Automated design of degenerate codon libraries

Marco A. Mena and Patrick S. Daugherty1

Department of Chemical Engineering, University of California, Santa Barbara Santa Barbara, CA 93106-9510, USA

1 To whom correspondence should be addressed. E-mail: psd{at}engineering.ucsb.edu


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Availability
 Acknowledgements
 References
 
Degenerate codon libraries are frequently used in protein engineering and evolution studies but are often limited to targeting a small number of positions to adequately limit the search space. To mitigate this, codon degeneracy can be limited using heuristics or previous knowledge of the targeted positions. To automate design of libraries given a set of amino acid sequences, an algorithm (LibDesign) was developed that generates a set of possible degenerate codon libraries, their resulting size, and their score relative to a user-defined scoring function. A gene library of a specified size can then be constructed that is representative of the given amino acid distribution or that includes specific sequences or combinations thereof. LibDesign provides a new tool for automated design of high-quality protein libraries that more effectively harness existing sequence–structure information derived from multiple sequence alignment or computational protein design data.

Keywords: algorithm/codon/degenerate/design/library


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Availability
 Acknowledgements
 References
 
Protein engineering and evolution applications frequently use protein libraries that include defined amino acid mixtures at certain positions of interest (Campbell et al., 2002Go; Nguyen and Daugherty, 2005Go). A widely used approach for introducing diversity is the use of degenerate codons incorporated during oligonucleotide synthesis that include mixtures of nucleotides at each position. Most often, the complete set of standard amino acids is encoded using NNK or NNS codons, where K = G or T and S = C or G. But importantly, other degenerate codons are increasingly used to encode a defined subset of standard amino acids for similarity-based cloning of unknown genes using sequence alignments (Rose et al., 1998Go) and for codon-based mutagenesis (Hermes et al., 1989Go) in the construction of protein libraries for protein engineering and directed evolution (Campbell et al., 2002Go; Hayes et al., 2002Go; Amin et al., 2004Go; Schmitzer et al., 2004Go).

Despite the considerable utility of degenerate codon libraries, their use is limited by the size of the library that can be accommodated by the screen or selection, since the number of possible sequences grows exponentially with the number of codons targeted and their respective degeneracy. For example, an exhaustive search of the protein sequence space derived from 10 positions fully randomized with the 20 standard amino acids cannot currently be accomplished since it would require screening a library composed of more than 1015 members. Consequently, one is limited to exploring a smaller number of positions if exhaustive screening is desired. Alternatively, one can investigate a larger number of positions by restricting the degeneracy at each position. In such cases heuristics, such as hydrophobicity, charge or size, can be used to identify amino acid subsets likely to improve protein function (Campbell et al., 2002Go; Hayes et al., 2002Go). Often, inclusion of only a specific subset of amino acids is desired, as is the case with libraries designed using multiple sequence alignments or computational design. In these cases, the degenerate codons used at each position must be optimized to maximize coverage of the intended sequence space. Effective approaches to controlling codon diversity include mix-and-split (Glaser et al., 1992Go) and triphosphoramidite-based synthesis (Virnekas et al., 1994Go). Such approaches enable the screening of a larger number of unique protein variants, thereby increasing the probability of identifying gain-of-function mutation. However, their widespread use remains limited by the high cost of primer synthesis using these approaches (Neylon, 2004Go).

Substantial effort has been invested to comprehensively search a specified protein sequence space, such as that defined by sequential randomized positions using oligonucleotide cassette mutagenesis. However, the advent of non-contiguous multiple-site mutagenesis techniques [e.g. oligonucleotide-based gene library assembly, QuikChange® Multi (Hogrefe et al., 2002Go)] now enables sparse sampling of a much broader sequence space defined by an arbitrarily large number of sequence positions. In such cases, the selection of appropriate degenerate codons is not straightforward, since ‘library space’ grows rapidly with the number of positions targeted. To address this problem, we employed a search algorithm to identify of a set of degenerate codons that maximize representation of a given sequence set. Given a set of aligned sequences with an arbitrary distribution of amino acids at each position, LibDesign returns a set of degenerate codon library designs that encode different approximations of the distribution. These libraries can be characterized in terms of their size and ability to represent the sequence set. LibDesign can return designs that meet user-specified constraints of library size and score, a random sampling of the designs or all possible library designs. Consequently, protein sequence space can be explored more efficiently, with the aim of evolving function more rapidly.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Availability
 Acknowledgements
 References
 
LibDesign takes as input a set of aligned amino acid sequences, for example, derived from multiple sequence alignments or protein design algorithms. Amino acids that must be included at a position regardless of their frequency in the input sequences can be specified (for example, to ensure inclusion of a wild-type amino acid). A set of optimum degenerate codons is then determined independently for each position; an optimum codon is defined as one which encodes the specified amino acids with the minimal degeneracy, while avoiding stop codons if possible. These optimum codons are rank ordered in terms of inclusivity of the input amino acids at that position. The resulting set of codons contains members encoding the wild-type and the most frequent amino acid, the wild-type and the two most frequent amino acids, and so on. The most degenerate codon encodes the wild-type and all amino acids with a non-zero frequency in the input set.

Possible permutations of the codon sets are then searched using an exhaustive search, random Monte Carlo sampling or other evolutionary search algorithms (Frenkel and Smit, 2002Go). For each permutation, the potential library size is computed and a score is calculated. While an arbitrary scoring function could be used to rank library designs, the score as implemented here is simply the number of sequences in the input alignment that are exactly encoded by the library. Libraries that meet user-specified criteria regarding library size and score are appended to an output file.


    Results and discussion
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Availability
 Acknowledgements
 References
 
As a model case for library design, a set of 100 protein sequences with 10 variable positions was generated using a Monte Carlo protein design algorithm (M. A. Mena and P. S. Daugherty, unpublished data) similar to those described previously (Desjarlais and Clarke, 1998Go). The degenerate codons of the resulting sequences possessed the amino acid distribution given in Table I. Additionally, in this example, a wild-type amino acid was required at each position. LibDesign enabled library design uses three different approaches: (i) scoring of all possible libraries, (ii) random sampling of all possible libraries; and (iii) scoring only of libraries within a given size and score constraint. Three different scores were calculated for each library design: (i) the percent of sequences of the input alignment exactly encoded by the library, (ii) the percent of input sequences encoded to within one mutation and (iii) the percent of input sequences encoded to within two mutations.


View this table:
[in this window]
[in a new window]
 
Table I. Distribution of amino acids at each position for test case

 
Application of LibDesign to the input set (Table I) yielded a set of optimal codons (Table II) and libraries representing the input data (Table III and Figure 1). Use of a combinatorial search algorithm was important since, in total, 1 x 106 potential libraries could be constructed from all permutations of the optimal codon sets (Table II). A library encoding all amino acids in the distribution and the wild-type residue would have a size of 5 x 1010 (Table II). A random sampling of the entire library space illustrates the tradeoff between library size and score (Figure 1a). Exhaustive search of a defined portion of the library space (Figure 1b) identified all libraries with scores >20% but with a library size <107. Libraries having the desired size and score can then be constructed using gene assembly methods as described previously (Bessette et al., 2003Go).


View this table:
[in this window]
[in a new window]
 
Table II. Degenerate codons computed by LibDesign at each position, from most-inclusive to least-inclusive

 

View this table:
[in this window]
[in a new window]
 
Table III. Sample of three libraries resulting from LibDesign, spanning different sizes and scores

 


View larger version (37K):
[in this window]
[in a new window]
 
Fig. 1. LibDesign output of library size versus score. The x-axis represents the size of the library in number of individual clones, whereas the y-axis represents the score, which is defined as the percent of matches to the original input sequence set. Individual candidate libraries resulting from (a) a random sampling of the entire library space and (b) an exhaustive search of the library space containing only libraries of size <107 and with an exact-match score >20%. Libraries are scored in terms of exact matches to input (red circles), matches to within one mutation (blue squares) or matches to within two mutations (green diamonds).

 
LibDesign can be used to design libraries using degenerate codons, given a specified set of amino acid sequences. Coupled with a priori knowledge of the desired distributions of amino acids at certain positions, this method enables library design with mutations at an arbitrarily large number of positions, while maintaining specified constraints on the potential library size. This library design method is particularly well-suited for coupling computational protein design and multiple sequence alignment data to library construction and screening.


    Availability
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Availability
 Acknowledgements
 References
 
The program and source code are available from the authors by request.


    Acknowledgements
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Availability
 Acknowledgements
 References
 
This work was supported, in part, by NIH National Institute for Biomedical Imaging and Bioengineering grant EB 000205.


    References
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Availability
 Acknowledgements
 References
 
Amin,N., Liu,A.D., Ramer,S., Aehle,W., Meijer,D., Metin,M., Wong,S., Gualfetti,P. and Schellenberger,V. (2004) Protein Eng. Des. Sel., 17, 787–793.[Abstract/Free Full Text]

Bessette,P.H., Mena,M.A., Nguyen,A.W. and Daugherty,P.S. (2003) Methods Mol. Biol., 231, 29–37.[Medline]

Campbell,R.E., Tour,O., Palmer,A.E., Steinbach,P.A., Baird,G.S., Zacharias,D.A. and Tsien,R.Y. (2002) Proc. Natl Acad. Sci. USA, 99, 7877–7882.[Abstract/Free Full Text]

Desjarlais,J.R. and Clarke,N.D. (1998) Curr. Opin. Struct. Biol., 8, 471–475.[CrossRef][ISI][Medline]

Frenkel,D. and Smit,B. (2002) Understanding Molecular Simulation. Oxford University Press, Oxford.

Glaser,S.M., Yelton,D.E. and Huse,W.D. (1992) J. Immunol., 149, 3903–3913.[Abstract/Free Full Text]

Hayes,R.J., Bentzien,J., Ary,M.L., Hwang,M.Y., Jacinto,J.M., Vielmetter,J., Kundu,A. and Dahiyat,B.I. (2002) Proc. Natl Acad. Sci. USA, 99, 15926–15931.[Abstract/Free Full Text]

Hermes,J.D., Parekh,S.M., Blacklow,S.C., Koster,H. and Knowles,J.R. (1989) Gene, 84, 143–151.[CrossRef][ISI][Medline]

Hogrefe,H.H., Cline,J., Youngblood,G.L. and Allen,R.M. (2002) Biotechniques, 33, 1158–1160, 1162, 1164–1165.

Neylon,C. (2004) Nucleic Acids Res., 32, 1448–1459.[Abstract/Free Full Text]

Nguyen,A.W. and Daugherty,P.S. (2005) Nat. Biotechnol., 23, 355–360.[CrossRef][ISI][Medline]

Rose,T.M., Schultz,E.R., Henikoff,J.G., Pietrokovski,S., McCallum,C.M. and Henikoff,S. (1998) Nucleic Acids Res., 26, 1628–1635.[Abstract/Free Full Text]

Schmitzer,A.R., Lepine,F. and Pelletier,J.N. (2004) Protein Eng. Des. Sel., 17, 809–819.[Abstract/Free Full Text]

Virnekas,B., Ge,L., Pluckthun,A., Schneider,K.C., Wellnhofer,G. and Moroney,S.E. (1994) Nucleic Acids Res., 22, 5600–5607.[Abstract]

Received March 2, 2005; revised June 21, 2005; accepted August 13, 2005.

Edited by Andreas Plueckthun





This Article
Abstract
Full Text (PDF)
All Versions of this Article:
18/12/559    most recent
gzi061v1
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Request Permissions
Google Scholar
Articles by Mena, M. A.
Articles by Daugherty, P. S.
PubMed
PubMed Citation
Articles by Mena, M. A.
Articles by Daugherty, P. S.