1Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA and 2Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge CB3 0HA, UK.
3 To whom correspondence should be addressed. e-mail: jmb50{at}cam.ac.uk
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: completeness/diversity/Poisson statistics/randomized libraries
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
An assumption underlying all directed evolution experiments is that the amount of molecular diversity theoretically possible is enormous compared with our ability to generate and screen it. Even a small protein of 100 amino acids can be encoded by 4300 10181 possible DNA sequences, a number vastly larger than the number of atoms in the observable Universe (
1080), let alone the biggest protein-encoding libraries accessible in the laboratory [10121015 using in vitro selection methods such as mRNA display (Roberts and Ja, 1999
)]. Increasingly it is acknowledged that quantitative, predictive models for the processes underlying randomized library construction will be useful in targeting and interpreting that diversity which we are able to generate experimentally [reviewed by Voigt et al. (Voigt et al., 2001a
)]. Recent studies have included in silico modelling of epPCR and the generation of crossovers in DNA shuffling (Moore and Maranas, 2000
; Moore et al., 2001
) and the construction of computational pre-screens both to identify the regions of proteins most likely to yield beneficial mutations on randomization (Voigt et al., 2001b
) and also to predict the fragments or schemas of proteins able to be recombined with minimal disruption of overall three-dimensional structure (Voigt et al., 2002
).
While these analyses hint at the insights to be gained from a quantitative approach to directed evolution, they are too complex to be generally applicable for the laboratory researcher. Moreover, the number of mutations or crossovers required, or even optimal, to effect a given functional change remains elusive. In this paper, we argue that the likelihood of finding a variant with improved properties in a given library is maximized when that library is maximally diverse. We used simple statistics to derive a series of widely-applicable equations and computer algorithms for estimating the number of unique sequence variants in libraries constructed by randomized oligonucleotide mutagenesis, epPCR and in vitro recombination. Generally, applying these algorithms provides mathematical support for the previously empirical guidelines which have evolved for generating randomized libraries in which diversity is maximized and unwanted degeneracy is minimized, although some new strategies for library construction also become apparent.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Incorporating randomized codons into one of the primers in a PCR mix allows the generation of molecular diversity at specific locations in a gene. Intuitively, we know that randomizing a greater number of codons reduces the likelihood of sampling all possible random variants. Here we derive simple equations for estimating how many variants a given library will actually contain and how large a library needs to be in order to give a fixed probability (e.g. 95%) that all possible sequence variants will be represented.
Consider a library containing a number of clones L, constructed by randomizing M codons or N = 3M base pairs, in which all possible sequence variants vi are equally probable. Since the variants are equally probable, the mean number of occurrences of any one variant vi in the library is given by = L/V (where V is the total number of possible sequence variants). For
<< L (e.g. V > 10), the actual number of occurrences of any variant vi is essentially independent of the number of occurrences of any other variant vj and can therefore be well approximated by the Poisson distribution [see Feller for details (Feller, 1968
)]:
where P(x) denotes the probability that the variant vi occurs exactly x times in the library. The probability that vi occurs at least once is 1 P(0) = 1 e = 1 eL/V. Hence the expected number of distinct variants in the library is
and the fractional completeness of the library is given by F = C/V = 1 eL/V.
As an example, let us assume that we mutate four codons in a gene using NNS (N = A/C/G/T; S = C/G) codons in the randomization protocol. Because there are 32 possible NNS codons, it follows that there are V = 324 106 possible sequence variants. If we wish to construct a library expected to contain, say, 95% of all the possible sequence variants, then we must solve F = 0.95
1 eL/V = 0.95
L = V ln0.05
3.0V. Three-fold degeneracy, corresponding to 3x106 clones in the present example, is therefore required for a library that is expected to contain 95% of the 106 possible sequence variants.
A related but distinct problem is to calculate the required size of a library that has a 95% chance of being 100% complete. The probability that every variant vi is represented is Pc = (1 eL/V)V. Solving for L gives
where the approximation holds provided V >> lnPc. Since one is generally interested in Pc values of the order of 90100% (and certainly >1%), this condition is generally true.
Returning to our example in which four NNS codons are used to generate a randomized library, we can now calculate the size that a library needs to be to have a 95% chance of containing every possible sequence variant. Solving Equation 3 for Pc = 0.95 and V = 106 gives L 1.7x107. Hence a much higher level of degeneracy,
17-fold in this example, is required to be 95% certain of sampling all possible variants. The degree of over-sampling required varies with the number of variants V. For example, if five codons are randomized instead of four, V = 325
3.4x107 and solving Equation 3 for Pc = 0.95 shows that 20-fold degeneracy is required to be 95% certain that the library will be complete.
While it is reasonably straightforward to calculate manually library completeness C or the probability Pc of having a complete library, using Equations 2 and 3, we have also written a short Fortran 77 program, GLUE, to perform these calculations. GLUE may be downloaded from www.bio.cam.ac.uk/blackburn/stats.html.
Error-prone PCR
Although most recent examples of directed evolution use epPCR in conjunction with recombination-based strategies such as DNA shuffling, it is still commonly encountered as a means of generating random diversity at any position in a gene. Further, the epPCR protocol has been refined to afford considerable control over the substitution rate. Low rates of mutation (typically 23 base substitutions per gene, corresponding to 1 amino acid substitution per sequence per generation) have traditionally been employed, in order to minimize the accumulation of deleterious mutations (Arnold et al., 2001
). However, using nucleoside analogues to generate a much higher mutational load of 8.227.2 mutations per gene in the directed evolution of TEM-1 ß-lactamase has also yielded improved variants (Zaccolo and Gherardi, 1999
).
By assuming Poisson statistics to analyse the actual compositions of epPCR libraries, here we address the question of how these apparently contradictory approaches can prove equally successful. We describe a simple program, dubbed PEDEL (Program for Estimating Diversity in Error-prone PCR Libraries) and available to download from www.bio.cam.ac.uk/blackburn/stats.html, which estimates the number of unique sequence variants, C, found in a given epPCR library.
The input parameters for PEDEL are the length of the template sequence, N (in base pairs), the size of the randomly-generated library, L, and the mean number of point mutations per sequence, (it is assumed that
<< N, e.g.
< 0.1N). A value for
can be obtained experimentally by sequencing a small number of library variants.
The probability that a library variant contains exactly x mutations, given that the mean number of mutations per sequence is , can be described by the Poisson distribution P(x) (Equation 1; Cadwell and Joyce, 1992
). The number of possible sequences differing from the parental sequence by exactly x mutations is given by
since
is the number of ways of choosing x bases to mutate and each of these may be mutated to any of three other bases, giving the 3x term.
In order to estimate the total number of unique variants, PEDEL divides the library of size L into a series of sub-libraries, each of which contains only variants with exactly x mutations. The expected size of each sub-library is given by Lx = P(x)L. Since all sequences with exactly x mutations are equally probable, the completeness statistic Cx derived for cassette mutagenesis (Equation 2) can be applied to each sub-library, substituting Vx (Equation 4) for V.
When the sub-library size Lx is small compared with the total number of possible variants Vx in that sub-library, we may expect that almost every member of the sub-library will be distinct, i.e. Cx Lx. This is the case when x is large. This approximation is good to within 5% for Lx/Vx < 0.1 and in turn allows the calculation of a threshold x value, xu (see www.bio.cam.ac.uk/
blackburn/stats.html for details), such that for all x > xu the approximation Cx
Lx is acceptable. For example, in a library with a total of L = 109 members, length N = 1000 base pairs and mean mutation rate
= 4, one finds that xu
3.1 (Figure 1). PEDEL uses this observation to split the infinite summation
|
and hence to calculate the total number of distinct variants in a given epPCR library. Conversely, when x is small enough, Lx will be large compared with Vx and we may expect the sub-library to sample nearly all possible variants, i.e. Cx Vx. Indeed, in most scenarios encountered experimentally, there is at most one x value for which one or other of these approximations is not valid. Examples of how C varies as a function of mutation rate
, library size L and template sequence length N are shown in Figure 2.
|
|
Finally, it should be noted that in reality and as detailed by Moore and Maranas (Moore and Maranas, 2000), the distribution of mutations in the final pool of epPCR-generated molecules may deviate from Poisson statistics. The primary cause is the well-documented bias in the types of base substitution introduced by Taq DNA polymerase under error-prone conditions (Shafikhani et al., 1997
). By making some sequence variants less likely than others, this has the effect of reducing library diversity (i.e. decreasing completeness, C). However, with the recent advent of commercially available polymerases such as Mutazyme (Stratagene, La Jolla, CA), which possesses an opposite bias to Taq in its mutational spectrum, unbiased libraries can now be constructed by sequential PCR amplifications with the two polymerases (W.M.Patrick, M.de Lumley and J.M.Blackburn, unpublished data). Given the premise that directed evolution is most likely to identify an improved variant when a library is maximally diverse and that this is the case when all variants in a sub-library are equally probable, we suggest that this dual-polymerase approach should become routine.
In vitro recombination of highly homologous sequences
Methods for recombining genes in vitro have proved revolutionary in the field of directed evolution. Here we describe DRIVeR (Diversity Resulting from In vitro Recombination), a program for estimating the diversity represented in libraries generated by recombining two highly homologous parental sequences which differ in only a few (e.g. 20) base or amino acid positions. It is available to be downloaded at www.bio.cam.ac.uk/blackburn/stats.html.
The number of unique sequences in a shuffled DNA library will be critically dependent upon (i) the mean number of crossovers during the recombination process and (ii) the spacing of the varying base pairs, since bases that are closely spaced are less likely to be recombined than those that are far apart in the sequence. In most reported examples of in vitro recombination, by either DNA shuffling or StEP PCR, the number of crossovers observed per daughter sequence is small [typically of the order of 14 (Zhao et al., 1998; Raillard et al., 2001
). Suppose, then, that the mean number of crossovers per daughter sequence is
and that
<< N, e.g.
< 0.1N (where N is the sequence length). We assume that the number of crossovers between two consecutive varying positions in a particular daughter sequence follows a Poisson distribution (Equation 1) with mean (n 1)
/(N M 1), where M is the number of varying base positions and (n 1)/(N M 1) gives the ratio of the total number of potential crossover points in the sequence (viz. N M 1) to the number of crossover points between the two varying positions (n 1). M is incorporated into this expression to account for the observation that, owing to the requirement for a base-paired 3' clamp in fragment reassembly and extension, crossovers immediately following variable positions are impossible. DRIVeR also therefore sets the probability of a crossover at these positions to zero.
Note that experimentally, crossovers are only observable if they occur in a region that will produce a distinct daughter sequence. Clearly, one crossover between consecutive varying base pairs produces the same daughter sequence as 3, 5, 7, ... crossovers and similarly for any even number of crossovers. Further, any crossovers occurring between one end of the sequence and the first varying base position are also unable to be detected by sequence analysis of daughter variants. Therefore only an observed value,
obs, can be determined from experimental data. However, the statistics used by DRIVeR to estimate the number of distinct sequences in a shuffled DNA library require that the value specified by the user is that of
true. For a given input value of
true, DRIVeR calculates the resulting
obs, allowing a quick, empirical search for the value of
true based on a
obs value determined by sequencing a small number of variants from a library. Conversely, experimental uncertainty in estimating
obs can be assessed by entering a range of possible
true values into the program and assessing the effects on both
obs and the overall diversity of the library. Details on these calculations are available in the supplementary material at www.bio.cam.ac.uk/
blackburn/stats.html.
Once a value for true in keeping with
obs has been determined, DRIVeR uses the inputted sequence length N, number of varying base positions M, library size L,
true and the positions of the varying bases to calculate the relative probability of each possible shuffled variant. This in turn allows the calculation of C, the expected total number of distinct daughter sequences in the library of size L. DRIVeR also tabulates the probabilities of an even or an odd number of crossovers in each interval between varying codons and a further output file contains the probabilities of occurrence of each of the possible 2M daughter sequences. As one would expect, plotting the expected value of C as a function of L with various combinations of spacing of varying base pairs shows that C increases with library size and is larger if the varying bases are well spaced along the parental sequences (Figure 3).
|
In a recent example from the literature, Raillard et al. (Raillard et al., 2001) used DNA shuffling to recombine two bacterial triazine hydrolase genes (atzA and triA, GenBank accession numbers U55933 and AF312304, respectively), which differed at nine of 1425 bases, coding for proteins with nine varying amino acids. There are therefore 29 = 512 possible shuffled daughter variants. The authors screened a library of L = 1600 shuffled variants, although they also noted that this probably fell short of a screen of the complete library. However, it was not immediately obvious what fraction of the complete library had actually been sampled here. We therefore specified N = 1425, M = 9, L = 1600 and the positions of the variable bases were 250, 274, 375, 650, 655, 757, 763, 982 and 991. Trial and error show that a
true of 10 crossovers per sequence corresponds to
obs
2, which in turn agrees with the data obtained experimentally (the authors state that every variant sequenced had undergone at least one and as many as four recombination events). On inputting
true = 10, DRIVeR calculates that the screened library of 1600 members is expected to contain
161 of the 512 possible distinct sequence variants. The biggest factor leading to the low diversity is the close spacing of the variable bases at positions 650 and 655, 757 and 763, and 982 and 991, between each of which recombination events are unlikely to occur. Indeed, even a 1000-fold larger library of size L = 1.6x106 would still be incomplete, containing an expected 484 distinct sequences. Further, the close spacing of these variable bases means that the uncertainty in library diversity, C, associated with estimating
obs is significant in this example: if
obs was in fact 1.6, the library would be expected to contain 126 variants, while
obs = 2.4 gives C = 221. This effect is reduced for sequences in which the variable bases are more evenly spaced.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Here we have used simple arguments and the approximating assumption of Poisson statistics in order to analyse completeness and therefore degeneracy in libraries generated by three common PCR-based methods. Applying the algorithms described provides laboratory researchers with either an estimate of how many unique sequence variants their library is most likely to contain or conversely a target library size to enable them to sample a specified proportion of all possible sequences.
In the first example we derive simple equations for characterizing the diversity arising from oligonucleotide-directed mutagenesis. In calculating library sizes required for fixed levels of completeness, an important distinction is made between a library that is most likely to be 95% complete (Equation 2) and a 95% chance of having a 100% complete library (Equation 3). As one might expect, we show that libraries containing much higher levels of sequence degeneracy are required to effect the latter situation. The assumption underpinning these equations is that each random variant is equally probable and at the DNA level this will be the case. However, an interesting recent study of M13 phage-displayed peptides (Rodi et al., 2002) uncovered biases in the sequences of those peptides which are successfully translated and displayed, suggesting that in this example the underlying biology has affected functional library diversity in a manner not predicted by Poisson statistics. Of the methods commonly employed for screening or selecting from randomized libraries, display on filamentous phage is the one most likely to show such a bias (reflecting the constraints on sequence imposed by periplasmic virus assembly) and this should be considered when implementing GLUE to assess the diversity represented in phage-displayed libraries.
Provided that the number of randomized codons is low, oligonucleotide-directed mutagenesis offers the potential to generate a library which samples all possible sequence variants. In epPCR, however, the ability to generate mutations at any position in a gene ensures that the possible number of variants is enormous and it becomes more informative to analyse how many distinct sequences are actually represented in a given library. The sub-library approach described here and implemented by PEDEL gives insights into the composition of epPCR libraries. For example, if the mutation rate is kept low (the case in the majority of published experiments), then a large proportion of the resulting library (as high as 37% when
= 1) will contain the parental template sequence, thus significantly reducing the effective size of that library. However, a mid-sized library with a higher mutation rate (e.g. L = 106,
= 6, N = 1000) will still contain all possible variants differing from the template sequence by a single nucleotide, whilst reducing the fraction of unmutated template sequence present to <0.25%. In addition, it will also sample many sequences which differ from the template in 27 positions. Hence PEDEL allows an informed trade-off to be made between maximizing the number of unique sequence variants sampled and ensuring that those sequences least likely to contain deleterious mutations are well represented.
DNA shuffling and its variants have proven to be extremely flexible methodologies, enabling the inclusion of many different pieces of genetic information in their protocols. Diversity can be introduced by incorporating an epPCR step, shuffling a large number of parental sequences from different species or by incorporating synthetic oligonucleotides to target specific regions for mutagenesis (Minshull and Stemmer, 1999). Here we have analysed a simplest case scenario, in which two highly homologous parental sequences are recombined. In modelling the shuffling reaction as a Poissonian process we have simplified it considerably, since as detailed by Moore et al. (Moore et al., 2001
), crossovers will tend to accumulate in regions of high sequence identity and are better predicted by considering the thermodynamics of the reassembly reaction. However, in examples such as that of the triazine hydrolases atzA and triA (Raillard et al., 2001
), where the parental sequences show near 100% sequence identity, it can be inferred that crossover position will be well approximated by a Poisson distribution and that DRIVeR will therefore provide a useful measure of the diversity represented in a given library. A further advantage of DRIVeR is that it can be used in the analysis of libraries generated by both DNA shuffling and StEP PCR, as the critical parameter is the number of crossovers, rather than the size of the reassembled fragments (a concept that is irrelevant in the latter methodology).
A key result from the analysis of recombined libraries using DRIVeR was that diversity is largely dependent on true and hence that it is maximized when crossovers are sufficiently frequent that each daughter variant is equally probable. In such a scenario, the number of variants in a given library is best estimated using the formula derived for oligonucleotide-directed mutagenesis (Equation 2). Two modifications to the DNA shuffling protocol that were described recently (Ness et al., 2002
; Zha et al., 2003
) are directly predicted by this observation. Both groups designed overlapping oligonucleotides for assembly in a synthetic DNA shuffling reaction, resulting in the production of functional daughter variants through the recombination of tightly linked diversity. In both cases, in vitro recombination was reduced to a problem of equally probable variants, with concomitant success in maximizing the resultant library diversity. A third report (Moore and Maranas, 2002
) has recently outlined a method for engineering codon usage to maintain amino acid sequence but to optimize or direct recombination, based on earlier considerations of predicting crossover generation. In combination, these approaches offer the prospect of a more rational approach to increasing diversity and therefore the likelihood of identifying improved variants in recombined libraries.
In addition to addressing the question of how much molecular diversity is present in a randomized library, the statistics and algorithms described here provide evidence for a number of empirical guidelines for library construction. In general, of course, the larger the library, the more diverse it is expected to be, although in the case of oligonucleotide-directed mutagenesis GLUE provides insight into just how large is large enough to sample a required proportion of variants. In epPCR and DNA shuffling, we have demonstrated that maximizing library diversity requires consideration of both library size, L, and mean mutation rate or crossover frequency, . Contrary to the prevailing dogma for construction of libraries by epPCR, we have shown that an elevated mutation rate can be useful for increasing the absolute number of variants sampled, in addition to reducing wasteful over-sampling of the unmutated template and single point mutants. Likewise, in DNA shuffling a very large mean number of crossovers per sequence (corresponding to small fragments in the reassembly reaction), such that every daughter variant is essentially equally probable, ensures that the maximum number of shuffled variants will be represented. This is especially important for ensuring recombination between two or more varying bases that are closely spaced in the parent sequences. In combination, we believe that the descriptive and predictive aspects of GLUE, PEDEL and DRIVeR will prove useful and generally applicable in guiding the construction and interrogation of randomized protein-encoding libraries.
![]() |
Acknowledgements |
---|
![]() |
FOOTNOTES |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Brakmann,S. (2001) ChemBioChem, 2, 865871.[CrossRef][ISI][Medline]
Cadwell,R.C. and Joyce,G.F. (1992) PCR Methods Appl., 2, 2833.[Medline]
Feller,W. (1968) An Introduction to Probability Theory and Its Applications. Wiley, New York.
Hermes,J.D., Parekh,S.M., Blacklow,S.C., Köster,H. and Knowles,J.R. (1989) Gene, 84, 143151.[CrossRef][ISI][Medline]
Minshull,J. and Stemmer,W.P.C. (1999) Curr. Opin. Chem. Biol., 3, 284290.[CrossRef][ISI][Medline]
Moore,G.L. and Maranas,C.D. (2000) J. Theor. Biol., 205, 483503.[CrossRef][ISI][Medline]
Moore,G.L. and Maranas,C.D. (2002) Nucleic Acids Res., 30, 24072416.
Moore,G.L., Maranas,C.D., Lutz,S. and Benkovic,S.J. (2001) Proc. Natl Acad. Sci. USA, 98, 32263231.
Ness,J.E., Kim,S., Gottman,A., Pak,R., Krebber,A., Borchert,T.V., Govindarajan,S., Mundorff,E.C. and Minshull,J. (2002) Nat. Biotechnol., 20, 12511255.[CrossRef][ISI][Medline]
Raillard,S. et al. (2001) Chem. Biol., 8, 891898.[CrossRef][ISI][Medline]
Roberts,R.W. and Ja,W.J. (1999) Curr. Opin. Struct. Biol., 9, 521529.[CrossRef][ISI][Medline]
Rodi,D.J., Soares,A.S. and Makowski,L. (2002) J. Mol. Biol., 322, 10391052.[CrossRef][ISI][Medline]
Saab-Rincón,G., Juárez,V.R., Osuna,J., Sánchez,F. and Soberón,X. (2001) Protein Eng., 14, 149155.
Shafikhani,S., Siegel,R.A., Ferrari,E. and Schellenberger,V. (1997) Biotechniques, 23, 304310.[ISI][Medline]
Stemmer,W.P.C. (1994a) Proc. Natl Acad. Sci. USA, 91, 1074710751.
Stemmer,W.P.C. (1994b) Nature, 370, 389391.[CrossRef][ISI][Medline]
Taylor,S.V., Kast,P. and Hilvert,D. (2001) Angew. Chem., Int. Ed. Engl., 40, 33103335.[CrossRef][Medline]
Voigt,C.A., Kauffman,S. and Wang,Z.-G. (2001a) Adv. Protein Chem., 55, 79160.[ISI]
Voigt,C.A., Mayo,S.L., Arnold,F.H. and Wang,Z.-G. (2001b) Proc. Natl Acad. Sci. USA, 98, 37783783.
Voigt,C.A., Martinez,C., Wang,Z.-G., Mayo,S.L. and Arnold,F.H. (2002) Nat. Struct. Biol., 9, 553558.[ISI][Medline]
Zaccolo,M. and Gherardi,E. (1999) J. Mol. Biol., 285, 775783.[CrossRef][ISI][Medline]
Zha,D., Eipper,A. and Reetz,M.T. (2003) ChemBioChem, 4, 3439.[CrossRef][ISI][Medline]
Zhao,H., Giver,L., Shao,Z., Affholter,J.A. and Arnold,F.H. (1998) Nat. Biotechnol., 16, 258261.[ISI][Medline]
Received November 6, 2002; revised May 6, 2003; accepted May 20, 2003.