1Bioengineering Option and 3Division of Chemistry and Chemical Engineering, California Institute of Technology, Mail Code 210-41, Pasadena, CA 91125-4100, USA
2 To whom correspondence should be addressed. E-mail: endelman{at}caltech.edu; zgw{at}cheme.caltech.edu; frances{at}cheme.caltech.edu
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: dynamic programming/laboratory evolution/optimization/protein design/recombination
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sequence design is sometimes called rational design, but many aspects of laboratory evolution are also rationally designed (Kamtekar et al., 1993; Kono and Saven, 2001
; Voigt et al., 2001
; Moore and Maranas, 2002
), including the library diversity. By diversity we mean how proteins in the library differ from the parents and from each other. The design goals and library creation method should dictate library diversity, but our understanding of this subject is still very limited. In several studies the number of mutations m has been correlated with functional change (Zaccolo and Gherardi, 1999
; Daugherty et al., 2000
; Ostermeier, 2003
; Otey et al., 2004
). We use the corresponding library average
m
as an example of an empirically useful diversity measure.
Although diversity is needed to effect changes in protein function, it is at odds with the need for stably folded proteins (a prerequisite for most functions). Since most mutations are neutral or disruptive to protein structure, the fraction of stably folded proteins in a library tends to decrease with diversity (Daugherty et al., 2000; Guo et al., 2004
). Equivalently, for an energy function that scores protein stability (Gordon et al., 1999
; Saraf et al., 2004
), the average energy of all proteins in a library
E
tends to increase with diversity. This type of tradeoff is common in design problems with conflicting performance objectives. Our design strategy involves finding libraries on the optimal energydiversity tradeoff surface.
We apply this strategy to site-directed recombination (SDR), in which n crossovers are chosen in an alignment of p structurally-related parents to define a set of p(n + 1) peptide fragments. These fragments are then assembled combinatorially to create a library of pn+1 chimeric proteins (Figure 1). SDR is a relatively new approach to recombination in laboratory evolution (Richardson et al., 2002; Hiraga and Arnold, 2003
; Meyer et al., 2003
). Other strategies do not involve a specific choice of crossovers (Stevenson and Benkovic, 2002
). Instead, they attempt to generate all possible crossovers using techniques from molecular biology. In practice, however, these methods are often limited in either the number or locations of crossovers (Joern et al., 2002
).
|
To optimize SDR libraries for a given set of parents and fixed number of crossovers n, we minimize the average energy of all chimeras E
, subject to constraints on the length L of each peptide fragment:
![]() | (1) |
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
As with other global optimization algorithms for protein design (Dahiyat and Mayo, 1996; Gordon and Mayo, 1999
), RASPP can use any energy function with single and pairwise interactions between residues:
![]() | (2) |
![]() | (3) |
Consider two libraries, one with k1 crossovers and the other with k crossovers, whose first k1 crossover locations are identical. The difference in average energy between these libraries is
![]() | (4) |
![]() | (5) |
![]() | (6) |
Computational energies
To generate computational results we used SCHEMA disruption, an instance of Equation 2 that counts the number of pairwise interactions broken by recombination (Voigt et al., 2002; Silberg et al., 2004
):
![]() | (7) |
The contact matrix Cij depends solely on structural information, while ij uses only the parental sequence alignment. Specifically, Cij = 1 if residues i and j are within 4.5 Å in the parental structure; otherwise Cij = 0. The delta function
ij = 0 if the amino acids
i(Si) and
j(Sj) that are found in the chimera are also present at homologous positions in any single parent. Otherwise, the i j interaction is considered broken and
ij = 1.
SCHEMA disruption has proven useful in guiding SDR of ß-lactamases (Hiraga and Arnold, 2003; Meyer et al., 2003
) and cytochromes P450 (Otey et al., 2004
), the two systems explored in this study. For SDR of ß-lactamases, we used the crystal structure of TEM-1 (1BTL.pdb) (Jelsch et al., 1993
) and generated a structural alignment of 263 residues with its homolog PSE-4 (1G68.pdb) (Lim et al., 2001
) using Swiss-Pdb Viewer (Guex and Peitsch, 1997
). TEM-1 and PSE-4 have 43% sequence identity. For SDR of cytochromes P450, we used the crystal structure of CYP102A1 (1JPZ.pdb) (Haines et al., 2001
) from Bacillus megaterium, commonly known as P450 BM-3, and generated sequence alignments with Bacillus subtilis homologs CYP102A2 (ORF BG12871) and CYP102A3 (ORF BG12299) using CLUSTALW (Thompson et al., 1994
). Pairwise sequence identities for these P450 homologs are around 65% based on 456 residues. Cytochrome P450 residues are numbered as in 1JPZ.pdb.
Measuring length
Since library composition is invariant with respect to crossover location within a contiguous region of conserved residues, we do not consider conserved residues as potential crossover locations. This effectively reduces the length of the protein (N) by the number of conserved residues. To be consistent we measure fragment length L by the number of residues not conserved across all parents.
Measuring diversity
The mutation level m of each chimera is the minimum number of amino acid changes needed to convert the protein into one of the parents. The average number of mutations m
in a library with pn+1 chimeras is (
imi)/pn+1.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Every feasible n-crossover library, i.e. one that satisfies the length constraints in Equation 1, can be represented as an n-path from node 0 to column n in the directed graph of Figure 2. The node Xk visited in column k n corresponds to the position of the kth crossover. To create a one-to-one correspondence between the set of all n-paths and the set of all feasible n-crossover libraries, nodes in adjacent columns are selectively connected. A path that visits node X1 in the first column defines the first peptide fragment as [1, X1] (amino acid residues 1 to X1, inclusive), which has length X1. Thus node 0 is connected to all nodes in the first column that satisfy Lmin
X1
Lmax. Similarly, an arc from node X1 in the first column to node X2 in the second column defines the second peptide fragment as [X1 + 1, X2], which has length X2 X1. Thus node X1 is connected to node X2 if and only if Lmin
X2 X1
Lmax. This process is continued until the last column, where an arc from Xn 1 to Xn defines two peptide fragments: [Xn 1 + 1, Xn] and [Xn + 1, N] for a protein of length N. Thus node Xn 1 is connected to node Xn if and only if Lmin
Xn Xn1
Lmax and Lmin
N Xn
Lmax.
|
![]() | (8) |
![]() | (9) |
![]() | (10) |
In summary, there is a one-to-one correspondence between feasible libraries and n-paths, and the total length of each n-path equals the average energy of the corresponding library. Therefore, finding the shortest n-path is equivalent to solving Equation 1.
Complexity of finding the shortest path
Shortest-path problems can be solved efficiently because of their recursive structure (Lawler, 1976; Korte and Vygen, 2002
). In the case of Figure 2, the length of the shortest path
from node 0 to node j in column k can be computed using the shortest paths from node 0 to all nodes in column k 1:
![]() | (11) |
No information from other columns is needed. This property is the basis for dynamic programming. Using forward induction, RASPP finds the shortest path to every node in the first column, then the shortest path to every node in the second column, etc. Each evaluation of Equation 11 requires O(N) operations. This is repeated for all O(N) nodes in a column and for each of the n columns, yielding a running time of O(N2n).
In the process of finding the shortest n-path, RASPP also finds the shortest path to every column k n. These path lengths do not exactly correspond to the solution of Equation 1 with k crossovers because the set of arc connections to the last column must satisfy a different set of constraints, as discussed above. To find optimal libraries with any fixed number of crossovers k
n, the arc connections between column k 1 and column k are updated and Equation 11 is solved O(N2) times as before. This can be repeated for all values k
n with a total running time of O(N2n), the same as a single iteration of RASPP.
Although RASPP libraries are provably optimal when diversity is measured by fragment length, often we are interested in optimality with respect to other diversity measures, such as the average level of mutation m
. To use fragment length as a surrogate for
m
, we vary the length constraints over the entire range of feasible libraries: Lmin = 1 to N/(n + 1) and Lmax = N/(n + 1) to N nLmin, solving Equation 1 O(N2/n) times. This runs quickly since the arc lengths are not recalculated for each iteration. The ranges for Lmin and Lmax can easily be modified to accommodate experimental constraints arising from the library creation method.
RASPP libraries are optimized with respect to m
To illustrate RASPP, consider using three cytochrome P450 homologs (CYP102A1/A2/A3, N = 197 non-conserved residues) for the laboratory evolution of novel catalytic properties (Otey et al., 2004). By varying the length constraints for n = 7 crossovers, we generated 2052 libraries, of which 391 are distinct (Figure 3). As Lmin increases and Lmax decreases, the crossovers become more evenly spaced, resulting in libraries with higher
E
and higher
m
. Designing libraries with more crossovers increases the levels of diversity accessible by SDR, but adding fragments also complicates construction of the library. In this example, the choice of n = 7 provides enough mutants for screening (38 = 6561 chimeras) and sufficiently high levels of mutation for laboratory evolution (nearly
m
= 75) based on data from previous experiments (Otey et al., 2004
).
|
|
|
RASPP curves for parental design
Before choosing optimal crossover locations, one must decide upon a set of parents for recombination. RASPP curves provide a rapid and reliable way of determining which parents yield the lowest energy libraries in a desired diversity range. To illustrate, consider choosing which combination of cytochrome P450 homologs (A1/A2, A1/A3 or A1/A2/A3) is best for laboratory evolution. Even though a library with three parents has more chimeras than one with two parents, the comparison is fair because any random, experimental sample will on average have the same E
as the entire library.
The RASPP curves for these alternative designs reveal significant differences at mutation levels m
> 40 (Figure 6). For 40 <
m
< 60, the combination A1/A2 is better than A1/A3 because the former has lower energy. This would be difficult to ascertain by other means, since A1/A2 and A1/A3 both have 65% sequence identity and their non-conserved residues have the same surface accessibility on average (1.5 contacts per non-conserved residue). For 40 <
m
< 50, A1/A2 also has lower energy than A1/A2/A3. For 50 <
m
< 60, A1/A2 and A1/A2/A3 have comparable energy, but A1/A2 is still preferable because adding a third parent increases the cost and complexity of library construction. All three parents are needed to build libraries with
m
> 60.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Equation 6 is our key theoretical result which shows that dynamic programming can be used for SDR library design. In this respect, Equation 6 is analogous to dead-end elimination (DEE) theorems (Desmet et al., 1992; Goldstein, 1994
; Looger and Hellinga, 2001
), which have led to many successes in protein sequence design (Dahiyat and Mayo, 1996
; Looger et al., 2003
). However, Equation 6 and the DEE theorem have very different consequences for computational protein design. RASPP finds the global energy minimum (for Equation 1) in O(N3p2 + N2n) operations for a protein of length N, making it efficient in theory and practice (Papadimitriou and Steiglitz, 1998
). In contrast, DEE requires an exponential number of operations O(aN) in the worst case. This is unavoidable (unless P = NP) because finding the amino acid sequence with minimum energy is NP-hard (Pierce and Winfree, 2002
). By averaging over the library, we transform protein design from a hard problem to an easy one.
Our tractable formulation of SDR library design also depends on constraining fragment diversity to effect changes in library diversity. We have described RASPP using fragment length, but RASPP can use any measure of fragment diversity compatible with the arc connections in Figure 2. We have shown that constraints on fragment length are effective when library diversity is measured by the average number of mutations m
. As we learn more about the effects of library diversity on protein evolution, other measures of fragment diversity will be needed.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Dahiyat,B.I. and Mayo,S.L. (1996) Protein Sci., 5, 895903.
Dahiyat,B.I. and Mayo,S.L. (1997) Science, 278, 8287.
Daugherty,P.S., Chen,G., Iverson,B.L. and Georgiou,G. (2000) Proc. Natl Acad. Sci. USA, 97, 20292034.
Deem,M.W. and Bogarad,L.D. (1999) Proc. Natl Acad. Sci. USA, 96, 25912595.
DeGrado,W.F. (2001) Chem. Rev., 101, 30253026.[CrossRef][ISI][Medline]
DeGrado,W.F., Summa,C.M., Pavone,V., Nastri,F. and Lombardi,A. (1999) Annu. Rev. Biochem., 68, 779819.[CrossRef][ISI][Medline]
Desmet,J., De Maeyer,M., Hazes,B. and Lasters,I. (1992) Nature, 356, 539542.[CrossRef][ISI]
Goldstein,R.F. (1994) Biophys. J., 66, 13351340.[Abstract]
Gordon,D.B. and Mayo,S.L. (1999) Structure, 7, 10891098.[ISI][Medline]
Gordon,D.B., Marshall,S.A. and Mayo,S.L. (1999) Curr. Opin. Struct. Biol., 9, 509513.[CrossRef][ISI][Medline]
Guex,N. and Peitsch,M.C. (1997) Electrophoresis, 18, 27142723.[ISI][Medline]
Guo,H.H., Choe,J. and Loeb,L.A. (2004) Proc. Natl Acad. Sci. USA, 101, 92059210.
Haines,D.C., Tomchick,D.R., Machius,M. and Peterson,J.A. (2001) Biochemistry, 40, 1345613465.[CrossRef][ISI][Medline]
Hiraga,K. and Arnold,F.H. (2003) J. Mol. Biol., 330, 287296.[CrossRef][ISI][Medline]
Holland,J.H. (1992) Adaptation in Natural and Artificial Systems. MIT Press, Cambridge, MA.
Jelsch,C., Mourey,L., Masson,J.M. and Samama,J.P. (1993) Proteins, 16, 364383.[ISI][Medline]
Joern,J.M., Meinhold,P. and Arnold,F.H. (2002) J. Mol. Biol., 316, 643656.[CrossRef][ISI][Medline]
Kamtekar,S., Schiffer,J.M., Xiong,H., Babik,J.M. and Hecht,M.H. (1993) Science, 262, 16801685.[ISI][Medline]
Kono,H. and Saven,J.G. (2001) J. Mol. Biol., 306, 607628.[CrossRef][ISI][Medline]
Korte,B. and Vygen,J. (2002) Combinatorial Optimization: Theory and Algorithms. Springer, Berlin.
Kuhlman,B., Dantas,G., Ireton,G.C., Varani,G., Stoddard,B.L. and Baker,D. (2003) Science, 302, 13641368.
Lawler,E. (1976) Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, New York.
Lim,D., Sanschagrin,F., Passmore,L., De Castro,L., Levesque,R.C. and Strynadka,N.C. (2001) Biochemistry, 40, 395402.[CrossRef][ISI][Medline]
Looger,L.L. and Hellinga,H.W. (2001) J. Mol. Biol., 307, 429445.[CrossRef][ISI][Medline]
Looger,L.L., Dwyer,M.A., Smith,J.J. and Hellinga,H.W. (2003) Nature, 423, 185190.[CrossRef][ISI][Medline]
Meyer,M.M., Silberg,J.J., Voigt,C.A., Endelman,J.B., Mayo,S.L., Wang,Z.G. and Arnold,F.H. (2003) Protein Sci., 12, 16861693.
Mitchell,M. (1996) An Introduction to Genetic Algorithms. MIT Press, Cambridge, MA.
Moore,G.L. and Maranas,C.D. (2002) Nucleic Acids Res., 30, 24072416.
Neylon,C. (2004) Nucleic Acids Res., 32, 14481459.
Ostermeier,M. (2003) Trends Biotechnol., 21, 244247.[CrossRef][ISI][Medline]
Otey,C.R., Silberg,J.J., Voigt,C.A., Endelman,J.B., Bandara,G. and Arnold,F.H. (2004) Chem. Biol., 11, 309318.[CrossRef][ISI][Medline]
Papadimitriou,C.H. and Steiglitz,K. (1998) Combinatorial Optimization: Algorithms and Complexity. Dover, Mineola.
Pierce,N.A. and Winfree,E. (2002) Protein Eng., 15, 779782.[CrossRef][ISI][Medline]
Ravichandran,K.G., Boddupalli,S.S., Hasemann,C.A., Peterson,J.A. and Deisenhofer,J. (1993) Science, 261, 731736.[ISI][Medline]
Richardson,T.H., Tan,X., Frey,G., Callen,W., Cabell,M., Lam,D., Macomber,J., Short,J.M., Robertson,D.E. and Miller,C. (2002) J. Biol. Chem., 277, 2650126507.
Saraf,M.C., Horswill,A.R., Benkovic,S.J. and Maranas,C.D. (2004) Proc. Natl Acad. Sci. USA, 101, 41424147.
Silberg,J.J., Endelman,J.B. and Arnold,F.H. (2004) Methods Enzymol., 388, 3542.[ISI][Medline]
Stevenson,J.D. and Benkovic,S.J. (2002) J. Chem. Soc., Perkin Trans. 2, 14831493.
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 46734680.[Abstract]
Voigt,C.A., Kauffman,S. and Wang,Z.G. (2001) Adv. Protein Chem., 55, 79160.[ISI]
Voigt,C.A., Martinez,C., Wang,Z.G., Mayo,S.L. and Arnold,F.H. (2002) Nat. Struct. Biol., 9, 553558.[ISI][Medline]
Zaccolo,M. and Gherardi,E. (1999) J. Mol. Biol., 285, 775783.[CrossRef][ISI][Medline]
Received August 12, 2004; accepted August 16, 2004.
Edited by Stephen Mayo