Use of a quantitative structure–property relationship to design larger model proteins that fold rapidly

Aaron R. Dinner1,2, Ellis Verosub1 and Martin Karplus1,2,3,4

1 Department of Chemistry and Chemical Biology and 2 Committee on Higher Degrees in Biophysics, Harvard University, 12 Oxford Street, Cambridge, MA 02138, USA and 3 Laboratoire de Chimie Biophysique, Institut le Bel, Université Louis Pasteur, 4 Rue Blaise Pascal, 67000 Strasbourg, France


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
A quantitative structure–property relationship (QSPR) was used to design model protein sequences that fold repeatedly and relatively rapidly to stable target structures. The specific model was a 125-residue heteropolymer chain subject to Monte Carlo dynamics on a simple cubic lattice. The QSPR was derived from an analysis of a database of 200 sequences by a statistical method that uses a genetic algorithm to select the sequence attributes that are most important for folding and a neural network to determine the corresponding functional dependence of folding ability on the chosen attributes. The QSPR depends on the number of anti-parallel sheet contacts, the energy gap between the native state and quasi-continuous part of the spectrum and the total energy of the contacts between surface residues. Two Monte Carlo procedures were used in series to optimize both the target structures and the sequences. We generated 20 fully optimized sequences and 60 partially optimized control sequences and tested each for its ability to fold in dynamic MC simulations. Although sequences in which either the number of anti-parallel sheet contacts or the energy of the surface residues is non-optimal are capable of folding almost as well as fully optimized ones, sequences in which only the energy gap is optimized fold markedly more slowly. Implications of the results for the design of proteins are discussed.

Keywords: genetic neural network/lattice model/Monte Carlo/protein folding/sequence optimization


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
An important goal of the study of protein folding is to learn how to design proteins that fold to stable, well-defined structures [the `inverse' folding problem (Pabo, 1983Go)] [for reviews, see Shaw and Bott (1996), Sauer (1996) and Shakhnovich (1998)]. In the last few years, considerable progress towards this goal has been made due to the introduction of systematic quantitative design protocols (Hellinga and Richards, 1991Go; Desjarlais and Handel, 1995Go; Harbury et al., 1995Go, 1998Go; Dahiyat and Mayo, 1996Go; Villegas et al., 1996Go; Dayihat and Mayo, 1997a). A key element of these methods is that information about existing sequences (for example, the melting temperature, Tm) is used in the design of new ones. For example, Dahiyat and Mayo (1996, 1997a) iteratively improved the form and parameters of their sequence scoring function by deriving quantitative structure–property relationships (QSPRs) for the melting temperatures of their designed sequences. With their final scoring function, they designed a complete 28-residue sequence that folds into a ßß{alpha} motif (NMR structure PDB code 1fsd) with the core aromatic residues packed in a similar manner to the corresponding residues in the template protein (Tm = 42°C) (Dahiyat and Mayo, 1997bGo).

An analogous protocol can be used in the design of sequences for computational models of proteins. For this purpose, simplified models are useful because any parameter can be varied easily to probe its role in stabilizing the target structure, and information at all levels of detail can be obtained for every sequence. In particular, design `failures' (i.e. sequences that do not fold) can be characterized and compared with sequences that do fold to the target structure. Computational studies have the additional practical advantages that they are much faster and less expensive than their experimental counterparts.

We have made a detailed analysis of the folding of a 125-residue heteropolymer model for proteins subject to Monte Carlo (MC) dynamics on a simple cubic lattice (Dinner et al., 1996Go; Dinner and Karplus, 1998Go; Dinner et al., 1998Go). The study of a few such sequences (Dinner et al., 1996Go; Dinner and Karplus, 1998Go; Dinner and Karplus, 1999Go) showed that they fold by a mechanism in which the chain quickly (~106 MC steps) collapses to a disordered globule and then makes a relatively slow search (~107 108 MC steps) through the compact states for a specific set of about 30 contacts (non-bonded spatial nearest-neighbors) that make up a stable, cooperative core which leads to a rapid accumulation of native structure. Once the core is formed, both direct folding to the native state and misfolding to a long-lived intermediate are found to occur. Molecules that form an intermediate complete folding by rearrangement and condensation of primarily surface residues; this process is slow (~107–109 MC steps) because it requires disruption of stable contacts.

To verify the generality of this folding mechanism, we studied a database of 200 such sequences, 100 of which had native states with primarily helical structure and 100 of which had native states with primarily sheet structure (Dinner et al., 1996Go, 1998Go). With a simple linear analysis, we found that folding ability was significantly correlated with the stability of the native state and the amount of kinetically accessible cooperative native secondary structure, in particular anti-parallel sheets connected by tight turns (Dinner et al., 1996Go). However, the predictivity of either a stability descriptor or a structure descriptor alone was low (as measured by the cross-validated Pearson linear correlation coefficient between the predicted and observed folding abilities, rcv). Consequently, we applied a more sophisticated statistical analysis which had been used previously to derive QSPRs for ligand binding (So and Karplus, 1996aGo,bGo). The method employs a genetic algorithm to pick the sequence descriptors that are most important for folding and an artificial neural network to derive the functional dependence of folding ability on the chosen descriptors (GNN). In addition to confirming the results of the linear analysis, the GNN identified the total energy of the contacts between surface residues as important, which is consistent with the folding mechanism described above. The resulting predictivities of the three-descriptor QSPRs (in the range 0.434 < rcv < 0.834) was a significant improvement over those of the single-descriptor linear analysis (in the range 0.199 < rcv < 0.492) (Dinner et al., 1998Go).

In the present study, we use one of the QSPRs derived in Dinner et al. (1998) to design 125mer sequences that fold repeatedly and relatively rapidly to target structures. The QSPR is a function of the number of anti-parallel sheet contacts, the energy gap between the native state and the quasi-continuous part of the spectrum, and the total energy of the contacts between surface residues. Two Monte Carlo procedures were used in series to optimize both the target structures and the sequences. We generated 20 fully optimized sequences and 60 partially optimized control sequences and tested each for its ability to fold in dynamic MC simulations. Although sequences in which either the number of anti-parallel sheet contacts or the energy of the surface residues is non-optimal are capable of folding almost as well as those in which all three inputs to the QSPR are optimized, those in which only the energy gap is optimized fold markedly more slowly. Thus, for larger model proteins, the QSPR approach represents a significant improvement over commonly employed design protocols that do not take the native structure or the spatial distribution of interactions into account (Shakhnovich and Gutin, 1993aGo,bGo; Gutin et al., 1995bGo).


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Model

The model used for the 125mer has been described in detail elsewhere (Dinner et al., 1996Go, 1998Go; Dinner and Karplus, 1998Go) and is reviewed only briefly here. The effective energy of a conformation is the sum over all contacts (non-bonded spatial nearest-neighbors):

The function {Delta}(rirj) selects the interacting residue pairs; it is unity if i and j, located at ri and rj respectively, are in contact and zero otherwise. The quantity B0 is a mean attraction between residues that corresponds to an overall hydrophobic term; for all the calculations shown, B0 = –0.25 (all energies and temperatures are in dimensionless units, and the Boltzmann constant is taken to be unity). [In Dinner et al. (1996), the value of B0 was reported incorrectly and should have been B0 = –0.25/0.6 = –0.42 in the scaled units.] Bij gives the specific interaction energy between residues i and j; a complete set of Bij defines a sequence.

Sequence database

As described above, the goal of the present study is to use a quantitative structure–property relationship (QSPR) derived from analysis of a database of sequences with known folding ability. The probability of finding sequences with randomly chosen interactions that fold within feasible simulation times (109 MC steps) is very small for the 125mer. Consequently, to obtain a database with a significant fraction of folding sequences, we generated the sequences in the following manner. From an ensemble of fully compact 5x5x5 (176-contact) conformations, we selected 100 otherwise random structures with a large fraction of contacts in sheets and 100 otherwise random structures with a large fraction of contacts in helices [see the Methods section of Dinner and Karplus (1998) for details]. For each such structure, the Bij were assigned by a procedure equivalent to drawing them from two Gaussian distributions with standard deviations of about 0.6: one for native Bij with a mean of –1 and one for non-native Bij with a mean of 0 [see Dinner et al. (1998) for details]. Since there are far more non-native contacts than native ones (3606 compared with 176), the overall distributions of the Bij are almost Gaussian in character and centered on 0 (for the 200 sequences, on average, the center is ij = –0.048, the overall standard deviation is {sigma}B = 0.597, and the normalized third moment is {alpha}3->B = –0.188). Addition of B0 to all contacts, as in Eqn 1, translates the distribution as a whole. This optimization procedure is similar to others in the literature (Shakhnovich and Gutin, 1993aGo; Gutin et al., 1995bGo) in that it lowers the energy of the native state relative to the majority of states in the spectrum and tends to increase the correlation between the energy of a state and its similarity to the chosen native state (Dinner et al., 1998Go).

The 200 sequences were each subjected to 10 independent Metropolis Monte Carlo trials (Metropolis et al., 1953Go), each of which started with a different random configuration. Trials continued for a maximum of 50x106 MC steps and were terminated earlier if the chain found the native state (first passage time). The MC move set consisted of tail flips, internal bead flips, crankshaft rotations (Hilhorst and Deutch, 1975Go) and a composite move in which two to four single bead or crankshaft moves were performed before applying the Metropolis criterion. The probabilities with which the moves were chosen and the resulting acceptance rates are given elsewhere (Dinner et al., 1996Go, 1998Go). For all trials, the temperature was T = 0.8. From equilibrium sampling over a range of temperatures for a few sequences, we estimate the melting temperatures (Tm) of the sequences studied to be in the range 0.9 < Tm < 1.1, so that T = 0.8 is expected to be lower than Tm for most sequences.

The folding ability of each sequence was measured both by the number of times it reached the native state in the 10 trials (Nf) and by the average contact overlap between the native structure and the lowest energy structure (Emin) sampled during each trial (Qm = 0Emin). Qm identifies sequences that repeatedly get close to the native state but fail to find it. As expected, the procedure used to generate the sequences produced a significant fraction capable of folding within the allotted simulation time; Nf >= 4 for 87 sheet sequences and for 45 helical sequences (similarly, Qm >= 0.8 for 88 sheet sequences and for 49 helical sequences). Complete distributions are given in figure 12 of Dinner et al. (1998).

GNN analysis

We used a genetic neural network (GNN) to obtain quantitative structure–property relationships (QSPRs) (So and Karplus, 1996aGo,bGo) for a thorough analysis of the relation between different system properties and folding ability. As described in the papers that introduced the GNN algorithm for ligand design (So and Karplus, 1996aGo,bGo), the sequence attributes (descriptors) are selected by a genetic algorithm (GA) (Holland, 1975Go) and the functional dependence of the response data on the chosen descriptors is derived by an artificial neural network (NN) (Hertz et al., 1991Go) [for reviews of the application of these methods to chemical and biological problems, see Burns and Whitesides (1993), Zupan and Gasteiger (1993), Clark and Westhead (1996) and Pedersen and Moult (1996)]. The GNN method combines the evolutionary and parallel features of genetic algorithms with the non-linear features of neural networks to obtain optimal descriptions of complex relationships. The specific GNN protocol has been published (So and Karplus, 1996bGo), and the details of its implementation in the present study are described in Dinner et al. (1998).

We considered 47 different sequence attributes as inputs to the GNN (Dinner et al., 1998Go). They fell into five categories: (i) properties of the Bij distribution; (ii) the overall stability of the native state; (iii) the structure content of the native state; (iv) correlations between sequence and native structure and (v) breakdowns of the energy and secondary structure content of the native state by residue exposure. Separate three-descriptor QSPRs were derived to predict Nf and Qm for each secondary structure class (helix and sheet) and for the entire database. In almost all cases, the GNN chose a descriptor measuring the overall stability of the native state, a descriptor measuring the amount of kinetically accessible cooperative secondary structure in the native state (typically the number of contacts in either anti-parallel sheets or turns), and a descriptor measuring the spatial distribution of pairwise interactions within the native structure (most commonly the total energy of the contacts between surface residues).

The QSPR that we selected for use in the present study is one that predicts Qm for all sequences. It was chosen over one that predicts Nf based on the higher cross-validated Pearson linear correlation coefficients between the predicted and observed folding abilities; that is, rcv = 0.774 for the Qm QSPR, while rcv = 0.654 for the Nf QSPR. The lower correlation in the latter case results from the fact that the Nf data are more noisy because this measure of folding ability is an integer so that a single atypical trial out of 10 can substantially change its value. However, Nf and Qm are strongly correlated (rNfQm = 0.823), and the QSPRs for Nf and Qm (for all sequences) take the same three descriptors as inputs. Thus, variation of the three descriptors to maximize Qm is expected to optimize Nf as well.

The first input to the QSPR is the total number of native contacts in anti-parallel sheets (Cas). For lattices, secondary structures are based on a contact map (Chan and Dill, 1990Go). Anti-parallel sheets are such that if i and j are in contact, either i–1 and j+1 are in contact or i+1 and j–1 are in contact; these contacts form lines on the contact map perpendicular to the diagonal. The second input is the energy gap between the ground state and the lower limit of the quasi-continuous spectrum (Shakhnovich and Gutin, 1990Go)

where C0 is the number of contacts in a 5x5x5 fully compact conformation (176), E0 is the energy of the native state and = C0(ij + B0) is the average energy of a 5x5x5 conformation. The third input is the total energy of the contacts between surface residues (E0->ss).

The neural network functional dependence on a descriptor (a NN input) can be probed by fixing the values of the other two descriptors chosen by the GA and varying the descriptor of interest between its minimum and maximum values in the database. The result of this procedure is shown for Cas with fixed {Delta}c and E0->ss in Figure 1aGo. Three such curves are shown: one in which the fixed descriptors ({Delta}c and E0->ss) are given their average values in the database, one in which the fixed descriptors are given their maximum values in the database (fast folding limit) and one in which the fixed descriptors are given their minimum values in the database (slow folding limit). The QSPR spans only a partial range of Qm (0.383–0.974) because the lowest Qm in the database is greater than zero (0.361). The functional dependence on Cas is similar for the average and maximum fixed descriptor values, indicating that the QSPR is not very sensitive to the values of the other two inputs if they are above a particular threshold. This threshold behavior (a sudden increase in Qm followed by saturation) can be seen clearly in Figure 1bGo, in which we show the analogous set of curves for {Delta}c. The {Delta}c functional dependence shows the least sensitivity to the fixed descriptor values; all three curves are essentially the same in shape. The functional dependence of the QSPR on E0->ss is shown in Figure 1cGo. This parameter was first suggested to be important in the earlier linear analysis of the database (Dinner et al., 1996Go) because the off-pathway intermediates entail formation of native-like local domains involving primarily surface residues. Consistent with the mechanism, as the surface energy becomes weaker (less negative), the folding ability increases. The functional dependence on E0->ss shows the strongest sensitivity to the choice of the fixed descriptor values.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 1. Dependence of the QSPR on its inputs. Each curve is obtained by fixing the other two inputs and varying the one in question linearly between its minimum and maximum values in the 200 sequence database. Fixed values are set at their average values (solid line). Fixed values are set to their maximum values (long dash). Fixed values are set to their minimum values (short dash). (a) Cas is varied; (b) {Delta}c is varied; (c) E0->ss is varied.

 
Sequence optimization

The goal of the present study is to design sequences that maximize the Qm predicted by the QSPR. Since the inputs Cas, {Delta}c and E0->ss involve both structure and sequence, we wanted to optimize both the target structures and the sequences (sets of Bij). In principle, one could perform a simultaneous optimization of structure and sequence by a Monte Carlo procedure which changed the structure, optimized the sequence for the new structure and then applied a suitable criterion to accept or reject the new structure and sequence. However, such a `nested' Monte Carlo would be extremely computationally costly because it would involve a complete optimization of the sequence at each step that changed the structure (a `nested' Monte Carlo procedure is used in Seno et al. (1996) for two-dimensional 16mers). Consequently, we instead chose to make use of the fact that, as discussed above, the functional dependence of the QSPR on Cas is relatively insensitive to the other two inputs so long as they are near or above their mean values in the database (Figure 1bGo) and optimized the structures and the sequences separately.

We began by optimizing the structures (the Cas input). We used a Monte Carlo procedure restricted to the space of 5x5x5 cubes to maximize the QSPR with its {Delta}c and E0->ss inputs held fixed at their average values for the 200 sequence database (c = 156.423 and E0->ss = –104.556). Each trial began in a random 5x5x5 conformation (as = 62) and proceeded for 50x103 MC steps. At each step, we generated a new 5x5x5 conformation by making a `four-cycle' move (described below) (Ramakrishnan et al., 1995Go, 1997Go) and then applied a Metropolis-like criterion in which the move was accepted if and only if a random number distributed uniformly between 0 and 1 was less than exp[(q2q1)/Tq], where q1 is the old value of the QSPR, q2 is the new value of the QSPR and Tq is the Monte Carlo `temperature' (Tq is unrelated to the physically meaningful temperature used in the dynamics). In a four-cycle move, two adjacent parallel (or anti-parallel) bonds are broken and new bonds are formed along the other two sides of the square defined by the four residues of the original bonds (Figure 2Go) (Ramakrishnan et al., 1995Go, 1997Go).



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2. A four-cycle move. Parallel (or anti-parallel) bonds ij and kl are broken, and new bonds are formed along the other two sides of the square defined by ijkl (ik and jl). The reverse is also possible. If the bonds to be broken are anti-parallel, a closed subcycle is created; in this case, the subcycle is resolved by an additional four-cycle move chosen such that one bond to be broken is in the closed subcycle and the other is in the remaining part of the chain.

 
Using the above procedure, we created two sets of 20 target structures. In the first set, Tq was set to 0.0001. As discussed above, the QSPR ranges from 0.383 to 0.974, so the choice Tq = 0.0001 corresponds to a relatively low optimization temperature that drives the system to create structures with high numbers of anti-parallel sheet contacts (Figure 3aGo). Consequently, we denote this set of 20 structures by `S' for `sheet' (Table IGo). There was no need to cool the system slowly because the QSPR `potential' is sufficiently smooth in the space of 5x5x5 conformations that no significant trapping in local minima was observed. In the second set, Tq was set to a very high temperature (Tq = 1000) to completely randomize the structure. These 20 structures, which we denote `R' for `random' (Table IGo), act as controls in our study. They have Cas comparable with the helical structures in the database [compare figure 3aGo with figure 7 of Dinner et al. (1998)]. However, it should be noted that these structures, like the helical ones in the database, have a significant fraction of long-range contacts (|i – j | >= 11) due to the stipulation that their native structures be 5x5x5 cubes.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 3. Distribution of the inputs to the QSPR for the designed sequences. (a) Distribution of the number of anti-parallel sheet contacts for R (black) and S (white) structures; each value is binned with the nearest integer multiple of 5. (b) Distribution of the energy gap {Delta}c for RE (black), RQ (dark gray), SE (light gray) and SQ (white) sequences; each value is binned with the nearest integer multiple of 2. (c) Distribution of the energy between surface residues; gray tones are the same as in (b); each value is binned with the nearest integer multiple of 6.

 

View this table:
[in this window]
[in a new window]
 
Table 1 Optimization protocol
 
Each of the 40 structures served as the target native state for a Monte Carlo procedure in the space of sequences (sets of Bij) to optimize the QSPR with respect to {Delta}c and E0->ss; the Cas input was held fixed at its value for the target structure. Each trial began with a random assignment of the Bij from a Gaussian distribution:

where the mean and standard deviation of the Gaussian were set to their average values in the 200 sequence database ({sigma}B = 0.597 and ij = –0.048). Typically, the initial energy of the target structure was close to or within the quasi-continuous part of the spectrum, in which case {Delta}c is very small. For {Delta}c far outside the range of values of the original 200 sequences used to train the GNN, the QSPR is essentially flat (Figure 1bGo), which results in a non-directed and thus inefficient search. To increase {Delta}c to the range shown in Figure 1bGo, we first maximized {Delta}c directly for 500x103 MC steps. At each step, we picked two residues randomly, interchanged the rows and columns to which they corresponded in the Bij matrix, and then accepted the move if and only if a random number distributed uniformly between 0 and 1 was less than exp[({Delta}2->c{Delta}1->c)/T{Delta}], where {Delta}1->c is the old value of the {Delta}c, {Delta}2->c is the new value of {Delta}c, and T{Delta} is an optimization `temperature' that is independent of Tq. To drive the system to high {Delta}c, as discussed above, a low optimization temperature was used (T{Delta} = 0.1). The interchange of entire rows and columns in the Bij matrix is equivalent to interchanging residue types in a linear sequence (Shakhnovich and Gutin, 1993aGo,bGo); unlike the interchange of individual Bij elements (Cieplak et al., 1996Go; Klimov and Thirumalai, 1996Go; Cieplak and Banavar, 1997Go) it preserves a certain degree of frustration and does not distort the distributions due to the fact that only odd |ij| can be formed on a simple cubic lattice (Shakhnovich, 1998Go; Dinner et al., 1999Go). Thus, all the sequences in the present study are essentially Gaussian [the average normalized third moment of the Bij distributions is {alpha}3->ß = –0.075 compared with {alpha}3->ß = –0.188 for the 200 sequence database (Dinner et al., 1998Go)]. The two sets of partially optimized sequences were saved at this point and are denoted `SE' and `RE' for `sheet-energy gap' and `random-energy gap`, respectively (Table IGo).

Further Monte Carlo optimization to maximize the QSPR was then performed to generate two additional sets of 20 sequences each. For 200x103 MC steps we continued to interchange entire rows and columns of the Bij matrix subject to the acceptance criterion described above for the structure optimization; again, Tq = 0.0001. We then fixed the set of 176 Bij that were native and made interchanges of individual Bij elements within this set for 100x103 MC steps at Tq = 0.0001 to optimize Ess->0 without changing {Delta}c or distorting the distributions of native and non-native contacts. We denote these two sets of sequences as `SQ' and `RQ' for `sheet-QSPR' and `random-QSPR`, respectively (Table IGo).

The four sets of sequences are summarized in Table I.Go For each set, there are 20 sequences. In one set, we optimized all three inputs to the QSPR: Cas, {Delta}c and E0->ss (SQ). In another two, we optimized only two of the three inputs: {Delta}c and Cas (SE) or {Delta}c and E0->ss (RQ). In the final set, we optimized only the overall stability input, {Delta}c (RE). Sequences without optimization of {Delta}c were not generated because they do not meet the requirement that the target structure be stable at the temperature used in the dynamics (T = 0.8).


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The distributions of the three inputs to the QSPR are shown for the four sets of sequences in Figure 3Go. As discussed above, the S structures have relatively high Cas, even for sheet structures, while the R structures have Cas comparable with those of the helical sequences in the 200 sequence database (Figure 3aGo). We stress that the R structures, like the helical ones in the database, have a significant fraction of long-range (|ij| >= 11) contacts due to the stipulation that the native structure be a 5x5x5 cube. As a result, the sequences designed for these structures are capable of cooperative folding (see below). Visual inspection of the S structures showed that, while a few have very uniform topologies, like the structures of the original 200 sequence database shown in figure 2bGo of Dinner et al. (1996) and figure 2bGo of Dinner et al. (1998), most of them consist of many short anti-parallel hairpins packed in a relatively irregular fashion. Moreover, there was no obvious relationship between the degree of order in a structure's overall topology and the corresponding sequence's folding ability (discussed below).

The distributions of {Delta}c differ depending on whether they are for the E or the Q sequences. Those in which we terminated the sequence optimization after the direct maximization of {Delta}c (the E sequences) tend to have higher {Delta}c (more stable target structures) than those in which we continued the optimization to maximize the QSPR. Thus, the optimization of the QSPR decreases {Delta}c (destabilizing the target structure) slightly to optimize E0->ss. Consistent with this idea, we see in Figure 3cGo that the distribution of E0->ss is shifted to the left (more stable) for the E sequences. The fully optimized SQ sequences have intermediate E0->ss, while the RQ sequences have high E0->ss to compensate for their low Cas.

To determine their folding ability, the sequences were each subjected to 10 dynamic Monte Carlo trials identical in procedure to those described above for the 200 sequence database. In particular, there was no need to change the folding conditions, which are determined by B0 and T, because they were already calibrated for relatively optimized sequences [for further discussion of this point, see Gutin et al. (1995a)]. The distributions of folding abilities are given in Figure 4Go. The distributions of the number of times folded (Nf) for the RQ, SE and SQ sequences are comparable, although the SQ sequences are the only set in which a sequence folds all 10 times. The distribution for the RE sequence is shifted markedly to the left; f = 2.1 for the RE sequences while f = 5.5, 5.1 and 5.4 for the RQ, SE and SQ sequences. Turning to the distributions of Qm, we see that, at Qm > 0.95, there are six SQ sequences and none of any other sequence set. Otherwise, the behavior for Qm is similar to that observed for Nf in that the RE sequences exhibit significantly worse folding than the other sequence sets; m = 0.675, 0.870, 0.836 and 0.902 for RE, RQ, SE and SQ sequences respectively. Thus, the sequences in which all three inputs were optimized exhibit the best folding, but those in which two out of three inputs were optimized are almost as good (with RQ slightly better than SE).



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 4. Distribution of folding abilities for RE (black), RQ (dark gray), SE (light gray) and SQ (white) sequences. (a) Qm; each Qm value is binned with the nearest integer multiple of 0.1. (b) Nf.

 
As shown in Figure 5Go, the observed behavior is consistent with the predictions of the QSPR. Both the SE and RQ sequences are predicted to be close in folding ability (as measured by Qm) to the SQ sequences. This behavior can be understood in terms of the functional dependence curves presented in Figure 1Go. The SE sequences have high Cas and high {Delta}c, so even a low E0->ss will result in high Qm; the left intercept of the long dash curve in Figure 1cGo is Qm = 0.965. The RQ sequences have E0->ss that are higher (less stable) than that of any sequence in the 200 sequence database, so the predicted Qm values fall above the left intercept of the long dash curve in Figure 1bGo (Qm = 0.855). As for the RE sequences, they are predicted and observed to fold less well. They correspond to the short dash curve in Figure 1bGo, for which Qm < 0.84 for all values of {Delta}c.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 5. Comparison of observed and predicted folding abilities for RE (black circle), RQ (black triangle), SE (white circle) and SQ (white triangle) sequences. (a) Qm; the overall (80 sequence) Pearson linear correlation coefficient is r = 0.696. (b) Nf; r = 0.569. The solid lines correspond to a perfect correlations (r = 1); the dashed lines are best-fit linear regressions.

 
Although the overall correlations in Figure 5Go are good (r = 0.696 for Qm and r = 0.569 for Nf), the QSPRs systematically overestimate the folding abilities. In particular, the partially optimized RQ (black triangle) and SE (white circle) sequences cluster at the top of the range of predicted Qm values but exhibit a significant variability in the observed Qm values (Figure 5aGo). This derives from overcompensation for the non-optimized input by increasing the remaining two inputs ({Delta}c and E0->ss in the case of the RQ sequences and Cas and {Delta}c in the case of the SE sequences) to the point that they exceed their ranges in the training database.

As was the case in Dinner et al. (1998), the predictivity of the Nf QSPR is lower than that of the Qm QSPR. In part, this is due to the fact that, as mentioned above, the Nf data are more noisy. However, the SQ sequences that exhibit high Qm but low Nf (white triangles at Nf <= 3 in Figure 5bGo) suggest that factors other than the three considered are also of importance in determining Nf. Since these sequences consistently get close to the native state but fail to find it, it is likely that the additional factors involve specific interactions that would have to be determined on a sequence-by-sequence basis [for an extreme example, see Dinner and Karplus (1998)] and thus would not be of use in the general design protocol.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
In the present study, we used a quantitative structure–property relationship (QSPR) to design 125-residue sequences that fold repeatedly and relatively rapidly to stable target structures. The QSPR was derived by statistical analysis of a database of 200 sequences with known folding abilities; the statistical method, employed previously in ligand design (So and Karplus, 1996aGo,bGo), uses a genetic algorithm to select the sequence attributes (descriptors) that are most important for folding and a neural network to determine the corresponding functional dependence of folding ability on the chosen attributes. The specific QSPR employed takes as inputs the number of anti-parallel sheet contacts, the estimated energy gap between the native state and the majority of states in the spectrum, and the total energy of the contacts between surface residues. The predicted folding ability was maximized by two Monte Carlo procedures in series: the first in the space of target conformations (5x5x5 cubes) and the second in the space of sequences (sets of assigned pairwise nearest-neighbor interactions).

Four sets of 20 sequences each were designed, all of which met the basic thermodynamic requirement for folding that the target (native) structures be stable at the temperature employed in the dynamics. One of the sets was fully optimized, while the other three were only partially optimized and acted as controls. Consistent with the functional dependence of the QSPR, it was found that sequences in which the stability (energy gap) descriptor and one of the other two descriptors were optimized folded almost as well as sequences in which all three descriptors were optimized. In contrast, sequences in which only the energy gap was optimized exhibited markedly poorer folding.

A physical interpretation of the QSPR and hence the observed folding behavior of the designed sequences is clearly of interest. As mentioned above, the energy gap, {Delta}c was optimized for all sets of sequences. Maximization of this input has kinetic effects in addition to the obvious thermodynamic ones; increasing the energy gap stabilizes more native-like structures relative to less native-like ones, which increases the bias of the energy surface towards the native state [for a fuller discussion of the role of the energy gap in folding, see Dinner et al. (1999)]. The four sets of sequences differ with respect to the optimization of the other two inputs, Cas and E0->ss. Maximization of Cas speeds folding because anti-parallel sheet contacts are typically cooperative (formation of one contact increases the probability of the formation of the others) and kinetically accessible (easily found in a random search). Thus, they promote folding by restricting the random search for the core and allowing it to maintain its structure sufficiently long to drive the subsequent accumulation of structure. As for E0->ss, its maximization (destabilization) has two advantages. It moves stronger contacts to the core, which stabilizes the transition state, and weaker contacts to the surface, which decreases the likelihood of trapping by favoring more open structures that can fold rapidly to the native state. The RE sequences exhibit the slowest folding because they have difficulty forming and maintaining the core and then tend to become trapped in a partially folded intermediate. The SE sequences fold faster because they form stable cores more readily; however, they still have a tendency to become trapped. The RQ sequences fold almost as rapidly as do the fully optimized SQ sequences because, in spite of the reduction in the number of cooperative contacts, they have relatively stable cores and labile surfaces.

Poor folding behavior for designed sequences of longer chain length (48 or more residues) in which only the energy gap was optimized has been reported previously (Abkevich et al., 1995Go, 1996Go). The problem stems from inhomogeneous spatial distributions of interactions in the target native structures that result in subdomains that behave differently. Such sequences form kinetic and equilibrium intermediates in which clusters of strong interactions are native-like while weak interactions are missing. Abkevich et al. (1996) overcame this problem by decreasing the dispersion of native interactions; 48-mers in which the standard deviation of the native Bij ({sigma}B->nat) was much smaller than that of all Bij ({sigma}B) exhibited two-state all-or-none folding while sequences in which {sigma}B->nat was much larger than {sigma}B tended to populate partially folded intermediates. The dispersion of native contact energies was not considered in the present study because, as discussed in Dinner et al. (1998), the genetic neural network (GNN) did not choose {sigma}B->nat for any of the QSPRs. Although the 48mer study considered a larger range of {sigma}B->nat/{sigma}B (0.27–1.77) than did Dinner et al. (1998) (0.82–1.06), the GNN should be sufficiently sensitive to detect any significant dependence of the folding ability on {sigma}B->nat. The fact that this descriptor was not chosen indicates that it is less important than E0->ss in the 125mer. Our success in the present study in designing 125mer sequences that fold significantly better than ones in which only the energy gap is optimized shows that Cas, {Delta}c and E0->ss are indeed sufficient, though with more sequence data and an extended set of descriptors, {sigma}B->nat might be appropriate for improving the results.

Recently, Mirny and co-workers took a more direct approach to optimizing the rate of folding. By a Monte Carlo procedure in sequence space, they made `mutations' (Gutin et al., 1995bGo) to a 48mer sequence and accepted each mutation only if it decreased the mean first passage time (MFPT) (Mirny et al., 1998Go). The optimization procedure was found to shift the folding mechanism from a random search for a critical fraction of contacts (Sali et al., 1994Go) to a nucleation mechanism (Abkevich et al., 1994Go; Mirny et al., 1998Go), which accelerated folding by two orders of magnitude. During the optimization, the energy of the nucleus contacts dropped rapidly (from ij/T = –1.00 to Bij/T = –2.12), while the energy of the rest of the contacts remained roughly the same (at ij/T = –1.00). This result is in agreement with the earlier analysis of a few fast-folding 125mer sequences in which ij/T = –1.50 for contacts in the folding core and ij/T = –1.25 for all contacts (so that ij/T {approx} –1.20 for the non-core contacts) (Dinner et al., 1996Go) and the statistical analyses of the database (Dinner et al., 1996Go, 1998Go) which ascribed importance to E0->ss. In the present study, the final phase of Monte Carlo optimization, which is designed to maximize E0->ss, moves higher native Bij to the surface and lower native Bij to the core of the target structure. Thus, the QSPR captures relatively subtle features of the direct MFPT optimization. Moreover, it does so without the need for extensive simulation of the dynamics at every step of the optimization procedure [to calculate MFPT, 400 independent dynamic MC trials were performed after each `mutation' in Mirny et al. (1998)]. As a result, the QSPR-based protocol is not only more practical (faster) for protein design but is truly predictive.

An interesting question is whether natural selection has optimized proteins for fast folding. A double-mutant cycle analysis of chymotrypsin inhibitor 2 (CI2), a 64-residue protein that exhibits fast two-state folding, revealed that particular interactions in the hydrophobic core are more favorable in the transition state than in the native state. Residues Ala16 and Ile57, which interact with an unfavorable free energy of –2.0±0.2 kcal/mol in the native state, have an unfavorable free energy of only –0.38±0.02 kcal/mol in the transition state, while residues Ala16 and Leu49 interact with an unfavorable free energy of –0.8±0.4 kcal/mol in the native state and a favorable free energy of 0.31±0.05 kcal/mol in the transition state (Ladurner et al., 1997Go). Thus, it is plausible, but not compelling, that folding rate was a factor in the evolution of CI2. A more direct test of the evolutionary importance of the folding rate was reported recently for the immunoglobulin G binding domain of peptostreptococcal protein L, a 62-residue protein with a ßß{alpha}ßß structure. Thirty-six of the residues were subjected to random mutagenesis, and functional folded variants were retrieved with a phage-display method that did not discriminate between sequences on the basis of folding rate (to avoid introducing any bias into the set examined) (Kim et al., 1998Go). The kinetics and thermodynamics of 12 of the variants were determined, and it was found that, although all the variants had reduced stability, the wild type folds with a rate (61 s–1) that is in the middle of the range observed for the variants (4–180 s–1). This suggests that the wild type sequence was optimized for stability rather than fast folding. On the basis of only two proteins, however, it is difficult to assess the generality of these results. Moreover, the role folding rate plays in natural selection is likely to depend on the function of the protein; see, for example, the discussion of {alpha}-lytic protease in Kim et al. (1998).

Although in the present study we have used a GNN-QSPR to design model sequences that fold rapidly, the use of QSPRs in protein design is clearly not limited to optimization of the folding rate or to model sequences. As already mentioned, previous studies of real proteins employed a linear QSPR as a scoring function to design sequences with greater stability (as measured by Tm) (Dahiyat and Mayo, 1996Go, 1997aGo,Dahiyat and Mayo, bGo). Given that the QSPRs were first developed for the maximization of small molecule binding, a natural extension to the above protein design studies would be to use QSPRs in the optimization of the enzymic function of an artificial sequence. Advances in QSPR methodology, including the use of genetic neural networks (So and Karplus, 1996aGo,bGo) and 3D QSPR methods (Kubinyi, 1993Go), should facilitate research directed towards this goal.


    Acknowledgments
 
We wish to thank Sung-Sau So for providing the neural network weights. A.R.D. is a Howard Hughes Medical Institute Predoctoral Fellow. This work was supported in part by a grant from the National Science Foundation.


    Notes
 
4 To whom correspondence should be addressed; email: marci{at}tammy.harvard.edu Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Abkevich,V.I., Gutin,A.M. and Shakhnovich,E.I. (1994) Biochemistry, 33, 10026–10036.[ISI][Medline]

Abkevich,V.I., Gutin,A.M. and Shakhnovich,E.I. (1995) Protein Sci., 4, 1167–1177.[Abstract/Free Full Text]

Abkevich,V.I., Gutin,A.M. and Shakhnovich,E.I. (1996) Folding Des., 1, 221–230.[ISI][Medline]

Burns,J.A. and Whitesides,G.M. (1993) Chem. Rev., 93, 2583–2602.[ISI]

Chan,H.S. and Dill,K.A. (1990) J. Chem. Phys., 92, 3118–3135.[ISI]

Cieplak,M. and Banavar,J.R. (1997) Folding Des., 2, 235–245.[ISI][Medline]

Cieplak,M., Vishveshwara,S. and Banavar,J.R. (1996) Phys. Rev. Lett., 77, 3681–3684.[ISI][Medline]

Clark,D.E. and Westhead,D.R. (1996) J. Comput.-Aided. Mol. Des., 10, 337–358.

Dahiyat,B.I. and Mayo,S.L. (1996) Protein Sci., 5, 895–903.[Abstract/Free Full Text]

Dahiyat,B.I. and Mayo,S.L. (1997a) Proc. Natl Acad. Sci. USA, 94, 10172–10177.[Abstract/Free Full Text]

Dahiyat,B.I. and Mayo,S.L. (1997b) Science, 278, 82–87.[Abstract/Free Full Text]

Desjarlais,J.R. and Handel,T.M. (1995) Protein Sci., 4, 2006–2018.[Abstract/Free Full Text]

Dinner,A.R. and Karplus,M. (1998) Nature Struct. Biol., 5, 236–241.[ISI][Medline]

Dinner,A.R. and Karplus,M. (1999) J. Phys. Chem., B103, 7976–7994.[ISI]

Dinner,A.R., SSali,A. and Karplus,M. (1996) Proc. Natl Acad. Sci. USA, 93, 8356–8361.[Abstract/Free Full Text]

Dinner,A.R., So.,S.-S. and Karplus,M. (1998) Proteins, 33, 177–203.[ISI][Medline]

Dinner,A.R., Abkevich,V., Shakhnovich,E. and Karplus,M. (1999) Proteins, 35, 34–40.[ISI][Medline]

Gutin,A.M., Abkevich,V.I. and Shakhnovich,E.I. (1995a) Biochemistry, 34, 3066–3076.[ISI][Medline]

Gutin,A.M., Abkevich,V.I. and Shakhnovich,E.I. (1995b) Proc. Natl Acad. Sci. USA, 92, 1282–1286.[Abstract]

Harbury,P.B., Tidor,B. and Kim,P. (1995) Proc. Natl Acad. Sci. USA, 92, 8408–8412.[Abstract]

Harbury,P.B., Plecs,J.J., Tidor,B., Alber,T. and Kim,P. (1998) Science, 282, 1462–1467.[Abstract/Free Full Text]

Hellinga,H.W. and Richards,F.M. (1991) J. Mol. Biol., 222, 763–785.[ISI][Medline]

Hertz,J., Krogh,A. and Palmer,R.G. (1991), Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City.

Hilhorst,H.J. and Deutch,J.M. (1975) J. Chem. Phys., 63, 5153–5161.[ISI]

Holland,J.H. (1975), Adaption in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor.

Kim,D.E., Gu,H. and Baker,D. (1998) Proc. Natl Acad. Sci. USA, 95, 4982–4986.[Abstract/Free Full Text]

Klimov,D.K. and Thirumalai,D. (1996) Phys. Rev. Lett., 76, 4070–4073.[ISI][Medline]

Kubinyi,H. (1993), 3D QSAR in Drug Design: Theory, Methods and Applications. ESCOM Science Publishers, Leiden.

Ladurner,A.G., Itzhaki,L.S. and Fersht,A.R. (1997) Folding Des., 2, 363–368.[ISI][Medline]

Metropolis,N., Rosenbluth,A.W., Rosenbluth,M.N., Teller,A.H. and Teller,E. (1953) J. Chem. Phys., 21, 1087–1092.[ISI]

Mirny,L.A., Abkevich,V. and Shakhnovich,E.I. (1998) Proc. Natl Acad. Sci. USA, 95, 4976–4981.[Abstract/Free Full Text]

Pabo,C. (1983) Nature, 301, 200.[ISI][Medline]

Pedersen,J.T. and Moult,J. (1996) Curr. Opin. Struct. Biol., 6, 227–231.[ISI][Medline]

Ramakrishnan,R., Pekny,J.F. and Caruthers,J.M. (1995) J. Chem. Phys., 103, 7592–7604.[ISI]

Ramakrishnan,R., Ramachandran,B. and Pekny,J.F. (1997) J. Chem. Phys., 106, 2418–2425.[ISI]

Sali,A., Shakhnovich,E. and Karplus,M. (1994) Nature, 369, 248–251.[ISI][Medline]

Sauer,R.T. (1996) Folding Des., 1, R27–R30.[ISI][Medline]

Seno,F., Vendruscolo,M., Maritan,A. and Banavar,J.R. (1996) Phys. Rev. Lett., 77, 1901–1904.[ISI][Medline]

Shakhnovich,E.I. (1998) Folding Des., 3, R45–R58.[ISI]

Shakhnovich,E.I. and Gutin,A.M. (1990) J. Chem. Phys., 93, 5967–5971.[ISI]

Shakhnovich,E.I. and Gutin,A.M. (1993a) Proc. Natl Acad. Sci. USA, 90, 7195–7199.[Abstract]

Shakhnovich,E.I. and Gutin,A.M. (1993b) Protein Engng, 8, 793–800.

Shaw,A. and Bott,R. (1996) Curr. Opin. Struct. Biol., 6, 546–550.[ISI][Medline]

So,S.-S. and Karplus,M. (1996a) J. Med. Chem., 39, 1521–1530.[ISI][Medline]

So,S.-S. and Karplus,M. (1996b) J. Med. Chem., 39, 5246–5256.[ISI][Medline]

Villegas,V., Viguera,A.R., Avilés,F.X. and Serrano,L. (1996) Folding Des., 1, 29–34.[ISI][Medline]

Zupan,J. and Gasteiger,J. (1993), Neural Networks for Chemists: An Introduction. VCH Publishers, New York.

Received March 26, 1999; revised June 4, 1999; accepted June 21, 1999.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (4)
Request Permissions
Google Scholar
Articles by Dinner, A. R.
Articles by Karplus, M.
PubMed
PubMed Citation
Articles by Dinner, A. R.
Articles by Karplus, M.