Prediction of the structures of proteins with the UNRES force field, including dynamic formation and breaking of disulfide bonds

Cezary Czaplewski1,2, Stanislaw Oldziej1,2, Adam Liwo1,2,3 and Harold A. Scheraga1,4

1Baker Laboratory of Chemistry, Cornell University, Ithaca, NY 14853-1301, USA, 2Faculty of Chemistry, University of Gdansk, ul. Sobieskiego 18, 80-952 Gdansk and 3Academic Computer Center in Gdansk TASK, ul. Narutowicza 11/12, 80-952 Gdansk, Poland

4 To whom correspondence should be addressed. e-mail: has5{at}cornell.edu


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The presence of disulfide bonds is essential for maintaining the structure and function of many proteins. The disulfide bonds are usually formed dynamically during folding. This process is not accounted for in present algorithms for protein-structure prediction, which either deduce the possible positions of disulfide bonds only after the structure is formed or assume fixed disulfide bonds during the course of simulated folding. In this work, the conformational space annealing (CSA) method and the UNRES united-residue force field were extended to treat dynamic formation of disulfide bonds. A harmonic potential is imposed on the distance between disulfide-bonded cysteine side-chain centroids to describe the energetics of bond distortion and an energy gain of 5.5 kcal/mol is added for disulfide-bond formation. Formation, breaking and rearrangement of disulfide bonds are included in the CSA search by introducing appropriate operations; the search can also be carried out with a fixed disulfide-bond arrangement. The algorithm was applied to four proteins: 1EI0 ({alpha}), 1NKL ({alpha}), 1L1I (ß-helix) and 1ED0 ({alpha} + ß). For 1EI0, a low-energy structure with correct fold was obtained both in the runs without and with disulfide bonds; however, it was obtained as the lowest in energy only with the native disulfide-bond arrangement. For the other proteins studied, structures with the correct fold were obtained as the lowest (1NKL and 1L1I) or low-energy structures (1ED0) only in runs with disulfide bonds, although the final disulfide-bond arrangement was non-native. The results demonstrate that, by including the possibility of formation of disulfide bonds, the predictive power of the UNRES force field is enhanced, even though the disulfide-bond potential introduced here rarely produces disulfide bonds in native positions. To the best of our knowledge, this is the first algorithm for energy-based prediction of the structure of disulfide-bonded proteins without any assumption as to the positions of native disulfides or human intervention. Directions for improving the potentials and the search method are suggested.

Keywords: conformational-space annealing/disulfide bond/global optimization/protein structure prediction/united residue


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Disulfide bonds occur frequently in many proteins, especially in extracellular soluble globular proteins. These bonds provide stability to the native structure of a protein and may compensate for the absence of a significant hydrophobic core in small proteins (Betz, 1993Go; Petersen et al., 1999Go). It is still an unsettled question as to whether the disulfide bonds are formed before or after the secondary structure elements are formed (Welker et al., 2001Go). Some disulfide bonds are essential for maintaining the structure and function of the protein, while others can be broken without radically changing the properties of the protein. The addition of a single disulfide bond can cause cooperative, global folding of the entire protein. For example, bovine pancreatic ribonuclease A with two (26–84, 58–110) out of four of the native disulfide bonds has no conformational order, but species with one additional disulfide (65–72 or 40–95) have native-like structure (Lester et al., 1997Go; Wedemeyer et al., 2000Go).

Protein folding simulations show that inclusion of disulfide bonds as constraints reduces the conformational space that must be searched (Skolnick et al., 1997Go; Huang et al., 1999Go; Abkevich and Shakhnovich, 2000Go). Most protein structure prediction methods and protein folding simulations do not take into account dynamic disulfide-bond formation during folding. The particular arrangement of disulfide bonds is applied as a fixed set of constraints. Some studies examine the disruption of native disulfides and the incorporation of novel disulfides and their effects on stability, but use different simulations with different, fixed, arrangements of disulfide bonds during simulations (Rey and Skolnick, 1994Go). Computer modeling of disulfide bonds which can be introduced by protein engineering into various proteins of known structure to increase their stability relative to that of the wild type is fairly common (Zhou et al., 1993Go; Burton et al., 2000Go; Dani et al., 2003Go). A few studies have addressed the important problem of predicting the disulfide bonding state of cysteines in proteins; these include the use of statistical methods (Fiser et al., 1992Go), a specially optimized threading potential (Dombkowski and Crippen, 2000Go), neural networks and hidden Markov models (Muskal et al., 1990Go; Martelli et al., 2002Go) and methods that combine local context and global information about protein sequences (Fiser and Simon, 2000Go; Mucchielli-Giorgi et al., 2002Go).

Saitô and co-workers (Watanabe et al., 1991Go; Kobayashi et al., 1992Go) developed a protein folding simulation method with dynamic formation/breaking of disulfide bonds. Their procedure is based on an assumption that folding starts with the formation of secondary structures ({alpha}-helices and ß-sheets) and then proceeds to assemble them into the tertiary structure. Consequently, the simulation starts from the conformation in which the secondary structure is already formed and other regions are extended. The search for the conformation of minimum energy is carried out by changing the dihedral angles only in regions other than the secondary structures. Packing of the secondary structures of a polypeptide chain is guided by introduction of appropriate hydrophobic interactions which are responsible for the construction of short-distance local structure. A strict algebraic relation involving the geometry that must be satisfied for two cysteine residues to form a disulfide bond cannot be applied for distances too great to yield a disulfide bond. Consequently, Watanabe et al. used a geometrical graphic representation for the locus of the hydrogen atom of the SH group in the cysteine residue to draw the distributions of the cysteines at the folding stage in which they come close during simulation. They introduced a bonding potential (Equation 1 in Watanabe et al., 1991Go) between selected pairs of cysteines provided only that the circles representing the possible positions of the hydrogen atoms of the SH groups are face-to-face, thereby making it easy for them to intersect.

In this paper, we report an extension of our hierarchical procedure for protein-structure prediction (Liwo et al., 1999aGo; Pillardy et al., 2001aGo) to proteins containing disulfide bonds. This extended procedure allows for dynamic formation and breaking of disulfide bonds during the simulations. As in our previous work, an extensive search is carried out at the united-residue level with the UNRES force field and the use of the conformational space annealing (CSA) search method (Lee et al., 1997Go, 1999Go; Czaplewski et al., 2003Go). However, both the UNRES force field and the CSA procedure have been modified, in order to treat the possible formation of disulfide bonds.

The method was applied in the following sequence to proteins with all the basic types of secondary structure: an {alpha}-helical hairpin stabilized by two disulfide bonds [a fragment of the p8MTCP1 protein; Protein Data Bank (PDB) code 1EI0, 38 residues], whose three-dimensional structure was determined by NMR spectroscopy (Barthe et al., 2000Go); NK-lysyin (PDB code 1NKL, 78 residues), whose structure was also determined from NMR data (Liepinsh et al., 1997Go) and contains three disulfides and four {alpha}-helices; the thermal hysteresis protein isoform YL-1 (PDB code 1L1I, 84 residues), whose three-dimensional structure (in a ß-helical form) was determined recently from NMR data (Daley et al., 2002Go) and contains eight disulfides; and Viscotoxin A3 (PDB code 1ED0, 46 residues), whose structure was determined from NMR data (Romagnoli et al., 2000Go) as an {alpha}/ß type protein with three disulfide bonds, two {alpha}-helices and two short anti-parallel strands stabilized by one of the disulfide bonds.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The UNRES force field

In the UNRES model (Liwo et al., 1997aGo,b, 2001Go, 2002Go; Lee et al., 2001Go; Pillardy et al., 2001bGo), a polypeptide chain is represented by a sequence of {alpha}-carbon (C{alpha}) atoms linked by virtual bonds with attached united side chains (SC) and united peptide groups (p). Each united peptide group is located in the middle between two consecutive {alpha}-carbons, with peptide group pi being located between C{alpha}i and C{alpha}i+1. Only these united peptide groups and the united side chains serve as interaction sites, the {alpha}-carbons serving only to define the chain geometry (see figure 1 of Liwo et al., 1997aGo). All virtual bond lengths (i.e. C{alpha}–C{alpha} and C{alpha}–SC) are fixed; the distance between neighboring C{alpha}s is 3.8 Å, corresponding to trans peptide groups, while the side-chain angles ({alpha}SC and ßSC) and virtual-bond ({theta}) and dihedral ({gamma}) angles can vary. The energy of the virtual-bond chain is expressed by the equation:

The term USCiSCj represents the mean free energy of the hydrophobic (hydrophilic) interactions between the side chains, which implicitly contains the contributions from the interactions of the side chain with the solvent (potential of mean force). The term USCipj denotes the excluded-volume potential of the side chain–peptide group interactions. The peptide group interaction potential (Upipj) accounts mainly for the electrostatic interactions (i.e. the tendency to form backbone hydrogen bonds) between peptide groups pi and pj. Utor, Utord, Ub and Urot represent the energies of virtual-dihedral angle torsions, double torsions, virtual-bond angle bending and side-chain rotamers, respectively; these terms account for the local propensities of the polypeptide chain. Details of the parameterization of all of these terms are provided in earlier publications (Liwo et al., 1997aGo,b). Finally, the terms Umcorr, m = 1, 2, ..., Ncorr, are the correlation or multibody contributions from a cumulant expansion (Liwo et al., 2001Go, 2003Go) of the restricted free energy (RFE) and the ws are the weights of the energy terms. The multibody terms are indispensable for reproduction of regular {alpha}-helical and ß-sheet structures. The UNRES force field has been derived as an RFE function of an all-atom polypeptide chain plus the surrounding solvent, where the all-atom energy function is averaged over the degrees of freedom that are lost when passing from the all-atom to the simplified system. This approach enabled us to derive the Umcorr, m = 1, 2, ..., Ncorr multibody terms by a generalized cumulant expansion of the RFE developed by Kubo (1962Go). The internal parameters of the individual Us were derived by fitting the analytical expressions to the RFE surfaces of model systems (Liwo et al., 2001Go) or by fitting the calculated distribution functions to those determined from the PDB (Liwo et al., 1997bGo). The ws (the weights of the energy terms), the internal parameters of the energy Umcorr terms and the mean free energies of side-chain interactions of the USCiSCj energy term were optimized by a hierarchical design of the potential-energy landscape (Liwo et al., 2002Go).

The optimization method assumes a hierarchical structure of the energy landscape, which means that the energy decreases as the number of native-like elements in a structure increases, being lowest for structures from the native family and highest for structures with no native-like element. A level of the hierarchy is defined as a family of structures with the same number of native-like elements (or degree of native likeness). Optimization of a potential-energy function is aimed at achieving such a hierarchical structure of the energy landscape by forcing appropriate free-energy gaps between hierarchy levels to place their energies in ascending order from the native to the most unfolded structure. This procedure is different from the method used earlier, in which the energy gap and/or the Z score between the native structure and all non-native structures were maximized, regardless of the degree of native-likeness of the non-native structures (Liwo et al., 1997bGo; Lee et al., 2001Go; Pillardy et al., 2001bGo). 1IGD, an ({alpha}+ß)-type protein, was used as a training protein for optimization of the internal parameters of the Umcorr energy terms (A.Liwo et al., unpublished data), while the ws and the internal parameters of USCiSCj were optimized by using a set of four proteins (PDB codes 1E0G, 1E0L, 1GAB and 1IGD) (S.Oldziej et al., unpublished data).

The UNRES force field is able to predict the structures of proteins containing both {alpha}-helical and ß-sheet structures with a reasonable degree of accuracy, as assessed by tests on model proteins (Lee et al., 1999Go; Liwo et al., 1999bGo; Pillardy et al., 2001aGo) and also in the CASP3 (Lee et al., 1999, 2000Go; Orengo et al., 1999Go), CASP4 (Pillardy et al., 2001aGo) and CASP5 (Czaplewski et al., 2002Go) blind prediction experiments.

In order to describe the energetics of disulfide bonds, for the pair of half-cystines that forms a disulfide bond, we replace the USCiSCj energy term of Equation 1 by the following function:

It should be noted that dCysiCysj is the distance between the centers of cysteine side chains and not the distance between the sulfur atoms of the bond. E°Cys–Cys = –5.5 kcal/mol is the energy of formation of a non-strained disulfide bond from two half-cystine residues. This energy has been estimated on the basis of the energy of formation of a single disulfide bond in proteins that has been measured experimentally to be –3.5 kcal/mol (Doig and Williams, 1991Go) and the energy of non-bonded interactions between cysteine side chains estimated from the Miyazawa–Jernigan cysteine–cysteine contact energy (Miyazawa and Jernigan, 1985Go) on the basis of Equation 3 of Liwo et al. (1993Go) to be –2.0 kcal/mol. The values of d°Cys–Cys = 4.2 Å and kCys–Cys = 6.6 kcal/(mol·Å2) were estimated on the basis of the average distance between cysteine side-chain centroids in disulfide bonds calculated from ECEPP/3 geometry (Némethy et al., 1992Go) and the ECEPP/3 torsional constants of the Cß–S{gamma}–S{gamma}–Cß dihedral angle (Némethy et al., 1992Go).

The conformational space annealing method

CSA (Lee et al., 1997Go, 1999Go, 2000Go; Czaplewski et al., 2003Go) is a hybrid method which combines genetic algorithms, essential aspects of the build-up method and a local gradient-based minimization. The method is based on the idea of conformational space annealing: in the early stages, it enforces a broad conformational search and then gradually focuses the search into smaller regions with low energy. The CSA searching method allows one to focus on many different groups of low-energy protein structures, one of which is presumably the native structure.

The CSA method begins with a randomly-generated population of conformations which are energy minimized to generate the first bank of conformations. From the initial population, a number of conformations (called seeds) are selected as parents for the trial population. These ‘seed’ conformations are altered in a non-random fashion to create new trial conformations. As in any genetic algorithm, the trial population is generated by the use of genetic operators: mutations and crossovers. Attention is paid to ensure that all trial conformations are significantly different from each other and from the parent conformations. After generation, all trial conformations are energy minimized.

The next step of the CSA algorithm is the update of the current population (the bank) without increasing its size. Each trial conformation is compared with each existing conformation of the bank. If the trial conformation is similar to an existing conformation of the bank, only the lower energy conformation of these two is preserved. If the trial conformation is not similar to any existing conformation in the bank, it represents a new distinct region of conformational space. Then it replaces the highest energy conformation in the bank, if its energy is lower than the highest energy in the bank, otherwise it is discarded. The distance between conformations i and j is defined as the differences of their virtual-bond angles and virtual-bond dihedral angles (Equation 9 of Lee et al., 2000Go). If the distance, Dij, is less than or equal to some predefined cutoff value, Dcut, conformations i and j are considered similar, otherwise they are considered different. CSA achieves its efficiency by beginning with a large value of Dcut essentially to search all possible structures and then gradually reduces (‘anneals’) Dcut by reducing the minimum distance between the conformations of the bank and focusing the search in low-energy regions of conformational space. After updating the current population, the seed conformations are selected from the set of conformations not selected as seeds previously.

Introducing new operators for dynamic disulfide-bond formation and breaking into CSA

The CSA run with dynamic disulfide-bond arrangement allows for changing the positions of disulfide bonds. The only information supplied to the procedure are the positions of cysteines, but the links between them are unknown. In other words, any two cysteines are allowed to be in a bonded state.

The following new genetic operators have been introduced into CSA to treat the formation and breaking of disulfide links; these processes are well known to influence folding pathways considerably (Staley and Kim, 1992Go; Weissman and Kim, 1992Go).

1. Formation of new bonds. All ‘seed’ conformations are analyzed for the presence of non-bonded cysteines, whose side-chain centers are closer than 7 Å and the two specified cysteines are more than l residues apart in the amino acid sequence. In this work l = 1, unless stated otherwise. For each seed and each pair satisfying this criterion, a trial conformation is generated and, for the pair of cysteines designated as bonded, the hydrophobic-interaction potential USCiSCj is replaced by the disulfide-bond potential given by Equation 2.

2. Breaking of existing bonds. For each seed conformation with disulfide-bonded cysteine pairs, a pair (a disulfide bond) is selected at random. Then a new trial conformation is generated in which the selected bond is broken. Consequently, for the selected pair of cysteines, the disulfide-bond energy given by Equation 2 is replaced by the respective hydrophobic-interaction energy USCiSCj (Equation 1).

3. Exchange of links between disulfide bonds and free cysteines. If a free cysteine (of number i in the sequence) in a ‘seed’ conformation is found close to a disulfide bond (formed by cysteines j and k), i.e. if its distance from Cysj or Cysk is less than 7 Å, the trial conformations with the links Cysi–Cysj and Cysi–Cysk, formed instead of Cysj–Cysk, are generated. An additional criterion on the distance in the amino acid sequence between bonded cysteines is applied as in point 1.

4. Exchange of links between pairs of disulfide bonds. If a disulfide bond Cysi–Cysj is found close to disulfide bond Cysk–Cysl in a ‘seed’ conformation (i.e. if one of the distances Cysi...Cysk, Cysi...Cysl, Cysj...Cysk or Cysj...Cysl is less than 7 Å), trial conformations with the exchange of the current disulfide links into two alternative possibilities (Cysi–Cysk, Cysj–Cysl or Cysi–Cysl, Cysj–Cysk) are generated. An additional criterion on the distance in the amino acid sequence between bonded cysteines is applied as in point 1.

Trial conformations generated with existing CSA operators inherit the arrangement of disulfide bonds from the parent ‘seed’ conformation but only if the distance between the bonded cysteine side-chain centroids is smaller than 7 Å after the mutation or crossover operation.

The CSA run with variable disulfide-bond arrangement can start with no disulfide bonds in the randomly generated first bank or any fixed disulfide-bond arrangement for the first randomly generated conformations. In this paper, only runs with no disulfide bonds and runs with native disulfide-bond arrangement in the first bank were used in CSA with variable disulfide-bond arrangement.

The CSA simulation with fixed disulfide-bond arrangement assumes that all the pairs of cysteines between which the bonds are to be formed have been specified. Each of these pairs is in a ‘bonded’ (‘oxidized’) state and the energy of interactions between the cysteine residues is expressed by Equation 2. New operators for dynamic disulfide-bond formation and breaking are not used in the CSA runs with fixed disulfide-bond arrangement.

For each protein, four types of simulations were carried out (i) with dynamic disulfide-bond arrangement starting with a random bank of conformations with no disulfides; (ii) with fixed (native) disulfide-bond arrangement; (iii) with dynamic disulfide-bond arrangement starting with a random bank of conformations with native disulfide-bond arrangement; and (iv) assuming that no disulfide bonds can be formed. For smaller proteins (1EI0, 1ED0), 70 000 minimizations per CSA run were used whereas for larger proteins (1NKL, 1L1I), more than one run of each kind with around 100 000 minimizations was carried out.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Energies, r.m.s.d. values from the experimental structures and the numbers of native disulfide bonds for all types of CSA runs for the proteins studied are summarized in Table I, while representative structures are shown in Figures 15.


View this table:
[in this window]
[in a new window]
 
Table I. Energies, r.m.s.ds with respect to native structure and number of disulfide bonds for representative structures found in the CSA simulations
 


View larger version (29K):
[in this window]
[in a new window]
 
Fig. 1. Native structure of 1EI0 (A); the lowest energy structure from the run with fixed (native) disulfide-bond arrangement, an energy 8.4 kcal/mol higher than the GM and an r.m.s.d. of 3.9 Å with respect to native (B); the closest to native structure from the run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with no disulfides, an energy 13.7 kcal/mol higher than the GM and an r.m.s.d. of 2.2 Å (C); the structure from the run with dynamic disulfide-bond arrangement, starting with a random bank of conformations with no disulfides, with one native disulfide bond formed, an energy 2.9 kcal/mol higher than the GM and an r.m.s.d. of 3.4 Å (D); the structure with both native disulfide bonds formed in the run with a dynamic disulfide-bond arrangement starting with a random bank of conformations with the native disulfide-bond arrangement, an energy 36 kcal/mol higher than the GM and an r.m.s.d. of 4.4 Å (E); the GM is a straight {alpha}-helix (F).

 


View larger version (29K):
[in this window]
[in a new window]
 
Fig. 5. Native structure of 1ED0 (A) and the lowest energy, GM, non-native structure found in the run assuming that no disulfide bonds exist (B); the structure with the correct fold from the run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement, an energy 14.3 kcal/mol higher than the GM and an r.m.s.d. of 4.8 Å with respect to native (C); the native-like structure from the run with the fixed (native) disulfide-bond arrangement, an energy of 17.0 kcal/mol higher than GM and an r.m.s.d. of 4.9 Å (D).

 
Modifications of the UNRES force field and of the CSA global optimization procedure which allow formation of disulfide bonds were first tested on a small {alpha}-helical protein with a helix–turn–helix fold and two disulfides linking both helices together, identified in the PDB as 1EI0 (Barthe et al., 2000Go) (see Figure 1A). It should be stressed that 1EI0 was not used in the force-field optimization. The CSA global optimization runs, assuming that no disulfide bonds can be formed, led to a global-minimum (GM) structure in the form of a long {alpha}-helix with an r.m.s.d. of 13.6 Å from the native structure (see Figure 1F) and also to a structure with correct fold (not shown), with an r.m.s.d. for the C{alpha} atoms of 4.1 Å from the average NMR structure in the PDB and an energy only 0.7 kcal/mol higher than the GM. However, in the native-like structure found in this 4.1 Å r.m.s.d. run, the distances between cysteine centroids are fairly large: 11.2 Å for Cys3–Cys34 and 12.6 Å for Cys13–Cys24.

A CSA run with a fixed (native) disulfide-bond arrangement produced only native-like structures; the lowest energy was 8.4 kcal/mol higher than the GM from the previous run and the lowest energy structure has an r.m.s.d. of 3.9 Å (see Figure 1B) with both native disulfide bonds.

A CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with no disulfides, found the same GM as the run in which no disulfides were allowed (Figure 1F). No low-energy structures with both native disulfides were present in the final population. A large number of native-like structures were found, but with only one native disulfide bond. The structure closest to native in this run has an r.m.s.d. of 2.2 Å and an energy of 13.7 kcal/mol higher than the GM and only the Cys13–Cys24 disulfide bond was formed (see Figure1C). In the same run, structures with the correct fold with only the Cys3–Cys34 disulfide bond formed were found; the structure with an energy of 2.9 kcal/mol higher than the GM and an r.m.s.d. of 3.4 Å is shown in Figure 1D.

The CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement, gave similar results; the only difference is that a high-energy structure (36 kcal/mol higher than the GM) with both native disulfide bonds formed and an r.m.s.d. of 4.4 Å was present in the final population (see Figure 1E).

To check the influence of the force field used in the CSA runs, which allow dynamic formation of disulfide bonds, we re-optimized the UNRES force field by including the 1EI0 protein in the set of training proteins (PDB codes 1E0G, 1E0L, 1GAB, 1IGD and 1EI0). The disulfide-bond energy parameters of Equation 2 were not optimized; only the internal parameters of USCiSCj and the weights of the energy terms had been changed. In the CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement, the GM for 1EI0 was found with the new force field; it had the correct fold and one disulfide (Cys3–Cys34) formed and an r.m.s.d. of 3.5 Å with respect to the native (see Figure 2B). The lowest energy structure obtained by CSA, assuming that no disulfide bonds can be formed, has an r.m.s.d. of 3.6 Å and an energy 3.8 kcal/mol higher than the GM and inter-cysteine distances of 5.6 and 8.8 Å for Cys3–Cys34 and Cys13–Cys24, respectively. In the CSA run with a dynamic disulfide-bond arrangement starting with a random bank of conformations with no disulfides, the native-like structure with both disulfides was found as the structure with an energy 17.1 kcal/mol higher than the GM and an r.m.s.d. of 2.5 Å with respect to native (see Figure 2C). The lowest energy structure in the CSA run with fixed (native) disulfide-bond arrangement has an r.m.s.d. of 3.6 Å and was 10.5 kcal/mol higher than the GM (see Figure 2D). Even the force field optimized for the 1EI0 protein does not lead to a structure of 1EI0 with two disulfides formed as one of very low energy; the reason is the wrong inter-helical angle in the low-energy structures obtained with this version of the UNRES force field.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 2. Native structure of 1EI0 (A) and three structures from CSA runs with the re-optimized UNRES force field (including the 1EI0 protein in the set of training proteins): the lowest energy structure, the GM, with an r.m.s.d. of 3.5 Å with respect to native found in the run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement (B); the structure with both native disulfides formed in the run with a dynamic disulfide-bond arrangement,starting with a random bank of conformations with no disulfides, an energy 17.1 kcal/mol higher than the GM and an r.m.s.d. of 2.5 Å (C); the structure with the lowest energy from the run with a fixed (native) disulfide-bond arrangement, an energy 10.5 kcal/mol higher than GM and an r.m.s.d. of 3.6 Å (D).

 
The next test case was the four {alpha}-helix bundle 1NKL protein with three disulfides (Liepinsh et al., 1997Go) (see Figure 3A). Only the original force field, optimized on four training proteins (PDB codes 1E0G, 1E0L, 1GAB, 1IGD), was used. Assuming that no disulfide bonds can be formed, all CSA runs found structures with the correct fold but they were 49.6 kcal/mol higher in energy than the non-native three-helix cyclic-like structure which had the lowest (9.4 kcal/mol) energy in these runs. The GM, which has the correct fold, was found by the CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement (Figure 3B). It has an r.m.s.d. of 5.4 Å with respect to the native structure and has 9.4 kcal/mol lower energy than the low-energy non-native structures found in the former CSA runs without disulfides. Only one disulfide bond is present and it is non-native. The lowest energy structure in the CSA run with a fixed (native) disulfide-bond arrangement has an r.m.s.d. of 5.1 Å and was 17.5 kcal/mol higher than the GM (see Figure 3C).



View larger version (40K):
[in this window]
[in a new window]
 
Fig. 3. Native structure of 1NKL (A); the lowest energy structure, GM, with an r.m.s.d. of 5.4 Å with respect to native and one non-native disulfide formed, found in the run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement (B); the lowest energy from a run with a fixed (native) disulfide-bond arrangement, an energy 17.5 kcal/mol higher than the GM and an r.m.s.d. of 5.1 Å (C); the structure from the CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with no disulfides with one native disulfide formed, an energy of 40.6 kcal/mol higher than GM and an r.m.s.d. of 4.9Å (D).

 
An example of a structure from the CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with no disulfides, is shown in Figure 3D. This structure is 40.6 kcal/mol higher in energy than the GM and has an r.m.s.d. of 4.9 Å. Only one native disulfide bond, Cys35–Cys45, forms easily.

It should be noted that, although the lowest energy structure of 1NKL does not have the native disulfide-bond arrangement, the very possibility of disulfide-bond formation resulted in location of the structure with the correct fold as the lowest energy structure, as opposed to runs without the possibility of disulfide-bond formation. From Figure 3B, it can be seen that, although the six cysteine residues do not form the native bonds, they are qualitatively positioned as in the native structure. This suggests that dynamic formation, breaking and rearrangement of disulfide bonds guided the CSA search to find the correct fold, although the disulfide-bond potential introduced in this work cannot predict the correct disulfide-bond arrangement. An analysis of the history of the CSA search indicated that a single native disulfide bond was formed in low-energy structures during the course of the run (data not shown), which explains why it helped to find the native fold. The population of intermediate structures contained all native disulfide bonds, but only one was present in a particular structure. However, because of the imperfection of the force field, after the native-like fold was reached a lower energy was achieved with non-native disulfide bonds.

The ß-helical protein with eight disulfides, identified in the PDB with code 1L1I, has been chosen to test the algorithm on proteins with many disulfide bonds (Daley et al., 2002Go) (see Figure 4A). The original force field, optimized on four training proteins (PDB codes 1E0G, 1E0L, 1GAB, 1IGD), was used. It should be stressed that no similar fold is present in any of the training proteins. Assuming that no disulfide bonds can be formed, all CSA runs found only non-native structures, the lowest energy structure of these being a flat six-stranded ß-sheet built out of consecutive hairpins (not shown). However, the GM, which has the correct fold, was found by the CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement (see Figure 4C). It has an r.m.s.d. of 6.1 Å with respect to native and has 32.9 kcal/mol lower energy than the low-energy non-native structures found in the former CSA runs without disulfides. Six disulfide bonds are present but all of them are non-native. It should be noted that the energy gain due to the formation of a single disulfide bond is equal to 3.5 kcal/mol (see the section The UNRES force field, in Materials and methods). Therefore, the energy gain is greater than that due to the formation of six bonds. The CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with a native disulfide-bond arrangement, was repeated with an additional restriction of at least three residues in the loop between cysteines forming a disulfide bond. Consequently, the Cys18–Cys21 bond was forbidden to form. The lowest energy structure found by the new run has the correct fold with an r.m.s.d. of 7.7 Å, four native disulfides and an energy 46.8 kcal/mol higher than the GM (see Figure 4D). The lowest energy structure in the CSA run with the fixed (native) disulfide-bond arrangement has an r.m.s.d. of 5.7 Å and is 25.7 kcal/mol higher in energy than the GM (see Figure 4B).



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 4. Native structure of 1LI1 (A); the lowest energy from the run with fixed (native) disulfide-bond arrangement, an energy 25.7 kcal/mol higher than the GM and an r.m.s.d. of 5.7 Å with respect to the native structure (B); the lowest energy structure, GM, from the run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement, with an r.m.s.d. of 6.1 Å and six non-native disulfide bonds (C); the low-energy structure with four native disulfides found in the run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement with the additional restriction of at least three residues in the loop between cysteines forming a disulfide bond, an energy 46.8 kcal/mol higher than GM and an r.m.s.d. of 7.7 Å (D).

 
As in the case of 1NKL, although the lowest energy structure does not have a native disulfide-bond arrangement, the structure has a largely correct fold and the cysteines are arranged as in the native structure. Unlike 1NKL, where the formation of disulfide bonds only guided the search, but the lowest energy structure has only one non-native disulfide bond, here the native ß-helical structure is entirely stabilized by disulfide bonds, although they are not native.

The next test case was a small {alpha}/ß protein identified in the PDB with code 1ED0 (Romagnoli et al., 2000Go) (see Figure 5A). The original force field, optimized on four training proteins (PDB codes 1E0G, 1E0L, 1GAB, 1IGD), was used. Assuming that no disulfide bonds can be formed, the CSA run found only non-native structures. One of them is the GM which is a flat five-stranded ß-sheet built out of consecutive hairpins with a sixth strand packed to the surface of this ß-sheet (see Figure 5B). The structures with the correct fold were found by CSA runs with a dynamic disulfide-bond arrangement. A representative structure found by the CSA run with a dynamic disulfide-bond arrangement, starting with a random bank of conformations with the native disulfide-bond arrangement, is shown in Figure 5C. It has an r.m.s.d. of 4.8 Å with respect to native and is 14.3 kcal/mol higher in energy than the low-energy non-native structures found in the former CSA runs without disulfides. Only one disulfide bond is present and it is non-native. The lowest energy structure in the CSA run with a fixed (native) disulfide-bond arrangement has an r.m.s.d. of 6.8 Å and an energy 13.2 kcal/mol higher than the GM and bad packing of the secondary-structure elements. The closest-to-native structure from this run has an r.m.s.d. of 4.9 Å and an energy 17.0 kcal/mol higher than the non-native GM (see Figure 5D).


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The results presented in this paper show that it is possible to extend the united-residue force-field and search procedure (CSA) developed earlier to study proteins containing disulfides. To the best of our knowledge, this is the first algorithm for energy-based prediction of the structure of disulfide-bonded proteins without any assumption as to the positions of native disulfides or human intervention. The earlier work of Watanabe et al. (1991Go) concerned packing of fixed secondary-structure elements with a limited search of loop conformations. Moreover, formation of disulfide bonds was not guided by energy as in our approach, but the decision was arbitrarily based on a geometrical graphic representation, as described in the Introduction.

For 1NKL and 1L1I, the lowest energy structures obtained with inclusion of dynamic disulfide-bond formation had the correct fold, as opposed to runs in which the formation of disulfides was ignored. These structures did not have all native disulfide bonds; however, the positions of the cysteine residues were qualitatively the same as in the native structure. This suggests that the role of formation of disulfide bonds as a factor stabilizing the native structure or guiding the folding process towards the native structure is reasonably well accounted for by the modified CSA search procedure proposed in this work, even though the disulfide-bond potential is not perfect. As indicated in the Results section, the search is guided towards the native structure probably because some of the native disulfide bonds are formed during the run, although they can be broken or rearranged in the final structure. It must be stressed that the examples studied in this work show that it is not possible to use only the hydrophobic side-chain potential in united-residue simulations on proteins with disulfide bonds. Only in the case of 1EI0 is the structure with the correct fold low in energy without including disulfide-bond formation. If the formation of disulfide bonds is not assumed, the simulation results not only in incorrect packing, but also in incorrect secondary structures.

Based on the results presented in this paper, we propose the following future modifications. The energy term describing the formation of a disulfide bond should include an angular dependence in addition to the distance dependence used in the current version of the algorithm. The lack of an angular dependence led to overpopulation of the structures in which the cysteines are very close in the sequence (see Results). The formation of short-range disulfides is easy from an entropic point of view, which is represented by a distance dependence, but bonds (like disulfides) leading to short-range loops are more restricted by the stiffness of the peptide backbone and the chemical nature of the disulfide bond itself and their formation can be described by an angular and a distance dependence.

The case of the 1L1I protein shows the deficiency in the current mutation operators for formation/breaking disulfide bonds in CSA. Generally, to increase the probability of formation of a disulfide bond, we plan to introduce more complicated genetic operators. Based on our experience, the following new genetic operators are necessary: (i) copy the disulfide bond present in one conformation to another one, (ii) global perturbation of a conformation without disrupting the disulfide bond(bonds) that it already contains, (iii) small local perturbation of the backbone and side chains around cysteine moieties which are geometrically close to each other but still too far to be considered as bonded. Additionally, the possibility of formation of a disulfide bond between two free cysteines should be based not only on the distance between their centroids, but also on the orientation of the respective C{alpha}–SC vectors. All these changes in the search procedure should speed up the search and increase the probability of generating structures with correct disulfide bonds.


    Acknowledgements
 
We thank Jaroslaw Pillardy, Daniel Ripoll and Jorge Vila for helpful comments on this paper. This work was supported by grants from the National Institutes of Health (GM-14312), the National Science Foundation (MCB00-03722), the Fogarty Foundation (TW1064) and grant BW/8000-5-0234-3 from the Polish State Committee for Scientific Research (KBN). Support was also received from the National Foundation for Cancer Research. This research was conducted by using the resources of (a) the National Science Foundation Terascale Computing System at the Pittsburgh Supercomputer Center, (b) our 392-processor Beowulf cluster at the Baker Laboratory of Chemistry and Chemical Biology, Cornell University and (c) our 45-processor Beowulf cluster at the Faculty of Chemistry, University of Gdansk.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Abkevich,V.I. and Shakhnovich,E.I. (2000) J. Mol. Biol., 300, 975–985.[CrossRef][ISI][Medline]

Barthe,P., Rochette,S., Vita,C. and Roumestand,C. (2000) Protein Sci., 9, 942–955.[Abstract]

Betz,S.F. (1993) Protein Sci., 2, 1551–1558.[Abstract/Free Full Text]

Burton,R.E., Hunt,J.A., Fierke,C.A. and Oas,T.G. (2000) Protein Sci., 9, 776–785.[Abstract]

Czaplewski,C. et al. (2002) In Fifth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction. http://predictioncenter.llnl.gov/casp5/Casp5.html

Czaplewski,C., Liwo,A., Pillardy,J., Oldziej,S. and Scheraga,H.A. (2003) Polymer, in press.

Daley,M.E., Spyracopoulos,L., Jia,Z., Davies,P.L. and Sykes,B.D. (2002) Biochemistry, 41, 5515–5525.[CrossRef][ISI][Medline]

Dani,V.S., Ramakrishnan,C. and Varadarajan,R. (2003) Protein Eng., 16, 187–193.[Abstract/Free Full Text]

Doig,A.J. and Williams,D.H. (1991) J. Mol. Biol., 217, 389–398.[ISI][Medline]

Dombkowski,A.A. and Crippen,G.M. (2000) Protein Eng., 13, 679–689.[Abstract/Free Full Text]

Fiser,A. and Simon,I. (2000) Bioinformatics, 16, 251–256.[Abstract]

Fiser,A., Cserzö,M., Tüdös,E. and Simon,I. (1992) FEBS Lett., 302, 117–120.[CrossRef][ISI][Medline]

Huang,E.S., Samudrala,R. and Ponder,J.W. (1999) J. Mol. Biol., 290, 267–281.[CrossRef][ISI][Medline]

Kobayashi,Y., Sasabe,H., Akutsu,T. and Saitô,N. (1992) Biophys. Chem., 44, 113–127.[CrossRef][ISI][Medline]

Kubo,R. (1962) J. Phys. Soc. Jpn., 17, 1100–1120.[ISI]

Lee,J., Scheraga,H.A. and Rackovsky,S. (1997) J. Comput. Chem., 18, 1222–1232.[CrossRef][ISI]

Lee,J., Liwo,A. and Scheraga,H.A. (1999) Proc. Natl Acad. Sci. USA, 96, 2025–2030.[Abstract/Free Full Text]

Lee,J., Liwo,A., Ripoll,D.R., Pillardy,J., Saunders,J.A., Gibson,K.D. and Scheraga,H.A. (2000) Int. J. Quantum Chem., 71, 90–117.[CrossRef]

Lee,J., Ripoll,D.R., Czaplewski,C., Pillardy,J., Wedemeyer,W.J. and Scheraga,H.A. (2001) J. Phys. Chem. B, 105, 7291–7298.[CrossRef][ISI]

Lester,C.C., Xu,X., Laity,J.H., Shimotakahara,S. and Scheraga,H.A. (1997) Biochemistry, 36, 13068–13076.[CrossRef][ISI][Medline]

Liepinsh,E., Andersson,M., Ruysschaert,J.M. and Otting,G. (1997) Nat. Struct. Biol., 4, 793–795.[ISI][Medline]

Liwo,A., Pincus,M.R., Wawak,R.J., Rackovsky,S. and Scheraga,H.A. (1993) Protein Sci., 2, 1715–1731.[Abstract/Free Full Text]

Liwo,A., Oldziej,S., Pincus,M.R., Wawak,R.J., Rackovsky,S. and Scheraga,H.A. (1997a) J. Comput. Chem., 18, 849–873.[CrossRef][ISI]

Liwo,A., Pincus,M.R., Wawak,R.J., Rackovsky,S., Oldziej,S. and Scheraga,H.A. (1997b) J. Comput. Chem., 18, 874–887.[CrossRef][ISI]

Liwo,A., Lee,J., Ripoll,D.R., Pillardy,J. and Scheraga,H.A. (1999a) Proc. Natl Acad. Sci. USA, 96, 5482–5485.[Abstract/Free Full Text]

Liwo,A., Pillardy,J., Kazmierkiewicz,R., Wawak,R.J., Groth,M., Czaplewski,C., Oldziej,S. and Scheraga,H.A. (1999b) Theor. Chem. Acc., 101, 16–20.[CrossRef][ISI]

Liwo,A., Czaplewski,C., Pillardy,J. and Scheraga,H.A. (2001) J. Chem. Phys., 115, 2323–2347.[CrossRef][ISI]

Liwo,A., Arlukowicz,P., Czaplewski,C., Oldziej,S., Pillardy,J. and Scheraga,H.A. (2002) Proc. Natl Acad. Sci. USA, 99, 1937–1942.[Abstract/Free Full Text]

Liwo,A., Oldziej,S., Czaplewski,C., Kozlowska,U. and Scheraga,H.A. (2003) J. Phys. Chem. B, in press.

Martelli,P.L., Fariselli,P., Malaguti,L. and Casadio,R. (2002) Protein Eng., 15, 951–953.[Abstract/Free Full Text]

Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 18, 534–552.[ISI]

Mucchielli-Giorgi,M.H., Hazout,S. and Tuffery,P. (2002) Proteins: Struct. Funct. Genet., 46, 243–249.[CrossRef][ISI][Medline]

Muskal,S.M., Holbrook,S.R. and Kim,S.H. (1990) Protein Eng., 3, 667–672.[Abstract]

Némethy,G., Gibson,K.D., Palmer,K.A., Yoon,C.N., Paterlini,G., Zagari,A., Rumsey,S. and Scheraga,H.A. (1992) J. Phys. Chem., 96, 6472–6484.[ISI]

Orengo,C.A., Bray,J.E., Hubbard,T., LoConte,L. and Sillitoe,I. (1999) Proteins: Struct. Funct. Genet., Suppl. 3, 149–170.[CrossRef][Medline]

Petersen,M.T.N., Jonson,P.H. and Petersen,S.B. (1999) Protein Eng., 12, 535–548.[Abstract/Free Full Text]

Pillardy,J. et al. (2001a) Proc. Natl Acad. Sci. USA, 98, 2329–2333.[Abstract/Free Full Text]

Pillardy,J., Czaplewski,C., Liwo,A., Wedemeyer,W.J., Lee,J., Ripoll,D.R., Arlukowicz,P., Oldziej,S., Arnautova,Y.A. and Scheraga,H.A. (2001b) J. Phys. Chem. B, 105, 7299–7311.[CrossRef][ISI]

Rey,A. and Skolnick,J. (1994) J. Chem. Phys., 100, 2267–2276.[CrossRef][ISI]

Romagnoli,S., Ugolini,R., Fogolari,F., Schaller,G., Urech,K., Giannattasio,M., Ragona,L. and Molinari,H. (2000) Biochem. J., 350, 569–577.[CrossRef][ISI][Medline]

Skolnick,J., Kolinski,A. and Ortiz,A.R. (1997) J. Mol. Biol., 265, 217–241.[CrossRef][ISI][Medline]

Staley,J.P. and Kim,P.S. (1992) Proc. Natl Acad. Sci. USA, 89, 1519–1523.[Abstract]

Watanabe,K., Nakamura,A., Fukuda,Y. and Saitô,N. (1991) Biophys. Chem., 40, 293–301.[CrossRef][ISI][Medline]

Wedemeyer,W.J., Welker,E., Narayan,M. and Scheraga,H.A. (2000) Biochemistry, 39, 4207–4216.[CrossRef][ISI][Medline]

Weissman,J.S. and Kim,P.S. (1992) Science, 256, 112–114.[ISI]

Welker,E., Wedemeyer,W.J., Narayan,M. and Scheraga,H.A. (2001) Biochemistry, 40, 9059–9064.[CrossRef][ISI][Medline]

Zhou,N.E., Kay,C.M. and Hodges,R.S. (1993) Biochemistry, 32, 3178–3187.[ISI][Medline]

Accepted October 16, 2003 Edited by Valerie Daggett