Disulfide recognition in an optimized threading potential

Alan A. Dombkowski and Gordon M. Crippen1

College of Pharmacy, University of Michigan, Ann Arbor, MI 48109, USA


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
An energy potential is constructed and trained to succeed in fold recognition for the general population of proteins as well as an important class which has previously been problematic: small, disulfide-bearing proteins. The potential is modeled on solvation, with the energy a function of side chain burial and the number of disulfide bonds. An accurate disulfide recognition algorithm identifies cysteine pairs which have the appropriate orientation to form a disulfide bridge. The potential has 22 energy parameters which are optimized so the Protein Data Bank (PDB) structure for each sequence in a training set is the lowest in energy out of thousands of alternative structures. One parameter per amino acid type reflects burial preference and a single parameter is used in an overpacking term. Additionally, one optimized parameter provides a favorable contribution for each disulfide identified in a given protein structure. With little training, the potential is >80% accurate in ungapped threading tests using a variety of proteins. The same level of accuracy is observed in a threading test of small proteins which have disulfide bonds. Importantly, the energy potential is also successful with proteins having uncrosslinked cysteines.

Keywords: contact potential/disulfide/fold recognition/protein structure prediction/threading


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Genome mapping efforts have dramatically increased the availability of protein sequence data, but experimental methods of structure determination have not kept pace. Therefore, computational methods which can accurately predict the correct protein structure for a given amino acid sequence are a tremendous aid in biomedical and agricultural research. Homology modeling is useful for sequences which are similar to a protein of known structure, but structure prediction for sequences having low similarity to known proteins requires methods which are not dependent on sequence alignments. One such method is fold recognition (threading), which takes advantage of the recurrence of folds and structural motifs observed in nature. Using this approach, for a sequence of interest a library of known protein folds is sampled and sequence–structure compatibility is assessed with some type of scoring function. Numerous forms of scoring functions have been tried and come under a variety of names (potential function, contact potential, hydrophobic fitness score, potential of mean force, etc.), yet most share a common element: they are a function of intramolecular contact (Miyazawa and Jernigan, 1985Go; Sippl, 1990Go; Crippen, 1991Go; Huang et al., 1995Go; Huber and Torda, 1998Go). Simplifying even further, the fundamental component of successful potential functions appears to be a dependency on hydrophobic burial, conversely considered as solvation (Huang et al., 1995Go; Thomas and Dill, 1996Go; Jones, 1997Go).

Regardless of functional form, the overall goal in fold recognition is the same: for a given protein sequence the potential function should identify the native (or near-native) structure as unique out of all possible conformations. Huang et al. (1995) demonstrated that a simple hydrophobic scoring function can recognize the correct fold for a large number of proteins. Other more complex potentials have shown similar success (Maiorov and Crippen, 1992Go; Miyazawa and Jernigan, 1996Go; Zhang and Kim, 2000Go). However, one class of proteins which has thwarted fold recognition efforts are the small, disulfide-bearing proteins. Developers of potential functions have noted difficulty with these proteins in structure identification tests (Kocher et al., 1994Go; Huang et al., 1995Go; Park et al., 1997Go; Huber and Torda, 1998Go). Disulfide-bearing proteins often failed with fundamental ungapped threading tests. Disulfide bonds provide stability to the native structure and may compensate for the lack of a notable hydrophobic core in small proteins (Petersen et al., 1999Go; Betz, 1993Go). A strong negative correlation has been observed between hydrophobic content and the number of disulfide bonds (Abkevich and Shakhnovich, 2000Go). Experimental evidence reveals that some proteins will not retain their native structure when disulfide bonds are disrupted (Creighton, 1993Go; Narhi et al., 1993Go; Zhang et al., 1997Go). Since most energy potentials do not explicitly account for post-translational modifications and prosthetic group interactions, it is not surprising that they frequently fail for proteins which depend on such interactions. If disulfide bonds are often required in the absence of a large hydrophobic core, it follows that they should be accounted for in structure prediction methods which are intended for use with small proteins. Given the abundance and significance of small, disulfide-bearing proteins, an appropriate treatment of disulfide bonds in a fold recognition function would be beneficial. Here, we seek to continue the focus on solvation while improving on intramolecular contact definitions and accounting for the contribution of disulfide bonds.

The definition of a contact is crucial in fold recognition methods. Typically, two amino acids of a given protein are considered in contact if their side chains are within a specified distance of each other and an energy value or score is attributed to the contact. Threading requires the use of simplified side chain representations, so contacts defined in this context are a crude approximation of the actual interaction between residues. When searching for the best fold for a given sequence, contiguous segments of known PDB structures (templates) are often used as the structural model. However, the query amino acid sequence will differ from the actual sequence of the template protein, so to allow compatibility the side chains are usually represented as a point in space. Commonly used representations are Cß carbons or centroids which represent the center of mass of side chains. In this manner, any sequence can be folded into any observed structural segment by assigning the identity of the query sequence amino acids to the corresponding side chain representations of the template structure. Threading provides a quick method to sample many possible structures, all having protein-like bond angles and lengths, but the disadvantage is a loss of atomic detail in the side chains.

Often a single contact definition is used for all possible pairs of interacting amino acids. For example, if the Cß for two side chains are within 8 Å, the residues are considered in contact. However, this type of definition is a poor representation in an energy potential modeled on solvation. An intramolecular contact in a solvation potential should be defined as an interaction which excludes solvent from the space between two residues. The difficulty in choosing a contact distance is due to the variation in size and shape of the 20 amino acids. The Cß atoms of two large aromatic residues may be 10 Å apart and water could be excluded from the space between them owing to the volume occupied by the side chains. The same distance is not an appropriate contact definition for two small amino acids such as alanine or glycine because a water molecule could easily pass between the two side chains. A solvation potential also requires a better estimation of the extent of contact since a single definition would give the same contact value to either a glycine or a tyrosine that is within the cut-off distance of a central residue. Obviously, a tyrosine contact would provide much greater shielding from solvent. Since intramolecular contacts (hydrophobic interactions in particular) are critical in fold recognition, we are motivated to improve the treatment of contacts in simplified protein models.

We use contact curves which account for the residue types involved and provide an estimate of the number of atomic interactions between side chains as a function of the distance between Cß representations. A contact curve for each of the 210 possible amino acid pairs represents the average number of atomic contacts observed in PDB structures as a function of Cß–Cß distance for the given amino acid pair. This allows an estimate of the extent of contact between two residues in simplified structural models. The calculated burial of a residue of a given sequence folded into a specified conformation is simply the sum of all estimated contacts for that residue and the energy associated with the residue is defined as the product of the total contact value and an energy parameter for the given amino acid type. One energy parameter per amino acid type is obtained by fitting the solvation model to observed PDB structures. The parameters are optimized to ensure that the native structure for each protein in a training set is the lowest in energy as compared to thousands of alternative conformations.

To account for the stabilizing effect of disulfide bonds, our potential function includes a favorable energy contribution for each disulfide identified. When assessing sequence–structure compatibility, all pairs of cysteines are analyzed for possible bonding. Using only the C{alpha} and Cß coordinates, a cysteine pair is considered bonded if a disulfide with standard bond lengths and angles can reside between the two residues. The potential function is trained to be accurate in structure identification of disulfide-bearing proteins and those without disulfides. This includes proteins which contain non-bonded cysteines. Thus, the general applicability of the potential function is retained.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Estimating atomic contacts

To construct a solvation potential that is compatible with simplified protein representations, the first step is to define a contact in a manner consistent with solvation. Using the Colonna-Cesari and Sander model (1990), a contact is defined as an interaction between two atoms of different side chains which excludes water from the region between the atoms. Since the diameter of a water molecule is ~2.8 Å and the van der Waals radius of any given heavy atom (non-hydrogen) is ~1.8 Å, if two atoms are within 6.4 Å of each other we can say that a water molecule cannot pass between them owing to excluded volume. So an atomic contact is defined as two atoms from different side chains that are within 6.4 Å of each other. Using this definition, it would be simple to tabulate the total number of atomic contacts for any given residue in an all-atom model of a protein structure. However, the single-atom representation of side chains required in threading precludes performing this simple calculation. Therefore, we derive contact curves which provide a reasonable estimate of the average number of atomic contacts between two residues, given only their Cß locations and amino acid types. The curves are obtained from a survey of PDB structures using all-atom representations for the side chains. We then use the curves to estimate contact during threading with single-atom side chain models.

A survey of representative PDB structures provides the data which are then fitted to smooth sigmoidal curves. The survey structures are drawn from the PDB Select which represents protein sequences having no greater than 95% sequence homology (Hobohm et al., 1992Go). The proteins from the PDB Select list of June 1997 were screened to eliminate structures with chain breaks and alternative coordinates; the remaining 1347 structures became the survey set. For each of the 210 possible amino acid pairs, the average number of atomic contacts were tabulated and recorded as a function of Cß–Cß distance. When counting atomic contacts, we consider only heavy atoms (non-hydrogen) and use a smooth sigmoidal curve with a maximum atomic contact value of one when the atoms are <3.6 Å apart and a minimum value of zero when the distance is >6.4 Å. Equation (1) represents the sigmoid used for counting atomic contacts and for fitting the contact curves. For atomic contacts, U is 6.4 Å, L is 3.6 Å, M is 1 and {alpha} is the contact value for atoms which are distance d apart.


(1)

For each amino acid pair the average number of atomic contacts was recorded as a function of Cß–Cß distance and binned in 0.25 Å increments. Fitting a sigmoid curve to each of the 210 sets of data using Equation (1) provided three values: (1) an upper bound U, the Cß–Cß distance above which there is zero average atomic contact; (2) a lower bound L, the Cß–Cß distance below which the average number of contacts observed is at its maximum value; and (3) a maximum average contact value M. Table IGo lists the sigmoid parameters for all 210 amino acid pairs. Figure 1Go shows several examples of contact curves fit to the survey data. Cij is the average number of atomic contacts observed when the Cß atoms of the two residues are a distance dij apart. In general, amino acid pairs having smaller side chains provide a better fit than larger ones. The sigmoid function fit well to the data for most pairs. In some cases, the best fit resulted in a negative value for L. An example is the PHE–LEU pair shown in Figure 1bGo. The L value is –0.43 to allow the best fit and the curve rises above the maximum observed atomic contact value. However, this is not a practical problem since few pairs of residues are observed with a Cß–Cß distance <3 Å and the curves fit the data well in the observed range of dij.


View this table:
[in this window]
[in a new window]
 
Table I. Contact curve parameters
 


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1. Contact curve examples. Cij is the average number of atomic contacts observed between the specified amino acid pair as a function of the distance between Cß carbons, dij. The data points were obtained from all residue pairs observed in 1347 PDB structures and are arranged in 0.25 Å bins. The curves represent sigmoids fit to the data per Equation (1), with parameters given in Table IGo.

 
The atomic contact data reveals why a single contact definition is not appropriate in the context of solvation. Figure 1bGo shows that at 8 Å there is, on average, little contact between VAL and HIS, yet there is significant contact for the PHE–TRP pair. Also, the maximum value of atomic contact varies greatly among the 210 pairs. The estimate of contact provided by the contact curves gives a much more detailed description of contact between residues than simple contact definitions, yet requires no additional computational expense because the atomic contact survey and curve fitting need only be performed once.

The potential function

Our potential function has three terms representing solvation, overpacking and disulfide bonding. The solvation component of our energy potential is an implicit function of solvation. With the approach of Colonna-Cesari and Sander (1990), a given amino acid type is assumed to have a fixed number of total contacts which are divided among solvent and intramolecular interactions. By counting intramolecular contacts we can determine the extent of solvation. For residue i of a protein folded into a specified conformation, where k is the amino acid type, the solvation energy Ei is defined as the product of the energy parameter Pk(i) for amino acid type k and the solvent contacts of residue i. The energy can be determined implicitly from intramolecular contacts as shown below. By assuming that the fixed number of total contacts, Zk(i), for amino acid type k are partitioned between solvent contacts ns(i) and intramolecular contacts nr(i), the energy contribution of residue i is obtained as follows:


When comparing the energy values of the same protein sequence folded into various conformations, for any residue i the values Pk(i) and Zk(i) are constant, so the first term on the right-hand side may be ignored, giving

Ei = Pk(i)nr(i)

The negative sign is absorbed into the adjustable parameter Pk(i). For residue i the total number of intramolecular contacts nr(i) is estimated by calculating the distance from Cßi to each Cßj in the protein, using the appropriate contact curve to get Cij, then summing to obtain the total number of contacts for residue i. The total energy of a protein sequence in a given conformation is determined by summing over all residues N as follows:


(2)

Equation (2) is the solvation contribution to the potential with 20 adjustable parameters Pk(i), one for each amino acid type. All residue pairs separated by two or more residues in sequence are included in the summation. To prevent the potential from favoring overpacked structures, a steric penalty is included. If the total number of contacts for a residue exceeds a specified value for the given amino acid type, the excess contact value Si is multiplied by a steric penalty parameter Ps. This brings the number of adjustable parameters to 21, including Ps. The solvation potential with the overpacking term is shown in Equation (3). Energy values are in arbitrary units.


(3)

Each amino acid type is assigned a maximum contact value Ck(i),max which is obtained from the distribution of total atomic contacts observed in our survey structures. For each amino acid of type k found in our survey structures the contact curves were used to calculate the total number of atomic contacts. A histogram for each residue type provides a distribution of the number of occurrences of each total contact value observed in the survey. The contact value that is one standard deviation lower than the maximum observed is used as Ck(i),max.

Disulfide recognition

Previous results using an energy potential in the form of Equation (3) suggested that the solvation component alone may not account for the stability of small, disulfide-bearing proteins (Dombkowski and Crippen, 1999Go). While achieving >80% accuracy for proteins larger than 110 residues in an ungapped threading test, the potential performed poorly for smaller proteins, particularly those with disulfide bonds. Others have reported difficulty with this class of proteins. However, it should be noted that pairwise potentials with 210 contact energy values representing the amino acid pairs do implicitly account for disulfide bonds, albeit in a crude manner, since they contain a CYS–CYS contact parameter. Miyazawa and Jernigan (1985, 1996) reported that their contact energies reflect the favorable formation of disulfide bonds. In their potential the formation of CYS–CYS contacts is greatly favored over all CYS–X interactions, where X is any other amino acid type. The solvation component of our potential does not provide special treatment for CYS–CYS contacts.

In our potential, the energy associated with a given residue is a function of the total number of estimated atomic contacts, regardless of the identity of the contacting residues. As illustrated in Figure 1aGo, if we are tabulating the number of contacts for a given cysteine, then another cysteine 3.75 Å away (as measured by Cß–Cß distance) would provide four atomic contacts. The same number of contacts would be attributed to a phenylalanine which is ~5.75 Å away from the central cysteine. Both interactions effectively desolvate the central cysteine equivalently and contribute equally to the overall energy of the cysteine. Additionally, a phenylalanine 3.75 Å away from the central cysteine would provide nearly twice as many atomic contacts as another cysteine at the same distance. Since cysteine contacts are energetically favorable, the interaction with the larger phenylalanine would be preferential over contact with another cysteine at the same Cß–Cß distance. Considering the contact curves and energy parameters, two CYS–PHE pairs are favored over the alternative CYS–CYS and PHE–PHE pairs, assuming the same dij distances. Unlike the Miyazawa and Jernigan potential and similar pairwise potentials, favorable CYS–CYS treatment is not inherent in the solvation component of our potential. Therefore, to account for the stabilizing contribution of disulfides we included a disulfide parameter for cysteine pairs having the appropriate bonding geometry.

The first step in our treatment of disulfides is the identification of all possible disulfide bonds for a target sequence threaded onto a structural template. We use the C{alpha} and Cß coordinates of the template to determine if a disulfide can exist between any pair of cysteines in the threaded sequence. We check all cysteine pairs for geometry which will allow a disulfide bond to form between them. Some cysteines may have multiple pairs which provide the correct bonding geometry. So a matrix of all possible disulfide bonds is processed with a heuristic algorithm to obtain a final list of designated bonds which conform to disulfide characteristics observed in PDB structures, allowing each cysteine to participate in one bond, at most.

Our disulfide recognition algorithm simply requires that a pair of cysteines have C{alpha} and Cß atoms positioned such that a disulfide with standard bond lengths and angles can reside between the two residues and provide appropriate C{alpha}–Cß–S{gamma} bond angles. A similar approach was taken by Hazes and Dijkstra (1988) in a method developed to identify possible mutation sites in proteins where a disulfide could be added. We used the bond lengths and angles of AMBER 95 (Cornell et al., 1995Go), as shown in Table IIGo. These values are in agreement with a recent survey of 351 disulfide bridges observed in PDB structures (Petersen et al.,1999). A small amount of tolerance is allowed in C{alpha}–Cß–S{gamma} angles. Since we are using Cß representations for all side chains, the locations of cysteine S{gamma} atoms are unknown. Therefore, when analyzing a pair of cysteines for bonding we must consider all possible sulfur locations. We fixed the Cß–S{gamma} and S{gamma}–S{gamma} bond lengths along with the Cß–S{gamma}–S{gamma} angles, so each S{gamma} atom is restricted to a ring which is centered on (and perpendicular to) the Cß–Cß axis. Our disulfide model geometry is shown in Figure 2Go. By convention we assign index i to the lower numbered residue. Note that a ring would be traced if either S{gamma} atom were rotated about the Cß–Cß axis at a radius of r. The S{gamma} rings are illustrated in Figure 2bGo. The center of each S{gamma} ring is distance a along the Cß–Cß axis from the corresponding Cß. Distances a and r are functions of the {chi}3 torsion angle, which is the rotation about the S{gamma}–S{gamma} bond and can be considered as functions of the Cß–Cß distance.


View this table:
[in this window]
[in a new window]
 
Table II. Disulfide model bonds and angles
 


View larger version (19K):
[in this window]
[in a new window]
 
Fig. 2. Geometry of the disulfide model. (a) Relationship of geometric values to the Cß–Cß distance. Axis û runs through the Cß carbons. The Cß–S{gamma} and S{gamma}–S{gamma} distances are fixed, along with the Cß–S{gamma}–S{gamma} angle. Therefore, a, b, r and {chi}3 are functions of distance dij. (b) The S{gamma} rings. All possible S{gamma}i positions can be obtained by tracing a ring of radius r about axis û, with the center of the ring positioned distance a from Cßi. Likewise, S{gamma}j positions are constrained to a ring of radius r, with the center located distance a from Cßj. Since the S{gamma}–S{gamma} bond distance is fixed at 2.04 Å, shown as s, there are only two possible S{gamma}j positions for each S{gamma}i, (c) Relationship between S{gamma} positions as viewed along the Cß axis, from residue j towards residue i. This view represents the S{gamma}j ring stacked on the S{gamma}i ring. Length h is the distance between S{gamma}i and S{gamma}j when viewed down the û axis. Angle {rho} and distances r and h are functions of the Cß–Cß distance. When attempting to fit a disulfide bond between cysteines, {rho}, r and h are fixed for a given Cß–Cß distance while angle {theta} is rotated to sample all possible S{gamma} positions.

 
To fit our standard disulfide between a pair of cysteines we only need to rotate the {chi}3 angle until we achieve a distance dij which matches the Cß–Cß distance. Note that the bond could be rotated in either direction to obtain the desired dij distance. This reflects right- and left-handed disulfide bonds and both cases are checked for the appropriate geometry. Since the S{gamma}–S{gamma} bond is fixed there are only two S{gamma}j positions which are appropriate for each S{gamma}i position, corresponding to the right- and left-handed bonds. Figure 2cGo is a view looking down the Cß–Cß axis (from Cßj towards Cßi) and shows the relationship between the S{gamma} positions. In this view the S{gamma}i and S{gamma}j rings overlap. Angle {rho} and distance h are also functions of the Cß–Cß distance dij.

The only possible S{gamma} positions, given our standard disulfide bond, are completely specified by values a, b, r, {theta} and {rho}. Since a, b, r and {rho} are functions of di, they become fixed for a given Cß–Cß distance. All possible S{gamma}i and S{gamma}j locations can then be determined by rotating angle {theta}. We step angle {theta} through 360° in 1° increments, checking the C{alpha}–Cß–S{gamma} bond angles at each step. For any angle {theta}, if both C{alpha}i–Cßi–S{gamma}i and C{alpha}j–Cßj–S{gamma}j angles are 114.6 ± 10°, we declare the cysteines to have a possible bond. Equations (4)–(8) provide values a, b, r, h and {rho} as a function of the Cß–Cß distance dij. The derivations are not shown, but were obtained by simple trigonometric operations. Equation (4) is derived by projecting the Cß–S{gamma} vector on to the Cß–Cß vector to obtain length a, shown in Figure 2aGo.


(4)

Length b in equation (5) and shown in Figure 2aGo, is simply the Cß–Cß distance minus 2a.:


(5)

Since the Cß–S{gamma} bond length is fixed at 1.81 Å, radius r = 1.81sin{eta}, where {eta} is the angle between the Cß–S{gamma} and Cß–Cß vectors. We calculate {eta} from acos(a/1.81); therefore, r can be obtained as


(6)

Angle {rho} is determined by first obtaining distance h, which is easily calculated using b and the fixed S{gamma}–S{gamma} bond length, as shown in Figure 2bGo. If the S{gamma}j ring were stacked on top of the S{gamma}i ring, as shown in Figure 2c hGo, is the observed distance between the S{gamma} positions. Note that h is not the S{gamma}–S{gamma} bond distance. In Figure 2b hGo, is one side of the right triangle formed using the S{gamma}–S{gamma} bond and the shortest path from S{gamma}i to the S{gamma}j ring, which is of length b. Using this relationship, h is 2.04sin{delta}, where {delta} is the angle between the other legs of the triangle. We calculate {delta} from acos(b/2.04); therefore, h can be obtained as


(7)

The triangle formed by S{gamma}i, S{gamma}j and the center of the stacked S{gamma} rings, as shown in Figure 2cGo, has two sides of length r and one side of length h. Therefore, {rho} can be calculated using the law of cosines, which simplifies to


(8)

For a given Cß–Cß distance, the S{gamma} positions are fixed relative to each other. To find if a disulfide can reside between two cysteines we simply rotate the entire disulfide `assembly' about the Cß–Cß axis, checking the C{alpha}–Cß–S{gamma} angles along the way. The rotation is stopped when the appropriate angles have been detected or 360° has been completed, indicating that no disulfide can exist between the cysteines. In the former case, we declare a putative bond and then estimate the {chi}3 angle.

The {chi}3 torsion angle is used as our measure of disulfide quality when selecting between conflicting bonds and also in the energy potential. A matrix of all possible disulfides in the protein is obtained by completing the above procedure for all cysteine pairs. Note that some cysteines may have possible bonds to more than one other cysteine in the protein. In these cases, we select the bonds which best conform to disulfide bond characteristics observed in PDB structures, allowing a maximum of one bond per cysteine.

Disulfides observed in protein structures show a strong preference for a {chi}3 torsion angle near ±90° (Thornton, 1981Go; Srinivasan et al., 1990Go; Petersen et al., 1999Go). Therefore, we have constructed the disulfide selection procedure and energy potential to favor disulfides bonds with {chi}3 angles conforming to this observation. We note that a recent survey suggests the actual preference may be shifted towards –80 and +100° (Petersen et al., 1999Go). Here, we do not distinguish between right- and left-handed bonds, but simply estimate the torsion angle without regard to stereochemistry. We use a {chi}3 range of 0–180° and our optimum torsion angle is 90°. With the standard disulfide geometry described above, we derived Equation (9), which provides an estimate of the {chi}3 angle as a function of the Cß–Cß distance dij:


(9)

Equation (9) allows dij to range from 2.925 to 4.569 Å, with a respective {chi}3 from 0 to 180°. A very accurate estimate of the torsion angle can be calculated with Equation (9), given only the Cß–Cß distance between cysteines.

To determine the accuracy of our approach, we applied the disulfide recognition algorithm to 200 disulfide-bearing proteins from the PDB, represented only by C{alpha} and Cß coordinates. Using the first chain of each protein, every possible cysteine–cysteine pair in the 200 structures was analyzed for a total of 3358 pairs, including 484 known disulfides. Of the 484 disulfides, 98% were correctly identified using only the C{alpha} and Cß coordinates as described above. Of the nine false negatives, seven were not correctly identified because the dij distances were beyond the range allowed by Equation (9). The other two had dij distances near the allowed limits and, if correctly identified, would have poor {chi}3 angles per our estimation. As discussed later, these disulfides would make insignificant contributions to the overall energy in our potential, so the impact of their incorrect identification is minimal. In addition to the correct identification of disulfide bonds, the {chi}3 angles were very accurately estimated also. Figure 3Go shows the correlation between the actual {chi}3 measured using all-atom PDB structures and the estimated {chi}3 using the simplified protein representations along with Equation (9). The linear correlation of the 475 true positives is 0.92 and the diagonal line represents y = x.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 3. Correlation between estimated and actual {chi}3 angles of 484 disulfides taken from 200 PDB structures. The linear correlation is 0.92. The diagonal line represents y = x.

 
When threading a protein sequence on to a structural template all possible disulfides are identified as described above. Since a given cysteine may have the appropriate binding geometry to multiple other cysteines we process the matrix of all possible disulfides to resolve conflicts, select the `best' bonds and provide the maximum number of disulfides possible. The selection procedure is based on several experimental observations: (1) a {chi}3 angle of ±90° is preferred; (2) disulfides tend to form between cysteines close in sequence; (3) there should be at least two residues in the sequence between cysteines involved in a disulfide bond (Thornton, 1981Go; Petersen et al., 1999Go). To assign disulfides from the matrix of all possible bonds, we start with the cysteine which has the smallest number of possible bonding partners, otherwise starting from the amino terminus. For this cysteine, we then select the bond having the best {chi}3 value. If multiple disulfides for this cysteine have the same torsion angle, we select the bond to the cysteine closest in sequence. The cysteines involved in the selected disulfide are then designated a bonded pair and removed from further consideration of bonding to other cysteines. Assessing the bonds available between the remaining cysteines, the process is reiterated until no more pairs are available leaving a final list of designated disulfides.

In our potential function, the energy associated with each disulfide bond is a function of the {chi}3 values. The disulfide term in the potential function is constructed to give a lower energy value to disulfides with a {chi}3 angle closest to 90°. Equation (10) is the complete form of the potential function, including the disulfide term. The summation of the right-hand term is performed over all designated disulfides t in the protein. Pd is the optimized disulfide parameter and the term enclosed in brackets ranges from 0 to 1 as a function of {chi}3. For an angle of 90° this term is simply 1 and it decreases linearly as {chi}3 varies from 90°. In this application the disulfide bond stereochemistry is not considered, so the {chi}3 angles range from 0 to 180°.


(10)

Optimizing the parameters

The 22 adjustable parameters Pk(i), Ps and Pd are obtained through an optimization procedure which ensures that for each protein sequence in a training set, the PDB (native) structure is the lowest in energy compared with a very large set of alternative conformations. Equation (11) represents the inequality generated for every alternative conformation sampled for each protein sequence in the training set. Enat is the energy of the native sequence in the native conformation and Ealt is the energy of the native sequence in an alternative conformation. The constant m is an arbitrary energy margin set to 10. The energies of each conformation are calculated using Equation (10) and are a linear function of the adjustable parameters, so the parameters can be obtained by solving the system of linear inequalities (Crippen, 1991Go; Maiorov and Crippen, 1992Go).


(11)

We generate alternative conformations for each native sequence by threading (Hendlich et al., 1990Go; Crippen, 1991Go). For a query sequence of length N, any contiguous structural segment of N residues taken from a PDB structure of length N or greater can be evaluated as a possible structure. So a PDB structure of length M can produce MN + 1 alternative conformations for a query sequence. Each conformation is simplified to C{alpha} and Cß carbons, which take on the identity of the amino acids in the query sequence. An artificial Cß is generated where the template contains a glycine.

The training procedure begins with arbitrarily set parameters, then a sequence from a training set of proteins is threaded through a set of template structures to produce the alternatives. Compliance with Equation (11) is checked for each alternative conformation and we continue to the next threading if Equation (11) is satisfied. Whenever a violation of Equation (11) is encountered, the current inequality (constraint) is added to the problem which is then solved for the adjustable parameters. The procedure is repeated for each native and continues until Equation (11) is satisfied for every alternative structure produced for every native sequence. Alternative structures classified as near-native are excluded from compliance with Equation (11). Since we are satisfying inequalities, the solution space can be large for a given problem. The parameters are optimized by choosing the solution which minimizes the sum of adjustable parameters.

Definition of near-native during training, the burial RMSD

The overall goal is to construct a potential which can identify the native or a near-native structure for a given protein sequence out of a large set of alternatives. So during training we do not require that native-like structures have a higher energy than the native. The native structure is generally accepted to be the crystal or NMR structure in the PDB. The definition of near-native is much more ambiguous. Similarity of two structures is conventionally measured by the coordinate root-mean-square deviation (RMSD) or the distance RMSD (DME). These measures of similarity are visually appealing since they classify structures that look the same as similar. However, they are not satisfactory when we consider structures which are energetically near-native. Motion of solvent-exposed loops or hinge-like motion between large protein domains do not necessarily imply changes in free energy, yet may result in RMSD differences. It is desirable to have a measure of structural similarity that is consistent with our measure of energy and independent of sequence length. Since the energy potential is a function of solvation, for training purposes we define near-native as those structures which have a solvation profile similar to that of the native. This definition allows for energetically similar structures to be classified as near-native despite coordinate differences. The burial root-mean-square deviation (BRMSD) is defined as


(12)

For a given sequence in two different conformations, N is the number of residues in the protein and Ci,a and Ci,b are the total number of contacts tabulated for residue i in structures a and b, respectively, using our contact definitions.

To assess the relationship of BRMSD to sequence length, 18 PDB structures from 43 to 333 residues in length were analyzed. The 18 sequences were threaded through 121 randomly selected PDB structures and the BRMSD was measured for each alternative conformation in reference to the native structure (data not shown). The average BRMSD shows little dependence on sequence length. The apparent independence of sequence length is appealing because one BRMSD cut-off value may be suitable to define near-native for proteins of various sequence lengths. To determine whether conventionally defined structural homologues fall within a given value of BRMSD, the coordinate RMSD and BRMSD were compared for 12 structural homologues (RMSD between 0.4 and 2.4) of protein 135l, turkey egg white lysozyme. All homologues had a BRMSD below 10, so a BRMSD of 10 was selected as our near-native cut-off and structures with a BRMSD <10 relative to the native are exempt from compliance with Equation (11) during training.

The training set

We trained the potential function to succeed in fold recognition for the general population of proteins while also providing for success with small, disulfide-bearing proteins. With a disulfide term in the potential function it is important to consider both types of proteins having multiple cysteines. For proteins with more than one cysteine, we define type I proteins as those having the maximum possible number of disulfide bonds in the native structure, given the number of cysteines in the sequence. Proteins with multiple cysteines but fewer than the maximum number of possible disulfides are designated as type II. For example, a protein with six cysteines and three disulfides would be classified as type I, whereas a protein with six cysteines but only two disulfides would be classified a type II. Since we want to ensure that the potential can identify the native structure, not simply the conformation which provides the most disulfide bonds, it is important to include both types in our training and test sets. Our training set of 22 proteins contains a variety of native structures, including seven without cysteines, five type I proteins and five type II. Table IIIGo lists the proteins in the training set and shows that the three general structural classes ({alpha}, ß, {alpha}/ß) are also included. During training these native sequences were threaded through 87 template proteins, listed in Table IVGo. The templates provided more than 31 000 alternative conformations, on average, for each native in the training set.


View this table:
[in this window]
[in a new window]
 
Table III. Native proteins in the training set
 

View this table:
[in this window]
[in a new window]
 
Table IV. Template proteins used in training
 
Testing the potential function

Testing of the trained energy potential was performed using two sets of test proteins. For a general test with representative proteins and to compare results with a frequently cited energy potential, we used the test set of Miyazawa and Jernigan (1996). It is an ungapped threading test where the goal is to identify the native structure as the lowest in energy for each of 88 test sequences threaded through 189 template structures. The 88 test proteins are structurally dissimilar and have <35% sequence identity to any other protein in the set. The structural diversity and variety of sequence length in this test set provide a comprehensive test scenario. We used simple, ungapped threading as the initial test for the potential function because it remains an essential requirement. If the native structure cannot be consistently identified from a set of ungapped decoys, there is little hope that the potential will succeed with gapped structures.

We created a second test to assess rigorously the accuracy of the potential function specifically with small, disulfide-bearing proteins. The test set was composed of 71 proteins taken from the PDB, all less than 150 residues in length and having at least one disulfide bond. These proteins were threaded, without gaps, through the 1347 PDB structures used in our survey of atomic contacts. The templates provided, on average, >250 000 alternative conformations for each native in the test set. The 71 proteins of the disulfide test set are listed in Table VGo.


View this table:
[in this window]
[in a new window]
 
Table V. The disulfide test set
 

    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Training of the potential function provided the optimized parameters listed in Table VIGo. A negative value indicates that intramolecular contacts are favored, while amino acids with a positive parameter tend to be exposed to solvent. Note that the values represent the energy contribution per atomic contact. Smaller amino acids have fewer possible atomic contacts than larger ones owing to the number of heavy atoms in the side chain, so the energy parameter for a smaller residue tends to be greater in magnitude than for a larger amino acid having a similar burial preference. The energy parameters can be scaled by any given factor and the same results will be obtained since the potential is a linear function of the parameters. Arbitrary boundaries are initially set for the parameter values and we note the optimized disulfide parameter Pd is at the lower boundary. This suggests that the parameter is currently not constrained by inequalities generated with the training set, so additional training is likely to be accommodated.


View this table:
[in this window]
[in a new window]
 
Table VI. Optimized parameters
 
Testing of the potential with the Miyazawa and Jernigan (M&J) test set demonstrates its accuracy with a variety of protein types. The native structure was correctly identified as the lowest in energy for 82.6% of the proteins in the M&J test set using our potential. Two proteins common to the training and test sets, 1hoe and 1paz, are not included. This compares with 86% achieved by Miyazawa and Jernigan (1996) using their potential with the same test set. The test proteins which failed with the our potential are 3chy, 2ltn.A, 1ycc, 1fxd, 2rsp.A, 1prc.C, 2pab.A, 5rxn, 2stv, 2wrp.R, 1utg, 2por, 2cdv, 1prc.L and 2cy3. These include four cytochrome cs, four multimeric proteins, two transmembrane chains, two proteins which bind metal, one steroid binding protein and one DNA binding protein. It is known that cytochrome c will not fold to its native conformation without covalently bound heme (Dumont et al., 1994Go). Since the potential function does not account for interactions with ligands, we are threading the apo form of these proteins. Hence it is not surprising, nor incorrect, that the heme-bound crystal structure is not selected as the lowest in energy by the potential. Likewise, transmembrane proteins and those with hydrophobic binding sites are not expected to be correctly identified. In fact, any protein which depends on metals, ligands, prosthetic groups, disulfides or multimeric interactions for native state stability is likely to be incorrectly identified with a potential function that does not account for such interactions. Of course, this depends on the extent of stability provided by the prosthetic group, ligand, etc.

Here, we made an initial attempt to consider stabilizing groups by accounting for disulfide bonds and the results reveal the utility of the approach. Our potential correctly identified 28 of 29 disulfide-bearing proteins in the M&J test set. Of the 11 disulfide proteins with <100 amino acids (those likely dependent on disulfide bonds for stability), 91% were correctly identified with our potential while only 36% of this group were correct when using the potential function of Miyazawa and Jernigan. The one disulfide-bearing protein that failed with our potential was 1fxd, ferredoxin. This native, an example of a type II protein, has one disulfide bond and four additional cysteines involved in binding an iron–sulfur cluster. The correct disulfide bond was identified by the potential function and the native structure ranked second only to a conformation that provided an additional disulfide. However, we note that our potential does not just mindlessly select the conformation having the most disulfide bonds. The M&J test includes 33 type II proteins and 86% were correctly ranked number one, excluding the four cytochrome cs. Many alternative conformations with disulfides were available for the type II sequences, yet the potential consistently identified the true native conformations which have no disulfide bonds. It is clear that our potential is not overtrained with regard to disulfide-bearing proteins.

Our disulfide test set provided a thorough evaluation of the potential function with small proteins having disulfide bonds. Table VGo lists the proteins used in the disulfide test set. The templates are the representative set of 1347 PDB structures having no greater than 95% sequence identity used in our contact survey. The high threshold for identity allows significant structural coverage among the templates, reducing native structure uniqueness. Over 250 000 alternative conformations were provided, on average, for each of the 71 test proteins. The potential function correctly identified the native or a similar conformation, as the lowest in energy (`best structure') for 80% of the test sequences. For these, the best structure is listed as `native' or by the template and threading offset of the native-like conformation, along with the RMSD compared to the native. The offset is the number of residues from the beginning of the template chain where the sequence to structure alignment begins. The offset is zero where not explicitly listed. For nine of the 14 proteins that failed, the native structure was within the 10 conformations having the lowest energy.

The results indicate that our potential is capable of identifying structural homologues and analogues. The former share a common ancestral sequence, while the latter are encountered when two proteins share a common fold but are not obviously related in function or sequence. With our disulfide test, in most cases where the potential function succeeded in identifying a structurally similar conformation in place of the native, the native and best structure were homologues. This demonstrates the ability of the potential function to identify related proteins. Apparent analogues are also identified. The lowest energy conformation identified for sequence 1gur was template 1eit, with an offset of zero. There is no obvious relationship between these two proteins, other than three common disulfide bonds, yet they have very similar structures with an RMSD of 2.3 Å. They share 28% sequence identity over 34 residues, below the threshold for structural homology considering the sequence length (Sander and Schneider, 1991Go; Abagyan and Batalov, 1997Go). The functions of these two proteins also seem unrelated; 1gur is a sweet taste-suppressing polypeptide from the tree Gymnema sylvestre, and 1eit is a neurotoxin from the funnel web spider. Despite insignificant similarity in sequence and function, the energy potential is able to identify the common fold. We are encouraged by the results and a more extensive test and analysis of analogue identification are under way.

In addition to recognizing the correct fold, the potential function is also very accurate in identifying the actual disulfide bonds in each native structure. The test proteins in Table VGo have a total of 210 disulfide bonds and 98% were correctly identified. Three of the five incorrectly identified bonds have Cß–Cß distances beyond the allowed range of Equation (9). These disulfides are from NMR structures and using the first model of each PDB file reveals that two of the bonds, cysteine pairs 17–33 and 3–18 in 1gur, have anomalous disulfide geometry with respective S{gamma}–S{gamma} distances of 3.69 and 4.99 Å, respectively.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Trained with only 22 proteins, the potential function successfully identifies the native structure for >80% of test sequences in ungapped threading tests. The results show that we are equally successful with disulfide-bearing proteins, as well as representative proteins such as those in the M&J test. The potential function accurately identifies the correct disulfide bonds in our test structures and favors the native conformation. However, the potential function is not overtrained, as demonstrated by the success with type II proteins and general applicability of the potential is retained.

Ungapped threading tests were used to ensure the potential satisfies the fundamental requirement of fold recognition: identification of the native structure. This has been demonstrated, so more comprehensive threading scenarios are being investigated. The results of our disulfide test clearly show that the potential is capable of identifying homologous structures. However, for sequences with numerous homologues having varying numbers and patterns of disulfides, it remains to be determined which factors most influence the ranking. Combining our potential with a gapped threading procedure will allow us to determine whether top-ranked homologues are more accurate in RMSD or in the number and pattern of disulfides when compared with the native.

The contribution of the disulfide term is evident when the results are compared with training and testing without the disulfide component in the potential function. The training set used in this work provides an infeasible problem when the disulfide component is removed from the potential. In other words, without the disulfide term there is no set of parameters which ensures the native structure is the lowest in energy out of all alternative conformations for each sequence in our training set. The solvation component along with the overpacking term cannot account for the stability of our native proteins. Using only the solvation and overpacking components trained with a set of proteins similar in number and size to the current training set provided a potential that was >80% accurate for proteins larger than 110 residues, but it did poorly when tested with small proteins, particularly those with disulfide bonds (Dombkowski and Crippen, 1999Go). Without the disulfide component we were only 36% accurate in identifying the native structure for small (<100 residues), disulfide-bearing proteins in the M&J test. The same potential was only 39% accurate with our disulfide test set. The results improved dramatically with the present potential to 91% and 80%, respectively. We note that comparing results using different, though similar, training sets is complex. However, it is evident that including the disulfide component in the potential allows training with a previously infeasible set of proteins and results in significantly improved fold recognition accuracy.

Since disulfide bonds occur most frequently in extracellular proteins, it would be advantageous to use a disulfide-cognizant potential for structure predictions of secreted proteins. Pre-screening for secreted proteins is possible using a very accurate computational tool, TargetP (Emanuelsson et al., 2000Go). This neural network-based program is trained to identify the N-terminal target sequence which specifies the cellular destination of a protein. It is >90% accurate in identifying proteins destined for the secretory pathway. The authors estimated that 10% of proteins in Homo sapiens and Arabidopsis thaliana are secreted, excluding membrane proteins. Analysis of our 1347 representative PDB structures revealed that 38% have at least one disulfide bond, 29% are type I proteins with the maximum possible number of disulfides given the number of cysteines and 44% are type II with at least two cysteines but fewer than the maximum possible number of disulfides. Protein fold recognition efforts should benefit from consideration of such a common structural feature.

In addition to identifying probable disulfides in threaded sequences, the disulfide recognition algorithm could easily be adapted to identify possible mutation sites for inserting disulfides in a protein. Where it is desirable to increase protein stability through the introduction of a disulfide bond, our algorithm could be modified to identify all pairs of residues that would form a disulfide if they were cysteines. The procedure entails threading a sequence of all cysteines on to the target structure and recording all possible bonds. However, the identification of putative disulfides assumes that the C{alpha} and Cß coordinates are not significantly perturbed by the mutation from the original amino acid to cysteine.

Developers of fold recognition methods have striven to create potential functions which are successful for the general population of proteins. However, some types of proteins have remained particularly troublesome in structure identification attempts. Since the major contribution to most potential functions is the identification of a suitable hydrophobic core for a given sequence, it is not surprising that small proteins having little of a core and dependent on prosthetic group interactions for stability have failed prediction efforts. Here, we have continued the focus on hydrophobic burial while improving on contact definitions and accounting for disulfide-enhanced stability. It is a step towards improved fold recognition by accommodating post-translational modifications. Our success with disulfide bonds suggests that a similar approach may be useful for other types of stabilizing interactions. Our difficulty with cytochrome c suggests that heme group ligation may be one of many appropriate targets.


    Notes
 
1 To whom correspondence should be addressed. E-mail: gcrippen{at}umich.edu Back


    Acknowledgments
 
A.A.D. is a University of Michigan Rackham Regents Fellow. This work was supported by the Michigan Molecular Biophysics Training Program (GM08270), NSF (DBI-9614074), NIH (GM59097–01) and the Vahlteich Research Award Fund.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Abagyan,R.A. and Batalov,S. (1997) J. Mol. Biol., 273, 355–368.[ISI][Medline]

Abkevich,V.I. and Shakhnovich,E.I. (2000) J. Mol. Biol., 300, 975–985.[ISI][Medline]

Betz,F.B. (1993) Protein Sci., 2, 1551–1558.[Abstract/Free Full Text]

Colonna-Cesari,F. and Sander,C. (1990) Biophys. J., 57, 1103–1107.[Abstract]

Cornell,W.D., Cieplak,P., Bayly,C.I., Gould,I.R.,Merz,K.M.,Jr, Ferguson,D.M., Spellmeyer.,D.C., Fox,T., Caldwell,J.W. and Kollman,P.A. (1995) J. Am. Chem. Soc., 117, 5179–5197.[ISI]

Creighton,T.E. (1993) Proteins, Structures and Molecular Properties. Freeman, San Francisco, p. 318.

Crippen,G.M. (1991) Biochemistry, 30, 4232–4237.[ISI][Medline]

Dombkowski,A.A. and Crippen,G.M. (1999) In Proceedings of the Third Annual International Conference on Computational Molecular Biology, Istrail,S., Pevzner,P., Waterman,M., eds. Association for Computing Machinery Press, New York, pp. 145–153.

Dumont,M.E., Corin,A.F. and Campbell,G.A. (1994) Biochemistry, 33, 7368–7378.[ISI][Medline]

Emanuelsson,O., Nielsen,H., Brunak,S. and von Heijne, G. (2000) J. Mol. Biol., 300, 1005–1016.[ISI][Medline]

Hazes,B. and Dijkstra,B.W. (1988) Protein Eng., 2, 119–125.[Abstract]

Hendlich,M., Lackner,P., Weitckus,S., Floeckner,H., Froschauer,R., Gottsbacher, Casari,G. and Sippl,M.J. (1990) J. Mol. Biol., 216, 167–180.[ISI][Medline]

Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409–417.[Abstract/Free Full Text]

Huang,E.S., Subbiah,S. and Levitt,M. (1995) J. Mol. Biol., 252, 709–720.[ISI][Medline]

Huber,T. and Torda,A.E. (1998) Protein Sci., 7, 142–149.[Abstract/Free Full Text]

Jones,D.T. (1997) Curr. Opin. Struct. Biol., 7, 377–387.[ISI][Medline]

Kocher,J.A., Rooman,M.J. and Wodak,S.J. (1994) J. Mol. Biol., 235, 1598–1613.[ISI][Medline]

Maiorov,V.N. and Crippen,G.M. (1992) J. Mol. Biol., 227, 876–888.[ISI][Medline]

Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 18, 534–552.[ISI]

Miyazawa,S. and Jernigan,R.L. (1996) J. Mol. Biol., 256, 623–644.[ISI][Medline]

Narhi,L.O., Hua,Q.X., Arakawa,T., Fox,G.M., Tsai,L., Rosenfeld,R., Holst,P., Miller,J.A. and Weiss,M.A. (1993) Biochemistry, 32, 5214–5221.[ISI][Medline]

Park,B.H., Huang,E.S. and Levitt,M. (1997) J. Mol. Biol., 266, 831–846.[ISI][Medline]

Petersen,M.T., Jonson,P.H. and Petersen,S.B. (1999) Protein Eng., 12, 535–548.[Abstract/Free Full Text]

Sander,C. and Schneider,R. (1991) Proteins, 9, 56–68.[ISI][Medline]

Sippl,M.J. (1990) J. Mol. Biol., 213, 859–883.[ISI][Medline]

Srinivasan,N., Sowdhamini,R., Ramakrishnan,C. and Balaram,P. (1990) Int. J. Pept. Protein Res., 36, 147–155.[ISI][Medline]

Thomas,P.D. and Dill,K.A. (1996) J. Mol. Biol., 257, 457–469.[ISI][Medline]

Thornton,J.M. (1981) J. Mol. Biol., 151, 261–287.[ISI][Medline]

Zhang,C. and Kim,S. (2000) Proc. Natl Acad. Sci. USA, 97, 2550–2555.[Abstract/Free Full Text]

Zhang,J., Matthews,J.M., Ward,L.D. and Simpson,R.J. (1997) Biochemistry, 36, 2380–2389.[ISI][Medline]

Received April 6, 2000; revised August 7, 2000; accepted August 17, 2000.