College of Pharmacy, University of Michigan, Ann Arbor, MI 48109, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: contact potential/disulfide/fold recognition/protein structure prediction/threading
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Regardless of functional form, the overall goal in fold recognition is the same: for a given protein sequence the potential function should identify the native (or near-native) structure as unique out of all possible conformations. Huang et al. (1995) demonstrated that a simple hydrophobic scoring function can recognize the correct fold for a large number of proteins. Other more complex potentials have shown similar success (Maiorov and Crippen, 1992; Miyazawa and Jernigan, 1996
; Zhang and Kim, 2000
). However, one class of proteins which has thwarted fold recognition efforts are the small, disulfide-bearing proteins. Developers of potential functions have noted difficulty with these proteins in structure identification tests (Kocher et al., 1994
; Huang et al., 1995
; Park et al., 1997
; Huber and Torda, 1998
). Disulfide-bearing proteins often failed with fundamental ungapped threading tests. Disulfide bonds provide stability to the native structure and may compensate for the lack of a notable hydrophobic core in small proteins (Petersen et al., 1999
; Betz, 1993
). A strong negative correlation has been observed between hydrophobic content and the number of disulfide bonds (Abkevich and Shakhnovich, 2000
). Experimental evidence reveals that some proteins will not retain their native structure when disulfide bonds are disrupted (Creighton, 1993
; Narhi et al., 1993
; Zhang et al., 1997
). Since most energy potentials do not explicitly account for post-translational modifications and prosthetic group interactions, it is not surprising that they frequently fail for proteins which depend on such interactions. If disulfide bonds are often required in the absence of a large hydrophobic core, it follows that they should be accounted for in structure prediction methods which are intended for use with small proteins. Given the abundance and significance of small, disulfide-bearing proteins, an appropriate treatment of disulfide bonds in a fold recognition function would be beneficial. Here, we seek to continue the focus on solvation while improving on intramolecular contact definitions and accounting for the contribution of disulfide bonds.
The definition of a contact is crucial in fold recognition methods. Typically, two amino acids of a given protein are considered in contact if their side chains are within a specified distance of each other and an energy value or score is attributed to the contact. Threading requires the use of simplified side chain representations, so contacts defined in this context are a crude approximation of the actual interaction between residues. When searching for the best fold for a given sequence, contiguous segments of known PDB structures (templates) are often used as the structural model. However, the query amino acid sequence will differ from the actual sequence of the template protein, so to allow compatibility the side chains are usually represented as a point in space. Commonly used representations are Cß carbons or centroids which represent the center of mass of side chains. In this manner, any sequence can be folded into any observed structural segment by assigning the identity of the query sequence amino acids to the corresponding side chain representations of the template structure. Threading provides a quick method to sample many possible structures, all having protein-like bond angles and lengths, but the disadvantage is a loss of atomic detail in the side chains.
Often a single contact definition is used for all possible pairs of interacting amino acids. For example, if the Cß for two side chains are within 8 Å, the residues are considered in contact. However, this type of definition is a poor representation in an energy potential modeled on solvation. An intramolecular contact in a solvation potential should be defined as an interaction which excludes solvent from the space between two residues. The difficulty in choosing a contact distance is due to the variation in size and shape of the 20 amino acids. The Cß atoms of two large aromatic residues may be 10 Å apart and water could be excluded from the space between them owing to the volume occupied by the side chains. The same distance is not an appropriate contact definition for two small amino acids such as alanine or glycine because a water molecule could easily pass between the two side chains. A solvation potential also requires a better estimation of the extent of contact since a single definition would give the same contact value to either a glycine or a tyrosine that is within the cut-off distance of a central residue. Obviously, a tyrosine contact would provide much greater shielding from solvent. Since intramolecular contacts (hydrophobic interactions in particular) are critical in fold recognition, we are motivated to improve the treatment of contacts in simplified protein models.
We use contact curves which account for the residue types involved and provide an estimate of the number of atomic interactions between side chains as a function of the distance between Cß representations. A contact curve for each of the 210 possible amino acid pairs represents the average number of atomic contacts observed in PDB structures as a function of CßCß distance for the given amino acid pair. This allows an estimate of the extent of contact between two residues in simplified structural models. The calculated burial of a residue of a given sequence folded into a specified conformation is simply the sum of all estimated contacts for that residue and the energy associated with the residue is defined as the product of the total contact value and an energy parameter for the given amino acid type. One energy parameter per amino acid type is obtained by fitting the solvation model to observed PDB structures. The parameters are optimized to ensure that the native structure for each protein in a training set is the lowest in energy as compared to thousands of alternative conformations.
To account for the stabilizing effect of disulfide bonds, our potential function includes a favorable energy contribution for each disulfide identified. When assessing sequencestructure compatibility, all pairs of cysteines are analyzed for possible bonding. Using only the C and Cß coordinates, a cysteine pair is considered bonded if a disulfide with standard bond lengths and angles can reside between the two residues. The potential function is trained to be accurate in structure identification of disulfide-bearing proteins and those without disulfides. This includes proteins which contain non-bonded cysteines. Thus, the general applicability of the potential function is retained.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To construct a solvation potential that is compatible with simplified protein representations, the first step is to define a contact in a manner consistent with solvation. Using the Colonna-Cesari and Sander model (1990), a contact is defined as an interaction between two atoms of different side chains which excludes water from the region between the atoms. Since the diameter of a water molecule is ~2.8 Å and the van der Waals radius of any given heavy atom (non-hydrogen) is ~1.8 Å, if two atoms are within 6.4 Å of each other we can say that a water molecule cannot pass between them owing to excluded volume. So an atomic contact is defined as two atoms from different side chains that are within 6.4 Å of each other. Using this definition, it would be simple to tabulate the total number of atomic contacts for any given residue in an all-atom model of a protein structure. However, the single-atom representation of side chains required in threading precludes performing this simple calculation. Therefore, we derive contact curves which provide a reasonable estimate of the average number of atomic contacts between two residues, given only their Cß locations and amino acid types. The curves are obtained from a survey of PDB structures using all-atom representations for the side chains. We then use the curves to estimate contact during threading with single-atom side chain models.
A survey of representative PDB structures provides the data which are then fitted to smooth sigmoidal curves. The survey structures are drawn from the PDB Select which represents protein sequences having no greater than 95% sequence homology (Hobohm et al., 1992). The proteins from the PDB Select list of June 1997 were screened to eliminate structures with chain breaks and alternative coordinates; the remaining 1347 structures became the survey set. For each of the 210 possible amino acid pairs, the average number of atomic contacts were tabulated and recorded as a function of CßCß distance. When counting atomic contacts, we consider only heavy atoms (non-hydrogen) and use a smooth sigmoidal curve with a maximum atomic contact value of one when the atoms are <3.6 Å apart and a minimum value of zero when the distance is >6.4 Å. Equation (1) represents the sigmoid used for counting atomic contacts and for fitting the contact curves. For atomic contacts, U is 6.4 Å, L is 3.6 Å, M is 1 and
is the contact value for atoms which are distance d apart.
| (1) |
For each amino acid pair the average number of atomic contacts was recorded as a function of CßCß distance and binned in 0.25 Å increments. Fitting a sigmoid curve to each of the 210 sets of data using Equation (1) provided three values: (1) an upper bound U, the CßCß distance above which there is zero average atomic contact; (2) a lower bound L, the CßCß distance below which the average number of contacts observed is at its maximum value; and (3) a maximum average contact value M. Table I lists the sigmoid parameters for all 210 amino acid pairs. Figure 1
shows several examples of contact curves fit to the survey data. Cij is the average number of atomic contacts observed when the Cß atoms of the two residues are a distance dij apart. In general, amino acid pairs having smaller side chains provide a better fit than larger ones. The sigmoid function fit well to the data for most pairs. In some cases, the best fit resulted in a negative value for L. An example is the PHELEU pair shown in Figure 1b
. The L value is 0.43 to allow the best fit and the curve rises above the maximum observed atomic contact value. However, this is not a practical problem since few pairs of residues are observed with a CßCß distance <3 Å and the curves fit the data well in the observed range of dij.
|
|
The potential function
Our potential function has three terms representing solvation, overpacking and disulfide bonding. The solvation component of our energy potential is an implicit function of solvation. With the approach of Colonna-Cesari and Sander (1990), a given amino acid type is assumed to have a fixed number of total contacts which are divided among solvent and intramolecular interactions. By counting intramolecular contacts we can determine the extent of solvation. For residue i of a protein folded into a specified conformation, where k is the amino acid type, the solvation energy Ei is defined as the product of the energy parameter Pk(i) for amino acid type k and the solvent contacts of residue i. The energy can be determined implicitly from intramolecular contacts as shown below. By assuming that the fixed number of total contacts, Zk(i), for amino acid type k are partitioned between solvent contacts ns(i) and intramolecular contacts nr(i), the energy contribution of residue i is obtained as follows:
![]() |
When comparing the energy values of the same protein sequence folded into various conformations, for any residue i the values Pk(i) and Zk(i) are constant, so the first term on the right-hand side may be ignored, giving
Ei = Pk(i)nr(i)
The negative sign is absorbed into the adjustable parameter Pk(i). For residue i the total number of intramolecular contacts nr(i) is estimated by calculating the distance from Cßi to each Cßj in the protein, using the appropriate contact curve to get Cij, then summing to obtain the total number of contacts for residue i. The total energy of a protein sequence in a given conformation is determined by summing over all residues N as follows:
![]() | (2) |
Equation (2) is the solvation contribution to the potential with 20 adjustable parameters Pk(i), one for each amino acid type. All residue pairs separated by two or more residues in sequence are included in the summation. To prevent the potential from favoring overpacked structures, a steric penalty is included. If the total number of contacts for a residue exceeds a specified value for the given amino acid type, the excess contact value Si is multiplied by a steric penalty parameter Ps. This brings the number of adjustable parameters to 21, including Ps. The solvation potential with the overpacking term is shown in Equation (3). Energy values are in arbitrary units.
| (3) |
Each amino acid type is assigned a maximum contact value Ck(i),max which is obtained from the distribution of total atomic contacts observed in our survey structures. For each amino acid of type k found in our survey structures the contact curves were used to calculate the total number of atomic contacts. A histogram for each residue type provides a distribution of the number of occurrences of each total contact value observed in the survey. The contact value that is one standard deviation lower than the maximum observed is used as Ck(i),max.
Disulfide recognition
Previous results using an energy potential in the form of Equation (3) suggested that the solvation component alone may not account for the stability of small, disulfide-bearing proteins (Dombkowski and Crippen, 1999). While achieving >80% accuracy for proteins larger than 110 residues in an ungapped threading test, the potential performed poorly for smaller proteins, particularly those with disulfide bonds. Others have reported difficulty with this class of proteins. However, it should be noted that pairwise potentials with 210 contact energy values representing the amino acid pairs do implicitly account for disulfide bonds, albeit in a crude manner, since they contain a CYSCYS contact parameter. Miyazawa and Jernigan (1985, 1996) reported that their contact energies reflect the favorable formation of disulfide bonds. In their potential the formation of CYSCYS contacts is greatly favored over all CYSX interactions, where X is any other amino acid type. The solvation component of our potential does not provide special treatment for CYSCYS contacts.
In our potential, the energy associated with a given residue is a function of the total number of estimated atomic contacts, regardless of the identity of the contacting residues. As illustrated in Figure 1a, if we are tabulating the number of contacts for a given cysteine, then another cysteine 3.75 Å away (as measured by CßCß distance) would provide four atomic contacts. The same number of contacts would be attributed to a phenylalanine which is ~5.75 Å away from the central cysteine. Both interactions effectively desolvate the central cysteine equivalently and contribute equally to the overall energy of the cysteine. Additionally, a phenylalanine 3.75 Å away from the central cysteine would provide nearly twice as many atomic contacts as another cysteine at the same distance. Since cysteine contacts are energetically favorable, the interaction with the larger phenylalanine would be preferential over contact with another cysteine at the same CßCß distance. Considering the contact curves and energy parameters, two CYSPHE pairs are favored over the alternative CYSCYS and PHEPHE pairs, assuming the same dij distances. Unlike the Miyazawa and Jernigan potential and similar pairwise potentials, favorable CYSCYS treatment is not inherent in the solvation component of our potential. Therefore, to account for the stabilizing contribution of disulfides we included a disulfide parameter for cysteine pairs having the appropriate bonding geometry.
The first step in our treatment of disulfides is the identification of all possible disulfide bonds for a target sequence threaded onto a structural template. We use the C and Cß coordinates of the template to determine if a disulfide can exist between any pair of cysteines in the threaded sequence. We check all cysteine pairs for geometry which will allow a disulfide bond to form between them. Some cysteines may have multiple pairs which provide the correct bonding geometry. So a matrix of all possible disulfide bonds is processed with a heuristic algorithm to obtain a final list of designated bonds which conform to disulfide characteristics observed in PDB structures, allowing each cysteine to participate in one bond, at most.
Our disulfide recognition algorithm simply requires that a pair of cysteines have C and Cß atoms positioned such that a disulfide with standard bond lengths and angles can reside between the two residues and provide appropriate C
CßS
bond angles. A similar approach was taken by Hazes and Dijkstra (1988) in a method developed to identify possible mutation sites in proteins where a disulfide could be added. We used the bond lengths and angles of AMBER 95 (Cornell et al., 1995
), as shown in Table II
. These values are in agreement with a recent survey of 351 disulfide bridges observed in PDB structures (Petersen et al.,1999). A small amount of tolerance is allowed in C
CßS
angles. Since we are using Cß representations for all side chains, the locations of cysteine S
atoms are unknown. Therefore, when analyzing a pair of cysteines for bonding we must consider all possible sulfur locations. We fixed the CßS
and S
S
bond lengths along with the CßS
S
angles, so each S
atom is restricted to a ring which is centered on (and perpendicular to) the CßCß axis. Our disulfide model geometry is shown in Figure 2
. By convention we assign index i to the lower numbered residue. Note that a ring would be traced if either S
atom were rotated about the CßCß axis at a radius of r. The S
rings are illustrated in Figure 2b
. The center of each S
ring is distance a along the CßCß axis from the corresponding Cß. Distances a and r are functions of the
3 torsion angle, which is the rotation about the S
S
bond and can be considered as functions of the CßCß distance.
|
|
The only possible S positions, given our standard disulfide bond, are completely specified by values a, b, r,
and
. Since a, b, r and
are functions of di, they become fixed for a given CßCß distance. All possible S
i and S
j locations can then be determined by rotating angle
. We step angle
through 360° in 1° increments, checking the C
CßS
bond angles at each step. For any angle
, if both C
iCßiS
i and C
jCßjS
j angles are 114.6 ± 10°, we declare the cysteines to have a possible bond. Equations (4)(8) provide values a, b, r, h and
as a function of the CßCß distance dij. The derivations are not shown, but were obtained by simple trigonometric operations. Equation (4) is derived by projecting the CßS
vector on to the CßCß vector to obtain length a, shown in Figure 2a
.
![]() | (4) |
Length b in equation (5) and shown in Figure 2a, is simply the CßCß distance minus 2a.:
![]() | (5) |
Since the CßS bond length is fixed at 1.81 Å, radius r = 1.81sin
, where
is the angle between the CßS
and CßCß vectors. We calculate
from acos(a/1.81); therefore, r can be obtained as
![]() | (6) |
Angle is determined by first obtaining distance h, which is easily calculated using b and the fixed S
S
bond length, as shown in Figure 2b
. If the S
j ring were stacked on top of the S
i ring, as shown in Figure 2c h
, is the observed distance between the S
positions. Note that h is not the S
S
bond distance. In Figure 2b h
, is one side of the right triangle formed using the S
S
bond and the shortest path from S
i to the S
j ring, which is of length b. Using this relationship, h is 2.04sin
, where
is the angle between the other legs of the triangle. We calculate
from acos(b/2.04); therefore, h can be obtained as
![]() | (7) |
The triangle formed by Si, S
j and the center of the stacked S
rings, as shown in Figure 2c
, has two sides of length r and one side of length h. Therefore,
can be calculated using the law of cosines, which simplifies to
![]() | (8) |
For a given CßCß distance, the S positions are fixed relative to each other. To find if a disulfide can reside between two cysteines we simply rotate the entire disulfide `assembly' about the CßCß axis, checking the C
CßS
angles along the way. The rotation is stopped when the appropriate angles have been detected or 360° has been completed, indicating that no disulfide can exist between the cysteines. In the former case, we declare a putative bond and then estimate the
3 angle.
The 3 torsion angle is used as our measure of disulfide quality when selecting between conflicting bonds and also in the energy potential. A matrix of all possible disulfides in the protein is obtained by completing the above procedure for all cysteine pairs. Note that some cysteines may have possible bonds to more than one other cysteine in the protein. In these cases, we select the bonds which best conform to disulfide bond characteristics observed in PDB structures, allowing a maximum of one bond per cysteine.
Disulfides observed in protein structures show a strong preference for a 3 torsion angle near ±90° (Thornton, 1981
; Srinivasan et al., 1990
; Petersen et al., 1999
). Therefore, we have constructed the disulfide selection procedure and energy potential to favor disulfides bonds with
3 angles conforming to this observation. We note that a recent survey suggests the actual preference may be shifted towards 80 and +100° (Petersen et al., 1999
). Here, we do not distinguish between right- and left-handed bonds, but simply estimate the torsion angle without regard to stereochemistry. We use a
3 range of 0180° and our optimum torsion angle is 90°. With the standard disulfide geometry described above, we derived Equation (9), which provides an estimate of the
3 angle as a function of the CßCß distance dij:
![]() | (9) |
Equation (9) allows dij to range from 2.925 to 4.569 Å, with a respective 3 from 0 to 180°. A very accurate estimate of the torsion angle can be calculated with Equation (9), given only the CßCß distance between cysteines.
To determine the accuracy of our approach, we applied the disulfide recognition algorithm to 200 disulfide-bearing proteins from the PDB, represented only by C and Cß coordinates. Using the first chain of each protein, every possible cysteinecysteine pair in the 200 structures was analyzed for a total of 3358 pairs, including 484 known disulfides. Of the 484 disulfides, 98% were correctly identified using only the C
and Cß coordinates as described above. Of the nine false negatives, seven were not correctly identified because the dij distances were beyond the range allowed by Equation (9). The other two had dij distances near the allowed limits and, if correctly identified, would have poor
3 angles per our estimation. As discussed later, these disulfides would make insignificant contributions to the overall energy in our potential, so the impact of their incorrect identification is minimal. In addition to the correct identification of disulfide bonds, the
3 angles were very accurately estimated also. Figure 3
shows the correlation between the actual
3 measured using all-atom PDB structures and the estimated
3 using the simplified protein representations along with Equation (9). The linear correlation of the 475 true positives is 0.92 and the diagonal line represents y = x.
|
In our potential function, the energy associated with each disulfide bond is a function of the 3 values. The disulfide term in the potential function is constructed to give a lower energy value to disulfides with a
3 angle closest to 90°. Equation (10) is the complete form of the potential function, including the disulfide term. The summation of the right-hand term is performed over all designated disulfides t in the protein. Pd is the optimized disulfide parameter and the term enclosed in brackets ranges from 0 to 1 as a function of
3. For an angle of 90° this term is simply 1 and it decreases linearly as
3 varies from 90°. In this application the disulfide bond stereochemistry is not considered, so the
3 angles range from 0 to 180°.
![]() | (10) |
Optimizing the parameters
The 22 adjustable parameters Pk(i), Ps and Pd are obtained through an optimization procedure which ensures that for each protein sequence in a training set, the PDB (native) structure is the lowest in energy compared with a very large set of alternative conformations. Equation (11) represents the inequality generated for every alternative conformation sampled for each protein sequence in the training set. Enat is the energy of the native sequence in the native conformation and Ealt is the energy of the native sequence in an alternative conformation. The constant m is an arbitrary energy margin set to 10. The energies of each conformation are calculated using Equation (10) and are a linear function of the adjustable parameters, so the parameters can be obtained by solving the system of linear inequalities (Crippen, 1991; Maiorov and Crippen, 1992
).
![]() | (11) |
We generate alternative conformations for each native sequence by threading (Hendlich et al., 1990; Crippen, 1991
). For a query sequence of length N, any contiguous structural segment of N residues taken from a PDB structure of length N or greater can be evaluated as a possible structure. So a PDB structure of length M can produce M N + 1 alternative conformations for a query sequence. Each conformation is simplified to C
and Cß carbons, which take on the identity of the amino acids in the query sequence. An artificial Cß is generated where the template contains a glycine.
The training procedure begins with arbitrarily set parameters, then a sequence from a training set of proteins is threaded through a set of template structures to produce the alternatives. Compliance with Equation (11) is checked for each alternative conformation and we continue to the next threading if Equation (11) is satisfied. Whenever a violation of Equation (11) is encountered, the current inequality (constraint) is added to the problem which is then solved for the adjustable parameters. The procedure is repeated for each native and continues until Equation (11) is satisfied for every alternative structure produced for every native sequence. Alternative structures classified as near-native are excluded from compliance with Equation (11). Since we are satisfying inequalities, the solution space can be large for a given problem. The parameters are optimized by choosing the solution which minimizes the sum of adjustable parameters.
Definition of near-native during training, the burial RMSD
The overall goal is to construct a potential which can identify the native or a near-native structure for a given protein sequence out of a large set of alternatives. So during training we do not require that native-like structures have a higher energy than the native. The native structure is generally accepted to be the crystal or NMR structure in the PDB. The definition of near-native is much more ambiguous. Similarity of two structures is conventionally measured by the coordinate root-mean-square deviation (RMSD) or the distance RMSD (DME). These measures of similarity are visually appealing since they classify structures that look the same as similar. However, they are not satisfactory when we consider structures which are energetically near-native. Motion of solvent-exposed loops or hinge-like motion between large protein domains do not necessarily imply changes in free energy, yet may result in RMSD differences. It is desirable to have a measure of structural similarity that is consistent with our measure of energy and independent of sequence length. Since the energy potential is a function of solvation, for training purposes we define near-native as those structures which have a solvation profile similar to that of the native. This definition allows for energetically similar structures to be classified as near-native despite coordinate differences. The burial root-mean-square deviation (BRMSD) is defined as
![]() | (12) |
For a given sequence in two different conformations, N is the number of residues in the protein and Ci,a and Ci,b are the total number of contacts tabulated for residue i in structures a and b, respectively, using our contact definitions.
To assess the relationship of BRMSD to sequence length, 18 PDB structures from 43 to 333 residues in length were analyzed. The 18 sequences were threaded through 121 randomly selected PDB structures and the BRMSD was measured for each alternative conformation in reference to the native structure (data not shown). The average BRMSD shows little dependence on sequence length. The apparent independence of sequence length is appealing because one BRMSD cut-off value may be suitable to define near-native for proteins of various sequence lengths. To determine whether conventionally defined structural homologues fall within a given value of BRMSD, the coordinate RMSD and BRMSD were compared for 12 structural homologues (RMSD between 0.4 and 2.4) of protein 135l, turkey egg white lysozyme. All homologues had a BRMSD below 10, so a BRMSD of 10 was selected as our near-native cut-off and structures with a BRMSD <10 relative to the native are exempt from compliance with Equation (11) during training.
The training set
We trained the potential function to succeed in fold recognition for the general population of proteins while also providing for success with small, disulfide-bearing proteins. With a disulfide term in the potential function it is important to consider both types of proteins having multiple cysteines. For proteins with more than one cysteine, we define type I proteins as those having the maximum possible number of disulfide bonds in the native structure, given the number of cysteines in the sequence. Proteins with multiple cysteines but fewer than the maximum number of possible disulfides are designated as type II. For example, a protein with six cysteines and three disulfides would be classified as type I, whereas a protein with six cysteines but only two disulfides would be classified a type II. Since we want to ensure that the potential can identify the native structure, not simply the conformation which provides the most disulfide bonds, it is important to include both types in our training and test sets. Our training set of 22 proteins contains a variety of native structures, including seven without cysteines, five type I proteins and five type II. Table III lists the proteins in the training set and shows that the three general structural classes (
, ß,
/ß) are also included. During training these native sequences were threaded through 87 template proteins, listed in Table IV
. The templates provided more than 31 000 alternative conformations, on average, for each native in the training set.
|
|
Testing of the trained energy potential was performed using two sets of test proteins. For a general test with representative proteins and to compare results with a frequently cited energy potential, we used the test set of Miyazawa and Jernigan (1996). It is an ungapped threading test where the goal is to identify the native structure as the lowest in energy for each of 88 test sequences threaded through 189 template structures. The 88 test proteins are structurally dissimilar and have <35% sequence identity to any other protein in the set. The structural diversity and variety of sequence length in this test set provide a comprehensive test scenario. We used simple, ungapped threading as the initial test for the potential function because it remains an essential requirement. If the native structure cannot be consistently identified from a set of ungapped decoys, there is little hope that the potential will succeed with gapped structures.
We created a second test to assess rigorously the accuracy of the potential function specifically with small, disulfide-bearing proteins. The test set was composed of 71 proteins taken from the PDB, all less than 150 residues in length and having at least one disulfide bond. These proteins were threaded, without gaps, through the 1347 PDB structures used in our survey of atomic contacts. The templates provided, on average, >250 000 alternative conformations for each native in the test set. The 71 proteins of the disulfide test set are listed in Table V.
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Here, we made an initial attempt to consider stabilizing groups by accounting for disulfide bonds and the results reveal the utility of the approach. Our potential correctly identified 28 of 29 disulfide-bearing proteins in the M&J test set. Of the 11 disulfide proteins with <100 amino acids (those likely dependent on disulfide bonds for stability), 91% were correctly identified with our potential while only 36% of this group were correct when using the potential function of Miyazawa and Jernigan. The one disulfide-bearing protein that failed with our potential was 1fxd, ferredoxin. This native, an example of a type II protein, has one disulfide bond and four additional cysteines involved in binding an ironsulfur cluster. The correct disulfide bond was identified by the potential function and the native structure ranked second only to a conformation that provided an additional disulfide. However, we note that our potential does not just mindlessly select the conformation having the most disulfide bonds. The M&J test includes 33 type II proteins and 86% were correctly ranked number one, excluding the four cytochrome cs. Many alternative conformations with disulfides were available for the type II sequences, yet the potential consistently identified the true native conformations which have no disulfide bonds. It is clear that our potential is not overtrained with regard to disulfide-bearing proteins.
Our disulfide test set provided a thorough evaluation of the potential function with small proteins having disulfide bonds. Table V lists the proteins used in the disulfide test set. The templates are the representative set of 1347 PDB structures having no greater than 95% sequence identity used in our contact survey. The high threshold for identity allows significant structural coverage among the templates, reducing native structure uniqueness. Over 250 000 alternative conformations were provided, on average, for each of the 71 test proteins. The potential function correctly identified the native or a similar conformation, as the lowest in energy (`best structure') for 80% of the test sequences. For these, the best structure is listed as `native' or by the template and threading offset of the native-like conformation, along with the RMSD compared to the native. The offset is the number of residues from the beginning of the template chain where the sequence to structure alignment begins. The offset is zero where not explicitly listed. For nine of the 14 proteins that failed, the native structure was within the 10 conformations having the lowest energy.
The results indicate that our potential is capable of identifying structural homologues and analogues. The former share a common ancestral sequence, while the latter are encountered when two proteins share a common fold but are not obviously related in function or sequence. With our disulfide test, in most cases where the potential function succeeded in identifying a structurally similar conformation in place of the native, the native and best structure were homologues. This demonstrates the ability of the potential function to identify related proteins. Apparent analogues are also identified. The lowest energy conformation identified for sequence 1gur was template 1eit, with an offset of zero. There is no obvious relationship between these two proteins, other than three common disulfide bonds, yet they have very similar structures with an RMSD of 2.3 Å. They share 28% sequence identity over 34 residues, below the threshold for structural homology considering the sequence length (Sander and Schneider, 1991; Abagyan and Batalov, 1997
). The functions of these two proteins also seem unrelated; 1gur is a sweet taste-suppressing polypeptide from the tree Gymnema sylvestre, and 1eit is a neurotoxin from the funnel web spider. Despite insignificant similarity in sequence and function, the energy potential is able to identify the common fold. We are encouraged by the results and a more extensive test and analysis of analogue identification are under way.
In addition to recognizing the correct fold, the potential function is also very accurate in identifying the actual disulfide bonds in each native structure. The test proteins in Table V have a total of 210 disulfide bonds and 98% were correctly identified. Three of the five incorrectly identified bonds have CßCß distances beyond the allowed range of Equation (9). These disulfides are from NMR structures and using the first model of each PDB file reveals that two of the bonds, cysteine pairs 1733 and 318 in 1gur, have anomalous disulfide geometry with respective S
S
distances of 3.69 and 4.99 Å, respectively.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Ungapped threading tests were used to ensure the potential satisfies the fundamental requirement of fold recognition: identification of the native structure. This has been demonstrated, so more comprehensive threading scenarios are being investigated. The results of our disulfide test clearly show that the potential is capable of identifying homologous structures. However, for sequences with numerous homologues having varying numbers and patterns of disulfides, it remains to be determined which factors most influence the ranking. Combining our potential with a gapped threading procedure will allow us to determine whether top-ranked homologues are more accurate in RMSD or in the number and pattern of disulfides when compared with the native.
The contribution of the disulfide term is evident when the results are compared with training and testing without the disulfide component in the potential function. The training set used in this work provides an infeasible problem when the disulfide component is removed from the potential. In other words, without the disulfide term there is no set of parameters which ensures the native structure is the lowest in energy out of all alternative conformations for each sequence in our training set. The solvation component along with the overpacking term cannot account for the stability of our native proteins. Using only the solvation and overpacking components trained with a set of proteins similar in number and size to the current training set provided a potential that was >80% accurate for proteins larger than 110 residues, but it did poorly when tested with small proteins, particularly those with disulfide bonds (Dombkowski and Crippen, 1999). Without the disulfide component we were only 36% accurate in identifying the native structure for small (<100 residues), disulfide-bearing proteins in the M&J test. The same potential was only 39% accurate with our disulfide test set. The results improved dramatically with the present potential to 91% and 80%, respectively. We note that comparing results using different, though similar, training sets is complex. However, it is evident that including the disulfide component in the potential allows training with a previously infeasible set of proteins and results in significantly improved fold recognition accuracy.
Since disulfide bonds occur most frequently in extracellular proteins, it would be advantageous to use a disulfide-cognizant potential for structure predictions of secreted proteins. Pre-screening for secreted proteins is possible using a very accurate computational tool, TargetP (Emanuelsson et al., 2000). This neural network-based program is trained to identify the N-terminal target sequence which specifies the cellular destination of a protein. It is >90% accurate in identifying proteins destined for the secretory pathway. The authors estimated that 10% of proteins in Homo sapiens and Arabidopsis thaliana are secreted, excluding membrane proteins. Analysis of our 1347 representative PDB structures revealed that 38% have at least one disulfide bond, 29% are type I proteins with the maximum possible number of disulfides given the number of cysteines and 44% are type II with at least two cysteines but fewer than the maximum possible number of disulfides. Protein fold recognition efforts should benefit from consideration of such a common structural feature.
In addition to identifying probable disulfides in threaded sequences, the disulfide recognition algorithm could easily be adapted to identify possible mutation sites for inserting disulfides in a protein. Where it is desirable to increase protein stability through the introduction of a disulfide bond, our algorithm could be modified to identify all pairs of residues that would form a disulfide if they were cysteines. The procedure entails threading a sequence of all cysteines on to the target structure and recording all possible bonds. However, the identification of putative disulfides assumes that the C and Cß coordinates are not significantly perturbed by the mutation from the original amino acid to cysteine.
Developers of fold recognition methods have striven to create potential functions which are successful for the general population of proteins. However, some types of proteins have remained particularly troublesome in structure identification attempts. Since the major contribution to most potential functions is the identification of a suitable hydrophobic core for a given sequence, it is not surprising that small proteins having little of a core and dependent on prosthetic group interactions for stability have failed prediction efforts. Here, we have continued the focus on hydrophobic burial while improving on contact definitions and accounting for disulfide-enhanced stability. It is a step towards improved fold recognition by accommodating post-translational modifications. Our success with disulfide bonds suggests that a similar approach may be useful for other types of stabilizing interactions. Our difficulty with cytochrome c suggests that heme group ligation may be one of many appropriate targets.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Abkevich,V.I. and Shakhnovich,E.I. (2000) J. Mol. Biol., 300, 975985.[ISI][Medline]
Betz,F.B. (1993) Protein Sci., 2, 15511558.
Colonna-Cesari,F. and Sander,C. (1990) Biophys. J., 57, 11031107.[Abstract]
Cornell,W.D., Cieplak,P., Bayly,C.I., Gould,I.R.,Merz,K.M.,Jr, Ferguson,D.M., Spellmeyer.,D.C., Fox,T., Caldwell,J.W. and Kollman,P.A. (1995) J. Am. Chem. Soc., 117, 51795197.[ISI]
Creighton,T.E. (1993) Proteins, Structures and Molecular Properties. Freeman, San Francisco, p. 318.
Crippen,G.M. (1991) Biochemistry, 30, 42324237.[ISI][Medline]
Dombkowski,A.A. and Crippen,G.M. (1999) In Proceedings of the Third Annual International Conference on Computational Molecular Biology, Istrail,S., Pevzner,P., Waterman,M., eds. Association for Computing Machinery Press, New York, pp. 145153.
Dumont,M.E., Corin,A.F. and Campbell,G.A. (1994) Biochemistry, 33, 73687378.[ISI][Medline]
Emanuelsson,O., Nielsen,H., Brunak,S. and von Heijne, G. (2000) J. Mol. Biol., 300, 10051016.[ISI][Medline]
Hazes,B. and Dijkstra,B.W. (1988) Protein Eng., 2, 119125.[Abstract]
Hendlich,M., Lackner,P., Weitckus,S., Floeckner,H., Froschauer,R., Gottsbacher, Casari,G. and Sippl,M.J. (1990) J. Mol. Biol., 216, 167180.[ISI][Medline]
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409417.
Huang,E.S., Subbiah,S. and Levitt,M. (1995) J. Mol. Biol., 252, 709720.[ISI][Medline]
Huber,T. and Torda,A.E. (1998) Protein Sci., 7, 142149.
Jones,D.T. (1997) Curr. Opin. Struct. Biol., 7, 377387.[ISI][Medline]
Kocher,J.A., Rooman,M.J. and Wodak,S.J. (1994) J. Mol. Biol., 235, 15981613.[ISI][Medline]
Maiorov,V.N. and Crippen,G.M. (1992) J. Mol. Biol., 227, 876888.[ISI][Medline]
Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 18, 534552.[ISI]
Miyazawa,S. and Jernigan,R.L. (1996) J. Mol. Biol., 256, 623644.[ISI][Medline]
Narhi,L.O., Hua,Q.X., Arakawa,T., Fox,G.M., Tsai,L., Rosenfeld,R., Holst,P., Miller,J.A. and Weiss,M.A. (1993) Biochemistry, 32, 52145221.[ISI][Medline]
Park,B.H., Huang,E.S. and Levitt,M. (1997) J. Mol. Biol., 266, 831846.[ISI][Medline]
Petersen,M.T., Jonson,P.H. and Petersen,S.B. (1999) Protein Eng., 12, 535548.
Sander,C. and Schneider,R. (1991) Proteins, 9, 5668.[ISI][Medline]
Sippl,M.J. (1990) J. Mol. Biol., 213, 859883.[ISI][Medline]
Srinivasan,N., Sowdhamini,R., Ramakrishnan,C. and Balaram,P. (1990) Int. J. Pept. Protein Res., 36, 147155.[ISI][Medline]
Thomas,P.D. and Dill,K.A. (1996) J. Mol. Biol., 257, 457469.[ISI][Medline]
Thornton,J.M. (1981) J. Mol. Biol., 151, 261287.[ISI][Medline]
Zhang,C. and Kim,S. (2000) Proc. Natl Acad. Sci. USA, 97, 25502555.
Zhang,J., Matthews,J.M., Ward,L.D. and Simpson,R.J. (1997) Biochemistry, 36, 23802389.[ISI][Medline]
Received April 6, 2000; revised August 7, 2000; accepted August 17, 2000.