1 Section of Neurobiology, Yale University School of Medicine, New Haven, CT, USA, 3 Center for Molecular and Biomolecular Informatics, University of Nijmegen, PO Box 9010, 6500 GL Nijmegen, The Netherlands and 5 Biostructure Department, Novo Nordisk A/S, Måløv, Denmark
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: correlated mutation analysis/PDB-derived likelihood matrix/protein evolution/protein folding/protein residue contact
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
During evolution, orthologs of a given protein accumulate sequence variability; yet this variability is restricted by the need to conserve the proper protein fold and function (Zuckerkandl and Pauling, 1965). Analysis of covariation at two or more sequence positions can provide information on their structural interactions (Altschuh et al., 1987
, 1988
; Göbel et al., 1994
; Neher, 1994
; Shindyalov et al., 1994
; Taylor and Hatrick, 1994
; Lichtarge et al., 1996
; Olmea and Valencia, 1997
; Olmea et al., 1999; Pazos et al., 1997
). This concept forms the basis of correlated mutation analysis (CMA), illustrated in Figure 1
. CMA identifies pairs of residues that tend to remain constant or mutate in tandem. These pairs, while not necessarily nearby in the primary sequence, form three-dimensional contacts (and functional interactions) at an above-random frequency. Hence, CMA may be useful as a structure prediction tool. A number of correctly predicted contacts, if well dispersed on the peptide chain, can provide sufficient restraints to predict tertiary structure. In combination with secondary structure prediction and folding algorithms, de novo structure prediction may be feasible.
|
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
To derive contact likelihood scores for amino acid pairs, we analyzed a training set of 672 non-redundant, randomly-selected PDB files from PDBSELECT (Hobohm and Sander, 1995). As in previous studies, a number of steps were taken to limit computation time: analysis for each protein entry was limited to one subunit, 3000 residues and 50 000 atoms (values were similar to other studies). To avoid contacts involved in secondary structure, residue pairs with a chain separation less than 10 were excluded.
Any two residues with side chain heavy atoms or -carbon atoms within 4.75 Å were considered to be in contact. For each pair of amino acids of types a and b (such as a = aspartate, b = lysine), an odds ratio matrix M (a, b) was calculated as shown in Equation (1), where C (a, b) equals the number of contacts observed between types a and b, C equals the total number of contacts for all types, P (a, b) equals the number of pairs compared for types a and b, and P equals the total number of pairs compared.
![]() | (1) |
Exploration of parameter space
We tested >20 independent HSSP files to validate our parameters (see methods below). We found that limits of 3000 residues and 50 000 atoms made no difference in results. For the structure validation (described below), use of 2075 sequences was no different than use of 10300 sequences. The optimal penalty for comparisons with all anionic or cationic residues was found to be 5, but results were similar for values of 010.
Sequence analysis
We proceeded to test the accuracy of predictions on a validation set of 118 non-redundant protein families from the HSSP database (Sander and Schneider, 1991). These files were randomly selected from PDBSELECT and excluded from the training set. Each HSSP file contains an alignment of protein sequences similar to that for the PDB structure. HSSP files were randomly selected from those that contained at least 20 sequences (for adequate sample size) and included only one subunit (to avoid inter-subunit contacts, which probably involve different contact rules). We calculated a CMA score S (x, y) for each pair of sequence positions as shown in Equation (2), where n equals the length of the sequence alignment and p and q denote individual sequences. M (x, y) denotes the matrix value for symbols xp and yp. D (x, y) is the correlated mutation operator, equal to 1 if the xp and yp differ, 0 otherwise. A penalty of 5 was applied for comparisons involving all charged residues (to filter out protein surfaces). We considered x and y values with a minimum chain distance of 10 (to filter out secondary structure) and a maximum of 100 (to speed computation and keep predictions within domains). Sequence pairs with >90% conservation were removed due to insufficient information content.
![]() | (2) |
Testing the method
For each HSSP file we compared the predicted contacts with observed contacts in the corresponding PDB file. For a preliminary analysis, we chose to predict 0.1n high-scoring pairs as contacts, where n is the sequence length. Debe and Goddard (Debe and Goddard, 1999) have shown that six contacts were sufficient to solve the structure of the 72-residue lexA repressor within 4.5 Å
-carbon r.m.s.d.
We compared predicted contacts with the corresponding experimentally determined structures. As in previous studies, accuracy was calculated as the number of correctly predicted contacts divided by the number of predicted contacts. A benchmark was needed to assess these accuracy levels. We first calculated the accuracy expected at random, equivalent to the number of contacts in the structure divided by the number of residue pairs, with minimum chain separation taken into account. It was also important to determine the distribution of scores expected at random. We ran 20 Monte Carlo permutation tests, in which we created a list of all possible contacts, ranked them in random order, and calculated the accuracy achieved by chance. Performed 20 times, this method yields a 95% confidence level.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
With respect to RP, the best prediction was for thaumatin (1thv; Ko et al., 1994), a 207-residue protein with an RP of 23.6 and an accuracy of 45%. Contacts predicted for thaumatin are summarized in Table I
. Figure 3
compares the predicted contacts with the true contact map. Thaumatin contains both ß-sheets and
-helices. All of the correctly predicted contacts were between ß-sheets. The 217-amino acid adenylate kinase (1zin; Berry and Phillips, 1998
) yielded an RP of 19.1 and an accuracy of 33%. In this case, four of the contacts were between ß-sheets and
-helices, two between ß-sheets, and two between
-helices. For the 91-amino acid fibronectin cell adhesion molecule (1fna; Dickinson et al., 1994
), an all ß-sheet protein, RP was 18.3 and accuracy was 67%. The method did not perform equally well for all protein families: 15 of the 118 families tested had no correctly predicted pairs. In future work we hope to identify a systematic difference in fold type, function, or composition that differentiate families with high RP from those with lower values.
|
|
The number of contacts predicted for each protein family also influenced performance. When only 0.05n contacts were predicted (with a 4.5 Å cut-off), mean RP rose to 7.9, mean accuracy to 18%. The trade-off is lowered sensitivity: fewer contacts were predicted. Likewise, prediction of more contacts increased sensitivity but reduced the accuracy (specificity). With a 0.25n cut-off the mean RP fell to 5.1 and accuracy to 12%.
Table II shows the contact likelihood matrix. Unlike most matrices, this one is not symmetric. Side chain contacts, most of which result from non-bond interactions, may nonetheless be influenced by the asymmetric protein backbone. The diagonal elements represent the likelihood of contact between pairs of identical residue types. The very value of 9.6 for C:C pairs reflects the prevalence of disulfide bonds that occur in proteins secreted into the extracellular space, which typically has a net oxidizing redox status. We expect that this figure will not be so high for cytosolic and membrane proteins. The other diagonal elements are distributed between those for the charged residues that have suppressed values on account of the repulsive effect of like charges, and those that are higher than average, reflecting propensities for hydrogen bond interactions for the polar (but uncharged) residue types and hydrophobic interactions for the non-polar types. Concerning the latter, there is a pronounced tendency for the aromatic types, together with methionine, to have higher values than the aliphatic residue types. This is true both for the likelike (diagonal) and unlikeunlike (off-diagonal) cases. These residue types are not only hydrophobic, but can participate in interactions involving pi-orbitals of the aromatic rings (and the empty d-orbitals of the SD atom in methionine). Glycine prefers to associate with bulky side chains, suggestive of a propensity for knobs-into-holes packing originally proposed by Crick (Crick, 1953
).
|
Our method contains many parameters that can be optimized for different protein families or for characteristics known a priori from the sequence. Our exploration of this parameter space has so far been limited by CPU considerations. The key result of this work is that a matrix derived from contacts is most suitable for the prediction of contacts. We consider that this intuitive point should remain valid for any choice of CMA methods or parameters.
With incremental improvements, CMA may soon permit provisional structure prediction before experimental structural data become available. One way in which further improvement might be achieved is by checking whether different patterns of residue contacts prevail in different fold classes (all- or all-ß, for example). These methods may help us to confront the recent proliferation of genome information and to explore the fold space of protein families not yet solved experimentally. CMA may also aid in the refinement and validation of low-resolution structures. Even more important, the discovery of better structure prediction methods should illuminate the processes by which a nascent polypeptide adopts its native structure, and the mechanisms by which related proteins preserve similar structures despite evolutionary drift.
![]() |
Notes |
---|
4 To whom correspondence should be addressed. E-mail: vriend{at}cmbi.kun.nl
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschuh,D., Vernet,T., Berti,P., Moras,D. and Nagai,K. (1988) Protein Eng., 2, 193199.[Abstract]
Berry,M.B. and Phillips,G.N.,Jr (1998) Proteins, 32, 276.[CrossRef][ISI][Medline]
Crick,F.H.C. (1953) Acta Crystallogr., 6, 689697.[CrossRef][ISI]
Debe,D.A. and Goddard,W.A.,3rd. (1999) J. Mol. Biol., 294, 619625.[CrossRef][ISI][Medline]
Dickinson,C.D., Veerapandian,B., Dai,X.P., Hamlin,R.C., Xuong,N.H., Ruoslahti,E. and Ely,K. (1994) J. Mol. Biol., 236, 1079.[ISI][Medline]
Göbel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309317.[ISI][Medline]
Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522524.
Ko,T.P., Day,J., Greenwood,A. and Mcpherson,A. (1994) Acta Crystallogr. D Biol. Crystallogr., 50, 813.[CrossRef][ISI]
Lichtarge,O., Bourne,H.R. and Cohen,F.E. (1996) J. Mol. Biol., 257, 342358.[CrossRef][ISI][Medline]
Neher,E. (1994) Proc. Natl Acad. Sci. USA, 91, 98102.[Abstract]
Olmea,O. and Valencia,A. (1997) Fold Des., 2, S25S32.[ISI][Medline]
Pazos,F., Helmer-Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 29, 511523.
Sander,C. and Schneider,R. (1991) Proteins, 9, 5658.[ISI][Medline]
Shindyalov,I.N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 349358.[Abstract]
Singer,M.S., Oliveira,L., Vriend,G. and Shepherd,G.M. (1995a) Receptors Channels, 3, 8995.[ISI][Medline]
Singer,M.S., Shepherd,G.M. and Greer,C.A. (1995b) Nature, 377, 1920.[ISI][Medline]
Taylor,W.R. and Hatrick,K. (1994) Protein Eng., 7, 341348.[Abstract]
Zuckerkandl,E. and Pauling,L. (1965) In Bryson,V. and Vogel,H.J. (eds), Evolving Genes and Proteins. Academic Press, New York, pp. 97166.
Received January 29, 2002; revised May 16, 2002; accepted May 24, 2002.