Prediction of protein residue contacts with a PDB-derived likelihood matrix

Michael S. Singer1,2, Gert Vriend3,4 and Robert P. Bywater5

1 Section of Neurobiology, Yale University School of Medicine, New Haven, CT, USA, 3 Center for Molecular and Biomolecular Informatics, University of Nijmegen, PO Box 9010, 6500 GL Nijmegen, The Netherlands and 5 Biostructure Department, Novo Nordisk A/S, Måløv, Denmark


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Proteins with similar folds often display common patterns of residue variability. A widely discussed question is how these patterns can be identified and deconvoluted to predict protein structure. In this respect, correlated mutation analysis (CMA) has shown considerable promise. CMA compares multiple members of a protein family and detects residues that remain constant or mutate in tandem. Often this behavior points to structural or functional interdependence between residues. CMA has been used to predict pairs of amino acids that are distant in the primary sequence but likely to form close contacts in the native three-dimensional structure. Until now these methods have used evolutionary or biophysical models to score the fit between residues. We wished to test whether empirical methods, derived from known protein structures, would provide useful predictive power for CMA. We analyzed 672 known protein structures, derived contact likelihood scores for all possible amino acid pairs, and used these scores to predict contacts. We then tested the method on 118 different protein families for which structures have been solved to atomic resolution. The mean performance was almost seven times better than random prediction. Used in concert with secondary structure prediction, the new CMA method could supply restraints for predicting still undetermined structures.

Keywords: correlated mutation analysis/PDB-derived likelihood matrix/protein evolution/protein folding/protein residue contact


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
There has been considerable interest in predicting protein tertiary structure from primary sequence. This interest has only intensified with the recent proliferation of genome sequence information. Beyond the practical value of such methods, they should also provide insights into fundamental principles of protein folding and evolution.

During evolution, orthologs of a given protein accumulate sequence variability; yet this variability is restricted by the need to conserve the proper protein fold and function (Zuckerkandl and Pauling, 1965Go). Analysis of covariation at two or more sequence positions can provide information on their structural interactions (Altschuh et al., 1987Go, 1988Go; Göbel et al., 1994Go; Neher, 1994Go; Shindyalov et al., 1994Go; Taylor and Hatrick, 1994Go; Lichtarge et al., 1996Go; Olmea and Valencia, 1997Go; Olmea et al., 1999; Pazos et al., 1997Go). This concept forms the basis of correlated mutation analysis (CMA), illustrated in Figure 1Go. CMA identifies pairs of residues that tend to remain constant or mutate in tandem. These pairs, while not necessarily nearby in the primary sequence, form three-dimensional contacts (and functional interactions) at an above-random frequency. Hence, CMA may be useful as a structure prediction tool. A number of correctly predicted contacts, if well dispersed on the peptide chain, can provide sufficient restraints to predict tertiary structure. In combination with secondary structure prediction and folding algorithms, de novo structure prediction may be feasible.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1. Illustration of CMA theory and application. (A) Several residues are shown in their structure context, in this example, two nearby {alpha}-helices. (B) For these residues, six sequences (A–F) are shown as a multiple alignment. Positions 1 and 3 show correlated substitutions (connected by arrows), as do positions 5 and n. (C) The most parsimonious evolutionary pathways are between sequences A and F, for positions 1 and 3. CMA detects pairs of residue positions that show correlated substitutions without intermediates. The theory is that when a mutation occurs in a structurally important residue (mutation 1), the intermediate has structural instability. Compensatory mutations are then selected (mutation 2) and the structural interaction is restored. Any intermediates are eventually eliminated from the sequence record due to reduced fitness. The likelihood of pre-mutation and post-mutation structural interactions is assessed by the contact matrix terms in Equation (2Go).

 
CMA methods require matrices to compare residues and calculate prediction scores. Until now these matrices have been based on amino acid identity (Shindyalov et al., 1994Go; Singer et al., 1995aGo,bGo), substitution (Göbel et al., 1994Go), or biophysical complementarity, such as charge or total side chain volume (Neher, 1994Go). Each of these matrices is based on a priori models of non-covalent residue interactions or protein evolution. Here we demonstrate the predictive power of contact likelihood matrices derived from previously solved structures in the Protein Data Bank (PDB).


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Calculation of empirical matrix from PDB

To derive contact likelihood scores for amino acid pairs, we analyzed a training set of 672 non-redundant, randomly-selected PDB files from PDBSELECT (Hobohm and Sander, 1995). As in previous studies, a number of steps were taken to limit computation time: analysis for each protein entry was limited to one subunit, 3000 residues and 50 000 atoms (values were similar to other studies). To avoid contacts involved in secondary structure, residue pairs with a chain separation less than 10 were excluded.

Any two residues with side chain heavy atoms or {alpha}-carbon atoms within 4.75 Å were considered to be in contact. For each pair of amino acids of types a and b (such as a = aspartate, b = lysine), an odds ratio matrix M (a, b) was calculated as shown in Equation (1), where C (a, b) equals the number of contacts observed between types a and b, C equals the total number of contacts for all types, P (a, b) equals the number of pairs compared for types a and b, and P equals the total number of pairs compared.


(1)

Exploration of parameter space

We tested >20 independent HSSP files to validate our parameters (see methods below). We found that limits of 3000 residues and 50 000 atoms made no difference in results. For the structure validation (described below), use of 20–75 sequences was no different than use of 10–300 sequences. The optimal penalty for comparisons with all anionic or cationic residues was found to be 5, but results were similar for values of 0–10.

Sequence analysis

We proceeded to test the accuracy of predictions on a validation set of 118 non-redundant protein families from the HSSP database (Sander and Schneider, 1991Go). These files were randomly selected from PDBSELECT and excluded from the training set. Each HSSP file contains an alignment of protein sequences similar to that for the PDB structure. HSSP files were randomly selected from those that contained at least 20 sequences (for adequate sample size) and included only one subunit (to avoid inter-subunit contacts, which probably involve different contact rules). We calculated a CMA score S (x, y) for each pair of sequence positions as shown in Equation (2), where n equals the length of the sequence alignment and p and q denote individual sequences. M (x, y) denotes the matrix value for symbols xp and yp. D (x, y) is the correlated mutation operator, equal to 1 if the xp and yp differ, 0 otherwise. A penalty of 5 was applied for comparisons involving all charged residues (to filter out protein surfaces). We considered x and y values with a minimum chain distance of 10 (to filter out secondary structure) and a maximum of 100 (to speed computation and keep predictions within domains). Sequence pairs with >90% conservation were removed due to insufficient information content.


(2)

Testing the method

For each HSSP file we compared the predicted contacts with observed contacts in the corresponding PDB file. For a preliminary analysis, we chose to predict 0.1n high-scoring pairs as contacts, where n is the sequence length. Debe and Goddard (Debe and Goddard, 1999Go) have shown that six contacts were sufficient to solve the structure of the 72-residue lexA repressor within 4.5 Å {alpha}-carbon r.m.s.d.

We compared predicted contacts with the corresponding experimentally determined structures. As in previous studies, accuracy was calculated as the number of correctly predicted contacts divided by the number of predicted contacts. A benchmark was needed to assess these accuracy levels. We first calculated the accuracy expected at random, equivalent to the number of contacts in the structure divided by the number of residue pairs, with minimum chain separation taken into account. It was also important to determine the distribution of scores expected at random. We ran 20 Monte Carlo permutation tests, in which we created a list of all possible contacts, ranked them in random order, and calculated the accuracy achieved by chance. Performed 20 times, this method yields a 95% confidence level.


    Results and discussion
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Performance results for the 118 protein families are shown in Figure 2Go. Note that these results were obtained with a stringent distance cut-off of 4.5 Å. The CMA, with a mean prediction accuracy of 15.7%, performed much better overall than the random accuracy of 2.4%. Some families had accuracies approaching 70%, yet random accuracy never exceeded 5.8%. Longer proteins tended to yield lower prediction accuracies, similar to the results of Göbel et al. (Göbel et al., 1994Go). Because contacts in larger proteins are more difficult to predict, a more informative performance measure is the prediction accuracy divided by the random accuracy, referred to here as relative performance (RP). By definition, random predictions yield a mean RP of 1.00.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 2. Method and random (analytical) prediction accuracies for 118 protein families, shown as a function of protein size. The points form curves, because prediction accuracy is a quantal property. The curves for 0–3 correctly predicted contacts are labeled at left (in brackets).

 
Tests of the CMA method at the 4.5 Å cut-off showed a mean RP of 6.77 (SD 4.70), which was independent of protein size (linear correlation, r = 0.19). Simply stated, the CMA performed on average almost seven times better than random for a wide range of protein families. This result was supported by the Monte Carlo simulations, which showed a mean RP of 0.91 (SD 1.65), near the expected value of 1.00. RP values obtained by CMA were considerably better than Monte Carlo values. Moreover, for 68 of 118 protein families (58%), CMA performed significantly better than the maximum values in the respective Monte Carlo distributions (P < 0.05).

With respect to RP, the best prediction was for thaumatin (1thv; Ko et al., 1994Go), a 207-residue protein with an RP of 23.6 and an accuracy of 45%. Contacts predicted for thaumatin are summarized in Table IGo. Figure 3Go compares the predicted contacts with the true contact map. Thaumatin contains both ß-sheets and {alpha}-helices. All of the correctly predicted contacts were between ß-sheets. The 217-amino acid adenylate kinase (1zin; Berry and Phillips, 1998Go) yielded an RP of 19.1 and an accuracy of 33%. In this case, four of the contacts were between ß-sheets and {alpha}-helices, two between ß-sheets, and two between {alpha}-helices. For the 91-amino acid fibronectin cell adhesion molecule (1fna; Dickinson et al., 1994Go), an all ß-sheet protein, RP was 18.3 and accuracy was 67%. The method did not perform equally well for all protein families: 15 of the 118 families tested had no correctly predicted pairs. In future work we hope to identify a systematic difference in fold type, function, or composition that differentiate families with high RP from those with lower values.


View this table:
[in this window]
[in a new window]
 
Table I. Predicted contacts for the 207-residue protein thaumatin (1thv;Ko et al., 1994Go)
 


View larger version (18K):
[in this window]
[in a new window]
 
Fig. 3. Predicted and true contact maps for the N-terminal domain of thaumatin (1thv). Axes represent residues 1–99 (N-terminus on left and top). Red points, predicted contacts. True heavy atom distances: black, 5–6 Å; dark blue, 4–5 Å; mid-blue 3–4 Å; light blue 2–3 Å. The map was created with QUANTA(R) modeling software (Accelrys, San Diego, CA).

 
Percent accuracy and RP varied for different distance cut-offs. The RP score reached its maximum at a cut-off of 4.5 Å, but this does not necessarily imply that a cut-off of 4.5 Å is the most suitable for structure prediction. Cut-offs of 6 Å were associated with better prediction accuracies and are likely to work equally well as distance restraints. The 6 Å cut-off yielded 20% mean accuracy, the 8 Å cut-off 30%. Generally, these results did not depend on the distance cut-off used to build the likelihood matrix (data not shown).

The number of contacts predicted for each protein family also influenced performance. When only 0.05n contacts were predicted (with a 4.5 Å cut-off), mean RP rose to 7.9, mean accuracy to 18%. The trade-off is lowered sensitivity: fewer contacts were predicted. Likewise, prediction of more contacts increased sensitivity but reduced the accuracy (specificity). With a 0.25n cut-off the mean RP fell to 5.1 and accuracy to 12%.

Table IIGo shows the contact likelihood matrix. Unlike most matrices, this one is not symmetric. Side chain contacts, most of which result from non-bond interactions, may nonetheless be influenced by the asymmetric protein backbone. The diagonal elements represent the likelihood of contact between pairs of identical residue types. The very value of 9.6 for C:C pairs reflects the prevalence of disulfide bonds that occur in proteins secreted into the extracellular space, which typically has a net oxidizing redox status. We expect that this figure will not be so high for cytosolic and membrane proteins. The other diagonal elements are distributed between those for the charged residues that have suppressed values on account of the repulsive effect of like charges, and those that are higher than average, reflecting propensities for hydrogen bond interactions for the polar (but uncharged) residue types and hydrophobic interactions for the non-polar types. Concerning the latter, there is a pronounced tendency for the aromatic types, together with methionine, to have higher values than the aliphatic residue types. This is true both for the ‘like–like’ (diagonal) and ‘unlike–unlike’ (off-diagonal) cases. These residue types are not only hydrophobic, but can participate in interactions involving pi-orbitals of the aromatic rings (and the empty d-orbitals of the SD atom in methionine). Glycine prefers to associate with bulky side chains, suggestive of a propensity for ‘knobs-into-holes’ packing originally proposed by Crick (Crick, 1953Go).


View this table:
[in this window]
[in a new window]
 
Table II. Contact likelihood matrix
 
The PDB-derived matrix is not intended to replace previous CMA methods but rather to complement them with information derived from structural data. Features of other CMA approaches will likely improve on the scheme we have presented here. For example, in an effort to test the method in the simplest form possible, we did not take phylogenetic relations into account (Shindyalov et al., 1994Go; Lichtarge et al., 1996Go). Prediction accuracy may also be improved by considering chain separation, predicting residue triads or networks instead of pairs, and deriving different contact matrices for various protein fold classes. Furthermore, cellular or extracellular localization may impose different patterns of residue contact preferences, especially with respect to cysteine residues. Nucleic acid sequences may contain further information that could be exploited by derivation of a 64x64 codon matrix. We appreciate that the current two-dimensional matrix and scoring function (Equations 1 and 2GoGo) could be generalized to a four-dimensional matrix derived from multiple sequence alignments, but this will not be practical until more protein structures and sequences become available. We will consider this approach in future extensions of our work.

Our method contains many parameters that can be optimized for different protein families or for characteristics known a priori from the sequence. Our exploration of this parameter space has so far been limited by CPU considerations. The key result of this work is that a matrix derived from contacts is most suitable for the prediction of contacts. We consider that this intuitive point should remain valid for any choice of CMA methods or parameters.

With incremental improvements, CMA may soon permit provisional structure prediction before experimental structural data become available. One way in which further improvement might be achieved is by checking whether different patterns of residue contacts prevail in different fold classes (all-{alpha} or all-ß, for example). These methods may help us to confront the recent proliferation of genome information and to explore the fold space of protein families not yet solved experimentally. CMA may also aid in the refinement and validation of low-resolution structures. Even more important, the discovery of better structure prediction methods should illuminate the processes by which a nascent polypeptide adopts its native structure, and the mechanisms by which related proteins preserve similar structures despite evolutionary drift.


    Notes
 
2 Present address: Internal Medicine Residency Program, Brigham and Women’s Hospital, Boston, MA 02115, USA Back

4 To whom correspondence should be addressed. E-mail: vriend{at}cmbi.kun.nl Back


    Acknowledgments
 
We are grateful to Gordon M.Shepherd, Elias Lolis and Perry Miller for helpful discussions and review of this manuscript. This work was supported by fellowships to M.S. from the Yale MSTP and the NLM (IAIMS grant to Perry Miller), as well as grants to Gordon M.Shepherd from NIDCD, NIA, NASA, NIMH (Human Brain Project) and the DOD-ARO (Multidisciplinary University Research Initiative).


    References
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Altschuh,D., Lesk,A.M., Bloomer,A.C. and Klug,A. (1987) J. Mol. Biol., 193, 693–707.[ISI][Medline]

Altschuh,D., Vernet,T., Berti,P., Moras,D. and Nagai,K. (1988) Protein Eng., 2, 193–199.[Abstract]

Berry,M.B. and Phillips,G.N.,Jr (1998) Proteins, 32, 276.[CrossRef][ISI][Medline]

Crick,F.H.C. (1953) Acta Crystallogr., 6, 689–697.[CrossRef][ISI]

Debe,D.A. and Goddard,W.A.,3rd. (1999) J. Mol. Biol., 294, 619–625.[CrossRef][ISI][Medline]

Dickinson,C.D., Veerapandian,B., Dai,X.P., Hamlin,R.C., Xuong,N.H., Ruoslahti,E. and Ely,K. (1994) J. Mol. Biol., 236, 1079.[ISI][Medline]

Göbel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309–317.[ISI][Medline]

Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522–524.[Abstract/Free Full Text]

Ko,T.P., Day,J., Greenwood,A. and Mcpherson,A. (1994) Acta Crystallogr. D Biol. Crystallogr., 50, 813.[CrossRef][ISI]

Lichtarge,O., Bourne,H.R. and Cohen,F.E. (1996) J. Mol. Biol., 257, 342–358.[CrossRef][ISI][Medline]

Neher,E. (1994) Proc. Natl Acad. Sci. USA, 91, 98–102.[Abstract]

Olmea,O. and Valencia,A. (1997) Fold Des., 2, S25–S32.[ISI][Medline]

Pazos,F., Helmer-Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 29, 511–523.

Sander,C. and Schneider,R. (1991) Proteins, 9, 56–58.[ISI][Medline]

Shindyalov,I.N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 349–358.[Abstract]

Singer,M.S., Oliveira,L., Vriend,G. and Shepherd,G.M. (1995a) Receptors Channels, 3, 89–95.[ISI][Medline]

Singer,M.S., Shepherd,G.M. and Greer,C.A. (1995b) Nature, 377, 19–20.[ISI][Medline]

Taylor,W.R. and Hatrick,K. (1994) Protein Eng., 7, 341–348.[Abstract]

Zuckerkandl,E. and Pauling,L. (1965) In Bryson,V. and Vogel,H.J. (eds), Evolving Genes and Proteins. Academic Press, New York, pp. 97–166.

Received January 29, 2002; revised May 16, 2002; accepted May 24, 2002.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (12)
Request Permissions
Google Scholar
Articles by Singer, M. S.
Articles by Bywater, R. P.
PubMed
PubMed Citation
Articles by Singer, M. S.
Articles by Bywater, R. P.