Pharmacy Department, King's College London, Manresa Road, London SW3 6LX, UK
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: protein structure prediction/solvent accessibility
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
More consistent success has been achieved in predicting a quantifiable, usually discrete, one-dimensional aspect of protein structure (Rost and Sander, 1995). Such lower level structural features are of value by themselves, but are also seen as stepping stones to the prediction of three-dimensional structure. Initial interest concentrated on the prediction of secondary structure. Typically, for three-state predictions (
-helix, ß-strand and turn), 6570% prediction success has been achieved (Eisenhaber et al., 1995
). With the addition of data from a multiple sequence alignment of homologous proteins, this has been raised to 75% (Frishman and Argos, 1997
).
More recently, prediction of residue solvent accessibility has found favour. Solvent accessibility can be calculated from main- and side-chain atom positions in the three-dimensional structure of a protein, usually by the method of Kabsch and Sander (1983). The potential usefulness of this characteristic is linked to the discovery that burial of core residues may be a strong driving force in protein folding (Chan and Dill, 1990); knowledge of which residues in a protein are buried can improve the accuracy of tertiary structure prediction (Aszódi and Taylor, 1996
). Solvent accessibility has also been used to predict the position of protein hydration sites, which may play an important part in a protein's function (Ehrlich et al., 1998
).
Existing prediction methods use a variety of measures of residue burial: binary exposure categories (buried or exposed), ternary categories (buried, partially exposed or fully exposed) or more complex schemes (typically with 10 categories). To compensate for differences in amino acid side-chain size, residue exposed surface area is usually expressed as a percentage of the maximum exposed surface area, calculated by the methods of Rose et al. (1985), Chothia (1976) or Shrake and Rupley (1973). It is this normalized value that determines membership of a particular exposure category. The thresholds between categories vary in the reported work, which makes accurate comparison between methods difficult. As thresholds change, so do the number of residues in each category. Methods can appear most accurate when the residue distribution is highly skewed.
The approaches used to predict solvent accessibility are similar to those used previously to predict secondary structure (Eisenhaber et al., 1995). The rationale is clear: in both cases the investigator hopes that the local amino acid sequence has a strong influence on the one-dimensional property under consideration. Non-local interactions are generally ignored, which is likely to account for the fact that results seem to indicate an upper limit on prediction success. Approaches meeting with success include the use of neural networks (Holbrook et al., 1990
; Rost and Sander, 1994
), residue substitution matrices (Pascarella et al., 1998
) and Bayesian analysis (Thompson and Goldstein, 1996
).
As with the prediction of secondary structure, the level of success achieved by different methods varies very little. Binary categories achieve 7075% success, ternary categories around 55% and 10-state systems 2025%. The use of data from sequence alignments of homologous proteins increases these values. In their report of the success of a neural network method, Rost and Sander (1994) quoted the success of random prediction in two-, three- and 10-category methods, presumably to give a bottom line by which more sophisticated methods should be judged. In recent work using genetic algorithms to evolve sequence motifs for the prediction of buried residues (unpublished), we used a more challenging bottom line. The results from this shed interesting light on the value of other prediction methods.
We present here this new baseline by which other methods of predicting residue solvent accessibility can be judged. The method takes no account of the local sequence surrounding a residue and makes predictions solely on the basis the exposure category in which an amino acid is most often found. In this, it resembles the simplest of residue hydrophobicity or Bayesian methods. Comparison with existing methods demonstrates that these sophisticated approaches often show a surprisingly low increase in prediction success over our new baseline. Such an approach to estimating the minimum level of success that should be expected from novel prediction techniques has not previously been demonstrated.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Predictions were made in one of two ways. In the first method, one set of protein structures was used to assign each amino acid to an exposure category. A second set of protein structures was used to assess the success of the predictions made on the basis of these assignments. In the second method, a jackknife approach was used. For a set of n structures, n 1 were used to assign amino acids to exposure categories and the remaining one used to assess the success of the predictions. This was repeated n times with each protein structure in the data set being used once to assess the success of prediction.
Residue solvent accessibilities were taken from the DSSP file corresponding to each structure, in which exposed surface area is calculated by the method of Kabsch and Sander (1983). These were transformed to percentages of the maximum exposed surface area for each residue using the values given by Rose et al. (1985), Chothia (1976) or Shrake and Rupley (1973).
To ensure the fairness of comparisons between the baseline and existing methods of predicting solvent accessibility, the protein structure sets, exposure categories and values of maximum exposed surface area for each amino acid were chosen to match those used in the technique under comparison.
Four sets of structures were used: RS, 126 structures from Rost and Sander (1994); TG, 111 structures from Thompson and Goldstein (1996); H, the 19 structures used by Holbrook et al. (1990) to train their networks; and H', the five structures used by Holbrook et al. (1990) to test their networks. All protein structures were retrieved from the Brookhaven Protein Data Bank (PDB) (Sussman et al., 1998). Where a record had been withdrawn or superseded and was no longer available, the appropriate newer structure was used.
The performance of each prediction method was assessed by two measures. The first was the percentage of predictions made that were successful:
|
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The clear conclusion to be made from the data in Table I is that such comparisons lead to no clear conclusions. As noted by Thompson and Goldstein (1996) and Rost and Sander (1994), the choice of solvent accessibility thresholds is problematic. They have shown that using thresholds which partition the data set unevenly can permit increased prediction success with a concomitant decrease in correlation coefficient. It appears from the results presented here that as thresholds change, the performances of different prediction methods change to a variable degree. The change in three-state thresholds from 9 and 36% to 9 and 64% removed the Bayesian approach's advantage over the baseline method on both the RS and TG data sets [Table I
, (ii) and (iii) or (xiii) and (xiv)].
The neural network methods of Rost and Sander (1994) and Holbrook et al. (1990) often showed little advantage over the baseline method, especially when correlation coefficients were considered. In particular, Holbrook et al.'s networks seemed to be handicapped by the small set of training data in comparison with the size of the network. If these networks did not perform better than the baseline method, it may be that they were failing to utilize the local sequence information presented to them in making their predictions. These conclusions do not extend to the use of networks in concert with data from multiple sequence alignment, when their success is markedly in excess of that of the baseline method.
The Bayesian approach of Thompson and Goldstein (1996) was more robust, even when using single sequence data, and showed an improvement over the baseline method under most circumstances. Again, the use of multiple sequence alignments improves its performance considerably.
The prediction method presented here does not offer new insights into the mechanisms of residue burial, nor does it provide improvements in the prediction of residue solvent accessibility. It does provide a bottom line for the prediction of residue burial, giving a very rapid indication of the minimum level of success that should be expected from a given set of protein structures and exposure categories. If a technique is considerably worse than the baseline method, something is amiss. If the benefits of extra sophistication are small, only the experimenter can determine whether they are worth the effort expended.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Chan,H.S. and Dill,K.A. (1990) Proc. Natl Acad. Sci. USA, 87, 63886392.[Abstract]
Chothia,C. (1976) J. Mol. Biol., 105, 114.[ISI][Medline]
Ehrlich,L., Reczko,M., Bohr,H. and Wade,R.C. (1998) Protein Engng, 11, 1119.[Abstract]
Eisenhaber,F., Persson,B. and Argos,P. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 194.[Abstract]
Flockner,H., Braxenthaler,M., Lackner,P., Jaritz,M., Ortner,M. and Sippl,M.J. (1995) Proteins: Struct. Funct. Genet., 23, 376386.[ISI][Medline]
Frishman,D. and Argos,P. (1997) Proteins: Struct. Funct. Genet., 27, 329335.[ISI][Medline]
Holbrook,S.R., Muskal,S.M. and Kim,S.-H. (1990) Protein Engng, 3, 659665.[Abstract]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Pascarella,S., De Persio,R., Bossa,F. and Argos,P. (1998) Proteins: Struct. Funct. Genet., 32, 190199.[ISI][Medline]
Rose,G.D., Geselowitz,A.R., Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Science, 229, 834838.[ISI][Medline]
Rost,B. and Sander,C. (1994) Proteins: Struct. Funct. Genet., 20, 216226.[ISI][Medline]
Rost,B. and Sander,C. (1995) Proteins: Struct. Funct. Genet., 23, 295300.[ISI][Medline]
Sánchez,R. and Sali,A. (1998) Proc. Natl Acad. Sci. USA, 95, 1359713602.
Shrake,A. and Rupley,J.A. (1973) J. Mol. Biol., 79, 351371.[ISI][Medline]
Sussman,J.L., Lin,D., Jiang,J., Manning,N.O., Prilusky,J., Ritter,O. and Abola.,E.E. (1998) Acta Crystallogr., D54, 10781084.
Thompson,M.J. and Goldstein,R.A. (1996) Proteins: Struct. Funct. Genet., 25, 3847.[ISI][Medline]
Torda,A.E. (1997) Curr. Opin. Struct. Biol., 7, 200205.[ISI][Medline]
Received April 20, 1999; revised August 2, 1999; accepted August 23, 1999.