The bottom line for prediction of residue solvent accessibility

C.J. Richardson1 and D.J. Barlow

Pharmacy Department, King's College London, Manresa Road, London SW3 6LX, UK


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
A simple method of predicting residue solvent accessibilities in proteins is described, with the intention that it should be used as a baseline by which more sophisticated approaches to prediction can be judged. Comparison with existing methods of predicting residue burial reveals that their performance is often little better than that of the baseline method. The problem of comparing different prediction methods is shown to be complicated by the proliferation of different schemes for classifying residue burial.

Keywords: protein structure prediction/solvent accessibility


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Knowledge of a protein's three-dimensional structure is essential for an understanding of both its function and mechanism of action. The value of such knowledge does much to explain the considerable effort being expended to close the gap between the amount of three-dimensional protein structural information and the overwhelming quantity of amino acid sequence data (Sánchez and Sali, 1998Go). For the cases in which clear homologues of a protein sequence exist, much progress in the prediction of three-dimensional structure has been made (Flockner et al., 1995Go; Torda, 1997Go). Where there is little homology between the sequence under investigation and those of proteins of known structure, the ground is less fertile.

More consistent success has been achieved in predicting a quantifiable, usually discrete, one-dimensional aspect of protein structure (Rost and Sander, 1995Go). Such lower level structural features are of value by themselves, but are also seen as stepping stones to the prediction of three-dimensional structure. Initial interest concentrated on the prediction of secondary structure. Typically, for three-state predictions ({alpha}-helix, ß-strand and turn), 65–70% prediction success has been achieved (Eisenhaber et al., 1995Go). With the addition of data from a multiple sequence alignment of homologous proteins, this has been raised to 75% (Frishman and Argos, 1997Go).

More recently, prediction of residue solvent accessibility has found favour. Solvent accessibility can be calculated from main- and side-chain atom positions in the three-dimensional structure of a protein, usually by the method of Kabsch and Sander (1983). The potential usefulness of this characteristic is linked to the discovery that burial of core residues may be a strong driving force in protein folding (Chan and Dill, 1990Go); knowledge of which residues in a protein are buried can improve the accuracy of tertiary structure prediction (Aszódi and Taylor, 1996Go). Solvent accessibility has also been used to predict the position of protein hydration sites, which may play an important part in a protein's function (Ehrlich et al., 1998Go).

Existing prediction methods use a variety of measures of residue burial: binary exposure categories (buried or exposed), ternary categories (buried, partially exposed or fully exposed) or more complex schemes (typically with 10 categories). To compensate for differences in amino acid side-chain size, residue exposed surface area is usually expressed as a percentage of the maximum exposed surface area, calculated by the methods of Rose et al. (1985), Chothia (1976) or Shrake and Rupley (1973). It is this normalized value that determines membership of a particular exposure category. The thresholds between categories vary in the reported work, which makes accurate comparison between methods difficult. As thresholds change, so do the number of residues in each category. Methods can appear most accurate when the residue distribution is highly skewed.

The approaches used to predict solvent accessibility are similar to those used previously to predict secondary structure (Eisenhaber et al., 1995Go). The rationale is clear: in both cases the investigator hopes that the local amino acid sequence has a strong influence on the one-dimensional property under consideration. Non-local interactions are generally ignored, which is likely to account for the fact that results seem to indicate an upper limit on prediction success. Approaches meeting with success include the use of neural networks (Holbrook et al., 1990Go; Rost and Sander, 1994Go), residue substitution matrices (Pascarella et al., 1998Go) and Bayesian analysis (Thompson and Goldstein, 1996Go).

As with the prediction of secondary structure, the level of success achieved by different methods varies very little. Binary categories achieve 70–75% success, ternary categories around 55% and 10-state systems 20–25%. The use of data from sequence alignments of homologous proteins increases these values. In their report of the success of a neural network method, Rost and Sander (1994) quoted the success of random prediction in two-, three- and 10-category methods, presumably to give a bottom line by which more sophisticated methods should be judged. In recent work using genetic algorithms to evolve sequence motifs for the prediction of buried residues (unpublished), we used a more challenging bottom line. The results from this shed interesting light on the value of other prediction methods.

We present here this new baseline by which other methods of predicting residue solvent accessibility can be judged. The method takes no account of the local sequence surrounding a residue and makes predictions solely on the basis the exposure category in which an amino acid is most often found. In this, it resembles the simplest of residue hydrophobicity or Bayesian methods. Comparison with existing methods demonstrates that these sophisticated approaches often show a surprisingly low increase in prediction success over our new baseline. Such an approach to estimating the minimum level of success that should be expected from novel prediction techniques has not previously been demonstrated.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The baseline prediction method was applied at the level of individual amino acids in a protein sequence, with predictions made being based solely on the identity of the residues with no reference to the surrounding sequence. Initially, each amino acid was classified by the analysis of a database of protein structures and assigned to the burial category in which it was most often found. Predictions were then made using these assignments. For example, in one application of this method using binary exposure categories, proline was found to be exposed in 67.5% of cases (corresponding to 1032 occurrences of the amino acid) and buried in 32.5% of cases (495 occurrences); any proline residue encountered in prediction was therefore assigned to the exposed category.

Predictions were made in one of two ways. In the first method, one set of protein structures was used to assign each amino acid to an exposure category. A second set of protein structures was used to assess the success of the predictions made on the basis of these assignments. In the second method, a jackknife approach was used. For a set of n structures, n – 1 were used to assign amino acids to exposure categories and the remaining one used to assess the success of the predictions. This was repeated n times with each protein structure in the data set being used once to assess the success of prediction.

Residue solvent accessibilities were taken from the DSSP file corresponding to each structure, in which exposed surface area is calculated by the method of Kabsch and Sander (1983). These were transformed to percentages of the maximum exposed surface area for each residue using the values given by Rose et al. (1985), Chothia (1976) or Shrake and Rupley (1973).

To ensure the fairness of comparisons between the baseline and existing methods of predicting solvent accessibility, the protein structure sets, exposure categories and values of maximum exposed surface area for each amino acid were chosen to match those used in the technique under comparison.

Four sets of structures were used: RS, 126 structures from Rost and Sander (1994); TG, 111 structures from Thompson and Goldstein (1996); H, the 19 structures used by Holbrook et al. (1990) to train their networks; and H', the five structures used by Holbrook et al. (1990) to test their networks. All protein structures were retrieved from the Brookhaven Protein Data Bank (PDB) (Sussman et al., 1998Go). Where a record had been withdrawn or superseded and was no longer available, the appropriate newer structure was used.

The performance of each prediction method was assessed by two measures. The first was the percentage of predictions made that were successful:

The second was the correlation coefficient between the observed, oi and predicted, pi, solvent accessibility categories over residue locations:


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The performance of the baseline algorithm was tested with both test and training data and using a single data set with the jackknife approach. All four protein data sets were used and two, three and 10 state burial categories with a variety of thresholds were examined. The performance of the baseline method is shown in Table IGo, in comparison with the performance of existing approaches where appropriate.


View this table:
[in this window]
[in a new window]
 
Table I. Comparison of Bayesian, neural network and baseline methods for different protein structure data sets and burial thresholds
 
In general, the performance of the baseline algorithm was inferior by a few percentage points to that of existing methods and correlation coefficients were lower by ~0.05. However, there were a few notable exceptions to this trend. In particular, the performance of the baseline method in Table IGo (ii), (xiii) and (xvi) was comparatively better than might have been expected and comparatively worse in (iii), (v) and (ix).


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The baseline method was developed during research into novel methods of predicting solvent accessibility as a minimally successful method one step up from random prediction. It was intended as a bottom line against which the performance of more sophisticated prediction schemes could be compared and is presented here in the same spirit. Despite this, its level of success is surprising and in some cases [Table IGo, (ii), (xiii) and (xvi)] it equalled or exceeded that of neural network and Bayesian methods. Even when the baseline method performed at a lower standard than these approaches, the benefit brought by their extra sophistication was often only a few percent. For example, if we consider the jackknife predictions made for the Thompson and Goldstein (TG) protein dataset using a simple binary classification of residue exposure [Table IGo, (i)], the difference in success achieved by the baseline and Bayesian methods is <3%. On average, therefore, for a protein of 200 residues, the baseline method would in this case correctly predict the exposure state for ~137 of the residues, whilst the Bayesian method would correctly predict the exposure state for ~141 residues (that is, just four residues more).

The clear conclusion to be made from the data in Table IGo is that such comparisons lead to no clear conclusions. As noted by Thompson and Goldstein (1996) and Rost and Sander (1994), the choice of solvent accessibility thresholds is problematic. They have shown that using thresholds which partition the data set unevenly can permit increased prediction success with a concomitant decrease in correlation coefficient. It appears from the results presented here that as thresholds change, the performances of different prediction methods change to a variable degree. The change in three-state thresholds from 9 and 36% to 9 and 64% removed the Bayesian approach's advantage over the baseline method on both the RS and TG data sets [Table IGo, (ii) and (iii) or (xiii) and (xiv)].

The neural network methods of Rost and Sander (1994) and Holbrook et al. (1990) often showed little advantage over the baseline method, especially when correlation coefficients were considered. In particular, Holbrook et al.'s networks seemed to be handicapped by the small set of training data in comparison with the size of the network. If these networks did not perform better than the baseline method, it may be that they were failing to utilize the local sequence information presented to them in making their predictions. These conclusions do not extend to the use of networks in concert with data from multiple sequence alignment, when their success is markedly in excess of that of the baseline method.

The Bayesian approach of Thompson and Goldstein (1996) was more robust, even when using single sequence data, and showed an improvement over the baseline method under most circumstances. Again, the use of multiple sequence alignments improves its performance considerably.

The prediction method presented here does not offer new insights into the mechanisms of residue burial, nor does it provide improvements in the prediction of residue solvent accessibility. It does provide a bottom line for the prediction of residue burial, giving a very rapid indication of the minimum level of success that should be expected from a given set of protein structures and exposure categories. If a technique is considerably worse than the baseline method, something is amiss. If the benefits of extra sophistication are small, only the experimenter can determine whether they are worth the effort expended.


    Acknowledgments
 
Dr Richardson gratefully acknowledges the support of a Cyril W.Maplethorpe Fellowship.


    Notes
 
1 To whom correspondence should be addressed. Email: foop{at}sg4.pcy.kcl.ac.uk Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Aszódi,A. and Taylor,W.R. (1996) Folding Des., 1, 325–334.[ISI][Medline]

Chan,H.S. and Dill,K.A. (1990) Proc. Natl Acad. Sci. USA, 87, 6388–6392.[Abstract]

Chothia,C. (1976) J. Mol. Biol., 105, 1–14.[ISI][Medline]

Ehrlich,L., Reczko,M., Bohr,H. and Wade,R.C. (1998) Protein Engng, 11, 11–19.[Abstract]

Eisenhaber,F., Persson,B. and Argos,P. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 1–94.[Abstract]

Flockner,H., Braxenthaler,M., Lackner,P., Jaritz,M., Ortner,M. and Sippl,M.J. (1995) Proteins: Struct. Funct. Genet., 23, 376–386.[ISI][Medline]

Frishman,D. and Argos,P. (1997) Proteins: Struct. Funct. Genet., 27, 329–335.[ISI][Medline]

Holbrook,S.R., Muskal,S.M. and Kim,S.-H. (1990) Protein Engng, 3, 659–665.[Abstract]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Pascarella,S., De Persio,R., Bossa,F. and Argos,P. (1998) Proteins: Struct. Funct. Genet., 32, 190–199.[ISI][Medline]

Rose,G.D., Geselowitz,A.R., Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Science, 229, 834–838.[ISI][Medline]

Rost,B. and Sander,C. (1994) Proteins: Struct. Funct. Genet., 20, 216–226.[ISI][Medline]

Rost,B. and Sander,C. (1995) Proteins: Struct. Funct. Genet., 23, 295–300.[ISI][Medline]

Sánchez,R. and Sali,A. (1998) Proc. Natl Acad. Sci. USA, 95, 13597–13602.[Abstract/Free Full Text]

Shrake,A. and Rupley,J.A. (1973) J. Mol. Biol., 79, 351–371.[ISI][Medline]

Sussman,J.L., Lin,D., Jiang,J., Manning,N.O., Prilusky,J., Ritter,O. and Abola.,E.E. (1998) Acta Crystallogr., D54, 1078–1084.

Thompson,M.J. and Goldstein,R.A. (1996) Proteins: Struct. Funct. Genet., 25, 38–47.[ISI][Medline]

Torda,A.E. (1997) Curr. Opin. Struct. Biol., 7, 200–205.[ISI][Medline]

Received April 20, 1999; revised August 2, 1999; accepted August 23, 1999.