Department of General Chemistry, Pavia University, viale Taramelli 12, I-27100 Pavia and International Centre for Genetic Engineering and Biotechnology, Area Science Park, Padriciano 99, I-34012 Trieste, Italy. Email: carugo{at}icgeb.trieste.it
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: protein sequence/protein structure/secondary structure/solvent accessibility/structure prediction
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Several studies have been devoted to the prediction of the solvent accessibility on the basis of the sequence. Pascarella et al. (1998) made use of residue substitution matrices, Bayesian statistics was applied by Thompson and Goldstein (1996) and neuronal networks were employed by others (Holbrook et al., 1990; Rost and Sander, 1994
). Most recently, Richardson and Barlow (1999) proposed a very simple method and compared its performance with those, computationally more demanding, reported previously. Such an approach is simply based on the computation of the mean solvent-accessible area for each residue type over a certain learning set of protein three-dimensional structures with a consequent classification of the residue as a function of its mean tendency to be buried or exposed. It was found that the latter procedure produces results of comparable quality to the computationally more demanding approaches proposed previously.
In the present work, the simple basic method of Richardson and Barlow (1999) was elaborated in order to include some information describing the sequence environment. It is verified that the quality of the prediction improves if few neighboring residues are included in the determination of the mean tendency of a residue type to be buried or exposed. The inclusion of the secondary structural types in the classification procedure does not produce any significant improvement.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The standard solvent-accessible area values were computed for each residue type as suggested by Heringa et al. (1995). The maximum solvent-accessible area for each residue type was computed in each structure of the learning set. Subsequently, the maximum values were averaged. The standard solvent-accessible area values were computed for each residue type independently of its sequence environment (INDE) or by considering either the preceding or the following residue (1P and 1F, respectively) or the two preceding and the two following residues (2P and 2F, respectively) or by considering both the preceding and the following residues (1P1F). The standard solvent-accessible area values were also computed independently of the sequence environment but by considering the residue backbone conformation, determined with DSSP (Kabsch and Sander, 1983) and simplified in helical, strand and others (SECS).
When the standard solvent-accessible area for each residue type was computed by considering a single neighbor, there were 400 possible residue pairs. On average, they were observed 112 times each and the standard deviation of the standard solvent-accessible area was 4.1 Å2. When it was computed by considering two neighboring residues, there were 8000 possible residue triplets, each observed only 9.9 times and with a standard deviation of the standard solvent-accessible area as high as 13.2 Å2. If the computation of the standard solvent-accessible area was performed by considering more neighboring residues, the statistics would have became unreliable because of the paucity of the data.
A residue was considered buried or exposed to the solvent if its fractional solvent-accessible area, i.e., the ratio between its observed solvent-accessible area and its standard accessible area, was lower or higher than a selected value (values of 0.2, 0.3 and 0.4 were tested). The prediction of the accessibility of a residue was based on the computation of its mean fractional solvent-accessible area, i.e. by determining if the mean value was below or above the same arbitrary threshold value. A jackknife procedure was applied. Given the 338 structures, for each one the comparison between predicted and observed classification was performed by computing the percentage of the residues correctly predicted. The predicted class was derived by computing the mean fractional solvent-accessible area over all the other structures of the learning set.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The prediction quality is significantly improved when the sequence neighbors are considered in the computation of the standard solvent-accessible area values. The inclusion of a single neighbor (1P and 1F) improves the prediction by about 15% and the inclusion of two neighbors (2P, 2F and 1P1F) by about 510%. Interestingly, similar improvements in prediction accuracy are obtained independently of the location of the two neighbors, either both preceding (2P) or following (2F) or flanking (1P1F) the residue whose accessibility must be predicted. This clearly indicates the importance of medium range effects. The exposure of a residue in position i is influenced not only by its neighbors in position i ± 1 but also by the more distant residues in position i ± 2. Unfortunately, the relatively small dimension of the learning set does not allow one to investigate the importance of slightly more distant residues.
The prediction accuracy clearly depends on the protein dimensions. This aspect has generally been neglected in the past although the correlation between sequence length and level of prediction is evident (Figure 1). Although the percentage of successfully predicted residues can be as high as 80100% for very small proteins, it is hardly better than 60% for larger proteins. The relationship shown in Figure 1
must be considered in designing and assessing novel algorithms for the accessibility prediction. No other features were found to discriminate proteins for which the accessibility prediction was very good (better than 85%) from proteins for which the prediction was of low accuracy (less than 55%). The secondary structure content is nearly identical with a slightly smaller presence of helical or strand backbone conformations in proteins with well predicted residue exposure (58.6% versus 64.8%). Analogously, the amino acid content was statistically identical, with the only exception of the cysteines, which are more often encountered in small proteins adopting the knotting fold (Isaacs, 1995
).
|
The very simple method of predicting of the residue exposure to the solvent from sequence data recently published by Richardson and Barlow (1999) can be improved significantly by considering the nature of the sequence neighbors. The accuracy of the prediction, measured as the percentage of successful predictions, can increase as much as about 10% when two neighbors are considered, regardless of whether they precede or follow the residue whose accessibility must be predicted. The present protein three-dimensional structural knowledge does not allow one to include more than a pair of residues, although this will be possible in the near future, given the large number of novel structures solved every week. It has also been shown that the prediction accuracy of this method depends strongly on the protein dimensions, being the best for small proteins. This appears to be a promising strategy in performing high-quality predictions, which are or practical relevance in several areas of biology.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Heringa,J. Argos,P. Egmond,M.R. and de Vlieg,J. (1995) Protein Eng., 8, 2130.[Abstract]
Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522531.
Holbrook,S.R. Muskal,S.M. and Kim,S.-H. (1990) Protein Eng., 3, 659665.[Abstract]
Isaacs,N.W. (1995) Curr. Opin. Struct. Biol., 5, 391395.[ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Karplus, P. A. and Schulz, G. E. (1985) Naturwissenchaften, 72, 212213.[ISI]
Pascarella,S., De Persio,R., Bossa,F. and Argos, P. (1998) Proteins: Struct. Funct. Genet., 32, 190199.[ISI][Medline]
Ragone, R., Facchiano, F., Facchiano, A., Facchiano, A. M. and Colonna, G. (1989) Protein Eng., 2, 497504.[Abstract]
Richardson, C. J. and Barlow, D. J. (1999) Protein Eng., 12, 10511054.
Rost,B. and Sander,C. (1994) Proteins: Struct. Funct. Genet., 20, 216226.[ISI][Medline]
Thompson,M.J. and Goldstein,R.A. (1996) Proteins: Struct. Funct. Genet., 25, 3847.[ISI][Medline]
Vihinen, M., Torkkila, E. and Riikonen, P. (1994) Proteins: Struct. Funct. Genet., 19, 141149.[ISI][Medline]
Received April 14, 2000; revised May 31, 2000; accepted June 23, 2000.
|