Predicting residue solvent accessibility from protein sequence by considering the sequence environment

O. Carugo

Department of General Chemistry, Pavia University, viale Taramelli 12, I-27100 Pavia and International Centre for Genetic Engineering and Biotechnology, Area Science Park, Padriciano 99, I-34012 Trieste, Italy. Email: carugo{at}icgeb.trieste.it


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
The solvent accessibility of each residue is predicted on the basis of the protein sequence. A set of 338 monomeric, non-homologous and high-resolution protein crystal structures is used as a learning set and a jackknife procedure is applied to each entry. The prediction is based on the comparison of the observed and the average values of the solvent-accessible area. It appears that the prediction accuracy is significantly improved by considering the residue types preceding and/or following the residue whose accessibility must be predicted. In contrast, the separate treatment of different secondary structural types does not improve the quality of the prediction. It is furthermore shown that the residue accessibility is much better predicted in small than in larger proteins. Such a discrepancy must be carefully considered in any algorithm for predicting residue accessibility.

Keywords: protein sequence/protein structure/secondary structure/solvent accessibility/structure prediction


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
The prediction of the solvent accessibility of polypeptide segments within proteins, like in general the prediction of the flexibility (Karplus and Schulz, 1985Go; Ragone et al., 1989Go; Vihinen et al., 1994Go), is of great practical interest. The identification, on the basis of the sequence, of amino acids which are likely to be exposed to the solvent is in fact necessary to select protein fragments against which to produce specific monoclonal antibodies. It is moreover possible to devise computational protocols for the prediction of the tertiary structure on the basis of the prediction of the solvent accessibility and of the secondary structure. An effective identification of the polypeptide segments most likely exposed to the solvent can also be relevant in predicting boundaries for structural domains and in predicting regions responsible for intramolecular conformational rearrangements.

Several studies have been devoted to the prediction of the solvent accessibility on the basis of the sequence. Pascarella et al. (1998) made use of residue substitution matrices, Bayesian statistics was applied by Thompson and Goldstein (1996) and neuronal networks were employed by others (Holbrook et al., 1990Go; Rost and Sander, 1994Go). Most recently, Richardson and Barlow (1999) proposed a very simple method and compared its performance with those, computationally more demanding, reported previously. Such an approach is simply based on the computation of the mean solvent-accessible area for each residue type over a certain learning set of protein three-dimensional structures with a consequent classification of the residue as a function of its mean tendency to be buried or exposed. It was found that the latter procedure produces results of comparable quality to the computationally more demanding approaches proposed previously.

In the present work, the simple basic method of Richardson and Barlow (1999) was elaborated in order to include some information describing the sequence environment. It is verified that the quality of the prediction improves if few neighboring residues are included in the determination of the mean tendency of a residue type to be buried or exposed. The inclusion of the secondary structural types in the classification procedure does not produce any significant improvement.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
A set of 338 non-homologous monomeric protein crystal structures was extracted from the Protein Data Bank (Bernstein et al., 1977Go) with the program PDB_SELECT (Hobohm and Sander, 1994Go) (maximum sequence identity 25% and crystallographic resolution <=2.5 Å). The identification codes are 1euu, 1a0p, 1a7j, 1aua, 1b0m, 2gpr, 1dlc, 1juk, 2dpg, 1ihp, 1hlb, 1fep, 1rlw, 1bp1, 1dar, 1fgs, 1a8y, 1ax8, 4lbd, 2liv, 1an8, 2omf, 3nll, 8ohm, 1c25, 1toh, 1aj6, 1bob, 2fxb, 1ad6, 1inp, 1aw5, 1cyx, 1opr, 1auq, 1dhy, 1dhr, 1a26, 1gpc, 1zxq, 1beo, 1aa0, 1maz, 2bgu, 1lba, 1kte, 1tul, 1goh, 5eau, 1cfr, 1kvs, 1ash, 1auk, 1br9, 1b54, 1acc, 1sfe, 1ygs, 1rmd, 1pea, 1irk, 1ryt, 2tct, 2std, 1esc, 1tfr, 1aol, 1vin, 1a6q, 1a1x, 1csn, 1a8h, 1btn, 1cfb, 1amx, 1msc, 1hoe, 2plc, 2pgd, 1bv1, 4mt2, 1sra, 1vid, 1uok, 1poc, 1gsa, 1jpc, 1lki, 2pia, 1bol, 1pbv, 1alo, 1rmg, 1gky, 2i1b, 1oyc, 1aj2, 3tdt, 1lou, 1bea, 1fua, 1rss, 1ois, 1at0, 1ecl, 1skz, 1neu, 1alu, 1mai, 1ad2, 1dfx, 1edt, 1opy, 1phm, 1dxy, 1jdw, 1ak1, 1ceo, 1bgp, 1a8l, 1rec, 1bxe, 1idk, 1rsy, 1iso, 1ail, 1who, 1cpo, 4bcl, 1sfp, 1dun, 1a48, 1byb, 3gcb, 1bg0, 1lml, 1bg7, 1fit, 1vls, 1pud, 2abk, 1pty, 1qba, 1rkd, 1tib, 1zrn, 2dpm, 1uch, 1v39, 1bg2, 1nfn, 1ak0, 1bu8, 1uxy, 2gar, 1lcl, 1mml, 1pot, 1qnf, 1ten, 1ayl, 1bdo, 1c3d, 1bg6, 2por, 1uae, 2sli, 1tml, 2sak, 1fna, 1al3, 1hxn, 2tgi, 2acy, 1pgs, 1lbu, 1amp, 2baa, 1a8i, 1pda, 1bkb, 1ha1, 1nlr, 1chd, 1thv, 3cla, 1xjo, 1bm8, 1xwl, 1bgc, 1tfe, 1azo, 1ido, 2cyp, 1vjs, 1wab, 1vhh, 6cel, 1pdo, 1pmi, 1fdr, 1kid, 1fds, 1sbp, 1ako, 2gdm, 1knb, 1gai, 2hft, 1anf, 3chy, 1erv, 1dhn, 1aqb, 1cnv, 119l, 1cem, 1wer, 1vcc, 2bce, 1gso, 1gvp, 1ads, 1jer, 2dri, 1a3c, 1edg, 1phg, 16pk, 1bkf, 1rzl, 1smd, 1dad, 3nul, 1a8e, 1aru, 3cyr, 1nfp, 1nif, 1mrj, 1zin, 2qwc, 1kuh, 2ilk, 1ppn, 1bfg, 153l, 3pte, 2ayh, 4xis, 1nox, 2a0b, 1akz, 1a8d, 1moq, 1a3h, 1hfc, 1a62, 1hta, 1lit, 1ra9, 1tca, 3grs, 1orc, 2cba, 1kpf, 1np4, 1ah7, 1aie, 1koe, 1whi, 1rie, 3ezm, 1opd, 1b4v, 1ezm, 2mcm, 1cyo, 1poa, 1brt, 2hbg, 2eng, 2sns, 8abp, 1xnb, 2rn2, 3seb, 1g3p, 1bgf, 1aba, 2end, 3vub, 2phy, 1ecd, 2ctc, 1awd, 1rcf, 1nxb, 1ppt, 1rhs, 5p21, 1utg, 1plc, 1fus, 1bk0, 1bxa, 1c52, 7rsa, 1oaa, 1msi, 1ycc, 2pth, 2sn3, 1amm, 1bx7, 1atg, 1arb, 1ifc, 1a7s, 1ctj, 2igd, 1nkd, 3sil, 5pti, 2erl, 1a6m, 1cex, 1ixh, 1aho, 1bxo, 1nls, 2fdn, 1b0y, 3lzt, 1rb9, 2pvb, 1cbn and 1gci. The solvent-accessible area values and the secondary structural types were determined with DSSP (Kabsch and Sander, 1983Go).

The standard solvent-accessible area values were computed for each residue type as suggested by Heringa et al. (1995). The maximum solvent-accessible area for each residue type was computed in each structure of the learning set. Subsequently, the maximum values were averaged. The standard solvent-accessible area values were computed for each residue type independently of its sequence environment (INDE) or by considering either the preceding or the following residue (1P and 1F, respectively) or the two preceding and the two following residues (2P and 2F, respectively) or by considering both the preceding and the following residues (1P1F). The standard solvent-accessible area values were also computed independently of the sequence environment but by considering the residue backbone conformation, determined with DSSP (Kabsch and Sander, 1983Go) and simplified in helical, strand and others (SECS).

When the standard solvent-accessible area for each residue type was computed by considering a single neighbor, there were 400 possible residue pairs. On average, they were observed 112 times each and the standard deviation of the standard solvent-accessible area was 4.1 Å2. When it was computed by considering two neighboring residues, there were 8000 possible residue triplets, each observed only 9.9 times and with a standard deviation of the standard solvent-accessible area as high as 13.2 Å2. If the computation of the standard solvent-accessible area was performed by considering more neighboring residues, the statistics would have became unreliable because of the paucity of the data.

A residue was considered buried or exposed to the solvent if its fractional solvent-accessible area, i.e., the ratio between its observed solvent-accessible area and its standard accessible area, was lower or higher than a selected value (values of 0.2, 0.3 and 0.4 were tested). The prediction of the accessibility of a residue was based on the computation of its mean fractional solvent-accessible area, i.e. by determining if the mean value was below or above the same arbitrary threshold value. A jackknife procedure was applied. Given the 338 structures, for each one the comparison between predicted and observed classification was performed by computing the percentage of the residues correctly predicted. The predicted class was derived by computing the mean fractional solvent-accessible area over all the other structures of the learning set.


    Results and discussion
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Table IGo reports the mean percentage values of correct predictions for various arbitrary threshold values discriminating buried and exposed residues. As expected, the quality of the prediction depends on the value of the fractional solvent-accessible area adopted to discriminate buried and exposed residues (Richardson and Barlow, 1999Go). Nevertheless, the 68.7% of successful predictions with a threshold of 0.2 and with a standardization procedure independent on the sequence environment (INDE) compares well with the values of 68.3–70.0% obtained with the same procedure on other learning sets (Richardson and Barlow, 1999Go).


View this table:
[in this window]
[in a new window]
 
Table I. Mean values and standard deviations (in parentheses) of the percentage of correct predictions of the residue accessibility to the solvent
 
It appears that the inclusion of the secondary structural type in the computation of the standard solvent-accessible area values does not improve the quality of the prediction. This might seem surprising, given the large difference of accessibility of various secondary structure elements, the loops being for example generally more solvent exposed than helices or strands. Nevertheless, it must be considered that several residues within helices or strands can protrude towards the protein exterior with the consequence that the standardization based on the secondary structural type is ineffective.

The prediction quality is significantly improved when the sequence neighbors are considered in the computation of the standard solvent-accessible area values. The inclusion of a single neighbor (1P and 1F) improves the prediction by about 1–5% and the inclusion of two neighbors (2P, 2F and 1P1F) by about 5–10%. Interestingly, similar improvements in prediction accuracy are obtained independently of the location of the two neighbors, either both preceding (2P) or following (2F) or flanking (1P1F) the residue whose accessibility must be predicted. This clearly indicates the importance of medium range effects. The exposure of a residue in position i is influenced not only by its neighbors in position i ± 1 but also by the more distant residues in position i ± 2. Unfortunately, the relatively small dimension of the learning set does not allow one to investigate the importance of slightly more distant residues.

The prediction accuracy clearly depends on the protein dimensions. This aspect has generally been neglected in the past although the correlation between sequence length and level of prediction is evident (Figure 1Go). Although the percentage of successfully predicted residues can be as high as 80–100% for very small proteins, it is hardly better than 60% for larger proteins. The relationship shown in Figure 1Go must be considered in designing and assessing novel algorithms for the accessibility prediction. No other features were found to discriminate proteins for which the accessibility prediction was very good (better than 85%) from proteins for which the prediction was of low accuracy (less than 55%). The secondary structure content is nearly identical with a slightly smaller presence of helical or strand backbone conformations in proteins with well predicted residue exposure (58.6% versus 64.8%). Analogously, the amino acid content was statistically identical, with the only exception of the cysteines, which are more often encountered in small proteins adopting the knotting fold (Isaacs, 1995Go).



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 1. Dependence of the accuracy of the accessibility prediction on the protein dimensions. The dashed line corresponds to the best fit with the logarithmic function y = 138.2 – 14.1ln(x) (correlation coefficient = –0.82). Residues with fractional solvent-accessible area values higher than 0.2 were considered exposed. Standard accessible area values were computed by considering both the preceding and the following residues (method 1P1F). Similar trends are found by considering other burial criteria or other accessibility standardization methods.

 
Conclusions

The very simple method of predicting of the residue exposure to the solvent from sequence data recently published by Richardson and Barlow (1999) can be improved significantly by considering the nature of the sequence neighbors. The accuracy of the prediction, measured as the percentage of successful predictions, can increase as much as about 10% when two neighbors are considered, regardless of whether they precede or follow the residue whose accessibility must be predicted. The present protein three-dimensional structural knowledge does not allow one to include more than a pair of residues, although this will be possible in the near future, given the large number of novel structures solved every week. It has also been shown that the prediction accuracy of this method depends strongly on the protein dimensions, being the best for small proteins. This appears to be a promising strategy in performing high-quality predictions, which are or practical relevance in several areas of biology.


    Acknowledgments
 
Dr Domenico Bordo (Advanced Biotechnology Center, Genova) is gratefully acknowledged for critical reading of the manuscript and valuable discussions.


    References
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M (1977) J. Mol. Biol., 112, 535–542.[ISI][Medline]

Heringa,J. Argos,P. Egmond,M.R. and de Vlieg,J. (1995) Protein Eng., 8, 21–30.[Abstract]

Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522–531.[Abstract/Free Full Text]

Holbrook,S.R. Muskal,S.M. and Kim,S.-H. (1990) Protein Eng., 3, 659–665.[Abstract]

Isaacs,N.W. (1995) Curr. Opin. Struct. Biol., 5, 391–395.[ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Karplus, P. A. and Schulz, G. E. (1985) Naturwissenchaften, 72, 212–213.[ISI]

Pascarella,S., De Persio,R., Bossa,F. and Argos, P. (1998) Proteins: Struct. Funct. Genet., 32, 190–199.[ISI][Medline]

Ragone, R., Facchiano, F., Facchiano, A., Facchiano, A. M. and Colonna, G. (1989) Protein Eng., 2, 497–504.[Abstract]

Richardson, C. J. and Barlow, D. J. (1999) Protein Eng., 12, 1051–1054.[Abstract/Free Full Text]

Rost,B. and Sander,C. (1994) Proteins: Struct. Funct. Genet., 20, 216–226.[ISI][Medline]

Thompson,M.J. and Goldstein,R.A. (1996) Proteins: Struct. Funct. Genet., 25, 38–47.[ISI][Medline]

Vihinen, M., Torkkila, E. and Riikonen, P. (1994) Proteins: Struct. Funct. Genet., 19, 141–149.[ISI][Medline]

Received April 14, 2000; revised May 31, 2000; accepted June 23, 2000.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (12)
Request Permissions
Google Scholar
Articles by Carugo, O.
PubMed
PubMed Citation
Articles by Carugo, O.