1Dipartimento di Scienze Biochimiche A. Rossi Fanelli and Istituto di Biologia e Patologia Molecolare del C.N.R., 2Centro di Eccellenza di Biologia e Medicina Molecolare (BEMM), Università La Sapienza, 00185 Rome and 4Centro Interdipartimentale di Ricerca per lAnalisi dei Modelli e dellInformazione nei Sistemi Biomedici (CISB), Italy
3 To whom correspondence should be addressed at: Dipartimento di Scienze Biochimiche, Università La Sapienza, P.le A. Moro, 5, 00185 Roma, Italy. e-mail: giulio.gianese{at}uniroma1.it
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: exposure category/prediction accuracy comparison/probability profiles/solvent accessibility
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this work we present a method to calculate the probability of residues in a protein sequence being in a particular accessibility state. The prediction system is based on probabilistic profiles (PP) and runs on a sliding window of amino acids symmetric about the residue whose accessibility category has to be predicted. The performance is compared with that of previous methods. In many cases the prediction accuracy is significantly improved.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The prediction of the solvent accessibility i of residue i was based on the observation of the amino acid sequence {Sj} within a window of n residue size symmetric about position i. Positions within the window are identified by j. The information on the protein structures used as training set allows one to express the probability of a particular sequence {Sj} associated with a certain accessibility in terms of P({Sj}|
i), the conditional probability of that sequence given the solvent accessibility
i of the residue at location i. Thus, the probability P({Sj},
i) of finding at the same time a particular sequence and a particular accessibility state for residue i can be calculated as follows:
P({Sj}, i) = P({Sj}|
i)P(
i)(1)
where P(i) is the probability of solvent accessibility
i in the absence of any sequence information. The conditional probability P({Sj} |
i) of a sequence symmetric about position i of a test sequence was calculated considering the probability of occurrence of each amino acid residue aj in that sequence, P(aj |
i), and the conditional probability of observing each amino acid in position j (with j > 1), given the amino acid type in j 1, P(aj|aj1,
i). Conditional probabilities of the latter type correspond to a state called a first-order state, because it catches the first-order correlation between neighbouring residues, so that P({Sj}|
i) was expressed as:
The system takes into consideration each position j, and the corresponding residue, independently and, simultaneously, the same location j as conditioned by the immediately adjacent position j 1. The right-side probability terms in Equation 2 for an accessibility state i were obtained by summing the corresponding elements p of the probability profiles derived from each structure of the training set of k proteins. The elements p were normalized by division by the sum of the probabilities for all the m (= 1, 20) amino acids in each position j. For an amino acid ap in location j, (ap)j:
and for an amino acid (ap)j preceded by an amino acid aq, (aq)j1:
where ck is a coefficient introduced to weight the contribution of each protein structure to the calculation of probabilities, i.e. ck is the ratio between the number of amino acids of the sequence k and the total number of residues in the whole training set. A sufficient number of observations allows one to comply approximately with the probability requirements. For a three-state prediction the first term in Equation 3 needs 60 (= 3x20) entries for each position j, and the first term in Equation 4 requires 1200 (= 3x20x20) entries. In the latter case the data from the set of structures could not hold enough information. In such a case the probability of the whole sequence {Sj} in Equation 2 may easily become zero. To avoid this sort of overfitting it is possible to resort to the so-called pseudocounts, which means to count more amino acids than those contained in the data. In the present work, 1 was added to all the counts in the determination of the conditional probabilities p((ap)j|(ap)j1, i) for each structure of the training set.
Probability P(i) of occurrence of an accessibility state in the absence of any sequence information is equivalent to the fraction of residues in that state in a given protein and depends on the dimension of the molecule. The number Nb of buried residues in a globular protein proved to be approximately related to the total number Nt of residues (Janin, 1979
) by:
Nb1/3 = Nt1/3 b(5)
The equation contains the parameter b, which is related to the accessibility cut-off used to select buried residues. For a two-state solvent accessibility this equation allows one to calculate the expected fraction of buried and, by difference, of exposed residues in a given protein. For a three-state accessibility prediction (buried, intermediate, exposed) another equation had to be adopted, in order to calculate the fraction of the exposed residues Ne:
Ne = aNt2/3(6)
where Nt is the total number of amino acids of a protein and a a parameter related to the threshold used to define exposed residues. This equation is a modification of that proposed by Janin (Janin, 1976) and by Teller (Teller, 1976
) that relates the surface area of a protein to its molecular weight or volume. The parameters b and a in Equations 5 and 6, respectively, were determined by averaging their values calculated for each protein of the training set, except for those with interruptions of the polypeptide chain. In the case of data sets in which surface areas of residues in chains from oligomeric proteins were calculated using the oligomeric complexes, the values of the parameter b were calculated considering all the structures of the set, irrespective of their quaternary structure, for both three- and two-state prediction. Indeed, the assembly of subunits in an oligomeric protein does not alter significantly the total number of buried residues in the single chain. On the contrary, in the case of data sets with oligomeric proteins and for a three-state prediction, Equation 6 cannot be used to compute the number of exposed residues because of the redistribution of residues between the intermediate and exposed categories after the subunit association. For this reason, the distribution of the different amino acids in the exposed state over all the structures of the training set was considered. The frequency of each amino acid type in the exposed category was then used to estimate the total number of exposed residues in the test protein through its amino acid composition. If the number of exposed residues calculated in such a way exceeded the number derived by Equation 6 with a calculated using only the monomeric proteins of the training set, then the latter number was chosen. In Table I are listed the mean values of the parameters a and b calculated using only the monomeric proteins of the 1, 2 set by Rost and Sander (1994)
at different accessibility thresholds. The values of b calculated for the complete data set irrespective of the quaternary structure of the proteins show a mean difference of only 0.2 from those derived from monomeric proteins.
|
Multiple sequence information
It is well established that the inclusion of evolutionary information by means of multiple sequence alignments of related proteins can improve the prediction accuracy of the structural features of a protein. In order to include such information in our prediction scheme, the probability of residues P(aj|i) in Equation 2 in the locations of the environmental window was substituted by the probability of observing amino acid exchanges P(Aj|
i) in the alignment positions of the same window. For an exchange between an amino acid ap and an amino acid aq in location j, (Apq)j:
The exchange probability matrices were symmetric [P(Amn)j|i) = P(Anm)j|
i)], so that no direction was given for a particular residue substitution. Symmetry was imposed by assigning to (Amn)j and (Anm)j the values of their sum divided by two. To avoid redundancy of information derived from closely related sequences, only exchanges Amn with m
n were considered, thus for a three-state prediction the left term in Equation 7 requires 1140 [= 3x(20x20 20)] entries for each position j. As for conditional probabilities, the amount of entries requires pseudocounts, at least for data sets with approximately <200 proteins. The redundancy was also reduced by excluding from the alignments all the sequences sharing >90% residue identity with the sequence whose solvent accessibility has to be predicted. Sequences sharing <20% identity with the tested sequence were excluded to eliminate remotely related proteins in which the conservation of the exposure category at aligned positions cannot be assured. Only the exchanges between the protein under examination and its related sequences were considered. A further precaution was adopted to filter the over-representation of residues in highly related sequences: only unique pairs of adjacent residues were counted during the calculation of the conditional probabilities. For example, if pairs AA, AA, AS, SS, SS, AT were observed at positions j and j 1 in the same window in the aligned sequences, only AA, AS, SS and AT would be counted.
For each position j within the window the average of the probabilities for all the substitutions observed and the average of the conditional probabilities for all the pairs of residues observed in j and j 1 were taken for prediction. The maximal window length for prediction from multiple sequence alignments is 19.
Data sets and prediction evaluation
Four sets of structures were used: RS, 126 structures from Rost and Sander (1994); TG, 111 structures from Thompson and Goldstein (1996
); NM, 215 structures from Naderi-Manesh et al. (2001
); YZ, 421 structures from Yuan et al. (2002
). Protein structures were retrieved from the Brookhaven Protein Data Bank (PDB) (Sussman et al., 1998
). Accessible surface area of amino acid residues was calculated by the program DSSP (Kabsch and Sander, 1983
), which is a widely accepted method of solvent accessibility determination, and, in one case, by the program ASC (Eisenhaber and Argos, 1993
). Relative solvent accessibility was then determined by dividing the accessible surface area of a residue for the maximum exposed surface of the same residue type in a GlyXGly oligopeptide, or AlaXAla, and with the side chain in a fully extended conformation. Values of maximum exposed surface area were those given by Rose et al. (1985
), Shrake and Rupley (1973
) and Naderi-Manesh et al. (2001
). For the AlaXAla tripeptide, the values used were those calculated by Ahmad and Gromiha (2002
) and those reported by Ahmad et al. (2003
).
In order to predict the accessibility category of residues located within the N- and C-terminal regions, eight (for single sequence input) or nine (for multiple alignment input) serine residues were added to the ends of each polypeptide chain, all assumed in the exposed state. This procedure allows a better prediction of terminal residues rather than their bypassing. For a correct comparison between the proposed and the previous methods of solvent accessibility prediction, the calculations for the different data sets were carried out using the same exposure categories and values of maximal accessibility area for each amino acid as those adopted by the relevant authors. For each structure in the data sets, alignments of homologous sequences were extracted from the standard data set HSSP (Dodge et al., 1998). Because certain HSSP files can contain many sequences (of the order of thousands), no more than 400 closest related homologues for each protein family were considered, so as to balance the single family contribution. For the YZ set, a selection of structures from the 513 protein set of Cuff and Barton (1999
), the multiple alignments proposed by the latter authors were used. Because that data set is constituted of protein domains, several structures have an incomplete polypeptide chain, thus Equations 5 and 6 cannot be used to calculate the probability of each accessibility state
i. In such cases, the amino acid composition of the sliding window was used to estimate a local probability for
i. However, the incomplete sequence data from some single structural domains make the YZ set unsuitable for testing the accuracy of the PP method by using single sequence information. Multiple alignment data may counterbalance this deficiency of sequence information. For this reason the PP method was applied to the YZ set exclusively by using multiple alignments.
Prediction based on single sequence information was tested using a jack-knife method. Each protein in a set of k structures, in turn, was extracted from the data set and used as a test protein. The remaining k 1 structures were used to calculate the probability profiles and the parameters (a, b, ck) necessary for accessibility prediction. This procedure was repeated k times until each structure was tested once. However, prediction based on multiple sequence alignments was performed by an n-fold cross-validation test on the set examined, i.e. the data set was randomly divided into n subsets, with n equal to the number adopted by the authors whose method is under comparison. Each subset was extracted from the data set and its proteins used as test proteins. The remaining proteins were used to train the method. This procedure was repeated n times with different subdivisions of the data set until all proteins were tested once.
The percentage of correctly predicted residues was calculated by dividing the total number of residues correctly assigned to solvent accessibility categories by the total number of residues in the data set. The prediction quality was assessed also by the correlation coefficient r between the observed oi and predicted pi solvent exposure states over residue locations. For a data set of N residues:
The equation is the same as used for the calculation of the correlation coefficients by all the methods directly compared in this work.
The prediction programs were written in C-language and run under the IRIX 6.3 operating system.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The prediction accuracy of the PP algorithm was tested with four different data sets. All the proteins of each data set were used (no subsets were considered) and accessibility categories at two (buried, exposed) and three (buried, intermediate, exposed) states were examined. The range of thresholds tested were those adopted by the authors proposing the protein set. Therefore, the problem concerning the optimal choice of the accessibility cut-offs defining the exposure states was not taken into account. However, as for the previous methods, the prediction accuracy shows considerable variation relative to the threshold values. In fact, as noted by Rost and Sander (1994), Thompson and Goldstein (1996
) and Richardson and Barlow (1999
), the use of thresholds splitting irregularly the amino acid residues into the different accessibility states may increase the percentage of correct assignment but to the detriment of the correlation coefficient that decreases.
The results of predictions based on single sequence data of the PP method are reported in Table II, in comparison with those from the previous methods, whenever available. On the RS set, the results for a two-state prediction give accuracy values comparable (cut-off 9%) or 1% higher (cut-off 16 and 23%) than the results performed by the Bayesian method (Thompson and Goldstein, 1996
). On the same data set, for a three-state prediction, the accuracy with 9, 64% thresholds is more than 6 and 1% better than those performed by, respectively, the neural network (NN) of Rost and Sander (1994
) and the Bayesian methods, while for 9, 36% thresholds the accuracy is 2% higher than the NN method and only a little greater than the Bayesian method. On the TG set, in comparison with the Bayesian approach, the PP method achieves accuracy more than 1% greater for a two-state prediction (cut-off 20%) and nearly 2% greater (cut-offs 9, 64%) or a little better (cut-offs 9, 36%) for a three-state prediction. On the NM set the accuracy is
29% better for a two-state model, and 28% better for a three-state-model, than those obtained by the information theory (IT) method of Naderi-Manesh et al. (2001
).
|
|
Comparison with other methods
In recent years great attention has been paid to the prediction of amino acid accessibility from sequence information as part of an attempt to improve the structural information that can be extracted from the increasing number of known protein sequences. However, in some cases a direct comparison of the prediction methods proposed is not possible; the reasons are various, for example, the selection of arbitrarily different threshold values defining exposure categories, or the use of different programs to calculate accessible surface area. The pioneering method of Holbrook et al. (1990), applying neural networks, and the method of Wako and Blundell (1994
), employing amino acid environment-dependent substitution tables, were not compared with the PP method since the authors have selected very small testing sets of, respectively, five and 13 proteins. The dimension of the testing sets is too small to assess the prediction accuracy. Mucchielli-Giorgi et al. (1999
) have proposed a method based on a logistic function producing an accuracy of 70.7% for a 25% accessibility threshold. However, their prediction is limited to a two-state model only and their results report several amino acids as unknown; for these reasons the comparison with other methods cannot take place.
Carugo (2000) elaborated the simple basic method of Richardson and Barlow (1999
) in order to include information deriving from the sequence environment of residues. On a data set of 338 monomeric protein structures that method obtained an accuracy of 75.6, 71.5 and 67.8% for 20, 30 and 40% cut-offs, respectively, by inclusion of information from two neighbouring residues. Also, in that case the prediction was restricted to a two-state model and the accuracy was reported as the mean percentage of correct predictions calculated for each protein of the set. Furthermore, the fractional solvent accessibility was calculated at variance with other methods as the ratio between the observed accessible area of a residue and its standard accessible area. The standard accessible area values were computed as proposed by Heringa et al. (1995
): the maximum accessible area for each residue type was computed in each structure of the training set and finally the maximum values were averaged. Because of such differences, a direct comparison cannot be carried out.
The program JNET (Cuff and Barton, 2000) achieves a performance of 76.2 and 79.8% with accessibility thresholds of 25 and 5%, respectively, by means of application of multiple sequence alignments profiles, though for a two-state model only. By means of a method based on multiple linear regression, with a threshold of 20%, an accuracy of 71.5% has been reported using single sequence information and of 75.0% with multiple sequence alignments (Li and Pan, 2001
). However, prediction for a three-state model by this method was not attempted. A new neural network approach (Pollastri et al., 2002
) achieved improved results with a 77.2% correct classification for a 25% cut-off by using input profiles of multiple sequence alignments, even if exclusively for a two-state prediction. NETASA (Ahmad and Gromiha, 2002
), a neural network based method, was applied to the same data set proposed by Naderi-Manesh et al. (2001
). Table IV reports the results of NETASA prediction for the testing set in comparison with those of the PP method using the same type of solvent accessibility computation and the same procedure of subdivision of the data into a training and test data sets. The same 30 proteins randomly selected by the authors were used for the training. The remaining 185 proteins were adopted as the test data set. The results were not included in Table II along with accuracy values performed by the IT method since NETASA adopts different operating parameters: other accessibility thresholds, a different program for calculation of accessible surface areas (ASC), a relative solvent accessibility computed with the AlaXAla oligopeptide in an extended conformation, instead of GlyXGly. Both the accuracy and correlation coefficient values of the PP method are evidently higher than the values given by NETASA, with an improvement in accuracy of approximately >1% for a two-state prediction and >3% for a three-state system.
|
Improvements of the PP method are probably due to different cooperative factors: the consideration of protein size effects on the probability of exposure category through Equations 5 and 6; and the averaging of the probabilities given by different window sizes during the testing procedure so as to normalize the contribution of long-range effects of residues interacting at different distances with the central residue.
The performance of the PP method highlights that protein molecules still contain unexploited single sequence information for improvement in solvent accessibility prediction through alternative mathematical approaches. However, a strict evaluation of such approaches cannot overlook the compositional quality of the data sets. Generally, the non-redundancy of the data is obtained by selection of proteins with no more than 25% pairwise-sequence identity. Unfortunately, percentage identity has been proved to be a bad measure of sequence similarity, in particular for values below 30% (Brenner et al., 1998). For example, as noted by Cuff and Barton (1999
), the RS set contains pairs of proteins that display clear sequence similarity when compared by more sophisticated methods than percentage identity. Considering the continuous flow of new solved structures and the availability of effective techniques to overcome sequence identity deficiencies, it is predictable that the method described in the present work could generate more accurate predictions with protein sets of higher quality and size.
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Ahmad,S., Gromiha,M.M. and Sarai,A. (2003) Proteins, 50, 629635.[CrossRef][ISI][Medline]
Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Proc. Natl Acad. Sci. USA, 95, 60736078.
Carugo,O. (2000) Protein Eng., 13, 607609.[CrossRef][ISI][Medline]
Cuff,J.A. and Barton,G.J. (1999) Proteins, 34, 508519.[CrossRef][ISI][Medline]
Cuff,J.A. and Barton,G.J. (2000) Proteins, 40, 502511.[CrossRef][ISI][Medline]
Dodge,C., Schneider,R. and Sander,C. (1998) Nucleic Acids Res., 26, 313315.
Eisenhaber,F. and Argos,P. (1993) J. Comp. Chem., 14, 12721280.[ISI]
Heringa,J., Argos,P., Egmond,M.R. and de Vlieg,J. (1995) Protein Eng., 8, 2130.[Medline]
Holbrook,S.R., Muskal,S.M. and Kim,S.-H. (1990) Protein Eng., 3, 659665.[ISI][Medline]
Janin,J. (1976) J. Mol. Biol., 105, 1314.[Medline]
Janin,J. (1979) Nature, 277, 491492.[ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Li,X. and Pan,X.-M. (2001) Proteins, 42, 15.[CrossRef][ISI][Medline]
Mucchielli-Giorgi,M.H., Hazout,S. and Tufféry,P. (1999) Bioinformatics, 15, 176177.
Naderi-Manesh,H., Sadeghi,M., Arab,S. and Movahedi,A.A.M. (2001) Proteins, 42, 452459.[CrossRef][ISI][Medline]
Pascarella,S., De Persio,R., Bossa,F. and Argos,P. (1998) Proteins, 32, 190199.[CrossRef][ISI][Medline]
Pollastri,G., Baldi,P., Fariselli,P. and Casadio,R. (2002) Proteins, 47, 142153.[CrossRef][ISI][Medline]
Richardson,C.J. and Barlow,D.J. (1999) Protein Eng., 12, 10511054.[CrossRef][ISI][Medline]
Rose,G.D., Geselowitz,A.R., Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Science, 229, 834838.[ISI][Medline]
Rost,B. and Sander,C. (1994) Proteins, 20, 216226.[ISI][Medline]
Sánchez,R. and Sali,A. (1998) Proc. Natl Acad. Sci. USA, 95, 1359713602.
Schonbrun,J., Wedemeyer,W.J. and Baker,D. (2002) Curr. Opin. Struct. Biol., 12, 348354.[CrossRef][ISI][Medline]
Shrake,A. and Rupley,J.A. (1973) J. Mol. Biol., 79, 351371.[ISI][Medline]
Sussman,J.L., Lin,D., Jiang,J., Manning,N.O., Prilusky,J., Ritter,O. and Abola,E.E. (1998) Acta Crystallogr. D, 54, 10781084.[CrossRef][ISI][Medline]
Teller,D.C. (1976) Nature, 260, 729731.[ISI][Medline]
Thompson,M.J. and Goldstein,R.A. (1996) Proteins, 25, 3847.[CrossRef][ISI][Medline]
Wako,H. and Blundell,T. (1994) J. Mol. Biol., 238, 682692.[CrossRef][ISI][Medline]
Yuan,Z., Burrage,K. and Mattick,J.S. (2002) Proteins, 48, 566570.[CrossRef][ISI][Medline]
Received June 23, 2003; revised October 25, 2003; accepted October 30, 2003