Improvement in prediction of solvent accessibility by probability profiles

Giulio Gianese1,2,3, Francesco Bossa1,2 and Stefano Pascarella1,2,4

1Dipartimento di Scienze Biochimiche ‘A. Rossi Fanelli’ and Istituto di Biologia e Patologia Molecolare del C.N.R., 2Centro di Eccellenza di Biologia e Medicina Molecolare (BEMM), Università ‘La Sapienza’, 00185 Rome and 4Centro Interdipartimentale di Ricerca per l’Analisi dei Modelli e dell’Informazione nei Sistemi Biomedici (CISB), Italy

3 To whom correspondence should be addressed at: Dipartimento di Scienze Biochimiche, Università ‘La Sapienza’, P.le A. Moro, 5, 00185 Roma, Italy. e-mail: giulio.gianese{at}uniroma1.it


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The capability of predicting folding and conformation of a protein from its primary structure is probably one of the main goals of modern biology. An accurate prediction of solvent accessibility is an intermediate step along this way. A new method for predicting solvent accessibility from single sequence and multiple alignment data is described. The method is based on probability profiles calculated on an amino acid sequence centred on the residue whose accessibility has to be predicted. A profile is constructed for each exposure category considered so as to calculate the probability of a sequence being generated by the different profiles. Prediction accuracy was tested on a variety of protein sets with two- and three-state models. Different thresholds were used according to those adopted by the authors proposing the data sets. The prediction accuracy is significantly improved over existing methods.

Keywords: exposure category/prediction accuracy comparison/probability profiles/solvent accessibility


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Function and activity of a protein are strictly dependent on its molecular structure. Thus, prediction of tertiary protein structure from sequence information is of considerable scientific importance, especially because of the gap existing between the available three-dimensional protein structural information and the over-abundance of amino acid sequence data (Sánchez and Sali, 1998Go). For proteins displaying significant sequence similarity to homologues with known structure, homology modelling has provided an effective tool for structure prediction (Schonbrun et al., 2002Go). Otherwise, the success has been less consistent. In such cases, several methods have been proposed and developed to predict lower-level, usually discrete, protein structural features. The identification of these one-dimensional features represents progressive steps toward the prediction of the three-dimensional structure of proteins. Initially, prediction of secondary structure from sequence information was attempted and in recent years accuracy has improved considerably. Presently, prediction achieves 65–69% accuracy for a three-state model ({alpha}-helix, ß-strand and loop regions) by using single sequence information (Cuff and Barton, 1999Go). Additional input information derived from multiple sequence alignments of homologous proteins increased accuracy to 76% (Cuff and Barton, 2000Go). Another underlying feature of protein structural arrangement is the degree of interaction of amino acid residues with solvent molecules. Despite the dominant attention given to secondary structure prediction, solvent accessibility prediction has recently received increased interest. Relative solvent accessibility can be calculated from the main and side chain atom coordinates of each residue in the solved three-dimensional structure of a protein, usually by the method of Kabsch and Sander (Kabsch and Sander, 1983Go). Cut-off values of solvent accessibility define the exposure categories, generally binary (buried or exposed) or ternary (buried, intermediate or exposed). Thresholds dividing categories are arbitrarily chosen by the various authors, which hinders accurate comparison between different methods. Different predictive approaches have been attempted, using methods based on neural networks with (Rost and Sander, 1994Go; Cuff and Barton, 2000Go; Pollastri et al., 2002Go) or without (Holbrook et al., 1990Go; Ahmad and Gromiha, 2002Go) multiple sequence alignment information, residue substitution matrices (RSMs) derived from protein families (Wako and Blundell, 1994Go; Pascarella et al., 1998Go), Bayesian statistics (Thompson and Goldstein, 1996Go), a logistic function (Mucchielli-Giorgi et al., 1999Go), information theory (Carugo, 2000Go; Naderi-Manesh et al., 2001Go), multiple linear regression (Li and Pan, 2001Go) and support vector machines (SVM) (Yuan et al., 2002Go). Recently, a new method based on neural networks was presented for real value prediction of solvent accessibility (Ahmad et al., 2003Go). A comparison of the reported methods shows that binary categories achieve 69–71% accuracy and ternary categories ~55–60% when single sequence information is used. These accuracy values increase by inclusion of evolutionary data from multiple sequence alignments of homologous proteins: the correct prediction improves by ~4–6% and ~3–4% for, respectively, a two-state and a three-state model. It is noteworthy that prediction accuracy depends on the cut-off value of solvent accessibility. The baseline method (Richardson and Barlow, 1999Go) was proposed to give a bottom line against which more sophisticated methods of solvent accessibility prediction can be judged. The method assigns a residue into the particular exposure category in which it is most frequently found, irrespective of its local surrounding sequence. In comparison, the simple baseline approach often reaches an accuracy level only a little lower than more complex methods.

In this work we present a method to calculate the probability of residues in a protein sequence being in a particular accessibility state. The prediction system is based on probabilistic profiles (PP) and runs on a sliding window of amino acids symmetric about the residue whose accessibility category has to be predicted. The performance is compared with that of previous methods. In many cases the prediction accuracy is significantly improved.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Algorithm

The prediction of the solvent accessibility {omega}i of residue i was based on the observation of the amino acid sequence {Sj} within a window of n residue size symmetric about position i. Positions within the window are identified by j. The information on the protein structures used as training set allows one to express the probability of a particular sequence {Sj} associated with a certain accessibility in terms of P({Sj}|{omega}i), the conditional probability of that sequence given the solvent accessibility {omega}i of the residue at location i. Thus, the probability P({Sj}, {omega}i) of finding at the same time a particular sequence and a particular accessibility state for residue i can be calculated as follows:

P({Sj}, {omega}i) = P({Sj}|{omega}i)P({omega}i)(1)

where P({omega}i) is the probability of solvent accessibility {omega}i in the absence of any sequence information. The conditional probability P({Sj} | {omega}i) of a sequence symmetric about position i of a test sequence was calculated considering the probability of occurrence of each amino acid residue aj in that sequence, P(aj | {omega}i), and the conditional probability of observing each amino acid in position j (with j > 1), given the amino acid type in j 1, P(aj|aj–1, {omega}i). Conditional probabilities of the latter type correspond to a state called a first-order state, because it catches the first-order correlation between neighbouring residues, so that P({Sj}|{omega}i) was expressed as:

The system takes into consideration each position j, and the corresponding residue, independently and, simultaneously, the same location j as conditioned by the immediately adjacent position j – 1. The right-side probability terms in Equation 2 for an accessibility state {omega}i were obtained by summing the corresponding elements p of the probability profiles derived from each structure of the training set of k proteins. The elements p were normalized by division by the sum of the probabilities for all the m (= 1, 20) amino acids in each position j. For an amino acid ap in location j, (ap)j:

and for an amino acid (ap)j preceded by an amino acid aq, (aq)j–1:

where ck is a coefficient introduced to weight the contribution of each protein structure to the calculation of probabilities, i.e. ck is the ratio between the number of amino acids of the sequence k and the total number of residues in the whole training set. A sufficient number of observations allows one to comply approximately with the probability requirements. For a three-state prediction the first term in Equation 3 needs 60 (= 3x20) entries for each position j, and the first term in Equation 4 requires 1200 (= 3x20x20) entries. In the latter case the data from the set of structures could not hold enough information. In such a case the probability of the whole sequence {Sj} in Equation 2 may easily become zero. To avoid this sort of overfitting it is possible to resort to the so-called pseudocounts, which means to count more amino acids than those contained in the data. In the present work, 1 was added to all the counts in the determination of the conditional probabilities p((ap)j|(ap)j–1, {omega}i) for each structure of the training set.

Probability P({omega}i) of occurrence of an accessibility state in the absence of any sequence information is equivalent to the fraction of residues in that state in a given protein and depends on the dimension of the molecule. The number Nb of buried residues in a globular protein proved to be approximately related to the total number Nt of residues (Janin, 1979Go) by:

Nb1/3 = Nt1/3b(5)

The equation contains the parameter b, which is related to the accessibility cut-off used to select buried residues. For a two-state solvent accessibility this equation allows one to calculate the expected fraction of buried and, by difference, of exposed residues in a given protein. For a three-state accessibility prediction (buried, intermediate, exposed) another equation had to be adopted, in order to calculate the fraction of the exposed residues Ne:

Ne = aNt2/3(6)

where Nt is the total number of amino acids of a protein and a a parameter related to the threshold used to define exposed residues. This equation is a modification of that proposed by Janin (Janin, 1976Go) and by Teller (Teller, 1976Go) that relates the surface area of a protein to its molecular weight or volume. The parameters b and a in Equations 5 and 6, respectively, were determined by averaging their values calculated for each protein of the training set, except for those with interruptions of the polypeptide chain. In the case of data sets in which surface areas of residues in chains from oligomeric proteins were calculated using the oligomeric complexes, the values of the parameter b were calculated considering all the structures of the set, irrespective of their quaternary structure, for both three- and two-state prediction. Indeed, the assembly of subunits in an oligomeric protein does not alter significantly the total number of buried residues in the single chain. On the contrary, in the case of data sets with oligomeric proteins and for a three-state prediction, Equation 6 cannot be used to compute the number of exposed residues because of the redistribution of residues between the intermediate and exposed categories after the subunit association. For this reason, the distribution of the different amino acids in the exposed state over all the structures of the training set was considered. The frequency of each amino acid type in the exposed category was then used to estimate the total number of exposed residues in the test protein through its amino acid composition. If the number of exposed residues calculated in such a way exceeded the number derived by Equation 6 with a calculated using only the monomeric proteins of the training set, then the latter number was chosen. In Table I are listed the mean values of the parameters a and b calculated using only the monomeric proteins of the 1, 2 set by Rost and Sander (1994)Go at different accessibility thresholds. The values of b calculated for the complete data set irrespective of the quaternary structure of the proteins show a mean difference of only –0.2 from those derived from monomeric proteins.


View this table:
[in this window]
[in a new window]
 
Table I. Average values of parameters a and b calculated for monomeric proteins and different accessibility thresholds
 
The probability of a sequence symmetrical with respect to an amino acid i being associated with a particular exposure state {omega}i was calculated for different sizes of the ‘environmental’ window, ranging from a length of w = 1 (coinciding with the central residue i) to w = 17 (i.e. eight residues on either side of the central position), for a total of nine different lengths. The maximal number of residues of the window was selected by observing the effect on prediction results of different window sizes. The maximal length chosen was that whose further elongation does not increase the prediction accuracy. The prediction of {omega}i of the residue ai in the window was made on the basis of the normalized probability in Equation 1. The mean over all the probabilities given by the different window lengths was calculated. The final step assigns the residue i to the accessibility category with the largest average.

Multiple sequence information

It is well established that the inclusion of evolutionary information by means of multiple sequence alignments of related proteins can improve the prediction accuracy of the structural features of a protein. In order to include such information in our prediction scheme, the probability of residues P(aj|{omega}i) in Equation 2 in the locations of the environmental window was substituted by the probability of observing amino acid exchanges P(Aj|{omega}i) in the alignment positions of the same window. For an exchange between an amino acid ap and an amino acid aq in location j, (Apq)j:

The exchange probability matrices were symmetric [P(Amn)j|{omega}i) = P(Anm)j|{omega}i)], so that no direction was given for a particular residue substitution. Symmetry was imposed by assigning to (Amn)j and (Anm)j the values of their sum divided by two. To avoid redundancy of information derived from closely related sequences, only exchanges Amn with m != n were considered, thus for a three-state prediction the left term in Equation 7 requires 1140 [= 3x(20x20 – 20)] entries for each position j. As for conditional probabilities, the amount of entries requires pseudocounts, at least for data sets with approximately <200 proteins. The redundancy was also reduced by excluding from the alignments all the sequences sharing >90% residue identity with the sequence whose solvent accessibility has to be predicted. Sequences sharing <20% identity with the tested sequence were excluded to eliminate remotely related proteins in which the conservation of the exposure category at aligned positions cannot be assured. Only the exchanges between the protein under examination and its related sequences were considered. A further precaution was adopted to filter the over-representation of residues in highly related sequences: only unique pairs of adjacent residues were counted during the calculation of the conditional probabilities. For example, if pairs AA, AA, AS, SS, SS, AT were observed at positions j and j – 1 in the same window in the aligned sequences, only AA, AS, SS and AT would be counted.

For each position j within the window the average of the probabilities for all the substitutions observed and the average of the conditional probabilities for all the pairs of residues observed in j and j – 1 were taken for prediction. The maximal window length for prediction from multiple sequence alignments is 19.

Data sets and prediction evaluation

Four sets of structures were used: RS, 126 structures from Rost and Sander (1994Go); TG, 111 structures from Thompson and Goldstein (1996Go); NM, 215 structures from Naderi-Manesh et al. (2001Go); YZ, 421 structures from Yuan et al. (2002Go). Protein structures were retrieved from the Brookhaven Protein Data Bank (PDB) (Sussman et al., 1998Go). Accessible surface area of amino acid residues was calculated by the program DSSP (Kabsch and Sander, 1983Go), which is a widely accepted method of solvent accessibility determination, and, in one case, by the program ASC (Eisenhaber and Argos, 1993Go). Relative solvent accessibility was then determined by dividing the accessible surface area of a residue for the maximum exposed surface of the same residue type in a Gly–X–Gly oligopeptide, or Ala–X–Ala, and with the side chain in a fully extended conformation. Values of maximum exposed surface area were those given by Rose et al. (1985Go), Shrake and Rupley (1973Go) and Naderi-Manesh et al. (2001Go). For the Ala–X–Ala tripeptide, the values used were those calculated by Ahmad and Gromiha (2002Go) and those reported by Ahmad et al. (2003Go).

In order to predict the accessibility category of residues located within the N- and C-terminal regions, eight (for single sequence input) or nine (for multiple alignment input) serine residues were added to the ends of each polypeptide chain, all assumed in the exposed state. This procedure allows a better prediction of terminal residues rather than their bypassing. For a correct comparison between the proposed and the previous methods of solvent accessibility prediction, the calculations for the different data sets were carried out using the same exposure categories and values of maximal accessibility area for each amino acid as those adopted by the relevant authors. For each structure in the data sets, alignments of homologous sequences were extracted from the standard data set HSSP (Dodge et al., 1998Go). Because certain HSSP files can contain many sequences (of the order of thousands), no more than 400 closest related homologues for each protein family were considered, so as to balance the single family contribution. For the YZ set, a selection of structures from the 513 protein set of Cuff and Barton (1999Go), the multiple alignments proposed by the latter authors were used. Because that data set is constituted of protein domains, several structures have an incomplete polypeptide chain, thus Equations 5 and 6 cannot be used to calculate the probability of each accessibility state {omega}i. In such cases, the amino acid composition of the sliding window was used to estimate a local probability for {omega}i. However, the incomplete sequence data from some single structural domains make the YZ set unsuitable for testing the accuracy of the PP method by using single sequence information. Multiple alignment data may counterbalance this deficiency of sequence information. For this reason the PP method was applied to the YZ set exclusively by using multiple alignments.

Prediction based on single sequence information was tested using a jack-knife method. Each protein in a set of k structures, in turn, was extracted from the data set and used as a test protein. The remaining k – 1 structures were used to calculate the probability profiles and the parameters (a, b, ck) necessary for accessibility prediction. This procedure was repeated k times until each structure was tested once. However, prediction based on multiple sequence alignments was performed by an n-fold cross-validation test on the set examined, i.e. the data set was randomly divided into n subsets, with n equal to the number adopted by the authors whose method is under comparison. Each subset was extracted from the data set and its proteins used as test proteins. The remaining proteins were used to train the method. This procedure was repeated n times with different subdivisions of the data set until all proteins were tested once.

The percentage of correctly predicted residues was calculated by dividing the total number of residues correctly assigned to solvent accessibility categories by the total number of residues in the data set. The prediction quality was assessed also by the correlation coefficient r between the observed oi and predicted pi solvent exposure states over residue locations. For a data set of N residues:

The equation is the same as used for the calculation of the correlation coefficients by all the methods directly compared in this work.

The prediction programs were written in C-language and run under the IRIX 6.3 operating system.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Prediction accuracy

The prediction accuracy of the PP algorithm was tested with four different data sets. All the proteins of each data set were used (no subsets were considered) and accessibility categories at two (buried, exposed) and three (buried, intermediate, exposed) states were examined. The range of thresholds tested were those adopted by the authors proposing the protein set. Therefore, the problem concerning the optimal choice of the accessibility cut-offs defining the exposure states was not taken into account. However, as for the previous methods, the prediction accuracy shows considerable variation relative to the threshold values. In fact, as noted by Rost and Sander (1994Go), Thompson and Goldstein (1996Go) and Richardson and Barlow (1999Go), the use of thresholds splitting irregularly the amino acid residues into the different accessibility states may increase the percentage of correct assignment but to the detriment of the correlation coefficient that decreases.

The results of predictions based on single sequence data of the PP method are reported in Table II, in comparison with those from the previous methods, whenever available. On the RS set, the results for a two-state prediction give accuracy values comparable (cut-off 9%) or ~1% higher (cut-off 16 and 23%) than the results performed by the Bayesian method (Thompson and Goldstein, 1996Go). On the same data set, for a three-state prediction, the accuracy with 9, 64% thresholds is more than 6 and 1% better than those performed by, respectively, the neural network (NN) of Rost and Sander (1994Go) and the Bayesian methods, while for 9, 36% thresholds the accuracy is 2% higher than the NN method and only a little greater than the Bayesian method. On the TG set, in comparison with the Bayesian approach, the PP method achieves accuracy more than 1% greater for a two-state prediction (cut-off 20%) and nearly 2% greater (cut-offs 9, 64%) or a little better (cut-offs 9, 36%) for a three-state prediction. On the NM set the accuracy is ~2–9% better for a two-state model, and 2–8% better for a three-state-model, than those obtained by the information theory (IT) method of Naderi-Manesh et al. (2001Go).


View this table:
[in this window]
[in a new window]
 
Table II. Prediction comparison of NN, Bayesian (Bayes), IT and PP methods for different data sets of protein structures and thresholds of exposure categories by using single sequence information
 
Some consideration should be given to the results of the IT method. The method is based on information theory and was applied to several accessibility states with a variety of cut-off values, including two and three states. Nevertheless, the authors have also applied their own algorithm to calculate accessible surface area values from PDB data and used such values to assess their method accuracy. With their accessibility measures, the method seems to reach a higher accuracy. Therefore, the comparison took into consideration only the accuracy obtained using the accessibility definition by DSSP, since this program is the commonly utilized reference. The results of prediction based on multiple sequence alignment data of the PP method in comparison with those from other extant methods are shown in Table III. When the PP method is applied to the RS set and compared with the NN method, for a two-state prediction there is an improvement in accuracy of more than 2% for a 9% threshold, while for a 16% threshold the accuracy is comparable. For a three-state system the accuracy is comparable for cut-offs of 9, 36% but increases to more than 6% for cut-offs of 9, 64%. Moreover, the improvement is 3% for 5, 25% cut-offs in comparison with the accuracy achieved on the same data set by the RSM method of Pascarella et al. (1998Go). The results for the TG set compared with the Bayesian method reveal a slight rise in accuracy (0.3%) both for two- and three-state prediction. Nevertheless, the increase in accuracy is more than 3% for 5, 25% thresholds when compared with the result of the RSM method. Comparison with the SVM method on the YZ set essentially displays comparable results for a two-state prediction. For a three-state system, instead, the prediction accuracy is similar for 4, 36% and 9, 16% thresholds, but is more than 1% increased for 9, 36% thresholds.


View this table:
[in this window]
[in a new window]
 
Table III. Prediction comparison of NN, Bayesian (Bayes), RSM, SVM and PP methods for different data sets of protein structures and thresholds of exposure categories by using multiple sequence alignments information
 
On the whole, the PP algorithm achieves predictions at least comparable or, in many cases, superior to that of the examined methods, the improvement ranging from a fraction of one to a few percentage points. In addition, the correlation coefficients are always higher than those obtained by the other methods.

Comparison with other methods

In recent years great attention has been paid to the prediction of amino acid accessibility from sequence information as part of an attempt to improve the structural information that can be extracted from the increasing number of known protein sequences. However, in some cases a direct comparison of the prediction methods proposed is not possible; the reasons are various, for example, the selection of arbitrarily different threshold values defining exposure categories, or the use of different programs to calculate accessible surface area. The pioneering method of Holbrook et al. (1990Go), applying neural networks, and the method of Wako and Blundell (1994Go), employing amino acid environment-dependent substitution tables, were not compared with the PP method since the authors have selected very small testing sets of, respectively, five and 13 proteins. The dimension of the testing sets is too small to assess the prediction accuracy. Mucchielli-Giorgi et al. (1999Go) have proposed a method based on a logistic function producing an accuracy of 70.7% for a 25% accessibility threshold. However, their prediction is limited to a two-state model only and their results report several amino acids as ‘unknown’; for these reasons the comparison with other methods cannot take place.

Carugo (2000Go) elaborated the simple basic method of Richardson and Barlow (1999Go) in order to include information deriving from the sequence environment of residues. On a data set of 338 monomeric protein structures that method obtained an accuracy of 75.6, 71.5 and 67.8% for 20, 30 and 40% cut-offs, respectively, by inclusion of information from two neighbouring residues. Also, in that case the prediction was restricted to a two-state model and the accuracy was reported as the mean percentage of correct predictions calculated for each protein of the set. Furthermore, the fractional solvent accessibility was calculated at variance with other methods as the ratio between the observed accessible area of a residue and its standard accessible area. The standard accessible area values were computed as proposed by Heringa et al. (1995Go): the maximum accessible area for each residue type was computed in each structure of the training set and finally the maximum values were averaged. Because of such differences, a direct comparison cannot be carried out.

The program JNET (Cuff and Barton, 2000Go) achieves a performance of 76.2 and 79.8% with accessibility thresholds of 25 and 5%, respectively, by means of application of multiple sequence alignments profiles, though for a two-state model only. By means of a method based on multiple linear regression, with a threshold of 20%, an accuracy of 71.5% has been reported using single sequence information and of 75.0% with multiple sequence alignments (Li and Pan, 2001Go). However, prediction for a three-state model by this method was not attempted. A new neural network approach (Pollastri et al., 2002Go) achieved improved results with a 77.2% correct classification for a 25% cut-off by using input profiles of multiple sequence alignments, even if exclusively for a two-state prediction. NETASA (Ahmad and Gromiha, 2002Go), a neural network based method, was applied to the same data set proposed by Naderi-Manesh et al. (2001Go). Table IV reports the results of NETASA prediction for the testing set in comparison with those of the PP method using the same type of solvent accessibility computation and the same procedure of subdivision of the data into a training and test data sets. The same 30 proteins randomly selected by the authors were used for the training. The remaining 185 proteins were adopted as the test data set. The results were not included in Table II along with accuracy values performed by the IT method since NETASA adopts different operating parameters: other accessibility thresholds, a different program for calculation of accessible surface areas (ASC), a relative solvent accessibility computed with the Ala–X–Ala oligopeptide in an extended conformation, instead of Gly–X–Gly. Both the accuracy and correlation coefficient values of the PP method are evidently higher than the values given by NETASA, with an improvement in accuracy of approximately >1% for a two-state prediction and >3% for a three-state system.


View this table:
[in this window]
[in a new window]
 
Table IV. Comparison of solvent accessibility prediction for the NM data set between PP and NETASA methods
 
The comparison of the accuracy levels reached by existing prediction methods shows that the PP approach is probably the most accurate in the prediction of solvent accessibility using both single sequence and multiple sequence information. It must be noted, however, that a rigorous comparison requires application of all the methods on the same data set using the same exposure categories and the same values of maximum exposed surface area for each amino acid. In particular, the choice of different sets of maximum exposed surface values may change the percentage of accuracy on the same data set even by a few percentage points.

Improvements of the PP method are probably due to different cooperative factors: the consideration of protein size effects on the probability of exposure category through Equations 5 and 6; and the averaging of the probabilities given by different window sizes during the testing procedure so as to normalize the contribution of long-range effects of residues interacting at different distances with the central residue.

The performance of the PP method highlights that protein molecules still contain unexploited single sequence information for improvement in solvent accessibility prediction through alternative mathematical approaches. However, a strict evaluation of such approaches cannot overlook the compositional quality of the data sets. Generally, the non-redundancy of the data is obtained by selection of proteins with no more than 25% pairwise-sequence identity. Unfortunately, percentage identity has been proved to be a bad measure of sequence similarity, in particular for values below 30% (Brenner et al., 1998Go). For example, as noted by Cuff and Barton (1999Go), the RS set contains pairs of proteins that display clear sequence similarity when compared by more sophisticated methods than percentage identity. Considering the continuous flow of new solved structures and the availability of effective techniques to overcome sequence identity deficiencies, it is predictable that the method described in the present work could generate more accurate predictions with protein sets of higher quality and size.


    Acknowledgements
 
We thank Professor Carlotta Maffei for reading the manuscript and helpful advice. This work was partially supported by grants from the Italian ‘Ministero dell’Istruzione, dell’Università e della Ricerca’.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Ahmad,S. and Gromiha,M.M. (2002) Bioinformatics, 18, 819–824.[Abstract/Free Full Text]

Ahmad,S., Gromiha,M.M. and Sarai,A. (2003) Proteins, 50, 629–635.[CrossRef][ISI][Medline]

Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Proc. Natl Acad. Sci. USA, 95, 6073–6078.[Abstract/Free Full Text]

Carugo,O. (2000) Protein Eng., 13, 607–609.[CrossRef][ISI][Medline]

Cuff,J.A. and Barton,G.J. (1999) Proteins, 34, 508–519.[CrossRef][ISI][Medline]

Cuff,J.A. and Barton,G.J. (2000) Proteins, 40, 502–511.[CrossRef][ISI][Medline]

Dodge,C., Schneider,R. and Sander,C. (1998) Nucleic Acids Res., 26, 313–315.[Abstract/Free Full Text]

Eisenhaber,F. and Argos,P. (1993) J. Comp. Chem., 14, 1272–1280.[ISI]

Heringa,J., Argos,P., Egmond,M.R. and de Vlieg,J. (1995) Protein Eng., 8, 21–30.[Medline]

Holbrook,S.R., Muskal,S.M. and Kim,S.-H. (1990) Protein Eng., 3, 659–665.[ISI][Medline]

Janin,J. (1976) J. Mol. Biol., 105, 13–14.[Medline]

Janin,J. (1979) Nature, 277, 491–492.[ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Li,X. and Pan,X.-M. (2001) Proteins, 42, 1–5.[CrossRef][ISI][Medline]

Mucchielli-Giorgi,M.H., Hazout,S. and Tufféry,P. (1999) Bioinformatics, 15, 176–177.[Abstract/Free Full Text]

Naderi-Manesh,H., Sadeghi,M., Arab,S. and Movahedi,A.A.M. (2001) Proteins, 42, 452–459.[CrossRef][ISI][Medline]

Pascarella,S., De Persio,R., Bossa,F. and Argos,P. (1998) Proteins, 32, 190–199.[CrossRef][ISI][Medline]

Pollastri,G., Baldi,P., Fariselli,P. and Casadio,R. (2002) Proteins, 47, 142–153.[CrossRef][ISI][Medline]

Richardson,C.J. and Barlow,D.J. (1999) Protein Eng., 12, 1051–1054.[CrossRef][ISI][Medline]

Rose,G.D., Geselowitz,A.R., Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Science, 229, 834–838.[ISI][Medline]

Rost,B. and Sander,C. (1994) Proteins, 20, 216–226.[ISI][Medline]

Sánchez,R. and Sali,A. (1998) Proc. Natl Acad. Sci. USA, 95, 13597–13602.[Abstract/Free Full Text]

Schonbrun,J., Wedemeyer,W.J. and Baker,D. (2002) Curr. Opin. Struct. Biol., 12, 348–354.[CrossRef][ISI][Medline]

Shrake,A. and Rupley,J.A. (1973) J. Mol. Biol., 79, 351–371.[ISI][Medline]

Sussman,J.L., Lin,D., Jiang,J., Manning,N.O., Prilusky,J., Ritter,O. and Abola,E.E. (1998) Acta Crystallogr. D, 54, 1078–1084.[CrossRef][ISI][Medline]

Teller,D.C. (1976) Nature, 260, 729–731.[ISI][Medline]

Thompson,M.J. and Goldstein,R.A. (1996) Proteins, 25, 38–47.[CrossRef][ISI][Medline]

Wako,H. and Blundell,T. (1994) J. Mol. Biol., 238, 682–692.[CrossRef][ISI][Medline]

Yuan,Z., Burrage,K. and Mattick,J.S. (2002) Proteins, 48, 566–570.[CrossRef][ISI][Medline]

Received June 23, 2003; revised October 25, 2003; accepted October 30, 2003





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (6)
Request Permissions
Google Scholar
Articles by Gianese, G.
Articles by Pascarella, S.
PubMed
PubMed Citation
Articles by Gianese, G.
Articles by Pascarella, S.