Department of Computer Science, University of Wales, Aberystwyth Penglais, Aberystwyth, Ceredigion, SY23 3DB, Wales, UK, 2 Department of Engineering Mathematics and Computer Science, Speed Scientific School, University of Louisville, Louisville, KY 40292, USA and 3 Department of Biostatistics and Medical Informatics and Department of Computer Sciences, University of Wisconsin, 1300 University Avenue, Room 5795 Medical Sciences, Madison, WI 53706, USA.Email: page{at}biostac.wisc.edu
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: machine-learning/neural-networks/secondary structure/statistics
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this paper we seek to answer the following questions. Is it better to use a single secondary structure prediction method or to combine predictions from several different methods? If it is better to combine predictions, what is the best method for doing so?
Basic probability theory states that all of the evidence relevant to a prediction should be used in making that prediction (Jaynes, 1994). This suggests that if a novel prediction method provides new relevant information, then it should be possible to combine this with existing methods to obtain improved predictions. This idea is the basis of ensemble learning and multi-strategy learning methods, which are currently important subjects in machine learning (Dietterich, 1997
). The standard approach in multi-strategy learning is to identify different prediction or learning tasks within one problem-solving activity, and independently apply the most appropriate strategy to each task (Tecuci, 1998; Califf and Mooney, 1999
). Ensemble learning, on the other hand, focuses on a single prediction task, and builds multiple predictors or classifiers (an ensemble) for that task. The different predictors are combined either by voting or by training a classifier to combine them. In the majority of ensemble learning applications, the same machine learning algorithm is used, and only the training data is changed. This is what is done in the popular forms of ensemble learning known as bagging and boosting: in bagging, different prediction methods are repeatedly learnt using different bootstrap samples (Breiman, 1996
); in boosting, the sample is `reweighted' to give more importance to cases that are incorrectly predicted (Freund and Shapire, 1996
). Bauer and Kohavi (1999) provide an empirical comparative analysis of boosting, bagging and related ensemble approaches.
Within the field of protein secondary structure prediction the idea of combining different prediction methods is also an old one. Schultz et al. (1974) combined 11 methods, Argos et al. (1976) combined five. Sternberg (1983) stressed the merits and limitations of different prediction methods for different types of secondary structure (multi-strategy learning). Nishikawa and Ooi (1986) combined methods using a voting strategy. Biou et al. (1988) used a complicated biased method to combine three methods (including GOR 111). Zhang et al. (1992) used a neural network (which they found to be better than voting) to combine predictors obtained from neural network, nearest neighbour and naive Bayes algorithms. Rost and Sander (1993) in PHD learnt several different neural networks and averaged their output to produce a final prediction. Salamov and Solovyev (1995) used majority voting to combine different nearest neighbour predictors. King and Sternberg (1996) showed that combining their program DSC with the neural network method PHD produced an improved prediction accuracy. Recently, Cuff and Barton (1999) carried out an extensive comparison of different prediction methods: DSC v 1.0, PHD, NNSSP, Predator, ZPRED (Zvelebil et al., 1987) and MULPRED (G.J.Barton, unpublished). They also set up a server for these different prediction methods: http://circinus.ebi.ac.uk:8081/. The authors tried using weighted votes and neural networks to combine methods, but concluded that a simple majority vote of DSC v 0.1, PHD, NNSSP, Predator (with PHD being used to break ties) was the best.
A central problem with work on combing protein secondary structure prediction methods is obtaining an accurate and unbiased estimate of the accuracy of the combined method. This is because authors of combined methods typically have not had access to an unbiased test set. For example, in King and Sternberg (1996) the combined results were biased optimistically because the PHD predictions were based on protein domains that PHD had trained on; a similar problem faced Cuff and Barton (1999) whose results were also optimistically biased because their test data included protein domains that NNSSP and Predator have been trained on.
To obtain an unbiased estimate it is necessary to test on data that none of the individual methods in the combined method have been trained on. There are two possible ways to do this:
In this paper we use the second approach of testing prediction methods on unseen datasets. We use two unseen datasets, one from CASP3 (the Third Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction) and one from the European Bioinformatics Institute.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Training data
The training data used was a set of 496 non-homologous domains. This dataset is based on the one developed by Cuff and Barton (1999) and it is almost a proper superset of a training set of 126 domains used to originally train PHD (Rost and Sander, 1993) and DSC (King and Sternberg, 1996
). The reason that it is not quite a superset is that the definition of homology used is now stricter). The dataset differs slightly from that of Barton in excluding proteins with less than 30 residues. There are 82 847 residues in this dataset: 28 678 in
-helix conformation, 17 741 in ß-strand and 36 428 in coil. Secondary structure was assigned using DSSP (Kabsch and Sander, 1983
). Cuff and Barton (1999) have shown that the exact mapping of DSSP output to three state secondary structure can have a significant effect on the resulting estimated accuracy. We have used the conservative mapping: H, I, G are translated to
-helix; E is translated to ß-strand, and all other states are translated as coil. This dataset is based on a 1996 release of PDB.
Test data
We used two separate test sets. The first test set consisted of 17 domains which the CASP3 organizers identified as having no known homologous structures in PDB (http://PredictionCenter.llnl.gov/); and hence no homology with our training set. We used the 17 domains which were chronologically last to be revealed before the CASP3 meeting; we could not use any more CASP3 proteins as we did not start the experiment in time. These proteins are given in Table I. The data consists of 2373 residues, 886 in
-helix conformation, 433 in ß-strand and 1054 in coil. Residues not seen in the protein structures were assigned as coil. This dataset is somewhat small to reveal significant statistical differences, but there was no possibility that any prediction method could have seen this dataset as all the predictions of the individual methods were carried out before the structures were revealed.
|
If the same broad results are obtained on both datasets then these results would have strong experimental support.
Alignments We formed alignments for all the proteins using the basic method described for DSC in King et al. (1997); we used the nr database instead of OWL, it contained 90 372 249 letters and 299 576 sequences and it was posted on April 17, 1998.
Algorithms
Secondary structure prediction algorithms
In CASP2 (http://PredictionCenter.llnl.gov/casp2/Casp2.html) three publicly available methods clearly performed better than any others (Lesk, 1997). These methods were PHD, DSC and NNSSP. In the two years since CASP2 the program Predator (Frisham and Argos, 1997) has also appeared and become popular. We therefore chose to investigate the use of PHD, DSC, NNSSP and Predator in our analysis.
PHD (Rost and Sander, 1993) is an accurate prediction method and probably the most commonly used secondary structure prediction method. It has a publicly available server (http://www.embl-heidelberg.de/predictprotein/ppDoPredDef.html). PHD was the most accurate method in both CASP1 (Defay and Cohen, 1995
) and CASP2 (Lesk, 1997
). In PHD multiple alignments are represented by a window of normalized sequence frequencies. The learning method is the neural network algorithm back-propagation, using both sequential and combined nets with single hidden layers.
DSC (King and Sternberg, 1996) is an accurate predic- tion method that has both a public server (King et al., 1997
; http://www.icnet.uk/bmm/dsc/dsc_form_align.html) and publicly distributed code (ftp://ftp.icnet.uk/pub/bmm/king/dsc/dsc.tar.gz). In DSC, sequences are described using a set of sequence descriptors (residue conformation propensities, sequence edge effects, moments of hydrophobicity, position of insertions and deletions, moments of conservation, auto-correlation, residue ratios, secondary structure feedback effects and filtering); multiple sequences in an alignment are used in a simplistic way. The learning method is standard linear discrimination. We used DSC v1.2this version includes an updated and more accurate GOR 1 matrix. This has been found to improve prediction accuracy by 0.51%. The matrix was derived from a dataset of 496 proteins compared with the original dataset of 126.
The NNSSP prediction method (Salamov and Solovyev, 1995) is a successful prediction method with its own server (http://genomic.sanger.ac.uk/pss/pss.shtml). Sequences are represented as multiple alignments and information about insertions and deletions used. The learning method of NNSSP is (50 and 100)-nearest neighbours with a voting system to combine predictions. We carried out the NNSSP predictions using the binary code given to us by Dr Solovyev and the clu2nssp program written by Nicolas Le Novère to convert the clustalw format to nnssp format.
The Predator prediction method (Frisham and Argos, 1997) is the most recent prediction method considered in this paper. It has its own prediction web server (http://www.embl-heidelberg.de/argos/predator/run_predator.html) and the code is available (ftp://ftp.ebi.ac.uk/pub/software/unix/predator). Predator is based on a combined nearest neighbour and statistical approach that aims to recognize hydrogen bonding patterns. We carried out the Predator predictions using the downloaded code with our own alignments.
Combined dataset
Using the predictions of the different methods we formed, using the CASP3 test protein, a dataset containing the attributes described in Table II with the class value (value to be predicted) either
-helix, ß-strand or coil.
|
Statistical tests We used a binomial sign test to compare if one prediction method is better. The procedure for the binomial sign test is to compare the accuracy of two methods on each protein in the test data and count how many times one method was more accurate than another. The probability of obtaining these results assuming that there was no difference between the methods was then calculated using the binomial distribution (one-tailed as we are testing if the more accurate method really was more accurate). The binomial sign test makes no assumptions about the distribution of accuracies of the methods, but does not use the size of differences in accuracy (i.e. 0.1% more accurate is considered the same as 10% more accurate).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The results of the basic prediction methods PHD, DSC v1.2, NNSSP and Predator on the CASP3 test data are given in Table III. The order of accuracy from most accurate to least was NNSSP, PHD, DSC, Predator. However, there was no significant difference between the accuracy of any of the methods at P = 0.05. The results of testing the different combined methods on the CASP3 test data are also given in Table III
. In this database, attributes derived from PHD, DSC v1.2, NNSSP and Predator (Table II
) were combined to directly predict the secondary structure class. The combined methods are from 13% more accurate. The biased vote method was found to be the most accurate and C5 the least accurate. There were no significant differences between combining methods except that C5 was significantly worse than using voting, linear discrimination and neural networks at P = 0.01. The combined results were also compared with that of the individual methods. All the combined methods are more accurate, with the difference generally significant at P = 0.05. The evidence is clearly supports the conclusion that it is better to combine predictions.
|
The results of the basic prediction methods PHD, DSC v1.2, NNSSP and Predator on the CASP3 test data are given in Table IV. The order of accuracy is the same as for CASP3: NNSSP, PHD, DSC, Predator. However, because a small number of proteins may have been seen by some methods, it is not justified to say that one method is significantly more accurate than any other. However, this possibility can make no difference to the comparison of single and combined methods. The results of testing the different combined methods on the CASP3 test data are also given in Table IV
. The combined methods are from 13% more accurate. The results show that the more sophisticated methods of combining predictionslinear discrimination and neural networksare significantly more accurate (at P = 0.01) than the voting methods for combining predictions. This supports the result of Zhang et al. (1992) who successfully combined predictions using neural networks, and contradicts the results of Cuff and Barton (1999) who could not get an improvement using neural networks. A possible explanation for this disagreement is that they used directly the predictions obtained from PHD, DSC, NNSSP and Predator (i.e.
, ß, c recoded as binary are taken as input by their neural network), while we used the probabilities of each state and also the number of classifiers predicting
-helix or ß-strand. It seems that their strategy leads to the loss of some information. The authors themselves suggested that better accuracies would be achievable using the propensities of the different states, rather than simple binary input.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is better to combine prediction methods (assuming that they do not differ greatly in accuracy).
![]() |
Conclusion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bauer,E. and Kohavi,R. (1999) Machine Learning, in press.
Biou,V., Gibrat,J.F., Levin,J.M., Robson,B. and Garnier,J. (1988) Protein Engng, 2, 185191[Abstract]
Breiman,L. (1996) Machine Learning, 24, 123140.[ISI]
Califf,M. and Mooney,R. (1999) In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), in press.
Cuff,J.A. and Barton,G.J. (1999) Prot. Struct. Funct. Genet., 34, 508519.[ISI]
Defay,T. and Cohen,F.E. (1995) Prot. Struct. Funct. Genet., 23, 431445.[ISI]
Dietterich,T.G. (1997) AI Magazine, 18, 97136.[ISI]
Fisher,R.A. (1936) Eugenics, 7, 179188.
Freund,Y. and Shapire,R.E. (1996) In Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 148156.
Frishman,D. and Argos,P. (1997) Proteins, 27, 329335.[ISI][Medline]
Garnier,J., Osguthorpe,D.J. and Robson,B. (1978) J. Mol. Biol., 120, 97120.[ISI][Medline]
Garnier,J., Gibrat,J.F. and Robson,B. (1996) Methods Enzymol., 226, 540553.
Gibrat,J.F., Garnier,J. and Robson,B. (1987) J. Mol. Biol., 198, 425443.[ISI][Medline]
Jaynes,E.T. (1994) Probability Theory: The Logic of Science. http://omega.albany.edu:8008/JaynesBook.html
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
King,R.D. (1996) In Sternberg,M.J.E. (ed.), Protein Structure Prediction: A Practical Approach. Oxford University Press, Oxford, pp. 7997.
King,R.D. and Sternberg,M.J.E. (1996) Protein Sci., 5, 22982310.
King,R.D. and Sternberg,M.J.E. (1990) J. Mol. Biol., 216, 441457.[ISI][Medline]
King,R.D., Feng,C. and Sutherland,A. (1995) Appl. Artificial Intelligence, 9, 289335.
King,R.D., Saqi,M., Sayle,R. and Sterenberg,M.J.E. (1997) CABIOS, 13, 473474.[Medline]
Kneller,D.G., Cohen,F.E. and Langridge,R. (1990) J. Mol. Biol., 214, 171182[ISI][Medline]
Lesk,A.M., (1997) Prot. Struct. Funct. Genet., (Suppl. 1), 151166.
Mitchell,T.M. (1997) Machine Learning. McGraw Hill, New York.
Muggleton,S., King,R.D. and Sternberg,M.J.E. (1992) Protein Engng, 5, 647657.[Abstract]
Murzin,A.G. and Bateman,A. (1997) Prot. Struct. Funct. Genet., (Suppl. 1), 105112.
Nishikawa,K. and Ooi,T. (1986) Biochem. Biophys. Acta, 871, 4554.[ISI][Medline]
Qian,N. and Sejnowsk,T.J. (1988) J. Mol. Biol., 202, 865884.[ISI][Medline]
Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo.
Robson,B. (1976) J. Mol. Biol., 107, 327356.[ISI][Medline]
Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584599.[ISI][Medline]
Russell,R.B., Saqi,M.A., Sayle,R.A., Bates,P.A. and Sternberg,M.J.E. (1997) J. Mol. Biol., 269, 423439.[ISI][Medline]
Salamov,A.A. and Solovyev,V.V. (1995) J. Mol. Biol., 247, 1115.[ISI][Medline]
Schultz,G.E. and Schirmer,R.H. (1978) Principles of Protein Structure. Springer-Verlag, Berlin.
Schultz,G.E. et al. (1974) Nature, 250, 140142.[ISI][Medline]
Sternberg,M.J.E. (1983) In Geisow and Barrett (eds), Computing in Biological Science. Elsevier Biomedical Press.
Weiss,S.M. and Kulikowski,C.A. (1991) Computer Systems that Learn. Morgan Kaufmann, San Mateo.
Yi,T. and Lander,E.S. (1993) J. Mol. Biol., 232, 11171129.[ISI][Medline]
Zhang,X., Mesirov,J.P. and Waltz,D.L. (1992) J. Mol. Biol., 225, 10491063.[ISI][Medline]
Zvelebil,M.J.J.M., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987) J. Mol. Biol., 195, 957961.[ISI][Medline]
Received April 28, 1999; revised October 21, 1999; accepted October 27, 1999.