Institute of Physics, Chinese Academy of Sciences, Beijing 100080, China e-mail: jiangf{at}aphy.iphy.ac.cn
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: nearest neighbor/nucleation/protein folding/protein secondary structure prediction/sequence clustering/stability
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
In recent years, most development of secondary structure prediction has been in the area of application of machine learning techniques. Neural networks were first applied to the prediction of secondary structure (Bohr et al., 1988; Qian and Sejnowski, 1988
). This stimulated many subsequent studies of neural networks (Holley and Karplus, 1989
; Kneller et al., 1990
) in secondary structure prediction. However, a real breakthrough did not come until the work of Rost and Sander (Rost and Sander, 1993a
,b
, 1994), which was made available as a mail server PHD. The main reason for its success (better than 70% for the first time), among many others, was that the concept of profiles was introduced. Thus, instead of presenting a network with a single sequence, many aligned sequences of homologous proteins were presented. The method of profiles continues to improve as more and more sequences are becoming available (Przybylski and Rost, 2002
). The reason for the success of the profiles method seems to be that it captures the fact that protein structures conserve more than sequences. Because only mutations that do not disrupt the three-dimensional structure of a protein will survive the evolution, sequence divergence under the structural constraints reflects the interactions between amino acid residues of a protein, where the interactions could be either short range or long range in sequence. Now most secondary structure prediction methods achieving high performance with Q3 near 80%, (Riis and Krogh, 1996
; Baldi et al., 1999
; Cuff and Barton, 1999
; Jones, 1999
; Ouali and King, 2000
; Pollastri et al., 2002
) make use of PSI-BLAST profiles (Altschul et al., 1997
) in combination with improvement of prediction algorithms. New machine learning methods such as support vector machine (Hua and Sun, 2001
) should also benefit from PSI-BLAST profiles.
It is known that the effect of long-range interactions among amino acid residues in a protein is very important in the formation of protein secondary structure. Although many machine learning methods could learn to take account of some of the effects of long-range interactions, principles underlying these interactions and their roles in influencing the formation of secondary structures are still difficult to understand. In contrast, if the formation of secondary structures of a protein were dominated by short-range interactions, then all the information for predicting the secondary structure of a residue would be contained in its flanking sequences. In other words, the tertiary structure would have little influence on the formation of secondary structures. If this were true for some residues of a protein, then we should be able to predict their secondary structures at relatively higher accuracy than others. In this work, we present a new method of secondary structure prediction, which implements the above idea and belongs to the general category of nearest neighbor methods (Levin et al., 1986; Solovyev and Salamov, 1994
; Mehta et al., 1995
). We introduce a reliability z-score based on local sequence clustering of each query sequence in the secondary structure database. The z-scores were calculated by a brute-force method for a given database constructed from experimentally determined, non-redundant protein structures. We show that for a few selected residues, the prediction accuracy could be as high as Q3 > 80%.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
We use DSSP (Kabsch and Sander, 1983) for automatic assignment of secondary structures. The most widely used convention for three-states definition (Rost and Eyrich, 2001
) is (i) H, G and I assigned to helix; (ii) E and B to strand; and (iii) Others to coil. We follow this convention. The effect of using different assignment conventions on prediction performance has been studied by Cuff and Barton (Cuff and Barton, 1999
).
It is known that the database used for secondary structure prediction is crucial and can have a large effect on the results of prediction tests (Rost et al., 1994). For our algorithm, a set of experimentally determined, non-redundant protein structures in the PDB (Berman et al., 2000
) with mutual sequence homology <25% was selected to construct the training set and the testing set (Hobohm and Sander, 1994
) (see the website http://www.cmbi.kun.nl/gv/pdbsel/). The PDBSELECT list with <25% sequence homology used for the current study was that released in February 2001, which contained 1323 usable proteins accepted by DSSP. Out of 1323, 1307 proteins were used for the training set to construct the database. None of these proteins have chain breaks as judged by DSSP, so that no artificial N- or C-terminal residues were introduced into the database. Experimental structures determined by both NMR and X-ray methods were included if they were contained in the PDBSELECT list of <25% sequence homology. No resolution cutoff was applied to X-ray structures. Then the remaining 16 of 1323 proteins from the PDBSELECT list of <25% sequence homology were selected randomly and used as the testing set, which had not been included in the training set.
Prediction of protein secondary structure
Our prediction algorithm is based on the fact that there is clustering in sequence space for a given secondary structure state, e.g. helix. If we assume the opposite, that is, there is no clustering in sequence space, it would mean that helix has no preference in its sequences and that helix sequences are randomly distributed in the sequence space. This is contradictory to what is known so far, because prediction algorithms based on machine learning techniques have shown that there is a strong preference for certain sequences in each secondary structure state and these sequences can be grouped into classes and learned by sophisticated algorithms. If there are clusters of sequences for each secondary structure state, then for a given sequence segment, we could locate the putative cluster to which it belongs by a brute-force method, which is now feasible with fast personal computers. Once the putative cluster has been located, we could estimate how good the cluster is by calculating a z-score relative to random sequences; the higher the z-score, the better the cluster. As will be shown, this z-score is correlated with the reliability of the secondary structure prediction.
Our prediction algorithm consists of several steps:
1. Construction of the secondary structure database. After the assignment by DSSP, sequences of a segment length of 21 are extracted from all the proteins in the training set and grouped into three files, corresponding to the three states of the secondary structure of the center residue in each sequence segment. For N- and C-terminal sequences, dummy amino acids X are added to make the corresponding segment be of the same length of 21. For the current work, the database has 63 143, 41 394 and 81 197 sequence segments for the three secondary structure states of helix, strand and coil, respectively.
2. Calculation of the similarity scores. To predict the secondary structure state of the center residue of a given sequence of a segment length of W, we need to calculate the similarity scores of this sequence against all the sequences in a secondary structure state in the database. Then, the similarity scores are sorted in descending order and the top Ntop will be used to calculate a z-score. Since all the sequences in the calculation are of the same length W, no alignment is required. The similarity score between each pair of amino acids is calculated based on the BLOSUM62 similarity matrix (Altschul et al., 1997).
3. Calculation of z-score. The idea is to compare the extent of clustering between a given sequence from a real protein and random sequences. For this purpose, we generate 300 random sequences. Then we calculate the similarity scores as in step 2 for a secondary structure state. For the top Ntop similarity scores, the average (µrandom) and the standard deviation (random) are calculated. µrandom and
random are calculated for each secondary structure state. For the current work, there are three µrandom and three
random for helix, strand and coil, respectively. Since there are 300 random sequences, µrandom and
random are averaged again over 300 random sequences. Next, for a given sequence from a real protein, µseq and
seq are calculated in the same way as µrandom and
random. The z-score is defined as zi = [(µseq,i µrandom,i)/
random,i], where i denotes the secondary structure state (i = 1, 2 and 3 for helix, strand and coil, respectively).
4. The prediction of the secondary structure state and the estimation of its reliability. The predicted secondary structure state for the center residue of a given sequence is the state with the maximum of the z-scores of the three secondary structure states as calculated in step 3. The reliability of this prediction is estimated by a difference z-score. The difference z-score is defined as zdiff = max_top{zi} max_next{zi}, which means the difference between the maximum of the three z-scores and the second maximum of the three z-scores. If zdiff is small, the distinction between the top score and the second best score is small. Hence the corresponding top secondary structure state and the second best state are both possible and cannot be distinguished by their z-scores. In prediction for the secondary structures of a protein, the zdiffs are calculated for all the residues and then its average and its standard deviation. Then a simga_cutoff is applied to zdiff so that only predictions with zdiff bigger than the sigma_cutoff are deemed reliable, whereas the rest are unreliable or unpredictable.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The best overall performance judging by both Q3 and SOV was achieved with a segment length of 21 and Ntop of 80 and the results are given in the seventh row of Table I. The overall Q3 and SOV were 68.51 and 63.57, respectively. Although our test set was small, consisting of 16 protein chains, these results indicated that our method performed worse than the best methods based on machine learning techniques such as neural networks. Individual Q3 and SOV for the three states of secondary structure had similar patterns of variation as in evaluation tests by other methods (Przybylski and Rost, 2002). The percentage of correct prediction for strand was lower than for helix and coil. Since there was no learning involved in our brute-force method, this level of performance was still encouraging and well within expectation.
For a brute-force algorithm, our sequence clustering method should have a strong memory effect, that is, an ability to perform well for proteins in the training set. In our case, this method should perform well for proteins used in constructing the database. This was tested and found to be true. In the extreme case, when Ntop was set to 1, the prediction accuracy was 100%. This is because when Ntop = 1, the secondary structure state of a query sequence is determined by the sequence in the database that is most similar to or identical with the query sequence. However, our test set consisted of proteins that share <25% sequence homology with the proteins that were used in constructing the database. Moreover, the best performance was not achieved at Ntop = 1 when the memory effect was strongest. Therefore, our good performance for the test set should not be attributed to the memory effect. Instead, our results indicate that some sequence patterns or codes were conserved among different members of a homologous protein family with low sequence homology and, perhaps, even among different families of proteins. Such patterns could be derived either from evolution of proteins or from the physicochemical principles governing protein structures.
For each residue of a protein and for each state of secondary structure, a reliability z-score was calculated by our prediction method, as described in Materials and methods. This z-score was used to determine which secondary structure state was most probable. However, it could also be used to estimate the reliability of a prediction: the higher the z-score, the more reliable is the prediction. A difference z-score and a sigma_cutoff calculated for all residues when predicting a protein have been defined in Materials and methods. In Table I, the last row gives the results of prediction performance when the sigma_cutoff was set to 0.2. Both Q3 and SOV were increased by about 17%, although only 23.34% of the residues were predicted. This result shows that the difference z-score was correlated with the reliability of prediction. Furthermore, some of the residues in a protein could be predicted with more certainty than others. The interpretation of this phenomenon could be that there were some residues whose secondary structures were more or less determined by the local sequences, that is, the short-range interactions. There were other residues whose secondary structures were somehow influenced by the distant sequences, that is, the long-range interactions. This interpretation is consistent with our current understanding of protein secondary structure formation.
To verify further the results obtained with our limited testing with 16 proteins, we performed a full jack-knife test. Using all the available 1323 proteins, we took out one protein and used the remaining 1322 to construct a database. Then we made a prediction based on this database on this protein, repeating this process for each protein in the list of 1323. The results are given in Table II.
|
Our results suggested that residues in a protein with high reliability z-score were indeed predicted with high accuracy and these correctly predicted residues were most likely to be dominated by local interactions in their secondary structure formation. It would be interesting to analyze these residues in the protein. First, we calculated the average accessible surface areas of correctly predicted residues found in the full jack-knife test with a sigma_cutoff of 0.3. The results are shown in Table III. In this table, the average solvent-accessible surface areas calculated by using known protein structures (Rose et al., 1985) are also listed for comparison. Overall we can see that amino acids A, C, F, I, K, L, M, V, W and Y were more buried for correctly predicted residues than the average calculated by Rose et al. (Rose et al., 1985
). Amino acids S and T were slightly more buried and the remainder, including D, E, G, H, N, P, Q and R, were more exposed. Solvent-accessible surface areas for each secondary structure state were also averaged and are listed in Table III. In general, when correctly predicted, amino acids in the state of Coil were more exposed, in the state of Strand more buried and in the state of Helix in between. The group of more buried amino acids clearly included all hydrophobic amino acids, except proline, and aromatic residues (F, Y, W). Cysteine involved in disulfide bonds should be more buried, although we did not check whether all correctly predicted cysteine residues were disulfide bonded. The group of more exposed amino acids include polar and charged residues and residues important for making turns and loops (glycine and proline). Lysine was an exception, which was also pointed out by Rose et al. (Rose et al., 1985
) to be an unusually exposed amino acid on average. In Table III, lysine belonged to the more buried group because it is significantly more buried in the strand and helix conformations than the average calculated by Rose et al. (Rose et al., 1985
). It would be interesting to examine the three-dimensional structures of these correctly predicted lysine residues to see whether they were involved in specific interactions such as salt bridges. Such a clear separation between buried and exposed amino acids has not been seen previously. Previous studies have found that the tendency of hydrophobic amino acids to be buried was not as strong as expected, which was discussed by Rose et al. (Rose et al., 1985
). However, when only correctly predicted residues were examined, as was done here, the tendency was much stronger and consistent with our intuitive estimation based on physicochemical properties of the amino acids. We might suggest that these correctly predicted residues are crucial in driving protein folding and stabilizing protein structures, that is, they are the core of proteins.
|
|
Based on our results, it could be expected that this sequence clustering method for predicting protein secondary structure could be improved in several ways. First, profiles generated by finding the members of a homologous family with diverse sequence could improve the performance. This has been proved to have a significant effect (Przybylski and Rost, 2002). It is understandable that using a single instance of a protein family is statistically less reliable than using multiple instances. Hence our method should also benefit from an increased size of sequence database. Secondly, as more non-redundant protein structures are being determined experimentally, the database for the prediction will contain more useful information. This may mean more clearly delineated clusters and hence lead to less noise in prediction. It may also mean the inclusion of new clusters not previously represented, as the secondary structure formation in different protein families may follow different rules or codes. Finally, the combination of this brute-force algorithm with machine learning algorithms should be explored to improve the performance further.
Conclusions
Most algorithms for protein secondary structure prediction currently in use are based on machine learning techniques. By building better machine learning architectures, especially neural networks, and by extensively using profile methods enhanced with an ever-increasing size of sequence database, the current performance for predicting protein secondary structure has been approaching 80%. Further improvement is still promising. By realizing the fact that the difficulty in achieving accurate prediction, allowing reasonable inherent variation, lies with the difficulty in taking into account the effect of long-range interactions in influencing the secondary structure formation, a brute-force algorithm is proposed here. This algorithm is based on sequence similarity belonging to the general category of nearest neighbor methods and it is intended to perform well for those secondary structures in a protein whose formation is dominated by short-range interactions and hence neighboring sequences. For a query sequence from a protein, the putative cluster is found in the secondary structure database. The extent of the sequence clustering is estimated by a z-score. Our test results have shown that this algorithm performed very well and the z-score defined was correlated with the reliability of prediction. Prospects of further improving the algorithm are very encouraging.
This program is called S2CHE and is available from the author on request.
![]() |
Acknowledgement |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Baldi,P., Brunak,S., Frasconi,P., Soda,G. and Pollastri,G. (1999) Bioinformatics, 15, 937946.
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235242.
Bohr,H., Bohr,J., Brunak,S., Cotterill,R.M., Lautrup,B., Norskov,L., Olsen,O.H. and Petersen,S.B. (1988) FEBS Lett., 241, 223228.[CrossRef][ISI][Medline]
Bowie,J.U. and Eisenberg,D. (1993) Curr. Opin. Struct. Biol., 3, 347444.
Cuff,J.A. and Barton,G.J. (1999) Proteins, 34, 508519.[CrossRef][ISI][Medline]
Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522524.
Holley,H. and Karplus,M. (1989) Proc. Natl Acad. Sci. USA, 86, 152156.[Abstract]
Hua,S. and Sun,Z. (2001) J. Mol. Biol., 308, 397407.[CrossRef][ISI][Medline]
Jones,D. and Thornton,J. (1993) J. Comput.-Aided Mol. Des., 7, 439456.
Jones,D.T. (1999) J. Mol. Biol., 292, 195202.[CrossRef][ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Kneller,D., Cohen,F. and Langridge, R. (1990) J. Mol. Biol., 214, 171182.[ISI][Medline]
Lemer,C., Rooman,M.J. and Wodak,S.J. (1996) Proteins, 23, 337345.[ISI]
Levin,J.M., Robson,B. and Garnier,J. (1986) FEBS Lett., 205, 303308.[CrossRef][ISI][Medline]
Levin,J.M., Pascarella,S., Argos,P. and Garnier,J. (1993) Protein Eng., 6, 849854.[Abstract]
Levinthal,C. (1968) J. Chim. Phys., 65, 4445.
Marti-Renom,M., Stuart,A., Fiser,A., Sanchez,R., Melo,F. and Sali,A. (2000) Annu. Rev. Biophys Biomol. Struct., 29, 291325.[CrossRef][ISI][Medline]
Mehta,P.K, Heringa,J. and Argos,P. (1995) Protein Sci., 4, 25172525.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindles,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Ouali,M. and King,R.D. (2000) Protein Sci., 9, 11621176.[Abstract]
Pollastri,G., Przybylski,D., Rost,B. and Baldi,P. (2002) Proteins, 47, 228235.[CrossRef][ISI][Medline]
Przybylski,D. and Rost,B. (2002) Proteins, 46, 197205.[CrossRef][ISI][Medline]
Qian,N. and Sejnowski,T. (1988) J. Mol. Biol., 202, 865884.[ISI][Medline]
RASMOL(1999) RasMol Molecular Renderer, Copyright by R.Sayle 19921999 and Copyright by H.J.Bernstein 19981999.
Riis,S.K. and Krogh,A. (1996) J. Comput. Biol., 3, 163183.[ISI][Medline]
Rose,G.D., Gezelowitz,A.R, Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Science, 229, 834838.[ISI][Medline]
Rost,B. (1995) Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 314321.[Medline]
Rost,B. (2001) J. Struct. Biol., 134, 204218.[ISI][Medline]
Rost,B. and Eyrich,V. (2001) Proteins, 45, 192199.[CrossRef]
Rost,B. and Sander,C. (1993a) Proc. Natl Acad. Sci. USA, 90, 75587562.
Rost,B. and Sander,C. (1993b) J. Mol. Biol., 232, 584599.[CrossRef][ISI][Medline]
Rost,B. and Sander,C. (1994) Proteins, 19, 5572.[ISI][Medline]
Rost,B., Sander,C. and Schneider,R. (1994) J. Mol. Biol., 235, 1326.[CrossRef][ISI][Medline]
Solovyev,V.V. and Salamov,A.A. (1994) Comput. Appl. Biosci., 10, 661669.[Abstract]
Wodak,S.J. and Rooman,M.J. (1993) Curr. Opin. Struct. Biol., 3, 247259.[ISI]
Zemla,A., Venclovas,C., Fidelis,K. and Rost,B. (1999) Proteins, 34, 220223. [CrossRef][ISI][Medline]
Received March 27, 2003; revised June 30, 2003; accepted August 22, 2003.
|