Prediction of protein secondary structure with a reliability score estimated by local sequence clustering

Fan Jiang

Institute of Physics, Chinese Academy of Sciences, Beijing 100080, China e-mail: jiangf{at}aphy.iphy.ac.cn


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Most algorithms for protein secondary structure prediction are based on machine learning techniques, e.g. neural networks. Good architectures and learning methods have improved the performance continuously. The introduction of profile methods, e.g. PSI-BLAST, has been a major breakthrough in increasing the prediction accuracy to close to 80%. In this paper, a brute-force algorithm is proposed and the reliability of each prediction is estimated by a z-score based on local sequence clustering. This algorithm is intended to perform well for those secondary structures in a protein whose formation is mainly dominated by the neighboring sequences and short-range interactions. A reliability z-score has been defined to estimate the goodness of a putative cluster found for a query sequence in a database. The database for prediction was constructed by experimentally determined, non-redundant protein structures with <25% sequence homology, a list maintained by PDBSELECT. Our test results have shown that this new algorithm, belonging to what is known as nearest neighbor methods, performed very well within the expectation of previous methods and that the reliability z-score as defined was correlated with the reliability of prediction. This led to the possibility of making very accurate predictions for a few selected residues in a protein with an accuracy measure of Q3 > 80%. The further development of this algorithm, and a nucleation mechanism for protein folding are suggested.

Keywords: nearest neighbor/nucleation/protein folding/protein secondary structure prediction/sequence clustering/stability


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The ability to extract structural information from sequences has become increasingly important for bioinformatics research. Protein structure prediction is one of most intensely studied subjects in modern computational biology. Protein structure prediction could be achieved at both the secondary structure level and the three-dimensional level. These are intimately related to each other. Currently, two useful prediction methodologies are fold recognition [reviews on fold recognition can be found in the literature (Bowie and Eisenberg, 1993Go; Jones and Thornton, 1993Go; Wodak and Rooman, 1993Go; Lemer et al., 1996Go); for the first application of secondary structure prediction in threading, see Rost (Rost, 1995Go)] and comparative modeling (Marti-Renom et al., 2000Go), both of which involve accurate prediction of secondary structure to improve the accuracy of prediction for three-dimensional structures. Much progress has been made through international community-wide studies and evaluations such as CASP (critical assessment of structure prediction) (CASP5: http://predictioncenter.llnl.gov/casp5/Casp5.html). The accuracy of secondary structure prediction has increased gradually over the years as we gain a better understanding of the principles of sequence–structure relationships and the effect of evolution on the proteomes of organisms (Rost, 2001Go).

In recent years, most development of secondary structure prediction has been in the area of application of machine learning techniques. Neural networks were first applied to the prediction of secondary structure (Bohr et al., 1988Go; Qian and Sejnowski, 1988Go). This stimulated many subsequent studies of neural networks (Holley and Karplus, 1989Go; Kneller et al., 1990Go) in secondary structure prediction. However, a real breakthrough did not come until the work of Rost and Sander (Rost and Sander, 1993aGo,bGo, 1994), which was made available as a mail server PHD. The main reason for its success (better than 70% for the first time), among many others, was that the concept of profiles was introduced. Thus, instead of presenting a network with a single sequence, many aligned sequences of homologous proteins were presented. The method of profiles continues to improve as more and more sequences are becoming available (Przybylski and Rost, 2002Go). The reason for the success of the profiles method seems to be that it captures the fact that protein structures conserve more than sequences. Because only mutations that do not disrupt the three-dimensional structure of a protein will survive the evolution, sequence divergence under the structural constraints reflects the interactions between amino acid residues of a protein, where the interactions could be either short range or long range in sequence. Now most secondary structure prediction methods achieving high performance with Q3 near 80%, (Riis and Krogh, 1996Go; Baldi et al., 1999Go; Cuff and Barton, 1999Go; Jones, 1999Go; Ouali and King, 2000Go; Pollastri et al., 2002Go) make use of PSI-BLAST profiles (Altschul et al., 1997Go) in combination with improvement of prediction algorithms. New machine learning methods such as support vector machine (Hua and Sun, 2001Go) should also benefit from PSI-BLAST profiles.

It is known that the effect of long-range interactions among amino acid residues in a protein is very important in the formation of protein secondary structure. Although many machine learning methods could ‘learn’ to take account of some of the effects of long-range interactions, principles underlying these interactions and their roles in influencing the formation of secondary structures are still difficult to understand. In contrast, if the formation of secondary structures of a protein were dominated by short-range interactions, then all the information for predicting the secondary structure of a residue would be contained in its flanking sequences. In other words, the tertiary structure would have little influence on the formation of secondary structures. If this were true for some residues of a protein, then we should be able to predict their secondary structures at relatively higher accuracy than others. In this work, we present a new method of secondary structure prediction, which implements the above idea and belongs to the general category of nearest neighbor methods (Levin et al., 1986Go; Solovyev and Salamov, 1994Go; Mehta et al., 1995Go). We introduce a reliability z-score based on local sequence clustering of each query sequence in the secondary structure database. The z-scores were calculated by a brute-force method for a given database constructed from experimentally determined, non-redundant protein structures. We show that for a few selected residues, the prediction accuracy could be as high as Q3 > 80%.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Database and definition of protein secondary structure

We use DSSP (Kabsch and Sander, 1983Go) for automatic assignment of secondary structures. The most widely used convention for three-states definition (Rost and Eyrich, 2001Go) is (i) ‘H’, ‘G’ and ‘I’ assigned to helix; (ii) ‘E’ and ‘B’ to strand; and (iii) ‘Others’ to coil. We follow this convention. The effect of using different assignment conventions on prediction performance has been studied by Cuff and Barton (Cuff and Barton, 1999Go).

It is known that the database used for secondary structure prediction is crucial and can have a large effect on the results of prediction tests (Rost et al., 1994Go). For our algorithm, a set of experimentally determined, non-redundant protein structures in the PDB (Berman et al., 2000Go) with mutual sequence homology <25% was selected to construct the training set and the testing set (Hobohm and Sander, 1994Go) (see the website http://www.cmbi.kun.nl/gv/pdbsel/). The PDBSELECT list with <25% sequence homology used for the current study was that released in February 2001, which contained 1323 usable proteins accepted by DSSP. Out of 1323, 1307 proteins were used for the training set to construct the database. None of these proteins have chain breaks as judged by DSSP, so that no artificial N- or C-terminal residues were introduced into the database. Experimental structures determined by both NMR and X-ray methods were included if they were contained in the PDBSELECT list of <25% sequence homology. No resolution cutoff was applied to X-ray structures. Then the remaining 16 of 1323 proteins from the PDBSELECT list of <25% sequence homology were selected randomly and used as the testing set, which had not been included in the training set.

Prediction of protein secondary structure

Our prediction algorithm is based on the fact that there is clustering in sequence space for a given secondary structure state, e.g. helix. If we assume the opposite, that is, there is no clustering in sequence space, it would mean that helix has no preference in its sequences and that helix sequences are randomly distributed in the sequence space. This is contradictory to what is known so far, because prediction algorithms based on machine learning techniques have shown that there is a strong preference for certain sequences in each secondary structure state and these sequences can be grouped into classes and learned by sophisticated algorithms. If there are clusters of sequences for each secondary structure state, then for a given sequence segment, we could locate the putative cluster to which it belongs by a brute-force method, which is now feasible with fast personal computers. Once the putative cluster has been located, we could estimate how good the cluster is by calculating a z-score relative to random sequences; the higher the z-score, the better the cluster. As will be shown, this z-score is correlated with the reliability of the secondary structure prediction.

Our prediction algorithm consists of several steps:

1. Construction of the secondary structure database. After the assignment by DSSP, sequences of a segment length of 21 are extracted from all the proteins in the training set and grouped into three files, corresponding to the three states of the secondary structure of the center residue in each sequence segment. For N- and C-terminal sequences, dummy amino acids ‘X’ are added to make the corresponding segment be of the same length of 21. For the current work, the database has 63 143, 41 394 and 81 197 sequence segments for the three secondary structure states of helix, strand and coil, respectively.

2. Calculation of the similarity scores. To predict the secondary structure state of the center residue of a given sequence of a segment length of W, we need to calculate the similarity scores of this sequence against all the sequences in a secondary structure state in the database. Then, the similarity scores are sorted in descending order and the top Ntop will be used to calculate a z-score. Since all the sequences in the calculation are of the same length W, no alignment is required. The similarity score between each pair of amino acids is calculated based on the BLOSUM62 similarity matrix (Altschul et al., 1997Go).

3. Calculation of z-score. The idea is to compare the extent of clustering between a given sequence from a real protein and random sequences. For this purpose, we generate 300 random sequences. Then we calculate the similarity scores as in step 2 for a secondary structure state. For the top Ntop similarity scores, the average random) and the standard deviation ({sigma}random) are calculated. µrandom and {sigma}random are calculated for each secondary structure state. For the current work, there are three µrandom and three {sigma}random for helix, strand and coil, respectively. Since there are 300 random sequences, µrandom and {sigma}random are averaged again over 300 random sequences. Next, for a given sequence from a real protein, µseq and {sigma}seq are calculated in the same way as µrandom and {sigma}random. The z-score is defined as zi = [(µseq,i – µrandom,i)/{sigma}random,i], where i denotes the secondary structure state (i = 1, 2 and 3 for helix, strand and coil, respectively).

4. The prediction of the secondary structure state and the estimation of its reliability. The predicted secondary structure state for the center residue of a given sequence is the state with the maximum of the z-scores of the three secondary structure states as calculated in step 3. The reliability of this prediction is estimated by a difference z-score. The difference z-score is defined as zdiff = max_top{zi} – max_next{zi}, which means the difference between the maximum of the three z-scores and the second maximum of the three z-scores. If zdiff is small, the distinction between the top score and the second best score is small. Hence the corresponding top secondary structure state and the second best state are both possible and cannot be distinguished by their z-scores. In prediction for the secondary structures of a protein, the zdiffs are calculated for all the residues and then its average and its standard deviation. Then a simga_cutoff is applied to zdiff so that only predictions with zdiff bigger than the sigma_cutoff are deemed reliable, whereas the rest are unreliable or unpredictable.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Sixteen polypeptide chains, out of the non-redundant list as described in Materials and methods, were used to test the performance of our prediction method. They were, in PDB code, 1AGT_, 1AHO_, 1AIE_, 1AISB, 1AK4C, 1AOCA, 1AOID, 1ARZC, 1ATLA, 1AXN_, 1AYFB, 1B33N, 1B87A, 1BABB, 1BE9A and 1BSNA, where the fourth character in the PDB code is the chain ID and ‘_’ means empty chain ID. There were 1323 proteins in the non-redundant list and these 16 proteins were among them. The remaining 1307 (1323 – 16) proteins were then used to construct the secondary structure database for the test. The results of the prediction test are shown in Table I. Q3 and SOV (Zemla et al., 1999Go) were used to evaluate the performance. Values of both overall and individual states are given in Table I. In secondary structure prediction, the segment length of sequence commonly used is 21. Fixing the segment length at 21, the effect of Ntop was examined. Ntop was sampled at 1, 2, 5, 10, 20, 40, 80, 160, 320 and 640. It was clear that Q3 peaked at a value of 40 and SOV at 160, being 68.62 and 63.93, respectively. This level of accuracy is very similar to that achieved by other nearest neighbor methods. In the previous nearest neighbor methods, when they were combined with multiple sequence alignment from superposition of known structures used for constructing the database, the best performance was 68.5% (Levin et al., 1993Go) and 72.2% (Mehta et al., 1995Go) (note, however, the accuracy here was not Q3 as used currently as the standard measure for accuracy). Our interpretation is for the database that we constructed, the average size of clusters in sequence space is between 40 and 160. If Ntop is >160, the noisy sequences from nearby clusters would be included and hence the signal (z-score) would be lower. The reduced signal for the correct secondary structure state will lead to a wrong prediction. If Ntop is <40, an uneven distribution of sequences within the putative cluster and some misleading sequences, e.g. sequences that have adopted more than one secondary structure state, will introduce random errors into the signal (z-score). This again reduces the chance of selecting the correct secondary structure state. Therefore, there should be an optimum value for Ntop and this was found to be between 40 and 160 for our current database. For subsequent tests, we used 80 for Ntop.


View this table:
[in this window]
[in a new window]
 
Table I. Evaluation of secondary structure prediction
 
The effect of the segment length of sequences used for prediction was not surprising, as shown in the eleventh and twelfth rows in Table I, where segment lengths of 11 and 5 were tested, respectively. There were dramatic decreases in performance on going from a segment length of 21 to 11 and to 5. Q3 dropped from 68.51 to 67.29 and to 62.16, respectively. SOV dropped even more dramatically, from 63.57 to 62.55 and to 52.23. Therefore, inclusion of flanking sequences around a center residue being predicted was shown to be very important for our prediction method. This is perhaps partly because the secondary structures are to a certain extent influenced by neighboring sequences as far as 10 residues away and partly because longer sequences will help delineate different clusters of sequences better than shorter sequences. This result was consistent with what has been found from other prediction methods.

The best overall performance judging by both Q3 and SOV was achieved with a segment length of 21 and Ntop of 80 and the results are given in the seventh row of Table I. The overall Q3 and SOV were 68.51 and 63.57, respectively. Although our test set was small, consisting of 16 protein chains, these results indicated that our method performed worse than the best methods based on machine learning techniques such as neural networks. Individual Q3 and SOV for the three states of secondary structure had similar patterns of variation as in evaluation tests by other methods (Przybylski and Rost, 2002Go). The percentage of correct prediction for strand was lower than for helix and coil. Since there was no ‘learning’ involved in our brute-force method, this level of performance was still encouraging and well within expectation.

For a brute-force algorithm, our sequence clustering method should have a strong ‘memory’ effect, that is, an ability to perform well for proteins in the training set. In our case, this method should perform well for proteins used in constructing the database. This was tested and found to be true. In the extreme case, when Ntop was set to 1, the prediction accuracy was 100%. This is because when Ntop = 1, the secondary structure state of a query sequence is determined by the sequence in the database that is most similar to or identical with the query sequence. However, our test set consisted of proteins that share <25% sequence homology with the proteins that were used in constructing the database. Moreover, the best performance was not achieved at Ntop = 1 when the ‘memory’ effect was strongest. Therefore, our good performance for the test set should not be attributed to the ‘memory’ effect. Instead, our results indicate that some sequence patterns or ‘codes’ were conserved among different members of a homologous protein family with low sequence homology and, perhaps, even among different families of proteins. Such patterns could be derived either from evolution of proteins or from the physicochemical principles governing protein structures.

For each residue of a protein and for each state of secondary structure, a reliability z-score was calculated by our prediction method, as described in Materials and methods. This z-score was used to determine which secondary structure state was most probable. However, it could also be used to estimate the reliability of a prediction: the higher the z-score, the more reliable is the prediction. A difference z-score and a sigma_cutoff calculated for all residues when predicting a protein have been defined in Materials and methods. In Table I, the last row gives the results of prediction performance when the sigma_cutoff was set to 0.2{sigma}. Both Q3 and SOV were increased by about 17%, although only 23.34% of the residues were predicted. This result shows that the difference z-score was correlated with the reliability of prediction. Furthermore, some of the residues in a protein could be predicted with more certainty than others. The interpretation of this phenomenon could be that there were some residues whose secondary structures were more or less determined by the local sequences, that is, the short-range interactions. There were other residues whose secondary structures were somehow influenced by the distant sequences, that is, the long-range interactions. This interpretation is consistent with our current understanding of protein secondary structure formation.

To verify further the results obtained with our limited testing with 16 proteins, we performed a full jack-knife test. Using all the available 1323 proteins, we took out one protein and used the remaining 1322 to construct a database. Then we made a prediction based on this database on this protein, repeating this process for each protein in the list of 1323. The results are given in Table II.


View this table:
[in this window]
[in a new window]
 
Table II. Evaluation of full jack-knife test
 
The performance was slightly worse, as commonly observed for jack-knife tests, than that tested with 16 proteins because the list of 1323 proteins may contain more instances of some patterns than others. More instances would result in better prediction and fewer instances worse prediction. Averaged over all 1323 proteins, the performance should be slightly worse. However, the general trend as shown by Table I was the same and a prediction accuracy Q3 > 80% was achieved. The fact that a high performance was achieved consistently for residues with high reliability z-scores indicated that the results were sufficiently reliable to make a distinction between residues whose secondary structures were dominated by short- and long-range interactions. Otherwise, an uneven distribution of secondary structures in sequence space due to an insufficient size of the database would lead to fluctuating performance.

Our results suggested that residues in a protein with high reliability z-score were indeed predicted with high accuracy and these correctly predicted residues were most likely to be dominated by local interactions in their secondary structure formation. It would be interesting to analyze these residues in the protein. First, we calculated the average accessible surface areas of correctly predicted residues found in the full jack-knife test with a sigma_cutoff of 0.3. The results are shown in Table III. In this table, the average solvent-accessible surface areas calculated by using known protein structures (Rose et al., 1985Go) are also listed for comparison. Overall we can see that amino acids A, C, F, I, K, L, M, V, W and Y were more buried for correctly predicted residues than the average calculated by Rose et al. (Rose et al., 1985Go). Amino acids S and T were slightly more buried and the remainder, including D, E, G, H, N, P, Q and R, were more exposed. Solvent-accessible surface areas for each secondary structure state were also averaged and are listed in Table III. In general, when correctly predicted, amino acids in the state of ‘Coil’ were more exposed, in the state of ‘Strand’ more buried and in the state of ‘Helix’ in between. The group of more buried amino acids clearly included all hydrophobic amino acids, except proline, and aromatic residues (F, Y, W). Cysteine involved in disulfide bonds should be more buried, although we did not check whether all correctly predicted cysteine residues were disulfide bonded. The group of more exposed amino acids include polar and charged residues and residues important for making turns and loops (glycine and proline). Lysine was an exception, which was also pointed out by Rose et al. (Rose et al., 1985Go) to be an unusually exposed amino acid on average. In Table III, lysine belonged to the more buried group because it is significantly more buried in the strand and helix conformations than the average calculated by Rose et al. (Rose et al., 1985Go). It would be interesting to examine the three-dimensional structures of these correctly predicted lysine residues to see whether they were involved in specific interactions such as salt bridges. Such a clear separation between buried and exposed amino acids has not been seen previously. Previous studies have found that the tendency of hydrophobic amino acids to be buried was not as strong as expected, which was discussed by Rose et al. (Rose et al., 1985Go). However, when only correctly predicted residues were examined, as was done here, the tendency was much stronger and consistent with our intuitive estimation based on physicochemical properties of the amino acids. We might suggest that these correctly predicted residues are crucial in driving protein folding and stabilizing protein structures, that is, they are the ‘core’ of proteins.


View this table:
[in this window]
[in a new window]
 
Table III. Accessible surface area of correctly predicted residues
 
The next step of our analysis should have been the examination of the three-dimensional arrangement of correctly predicted residues in all proteins. However, this was not possible in the present work. Instead, we selected 10 proteins, which were more or less well known, such as myoglobin, lysozyme and trypsin inhibitor. We also selected proteins that represented classes of proteins by secondary structure contents, namely all {alpha}, all ß and {alpha}/ß. Nevertheless, this selection of 10 proteins was very limited and could only serve to show intuitively some of the implications of these correctly predicted residues. In Figure 1, we highlight the correctly predicted residues in black and the rest in gray. The secondary structures of each protein are shown as ribbons drawn by RASMOL (RASMOL, 1999Go). In general, the correctly predicted residues were close to each other in spatial arrangement. Some helices and strands are highlighted. Many significant turns and loops or bends and kinks are also highlighted in black, especially in proteins consisting of ß-strands such as 1A2PA, 1EJFA, 1EJGA, 1KB5B, 1NIVA, 3CHBD and 8PRN_. Sometimes having a protein segment in a coil conformation is important, e.g. in 5PTI_. In this case, the sequence involved was 5-CLEPPYTGPC-14 and the secondary structure 5-HHCC CCCCCC-14. It is conceivable that multiple prolines are there to ensure that non-native disulfide bonds between cysteines 5 and 14 not formed. These drawings indicate that in a protein some residues were more special than others. Our method selected these residues by a reliability z-score, which measured the extent of sequence clustering around a query sequence from a protein segment relative to random query sequences. Intuitively, more clustering around the sequence of a protein segment gives a higher z-score, which also means that there are more instances in the secondary structure database. In other words, these instances and the corresponding sequence–structure pattern are more conserved among different proteins. Moreover, when the secondary structure of a protein segment is determined mainly by local or short-range interactions, the sequence of this protein segment is likely to be more conserved than that of a protein segment whose secondary structure is affected by long-range interactions involving more than one protein segment remotely located in sequence relative to this protein segment. As we know, the conserved residues are generally more likely involved in driving protein folding and maintaining the protein stability. Hence, coming back to these drawings, we suggest that protein folding might start with a nucleation step. The nucleation step might involve multiple sites in the protein, such as suggested by drawings of myoglobin, lysozyme, agglutinin and porin. Many protein folding mechanisms have been proposed from either theoretical and experimental considerations and the nucleation mechanism was one of them or included in a few of those proposed. The nucleation mechanism is very close to the idea that proteins fold through some ‘directed’ process, as was proposed as a way to resolve the Levinthal paradox (Levinthal, 1968Go). Without going into a prolonged discussion of the current literature on protein folding mechanisms, for which there is a vast amount of published work, it suffices to say that much is still to be done in understanding the principles underlying protein folding and sequence–structure relationships. Our present method should offer a new way of looking at protein structures.












View larger version (357K):
[in this window]
[in a new window]
 
Fig. 1. Ribbon drawings of 10 proteins. The correctly predicted residues are highlighted in black and others in gray. It is evident that highlighted residues are close to each other in a spatial arrangement forming continuous secondary structure segments. There are also a few turns highlighted, especially in proteins with ß-sheet. The ribbons were drawn with RASMOL (RASMOL, 1999Go).

 
In the light of the above discussion, it is evident that if the database used in our present method were constructed with proteins of different fold classes as compiled by CATH (Orengo et al., 1997Go) or SCOP (Murzin et al., 1995Go), it would constitute a more rigorous procedure for finding and evaluating sequence–structure patterns involved in protein folding and for structure prediction. Nevertheless, as was found in the profile methods, more aligned proteins with more divergent sequences led to better performance (Przybylski and Rost, 2002Go). Hence in our case, inclusion of more instances of protein structures with low sequence homology but redundant fold classes should help to increase the accuracy of the reliability z-score, which was important for our method for the prediction of secondary structure of a protein with low sequence homology to the existing known protein structures. For studying patterns conserved among different fold classes important in protein folding, CATH or SCOP should definitely be used for constructing the database, as suggested by a referee.

Based on our results, it could be expected that this sequence clustering method for predicting protein secondary structure could be improved in several ways. First, profiles generated by finding the members of a homologous family with diverse sequence could improve the performance. This has been proved to have a significant effect (Przybylski and Rost, 2002Go). It is understandable that using a single instance of a protein family is statistically less reliable than using multiple instances. Hence our method should also benefit from an increased size of sequence database. Secondly, as more non-redundant protein structures are being determined experimentally, the database for the prediction will contain more useful information. This may mean more clearly delineated clusters and hence lead to less noise in prediction. It may also mean the inclusion of new clusters not previously represented, as the secondary structure formation in different protein families may follow different rules or ‘codes’. Finally, the combination of this brute-force algorithm with machine learning algorithms should be explored to improve the performance further.

Conclusions

Most algorithms for protein secondary structure prediction currently in use are based on machine learning techniques. By building better machine learning architectures, especially neural networks, and by extensively using profile methods enhanced with an ever-increasing size of sequence database, the current performance for predicting protein secondary structure has been approaching 80%. Further improvement is still promising. By realizing the fact that the difficulty in achieving accurate prediction, allowing reasonable inherent variation, lies with the difficulty in taking into account the effect of long-range interactions in influencing the secondary structure formation, a brute-force algorithm is proposed here. This algorithm is based on sequence similarity belonging to the general category of nearest neighbor methods and it is intended to perform well for those secondary structures in a protein whose formation is dominated by short-range interactions and hence neighboring sequences. For a query sequence from a protein, the putative cluster is found in the secondary structure database. The extent of the sequence clustering is estimated by a z-score. Our test results have shown that this algorithm performed very well and the z-score defined was correlated with the reliability of prediction. Prospects of further improving the algorithm are very encouraging.

This program is called S2CHE and is available from the author on request.


    Acknowledgement
 
Project 30170198 was supported by the NSFC.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Altschul,S., Madden,T., Shaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Baldi,P., Brunak,S., Frasconi,P., Soda,G. and Pollastri,G. (1999) Bioinformatics, 15, 937–946.[Abstract/Free Full Text]

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235–242.[Abstract/Free Full Text]

Bohr,H., Bohr,J., Brunak,S., Cotterill,R.M., Lautrup,B., Norskov,L., Olsen,O.H. and Petersen,S.B. (1988) FEBS Lett., 241, 223–228.[CrossRef][ISI][Medline]

Bowie,J.U. and Eisenberg,D. (1993) Curr. Opin. Struct. Biol., 3, 347–444.

Cuff,J.A. and Barton,G.J. (1999) Proteins, 34, 508–519.[CrossRef][ISI][Medline]

Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522–524.[Abstract/Free Full Text]

Holley,H. and Karplus,M. (1989) Proc. Natl Acad. Sci. USA, 86, 152–156.[Abstract]

Hua,S. and Sun,Z. (2001) J. Mol. Biol., 308, 397–407.[CrossRef][ISI][Medline]

Jones,D. and Thornton,J. (1993) J. Comput.-Aided Mol. Des., 7, 439–456.

Jones,D.T. (1999) J. Mol. Biol., 292, 195–202.[CrossRef][ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Kneller,D., Cohen,F. and Langridge, R. (1990) J. Mol. Biol., 214, 171–182.[ISI][Medline]

Lemer,C., Rooman,M.J. and Wodak,S.J. (1996) Proteins, 23, 337–345.[ISI]

Levin,J.M., Robson,B. and Garnier,J. (1986) FEBS Lett., 205, 303–308.[CrossRef][ISI][Medline]

Levin,J.M., Pascarella,S., Argos,P. and Garnier,J. (1993) Protein Eng., 6, 849–854.[Abstract]

Levinthal,C. (1968) J. Chim. Phys., 65, 44–45.

Marti-Renom,M., Stuart,A., Fiser,A., Sanchez,R., Melo,F. and Sali,A. (2000) Annu. Rev. Biophys Biomol. Struct., 29, 291–325.[CrossRef][ISI][Medline]

Mehta,P.K, Heringa,J. and Argos,P. (1995) Protein Sci., 4, 2517–2525.[Abstract/Free Full Text]

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[CrossRef][ISI][Medline]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindles,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Ouali,M. and King,R.D. (2000) Protein Sci., 9, 1162–1176.[Abstract]

Pollastri,G., Przybylski,D., Rost,B. and Baldi,P. (2002) Proteins, 47, 228–235.[CrossRef][ISI][Medline]

Przybylski,D. and Rost,B. (2002) Proteins, 46, 197–205.[CrossRef][ISI][Medline]

Qian,N. and Sejnowski,T. (1988) J. Mol. Biol., 202, 865–884.[ISI][Medline]

RASMOL(1999) RasMol Molecular Renderer, Copyright by R.Sayle 1992–1999 and Copyright by H.J.Bernstein 1998–1999.

Riis,S.K. and Krogh,A. (1996) J. Comput. Biol., 3, 163–183.[ISI][Medline]

Rose,G.D., Gezelowitz,A.R, Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Science, 229, 834–838.[ISI][Medline]

Rost,B. (1995) Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 314–321.[Medline]

Rost,B. (2001) J. Struct. Biol., 134, 204–218.[ISI][Medline]

Rost,B. and Eyrich,V. (2001) Proteins, 45, 192–199.[CrossRef]

Rost,B. and Sander,C. (1993a) Proc. Natl Acad. Sci. USA, 90, 7558–7562.[Abstract/Free Full Text]

Rost,B. and Sander,C. (1993b) J. Mol. Biol., 232, 584–599.[CrossRef][ISI][Medline]

Rost,B. and Sander,C. (1994) Proteins, 19, 55–72.[ISI][Medline]

Rost,B., Sander,C. and Schneider,R. (1994) J. Mol. Biol., 235, 13–26.[CrossRef][ISI][Medline]

Solovyev,V.V. and Salamov,A.A. (1994) Comput. Appl. Biosci., 10, 661–669.[Abstract]

Wodak,S.J. and Rooman,M.J. (1993) Curr. Opin. Struct. Biol., 3, 247–259.[ISI]

Zemla,A., Venclovas,C., Fidelis,K. and Rost,B. (1999) Proteins, 34, 220–223. [CrossRef][ISI][Medline]

Received March 27, 2003; revised June 30, 2003; accepted August 22, 2003.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (4)
Request Permissions
Google Scholar
Articles by Jiang, F.
PubMed
PubMed Citation
Articles by Jiang, F.