1Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192 and 3CCSE, Japan Atomic Energy Research Institute, 81, Umemidai, Kizu-cho, Souraku, Kyoto 619-0215, Japan
2 To whom correspondence should be addressed. E-mail: takawaba{at}is.naist.jp
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: PSI-BLAST/remote homologue detection/secondary structure prediction/solvent accessibility prediction
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this study, following Geourjon et al.'s work, we developed a similar filtering method to choose more likely homologues from the many candidates obtained by PSI-BLAST, using predicted structural information. Compared with Geourjon et al.'s original work, we introduced a few refinements to the method. First, we employed a statistical significant score (Z-score) for structure matching. Second, in addition to the secondary structure predictions, we also used the solvent accessibility predictions. Third, we used a simple linear discrimination method, which combines E-value and the structural matching score.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To evaluate the performance of our methods, we used 3605 representative protein domains in the SCOP database (version 1.63) (Murzin et al., 1995) with sequence identities of 30% or less. The domains of classes 6 (membrane and cell surface proteins and peptides), 8 (coiled coil proteins), 9 (low-resolution protein structures), 10 (peptides), 11 (designed proteins) and those with length <40 residues were removed from the representative list, because of their specific nature of evolution. The family and superfamily relationships defined in SCOP were considered to be the correct homologous relationships.
Overview of the method
Figure 1 shows an overview of our method. The secondary structures and solvent accessibilities for the query and library sequences are predicted. PSI-BLAST (Altschul et al., 1997) is performed for the query protein against the SCOP representative sequence database and it outputs the homologue candidates with large threshold E-values and their alignments with the query sequence. Matching scores between the query and library predicted secondary structures/solvent accessibilities are calculated on the alignments. The final decision is made by considering the PSI-BLAST E-values and the structural matching scores.
|
Secondary structure prediction
An in-house program for secondary structure prediction was developed, using the standard neural network algorithm (Rumelhart et al., 1986; Qian and Sejnowski, 1988
; Rost and Sander, 1993
; Jones 1999
). The output secondary structure included three-states [helix (H), strand (E) and others (C)]. The correct secondary structures were defined by the DSSP program (Kabsch and Sander, 1983
). We employed the network architecture proposed by Jones (1999)
, which is composed of two three-layered networks (cascaded network). The first network used PSI-BLAST PSSM as the input and generated preliminary predictions. The second network used the prediction of the first network as the inputs and yielded the final prediction. The input size was 13 residues and the number of hidden units was 30, for both the first and second networks. The program's performance was evaluated by the 7-fold cross-validation (Rost and Sander, 1993
), using the SCOP representative dataset. The prediction accuracy of our method was Q3 = 76.18%, which was the percentage of residues with correctly predicted three-state secondary structures, against all of the residues.
Solvent accessibility prediction
A neural network for solvent accessibility prediction was also developed, which outputs two states of accessibility: exposed (e) or buried (b). The correct answer was defined using the accessible surface area calculated by the DSSP program (Kabsch and Sander, 1983). If the value of the accessible surface area was greater than 15% of the value for the standard extended conformation, then its accessibility was defined as exposed; otherwise, buried. After trials of various kinds of neural network architectures, we found that the second network was not effective for accessibility prediction. Finally, we employed the network with 13-residue PSSM inputs, with no hidden layer or second network. The accuracy of our network was Q2 = 73.74%, evaluated by the 7-fold cross-validation.
Z-score for structure matching
Based on the PSI-BLAST alignment, the degree of structural matching was measured for the three-state secondary structures and the two-state solvent accessibilities. Figure 2 shows an example of the secondary structure correspondence on the PSI-BLAST sequence alignment. We assumed that homologous protein pairs have more structural matches than non-homologous pairs. Measuring structural matches for this purpose is not a trivial problem. The structural identity Q-value (Q3 or Q2) may be the simplest way for measuring structural matches. The Q-value is defined by the number M of residue pairs with same structure divided by the number N of compared residues in the alignment between the query and the subject protein. However, such a Q-value tends to be high for a short alignment, even for non-homologous pairs. To solve this problem, Geourjon et al. (2001) excluded the sequence pairs with <100 aligned residues from their datasets and evaluated the structural matching by the Sov score (Rost et al., 1994
; Zemla et al., 1999
). Instead of simply excluding short alignments, we introduced the following Z-score for evaluating, the statistical significance of matching structures, against random matches given by the binomial distribution:
![]() | (1) |
![]() | (2) |
|
A simple linear discrimination method using centroid vectors was introduced to make a final decision by considering several features, such as the E-value of PSI-BLAST and the Z-score from secondary structure/accessibility prediction. The final score of the linear discrimination method is the inner product S, between the input feature vector x and the projection vector w:
![]() | (3) |
![]() | (4) |
Coveragereliability plot
To evaluate the abilities of the various detection methods, coveragereliability plots were generated (Kawabata and Nishikawa, 2000). Coverage and reliability are defined as follows:
![]() | (5) |
![]() | (6) |
Availability of software
Our software is available through a Web server (http://biunit.naist.jp/psisec/). It calculates a PSSM for a given target sequence using the PSI-BLAST, predicts its secondary structure from the PSSM by our neural network program, searches its homologues in the current PDB sequences using the PSSM and shows a combined result with the predicted secondary structure.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Scatter plots of E-values and Z-scores for secondary structure prediction are shown in Figure 3. Basically, the homologous protein pairs (red dots) were distributed more in the lower E-value and higher Z-score region, as compared with the non-homologous pairs (green dots). The vector w connecting two means is also shown. The two features were combined into one by calculating an inner product with the vector w.
|
|
It is well known that the E-values of PSI-BLAST are not symmetric: the E-value E(A, B) of protein A in a library, using protein B as a query, is often different from the E-value E(B, A) of protein B in a library, using protein A as a query. Using this asymmetry, the two-way PSI-BLAST method was proposed, which is reportedly more sensitive than the standard one-way PSI-BLAST (Teichman et al., 1999; Kawabata et al., 2000
). In the two way PSI-BLAST method, the E-value for a pair of proteins A and B is evaluated by considering two PSI-BLAST searches:
![]() | (7) |
We examined the performance of our filtering method against the two-way PSI-BLAST method, by using a symmetrical Z-score for structural matching, defined as follows:
![]() | (8) |
The column labeled two-way in Figure 5 is the performance of our improved method based on the two-way PSI-BLAST method. As reported previously, the two-way PSI-BLAST coverage is larger than the one-way. Combination with the Z-score for solvent accessibility and secondary structure also improved the performance of the two-way PSI-BLAST method.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For predicting secondary structures and solvent accessibilities, neural networks with PSI-BLAST profile inputs were used in this study. To elucidate the relationship between the performance of homologue detection and the accuracy for secondary structure/solvent accessibility prediction, we employed less accurate methods, neural networks with single sequence inputs. For this purpose, we developed an in-house neural network program with the same architecture as used by Qian and Sejnowski (1988). We trained it using the SCOP representative datasets and evaluated by the cross-validation method. The prediction accuracies were lower than those with profile inputs. For secondary structure prediction, Q3 for the network with single sequence inputs was 68.59%, whereas Q3 for that with profile inputs was 76.18%. For solvent accuracy prediction, Q2 for the network with single sequence inputs was 68.24%, whereas Q2 for that with profile inputs was 73.74%. The performance of homologue detection using these prediction methods is summarized in Figure 6. The results show that the prediction methods using a single sequence improved PSI-BLAST, but not as much as the methods using profile inputs. This suggests that the prediction accuracy for secondary structure/solvent accessibility is crucial for the performance of our filtering method.
|
Basically, we used the predicted secondary structure/solvent accessibility for both the query and library proteins. However, when remote homologue detection is used for a structure prediction, the structures of the library proteins are already known, whereas that of the query protein needs to be predicted. In order to determine the effect using an observed structure, we examined the performance of various combinations of observed and predicted structures. Figure 7 shows the performance of three combinations of secondary structures: predicted structures for both query and library proteins (PrePre), predicted structures for query proteins and observed structures for library proteins (PreObs) and observed structures for both query and library proteins (ObsObs). It is reasonable that the performance of ObsObs is the best among the three. However, the performance of the predicted versus observed structures (PreObs) is not much better than that of the predicted versus predicted structure (PrePre). In other words, our filtering method using a predicted structure as a query worked equally for a structure-unknown library and a structure-known library. Although a similar result was reported by Geourjon et al. (2001), it is still not clear why introducing observed structures did not improve the performance of PrePre. Coincident prediction errors for the query and library proteins may explain the high performance of PrePre.
|
Compared with the other remote homologue detection methods, the advantage of our method is its easy implementation and fast computation. In addition, our evaluation of the performance is more reliable than those of previous, similar studies using ready-made secondary structure prediction programs, such as PHD (Rost and Sander, 1993) and PSI-PRED (Jones, 1999
). This is because our in-house prediction programs can be trained by ourselves and we applied the cross-validation evaluation for them. However, we are aware of the limitations of our strategy. First, homologous pairs with large E-values of PSI-BLAST cannot be found by our filtering method, because our method completely depends on PSI-BLAST to provide homologue candidates. Second, our method just filters the PSI-BLAST results, it cannot improve the alignments. The aligned sequences of the homologous pairs detected by our filtering methods were often too short (data not shown). This was simply because PSI-BLAST alignments with larger E-values tend to be shorter. We now plan to introduce the dynamic programming program to realign only the homologue candidates found by PSI-BLAST, using predicted secondary structures. This may enhance the sensitivity and provide better alignments, without introducing large computational costs.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bindewald,E.B., Cestaro,A., Hesser,J., Heiler,M. and Tosatto,S.C.E. (2003) Protein Eng., 16, 785789.[CrossRef][ISI][Medline]
Bowie,J.U., Luthy,R. and Eisenberg,D. (1991) Science, 253, 164170.[ISI][Medline]
De la Cruz,X. and Thornton,J.M. (1999) Protein Sci., 8, 750759.[Abstract]
Di Francesco,V., Munson,P.J. and Garnier,J. (1999) Bioinformatics, 15, 131140.
Eddy,S. (1998) Bioinformatics, 14, 755763.[Abstract]
Fischel-Goldsian,F., Mathiowitz,G. and Smith,T.F. (1990) Protein Eng., 3, 577581.[ISI][Medline]
Fischer,D. and Eisenberg,D. (1996) Protein Sci., 5, 947955.
Geetha,V., Di Francisco,V., Garnier,J. and Munson,P.J. (1999) Protein Eng, 12, 527534.[CrossRef][ISI][Medline]
Geourjon,C., Combet,C., Blanchet,C. and Deleage,G. (2001) Protein Sci., 10, 788797.
Ginalski,K., Pas,J., Wyrwicz,L.S., von Grotthuss,M., Bujnicki,J.M. and Rychlewski,L. (2003) Nucleic Acids Res., 31, 38043807.
Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl Acad. Sci. USA, 84, 43554358.[Abstract]
Hargbo,J. and Elofsson,A. (1999) Proteins, 36, 6876.[CrossRef][ISI][Medline]
Jones,D.T. (1999) J. Mol. Biol., 292, 195202.[CrossRef][ISI][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 8689.[CrossRef][ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Kawabata,T. and Nishikawa,K. (2000) Proteins, 41, 108122.[CrossRef][ISI][Medline]
Kawabata,T, Arisaka,F. and Nishikawa,K. (2000) Gene, 259, 223233.[CrossRef][ISI][Medline]
Kelly,L.A., MacCallum,R.M. and Sternberg,M.J.E. (2000) J. Mol. Biol., 299, 499520.[ISI][Medline]
McGuffin,L.J. and Jones,D.T. (2003) Bioinformatics, 19, 874881.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Qian,N. and Sejnowski,J. (1988) J. Mol. Biol., 202, 865884.[ISI][Medline]
Rice,D.W. and Eisenberg,D. (1997) J. Mol. Biol., 267, 10261038.[CrossRef][ISI][Medline]
Rost,B. and Sander,C. (1993). J. Mol. Biol., 232, 584599.[CrossRef][ISI][Medline]
Rost B., Sander,C. and Schneider,R. (1994) J. Mol. Biol., 235, 1326.[CrossRef][ISI][Medline]
Rost,B., Schneider R. and Sander,C. (1997) J. Mol. Biol., 270, 417480.
Rumelhart,D.E., Hinton,G.E. and Williams,R.J. (1986) Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA, pp. 318362.
Russel,B.R., Copley,R.R. and Barton,G.J. (1996) J. Mol. Biol., 259, 349365.[CrossRef][ISI][Medline]
Shan,Y., Wang,G. and Zhou,H.-X. (2001) Proteins, 42, 2337.[CrossRef][ISI][Medline]
Teichmann,S.A., Chothia,C. and Gerstein,M. (1999) Curr. Opin. Struct. Biol., 9, 390399.[CrossRef][ISI][Medline]
Wallqvist,A., Fukunishi,Y., Murphy,L.R., Fadel,A. and Levy,R.M. (2000) Bioinformatics, 16, 9881002.[Abstract]
Zemla,A., Venclovas,C., Fidelis,K. and Rost,B. (1999) Proteins, 34, 220223.[CrossRef][ISI][Medline]
Received February 20, 2004; revised August 1, 2004; accepted August 3, 2004.
Edited by Fred Cohen
|