Filtering remote homologues using predicted structural information

Katsunori Uehara1, Takeshi Kawabata1,2 and Nobuhiro Go1,3

1Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192 and 3CCSE, Japan Atomic Energy Research Institute, 8–1, Umemidai, Kizu-cho, Souraku, Kyoto 619-0215, Japan

2 To whom correspondence should be addressed. E-mail: takawaba{at}is.naist.jp


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Finding homologues for a given protein plays a major role in predicting the protein's structure and function. However, it is still difficult to find remote homologues with low sequence similarity, even with advanced sequence search methods. We propose a simple filtering method that uses predicted structural information, pertaining to secondary structures and solvent accessibilities. It filters the more promising homologues from the many candidate proteins obtained by PSI-BLAST with a less stringent threshold E-value. The final decision is made by a simple linear discrimination method, considering the E-value of PSI-BLAST and the statistical significance scores of structural matches. An in-house neural network program is used for the prediction of secondary structures and solvent accessibilities for both the query and library proteins. The performance of our filtering method was evaluated by the cross-validation method, using the SCOP superfamily relationship as the correct standard. Coverage–reliability plots show that our filtering method clearly improves the performance of PSI-BLAST. The secondary structure improves PSI-BLAST better than the solvent accessibilities, but the combination of these two features with PSI-BLAST leads to the best result. The advantage of our method is its easy implementation with fewer parameters to be tuned and faster computation. We also discuss its performance with predicted and observed secondary structures.

Keywords: PSI-BLAST/remote homologue detection/secondary structure prediction/solvent accessibility prediction


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The detection of homologues for a given protein sequence is an essential step for predicting its tertiary structure and function. The demand for homologue detection has increased as a result of the huge number of protein sequences generated from genome sequencing projects. Since the 1990s, great efforts have been made to develop sensitive methods to detect increasingly remote homologues, based on ideas such as profile (Gribskov et al., 1987Go) or threading (Bowie et al., 1991Go; Jones et al., 1992Go). For highly sensitive homologue detection, profile methods use multiple alignments as inputs, whereas threading methods make use of structural information. Recently, programs based on the idea of profiles, PSI-BLAST (Altschul et al., 1997Go) and HMMer (Eddy, 1998Go), have been accepted as the standard tool for this purpose. However, structural comparison studies have suggested that there are many remote homologous protein pairs that PSI-BLAST cannot detect (Murzin et al., 1995Go; Kawabata and Nishikawa, 2000Go), indicating that there is room for more sensitive methods. In order to enhance the performance of profile methods, many researchers proposed combined methods utilizing profiles and structural features. Several researchers reported that the predicted secondary structure information improves the performance of standard sequence searches (Fischel-Ghodsian et al., 1990Go; Fischer and Eisenberg, 1996Go; Russel et al., 1996Go; Rice and Eisenberg, 1997Go; Rost et al., 1997Go). Following their studies, most of the recently developed recognition methods have incorporated predicted secondary structure information (De la Cruz and Thornton, 1999Go; Di Francesco et al., 1999Go; Geetha et al., 1999Go; Hargbo and Elofsson 1999Go; Kelly et al., 2000Go; Wallqvist et al., 2000Go; Shan et al., 2001Go; Bindewald et al., 2003Go; Ginalski et al., 2003Go; McGuffin and Jones, 2003Go). In most of the methods, a new combined score was introduced by adding sequence (or profile) and weighted secondary structure matching scores. Using such a combined score, protein pairs are aligned by the dynamic programming algorithm. Although these methods are powerful for recognizing remote homologues, they have some problems in the phases of method development and application to large databases. First, these methods contain many parameters to be adjusted by developers, such as the matching score for secondary structures and its weight against other terms. In addition, a search using the dynamic programming algorithm is much slower than that with the BLAST heuristic algorithm. Geourjon et al. (2001)Go proposed another effective strategy: many homologue candidates are obtained first, using the standard sequence search methods with a less stringent (i.e. larger) threshold E-value. From these candidates, a filtering program selects the more likely homologues, using predicted secondary structure information. The advantage of this method is its easy implementation with fewer parameters to be tuned and faster computation. This strategy successfully improved the BLAST results, but was only partially successful with the PSI-BLAST results.

In this study, following Geourjon et al.'s work, we developed a similar filtering method to choose more likely homologues from the many candidates obtained by PSI-BLAST, using predicted structural information. Compared with Geourjon et al.'s original work, we introduced a few refinements to the method. First, we employed a statistical significant score (Z-score) for structure matching. Second, in addition to the secondary structure predictions, we also used the solvent accessibility predictions. Third, we used a simple linear discrimination method, which combines E-value and the structural matching score.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Dataset

To evaluate the performance of our methods, we used 3605 representative protein domains in the SCOP database (version 1.63) (Murzin et al., 1995Go) with sequence identities of 30% or less. The domains of classes 6 (membrane and cell surface proteins and peptides), 8 (coiled coil proteins), 9 (low-resolution protein structures), 10 (peptides), 11 (designed proteins) and those with length <40 residues were removed from the representative list, because of their specific nature of evolution. The ‘family’ and ‘superfamily’ relationships defined in SCOP were considered to be the correct homologous relationships.

Overview of the method

Figure 1 shows an overview of our method. The secondary structures and solvent accessibilities for the query and library sequences are predicted. PSI-BLAST (Altschul et al., 1997Go) is performed for the query protein against the SCOP representative sequence database and it outputs the homologue candidates with large threshold E-values and their alignments with the query sequence. Matching scores between the query and library predicted secondary structures/solvent accessibilities are calculated on the alignments. The final decision is made by considering the PSI-BLAST E-values and the structural matching scores.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 1. Overview of the method. Using predicted structural information, pertaining to secondary structures (Sec) and solvent accessibilities (Acc), more promising homologues are chosen from the many candidate proteins obtained by PSI-BLAST with a less stringent threshold E-value.

 
The procedure for performing PSI-BLAST actually consists of two steps. In the first step, homologues of the query proteins are collected in the NR database downloaded from NCBI, with the maximum number of iterations = 5 and the threshold E-value = 0.001. After convergence, the final position-specific score matrices (PSSM) are saved. In the second step, using the PSSM matrix as the query, the homologue candidates in the representative SCOP database are searched with threshold E-values = 100.

Secondary structure prediction

An in-house program for secondary structure prediction was developed, using the standard neural network algorithm (Rumelhart et al., 1986Go; Qian and Sejnowski, 1988Go; Rost and Sander, 1993Go; Jones 1999Go). The output secondary structure included three-states [helix (‘H’), strand (‘E’) and others (‘C’)]. The correct secondary structures were defined by the DSSP program (Kabsch and Sander, 1983Go). We employed the network architecture proposed by Jones (1999)Go, which is composed of two three-layered networks (cascaded network). The first network used PSI-BLAST PSSM as the input and generated preliminary predictions. The second network used the prediction of the first network as the inputs and yielded the final prediction. The input size was 13 residues and the number of hidden units was 30, for both the first and second networks. The program's performance was evaluated by the 7-fold cross-validation (Rost and Sander, 1993Go), using the SCOP representative dataset. The prediction accuracy of our method was Q3 = 76.18%, which was the percentage of residues with correctly predicted three-state secondary structures, against all of the residues.

Solvent accessibility prediction

A neural network for solvent accessibility prediction was also developed, which outputs two states of accessibility: exposed (‘e’) or buried (‘b’). The correct answer was defined using the accessible surface area calculated by the DSSP program (Kabsch and Sander, 1983Go). If the value of the accessible surface area was greater than 15% of the value for the standard extended conformation, then its accessibility was defined as ‘exposed’; otherwise, ‘buried’. After trials of various kinds of neural network architectures, we found that the second network was not effective for accessibility prediction. Finally, we employed the network with 13-residue PSSM inputs, with no hidden layer or second network. The accuracy of our network was Q2 = 73.74%, evaluated by the 7-fold cross-validation.

Z-score for structure matching

Based on the PSI-BLAST alignment, the degree of structural matching was measured for the three-state secondary structures and the two-state solvent accessibilities. Figure 2 shows an example of the secondary structure correspondence on the PSI-BLAST sequence alignment. We assumed that homologous protein pairs have more structural matches than non-homologous pairs. Measuring structural matches for this purpose is not a trivial problem. The structural identity Q-value (Q3 or Q2) may be the simplest way for measuring structural matches. The Q-value is defined by the number M of residue pairs with same structure divided by the number N of compared residues in the alignment between the query and the subject protein. However, such a Q-value tends to be high for a short alignment, even for non-homologous pairs. To solve this problem, Geourjon et al. (2001)Go excluded the sequence pairs with <100 aligned residues from their datasets and evaluated the structural matching by the Sov score (Rost et al., 1994Go; Zemla et al., 1999Go). Instead of simply excluding short alignments, we introduced the following Z-score for evaluating, the statistical significance of matching structures, against random matches given by the binomial distribution:

(1)
where Q is defined as M/N and p is the probability that the corresponding residues randomly have the same structural state, which is given as follows:

(2)
where S is the set of structural states and ps is the probability that the structural state s appears by chance, which can be approximated by the frequencies of the respective structural states in the representative structural dataset. Actually, p for secondary structures is and p for accessibility is . For the pairs with the same Q-value, the Z-score is proportional to the square root of the compared residues, N. This property of the Z-score is helpful for excluding non-homologous pairs with a short alignment.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 2. Example of structural correspondence based on the PSI-BLAST alignment. (a) An example of PSI-BLAST. (b) Secondary structure information has been added to the alignment. In this case, the number of compared residues N is 47 and the number of structurally matched residues M is 33. The Z-score is 4.73.

 
Linear discrimination using centroid vectors

A simple linear discrimination method using centroid vectors was introduced to make a final decision by considering several features, such as the E-value of PSI-BLAST and the Z-score from secondary structure/accessibility prediction. The final score of the linear discrimination method is the inner product S, between the input feature vector x and the projection vector w:

(3)
We defined the vector w as the difference between the means of two classes:

(4)
where N+ is the number of homologous pairs [positive (+) class] and N is that of the non-homologous pairs [negative (–) class]. The performance of this method was evaluated using the 7-fold cross-validation method, where the parameter w was decided without using the test dataset.

Coverage–reliability plot

To evaluate the abilities of the various detection methods, coverage–reliability plots were generated (Kawabata and Nishikawa, 2000Go). Coverage and reliability are defined as follows:

(5)

(6)
where Ntp(S) is the number of homologous protein pairs with a similarity score better than S, Nt is the number of homologous protein pairs and Np(S) is the number of protein pairs with a similarity score better than S. Coverage and reliability were calculated against all of the observed scores and plotted on a plane. The curves plotted more towards the upper right are better than those towards the lower left.

Availability of software

Our software is available through a Web server (http://biunit.naist.jp/psisec/). It calculates a PSSM for a given target sequence using the PSI-BLAST, predicts its secondary structure from the PSSM by our neural network program, searches its homologues in the current PDB sequences using the PSSM and shows a combined result with the predicted secondary structure.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Predicted structural information improves PSI-BLAST performance

Scatter plots of E-values and Z-scores for secondary structure prediction are shown in Figure 3. Basically, the homologous protein pairs (red dots) were distributed more in the lower E-value and higher Z-score region, as compared with the non-homologous pairs (green dots). The vector w connecting two means is also shown. The two features were combined into one by calculating an inner product with the vector w.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 3. Distribution of aligned sequence pairs plotted by the logarithm of the PSI-BLAST E-value versus the Z-score for matched predicted secondary structures. Red dots (a) are homologous pairs and green dots (b) are non-homologous pairs. The black circle is the center for homologous pairs and the black square is the center for non-homologous pairs. The black line connecting the circle and the square corresponds to the direction of the projection vector w. The blue lines are the boundaries corresponding to reliabilities of 0.8 (‘Rel = 0.8’) and 0.9 (‘Rel = 0.9’).

 
Coverage–reliability plots for the various methods are shown in Figure 4, which clearly indicates that our filtering program, which considers matching of predicted secondary structures, has a better ability to recognize homologues than the original PSI-BLAST program. The Z-score for the solvent accessibilities improved the performance, but was less effective than that for the secondary structure. Combining the three features, E-value, Z-score for secondary structure and Z-score for solvent accessibility, yielded a slightly better result than that for E-value and Z-score for secondary structure.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 4. Coverage–reliability plots for various discrimination methods. ‘PSI + Zacc’ is the performance of the method using the PSI-BLAST E-values and Z-scores for predicted solvent accessibilities, ‘PSI + Zsec’ is that using the PSI-BLAST E-values and Z-scores for predicted secondary structures and ‘PSI + Zsec + Zacc’ is that using the PSI-BLAST E-values and Z-scores for predicted secondary structures and solvent accessibilities.

 
Combination with two-way PSI-BLAST

It is well known that the E-values of PSI-BLAST are not symmetric: the E-value E(A, B) of protein A in a library, using protein B as a query, is often different from the E-value E(B, A) of protein B in a library, using protein A as a query. Using this asymmetry, the two-way PSI-BLAST method was proposed, which is reportedly more sensitive than the standard one-way PSI-BLAST (Teichman et al., 1999Go; Kawabata et al., 2000Go). In the two way PSI-BLAST method, the E-value for a pair of proteins A and B is evaluated by considering two PSI-BLAST searches:

(7)

We examined the performance of our filtering method against the two-way PSI-BLAST method, by using a symmetrical Z-score for structural matching, defined as follows:

(8)

The column labeled ‘two-way’ in Figure 5 is the performance of our improved method based on the two-way PSI-BLAST method. As reported previously, the two-way PSI-BLAST coverage is larger than the one-way. Combination with the Z-score for solvent accessibility and secondary structure also improved the performance of the two-way PSI-BLAST method.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 5. Coverage–reliability plots for various two-way discriminate methods. The meanings of ‘PSI + Zacc’, ‘PSI + Zsec’ and ‘PSI + Zacc + Zsec’ are the same as in Figure 4.

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Homologue detection using prediction with single sequence inputs

For predicting secondary structures and solvent accessibilities, neural networks with PSI-BLAST profile inputs were used in this study. To elucidate the relationship between the performance of homologue detection and the accuracy for secondary structure/solvent accessibility prediction, we employed less accurate methods, neural networks with single sequence inputs. For this purpose, we developed an in-house neural network program with the same architecture as used by Qian and Sejnowski (1988)Go. We trained it using the SCOP representative datasets and evaluated by the cross-validation method. The prediction accuracies were lower than those with profile inputs. For secondary structure prediction, Q3 for the network with single sequence inputs was 68.59%, whereas Q3 for that with profile inputs was 76.18%. For solvent accuracy prediction, Q2 for the network with single sequence inputs was 68.24%, whereas Q2 for that with profile inputs was 73.74%. The performance of homologue detection using these prediction methods is summarized in Figure 6. The results show that the prediction methods using a single sequence improved PSI-BLAST, but not as much as the methods using profile inputs. This suggests that the prediction accuracy for secondary structure/solvent accessibility is crucial for the performance of our filtering method.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 6. Coverage–reliability plots for discrimination methods using different secondary structure/solvent accessibility prediction methods. A neural network with single sequence inputs (‘Single’) and that with profile inputs (‘Profile’) are compared.

 
Performance of observed–observed and predicted–observed secondary structures

Basically, we used the predicted secondary structure/solvent accessibility for both the query and library proteins. However, when remote homologue detection is used for a structure prediction, the structures of the library proteins are already known, whereas that of the query protein needs to be predicted. In order to determine the effect using an observed structure, we examined the performance of various combinations of observed and predicted structures. Figure 7 shows the performance of three combinations of secondary structures: predicted structures for both query and library proteins (Pre–Pre), predicted structures for query proteins and observed structures for library proteins (Pre–Obs) and observed structures for both query and library proteins (Obs–Obs). It is reasonable that the performance of ‘Obs–Obs’ is the best among the three. However, the performance of the predicted versus observed structures (Pre–Obs) is not much better than that of the predicted versus predicted structure (Pre–Pre). In other words, our filtering method using a predicted structure as a query worked equally for a structure-unknown library and a structure-known library. Although a similar result was reported by Geourjon et al. (2001)Go, it is still not clear why introducing observed structures did not improve the performance of ‘Pre–Pre’. Coincident prediction errors for the query and library proteins may explain the high performance of ‘Pre–Pre’.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 7. Coverage–reliability plots for discrimination methods using different combinations of secondary structures. Three combinations of secondary structures were examined: predicted structures for both query and library proteins (‘Pre–Pre’), predicted structures for query proteins, observed structures for library proteins (‘Pre–Obs’) and observed structures for both query and library proteins (‘Obs–Obs’).

 
Limitations of the method and possible improvements

Compared with the other remote homologue detection methods, the advantage of our method is its easy implementation and fast computation. In addition, our evaluation of the performance is more reliable than those of previous, similar studies using ready-made secondary structure prediction programs, such as PHD (Rost and Sander, 1993Go) and PSI-PRED (Jones, 1999Go). This is because our in-house prediction programs can be trained by ourselves and we applied the cross-validation evaluation for them. However, we are aware of the limitations of our strategy. First, homologous pairs with large E-values of PSI-BLAST cannot be found by our filtering method, because our method completely depends on PSI-BLAST to provide homologue candidates. Second, our method just filters the PSI-BLAST results, it cannot improve the alignments. The aligned sequences of the homologous pairs detected by our filtering methods were often too short (data not shown). This was simply because PSI-BLAST alignments with larger E-values tend to be shorter. We now plan to introduce the dynamic programming program to realign only the homologue candidates found by PSI-BLAST, using predicted secondary structures. This may enhance the sensitivity and provide better alignments, without introducing large computational costs.


    Acknowledgments
 
This work was supported by the Special Coordination Funds Promoting Science and Technology and a Grant-in-Aid for Scientific Research on Priority Area (C), Genome Information Science, from MEXT (Ministry of Education, Culture, Sports, Science and Technology, Japan).


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,H., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Bindewald,E.B., Cestaro,A., Hesser,J., Heiler,M. and Tosatto,S.C.E. (2003) Protein Eng., 16, 785–789.[CrossRef][ISI][Medline]

Bowie,J.U., Luthy,R. and Eisenberg,D. (1991) Science, 253, 164–170.[ISI][Medline]

De la Cruz,X. and Thornton,J.M. (1999) Protein Sci., 8, 750–759.[Abstract]

Di Francesco,V., Munson,P.J. and Garnier,J. (1999) Bioinformatics, 15, 131–140.[Abstract/Free Full Text]

Eddy,S. (1998) Bioinformatics, 14, 755–763.[Abstract]

Fischel-Goldsian,F., Mathiowitz,G. and Smith,T.F. (1990) Protein Eng., 3, 577–581.[ISI][Medline]

Fischer,D. and Eisenberg,D. (1996) Protein Sci., 5, 947–955.[Abstract/Free Full Text]

Geetha,V., Di Francisco,V., Garnier,J. and Munson,P.J. (1999) Protein Eng, 12, 527–534.[CrossRef][ISI][Medline]

Geourjon,C., Combet,C., Blanchet,C. and Deleage,G. (2001) Protein Sci., 10, 788–797.[Abstract/Free Full Text]

Ginalski,K., Pas,J., Wyrwicz,L.S., von Grotthuss,M., Bujnicki,J.M. and Rychlewski,L. (2003) Nucleic Acids Res., 31, 3804–3807.[Abstract/Free Full Text]

Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl Acad. Sci. USA, 84, 4355–4358.[Abstract]

Hargbo,J. and Elofsson,A. (1999) Proteins, 36, 68–76.[CrossRef][ISI][Medline]

Jones,D.T. (1999) J. Mol. Biol., 292, 195–202.[CrossRef][ISI][Medline]

Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 86–89.[CrossRef][ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Kawabata,T. and Nishikawa,K. (2000) Proteins, 41, 108–122.[CrossRef][ISI][Medline]

Kawabata,T, Arisaka,F. and Nishikawa,K. (2000) Gene, 259, 223–233.[CrossRef][ISI][Medline]

Kelly,L.A., MacCallum,R.M. and Sternberg,M.J.E. (2000) J. Mol. Biol., 299, 499–520.[ISI][Medline]

McGuffin,L.J. and Jones,D.T. (2003) Bioinformatics, 19, 874–881.[Abstract/Free Full Text]

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536–540.[CrossRef][ISI][Medline]

Qian,N. and Sejnowski,J. (1988) J. Mol. Biol., 202, 865–884.[ISI][Medline]

Rice,D.W. and Eisenberg,D. (1997) J. Mol. Biol., 267, 1026–1038.[CrossRef][ISI][Medline]

Rost,B. and Sander,C. (1993). J. Mol. Biol., 232, 584–599.[CrossRef][ISI][Medline]

Rost B., Sander,C. and Schneider,R. (1994) J. Mol. Biol., 235, 13–26.[CrossRef][ISI][Medline]

Rost,B., Schneider R. and Sander,C. (1997) J. Mol. Biol., 270, 417–480.

Rumelhart,D.E., Hinton,G.E. and Williams,R.J. (1986) Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA, pp. 318–362.

Russel,B.R., Copley,R.R. and Barton,G.J. (1996) J. Mol. Biol., 259, 349–365.[CrossRef][ISI][Medline]

Shan,Y., Wang,G. and Zhou,H.-X. (2001) Proteins, 42, 23–37.[CrossRef][ISI][Medline]

Teichmann,S.A., Chothia,C. and Gerstein,M. (1999) Curr. Opin. Struct. Biol., 9, 390–399.[CrossRef][ISI][Medline]

Wallqvist,A., Fukunishi,Y., Murphy,L.R., Fadel,A. and Levy,R.M. (2000) Bioinformatics, 16, 988–1002.[Abstract]

Zemla,A., Venclovas,C., Fidelis,K. and Rost,B. (1999) Proteins, 34, 220–223.[CrossRef][ISI][Medline]

Received February 20, 2004; revised August 1, 2004; accepted August 3, 2004.

Edited by Fred Cohen





This Article
Abstract
FREE Full Text (PDF)
All Versions of this Article:
17/7/565    most recent
gzh065v1
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Request Permissions
Google Scholar
Articles by Uehara, K.
Articles by Go, N.
PubMed
PubMed Citation
Articles by Uehara, K.
Articles by Go, N.