Combining sensitive database searches with multiple intermediates to detect distant homologues

Asaf A. Salamov1,2, Makiko Suwa1, Christine A. Orengo3 and Mark B. Swindells1,4,5

1 Helix Research Institute, 1532–3 Yana, Kisarazu-shi, Chiba, 292, Japan, 3 Biomolecular Structure and Modelling Unit, Department of Biochemistry, University College London, Gower Street, London, UK and 4 Tsukuba Advanced Research Alliance, University of Tsukuba, Tsukuba 305, Japan


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results
 References
 
Using data from the CATH structure classification, we have assessed the blastp, fasta, smith–waterman and gapped-blast algorithms, developed a portable normalization scheme and identified safe thresholds for database searching. Of the four methods assessed, fasta, smith–waterman and gapped-blast perform similarly, whereas the sensitivity of blastp was much lower. Introduction of an intermediate sequence search substantially improved the results. When tested on a set of relationships that could not be identified by blastp, intermediate sequences were able to find double the number of relationships identified by the smith–waterman algorithm alone. However, we found that the benefit of using intermediates varied considerably between each family and depended not only on the number of available sequences, but also their diversity. In an attempt to increase sensitivity further, a multiple intermediate sequence search (MISS) procedure was developed. When assessed on 1906 cases from a wide range of homologous families that could not be detected by the previous approaches, MISS was able to identify 241 additional relationships. MISS uses the full extent of sequence diversity to detect additional relationships, but does not consider any structure-specific information. For this reason, it is more generally applicable than fold recognition and threading methods, which require a library of known structures.

Keywords: CATH/intermediate searches/sequence analysis/protein structure


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results
 References
 
Although we are still unable to predict protein structure and function directly, sequence comparison techniques enable us to infer these properties by identifying similarities to entries already in public databases. Such methods work particularly well for globular proteins which are the subject of this research. In order to perform a successful database search, one must have an alignment algorithm, residue similarity matrix and scoring scheme as well as knowledge of the threshold beyond which the scores only report real relationships. Many researchers have investigated all these features, yet despite much work it remains unclear what scores and thresholds can reliably detect the most homologues while still rejecting all false relationships (Johnson and Overington, 1993Go; Pearson, 1996Go; Sphaer et al., 1996Go). Furthermore, as scores are dependent on the algorithm and matrix used, it has been difficult to provide general guidelines for non-specialists.

There are now many alignment algorithms available, yet despite many useful developments the most sensitive is still considered to be dynamic programming (Needleman and Wunsch, 1970Go; Smith and Waterman, 1981Go; Gotoh, 1982Go). The popularity of other programs, such as fasta (Pearson and Lipman, 1988Go; Pearson, 1996Go) and blast (Altschul et al., 1990Go; Altschul et al., 1997Go), are attributable to increased alignment speed while retaining acceptable sensitivity. With matrices, however, quality appears to depend not only on the method of construction (McLachlan, 1971Go; Dayhoff et al., 1978Go; Feng et al., 1985Go; Henikoff and Henikoff, 1992Go), but also the amount of sequence data from which substitution values are estimated (Gonnet et al., 1992Go; Jones et al., 1992Go). As a result, there is a general trend for the most recent matrices to have the highest sensitivity, even though the method of construction may have been proposed some time ago (Dayhoff et al., 1978Go).

The alignment score output by an algorithm is the basis upon which each relationship is identified. The simplest is percentage sequence identity where a heuristic twilight zone of 30% defines the level above which relationships can usually be reliably identified. A more specific assessment can be achieved by also considering the length of the alignment (Sander and Schneider, 1991Go) but, ultimately, percentage identity is an over-simplification as it neglects all of the conserved substitutions that may be occurring at other sites. An alignment score that reflects such conservation (and includes gap penalties where appropriate) should therefore be more sensitive in database searches.

Further improvements can be obtained by considering the scores that might randomly occur between unrelated sequences. Karlin and Altschul (1990) showed mathematically that the scores for ungapped alignments in random sequences should adopt an extreme distribution and Mott (1991) showed empirically that gapped alignments also result in the same distribution. Other suggestions for sequence length and database size normalization have included ln-scale approaches (Pearson, 1995Go; Sphaer et al., 1996Go), although it is unclear why they should be superior.

Naturally, any progress towards maximizing the sensitivity of a search, as well as giving the user reliable thresholds for interpreting the results, would be beneficial. But even if such improvements could be obtained by judicious selection of the above features, there would be an upper limit that cannot be exceeded by a single pairwise alignment procedure.

To obtain extra sensitivity with the same algorithms, many workers have used protein sequence variations as an extra way to assist the search process (Gribskov, 1992Go; Koonin and Tatsuov, 1994Go; Bork et al., 1995Go; Holm and Sander, 1997Go). Such methods use the simple observation that even though two sequences may not directly have a significant score, they may in practice be linked through an intermediate whose similarities to both query and target are above the threshold. A formal assessment of the benefits has recently been reported, in which sequences from the Protein Data Bank (PDB) were linked via an intermediate (Park et al., 1997Go).

Linking by intermediates is attractive, not only because it harnesses the rapid increase in sequence data, but also because it only uses the same algorithms and thresholds that were described above. This is different from methods which summarize multiple sequence information as some sort of profile. For example, in PSI-blast, where a position-dependent score matrix is constructed and then modified at each iteration, the scores vary considerably despite attempts to normalize them. The same is true for Hidden Markov Models in the pfam database (Sonnhammer et al., 1998Go), where a specific threshold (judged to be safe by the person who made the HMM) is given for each family.

We are specifically interested in ways to maximize the number of sequences that can be related to known structures, because knowledge of the approximate three-dimensional structure provides an ideal link between sequence and function. We are specifically interested in globular domains and have already developed a comprehensive classification of known structures (Orengo et al., 1997Go). Maximizing the number of sequences that can be matched to each structure by a database search should provide a useful extension to our classification. In this paper we investigate various ways to improve sequence searches, assess the limitations of current intermediate sequence technology and suggest a novel implementation that uses multiple intermediates.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results
 References
 
Alignment programs, matrices

We used blastp (Altschul et al., 1990Go), gapped-blast (Altschul et al., 1997Go), fasta (Pearson and Lipman, 1988Go) and smith–waterman (Smith and Waterman, 1981Go) algorithms and two matrices; blosum 62 and blosum 50 (Henikoff and Henikoff, 1992Go) in this work. These are also referred to as blastp, gp-blast, fasta, sw, bl62 and bl50, respectively. Our choice of matrix is not exhaustive when compared with previous reports (Johnson and Overington, 1993Go; Abagyan and Batalov, 1997Go), but our aim is to provide a reasonable benchmark rather than investigate additional gains that might be obtained by detailed optimization. Blosum matrices are the defaults for the algorithms used and furthermore several independent comparisons have already shown them to be among the best performers (Johnson and Overington, 1993Go; Abagyan and Batalov, 1997Go). Other parameters, where required, were set to the default values. For gapped-blast, the gap opening and extension penalties were –10 and –1, respectively. For fasta and smith–waterman these were –12 and –2.

Scoring scheme

The simplest alignment score for any algorithm is a product of the substitution matrix values for each aligned position and the applied gap penalties. However, it is well known that discrimination between true and false relationships can be improved by also considering the extreme distribution of scores that occurs between unrelated sequences, in addition to sequence length (Karlin and Altschul, 1990Go; Mott, 1991Go; Sander and Schneider, 1991Go). Our only criterion for a suitable scoring scheme was that it should be independent of database size. This ruled out the commonly quoted expectation values (E-values), as well as the P-values used in blastp-v1.4 as our ultimate choice, although we evaluated their suitability to the task.

We found that normalizing the raw score by taking into account the lengths of both sequences (longer sequences will always tend to give higher scores) and the extreme distribution of alignment scores identified by Karlin and Altschul (1990), gave the best discrimination on our dataset, irrespective of which algorithm/matrix combination used. The normalized score S is calculated by


where s is the raw score, n and m are the lengths of the sequences being aligned and {lambda} and K are constants taken from Alschul and Gish (1996). Specifically these were {lambda} = 0.216 and K = 0.014 for blosum 62 and {lambda} = 0.158 and K = 0.019 for blosum 50.

This equation is in fact a precursor to the calculation of expectation values used in the gapped-blast, fasta and smith–waterman searches. The probability p that a score S from a pairwise comparison is above a threshold x is


From this the expectation value (E), which also takes into account database size (D), is


The P value originally used by blastp but no longer supported by gapped-blast was a further modification of the E-value.

Data set compilation from the CATH classification

All 841 single domain proteins that had less than 95% sequence identity and were available in version 1.1 of the CATH protein structure classification (Orengo et al., 1997Go) were selected for this work. CATH is a comprehensive, hierarchical classification of structures from the PDB (Bernstein et al., 1977Go). Each Homologous superfamily (H-level) in CATH contains proteins whose relationships can either be identified by sequence searching or, when the sequence similarity is too low, structure comparison coupled with functional analysis. Our 841 sequences are distributed over 241 Homologous superfamilies. 157 of the Homologous superfamilies have only a single entry and therefore play their role by increasing the chance of a false relationship occurring when comparisons are made with data from the 200 000 entry, OWL sequence database. The 3442 true relationships which exist are contained within 84 Homologous superfamilies and the seven most populated of these are shown in Figure 3bGo. The list of 841 proteins is available at http://www.hri.co.jp/~swintech/841.list.



View larger version (45K):
[in this window]
[in a new window]
 
Fig. 3. (a) Increase in the percentage of relationships identified. Results for blastp with raw scores (from Figure 1aGo) are compared with the optimal fasta and gp-blast normalized scores (from Figure 1bGo) and the application of intermediate sequence searches. (b) Variation of success in the seven most populated Homologous superfamilies using the fasta algorithm.

 
Single- and multi-domain proteins for threshold assessment

It is important to describe here our principles for dealing with domains. The thresholds for each algorithm were originally determined using a subset of single domain proteins. This was for two reasons: first it made the process of separating related and unrelated proteins easier and did not rely on our subjective division of protein structures into domains that had not been experimentally verified; second, it helps us to determine a conservative threshold, above which unrelated sequences are unlikely to occur. This is because of the way it considers sequence length when modifying the raw score and is best explained by way of example. Consider two sequences, one containing only a single SH3 domain and another containing an SH3 domain linked to an SH2. Although the raw score would be identical, the normalized score would be lower than if the multi-domain protein were split up and the SH3 domains aligned by themselves. Therefore, the threshold arising from this analysis will also be applicable to searches with multi-domain proteins.


    Results
 Top
 Abstract
 Introduction
 Methods
 Results
 References
 
Assessing popular sequence comparison methods using the CATH classification

To assess the power of any sequence alignment method, a suitable test set is required. We take our data from the CATH classification of protein structures (Orengo et al., 1997Go). Because CATH uses structure comparison to detect distant evolutionary relationships, the procedure is not dependent on the limitations of sequence comparison technology for detecting homologues. Here we use 841 single domain entries from CATH, that constitute 241 Homologous superfamilies (H levels in CATH). Within each H level, sequence similarities range from 95% identical to statistically insignificant and thus provide a broad set with which to test the power of popular sequence alignment algorithms. In total there are 3442 real relationships to be detected (i.e. they have the same CATH number) and 349 778 false relationships to be rejected.

Variation of specificity with sensitivity for pairwise searches

Although various combinations of algorithm, matrix and scoring scheme were investigated, we limit the discussion here to those with the most important differences. Figure 1aGo shows the variation of sensitivity and specificity for the raw scores of four algorithms after performing all by all comparisons on our 841 protein dataset (results for blosum62 and blosum50 matrices were similar, and the latter are not shown). It was immediately clear that blastp performs the worst, whereas the other three are fairly similar. Although the percentage of relationships identified relates only to our dataset of 841 proteins, relative improvements are important, because the worst performing algorithm only detects obvious relationships, while leaving more difficult cases to methods with higher sensitivity.



View larger version (44K):
[in this window]
[in a new window]
 
Fig. 1. Variation of sensitivity with specificity for (a) all four algorithms using the blosum 62 matrix and raw scores and (b) the three most sensitive algorithms using both blosum 50 and 62 matrices with normalized scores. Tables below each graph quote the sensitivities at 100 and 95% specificity.

 
Using these raw data, we tried to improve the discrimination of true and false relationships with a variety of normalization schemes that included those described in the methods (Equations 1–3) as well as the logarithmic scale proposed by Pearson (1995) and used by Sphaer et al. (1996). In all tests we found that Equation 1 gave the best normalization, but the differences were small.

In retrospect this was inevitable, as even raw scores only perform a few percentage points worse than the best normalization. We suspect that the slight superiority over the currently popular E-value normalization is because the latter has an extra exponent that reduces the effect of the sequence length term. In any case, our simpler normalization procedure has the added advantage of being independent of database size (see Methods for more detail). The variation of sensitivity and specificity, when combining the effects of normalized scores with the blosum62 and blosum50 matrices, is shown in Figure 1bGo.

Although the choice of matrix has only a minimal effect on the results, it is interesting to note that in our tests, the scores were slightly higher when using the matrix that is not the default. More important, the capacity of gp-blast/bl50 to perform at essentially the same level of reliability as the sw/bl50 has important implications for large-scale sequencing projects, as gapped-blast is ~100 times faster than smith–waterman.

A threshold for safe database searching

From our test of 841 domains above, it was possible in principle to define a threshold for the normalized score that maximized sensitivity while retaining 100% specificity. However, we were aware that in real database searches there would be a wider variety of sequences with which to achieve elevated scores by chance.

Therefore, using both the fasta/bl62 and gp-blast/bl50 combinations together with our normalized scoring scheme, we compared each of the 841 domains against the Owl sequence database (containing over 200 000 entries) and stored a list of all the hits for each domain. The smith–waterman method was not assessed because of the large amount of computational time required for such a search, combined with the negligible improvement suggested by Figure 1Go.

Once the comparisons had been made, all unrelated pairs from the 841 protein test set had their Owl hits compared and the scores for any sequences appearing in both lists were recorded. For each comparison, the lower of the two scores linking two proteins via an Owl sequence was recorded (as this was the score at which the query and target sequence would be incorrectly linked by an intermediate). After performing this for all 349 778 unrelated pairs, we found that the highest score that would incorrectly link two unrelated proteins was 87 for fasta/blosum62 and 17 for gapped-blast/blosum50. Therefore, we set our thresholds to 88 and 18, respectively, on the assumption that scores above these thresholds would be rare.

Intermediate sequences improve sensitivity

With these thresholds set we once again compared the related pairs, first directly with one another (as in Figure 1Go) and then via an intermediate sequence in the manner described above, to see how many additional relationships could be identified. The latter is essentially the same as the test devised by Park et al. (1997), except that we make assessments using our own normalization scheme and have added the criterion that alignment overlap must be greater than 50% of the smallest protein. In Figure 2Go a schematic diagram of the intermediate search procedure is shown together with typical results for these comparisons. Figure 3aGo shows the variation in sensitivity for various types of search. As discussed previously, the actual percentages identified are related only to our dataset, but if we use the blastp run in Figure 1aGo as our reference, where all trivial relationships have been removed (i.e. what remains is a non-redundant dataset), the increase obtained with intermediate sequence searching is almost double that gained by using normalized scores from fasta or gapped-blast alone.



View larger version (48K):
[in this window]
[in a new window]
 
Fig. 2. (a) Schematic diagram showing the difference between the direct comparison of two protein sequences and the use of an intermediate sequence (ISS). (b) Examples of relationships that were missed using a conventional fasta search (normalized threshold is 88) but are found used intermediate sequences.

 
Sensitivity can vary considerably between families

However, it is more instructive to look at what is happening within each Homologous superfamily. Figure 3bGo shows the results for the seven largest superfamilies in our dataset. The globin superfamily has the largest number of sequences, reflecting its prevalence in PDB as well as its historical significance as the first protein structure to be determined. Clearly, intermediate sequences perform better than average for the globins. If they are discounted from the total, the percentages in Figure 3aGo decrease slightly, becoming 41.3 and 51.6%, respectively, for direct and intermediate searches using fasta. Families such as CD59 also benefit from intermediates, whereas others remain almost unaffected. (Note that in Figure 3bGo, both directions for pairwise comparisons are considered, as the comparison process is not always symmetrical when using intermediate sequences.)

Multiple intermediate sequence searches (MISS)

Given the overall benefits of intermediate searching, we considered it possible that some of the remaining relationships might be detected if more than one intermediate step were used. We refer to this implementation as sequence hopping and the number of intermediates as the number of hops. Using more than one intermediate is, in practice, problematic because at each step there may be a choice of many, possibly hundreds, of intermediates. Computationally it is unrealistic to use all of these as the next query and so some approximation is required.

Our approach was to take a relationship that escaped detection and assign one as the query and the other as the target. In the same way as before, the query and target were first compared against the OWL database and hits above the threshold collected. However, this time no sequence appears in both lists of intermediates. Therefore, we randomly selected an intermediate from the query's list that had a score above the threshold, yet was low enough to ensure that the selected sequence was considerably different from its predecessor (we did this with fasta using a normalized score between 88 and 400). Following this, the database search was repeated using the new sequence. In principle this process could be run continually until a target was found, but as the approach is time consuming we set an upper bound of 20 hops.

In order to explore the benefits of this approach, we first applied our method to the globin family. Surprisingly, we discovered that (unlike the previous searches) those involving multiple steps are crucially dependent on the search direction. In other words, for each relationship, the difference between success and failure could be critically dependent on which sequence of the pair was assigned as the query. In our case, nearly all of the missed globin relationships involved ascaris haemoglobin (1ash). When we assigned this as the query, all of the remaining relationships could be identified within five hops. However, when the search was reversed and 1ash became the target, there were almost no cases of successful identification.

Analysis revealed that this non-symmetricality results from 1ash being an outlier of the globin family. When 1ash is first compared with Owl it only identifies a small number of (about 10) globin sequences above the required threshold. As a result, each hop steadily draws the search towards the main body of globins. In contrast, other globins hit an enormous number of sequences but only a few of these are on a `path' that leads to 1ash. As the chance of finding this path is inversely proportional to the number of sequences and the number of hops required, we found that in practice the query sequence could not find 1ash as its target within a realistic number of hops.

It was clear, therefore, that a comprehensive test would require each missed relationship to be analysed twice, with the query and target sequences systematically reversed. There were a total of 1418 relationships missed by the standard ISS approach which translated into 2836 when considering both directions. We ran our searches on all families except the immunoglobulin family (which accounts for 930 of the 2836 pairs). Our reasons for this were first that the immunoglobulin family had already been rather unsuccessful in the previous tests, leaving an enormous number of time-consuming searches to be made. Second, there is an enormous number of immunoglobulin sequences, even in a non-redundant database, which substantially slows each search . Finally, because the immunoglobulin fold is a superfold (i.e. proteins can have similar structures but different functions), it is likely that many of these would only be detected by methods such as threading anyway.

In total we were able to detect 241 of the 1906 relationships analysed using this technique. This is 12% of all the remaining non-immunoglobulin relationships, which is a significant improvement, particularly when one considers that these relationships are becoming progressively harder to detect at each stage. The results are spread fairly evenly over the families. For the families shown in Figure 3bGo the extra relationships identified were: 40 globins, 42 jelly rolls, 19 CD59, 44 cytochromes, 15 plekstrins and 4 lipocalins. The other 77 are spread across the remaining families.

Conclusions

We have described a number of approaches for analysing and maximizing the efficacy of sequence database searches. We have assessed a number of normalization schemes using relationships derived from known structures. Although we found that normalized scores are superior to raw scores, the differences are small in comparison with the advances that can be made by incorporating sequence variations within each family.

In line with previous results for using a single intermediate (Park et al., 1997Go), we find a significant overall increase in sensitivity. However, in this work we emphasize the family-specific nature of these increases. Success relies on having many sequences with a large range of evolutionary distances and in practice this depends on the range of sequences available in public databases.

With the advent of genome sequencing, the situation should improve and in fact we are already seeing many families that were once poorly represented, benefiting considerably. As a result, it is becoming more and more likely that a first round of sequence searching will hit something rather than nothing and this power can be used to great effect. Even in the Casp2 prediction competition, which usually requires threading or fold recognition methods, a procedure with a single intermediate is able to detect the correct serine proteinase target.

Our developments with multiple intermediates can frequently detect similarities missed even by a single intermediate, making an additional contribution to increasing sensitivity. Once more, the method is just a simple application of a standard pairwise algorithm and requires neither the building of a profile nor the use of structural data.

One slight disadvantage of this approach is that there is asymmetry, as mentioned above. However, although not ideal for assessing our method, this asymmetry may not in practice be too serious; sequences that yield a large pool of hits from the first round of searching will probably immediately have sufficient structural and functional data available and therefore there is no critical need to find every relationship. In contrast, outliers will be quickly drawn into a larger pool of data, using only a moderate number of hops. Of course, for this to work there must be enough sequences available to link outliers to other family members, but given the number of sequencing projects now producing data this is becoming increasingly likely.

Although profile and fold recognition methods play an important part in sequence analysis, the procedures described here are attractive, not least because of the reproducibility and consistency of the thresholds applied. These approaches should be particularly helpful to non-specialists who wish to perform their own sensitive database searches.


    Acknowledgments
 
The excellent name `MISS' was suggested by one of the referees. It was also by far the quickest of the suggestions to implement.


    Notes
 
2 Present address: Sanger Centre, Wellcome Trust Genome Campus, Cambridge, UK Back

4 To whom correspondence should be addressed. Present address: Inpharmatica Ltd, 60 Charlotte Street, London W1P 2AX, UK Back


    References
 Top
 Abstract
 Introduction
 Methods
 Results
 References
 
Abagyan,R. and Batalov,S. (1997) J. Mol. Biol., 273, 355–368.[ISI][Medline]

Altschul,S.F. and Gish,W. (1996) Methods Enzymol., 266, 460–480.[ISI][Medline]

Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) J. Mol. Biol., 215, 403–410.[ISI][Medline]

Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.,Jr, Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 122, 535–542.

Bork,P., Gellerich,J., Groth,H., Hooft,R. and Martin,F. (1995) Protein Sci., 4, 268–274.[Abstract/Free Full Text]

Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure, Vol. 5. National Biomedical Research Foundation, Washington, DC, pp. 345–352.

Feng,D.F., Johnson,M.S. and Doolittle,R.F. (1985) J. Mol. Evol., 21, 112–125.[ISI]

Gonnet,G.H., Cohen,M.A. and Benner,S.A. (1992) Science, 256, 1443–1445.[ISI][Medline]

Gotoh,O. (1982) J. Mol. Biol., 162, 705–708.[ISI][Medline]

Gribskov,M. (1992) Gene, 119, 107–111.[ISI][Medline]

Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 10915–10919.[Abstract]

Holm,L. and Sander,C. (1997) Proteins, 28, 72–82.[ISI][Medline]

Johnson,M.S. and Overington,J.P. (1993) J. Mol. Biol., 233, 716–738.[ISI][Medline]

Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Comput. Appl. Biosci., 8, 275–282.[Abstract]

Karlin,S. and Altschul,S.F. (1990) Proc. Natl Acad. Sci. USA, 87, 2264–2268.[Abstract]

Koonin,E. and Tatsuov,R.L. (1994) J. Mol. Biol., 244, 125–132.[ISI][Medline]

McLachlan,A.D. (1971) J. Mol. Biol., 61, 409.[ISI][Medline]

Mott,R. (1991) Bull. Math. Biol., 54, 59–75.[ISI]

Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443–453.[ISI][Medline]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) J. Mol. Biol., 273, 349–354.[ISI][Medline]

Pearson,W.R. (1995) Protein Sci., 4, 1145–1160.[Abstract/Free Full Text]

Pearson,W.R. (1996) Methods Enzymol., 266, 227–258.[ISI][Medline]

Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444–2448.[Abstract]

Sander,C. and Schneider,R. (1991) Proteins, 9, 56–68.[ISI][Medline]

Smith,T.F. and Waterman,M.S. (1981) J. Mol. Biol., 147, 195–197.[ISI][Medline]

Sonnhammer,E.L.L., Eddy,S.R., Birhey,E., Bateman,A. and Durbin,R. (1998) Nucleic Acids Res., 26, 322–325.

Sphaer,E.G., Robinson,M., Yee,D., Candlin,J.D., Mines,R. and Hunkapiller,T. (1996) Genomics, 38, 179–191.[ISI][Medline]

Received May 19, 1998; revised November 9, 1998; accepted November 11, 1998.