Towards more meaningful hierarchical classification of amino acid scoring matrices

Alex C.W. May1

Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, London WC1E 6BT, UK


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Hierarchical classifications of the 20 amino acids according to residue relationships within scoring matrices have not hitherto been tested for reliability. In fact, testing here of the residue groupings obtained thus from 18 published matrices shows that they vary considerably in reliability. This behaviour gives a new insight then into the matrices with respect to the relationships between the amino acid scores contained therein. For example, other than the trivial grouping of the 20 amino acids, no reliable residue groupings are present in all 18 matrix amino acid hierarchical classifications. Hierarchical classification of the 18 scoring matrices themselves is investigated in terms of matrix representation and choice of similarity and dissimilarity measures for matrix comparison. There is no absolute standard against which to compare a matrix clustering, of course, but it is possible to assess the usefulness of a measure for the purpose in terms of the reliability of the calculated tree. Matrix representation is shown to be important. Finally, a novel two-step approach for hierarchical classification of the 18 amino acid scoring matrices is described.

Keywords: amino acid clustering/amino acid scoring matrix/hierarchical classification/tree diagram/validation


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Dynamic programming-based sequence comparison (for a review, see Waterman, 1995) and database search programs (for a review, see Altschul et al., 1994) both require weight matrices for nucleotide or amino acid scoring (for a review, see Henikoff, 1996). It is well known that the choice of scoring matrix can have a large effect on the results of sequence comparison and database scanning (e.g. Henikoff and Henikoff, 1993; Johnson and Overington, 1993; Pearson, 1995; Vogt et al., 1995).

Whatever its basis, a weight matrix is a 20x20 symmetric matrix with a score for each of the possible pairs of amino acids. The simplest scoring matrix is the 20x20 unit matrix with ones along the principal diagonal for matching identical amino acids and zeros everywhere else for non-identical pairs. Of course, treating each non-identical pair as equivalent, the 20x20 unit matrix does not incorporate information on amino acid properties and relationships.

There are several classifications of the 20 amino acids according to physico-chemical properties (e.g. Sneath, 1966; Grantham, 1974; Taylor, 1986; Livingstone and Barton, 1993; Mocz, 1995; Stanfel, 1996). In addition to residue physico-chemical properties, the classification of Taylor (1986) also includes the mutation data of Dayhoff et al. (1978). The Dayhoff approach uses observed amino acid substitutions in alignments of highly similar sequences to define probabilities for residue exchanges after a particular evolutionary time. Johnson and Overington (1993) extended the Dayhoff method to more distantly related proteins by tabulating amino acid replacements observed in alignments of protein tertiary structures. Johnson and Overington (1993) use hierarchical classification (for a review, see Gordon, 1996), in which the similarities among objects are shown as a hierarchy, to compare their 3D structure-based matrix and 12 other published ones. Each matrix was scaled between 0 and 100. First, the 20 amino acids are clustered according to the score relationships in each matrix to construct residue classifications. Second, the 20x20 matrices themselves are clustered to identify their relationships. Any classification of a set of objects into groups of similar objects requires an operational definition of similarity or dissimilarity. In both cases, Johnson and Overington (1993) use Euclidean distance, the most commonly used dissimilarity measure for cluster analysis (Everitt, 1993Go). Thus, two scoring matrices are compared in a 400-dimensional space where each of the 20x20 scores in a matrix is considered as a separate dimension.

Russell et al. (1997) again used tertiary structure alignments to construct substitution matrices but for analogues and homologues. Assignment of analogue/homologue status for proteins sharing a common fold is according to SCOP (Murzin et al., 1995Go). Analogues are defined as proteins with a common fold but usually different functions and little evidence of evolutionary relationship, while homologues are those structurally similar proteins with a common ancestor and function. Complete linkage hierarchical classification is also used to classify the 20 amino acids on the basis of score relations in the new matrices. However, the authors do not describe the (dis)similarity measure for the clustering. Although they calculate root-mean-square (r.m.s.) differences between the new matrices and three published ones, Russell et al. (1997) do not present a hierarchical classification of scoring matrices.

Recently, Agrafiotis (1997) used non-linear mapping to classify the 20 amino acids on the basis of score relationships in the scoring matrices studied by Johnson and Overington (1993). Wang et al. (1998) applied a self-organizing tree growing neural network for hierarchical classification of the data considered by Johnson and Overington (1993).

The hierarchical classifications of the 20 amino acids presented by Johnson and Overington (1993) and Russell et al. (1997) do not provide any assessment of the validity of a hierarchical representation. [Although the trees of Russell et al. (1997) do contain polytomies—`parent' nodes with more than two `children' nodes—to indicate uncertain branching orders, their use is not explicitly referred to in the text. Similarly, while the trees of Johnson and Overington (1993) are accompanied by sum of squares values, the fact that these can serve as a measure of the match between a tree and a dissimilarity matrix is not mentioned. Furthermore, other than reporting the values, the authors do not indicate how the reader might assess their significance.) How meaningful then are these classifications? The same question arises in application of hierarchical classification to the results of protein 3D structure comparison for fold classification (May, 1999Go).

One method to assess a hierarchical classification in terms of the reliability of individual branches in the optimal tree diagram or dendrogram is the bootstrap (Felsenstein, 1985Go), a non-parametric resampling approach. For instance, phylogenetic trees obtained from sequence information are almost always accompanied by statistics describing the support for a hierarchical classification and individual clusters contained within. Another non-parametric resampling method for picking out that subset of nodes in a tree that is strongly supported is the jackknife. Lanyon (1985) derives a strict consensus of n separate trees, each omitting a single, different object to define the reliable clusters in a tree. Detecting internal inconsistencies in distance data, the jackknife strict-consensus tree algorithm of Lanyon (1985) determines whether a tree is dependent on the inclusion or exclusion of particular objects. This method is used here to test the hierarchical cluster structure in 18 amino acid scoring matrices (Table IGo).


View this table:
[in this window]
[in a new window]
 
Table I. 18 amino acid scoring matrices
 
The cophenetic correlation coefficient (for reviews, see Romesburg, 1984; Everitt, 1993) is a measure of how well a tree represents in two dimensions the multi-dimensional relationships within input data. Although a value of 1.0 means that the concordance (as a linear relation) between the input data and the tree is perfect, in practice the relationship is unlikely to be totally linear. Romesburg (1984) suggests then that a cophenetic correlation of 0.8 or above indicates that the dendrogram does not greatly distort the original structure in the input data. However, the cophenetic correlation coefficient is not always a very reliable measure of the distortion due to a hierarchical model (for reviews, see Romesburg, 1984; Everitt, 1993).


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Amino acid scoring matrices

Scoring matrices were obtained from the WWW sites listed in Table IGo. Following Johnson and Overington (1993), the three matrices of Russell et al. (1997) were scaled between 0 and 100.

Hierarchical classification

The QCLUST program (written by J.Brzustowski; e-mail: jbrzusto{at}gpu.srv.ualberta.ca) is used. It is available via anonymous FTP from gause.biology.ualberta.ca in the directory /pub/jbrzusto/trees. The clustering method applied here is the unweighted pair-group method using arithmetic averages (UPGMA), the most widely used algorithm for obtaining a hierarchical classification (Romesburg, 1984Go). UPGMA is an agglomerative algorithm: it iteratively amalgamates smaller clusters that are most similar into larger ones until the number of clusters is one.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Hierarchical classifications of the 20 amino acids

Each amino acid in a scoring matrix is described by 20 variables, the scores for exchange to any of the 20 amino acids. Dissimilarity between two amino acids is defined as a function of the vector between the two 20-dimensional vectors. The L2-norm of a vector, the familiar Euclidean length, is the norm used by Johnson and Overington (1993) (Table IIGo). Of the residue classifications obtained from the scoring matrices considered by Johnson and Overington (1993) and Russell et al. (1997), that of Miyata et al. (1979) yields the largest stable subset of the trees (Table IIGo): 16 out of the 19 internal nodes are reliable (Lanyon, 1985Go). Miyata et al. (1979) use two (polarity and molecular volume) of the three residue physico-chemical properties (atomic composition, polarity and molecular volume) described by Grantham (1974) to define an amino acid pair distance. Other than the root node encompassing all 20 residues (always reliable in a rooted tree), the largest cluster of the seven common reliable internal nodes between the classifications of Miyata et al. (1979) and Grantham (1974) is the group Ile, Leu, Met (data not shown). Of the six groups of amino acids obtained by Miyata et al. (1979) (their Table IIGo) after classification on the basis of their matrix of amino acid pair distances, only one, that of Ala, Gly, Pro, Ser, Thr, is found to be stable here. Unfortunately, Miyata et al. (1979) do not state how they produce their amino acid clustering.


View this table:
[in this window]
[in a new window]
 
Table II. Ranking of scoring matrices (Table IGo) by number of reliable internal nodesa after hierarchical classification (L2-norm) of the 20 amino acids
 
Commonly known as the Manhattan or city block metric (Everitt, 1993Go), the L1-norm of a vector is its length as the sum of the absolute values of its components. The L{infty}-norm of a vector is its length as the maximum absolute value of its components. Hierarchical classification of the 20 amino acids using the three different distance measures, L2-norm, L1-norm and L{infty}-norm, followed by construction of Lanyon (1985) jackknife strict consensus trees for each, allows us to ask how many clusters are maintained across the three consensus classifications (Table IIIGo). The idea is to test how dependent a hierarchical classification is on the choice of L2-norm as the distance measure. Although vector L2-norm, L1-norm and L{infty}-norm distance measures are obviously completely different in form, it might be the case that some of the amino acid vector relationships remain stable with variation in norm. Those that are maintained in this way might be described as `meaningful' or at least invariant under changes in vector norm. There is a statistically significant positive rank correlation at the 0.01 level of significance (one-tailed test) between the ordering of the scoring matrices according to the number of reliable internal nodes for the comparison L{infty}-norm vs L2-norm (data not shown). The rank correlation is not statistically significant at the 0.05 level for the comparisons L{infty}-norm vs L1-norm and L1-norm vs L2-norm (data not shown). Other than the mandatory grouping of all 20 residues, none of the residue relationships encapsulated within the analogues matrix of Russell et al. (1997) are stable across all three vector norms. Although Russell et al. (1997) do not specify the dissimilarity measure they use for amino acid classification, they do state that there is not enough data to classify four of the 20 amino acids with their substitution matrix derived from analogues.


View this table:
[in this window]
[in a new window]
 
Table III. Ranking of scoring matrices (Table IGo) by number of common reliable internal nodesa after hierarchical classification (L2-norm, L1-norm and L{infty}-norm) of the 20 amino acids
 
Overall, the L1-norm gives the highest number of reliable internal nodes for the 18 matrix amino acid classifications (218 of the maximum possible 684, i.e. 36x19)—the lowest is obtained with the L{infty}-norm (174) (L2-norm, 208). Using the operational definition of a distorted dendrogram suggested by Romesburg (1984) as one with a cophenetic correlation coefficient below 0.8, most of the amino acid hierarchical classifications here are not distorted. The L{infty}-norm distance measure gives the highest number of distorted trees: six of the 18 (L1-norm, 4; L2-norm, 3). Three of the matrices give distorted amino acid trees for all three norms: Fitch, Rao and A+RH. There are 82 different residue groupings in the 208 reliable internal nodes across the 18 matrix amino acid classifications obtained with the L2-norm (Table IVGo). Other than the trivial grouping of all 20 amino acids, no reliable internal nodes are present in all 18 matrix amino acid hierarchical classifications.


View this table:
[in this window]
[in a new window]
 
Table IV. Ranking of reliable internal nodes after hierarchical classification (L2-norm) of the 20 amino acids according to frequency of occurrence across the 18 scoring matrices (Table IGo)
 
Hierarchical classifications of the 18 amino acid scoring matrices

Matrix representation. Both Johnson and Overington (1993) and Russell et al. (1997) compare scoring matrices in their complete 20x20 symmetric form. However, it is not clear whether it is necessary to consider all 400 matrix elements since, by definition, a scoring matrix is symmetric with respect to its diagonal. This means that the upper triangle of 210 scores (190 pairs of different residues and 20 pairs of the same residues) contains all the unique information held within an amino acid scoring matrix. Obviously, the influence of the 20 scores for identical pairs will vary in a descriptive statistic calculated between all 400 values and that between 210 of the values.

A third possible matrix representation for comparing scoring matrices is that of the cophenetic matrix (Romesburg, 1984Go). A cophenetic matrix encodes the similarity relationships between a set of objects as defined by the output of a hierarchical classification, a tree. This means that it is possible to compare the scoring matrices at the level of the trees obtained after hierarchical classification (L2-norm) of the 20 amino acids (see above). The cophenetic matrix here contains all 400 elements each showing the level of similarity that relates each pair of amino acids in the tree.

Similarity and dissimilarity measures. The scoring matrices are clustered in each of the three matrix representations using three disimilarity measures and two similarity measures (Table VGo). Besides Euclidean distance and r.m.s. difference, the other dissimilarity measure here is the Penrose coefficient of shape difference (Romesburg, 1984Go). The latter is a function of r.m.s. difference [described as the average Euclidean distance coefficient in Romesburg (1984)]. The cosine coefficient, a similarity measure, is the familiar cosine of the angle between the directions of two vectors. According to this definition, maximum similarity, 1.0, is accorded to parallel vectors, i.e. the angle between their directions is 0 and so cos 0 = 1. The other similarity measure is the conventional correlation coefficient (Press et al., 1992Go).


View this table:
[in this window]
[in a new window]
 
Table V. Ranking of resemblance coefficients by number of reliable internal nodes after hierarchical classification by the unweighted pair-group method using arithmetic averages (UPGMA) of the 18 scoring matrices (Table IGo) in 20x20 symmetric representationa
 
Hierarchical classification according to the dissimilarity measure used by Johnson and Overington (1993)—Euclidean distance—gives the same eight reliable internal nodes for both the 20x20 symmetric matrix and upper triangle representations. Similarly, r.m.s. difference, used by Russell et al. (1997), gives the same answer with the two scoring matrix representations. In fact, the sets of reliable internal nodes obtained with these two dissimilarity measures are identical. Clearly, consideration of the entire scoring matrix imparts no extra information for these two dissimilarity measures. However, in the case of the Penrose coefficient of shape difference, the number of reliable internal nodes is affected by the scoring matrix representation: the 20x20 symmetric representation affords two such additional groupings. Both Euclidean distance and r.m.s. difference are sensitive to two types of size displacements between data profiles: additive and proportional translations. While sensitive to the latter, the Penrose coefficient of shape difference ignores the former. Hence size displacements between data profiles by means of additive translations will not contribute to dissimilarity here. The correlation coefficient, a similarity measure, ignores both additive and proportional translations of data profiles. The half scoring matrix representation gives an extra reliable internal node than the full matrix. For the symmetric scoring matrices there are six common reliable internal nodes between the eight with Euclidean distance and the eight with correlation coefficient. That these two resemblance coefficients represent the two extremes of sensitivity to additive and proprortional translations of data profiles (Romesburg, 1984Go) suggests that these six groupings are particularly stable. The six common reliable internal nodes are the root node, (A+RH RH), (Birkbeck65 STRMAT110), (Dayhoff PAM120), (Doolittle Fitch) and (Grantham Miyata). Apart from the root node, each of the six common clusters comprises matrices derived in the same way (Table IGo).

The cophenetic matrix hierarchical classifications demonstrate that this matrix representation is different from the raw scoring matrices (Table VGo).

Using the operational definition of a distorted dendrogram suggested by Romesburg (1984) as one with a cophenetic correlation coefficient below 0.8, one of the trees in Table VGo is distorted: that of the half scoring matrices with the cosine coefficient (cophenetic correlation: 0.76).

Two-step hierarchical classifications of the 18 amino acid scoring matrices. An alternative approach to use of cophenetic matrices for clustering of the scoring matrices at the level of the tree is to compare the 19 internal nodes obtained in each of the 18 matrix amino acid hierarchical classifications (L2-norm). A pair of scoring matrix residue hierarchical classifications (L2-norm) can be assessed for resemblance in terms of the number of common reliable internal nodes, the number of common unreliable internal nodes and the number of common internal nodes across all 19. A resemblance matrix for the 18 scoring matrices, constructed on the basis of each of these three similarity measures, can itself in turn be input to hierarchical classification. Clustering the matrices in terms of the number of common reliable internal nodes between the residue hierarchical classifications gives eight out of 17 reliable internal nodes (cophenetic correlation: 0.90). Classification of the matrices with respect to the number of common unreliable internal nodes between the residue hierarchical classifications yields five out of 17 reliable internal nodes (cophenetic correlation: 0.81). Other than the obligatory root node, there are no common reliable internal nodes between these two trees. Hierarchical classification of the matrices using the number of common internal nodes across all 19 as the similarity measure produces 10 out of 17 reliable internal nodes (cophenetic correlation: 0.91) (Figure 1Go). Again, the 10 reliable clusters include (Birkbeck65 STRMAT110) and (Dayhoff PAM120). However, the Miyata matrix now forms a reliable group with Doolittle instead of Grantham. Since the cophenetic matrix contains all the information represented in a tree it might be expected that the hierarchical classifications of the scoring matrices at this level would agree with the two-step ones obtained by comparing all internal nodes. Surprisingly, two of the five cophenetic matrix hierarchical classifications—Euclidean distance and r.m.s. difference—shared only one reliable internal node, the root, with the two-step clustering in Figure 1Go. The highest number of common reliable internal nodes between the third two-step hierarchical classification and the cophenetic matrix ones was three—these are Penrose coefficient of shape difference and correlation coefficient. The only non-trivial reliable internal node present in both the Penrose coefficient of shape difference and correlation coefficient cophenetic matrix hierarchical classifications and that in Figure 1Go is the grouping of the two Dayhoff matrices with that of Jones.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1. Two-step hierarchical classification of 18 amino acid scoring matrices (Table IGo). First, a hierarchical classification (L2-norm) of the 20 amino acids is obtained from each matrix by the unweighted pair-group method using arithmetic averages (UPGMA). Second, the 19 residue clusters specified for each matrix are then compared and a similarity measure for a pair of matrices is defined in terms of the number of common internal nodes across all 19. A UPGMA hierarchical classification of the 18 matrices is performed on the basis of this similarity measure. Unreliable internal nodes, i.e. those not retained in the jackknife strict-consensus tree (Lanyon, 1985Go), are indicated by a letter `u'. The cophenetic correlation is 0.91. Horizontal spacing between clusters in the tree diagram is proportional to the mean dissimilarity between clusters. Tree diagram plotted with the DRAWGRAM program, part of the PHYLIP package (Felsenstein, 1989Go, 1993Go).

 
The hierarchical classifications described in this paper are available via the WWW at URL http://mathbio.nimr.mrc.ac.uk/.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
In addition to hierarchical classification, Johnson and Overington (1993) also use principal component analysis (PCA) to define the main relationships between the amino acids according to a scoring matrix and those between the matrices themselves. As the authors state, an advantage of PCA is that it does not impose a tree representation on the data. Furthermore, the results of a hierarchical classification do not have to be expressed as a binary tree: for instance, Tomii and Kanehisa (1996) use a minimum spanning tree to summarize their hierarchical classification of 42 scoring matrices. The purpose of this paper, however, is to investigate the properties of hierarchical classifications of amino acid scoring matrices described by binary trees.

The tree diagram that results from a hierarchical classification resembles a phylogenetic tree summarizing the evolutionary history or phylogeny of a group of species. Classification by whatever means of biological objects related by evolution from a common ancestor can always be compared with their phylogenetic classification. Here, of course, no such comparison is possible.

Previously, hierarchical classification has been used to characterize relationships between the 20 amino acids in scoring matrices (Johnson and Overington, 1993Go; Russell et al., 1997Go). Testing here of these trees shows that they vary considerably in reliability, however (Tables II–IVGoGoGo). Other than the trivial grouping of the 20 amino acids, no reliable residue groupings are present in all 18 matrix amino acid hierarchical classifications.

Hitherto the choice of (dis)similarity measure for hierarchical classification of amino acid scoring matrices has not been investigated (Johnson and Overington, 1993Go; Russell et al., 1997Go). The dissimilarity measures used by Johnson and Overington (1993), Euclidean distance, and Russell et al. (1997), r.m.s. difference, are both sensitive to size displacements between data profiles (Romesburg, 1984Go). This means that any similarity in the shapes of the matrix data profiles is ignored. Johnson and Overington (1993) address this problem somewhat by standardizing the matrices so that the scores in each range between 0 and 100. The new matrices of Russell et al. (1997) have different ranges; apparently, this is not taken into account in their matrix comparison. Here the three matrices of Russell et al. (1997) were accordingly scaled between 0 and 100.

Regardless of the operational definition of (dis)similarity used to cluster the matrices, we can assess the usefulness of a measure for the purpose in terms of the reliability of the calculated tree. Thus, a measure that affords a more stable tree is more suited to describing the relationships between objects as a tree than one that does not. According to this criterion, the Penrose coefficient of shape difference, a dissimilarity measure, gives the largest stable subset of trees (10 out of 17 reliable internal nodes) for hierarchical classification of the 18 scoring matrices for the 20x20 symmetric representation. The highest number of reliable internal nodes for the upper triangle trees, nine out of 17, is obtained with the cosine coefficient and correlation coefficient, both similarity measures. The cophenetic matrix (L2-norm) matrix representation gives the highest number of reliable internal nodes (11 out of 17) after hierarchical classification with the dissimilarity measures Euclidean distance and r.m.s. difference.


    Acknowledgments
 
I am grateful to Scott Lanyon for answering several questions via e-mail. I thank the UK Biotechnology and Biological Sciences Research Council for financial support.


    Notes
 
1 Present address: Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 lAA, UK. E-mail: amay{at}nimr.mrc.ac.uk Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Agrafiotis,D.K. (1997). Protein Sci., 6, 287–293.[Abstract/Free Full Text]

Altschul,S.F., Boguski,M.S., Gish,W. and Wooton,J.C. (1994). Nature Genet., 6, 119–129.[ISI][Medline]

Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978). In Dayhoff,M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5, suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345–358.

Doolittle,R.F. (1979). In Neurath,H. and Hill,R.L. (eds.) The Proteins, vol. 4. Academic Press, New York, pp. 1–118.

Everitt,B. (1993). Cluster Analysis, 3rd edn. Edward Arnold, London.

Felsenstein,J. (1985). Evolution, 39, 783–791.[ISI]

Felsenstein,J. (1989). Cladistics, 5, 164–166.

Felsenstein,J. (1993). PHYLIP (Phylogeny Inference Package), version 3.5c. Distributed by the author, Department of Genetics, University of Washington, Seattle, WA.

Feng,D., Johnson,M.S. and Doolittle,R.F. (1985). J. Mol. Evolut., 21, 112–125.[ISI]

Fitch,W.M. (1966). J. Mol. Biol., 16, 9–16.[ISI][Medline]

Fitch,W.M. and Margoliash,E. (1967). Science, 15, 299–304.

Gonnet,G.H., Cohen,M.A. and Benner,S.A. (1992). Science, 256, 1443–1445.[ISI][Medline]

Gordon,A.D. (1996). In Arabie,P., Hubert,L.J. and De Soete,G. (eds), Clustering and Classification. World Scientific, Singapore, pp. 65–121.

Grantham,R. (1974). Science, 185, 862–864.[ISI][Medline]

Henikoff,S. (1996). Curr. Opin. Struct. Biol., 6, 353–360.[ISI][Medline]

Henikoff,S. and Henikoff,J.G. (1992). Proc. Natl Acad. Sci. USA, 89, 10915–10919.[Abstract]

Henikoff,S. and Henikoff,J.G. (1993). Proteins, 17, 49–61.[ISI][Medline]

Johnson,M.S. and Overington,J.P. (1993). J. Mol. Biol., 233, 716–738.[ISI][Medline]

Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992). CABIOS, 8, 275–282.[Abstract]

Lanyon,S.M. (1985). Syst. Zool., 34, 397–403.[ISI]

Levin,J.M., Robson,B. and Garnier,J. (1986). FEBS Lett., 205, 303–308.[ISI][Medline]

Livingstone,C.D. and Barton,G.J. (1993). CABIOS, 9, 745–756.[Abstract]

May,A.C.W. (1999). Proteins, in press.

McLachlan,A.D. (1971). J. Mol. Biol., 61, 409–424.[ISI][Medline]

Miyata,T., Miyazawa,S. and Yasunaga,T. (1979). J. Mol. Evolut., 12, 219–236.[ISI][Medline]

Mocz,G. (1995). Protein Sci., 4, 1178–1187.[Abstract/Free Full Text]

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536–540.[ISI][Medline]

Pearson,W.R. (1995). Protein Sci., 4, 1145–1160.[Abstract/Free Full Text]

Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. (1992). Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge.

Rao,J.K.M. (1987). Int. J. Pept. Protein Res., 29, 276–281.[ISI][Medline]

Risler,J.L., Delorme,M.O., Delacroix,H. and Henaut,A. (1988). J. Mol. Biol., 204, 1019–1029.[ISI][Medline]

Romesburg,H.C. (1984). Cluster Analysis for Researchers. Lifetime Learning Publications, Belmont, CA.

Russell,R.B., Saqi,M.A.S., Sayle,R.A., Bates,P.A. and Sternberg,M.J.E. (1997). J. Mol. Biol., 269, 423–439.[ISI][Medline]

Sneath,P.H.A. (1966). J. Theor. Biol., 12, 157–195.[ISI][Medline]

Stanfel,L.E. (1996). J. Theor. Biol., 183, 195–205.[ISI][Medline]

Taylor,W.R. (1986). J. Theor. Biol., 119, 205–218.[ISI][Medline]

Tomii,K. and Kanehisa,M. (1996). Protein Engng, 9, 27–36.[Abstract]

Vogt,G., Etzold,T. and Argos,P. (1995). J. Mol. Biol., 249, 816–831.[ISI][Medline]

Wang,H., Dopazo,J. and Carazo,J.M. (1998). Bioinformatics, 14, 376–377.[Abstract]

Waterman,M.S. (ed.) (1995). Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, London.

Received October 27, 1998; revised May 27, 1999; accepted June 9, 1999.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (5)
Request Permissions
Google Scholar
Articles by May, A. C.W.
PubMed
PubMed Citation
Articles by May, A. C.W.