Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, London WC1E 6BT, UK
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: amino acid clustering/amino acid scoring matrix/hierarchical classification/tree diagram/validation
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Whatever its basis, a weight matrix is a 20x20 symmetric matrix with a score for each of the possible pairs of amino acids. The simplest scoring matrix is the 20x20 unit matrix with ones along the principal diagonal for matching identical amino acids and zeros everywhere else for non-identical pairs. Of course, treating each non-identical pair as equivalent, the 20x20 unit matrix does not incorporate information on amino acid properties and relationships.
There are several classifications of the 20 amino acids according to physico-chemical properties (e.g. Sneath, 1966; Grantham, 1974; Taylor, 1986; Livingstone and Barton, 1993; Mocz, 1995; Stanfel, 1996). In addition to residue physico-chemical properties, the classification of Taylor (1986) also includes the mutation data of Dayhoff et al. (1978). The Dayhoff approach uses observed amino acid substitutions in alignments of highly similar sequences to define probabilities for residue exchanges after a particular evolutionary time. Johnson and Overington (1993) extended the Dayhoff method to more distantly related proteins by tabulating amino acid replacements observed in alignments of protein tertiary structures. Johnson and Overington (1993) use hierarchical classification (for a review, see Gordon, 1996), in which the similarities among objects are shown as a hierarchy, to compare their 3D structure-based matrix and 12 other published ones. Each matrix was scaled between 0 and 100. First, the 20 amino acids are clustered according to the score relationships in each matrix to construct residue classifications. Second, the 20x20 matrices themselves are clustered to identify their relationships. Any classification of a set of objects into groups of similar objects requires an operational definition of similarity or dissimilarity. In both cases, Johnson and Overington (1993) use Euclidean distance, the most commonly used dissimilarity measure for cluster analysis (Everitt, 1993). Thus, two scoring matrices are compared in a 400-dimensional space where each of the 20x20 scores in a matrix is considered as a separate dimension.
Russell et al. (1997) again used tertiary structure alignments to construct substitution matrices but for analogues and homologues. Assignment of analogue/homologue status for proteins sharing a common fold is according to SCOP (Murzin et al., 1995). Analogues are defined as proteins with a common fold but usually different functions and little evidence of evolutionary relationship, while homologues are those structurally similar proteins with a common ancestor and function. Complete linkage hierarchical classification is also used to classify the 20 amino acids on the basis of score relations in the new matrices. However, the authors do not describe the (dis)similarity measure for the clustering. Although they calculate root-mean-square (r.m.s.) differences between the new matrices and three published ones, Russell et al. (1997) do not present a hierarchical classification of scoring matrices.
Recently, Agrafiotis (1997) used non-linear mapping to classify the 20 amino acids on the basis of score relationships in the scoring matrices studied by Johnson and Overington (1993). Wang et al. (1998) applied a self-organizing tree growing neural network for hierarchical classification of the data considered by Johnson and Overington (1993).
The hierarchical classifications of the 20 amino acids presented by Johnson and Overington (1993) and Russell et al. (1997) do not provide any assessment of the validity of a hierarchical representation. [Although the trees of Russell et al. (1997) do contain polytomies`parent' nodes with more than two `children' nodesto indicate uncertain branching orders, their use is not explicitly referred to in the text. Similarly, while the trees of Johnson and Overington (1993) are accompanied by sum of squares values, the fact that these can serve as a measure of the match between a tree and a dissimilarity matrix is not mentioned. Furthermore, other than reporting the values, the authors do not indicate how the reader might assess their significance.) How meaningful then are these classifications? The same question arises in application of hierarchical classification to the results of protein 3D structure comparison for fold classification (May, 1999).
One method to assess a hierarchical classification in terms of the reliability of individual branches in the optimal tree diagram or dendrogram is the bootstrap (Felsenstein, 1985), a non-parametric resampling approach. For instance, phylogenetic trees obtained from sequence information are almost always accompanied by statistics describing the support for a hierarchical classification and individual clusters contained within. Another non-parametric resampling method for picking out that subset of nodes in a tree that is strongly supported is the jackknife. Lanyon (1985) derives a strict consensus of n separate trees, each omitting a single, different object to define the reliable clusters in a tree. Detecting internal inconsistencies in distance data, the jackknife strict-consensus tree algorithm of Lanyon (1985) determines whether a tree is dependent on the inclusion or exclusion of particular objects. This method is used here to test the hierarchical cluster structure in 18 amino acid scoring matrices (Table I
).
|
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Scoring matrices were obtained from the WWW sites listed in Table I. Following Johnson and Overington (1993), the three matrices of Russell et al. (1997) were scaled between 0 and 100.
Hierarchical classification
The QCLUST program (written by J.Brzustowski; e-mail: jbrzusto{at}gpu.srv.ualberta.ca) is used. It is available via anonymous FTP from gause.biology.ualberta.ca in the directory /pub/jbrzusto/trees. The clustering method applied here is the unweighted pair-group method using arithmetic averages (UPGMA), the most widely used algorithm for obtaining a hierarchical classification (Romesburg, 1984). UPGMA is an agglomerative algorithm: it iteratively amalgamates smaller clusters that are most similar into larger ones until the number of clusters is one.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Each amino acid in a scoring matrix is described by 20 variables, the scores for exchange to any of the 20 amino acids. Dissimilarity between two amino acids is defined as a function of the vector between the two 20-dimensional vectors. The L2-norm of a vector, the familiar Euclidean length, is the norm used by Johnson and Overington (1993) (Table II). Of the residue classifications obtained from the scoring matrices considered by Johnson and Overington (1993) and Russell et al. (1997), that of Miyata et al. (1979) yields the largest stable subset of the trees (Table II
): 16 out of the 19 internal nodes are reliable (Lanyon, 1985
). Miyata et al. (1979) use two (polarity and molecular volume) of the three residue physico-chemical properties (atomic composition, polarity and molecular volume) described by Grantham (1974) to define an amino acid pair distance. Other than the root node encompassing all 20 residues (always reliable in a rooted tree), the largest cluster of the seven common reliable internal nodes between the classifications of Miyata et al. (1979) and Grantham (1974) is the group Ile, Leu, Met (data not shown). Of the six groups of amino acids obtained by Miyata et al. (1979) (their Table II
) after classification on the basis of their matrix of amino acid pair distances, only one, that of Ala, Gly, Pro, Ser, Thr, is found to be stable here. Unfortunately, Miyata et al. (1979) do not state how they produce their amino acid clustering.
|
|
|
Matrix representation. Both Johnson and Overington (1993) and Russell et al. (1997) compare scoring matrices in their complete 20x20 symmetric form. However, it is not clear whether it is necessary to consider all 400 matrix elements since, by definition, a scoring matrix is symmetric with respect to its diagonal. This means that the upper triangle of 210 scores (190 pairs of different residues and 20 pairs of the same residues) contains all the unique information held within an amino acid scoring matrix. Obviously, the influence of the 20 scores for identical pairs will vary in a descriptive statistic calculated between all 400 values and that between 210 of the values.
A third possible matrix representation for comparing scoring matrices is that of the cophenetic matrix (Romesburg, 1984). A cophenetic matrix encodes the similarity relationships between a set of objects as defined by the output of a hierarchical classification, a tree. This means that it is possible to compare the scoring matrices at the level of the trees obtained after hierarchical classification (L2-norm) of the 20 amino acids (see above). The cophenetic matrix here contains all 400 elements each showing the level of similarity that relates each pair of amino acids in the tree.
Similarity and dissimilarity measures.
The scoring matrices are clustered in each of the three matrix representations using three disimilarity measures and two similarity measures (Table V). Besides Euclidean distance and r.m.s. difference, the other dissimilarity measure here is the Penrose coefficient of shape difference (Romesburg, 1984
). The latter is a function of r.m.s. difference [described as the average Euclidean distance coefficient in Romesburg (1984)]. The cosine coefficient, a similarity measure, is the familiar cosine of the angle between the directions of two vectors. According to this definition, maximum similarity, 1.0, is accorded to parallel vectors, i.e. the angle between their directions is 0 and so cos 0 = 1. The other similarity measure is the conventional correlation coefficient (Press et al., 1992
).
|
The cophenetic matrix hierarchical classifications demonstrate that this matrix representation is different from the raw scoring matrices (Table V).
Using the operational definition of a distorted dendrogram suggested by Romesburg (1984) as one with a cophenetic correlation coefficient below 0.8, one of the trees in Table V is distorted: that of the half scoring matrices with the cosine coefficient (cophenetic correlation: 0.76).
Two-step hierarchical classifications of the 18 amino acid scoring matrices.
An alternative approach to use of cophenetic matrices for clustering of the scoring matrices at the level of the tree is to compare the 19 internal nodes obtained in each of the 18 matrix amino acid hierarchical classifications (L2-norm). A pair of scoring matrix residue hierarchical classifications (L2-norm) can be assessed for resemblance in terms of the number of common reliable internal nodes, the number of common unreliable internal nodes and the number of common internal nodes across all 19. A resemblance matrix for the 18 scoring matrices, constructed on the basis of each of these three similarity measures, can itself in turn be input to hierarchical classification. Clustering the matrices in terms of the number of common reliable internal nodes between the residue hierarchical classifications gives eight out of 17 reliable internal nodes (cophenetic correlation: 0.90). Classification of the matrices with respect to the number of common unreliable internal nodes between the residue hierarchical classifications yields five out of 17 reliable internal nodes (cophenetic correlation: 0.81). Other than the obligatory root node, there are no common reliable internal nodes between these two trees. Hierarchical classification of the matrices using the number of common internal nodes across all 19 as the similarity measure produces 10 out of 17 reliable internal nodes (cophenetic correlation: 0.91) (Figure 1). Again, the 10 reliable clusters include (Birkbeck65 STRMAT110) and (Dayhoff PAM120). However, the Miyata matrix now forms a reliable group with Doolittle instead of Grantham. Since the cophenetic matrix contains all the information represented in a tree it might be expected that the hierarchical classifications of the scoring matrices at this level would agree with the two-step ones obtained by comparing all internal nodes. Surprisingly, two of the five cophenetic matrix hierarchical classificationsEuclidean distance and r.m.s. differenceshared only one reliable internal node, the root, with the two-step clustering in Figure 1
. The highest number of common reliable internal nodes between the third two-step hierarchical classification and the cophenetic matrix ones was threethese are Penrose coefficient of shape difference and correlation coefficient. The only non-trivial reliable internal node present in both the Penrose coefficient of shape difference and correlation coefficient cophenetic matrix hierarchical classifications and that in Figure 1
is the grouping of the two Dayhoff matrices with that of Jones.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The tree diagram that results from a hierarchical classification resembles a phylogenetic tree summarizing the evolutionary history or phylogeny of a group of species. Classification by whatever means of biological objects related by evolution from a common ancestor can always be compared with their phylogenetic classification. Here, of course, no such comparison is possible.
Previously, hierarchical classification has been used to characterize relationships between the 20 amino acids in scoring matrices (Johnson and Overington, 1993; Russell et al., 1997
). Testing here of these trees shows that they vary considerably in reliability, however (Tables IIIV
). Other than the trivial grouping of the 20 amino acids, no reliable residue groupings are present in all 18 matrix amino acid hierarchical classifications.
Hitherto the choice of (dis)similarity measure for hierarchical classification of amino acid scoring matrices has not been investigated (Johnson and Overington, 1993; Russell et al., 1997
). The dissimilarity measures used by Johnson and Overington (1993), Euclidean distance, and Russell et al. (1997), r.m.s. difference, are both sensitive to size displacements between data profiles (Romesburg, 1984
). This means that any similarity in the shapes of the matrix data profiles is ignored. Johnson and Overington (1993) address this problem somewhat by standardizing the matrices so that the scores in each range between 0 and 100. The new matrices of Russell et al. (1997) have different ranges; apparently, this is not taken into account in their matrix comparison. Here the three matrices of Russell et al. (1997) were accordingly scaled between 0 and 100.
Regardless of the operational definition of (dis)similarity used to cluster the matrices, we can assess the usefulness of a measure for the purpose in terms of the reliability of the calculated tree. Thus, a measure that affords a more stable tree is more suited to describing the relationships between objects as a tree than one that does not. According to this criterion, the Penrose coefficient of shape difference, a dissimilarity measure, gives the largest stable subset of trees (10 out of 17 reliable internal nodes) for hierarchical classification of the 18 scoring matrices for the 20x20 symmetric representation. The highest number of reliable internal nodes for the upper triangle trees, nine out of 17, is obtained with the cosine coefficient and correlation coefficient, both similarity measures. The cophenetic matrix (L2-norm) matrix representation gives the highest number of reliable internal nodes (11 out of 17) after hierarchical classification with the dissimilarity measures Euclidean distance and r.m.s. difference.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul,S.F., Boguski,M.S., Gish,W. and Wooton,J.C. (1994). Nature Genet., 6, 119129.[ISI][Medline]
Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978). In Dayhoff,M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5, suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345358.
Doolittle,R.F. (1979). In Neurath,H. and Hill,R.L. (eds.) The Proteins, vol. 4. Academic Press, New York, pp. 1118.
Everitt,B. (1993). Cluster Analysis, 3rd edn. Edward Arnold, London.
Felsenstein,J. (1985). Evolution, 39, 783791.[ISI]
Felsenstein,J. (1989). Cladistics, 5, 164166.
Felsenstein,J. (1993). PHYLIP (Phylogeny Inference Package), version 3.5c. Distributed by the author, Department of Genetics, University of Washington, Seattle, WA.
Feng,D., Johnson,M.S. and Doolittle,R.F. (1985). J. Mol. Evolut., 21, 112125.[ISI]
Fitch,W.M. (1966). J. Mol. Biol., 16, 916.[ISI][Medline]
Fitch,W.M. and Margoliash,E. (1967). Science, 15, 299304.
Gonnet,G.H., Cohen,M.A. and Benner,S.A. (1992). Science, 256, 14431445.[ISI][Medline]
Gordon,A.D. (1996). In Arabie,P., Hubert,L.J. and De Soete,G. (eds), Clustering and Classification. World Scientific, Singapore, pp. 65121.
Grantham,R. (1974). Science, 185, 862864.[ISI][Medline]
Henikoff,S. (1996). Curr. Opin. Struct. Biol., 6, 353360.[ISI][Medline]
Henikoff,S. and Henikoff,J.G. (1992). Proc. Natl Acad. Sci. USA, 89, 1091510919.[Abstract]
Henikoff,S. and Henikoff,J.G. (1993). Proteins, 17, 4961.[ISI][Medline]
Johnson,M.S. and Overington,J.P. (1993). J. Mol. Biol., 233, 716738.[ISI][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992). CABIOS, 8, 275282.[Abstract]
Lanyon,S.M. (1985). Syst. Zool., 34, 397403.[ISI]
Levin,J.M., Robson,B. and Garnier,J. (1986). FEBS Lett., 205, 303308.[ISI][Medline]
Livingstone,C.D. and Barton,G.J. (1993). CABIOS, 9, 745756.[Abstract]
May,A.C.W. (1999). Proteins, in press.
McLachlan,A.D. (1971). J. Mol. Biol., 61, 409424.[ISI][Medline]
Miyata,T., Miyazawa,S. and Yasunaga,T. (1979). J. Mol. Evolut., 12, 219236.[ISI][Medline]
Mocz,G. (1995). Protein Sci., 4, 11781187.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536540.[ISI][Medline]
Pearson,W.R. (1995). Protein Sci., 4, 11451160.
Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. (1992). Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge.
Rao,J.K.M. (1987). Int. J. Pept. Protein Res., 29, 276281.[ISI][Medline]
Risler,J.L., Delorme,M.O., Delacroix,H. and Henaut,A. (1988). J. Mol. Biol., 204, 10191029.[ISI][Medline]
Romesburg,H.C. (1984). Cluster Analysis for Researchers. Lifetime Learning Publications, Belmont, CA.
Russell,R.B., Saqi,M.A.S., Sayle,R.A., Bates,P.A. and Sternberg,M.J.E. (1997). J. Mol. Biol., 269, 423439.[ISI][Medline]
Sneath,P.H.A. (1966). J. Theor. Biol., 12, 157195.[ISI][Medline]
Stanfel,L.E. (1996). J. Theor. Biol., 183, 195205.[ISI][Medline]
Taylor,W.R. (1986). J. Theor. Biol., 119, 205218.[ISI][Medline]
Tomii,K. and Kanehisa,M. (1996). Protein Engng, 9, 2736.[Abstract]
Vogt,G., Etzold,T. and Argos,P. (1995). J. Mol. Biol., 249, 816831.[ISI][Medline]
Wang,H., Dopazo,J. and Carazo,J.M. (1998). Bioinformatics, 14, 376377.[Abstract]
Waterman,M.S. (ed.) (1995). Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, London.
Received October 27, 1998; revised May 27, 1999; accepted June 9, 1999.
|