Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 lAA, UK.E-mail: amay{at}nimr.mrc.ac.uk
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: cluster analysis/multiple sequence alignment/protein homologous family/representative set/sequence weighting
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sequence patterns within a group of protein sequences defined by their multiple alignment can be represented as profiles (Gribskov et al., 1987). Also known as a position-specific scoring matrix, a profile is a 2D weight matrix in which the rows describe the aligned positions and the first 20 columns specify the score for each of the possible amino acids. As well as reflecting the observed frequency of each residue at an aligned position, these scores also incorporate the residue similarity relationship data of an amino acid substitution matrix. A further column denotes a penalty for insertions or deletions at that position. An extension of a standard dynamic programming algorithm (Smith and Waterman, 1981
) is used in turn to align single test sequences against a profile to identify further members of the set. Recently, profiles have been constructed from multiple sequence alignments using a hidden Markov model (HMM) approach (for a recent review, see Amitai, 1998).
Usually, there is an unequal representation of proteins in a multiple sequence alignment. That is, the sequences are not statistically independent: the space defined by the sequence identity relationships within the group is not evenly populated (Sibbald and Argos, 1990). Accordingly, a profile constructed from the multiple alignment will reflect this bias: outlier sequences will have less influence than they should. Sequence weighting addresses this problem of unequal representation.
An early weighting method is that of Vingron and Argos, who assign a weight to a sequence on the basis of the sum of its pairwise distances defined in terms of the number of amino acid mismatches to all the others (Vingron and Argos. 1989). Thus, an outlier will receive a high weight. In a similar approach, Sander and Schneider defined a weight for a sequence according to its mean pairwise distance to all the rest (Sander and Schneider, 1991
).
Altschul et al. described a procedure suitable only for sequences related by an evolutionary tree (Altschul et al., 1989). In addition, the tree must be rooted the initial bifurcation specifies the last common ancestor of all the sequences. Although also dependent on a phylogenetic tree, the simple `branch-proportional' scheme of Thompson et al. (Thompson et al., 1994
) does not require the biological root to be known. In another straightforward tree-based strategy, Gerstein et al. used recursion to traverse a tree from the leaves up to the root, incrementing the weights at each node (Gerstein et al., 1994
).
An advantage of the method of Sibbald and Argos is its lack of dependence on a tree (Sibbald and Argos, 1990). A given space containing a set of points can be partitioned into volumes around each point comprising all empty space that is closer to the central point in the volume than to any of the other points. The resulting pattern is a Voronoi diagram and the partitions are known as Voronoi volumes. Highly related proteins are clustered together in sequence space. The unevenness of such a space means that the Voronoi volume around an outlier is greater than that around each of the sequences comprising a cluster. Thus, weights can be assigned according to the Voronoi volumes around each sequence. Although it has been argued that Voronoi sequence weights are more likely to be correct in general (Vingron and Sibbald, 1993
), they suffer from two related problems. First, it is an open problem how to calculate analytically Voronoi geometry for such a high-dimensional space as sequence space. This necessitates use of a stochastic sampling procedure to generate `random sequences': Sibbald and Argos used a Monte Carlo algorithm (Sibbald and Argos, 1990
). Second, convergence to stable weights via the sampling of sequences randomly from sequence space can become impractical for larger domains (>100 residues) (Thompson et al., 1994
).
Henikoff and Henikoff described a simple and elegant sequence weighting method that instead of considering pairwise sequence distances incorporates the residue diversity at each aligned position (Henikoff and Henikoff, 1994). These position-based sequence weights are shown to perform well in database searching tests. In a related approach, Krogh and Mitchison proposed maximum entropy weights (Krogh and Mitchison, 1995
). The recent PSIC (position-specific independent counts) program of Sunyaev et al. calculates sequence weights for each column in a multiple alignment (Sunyaev et al., 1999
).
Sequence weighting is related to the problem of how to select representative protein data sets. Selection of representatives is key in not only classification of protein sequences (for example, see Bateman et al., 1999; Srinivasarao et al., 1999) and 3D structures (for instance, see Orengo et al., 1997; Holm and Sander, 1999) but also for protein target selection in structural genomics (Burley et al., 1999). As for the latter, the question of which targets should be chosen for high-throughput structure determination is far from academic: it is simply not feasible at present to determine the 3D structure of every protein encoded by the human genome. Of course, although the definition of representative is operationally defined, such a data set should combine `maximum coverage with minimum redundancy' (Hobohm et al., 1992
). Usually, a representative set contains no pair of proteins with a sequence identity greater than a user-defined threshold (for example, see Boberg et al., 1992; Heringa et al., 1992; Hobohm et al., 1992; Holm and Sander, 1998). However, as far as the author is aware, it seems that the question of how to choose arbitrarily sized representative sets of aligned sequences from a multiple alignment has not been addressed until very recently (see below). In addition, there is a need for a method to quantify how representative such a set is of the whole group. Thus, it would be possible to summarize a multiple alignment. This data reduction could be useful for extracting the key features of large multiple sequence alignments for human inspection. For example, it might be desired not only to select a few representative proteins from a multiple alignment for clarity in a figure for publication but also to quantify the extent of coverage therein. Recently, Corpet et al. described an approach to derive what they called a `rich description' of a protein family on the basis of a multiple sequence alignment (Corpet et al., 1999
). Hierarchical classification (see below) is used to define sub-families of sequences within the family. Consensus sequences are obtained at each tree node to represent the sub-families. The original alignment can then be summarized by replacement of several sequences by a consensus sequence. However, apart from the problems inherent in its dependence on hierarchical classification (see below), this approach offers no assessment of how representative the summary alignment of the consensus sequences is of the complete family.
Hierarchical classification is commonly used to group similar sequences into nested sets of clusters. The result is a rooted tree like a phylogenetic tree representing the evolutionary history of a group of species. The input for a hierarchical classification of N protein sequences is an NxN matrix of pairwise similarities or dissimilarities. The simplest similarity measure is percentage sequence identity. There are at least three problems with the use of hierarchical classification to cluster protein sequences to inform selection of a representative set from a multiple sequence alignment. First, a tree might not be the most suitable encapsulation of the distance relationships for a group of sequences. Of course, the validity of a tree can be assessed (May, 1999a, b
). This was not done by Corpet et al. Second, a problem lies in how to determine the most appropriate number of subsets for a group given a hierarchy, i.e. where to `cut' the tree. This is an open problem in hierarchical classification (Everitt, 1993
). Despite its subjectivity, visual inspection is often used to identify a level of similarity to act as the cut-point of a tree. By definition, this is not an option for an automatic method. Thus, the approach of Corpet et al. relies on a user-supplied maximum number of sub-families. Third, once the subsets have been chosen, the next task is to define a representative for each partition. Because hierarchical classification is sequentially optimal within the tree but not globally optimal over the whole tree, there is no guarantee that the partition representatives, however selected, will best represent the entire data. Clearly, this is a drawback of the procedure of Corpet et al. The hierarchical tree has a wide appeal in biology (for a review, see May, 1999b): recently, hierarchical classification has been used to cluster genome-wide expression data (for example, see Eisen et al., 1998). However, Tamayo et al. have questioned the suitability of hierarchical classification for this purpose (Tamayo et al., 1999
). Instead, these authors use self-organizing maps, artificial neural networks particularly suited to such exploratory data analysis.
A major challenge in the classification of protein sequences and 3D structures is the formulation of operational definitions of useful groupings. For example, sequences are usually clustered into homologous families (i.e. proteins related by divergent evolution from a common ancestor) on the basis of `significant' sequence identity. The problem here is that there is not a single sequence identity threshold suitable to define membership for all protein families. In other words, some families (e.g. globins) are far more heterogeneous at the sequence level than others are. The same problem arises in the clustering of protein 3D structures into groups with common folds: some folds (e.g. the globin fold) are far more conserved over evolution than others are.
Cluster analysis is not restricted to a hierarchical tree representation and use of a dissimilarity matrix. Objects related by a linear ordering can be optimally assigned into clusters using a simple dynamic programming algorithm (Bement and Waterman, 1977; for a review of constrained classification, see Gordon, 1996; Hawkins and Merriam, 1973). Importantly, the use of dynamic programming means that such a clustering is optimal for all predefined numbers of subsets (Gordon, 1996). In addition, there is no need for an a priori threshold to delineate clusters. So, reduction of the protein sequence clustering problem to one of constrained classification not only obviates the need for imposing a tree on the data but also makes it possible to identify a globally optimal solution. Furthermore, unifying sequence weighting and sequence classification, the approach described here affords a new view of relationships within a multiple sequence alignment.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The multiple sequence alignments used in this study are those in HOMSTRAD (Mizuguchi et al., 1998), a database of protein 3D structure alignments for homologous families. HOMSTRAD is available on the Web at http://www-cryst.bioc. cam.ac.uk/~homstrad/. An advantage of the use of homologous protein families is that assignment of topological equivalence is essentially unambiguous for proteins descended by evolution from a common ancestor. By definition, 3D structure comparison is that much more difficult and therefore subjective for distantly related proteins. (Of course, it is just the level of structure similarity among the members of the protein families in HOMSTRAD that gives further support to a claim of a clear evolutionary relationship.) In particular, the November 1999 full release of HOMSTRAD comprises 1193 proteins grouped into 209 families; the mean (±SD) number of structures per family is 5.7 (4.9). The 41 families comprising only two members are not considered in this analysis since, by definition, an alignment of only two protein sequences does not constitute a multiple alignment. The mean (±SD) number of structures per family is 6.6 (5.0) for the remaining 168 families. The size distribution of the 168 families, sorted numerically by size, is (number of structures, number of families): 3, 40; 4, 29; 5, 27; 6, 20; 7, 7; 8, 10; 9, 6; 10, 5; 11, 4; 12, 6; 13, 3; 14, 3; 15, 1; 18, 1; 20, 1; 21, 1; 22, 1; 26, 1; 27, 1; 41, 1.
Hierarchical classification of sequences on the basis of identity
A standard dissimilarity measure for a pair of aligned sequences is 100 percentage sequence identity. Sequence identity is determined here over the multiply aligned positions, i.e. those aligned positions without a gap in any sequence. If a gap were considered as a different residue type then it is not clear how to treat matching gap characters in terms of sequence identity for a given pair. Alternatively, of course, sequence identity can be defined in terms of the length of the shorter sequence of a pair but this means then that the number of aligned positions is not constant for all comparisons. Furthermore, distances calculated from sets of pairwise alignments `cannot be guaranteed to be even close to Euclidean' (Forster et al., 1999). A dissimilarity matrix comprising dissimilarities for all pairs of sequences is an input data format for hierarchical classification. The most widely used algorithm for obtaining a hierarchical classification (Romesburg, 1984
), the unweighted pair-group method using arithmetic averages (UPGMA), is used here. As in May (May, 1999a
, b
), the support for the clusters defined by a tree is assessed with a jackknife test (Lanyon, 1985
).
Sequence weighting
By definition, sequences within a homologous family are not statistically independent. Sequence weighting can be used to describe the extent of sampling bias within a protein family. Six weighting schemes of which two are previously published (Vingron and Argos, 1989; Henikoff and Henikoff, 1994
) are applied here.
A simple extension to the approach of Vingron and Argos (1989) is to use the familiar sum of squares formalization: weighting a sequence according to the sum of the squares of its pairwise distances (100 percentage sequence identity) to all the others.
The other three weighting approaches are based on residue diversity over all positions in the aligned sequences. The variability of an aligned column of residues is represented in terms of information-theoretical entropy (Shenkin et al., 1991): the more unlike the site is, the higher the entropy. Thus, an aligned position occupied by only a single residue is accorded entropy 0. The challenge is to combine the information in all multiply aligned positions (Figure 1
) without using a mean as in Henikoff and Henikoff (Henikoff and Henikoff, 1994
). As a measure of central tendency, the mean is not appropriate for highly skewed distributions since it can be heavily influenced by extreme values. Of course, the mean is the most efficient measure of central tendency for normal distributions. A multiple sequence alignment comprising N multiply aligned positions can be represented as an N-dimensional vector where each element therein is the entropy at that position (Figure 1
). We can jackknife a multiple alignment of M sequences by leaving out one sequence at a time and so construct M such N-dimensional entropy vectors, each representing M 1 sequences. Use of vector algebra to describe the relationship between each of these vectors and that for all M sequences makes it possible to quantify the influence of each deleted sequence and so assign a weight. This is done in three ways here. First, the standard correlation coefficient can be used to describe the strength of association. Insensitive to size displacements between data profiles, this statistic considers only similarity in shape. Second, the standard scalar product is used: like the correlation coefficient, it is insensitive to proportional translations between data profiles but, unlike the correlation coefficient, it is sensitive to additive translations between data profiles. Finally, the Euclidean distance is used since it is sensitive to both additive and proportional translations. Hence each of these three measures has different properties. A benefit of the entropy vector algebra procedure with jackknife is that it permits comparison of a single sequence with M 1 sequences simultaneously in the context of the pattern of sequence variability across all M.
|
One definition of the most representative pair of sequences for a multiple alignment might be that pair of sequences for which the pairwise distance is closest to a measure of central tendency, e.g. the mean or the root mean square (r.m.s.), of the distribution of distances for all unique pairs of sequences. Of course, there is no guarantee that the arrangement of residue identities along the alignment of such a pair is most representative of the pattern across the entire set, especially if the proteins are not evolutionarily related. This suggests a more useful definition of the most representative pair: that pair of sequences for which there is the strongest positive linear correlation between the N-dimensional entropy vector describing the pair and that for all M sequences. Although we can assess each pair in turn on this criterion, extension of such a `generate and test' approach to select most representative sets of >2 sequences would clearly soon fall prey to a `combinatoric explosion'. In other words, the number of combinations of R sequences chosen from a set of N rapidly becomes very large for increasing values of R. This means that it quickly becomes prohibitive to evaluate each combination in turn. Instead an heuristic approach can be used: the repeated stepwise addition of a single sequence to the most representative set until all M sequences have been considered. Selection of a sequence at each step of this greedy algorithm is on the basis of that sequence which maximizes the level of similarity between the subset and all M sequences. This method is similar to an agglomerative algorithm for hierarchical classification in that each addition step is optimal subject to the constraint that once a sequence has been recruited to the subset it cannot be considered again as a potential recruit.
Constrained classification of sequences on the basis of weights
Sequences within a multiple alignment can be sorted into a list on the basis of the arithmetic order of their weights. Optimal partitioning of such a list with the algorithm of Hawkins and Merriam (Hawkins and Merriam, 1973) delineates a `core' of sequences with low weights (the most typical ones) and a corresponding `periphery' of atypical sequences with high weights. There are at least three advantages of the use of the Hawkins and Merriam partitioning method here. First, the user is not required to specify an arbitrary threshold for the division(s) (cf. the problem of where to `cut' a hierarchical classification to partition a set). Second, the resulting segmentation into core/periphery is globally optimal. Third, the use of dynamic programming guarantees that the clustering of N points is optimal for all predefined numbers of subsets k where 1
k
N. It is important to note that use of the term `dynamic programming' here does not refer to the familiar biological sequence alignment algorithms. Instead, it refers to a general algorithmic technique for optimization of which the alignment algorithms are a specific instance. A measure of the variability of a segment is the sum of squared deviations of the constituent points about their mean. Defining the total within-segment variability W by summation for k segments, the globally optimal partitioning into k segments is that which minimizes W. This global optimization problem can be solved exactly by dynamic programming (Hawkins and Merriam, 1973
). In addition, this approach also affords automatic identification of the most appropriate value of k. The improvement in W on addition of another segment is expressed in terms of `explained' variation: the percentage reduction in W with respect to W when k = 1, i.e. no segmentation. The optimal value of k is taken as that value of k that shows the largest increase in `explained' variation with respect to k 1 (Everitt, 1993
).
Two other measures are used here to order a set of protein sequences from a multiple alignment for constrained classification. First, the jackknife entropy sum for each of the M sequences is obtained by adding the entropy at each multiply aligned position in the N-dimensional entropy vector representing M 1 sequences. By definition, the jackknife entropy sum will be highest for the most representative sequence since its exclusion means loss of the most central sequence. Use of vector algebra in determination of the above three entropy-based sequence weights means that they take into account the shape of the distribution of sequence diversity across the N multiply aligned positions. The jackknife entropy sum, however, averages over all columns and so ignores the shape of the distribution. Comparison of the two types of entropy-based measures will indicate families where averaging might not be the most appropriate approach to describe the distribution of sequence diversity across the multiple alignment. Second, the coefficient of variation (CV) of pairwise distances (100 percentage sequence identity) is used to measure the relative spread of the pairwise distances for each sequence. The standard deviation (SD) is an absolute measure of the spread of a univariate distribution. CV is defined as the SD as a percentage of the mean. Outliers can be defined here as sequences with small CV values: these sequences are closest to being equidistant to the others. Clearly, it is not possible to identify outliers according to this criterion from a tree since such a data structure does not represent directly the spread of the pairwise distances for each sequence. Hence the use of the CV affords a new view of relationships among a set of protein sequences. Furthermore, since the weight accorded to a sequence by the method of Vingron and Argos (Vingron and Argos, 1989) is essentially its mean pairwise distance to the rest of the set, it is possible to investigate the relation between SD and mean for these distance distributions and so quantify the influence of SD on CV.
Relationship between sequence identity-based dissimilarity and sequence weight-based dissimilarity
For a given sequence weighting method, a dissimilarity measure between a pair of sequences can be defined as the absolute difference in weight between the two. The strength of association between the sequence identity-based dissimilarity matrix and that defined for each sequence weight is described in terms of the correlation coefficient. The nature of the relation for a family will depend on the relationships among the members. Consider a pair of identical sequences: although the absolute difference in weight is 0, the converse is not necessarily true.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The result of an UPGMA hierarchical classification of N objects is a binary tree comprising N 1 internal nodes. Each internal node represents a subset (cluster) of 2 similar objects. The jackknife test of Lanyon (1985) to assess a tree requires a hierarchical classification of
4 objects. Of the 128 families with at least four members, 79 yield meaningful UPGMA trees on the basis of 100 percentage sequence identity as a dissimilarity measure. (A meaningful tree is defined as one where all the clusters are observed to be reliable according to the Lanyon method.) The mean (±SD) number of structures per family for the 79 is 5.7 (1.9). The zinc finger (CCHH-type) family (N = 15) is the one with the lowest relative proportion, 0.4, of reliable internal nodes: five out of 14 (Figure 2
). By contrast, constrained classification of sequence weights sorted in arithmetic order (Table I
) identifies all globally optimal core/periphery partitions for the family without the need for a user-supplied cut-off. According to the criterion of largest improvement in the percent fit for each new segment (see Materials and methods), the optimal number of segments is two (data not shown): the boundary is between 1ard and 1zaa3 (Table I
).
|
|
CV-based definition of outlier sequences
One operational definition of outliers is those sequences with small CV of pairwise distances (100 percentage sequence identity) since such sequences are closest to being equidistant to the others. Conversely, sequences with large CV values are those where the distribution of pairwise distances has a lot of spread. Of course, a large spread reflects a structure within the data: such sequences are closer to some than others. So, given that the CV describes the extent to which a sequence is part of a cluster, the core of a family will comprise sequences with large CV values. Use of CV in this way can lead to clustering which at first glance might appear counter-intuitive, but is entirely consistent. For example, consider a multiple alignment with no gaps of N sequences where N > 3 (by definition, more than a pair of sequences) and N 20 (the number of different amino acids). Each sequence comprises N 1 residues. Each of N 1 of the sequences are maximally (i.e. 0% sequence identity) equidistant from each other. The N 1 positions of the remaining sequence, however, consist of an amino acid that matches a residue in turn in one of the N 1 sequences. In other words, the last sequence is an exact intermediate between the N 1 sequences (Park et al., 1997
). Clearly, in terms of sequence weighting, the intermediate sequence is the most representative and so is accorded the lowest weight while the N 1 other sequences are each assigned an equal higher weight. However, since the intermediate sequence is equidistant from each of the other N 1 sequences, the CV of pairwise distances is 0.0% since SD is 0.0. Constrained classification of the CV values sorted in reverse arithmetic order delineates exactly a partition between a core of the N 1 sequences and a periphery of the intermediate sequence. However, this assignment of the intermediate sequence as an outlier is entirely consistent with the atypical nature of this sequence on the basis of this criterion. [At the first step of a hierarchical classification of the N sequences on the basis of pairwise distance (100 percentage sequence identity), the intermediate sequence is amalgamated with the first (in terms of order in the dissimilarity matrix) of the N 1 sequences. Each of the remaining sequences are in turn joined to the initial cluster. The single linkage algorithm for hierarchical classification defines the distance between two clusters A and B as the smallest distance between an object in A and an object in B. This means that despite the order dependence problem, a single linkage tree demonstrates that each of the N 1 sequences is equally close to the intermediate one. The UPGMA algorithm for hierarchical classification, however, defines the distance between two clusters A and B as the mean distance between an object in A and an object in B. Unfortunately, this then means that the order of the dissimilarity matrix becomes significant here.] Of the 168 families, five contain a sequence that is exactly equidistant in terms of sequence identity from the rest of the family members (CV is 0.0%; each of these families comprises only three sequences). Clearly, for a given family, the sequence with the smallest CV is closest to being equidistant to the rest. The largest value of CV for sequences with the smallest value within a family is 36.04% [3cox within the GMC oxidoreductase family (N = 4)]; mean (±SD) overall is 5.41 (5.94).
Relationship between mean and SD of distance distributions
The relationship between the CV of pairwise distances (100 percentage sequence identity) and the Vingron and Argos weight (Vingron and Argos, 1989) reflects the correlation between the SD and mean of the distance distributions for each sequence. There is a negative correlation between CV values and Vingron and Argos weights for 158 of the 168 families (r < 0.8 for 120 families). A complete negative correlation corresponds to the situation where the low weight (small mean) sequences that comprise the core of a family are closer to some sequences than others (large SD). Further, the high weight (large mean) outlier sequences are the nearest to being equidistant to the others (small SD). There is a positive correlation for 10 of the 168 families (r > 0.8 for seven families). A complete positive correlation is consistent with the low weight (small mean) core sequences tending towards being equidistant to the rest (small SD) (cf. the intermediate sequence in the example above). Also, the high weight (large mean) outlier sequences are closer to some sequences than others (large SD).
Entropy-based sequence weights
Ideally, the relationship between the jackknife entropy sum and the Henikoff and Henikoff sequence weights (Henikoff and Henikoff, 1994) should be complete negative correlation (r = 1). This is because both average over the residue diversity at all multiply aligned positions. In practice, r < 0.8 for all 168 families: the least negative correlation is r = 0.88 for the hormone receptor (DNA-binding domain) family (N = 5). Having established that the jackknife entropy sum is usefully related to the Henikoff and Henikoff sequence weight, the next issue is the relationship between the jackknife entropy sum and the three entropy-based sequence weights derived by vector algebra (see Materials and methods). The idea here is to investigate the usefulness of combining the information from all multiply aligned positions in ways other than by averaging. The relationship between the jackknife entropy sum and entropy-based correlation coefficient sequence weights is r > 0.8 for 22 families. As a similarity measure, the correlation coefficient ignores both additive and proportional translations between data profiles. Thus, complete positive correlation (r = 1) occurs when data profiles share a common `shape' regardless of size displacement (if any) between them. In contrast, the scalar product is sensitive to additive translations between data profiles but ignores proportional translations. In fact, 15 of the 22 families with jackknife entropy sum and entropy-based correlation coefficient sequence weights r > 0.8 have r < 0.8 for jackknife entropy sum and entropy-based scalar product sequence weights. In other words, additive translations between the entropy vectors are important for these 15 families. An example here is the CoA-dependent acetyltransferase family (N = 4) where 3cla is accorded the highest weight according to the method of Henikoff and Henikoff (Henikoff and Henikoff, 1994
). 3cla has the smallest CV of pairwise distances (100 percentage sequence identity), 5.0%, which makes it the closest to being equidistant to the others. This means that when 3cla is left out, the N-dimensional entropy vector is nearest to being parallel with that for all the sequences. On this basis, 3cla is assigned the lowest entropy-based correlation coefficient sequence weight. Euclidean distance is sensitive to both additive and proportional translations between data profiles. Of the seven families with jackknife entropy sum and entropy-based correlation coefficient sequence weights r > 0.8 and jackknife entropy sum and entropy-based scalar product sequence weights r > 0.8, six (epidermal growth factor-like domain N = 12; fibronectin type III domain N = 14; thiamine pyrophosphate enzymes N = 4; thioredoxin N = 6; transferrin N = 7; zinc finger CCHH-type N = 15) have r < 0.8 for jackknife entropy sum and entropy-based Euclidean distance sequence weights. Hence proportional translations between the entropy vectors are important for these six families. There is only one family, that of the hormone receptor (DNA-binding domain) (N = 5), where r > 0.8 for jackknife entropy sum and all three entropy-based sequence weights. Clearly, averaging entropy over all multiply aligned positions (jackknife entropy sum) does not always correspond to putting together the data without averaging. This is to be expected since the mean is not applicable for highly skewed distributions and is less efficient than other measures of central tendency when extreme values are possible. Hence an advantage of the entropy vector algebra approach presented here is that it allows direct consideration of the distribution of entropy across a multiple sequence alignment (Figure 1
).
Relationship between sequence identity-based dissimilarity and sequence weight-based dissimilarity
A dissimilarity measure can be defined between a pair of sequences as the absolute difference in sequence weight between the two. For a given sequence weighting protocol, a protein family can be characterized in terms of the relationship (correlation coefficient) between sequence identity-based dissimilarity and sequence weight-based dissimilarity. The value of r varies across all six weighting schemes for 161 of the 168 families. There are two families (TATA-box binding protein, C-terminal domain N = 4; uracil-DNA glycosylase N =3) where there are five unique values of r: in both cases the same value is shared between the position-based Henikoff and Henikoff and identity-based Vingron and Argos weights (Henikoff and Henikoff, 1994; Vingron and Argos, 1989
). The remaining five families, all comprising three structures, have a single unique value of r: 1.0 for cytokine granulocyte colony stimulating factor and enoyl-CoA hydratase and 1.0 for ciliate pheromone, microbial ribonucleases and thionin. The identity-based sum of squares sequence weighting scheme is the most frequently occurring [N = 69; mean (±SD) r = 0.66 (0.21); minimum r = 0.13; maximum r = 1.0] of those with the highest value of r for each family across the 161 families with six unique values of r.
Factors influencing the relationship between sequence identity-based dissimilarity and sequence weight-based dissimilarity
There are four factors influencing the relation (r) between sequence identity-based dissimilarity and sequence weight-based dissimilarity for a family. First, given that identical sequences are assigned identical weights, increasing the number of copies of a sequence within an otherwise unchanged multiple alignment leads to an increase in r. Thus, similar sequences are allocated similar weights. Second, it is not necessarily true, however, that identical (similar) weights imply identical (similar) sequences [for example, see `the highly idealized example' alignment of uniform sequences in Table I of Sibbald and Argos (Sibbald and Argos, 1990
)]. This leads to a decrease in r. For instance, of the 128 families with
4 members, cytochrome c' N = 4 has the most negative r (0.67) for the association between sequence identity-based dissimilarity and identity-based sum of squares sequence weight-based dissimilarity. Here the joint most distant pair, 2ccya and 1bbha (21.0% sequence identity), is closest in terms of absolute difference in sum of squares sequence weights. Third, although dissimilar sequences receive dissimilar weights resulting in an increase in r, the effect is enhanced when there are clear outlier(s) that are also close to being equidistant to the rest of the set. For example, of the 69 families where the identity-based sum of squares sequence weights give the highest value of r, the most positive value (1.00) is obtained for the immunoglobulin-binding domain N = 4 family. Here 2ptl not only is an outlier in terms of the UPGMA tree of sequence relationships (all internal nodes are reliable) but also is nearly equidistant to the rest: CV of pairwise distances (100 percentage sequence identity) = 2.16%. In contrast, consider the thioredoxin N = 6 family, that family of the 168 with r nearest 0 (0.03) for the relation between sequence identity-based dissimilarity and identity-based sum of squares sequence weight-based dissimilarity. Here the most distant pair, 1aaza and 3trx (7.5% sequence identity), is second closest in terms of absolute difference in sum of squares sequence weights. The initial bifurcation of the UPGMA tree of sequence relationships (all internal nodes are reliable) is between (1thx 2trxa 3trx) and (1aaza 1ego 3grx). 3trx is an outlier with respect to (1thx 2trxa) while 1aaza is an outlier in relation to (1ego 3grx). Hence neither 3trx nor 1aaza is an outlier with respect to the rest of the set. Of course, use of the square term means that identity-based: sum of squares sequence weights are more sensitive to outliers (i.e. upweights them) than identity-based Vingron and Argos weights. In turn, this means that absolute weight differences for comparisons involving outliers will be increased. Fourth, sequences assigned dissimilar weights are not necessarily unrelated: the absolute weight difference between 1aaza and 3grx is third highest of the 15 unique pairs in the thioredoxin N = 6 family and their sequence identity (37.3%) is also third highest. To conclude, the relation between sequence identity-based dissimilarity and sequence weight-based dissimilarity for a family is a useful indicator of intra-family sequence relationships.
Selection of representative sets of sequences
The entropy-based most representative pair of sequences for a family is defined as that pair for which there is the most positive r between the N-dimensional entropy vector characterizing the pair and that for all M sequences. The sum of squares-based most representative pair of sequences is taken as that pair of sequences for which the pairwise distance is nearest to the r.m.s. distance (a measure of central tendency) for all unique pairs. The relationship between the two pairs so selected can be summarized in terms of the ratio of the correlation coefficients for the association between each one and the complete set of sequences. The two pairs coincide (ratio of r = 1.0) in 18 of the 168 families; the largest family of the 18 is serine proteinase inhibitor Kunitz type N = 10. The ratio of r is <0.8 for 21 families; the largest family of the 21 is the largest of all 168: globin N = 41. By definition, the organization of residue identities along the alignment of the entropy-based most representative pair of sequences is closest to the arrangement of sequence diversity across the entire family. Clearly, it is not certain that the sum of squares-based most representative pair of sequences will also be the entropy-based most representative one. A problem with the pairwise distance (100 percentage sequence identity) as a dissimilarity measure here is that percentage sequence identity considers only the count of residue identities in two sequences (cf. the Hamming distance) and does not say anything about the distribution of the matches as specified by an alignment. Hence, although it is possible to identify that pair of sequences with dissimilarity closest to the `central' value (as defined by the r.m.s. or any other measure of central tendency), there is no guarantee that the placement of identities therein is most similar to the pattern of sequence diversity across the multiple alignment. Clearly, an advantage of the entropy-based method then is that the most representative pair of sequences is selected on the basis of its direct comparison with all M sequences simultaneously. Of course, if the sum of squares-based most representative pair of sequences is the same as that chosen on the basis of entropy relationships, then the `central' pairwise distance is derived from a distribution of residue identities that is most similar to that for the complete family.
The stepwise construction of entropy-based most representative sets starting from the optimal pair can be analysed in terms of the value of r for the relationship between the N-dimensional entropy vector describing the set at each step and that for all M sequences (Table II). Constrained classification of the sets according to r allows automatic identification of a minimal set of sequences to represent a family. Thus, a minimal set can be defined as the smallest set of sequences that cluster with all M sequences in the optimal constrained classification. In 167 of the 168 families the optimal partitioning according to the criterion described in Everitt (Everitt, 1993
) (see Materials and methods) is into two clusters. The one exception is the metallothionein alpha-domain family N = 3 where the optimal number of groupings is one, i.e. the entropy-based most representative pair of sequences, 1mrb and 1mrt (90.3% sequence identity), is deemed to embody the family completely (r = 1.0). Of the 31 multiply aligned positions, only three are not identities: 9, 15 and 22. The residues at these positions are (1mhu, 1mrb, 1mrt): V, P, V; A, A, S; G, G, E. The information-theoretical entropy (S) is 0.92 at each. Clearly, it is only the ninth position that differs between 1mhu and 1mrb (i.e. 96.8% sequence identity). Given that 1mhu and 1mrt match at this position, then the pair 1mrb and 1mrt is selected as the entropy-based most representative one.
|
The mean (±SD) relative size of the minimal set for the 126 families containing one is 0.56 (0.17). The smallest minimal set as a proportion of the total number of structures within a family is 0.10 for the globins N = 41, i.e. four members: 2pghb 1spgb 2mhba 1a6m. The largest minimal set as a proportion of N is 0.8 for nine families (data not shown) for all of which N = 5, i.e. again four members.
Reduction of a large multiple sequence alignment to a minimal set facilitates human inspection. Of course, the availability of complete genome sequences has only served to increase the need for such data reduction. It must be emphasized that the algorithm presented here for selection of representative sets is only an heuristic. However, it does have the advantage that it is deterministic and so the results presented here are meaningful. An extension of the current approach might be to use a Monte Carlo algorithm to find the combination of sequences that maximizes the value of the entropy-based objective function described here. Furthermore, it might be useful to report the value of r for published alignments where `only selected proteins are shown' in order to quantify the degree of comprehensiveness therein. An independent check of the criterion proposed for selection of a representative set could be to test the efficacy of a profile (Gribskov et al., 1987) constructed from such a set for searching a database for family members.
To conclude, there is a clear need for automatic methods to summarize large multiple sequence alignments to facilitate not only human inspection but also protein target selection for structural genomics. Also, to the best of my knowledge, this is the first time that a method for classification of protein sequences has been described which is guaranteed to produce globally optimal solutions. Use of sequence weighting to inform clustering allows a novel view of relationships within a multiple sequence alignment. A new sequence weighting method is proposed: one that considers directly the distribution of sequence variability along an alignment rather than averaging over all positions.
The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Amitai,M. (1998) Science, 282, 14361437.
Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer,E.L.L. (1999) Nucleic Acids Res., 27, 260262.
Bement,T.R. and Waterman,M.S. (1977) Math. Geol., 9, 5561.
Blundell,T.L., Sibanda,B.L., Sternberg,M.J.E. and Thornton,J.M. (1987) Nature, 326, 347352.[ISI][Medline]
Boberg,J., Salakoski,T. and Vihinen,M. (1992) Proteins, 14, 265276.[ISI][Medline]
Boguski,M.S. (1999) Science, 286, 453455.
Bult,C.J. et al. (1996) Science, 273, 10581073.[Abstract]
Burley,S.K. et al. (1999) Nature Genet., 23, 151157.[ISI][Medline]
Casari,G., Sander,C. and Valencia,A. (1995) Nature Struct. Biol., 2, 171178.[ISI][Medline]
Corpet,F., Gouzy,J. and Kahn,D. (1999) Bioinformatics, 15, 10201027.
Eisen,M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Proc. Natl Acad. Sci. USA, 95, 1486314868.
Everitt,B. (1993) Cluster Analysis. 3rd ed. Edward Arnold, London.
Felsenstein,J. (1989) Cladistics, 5, 164166.
Forster,M.J., Heath,A.B. and Afzal,M.A. (1999) Bioinformatics, 15, 8990.
Gerstein,M., Sonnhammer,E.L. and Chothia,C. (1994) J. Mol. Biol., 236, 10671078.[ISI][Medline]
Gordon,A.D. (1996) Comput. Statist. Data Anal., 21, 1729.[ISI]
Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl Acad. Sci. USA, 84, 43554358.[Abstract]
Hawkins,D.M. and Merriam,D.F. (1973) Math. Geol., 5, 389395.
Henikoff,S. and Henikoff,J.G. (1994) J. Mol. Biol., 243, 574578.[ISI][Medline]
Heringa,J., Sommerfeldt,H., Higgins,D. and Argos,P. (1992) Comput. Appl. Biosci., 8, 599600.[Abstract]
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409417.
Holm,L. and Sander,C. (1998) Bioinformatics, 14, 423429.[Abstract]
Holm,L. and Sander,C. (1999) Nucleic Acids Res., 27, 244247.
Krogh,A. and Mitchison,G. (1995) ISMB, 3, 215221.[Medline]
Lanyon,S.M. (1985) Syst. Zool., 34, 397403.[ISI]
Lichtarge,O., Bourne,H.R. and Cohen,F.E. (1996) J. Mol. Biol., 257, 342358.[ISI][Medline]
Lockless,S.W. and Ranganathan,R. (1999) Science, 286, 295299.
May,A.C.W. (1999a) Struct. Fold. Des., 7, R213.[ISI][Medline]
May,A.C.W. (1999b) Proteins, 37, 2029.[ISI][Medline]
Mirny,L.A. and Shakhnovich,E.I. (1999) J. Mol. Biol., 291, 177196.[ISI][Medline]
Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) Protein Sci., 7, 24692471.
Oliver,S.G. (1996) Nature, 379, 597600.[ISI][Medline]
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) J. Mol. Biol., 273, 349354.[ISI][Medline]
Pollock,D.D., Taylor,W.R. and Goldman,N. (1999) J. Mol. Biol., 287, 187198.[ISI][Medline]
Ptitsyn,O.B. (1998) J. Mol. Biol., 278, 655666.[ISI][Medline]
Ptitsyn,O.B. and Ting,K.L. (1999) J. Mol. Biol., 291, 671682.[ISI][Medline]
Romesburg,H.C. (1984) Cluster Analysis for Researchers. Lifetime Learning Publications, Belmont, CA.
Sander,C. and Schneider,R. (1991) Proteins, 9, 5668.[ISI][Medline]
Shenkin,P.S., Erman,B. and Mastrandrea,L.D. (1991) Proteins, 11, 297313.[ISI][Medline]
Sibbald,P.R. and Argos,P. (1990) J. Mol. Biol., 216, 813818.[ISI][Medline]
Smith,T.F. (1999) Struct. Fold. Des., 7, R7R12.[Medline]
Smith,T.F. and Waterman,M.S. (1981) J. Mol. Biol., 147, 195197.[ISI][Medline]
Snel,B., Bork,P. and Huynen,M.A. (1999) Nature Genet., 21, 108110.[ISI][Medline]
Srinivasarao,G.Y., Yeh,L.S.L., Marzec,C.R., Orcutt,B.C. and Barker,W.C. (1999) Bioinformatics, 15, 382390.
Sunyaev,S.R., Eisenhaber,F., Rodchenkov,I.V., Eisenhaber,B., Tumanyan,V.G. and Kuznetsov,E.N. (1999) Protein Eng., 12, 387394.
Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitrovsky,E., Lander,E.S. and Golub,T.R. (1999) Proc. Natl Acad. Sci. USA, 96, 29072912.
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Comput. Appl. Biosci., 10, 1929.[Abstract]
Vingron,M. and Argos,P. (1989) Comput. Appl. Biosci., 5, 115121.[Abstract]
Vingron,M. and Sibbald,P.R. (1993) Proc. Natl Acad. Sci. USA, 90, 87778781.[Abstract]
Received August 21, 2000; revised February 1, 2001; accepted February 15, 2001.