Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics

Alex C. W. May

Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 lAA, UK.E-mail: amay{at}nimr.mrc.ac.uk


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Hierarchical classification is probably the most popular approach to group-related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g. `significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster-related proteins without the need for an a priori threshold, first, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a `core–periphery' structure within families. The `core' of a protein family comprises the most typical sequences while the `periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.

Keywords: cluster analysis/multiple sequence alignment/protein homologous family/representative set/sequence weighting


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Multiple alignment of molecular sequences/three-dimensional (3D) structures is a key technique in molecular biology (for a recent review, see Smith, 1999). The extent of sequence variation at a given position in a set of aligned sequences can be useful information. For example, sequence-invariant positions within meaningful multiple protein sequence alignments often correspond to residues conserved for function, e.g. an enzyme active site, or 3D structure, e.g. disulphide bridges. Thus, it is often possible to suggest the function and/or 3D structure of a new protein sequence by extrapolation from related proteins (Blundell et al., 1987Go). [Of course, the notion of function here is necessarily operationally defined by the type and reliability of the information that is available for the known proteins and the nature of the sequence relationships (for reviews, see Oliver, 1996; Boguski, 1999).] Casari et al. predict functional residues in protein sequences on the basis of patterns of conservation within multiple alignments (Casari et al., 1995Go). Lichtarge et al. define binding surfaces common to protein families on the basis of sequence conservation patterns and knowledge of the shared fold (Lichtarge et al., 1996Go). Recently, Lockless and Ranganathan identified pathways of energetic connectivity through protein folds on the basis of the degree of statistical coupling between positions in multiple sequence alignments (Lockless and Ranganathan, 1999Go). Pollock et al. used a maximum likelihood method to identify protein residues undergoing correlated evolution (coevolution) in multiple sequence alignments (Pollock et al., 1999Go). Non-functional conserved residues have been identified as possible common folding nucleation sites for the c-type cytochromes (Ptitsyn, 1998Go) and the globins (Ptitsyn and Ting, 1999Go). Mirny and Shakhnovich also relate protein folding by nucleation to sequence variability across proteins sharing a common fold (Mirny and Shakhnovich, 1999Go). Furthermore, since methods to reconstruct phylogenetic trees from homologous (related by descent from a common ancestor) protein sequences rely on multiple alignment, the evolutionary history of a new sequence can be estimated by such a comparative approach. Of course, in the current era of complete genome sequences, it is now possible to perform comparative sequence analysis at the genome level (see, for instance, Bult et al., 1996; Snel et al., 1999).

Sequence patterns within a group of protein sequences defined by their multiple alignment can be represented as profiles (Gribskov et al., 1987Go). Also known as a position-specific scoring matrix, a profile is a 2D weight matrix in which the rows describe the aligned positions and the first 20 columns specify the score for each of the possible amino acids. As well as reflecting the observed frequency of each residue at an aligned position, these scores also incorporate the residue similarity relationship data of an amino acid substitution matrix. A further column denotes a penalty for insertions or deletions at that position. An extension of a standard dynamic programming algorithm (Smith and Waterman, 1981Go) is used in turn to align single test sequences against a profile to identify further members of the set. Recently, profiles have been constructed from multiple sequence alignments using a hidden Markov model (HMM) approach (for a recent review, see Amitai, 1998).

Usually, there is an unequal representation of proteins in a multiple sequence alignment. That is, the sequences are not statistically independent: the space defined by the sequence identity relationships within the group is not evenly populated (Sibbald and Argos, 1990Go). Accordingly, a profile constructed from the multiple alignment will reflect this bias: outlier sequences will have less influence than they should. Sequence weighting addresses this problem of unequal representation.

An early weighting method is that of Vingron and Argos, who assign a weight to a sequence on the basis of the sum of its pairwise distances – defined in terms of the number of amino acid mismatches – to all the others (Vingron and Argos. 1989Go). Thus, an outlier will receive a high weight. In a similar approach, Sander and Schneider defined a weight for a sequence according to its mean pairwise distance to all the rest (Sander and Schneider, 1991Go).

Altschul et al. described a procedure suitable only for sequences related by an evolutionary tree (Altschul et al., 1989Go). In addition, the tree must be rooted – the initial bifurcation specifies the last common ancestor of all the sequences. Although also dependent on a phylogenetic tree, the simple `branch-proportional' scheme of Thompson et al. (Thompson et al., 1994Go) does not require the biological root to be known. In another straightforward tree-based strategy, Gerstein et al. used recursion to traverse a tree from the leaves up to the root, incrementing the weights at each node (Gerstein et al., 1994Go).

An advantage of the method of Sibbald and Argos is its lack of dependence on a tree (Sibbald and Argos, 1990Go). A given space containing a set of points can be partitioned into volumes around each point comprising all empty space that is closer to the central point in the volume than to any of the other points. The resulting pattern is a Voronoi diagram and the partitions are known as Voronoi volumes. Highly related proteins are clustered together in sequence space. The unevenness of such a space means that the Voronoi volume around an outlier is greater than that around each of the sequences comprising a cluster. Thus, weights can be assigned according to the Voronoi volumes around each sequence. Although it has been argued that Voronoi sequence weights are more likely to be correct in general (Vingron and Sibbald, 1993Go), they suffer from two related problems. First, it is an open problem how to calculate analytically Voronoi geometry for such a high-dimensional space as sequence space. This necessitates use of a stochastic sampling procedure to generate `random sequences': Sibbald and Argos used a Monte Carlo algorithm (Sibbald and Argos, 1990Go). Second, convergence to stable weights via the sampling of sequences randomly from sequence space can become impractical for larger domains (>100 residues) (Thompson et al., 1994Go).

Henikoff and Henikoff described a simple and elegant sequence weighting method that instead of considering pairwise sequence distances incorporates the residue diversity at each aligned position (Henikoff and Henikoff, 1994Go). These position-based sequence weights are shown to perform well in database searching tests. In a related approach, Krogh and Mitchison proposed maximum entropy weights (Krogh and Mitchison, 1995Go). The recent PSIC (position-specific independent counts) program of Sunyaev et al. calculates sequence weights for each column in a multiple alignment (Sunyaev et al., 1999Go).

Sequence weighting is related to the problem of how to select representative protein data sets. Selection of representatives is key in not only classification of protein sequences (for example, see Bateman et al., 1999; Srinivasarao et al., 1999) and 3D structures (for instance, see Orengo et al., 1997; Holm and Sander, 1999) but also for protein target selection in structural genomics (Burley et al., 1999Go). As for the latter, the question of which targets should be chosen for high-throughput structure determination is far from academic: it is simply not feasible at present to determine the 3D structure of every protein encoded by the human genome. Of course, although the definition of representative is operationally defined, such a data set should combine `maximum coverage with minimum redundancy' (Hobohm et al., 1992Go). Usually, a representative set contains no pair of proteins with a sequence identity greater than a user-defined threshold (for example, see Boberg et al., 1992; Heringa et al., 1992; Hobohm et al., 1992; Holm and Sander, 1998). However, as far as the author is aware, it seems that the question of how to choose arbitrarily sized representative sets of aligned sequences from a multiple alignment has not been addressed until very recently (see below). In addition, there is a need for a method to quantify how representative such a set is of the whole group. Thus, it would be possible to summarize a multiple alignment. This data reduction could be useful for extracting the key features of large multiple sequence alignments for human inspection. For example, it might be desired not only to select a few representative proteins from a multiple alignment for clarity in a figure for publication but also to quantify the extent of coverage therein. Recently, Corpet et al. described an approach to derive what they called a `rich description' of a protein family on the basis of a multiple sequence alignment (Corpet et al., 1999Go). Hierarchical classification (see below) is used to define sub-families of sequences within the family. Consensus sequences are obtained at each tree node to represent the sub-families. The original alignment can then be summarized by replacement of several sequences by a consensus sequence. However, apart from the problems inherent in its dependence on hierarchical classification (see below), this approach offers no assessment of how representative the summary alignment of the consensus sequences is of the complete family.

Hierarchical classification is commonly used to group similar sequences into nested sets of clusters. The result is a rooted tree like a phylogenetic tree representing the evolutionary history of a group of species. The input for a hierarchical classification of N protein sequences is an NxN matrix of pairwise similarities or dissimilarities. The simplest similarity measure is percentage sequence identity. There are at least three problems with the use of hierarchical classification to cluster protein sequences to inform selection of a representative set from a multiple sequence alignment. First, a tree might not be the most suitable encapsulation of the distance relationships for a group of sequences. Of course, the validity of a tree can be assessed (May, 1999aGo, bGo). This was not done by Corpet et al. Second, a problem lies in how to determine the most appropriate number of subsets for a group given a hierarchy, i.e. where to `cut' the tree. This is an open problem in hierarchical classification (Everitt, 1993Go). Despite its subjectivity, visual inspection is often used to identify a level of similarity to act as the cut-point of a tree. By definition, this is not an option for an automatic method. Thus, the approach of Corpet et al. relies on a user-supplied maximum number of sub-families. Third, once the subsets have been chosen, the next task is to define a representative for each partition. Because hierarchical classification is sequentially optimal within the tree but not globally optimal over the whole tree, there is no guarantee that the partition representatives, however selected, will best represent the entire data. Clearly, this is a drawback of the procedure of Corpet et al. The hierarchical tree has a wide appeal in biology (for a review, see May, 1999b): recently, hierarchical classification has been used to cluster genome-wide expression data (for example, see Eisen et al., 1998). However, Tamayo et al. have questioned the suitability of hierarchical classification for this purpose (Tamayo et al., 1999Go). Instead, these authors use self-organizing maps, artificial neural networks particularly suited to such exploratory data analysis.

A major challenge in the classification of protein sequences and 3D structures is the formulation of operational definitions of useful groupings. For example, sequences are usually clustered into homologous families (i.e. proteins related by divergent evolution from a common ancestor) on the basis of `significant' sequence identity. The problem here is that there is not a single sequence identity threshold suitable to define membership for all protein families. In other words, some families (e.g. globins) are far more heterogeneous at the sequence level than others are. The same problem arises in the clustering of protein 3D structures into groups with common folds: some folds (e.g. the globin fold) are far more conserved over evolution than others are.

Cluster analysis is not restricted to a hierarchical tree representation and use of a dissimilarity matrix. Objects related by a linear ordering can be optimally assigned into clusters using a simple dynamic programming algorithm (Bement and Waterman, 1977; for a review of constrained classification, see Gordon, 1996; Hawkins and Merriam, 1973). Importantly, the use of dynamic programming means that such a clustering is optimal for all predefined numbers of subsets (Gordon, 1996Go). In addition, there is no need for an a priori threshold to delineate clusters. So, reduction of the protein sequence clustering problem to one of constrained classification not only obviates the need for imposing a tree on the data but also makes it possible to identify a globally optimal solution. Furthermore, unifying sequence weighting and sequence classification, the approach described here affords a new view of relationships within a multiple sequence alignment.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Homologous protein families

The multiple sequence alignments used in this study are those in HOMSTRAD (Mizuguchi et al., 1998Go), a database of protein 3D structure alignments for homologous families. HOMSTRAD is available on the Web at http://www-cryst.bioc. cam.ac.uk/~homstrad/. An advantage of the use of homologous protein families is that assignment of topological equivalence is essentially unambiguous for proteins descended by evolution from a common ancestor. By definition, 3D structure comparison is that much more difficult and therefore subjective for distantly related proteins. (Of course, it is just the level of structure similarity among the members of the protein families in HOMSTRAD that gives further support to a claim of a clear evolutionary relationship.) In particular, the November 1999 full release of HOMSTRAD comprises 1193 proteins grouped into 209 families; the mean (±SD) number of structures per family is 5.7 (4.9). The 41 families comprising only two members are not considered in this analysis since, by definition, an alignment of only two protein sequences does not constitute a multiple alignment. The mean (±SD) number of structures per family is 6.6 (5.0) for the remaining 168 families. The size distribution of the 168 families, sorted numerically by size, is (number of structures, number of families): 3, 40; 4, 29; 5, 27; 6, 20; 7, 7; 8, 10; 9, 6; 10, 5; 11, 4; 12, 6; 13, 3; 14, 3; 15, 1; 18, 1; 20, 1; 21, 1; 22, 1; 26, 1; 27, 1; 41, 1.

Hierarchical classification of sequences on the basis of identity

A standard dissimilarity measure for a pair of aligned sequences is 100 – percentage sequence identity. Sequence identity is determined here over the multiply aligned positions, i.e. those aligned positions without a gap in any sequence. If a gap were considered as a different residue type then it is not clear how to treat matching gap characters in terms of sequence identity for a given pair. Alternatively, of course, sequence identity can be defined in terms of the length of the shorter sequence of a pair but this means then that the number of aligned positions is not constant for all comparisons. Furthermore, distances calculated from sets of pairwise alignments `cannot be guaranteed to be even close to Euclidean' (Forster et al., 1999Go). A dissimilarity matrix comprising dissimilarities for all pairs of sequences is an input data format for hierarchical classification. The most widely used algorithm for obtaining a hierarchical classification (Romesburg, 1984Go), the unweighted pair-group method using arithmetic averages (UPGMA), is used here. As in May (May, 1999aGo, bGo), the support for the clusters defined by a tree is assessed with a jackknife test (Lanyon, 1985Go).

Sequence weighting

By definition, sequences within a homologous family are not statistically independent. Sequence weighting can be used to describe the extent of sampling bias within a protein family. Six weighting schemes of which two are previously published (Vingron and Argos, 1989Go; Henikoff and Henikoff, 1994Go) are applied here.

A simple extension to the approach of Vingron and Argos (1989) is to use the familiar sum of squares formalization: weighting a sequence according to the sum of the squares of its pairwise distances (100 – percentage sequence identity) to all the others.

The other three weighting approaches are based on residue diversity over all positions in the aligned sequences. The variability of an aligned column of residues is represented in terms of information-theoretical entropy (Shenkin et al., 1991Go): the more unlike the site is, the higher the entropy. Thus, an aligned position occupied by only a single residue is accorded entropy 0. The challenge is to combine the information in all multiply aligned positions (Figure 1Go) without using a mean as in Henikoff and Henikoff (Henikoff and Henikoff, 1994Go). As a measure of central tendency, the mean is not appropriate for highly skewed distributions since it can be heavily influenced by extreme values. Of course, the mean is the most efficient measure of central tendency for normal distributions. A multiple sequence alignment comprising N multiply aligned positions can be represented as an N-dimensional vector where each element therein is the entropy at that position (Figure 1Go). We can jackknife a multiple alignment of M sequences by leaving out one sequence at a time and so construct M such N-dimensional entropy vectors, each representing M – 1 sequences. Use of vector algebra to describe the relationship between each of these vectors and that for all M sequences makes it possible to quantify the influence of each deleted sequence and so assign a weight. This is done in three ways here. First, the standard correlation coefficient can be used to describe the strength of association. Insensitive to size displacements between data profiles, this statistic considers only similarity in shape. Second, the standard scalar product is used: like the correlation coefficient, it is insensitive to proportional translations between data profiles but, unlike the correlation coefficient, it is sensitive to additive translations between data profiles. Finally, the Euclidean distance is used since it is sensitive to both additive and proportional translations. Hence each of these three measures has different properties. A benefit of the entropy vector algebra procedure with jackknife is that it permits comparison of a single sequence with M 1 sequences simultaneously in the context of the pattern of sequence variability across all M.



View larger version (37K):
[in this window]
[in a new window]
 
Fig. 1. Sequence variability across the 23 multiply aligned positions of the HOMSTRAD (Mizuguchi et al., 1998Go) zinc finger CCHH-type family (N = 15). Sequence variability is defined as the information-theoretical entropy (Shenkin et al., 1991Go) normalized with respect to the maximum value possible, log215, for N = 15. Thus, the relative variability at a multiply aligned position can range from 0 (i.e. only one residue type) to 1 (i.e. 15 residue types). The four invariant positions are the eponymous C, C, H, H residues.

 
Selection of representative sets of sequences

One definition of the most representative pair of sequences for a multiple alignment might be that pair of sequences for which the pairwise distance is closest to a measure of central tendency, e.g. the mean or the root mean square (r.m.s.), of the distribution of distances for all unique pairs of sequences. Of course, there is no guarantee that the arrangement of residue identities along the alignment of such a pair is most representative of the pattern across the entire set, especially if the proteins are not evolutionarily related. This suggests a more useful definition of the most representative pair: that pair of sequences for which there is the strongest positive linear correlation between the N-dimensional entropy vector describing the pair and that for all M sequences. Although we can assess each pair in turn on this criterion, extension of such a `generate and test' approach to select most representative sets of >2 sequences would clearly soon fall prey to a `combinatoric explosion'. In other words, the number of combinations of R sequences chosen from a set of N rapidly becomes very large for increasing values of R. This means that it quickly becomes prohibitive to evaluate each combination in turn. Instead an heuristic approach can be used: the repeated stepwise addition of a single sequence to the most representative set until all M sequences have been considered. Selection of a sequence at each step of this greedy algorithm is on the basis of that sequence which maximizes the level of similarity between the subset and all M sequences. This method is similar to an agglomerative algorithm for hierarchical classification in that each addition step is optimal subject to the constraint that once a sequence has been recruited to the subset it cannot be considered again as a potential recruit.

Constrained classification of sequences on the basis of weights

Sequences within a multiple alignment can be sorted into a list on the basis of the arithmetic order of their weights. Optimal partitioning of such a list with the algorithm of Hawkins and Merriam (Hawkins and Merriam, 1973Go) delineates a `core' of sequences with low weights (the most typical ones) and a corresponding `periphery' of atypical sequences with high weights. There are at least three advantages of the use of the Hawkins and Merriam partitioning method here. First, the user is not required to specify an arbitrary threshold for the division(s) (cf. the problem of where to `cut' a hierarchical classification to partition a set). Second, the resulting segmentation into core/periphery is globally optimal. Third, the use of dynamic programming guarantees that the clustering of N points is optimal for all predefined numbers of subsets k where 1 <= k <= N. It is important to note that use of the term `dynamic programming' here does not refer to the familiar biological sequence alignment algorithms. Instead, it refers to a general algorithmic technique for optimization of which the alignment algorithms are a specific instance. A measure of the variability of a segment is the sum of squared deviations of the constituent points about their mean. Defining the total within-segment variability W by summation for k segments, the globally optimal partitioning into k segments is that which minimizes W. This global optimization problem can be solved exactly by dynamic programming (Hawkins and Merriam, 1973Go). In addition, this approach also affords automatic identification of the most appropriate value of k. The improvement in W on addition of another segment is expressed in terms of `explained' variation: the percentage reduction in W with respect to W when k = 1, i.e. no segmentation. The optimal value of k is taken as that value of k that shows the largest increase in `explained' variation with respect to k – 1 (Everitt, 1993Go).

Two other measures are used here to order a set of protein sequences from a multiple alignment for constrained classification. First, the jackknife entropy sum for each of the M sequences is obtained by adding the entropy at each multiply aligned position in the N-dimensional entropy vector representing M – 1 sequences. By definition, the jackknife entropy sum will be highest for the most representative sequence since its exclusion means loss of the most central sequence. Use of vector algebra in determination of the above three entropy-based sequence weights means that they take into account the shape of the distribution of sequence diversity across the N multiply aligned positions. The jackknife entropy sum, however, averages over all columns and so ignores the shape of the distribution. Comparison of the two types of entropy-based measures will indicate families where averaging might not be the most appropriate approach to describe the distribution of sequence diversity across the multiple alignment. Second, the coefficient of variation (CV) of pairwise distances (100 – percentage sequence identity) is used to measure the relative spread of the pairwise distances for each sequence. The standard deviation (SD) is an absolute measure of the spread of a univariate distribution. CV is defined as the SD as a percentage of the mean. Outliers can be defined here as sequences with small CV values: these sequences are closest to being equidistant to the others. Clearly, it is not possible to identify outliers according to this criterion from a tree since such a data structure does not represent directly the spread of the pairwise distances for each sequence. Hence the use of the CV affords a new view of relationships among a set of protein sequences. Furthermore, since the weight accorded to a sequence by the method of Vingron and Argos (Vingron and Argos, 1989Go) is essentially its mean pairwise distance to the rest of the set, it is possible to investigate the relation between SD and mean for these distance distributions and so quantify the influence of SD on CV.

Relationship between sequence identity-based dissimilarity and sequence weight-based dissimilarity

For a given sequence weighting method, a dissimilarity measure between a pair of sequences can be defined as the absolute difference in weight between the two. The strength of association between the sequence identity-based dissimilarity matrix and that defined for each sequence weight is described in terms of the correlation coefficient. The nature of the relation for a family will depend on the relationships among the members. Consider a pair of identical sequences: although the absolute difference in weight is 0, the converse is not necessarily true.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Comparison between hierarchical classification and constrained classification

The result of an UPGMA hierarchical classification of N objects is a binary tree comprising N – 1 internal nodes. Each internal node represents a subset (cluster) of >=2 similar objects. The jackknife test of Lanyon (1985) to assess a tree requires a hierarchical classification of >=4 objects. Of the 128 families with at least four members, 79 yield meaningful UPGMA trees on the basis of 100 – percentage sequence identity as a dissimilarity measure. (A meaningful tree is defined as one where all the clusters are observed to be reliable according to the Lanyon method.) The mean (±SD) number of structures per family for the 79 is 5.7 (1.9). The zinc finger (CCHH-type) family (N = 15) is the one with the lowest relative proportion, 0.4, of reliable internal nodes: five out of 14 (Figure 2Go). By contrast, constrained classification of sequence weights sorted in arithmetic order (Table IGo) identifies all globally optimal core/periphery partitions for the family without the need for a user-supplied cut-off. According to the criterion of largest improvement in the percent fit for each new segment (see Materials and methods), the optimal number of segments is two (data not shown): the boundary is between 1ard and 1zaa3 (Table IGo).



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 2. Hierarchical classification of 15 zinc finger CCHH-type protein sequences (Mizuguchi et al., 1998Go) by the unweighted pair-group method using arithmetic averages (UPGMA) according to sequence identity. Protein names are PDB codes. Tree diagram plotted with DRAWGRAM, part of the PHYLIP package (Felsenstein, 1989Go). Five of the 14 internal nodes are found to be reliable according to a jackknife test (Lanyon, 1985Go). Other than the root node including all terminal taxa, which is always reliable in a rooted tree, the reliable internal nodes are (3znf 5znf 1bboN); (3znf 5znf); (1znf 1paa); (1zaa1 1zaa2).

 

View this table:
[in this window]
[in a new window]
 
Table I. Position-based sequence weights (Henikoff and Henikoff, 1994Go) for the HOMSTRAD (Mizuguchi et al., 1998Go) zinc finger CCHH-type family (N = 15), where the weights, sorted in arithmetic order, are calculated on the basis of the 23 multiply aligned positions only
 
There are eight constrained classifications of sequences for each of the 168 families: six on the basis of weights, the jackknife entropy sum and the CV of pairwise distances (100 – percentage sequence identity). In all 1344 cases (168x8), the optimal partitioning according to the criterion described in Everitt (Everitt, 1993Go) (see Materials and methods) is found to be into two clusters. This is an interesting result given the range in family size here from three to 41 sequences. The size of a core can be expressed in terms of the relative proportion of the total number of sequences within a family. The smallest mean (±SD) size of core, 0.53 (0.16), across the 168 families is obtained with entropy-based (linear correlation) weights and CV of pairwise distances (100 – percentage sequence identity). The largest mean (±SD) size of core, 0.65 (0.15), across the 168 families is found with entropy-based (Euclidean distance) weights.

CV-based definition of outlier sequences

One operational definition of outliers is those sequences with small CV of pairwise distances (100 – percentage sequence identity) since such sequences are closest to being equidistant to the others. Conversely, sequences with large CV values are those where the distribution of pairwise distances has a lot of spread. Of course, a large spread reflects a structure within the data: such sequences are closer to some than others. So, given that the CV describes the extent to which a sequence is part of a cluster, the core of a family will comprise sequences with large CV values. Use of CV in this way can lead to clustering which at first glance might appear counter-intuitive, but is entirely consistent. For example, consider a multiple alignment with no gaps of N sequences where N > 3 (by definition, more than a pair of sequences) and N <= 20 (the number of different amino acids). Each sequence comprises N – 1 residues. Each of N – 1 of the sequences are maximally (i.e. 0% sequence identity) equidistant from each other. The N 1 positions of the remaining sequence, however, consist of an amino acid that matches a residue in turn in one of the N 1 sequences. In other words, the last sequence is an exact intermediate between the N – 1 sequences (Park et al., 1997Go). Clearly, in terms of sequence weighting, the intermediate sequence is the most representative and so is accorded the lowest weight while the N – 1 other sequences are each assigned an equal higher weight. However, since the intermediate sequence is equidistant from each of the other N – 1 sequences, the CV of pairwise distances is 0.0% since SD is 0.0. Constrained classification of the CV values sorted in reverse arithmetic order delineates exactly a partition between a core of the N – 1 sequences and a periphery of the intermediate sequence. However, this assignment of the intermediate sequence as an outlier is entirely consistent with the atypical nature of this sequence on the basis of this criterion. [At the first step of a hierarchical classification of the N sequences on the basis of pairwise distance (100 – percentage sequence identity), the intermediate sequence is amalgamated with the first (in terms of order in the dissimilarity matrix) of the N – 1 sequences. Each of the remaining sequences are in turn joined to the initial cluster. The single linkage algorithm for hierarchical classification defines the distance between two clusters A and B as the smallest distance between an object in A and an object in B. This means that despite the order dependence problem, a single linkage tree demonstrates that each of the N – 1 sequences is equally close to the intermediate one. The UPGMA algorithm for hierarchical classification, however, defines the distance between two clusters A and B as the mean distance between an object in A and an object in B. Unfortunately, this then means that the order of the dissimilarity matrix becomes significant here.] Of the 168 families, five contain a sequence that is exactly equidistant in terms of sequence identity from the rest of the family members (CV is 0.0%; each of these families comprises only three sequences). Clearly, for a given family, the sequence with the smallest CV is closest to being equidistant to the rest. The largest value of CV for sequences with the smallest value within a family is 36.04% [3cox within the GMC oxidoreductase family (N = 4)]; mean (±SD) overall is 5.41 (5.94).

Relationship between mean and SD of distance distributions

The relationship between the CV of pairwise distances (100 – percentage sequence identity) and the Vingron and Argos weight (Vingron and Argos, 1989Go) reflects the correlation between the SD and mean of the distance distributions for each sequence. There is a negative correlation between CV values and Vingron and Argos weights for 158 of the 168 families (r < –0.8 for 120 families). A complete negative correlation corresponds to the situation where the low weight (small mean) sequences that comprise the core of a family are closer to some sequences than others (large SD). Further, the high weight (large mean) outlier sequences are the nearest to being equidistant to the others (small SD). There is a positive correlation for 10 of the 168 families (r > 0.8 for seven families). A complete positive correlation is consistent with the low weight (small mean) core sequences tending towards being equidistant to the rest (small SD) (cf. the intermediate sequence in the example above). Also, the high weight (large mean) outlier sequences are closer to some sequences than others (large SD).

Entropy-based sequence weights

Ideally, the relationship between the jackknife entropy sum and the Henikoff and Henikoff sequence weights (Henikoff and Henikoff, 1994Go) should be complete negative correlation (r = –1). This is because both average over the residue diversity at all multiply aligned positions. In practice, r < –0.8 for all 168 families: the least negative correlation is r = –0.88 for the hormone receptor (DNA-binding domain) family (N = 5). Having established that the jackknife entropy sum is usefully related to the Henikoff and Henikoff sequence weight, the next issue is the relationship between the jackknife entropy sum and the three entropy-based sequence weights derived by vector algebra (see Materials and methods). The idea here is to investigate the usefulness of combining the information from all multiply aligned positions in ways other than by averaging. The relationship between the jackknife entropy sum and entropy-based correlation coefficient sequence weights is r > –0.8 for 22 families. As a similarity measure, the correlation coefficient ignores both additive and proportional translations between data profiles. Thus, complete positive correlation (r = 1) occurs when data profiles share a common `shape' regardless of size displacement (if any) between them. In contrast, the scalar product is sensitive to additive translations between data profiles but ignores proportional translations. In fact, 15 of the 22 families with jackknife entropy sum and entropy-based correlation coefficient sequence weights r > –0.8 have r < –0.8 for jackknife entropy sum and entropy-based scalar product sequence weights. In other words, additive translations between the entropy vectors are important for these 15 families. An example here is the CoA-dependent acetyltransferase family (N = 4) where 3cla is accorded the highest weight according to the method of Henikoff and Henikoff (Henikoff and Henikoff, 1994Go). 3cla has the smallest CV of pairwise distances (100 – percentage sequence identity), 5.0%, which makes it the closest to being equidistant to the others. This means that when 3cla is left out, the N-dimensional entropy vector is nearest to being parallel with that for all the sequences. On this basis, 3cla is assigned the lowest entropy-based correlation coefficient sequence weight. Euclidean distance is sensitive to both additive and proportional translations between data profiles. Of the seven families with jackknife entropy sum and entropy-based correlation coefficient sequence weights r > –0.8 and jackknife entropy sum and entropy-based scalar product sequence weights r > –0.8, six (epidermal growth factor-like domain N = 12; fibronectin type III domain N = 14; thiamine pyrophosphate enzymes N = 4; thioredoxin N = 6; transferrin N = 7; zinc finger CCHH-type N = 15) have r < –0.8 for jackknife entropy sum and entropy-based Euclidean distance sequence weights. Hence proportional translations between the entropy vectors are important for these six families. There is only one family, that of the hormone receptor (DNA-binding domain) (N = 5), where r > –0.8 for jackknife entropy sum and all three entropy-based sequence weights. Clearly, averaging entropy over all multiply aligned positions (jackknife entropy sum) does not always correspond to putting together the data without averaging. This is to be expected since the mean is not applicable for highly skewed distributions and is less efficient than other measures of central tendency when extreme values are possible. Hence an advantage of the entropy vector algebra approach presented here is that it allows direct consideration of the distribution of entropy across a multiple sequence alignment (Figure 1Go).

Relationship between sequence identity-based dissimilarity and sequence weight-based dissimilarity

A dissimilarity measure can be defined between a pair of sequences as the absolute difference in sequence weight between the two. For a given sequence weighting protocol, a protein family can be characterized in terms of the relationship (correlation coefficient) between sequence identity-based dissimilarity and sequence weight-based dissimilarity. The value of r varies across all six weighting schemes for 161 of the 168 families. There are two families (TATA-box binding protein, C-terminal domain N = 4; uracil-DNA glycosylase N =3) where there are five unique values of r: in both cases the same value is shared between the position-based Henikoff and Henikoff and identity-based Vingron and Argos weights (Henikoff and Henikoff, 1994Go; Vingron and Argos, 1989Go). The remaining five families, all comprising three structures, have a single unique value of r: –1.0 for cytokine granulocyte colony stimulating factor and enoyl-CoA hydratase and 1.0 for ciliate pheromone, microbial ribonucleases and thionin. The identity-based sum of squares sequence weighting scheme is the most frequently occurring [N = 69; mean (±SD) r = 0.66 (0.21); minimum r = 0.13; maximum r = 1.0] of those with the highest value of r for each family across the 161 families with six unique values of r.

Factors influencing the relationship between sequence identity-based dissimilarity and sequence weight-based dissimilarity

There are four factors influencing the relation (r) between sequence identity-based dissimilarity and sequence weight-based dissimilarity for a family. First, given that identical sequences are assigned identical weights, increasing the number of copies of a sequence within an otherwise unchanged multiple alignment leads to an increase in r. Thus, similar sequences are allocated similar weights. Second, it is not necessarily true, however, that identical (similar) weights imply identical (similar) sequences [for example, see `the highly idealized example' alignment of uniform sequences in Table IGo of Sibbald and Argos (Sibbald and Argos, 1990Go)]. This leads to a decrease in r. For instance, of the 128 families with >=4 members, cytochrome c' N = 4 has the most negative r (–0.67) for the association between sequence identity-based dissimilarity and identity-based sum of squares sequence weight-based dissimilarity. Here the joint most distant pair, 2ccya and 1bbha (21.0% sequence identity), is closest in terms of absolute difference in sum of squares sequence weights. Third, although dissimilar sequences receive dissimilar weights resulting in an increase in r, the effect is enhanced when there are clear outlier(s) that are also close to being equidistant to the rest of the set. For example, of the 69 families where the identity-based sum of squares sequence weights give the highest value of r, the most positive value (1.00) is obtained for the immunoglobulin-binding domain N = 4 family. Here 2ptl not only is an outlier in terms of the UPGMA tree of sequence relationships (all internal nodes are reliable) but also is nearly equidistant to the rest: CV of pairwise distances (100 – percentage sequence identity) = 2.16%. In contrast, consider the thioredoxin N = 6 family, that family of the 168 with r nearest 0 (0.03) for the relation between sequence identity-based dissimilarity and identity-based sum of squares sequence weight-based dissimilarity. Here the most distant pair, 1aaza and 3trx (7.5% sequence identity), is second closest in terms of absolute difference in sum of squares sequence weights. The initial bifurcation of the UPGMA tree of sequence relationships (all internal nodes are reliable) is between (1thx 2trxa 3trx) and (1aaza 1ego 3grx). 3trx is an outlier with respect to (1thx 2trxa) while 1aaza is an outlier in relation to (1ego 3grx). Hence neither 3trx nor 1aaza is an outlier with respect to the rest of the set. Of course, use of the square term means that identity-based: sum of squares sequence weights are more sensitive to outliers (i.e. upweights them) than identity-based Vingron and Argos weights. In turn, this means that absolute weight differences for comparisons involving outliers will be increased. Fourth, sequences assigned dissimilar weights are not necessarily unrelated: the absolute weight difference between 1aaza and 3grx is third highest of the 15 unique pairs in the thioredoxin N = 6 family and their sequence identity (37.3%) is also third highest. To conclude, the relation between sequence identity-based dissimilarity and sequence weight-based dissimilarity for a family is a useful indicator of intra-family sequence relationships.

Selection of representative sets of sequences

The entropy-based most representative pair of sequences for a family is defined as that pair for which there is the most positive r between the N-dimensional entropy vector characterizing the pair and that for all M sequences. The sum of squares-based most representative pair of sequences is taken as that pair of sequences for which the pairwise distance is nearest to the r.m.s. distance (a measure of central tendency) for all unique pairs. The relationship between the two pairs so selected can be summarized in terms of the ratio of the correlation coefficients for the association between each one and the complete set of sequences. The two pairs coincide (ratio of r = 1.0) in 18 of the 168 families; the largest family of the 18 is serine proteinase inhibitor – Kunitz type N = 10. The ratio of r is <0.8 for 21 families; the largest family of the 21 is the largest of all 168: globin N = 41. By definition, the organization of residue identities along the alignment of the entropy-based most representative pair of sequences is closest to the arrangement of sequence diversity across the entire family. Clearly, it is not certain that the sum of squares-based most representative pair of sequences will also be the entropy-based most representative one. A problem with the pairwise distance (100 – percentage sequence identity) as a dissimilarity measure here is that percentage sequence identity considers only the count of residue identities in two sequences (cf. the Hamming distance) and does not say anything about the distribution of the matches as specified by an alignment. Hence, although it is possible to identify that pair of sequences with dissimilarity closest to the `central' value (as defined by the r.m.s. or any other measure of central tendency), there is no guarantee that the placement of identities therein is most similar to the pattern of sequence diversity across the multiple alignment. Clearly, an advantage of the entropy-based method then is that the most representative pair of sequences is selected on the basis of its direct comparison with all M sequences simultaneously. Of course, if the sum of squares-based most representative pair of sequences is the same as that chosen on the basis of entropy relationships, then the `central' pairwise distance is derived from a distribution of residue identities that is most similar to that for the complete family.

The stepwise construction of entropy-based most representative sets starting from the optimal pair can be analysed in terms of the value of r for the relationship between the N-dimensional entropy vector describing the set at each step and that for all M sequences (Table IIGo). Constrained classification of the sets according to r allows automatic identification of a minimal set of sequences to represent a family. Thus, a minimal set can be defined as the smallest set of sequences that cluster with all M sequences in the optimal constrained classification. In 167 of the 168 families the optimal partitioning according to the criterion described in Everitt (Everitt, 1993Go) (see Materials and methods) is into two clusters. The one exception is the metallothionein alpha-domain family N = 3 where the optimal number of groupings is one, i.e. the entropy-based most representative pair of sequences, 1mrb and 1mrt (90.3% sequence identity), is deemed to embody the family completely (r = 1.0). Of the 31 multiply aligned positions, only three are not identities: 9, 15 and 22. The residues at these positions are (1mhu, 1mrb, 1mrt): V, P, V; A, A, S; G, G, E. The information-theoretical entropy (S) is 0.92 at each. Clearly, it is only the ninth position that differs between 1mhu and 1mrb (i.e. 96.8% sequence identity). Given that 1mhu and 1mrt match at this position, then the pair 1mrb and 1mrt is selected as the entropy-based most representative one.


View this table:
[in this window]
[in a new window]
 
Table II. Entropy-based selection of representative sets of sequences for the HOMSTRAD (Mizuguchi et al., 1998Go) zinc finger CCHH-type family (N = 15)
 
According to the above definition of a minimal set, of the 167 families with optimal partitioning into two clusters, 41 cannot be encapsulated in a minimal set. Not surprisingly, the 41 comprise all families with three structures (except, of course, the metallothionein alpha-domain family). In addition, the 41 contains two families consisting of four members: CoA-dependent acetyltransferase and cytochrome c'. Given M sequences in a family, the mean (±SD) difference in r from 1.0 for the M – 1 sets across the 41 without a minimal set is 0.17 (0.06); minimum = 0.03 for metallothionein beta-domain family N = 3; maximum = 0.30 for bacterial exopeptidases N = 3. The metallothionein beta-domains are more variable than the alpha-domains: of the 30 multiply aligned positions, seven are not identities. Six of the seven variable positions are of the same type as the three in the alpha-domains, comprising a pair of matching residues and another one (i.e. S = 0.92). The 23rd position, however, contains three different residues (S = 1.58). Maximal variability here explains why r for the entropy-based most representative pair, 2mrb and 2mrt, is not quite 1.0.

The mean (±SD) relative size of the minimal set for the 126 families containing one is 0.56 (0.17). The smallest minimal set as a proportion of the total number of structures within a family is 0.10 for the globins N = 41, i.e. four members: 2pghb 1spgb 2mhba 1a6m. The largest minimal set as a proportion of N is 0.8 for nine families (data not shown) for all of which N = 5, i.e. again four members.

Reduction of a large multiple sequence alignment to a minimal set facilitates human inspection. Of course, the availability of complete genome sequences has only served to increase the need for such data reduction. It must be emphasized that the algorithm presented here for selection of representative sets is only an heuristic. However, it does have the advantage that it is deterministic and so the results presented here are meaningful. An extension of the current approach might be to use a Monte Carlo algorithm to find the combination of sequences that maximizes the value of the entropy-based objective function described here. Furthermore, it might be useful to report the value of r for published alignments where `only selected proteins are shown' in order to quantify the degree of comprehensiveness therein. An independent check of the criterion proposed for selection of a representative set could be to test the efficacy of a profile (Gribskov et al., 1987Go) constructed from such a set for searching a database for family members.

To conclude, there is a clear need for automatic methods to summarize large multiple sequence alignments to facilitate not only human inspection but also protein target selection for structural genomics. Also, to the best of my knowledge, this is the first time that a method for classification of protein sequences has been described which is guaranteed to produce globally optimal solutions. Use of sequence weighting to inform clustering allows a novel view of relationships within a multiple sequence alignment. A new sequence weighting method is proposed: one that considers directly the distribution of sequence variability along an alignment rather than averaging over all positions.

The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.


    Acknowledgments
 
I thank Jaap Heringa and Willie Taylor for helpful discussions.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Altschul,S.F., Carroll,R.J. and Lipman,D.J. (1989) J. Mol. Biol., 207, 647–653.[ISI][Medline]

Amitai,M. (1998) Science, 282, 1436–1437.[Free Full Text]

Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer,E.L.L. (1999) Nucleic Acids Res., 27, 260–262.[Abstract/Free Full Text]

Bement,T.R. and Waterman,M.S. (1977) Math. Geol., 9, 55–61.

Blundell,T.L., Sibanda,B.L., Sternberg,M.J.E. and Thornton,J.M. (1987) Nature, 326, 347–352.[ISI][Medline]

Boberg,J., Salakoski,T. and Vihinen,M. (1992) Proteins, 14, 265–276.[ISI][Medline]

Boguski,M.S. (1999) Science, 286, 453–455.[Abstract/Free Full Text]

Bult,C.J. et al. (1996) Science, 273, 1058–1073.[Abstract]

Burley,S.K. et al. (1999) Nature Genet., 23, 151–157.[ISI][Medline]

Casari,G., Sander,C. and Valencia,A. (1995) Nature Struct. Biol., 2, 171–178.[ISI][Medline]

Corpet,F., Gouzy,J. and Kahn,D. (1999) Bioinformatics, 15, 1020–1027.[Abstract/Free Full Text]

Eisen,M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Proc. Natl Acad. Sci. USA, 95, 14863–14868.[Abstract/Free Full Text]

Everitt,B. (1993) Cluster Analysis. 3rd ed. Edward Arnold, London.

Felsenstein,J. (1989) Cladistics, 5, 164–166.

Forster,M.J., Heath,A.B. and Afzal,M.A. (1999) Bioinformatics, 15, 89–90.[Abstract/Free Full Text]

Gerstein,M., Sonnhammer,E.L. and Chothia,C. (1994) J. Mol. Biol., 236, 1067–1078.[ISI][Medline]

Gordon,A.D. (1996) Comput. Statist. Data Anal., 21, 17–29.[ISI]

Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl Acad. Sci. USA, 84, 4355–4358.[Abstract]

Hawkins,D.M. and Merriam,D.F. (1973) Math. Geol., 5, 389–395.

Henikoff,S. and Henikoff,J.G. (1994) J. Mol. Biol., 243, 574–578.[ISI][Medline]

Heringa,J., Sommerfeldt,H., Higgins,D. and Argos,P. (1992) Comput. Appl. Biosci., 8, 599–600.[Abstract]

Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409–417.[Abstract/Free Full Text]

Holm,L. and Sander,C. (1998) Bioinformatics, 14, 423–429.[Abstract]

Holm,L. and Sander,C. (1999) Nucleic Acids Res., 27, 244–247.[Abstract/Free Full Text]

Krogh,A. and Mitchison,G. (1995) ISMB, 3, 215–221.[Medline]

Lanyon,S.M. (1985) Syst. Zool., 34, 397–403.[ISI]

Lichtarge,O., Bourne,H.R. and Cohen,F.E. (1996) J. Mol. Biol., 257, 342–358.[ISI][Medline]

Lockless,S.W. and Ranganathan,R. (1999) Science, 286, 295–299.[Abstract/Free Full Text]

May,A.C.W. (1999a) Struct. Fold. Des., 7, R213.[ISI][Medline]

May,A.C.W. (1999b) Proteins, 37, 20–29.[ISI][Medline]

Mirny,L.A. and Shakhnovich,E.I. (1999) J. Mol. Biol., 291, 177–196.[ISI][Medline]

Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) Protein Sci., 7, 2469–2471.[Abstract/Free Full Text]

Oliver,S.G. (1996) Nature, 379, 597–600.[ISI][Medline]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) J. Mol. Biol., 273, 349–354.[ISI][Medline]

Pollock,D.D., Taylor,W.R. and Goldman,N. (1999) J. Mol. Biol., 287, 187–198.[ISI][Medline]

Ptitsyn,O.B. (1998) J. Mol. Biol., 278, 655–666.[ISI][Medline]

Ptitsyn,O.B. and Ting,K.L. (1999) J. Mol. Biol., 291, 671–682.[ISI][Medline]

Romesburg,H.C. (1984) Cluster Analysis for Researchers. Lifetime Learning Publications, Belmont, CA.

Sander,C. and Schneider,R. (1991) Proteins, 9, 56–68.[ISI][Medline]

Shenkin,P.S., Erman,B. and Mastrandrea,L.D. (1991) Proteins, 11, 297–313.[ISI][Medline]

Sibbald,P.R. and Argos,P. (1990) J. Mol. Biol., 216, 813–818.[ISI][Medline]

Smith,T.F. (1999) Struct. Fold. Des., 7, R7–R12.[Medline]

Smith,T.F. and Waterman,M.S. (1981) J. Mol. Biol., 147, 195–197.[ISI][Medline]

Snel,B., Bork,P. and Huynen,M.A. (1999) Nature Genet., 21, 108–110.[ISI][Medline]

Srinivasarao,G.Y., Yeh,L.S.L., Marzec,C.R., Orcutt,B.C. and Barker,W.C. (1999) Bioinformatics, 15, 382–390.[Abstract/Free Full Text]

Sunyaev,S.R., Eisenhaber,F., Rodchenkov,I.V., Eisenhaber,B., Tumanyan,V.G. and Kuznetsov,E.N. (1999) Protein Eng., 12, 387–394.[Abstract/Free Full Text]

Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitrovsky,E., Lander,E.S. and Golub,T.R. (1999) Proc. Natl Acad. Sci. USA, 96, 2907–2912.[Abstract/Free Full Text]

Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Comput. Appl. Biosci., 10, 19–29.[Abstract]

Vingron,M. and Argos,P. (1989) Comput. Appl. Biosci., 5, 115–121.[Abstract]

Vingron,M. and Sibbald,P.R. (1993) Proc. Natl Acad. Sci. USA, 90, 8777–8781.[Abstract]

Received August 21, 2000; revised February 1, 2001; accepted February 15, 2001.