Secator: A Program for Inferring Protein Subfamilies from Phylogenetic Trees

Nicolas Wicker, Guy René Perrin, Jean Claude Thierry and Olivier Poch

LSIIT-ICPS (AXE E), UPRES-A CNRS 70005 Université Louis Pasteur, Illkirch, France
Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire CNRS/INSERM/ULP, Illkirch, France


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
With the huge increase of protein data, an important problem is to estimate, within a large protein family, the number of sensible subsets for subsequent in-depth structural, functional, and evolutionary analyses. To tackle this problem, we developed a new program, Secator, which implements the principle of an ascending hierarchical method using a distance matrix based on a multiple alignment of protein sequences. Dissimilarity values assigned to the nodes of a deduced phylogenetic tree are partitioned by a new stopping rule introduced to automatically determine the significant dissimilarity values. The quality of the clusters obtained by Secator is verified by a separate Jackknife study. The method is demonstrated on 24 large protein families covering a wide spectrum of structural and sequence conservation and its usefulness and accuracy with real biological data is illustrated on two well-studied protein families (the Sm proteins and the nuclear receptors).


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
With the rapid growth of the sequence databases, the number of sequences belonging to a particular functionally related protein family is increasing sharply. As a consequence, it is becoming more and more necessary for biologists to analyze the relationships existing between the numerous members of a protein family and categorize them into sensible subfamilies. Subfamilies are frequently representative of sets of proteins with related functions and/or distinct domain organizations resulting from different evolution histories. Clustering approaches have until now been focused on the discovery of groups of homologous proteins in entire protein databases (Wolf et al. 1999Citation ; Enright and Ouzounis 2000Citation ; Krause, Stoye, and Vingron 2000Citation ; Tatusov et al. 2000Citation ) based on single-sequence similarity search algorithms such as BLAST (Altschul et al. 1997Citation ) and FASTA (Pearson 1994Citation ). However, these methods are not suitable for in-depth phylogenetic studies of a predefined set of proteins. Inference of subfamilies in sets of homologous proteins remains crucial in order to gain insight into their real functional and evolutionary relationships. This is usually done by collapsing internal branches of a phylogenetic tree, either manually, using a graphical tool such as the TreeView program (Page 1996Citation ), or in a semiautomatic way with sequence grouping guided by the reliability of the branching order. The latter method was used, for example, to define groups among receptors (Nuclear Receptors Nomenclature Committee 1999Citation ) and among myosin sequences (Hodge and Cope 2000)Citation . Phylogenetic trees have also been used to group sequences (Lichtarge, Bourne, and Cohen 1996Citation ; Corpet, Gouzy, and Kahn 1999Citation ), but in both cases, the user must define the maximum distance or the minimum percentage of identity required for sequences to belong to the same group. To our knowledge, only one algorithm has been proposed which addresses the problem of automatic clustering of probable functional subfamilies in a phylogenetic tree (Sjolander 1998Citation ). This algorithm is based on the minimization of an encoding cost of the multiple alignment of a set of proteins.

Here we present a new program called Secator which is based on a different principle and has the advantage that it is fully automatic. The first step is to create a tree from a distance matrix based on a multiple alignment using BIONJ (Gascuel 1997Citation ). The program assigns a dissimilarity value to each node in the tree and then collapses branches by automatically detecting the nodes joining distant subtrees (NJDSTs). The method was validated on 24 protein families and is illustrated using two well-studied protein families: the Sm proteins (Salgado-Garrido et al. 1999Citation ) and the nuclear receptors (Wurtz et al. 1996Citation ). Our automatic partitioning is in good agreement with previously defined subfamilies grouped according to biological data. In addition, the program distinguished five main subfamilies among the 233 nuclear receptors from Caenorhabditis elegans that have been predicted and aligned (J. Fagart, personal communication).


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Determination of Subfamilies of Proteins from a Phylogenetic Tree of n Sequences
Given are an unrooted tree, an n-by-n distance matrix, n sequence weights either all equal to 1 (by default) or calculated by Secator using the algorithm described in Thompson, Higgins, and Gibson (1994)Citation , and an integer resolution value (R) which enables the user to ask for more or less groups than in the original clustering if it is set to a positive or negative value, respectively. R is set to 0 by default.

  1. Initially, each sequence forms a different family of proteins, and a dissimilarity value between each pair (i, j) of families is calculated according to the formula


    where wseqi and wseqj are, respectively, the weights of the sequences i and j, and d(seqi, seqj) is the distance between the sequences i and j as given by the distance matrix.

  2. While the number of families is greater than two do

  3. The dissimilarity value between the two remaining families is assigned to a virtual node (node_V in fig. 1 ).
  4. The nodes are clustered into two groups, the group with high dissimilarity values and the group with low dissimilarity values. This clustering is done by computing the partition into two groups which has the maximum interclass inertia on a subset of all possible partitions. Initially, D = {Di}i<1,n> is the set of all the dissimilarity values sorted in decreasing order, and g is the mean of D.
    For i = 2 to n do
    Partition the dissimilarity values into two groups Ei and Fi, where Ei is the group of high dissimilarity values and Fi is the group of low dissimilarity values.


    where we = |Ei|, wf = |Fi|, and d is the usual distance.
    >[rf[li>The best partitioning is given by the pair (Ek, Fk), for which the corresponding Ik is the highest. This partitioning produces a threshold value of dissimilarity (TD) which is the highest value of Fk.

  5. For |R| times do

  6. The NJDSTs are defined as the nodes with dissimilarity values above the TD. Then, from the leaves of the tree up to the internal branches, branches are collapsed until NJDSTs are met. For example, in figure 1, a, b, and c are collapsed.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1.—Example of a tree (A) before and (B) after collapsing. V = virtual node; branches are collapsed from the leaves up to the internal branches until nodes joining distant subtrees (dots) are met

 
Implementation
The method presented here is implemented in the program Secator, which is written in C and should run on any UNIX machine. The program takes as input either a distance matrix in PHYLIP format or a multiple alignment in MSF or FASTA format. In the latter case, distances between the sequences are based on percentages of residue identity. A phylogenetic tree is then calculated using BIONJ (Gascuel 1997Citation ). Secator produces two output files: the collapsed tree in PHYLIP format and a table of the sequence groups with their mean distance (MD) scores (Thompson et al. 2000)Citation when the alignment is given. In addition to the resolution and weighting parameters, the user can also choose whether to conserve the distances in the final tree or have a multifurcate tree. A jackknife option provides an assessment of the quality of the clustering and of the number of groups. The program and multiple alignments are available by ftp at http://www-bio3d-igbmc.u-strasbg.fr/~wicker/Secator/secator.html.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
The novelty of our method is the automatic clustering of the nodes of a phylogenetic tree to define probable functional subfamilies. This is realized by labeling each node with a dissimilarity value, which gives an objective estimation of the divergence of its external sub-branches. Ward's aggregative dissimilarity measure is preferred to the usual single-linkage or complete-linkage hierarchical clustering, which are also broadly used. Indeed, the progressive aspect of sequence data would tend to create oversized clusters using the single linkage. As for the complete linkage, it creates groups too compact to deal with such a sparse set of data. When an alignment is submitted, the dissimilarity is based on percentage of identity because it is the measure that is least sensitive to the physicochemical bias of the studied sequences (e.g., transmembrane sequences). However, the user has the option of providing other distances in the form of a matrix.

When dissimilarity values are above an automatically computed threshold (TD), the external subtrees are assumed to be "unmergeable," and the nodes are designated NJDSTs. When all NJDSTs have been inferred, all branches are collapsed from the leaves up to an NJDST and the corresponding sequences are clustered into a subfamily.

The major problem is to automatically determine a suitable threshold for high dissimilarity values. In ascending hierarchical clustering, where dissimilarity values can be calculated in the same way, the threshold above which the obtained dendrogram should be cut is usually found manually by looking at the elbow of the curve of the dissimilarity values sorted in descending order (fig. 2A > and D). Many stopping rules provide an automatic threshold (Milligan and Cooper 1985Citation ); however, these rules are not generally suitable when the differences between clusters are fuzzy. We present a new stopping rule of geometric nature which focuses on the clustering essence of the stopping rule.



View larger version (48K):
[in this window]
[in a new window]
 
Fig. 2.—Dissimilarity value curves and phylogenetic trees before and after collapse of the Sm protein family (A, B, and C) and of the nuclear receptor family (D, E, and F). Sm subtypes are numbered from 1 to 7: subtype 1—SmB and SmN; subtype 2—SmD1 and Lsm2; subtype 3—SmD2, Lsm3, and archaeal proteins; subtype 4—SmD3 and Lsm4; subtype 5—SmE and Lsm5; subtype 6—SmF and Lsm6; subtype 7—SmG and Lsm7. The number 8 represents the subfamily including Lsm1 and Lsm1-related sequences. Subfamilies of Caenorhabditis elegans are denoted Ce-1 to Ce-5 (F), and the other subfamilies or groups are noted as in Nuclear Receptors Nomenclature Committee (1999). Full-length alignments and accession numbers are available as supplementary materials at http://www-bio3d-igbmc.u-strasbg.fr/~wicker/Secator/secator.html.

 
As our goal is to separate a set of high values from a set of low ones, a clustering method appears quite appropriate, particularly a squared error partitional method. This method is expected to give good results because the number of groups of dissimilarity values (two) is known and the values to be clustered are linearly separated. In addition, the exact optimum can be calculated since the number of solutions to evaluate is relatively small. Indeed, if there are n dissimilarity values to cluster, only n - 1 solutions must be estimated to find the optimal solution. These solutions are obtained by separating the dissimilarity values into two groups, where all the values of one group are higher than those of the other group. The NJDST with the lowest dissimilarity value is typically situated in the elbow of the curve.

The algorithm takes as input two parameters (resolution and weight), which enables the user to investigate the partitioning of the protein set in the light of biological knowledge. Changing the resolution parameter allows the user to select the depth of the clustering and the resulting number of subfamilies. By default, the partition method is performed only once (resolution = 0). The method is iterated on the low or high dissimilarity values for positive or negative resolutions, respectively. The absolute value of the resolution represents the number of additional iterations. The sequences are assigned equal weights by default, as this has proved to be the best setting in general. Indeed, analysis of various multiple alignments (see below) revealed that weighting of a group of highly similar sequences (typically protein sequences from organisms belonging to the same genus or closely related at the evolutionary level) frequently makes the total weight of the group negligible compared with a group of more weakly related sequences inducing an inadequate merging. Nevertheless, sequences may be weighted according to percentage of residue identity if the weighting parameter is selected. This may enhance the differentiation of a small subgroup of weakly related sequences which are topologically close to a larger group on the tree.

To assess the quality of our clustering method, we performed a jackknife on 24 structurally and manually validated multiple alignments (22 amino-acyl tRNA synthetases and the Sm and nuclear receptor protein families available at http://www-bio3d-igbmc.u-strasbg.fr/~wicker/Secator/secator.html). For each multiple alignment, 1% of the total number of sequences (n) was removed n times and the clusters were calculated for each resulting reduced multiple alignment and compared with the original clusters. A jackknife value was computed for each cluster and for the number of clusters. The jackknife value of a cluster is calculated as the percentage of the time this cluster is the same as in the result. The jackknife value of the number of groups is the percentage of the time each observed number of groups is found during the jackknife. For 82.7% of the groups, the jackknife value was >80%. Furthermore, in 75% of the examples, the jackknife value of the original number of groups was >80%. If the jackknife of a number of groups is above a certain threshold (20%), it is suggested to the user that an alternative to the original clustering exists.

At this point, it should be observed that any partition of the tree may be meaningful, as was pointed out by (Sjolander 1998Citation ). Indeed, there is no one criterion that is objectively better than another, and it is up to the biologist to choose the most convenient from his or her point of view or to change the criterion depending on the sequence family being analyzed. Thus, the introduction of different, complementary methods is of major importance to allow objective, reliable, and reproducible analysis.

In the next section, we illustrate our method using two well-studied protein families: the Sm proteins (Salgado-Garrido et al. 1999Citation ) and the ligand-binding domain of the nuclear receptors (Wurtz et al. 1996Citation ). These two protein families were preferred because of the presence of numerous divergent sequences from various origins and the availability of large amounts of biological, structural, and functional knowledge. In addition, these two protein families represent two extreme test cases with a family of very small proteins with percentages of identity ranging from 39% to 73% (Sm proteins), while the nuclear receptors are longer, highly variable proteins with percentages of identity ranging from 21% to 89%.

Analysis of Sm Proteins
The Sm proteins represent an important protein family involved in pre-mRNA splicing by promoting small nuclear RNA (snRNA) cap modification and targeting small nuclear ribonucleoproteins (snRNPs) to their appropriate cellular location. They are found in eukaryotes from yeast to humans, and some Sm-related proteins have recently been found in Archaea. At the structural level, a group of seven canonical Sm proteins, named B, D1, D2, D3, E, F, and G according to the corresponding human Sm proteins, forms a complex that can bind several RNAs (Kambach et al. 1999Citation ). At the sequence level, Sm proteins share a conserved Sm domain consisting of two blocks of weak but significant sequence similarity interrupted by a spacer region of variable length. Among the numerous proteins carrying an Sm domain, some are highly similar to the canonical Sm proteins, while others (Sm-like proteins) have no obvious counterpart in the Sm protein complex (Seraphin 1995Citation ).

Recently, an in-depth sequence analysis (Salgado-Garrido et al. 1999Citation ) showed that Sm and Sm-like proteins group into at least seven biological subtypes corresponding to the seven canonical Sm proteins with their Sm-like related proteins, while most of the archaeal proteins and two Sm-like proteins form various nonrelated groups.

We used Secator to analyze 102 sequences corresponding to 57 Sm proteins from yeast, plant, insect, and mammalian origins, as well as 45 Sm-like proteins from eukaryotic and archaeal origins. The distance matrix used as input was based on a full-length alignment, available with the definition and abbreviation of each sequence as well as the table of the sequence groups with their mean distance (MD) scores as supplementary material at http://www-bio3d-igbmc.u-strasbg.fr/~wicker/Secator/secator.html. Figure 2B shows the initial phylogenetic tree obtained with BIONJ, and figure 2C shows the resulting tree after collapsing by Secator. In figure 2A, the threshold of high dissimilarity at peak 7 corresponds to a visible disruption of the curve, implying that eight is a sensible number of subfamilies. In addition, Secator outgrouped three sequences (yLsm9, m-therm2, and aero-pern2). These sequences have features that clearly discriminate them from the rest of the family, noticeably the absence of the highly conserved dipeptide RG in the second block of conservation. Among the eight groups identified by Secator, six correspond exactly to the previously reported functional subtypes, highlighting the strong correlation between the biological grouping and the predicted subfamilies. Secator assigned the so-called group 1 of archaeal Sm-like proteins to subtype 3, which corresponds to the SmD2 canonical proteins and the Lsm3 proteins (Salgado-Garrido et al. 1999Citation ). At the sequence level, such a grouping may be biologically or evolutionarily relevant, since examination of the sequence conservation revealed that the SmD2, Lsm3, and archaeal sequences share two highly conserved residues (H and R at positions 45 and 90) which are absent in all other subtypes. The eighth subfamily, which was not reported in Salgado-Garrido et al. (1999)Citation as a subtype, is composed of some Lsm1 and various Lsm1-related sequences, suggesting that these sequences might represent a new subtype.

Analysis of Nuclear Receptor Proteins
The nuclear receptor (NR) superfamily represents the single largest family of metazoan transcription factors (Tsai and O'Malley 1994Citation ). Most of the NRs are ligand-inducible factors that specifically regulate the expression of target genes involved in major physiological functions such as metabolism, development, and reproduction and are implicated in diseases such as cancer, diabetes, or hormone resistance syndromes (Weatherman, Fletterick, and Scanlan 1999Citation ). To date, more than 100 different NRs have been characterized which bind to hormones, such as sex steroids (progestins [PR], estrogens [ER], and androgens [AR]), adrenal steroids (glucocorticoids [GR] and mineralocorticoids [MR]), vitamin D3 (VDR), thyroid (TR), and retinoid (RXR 9-cis and all-trans), in addition to a variety of other metabolic and uncharacterized ligands. In general, the NRs have three structural domains: a highly variable N-terminal domain, a highly conserved DNA-binding domain (DBD), and a weakly conserved C-terminal ligand-binding domain (LBD). As the LBDs specifically bind a particular ligand type, they are the main targets for both pharmaceutical and phylogenetic studies.

The NRs (Evans 1988Citation ) were originally divided into three main subfamilies: the steroid receptor family, including ER, GR, MR, PR, and AR; the RXR receptor family, including the TR, VDR, RXR, and the ecdysone receptor (EcR); and a third family including the peroxisome proliferator activation receptor (PPAR), steriodogenic factor 1 (SF-1), nerve growth factor-induced receptor (NGF1), and X-linked orphan receptor DAX-1. Recently, the NRs were classified into six "subfamilies" (S1–S6) and 26 "groups" (uppercase letters) (Nuclear Receptors Nomenclature Committee 1999Citation ) by aligning the DNA-binding C domain and the ligand-binding E domain.

We used Secator to cluster the LBDs of 477 sequences comprising 244 classical NRs and 233 sequences from C. elegans (supplementary material is available at http://www-bio3d-igbmc.u-strasbg.fr/~wicker/Secator/secator.html). Figures 2E and 2F show the phylogenetic trees before and after collapsing. Here, the smallest dissimilarity value of NJDSTs corresponds to peak 15 (fig. 2D ).

The resulting clustering is in good agreement with the reported subfamilies even though our collapsed tree is based solely on the LBD domain, emphasizing the observed correlation existing between the DBD and LBD evolution. In addition, the method appears robust, since the inclusion of numerous highly variable C. elegans sequences does not significantly affect the clustering of the classical NRs. Secator correctly discriminates three subfamilies (fig. 2F ): S6 (composed of members related to the mouse GCNF1), S5 (including SF1 and LRH1), and S4 (including NOR and NUR).

Two Secator clusters differ slightly from the reported subfamilies. First, the highly similar groups S3A (ER) and S3B (ER-related) are clustered together. Second, group S2A (representative member: HNF4) has been excluded from subfamily 2. This result is linked to some specifically conserved residues that a large set of C. elegans sequences (Ce-2) shares with Group S2A. In fact, the major discrepancy observed between the two classifications is linked to S1. Secator clusters S1D to S1F (REV-ERB, ROR, CNR, ...), but separates groups S1A (TRA and TRB), S1B (RAR), S1C (PPAR), S1H (UR, LXR, ...), and S1I (VDR, ONR1, ...); S1J was absent from our alignment, and S1K was merged with various orphan receptors. At the sequence level, this major difference is probably linked to the absence of the DBD in our alignment, since, as noted in Laudet (1997)Citation , all of the S1 members share a characteristic DBD binding to direct repeat elements.

In addition, this analysis proposes for the first time a clustering of the orphan C. elegans receptors into five subfamilies (Ce-1 to Ce-5). The biological relevance of these results is strongly supported by the good agreement of our analysis with the existing functional subfamily classification. The objective subfamilies identified by Secator should prove useful in the comprehension of the evolution of this crucial protein family, and particularly in the construction of structural models of C. elegans NRs.

Further improvements and comparisons of clustering techniques in sequence analysis are clearly needed. This will require the use of a large number of well-studied test cases to compare and evaluate the different and complementary methods (work in progress). Nevertheless, Secator should prove particularly useful in a wide range of sequence analysis methods, particularly those dedicated to the identification of residues and domains indicative of structural or functional differences (Hannenhalli and Russell 2000)Citation .


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
We are much indebted to Jerome Fagart and Jean-Marie Wurtz for providing their alignment of the LBD domain and to Kimmen Sjolander, who has made available her thesis and has kindly answered our questions. We are also grateful to Julie Thompson, Odile Lecompte, and Frédéric Plewniak for helpful comments.


    Footnotes
 
William R. Taylor, Reviewing Editor

1 Abbreviations: NJDST, node joining distant subtrees; TD, threshold of dissimilarity. Back

2 Keywords: Secator subfamily phylogenetic tree clustering Back

3 Address for correspondence and reprints: Olivier Poch, Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire CNRS/INSERM/ULP, BP 163, 67404 Illkirch cedex, France. poch{at}igbmc.u-strasbg.fr . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 

    Altschul S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402[Abstract/Free Full Text]

    Corpet F., J. Gouzy, D. Kahn, 1999 Browsing protein families via the ‘Rich Family Description’ format Bioinformatics 15:1020-1027[Abstract/Free Full Text]

    Enright A. J., C. A. Ouzounis, 2000 GeneRAGE: a robust algorithm for sequence clustering and domain detection Bioinformatics 16:451-457[Abstract]

    Evans R. M., 1988 The steroid and thyroid hormone receptor superfamily Science 240:889-895[ISI][Medline]

    Gascuel O., 1997 BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data Mol. Biol. Evol 14:685-695[Abstract]

    Hannenhalli S. S., R. B. Russell, 2000 Analysis and prediction of functional sub-types from protein sequence alignments J. Mol. Biol 303:61-76[ISI][Medline]

    Hodge T., M. J. Cope, 2000 A myosin family tree J. Cell Sci 113:3353-3354[Free Full Text]

    Kambach C., S. Walke, R. Young, J. M. Avis, E. de la Fortelle, V. A. Raker, R. Luhrmann, J. Li, K. Nagai, 1999 Crystal structures of two Sm protein complexes and their implications for the assembly of the spliceosomal snRNPs Cell 96:375-387[ISI][Medline]

    Krause A., J. Stoye, M. Vingron, 2000 The SYSTERS protein sequence cluster set Nucleic Acids Res 28:270-272[Abstract/Free Full Text]

    Laudet V., 1997 Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor J. Mol. Endocrinol 19:207-226[Abstract/Free Full Text]

    Lichtarge O., H. R. Bourne, F. E. Cohen, 1996 An evolutionary trace method defines binding surfaces common to protein families J. Mol. Biol 257:342-358[ISI][Medline]

    Milligan G. W., M. C. Cooper, 1985 An examination of procedures for determining the number of clusters in a data set Psychometrika 50:159-179[ISI]

    Nuclear Receptors Nomenclature Committee. 1999 A unified nomenclature system for the nuclear receptor superfamily [letter] Cell 97:161-163[ISI][Medline]

    Page R. D., 1996 TreeView: an application to display phylogenetic trees on personal computers Comput. Appl. Biosci 12:357-358[Medline]

    Pearson W. R., 1994 Using the FASTA program to search protein and DNA sequence databases Methods Mol. Biol 24:307-331[Medline]

    Salgado-Garrido J., E. Bragado-Nilsson, S. Kandels-Lewis, B. Seraphin, 1999 Sm and Sm-like proteins assemble in two related complexes of deep evolutionary origin EMBO J 18:3451-3462[Abstract/Free Full Text]

    Seraphin B., 1995 Sm and Sm-like proteins belong to a large family: identification of proteins of the U6 as well as the U1, U2, U4 and U5 snRNPs EMBO J 14:2089-2098[Abstract]

    Sjolander K., 1998 Phylogenetic inference in protein superfamilies: analysis of SH2 domains Intell. Syst. Mol. Biol 6:165-174

    Tatusov R. L., M. Y. Galperin, D. A. Natale, E. V. Koonin, 2000 The COG database: a tool for genome-scale analysis of protein functions and evolution Nucleic Acids Res 28:33-36[Abstract/Free Full Text]

    Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 Improved sensitivity of profile searches through the use of sequence weights and gap excision Comput. Appl. Biosci 10:19-29[Abstract]

    Thompson J. D., F. Plewniak, J. Thierry, O. Poch, 2000 DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches Nucleic Acids Res 28:2919-2926[Abstract/Free Full Text]

    Tsai M. J., B. W. O'Malley, 1994 Molecular mechanisms of action of steroid/thyroid receptor superfamily members Annu. Rev. Biochem 63:451-486[ISI][Medline]

    Weatherman R. V., R. J. Fletterick, T. S. Scanlan, 1999 Nuclear-receptor ligands and ligand-binding domains Annu. Rev. Biochem 68:559-581[ISI][Medline]

    Wolf Y. I., L. Aravind, N. V. Grishin, E. V. Koonin, 1999 Evolution of aminoacyl-tRNA synthetases—analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events Genome Res 9:689-710[Abstract/Free Full Text]

    Wurtz J. M., W. Bourguet, J. P. Renaud, V. Vivat, P. Chambon, D. Moras, H. Gronemeyer, 1996 A canonical structure for the ligand-binding domain of nuclear receptors Nat. Struct. Biol 3:206[ISI][Medline]

Accepted for publication April 9, 2001.