LSIIT-ICPS (AXE E), UPRES-A CNRS 70005 Université Louis Pasteur, Illkirch, France
Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire CNRS/INSERM/ULP, Illkirch, France
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Here we present a new program called Secator which is based on a different principle and has the advantage that it is fully automatic. The first step is to create a tree from a distance matrix based on a multiple alignment using BIONJ (Gascuel 1997
). The program assigns a dissimilarity value to each node in the tree and then collapses branches by automatically detecting the nodes joining distant subtrees (NJDSTs). The method was validated on 24 protein families and is illustrated using two well-studied protein families: the Sm proteins (Salgado-Garrido et al. 1999
) and the nuclear receptors (Wurtz et al. 1996
). Our automatic partitioning is in good agreement with previously defined subfamilies grouped according to biological data. In addition, the program distinguished five main subfamilies among the 233 nuclear receptors from Caenorhabditis elegans that have been predicted and aligned (J. Fagart, personal communication).
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
![]() |
|
![]() |
![]() |
|
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
When dissimilarity values are above an automatically computed threshold (TD), the external subtrees are assumed to be "unmergeable," and the nodes are designated NJDSTs. When all NJDSTs have been inferred, all branches are collapsed from the leaves up to an NJDST and the corresponding sequences are clustered into a subfamily.
The major problem is to automatically determine a suitable threshold for high dissimilarity values. In ascending hierarchical clustering, where dissimilarity values can be calculated in the same way, the threshold above which the obtained dendrogram should be cut is usually found manually by looking at the elbow of the curve of the dissimilarity values sorted in descending order (fig. 2A
> and D). Many stopping rules provide an automatic threshold (Milligan and Cooper 1985
); however, these rules are not generally suitable when the differences between clusters are fuzzy. We present a new stopping rule of geometric nature which focuses on the clustering essence of the stopping rule.
|
The algorithm takes as input two parameters (resolution and weight), which enables the user to investigate the partitioning of the protein set in the light of biological knowledge. Changing the resolution parameter allows the user to select the depth of the clustering and the resulting number of subfamilies. By default, the partition method is performed only once (resolution = 0). The method is iterated on the low or high dissimilarity values for positive or negative resolutions, respectively. The absolute value of the resolution represents the number of additional iterations. The sequences are assigned equal weights by default, as this has proved to be the best setting in general. Indeed, analysis of various multiple alignments (see below) revealed that weighting of a group of highly similar sequences (typically protein sequences from organisms belonging to the same genus or closely related at the evolutionary level) frequently makes the total weight of the group negligible compared with a group of more weakly related sequences inducing an inadequate merging. Nevertheless, sequences may be weighted according to percentage of residue identity if the weighting parameter is selected. This may enhance the differentiation of a small subgroup of weakly related sequences which are topologically close to a larger group on the tree.
To assess the quality of our clustering method, we performed a jackknife on 24 structurally and manually validated multiple alignments (22 amino-acyl tRNA synthetases and the Sm and nuclear receptor protein families available at http://www-bio3d-igbmc.u-strasbg.fr/~wicker/Secator/secator.html). For each multiple alignment, 1% of the total number of sequences (n) was removed n times and the clusters were calculated for each resulting reduced multiple alignment and compared with the original clusters. A jackknife value was computed for each cluster and for the number of clusters. The jackknife value of a cluster is calculated as the percentage of the time this cluster is the same as in the result. The jackknife value of the number of groups is the percentage of the time each observed number of groups is found during the jackknife. For 82.7% of the groups, the jackknife value was >80%. Furthermore, in 75% of the examples, the jackknife value of the original number of groups was >80%. If the jackknife of a number of groups is above a certain threshold (20%), it is suggested to the user that an alternative to the original clustering exists.
At this point, it should be observed that any partition of the tree may be meaningful, as was pointed out by (Sjolander 1998
). Indeed, there is no one criterion that is objectively better than another, and it is up to the biologist to choose the most convenient from his or her point of view or to change the criterion depending on the sequence family being analyzed. Thus, the introduction of different, complementary methods is of major importance to allow objective, reliable, and reproducible analysis.
In the next section, we illustrate our method using two well-studied protein families: the Sm proteins (Salgado-Garrido et al. 1999
) and the ligand-binding domain of the nuclear receptors (Wurtz et al. 1996
). These two protein families were preferred because of the presence of numerous divergent sequences from various origins and the availability of large amounts of biological, structural, and functional knowledge. In addition, these two protein families represent two extreme test cases with a family of very small proteins with percentages of identity ranging from 39% to 73% (Sm proteins), while the nuclear receptors are longer, highly variable proteins with percentages of identity ranging from 21% to 89%.
Analysis of Sm Proteins
The Sm proteins represent an important protein family involved in pre-mRNA splicing by promoting small nuclear RNA (snRNA) cap modification and targeting small nuclear ribonucleoproteins (snRNPs) to their appropriate cellular location. They are found in eukaryotes from yeast to humans, and some Sm-related proteins have recently been found in Archaea. At the structural level, a group of seven canonical Sm proteins, named B, D1, D2, D3, E, F, and G according to the corresponding human Sm proteins, forms a complex that can bind several RNAs (Kambach et al. 1999
). At the sequence level, Sm proteins share a conserved Sm domain consisting of two blocks of weak but significant sequence similarity interrupted by a spacer region of variable length. Among the numerous proteins carrying an Sm domain, some are highly similar to the canonical Sm proteins, while others (Sm-like proteins) have no obvious counterpart in the Sm protein complex (Seraphin 1995
).
Recently, an in-depth sequence analysis (Salgado-Garrido et al. 1999
) showed that Sm and Sm-like proteins group into at least seven biological subtypes corresponding to the seven canonical Sm proteins with their Sm-like related proteins, while most of the archaeal proteins and two Sm-like proteins form various nonrelated groups.
We used Secator to analyze 102 sequences corresponding to 57 Sm proteins from yeast, plant, insect, and mammalian origins, as well as 45 Sm-like proteins from eukaryotic and archaeal origins. The distance matrix used as input was based on a full-length alignment, available with the definition and abbreviation of each sequence as well as the table of the sequence groups with their mean distance (MD) scores as supplementary material at http://www-bio3d-igbmc.u-strasbg.fr/~wicker/Secator/secator.html. Figure 2B
shows the initial phylogenetic tree obtained with BIONJ, and figure 2C
shows the resulting tree after collapsing by Secator. In figure 2A,
the threshold of high dissimilarity at peak 7 corresponds to a visible disruption of the curve, implying that eight is a sensible number of subfamilies. In addition, Secator outgrouped three sequences (yLsm9, m-therm2, and aero-pern2). These sequences have features that clearly discriminate them from the rest of the family, noticeably the absence of the highly conserved dipeptide RG in the second block of conservation. Among the eight groups identified by Secator, six correspond exactly to the previously reported functional subtypes, highlighting the strong correlation between the biological grouping and the predicted subfamilies. Secator assigned the so-called group 1 of archaeal Sm-like proteins to subtype 3, which corresponds to the SmD2 canonical proteins and the Lsm3 proteins (Salgado-Garrido et al. 1999
). At the sequence level, such a grouping may be biologically or evolutionarily relevant, since examination of the sequence conservation revealed that the SmD2, Lsm3, and archaeal sequences share two highly conserved residues (H and R at positions 45 and 90) which are absent in all other subtypes. The eighth subfamily, which was not reported in Salgado-Garrido et al. (1999)
as a subtype, is composed of some Lsm1 and various Lsm1-related sequences, suggesting that these sequences might represent a new subtype.
Analysis of Nuclear Receptor Proteins
The nuclear receptor (NR) superfamily represents the single largest family of metazoan transcription factors (Tsai and O'Malley 1994
). Most of the NRs are ligand-inducible factors that specifically regulate the expression of target genes involved in major physiological functions such as metabolism, development, and reproduction and are implicated in diseases such as cancer, diabetes, or hormone resistance syndromes (Weatherman, Fletterick, and Scanlan 1999
). To date, more than 100 different NRs have been characterized which bind to hormones, such as sex steroids (progestins [PR], estrogens [ER], and androgens [AR]), adrenal steroids (glucocorticoids [GR] and mineralocorticoids [MR]), vitamin D3 (VDR), thyroid (TR), and retinoid (RXR 9-cis and all-trans), in addition to a variety of other metabolic and uncharacterized ligands. In general, the NRs have three structural domains: a highly variable N-terminal domain, a highly conserved DNA-binding domain (DBD), and a weakly conserved C-terminal ligand-binding domain (LBD). As the LBDs specifically bind a particular ligand type, they are the main targets for both pharmaceutical and phylogenetic studies.
The NRs (Evans 1988
) were originally divided into three main subfamilies: the steroid receptor family, including ER, GR, MR, PR, and AR; the RXR receptor family, including the TR, VDR, RXR, and the ecdysone receptor (EcR); and a third family including the peroxisome proliferator activation receptor (PPAR), steriodogenic factor 1 (SF-1), nerve growth factor-induced receptor (NGF1), and X-linked orphan receptor DAX-1. Recently, the NRs were classified into six "subfamilies" (S1S6) and 26 "groups" (uppercase letters) (Nuclear Receptors Nomenclature Committee 1999
) by aligning the DNA-binding C domain and the ligand-binding E domain.
We used Secator to cluster the LBDs of 477 sequences comprising 244 classical NRs and 233 sequences from C. elegans (supplementary material is available at http://www-bio3d-igbmc.u-strasbg.fr/~wicker/Secator/secator.html). Figures 2E and 2F show the phylogenetic trees before and after collapsing. Here, the smallest dissimilarity value of NJDSTs corresponds to peak 15 (fig. 2D ).
The resulting clustering is in good agreement with the reported subfamilies even though our collapsed tree is based solely on the LBD domain, emphasizing the observed correlation existing between the DBD and LBD evolution. In addition, the method appears robust, since the inclusion of numerous highly variable C. elegans sequences does not significantly affect the clustering of the classical NRs. Secator correctly discriminates three subfamilies (fig. 2F ): S6 (composed of members related to the mouse GCNF1), S5 (including SF1 and LRH1), and S4 (including NOR and NUR).
Two Secator clusters differ slightly from the reported subfamilies. First, the highly similar groups S3A (ER) and S3B (ER-related) are clustered together. Second, group S2A (representative member: HNF4) has been excluded from subfamily 2. This result is linked to some specifically conserved residues that a large set of C. elegans sequences (Ce-2) shares with Group S2A. In fact, the major discrepancy observed between the two classifications is linked to S1. Secator clusters S1D to S1F (REV-ERB, ROR, CNR, ...), but separates groups S1A (TRA and TRB), S1B (RAR), S1C (PPAR), S1H (UR, LXR, ...), and S1I (VDR, ONR1, ...); S1J was absent from our alignment, and S1K was merged with various orphan receptors. At the sequence level, this major difference is probably linked to the absence of the DBD in our alignment, since, as noted in Laudet (1997)
, all of the S1 members share a characteristic DBD binding to direct repeat elements.
In addition, this analysis proposes for the first time a clustering of the orphan C. elegans receptors into five subfamilies (Ce-1 to Ce-5). The biological relevance of these results is strongly supported by the good agreement of our analysis with the existing functional subfamily classification. The objective subfamilies identified by Secator should prove useful in the comprehension of the evolution of this crucial protein family, and particularly in the construction of structural models of C. elegans NRs.
Further improvements and comparisons of clustering techniques in sequence analysis are clearly needed. This will require the use of a large number of well-studied test cases to compare and evaluate the different and complementary methods (work in progress). Nevertheless, Secator should prove particularly useful in a wide range of sequence analysis methods, particularly those dedicated to the identification of residues and domains indicative of structural or functional differences (Hannenhalli and Russell 2000)
.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Abbreviations: NJDST, node joining distant subtrees; TD, threshold of dissimilarity.
2 Keywords: Secator
subfamily
phylogenetic tree
clustering
3 Address for correspondence and reprints: Olivier Poch, Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire CNRS/INSERM/ULP, BP 163, 67404 Illkirch cedex, France. poch{at}igbmc.u-strasbg.fr
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402
Corpet F., J. Gouzy, D. Kahn, 1999 Browsing protein families via the Rich Family Description format Bioinformatics 15:1020-1027
Enright A. J., C. A. Ouzounis, 2000 GeneRAGE: a robust algorithm for sequence clustering and domain detection Bioinformatics 16:451-457[Abstract]
Evans R. M., 1988 The steroid and thyroid hormone receptor superfamily Science 240:889-895[ISI][Medline]
Gascuel O., 1997 BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data Mol. Biol. Evol 14:685-695[Abstract]
Hannenhalli S. S., R. B. Russell, 2000 Analysis and prediction of functional sub-types from protein sequence alignments J. Mol. Biol 303:61-76[ISI][Medline]
Hodge T., M. J. Cope, 2000 A myosin family tree J. Cell Sci 113:3353-3354
Kambach C., S. Walke, R. Young, J. M. Avis, E. de la Fortelle, V. A. Raker, R. Luhrmann, J. Li, K. Nagai, 1999 Crystal structures of two Sm protein complexes and their implications for the assembly of the spliceosomal snRNPs Cell 96:375-387[ISI][Medline]
Krause A., J. Stoye, M. Vingron, 2000 The SYSTERS protein sequence cluster set Nucleic Acids Res 28:270-272
Laudet V., 1997 Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor J. Mol. Endocrinol 19:207-226
Lichtarge O., H. R. Bourne, F. E. Cohen, 1996 An evolutionary trace method defines binding surfaces common to protein families J. Mol. Biol 257:342-358[ISI][Medline]
Milligan G. W., M. C. Cooper, 1985 An examination of procedures for determining the number of clusters in a data set Psychometrika 50:159-179[ISI]
Nuclear Receptors Nomenclature Committee. 1999 A unified nomenclature system for the nuclear receptor superfamily [letter] Cell 97:161-163[ISI][Medline]
Page R. D., 1996 TreeView: an application to display phylogenetic trees on personal computers Comput. Appl. Biosci 12:357-358[Medline]
Pearson W. R., 1994 Using the FASTA program to search protein and DNA sequence databases Methods Mol. Biol 24:307-331[Medline]
Salgado-Garrido J., E. Bragado-Nilsson, S. Kandels-Lewis, B. Seraphin, 1999 Sm and Sm-like proteins assemble in two related complexes of deep evolutionary origin EMBO J 18:3451-3462
Seraphin B., 1995 Sm and Sm-like proteins belong to a large family: identification of proteins of the U6 as well as the U1, U2, U4 and U5 snRNPs EMBO J 14:2089-2098[Abstract]
Sjolander K., 1998 Phylogenetic inference in protein superfamilies: analysis of SH2 domains Intell. Syst. Mol. Biol 6:165-174
Tatusov R. L., M. Y. Galperin, D. A. Natale, E. V. Koonin, 2000 The COG database: a tool for genome-scale analysis of protein functions and evolution Nucleic Acids Res 28:33-36
Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 Improved sensitivity of profile searches through the use of sequence weights and gap excision Comput. Appl. Biosci 10:19-29[Abstract]
Thompson J. D., F. Plewniak, J. Thierry, O. Poch, 2000 DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches Nucleic Acids Res 28:2919-2926
Tsai M. J., B. W. O'Malley, 1994 Molecular mechanisms of action of steroid/thyroid receptor superfamily members Annu. Rev. Biochem 63:451-486[ISI][Medline]
Weatherman R. V., R. J. Fletterick, T. S. Scanlan, 1999 Nuclear-receptor ligands and ligand-binding domains Annu. Rev. Biochem 68:559-581[ISI][Medline]
Wolf Y. I., L. Aravind, N. V. Grishin, E. V. Koonin, 1999 Evolution of aminoacyl-tRNA synthetasesanalysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events Genome Res 9:689-710
Wurtz J. M., W. Bourguet, J. P. Renaud, V. Vivat, P. Chambon, D. Moras, H. Gronemeyer, 1996 A canonical structure for the ligand-binding domain of nuclear receptors Nat. Struct. Biol 3:206[ISI][Medline]