Archaeal Phylogeny Based on Ribosomal Proteins

Oriane Matte-Tailliez1, Céline Brochier1, Patrick Forterre and Hervé Philippe3

*Institut de Génétique et Microbiologie, Université Paris-Sud, Orsay, France;
{dagger}Phylogénie, Bioinformatique et Génome, Université Pierre et Marie Curie, Paris, France


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Until recently, phylogenetic analyses of Archaea have mainly been based on ribosomal RNA (rRNA) sequence comparisons, leading to the distinction of the two major archaeal phyla: the Euryarchaeota and the Crenarchaeota. Here, thanks to the recent sequencing of several archaeal genomes, we have constructed a phylogeny based on the fusion of the sequences of the 53 ribosomal proteins present in most of the archaeal species. This phylogeny was remarkably congruent with the rRNA phylogeny, suggesting that both reflected the actual phylogeny of the domain Archaea even if some nodes remained unresolved. In both cases, the branches leading to hyperthermophilic species were short, suggesting that the evolutionary rate of their genes has been slowed down by structural constraints related to environmental adaptation. In addition, to estimate the impact of lateral gene transfer (LGT) on our tree reconstruction, we used a new method that revealed that 8 genes out of the 53 ribosomal proteins used in our study were likely affected by LGT. This strongly suggested that a core of 45 nontransferred ribosomal protein genes existed in Archaea that can be tentatively used to infer the phylogeny of this domain. Interestingly, the tree obtained using only the eight ribosomal proteins likely affected by LGT was not very different from the consensus tree, indicating that LGT mainly brought random phylogenetic noise. The major difference involves organisms living in similar environments, suggesting that LGTs are mainly directed by the physical proximity of the organisms rather than by their phylogenetic proximity.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Archaea are a collection of prokaryotes with very diverse phenotypes, which have been recognized as a major taxonomic unit (one of the three domains of life) based on ribosomal RNA (rRNA) sequence comparison (Woese 1987Citation ; Huber, Huber, and Stetter 2000Citation ). Some Archaea exhibit unique features in the living world, such as methanogenesis or the ability to grow at temperatures above 95°C or at pHs as low as 0. Recently, the search for new archaeal phylotypes by PCR from various environments has revealed that Archaea are an important component of the biosphere (Pace 1997Citation ). Molecular studies and comparative genomics have shown that Archaea are characterized by a combination of unique features, such as left-handed isoprenoids containing glycerolipids, and mosaic bacterial and eukaryotic features. In particular, despite a bacterial organization of their chromosome (messenger RNA with Shine-Dalgarno sequences, genes assembled in operons, a single origin of bidirectional replication), most of their informational proteins resemble eukaryotic ones (Forterre 1997Citation ; Olsen and Woese 1997Citation ; Rivera et al. 1998Citation ; Makarova et al. 1999Citation ).

The study of Archaea is essential to understand the history of molecular mechanisms and metabolism diversity on our planet as well as to unravel the mechanisms by which life can prosper in extreme environments. In order to fully benefit from such studies, a sound phylogeny of the archaeal domain should be available. The phylogeny of Archaea currently used and adopted in textbooks is based on 16S rRNA sequence comparison. From this rRNA-based phylogeny, the archaeal domain has been divided into two phyla, the Euryarchaeota and the Crenarchaeota. The Euryarchaeota include all described halophilic and methanogenic species as well as thermoacidophiles and hyperthermophiles, whereas cultivated species of the Crenarchaeota include only thermoacidophiles and hyperthermophiles. In addition, many phylotypes corresponding to likely yet uncultured mesophilic (or even psychrophilic) species have been detected for both phyla (Pace 1997Citation ). rRNA sequence comparison has also shown a specific relationship between halophiles and Methanomicrobiales, suggesting that halophiles evolved from methanogens by the loss of methanogenesis and the acquisition of an aerobic lifestyle.

The division of Archaea into two phyla, inferred from rRNA sequences, is also supported by genomic analyses (She et al. 2001Citation ). For example, all Crenarchaeota lack eukaryotic-like histones, DNA polymerases of the D family, and cell division proteins of the MinD and FtsZ families, all generally present in Euryarchaeota (Bernander 2000Citation ). Other relationships suggested by rRNA are not firmly established. In particular, Thermoplasmatales (moderate thermophiles and acidophiles) and Archaeoglobales (sulfate reducers) are located within methanogens (with the Thermococcales emerging first), but the relationships between these subgroups are not resolved (Ludwig and Klenk 2001Citation ). On the contrary, it has been observed that the structure of Thermoplasmatales RNA polymerase (B type) is similar to that of Crenarchaeota and Thermococcales, and distinct from that of methanogens (B' + B'' type) (Klenk et al. 1992Citation ).

It has recently been shown that rRNA phylogenies can be sometimes grossly misleading in inferring phylogenies in the presence of unequal rates of evolution or differences in base composition (Philippe and Laurent 1998Citation ). For example, several early-branching lineages in the eukaryotic 18S rRNA tree turned out to be misplaced because of the long-branch attraction (LBA) artifact (Philippe, Germot, and Moreira 2000Citation ). This raises concerns about the reliability of the archaeal tree based on 16S rRNA, all the more so that this domain contains many extremophiles, which can bias the assessment of rRNA sequence evolution. For example, rRNA sequences of hyperthermophilic species are G + C rich, a well-known source of tree reconstruction artifacts (Woese et al. 1991Citation ; Lockhart et al. 1994Citation ), and those of halophiles display very long branches, which can produce LBA artifacts (Felsenstein 1978Citation ).

The recent discovery of the very high occurrence of lateral gene transfers (LGTs) in prokaryotes (Lan and Reeves 1996Citation ; Ochman, Lawrence, and Groisman 2000Citation ) raises a different, but nonetheless major, problem. It has been suggested that a prokaryotic phylogeny itself did not exist because prokaryotic genomes are a complete mosaic of genes from various origins (Doolittle 1999Citation ). In fact, many archaeal phylogenies are based on single protein trees (Klenk, Palm, and Zillig 1994Citation ; Brown and Doolittle 1997Citation ; Philippe and Forterre 1999Citation ; Woese et al. 2000Citation ). Although most of them validate the Euryarchaeota/Crenarchaeota phyla division, they are often in contradiction both with each other and with the rRNA tree concerning the position of the various lineages within each phylum. LGTs could provide a simple explanation of these incongruencies. Yet, it has been proposed that a core of genes (especially the ones involved in numerous protein-protein interactions, the complexity hypothesis) may be refractory to transfer and could thus provide the raw material necessary to infer organismal phylogeny (Jain, Rivera, and Lake 1999Citation ). However, this hypothesis was recently weakened by the in vitro demonstration that in Escherichia coli the rRNA operon can be successfully replaced by that of a distantly related species (Asai et al. 1999Citation ) and by the numerous cases of LGTs involving the ribosomal protein rps14 in Bacteria (Brochier, Philippe, and Moreira 2000Citation ).

To infer the archaeal phylogeny (if it exists), one needs to be able to demonstrate that a set of genes (the core) has not been laterally transferred during the evolutionary history of this group. Unfortunately, the standard methods used to detect LGTs (e.g., bias in G + C content, in codon usage, or in oligonucleotide frequencies) are designed for recent events only, and they have also recently been shown to provide quite different results (Ragan 2001Citation ). We developed a method based on a phylogenetic criterion to detect LGTs (Brochier et al. in press). When applied to a sample of 57 genes from 45 bacterial species, this method revealed that only 13 genes are affected by LGTs. The phylogeny based on the remaining 44 genes was congruent with the phylogeny based on rRNA (16S and 23S). This strongly suggested that a core of nontransferred genes exists in Bacteria and that, subsequently, a bacterial phylogeny can be inferred.

Here, we constructed a phylogeny of the archaeal domain based on a concatenated data set of all the ribosomal proteins present in most archaeal species. This multiprotein approach was possible thanks to the many archaeal genome projects recently completed (notably those of Thermoplasmatales and Methanomicrobiales, both of uncertain phylogenetic position). The phylogeny of the concatenated ribosomal proteins was in agreement with the rRNA tree. We actually identified a few cases of probable LGT involving archaeal ribosomal proteins. Interestingly, LGTs appear to be biased in favor of transfer between species living in the same environment.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Data Sets
All the 61 Pyrococcus abyssii proteins annotated as ribosomal proteins, as well as 16S and 23S rRNAs, were used as seeds for BLAST searches (Altschul et al. 1990Citation ) in 13 archaeal species whose genome sequences were complete or near completion. BLAST searches were performed at http://www.ncbi.nlm.nih.gov/BLAST/ for published sequences and locally for Ferroplasma acidarmanus, Methanosarcina barkeri (contigs retrieved at http://www.jgi.doe.gov/JGI_microbial/html/methanosarcina/methano_homepage.html and at http://www.jgi.doe.gov/JGI_microbial/html/ferroplasma/ferro_homepage.html, respectively), and Pyrobaculum aerophilum (sequences kindly provided by Sorel Fitz-Gibbon). In addition, we retrieved almost all of the ribosomal proteins of Haloarcula marismortui, thanks to the work on the three-dimensional structure of its ribosome (Ban et al. 2000Citation ). For some proteins of short size, we performed an additional tblastn search, as they were not annotated in GenBank. A total of 53 ribosomal proteins that were present in at least 13 species and that could be nonambiguously aligned were retained. In contrast to our analyses of Bacteria (Brochier et al. in press), we did not encounter proteins displaying several duplicated copies within the same species. The retrieved sequences were aligned using CLUSTALW (Thompson, Higgins, and Gibson 1994Citation ). Alignments were inspected manually using the program ED of the MUST package (Philippe 1993Citation ). Most gaps and all ambiguously aligned regions were excluded from phylogenetic analyses.

As previously reported, we noticed that most archaeal ribosomal proteins were more similar to their eukaryotic homologs than to their bacterial counterparts; several proteins in our data set were not present in Bacteria, some being even absent in Eukarya. Alignment between ribosomal proteins of different domains (even between Archaea and Eukarya) turned out to be difficult, preventing a meaningful use of bacterial/eukaryal ribosomal proteins to root the archaeal tree or to test the monophyly of Archaea. Therefore we did not include any outgroup in order to reduce the noise and increase the number of alignable positions. The trees were rooted between Crenarchaeota and Euryarchaeota, their monophilies being undisputed (Ludwig and Klenk 2001Citation ).

The remaining 53 proteins were concatenated into a large fusion, P1 (7,175 positions). Protein fusions were also constructed and analyzed for the eight proteins for which our principal components analysis (PCA) (see later) indicated likely cases of LGT (fusion P3, 926 positions) and for the 45 proteins remaining when these 8 were excluded (fusion P2, 6,249 positions). A fusion of the 16S and 23S rRNA sequences (fusion R, 3,933 positions) for the 14 archaeal species was analyzed in a similar way.

Phylogenetic Analyses
For all individual and fusion alignments, neighbor-joining, maximum parsimony, and maximum likelihood (ML) analyses were carried out with all individual and concatenated data sets using MUST (Philippe 1993Citation ), PAUP 3.1 (Swofford 1993Citation ), and MOLPHY 2.3 (Adachi and Hasegawa 1996Citation ), respectively. Calculation of {alpha}-parameter values and other ML analyses, taking into account among-site rate variation (ASRV), were conducted using the program PUZZLE (Strimmer and von Haeseler 1996Citation ). ML bootstrap proportions were computed using the RELL method (Kishino, Miyata, and Hasegawa 1990Citation ) upon 2,000 top-ranking trees. For distance and parsimony analyses, 1,000 bootstrap replicates were computed. All individual and concatenated alignments and the corresponding phylogenetic trees are available at our web site (http://sorex.snv.jussieu.fr/archaea/rp.html).

Principal Components Analysis
To avoid the limitations of standard pairwise statistical comparisons of congruence between tree topologies (such as the Kishino-Hasegawa test [Kishino and Hasegawa 1990Citation ], see Goldman, Anderson, and Rodrigo 2000Citation ), we carried out a simultaneous comparison of all the tree topologies obtained from the individual analyses and multigene fusions using a PCA approach (Brochier et al. in press). For this, tree topologies were obtained from the individual analyses of the 49 ribosomal proteins that included all the 14 archaeal species. These 49 topologies were chosen to represent the tree space (for a detailed discussion see Brochier et al. in press). The likelihood of each data set (both individual and concatenated) was computed for each of the 49 topologies using the programs MOLPHY 2.3 (Adachi and Hasegawa 1996Citation ) or PUZZLE 4.0 (Strimmer and von Haeseler 1996Citation ). Finally, each protein or rRNA was described by the 49 increases of likelihood values with respect to the best tree (measured as the number of standard deviations), which were analyzed by PCA using the program SAS (SAS 1999Citation ). The location of each individual or fusion alignment on a bidimensional diagram (the first two axes of the PCA) allowed studying its congruence on the remaining data sets. Two points (i.e., alignments) that were close in the diagram indicated that the two corresponding genes are congruent (i.e., they similarly supported the various topologies).


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We identified 53 ribosomal proteins that are almost ubiquitous in the 13 completely (or almost completely) sequenced archaeal genomes available and that have no duplicated copies. The sequences of all conserved archaeal ribosomal proteins were aligned individually and concatenated, leading to 7,175 unambiguously aligned amino acid positions. The ML tree of our concatenated protein set is shown in figure 1A. The tree is arbitrarily rooted between Crenarchaeota and Euryarchaeota. For comparison, a tree based on the fused small and large subunit rRNA (3,933 homologous positions) was inferred by the ML method (fig. 1B ). The two trees were in excellent agreement, with the monophyly of Crenarchaeota, the sister grouping of Sulfolobus and Aeropyrum, the basal branching of Pyrococcus within the Euryarchaeota, the sister grouping of Methanococcus and Methanobacterium (albeit weakly supported) and of the two Halobacteriales and Methanosarcina, and the monophyly of a group including Archaeoglobus, Thermoplasmatales, Methanosarcina, and Halobacteriales. The only difference between the rRNA and the ribosomal protein trees was the position of Archaeoglobus, which branched with the Methanosarcina/Halobacteriales group in the ribosomal protein tree (bootstrap support [BS] of 94%), whereas it was a sister group of the Methanosarcina/Halobacteriales/Thermoplasmatales group and in the rRNA tree (BS of 87%).



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 1.—Maximum likelihood phylogenetic trees for the concatenated 53 ribosomal proteins (A, 7,175 positions) and the concatenated SSU and LSU rRNA (B, 3,933 positions). The calculations of the best tree and the branch lengths were conducted using the program PUZZLE with a {Gamma}-law correction. Numbers close to nodes are ML BS computed by the RELL method upon 2,000 top-ranking trees using the MOLPHY program without correction for among-site variation. Scale bars correspond to 10 substitutions per 100 positions for a unit branch length. The trees were arbitrarily rooted between Crenarchaeota and Euryarchaeota

 
The good agreement between ribosomal proteins and rRNA trees did not exclude LGTs. We therefore performed a detailed analysis of each protein through a PCA, as described in Materials and Methods. The three first axes explained 61%, 11% and 10% of the variance. For the sake of clarity, only the first two axes are shown in figure 2 , but the pattern provided by the third axis was the same (data not shown). As in our previous analysis (Brochier et al. in press), axis 1 mainly corresponded to the size of the markers, whereas axis 2 (and here 3) corresponded to the incongruence. It appeared that 45 ribosomal proteins formed a densely packed cloud on the right half of figure 2 . This indicated that the corresponding proteins had similar likelihood for all the topologies tested and were thus congruent: they shared the same history and were probably devoid of LGTs. On the contrary, 8 out of the 53 ribosomal proteins were dispersed in the plot, suggesting that they were incongruent with the 45 ribosomal proteins as well as among themselves. Finally, the points corresponding to 45 ribosomal proteins (triangle), rRNA, and concatenated ribosomal protein were roughly on the same lines (fig. 2 ). This suggested that the major differences between these three data sets were in the size (~100 positions for individual proteins, ~4,000 for rRNA, and ~7,000 for concatenated proteins). It is interesting to note that the model used for tree reconstruction (with or without a {Gamma} law model to handle ASRV and the simple Hasegawa/Kishino/Yano [Hasegawa, Kishino, and Yano 1985Citation ] or the more complex Tamura/Nei [Tamura and Nei 1993Citation ] model of nucleotide substitutions in rRNA) has very little influence on the results (see the proximity of the points RTN, RHKY, and R{Gamma}, and P1 and P1{Gamma} on fig. 2 ).



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 2.—Principal components analysis of the likelihood values estimated for 49 tree topologies for different protein and rRNA data sets. The first axis (61% of the variance) mainly corresponded to sequence length and the second one (11% of the variance) to incongruence among data sets. Genes that were suspected to have undergone LGTs were represented by diamonds, whereas triangles corresponded to the 45 genes for which LGT was not suspected because of their close proximity. Circles corresponded to protein fusions (P1, P2, and P3) and crosses to rRNA fusions (R). All the likelihood values were computed with the JTT model without taking into account ASRV, except for P1{Gamma} where a gamma law–modeled ASRV was used, RTN where the Tamura/Nei model was used, RHKY where the Hasegawa/Kishino/Yano model was used, and RTN+{Gamma} where the Tamura/Nei model and a gamma law–modeled ASRV were used

 
We inferred the ML phylogenies for the eight ribosomal proteins for which our PCA analysis strongly suggested incongruency. However, on account of their small size (from 42 to 197 positions), the trees (see http://sorex.snv.jussieu.fr/archaea/rp.html) were very difficult to interpret because stochastic effects were too strong. Yet, a few examples of LGTs were clear, such as in the case of rpl12e, where Thermoplasmatales were paraphyletic and very close to Crenarchaeota (fig. 3 ).



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 3.—Maximum likelihood phylogenetic trees for the Rpl12e protein (89 positions). Numbers close to nodes are ML BS. The tree was abitrarily rooted. Scale bars correspond to 10 substitutions per 100 positions for a unit branch length.

 
To determine the effect of the eight ribosomal proteins that have likely been affected by LGTs on the global phylogeny, we separately concatenated these eight proteins (fusion P3, 926 positions), the corresponding phylogeny being called the "dirty" tree, and the remaining 45 proteins (fusion P2; 6,249 positions), the corresponding phylogeny being called the "cleaned" tree. This latter tree (fig. 4A ) was rather similar to the rRNA tree (fig. 1 ): the branching pattern within Crenarchaeota was the same, and the monophylies of Pyrococcus, Thermoplasmatales, and Methanosarcina/Halobacteriales were recovered. The major difference between the cleaned tree and the trees in figure 1 was the position of the Thermoplasmatales, here a sister group of Archaeoglobus (BS of 42%). For the global concatenated tree (fig. 1A ), Thermoplasmatales were strongly excluded from a group consisting of Archaeoglobus, Methanosarcina, and Halobacteriales (BS of 94%). For the rRNA tree (fig. 1B ), the sister group of Thermoplasmatales and Methanosarcina/Halobacteriales was highly supported (99%). A minor difference of the cleaned tree was the paraphyly instead of the monophyly of Methanococcus and Methanobacterium, but the support was very weak in all the cases (BS ~50%).



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 4.—Maximum likelihood phylogenetic trees for the fusion P2 (A, 6,249 positions, the cleaned tree) and the fusion P3 (B, 926 positions, the dirty tree). The trees were rooted between Crenarchaeota and Euryarchaeota. Numbers close to nodes are ML BS. Scale bars correspond to 10 substitutions per 100 positions for a unit branch length.

 
Surprisingly, the dirty tree (fig. 4B ) was not consistently different from either the global tree or the cleaned tree. Nevertheless, the only difference was the basal position of Thermoplasmatales within Euryarchaoeta (BS of 71%). This phylogeny suggested that there were few systematic biases in the direction of LGTs, except perhaps between Thermoplasmatales and Sulfolobales, which live in the same type of acidic and hot environment.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Tree Reconstruction Artifacts and Archaeal Phylogeny
The phylogeny obtained with the fusion of conserved ribosomal proteins (without ribosomal proteins potentially affected by LGT or with all the proteins, figs. 1A and 4A, respectively) was in remarkable agreement with the rRNA phylogeny. In particular, it validated the emergence of aerobic halophiles and Thermoplasmatales from methanogenic ancestors and the basal position of Thermococcales in the euryarchaeal tree. This suggested that these two trees were good approximations of the actual species tree, although they might be similarly biased by some specific features of ribosome evolution. However, this latter explanation seems unlikely because a similar tree topology was obtained with unrelated informational proteins such as RNA polymerase (see later) or DNA topoisomerase VI (J. Filée, personal communication). Nevertheless, at least two important biases always affect phylogenetic reconstruction, i.e., the variation in evolutionary rates and the nucleotide–amino acid composition.

In both ribosomal protein and rRNA trees (fig. 1 ), the branches leading to hyperthermophilic species were systematically shorter than those leading to mesophilic species (if the trees were rooted between Euryarchaeota and Crenarchaeota), indicating that the rates of evolution of rRNA and ribosomal proteins in hyperthermophiles have been slower than in their mesophilic counterparts. The difference between the length of the branches leading to hyperthermoplic and mesophilic species was slightly less pronounced in the ribosomal protein tree than in the rRNA tree. A slower evolutionary rate (i.e., the systematically short branches of hyperthermophilic lineages in the rRNA tree) could be explained by the structural constraints related to their high G + C content and the requirement to stabilize rRNA secondary and tertiary structures at high temperatures. The observation of a similar phenomenon in the ribosomal protein tree could also be explained by structural constraints related to protein thermostability.

It has been previously suggested that the common ancestor of all Archaea was a hyperthermophile, based on the basal position of hyperthermophilic archaea in the rRNA tree (Achenbach-Richter et al. 1988Citation ). Another argument in favor of a hyperthermophilic archaeal ancestor is the similarity between the archaeal rRNA phylogeny and the phylogeny of reverse gyrase, a protein specific to hyperthermophiles (Forterre et al. 2000Citation ). If this is the case, both rRNA and ribosomal proteins from hyperthermophilic archaea would have retained more ancestral characters than their mesophilic or moderately thermophilic counterparts. Moreover, if nonhyperthermophilic species evolved from hyperthermophilic ancestors, they would have adapted to a new environment (often an extreme one, i.e., halophilic or acidophilic), and it is therefore expected that their genes underwent an accelerated evolutionary rate. The LBA artifact probably did not highly bias our cleaned tree (fig. 4A ) because the two longest branches (Thermoplasmatales and Halobacteriales) did not group together but clustered with the slowly evolving species (Archaeoglobus and Methanosarcina, respectively). Yet, these variable evolutionary rates provided serious noise in our alignment, which could explain the low BS in our cleaned tree despite the use of ~6,000 orthologous positions. On the contrary, in the rRNA tree (fig. 1B ) the longest branches (Thermoplasmatales, Halobacteriales, and to a lesser extent Methanosarcina) strongly clustered together (BS of 99%), suggesting LBA.

In addition, the archaeal species used in this study live in extreme environments, which could strongly constrain their evolutionary pattern. For example, it is well known that hyperthermophilic species share high G + C in their rRNA sequences (Galtier and Lobry 1997Citation ) and can artifactually cluster together in the corresponding phylogenies (Woese et al. 1991Citation ; Embley, Thomas, and Wlliams 1992Citation ). In the rRNA tree (fig. 1B ), all the species with low G + C content (below 56%, Methanosarcina, Haloarcula, Halobacterium, Thermoplasma, and Ferroplasma) were grouped with a very high statistical support (BS of 99%). This grouping was likely not only the result of an LBA artifact but also of a compositional artifact. For the ribosomal proteins, amino acid compositional biases surely exist because it is obvious for the proteome (Kreil and Ouzounis 2001Citation ), but contrary to the nucleotide bias of rRNA, they are probably not convergent in halophilic and acidophilic species. Indeed, halophilic species have a K+-rich intracellular environment, and their proteins (especially the surface) are rich in aspartic and glutamic acids with an average pI of 5.1 (Ng et al. 2000Citation ). In Thermoplasma the intracellular pH is more acidic, and its proteins have a higher pI (Kawashima et al. 2000Citation ). As a result, from a compositional point of view, Thermoplasmales and Halobacteriales will tend to be repulsed, which could counteract the attraction created by their convergent high evolutionary rate. The compositional bias thus explained why Thermoplasmatales grouped strongly with Methanosarcina/Halobacteriales in the rRNA tree (fig. 1B ) and weakly with Archaeoglobus in the cleaned tree (fig. 4A ), whereas the LBA artifact would simply predict the grouping of Thermoplasmatales and Halobacteriales.

Our ribosomal phylogeny, even when cleared of LGT problems, should only be considered as an approximation of the species tree, not only because of a potential bias caused by ribosome coevolution but also because several additional biases (rate, amino acid composition, or covarion structure) are not well handled by the available tree reconstruction methods. Moreover, given the several weak bootstrap values (42%, 51%, and 86%) in our cleaned tree (fig. 4A ), a complete resolution of the archaeal phylogeny would also require many more positions (i.e., genes) than were used here (~7,000). Nevertheless, the topologies of the euryarchaeal rRNA and ribosomal protein trees implied that the ability to perform methanogenesis appeared early on in Euryarchaea and was lost several times independently because methanogens were never retrieved as a coherent group.

Rare Genomic Event as Phylogenetic Marker
The position of Thermoplasmatales in the rRNA tree was previously disputed on the ground that (1) the RNA polymerase of T. acidophilum is of the ABC type, as is the case for the RNA polymerases of Crenarchaeota and Thermococcales, whereas the RNA polymerases of all methanogens, Archaeoglobus, and Halobacteriales are of the AB'B''C type; and (2) T. acidophilum branches before methanogens in an RNA polymerase phylogeny based on nucleotide sequenced comparison (Klenk, Palm, and Zillig 1994Citation ). However, in the present analysis (figs. 1 and 4 ) we found that Thermoplasmatales branch after Methanococcus and Methanobacterium in the euryarchaeal part of the tree in both rRNA and ribosomal protein phylogenies. Furthermore, we obtained the same result when we constructed a phylogenetic tree of the available amino acid sequences of all archaeal RNA polymerase B subunits (fig. 5 ). This tree also turned out to be identical to the global ribosomal protein tree (fig. 1A ). Our topology was only slightly different from the one found by Klenk, Palm, and Zillig (1994)Citation , with the monophyly of Thermoplasma/Halobacterium instead of their paraphyly. This discrepancy likely resulted from the use of a different species sampling (6 only by Klenk et al. 1992Citation and 16 in fig. 5 ) and perhaps also by a different tree reconstruction method (ML on nucleotides vs. ML on amino acids).



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 5.—ML phylogenetic tree for the RNA polymerase RPOB (1,035 positions). S indicates the branch in which we assumed that the split event leading to RPOB' and RPOB'' genes occurred, and F the branch in which we assumed that RPOB' and RPOB'' genes fused. The trees were rooted between Crenarchaeota and Euryarchaeota. Numbers close to nodes are ML BS. Scale bars correspond to 10 substitutions per 100 positions for a unit branch length.

 
Because the RNA polymerase RPOB is composed of a single subunit in all eukaryotes, in all crenarchaeotes, and in a few euryarchaeotes, the most parsimonious explanation is to assume that a single-gene split event occurred during the euryarchaeote evolution (Klenk, Palm, and Zillig 1994Citation ), which would support the early emergence of Thermococcales and Thermoplasmatales in this subgroup. However, all our phylogenies, except the dirty tree (fig. 4B ), provide strong support for the emergence of Thermoplasmatales after Methanococcus and Methanobacterium (BS of 99% for rRNA, 98% for the cleaned ribosomal protein tree, and 76% for the RPOB tree). To reconcile these two results, one has to assume that the early split event of RPOB was followed by a gene fusion event of RPOB' and RPOB'' in a common ancestor of Thermoplasmatales. Such a fusion would have been a relatively easy event because the genes encoding the B' and B'' subunits are contiguous in all known archaeal genomes. Our analysis therefore suggested that a nonparsimonious explanation (one split and one fusion) was the best explanation for the evolution of RPOB in Archaea. As a result, although rare genomic events might be useful to infer phylogeny (Rokas and Holland 2000Citation ), they can also be prone to homoplasy and should be used carefully.

Relative Importance of Phylogenetic or Environmental Distance in LGT Frequencies
During our BLAST searches and preliminary phylogenetic analyses, we failed to detect indication for LGT of ribosomal proteins between Archaea and either Bacteria or Eucarya. It has been suggested that rpl23 of Helicobacter pylori was acquired from Archaea (Hansmann and Martin 2000Citation ). However, an analysis with an extensive species sampling (28 Eucarya, 13 Archaea, and 94 Bacteria) supported the clustering of Helicobacter and the other Bacteria by a weak bootstrap value but also by a very conserved insertion of 13 amino acids specific to Bacteria (data not shown). The previous observation was likely the result of a combination of stochastic effect (less than 70 homologous positions and only 18 species) and LBA artifact (Helicobacter evolving very fast). Similarly, although LGTs can be frequent within Bacteria (Brochier, Philippe, and Moreira 2000Citation ), we failed to detect LGTs between Bacteria and Archaea/Eucarya (Brochier et al. in press). This suggested that the ribosomes of the three domains have sufficiently diverged from each other to prevent the successful interdomain replacement of a ribosomal protein. This is in agreement with the hypothesis that informational proteins are resistant to long-range LGT (Jain, Rivera, and Lake 1999Citation ; Graham et al. 2000Citation ).

According to our PCA approach, we found that 15% of the 53 ribosomal proteins have undergone at least one LGT event during the evolution of Archaea. Interestingly, the phylogeny inferred from the concatenation of the eight LGT proteins (fig. 4B ) was not so different from the cleaned phylogeny (fig. 4A ). This suggested that LGTs were rare for these eight proteins and that they just contributed random phylogenetic noise (i.e., there was no major bias in the direction of the transfer, but see later). The scarcity of LGTs was confirmed by the fact that the monophyly of closely related species (Pyrococcus, Halobacteriales, and Thermoplasmatales) was always recovered, except in four cases, because the time elapsed since the common ancestors of the three groups is too short to have a high probability to be observed. An alternative explanation would be that LGTs of ribosomal proteins occurred only between organisms that are phylogenetically close (Woese 2000Citation ), which could explain why a phylogenetic structure persists in the sequences even if LGTs are frequent.

However, the two major differences between the dirty (fig. 4B ) and the cleaned (fig. 4A ) trees suggested a quite different and probably more plausible explanation. The early emergence of Thermoplasmatales (fig. 4B ) could be explained by an attraction of their branches by the crenarchaeal branches, because of specific LGT between Thermoplasmatales and Sulfolobales. This hypothesis was supported by the phylogenies of rpl12e proteins, where a robust relationship between Thermoplasmatales and Crenarchaeota could be observed (fig. 3 ). Indeed, Baumeister and coworkers (Ruepp et al. 2000Citation ) observed that the genome of T. acidophilum contains many genes which are more closely related to Sulfolobus than to the euryarchaeal relatives of Thermoplasmatales, a fact easily explained by the evidence that Sulfolobus and Thermoplasma thrive in the same type of acidic and hot environment. This indicated that the ribosomes had not sufficiently diverged in the two archaeal phyla to prevent protein exchange between them. Indeed, the analyses of complete genomes have already suggested the highest frequency of LGTs between organisms thriving in the same environment: Aquifex and thermophilic Archaea (Aravind et al. 1998Citation ), Thermotoga and thermophilic Archaea (Nelson et al. 1999Citation ), chloroplast, mitochondrion, and nucleus (Marienfeld, Unseld, and Brennicke 1999Citation ; Gallois et al. 2001Citation ), Chlamydia and Rickettsia (Wolf, Aravind, and Koonin 1999Citation ), Sinorhizobium and Streptomyces (B. Golding, personal communication), and Thermoplasma and Sulfolobus (Ruepp et al. 2000Citation ). For ribosomal proteins, LGTs were rare and appeared to be very difficult, if even impossible, between phylogenetically very distant organisms (i.e., between the three domains) and to be mainly directed by the physical proximity of the organisms rather than by their phylogenetic proximity.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We thank Sorel Fitz-Gibbon for kindly providing ribosomal sequences of P. aerophilum. Preliminary sequence data were obtained from The DOE Joint Genome Institute (JGI) at http://www.jgi.doe.gov/JGI_microbial/html/index.html. We thank Simonetta Gribaldo, Philippe Lopez, and David Moreira for careful reading of the manuscript.


    Footnotes
 
Pierre Capy, Reviewing Editor

1 These authors contributed equally to the work. Back

Keywords: Archaea lateral gene transfer molecular phylogeny multigene analysis ribosomal proteins Back

Address for correspondence and reprints: Hervé Philippe, Phylogénie, Bioinformatique et Génome, UMR 7622 CNRS, Université Pierre et Marie Curie, 9, quai St Bernard, 75005 Paris, France. herve.philippe{at}snv.jussieu.fr . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Achenbach-Richter L., R. Gupta, W. Zillig, C. R. Woese, 1988 Rooting the archaebacterial tree: the pivotal role of Thermococcus celer in archaebacterial evolution Syst. Appl. Microbiol 10:231-240[ISI][Medline]

    Adachi J., M. Hasegawa, 1996 MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood Comput. Sci. Monogr 28:1-150

    Altschul S. F., W. Gish, W. Miller, E. W. Myers, D. J. Lipman, 1990 Basic local alignment search tool J. Mol. Biol 215:403-410[ISI][Medline]

    Aravind L., R. L. Tatusov, Y. I. Wolf, D. R. Walker, E. V. Koonin, 1998 Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles Trends Genet 14:442-444[ISI][Medline]

    Asai T., C. Condon, J. Voulgaris, D. Zaporojets, B. Shen, M. Al-Omar, C. Squires, C. L. Squires, 1999 Construction and initial characterization of Escherichia coli strains with few or no intact chromosomal rRNA operons J. Bacteriol 181:3803-3809[Abstract/Free Full Text]

    Ban N., P. Nissen, J. Hansen, P. B. Moore, T. A. Steitz, 2000 The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Science 289:905-920

    Bernander R., 2000 Chromosome replication, nucleoid segregation and cell division in archaea Trends Microbiol 8:278-283[ISI][Medline]

    Brochier C., E. Bapteste, D. Moreira, H. Philippe, 2002 Eubacterial phylogeny based on translational apparatus proteins Trends Genet. 18:1–5

    Brochier C., H. Philippe, D. Moreira, 2000 The evolutionary history of ribosomal protein RpS14: horizontal gene transfer at the heart of the ribosome Trends Genet 16:529-533[ISI][Medline]

    Brown J. R., W. F. Doolittle, 1997 Archaea and the prokaryote-to-eukaryote transition Microbiol. Mol. Biol. Rev 61:456-502[Abstract]

    Doolittle W. F., 1999 Phylogenetic classification and the universal tree Science 284:2124-2129[Abstract/Free Full Text]

    Embley T. M., R. H. Thomas, R. A. D. Wlliams, 1992 Reduced thermophilic bias in the 16S rDNA sequence from Thermus ruber provides further support for a relationship between Thermus and Deinococcus Syst. Appl. Microbiol 16:25-29[ISI]

    Felsenstein J., 1978 Cases in which parsimony or compatibility methods will be positively misleading Syst. Zool 27:401-410[ISI]

    Forterre P., 1997 Archaea: what can we learn from their sequences? Curr. Opin. Genet. Dev 7:764-770[ISI][Medline]

    Forterre P., C. Bouthier De La Tour, H. Philippe, M. Duguet, 2000 Reverse gyrase from hyperthermophiles: probable transfer of a thermoadaptation trait from archaea to bacteria Trends Genet 16:152-154[ISI][Medline]

    Gallois J., P. Achard, G. Green, R. Mache, 2001 The Arabidopsis chloroplast ribosomal protein L21 is encoded by a nuclear gene of mitochondrial origin Gene 274:179-185[ISI][Medline]

    Galtier N., J. R. Lobry, 1997 Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in prokaryotes J. Mol. Evol 44:632-636[ISI][Medline]

    Goldman N., J. P. Anderson, A. G. Rodrigo, 2000 Likelihood-based tests of topologies in phylogenetics Syst. Biol 49:652-670[ISI][Medline]

    Graham D. E., R. Overbeek, G. J. Olsen, C. R. Woese, 2000 An archaeal genomic signature Proc. Natl. Acad. Sci. USA 97:3304-3308[Abstract/Free Full Text]

    Hansmann S., W. Martin, 2000 Phylogeny of 33 ribosomal and six other proteins encoded in an ancient gene cluster that is conserved across prokaryotic genomes: influence of excluding poorly alignable sites from analysis Int. J. Syst. Evol. Microbiol 50:1655-1663[Abstract]

    Hasegawa M., H. Kishino, T. Yano, 1985 Dating of the human–ape splitting by a molecular clock of mitochondrial DNA J. Mol. Evol 22:160-174[ISI][Medline]

    Huber R., H. Huber, K. O. Stetter, 2000 Towards the ecology of hyperthermophiles: biotopes, new isolation strategies and novel metabolic properties FEMS Microbiol. Rev 24:615-623[ISI][Medline]

    Jain R., M. C. Rivera, J. A. Lake, 1999 Horizontal gene transfer among genomes: the complexity hypothesis Proc. Natl. Acad. Sci. USA 96:3801-3806[Abstract/Free Full Text]

    Kawashima T., N. Amano, H. Koike, et al. (12 co-authors) 2000 Archaeal adaptation to higher temperatures revealed by genomic sequence of Thermoplasma volcanium Proc. Natl. Acad. Sci. USA 97:14257-14262[Abstract/Free Full Text]

    Kishino H., M. Hasegawa, 1990 Converting distance to time: application to human evolution Methods Enzymol 183:550-570[ISI][Medline]

    Kishino H., T. Miyata, M. Hasegawa, 1990 Maximum likelihood inference of protein phylogeny, and the origin of chloroplasts J. Mol. Evol 31:151-160[ISI]

    Klenk H. P., P. Palm, W. Zillig, 1994 DNA-dependent RNA polymerases as phylogenetic marker molecules Syst. Appl. Microbiol 16:638-647[ISI]

    Klenk H. P., O. Renner, V. Schwass, W. Zillig, 1992 Nucleotide sequence of the genes encoding the subunits H, B, A' and A'' of the DNA-dependent RNA polymerase and the initiator tRNA from Thermoplasma acidophilum Nucleic Acids Res 20:5226.[ISI][Medline]

    Kreil D. P., C. A. Ouzounis, 2001 Identification of thermophilic species by the amino acid compositions deduced from their genomes Nucleic Acids Res 29:1608-1615[Abstract/Free Full Text]

    Lan R., P. R. Reeves, 1996 Gene transfer is a major factor in bacterial evolution Mol. Biol. Evol 13:47-55[Abstract]

    Lockhart P., M. Steel, M. Hendy, D. Penny, 1994 Recovering evolutionary trees under a more realistic model of sequence evolution Mol. Biol. Evol 11:605-612[Free Full Text]

    Ludwig W., H. P. Klenk, 2001 Overview: a phylogenetic backbone and taxonomic framework for procaryotic systematics Pp. 49–65 in D. R. Boone and R. W. Castenholz, eds. Bergey's manual of systematic bacteriology. Springer, Berlin

    Makarova K. S., L. Aravind, M. Y. Galperin, N. V. Grishin, R. L. Tatusov, Y. I. Wolf, E. V. Koonin, 1999 Comparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell Genome Res 9:608-628[Abstract/Free Full Text]

    Marienfeld J., M. Unseld, A. Brennicke, 1999 The mitochondrial genome of Arabidopsis is composed of both native and immigrant information Trends Plant Sci 4:495-502[ISI][Medline]

    Nelson K. E., R. A. Clayton, S. R. Gill, et al. (48 co-authors) 1999 Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima Nature 399:323-329[ISI][Medline]

    Ng W. V., S. P. Kennedy, G. G. Mahairas, et al. (40 co-authors) 2000 Genome sequence of Halobacterium species NRC-1 Proc. Natl. Acad. Sci. USA 97:12176–12181

    Ochman H., J. G. Lawrence, E. A. Groisman, 2000 Lateral gene transfer and the nature of bacterial innovation Nature 405:299-304[ISI][Medline]

    Olsen G. J., C. R. Woese, 1997 Archaeal genomics: an overview Cell 89:991-994[ISI][Medline]

    Pace N. R., 1997 A molecular view of microbial diversity and the biosphere Science 276:734-740[Abstract/Free Full Text]

    Philippe H., 1993 MUST, a computer package of management utilities for sequences and trees Nucleic Acids Res 21:5264-5272[Abstract]

    Philippe H., P. Forterre, 1999 The rooting of the universal tree of life is not reliable J. Mol. Evol 49:509-523[ISI][Medline]

    Philippe H., A. Germot, D. Moreira, 2000 The new phylogeny of eukaryotes Curr. Opin. Genet. Dev 10:596-601[ISI][Medline]

    Philippe H., J. Laurent, 1998 How good are deep phylogenetic trees? Curr. Opin. Genet. Dev 8:616-623[ISI][Medline]

    Ragan M. A., 2001 On surrogate methods for detecting lateral gene transfer FEMS Microbiol. Lett 201:187-191[ISI][Medline]

    Rivera M. C., R. Jain, J. E. Moore, J. A. Lake, 1998 Genomic evidence for two functionally distinct gene classes Proc. Natl. Acad. Sci. USA 95:6239-6244[Abstract/Free Full Text]

    Rokas A., P. W. H. Holland, 2000 Rare genomic changes as a tool for phylogenetics Trends Ecol. Evol 15:454-459[ISI][Medline]

    Ruepp A., W. Graml, M. L. Santos-Martinez, K. K. Koretke, C. Volker, H. W. Mewes, D. Frishman, S. Stocker, A. N. Lupas, W. Baumeister, 2000 The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum Nature 407:508-513[ISI][Medline]

    SAS. 1999 SAS/STAT user's guide SAS Institute Inc., Cary, NC

    She Q., R. K. Singh, F. Confalonieri, et al. (30 co-authors) 2001 The complete genome of the crenarchaeon Sulfolobus solfataricus P2 Proc. Natl. Acad. Sci. USA 98:7835-7840[Abstract/Free Full Text]

    Strimmer K., A. von Haeseler, 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies Mol. Biol. Evol 13:964-969[Free Full Text]

    Swofford D. L., 1993 PAUP: phylogenetic analysis using parsimony. Version 3.1.1 Illinois Natural History Survey, Champaign, Ill

    Tamura K., M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Mol. Biol. Evol 10:512-526[Abstract]

    Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]

    Woese C. R., 1987 Bacterial evolution Microbiol. Rev 51:221-271[ISI]

    ———. 2000 Interpreting the universal phylogenetic tree Proc. Natl. Acad. Sci. USA 97:8392-8396[Abstract/Free Full Text]

    Woese C., L. Achenbach, P. Rouviere, L. Mandelco, 1991 Archaeal phylogeny: reexamination of the phylogenetic position of Archaeoglobus fulgidus in light of certain composition-induced artifacts Syst. Appl. Microbiol 14:364-371[ISI][Medline]

    Woese C. R., G. J. Olsen, M. Ibba, D. Soll, 2000 Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process Microbiol. Mol. Biol. Rev 64:202-236[Abstract/Free Full Text]

    Wolf Y. I., L. Aravind, E. V. Koonin, 1999 Rickettsiae and Chlamydiae: evidence of horizontal gene transfer and gene exchange Trends Genet 15:173-175[ISI][Medline]

Accepted for publication December 6, 2001.