Phylogeny of Eukaryotes Based on Ribosomal RNA: Long-Branch Attraction and Models of Sequence Evolution

Hervé Philippe* and2, and Agnès Germot{ddagger}

*Equipe Phylogénie et Evolution moléculaires (UPRESA CNRS 8080), Université Paris-Sud, Orsay, France; and
{dagger}Laboratoire de Biologie des Protistes (UPRESA CNRS 6023), Université Clermont-Ferrand 2, Aubière, France

The molecular phylogeny of eukaryotes was first based on the comparison of the small-subunit ribosomal RNA (SSU rRNA) and quickly became a paradigm (Sogin 1991Citation ). However, serious criticisms have recently been raised against it. In particular, it has been suggested that the various protist lineages that emerged first (e.g., microsporidia, trichomonads, mycetozoans) were misplaced because of the long-branch attraction (LBA) artifact (Philippe and Adoutte 1998Citation ; Philippe and Laurent 1998Citation ; Germot and Philippe 1999Citation ; Moreira, Le Guyader, and Philippe 1999Citation ; Roger et al. 1999Citation ; Stiller and Hall 1999Citation ). Indeed, simulation studies (unpublished data) showed that when a distant outgroup was used, all the standard methods of phylogenetic reconstruction artifactually inferred a phylogeny with an asymmetrical base containing the fast-evolving lineages, leading to a tree shape very similar to the rRNA one.

When the substitution rates are considerably variable for different lineages, the use of a model that reflects the actual substitution probability is of utmost importance for obtaining a correct tree topology (Lockhart et al. 1994Citation ; Swofford et al. 1996Citation ; Yang 1996Citation ; Takezaki and Gojobori 1999Citation ). In the case of rRNA-based eukaryotic phylogeny, the first model violation noticed was the variation of G + C content among species (Hasegawa and Hashimoto 1993Citation ). Yet, the overall picture of the rRNA tree remained unchanged when only species with similar G + C contents were included (Leipe et al. 1993Citation ) or when methods less sensitive to this bias, such as transversion parsimony (Woese et al. 1991Citation ) or log-Det distance (Lockhart et al. 1994Citation ), were applied. Variation in evolutionary rates across sites (RAS) is a further complexity of sequence evolution which also has a significant effect on evolutionary tree building (Olsen 1987Citation ; Yang 1996Citation ). For instance, the late emergence of fast-evolving lineages such as the nucleomorph of chloroarachniophytes (Van de Peer et al. 1996Citation ) or microsporidia (Peyretaillade et al. 1998Citation ) can be retrieved only when RAS models are assumed. Finally, the existence of covarion structure constitutes a further model violation that has just begun to be considered (Lockhart et al. 1998Citation ; Lopez, Forterre, and Philippe 1999Citation ).

In this study, we sequenced the complete large-subunit (LSU) rRNA of Trichomonas vaginalis. We concatenated this sequence with that of the SSU rRNA for this taxon and compared these data with similar data from other eukaryotes. Using these data, we investigated the robustness of eukaryotic phylogeny with consideration for the model of sequence evolution used. The protocol for obtaining the LSU rRNA using several universal and specific primers was the same as that previously described (Germot, Philippe, and Le Guyader 1996Citation ), except that we sequenced both strands here. The size of the LSU rRNA was approximately 2,760 nt (the beginning and end were identified only by homology with other rRNAs), and its G + C content was 46%. The sequence of T. vaginalis was aligned manually with the available complete LSU rRNA sequences of 120 eukaryotes and of 15 archaebacteria, using the ED program of the MUST package (Philippe 1993Citation ). To increase the sequence lengths, we aligned SSU rRNA sequences for the same set of organisms whenever possible, and only the fusion of SSU and LSU rRNA was used for further analysis. A subset of only 32 species, encompassing as much biodiversity as possible (i.e., by eliminating numerous fungal, plant, and metazoan species), was used in order to reduce the computing time of our maximum-likelihood (ML) analysis. Only positions that could be unambiguously aligned were selected, leaving 875 and 1,604 positions for SSU and LSU rRNA, respectively. The alignment is available on request or at http://bufo.bc4.u-psud.fr.

The G + C content of rRNA appeared to be significantly heterogeneous between species at the 5% level, as measured by the {chi}2 statistics implemented in PUZZLE (Strimmer and von Haeseler 1996Citation ). Rather than coping with this model violation using a nonhomogeneous model (e.g., Galtier and Gouy 1998Citation ), only transversions were considered, since the ratio of purines to pyrimidines appeared to be homogeneous for all of the considered organisms, as previously observed for prokaryotic rRNA (Woese et al. 1991Citation ). This allowed us to use complex RAS models, such as those using a {Gamma} law distribution. Moreover, transitions were more saturated than transversions in a dot plot comparison (data not shown) and therefore were less likely to contain a reliable ancient phylogenetic signal.

The ML tree that was inferred with a model assuming equal rates across sites was very similar to the ones inferred from the SSU rRNA only (fig. 1A ). The three "archezoan" lineages (trichomonads, microsporidia, and diplomonads) emerged first, followed by euglenozoans and mycetozoans, and then by the crown. Within the tree, the monophyly of the various phyla was recovered with good statistical support. The main exception was the artifactual grouping of the nucleomorphs of chloroarachniophytes (Pedinomonas) and of cryptophytes (Guillardia), instead of being sister groups of chlorophytes and rhodophytes, respectively (Van de Peer et al. 1996Citation ). As with several early branching lineages, the nucleomorph of chlorarachniophytes displayed a long branch, which suggested a possible misplacement due to the LBA artifact.



View larger version (35K):
[in this window]
[in a new window]
 
Fig. 1.—A, Phylogenetic tree based on the fusion of SSU and LSU rRNAs. The tree was constructed by the maximum-likelihood (ML) method, using the quick search option of the nucml program (Adachi and Hasegawa 1996Citation ). The bootstrap values calculated using the RELL method on the 2,000 top-ranking trees for ML and by analysis of 1,000 replicates for maximum parsimony and neighbor joining arc were given for each node in bold, italic, and normal style, respectively. When the three methods provide bootstrap values greater than 90%, only the ML values are given. The scale bar represents the number of substitutions per site for a unit branch length. The nucleomorph sequences are indicated by nm. B, A revised phylogeny of eukaryotes. This tree was reconstructed according to phylogenies based on other markers (mainly actin, cytochrome oxidase, EF 1{alpha}, EF2, RNA polymerase, and tubulins). The constrained groups were microsporidia/fungi/metazoa (Baldauf and Doolittle 1997Citation ; Hirt et al. 1999Citation ), rhodophytes/chlorobiontes (Burger et al. 1999Citation ), mycetozoa (Baldauf and Doolittle 1997Citation ; Philippe and Adoutte 1998Citation ; unpublished data), diplomonads/trichomonads (Hashimoto et al. 1998Citation ; Keeling, Deane, and McFadden 1998Citation ), and euglenozoa/rhodophytes/chlorobiontes (Moreira, Le Guyader, and Philippe 1999Citation ). The branch lengths were computed using PUZZLE with eight variant gamma categories

 
We questioned the asymmetrical base on the tree of figure 1A , since it could be due to the use of an incorrect model of sequence evolution that rendered the ML method sensitive to LBA artifacts. This tree was compared with a revised one (fig. 1B ) whose topology was built according to phylogenies based on other markers (see the legend of fig. 1B for a detailed justification). The likelihood behavior of the two trees was studied according to four models of among-sites rate heterogeneity (table 1 ): (1) equal rates assumed at all sites (E); (2) a proportion of sites estimated to be invariable (I); (3) rates at all sites assumed to follow a gamma distribution ({Gamma}); and (4) some sites assumed to be invariable, with gamma-distributed rates at variable sites (I+{Gamma}). All of the ML calculations were carried out with the use of the PUZZLE software (Strimmer and von Haeseler 1996Citation ). Finally, these likelihoods were also compared with those of control trees: (1) the most-parsimonious (MP) trees computed by PAUP, version 3.1.1 (Swofford 1993Citation ), using either transitions and transversions or transversions; (2) the neighbor-joining (NJ) tree computed by MUST using p-distances with transversions only; and (3) 100 random trees generated by the MacClade program (Maddison and Maddison 1992Citation ).


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of the Log Likelihoods of Alternative Tree Topologies Using Different models of Sequence Evolution

 
The likelihood of the ML tree increased as the RAS heterogeneity was taken into account (lnL = -17,856.75 for E and -16,659.40 for {Gamma}, with a shape parameter of 0.43). Since the E model is nested within the {Gamma} model, the significance of this improvement can be evaluated using a likelihood ratio test, which is twice the difference in log-likelihood compared with the {chi}2 distribution with one degree of freedom ({alpha}, the gamma distribution shape parameter). The value of the test statistic was 2,394.7, whereas the critical value at the 0.001 significance level was 10.83. This indicated a dramatic improvement in fit, which is not surprising, since, when two rate categories of sites were considered (variable and invariable), the percentage of invariable sites was estimated to be 31%. In the case of the I + {Gamma} model, the frequency of invariable sites was estimated to be null, rendering this model identical to the {Gamma} one. Interestingly, our revised tree, which was strongly rejected by a Kishino/Hasegawa test (-5.16 SE), was less rejected when RAS heterogeneity was considered (-4.22 SE for I model and only -2.27 for {Gamma} model, close to the significance limit of -1.96 SE). This decrease in rejection was not observed for MP and NJ trees. For instance, the MP tree based on all the substitutions, which was not rejected with the E model (-0.88 SE), was significantly rejected with the {Gamma} model (-2.05 SE). However, random trees followed the same trend compared with the tree of figure 1B , but to a lesser extent (from -28.27 ± 0.54 SE for the E model to -23.77 ± 0.52 SE only for the {Gamma} one): the SE ratio of the revised tree to the random ones went from 0.18 to 0.10. Since the revised tree was much less rejected when the best-fitted model was used, its strong rejection with the E model is likely due to the fact that parts of the revised tree are correct but are obscured in the standard tree because of an LBA artifact. Nevertheless, it was still significantly rejected, even with the {Gamma} model.

Because parts of the rRNA tree (fig. 1A ) are clearly in error (e.g., the artifactual grouping of red and green nucleomorphs), the decrease in rejection of the alternative tree as more complex models are incorporated into the analyses may reflect only this difference between the trees. We therefore tried to determine which portions of the alternative tree (fig. 1B ) were making a contribution to the relative improvement in likelihood score as models were changed. As expected, each constraint seemed to contribute to a small fraction of the difference in lnL between the revised tree and the classical tree. Unfortunately, the impact of the model used was unclear (table 1 and data not shown). For example, the enforcement of the correct position of the two nucleomorphs is less rejected with the E model (-0.21 SE) than with the {Gamma} model (-0.40 SE). In contrast, enforcing the sister group relationship of fungi and microsporidia is much easier with the {Gamma} model (-0.49 SE) than with the E model (-1.88 SE). More detailed analysis was difficult because of stochastic effects and interdependence of the constraints.

The strong rejection of the revised tree (the tree specified in fig. 1B ) under the E model and not under the {Gamma} model suggests the significance of the presence of rate heterogeneity and/or covarion structure in the data. In the covarion model, sites that are variable in some parts of the underlying tree are invariable in others, and vice versa (Fitch and Markowitz 1970Citation ). Indeed, as we show below, a concern about phylogenetic inference from these data is the demonstration that the rDNA data do appear to have evolved under a covarion model, and one which is nonstationary. If proportions of invariable sites increase independently in distantly related lineages, then even ML tree building that allows for positional heterogeneity is expected to become inconsistent (Lockhart et al. 1996, 1998Citation ).

We applied the method of Lopez, Forterre, and Philippe (1999)Citation to test for the existence of a covarion structure. Since the method requires a large number of observed substitutions to be efficient, we used an expanded alignment of 60 species (divided into three groups, Archaea, basal eukaryotes, and "crown" eukaryotes) and both transitions and transversions (similar results were obtained when only transversions were used; data not shown). The matrix appeared to be significantly heterogeneous at the 1% level (the observed {chi}2 was 3,152.10, whereas the limit was 2,394.76), demonstrating the existence of a covarion structure in rRNA. To reduce the model violations, we therefore removed the 153 positions that appeared to be significantly heterogeneous at the 5% level. These positions mainly corresponded to positions that are varied in Archaea and not in Eukaryota (38%) and to positions that are varied in the crown groups and not in Archaea or in sequences present in the asymmetrical base (27%). The resulting matrix, containing 2,326 positions, was not heterogeneous. Because of the lack of a covarion structure, the {Gamma} model likely fits the data much better, although it is not possible to use a likelihood ratio test. Interestingly, with this homogeneous matrix, the tree of figure 1B is not significantly rejected (-1.35 SE when compared with the tree of fig. 1A ). This strong decrease in the rejection of the revised tree is not observed for random trees (from -23.77 SE to -23.05 SE), indicating again that this rejection is due to the existence of model violations.

In conclusion, support for the classical rRNA tree (fig. 1A ) obtained using the E model appears to be due to model violations. Our results reported here and elsewhere suggest that the long branches of the diplomonads, microsporidia, and trichomonads have been falsely attracted toward the very long branch of the archaebacterial outgroup (fig. 1B ) as a consequence of different G + C contents, rate heterogeneity, and an increased proportion of variable positions in these ingroup taxa when compared with those taxa in the crown groups. Our observation that the tree in figure 1B could not be excluded as having a worse fit than that in figure 1A when consideration was given to the nonstationary property of biological sequence evolution is a concern for eukaryote sequence studies relying on stationary analysis models.

However, it was not possible to find good statistical support for the widely accepted grouping of microsporidia with fungi (Hirt et al. 1999Citation ), despite the use of very long sequences (the fusion of SSU and LSU rRNA). This lack of resolution is likely due to the presence of both very short and very long branches within the same tree (see fig. 1B ). Therefore, the placement of fast-evolving lineages within the eukaryotic phylogeny is impossible because of the wide range of evolutionary rates in rRNA. To confidently locate these phyla, the use of other phylogenetic markers that evolved slowly in the corresponding lineages is probably the only reliable solution, as has been shown for microsporidia with tubulins (Keeling and Doolittle 1996Citation ) and RNA polymerase (Hirt et al. 1999Citation ).


    Acknowledgements
 TOP
 Acknowledgements
 literature cited
 
We acknowledge Hervé Le Guyader and Guy Brugerolle for their support throughout this work. We thank Philippe Lopez, David Moreira, Miklós Müller, and two anonymous referees for their comments on the manuscript. We also thank Patricia Johnson for providing a sample of the Trichomonas genomic DNA library, Nicolas Puchot for technical assistance, and Céline Brochier for the drawings. This work was supported by the CNRS, by the Groupement de Recherches et d'Etudes sur les Génomes (décision d'aide no. 94/125) and by the ACC-SV7. The new sequence has been deposited in GenBank with accession number AF202181.


    Footnotes
 
Masami Hasegawa, Reviewing Editor

1 Keywords: Trichomonas vaginalis rRNA phylogeny eukaryotic evolution rate across sites covarion Back

2 Address for correspondence and reprints: Hervé Philippe, Equipe Phylogénie et Evolution moléculaires (UPRESA CNRS 8080), Bâtiment 444, Université Paris-Sud, 91405 Orsay cedex, France. E-mail: herve.philippe{at}bc4.u-psud.fr Back


    literature cited
 TOP
 Acknowledgements
 literature cited
 

    Adachi, J., and M. Hasegawa. 1996. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. 28:1–150.

    Baldauf, S. L., and W. F. Doolittle. 1997. Origin and evolution of the slime molds (Mycetozoa). Proc. Natl. Acad. Sci. USA 94:12007–12012.

    Burger, G., D. Saint-Louis, M. W. Gray, and B. F. Lang. 1999. Complete sequence of the mitochondrial DNA of the red alga Porphyra purpurea. Cyanobacterial introns and shared ancestry of red and green algae. Plant Cell 11:1675–1694.

    Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579–593.[ISI][Medline]

    Galtier, N., and M. Gouy. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–879.[Abstract]

    Germot, A., and H. Philippe. 1999. Critical analysis of eukaryotic phylogeny: a case study based on the HSP70 family. J. Eukaryot. Microbiol. 46:116–124.[ISI][Medline]

    Germot, A., H. Philippe, and H. Le Guyader. 1996. Presence of a mitochondrial-type 70-kDa heat shock protein in Trichomonas vaginalis suggests a very early mitochondrial endosymbiosis in eukaryotes. Proc. Natl. Acad. Sci. USA 93:14614–14617.

    Hasegawa, M., and T. Hashimoto. 1993. Ribosomal RNA trees misleading? Nature 361:23.

    Hashimoto, T., L. B. Sanchez, T. Shirakura, M. Muller, and M. Hasegawa. 1998. Secondary absence of mitochondria in Giardia lamblia and Trichomonas vaginalis revealed by valyl-tRNA synthetase phylogeny. Proc. Natl. Acad. Sci. USA 95:6860–6865.

    Hirt, R. P., J. M. Logsdon Jr., B. Healy, M. W. Dorey, W. F. Doolittle, and T. M. Embley. 1999. Microsporidia are related to fungi: evidence from the largest subunit of RNA polymerase II and other proteins. Proc. Natl. Acad. Sci. USA 96:580–585.

    Keeling, P. J., J. A. Deane, and G. I. McFadden. 1998. The phylogenetic position of alpha- and beta-tubulins from the Chlorarachnion host and Cercomonas (Cercozoa). J. Eukaryot. Microbiol. 45:561–570.[ISI][Medline]

    Keeling, P. J., and W. F. Doolittle. 1996. Alpha-tubulin from early-diverging eukaryotic lineages and the evolution of the tubulin family. Mol. Biol. Evol. 13:1297–1305.[Abstract/Free Full Text]

    Leipe, D. D., J. H. Gunderson, T. A. Nerad, and M. L. Sogin. 1993. Small subunit ribosomal RNA+ of Hexamita inflata and the quest for the first branch in the eukaryotic tree. Mol. Biochem. Parasitol. 59:41–48.[ISI][Medline]

    Lockhart, P. J., A. W. Larkum, M. Steel, P. J. Waddell, and D. Penny. 1996. Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA 93:1930–1934.

    Lockhart, P. J., M. A. Steel, A. C. Barbrook, D. Huson, M. A. Charleston, and C. J. Howe. 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15:1183–1188.[Abstract]

    Lockhart, P., M. Steel, M. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605–612.[Free Full Text]

    Lopez, P., P. Forterre, and H. Philippe. 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49:496–508.[ISI][Medline]

    Maddison, W. P., and D. R. Maddison. 1992. MacClade—analysis of phylogeny and character evolution. Sinauer, Sunderland, Mass.

    Moreira, D., H. Le Guyader, and H. Philippe. 1999. Unusually high evolutionary rate of the elongation factor 1 alpha genes from the Ciliophora and its impact on the phylogeny of eukaryotes. Mol. Biol. Evol. 16:234–245.[Abstract]

    Olsen, G. 1987. Earliest phylogenetic branching: comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harb. Symp. Quant. Biol. LII:825–837.

    Peyretaillade, E., C. Biderre, P. Peyret, F. Duffieux, G. Metenier, M. Gouy, B. Michot, and C. P. Vivares. 1998. Microsporidian Encephalitozoon cuniculi, a unicellular eukaryote with an unusual chromosomal dispersion of ribosomal genes and a LSU rRNA reduced to the universal core. Nucleic Acids Res. 26:3513–3520.[Abstract/Free Full Text]

    Philippe, H. 1993. MUST, a computer package of management utilities for sequences and trees. Nucleic Acids Res. 21:5264–5272.[Abstract]

    Philippe, H., and A. Adoutte. 1998. The molecular phylogeny of Eukaryota: solid facts and uncertainties. Pp. 25–56 in G. Coombs, K. Vickerman, M. Sleigh, and A. Warren, eds. Evolutionary relationships among Protozoa. Chapman & Hall, London.

    Philippe, H., and J. Laurent. 1998. How good are deep phylogenetic trees? Curr. Opin. Genet. Dev. 8:616–623.[ISI][Medline]

    Roger, A. J., O. Sandblom, W. F. Doolittle, and H. Philippe. 1999. An evaluation of elongation factor 1 alpha as a phylogenetic marker for eukaryotes. Mol. Biol. Evol. 16:218–233.[Abstract]

    Sogin, M. L. 1991. Early evolution and the origin of eukaryotes. Curr. Opin. Genet. Dev. 1:457–463.[Medline]

    Stiller, J., and B. Hall. 1999. Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol. Biol. Evol. 16:1270–1279.[Free Full Text]

    Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964–969.[Free Full Text]

    Swofford, D. L. 1993. PAUP: phylogenetic analysis using parsimony. Version 3.1.1. Illinois Natural History Survey, Champaign.

    Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass.

    Takezaki, N., and T. Gojobori. 1999. Correct and incorrect vertebrate phylogenies obtained by the entire mitochondrial DNA sequences. Mol. Biol. Evol. 16:590–601.[Abstract]

    Van de Peer, Y., S. A. Rensing, U. G. Maier, and R. De Wachter. 1996. Substitution rate calibration of small subunit ribosomal RNA identifies chlorarachniophyte endosymbionts as remnants of green algae. Proc. Natl. Acad. Sci. USA 93:7732–7736.

    Woese, C., L. Achenbach, P. Rouviere, and L. Mandelco. 1991. Archaeal phylogeny: reexamination of the phylogenetic position of Archaeoglobus fulgidus in light of certain composition-induced artifacts. Syst. Appl. Microbiol. 14:364–371.[ISI][Medline]

    Yang, Z. 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367–370.[ISI]

Accepted for publication January 17, 2000.