Phylogénie, Bio-informatique et Génome, CNRS, Université Pierre et Marie Curie, Paris
The phylogeny of eukaryotes is in a state of flux (Philippe, Germot, and Moreira 2000
). The tree based on the small subunit ribosomal RNA, for long taken as a reference, has been challenged by the discovery of serious tree reconstruction artifacts, especially the long-branch attraction (LBA) (Felsenstein 1978
). To eschew these methodological issues, one can improve tree reconstruction methods, especially through the implementation of more realistic models of sequence evolution, which we have called the statistical approach (Philippe and Laurent 1998
). However, it is also possible to use the hennigian approach, i.e., the maximum parsimony method applied to less noisy characters, which are a priori not prone to convergence and reversion and thus less homoplastic. As recently reviewed (Rokas and Holland 2000
), rare genomic changes (RGCs), like intron positions, retrotransposon insertions, insertions-deletions (known as indels), mitochondrial and nuclear DNA genetic code variants, chloroplastic and mitochondrial DNA gene orders gave several interesting results.
In the case of eukaryotes, gene fusions have rarely been analyzed, although the fusion of cox1 and cox2 in the mitochondrial genome supports the monophyly of Mycetozoa (Lang, Gray, and Burger 1999
) and that of dihydrofolate reductase and thymidylate synthase suggests the early branching of animals, Fungi, and Microsporidia (Philippe et al. 2000
). Indels are the RGCs most often used for inferring the eukaryotic tree. For example, Baldauf and Palmer provided four indels in favor of the monophyly of Opisthokonta (Fungi + Metazoa) (Baldauf and Palmer 1993
). One of these, an insertion of about 12 amino acids (aa), was further used to locate Microsporidia within this group (for a recent review see Philippe, Germot, and Moreira [2000]
). Resolution of the eukaryotic tree can thus be improved by the study of indels.
However, indels are as sensitive as primary sequence data to some sources of homoplasy, such as hidden paralogy, lineage sorting, recombination, and lateral gene transfer (LGT) (Keeling and Palmer 2001
). For example, a close proximity between Archaea and Gram-positive bacteria seems to be supported by a large deletion (about 20 aa) in HSP70 (Gupta 1998
). But HSP70 genes displaying this deletion are also present in a few Gram-negative bacteria, demonstrating recent LGT. In fact, taxonomic distribution of HSP70 in Archaea (e.g., their absence in Crenarchaeota) suggests an ancient LGT from a Gram-positive bacterium to an euryarchaeon (Philippe, Budin, and Moreira 1999
). As a result, indels, even when large and unambiguous, can be misleading. Moreover, indels can also be affected by convergence and reversion. For example, an insertion of two aa in the elongation factor EF-1
(position 156 in Sulfolobus solfataricus) is unevenly distributed in several groups (Archaea, diplomonads, ciliates, apicomplexans, and animals) (Philippe and Laurent 1998
). Because the grouping of species sharing this insertion is at odds with both the accepted phylogeny as well as with the one based on EF-1
, this indel is clearly highly prone to convergence-reversion.
An interesting example concerns the phylogenetic position of trichomonads. A molecular phylogeny based on the enolase gene supports their early emergence (Keeling and Palmer 2000
). As discussed by the authors, this group displays a very long branch and could be misplaced because of LBA, rendering the enolase phylogeny inconclusive. Nevertheless, on the basis of two insertions of one aa in the enolase of trichomonads and prokaryotes, it was suggested that "Parabasalia are the earliest extant lineage of eukaryotes studied" (Keeling and Palmer 2000
). However, these two short deletions are also present in two distantly related organisms (an archaeon, Methanococcus jannaschi, and a bacterium, Campylobacter jejuni) (Hannaert et al. 2000
). The present analysis questions the early emergence of trichomonads, inferred by the enolase phylogeny and by inositol monophosphate dehydrogenase (impdh), another gene that strongly supports the same phylogenetic position for trichomonads (Collart et al. 1996
). We critically analyzed these two genes, paying special attention to the quality of indels as phylogenetic markers. We compared the distribution of indels with the robustly inferred phylogenetic relationships and estimated the congruence of indels among themselves for each gene.
We selected a representative set of 38 sequences of enolase from a total of 286. Prokaryotic and eukaryotic sequences were more similar to each other (mean of 68% identities) than they were to trichomonads (mean of 60% identities), confirming that trichomonads were very divergent. Our tree (fig. 1A
) was similar to that of previously published ones (Hannaert et al. 2000
; Keeling and Palmer 2000
, 2001
). Prokaryotes were well separated from eukaryotes, with the exception of Euglena, which likely acquired its enolase gene from a spirochaete (Hannaert et al. 2000
). Within prokaryotes, there were two groups of Archaea, one close to Campylobacter and another to eukaryotes. Within eukaryotes, several accepted groups (fungi, diplomonads, and Entamoeba-Mastigamoeba) were recovered. Trichomonads are connected to the tree on the central branch between the eukaryotic and the prokaryotic domains (bootstrap proportion of 84%). The fact that this long branch was located at a midpoint rooting position suggested that LBA could be responsible of this result (Philippe and Forterre 1999
).
|
Three indels (B, C, and G) were also homoplastic, but their interpretation was much open to discussion. Two close insertions (indels B and C) are shared by Plasmodium (and all sequenced alveolates) and Arabidopsis (and several green plants) (Keeling and Palmer 2001
). Interestingly, they have also been found in heterokonts (Phytophthora infestans). The phylogenetic tree, however, was in conflict with these indels because Plasmodium did not cluster with plants (fig. 1A
). In addition, Arabidopsis 2 (not displaying the insertions) was sister group of Arabidopsis 1. But, Phytophthora was close to Arabidopsis 1, in agreement with the indels.
Because indel C has a conserved sequence comprising rare aa (EWGWX), it was potentially a reliable phylogenetic marker. However, indel C is homoplastic because diplomonads (fig. 1B
) as well as two red algae (Keeling and Palmer 2001
) have an insertion of two aa. Moreover, indel B is homoplastic because it is also shared by two unrelated species (Pycnococcus and Rhodomonas). In this case, insertion B is incongruent with the insertion C. Thus, convergences occurred for indels B and C, which is not unexpected because this region is located on a loop (Keeling and Palmer 2001
).
Two alternative explanations were proposed to explain that the distribution of the insertion C is incongruent with the enolase phylogeny: (1) the tree was incorrect, i.e., biased by tree reconstruction artifact (Hannaert et al. 2000
), or (2) the indels were misleading, i.e., biased by recombination events (Keeling and Palmer 2001
). Because the ancestor of Apicomplexa had a secondary endosymbiosis with an algae (Waller et al. 1998
), the presence of indel C in Apicomplexa was interpreted as resulting from an LGT from the nucleus of the algae to the nucleus of the host. Yet, Hannaert et al. (2000)
considered that the complete gene has been transferred and has replaced the host copy, whereas Keeling and Palmer (2001)
suggested that only a part of the gene has been transferred and underwent recombination with the host copy. Yet, the latter authors were unable to detect the recombination, likely because of the fact that the enolase protein was highly mutationally saturated (see Supplementary Material on MBE website at http://www.molbiolevol.org).
In addition to LGTs, the presence of several paralogs rendered the inference of the species phylogeny from enolase very difficult. Three paralogs are found in the complete genome of Arabidopsis, and two are present in some chlorophytes and rhodophytes (Keeling and Palmer 2001
) as well as in Dictyostelium and some bacteria (see Supplementary Material). Secondary endosymbioses further complicate the history by allowing LGTs within a single organism. Even if we do not exclude the possibility of LGTs and of recombination, we think that a much simpler hypothesis explaining the indels B and C could not be ruled out: the enolase gene underwent a duplication early in the evolution of eukaryotes, one copy only acquiring the insertion. During later evolution, differential gene losses led to the complex history shown in figure 1A.
More precisely, heterokonts and alveolates are likely a sister group (Fast et al. 2001
), and this clade, called chromalveolates, would be close to plants. The insertions B and C would be a synapomorphy for this broad group, resulting from the acquisition of these insertions in their ancestor in either a previously duplicated gene or a specifically duplicated gene. The first hypothesis would explain the presence of two copies without the insertions in Dictyostelium, whereas the second requires at least an independent duplication.
The indel G, used to assess the phylogenetic position of trichomonads, was also open to scrutiny. Firstly, the alignment of this region is difficult, and two versions are proposed leading to either two indels of one aa each (Keeling and Palmer 2000
) or an indel of two aa (Hannaert et al. 2000
). We have arbitrarily accepted the second option (fig. 1B
) to simplify the discussion. Prokaryotes and trichomonads display two aa, which has been interpreted in favor of the early emergence of trichomonads (Keeling and Palmer 2000
). Yet, one bacterium (Campylobacter) and one archaeon (Methanococcus) do not have the insertion, suggesting the poor quality of this indel (Hannaert et al. 2000
). We found an intermediate case, the presence of a single aa in Methanosarcina (fig. 1B
). Moreover, just on the right of indel G, two unrelated bacterial species (Streptococcus and Chlorobium) displayed an insertion (indel H, fig. 1B
). Finally, the sequences of two Sulfolobus that became available during the writing of the manuscript displayed the same deletion as eukaryotes. All of this suggested that indel events were frequent in this region forming an external loop (Hannaert et al. 2000
). The indel G therefore should be used with major caution as a phylogenetic marker. It certainly did not help in resolving the difficult question of the phylogenetic position of trichomonads. Moreover, an ancient recombination occurring, in trichomonad, on the right on indel G between the eukaryotic resident form and a prokaryotic gene acquired by LGT would explain both the basal emergence and the insertion G of trichomonads, but evidence for this hypothesis is very weak (see Supplementary Material).
Lastly, the remaining indels were noisy, sometimes to a great extent. For example, indel I displayed 1-, 2-, or 3-aa-long insertions present or absent in different species of many lineages (e.g., 2 aa in Haloarcula and none in Halobacterium, 2 aa in Cladosporium and none in Candida, two closely related ascomycetes). Two other 1-aa-long deletions (indels J and L) were homoplastic and were present in the Metazoa, Plants, and Dictyostelium, and in some Fungi. These deletions did not even recover the monophyly of Fungi (Chytridiomycota, [Ascomycota, Zygomycota]). There was no deletion in Chytridiomycota (Neocallimastix), but the deletion was present in Zygomycota (Cunninghamella) and in some Ascomycota. Finally, indels D and E showed convergences between Archaea and some Bacteria.
Our analysis of the phylogenetic quality of the indels of the enolase gene revealed that, for all living beings, RGCs were not devoid of homoplasy. The majority was indeed misleading (7 out of 13). Furthermore, three other indels (B, C, and G) were subject to opposite interpretations. The various hypotheses proposed to explain their taxonomic distribution (convergence, paralogy, xenology, and mosaicism) can be tested only with a better species sampling (e.g., heterokonts, as suggested by Keeling and Palmer [2001]
) and with refined tree reconstruction methods. Presently, we believed that it was impossible to infer species phylogeny from a comparison of enolase sequences, even with the use of indels. In particular, the very basal position of trichomonads had no robust support from either the phylogeny or the short indels.
To evaluate whether the homoplasy of RGCs was specific to enolase, we studied another gene, the impdh. The phylogeny as well as one indel supported a very basal emergence of trichomonads (Collart et al. 1996
). However, in our updated phylogeny (fig. 2
), the story appeared more complicated because supplementary searches in ongoing genome sequence projects showed that the impdh of trichomonads was robustly included in a clade (called T) with three bacteria (Treponema denticola, Porphyromonas gingivalis, and Clostridium difficile), still emerging at the base of eukaryotes. These four sequences were very similar (>80% identities) but belonged to different major groups (trichomonads, spirochaetes, high GC Gram-positive bacteria and cytophagales). Their grouping suggested recent LGTs, although it was impossible to know the donor species with the present data set. In addition to this peculiar clade, the impdh phylogeny contained the three major groups (Archaea, Eubacteria, and Eucarya) and five minor ones (fig. 2
).
|
Three short insertions of impdh defined monophyletic groups in agreement with the phylogeny and could be considered reliable phylogenetic markers: indel 1 (2 aa) supported the group T, indel 2 (1 aa) supported the monophyly of the clade B+C+ guanosine monophosphate reductase (GMR), and indel 5 (2 aa) supported the monophyly of group B. In addition, indel 7 (5 aa) was present in all eukaryotes, except the trichomonads, which would provide strong evidence in favor of the early emergence of trichomonads, as proposed for enolase (Keeling and Palmer 2000
). The indel 3 seemed at first sight to be in contradiction with the phylogeny because it was present in groups T and GMR only, which were not closely related (fig. 2
). Yet, although located at the same position, the insertions did not have the same size in the two groups (7 and 4 aa, respectively) and did not display sequence similarity. Possibly multiple indel events occurred at this position, but evolution of indel 3 was easily understandable with additional knowledge, as for indels A and K of enolase (fig. 1B
).
The two remaining indels (4 and 6), albeit large, were much more problematic. First, they did not agree with the phylogeny because insertion 4 was present in five unrelated groups and insertion 6 in two unrelated groups (fig. 2
). Second, they contradicted themselves: indel 6 supported a sister group relationship of group A and T but not indel 4. Third, indel 4 also contradicted indel 2: insertion 4 was present only in group C, whereas insertion 2 was shared by groups C, B, and GMR. In fact, the boundaries of indel 4 cannot be perfectly defined because of the low level of sequence identity. If we took the most proximal unambiguously aligned boundaries (positions 80 and 249 in Tritrichomonas), the size of the so defined region encompassing indel 4 was 175 aa for the sequences containing the 120 aa of the complete flanking domain. This size was
40 aa for groups B, O, and GMR,
75 aa for group A and Borrelia (which nevertheless clustered with the group Bacteria of known impdh function), and 114 aa for Aeropyrum. The region containing the flanking domain was likely subjected to several independent deletions of various lengths, which implied that the absence of insertion 4 was at least partly caused by convergence. In contrast, the insertion 6 (6 aa) was unlikely to be caused by convergence because its consensus sequence was similar in group A (DYLDET) and in group T (EYFEET). Interpreting this indel as independent deletions in all the other sequences was unlikely because the taxonomic diversity of groups A and T is very limited. An ad hoc hypothesis of recombination would easily explain the set of indels displayed by trichomonads. For example, in an ancestor of trichomonads, the eukaryotic impdh could have recombined, after indel 4, with an impdh gene of cyanobacterial origin. This would explain the absence of insertion 7, specific for eukaryotes, in trichomonads, but evidence for this hypothesis was weak (see Supplementary Material).
We demonstrated that two genes, enolase and impdh, showing a robust early emergence of trichomonads (based on phylogeny and on indels) had a very complex evolutionary history. In addition to mutational saturation and tree reconstruction artifacts, it is likely that several duplications, LGTs, and recombination events occurred and blurred phylogenetic signal. Indels were suggested to detect them (Keeling and Palmer 2001
) and indeed turned out to be often homoplastic. This homoplasy was likely to be caused not only by recombination but also by convergences, especially for the short indels and at least for a larger one (indel 4 of
120 aa in impdh, which underwent independent deletions). Thus, one has to be critical with the use of indels as characters for phylogenetic inference. Because they were sensitive to LGTs and recombination and also were potentially homoplastic, it was very hazardous to ground a species phylogeny on a few indels. In particular, the indels supporting the early emergence of trichomonads in enolase and impdh were of poor quality. Therefore, there is no strong evidence from these genes in favor of trichomonads as basal eukaryotes.
Acknowledgements
We thank Drs. Brochier, Casane, Gribaldo, Lopez, Moreira, and Müller, and two anonymous referees for helpful comments. For free access to the enolase and impdh gene sequences, we thank the DOE Joint Genome Institute (http://www.jgi.doe.gov/JGI_microbial/html/index.html), the Institute for Genomic Research (http://www.tigr.org), the Sanger Institute (ftp://ftp.sanger.ac.uk/pub/), and the Göettingen Genomics Laboratory.
Footnotes
William Martin, Reviewing Editor
Keywords: enolase
inositol monophosphate dehydrogenase
insertions-deletions
homoplasy
molecular phylogeny
trichomonads
Address for correspondence and reprints: Hervé Philippe, Phylogénie, Bio-informatique et Génome, UMR CNRS 7622, Université Paris Pierre et Marie Curie, 9 quai Saint Bernard, 75005 Paris, France. herve.philippe{at}snv.jussieu.fr
References
Adachi J., M. Hasegawa, 1996 MOLPHY. Version 2.3. Programs for molecular phylogenetics based on maximum likelihood Comput. Sci. Monogr 28:1-150
Baldauf S. L., J. D. Palmer, 1993 Animals and fungi are each other's closest relatives: congruent evidence from multiple proteins Proc. Natl. Acad. Sci. USA 90:11558-11562[Abstract]
Colby T. D., K. Vanderveen, M. D. Strickler, G. D. Markham, B. M. Goldstein, 1999 Crystal structure of human type II inosine monophosphate dehydrogenase: implications for ligand binding and drug design Proc. Natl. Acad. Sci. USA 96:3531-3536
Collart F. R., J. Osipiuk, J. Trent, G. J. Olsen, E. Huberman, 1996 Cloning, characterization and sequence comparison of the gene coding for IMP dehydrogenase from Pyrococcus furiosus Gene 174:209-216[ISI][Medline]
Fast N. M., J. C. Kissinger, D. S. Roos, P. J. Keeling, 2001 Nuclear-encoded, plastid-targeted genes suggest a single common origin for apicomplexan and dinoflagellate plastids Mol. Biol. Evol 18:418-426
Felsenstein J., 1978 Cases in which parsimony or compatibility methods will be positively misleading Syst. Zool 27:401-410[ISI]
Gupta R. S., 1998 What are archaebacteria: life's third domain or monoderm prokaryotes related to Gram-positive bacteria? A new proposal for the classification of prokaryotic organisms Mol. Microbiol 229:695-708
Hannaert V., H. Brinkmann, U. Nowitzki, J. A. Lee, M.-A. Albert, C. W. Sensen, T. Gaasterland, M. Müller, P. Michels, W. Martin, 2000 Enolase from Trypanosoma brucei, from the amitochondriate protist Mastigamoeba balamuthi, and from the chloroplast and cytosol of Euglena gracilis: pieces in the evolutionary puzzle of the eukaryotic glycolytic pathway Mol. Biol. Evol 17:989-1000
Keeling P. J., J. D. Palmer, 2000 Parabasalian flagellates are ancient eukaryotes Nature 405:635-637[ISI][Medline]
. 2001 Lateral transfer at the gene and subgenic levels in the evolution of eukaryotic enolase Proc. Natl. Acad. Sci. USA 98:10745-10750
Kishino H., T. Miyata, M. Hasegawa, 1990 Maximum likelihood inference of protein phylogeny, and the origin of chloroplasts J. Mol. Evol 31:151-160[ISI]
Lang B. F., M. W. Gray, G. Burger, 1999 Mitochondrial genome evolution and the origin of eukaryotes Annu. Rev. Genet 33:351-397[ISI][Medline]
Philippe H., K. Budin, D. Moreira, 1999 Horizontal transfers confuse the prokaryotic phylogeny based on the HSP70 protein family Mol. Microbiol 31:1007-1009[ISI][Medline]
Philippe H., P. Forterre, 1999 The rooting of the universal tree of life is not reliable J. Mol. Evol 49:509-523[ISI][Medline]
Philippe H., A. Germot, D. Moreira, 2000 The new phylogeny of eukaryotes Curr. Opin. Genet. Dev 10:596-601[ISI][Medline]
Philippe H., J. Laurent, 1998 How good are deep phylogenetic trees? Curr. Opin. Genet. Dev 8:616-623[ISI][Medline]
Philippe H., P. Lopez, H. Brinkmann, K. Budin, A. Germot, J. Laurent, D. Moreira, M. Müller, H. Le Guyader, 2000 Early branching or fast evolving eukaryotes? An answer based on slowly evolving positions Philos. Trans. R. Soc. Lond. B. Biol. Sci 267:1213-1221
Rokas A., P. W. H. Holland, 2000 Rare genomic changes as a tool for phylogenetics Trends Ecol. Evol 15:454-459[ISI][Medline]
Waller R. F., P. J. Keeling, R. G. Donald, B. Striepen, E. Handman, N. Lang-Unnasch, A. F. Cowman, G. S. Besra, D. S. Roos, G. I. McFadden, 1998 Nuclear-encoded proteins target to the plastid in Toxoplasma gondii and Plasmodium falciparum Proc. Natl. Acad. Sci. USA 95:12352-12357