How Molecules Evolve in Eubacteria

Peter J. LockhartGo,*, Daniel Huson{ddagger}, Uwe Maier{ddagger}, Martin J. Fraunholz{ddagger}, Yves Van de Peer§, Adrian C. Barbrook||, Christopher J. Howe|| and Mike A. Steel

*Institute of Molecular BioSciences, Massey University, Palmerston North, New Zealand;
{dagger}Program in Applied and Computational Mathematics, Princeton University;
{ddagger}Fachbereich Biologie, Zellbiologie und Angewandte Botanik, Philipps-Universität Marburg, Marburg, Germany;
§Fakultät Biologie, Evolutionsbiologie, Universität Konstanz, Konstanz, Germany;
||Department of Biochemistry and Cambridge Centre for Molecular Recognition, University of Cambridge, Cambridge, England;
¶Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand.

A fundamental assumption in building evolutionary trees is that processes of change are constant across the tree of life (Li and Gu 1996;Citation Swofford et al. 1996Citation ). Despite this universal view, it is now clear that nucleotide compositions, amino acid compositions (e.g., Lanave et al. 1984;Citation Sueoka 1988;Citation Hasegawa and Hashimoto 1993Citation ; Barbrook, Lockhart, and Howe 1998Citation ; Forster and Hickey 1999Citation ; Lockhart et al. 1999Citation ), and, as we demonstrate here for eubacterial sequences, the distribution of sites in sequences that can accept substitutions may change over time.

We investigated anciently diverged eubacterial sequences using a simple linear dissimilarity measure (dlcov) that was sensitive to the type of variable sequence evolution predicted by a covarion/covariotide model (a model of evolution in which the same sequence positions are free to substitute in some taxa but not in others). Since tree-building properties of dlcov differ under covarion/covariotide and rates-across-sites models, dlcov allowed us to test for evidence of covarion/covariotide evolution in eubacterial sequences. Our analyses demonstrated that evolving distributions of variable sites in molecules provide support for deep-branching patterns in phylogenies reconstructed for eubacterial trees of life. This finding joins growing evidence supporting the covarion/covariotide evolution of sequences (Fitch and Markowitz 1970Citation ; Lockhart et al. 1996, 1998;Citation Phillippe and Laurent 1998;Citation Germot and Philippe 1999Citation ; Lopez, Forterre, and Philippe 1999Citation ; Moreira, Guyader, and Philippe 1999Citation ; Philippe et al. 2000;Citation Steel, Huson, and Lockhart 2000Citation ).

Given two monophyletic groups of taxa, the site patterns found in an alignment of sequences can be described in terms of five classes (Lockhart et al. 1998Citation ). Two of these are used in calculating dlcov. Let N3 denote the number of sites that are unvaried in the first group but varied in the second group, and let N4 denote the number of sites that are unvaried in the second group but varied in the first. Let N denote the total number of sites. Thus,


is the proportion of sites varied in one group but not the other. We describe exactly the expected value of dlcov under two models—a model in which there is a distribution of rates across sites (RAS), and a covarion-style model of the type described and analyzed recently by Tuffley and Steel (1998)Citation . Under this latter model, the following nonlinear dissimilarity measure converges (with increasing sequence length) to an additive measure that is proportional to the evolutionary distance between the groups:


where N5 is the number of sites that are varied in both groups.

At variable positions under both the RAS and the covarion-style models, we assume that the underlying mechanism of nucleotide substitution is described by the Kimura 3ST model (or some submodel). The results are expected to be similar under other models of nucleotide substitution but somewhat more difficult to analyze. Under either the RAS or the covarion-style model, the expected value of Nk/N is pk - pij, where pk is the probability that the site is varied among the sequences in group k {i, j}, and where pij is the probability that the site is varied among the sequences in both groups. Thus, if we let eij denote the expected value of dlcov (under either model), then eij = pi + pj - 2pij. Consequently, under an RAS model, we have:


where P[Ek | {lambda}] is the probability that the sequences in group k are varied at a site evolving at rate {lambda}, and the integration is performed with respect to the distribution of rates across sites. Note that if the sites all evolve at the same rate, pij = pipj.

For the covarion-style model described in Tuffley and Steel (1998)Citation , lemma 7 of that paper shows that

where b and c are positive constants (dependent only on the switching rates between "variable" and "invariable" states under the model), {tau}ij is the evolutionary distance between groups i and j, and xk = P[Ek | var] - P[Ek | inv], where P[Ek | var] (respectively, P[Ek | inv]) is the probability that a site is varied for the sequences in group k {i, j}, given that it is variable (respectively, invariable) at the root vertex of this group in the underlying tree.

In comparing formulae (1) and (2) for eij under the two models, we note that equation (1) does not involve the evolutionary distance {tau}ij between the groups. Hence, under an RAS model, we cannot expect dlcov to extract phylogenetic signal. However, eij increases monotonically with {tau}ij for the covarion-style model (eq. 2) and therefore is a (nonlinear) measure of the phylogenetic distance between the groups. Thus, to a first approximation, an expectation is that the dlcov values should fit a star phylogeny under an RAS model. Under a suitable covarion/covariotide-style model (and with {tau}ij small and monophyletic groups of similar diversity), the expectation is that dlcov will fit the underlying bifurcating tree. We tested if dlcov would allow the recovery of tree shapes similar to the model tree when sequences evolved under a non–covarion/covariotide model. Hence, for sequences of finite length (c = 100, 200, 300, 400, and 500), we simulated the evolution of five groups of sequences (each containing four sequences) on a bifurcating tree under Jukes-Cantor and RAS models (gamma law distribution of rates with shape parameters 0.5, 1, and 1.5), where the numbers of expected substitutions per site were set to 0.2 for the internal edges and to 0.1 for the external ones. For all combinations of parameters, we generated 100 different data sets. To each such data set we then reconstructed splitsgraphs (Bandelt and Dress 1992Citation ; Huson 1998Citation ) using (1) dlcov and (2) traditional distance measures, corrected according to the model used to simulate the data. Unlike the model transformation, dlcov tended to produce a splitsgraph that did not favor a particular bifurcating tree. Next, we applied dlcov and split decomposition to five different eubacterial tree of life data sets. For the analyses carried out, sequences were sampled from eubacterial groups (e.g., oxygenic photosynthesis, low G+C gram positives, etc.) so as to cover as much of the genetic diversity of each group as possible yet also maintain a hierarchical structure within each group. These steps were carried out in an attempt to identify diverse sequences showing the most conserved group structure. Sequences whose presence produced unresolved trifurcations between basal lineages within groups were excluded, since these perturbed the treelike properties of both dlcov and dcov (i.e., the splitsgraphs became boxlike). Groups that were poorly sampled with shallow divergences were also avoided. The list of taxa used, along with the alignments, are available from http://www.massey.ac.nz/~imbs/Research/MolEvol/Farside/ Plants.html.

For each data set, figure 1 shows (1) unweighted bootstrap neighbor-joining trees (obtained using PAUP, version 4; Swofford 1999Citation ) recovered using uncorrected (Hamming) distances and (2) split decomposition graphs (obtained using Splitstree, version 3.1; Huson 1997Citation ) recovered using dlcov. Since split decomposition makes no assumption that data fit a bifurcating tree, it provides a conservative test for identifying covarion/covariotide support for splits which occur in the neighbor-joining trees.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 1.—Neighbor-joining trees (left) and splitsgraphs (right) for five eubacterial data sets. Total sequence (Hamming) differences were used in construction of the neighbor-joining trees, and group dissimilarity measures were used in reconstructing the splitsgraphs. With protein sequences, dij = dlcov = (N3 + N4)/N. For 16S rDNA, dij = (xN2 + N3 + N4)/N, where weightings for x = 2–4 gave the bifurcating graph shown. Gamma proteobacterial groups 1 and 2 correspond to strongly supported splits in the Hamming distance/neighbor-joining trees

 
Comparisons of the neighbor-joining trees and splitsgraphs for protein sequences indicate that the distributions of N3 and N4 patterns in the different data sets give rise to treelike distances for dlcov and splits that correspond to those recovered most strongly in the neighbor-joining trees (e.g., the splits between the {alpha} and {gamma} proteobacteria and between the proteobacteria and other groups). These observations are explained if sequences belonging to the different monophyletic groups differ in their distributions of variable sites and if these differences provide support for the treelike structures recovered by tree-building algorithms such as neighbor joining.

Less support is provided by N3 + N4 patterns in the 16S rDNA sequences studied here. With these data, the expected phylogenetic–neighbor- joining 16SrDNA tree is recovered only if we include in our dissimilarity measure an additional pattern class N2 (i.e., sites at which the character states are different between the two groups and unvaried within each group). The evolution of these patterns is equally well described by covarion and noncovarion models. Thus, with rDNA, while there is evidence for covarionlike patterns of evolution in this molecule (Lockhart et al. 1998Citation ), the extent to which these contribute to the inferred phylogenetic relationship between major eubacterial groups is less clear.

It is reassuring that the strongest splits recovered in our protein splitsgraphs reconstructed using dlcov are found with different eubacterial data sets and are also recovered using the nonlinear covarion transform dcov (figures not shown), suggesting a common evolutionary history for these different molecules. However, the extent to which asymmetric processes of change may be convergent (and potentially misleading for phylogeny reconstruction) across more widely sampled groups in trees of life is a question that requires further study. Biased amino acid and nucleotide compositions can be convergent (Barbrook, Lockhart, and Howe 1998Citation ; Forster and Hickey 1999Citation ; Lockhart et al. 1999Citation ), and they are known to cause a problem for phylogeny reconstruction when sequences accepting biased substitutions also share similar distributions of varying sites (Lockhart et al. 1998Citation ). Although changes in distributions of variable sites may help to "fossilize" phylogenetic history in sequences (Lopez, Forterre, and Philippe 1999Citation ), some changes may cause problems for tree building. This can occur if the proportion of variable sites in sequences increases independently in different lineages (e.g., Lockhart et al. 1998;Citation Philippe and Laurent 1998Citation ; Germot and Philippe 1999Citation ; Steel, Huson, and Lockhart 2000Citation ). In this case, the data can be described by the type of inconsistency phenomena discussed by Felsenstein (1978)Citation . Such processes have been suggested to mislead outgroup placement with duplicated genes (Lockhart et al. 1996;Citation Philippe and Forterre 1999Citation ) and also to mislead the divergence order of eukaryotes (Germot and Philippe 1999Citation ; Philippe et al. 2000Citation ). These results and those we report here highlight the need for improving our understanding of the biochemical basis for processes of asymmetrical change in sequence evolution. This knowledge would surely help provide confidence in the phylogenetic inference of ancient divergences.

A final point is that we do not propose dlcov as an additive distance measure for building evolutionary trees. The measure is not expected to extract all the useful information present in the sequences, and, as we have pointed out, observations on diverse data sets suggest that the evolution of some sequences occurs by covarion processes which are nonstationary. This is a phenomenon which is difficult to model.


    Acknowledgements
 TOP
 Acknowledgements
 literature cited
 
We acknowledge support from the Alexander von Humboldt Foundation, the Deutsche Forschungsgemeinschaft, the New Zealand Marsden Fund, the New Zealand/German co-operation agreement, the Broodbank Fund, and the BBSRC.


    Footnotes
 
Masami Hasegawa, Reviewing Editor

1 Keywords: covarion covariotide nonstationarity split decomposition Back

2 Address for correspondence and reprints: Peter J. Lockhart, Institute of Molecular BioSciences, Massey University, Palmerston North, New Zealand. Back


    literature cited
 TOP
 Acknowledgements
 literature cited
 

    Bandelt, H. J., and A. W. M. Dress. 1992. Split decomposition: a new and useful approach to phylogenetic distance data. Mol. Phylogenet. Evol. 1:242–252.[Medline]

    Barbrook, A. C., P. J. Lockhart, and C. J. Howe. 1998. Phylogenetic analysis of plastid origins based on SecA sequences. Curr. Genet. 34:336–341.[ISI][Medline]

    Felsenstein, J. 1978. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27:401–410.[ISI]

    Fitch, W. F., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579–593.[ISI][Medline]

    Forster, P. G., and D. A. Hickey. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284–290.[ISI][Medline]

    Germot, A., and H. Philippe. 1999. Critical analysis of eukaryotic phylogeny: a case study based on on the HSP70 family. J. Eukaryot. Microbiol. 46:116–124.[ISI][Medline]

    Hasegawa, M., and T. Hashimoto. 1993. Ribosomal RNA trees misleading? Nature 361:23.

    Huson, D. 1998. SplitsTree: a program for analyzing and visualizing evolutionary data. Bioinformatics 14:68–73.

    Lanave, C., G. Preparata, C. Saccone, and G. J. Serio. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20:86–93.[ISI][Medline]

    Li, W. H., and X. Gu. 1996. Estimating evolutionary distances between DNA sequences. U.K. edition, London.

    Lockhart, P. J., A. W. D. Larkum, M. A. Steel, P. J. Waddell, and D. Penny. 1996. Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA 93:1930–1934.

    Lockhart, P. J., M. A. Steel, A. C. Barbrook, D. H. Huson, and C. J. Howe. 1998. A covariotide model describes the evolution of oxygenic photosynthesis. Mol. Biol. Evol. 15:1183–1188.[Abstract]

    Lockhart, P. J., C. J. Howe, A. C. Barbrook, A. W. D. Larkum, and D. Penny. 1999. Spectral analysis, systematic bias, and the evolution of chloroplasts. Mol. Biol. Evol. 16:573–576.[Free Full Text]

    Lopez, P., P. Forterre, and H. Philippe. 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49:496–508.[ISI][Medline]

    Moreira, D., H. L. Guyader, and H. Philippe. 1999. Unusually high evolutionary rate of the elongation factor 1a genes from the ciliophora and its impact on the phylogeny of eukaryotes. Mol. Biol. Evol. 16:234–245.[Abstract]

    Philippe, H., and P. Forterre. 1999. The rooting of the universal tree of life is not reliable. J. Mol. Evol. 49:509–523[ISI][Medline]

    Philippe, H., and J. Laurent. 1998. How good are deep phylogenetic trees? Curr. Opin. Genet. Dev. 8:616–623.[ISI][Medline]

    Philippe, H., P. Lopez, H. Brinkman, K. Budin, A. Germot, J. Laurent, D. Moreira, M. Müller, and H. LeGuyader. 2000. Tree reconstruction and the phylogeny of the eukaryotes. Proc. Natl. Acad. Sci. USA (in press).

    Steel, M. A., D. Huson, and P. J. Lockhart. 2000. Syst. Biol. (in press).

    Sueoka, N. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85:2653–2657.

    Swofford, D. L. 1999. PAUP. Version 4.65. Sinauer, Sunderland, Mass.

    Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass.

    Tuffley, C., and M. A. Steel. 1998. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:63–91.[ISI][Medline]

Accepted for publication January 18, 2000.