*Section of Evolution and Ecology, University of California at Davis;
University/Jepson Herbaria and Museum of Paleontology, University of California at Berkeley;
and
Center for Population Biology, University of California at Davis
Abstract
Sequences of two chloroplast photosystem genes, psaA and psbB, together comprising about 3,500 bp, were obtained for all five major groups of extant seed plants and several outgroups among other vascular plants. Strongly supported, but significantly conflicting, phylogenetic signals were obtained in parsimony analyses from partitions of the data into first and second codon positions versus third positions. In the former, both genes agreed on a monophyletic gymnosperms, with Gnetales closely related to certain conifers. In the latter, Gnetales are inferred to be the sister group of all other seed plants, with gymnosperms paraphyletic. None of the data supported the modern "anthophyte hypothesis," which places Gnetales as the sister group of flowering plants. A series of simulation studies were undertaken to examine the error rate for parsimony inference. Three kinds of errors were examined: random error, systematic bias (both properties of finite data sets), and statistical inconsistency owing to long-branch attraction (an asymptotic property). Parsimony reconstructions were extremely biased for third-position data for psbB. Regardless of the true underlying tree, a tree in which Gnetales are sister to all other seed plants was likely to be reconstructed for these data. None of the combinations of genes or partitions permits the anthophyte tree to be reconstructed with high probability. Simulations of progressively larger data sets indicate the existence of long-branch attraction (statistical inconsistency) for third-position psbB data if either the anthophyte tree or the gymnosperm tree is correct. This is also true for the anthophyte tree using either psaA third positions or psbB first and second positions. A factor contributing to bias and inconsistency is extremely short branches at the base of the seed plant radiation, coupled with extremely high rates in Gnetales and nonseed plant outgroups.
Introduction
Given a finite amount of data, all phylogenetic methods can be misled. Mistaken inferences about relationships can be more or less random, or, if only certain incorrect topologies are preferred, they can be biased in the context of the underlying process of molecular evolution for those data. At worst, this bias can persist as more and more character data are added, a phenomenon known as statistical inconsistency. Maximum parsimony (MP) can be inconsistent in a simple four-taxon tree in which two long terminal branches are separated by a short interior branch (Felsenstein 1978
). This is the source of the term "long-branch attraction" (LBA; Hendy and Penny 1989
), for which length is understood to mean the expected number of substitutions, a function of rate and time. Maximum-likelihood (ML) and distance methods can also be statistically inconsistent when the assumed model of evolution is incorrect (Chang 1996
). Because consistency is an asymptotic property (i.e., one emerging with an infinite number of characters) that is not directly detectable in real data sets, the term "long-branch attraction" has colloquially been applied to bias (a property of finite data sets) when branch length heterogeneity is suspected. In this paper, the term "long-branch attraction" will be used to refer to conditions under which bias in finite data sets and/or statistical inconsistency arises due to a combination of long and short branches. Unfortunately, despite the evocativeness of the term "long-branch attraction," very little is known about the mathematical conditions under which either bias or statistical inconsistency occurs in phylogenies of more than a few taxa (Hendy and Penny 1989
; Huelsenbeck 1995
; Kim 1998
).
Not surprisingly, teasing apart random error, bias, and inconsistency has proven to be difficult for real data sets (Huelsenbeck 1998
). Several methods have been proposed to identify LBA in real data (e.g., Lyons-Weiler and Hoelzer 1997
). One method is to use an algorithm that is putatively statistically consistent, such as ML rather than MP (Huelsenbeck 1997
). However, in finite (real) data sets, even if ML is consistent and MP is not, the real issue is the extent of bias, rather than the asymptotic behavior as more data are obtained. ML estimates are not guaranteed to be unbiased (Lehmann 1983
). Under some model conditions, likelihood methods are not as efficient as nonparametric methods such as parsimony (Huelsenbeck 1998;
Siddall 1998
). An alternative approach is to use Monte Carlo simulation to examine whether the tree-building method is biased under model conditions that appear appropriate for the given data. Cases that have been studied in this way include the putative attraction of Diptera and Strepsiptera (Huelsenbeck 1998
) and of mammals and birds (Huelsenbeck, Hillis, and Jones 1996
) and basal relationships of carabid beetles (Maddison, Baker, and Ober 1999
) and of yucca moths (Pellmyr and Leebens-Mack 1999
). In each of these studies, simulation showed that long branches were sometimes long enough to cause spurious relationships to be reconstructed with high probability.
In this paper, we investigate a possible pattern of LBA in the phylogeny of extant seed plants. Recent molecular work has identified a striking case of conflict between morphological and molecular data and among different partitions of molecular data in several independent data sets. Living seed plants comprise five groups: angiosperms and four gymnosperm groupsconifers, cycads, Gnetales, and Ginkgo. Morphological cladistic analyses had produced a consensus that angiosperms and Gnetales were sister groups to the exclusion of the other seed plants, rendering the extant gymnosperms paraphyletic (Crane 1985
; Doyle and Donoghue 1986, 1992
; Loconte and Stevenson 1990
; Nixon et al. 1994
; Rothwell and Serbet 1994
; Doyle 1996
). Although well supported at the morphological level in bootstrap analyses (J. A. Doyle, personal communication), this "anthophyte hypothesis" has not been supported in molecular studies. Instead, recent studies from chloroplast, mitochondrial, and nuclear genomes are converging on a radically different alternative, in which the gymnosperms are monophyletic and the Gnetales are nested within them, sometimes within the conifers themselves (Frohlich and Parker 1999; Graham and Olmstead 1999; Hansen et al. 1999
; Ross et al. 1999; Soltis et al. 1999
; Winter et al. 1999
; Bowe, Coat, and dePamphilis 2000;
Chaw et al. 2000
). This latter hypothesis would require reevaluation of homologies of conifer and gnetalean reproductive structures, because the angiospermlike "flowers" of Gnetales might have been derived from the highly specialized strobili (cones) of conifers. On the other hand, putative morphological synapomorphies exist that might unite conifers and Gnetales, such as features of wood anatomy (Carlquist 1996
), and thus the molecular data are in a position to play an important role in developing an understanding the evolution of this dominant and diverse group of land plants.
Although an apparent consensus is emerging from these molecular studies on seed plant relationships, some interesting internal inconsistencies in the molecular data have yet to be fully explored. For example, different but strongly supported phylogenetic inferences emerge from different subpartitions of the data (i.e., first and second vs. third codon positions) and/or different weighting schemes applied to these partitions. The concept of saturationthe assumption that silent substitutions in third codon positions in protein-coding genes are occurring so rapidly that they essentially become randomized (i.e., saturated) compared with slower-evolving first and second codon positions, especially for relatively deep divergencesis often invoked when particular methods or genes fail to produce the expected results or to explain unexpected results. For this reason, third codon positions have commonly been regarded as less reliable and phylogenetically uninformative, if not potentially misleading (Meyer 1994
). Some of the recent seed plant studies using both chloroplast and mitochondrial protein-coding genes have presumed that silent substitutions (in third positions) are in fact saturated and should be downweighted (Hansen et al. 1999
; Chaw et al. 2000
). However, the theoretical justification for downweighting saturated sites, or excluding them altogether from phylogenetic analyses (Meyer 1994
; Swofford et al. 1996
), is shaky and has been questioned by recent work showing that third codon positions sometimes contain most of the phylogenetic signal in the data (Yoder, Vilgalys, and Ruvolo 1996
; Björklund 1999;
Källersjö, Albert, and Farris 1999
).
Yang (1998)
recently studied the effect of overall substitution rate on phylogenetic accuracy and concluded that rates well above those that are conventionally viewed as saturated (ca. 2030% or greater; Meyer 1994
) actually improve accuracy, and that the decline in accuracy that emerges at even higher rates is quite slow. Although Felsenstein (1978)
implicated high rates of evolution in LBA problems, these were coupled with rate heterogeneity across branches. The magnitude of saturation needed to induce LBA was dependent on the pattern of rate variation across lineages (or, in the case of constant rates, branch lengths in different lineages; Hendy and Penny 1989
). In more complex trees, theoretically grounded guidelines are not available to indicate when saturation really occurs. Because studies of deep phylogeny typically hinge on data that represent mixtures of tempos and modes of evolution even within a single gene (e.g., Chase et al. 1993
), decisions about character weighting often have a significant impact on results. Studies of the performance of these heterogeneously evolving genes are greatly needed.
In this paper, we analyze the phylogeny and molecular evolution of two highly conserved chloroplast photosystem genes that demonstrate precisely this pattern of mixed rates. At the amino acid level, these are among the most conserved genes in photosynthetic organisms. At the DNA level, third positions are evolving very rapidly and appear to be saturated in pairwise comparisons across land plants (ca. one substitution per site). Variation in rate across these genes is significant, as is variation across different lineages. We use simulation analysis to study whether and how rate heterogeneity and saturation can lead to errors in phylogeny reconstruction. In particular, we focus on whether third-position data are statistically inconsistent owing to saturation. The main goal of the paper is to assess whether LBA can explain the disparate but strongly supported phylogenetic results that are obtained when third-position data are included in analyses of seed plant relationships, or if it is necessary to seek other explanations for conflict between these data partitions. We also use likelihood methods to characterize variation in rates among lineages ("lineage effects") and to assess whether this rate variation may be a significant factor affecting phylogenetic inferences in these data.
Materials and Methods
Taxa, Genes, Primers, and Sequencing
Taxa were sampled from all five extant seed plant groups, including angiosperms (Oryza, Zea, Nicotiana, Pisum, Chloranthus, and Drimys), Gnetales (Ephedra and Welwitschia), conifers (Pinus, Araucaria, Torreya, and Sequoia), cycads (Encephalartos and Cycas), and Ginkgo. Vascular plant outgroups included representatives of homosporous eusporangiate ferns (Angiopteris) and leptosporangiate ferns (Adiantum and Asplenium), heterosporous ferns (Marsilea), lycopsids (Huperzia), sphenopsids (Equisetum), and Psilotum. A nonvascular plant, the liverwort Marchantia, was used to root the tree. Except for Marchantia, Pinus, and several angiosperms, all sequences reported here are new. Taxon names, voucher information, and GenBank accession numbers are listed in table 1
.
|
|
MP searches used heuristic search strategies, consisting of ASIS addition sequences followed by tree bisection- reconnection (TBR) branch swapping (Swofford et al. 1996
). ML searches used the HKY85+
substitution model. Because ML using this model required about 3,500 times as much computation time as MP, a series of heuristic procedures were used to permit ML bootstrap runs to be implemented. The transition/transversion ratio and the shape parameter of the gamma distribution were first estimated on MP trees separately for each gene and codon position. These parameters were then fixed in the ML searches using the PREVIOUS command, and ML heuristic searches were run with ASIS addition sequences and NNI branch swapping. Although more exhaustive search procedures such as TBR swapping could be used to search for optimal trees, running times were prohibitive when repeated across 100 bootstrap replicates. Another analysis used the SITERATES facility in PAUP* to permit a different rate of evolution in the two codon partitions. This facility estimates different fixed rates in the different partitions and should help handle any heterogeneity between partitions. However, it does not simultaneously allow rate variation across sites within the partitions (e.g., with a gamma distribution). Bootstrap runs were implemented with the same options as described above but were done with data sets consisting of the two genes concatenated together.
MP analyses of the amino acid data used step matrices based on the inferred minimum number of substitutions required to transform one amino acid into another. ML and NJ analyses of the amino acid data used the Dayhoff substitution model (Dayhoff, Schwartz, and Orcutt 1978
). ML searches used the "quick" heuristic strategy in MOLPHY. Finally, because of shifts in base composition across taxa, some neighbor-joining analyses were performed with the log-det distance transformation, which may be relatively robust to composition changes across the tree (Lockhart et al. 1994
).
Estimates of divergence (expected numbers of substitutions) along branches were obtained by ML for each gene and codon position partition separately on each of a set of three trees which formed the basis for subsequent analyses of bias, error, and LBA. The three trees were chosen so as to satisfy three major hypotheses about seed plant relationships that have been discussed in the morphological and molecular literature on seed plant relationships (fig. 1
). Two of these three major hypotheses are consistent with trees obtained with different partitions of the present data sets. The third can be obtained easily by imposing one phylogenetic constraint on the present data. Thus, tree A consisted of one MP tree estimated from third positions only for a given gene. For both genes, this tree has Gnetales as the sister group of the other seed plants. When discussing seed plant relationships per se in the context of tree A, we use the term "Gnetales hypothesis." Tree B consisted of one MP tree estimated from first and second positions only. For both genes, this tree has a monophyletic gymnosperms as sister to angiosperms (hence, the "gymnosperm hypothesis"; note that this arrangement is not found on the bootstrap tree for psaA). Tree C is one MP tree found under the constraint tree corresponding to the anthophyte hypothesis of seed plant relationships (described above), obtained from a data set combining both codon partitions. Model parameters, including shape parameter and transition/transversion (ti/tv) ratio (unlike for the ML searches), were all estimated from the data under an HKY85+ model.
|
A simulation protocol was set up to identify the error rate for recovering each of the three main hypotheses of seed plant relationships under model conditions estimated from the data. The three relationships tested were the Gnetales, gymnosperm, and anthophyte hypotheses (fig. 1
), ignoring relationships outside of seed plants or within angiosperms or the other taxa. For example, we might be interested in estimating the probability of rejecting the anthophyte hypothesis if it indeed is truethe type I error. To test this sort of hypothesis, the tempo and mode of evolution of a codon partition of one of the genes, say, third positions in psaA, was estimated on tree C, which represents the null hypothesis. Parameter estimates of the substitution model (composition, ti/tv ratio, shape parameter) were obtained via ML, along with the estimates of overall branch lengths for each branch. Then, 1,000 new data sets were generated using this tree as input, along with estimated branch lengths and parameters, using the program Seq-Gen, version 1.1 (Rambaut and Grassly 1997
). Finally, the MP solution for each replicate data set was obtained and checked to see how often the three hypotheses of seed plant relationships entailed by trees A, B, and C (fig. 1
) were found. Type II error rates (the probability of mistakenly accepting the null hypothesis when it is false) were obtained in the same fashion. For example, in the case of the anthophyte hypothesis, the type II error rate is the probability of incorrectly reconstructing the anthophyte hypothesis given that one of the two other hypotheses (A or B) is correct.
Together, type I and type II errors can be summarized in a three-by-three matrix in which the ijth element gives the probability of obtaining the seed plant relationships entailed by hypothesis i given that tree j is in fact correct (where i, j is one of A, B, or C in fig. 1 ). The iith (diagonal) element is just 1type I error on hypothesis i. The ijth (off-diagonal) elements give the type II error for hypothesis i given the alternative hypothesis, j. Under ideal circumstances, hypothesis i should be supported when Tree i is in fact true. Therefore, the iith (diagonal) elements of the matrices should be high (low type I error rate) and the ijth (off-diagonal) elements should be low, indicating low type II error rate. On the other hand, if some off-diagonal elements are high, then trees other than the true tree have a high chance of (mistakenly) being inferred. Bias is indicated by a skew toward increasing probability of some of the off-diagonal elements but not all of them.
The simulations just described all entail generating data sets with the same number of characters as that observed with the real data. This permits assessments of sampling error and bias. In addition, simulations were undertaken to make inferences about the statistical consistency of tree reconstruction with the given data sets. Selected tests along the lines described above for type I error were repeated with progressively larger numbers of characters. This was easily implemented by changing a single parameter in Seq-Gen and repeating the experimental protocol described above. Since statistical consistency is an asymptotic property, it was not strictly possible to assess it by the finite-sized simulation implemented here, because it is always possible that a seemingly monotonic trend in one direction or the other might reverse itself at some point in simulations larger than those performed (Kim 1998
). However, we assume that the results from analyses of large numbers of characters are suggestive.
Experiments were also performed to test the effects of saturation in some data sets. Overall rates of sequence evolution were modified by multiplying the treewide rate estimated from the data by a constant (another Seq-Gen option). This permitted the type I and type II error rates to be estimated as a function of increasingly higher rates of substitution above that observed in the real data, to address the question of what levels of saturation are needed to obtain a high error rate in the context of the given topology and relative branch lengths.
Results
DNA Sequences
Primers described in table 2
permitted approximately 2,155 bp of psaA (out of 2,265 bp) (716/755 codons) and 1,415 bp of psbB (out of 1,530 bp) (471/510 codons) to be sequenced, with high overlap on one or both strands. We were unable to obtain good sequence for two ferns and one seed plant in the psbB data set (Asplenium, Marsilea, and Sequoia), but two ferns and three other vascular nonseed plants were still obtained for use as outgroups. The sequence for Chloranthus, an angiosperm, was not obtained for psaA. This yielded a data set of 22 taxa for psaA and 20 taxa for psbB, with 19 taxa shared in a combined data set. For alignments, see Supplementary Materials. Table 3 shows basic summary statistics for these sequences. As with most chloroplast genes, AT content is quite high. Moreover, base compositional heterogeneity among taxa appears to be significant for both genes in the third-position data (P = 0.0 in a 2 test of heterogeneity), but not in first and second positions (P = 1.0). Transition/transversion ratios and shapes of the pattern of rate variation across sites are also different between the third-position partition and the first- and second-position partition. Rate variation is discussed further below.
|
|
|
|
ML results were reported as bootstrap majority rule trees (figs. 4 and 5 ). For psaA, little resolution of relationships within seed plants was apparent, although in both partitions there was some support for Gnetales being nested with conifers (71% bootstrap support for first and second positions; 81% for third positions). For psbB data, first and second positions showed modest support (75%) for Gnetales nested within conifers, and contradictory but modest support for third-position data for Gnetales as sister group to all other seed plants (75%). In the concatenated data, the optimal trees (this time with TBR swapping) both in first and second positions and in third positions had Gnetales nested within conifers (trees not shown). Results from runs assuming fixed but different rates in the two codon partitions (SITERATES option, also with concatenated data sets) supported the Gnetales hypothesis (tree not shown). The bootstrap tree indicated a paraphyletic gymnosperms, with Gnetales as the sister group of other seed plants (100% support) and cycads as the sister group of angiosperms (65% support only), with the latter being a result not seen in other experiments with these data.
|
|
Rates, Error, Bias, and Long-Branch Attraction
Rate Parameters and Branch Lengths
Transition/transversion ratios were higher for third positions than for first and second positions but did not differ greatly between genes (table 3
). Rate variation at the nucleotide level was characterized by the shape parameter of a gamma distribution fit to the data via ML. In both genes, the shape of the distribution was highly left-skewed for first- and second- position data (indicating most sites are highly conserved), whereas the shape for third-position data was modal (fig. 6
). Most pairwise distances between taxa (Kimura two-parameter corrected) fell in the range of 0.030.07 substitutions per site for both genes at first and second positions, but ranged from 0.5 to 1.0 at third positions, suggesting a rate 1020 times as high.
Branch lengths (expected numbers of substitutions per site) estimated by ML exhibited striking differences across lineages (figs. 710 ). Many of the patterns in these lineage effects were largely independent of which of the three phylogenetic hypotheses were used in their estimation and seemed to be correlated between genes and partitions. For third-position data (e.g., fig. 8 ), several outgroups had exceptionally long branches (Equisetum and Adiantum), as did the Gnetales (Ephedra and Welwitschia, as well as their stem lineage) among seed plants. The stem lineage subtending angiosperms was relatively long compared to that of cycads, Ginkgo, and conifers, which in fact had quite short branches in all three trees. This pattern was also found in first and second positions, although branch lengths for Equisetum were shorter.
|
|
Simulations of Error Rates and Bias
Table 5
shows results from simulation experiments of the error rates of tree reconstructions. Results differed importantly between the two genes. For psaA, the type I and type II error rates were uniformly low if either tree A or tree B was correct for either position partition. For example, if tree A were correct, tree A would be estimated by MP with high probability (0.887 and 0.926, respectively, for the two codon partitions). The same holds for tree B. However, if tree C were the true tree and third-position data were used, chances are 0.666 that tree A would be mistakenly reconstructed (note that tree A actually was reconstructed from those data). This indicates a bias in the reconstruction. If first and second positions were used instead and tree C were correct, chances are about even that any of the three trees might be reconstructed (probabilities ranging from 0.2730.386). Hence, for psaA, tree C is not easy to estimate if it is true; the other trees are.
|
The bias toward tree A when tree B is correct does not occur with third-position data for psaA. Simulation of progressively higher rates of substitution over and above the observed rates indicated that rates would have to be about eight times as high as the already-saturated rates observed (0.51.0 substitutions per site) for this bias to appear, meaning that, effectively, this gene's level of saturation is nowhere close to what is necessary to generate positively misleading levels (fig. 11 ).
|
|
Phylogeny
The strongest conclusion from this study is that in parsimony analyses, there was a striking conflict between signals from different partitions of the two chloroplast genes but not within the same partitions between genes. Trees based on first- and second-position data were well supported but conflicted dramatically with well-supported trees based on third positions (or based on a combination of both partitions). The former implies the gymnosperm hypothesis; the latter, the Gnetales hypothesis (fig. 1 ). Neither partition supported the modern anthophyte hypothesis that Gnetales are the sister group of angiosperms (Doyle 1996
). High bootstrap values for the conflicting trees and significant partition homogeneity tests mean that it is not simply the case that one of the partitions is random noise and the other has signal. Both partitions have strong signals, and they conflict. This suggests the existence of statistical bias in one or both of the partitions.
Results from likelihood analyses were the same for psbB, but the conflict between partitions was much less evident for psaA. Both partitions supported Gnetales with conifers, although neither partition was sufficiently well supported to suggest the monophyly of gymnosperms. A hypothesis of gymnosperm monophyly and a sister group relationship between Gnetales and conifers does not require a radical reinterpretation of morphological evolution in seed plants. However, a placement of Gnetales within conifers, which is the typical signal emerging from first and second positions in our data (figs. 2A, 3A, 4A, 4B, and 5A
), requires both a major modification of reproductive morphology and the reacquisition of one copy of a large inverted repeat in the chloroplast genome, which conifers have lost but angiosperms, Gnetales, and all other seed plants have retained (Raubeson and Jansen 1992
).
The two different signals we observed in the photosystem genes are each corroborated by other studies. Data from rbcL had long supported the third-position tree (Hasebe et al. 1992
; Albert et al. 1994
), and a recent reanalysis of these data only supported the tree based on first and second positions when third positions were downweighted (Chaw et al. 2000
). Studies of 18S rDNA (Chaw et al. 1997;
Soltis et al. 1999
) and chloroplast ITS sequences (Goremykin et al. 1996
) had suggested the monophyly of gymnosperms, but this was widely questioned. Recently, several studies have once again argued for this result, including the remarkable relationship of Gnetales to conifers. Hansen et al. (1999)
, using a 10-kb piece of the chloroplast genome of Gnetum to compare with complete genomes of Pinus and several angiosperms, found high support for a monophyletic gymnosperms. Winter et al. (1999)
, using multiple paralogs from the MADS-box family of nuclear genes, found the same result in each clade of orthologs (some with high support, some not). Bowe, Coat, and dePamphilis (2000) sequenced two mitochondrial protein-coding genes and inferred a gymnosperm clade with Gnetales nested among conifers. Chaw et al. (2000)
inferred the same result with a different mitochondrial gene and in a reanalysis of 18S rDNA and rbcL (see also Ross et al. 1999, but note that they also found support for the anthophyte hypothesis with 26S data alone). Finally, M. W. Frohlich (personal communication) inferred a monophyletic gymnosperms in trees from phylogenetic analysis of the nuclear gene LEAFY.
That most of these results from recent studies are in agreement is significant, but it should be noted that almost all authors have downweighted or excluded third positions in protein-coding data, either explicitly at the nucleotide level (Hansen et al. 1999
; Chaw et al. 2000;
M. W. Frohlich, personal communication) or implicitly by working with only amino acid sequences (Winter et al. 1999
). We wonder if third-position signals are as strong in these other genes as they are in these two photosystem genes.
Rates of Evolution
Properties of phylogenetic inference methods are in part determined by patterns in the tempo and mode of evolution of the sequences under study. Part of the explanation for the conflicting signals found in these photosystem genes may lie in the tremendous variation in rates seen across partitions, across sites within genes, and across lineages. The amino acid sequences of photosystem genes are extremely conservative because their gene products function in the fundamental light-harvesting systems in plants. Rates of substitution at third positions are 1020-fold higher than they are at first and second positions. Third-position pairwise distances (Kimura two-parameter corrected) range up to about one substitution per site--high enough to be interpreted as saturated, although the implications of saturation are not always obvious (Yang 1998
; Björkland 1999
). For example, the relative performance of reconstruction methods when the data are in fact considered saturated at third codon positions has yet to be adequately explored. In fact, the very definition of saturated may be contingent on the methods being used to reconstruct the phylogeny.
Rates also vary widely from site to site within both genes. Replacement substitutions are extremely variable in rate, ranging from many sites that are essentially invariant to a few that evolve fairly rapidly. Silent rates are more modal with somewhat less variation across each gene. Rates are even more variable across lineages, and variation is correlated between genes and, to a lesser extent, between partitions. Lineages with high rates, such as Gnetales and some ferns, tend to have high rates for both psaA and psbB, and both silent and replacement rates tend to be high. The same pattern holds for lineages with low rates. Seed plants other than angiosperms and Gnetales tend to have low rates at all positions.
The cause of this extensive rate variation among plant taxa is uncertain. Presumably, genes that are as functionally constrained and highly conserved as photosystem genes should be subject to intense purifying selection at nonsynonymous sites. However, the rate of nonneutral substitution is affected not only by the strength of selection, but also by demographic factors such as population size (Gillespie 1999
). Long-lived seed plants such as conifers, cycads, and Ginkgo have lower rates of amino acid substitution than most angiosperms or Gnetales that were sampled. Evidently, some combination of selection intensity, population size, and perhaps the dynamics of fluctuations in each of these has caused these dramatic rate differences.
Lineage differences in rates of evolution of silent sites, on the other hand, are usually attributed to organism-wide factors such as generation time, metabolic rate (in animals at least), and differences in DNA repair mechanisms, or clade-specific factors like diversification rate (Bousquet et al. 1992
). Little is known about generation times in plants, and even less is known about these other factors (Gaut 1998
). Nonetheless, most extant gymnosperms are likely to have long generation times. Although it might seem difficult to attribute the high rates observed in Gnetales to short generation times because extant Gnetales are all long-lived perennials, the fossil record of Gnetales includes a much more diverse array of species, many of which might have had quite short life cycles (Stewart and Rothwell 1993
). This could explain at least the high rates along the stem lineage to Gnetales. However, Gnetum, Welwitschia, and Ephedra also have high rates, despite their presumably long generation times.
Long-Branch Attraction
The detection of LBA is problematic. One method proposed to identify statistical inconsistency in parsimony reconstruction is to use an estimation procedure that is putatively consistent, such as ML (Huelsenbeck 1997
). However, there are three problems with this approach. First, in finite data sets, every method, including ML, is subject to sampling error and bias. Second, ML has been shown to be inconsistent anyway, at least in some cases of model misspecification (Chang 1996
). Third, computational limitations prevented us from assessing the strength of the signal in ML analyses (via bootstrapping) to quite the same degree as was possible with MP. We were limited to NNI branch swapping in the ML runs rather than the more exhaustive TBR swapping we undertook in the MP runs. There were some hints that ML results could differ from MP results just as would be expected if ML were more robust than MP to LBA. However, these results were dependent on the precise substitution model used and were inconsistent between genes. The third-position psaA ML results suggested moderate support for nesting of Gnetales with conifers (unlike under parsimony), but psbB showed no evidence of this. Moreover, the ML runs using SITERATES options that allow fixed but different rates in the different codon positions supported the Gnetales hypothesis, just as parsimony runs on the combined data sets did. We suspect that the SITERATES model is just not a very good one in the face of the tremendous rate variation across sites observed in these genes (fig. 6
). Perhaps what is needed is a model permitting different gamma-distributed patterns of rate variation in different partitions.
Thus, we concentrated on a simulation approach. Both bias and statistical inconsistency were evident in third-position data for psbB if either the anthophyte tree (tree C) or the gymnosperm tree (tree B) was correct. In either of those cases, the tempo and mode of evolution of psbB was such that it was most likely that the Gnetales tree (tree A) would be reconstructed in a data set the size of the one used. Moreover, matters would become even worse if more data of the same type were gathered, which is the hallmark of statistical inconsistency. The statistical bias and inconsistency arise because of a combination of extremely short branches at the base of seed plants, long branches within Gnetales, and long branches in some outgroups. Evidently, given a near trichotomy at the base of seed plants, the psbB third-position data are such that the long-branched Gnetales tend to be attracted to a more basal position near the ferns.
On the other hand, there is less evidence for these same kinds of errors for psaA third-position data. For these data, the bias involves only the anthophyte tree, and once again it is closely associated with a trichotomy at the base of seed plants. Unlike in the case of psbB data, the gymnosperm tree (tree B) can be reconstructed accurately even with third-position data. This apparent contradiction is puzzling. One explanation might be that the estimates of branch lengths by ML are themselves biased and that branch lengths are actually much longer than they appear in the psaA phylograms (figs. 7 and 8
). If so, the parameters used in simulations might be too low to force LBA to occur. The saturation experiments described in figure 11 test for this. These experiments show that branch lengths would have to be underestimated by nearly an order of magnitude to mask a long-branch problem that is really in the data. Given this, it is doubtful that the likelihood approach used to estimate branch lengths is that highly biased (Zharkikh 1994
). Note that there is a short but finite branch supporting the gymnosperm clade in tree B of psaA (fig. 8
), unlike the case for psbB, in which its length is nearly 0 (fig. 10
).
|
A few other cases of strongly conflicting signal between codon partitions have been described in the literature. Stanger-Hall and Cunningham (1998), for example, investigated conflicting relationships implied by mitochondrial Cyt b and COII data for lemurs, following up on the work of Yoder, Vilgalys, and Ruvolo (1996)
and Adkins and Honeycutt (1994)
. They argued that it was the first- and second-position data that were more misleading than the third-position data, owing to selective constraints on amino acid evolution, but that specifics of the reconstruction algorithm (e.g., details of substitution model) interacted in a complex way with the data partitions, a result we found as well. Considered together, these results and ours suggest that rejection of third-position data a priori in any phylogenetic analysis on account of presumed saturation is not a good strategy. Sometimes, this might cause more reliable data to be excluded in favor of less reliable data. More generally, this strategy may mask interesting cases of conflict between data partitions and consequently weaken our understanding of how data and algorithms interact in efforts to reconstruct credible phylogenies.
Supplementary Material
Sequence alignments are available from the following web site: http://loco.ucdavis.edu/sandlab/sp2k.htm.
|
|
We thank J. Palmer, C. dePamphilis, and M. Frohlich for sharing unpublished manuscripts; T. Metcalf, E. Sandoval, and the staff of the UCD Botanical Conservatory and the UCD Arboretum for graciously providing plant material and expertise; and the Green Plant Phylogeny Research Coordinating Group for providing financial support. M. J. Donoghue, J. A. Doyle, and K. P. Steele provided useful insights. M.J.S. and T.S.K. are supported by NSF grant DEB-9726856 to M.J.S. and J. A. Doyle.
Footnotes
Pamela Soltis, Reviewing Editor
1 Abbreviations: LBA, long branch attraction; MP, maximum parsimony; ML, maximum likelihood; NJ, neighbor joining.
2 Keywords: statistical consistency
maximum likelihood
parsimony
3 Address for correspondence and reprints: Michael J. Sanderson, Section of Evolution and Ecology, One Shields Avenue, University of California, Davis, California 95616. E-mail: mjsanderson{at}ucdavis.edu
literature cited
Adachi, J., and M. Hasegawa. 1996. MOLPHY. Computer program published by the authors. Institute of Statistical Mathematics, Tokyo.
Adkins, R. M., and R. L. Honeycutt. 1994. Evolution of primate cytochrome c oxidase subunit II gene. J. Mol. Evol. 38:215231.[ISI][Medline]
Albert, V. A., A. Backlund, K. Bremer, M. W. Chase, J. R. Manhart, B. D. Mishler, and K. C. Nixon. 1994. Functional constraints and rbcL evidence for land plant phylogeny. Ann. Mo. Bot. Gard. 81:534567.
Bakke, E., and A. von Haeseler. 1999. Distance measures in terms of substitution process. Theor. Popul. Biol. 55:166175.[ISI][Medline]
Björkland, M. 1999. Are third positions really that bad? A test using vertebrate cytochrome b. Cladistics 15:191197.
Bousquet, J., S. H. Strauss, A. H. Doerksen, and R. A. Price. 1992. Extensive variation in evolutionary rate of rbcL gene sequences among seed plants. Proc. Natl. Acad. Sci. USA 89:78447848.
Bowe, L. M., G. Coat, and C. W. dePamphilis. 2000. Phylogeny of seed plants based on all three plant genomic compartments: extant gymnosperms are monophyletic and Gnetales are derived conifers. Proc. Natl. Acad. Sci. USA (in press).
Carlquist, S. 1996. Wood, bark, and stem anatomy of Gnetales: a summary. Int. J. Plant Sci. 157(Suppl.):S58S76.
Carmean, D., and B. Crespi. 1995. Do long branches attract flies? Nature 373:666.
Chang, J. T. 1996. Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. 134:189215.[ISI][Medline]
Chase, M. W., D. E. Soltis, R. G. Olmstead et al. (39 co-authors). 1993. Phylogenetics of seed plants: an analysis of nucleotide sequences from the plastid gene rbcL. Ann. Mo. Bot. Gard. 80:528580.
Chaw, S.-M., C. L. Parkinson, Y. Cheng, T. M. Vincent, and J. D. Palmer. 2000. Seed plant phylogeny inferred from all three plant genomes: monophyly of extant gymnosperms and origin of Gnetales from conifers. Proc. Natl. Acad. Sci. USA (in press).
Chaw, S.-M., A. Zharkikh, H.-M. Sung, T.-C. Lau, and W.-H. Li. 1997. Molecular phylogeny of extant gymnosperms and seed plant evolution: analysis of nuclear 18S rRNA sequences. Mol. Biol. Evol. 14:5668.[Abstract]
Crane, P. R. 1985. Phylogenetic analysis of seed plants and the origin of angiosperms. Ann. Mo. Bot. Gard. 72:716793.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. Pp. 345352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C.
Doyle, J. A. 1996. Seed plant phylogeny and the relationships of Gnetales. Int. J. Plant Sci. 157(Suppl. 6):S3S39.
Doyle, J. A., and M. J. Donoghue. 1986. Seed plant phylogeny and the origin of angiosperms: an experimental cladistic approach. Bot. Rev. 52:321431.[ISI]
. 1992. Fossils and seed plant phylogeny reanalyzed. Brittonia 44:89106.
Farris, J. S., M. Källersjö, A. G. Kluge, and C. Bult. 1995. Constructing a significance test for incongruence. Syst. Biol. 44:570572.[ISI]
Felsenstein, J. 1978. A likelihood approach to character weighting and what it tells us about parsimony and compatibility. Biol. J. Linn. Soc. 16:183196.
. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783791.
Frohlich, M. W., and D. S. Parker. 1999. Seed plant phylogeny: evidence from Floricaula/Leafy. XVIth International Botanical Congress [abstract].
Gaut, B. 1998. Molecular clocks and nucleotide substitution rates in higher plants. Pp. 93120 in M. K. Hecht, ed. Evolutionary biology. Vol. 30. Plenum Press, New York.
Gillespie, J. H. 1999. The role of population size in molecular evolution. Theor. Popl. Biol. 55:145156.
Goremykin, V., V. Bobrova, J. Pahnke, A. Troitsky, A. Antonov, and W. Martin. 1996. Noncoding sequences from the slowly evolving chloroplast inverted repeat in addition to rbcL data do not support gnetalean affinities of angiosperms. Mol. Biol. Evol. 13:383396.[Abstract]
Graham, S. W., and R. G. Olmstead. 1999. A phylogeny of basal angiosperms inferred from 17 chloroplast genes. XVIth International Botanical Congress [abstract].
Hansen, A., S. Hansmann, T. Samigullin, A. Antonov, and W. Martin. 1999. Gnetum and the angiosperms: molecular evidence that their shared morphological characters are convergent rather than homologous. Mol. Biol. Evol. 16:10061009.
Hasebe, M., R. Kofuji, M. Ito, M. Kato, K. Iwatsuki, and K. Ueda. 1992. Phylogeny of gymnosperms inferred from rbcL gene sequences. Bot. Mag. (Tokyo) 105:673679.
Hendy, M. D., and D. Penny. 1989. A framework for the quantitative study of evolutionary trees. Syst. Zool. 38:297309.[ISI]
Huelsenbeck, J. P. 1995. The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor-joining. Mol. Biol. Evol. 12:843849.[Abstract]
. 1997. Is the Felsenstein Zone a fly trap? Syst. Biol. 46:6974.
. 1998. Systematic bias in phylogenetic analysis: is the Strepsiptera problem solved? Syst. Biol. 47:519537.
Huelsenbeck, J. P., D. M. Hillis, and R. Jones. 1996. Parametric bootstrapping in molecular phylogenetics: applications and performance. Pp. 1945 in J. D. Ferraris and S. R. Palumbi, eds. Molecular zoology: advances, strategies and protocols. Wiley-Liss, New York.
Källersjö, M., V. A. Albert, and J. S. Farris. 1999. Homoplasy increases phylogenetic structure. Cladistics 15:9193.
Kim, J. 1998. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol. 47:4360.[ISI][Medline]
Lehmann, E. L. 1983. Theory of point estimation. Wiley, New York.
Li, W.-H. 1997. Molecular evolution. Sinauer, Sunderland, Mass.
Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under more realistic model of sequence evolution. Mol. Biol. Evol. 11:605612.
Loconte, H., and D. W. Stevenson. 1990. Cladistics of the Spermatophyta. Brittonia 42:197211.
Lyons-Weiler, J., and G. A. Hoelzer. 1997. Escaping from the Felsenstein Zone by detecting long branches in phylogenetic data. Mol. Phylogenet. Evol. 8:375384.[ISI][Medline]
Maddison, D. R., M. D. Baker, and K. A. Ober. 1999. Phylogeny of carabid beetles inferred from 18S ribosomal DNA (Coleoptera: Carabidae). Syst. Entomol. 24:103138.[ISI]
Manly, B. F. J. 1997. Randomization, bootstrap and Monte Carlo methods in biology. Chapman and Hall, New York.
Meyer, A. 1994. Shortcomings of the cytochrome b gene as a molecular marker. TREE 9:278280.
Nixon, K. C., W. L. Crepet, D. Stevenson, and E. M. Friis. 1994. A reevaluation of seed plant phylogeny. Ann. Mo. Bot. Gard. 81:484533.
Ort, D. R., and C. F. Yocum. 1996. Oxygenic photosynthesis: the light reactions. Kluwer, Boston.
Pellmyr, O., and J. Leebens-Mack. 1999. Forty million years of mutualism: evidence for Eocene origin of the yucca-yucca moth association. Proc. Natl. Acad. Sci. USA 96:91789183.
Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the monte-carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235238.[Abstract]
Raubeson, L. A., and R. K. Jansen 1992. A rare chloroplast DNA structural mutation is shared by all conifers. Biochem. Syst. Ecol. 20:1724.
Ross, V. A., M. J. Zanis, P. S. Soltis, and D. Soltis. 1999. Phylogenetic relationships among extant seed plant lineages inferred from 26S rDNA sequences. XVIth International Botanical Congress [abstract].
Rothwell, G. R., and R. Serbet. 1994. Lignophyte phylogeny and the evolution of spermatophytes: a numerical cladistic analysis. Syst. Bot. 19:443482.[ISI]
Siddall, M. E. 1998. Success of parsimony in the four-taxon case: long branch repulsion by likelihood in the Farris zone. Cladistics 14:209220.
Soltis, P. S., D. E. Soltis, P. G. Wolf, D. L. Nickrent, S.-M. Chaw, and R. L. Chapman. 1999. The phylogeny of land plants inferred from 18s rDNA sequences: pushing the limits of rDNA signal? Mol. Biol. Evol. 16:17741784.
Stanger-Hall, K., and C. W. Cunningham. 1998. Support for a monophyletic lemuriformes: overcoming incongruence between data partitions. Mol. Biol. Evol. 15:15721577.
Stewart, W. N., and G. W. Rothwell. 1993 Paleobotany and the evolution of plants. 2nd edition. Cambridge University Press, New York.
Swofford, D. S. 1999. PAUP* 4.0. Phylogenetic analysis using parsimony (*and other methods). Version 4b2. Sinauer, Sunderland, Mass.
Swofford. D. L., G. K. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogeny reconstruction. Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer, Sunderland, Mass.
Winter, K.-U., A. Becker, T. Munster, J. T. Kim, H. Saedler, and G. Theissen. 1999. MADS-box genes reveal that gnetophytes are more closely related to conifers than to flowering plants. Proc. Natl. Acad. Sci. USA 96:73427347.
Yang, Z. 1998. On the best evolutionary rate for phylogenetic analysis. Syst. Biol. 47:125133.[ISI][Medline]
Yoder, A. D., R. Vilgalys, and M. Ruvolo. 1996. Molecular evolutionary dynamics of cytochrome beta in strepsirrhine primates: the phylogenetic significance of third-position transversions. Mol. Biol. Evol. 13:13391350.
Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315329.[ISI][Medline]