Department of Biology, University of New Mexico
Department of Ecology and Evolutionary Biology, University of Connecticut
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For DNA sequence data, several evolutionary factors have been discovered that can potentially mislead phylogeny estimation methods. Examples of such factors include transition/transversion bias (Kimura 1980
; Wakeley 1993
), heterogeneity in substitution rates among lineages (Felsenstein 1978
), heterogeneity in substitution rates among sites within a nucleotide sequence (Navidi, Churchill, and von Haeseler 1991
; Reeves 1992
; Sidow and Steel 1992
; Yang 1993
), nonindependence of sites within a gene (Goldman and Yang 1994
; Muse 1995, 1996
; Schöniger and von Haeseler 1995
), and nonstationarity of nucleotide frequencies across lineages (Loomis and Smith 1990
; Burggraf, Stetter, and Woese 1
992; Hasegawa and Hashimoto 1993
; Lockhart et al. 1994
; Galtier and Gouy 1995, 1998
).
Lockhart et al. (1994)
presented three compelling examples in which they postulated that convergence in nucleotide composition (CNC) in independent lineages led parsimony, as well as methods based on traditional substitution models, to prefer an incorrect tree, namely the tree placing taxa with similar nucleotide compositions together. LogDet (Lake 1994
; Steel 1994
) was the only transformation of those tested that resulted in a correct phylogenetic inference. Relatively few other cases have been found in which CNC has been identified as a problematic factor, although Foster and Hickey (1999)
suggest that it may be the cause of misleading inferences for animal phylogenies when using all mitochondrial protein-coding sequences. There are at least two plausible explanations for this paucity of examples. First, if changed nucleotide composition is inherited (fig. 1A
) rather than acquired by convergence (fig. 1B
), one might expect phylogeny methods such as parsimony to prefer the correct tree more strongly than they should. Thus, whether nonstationarity in nucleotide composition is a problem would depend on the relative frequency in nature of inherited versus convergent similarity in nucleotide composition. This explanation is rather difficult to investigate, as it requires ascertaining relative frequencies of inherited composition versus CNC in nature. Second, even if convergent similarity in nucleotide composition is common, whether it is a problem for phylogeny methods depends on the strength of the convergence and how CNC interacts with other evolutionary factors. In this paper, we instead concentrate on this second explanation, using analyses of four-taxon phylogenies to obtain a feeling for the amount of CNC required to mislead phylogeny methods, especially parsimony. We also present a reexamination of one of the Lockhart et al. (1994)
examples using computer simulation to show that other factors are at work in these data, and CNC alone does not provide a satisfactory explanation for the failure of the phylogeny methods examined.
|
![]() |
Convergence in Nucleotide Composition in Four-Taxon Trees |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this section, we examine the question of how much CNC is required to mislead parsimony in the four-taxon case by using the probabilities of parsimony-informative patterns to define the region of statistical inconsistency for parsimony (i.e., the region in which parsimony would converge on an incorrect tree given an infinite amount of data). The model tree is that in figure 1B,
consisting of two "biased" branches and three "unbiased" branches (the central branch comprises both segments attached to the root node). Because short internal branches in four-taxon trees present the greatest difficulties for phylogeny reconstruction, the length of the central branch was varied independently of the four peripheral branches. Branch lengths are given in terms of the expected number of substitutions per site (d) unless otherwise indicated. The K2P model (Kimura 1980
) was used for unbiased branches, and the model employed for biased branches was the T92 model (Tamura 1992
; Galtier and Gouy 1998
). The bias introduced along the two biased branches involved increasing the frequency of both G and C by an amount
(i.e.,
G =
C = 0.25 +
,
A =
T = 0.25 -
). The probability of observing any of the four bases at the root node was assumed to be 0.25, in accordance with the K2P model employed for the central branch containing the root.
With a tree and a substitution model thus specified, it is possible to compute the probability of all 256 data patterns for any given combination of G+C bias (), transition/transversion rate ratio (
), and branch length (d). We need be concerned with only 36 of the 256 possible patterns, 12 of which support each of the three possible unrooted trees. Let P0 be the sum of the probabilities of the 12 patterns supporting the true tree and let P1 and P2 be the sum of the probabilities of the 12 patterns supporting each of the two incorrect trees. If either P1 or P2 exceeds P0, then parsimony will tend to choose incorrectly even with an infinite number of nucleotide sites (i.e., parsimony is statistically inconsistent).
As expected, for many combinations of branch lengths and , increasing G+C bias (
) caused parsimony to become statistically inconsistent (fig. 2
). Since the model tree specified the biased branches to be those leading to sequences 1 and 3, the tree that placed sequences 1 and 3 (tree 1) together was increasingly supported as the level of bias increased. Tree 0 (the true tree, placing sequences 1 and 2 together) and tree 1 thus provided the comparison of interest; tree 2 (placing sequences 1 and 4 together) will be ignored hereinafter. The plots in figure 2
depict the difference between P0 and P1. The region of inconsistency (shaded) is entered when the surface representing P0 - P1 dips below 0; it is in this area that parsimony is expected to prefer tree 1 over the true tree.
|
Figure 2
shows that, in general, branch lengths must be large (>0.5 substitutions per site) for CNC to cause serious problems for parsimony, even when the G+C bias is nearly at its maximum possible value ( = 0.24). CNC is exacerbated by small internal branch lengths and especially by transition/transversion bias.
Figure 3 repeats the analysis of figure 2 , this time including the discrete gamma distribution of sitewise relative rates. In this case, we see that the addition of rate heterogeneity actually decreases the size of the zone of inconsistency, especially in regions where all branches are long. One might predict that site-to-site rate heterogeneity would make matters worse for parsimony (and any method that does not take it into account), since high rate heterogeneity implies that change is concentrated at fewer sites. This means that variable sites have a better chance of experiencing multiple hits than in the rate homogeneity case, leading to greater difficulty in distinguishing true phylogenetic signal from false signal due to convergence. This would be especially true if the total amount of accumulated nucleotide composition bias were held constant. In figure 2 , this is not the case: it is the number of substitutions (branch lengths) that is held constant, and the greater success of parsimony can thus be attributed to the fact that change has been concentrated at a few variable sites, and the realized nucleotide composition bias is not as great as that for the rate homogeneity case (where more sites have undergone at least one change).
|
![]() |
Simulation Study |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Using PAUP*, version 4.0d64 (Swofford 1998
), we were able to reproduce the results of Lockhart et al. (1994)
on the entire data matrix of eight sequences, but we reduced the data set to just the sequences from Anacystis, Olithodiscus, Euglena, and Chlamydomonas for simplicity. As table 1
shows, reducing the taxon sampling did not affect the general conclusions reached by Lockhart et al. (1994)
. All methods examined except LogDet favored the unrooted tree topology grouping Euglena and Olithodiscus and separating them from Chlorella and Anacystis, which have higher G+C contents (table 2
). The model described by Galtier and Gouy (1998)
, hereinafter called the GG98 model, was used to simulate data according to the tree presumed to be correct, namely, (Anacystis, Olithodiscus, (Euglena, Chlamydomonas)). In essence, the hypothesis tested was that the process underlying the evolution of the observed sequences did not differ from the model of evolution used in the simulations. The results of the previous section suggest that the degree of bias present in the Lockhart et al. (1994)
data set is not large enough to mislead parsimony (or, presumably, other methods) unless other factors exacerbate its effects. We therefore predicted that all methods would usually pick the correct tree in the simulated data sets.
|
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Few clear cases have been reported in which CNC has been thought to derail the phylogenetic inference process. Of the three cases presented by Lockhart et al. (1994)
, two involve 18S rDNA from vertebrates and COII mtDNA from honeybees. In these two data sets, we could not find any way to obtain the putative "correct" tree except by using LogDet/paralinear distances, as reported by Lockhart et al. (1994)
. It is notable, however, that it is necessary to exclude all constant and autapomorphic sites (analyzing only parsimony-informative sites) to accurately estimate the phylogeny for these data sets. This suggests site-to-site rate heterogeneity as the likely culprit; however, taking account of site-to-site rate heterogeneity using the standard methods fails to produce a correct estimate. Therefore other, as yet unidentified, factors must be at work in these data sets.
The simulation study reported here represents a test of the hypothesis that CNC alone, or CNC in combination with site-to-site rate heterogeneity, is sufficient to explain the failure of many phylogenetic methods for the third case presented by Lockhart et al. (1994)
(represented by the chlorop.phy data set). We used a parametric bootstrap approach in which parameters were estimated from the data using maximum likelihood and simulations performed using these parameter estimates. The results show that CNC, either alone or in combination with site-to-site rate heterogeneity, is insufficient to account for difficulties found in the original data set. None of the simulated data sets presented problems for parsimony or any of the other methods tested (all of which failed on the original data set).
It is clear that the GG98 model used for the simulations did not capture some factor important in the evolution of the actual sequences. One possibility is that the GG98 model does not allow enough variation in nucleotide composition across the tree. This model places some constraints on changes in nucleotide composition, forcing the frequency of G to equal the frequency of C and allowing only changes in G+C composition at the nodes of the tree. It seems unlikely that these two model constraints can account for the differences seen between the simulation results and the results from the original data. First, allowing the composition of G to differ from the composition of C should not increase the chances of an artifactual joining of Euglena to Olithodiscus, since it is the low G+C content in these lineages that is postulated to have caused problems in the original data set. Second, allowing nucleotide composition to vary within lineages should also not increase the chance of Euglena pairing with Olithodiscus, since all of the phylogenetic methods that failed on the original data set view branches as the smallest units making up a phylogenetic tree: that is, they cannot, like LogDet, take account of changes in composition that occur within branches.
When simulations incorporated both CNC and rate heterogeneity, a small fraction of the simulated data sets proved difficult for all methods. This falls short of the result that would be expected if rate heterogeneity were the all-important missing factor. Also, we would expect LogDet to perform well (as it did on the original data set) compared with the other methods examined. In fact, LogDet behaves similarly to the other methods, failing on a small fraction of the simulated data sets (table 4 ). These observations indicate the presence of as-yet-unknown evolutionary factors at work in the evolution of the actual sequences that are not being modeled by the simulations.
The phylogenetic methods in common use today each have their own "Achilles' heel," and it behooves researchers to learn as much as possible about the factors at work in their data prior to deciding on a method to use in the final analysis. For example, parsimony's primary Achilles' heel has long been identified as long-branch attraction (Felsenstein 1978
). Maximum likelihood can correct for problems that are identified and incorporated into substitution models but can be deceived by factors not represented in the model used (e.g., rate heterogeneity; Gaut and Lewis 1995
). This paper has addressed a potential Achilles' heel applicable to most methods of phylogenetic inference and found that it is perhaps not as great a threat as it was initially perceived to be. This is not to say that CNC can be ignored altogether. Figure 3
illustrates that CNC in combination with site-to-site rate heterogeneity and transition/transversion bias can cause problems even at biologically realistic substitution rates and levels of rate heterogeneity. For example, in figure 3 , one point at which parsimony is inconsistent is characterized by the following parameter values: peripheral branch lengths = 0.8, central branch length = 0.1, gamma shape = 0.2, and transition/transversion rate ratio = 1.0, with a G+C difference of 0.12 between biased and unbiased lineages. These branch lengths and the G+C bias are at the edge of what is normally observed in actual data sets, but none are out of the realm of possibility, and the transition/transversion bias and degree of rate heterogeneity are not at all extreme. LogDet/paralinear distances provide a practical means for diagnosing CNC should it be present in a dosage sufficient to cause problems. A tree estimated using LogDet that differs from trees estimated using other methods should prompt an examination of the data for evidence that other methods are incorrectly joining taxa with similar nucleotide compositions.
While it is unlikely that any data set can be found that shows the influence of one and only one evolutionary factor, it is nevertheless beneficial to thoroughly analyze sequence data sets in the search for good examples of the effects of evolutionary factors representing potential pitfalls for phylogeny methods. Equally important is the search for new evolutionary factors. It is only when such evolutionary factors as site-to-site rate heterogeneity, transition/transversion bias, evolutionary dependence among sites, and CNC are discovered that work can begin on creating evolutionary models that avoid the problems they create.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: nucleotide composition
phylogeny
LogDet
G+C bias
maximum parsimony
2 Address for correspondence and reprints: Gavin C. Conant, Department of Biology, 167 Castetter Hall, University of New Mexico, Albuquerque, New Mexico 87131-1091. gconant{at}unm.edu
.
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Burggraf, S. G., K. O. Stetter, C. R. Woese. 1992. A phylogenetic analysis of Aquifex pyrophilus.. Syst. Appl. Microbiol. 15:352356[ISI][Medline]
Felsenstein, J.. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401410[ISI]
.1993. PHYLIP (phylogeny inference package)Version 3.5. Distributed by the author, Department of Genetics, University of Washington, Seattle, Washington
Foster, P. G., D. A. Hickey. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284290[ISI][Medline]
Galtier, N., M. Gouy. 1995. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA. 92:1131711321[Abstract]
.1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871879
Gaut, B., P. O. Lewis. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. 12:152162[Abstract]
Goldman, N., Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736
Hasegawa, M., T. Hashimoto. 1993. Ribosomal RNA trees misleading?. Nature. 361:23
Huelsenbeck, J. P.. 1995. Performance of phylogenetic methods in simulation. Syst. Biol. 44:1748[ISI]
Jukes, T. H., C. R. Cantor. 1969. Evolution of protein moleculesPp. 21132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York
Kimura, M.. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120[ISI][Medline]
Kuhner, M. K., J. Felsenstein. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:459468[Abstract]
Lake, J. A.. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA. 91:14551459[Abstract]
Lockhart, P. J., D. Penny, M. D. Hendy, C. J. Howe, T. J. Beanland, A. W. D. Larkum. 1992. Controversy on chloroplast origins. FEBS Lett. 301:127131[ISI][Medline]
Lockhart, P. J., M. A. Steel, M. D. Hendy, D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605612
Loomis, W. F., D. W. Smith. 1990. Molecular phylogeny of Dictyostelium discoideum by protein sequence comparison. Proc. Natl. Acad. Sci. USA. 87:90939097[Abstract]
Muse, S. V.. 1995. Evolutionary analyses of DNA sequences subject to constraints on secondary structure. Genetics. 139:14291439
.1996. Estimating synonymous and nonsynonymous substitution rates. Mol. Biol. Evol. 13:105114
Navidi, W. C., G. A. Churchill, A. von Haeseler. 1991. Methods for inferring phylogenies from nucleic acid sequence data by using maximum likelihood and linear invariants. Mol. Biol. Evol. 8:128143[Abstract]
Nei, M.. 1991. Relative efficiencies of different treemaking methods for molecular dataPp. 90128 in M. M. Miyamoto and J. Cracraft, eds. Phylogenetic analysis of DNA sequences. Oxford University Press, New York
Reeves, J. H.. 1992. Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J. Mol. Evol. 35:1731[ISI][Medline]
Schöniger, M., A. von Haeseler. 1995. Performance of the maximum likelihood, neighbor joining, and maximum parsimony methods when sequence sites are not independent. Syst. Biol. 44:533547[ISI]
Sidow, A., T. P. Steel. 1992. Estimating the fraction of invariable codons with a capture-recapture method. J. Mol. Evol. 35:253260[ISI][Medline]
Steel, M. A.. 1994. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7:1923
Swofford, D. L.. 1998. PAUP*: phylogenetic analysis using parsimony (*and other methods)Version 4.0 (prerelease test version). Sinauer, Sunderland, Mass
Swofford, D. L., P. J. Waddell, J. P. Huelsenbeck, P. G. Foster, P. O. Lewis, J. S. Rogers. 2001. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methodsSyst. Biol. (in press)
Tamura, K.. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 9:678687[Abstract]
Waddell, P., M. Steel. 1997. General time-reversible distances with unequal rates across sites: mixing G and inverse Gaussian distributions with invariant sites. Mol. Phylogenet. Evol. 8:398414[ISI][Medline]
Wakeley, J.. 1993. Substitution-rate variation among sites and the estimation of transition bias. Mol. Biol. Evol. 11:426442[Abstract]
Yang, Z.. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:13961401[Abstract]
Yang, Z., D. Roberts. 1995. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. 12:451458[Abstract]