Effects of Nucleotide Composition Bias on the Success of the Parsimony Criterion in Phylogenetic Inference

Gavin C. Conant and Paul O. Lewis

Department of Biology, University of New Mexico
Department of Ecology and Evolutionary Biology, University of Connecticut


    Abstract
 TOP
 Abstract
 Introduction
 Convergence in Nucleotide...
 Simulation Study
 Discussion
 Acknowledgements
 literature cited
 
Convergence in nucleotide composition (CNC) in unrelated lineages is a factor potentially affecting the performance of most phylogeny reconstruction methods. Such convergence has deleterious effects because unrelated lineages show similarities due to similar nucleotide compositions and not shared histories. While some methods (such as the LogDet/paralinear distance measure) avoid this pitfall, the amount of convergence in nucleotide composition necessary to deceive other phylogenetic methods has never been quantified. We examined analytically the relationship between convergence in nucleotide composition and the consistency of parsimony as a phylogenetic estimator for four taxa. Our results show that rather extreme amounts of convergence are necessary before parsimony begins to prefer the incorrect tree. Ancillary observations are that (for unweighted Fitch parsimony) transition/transversion bias contributes to the impact of CNC and, for a given amount of CNC and fixed branch lengths, data sets exhibiting substantial site-to-site rate heterogeneity present fewer difficulties than data sets in which rates are homogeneous. We conclude by reexamining a data set originally used to illustrate the problems caused by CNC. Using simulations, we show that in this case the convergence in nucleotide composition alone is insufficient to cause any commonly used methods to fail, and accounting for other evolutionary factors (such as site-to-site rate heterogeneity) can give a correct inference without accounting for CNC.


    Introduction
 TOP
 Abstract
 Introduction
 Convergence in Nucleotide...
 Simulation Study
 Discussion
 Acknowledgements
 literature cited
 
Since phylogenetic relationships cannot be observed, it is impossible to directly verify the accuracy of phylogeny reconstructions. Because of this difficulty, it is of interest to discover conditions in data that can be demonstrated to cause phylogeny reconstruction methods to fail. One approach has been to specify a model phylogeny and a substitution model incorporating the factor of interest and then show that data generated from that phylogeny result in incorrectly inferred relationships. This demonstration can be done analytically for simple cases and some phylogeny reconstruction methods (e.g., Felsenstein 1978Citation ), but it more often requires the use of computer simulation (e.g., Nei 1991Citation ; Kuhner and Felsenstein 1994Citation ; Huelsenbeck 1995Citation ; Schöniger and von Haeseler 1995Citation ).

For DNA sequence data, several evolutionary factors have been discovered that can potentially mislead phylogeny estimation methods. Examples of such factors include transition/transversion bias (Kimura 1980Citation ; Wakeley 1993Citation ), heterogeneity in substitution rates among lineages (Felsenstein 1978Citation ), heterogeneity in substitution rates among sites within a nucleotide sequence (Navidi, Churchill, and von Haeseler 1991Citation ; Reeves 1992Citation ; Sidow and Steel 1992Citation ; Yang 1993Citation ), nonindependence of sites within a gene (Goldman and Yang 1994Citation ; Muse 1995, 1996Citation ; Schöniger and von Haeseler 1995Citation ), and nonstationarity of nucleotide frequencies across lineages (Loomis and Smith 1990Citation ; Burggraf, Stetter, and Woese 1Citation 992; Hasegawa and Hashimoto 1993Citation ; Lockhart et al. 1994Citation ; Galtier and Gouy 1995, 1998Citation ).

Lockhart et al. (1994)Citation presented three compelling examples in which they postulated that convergence in nucleotide composition (CNC) in independent lineages led parsimony, as well as methods based on traditional substitution models, to prefer an incorrect tree, namely the tree placing taxa with similar nucleotide compositions together. LogDet (Lake 1994Citation ; Steel 1994Citation ) was the only transformation of those tested that resulted in a correct phylogenetic inference. Relatively few other cases have been found in which CNC has been identified as a problematic factor, although Foster and Hickey (1999)Citation suggest that it may be the cause of misleading inferences for animal phylogenies when using all mitochondrial protein-coding sequences. There are at least two plausible explanations for this paucity of examples. First, if changed nucleotide composition is inherited (fig. 1A ) rather than acquired by convergence (fig. 1B ), one might expect phylogeny methods such as parsimony to prefer the correct tree more strongly than they should. Thus, whether nonstationarity in nucleotide composition is a problem would depend on the relative frequency in nature of inherited versus convergent similarity in nucleotide composition. This explanation is rather difficult to investigate, as it requires ascertaining relative frequencies of inherited composition versus CNC in nature. Second, even if convergent similarity in nucleotide composition is common, whether it is a problem for phylogeny methods depends on the strength of the convergence and how CNC interacts with other evolutionary factors. In this paper, we instead concentrate on this second explanation, using analyses of four-taxon phylogenies to obtain a feeling for the amount of CNC required to mislead phylogeny methods, especially parsimony. We also present a reexamination of one of the Lockhart et al. (1994)Citation examples using computer simulation to show that other factors are at work in these data, and CNC alone does not provide a satisfactory explanation for the failure of the phylogeny methods examined.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1.—Four-taxon trees depicting different ways in which differences in G+C composition among the tip sequences can accrue. In all cases, it is assumed that an increase in the frequency with which G's and C's are recruited into sequences in the event of a substitution occurs at some point in time, and this increased propensity continues and is inherited by descendant lineages following speciation events. A, The increase in G and C substitutions begins in the common ancestor of sequences 1 and 2 and is inherited in these two lineages, resulting in sequences 1 and 2 having higher G+C compositions than sequences 3 and 4. B, The increase occurs independently in the lineage leading to sequence 1 and the lineage leading to sequence 3. For purposes of the simulations, which all used tree B as the model tree, the branch length (d) was identical for all branches (edges) except for the two internal segments immediately descended from the root node, for which the length was d/2

 

    Convergence in Nucleotide Composition in Four-Taxon Trees
 TOP
 Abstract
 Introduction
 Convergence in Nucleotide...
 Simulation Study
 Discussion
 Acknowledgements
 literature cited
 
The term "nucleotide composition" can have at least two distinct meanings. It can refer to the nucleotide pool available for substitution or to the observed proportions of nucleotides in a particular sequence or genome. Both have been termed "equilibrium frequencies," since all commonly used substitution models (with the exception of the model underlying the LogDet/paralinear distance measure) assume that the nucleotide composition is stationary (i.e., does not change from lineage to lineage across the tree). We use the term "base frequencies" to refer to the substitution pool relative frequencies, but we allow them to change from lineage to lineage following Yang and Roberts (1995)Citation and Galtier and Gouy (1998)Citation . When there is a change in substitution pool base frequencies, it takes some time before the observed nucleotide composition again reaches equilibrium. This lag is exacerbated by strong site-to-site rate heterogeneity, which leaves many sites unchanged for long periods of time. The appendix contains formulas for determining the expected nucleotide composition at some arbitrary time t following a change in base frequencies for models with and without the incorporation of rate heterogeneity.

In this section, we examine the question of how much CNC is required to mislead parsimony in the four-taxon case by using the probabilities of parsimony-informative patterns to define the region of statistical inconsistency for parsimony (i.e., the region in which parsimony would converge on an incorrect tree given an infinite amount of data). The model tree is that in figure 1B, consisting of two "biased" branches and three "unbiased" branches (the central branch comprises both segments attached to the root node). Because short internal branches in four-taxon trees present the greatest difficulties for phylogeny reconstruction, the length of the central branch was varied independently of the four peripheral branches. Branch lengths are given in terms of the expected number of substitutions per site (d) unless otherwise indicated. The K2P model (Kimura 1980Citation ) was used for unbiased branches, and the model employed for biased branches was the T92 model (Tamura 1992Citation ; Galtier and Gouy 1998Citation ). The bias introduced along the two biased branches involved increasing the frequency of both G and C by an amount {delta} (i.e., {pi}G = {pi}C = 0.25 + {delta}, {pi}A = {pi}T = 0.25 - {delta}). The probability of observing any of the four bases at the root node was assumed to be 0.25, in accordance with the K2P model employed for the central branch containing the root.

With a tree and a substitution model thus specified, it is possible to compute the probability of all 256 data patterns for any given combination of G+C bias ({delta}), transition/transversion rate ratio ({kappa}), and branch length (d). We need be concerned with only 36 of the 256 possible patterns, 12 of which support each of the three possible unrooted trees. Let P0 be the sum of the probabilities of the 12 patterns supporting the true tree and let P1 and P2 be the sum of the probabilities of the 12 patterns supporting each of the two incorrect trees. If either P1 or P2 exceeds P0, then parsimony will tend to choose incorrectly even with an infinite number of nucleotide sites (i.e., parsimony is statistically inconsistent).

As expected, for many combinations of branch lengths and {kappa}, increasing G+C bias ({delta}) caused parsimony to become statistically inconsistent (fig. 2 ). Since the model tree specified the biased branches to be those leading to sequences 1 and 3, the tree that placed sequences 1 and 3 (tree 1) together was increasingly supported as the level of bias increased. Tree 0 (the true tree, placing sequences 1 and 2 together) and tree 1 thus provided the comparison of interest; tree 2 (placing sequences 1 and 4 together) will be ignored hereinafter. The plots in figure 2 depict the difference between P0 and P1. The region of inconsistency (shaded) is entered when the surface representing P0 - P1 dips below 0; it is in this area that parsimony is expected to prefer tree 1 over the true tree.



View larger version (51K):
[in this window]
[in a new window]
 
Fig. 2.—Expected performance of the parsimony criterion for differing combinations of d, {kappa}, and {delta}, where d represents the expected number of substitutions per site, {kappa} is the transition/transversion rate ratio (the instantaneous transition rate divided by the instantaneous transversion rate), and {delta} is the magnitude of the increase in the equilibrium frequencies of both G and C ({pi}G = {pi}C = 0.25 + {delta}, {pi}A = {pi}T = 0.25 - {delta}) on biased branches (the dashed lines in the tree depicted in fig. 1B ). The performance of parsimony is measured as the difference between the probability of observing data patterns that support the correct tree and the probability of data patterns that support the "G+C tree" (i.e., the tree that incorrectly places taxa with increased G+C content together). Shaded portions of the plots represent regions of statistical inconsistency for parsimony, analogous to the "Felsenstein Zone" in the long-branch attraction problem, since in these regions misleading data patterns are more probable than patterns supporting the correct tree. A, {kappa} equals 1.0, {delta} = 0.0. B, {kappa} equals 1.0, {delta} = 0.12. C, {kappa} equals 1.0, {delta} = 0.24. D, {kappa} equals 10.0, {delta} = 0.0. E, {kappa} equals 10.0, {delta} = 0.12. F, {kappa} equals 10.0, {delta} = 0.24

 
It has been suggested by Lockhart et al. (1992)Citation that statistical inconsistency as a result of CNC occurs in the four-taxon case only when the internal branch of the unrooted tree is shorter than the terminal branches. This suggests that CNC is a problem related to the long-branch attraction described by Felsenstein (1978)Citation . The vertical axis in figure 2 represents the length of the central branch, while the horizontal axis represents the length of each of the peripheral branches. Figure 2 demonstrates that, in fact, even if the internal branch is equal in length to the terminal branches, there exists a level of G+C bias sufficient to cause parsimony to become inconsistent, although the level of bias required in such cases is quite high.

Figure 2 shows that, in general, branch lengths must be large (>0.5 substitutions per site) for CNC to cause serious problems for parsimony, even when the G+C bias is nearly at its maximum possible value ({delta} = 0.24). CNC is exacerbated by small internal branch lengths and especially by transition/transversion bias.

Figure 3 repeats the analysis of figure 2 , this time including the discrete gamma distribution of sitewise relative rates. In this case, we see that the addition of rate heterogeneity actually decreases the size of the zone of inconsistency, especially in regions where all branches are long. One might predict that site-to-site rate heterogeneity would make matters worse for parsimony (and any method that does not take it into account), since high rate heterogeneity implies that change is concentrated at fewer sites. This means that variable sites have a better chance of experiencing multiple hits than in the rate homogeneity case, leading to greater difficulty in distinguishing true phylogenetic signal from false signal due to convergence. This would be especially true if the total amount of accumulated nucleotide composition bias were held constant. In figure 2 , this is not the case: it is the number of substitutions (branch lengths) that is held constant, and the greater success of parsimony can thus be attributed to the fact that change has been concentrated at a few variable sites, and the realized nucleotide composition bias is not as great as that for the rate homogeneity case (where more sites have undergone at least one change).



View larger version (48K):
[in this window]
[in a new window]
 
Fig. 3.—Plot of the performance of parsimony as in figure 2 , with the addition of site-to-site rate variation modeled as a discrete gamma distribution with four categories and {alpha} (gamma shape parameter) = 0.2. A, {kappa} equals 1.0, {delta} = 0.0. B, {kappa} equals 1.0, {delta} = 0.12. C, {kappa} equals 1.0, {delta} = 0.24. D, {kappa} equals 10.0, {delta} = 0.0. E, {kappa} equals 10.0, {delta} = 0.12. F, {kappa} equals 10.0, {delta} = 0.24

 

    Simulation Study
 TOP
 Abstract
 Introduction
 Convergence in Nucleotide...
 Simulation Study
 Discussion
 Acknowledgements
 literature cited
 
The rigidity of the model tree in the analytical study makes it difficult to apply the results to real data sets. In particular, few real data sets follow the assumed perfect molecular clock, and fewer still have interior nodes so evenly spaced in time. We therefore used computer simulation to study the effects of CNC on the ability of parsimony and other methods to reconstruct the true tree using the chlorop.phy data set obtained from http://imbs.massey.ac.nz/Research/MolEvol/Farside/programs.htm and described in Lockhart et al. (1994)Citation . Lockhart et al. (1994)Citation examined data from the 16S rRNA gene of chloroplasts (of diverse phylogenetic origins), as well as the cyanobacterium Anacystis. They showed that many common phylogenetic reconstruction methods failed to favor the tree assumed to be correct, which places all the chlorophyll b–containing organisms together, separated from the cyanobacterium Anacystis and the chlorophyll c–containing chromophyte alga Olithodiscus. The methods that failed were (1) parsimony, presumably equal-weighted and using unordered character states; (2) maximum likelihood, using the model described in Felsenstein (1993)Citation , presumably with the transition : transversion ratio fixed at the default value of 2; (3) neighbor joining using Jukes and Cantor (1969)Citation distances; and (4) neighbor joining using Kimura (1980)Citation two-parameter distances. These methods all placed Euglena between Anacystis and Olithodiscus. Using the LogDet transformation (in conjunction with neighbor joining) on just parsimony-informative sites produced the well-corroborated tree in which Euglena grouped with the other chlorophyll a/b–containing organisms. Lockhart et al. (1994)Citation concluded that the relatively low G+C content of Euglena and Olithodiscus caused most methods to group them together.

Using PAUP*, version 4.0d64 (Swofford 1998Citation ), we were able to reproduce the results of Lockhart et al. (1994)Citation on the entire data matrix of eight sequences, but we reduced the data set to just the sequences from Anacystis, Olithodiscus, Euglena, and Chlamydomonas for simplicity. As table 1 shows, reducing the taxon sampling did not affect the general conclusions reached by Lockhart et al. (1994)Citation . All methods examined except LogDet favored the unrooted tree topology grouping Euglena and Olithodiscus and separating them from Chlorella and Anacystis, which have higher G+C contents (table 2 ). The model described by Galtier and Gouy (1998)Citation , hereinafter called the GG98 model, was used to simulate data according to the tree presumed to be correct, namely, (Anacystis, Olithodiscus, (Euglena, Chlamydomonas)). In essence, the hypothesis tested was that the process underlying the evolution of the observed sequences did not differ from the model of evolution used in the simulations. The results of the previous section suggest that the degree of bias present in the Lockhart et al. (1994)Citation data set is not large enough to mislead parsimony (or, presumably, other methods) unless other factors exacerbate its effects. We therefore predicted that all methods would usually pick the correct tree in the simulated data sets.


View this table:
[in this window]
[in a new window]
 
Table 2 Results of Analysis of Four Taxa from Lockhart et al. (1994) Under Different Inference Methods

 

View this table:
[in this window]
[in a new window]
 
Table 3 Natural Log of the Likelihood for the 15 Possible Rooted Trees from Lockhart et al. (1994) for the Galtier and Gouy (1998) Model With and Without Discrete Gamma Rate Heterogeneity

 
The parameter values used in the simulations were maximum-likelihood estimates obtained using two independently written computer programs, each using the GG98 model. The program EVAL_NH, written by Galtier and Gouy, was used to check the results from a program (GG98) written separately by one of us (P.O.L.). It is important to note that the incorporation of CNC makes the model non-time-reversible. In such models, the maximum likelihood changes with different rootings, so table 3 presents likelihood scores for all 15 possible rooted topologies for four taxa. The maximum-likelihood tree under the GG98 model is the "true" tree (table 3 ). This result demonstrates that using a model allowing nucleotide composition to vary across the tree improves the quality of the estimated tree. The two programs were in agreement with respect to the parameter estimates for the maximum-likelihood tree (fig. 4 ). We each wrote independent computer programs to simulate data sets based on these parameter estimates and used PAUP*, version 4.0d64 (Swofford 1998Citation ), to evaluate each of the 1,000 simulated data sets for the five methods used by Lockhart et al. (1994)Citation and described above: equal-weighted parsimony (MP); maximum likelihood with the F84 model (ML); minimum evolution with Jukes and Cantor (1969)Citation distances (ME-JC); minimum evolution with K2P distances (ME-K2P); and minimum evolution with LogDet distances (ME-LogDet). None of these methods selected an incorrect tree in any of the 1,000 simulations, suggesting that there is a significant difference between the model used for simulation and the actual processes generating the observed sequences.


View this table:
[in this window]
[in a new window]
 
Table 1 Performance of Various Phylogenetic Inference Methods on the Eight-Taxon Data Set of Lockhart et al. (1994) and with the Four Taxa Subsequently Used

 


View larger version (11K):
[in this window]
[in a new window]
 
Fig. 4.—Parameter values used for the simulations using the GG98 model. These values represent maximum-likelihood estimates obtained using the GG98 model. Values below branches represent branch lengths computed using the standard HKY85 model formula for expected number of substitutions. Note that since the "equilibrium frequencies" differ for each branch in the GG98 model, the standard formula no longer reflects the expected number of substitutions, since the nucleotide composition is nonstationary. The correct formulas for this case are presented in the appendix. Numbers to the right of each node (or below taxon names) are the estimated percentages of G+C for the branch subtending the node. These do not represent the G+C composition of the sequence at the node, but instead represent the probabilities of substitution of G's and C's over the life span of the lineage leading up to the node. The estimated value of {kappa} in this case was 3.781608

 
We repeated the simulations, this time incorporating discrete gamma rate heterogeneity into the data. The model used is termed the GG98-{Gamma} model, as it is identical to the GG98 model except for the addition of a gamma shape parameter. Four rate categories were used, with the mean of each category serving as the relative rate used in the likelihood calculations. Again, when the likelihood of each of the 15 possible rooted trees was computed using the GG98-{Gamma} model, the maximum-likelihood tree was identical to the tree topology assumed to be true by Lockhart et al. (1994)Citation (table 3 ). The maximum-likelihood estimates of the parameters of the GG98-{Gamma} model (fig. 5 ) were used as the basis of the simulations; however, this time only the GG98 program could be used to estimate parameters because EVAL_NH does not include the gamma version of the GG98 model. In this case, some of the simulated data sets resulted in incorrect estimates of phylogeny regardless of the method used. Nevertheless, all of the methods recovered the correct tree a high percentage of the time, and LogDet did not outperform the other methods (table 4 ) when presented with the true amount of rate heterogeneity (the maximum-likelihood estimate of the gamma shape parameter from the original data set, 0.308, was the assumed level of rate heterogeneity in the simulated data).



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 5.—Maximum-likelihood estimates of parameters obtained under the GG98-{Gamma} model. The values below branches and beside nodes have the same meanings as in figure 4 . The estimates for {kappa} and the gamma shape parameter are 4.673279 and 0.307850, respectively

 

View this table:
[in this window]
[in a new window]
 
Table 4 Results of Simulations Based on Parameter Estimates Made Using the GG98-{{Gamma}} Model

 

    Discussion
 TOP
 Abstract
 Introduction
 Convergence in Nucleotide...
 Simulation Study
 Discussion
 Acknowledgements
 literature cited
 
Of the many evolutionary factors affecting the accuracy of phylogenetic inference, CNC is a relative newcomer, being recognized formally as a problem with the papers by Lake (1994)Citation , Lockhart et al. (1992)Citation , and Steel (1994)Citation . The present paper seeks to discover how much CNC is required before it presents serious problems for phylogenetic inference methods such as parsimony. The analytical results presented suggest that extreme combinations of substitution rates, transition/transversion bias, and equilibrium frequencies are required before parsimony is expected to fail. This is welcome news, because the situation investigated here represents nearly the worst-case scenario: nucleotide composition converging toward a common value in two unrelated lineages (the worst-case scenario for the four-taxon problem would involve increases in G+C in two unrelated terminal lineages and a corresponding decrease in G+C in the other two terminal lineages). Inherited similarities in nucleotide composition, on the other hand, will not be as problematic, as parsimony will tend to estimate trees correctly, albeit for the wrong reason. The only drawback posed by inherited similarities in nucleotide composition will be a tendency for parsimony to prefer the correct tree more strongly than it should, exhibiting a false degree of confidence in the form of bootstrap or decay values (Swofford et al. 2001)Citation .

Few clear cases have been reported in which CNC has been thought to derail the phylogenetic inference process. Of the three cases presented by Lockhart et al. (1994)Citation , two involve 18S rDNA from vertebrates and COII mtDNA from honeybees. In these two data sets, we could not find any way to obtain the putative "correct" tree except by using LogDet/paralinear distances, as reported by Lockhart et al. (1994)Citation . It is notable, however, that it is necessary to exclude all constant and autapomorphic sites (analyzing only parsimony-informative sites) to accurately estimate the phylogeny for these data sets. This suggests site-to-site rate heterogeneity as the likely culprit; however, taking account of site-to-site rate heterogeneity using the standard methods fails to produce a correct estimate. Therefore other, as yet unidentified, factors must be at work in these data sets.

The simulation study reported here represents a test of the hypothesis that CNC alone, or CNC in combination with site-to-site rate heterogeneity, is sufficient to explain the failure of many phylogenetic methods for the third case presented by Lockhart et al. (1994)Citation (represented by the chlorop.phy data set). We used a parametric bootstrap approach in which parameters were estimated from the data using maximum likelihood and simulations performed using these parameter estimates. The results show that CNC, either alone or in combination with site-to-site rate heterogeneity, is insufficient to account for difficulties found in the original data set. None of the simulated data sets presented problems for parsimony or any of the other methods tested (all of which failed on the original data set).

It is clear that the GG98 model used for the simulations did not capture some factor important in the evolution of the actual sequences. One possibility is that the GG98 model does not allow enough variation in nucleotide composition across the tree. This model places some constraints on changes in nucleotide composition, forcing the frequency of G to equal the frequency of C and allowing only changes in G+C composition at the nodes of the tree. It seems unlikely that these two model constraints can account for the differences seen between the simulation results and the results from the original data. First, allowing the composition of G to differ from the composition of C should not increase the chances of an artifactual joining of Euglena to Olithodiscus, since it is the low G+C content in these lineages that is postulated to have caused problems in the original data set. Second, allowing nucleotide composition to vary within lineages should also not increase the chance of Euglena pairing with Olithodiscus, since all of the phylogenetic methods that failed on the original data set view branches as the smallest units making up a phylogenetic tree: that is, they cannot, like LogDet, take account of changes in composition that occur within branches.

When simulations incorporated both CNC and rate heterogeneity, a small fraction of the simulated data sets proved difficult for all methods. This falls short of the result that would be expected if rate heterogeneity were the all-important missing factor. Also, we would expect LogDet to perform well (as it did on the original data set) compared with the other methods examined. In fact, LogDet behaves similarly to the other methods, failing on a small fraction of the simulated data sets (table 4 ). These observations indicate the presence of as-yet-unknown evolutionary factors at work in the evolution of the actual sequences that are not being modeled by the simulations.

The phylogenetic methods in common use today each have their own "Achilles' heel," and it behooves researchers to learn as much as possible about the factors at work in their data prior to deciding on a method to use in the final analysis. For example, parsimony's primary Achilles' heel has long been identified as long-branch attraction (Felsenstein 1978Citation ). Maximum likelihood can correct for problems that are identified and incorporated into substitution models but can be deceived by factors not represented in the model used (e.g., rate heterogeneity; Gaut and Lewis 1995Citation ). This paper has addressed a potential Achilles' heel applicable to most methods of phylogenetic inference and found that it is perhaps not as great a threat as it was initially perceived to be. This is not to say that CNC can be ignored altogether. Figure 3 illustrates that CNC in combination with site-to-site rate heterogeneity and transition/transversion bias can cause problems even at biologically realistic substitution rates and levels of rate heterogeneity. For example, in figure 3 , one point at which parsimony is inconsistent is characterized by the following parameter values: peripheral branch lengths = 0.8, central branch length = 0.1, gamma shape = 0.2, and transition/transversion rate ratio = 1.0, with a G+C difference of 0.12 between biased and unbiased lineages. These branch lengths and the G+C bias are at the edge of what is normally observed in actual data sets, but none are out of the realm of possibility, and the transition/transversion bias and degree of rate heterogeneity are not at all extreme. LogDet/paralinear distances provide a practical means for diagnosing CNC should it be present in a dosage sufficient to cause problems. A tree estimated using LogDet that differs from trees estimated using other methods should prompt an examination of the data for evidence that other methods are incorrectly joining taxa with similar nucleotide compositions.

While it is unlikely that any data set can be found that shows the influence of one and only one evolutionary factor, it is nevertheless beneficial to thoroughly analyze sequence data sets in the search for good examples of the effects of evolutionary factors representing potential pitfalls for phylogeny methods. Equally important is the search for new evolutionary factors. It is only when such evolutionary factors as site-to-site rate heterogeneity, transition/transversion bias, evolutionary dependence among sites, and CNC are discovered that work can begin on creating evolutionary models that avoid the problems they create.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Convergence in Nucleotide...
 Simulation Study
 Discussion
 Acknowledgements
 literature cited
 
The authors would like to thank Peter Lockhart, Michael Steel, Michael Hendy, and David Penny for making the data sets used in their 1994 paper freely available to other researchers over the World Wide Web. Permission from David L. Swofford to use a prerelease test version of his software PAUP*, version 4.0, is also gratefully acknowledged. Finally, we thank the Biology Department of the University of New Mexico for providing support for the computing facilities needed to carry out this research. P.O.L. gratefully acknowledges funding from the Alfred P. Sloan Foundation/National Science Foundation (grant 98-4-5 ME). This paper is the culmination of the research performed for a Senior Honors Thesis by G.C.C.


    Footnotes
 
Masami Hasegawa, Reviewing Editor

1 Keywords: nucleotide composition phylogeny LogDet G+C bias maximum parsimony Back

2 Address for correspondence and reprints: Gavin C. Conant, Department of Biology, 167 Castetter Hall, University of New Mexico, Albuquerque, New Mexico 87131-1091. gconant{at}unm.edu . Back


    literature cited
 TOP
 Abstract
 Introduction
 Convergence in Nucleotide...
 Simulation Study
 Discussion
 Acknowledgements
 literature cited
 

    Burggraf, S. G., K. O. Stetter, C. R. Woese. 1992. A phylogenetic analysis of Aquifex pyrophilus.. Syst. Appl. Microbiol. 15:352–356[ISI][Medline]

    Felsenstein, J.. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401–410[ISI]

    ———.1993. PHYLIP (phylogeny inference package)Version 3.5. Distributed by the author, Department of Genetics, University of Washington, Seattle, Washington

    Foster, P. G., D. A. Hickey. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284–290[ISI][Medline]

    Galtier, N., M. Gouy. 1995. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA. 92:11317–11321[Abstract]

    ———.1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–879

    Gaut, B., P. O. Lewis. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. 12:152–162[Abstract]

    Goldman, N., Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736[Abstract/Free Full Text]

    Hasegawa, M., T. Hashimoto. 1993. Ribosomal RNA trees misleading?. Nature. 361:23

    Huelsenbeck, J. P.. 1995. Performance of phylogenetic methods in simulation. Syst. Biol. 44:17–48[ISI]

    Jukes, T. H., C. R. Cantor. 1969. Evolution of protein moleculesPp. 21–132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York

    Kimura, M.. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120[ISI][Medline]

    Kuhner, M. K., J. Felsenstein. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:459–468[Abstract]

    Lake, J. A.. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA. 91:1455–1459[Abstract]

    Lockhart, P. J., D. Penny, M. D. Hendy, C. J. Howe, T. J. Beanland, A. W. D. Larkum. 1992. Controversy on chloroplast origins. FEBS Lett. 301:127–131[ISI][Medline]

    Lockhart, P. J., M. A. Steel, M. D. Hendy, D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605–612[Free Full Text]

    Loomis, W. F., D. W. Smith. 1990. Molecular phylogeny of Dictyostelium discoideum by protein sequence comparison. Proc. Natl. Acad. Sci. USA. 87:9093–9097[Abstract]

    Muse, S. V.. 1995. Evolutionary analyses of DNA sequences subject to constraints on secondary structure. Genetics. 139:1429–1439[Abstract/Free Full Text]

    ———.1996. Estimating synonymous and nonsynonymous substitution rates. Mol. Biol. Evol. 13:105–114

    Navidi, W. C., G. A. Churchill, A. von Haeseler. 1991. Methods for inferring phylogenies from nucleic acid sequence data by using maximum likelihood and linear invariants. Mol. Biol. Evol. 8:128–143[Abstract]

    Nei, M.. 1991. Relative efficiencies of different treemaking methods for molecular dataPp. 90–128 in M. M. Miyamoto and J. Cracraft, eds. Phylogenetic analysis of DNA sequences. Oxford University Press, New York

    Reeves, J. H.. 1992. Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J. Mol. Evol. 35:17–31[ISI][Medline]

    Schöniger, M., A. von Haeseler. 1995. Performance of the maximum likelihood, neighbor joining, and maximum parsimony methods when sequence sites are not independent. Syst. Biol. 44:533–547[ISI]

    Sidow, A., T. P. Steel. 1992. Estimating the fraction of invariable codons with a capture-recapture method. J. Mol. Evol. 35:253–260[ISI][Medline]

    Steel, M. A.. 1994. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7:19–23

    Swofford, D. L.. 1998. PAUP*: phylogenetic analysis using parsimony (*and other methods)Version 4.0 (prerelease test version). Sinauer, Sunderland, Mass

    Swofford, D. L., P. J. Waddell, J. P. Huelsenbeck, P. G. Foster, P. O. Lewis, J. S. Rogers. 2001. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methodsSyst. Biol. (in press)

    Tamura, K.. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 9:678–687[Abstract]

    Waddell, P., M. Steel. 1997. General time-reversible distances with unequal rates across sites: mixing G and inverse Gaussian distributions with invariant sites. Mol. Phylogenet. Evol. 8:398–414[ISI][Medline]

    Wakeley, J.. 1993. Substitution-rate variation among sites and the estimation of transition bias. Mol. Biol. Evol. 11:426–442[Abstract]

    Yang, Z.. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401[Abstract]

    Yang, Z., D. Roberts. 1995. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. 12:451–458[Abstract]

Accepted for publication February 13, 2001.