Testing Substitution Models Within a Phylogenetic Tree

Gunter Weiss*,{dagger}, and Arndt von Haeseler{dagger},{ddagger}

* Max-Planck-Institut für evolutionäre Anthropologie, Leipzig, Germany
{dagger} Heinrich-Heine Universität Düsseldorf, Germany
{ddagger} Forschungszentrum Jülich, Germany


    Abstract
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 
Phylogenetic tree reconstruction frequently assumes the homogeneity of the substitution process over the whole tree. To test this assumption statistically, we propose a test based on the sample covariance matrix of the set of substitution rate matrices estimated from pairwise sequence comparison. The sample covariance matrix is condensed into a one-dimensional test statistic {Delta} = {sum} ln(1 + {delta}i), where {delta}i are the eigenvalues of the sample covariance matrix. The test does not assume a specific mutational model. It analyses the variation in the estimated rate matrices. The distribution of this test statistic is determined by simulations based on the phylogeny estimated from the data. We study the power of the test under various scenarios and apply the test to X chromosome and mtDNA primate sequence data. Finally, we demonstrate how to include rate variation in the test.

Key Words: sequence evolution • tree reconstruction • test for homogeneity • substitution models


    Introduction
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 
The reconstruction of evolutionary trees from molecular sequence data is nowadays a common tool with which to infer the ancestral relationship of living organisms. This is in part due to the ease of getting sequence data from any species one might be interested in either by standard sequencing techniques or simply by database searches. In addition, interest in the theory of sequence evolution and tree reconstruction has led to a large number of studies on substitution models (Zharkikh 1994), methods of reconstructing trees, exact and heuristic tree-search algorithms, and statistical procedures testing model assumptions and/or tree topologies (cf. Swofford et al. 1996). Today several software packages are available which put many of the theoretical results into practice (Swofford 1993; Felsenstein 1995; Adachi and Hasegawa 1996; Strimmer and von Haeseler 1996; Yang 1999; Kumar et al., 2000).

Given a set of sequence data, the first step in the analysis is the alignment of the sequences. This is a topic of its own which we do not address here (cf Gotoh 1999). From a sequence alignment a phylogenetic tree can be reconstructed by several means. Each method, however, uses different assumptions (stated either explicitly or implicitly), different optimality criteria, and different tree search algorithms. For example, most distance-based methods as well as maximum likelihood methods assume an explicit model of sequence evolution. With the exception of LogDet and paralinear distances, these models require homogeneity, reversibility, and stationarity. The model is usually defined by a matrix containing the instantaneous rates of change from one nucleotide to another. In addition it is assumed that the positions in a sequence evolve mutually independently (for review see Baake and von Haeseler 1999). However, the model might allow the overall rate of change to differ between positions, such that rate heterogeneity among sites is incorporated. Well known are attempts to model rate heterogeneity according to a gamma distribution (Yang 1993, 1994) or by a log-normal distribution (Olsen 1987). Although methods for testing the model of nucleotide substitutions abound (i.e., Goldman 1993a,b), little work has been put into the assumption that the evolutionary model is identical on all branches of a tree (for an exception see Yang and Roberts 1995). Recently, Devauchelle et al. (2001) observed large variations across taxa in substitution matrices for mtDNA encoded proteins. These observations were obtained by singular value decomposition, with no obvious way to support the results statistically.

In this study we question the justifiability of the assumption of a single evolutionary model acting on all branches of a tree. To this end, we develop a statistical test based on the sample covariance matrix S of the set of substitution rate matrices estimated from pairwise sequence comparison. The sample covariance matrix is condensed into an one-dimensional test statistic {Delta} = {sum}ln(1 + {delta}i), where the {delta}i are the eigenvalues of S. {Delta} is used to test the assumption of a single substitution model acting on all branches of a tree—i.e., testing the homogeneity assumption of the substitution process in the tree.

The outline of the paper is as follows: First, we define models of sequence evolution and describe how to estimate substitution rate matrices from pairwise sequence comparisons. Next, we derive and motivate the test statistic {Delta} and show how to determine the null-distribution of {Delta} by simulations. The reliability and sensitivity of the test are demonstrated by simulation studies. Finally, we apply the test to primate sequence data from the X chromosome and mitochondria.


    Models and Matrices
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 
Suppose that substitution in a DNA sequence follows the Markovian model of molecular evolution (Felsenstein 1983; Lanave et al. 1984; Tavaré 1986; Gu and Li 1996; Kelly and Rice 1996)—i.e., a continous-time stationary Markov process defined by a rate matrix Q. This matrix has off-diagonal entries qij > 0 that give the instantaneous substitution rates at which nucleotide i is replaced by nucleotide j (i != j {A, C, G, T}). The diagonal elements of Q are given by


such that rows sum up to zero. Let {pi}i, i {A, C, G, T} be the equilibrium frequency of nucleotide i. We scale time such that the expected rate of substitutions


i.e., time is measured in expected number of substitutions. We require the substitution process to be reversible in time. Then,


must hold for all i != j {A, C, G, T}. This implies that the off-diagonal entries of Q can be expressed as qij = {pi}jbij with bij = bji, such that the model contains nine free parameters.

Assume that the rate matrix Q is diagonalizable (which is the generic case). Then, Q has a spectral decomposition


where U is the matrix of (right) eigenvectors and diag[{lambda}0, {lambda}1, {lambda}2, {lambda}3] a diagonal matrix containing the eigenvalues of matrix Q. The constraints on Q imply that {lambda}0 = 0 > {lambda}1 >= {lambda}2 >= {lambda}3. (For details we refer to Chapter 4.8 and Appendix of Karlin and Taylor [1975].) The transition probability matrix for a branch of length t (i.e., average number of substitutions) is given by


From equation (2) this can be expressed as


with {eta}s = e, s = 0, 1, 2, 3, and {eta}0 = 1 > {eta}1 >= {eta}2 >= {eta}3 > 0.

While the structure of the substitution process defined by Q is assumed to be the same for all positions along the sequence, the rate of evolution might differ from position to position. We model this rate heterogeneity among sites by a nonnegative random variable R with mean E(R) = 1 and moment-generating function {phi}(x) = E(exp(Rx)) (Gu and Li 1996; Kelly and Rice 1996). The relative rate r at a position is then given by a realization of R. However, the requirement E(R) = 1 ensures that the evolutionary rate averaged over the whole sequence remains the same independent of the specific distribution of R (for an overview see Waddell and Steel [1997]).

Conditional on R = r, the transition probability matrix of a position for a branch of length t is given by


Suppose R has density f(r). Then, the unconditional transition probability matrix is


with eigenvalues


In molecular evolution rate heterogeneity is frequently modelled by a {Gamma}-distribution with density


where the shape of the distribution depends on the parameter {alpha} > 0 (Uzzel and Corbin 1971; Wakeley 1993). For this model equation (6) yields


If substitution rates are uniform among sites (i.e., {alpha} -> {infty}) then {phi}({lambda}st) = e{lambda}st as above (eq. 3). Note that the eigenvalues of P(t) are functions of the product {lambda}st.


    Inference and Testing
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 
Transition probability matrices as defined above are not directly observable from sequence data. However, they might be inferred from pairwise sequence comparison as follows. Let h(i,j) denote the probabilities of having nucleotides i and j in a pair of sequences (i, j {A, C, G, T}). These probabilities are collected in a so-called divergence matrix H (for review see Baake and von Haeseler 1999). Assuming stationarity and reversibility of the substitution process, the divergence matrix can be expressed as


where {Pi} = diag[{pi}A, {pi}C, {pi}G, {pi}T].

Using equations (5, 6, and 8), we get the following functional relationship between the substitution rate matrix Q and observables from data:


where {eta}s, s = 0, 1, 2, 3, are the eigenvalues of P(t) = {Pi}-1 H and we assume that {phi}-1() exists, which is the case for the {Gamma}-distribution. H can easily be estimated by relative frequencies of letter pairs in a pairwise alignment, an estimate for {Pi} is given by the average base composition in the corresponding alignment. Because we measure time in expected substitutions (eq. 1) we can use equation (9) to estimate Q directly from pairwise sequence comparisons. Note that the constraint of equation (1) yields the identifiability of Q in equation (9).

Now, suppose we were given a set of n aligned sequences that are related by a phylogenetic tree. Furthermore, each of the () sequence pairs yields an estimate of the substitution process that acted on the path connecting the sequence pair in the tree. Under the assumption of homogeneity of the substitution process over the whole phylogeny—i.e., all branches in the tree evolve according to the same substitution model Q—we expect all () estimates of Q to be close to each other. In this case differences between estimates are due to stochastic effects of sampling finite sequences. Alternatively, when there is heterogeneity of the substitution patterns among branches, we expect that there is variation between estimates that is not attributable to random effects but reflects differences of the substitution patterns among branches.

To quantify variation between estimates we utilize the sample covariance matrix of the estimates of Q. To this end, let q(k,l) denote the vector of the 12 off-diagonal entries of the estimate of Q for sequence pair (k, l). Note that the diagonal entries of Q are given by the negative of the respective row sum. Let further


be the mean of the vectors q(k,l). The sample covariance matrix S is then given by


where A' denotes the transpose of a matrix A. The matrix S measures the variability between the estimates q(k,l). However, for constructing a statistical test the 12 x 12 matrix S is not manageable. As it is typical of multivariate tests in general we condense the information contained in S to a scalar by introducing the following quantity:


where {delta}i, i = 1, ... , 12 are the eigenvalues of S. Note that the definition of S differs slightly from the logarithm of the determinant of S defined by ln({delta}i). The addition of one to each eigenvalue in equation (12) improves numerical stability and prevents unfavorable behavior of singular matrices.

We use {Delta} as a statistic to test the null-hypothesis of homogeneity of substitution patterns over the complete phylogeny. To do this, we need to determine the distribution of {Delta} given the null-hypothesis is true. Because we are using estimates from pairwise sequence comparisons, where sequences are related by a phylogeny, we expect those estimates to be highly correlated. Hence, we do not presume any standard distribution to fit our problem, but resort to a simulation procedure to determine the null-distribution of {Delta} described next.

From equations (1) and (9) we get not only an estimate for Q but also estimates for the distance between each pair of sequences. The resulting distance matrix is used to reconstruct the phylogenetic tree and estimate its branch lengths by standard-distance-based tree reconstruction methods, like Neighbor-Joining (Saitou and Nei 1987). Along this estimated phylogeny (tree and its branch lengths) we apply Seq-Gen (Rambaut and Grassly 1997) to simulate the evolution of sequences according to , our best estimate for Q under the null-hypothesis. For the resulting simulated data set, we compute {Delta} according to equation (12). By repeating this simulation, say 1,000 times, we get an approximation to the null-distribution of {Delta}. The P value of the actually observed {Delta}data is then estimated by the proportion of simulated {Delta} values equal to or larger than {Delta}data.


    Simulation Studies
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 
To check reliability and sensitivity of our test procedure we conducted simulation studies for a variety of trees and models. We used an eight-taxon tree, shown in figure 1A, to simulate data sets under the null-model of homogeneous substitution patterns. Thus, we can determine the size of the test, i.e., the maximum probability of type I error (Bickel and Doksum 1977).



View larger version (9K):
[in this window]
[in a new window]
 
FIG. 1. Phylogenies with substitution models Q (solid line) and Q* (dotted line) used in the simulations

 
To demonstrate the statistical power of the approach, we simulated data along trees where on specific subtrees (see fig. 1 B,C,D) the substitution model Q* differed from that on the rest of the tree, Q. The branches in the trees (respecting the root node) are all of the same length t and are measured in expected substitutions per sequence. Simulations were carried out for branch lengths t = 10, 20, 50, 100, 200, and 500 and a sequence length of 5,000 nucleotides.

The substitution models are defined as pairs (B, {pi}) of symmetric matrices B and base frequency vectors {pi}, such that entries of Q are given by


Matrices B and vectors {pi} were chosen to show whether small changes in the substitution patterns can be detected by our procedure. The entries of the basic matrix B0, which specifies Q in figure 1, are all equal to one. Thus, (B0, {pi}) constitutes a Felsenstein-type substitution model (Felsenstein 1981). Matrices B1 and B2 differ from B0 by only one entry. More precisely, we set bC,T = bT,C equal to 2.0 and 4.0, respectively. That is to say, B1 and B2 correspond to situations where the rate of one type of base exchange is increased. We used four different base frequency vectors in our simulations: {pi}0 = (0.25, 0.25, 0.25, 0.25), {pi}1 = (0.15, 0.2, 0.3, 0.35), {pi}2 = (0.2, 0.2, 0.3, 0.3), and {pi}3 = (0.15, 0.25, 0.25, 0.35).

To assess the size and power of the test, we simulated 1,000 data sets under each of the 12 possible parameter combinations. Each data set was then analyzed by our procedure (eqs. 10, 11, 12) including 1,000 simulations each to approximate the distribution of the statistic {Delta}. The relative frequency of the number of data sets where the null-hypothesis was wrongly rejected on the 5% significance level was used to determine the actual size of the test. The power of the test under alternative hypotheses is given by the relative number of data sets which led correctly to rejection.

Table 1 shows the results of the simulation studies. Rows corresponding to tree A (see fig. 1) show the probability to (wrongly) reject the null-hypothesis on the 5% significance level. The results of the simulations are in good agreement with the nominal significance level; however, it appears that the test is slightly conservative. Rows corresponding to trees B, C, and D (see fig. 1) show the probability to (correctly) reject the null-hypothesis on the 5% significance level. It is clear from the simulations that the power of the test is increasing with the branch lengths of the tree—i.e., higher amount of expected substitutions—as well as with the minor proportion of branches that evolve differently from the rest of the tree. The "balanced" tree D shows generally the highest power. The power of the test is generally low with short branches (t = 10). Only when the two models in the tree differ substantially is the test able to detect that difference in a significant number of simulations. Note that the entries in the column corresponding to t = 10 in table 1 are based on only about 300 simulations, as very sparse divergence matrices cause numerical problems of the algorithm. It also becomes clear that differences between the two models Q and Q* in terms of matrices B are detected more easily (upper four ABCD blocks of table 1) than the minor differences in the base frequency distributions {pi} (last two blocks of table 1).


View this table:
[in this window]
[in a new window]
 
Table 1 Simulated Size (Tree A) and Power (Tree B-D) of the Test of Homogeneity for Various Substitution Models.

 

    Applications
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 
With the encouraging results from simulations, we applied our test to two data sets of hominoid DNA. The first data set consists of 10 kb sequences from the Xq13.3 locus sampled from human, bonobo, chimpanzee, gorilla, orangutan, and gibbon (Kaessmann et al. 1999, 2001). The second data set comprises the complete mitochondrial DNA sequences from the same species except gibbon (Horai et al. 1995).

As a first step we performed the test assuming no rate heterogeneity among sites. Figure 2A shows the Neighbor-Joining primate Xq13.3 tree; the estimated transition rate matrix and the estimated nucleotide frequencies are displayed in figure 2B. Based on these estimates the null-distribution of {Delta} was approximated by the simulations in figure 2C. The observed value {Delta}Xq13 = 1.071 falls right into the distribution of {Delta} corresponding to a P value of 0.203. Thus, the test does not reject the homogeneous substitution patterns assumption. The power of the test is rather low, however, when sequences of this level of diversity are analyzed (see scale of phylogeny in figure 2A).



View larger version (17K):
[in this window]
[in a new window]
 
FIG. 2. Analyses of Xq13 data: (A) estimated phylogeny; (B) mean transition rate matrix, estimated base composition; (C) simulated null-distribution, observed value {Delta}Xq13; (D) 95%-quantile of {Delta} distribution (dashed line) and {Delta}Xq13 (solid line) as a function of the shape parameter {alpha}

 
A different result emerges for the complete mitochondrial data. The estimated tree (fig. 3A) agrees with the tree obtained from the Xq13.3 sequence. However, based on the estimates in figure 3B our simulations show, that {Delta}mtDNA = 1.033 is way out of the range of the simulated null-distribution yielding a p-value of less than 0.001 (see fig. 3C). The hypothesis of homogeneity of the substitution patterns is rejected.



View larger version (17K):
[in this window]
[in a new window]
 
FIG. 3. Analyses of mtDNA data: A. estimated phylogeny; B. mean transition rate matrix, estimated base composition; C. simulated null-distribution, observed value {Delta}mtDNA; D. 95%-quantile of {Delta} distribution (dashed line) and {Delta}mtDNA (solid line) as a function of the shape parameter {alpha}

 
We further investigated the impact of possible rate heterogeneity among positions on the outcome of the test. More specifically, we modeled rate heterogeneity by a {Gamma}-distribution (eq. 7) and varied the shape parameter {alpha} in our analyses. Note that small values of {alpha} correspond to high levels of rate heterogeneity. Figures 2D and 3D show the relationship between the upper 95% quantile of the null-distribution (based on 1,000 simulations for each choice of {alpha}) and the observed value {Delta}data as a function of {alpha}. It becomes clear that for the mtDNA data the null-hypothesis is rejected independently of levels of rate heterogeneity (fig. 3D).

The effect of possible rate heterogeneity on the analysis of Xq13 data is more interesting. For increasing level of rate heterogeneity, i.e. a decreasing value of {alpha}, {Delta}Xq13 approaches the 95% quantile curve. For {alpha} < 0.03 the value of {Delta}Xq13 is even larger than the corresponding 95% quantile. Thus, under the assumption of extreme rate heterogeneity, which may be indicated by extra evidence, the test would reject the homogeneity assumption for the Xq13 data. These results suggest that not obeying possible levels of rate heterogeneity results in a more conservative test. However, we note that with {alpha} close to zero numerical problems during estimation and simulation of the null-distribution become apparent.


    Discussion
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 
The assumption of homogeneity of the substitution process on a phylogeny can be tested statistically using DNA sequence data. We showed how to construct such a test and how simulations yield an approximation of the null-distribution of the proposed test statistic {Delta}. The test itself does not assume a specific mutational model. It does not estimate parameters of the substitution process from data, but rather evaluates differences in estimated rate matrices.

However, if rate heterogeneity among sites should be incorporated, a model and its parameters have to be specified in advance of the application of the test.

Simulation studies and the analyses of two real data sets demonstrated that the test is admissible and has a reasonable power of rejection. We also showed that the problem of unknown rate heterogeneity can be circumvented by analyzing the data under various levels of rate heterogeneity. Application to the two real data sets suggests that assuming wrongly homogeneous substitution rates leads to the test being conservative.

What are the effects of a finding of heterogeneous substitution patterns within a phylogenetic tree? On the one hand the heterogeneity is interesting in itself, because there is need for a biological explanation. We could imagine that substitution patterns do not change purely by chance but could result from a change in the sequence environment—e.g., less selective constrains due to a gene duplication, different replication machinery for a mitochondrial insert in the nucleus, or need for adaptation because of changes in selective constraints either on the locus studied or on genes that affect the locus indirectly.

On the more technical side the test results can improve the way phylogenies are built. Varying substitution patterns affect the estimation of branch lengths and possibly also the branching pattern of the reconstructed tree. For example, it is well known that compositional changes in nucleotide frequencies can produce misleading trees (Lockhart et al. 1994; Galtier and Gouy 1995). With this in mind, results of a phylogenetic reconstruction should be interpreted with care, if our test suggests a rejection of the homogeneity assumption.

Because a fully parameterized probability model on a phylogeny that allows for branch-specific substitution models is not feasible, the test result can help in improving the process of model building for a specific data set by using appropriate methods that take into account different rate matrices over the tree (e.g., Steel 1993; Lake 1994).

The proposed test clearly has its limits in that divergence matrices that are sparse easily lead to complex eigenvalues and problems in assessing the null-distribution of {Delta} through simulations.

Although it was not demonstrated here, an obvious extension of the test is to answer which type of base exchange contributes significantly to a significant result of the overall test. The more challenging problem is to provide a definite answer to which branches contribute most to a significant result. For this purpose estimates of branch-specific substitution models seem to be necessary. We will address this question in the near future (Weiss and von Haeseler, in preparation).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 
We thank the MPG for financial support. We also thank Roland Fleißner for critically reading the manuscript. The program is available free of charge and can be downloaded from www.tree-puzzle.de.


    Footnotes
 
E-mail: weiss{at}cs.uni-duesseldorf.de. Back

Wolfgang Stephan, Associate Editor Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Models and Matrices
 Inference and Testing
 Simulation Studies
 Applications
 Discussion
 Acknowledgements
 Literature Cited
 

    Adachi, J., and M. Hasegawa. 1996. MOLPHY: Programs for Molecular Phylogenetics, ver. 2.3. Tokyo: Institute of Statistical Mathematics.

    Baake, E., and A. von Haeseler. 1999. Distance measures in terms of substitution processes. Theor. Popul. Biol. 55:166-175.[CrossRef][ISI][Medline]

    Bickel, P. J., and K. A. Doksum. 1977. Mathematical statistics. Holden-Day, Oakland, Calif.

    Devauchelle, C., A. Grossmann, A. Hénaut, M. Holschneider, M. Monnerot, J. Risler, and B. Torrésan. 2001. Rate matrices for analyzing large families of protein sequences. J. Comp. Biol. 8:381-399.[CrossRef][ISI]

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]

    Felsenstein, J. 1983. Statistical inference of phylogenies. J. R. Stat. Soc. Ser. A, 146:246-272.[ISI]

    Felsenstein, J. 1995. PHYLIP (phylogeny inference package). Version 3.572. Department of Genetics, University of Washington, Seattle.

    Galtier, N., and M. Gouy. 1995. Inferring phylogenies from sequences of unequal base composition. Proc. Natl. Acad. Sci. USA 92:11317-11321.[Abstract]

    Goldman, N. 1993a. Statistical tests of models of DNA substitutions. J. Mol. Evol. 36:182-198.[ISI][Medline]

    Goldman, N. 1993b. Simple diagnostic statistical tests of models of DNA substitutions. J. Mol. Evol. 37:650-661.[ISI][Medline]

    Gotoh, O. 1999. Multiple sequence alignment: algorithms and applications. Adv. Biophys. 36:159-206.[CrossRef][ISI][Medline]

    Gu, X., and W. H. Li. 1996. A general additive distance with time-reversibility and rate variation among nucleotide sites. Proc. Natl. Acad. Sci. USA 93:4671-4676.[Abstract/Free Full Text]

    Horai, S., K. Hayasaka, R. Kondo, K. Tsugane, and N. Takahata. 1995. Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc. Natl. Acad. Sci. USA 92:532-536.[Abstract]

    Kaessmann, H., F. Heissig, A. von Haeseler, and S. Pääbo. 1999. DNA sequence varaition in a non-coding region of low recombination on the human X chromosome. Nat. Genet. 22:78-81.[CrossRef][ISI][Medline]

    Kaessmann, H., V. Wiebe, G. Weiss, and S. Pääbo. 2001. Great ape DNA sequences reveal a reduced diversity and an expansion in humans. Nat. Genet. 27:155-156.[CrossRef][ISI][Medline]

    Karlin, S., and H. Taylor. 1975. A first course in stochastic processes. Academic Press, New York.

    Kelly, C., and J. Rice. 1996. Modeling nucleotide evolution: a heterogeneous rate analysis. Math. Biosci. 133:85-109.[CrossRef][ISI][Medline]

    Kumar, S., K. Tamura, I. Jakobsen, and M. Nei. 2000. MEGA: Molecular Evolutionary Genetics Analysis, ver. 2. Pennsylvania State University, University Park, and Arizona State University, Tempe.

    Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455-1459.[Abstract]

    Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20:86-93.[ISI][Medline]

    Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605-612.[Free Full Text]

    Olsen, G. J. 1987. Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harbor Symp. Qunatita. Biol. 52:825-837.[ISI][Medline]

    Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238.[Abstract]

    Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.[Abstract]

    Steel, M. A. 1993. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7:19-23.[CrossRef][ISI]

    Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.[Free Full Text]

    Swofford, D. L. 1993. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.

    Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogeny reconstruction. Pp. 407–514 in D. M. Hillis, C. Moritz, and B. A. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer Associates, Sunderland, Mass.

    Tavaré, S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17:57-86.

    Uzzel, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to evolutionary events. Science 172:1089-1096.[ISI][Medline]

    Waddell, P. J., and M. A. Steel. 1997. General time-reversible distances with unequal rates across sites: mixing {gamma} and inverse gaussian distributions with invariant sites. Mol. Phyl. Evol. 8:398-414.[CrossRef][ISI][Medline]

    Wakeley, J. 1993. Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. J. Mol. Evol. 37:613-623.[ISI][Medline]

    Yang, Z. 1993. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396-1401.[Abstract]

    Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314.[ISI][Medline]

    Yang, Z. 1999. PAML, Phylogenetic Analysis by Maximum Likelihood, ver 2.0. University College London, London.

    Yang, Z., and D. Roberts. 1995. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. 12:451-458.[Abstract]

    Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315-329.[ISI][Medline]

Accepted for publication December 2, 2002.