* Max-Planck-Institut für evolutionäre Anthropologie, Leipzig, Germany
Heinrich-Heine Universität Düsseldorf, Germany
Forschungszentrum Jülich, Germany
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: sequence evolution tree reconstruction test for homogeneity substitution models
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Given a set of sequence data, the first step in the analysis is the alignment of the sequences. This is a topic of its own which we do not address here (cf Gotoh 1999). From a sequence alignment a phylogenetic tree can be reconstructed by several means. Each method, however, uses different assumptions (stated either explicitly or implicitly), different optimality criteria, and different tree search algorithms. For example, most distance-based methods as well as maximum likelihood methods assume an explicit model of sequence evolution. With the exception of LogDet and paralinear distances, these models require homogeneity, reversibility, and stationarity. The model is usually defined by a matrix containing the instantaneous rates of change from one nucleotide to another. In addition it is assumed that the positions in a sequence evolve mutually independently (for review see Baake and von Haeseler 1999). However, the model might allow the overall rate of change to differ between positions, such that rate heterogeneity among sites is incorporated. Well known are attempts to model rate heterogeneity according to a gamma distribution (Yang 1993, 1994) or by a log-normal distribution (Olsen 1987). Although methods for testing the model of nucleotide substitutions abound (i.e., Goldman 1993a,b), little work has been put into the assumption that the evolutionary model is identical on all branches of a tree (for an exception see Yang and Roberts 1995). Recently, Devauchelle et al. (2001) observed large variations across taxa in substitution matrices for mtDNA encoded proteins. These observations were obtained by singular value decomposition, with no obvious way to support the results statistically.
In this study we question the justifiability of the assumption of a single evolutionary model acting on all branches of a tree. To this end, we develop a statistical test based on the sample covariance matrix S of the set of substitution rate matrices estimated from pairwise sequence comparison. The sample covariance matrix is condensed into an one-dimensional test statistic =
ln(1 +
i), where the
i are the eigenvalues of S.
is used to test the assumption of a single substitution model acting on all branches of a treei.e., testing the homogeneity assumption of the substitution process in the tree.
The outline of the paper is as follows: First, we define models of sequence evolution and describe how to estimate substitution rate matrices from pairwise sequence comparisons. Next, we derive and motivate the test statistic and show how to determine the null-distribution of
by simulations. The reliability and sensitivity of the test are demonstrated by simulation studies. Finally, we apply the test to primate sequence data from the X chromosome and mitochondria.
![]() |
Models and Matrices |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
Assume that the rate matrix Q is diagonalizable (which is the generic case). Then, Q has a spectral decomposition
|
|
|
While the structure of the substitution process defined by Q is assumed to be the same for all positions along the sequence, the rate of evolution might differ from position to position. We model this rate heterogeneity among sites by a nonnegative random variable R with mean (R) = 1 and moment-generating function
(x) =
(exp(Rx)) (Gu and Li 1996; Kelly and Rice 1996). The relative rate r at a position is then given by a realization of R. However, the requirement
(R) = 1 ensures that the evolutionary rate averaged over the whole sequence remains the same independent of the specific distribution of R (for an overview see Waddell and Steel [1997]).
Conditional on R = r, the transition probability matrix of a position for a branch of length t is given by
|
|
|
|
|
![]() |
Inference and Testing |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Using equations (5, 6, and 8), we get the following functional relationship between the substitution rate matrix Q and observables from data:
|
Now, suppose we were given a set of n aligned sequences that are related by a phylogenetic tree. Furthermore, each of the () sequence pairs yields an estimate of the substitution process that acted on the path connecting the sequence pair in the tree. Under the assumption of homogeneity of the substitution process over the whole phylogenyi.e., all branches in the tree evolve according to the same substitution model Qwe expect all (
) estimates of Q to be close to each other. In this case differences between estimates are due to stochastic effects of sampling finite sequences. Alternatively, when there is heterogeneity of the substitution patterns among branches, we expect that there is variation between estimates that is not attributable to random effects but reflects differences of the substitution patterns among branches.
To quantify variation between estimates we utilize the sample covariance matrix of the estimates of Q. To this end, let q(k,l) denote the vector of the 12 off-diagonal entries of the estimate of Q for sequence pair (k, l). Note that the diagonal entries of Q are given by the negative of the respective row sum. Let further
|
|
|
We use as a statistic to test the null-hypothesis of homogeneity of substitution patterns over the complete phylogeny. To do this, we need to determine the distribution of
given the null-hypothesis is true. Because we are using estimates from pairwise sequence comparisons, where sequences are related by a phylogeny, we expect those estimates to be highly correlated. Hence, we do not presume any standard distribution to fit our problem, but resort to a simulation procedure to determine the null-distribution of
described next.
From equations (1) and (9) we get not only an estimate for Q but also estimates for the distance between each pair of sequences. The resulting distance matrix is used to reconstruct the phylogenetic tree and estimate its branch lengths by standard-distance-based tree reconstruction methods, like Neighbor-Joining (Saitou and Nei 1987). Along this estimated phylogeny (tree and its branch lengths) we apply Seq-Gen (Rambaut and Grassly 1997) to simulate the evolution of sequences according to , our best estimate for Q under the null-hypothesis. For the resulting simulated data set, we compute
according to equation (12). By repeating this simulation, say 1,000 times, we get an approximation to the null-distribution of
. The P value of the actually observed
data is then estimated by the proportion of simulated
values equal to or larger than
data.
![]() |
Simulation Studies |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The substitution models are defined as pairs (B, ) of symmetric matrices B and base frequency vectors
, such that entries of Q are given by
|
To assess the size and power of the test, we simulated 1,000 data sets under each of the 12 possible parameter combinations. Each data set was then analyzed by our procedure (eqs. 10, 11, 12) including 1,000 simulations each to approximate the distribution of the statistic . The relative frequency of the number of data sets where the null-hypothesis was wrongly rejected on the 5% significance level was used to determine the actual size of the test. The power of the test under alternative hypotheses is given by the relative number of data sets which led correctly to rejection.
Table 1 shows the results of the simulation studies. Rows corresponding to tree A (see fig. 1) show the probability to (wrongly) reject the null-hypothesis on the 5% significance level. The results of the simulations are in good agreement with the nominal significance level; however, it appears that the test is slightly conservative. Rows corresponding to trees B, C, and D (see fig. 1) show the probability to (correctly) reject the null-hypothesis on the 5% significance level. It is clear from the simulations that the power of the test is increasing with the branch lengths of the treei.e., higher amount of expected substitutionsas well as with the minor proportion of branches that evolve differently from the rest of the tree. The "balanced" tree D shows generally the highest power. The power of the test is generally low with short branches (t = 10). Only when the two models in the tree differ substantially is the test able to detect that difference in a significant number of simulations. Note that the entries in the column corresponding to t = 10 in table 1 are based on only about 300 simulations, as very sparse divergence matrices cause numerical problems of the algorithm. It also becomes clear that differences between the two models Q and Q* in terms of matrices B are detected more easily (upper four ABCD blocks of table 1) than the minor differences in the base frequency distributions (last two blocks of table 1).
|
![]() |
Applications |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
As a first step we performed the test assuming no rate heterogeneity among sites. Figure 2A shows the Neighbor-Joining primate Xq13.3 tree; the estimated transition rate matrix and the estimated nucleotide frequencies
are displayed in figure 2B. Based on these estimates the null-distribution of
was approximated by the simulations in figure 2C. The observed value
Xq13 = 1.071 falls right into the distribution of
corresponding to a P value of 0.203. Thus, the test does not reject the homogeneous substitution patterns assumption. The power of the test is rather low, however, when sequences of this level of diversity are analyzed (see scale of phylogeny in figure 2A).
|
|
The effect of possible rate heterogeneity on the analysis of Xq13 data is more interesting. For increasing level of rate heterogeneity, i.e. a decreasing value of ,
Xq13 approaches the 95% quantile curve. For
< 0.03 the value of
Xq13 is even larger than the corresponding 95% quantile. Thus, under the assumption of extreme rate heterogeneity, which may be indicated by extra evidence, the test would reject the homogeneity assumption for the Xq13 data. These results suggest that not obeying possible levels of rate heterogeneity results in a more conservative test. However, we note that with
close to zero numerical problems during estimation and simulation of the null-distribution become apparent.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
However, if rate heterogeneity among sites should be incorporated, a model and its parameters have to be specified in advance of the application of the test.
Simulation studies and the analyses of two real data sets demonstrated that the test is admissible and has a reasonable power of rejection. We also showed that the problem of unknown rate heterogeneity can be circumvented by analyzing the data under various levels of rate heterogeneity. Application to the two real data sets suggests that assuming wrongly homogeneous substitution rates leads to the test being conservative.
What are the effects of a finding of heterogeneous substitution patterns within a phylogenetic tree? On the one hand the heterogeneity is interesting in itself, because there is need for a biological explanation. We could imagine that substitution patterns do not change purely by chance but could result from a change in the sequence environmente.g., less selective constrains due to a gene duplication, different replication machinery for a mitochondrial insert in the nucleus, or need for adaptation because of changes in selective constraints either on the locus studied or on genes that affect the locus indirectly.
On the more technical side the test results can improve the way phylogenies are built. Varying substitution patterns affect the estimation of branch lengths and possibly also the branching pattern of the reconstructed tree. For example, it is well known that compositional changes in nucleotide frequencies can produce misleading trees (Lockhart et al. 1994; Galtier and Gouy 1995). With this in mind, results of a phylogenetic reconstruction should be interpreted with care, if our test suggests a rejection of the homogeneity assumption.
Because a fully parameterized probability model on a phylogeny that allows for branch-specific substitution models is not feasible, the test result can help in improving the process of model building for a specific data set by using appropriate methods that take into account different rate matrices over the tree (e.g., Steel 1993; Lake 1994).
The proposed test clearly has its limits in that divergence matrices that are sparse easily lead to complex eigenvalues and problems in assessing the null-distribution of through simulations.
Although it was not demonstrated here, an obvious extension of the test is to answer which type of base exchange contributes significantly to a significant result of the overall test. The more challenging problem is to provide a definite answer to which branches contribute most to a significant result. For this purpose estimates of branch-specific substitution models seem to be necessary. We will address this question in the near future (Weiss and von Haeseler, in preparation).
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Wolfgang Stephan, Associate Editor
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi, J., and M. Hasegawa. 1996. MOLPHY: Programs for Molecular Phylogenetics, ver. 2.3. Tokyo: Institute of Statistical Mathematics.
Baake, E., and A. von Haeseler. 1999. Distance measures in terms of substitution processes. Theor. Popul. Biol. 55:166-175.[CrossRef][ISI][Medline]
Bickel, P. J., and K. A. Doksum. 1977. Mathematical statistics. Holden-Day, Oakland, Calif.
Devauchelle, C., A. Grossmann, A. Hénaut, M. Holschneider, M. Monnerot, J. Risler, and B. Torrésan. 2001. Rate matrices for analyzing large families of protein sequences. J. Comp. Biol. 8:381-399.[CrossRef][ISI]
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]
Felsenstein, J. 1983. Statistical inference of phylogenies. J. R. Stat. Soc. Ser. A, 146:246-272.[ISI]
Felsenstein, J. 1995. PHYLIP (phylogeny inference package). Version 3.572. Department of Genetics, University of Washington, Seattle.
Galtier, N., and M. Gouy. 1995. Inferring phylogenies from sequences of unequal base composition. Proc. Natl. Acad. Sci. USA 92:11317-11321.[Abstract]
Goldman, N. 1993a. Statistical tests of models of DNA substitutions. J. Mol. Evol. 36:182-198.[ISI][Medline]
Goldman, N. 1993b. Simple diagnostic statistical tests of models of DNA substitutions. J. Mol. Evol. 37:650-661.[ISI][Medline]
Gotoh, O. 1999. Multiple sequence alignment: algorithms and applications. Adv. Biophys. 36:159-206.[CrossRef][ISI][Medline]
Gu, X., and W. H. Li. 1996. A general additive distance with time-reversibility and rate variation among nucleotide sites. Proc. Natl. Acad. Sci. USA 93:4671-4676.
Horai, S., K. Hayasaka, R. Kondo, K. Tsugane, and N. Takahata. 1995. Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc. Natl. Acad. Sci. USA 92:532-536.[Abstract]
Kaessmann, H., F. Heissig, A. von Haeseler, and S. Pääbo. 1999. DNA sequence varaition in a non-coding region of low recombination on the human X chromosome. Nat. Genet. 22:78-81.[CrossRef][ISI][Medline]
Kaessmann, H., V. Wiebe, G. Weiss, and S. Pääbo. 2001. Great ape DNA sequences reveal a reduced diversity and an expansion in humans. Nat. Genet. 27:155-156.[CrossRef][ISI][Medline]
Karlin, S., and H. Taylor. 1975. A first course in stochastic processes. Academic Press, New York.
Kelly, C., and J. Rice. 1996. Modeling nucleotide evolution: a heterogeneous rate analysis. Math. Biosci. 133:85-109.[CrossRef][ISI][Medline]
Kumar, S., K. Tamura, I. Jakobsen, and M. Nei. 2000. MEGA: Molecular Evolutionary Genetics Analysis, ver. 2. Pennsylvania State University, University Park, and Arizona State University, Tempe.
Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455-1459.[Abstract]
Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20:86-93.[ISI][Medline]
Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605-612.
Olsen, G. J. 1987. Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harbor Symp. Qunatita. Biol. 52:825-837.[ISI][Medline]
Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238.[Abstract]
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.[Abstract]
Steel, M. A. 1993. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7:19-23.[CrossRef][ISI]
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.
Swofford, D. L. 1993. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogeny reconstruction. Pp. 407514 in D. M. Hillis, C. Moritz, and B. A. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer Associates, Sunderland, Mass.
Tavaré, S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17:57-86.
Uzzel, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to evolutionary events. Science 172:1089-1096.[ISI][Medline]
Waddell, P. J., and M. A. Steel. 1997. General time-reversible distances with unequal rates across sites: mixing and inverse gaussian distributions with invariant sites. Mol. Phyl. Evol. 8:398-414.[CrossRef][ISI][Medline]
Wakeley, J. 1993. Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. J. Mol. Evol. 37:613-623.[ISI][Medline]
Yang, Z. 1993. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396-1401.[Abstract]
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314.[ISI][Medline]
Yang, Z. 1999. PAML, Phylogenetic Analysis by Maximum Likelihood, ver 2.0. University College London, London.
Yang, Z., and D. Roberts. 1995. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. 12:451-458.[Abstract]
Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315-329.[ISI][Medline]