School of Biological Sciences, University of Manchester
![]() |
Abstract |
---|
Key Words: molecular phylogeny maximum likelihood evolutionary distances distance matrix
![]() |
Introduction |
---|
The problems with distance estimation are already apparent in the simplest Jukes-Cantor (JC) (Jukes and Cantor 1969) model of sequence evolution. Evolutionary distance is usually measured in terms of the average number of substitutions per site, which we denote as d. For two aligned sequences of length N, the estimated evolutionary distance is given within the JC model by
|
It also is apparent from equation (1) that the estimated distance is infinite or undefined, if D 3/4. When the true d is large, D approaches 3/4 for an infinite sequence; therefore, there is a significant chance that the observed D in a finite sequence will be larger than 3/4 and that the distance estimate will be undefined. For small evolutionary distances, this will rarely occur. However, for any finite length N and for any true distance d, there is a nonzero probability that
will be undefined.
When using the JC model, the problems described above will only become important when d >> 1, in which case there is saturation of mutations. In this case there is very little phylogenetic information left in the sequences, and we probably should not be using these sequences for tree construction in the first place. However, we will show below that these problems tend to be more serious with more complex models. The effect of increasing the model complexity is illustrated by the next most simple model of evolution: the Kimura 2- parameter model (K2P) (Kimura 1980). In this model the base frequencies are equal, but a distinction is made between transitions (occurring at rate 1/4) and transversions (occurring at rate 1/4ß). The ratio
/ß is usually denoted by
. The estimated evolutionary distance between two sequences is the sum of the estimated numbers of transversions and transitions per site and also can be written as a log transform formula, this time with two log terms,
|
In addition to the JC and K2P models, many other rate matrix models have been proposed. The model of Hasegawa, Kishino, and Yano (HKY) (Hasegawa, Kishino, and Yano 1985) and the model of Tamura and Nei (TN) (Tamura and Nei 1993) have rate matrices for which the eigenvectors and eigenvalues are analytically soluble. Hence, an explicit log transform formula can be written. A distance formula for general reversible models has been derived in various forms by Lanave et al. (1984), Barry and Hartigan (1987), Rodriguez et al. (1990), and Li and Gu (1996). Other distance measures such as the paralinear distance (Lake 1994) and the LogDet distance (Lockhart et al. 1994; Steel 1994) have been introduced to address nonstationarity, i.e., heterogeneity in the base composition of the sequences studied.
The observation that increasing model complexity can increase both the variance of distance estimates and the frequency with which undefined distance estimates are encountered has been made before (Zharkikh 1994; Li 1997, p. 84 for a discussion of this point). However, we wish to consider what generic features of a model affect the accuracy of distance estimates and the frequency of undefined distances.
We will begin by showing a few problems with distance estimates in real data. We will then give a quantitative theory to expand on the above arguments concerning log-transform formulae. We also consider likelihood-based methods of estimating distances where we fix the estimates of the rate parameters before estimation of the evolutionary distance. We show that these methods have errors associated with the slowest timescale in the model, and hence that these methods are more accurate.
![]() |
Examples of Distance Estimates in Real Sequence Data |
---|
|
We also can use likelihood ratio tests (Goldman 1993) for model selection purposes, and this gives results that are contradictory to those above. Because both the K2P and TN model distance estimates can be derived by maximizing the likelihood in terms of the observed numbers of tranversions and transitions and the K2P can be considered as a special case of the TN model, we can apply a likelihood ratio test to each sequence pair from the this data set. It is straightforward to calculate the log-likelihood values for the two models for any pair of sequences and, from this, to determine the difference in the log-likelihoods. Because the two models are nested, we know that 2
is distributed according to a
2 distribution under the hypothesis that the simpler K2P model is correct (Goldman 1993). On performing such a calculation for each possible pair of sequences, we find that the TN model provides a significantly better fit to the data (P < 0.05) for 99.8% of the sequence pairs for which the distance estimate does not diverge. From this, we would conclude that the TN model provides an overwhelmingly better explanation of the observed sequence data than does the K2P model. (We note that the likelihood ratio test is usually applied to the likelihood of whole trees, whereas here we are applying it simply to the likelihood of each sequence pair.)
Our second example consists of 455 small subunit (SSU) rRNA sequences taken from the database of Van De Peer et al. (2000), chosen so that there is one example of a sequence from each genus of Eubacteria. This data set has been analyzed previously by Higgs (2000) and Savill, Hoyle, and Higgs (2001) because it is an example of sequence evolution under the constraint of a conserved RNA secondary structure. These articles (and references therein) discuss a range of models for evolution of the paired stem regions of RNA sequences. Stem regions evolve through compensatory substitutions that maintain the pairing pattern; hence, the substitutions in the sites on either side of a pair are strongly correlated, and it is necessary to consider the evolution of the pair as a single unit rather than as two separate sites. One appropriate model is that of Tillier and Collins (1995), which considers six allowed paired states, AU, GU, GC, UA, UG, and CG, (occurring with frequencies AU, ...,
CG) and has rate parameters
, ß, and
quantifying the rates of double transitions, double transversions, and single transitions, respectively. The structure of the model is simple enough for an explicit log transform formula to be calculated (Tillier and Collins 1995), namely
=
V +
+
.
V,
, and
are estimators of KV, K
, and K
, the expected numbers of transversions, single transitions, and double transitions, respectively. These are given as KV = ßt, K
= 8
2 (
1 +
3)
t, and K
= 8
1
3
t, where
1 = 1/ 2(
AU +
UA),
2 = 1/2 (
GU +
UG), and
3 = 1/2 (
GC +
CG). The estimators
V,
, and
are
|
|
![]() |
Distance Estimates with Fixed Rate Parameters |
---|
The most natural way of estimating the parameters would be simply to do an ML calculation of the optimum tree and parameter values for the whole set of sequences. However, this is not practical for large data sets of sequences without spending huge amounts of computer time. Indeed, if we were able to carry out ML for the complete set of sequences, we would not be particularly interested in using distance matrix methods anyway.
One way around the problem is to take a subset of sequences and do the ML calculation and then to use the ML parameters estimated from this subset when estimating pairwise distances for the whole set. Figure 1c shows the distances obtained in this way for the LSU rRNA sequences with the TN model. The estimates of the rate parameters have been obtained using TREE- PUZZLE (Strimmer and von Haeseler 1996; Strimmer 1997; Strimmer, Goldman, and von Haeseler 1997) on 10 sequences selected at random from the data set. The pairwise distances are then obtained for each sequence pair by maximizing the likelihood while keeping the rate parameters fixed at the values estimated by TREE-PUZZLE. There are now no divergent distance estimates, and the scatter of points at high D is substantially reduced in comparison with figure 1b. It is now practical to use the TN model on this data, which we know was the preferred model according to the likelihood ratio tests.
The model of Tillier and Collins (1995) for paired sites in RNAs is not implemented in TREE-PUZZLE or other currently available programs. We are currently in the process of developing a package that will use ML and Monte Carlo Markov Chain methods to construct phylogenies of RNAs using a variety of paired-site models. This will be available shortly (Jow et al. 2002). However, for the present article, we used a much simpler method to estimate the rate parameters for the model of Tillier and Collins (1995) in the example with the SSU rRNA sequences.
For each sequence pair of the SSU rRNA data set, the parameter ratios 1 and
2 can be estimated as outlined in the previous section. We obtain final estimates of
1 and
2 by averaging finite estimates over all the 455 x 454/2 sequence pairs of the SSU rRNA data set.
1 and
2 are then held fixed at these values. Fixing
1 and
2 (along with the usual timescale normalization condition that the expected number of substitutions per site per unit time is 1) completely specifies all three rate parameters,
, ß, and
. Specifically, ß = (1 + 8
1
3
1 + 8
2[
1 +
3]
2)-1,
=
1ß, and
=
2ß. Figure 2b shows ML distances calculated from the paired states in the SSU rRNA sequences and using the six-state model of Tillier and Collins (1995) with fixed rate parameters estimated in this fashion. There are now no divergent distance estimates and the scatter of points is reduced, indicating that distance estimates are more reliable than those with the log transform formula.
The key point of this section is that if good estimates of the rate parameters are known by ML estimation on a tree or by averaging the estimates of parameter ratios from each pair as described above, then accurate estimates of the pairwise distances can be obtained by maximizing the likelihoods for each pair with the fixed values of the rate parameters. These estimates suffer much less from problems of divergence than do estimates using the log transform formulae. In the next two sections, we consider in detail why this is so. Those readers not interested in the mathematical detail can proceed to Discussion and Conclusions, where we discuss the implications of these findings.
![]() |
Error Analysis of Distance Estimates for a General Rate Model |
---|
|
We have labeled the eigenvalues of R as i, where i = 0, ..., Nstate - 1. There is one eigenvalue
0 which is zero, whereas the other eigenvalues are negative. We have assumed the eigenvalues are ordered so that |
1|
|
2|
···
|
|. We also label
as
max because it is the eigenvalue of the largest absolute value. We also can define the matrices U and V whose columns are formed from the eigenvectors of R and RT, respectively. The transition probability matrix P can then be written in terms of this decomposition as
|
|
|
|
In Appendix A, we carry out an analysis of the errors in a log transform formula of the general form given in equation (7). The probability that the ith individual logarithmic term in equation (7) diverges is
|
|
The total probability that the estimate (eq. 7) diverges is pdiv = 1 - P(Q1 > 0, Q2 > 0, ..., > 0). Obtaining a generic analytic form for pdiv is not possible. However, we approximate P(Q1 > 0, Q2 > 0, ...,
> 0)
P(Q1 > 0)P(Q2 > 0)...P(
> 0), i.e., we approximate the Qi as being independent. This gives
|
|
The approximation that Q1 and Q2 are independent is, as we shall see from simulations, a reasonably accurate one for the K2P model. If the transition to transversion ratio is large, then there is a clear separation of timescales between the two processes. Thus, pdiv consists of two step-like increases at evolutionary distances d1 and d2 (fig. 3). Thus, the range of applicability is governed by the shorter of the two timescales, d1. To test the accuracy of the above approximation, simulations were performed by generating pairs of random sequences with known evolutionary distance d and by calculating the fraction of pairs for which
diverges. Figure 3 shows simulation results for the K2P model with sequence length N = 500 and
= 10. The theoretical estimate is in excellent agreement with the simulation estimate of pdiv. The separation of the two distance scales is very clear because
is large. We also performed simulations with
= 2 (not shown). In this case, the two steps merge together to a single large step. The theory and simulation are also in good agreement in this case.
|
![]() |
ML Distance Estimates with Fixed Rate Parameters |
---|
Let us consider methods of distance estimation based on ML. We suppose that estimates of all the parameters in the rate matrix are known and that we maximize the likelihood with respect to only the distance t. If the specified values of the rate parameters are equal to the true values, we find (see Appendix B) that the probability of obtaining a divergent distance estimate, for large N and t, is now controlled by the slowest nonzero eigenvalue 1. The limiting value, as t
, for the probability of divergence is 1/2 (to leading order in N-1), irrespective of the rate matrix used. For the K2P model, we find
|
In Appendix B, we derive a perturbative result for the RMS error of the ML estimate (with fixed rate parameters) of d. Again the theoretical result (eq. 26) is in excellent agreement with simulation (not shown). Consequently, we also can conclude that the error in the ML estimate of d is indeed controlled by the slowest nonzero eigenvalue 1.
In general, we will not know a priori the true parameters in the rate matrix. If the sequence data have been generated by the same class of model that is used to estimate the pairwise evolutionary distances, then for the methods we have used to estimate the rate parameters, e.g., for ML on a fixed tree (as in TREE-PUZZLE), we expect the estimates of rate parameters to become more accurate as sequence length N is increased. In this case, our analysis in Appendix B is unchanged to leading order in N. Consequently, for increasing sequence length, distance estimates are still controlled by the slowest eigenvalue 1. Obviously, for any fixed sequence length N, the exact value of
(
t)2
will depend on precisely how close the estimates of the rate matrix parameters are to the true values.
![]() |
Discussion and Conclusions |
---|
Both the probability of divergence (eq. 11) and the estimate of the variance of (eq. 22 in Appendix A) provide a measure of how much sequence information is required to resolve a given evolutionary distance between two sequences using log transform formulae. To maintain a fixed probability of divergence pdiv or fixed variance of the evolutionary distance estimate, we have the scaling (for large evolutionary distances)
|
Models with higher Ne tend to be applicable over a narrower range of t and give more divergences because the more eigenvalues there are, the larger the largest one tends to be. This was already seen in the example with the LSU rRNA sequences in figure 1a and b. We also can see from equation (14) that the evolutionary distance between two sequences that can be resolved accurately depends logarithmically on N. To increase the evolutionary distance probed requires an exponential increase in sequence length.
What are the sequence lengths required to obtain a given level of accuracy if evolutionary distances are estimated using ML with fixed rate parameters? Again, one finds that to maintain a constant level of accuracy or constant probability of obtaining a divergent evolutionary distance estimate, the required scaling is
|
![]() |
Appendix A |
---|
|
|
The distribution, P(S1, ..., SM) is multinomial and so, as N
, is well approximated by a multivariate Gaussian,
|
|
|
|
Assuming that the dominant contribution to equation (16) comes from estimating max, then the variance in such a distance estimate is equally easily derived by generalizing the analysis of Kimura and Ohta (1972)
|
![]() |
Appendix B |
---|
|
We wish to derive the probability of obtaining an infinite value for the estimate of t which maximizes the likelihood (eq. 23) when keeping the rate matrix R fixed, i.e., the probability that the solution, tML, to log L/
t = 0 is infinite. For the two sequence problems, we assume that there exists at most one maxima in t of the likelihood L at a finite value of tML. Thus, a divergent solution tML is equivalent to the asymptotic gradient of log L approaching zero from above, i.e.,
log L/
t
0+, t
. Thus, we will focus on the distribution of limt
log L/
t.
We take two sequences separated by branch length t0 and generated with rate matrix R. We consider our fixed estimate of R in equation (23) to be exact. As N
, we can take the distribution of x(t) =
logL/
t to be Gaussian, i.e., a probability density (2
B1)-1/2 exp(- [x - B0]2/2B1), where B0 =
logL/
t
and B1 = Var(
logL/
t). Consequently, the probability of obtaining a divergent estimate for tML is
|
|
A perturbative analysis of the accuracy of the ML estimate tML also can be obtained in a straightforward manner. The derivation is tedious, and so we quote only the final result here. Writing tML = t0 + tML (t0 again being the true separation between the two sequences), we find that the mean squared error is given by
|
![]() |
Acknowledgements |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
Atteson, K. 1997. The performance of the NJ method of phylogeny reconstruction. Pp. 133148 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies and biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, R.I.
Barry, D., and J. A. Hartigan. 1987. Asynchronous distance between homologous DNA sequences. Biometrics 43:261- 276.[ISI][Medline]
Bruno, W. J., N. D. Socci, and A. L. Halpern. 2000. Weighted neighbor joining: a likelihood-based approach to distance- based phylogeny reconstruction. Mol. Biol. Evol 17:189- 197.
Casella, G., and R. L. Berger. 2002. Statistical inference. 2nd edition. Duxbury, Pacific Grove, Calif.
De Rijk, P., J. Wuyts, Y. Van De Peer, T. Winklemans, and R. De Wachter. 2000. The European large subunit ribosomal RNA database. Nucleic Acids Res 28:177-178 [sequence data available via http://rrna.uia.ac.be/lsu].
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol 17:368-376.[ISI][Medline]
Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol 36:182-198.[ISI][Medline]
Hasegawa, M., H. Kishino, and N. Saitou. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol 32:443-445.[ISI][Medline]
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol 22:160-174.[ISI][Medline]
Higgs, P. G. 2000. RNA secondary structurephysical and computational aspects. Q. Rev. Biophys 33:199-253.[CrossRef][ISI][Medline]
Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol 19:1591- 1601.
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21123 in H. N. Munro, ed. Mammalian protein metabolism III. Academic Press, New York.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol 16:111-120.[ISI][Medline]
Kimura, M., and T. Ohta. 1972. On the stochastic model for estimation of mutational distance between homologous proteins. J. Mol. Evol 2:87-90.[ISI][Medline]
Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455-1459.[Abstract]
Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol 20:86-93.[ISI][Medline]
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Li, W.-H., and M. Gouy. 1991. Statistical methods for testing molecular phylogenies. Pp. 249277 in M. M. Miyamoto and J. Cracraft, eds. Phylogenetic analysis of DNA sequences. Oxford University Press, New York.
Li, W.-H., and X. Gu. 1996. Estimating evolutionary distances between DNA sequences. Pp. 449459 in F. Russell, ed. Methods in enzymology. Academic Press, San Diego.
Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol 11:605-612.
Nei, M., J. C. Stephens, and N. Saitou. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol 2:66-85.[Abstract]
Rodriguez, F., J. L. Oliver, A. Marin, and J. R. Medina. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol 142:485-501.[ISI][Medline]
Saitou, N., and T. Imanishi. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum evolution and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol 6:514-525.
Saitou, N., and M. Nei. 1987. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol 4:406-426.[Abstract]
Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum likelihood methods. Genetics 157:399-411.
Sourdis, J., and M. Nei. 1988. Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. Mol. Biol. Evol 5:298-311.[Abstract]
Steel, M. A. 1994. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett 7:19-24.
Strimmer, K. S. 1997. Maximum likelihood methods in molecular phylogenetics. Ph.D. thesis, University of Munich, Munich [available via http://users.ox.ac.uk/strimmer/].
Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol 14:210-211.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol 13:964-969.
Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512-526.[Abstract]
Tillier, E. R. M., and R. A. Collins. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol 12:7-15.
Van De Peer, Y., P. De Rijk, J. Wuyts, T. Winklemans, and R. De Wachter. 2000. The European small subunit ribosomal RNA database. Nucleic Acids Res 28:175-176 [sequence data available via http://rrna.uia.ac.be/ssu].
Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol 39:315-329.[ISI][Medline]