Factors Affecting the Errors in the Estimation of Evolutionary Distances Between Sequences

D. C. Hoyle and P. G. Higgs

School of Biological Sciences, University of Manchester


    Abstract
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t ~ 1/2 |{lambda}max|-1 logN, where {lambda}max is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t ~ 1/2 |{lambda}1|-1 logN, where {lambda}1 is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.

Key Words: molecular phylogeny • maximum likelihood • evolutionary distances • distance matrix


    Introduction
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
There are now a large number of methods available for the construction of phylogenetic trees from molecular sequences. Where relatively small numbers of sequences are involved and large amounts of computer time is available, maximum likelihood (ML) inference of phylogenies can be used (Felsenstein 1981; Li and Gouy 1991). The ML method is based on sound statistical principles and allows tests to be made for comparing alternative tree topologies and alternative models of sequence evolution. However, when large numbers of taxa are involved, the ML method becomes very slow, and practical methods of dealing with large sets of sequences are still required. The fastest phylogeny methods usually use matrices of evolutionary distances. The neighbor joining (NJ) method (Saitou and Nei 1987) is one of the most popular of these, and tests have shown that it is relatively accurate in comparison with other heuristic clustering methods (Sourdis and Nei 1988; Saitou and Imanishi 1989; Hasegawa, Kishino, and Saitou 1991). If the distance matrix corresponds to an exactly additive tree, NJ reproduces the correct tree topology, and it is robust against small errors in the distance estimates between sequences (Atteson 1997). If errors in distance estimates are large, however, then NJ and other distance matrix methods will give incorrect tree topologies. In this article, we investigate the way in which the errors in the pairwise distances between sequences depend on the sequence length, the evolutionary model, and the degree of divergence between sequences.

The problems with distance estimation are already apparent in the simplest Jukes-Cantor (JC) (Jukes and Cantor 1969) model of sequence evolution. Evolutionary distance is usually measured in terms of the average number of substitutions per site, which we denote as d. For two aligned sequences of length N, the estimated evolutionary distance is given within the JC model by


where D is the observed fraction of sites that differ. We will refer to equations such as equation (1) as "log transform" formulae because they transform observed quantities (in this case D) to estimates of distances that are not themselves directly observable quantities. A general method of deriving log transform formulae is given in Error Analysis of Distance Estimates for a General Rate Model, equations (4)–(9). For finite sequences, n has a binomial distribution, and hence D fluctuates about its mean value. Because is a nonlinear function of D, the expectation value of is not exactly equal to d; hence, there is a small systematic error of order 1/N. In practice, this is dominated by the statistical error, which is of the order 1/N1/2 (Kimura and Ohta 1972; Nei, Stephens, and Saitou 1985).

It also is apparent from equation (1) that the estimated distance is infinite or undefined, if D >= 3/4. When the true d is large, D approaches 3/4 for an infinite sequence; therefore, there is a significant chance that the observed D in a finite sequence will be larger than 3/4 and that the distance estimate will be undefined. For small evolutionary distances, this will rarely occur. However, for any finite length N and for any true distance d, there is a nonzero probability that will be undefined.

When using the JC model, the problems described above will only become important when d >> 1, in which case there is saturation of mutations. In this case there is very little phylogenetic information left in the sequences, and we probably should not be using these sequences for tree construction in the first place. However, we will show below that these problems tend to be more serious with more complex models. The effect of increasing the model complexity is illustrated by the next most simple model of evolution: the Kimura 2- parameter model (K2P) (Kimura 1980). In this model the base frequencies are equal, but a distinction is made between transitions (occurring at rate 1/4{alpha}) and transversions (occurring at rate 1/4ß). The ratio {alpha}/ß is usually denoted by {kappa}. The estimated evolutionary distance between two sequences is the sum of the estimated numbers of transversions and transitions per site and also can be written as a log transform formula, this time with two log terms,


where S and V are the observed fractions of sites that differ by transitions and transversions between the two sequences. There are now two timescales, one associated with transitions and one with transversions. Transitions are usually faster ({kappa} > 1). The distance estimate can diverge because of divergences in either of the two log terms. The quantity 1 - 2S - V tends to zero in a time governed by the rapid transition rate, whereas the term 1 - 2V tends to zero on the slow transversion timescale. Divergences, therefore, tend to arise because of saturation of the transitions, even when the overall evolutionary distance is small. The errors in this formula, therefore, become significant at smaller evolutionary distances than for the JC model.

In addition to the JC and K2P models, many other rate matrix models have been proposed. The model of Hasegawa, Kishino, and Yano (HKY) (Hasegawa, Kishino, and Yano 1985) and the model of Tamura and Nei (TN) (Tamura and Nei 1993) have rate matrices for which the eigenvectors and eigenvalues are analytically soluble. Hence, an explicit log transform formula can be written. A distance formula for general reversible models has been derived in various forms by Lanave et al. (1984), Barry and Hartigan (1987), Rodriguez et al. (1990), and Li and Gu (1996). Other distance measures such as the paralinear distance (Lake 1994) and the LogDet distance (Lockhart et al. 1994; Steel 1994) have been introduced to address nonstationarity, i.e., heterogeneity in the base composition of the sequences studied.

The observation that increasing model complexity can increase both the variance of distance estimates and the frequency with which undefined distance estimates are encountered has been made before (Zharkikh 1994; Li 1997, p. 84 for a discussion of this point). However, we wish to consider what generic features of a model affect the accuracy of distance estimates and the frequency of undefined distances.

We will begin by showing a few problems with distance estimates in real data. We will then give a quantitative theory to expand on the above arguments concerning log-transform formulae. We also consider likelihood-based methods of estimating distances where we fix the estimates of the rate parameters before estimation of the evolutionary distance. We show that these methods have errors associated with the slowest timescale in the model, and hence that these methods are more accurate.


    Examples of Distance Estimates in Real Sequence Data
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
As an example with real data, we consider a set of large subunit (LSU) rRNA sequences obtained from the database of De Rijk et al. (2000). A set of 90 mitochondrial LSU sequences was used, including representatives from all groups of mitochondria-containing eukaryotes for which sequence data were available. The alignment in the database was taken without alteration. To exclude variable regions and poorly aligned regions from the analysis, we eliminated from the alignment those sites which contained a gap or unknown nucleotide for 10% or more of the sequences. In figure 1a, the evolutionary distance estimates for the K2P model (eq. 2) are plotted against the sequence dissimilarity D for each pair of sequences. This is a well-behaved data set for which none of the distance estimates diverge and for which there is a clear relationship between D and so that all the points appear to lie on a smooth curve.



View larger version (20K):
[in this window]
[in a new window]
 
FIG. 1. Plot of estimated evolutionary distance against sequence dissimilarity D (uncorrected distance) for K2P model log transform estimates (left hand graph, 1a), TN model log transform estimates (middle graph, 1b), and estimates from TN model using ML with fixed rate parameters (right hand graph, 1c), all on LSU rRNA sequences (see main text). Evolutionary distances of d = -0.25 indicate a divergent estimate

 
Figure 1b shows the same data with distance estimates calculated with the TN model, using the log transform formula given in Tamura and Nei (1993). In this case, the distance estimate diverges for 1.43% of sequence pairs. Pairs that diverge are plotted as points at = -0.25 in the lower portion of the graph. Divergence occurs only for widely separated pairs with D between 0.6 and 0.7. Also apparent from figure 1b is that the scatter of distance estimates is greater for the TN model than for the K2P model, and the relationship between D and does not seem to be so well defined. In particular, at large D there are a number of outlier points, with well above the main curve. In any case, the TN distances could not be used without some finite value being assigned to the pairs that diverge.

We also can use likelihood ratio tests (Goldman 1993) for model selection purposes, and this gives results that are contradictory to those above. Because both the K2P and TN model distance estimates can be derived by maximizing the likelihood in terms of the observed numbers of tranversions and transitions and the K2P can be considered as a special case of the TN model, we can apply a likelihood ratio test to each sequence pair from the this data set. It is straightforward to calculate the log-likelihood values for the two models for any pair of sequences and, from this, to determine the difference {Delta} in the log-likelihoods. Because the two models are nested, we know that 2{Delta} is distributed according to a {chi}2 distribution under the hypothesis that the simpler K2P model is correct (Goldman 1993). On performing such a calculation for each possible pair of sequences, we find that the TN model provides a significantly better fit to the data (P < 0.05) for 99.8% of the sequence pairs for which the distance estimate does not diverge. From this, we would conclude that the TN model provides an overwhelmingly better explanation of the observed sequence data than does the K2P model. (We note that the likelihood ratio test is usually applied to the likelihood of whole trees, whereas here we are applying it simply to the likelihood of each sequence pair.)

Our second example consists of 455 small subunit (SSU) rRNA sequences taken from the database of Van De Peer et al. (2000), chosen so that there is one example of a sequence from each genus of Eubacteria. This data set has been analyzed previously by Higgs (2000) and Savill, Hoyle, and Higgs (2001) because it is an example of sequence evolution under the constraint of a conserved RNA secondary structure. These articles (and references therein) discuss a range of models for evolution of the paired stem regions of RNA sequences. Stem regions evolve through compensatory substitutions that maintain the pairing pattern; hence, the substitutions in the sites on either side of a pair are strongly correlated, and it is necessary to consider the evolution of the pair as a single unit rather than as two separate sites. One appropriate model is that of Tillier and Collins (1995), which considers six allowed paired states, AU, GU, GC, UA, UG, and CG, (occurring with frequencies {pi}AU, ..., {pi}CG) and has rate parameters {alpha}, ß, and {gamma} quantifying the rates of double transitions, double transversions, and single transitions, respectively. The structure of the model is simple enough for an explicit log transform formula to be calculated (Tillier and Collins 1995), namely = V + + . V, , and are estimators of KV, K, and K, the expected numbers of transversions, single transitions, and double transitions, respectively. These are given as KV = ßt, K = 8{pi}2 ({pi}1 + {pi}3) {gamma}t, and K = 8{pi}1{pi}3{alpha}t, where {pi}1 = 1/ 2({pi}AU + {pi}UA), {pi}2 = 1/2 ({pi}GU + {pi}UG), and {pi}3 = 1/2 ({pi}GC + {pi}CG). The estimators V, , and are


where V, S1, and S2 are the observed fractions of transversions, single transitions, and double transitions, respectively. Clearly, the ratios of the rate parameters also can be estimated from V, , and , i.e., {kappa}1 {equiv} {alpha} is estimated by 1 = /8{pi}1{pi}3V, and {kappa}2 {equiv} {gamma}/ß is estimated by 2 = /8{pi}2({pi}1 + {pi}3)V. Figure 2a shows the estimated distances using this formula. A divergent distance estimate occurs for 0.91% of the sequence pairs. These divergent pairs (shown in the lower portion of the graph at = -0.25) occur over a range of dissimilarity that begins with a value of D as small as 0.25. There also is considerable scatter in the distance estimates for the points that do not diverge. The problems of distance estimation are therefore quite serious in this example.



View larger version (15K):
[in this window]
[in a new window]
 
FIG. 2. Plot of estimated evolutionary distance d against sequence dissimilarity D for pairs of sequences from the SSU rRNA data set (see main text). Left hand graph (2a) shows results from Tillier and Collins (1995) six-state model with distances estimated using log transform formula. Right hand graph (2b) shows results from Tillier and Collins six-state model with distances estimated from ML with fixed rate parameters. Approximately, only every fifth data point is actually plotted on the graphs. Evolutionary distances of d = -0.25 indicate a divergent estimate

 

    Distance Estimates with Fixed Rate Parameters
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
The problem with the pairwise distance estimates is that the values of parameters like the base frequencies and rate matrix parameters {alpha}, ß, etc. are estimated separately for each pair, using only information from that pair, i.e., the parameters have different values for each sequence pair. Usually when building a tree, we make the assumption that the rate matrix parameters are constant during evolution of the whole tree. It, therefore, makes sense to obtain a single estimate of the parameters using information from all the sequences and then to estimate ML distances for each sequence pair while keeping the parameters fixed at their previously estimated values. The accuracy of pairwise distance estimates obtained in this way is analyzed in ML Distance Estimates with Fixed Rate Parameters and Appendix B.

The most natural way of estimating the parameters would be simply to do an ML calculation of the optimum tree and parameter values for the whole set of sequences. However, this is not practical for large data sets of sequences without spending huge amounts of computer time. Indeed, if we were able to carry out ML for the complete set of sequences, we would not be particularly interested in using distance matrix methods anyway.

One way around the problem is to take a subset of sequences and do the ML calculation and then to use the ML parameters estimated from this subset when estimating pairwise distances for the whole set. Figure 1c shows the distances obtained in this way for the LSU rRNA sequences with the TN model. The estimates of the rate parameters have been obtained using TREE- PUZZLE (Strimmer and von Haeseler 1996; Strimmer 1997; Strimmer, Goldman, and von Haeseler 1997) on 10 sequences selected at random from the data set. The pairwise distances are then obtained for each sequence pair by maximizing the likelihood while keeping the rate parameters fixed at the values estimated by TREE-PUZZLE. There are now no divergent distance estimates, and the scatter of points at high D is substantially reduced in comparison with figure 1b. It is now practical to use the TN model on this data, which we know was the preferred model according to the likelihood ratio tests.

The model of Tillier and Collins (1995) for paired sites in RNAs is not implemented in TREE-PUZZLE or other currently available programs. We are currently in the process of developing a package that will use ML and Monte Carlo Markov Chain methods to construct phylogenies of RNAs using a variety of paired-site models. This will be available shortly (Jow et al. 2002). However, for the present article, we used a much simpler method to estimate the rate parameters for the model of Tillier and Collins (1995) in the example with the SSU rRNA sequences.

For each sequence pair of the SSU rRNA data set, the parameter ratios {kappa}1 and {kappa}2 can be estimated as outlined in the previous section. We obtain final estimates of {kappa}1 and {kappa}2 by averaging finite estimates over all the 455 x 454/2 sequence pairs of the SSU rRNA data set. {kappa}1 and {kappa}2 are then held fixed at these values. Fixing {kappa}1 and {kappa}2 (along with the usual timescale normalization condition that the expected number of substitutions per site per unit time is 1) completely specifies all three rate parameters, {alpha}, ß, and {gamma}. Specifically, ß = (1 + 8{pi}1{pi}3{kappa}1 + 8{pi}2[{pi}1 + {pi}3]{kappa}2)-1, {alpha} = {kappa}1ß, and {gamma} = {kappa}2ß. Figure 2b shows ML distances calculated from the paired states in the SSU rRNA sequences and using the six-state model of Tillier and Collins (1995) with fixed rate parameters estimated in this fashion. There are now no divergent distance estimates and the scatter of points is reduced, indicating that distance estimates are more reliable than those with the log transform formula.

The key point of this section is that if good estimates of the rate parameters are known by ML estimation on a tree or by averaging the estimates of parameter ratios from each pair as described above, then accurate estimates of the pairwise distances can be obtained by maximizing the likelihoods for each pair with the fixed values of the rate parameters. These estimates suffer much less from problems of divergence than do estimates using the log transform formulae. In the next two sections, we consider in detail why this is so. Those readers not interested in the mathematical detail can proceed to Discussion and Conclusions, where we discuss the implications of these findings.


    Error Analysis of Distance Estimates for a General Rate Model
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Consider a rate matrix R, whose elements Rij (Rij > 0 {forall} i != j) define the rate of substitution from state i to state j. Thus, the probability Pij(t) of finding a particular nucleotide in state j at time t, given that it was initially in state i, obeys the simple evolution equation


R is defined such that the diagonal elements Rii = -{Sigma}j!=i Rij and also so that the substitution process is reversible, i.e., {pi}iRij = {pi}jRji {forall} i, j. At long times, Pij(t) tends to {pi}j, the equilibrium frequency of state j.

We have labeled the eigenvalues of R as {lambda}i, where i = 0, ..., Nstate - 1. There is one eigenvalue {lambda}0 which is zero, whereas the other eigenvalues are negative. We have assumed the eigenvalues are ordered so that |{lambda}1| <= |{lambda}2| <= ··· <= ||. We also label as {lambda}max because it is the eigenvalue of the largest absolute value. We also can define the matrices U and V whose columns are formed from the eigenvectors of R and RT, respectively. The transition probability matrix P can then be written in terms of this decomposition as


The evolutionary distance d between two sequences is defined as the expected number of nucleotide changes per site from which one finds d = -t {Sigma}i {pi}iRii. This can be written as a function of the eigenvalues as


To derive the log transform formula, the Pij(t) in equation (5) is replaced by the observed values in the pair of sequences in question, i.e., equating the expectation value Pij(t) with the observed number of substitutions of that type. Thus, the derivation essentially uses the "Methods of Moments" (see for example, Casella and Berger 2002, p. 312). This gives a set of simultaneous equations for the unknown quantities exp({lambda}kt). The number of unknowns is equal to the number of distinct nonzero eigenvalues of R, and we call this Ne. Solving these equations for {lambda}kt and substituting into equation (6) gives an estimate of the evolutionary distance as a log transform formulae


The quantities Mk depend on the frequencies of the states {pi}i but not on the rate parameters. Therefore, they can easily be estimated from the observed state frequencies. The Qi can be taken to be of the form


Here, the quantities S{alpha}, {alpha} = 1, ..., M, represent observed fractions of various substitution classes, e.g., S and V in the K2P model. The coefficients Bk{alpha} will be explicit from the particular log transform formula used. The log transform formulae given above (eqs. 1 and 2) are both of this form, as are those for the TN model and the RNA model used in Distance Estimates with Fixed Rate Parameters.

In Appendix A, we carry out an analysis of the errors in a log transform formula of the general form given in equation (7). The probability that the ith individual logarithmic term in equation (7) diverges is


where


Here, the notation <> represents the expectation value. From equation (9), we can see that the individual logarithmic term in equation (7) that is estimating the fastest eigenvalue {lambda}max is the most likely to produce a divergence, and so {lambda}max predominantly determines the likelihood of obtaining a divergent evolutionary distance estimate from a log transform formula.

The total probability that the estimate (eq. 7) diverges is pdiv = 1 - P(Q1 > 0, Q2 > 0, ..., > 0). Obtaining a generic analytic form for pdiv is not possible. However, we approximate P(Q1 > 0, Q2 > 0, ..., > 0) ~= P(Q1 > 0)P(Q2 > 0)...P( > 0), i.e., we approximate the Qi as being independent. This gives


where the complementary error function erfc(x) = 1 - erf(x). In the case of the K2P model, one has


where C11 = 4Pv(1 - Pv), C22 = Pv + 4Ps - (Pv + 2Ps)2 with Pv, Ps being the expected number of transversions and transitions, i.e., Pv = <V> and Ps = <S>.

The approximation that Q1 and Q2 are independent is, as we shall see from simulations, a reasonably accurate one for the K2P model. If the transition to transversion ratio {kappa} is large, then there is a clear separation of timescales between the two processes. Thus, pdiv consists of two step-like increases at evolutionary distances d1 and d2 (fig. 3). Thus, the range of applicability is governed by the shorter of the two timescales, d1. To test the accuracy of the above approximation, simulations were performed by generating pairs of random sequences with known evolutionary distance d and by calculating the fraction of pairs for which diverges. Figure 3 shows simulation results for the K2P model with sequence length N = 500 and {kappa} = 10. The theoretical estimate is in excellent agreement with the simulation estimate of pdiv. The separation of the two distance scales is very clear because {kappa} is large. We also performed simulations with {kappa} = 2 (not shown). In this case, the two steps merge together to a single large step. The theory and simulation are also in good agreement in this case.



View larger version (14K):
[in this window]
[in a new window]
 
FIG. 3. Plot of simulated and theoretical divergence probability of log transform and ML (fixed rate matrix parameters) distance estimates for the K2P model with {kappa} = 10 against evolutionary distance. The upper two curves (simulation = diamonds, theory = circles) are for the log transform estimate (eq. 2) and the lower two curves (simulation = triangles, theory = squares) are for ML with fixed rate parameters. Simulation averages evaluated over 2,000 replicates. Sequence length N = 500. One percent divergence probability occurs at approximate evolutionary distances of d = 1.40 and d = 6.65 for the two methods, respectively

 
For all models we have studied in this article, the approximation that the Qi in equation (7) is independent is found to be a good one. Certainly, for any model where the distinct nonzero eigenvalues are well separated, it is valid for pdiv <= 1/2 because within this range pdiv is predominantly controlled by a single eigenvalue {lambda}max. For pdiv > 1/2, we consider the inaccuracy introduced by assuming that Qi in equation (7) is independent to be largely unimportant in comparison with the fact that at this stage greater than 50% of the distance estimates will be undefined. From equation (11), we see that for a general rate model pdiv consists of Ne step-like increases reaching a limiting value (for long sequences and large evolutionary distances) of 1 - .


    ML Distance Estimates with Fixed Rate Parameters
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Our calculations on real data sets revealed that no divergent pairwise evolutionary distance estimates were obtained if the rate parameters were held fixed and not simultaneously estimated from the sequence data of the pair. We now consider this point in more detail and determine what controls the probability of a divergent estimate and the accuracy of that estimate when the rate parameters are held fixed.

Let us consider methods of distance estimation based on ML. We suppose that estimates of all the parameters in the rate matrix are known and that we maximize the likelihood with respect to only the distance t. If the specified values of the rate parameters are equal to the true values, we find (see Appendix B) that the probability of obtaining a divergent distance estimate, for large N and t, is now controlled by the slowest nonzero eigenvalue {lambda}1. The limiting value, as t -> {infty}, for the probability of divergence is 1/2 (to leading order in N-1), irrespective of the rate matrix used. For the K2P model, we find


The result from this equation is shown in figure 3 in comparison with estimates of pdiv from simulation. The sequence data has been generated with {kappa}0 = 10.0, and the same value has been specified in the ML calculation. It can be seen that the range of applicability of the ML distance estimate (with fixed rate parameters) is much larger because the divergence probability is a function of the slow timescale (transversions) only. The simulation results are in excellent agreement with the theory indicating that the probability of obtaining a divergent evolutionary distance estimate is predominantly controlled by the slowest nonzero eigenvalue {lambda}1.

In Appendix B, we derive a perturbative result for the RMS error of the ML estimate (with fixed rate parameters) of d. Again the theoretical result (eq. 26) is in excellent agreement with simulation (not shown). Consequently, we also can conclude that the error in the ML estimate of d is indeed controlled by the slowest nonzero eigenvalue {lambda}1.

In general, we will not know a priori the true parameters in the rate matrix. If the sequence data have been generated by the same class of model that is used to estimate the pairwise evolutionary distances, then for the methods we have used to estimate the rate parameters, e.g., for ML on a fixed tree (as in TREE-PUZZLE), we expect the estimates of rate parameters to become more accurate as sequence length N is increased. In this case, our analysis in Appendix B is unchanged to leading order in N. Consequently, for increasing sequence length, distance estimates are still controlled by the slowest eigenvalue {lambda}1. Obviously, for any fixed sequence length N, the exact value of <({delta}t)2> will depend on precisely how close the estimates of the rate matrix parameters are to the true values.


    Discussion and Conclusions
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
The focus of this article has been on the accuracy and reliability of distance matrix methods for phylogeny reconstruction. Our analysis has shown that distance estimates from log transform formula are accurate only on the shortest distance scale, |{lambda}max|-1, defined by the underlying rate matrix R and can frequently give rise to undefined distances. Pairwise distance estimates can be improved using ML if the rate parameters are known with reasonable accuracy, for example, by estimation from more than just the two particular sequences under consideration. In such cases, pairwise distance estimates are accurate on the longest distance scale, |{lambda}1|-1, defined by R. With improved pairwise distance estimates, better phylogenies can be constructed using NJ or variants such as Weighbor (Bruno, Socci, and Halpern 2000).

Both the probability of divergence (eq. 11) and the estimate of the variance of (eq. 22 in Appendix A) provide a measure of how much sequence information is required to resolve a given evolutionary distance between two sequences using log transform formulae. To maintain a fixed probability of divergence pdiv or fixed variance of the evolutionary distance estimate, we have the scaling (for large evolutionary distances)


Because tends to a constant, independent of N, as t -> {infty}, we can, for large sequence lengths N, consequently ignore it in the scaling relation above.

Models with higher Ne tend to be applicable over a narrower range of t and give more divergences because the more eigenvalues there are, the larger the largest one tends to be. This was already seen in the example with the LSU rRNA sequences in figure 1a and b. We also can see from equation (14) that the evolutionary distance between two sequences that can be resolved accurately depends logarithmically on N. To increase the evolutionary distance probed requires an exponential increase in sequence length.

What are the sequence lengths required to obtain a given level of accuracy if evolutionary distances are estimated using ML with fixed rate parameters? Again, one finds that to maintain a constant level of accuracy or constant probability of obtaining a divergent evolutionary distance estimate, the required scaling is


Thus, the evolutionary distances that one can probe still depends logarithmically on N. However, one should note that the prefactor to log N is as opposed to when using log transform formulae. Thus, if one has a fixed reasonably accurate estimate of the underlying rate matrix, then the evolutionary distance estimate obtained by maximizing the likelihood with respect to t can be considerably more accurate than that obtained by using the appropriate log transform formulae. This is particularly the case when evolution of sequences is occurring through processes on vastly different timescales. For the K2P model, when {kappa} = 10 (i.e., transitions occurring much more rapidly that transversions) the ratio {lambda}max/{lambda}1 = 5.


    Appendix A
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Probability of Divergence in Log Transform Formulae
We take as our starting point a log transform formula of the form


which provides an estimate of the true evolutionary distance d between two sequences. The amplitudes Mi are assumed to depend only on the state frequencies {{pi}i, i = 1, ..., Nstate}. Ne is the number of distinct nonzero eigenvalues of the rate matrix model to which the log transform formula (16) applies. The Qi can be taken to be of the form


Here, the quantities S{alpha}, {alpha} = 1, ..., M, represent observed fractions of various substitution classes. With the quantities S{alpha} estimated from the sequence data, Qi provides estimates of exp({lambda}it) and in particular <Qi> = exp({lambda}it).

The distribution, P(S1, ..., SM) is multinomial and so, as N -> {infty}, is well approximated by a multivariate Gaussian,


For large N, we can consider the quantities S1, ..., SM and Q1, ..., to be continuously distributed between -{infty} and {infty}. Using the distribution (eq. 18), we calculate


where


The probability that the ith individual logarithmic term in equation (16) diverges is


From this we can see that the individual logarithmic term in equation (16) that is estimating the fastest eigenvalue {lambda}max is the most likely to produce a divergence, and so {lambda}max predominantly determines the likelihood of obtaining a divergent evolutionary distance estimate from a log transform formula.

Assuming that the dominant contribution to equation (16) comes from estimating {lambda}max, then the variance in such a distance estimate is equally easily derived by generalizing the analysis of Kimura and Ohta (1972)



    Appendix B
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Analysis of ML for a Sequence Pair with Fixed Rate Parameters
The likelihood of observing two sequences 1 and 2 is


where s1,k and s2,k label the states at the kth site for sequences 1 and 2, respectively.

We wish to derive the probability of obtaining an infinite value for the estimate of t which maximizes the likelihood (eq. 23) when keeping the rate matrix R fixed, i.e., the probability that the solution, tML, to {partial} log L/{partial}t = 0 is infinite. For the two sequence problems, we assume that there exists at most one maxima in t of the likelihood L at a finite value of tML. Thus, a divergent solution tML is equivalent to the asymptotic gradient of log L approaching zero from above, i.e., {partial} log L/{partial}t -> 0+, t -> {infty}. Thus, we will focus on the distribution of limt->{infty} {partial} log L/{partial}t.

We take two sequences separated by branch length t0 and generated with rate matrix R. We consider our fixed estimate of R in equation (23) to be exact. As N -> {infty}, we can take the distribution of x(t) = {partial} logL/{partial}t to be Gaussian, i.e., a probability density (2{pi}B1)-1/2 exp(- [x - B0]2/2B1), where B0 = <{partial} logL/{partial}t> and B1 = Var({partial}logL/{partial}t). Consequently, the probability of obtaining a divergent estimate for tML is


where C = limt->{infty} B0/. We find


A perturbative analysis of the accuracy of the ML estimate tML also can be obtained in a straightforward manner. The derivation is tedious, and so we quote only the final result here. Writing tML = t0 + {delta}tML (t0 again being the true separation between the two sequences), we find that the mean squared error is given by


Consequently, in contrast to equation (22), the accuracy of the ML (with fixed rate parameters) estimate tML when the two sequences are well separated is on a distance scale determined by the slowest nonzero eigenvalue {lambda}1.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
This work was supported by the U.K. Biotechnology and Biological Sciences Research Council.


    Footnotes
 
E-mail: david.c.hoyle{at}man.ac.uk. Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Examples of Distance Estimates...
 Distance Estimates with Fixed...
 Error Analysis of Distance...
 ML Distance Estimates with...
 Discussion and Conclusions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 

    Atteson, K. 1997. The performance of the NJ method of phylogeny reconstruction. Pp. 133–148 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies and biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, R.I.

    Barry, D., and J. A. Hartigan. 1987. Asynchronous distance between homologous DNA sequences. Biometrics 43:261- 276.[ISI][Medline]

    Bruno, W. J., N. D. Socci, and A. L. Halpern. 2000. Weighted neighbor joining: a likelihood-based approach to distance- based phylogeny reconstruction. Mol. Biol. Evol 17:189- 197.[Abstract/Free Full Text]

    Casella, G., and R. L. Berger. 2002. Statistical inference. 2nd edition. Duxbury, Pacific Grove, Calif.

    De Rijk, P., J. Wuyts, Y. Van De Peer, T. Winklemans, and R. De Wachter. 2000. The European large subunit ribosomal RNA database. Nucleic Acids Res 28:177-178 [sequence data available via http://rrna.uia.ac.be/lsu].[Abstract/Free Full Text]

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol 17:368-376.[ISI][Medline]

    Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol 36:182-198.[ISI][Medline]

    Hasegawa, M., H. Kishino, and N. Saitou. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol 32:443-445.[ISI][Medline]

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol 22:160-174.[ISI][Medline]

    Higgs, P. G. 2000. RNA secondary structure—physical and computational aspects. Q. Rev. Biophys 33:199-253.[CrossRef][ISI][Medline]

    Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol 19:1591- 1601.[Abstract/Free Full Text]

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–123 in H. N. Munro, ed. Mammalian protein metabolism III. Academic Press, New York.

    Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol 16:111-120.[ISI][Medline]

    Kimura, M., and T. Ohta. 1972. On the stochastic model for estimation of mutational distance between homologous proteins. J. Mol. Evol 2:87-90.[ISI][Medline]

    Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455-1459.[Abstract]

    Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol 20:86-93.[ISI][Medline]

    Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.

    Li, W.-H., and M. Gouy. 1991. Statistical methods for testing molecular phylogenies. Pp. 249–277 in M. M. Miyamoto and J. Cracraft, eds. Phylogenetic analysis of DNA sequences. Oxford University Press, New York.

    Li, W.-H., and X. Gu. 1996. Estimating evolutionary distances between DNA sequences. Pp. 449–459 in F. Russell, ed. Methods in enzymology. Academic Press, San Diego.

    Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol 11:605-612.[Free Full Text]

    Nei, M., J. C. Stephens, and N. Saitou. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol 2:66-85.[Abstract]

    Rodriguez, F., J. L. Oliver, A. Marin, and J. R. Medina. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol 142:485-501.[ISI][Medline]

    Saitou, N., and T. Imanishi. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum evolution and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol 6:514-525.[Free Full Text]

    Saitou, N., and M. Nei. 1987. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol 4:406-426.[Abstract]

    Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum likelihood methods. Genetics 157:399-411.[Abstract/Free Full Text]

    Sourdis, J., and M. Nei. 1988. Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. Mol. Biol. Evol 5:298-311.[Abstract]

    Steel, M. A. 1994. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett 7:19-24.

    Strimmer, K. S. 1997. Maximum likelihood methods in molecular phylogenetics. Ph.D. thesis, University of Munich, Munich [available via http://users.ox.ac.uk/~strimmer/].

    Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol 14:210-211.[Free Full Text]

    Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol 13:964-969.[Free Full Text]

    Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512-526.[Abstract]

    Tillier, E. R. M., and R. A. Collins. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol 12:7-15.[Free Full Text]

    Van De Peer, Y., P. De Rijk, J. Wuyts, T. Winklemans, and R. De Wachter. 2000. The European small subunit ribosomal RNA database. Nucleic Acids Res 28:175-176 [sequence data available via http://rrna.uia.ac.be/ssu].[Abstract/Free Full Text]

    Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol 39:315-329.[ISI][Medline]

Accepted for publication July 17, 2002.