Département Informatique Fondamentale et Applications, LIRMM, Montpellier, France
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The results presented in the paper of Nei, Kumar, and Takahashi (1998
) can be summarized as follows. When using an exact algorithm, the observed performance of the true tree is always worse than or identical to the performance of the optimal tree. This is not surprising, since the true tree is one tree among all possible trees, and therefore it cannot be better than the best tree. Moreover, with short sequences, reconstruction algorithms often fail to discover the true tree, and consequently the true tree appears to be worse than the inferred tree. With approximate algorithms, the situation is not much different. Near-optimal inferred trees are rarely worse than the true tree, and the topological accuracy of these (fast) algorithms is close to that of the exact (time-consuming) algorithms. In light of these results, Nei, Kumar, and Takahashi (1998
) suggest that more attention should be given to testing the statistical reliability of inferred trees, e.g., using the bootstrap method (Felsenstein 1985
), than to finding optimal trees with excessive computational effort.
Similar results were previously described in Gascuel (1997a, 1997b
) concerning the ME criterion (see also Kumar 1996
, figs. 6 and 7). We first summarize these results and then show that, in the case of the ME criterion, considerations on the optimization principle cannot fully explain the observations. We conclude that the ME criterion is not perfectly suited for evolutionary distance data obtained from sequences, and also that the global optimization principle is not the sole way to conceive, describe, and analyze phylogenetic reconstruction algorithms. We then present another approach, implemented in the BIONJ algorithm (Gascuel 1997a
), which follows the agglomerative scheme, as NJ, and uses a model of distance data estimated from BIOlogical sequences.
![]() |
Simulation Schemes and NJ Algorithm Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In Gascuel (1997b)
, we simulated another type of data. Let D = (dij) be a tree distance (i.e., a distance represented by a tree with positive branch lengths), where dij is the distance between taxa i and j. A noise
ij was added to every dij to obtain the noisy distance (
ij) = (dij +
ij). The noises
ij were independently and identically distributed (i.i.d.) and normal with variance
2 and a null expectation. Such i.i.d. normal data are encountered when
ij estimates are the result of real observations with measurement errors, which is close, for example, to DNA-DNA hybridization data (Felsenstein 1987)
. We considered high and low noise levels (
= 0.6 or
= 0.1) and simulated three 12-taxon and three 24-taxon model trees. Five hundred replications were performed for each condition, which involves 3,000 data sets per noise level. Then, we applied reconstruction algorithms to the noisy matrices in order to recover the true tree corresponding to the original tree distance. The mean results are displayed in table 1
.
The situation was basically the same as that for sequence data. With high noise (corresponding to short sequences), we observed that NJ trees were shorter than the true tree in 92% of the cases, and longer in 4%. With low noise, we found values of 40% and 8%, respectively. This indicates that there is little room for improvement by optimizing the ME criterion: 4% improvement with high noise and 8% improvement with low noise (in terms of the probability of finding the true tree).
![]() |
Performance of Near-Optimal ME Trees |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The i.i.d. normal data were dealt with by the UNJ algorithm (Gascuel 1997b
). This algorithm is close to NJ but is unweighted, just as UPGMA (Sokal and Michener 1958
). The unweighted approach involves giving the same importance to each original taxon, which can be seen as the fundamental principle of the ordinary least-squares (OLS) criterion, and UNJ uses formulae that are exact in the OLS sense for estimating the branch lengths and reducing the distance matrix at each step, while formulae used by NJ are approximate (Saitou and Nei 1987
). Moreover, the OLS criterion is critical in the ME criterion definition, since the ME value of a tree is its OLS length estimate (Rzhetsky and Nei 1993
). As expected, we observed that UNJ optimized the ME criterion better than NJ did: with high noise, UNJ trees were shorter than NJ trees in 47% of the cases and longer in 11%, while with low noise, we found values of 17% and 3%, respectively (table 1
). Moreover, UNJ had better topological accuracy than NJ. For every inferred tree, we measured the number of incorrect branches, which is equal to half of the usual Robinson and Foulds 1981
topological distance between this tree and the true tree. We observed that UNJ trees were closer to the true tree than were NJ trees by 3% with high noise and by 8% with low noise (table 1
).
When applying UNJ to sequence data, we observed, as expected, that it optimized the ME criterion better than NJ did. However, it had lower topological accuracy (data not shown). The results obtained with tree swapping were even more surprising. To every NJ tree, we applied our fast tree-swapping (FTS) algorithm (Gascuel 1996
), which is an optimized implementation of Rzhetsky and Nei's (1993
) close-neighbor interchange (CNI) algorithm, but restricted to nearest-neighbor interchanges. Given a tree, FTS searches for the pair of nonintersecting clusters separated by one interior branch, which provides the greatest reduction in the ME criterion when exchanged. This process is repeated until no cluster exchange diminishes the ME criterion. Starting with NJ trees, FTS necessarily finds trees that are identical or better in the ME sense. Indeed, we found that these "NJ+FTS" trees were longer than the true tree in only 1% of the cases, with 300 sites as well as with 600 sites, as compared with 11% and 8%, respectively, for NJ alone. However, the topological accuracy was reduced by 4% with 300 sites and by 9% with 600 sites (results not previously published; see table 1
). This means that FTS, which is fully justified from the ME principle point of view, tends to degrade the good properties of NJ trees.
In one case (i.i.d. normal data), further optimization of the ME criterion tends to improve the topological accuracy of NJ, while in the other (sequence data), this yields a degradation. The explanation is related not to the optimization principle, but to the ME criterion and its use for sequence data. In fact, the UNJ algorithm, the OLS, and the ME criterion are perfectly suited for i.i.d. normal data (Gascuel 1997b
). A similar result was found by Degens (1983
) concerning UPGMA and the reconstruction of ultrametric trees (rooted with a molecular clock). On the other hand, evolutionary distance data estimated from sequences are very far from the i.i.d. normal model. The variance of the distance estimates is not constant and is much higher for long than for short distances. Moreover, the covariance of these estimates is often high (Nei and Jin 1989
; Bulmer 1991
). In other words, the variance-covariance matrix of the distance estimates obtained from sequences differs markedly from the identity matrix that corresponds to the i.i.d. model. Also, as we shall explain in the next section, NJ trees are more accurate than ME and UNJ trees simply because NJ takes sequence data features into better account.
![]() |
The BIONJ Algorithm: Principle and Performance |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
|
![]() |
BIONJ is equivalent to or more accurate than NJ for all tested conditions. The mean results are displayed in table 1
. The topological accuracy (measured using the Robinson and Foulds distance) is improved by 8% with 300 sites and by 10% with 600 sites, while the probability of finding the true tree is augmented by 5% and 4%, respectively. The gain is much higher for certain conditions. With highly varying rate model trees and with high substitution rates (maximum pairwise divergence 1.0 substitutions per site), the topological accuracy may be improved by as much as 50%, and the probability of finding the true tree may be improved by 15%. Note, however, that BIONJ is neither better nor worse than NJ in optimizing the ME criterion.
BIONJ actually does not optimize any global criterion. At each step, it selects a pair of taxa by optimizing a partial (or local) criterion, denoted as Q, which was proposed by Studier and Keppler (1988) and is equivalent to the NJ criterion (Saitou and Nei 1987
; Gascuel 1994
). When applied to any tree distance D, Q designates with certainty a pair of neighbors of the tree representing D. In practice, we only have an estimate
of D, and hence an estimate
of the true value of Q. Since we optimize
and not Q itself, we do not systematically select a true pair of neighbors. However, the more reliable
is in estimating D, the more reliable
(computed from
) is in selecting neighbors. The BIONJ approach involves iteratively (1) selecting a pair of neighbors by optimizing
and (2) reducing
such that the new estimates in the reduced matrix are as reliable as possible. This clearly strays from the global optimization paradigm.
BIONJ is a special case of what we have called the minimum variance reduction (MVR) method (Gascuel 2000
). This general approach involves achieving step 2 above owing to knowledge of the variance-covariance matrix V of estimates in
. The computing time needed by MVR depends on V. In the general case, MVR requires O(n4), where n is the number of taxa. However, MVR only requires O(n3), just as NJ does, in several special cases, e.g., UNJ (V is equal to the identity matrix), WLSNJ (V is diagonal), or BIONJ (V is nondiagonal and defined by a simple first-order model of sequence data). The MVR approach thus allows the data features to be taken into account and is much less demanding in computing time than current approximate algorithms used for the global optimization of least-squares criteria. For example, FITCH (Felsenstein 1997
) requires O(n4) to deal with OLS or weighted least-squares (WLS; V is diagonal), while generalized least-squares (GLS; no constraint on V) necessitates (among other things) inverting V, which requires O(n6) (Bulmer 1991
). Note also that the above-mentioned agglomerative algorithms are statistically consistent, which means that they converge toward the true tree when the number of sites increases (and when
is a consistent estimate of D). This property is essential in phylogenetic inference (Felsenstein 1984
) and is actually more important than the capacity to find the optimal tree according to a given criterion. Moreover, Atteson (1997) has demonstrated that the theoretical convergence rate of agglomerative algorithms such as ADDTREE, NJ, UNJ, and BIONJ is in some sense optimal for all possible distance-based methods (at least when the distances in D are bounded within a reasonable range; see Erdös et al. 1997
).
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Actually, the main question that arises is that of which criterion to optimize. The results presented in this paper indicate that the ME criterion is not perfectly suited for distance data estimated from sequences due to its closeness to OLS. Improvement could probably be obtained by combining the ME principle with the WLS (or GLS) estimation of tree length, but optimizing this refined criterion would then become harder and likely not easier than directly optimizing WLS (or GLS).
However, global optimization is not the sole way to conceive phylogenetic reconstruction algorithms. In the case of distance data, the agglomerative approach also possesses solid foundations. Combined with MVR, it allows the data features to be taken into account and leads to simple, fast, and accurate algorithms such as BIONJ. However, the simplicity and speed of these algorithms is perhaps somewhat counterbalanced by a lower ability to recover the true tree, which should clearly be further studied and evaluated.
![]() |
Footnotes |
---|
1 Keywords: phylogenetic reconstruction
optimal and heuristic algorithms
criterion to be optimized
agglomerative approach
model of the data
2 Address for correspondence and reprints: Olivier Gascuel, Département Informatique Fondamentale et Applications, LIRMM, 161 rue Ada, 34392, Montpellier cedex, France. E-mail: gascuel{at}lirmm.fr
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Atteson, K. 1997. The performance of the NJ method of phylogeny reconstruction. Pp. 133148 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies and biology. American Mathematical Society, Providence, R.I.
Bulmer, M. 1991. Use of the method of generalized least squares in reconstructing phylogenies from sequence data. Mol. Biol. Evol. 8:868883.
Degens, P. O. 1983. Hierarchical cluster methods as maximum likelihood estimators. Pp. 249253 in J. Felsenstein, ed. Numerical taxonomy. Springer, Berlin.
Erdös, P. L., M. A. Steel, L. A. Székely, and T. Warnow. 1997. Constructing big trees from short sequences. Automata Lang. Programming 24th Int. Colloquium Lect. Notes Comput. Sci. 1256:827837.
Felsenstein, J. 1984. Distance methods for inferring phylogenies: a justification. Evolution 38:1624.
. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783791.
. 1987. Estimation of hominoid phylogeny from a DNA hybridization data set. J. Mol. Evol. 26:123131.[ISI][Medline]
. 1993. PHYLIP (phylogeny inference package). Distributed by the author, Department of Genetics, University of Washington, Seattle.
. 1997. An alternating least-squares approach to inferring phylogenies from pairwise distances. Syst. Biol. 46:101111.[ISI][Medline]
Gascuel, O. 1994. A note on Sattath and Tversky's, Saittou and Nei's and Studier and Keppler's algorithms for inferring phylogenies from evolutionary distances. Mol. Biol. Evol. 11:961963.
. 1996. Algorithmes efficaces pour la construction d'arbres d'évolution minimum. Rapport de recherche LIRMM-96019.
. 1997a. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685695.
. 1997b. Concerning the NJ algorithm and its unweighted version, UNJ. Pp. 149170 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies and biology. American Mathematical Society, Providence, R.I.
. 2000. Data model and classification by trees: the minimum variance reduction (MVR) method, J. Classif. (in press).
Jin, L., and M. Nei. 1990. Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol. Biol. Evol. 7:82102.[Abstract]
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120.[ISI][Medline]
Kumar, S. 1996. A stepwise algorithm for finding minimum evolution trees. Mol. Biol. Evol. 13:584593.[Abstract]
McQuitty, L. L. 1966. Similarity analysis by reciprocal pairs for discrete and continuous data. Educ. Psychol. Meas. 26:825831.[ISI]
Nei, M., and L. Jin. 1989. Variances of the average numbers of nucleotide substitutions within and between populations. Mol. Biol. Evol. 6:290300.[Abstract]
Nei, M., S. Kumar, and K. Takahashi. 1998. The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Natl. Acad. Sci. USA 95:1239012397.
Robinson, D. F., and L. R. Foulds. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131147.[ISI]
Rzhetsky, A., and M. Nei. 1993. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol. Biol. Evol. 10:10731095.[Abstract]
Saitou, N., and M. Imanishi. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic reconstructions in obtaining the correct tree. Mol. Biol. Evol. 6:514525.
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstruction of phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Sattath, S., and A. Tversky. 1977. Additive similarity trees. Psychometrika 42:319345.
Sneath, P. H. A., and R. R. Sokal 1973. Numerical taxonomy. Freeman, San Francisco.
Sokal, R. R., and C. D. Michener. 1958. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 38:14091438.
Studier, J. A., and K. J. Keppler. 1988. A note on the neighbor-joining method of Saitou and Nei. Mol. Biol. Evol. 5:729731.
|