*Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, New Mexico;
Theoretical Physics Department, Bell Labs, Lucent Technologies, Murray Hill, New Jersey, and the Rockefeller University, New York, New York;
and
Health Sciences Center, Department of Molecular Genetics and Microbiology, University of New Mexico
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In maximum-likelihood (ML) phylogeny reconstruction (Felsenstein 1981
), the effect of a sequence on probabilities at an internal node decays exponentially with distance from that node, making ML trees robust to the presence of distant taxa. For example, one can expect the resolved branches in a primate tree reconstructed by ML to be robust with respect to adding bird or lizard sequences to the data set if the model of evolution is held fixed. Distance-based methods must contend with random errors in the distances that grow exponentially with distance. Because NJ is based on a criterion that does not downweight longer distances, inclusion of different bird or reptile sequences can change the reconstructed branching order of the primates much more easily with NJ than with ML. This has motivated us to create a fast method that, like ML, is inherently robust to the presence of distant taxa.
NJ, like UPGMA before it, consists of two main steps that are repeated until a tree is complete. The first step consists of choosing a pair of taxa to be joined, i.e., replaced by a single new node representing their immediate common ancestor. In the second step, distances from the new node to all other nodes are inferred. Recently, Gascuel (1997)
introduced an improvement on NJ called BIONJ, which uses the same first step as NJ, but in the second step uses weighted averages to reduce the variance in the estimates of the new distances. The weights used by BIONJ are based on variances expected for short distances in any evolutionary model.
In our method, which we call "weighted neighbor joining," or "Weighbor" for short, both NJ steps are redesigned. Our second step is not very different from BIONJs, but in our first step we replace the minimum-evolution criterion put forth by Saitou and Nei (1987)
with a likelihood-based criterion. This criterion models the distances as random variables obeying a Gaussian distribution (which approaches the true distribution in the limit of long sequences), each with an appropriate variance. In the studies in this paper, the formula specifying the variance as a function of distance is computed from the Jukes and Cantor (JC; 1969) model. The variance in a JC distance estimate (Nei, Stephens, and Saitou 1985
; Bulmer 1991
) can be written
![]() |
Criterion for Choosing the Best Pair to Join
The Weighbor criterion for deciding which pair to join is based on two subcriteria that hold for neighbors (and only for neighbors) when distances are exact. Using the labeling of nodes in figure 1 , with dij denoting the distance between nodes i and j, the subcriteria applying to neighbors i and j are:
|
We call this criterion "additivity" because it will hold if dik = diP + dPk holds for all k (and likewise with j in place of i), with P shown in figure 1 , in which case dik - djk = diP - djP.
Positivity:dik + djl - dij - dkl 0 for all choices of two other taxa k and l.
This requires the internal branch in the four-taxon tree ((i, j), (k, l)) to have a nonnegative length. That is, dPQ 0 in figure 1
. Because Q is defined by k and l, choices of these taxa that give the smallest dPQ result in the strongest test of this subcriterion.
Due to variations (errors) associated with finite sequence length, estimated distances are not expected to satisfy these criteria in every case in which i and j are neighbors. However, it is possible (using an additional simplifying assumption that the correlations between distances correspond to those for a star phylogeny) to determine the likelihood of the observed distances being generated if the two nodes are in fact sister taxa (i.e., neighbors). Thus, the pair of sequences to be joined at a given step in the procedure may be chosen as the pair for which this likelihood is highest. Estimating the negative log-likelihood gives us a "cost function." At each stage, we join the pair with the lowest cost, thereby maximizing the probability of a correct join.
Schematically, our cost function S(i, j) is the sum of two terms, one for each subcriterion:
![]() |
Evaluating Additivity
The likelihood that the observed distances are compatible with the additivity criterion for a given pair of taxa i and j is the likelihood that the various dik - djk values are all estimates of the same thing, , where the overbar indicates the optimally weighted average (see appendix 1). Taking the distance errors to be Gaussian gives us a negative log-likelihood of the form
This can be interpreted as a weighted least-squares 2 function; it is also equivalent to a weighted version of the "ThreeTree" criterion of OOta and Saitou (1998)
. The variances
2nonadd in the denominator consider only "nonadditive" variations and are computed according to
![]() |
Evaluating Positivity
Positivity is defined by an inequality, so the likelihood calculation involves integrating over all amounts by which the inequality could be satisfied. This integral results in the complementary error function erfc(x) defined by (2/)
xe-y2 dy. The negative log-likelihood that an instance dPQ of a Gaussian random variable with variance
2PQ comes from a distribution with a positive mean is
The precise definition of dPQ and the estimation of PQ are discussed below and in appendices 1 and 4; for now, let dPQ be defined by figure 1
. The log of the error function behaves linearly when |dPQ| <<
PQ, and in this situation, Pos(i, j) begins to resemble the linear NJ criterion, although with different coefficients. If the estimated length has a large negative value (dPQ/
PQ << -1), the penalty against joining the pair increases quadratically. If the estimate has a large positive value (dPQ >>
PQ), the function is nearly zero and virtually independent of dPQ; thus, pairs that clearly obey the positivity criterion are compared on the basis of additivity alone.
Heuristics to Expedite the Best Pair Search
For Weighbor to be useful, it should be much faster than ML on large trees, which implies that it should require no more than order N3 steps. While Add(i, j) can be exactly implemented in an order N3 algorithm by storing N2 sums and updating them, the positivity measure entails greater complexity. The most thorough measure of positivity would require evaluating dPQ for every quartet, which is impossible for an N3 method. This forces us to use certain heuristics to keep the calculation time proportional to N3.
Briefly, for each node i remaining at a given iteration, we first employ a series of pairwise comparisons aimed at finding the node j that is the most likely sister of i. In a comparison between j and j', we let k = j' and average over l (l {i, j, k}) in figure 1
to obtain an estimate of Pos(i, j) - Pos(i, j'). Once the most promising j is found for a given i, the cost function S(i, j) is evaluated for this pair, using an additional heuristic search to find the k and l that cause the worst value of the Pos(i, j) criterion (i.e., minimize dPQ/PQ). The pair i, j giving the best value of S(i, j) will be the pair that is joined. See www.t10.lanl.gov/billb/weighbor/technical for further details.
These heuristics allow Weighbor to estimate the complete tree in order N3 steps, like NJ and BIONJ. For comparison, the heuristic stepwise addition phase of ML phylogeny methods requires a number of steps proportional to N3L (more precisely, N3La, where La is the number of nonequivalent columns in the alignment; but in the worst case, La = L). Stepwise addition using the method of Fitch and Margoliash (1967)
(henceforth "FM") is order N4, while Bulmers (1991)
generalized least-squares method is N6. Distance methods also use an additional order N2L when the distance matrix is computed.
The heuristics we use are not guaranteed to find the optimal pair to join according to the criterion described above. They also can potentially create sensitivity to the order of taxa in the distance matrix. However, as shown below, making use of these heuristics, the method displays performance approaching that of ML and superior to that of existing N3 methods in dealing with long branches.
Calculating Distances to and from the New Node
Once we have decided to join nodes i and j at a new node (called P), there are three kinds of quantities that must be calculated in preparation for the next iteration. The first are the distances from i and j to P; the second are the distances from P to all of the remaining nodes; and the third are variables that will keep track of the variances in the latter distances.
Calling the newly created node P, the distance from i to P is computed as
![]() |
The distance from P to some remaining taxon k {i, j} is given by the weighted average
where 2avg(iP) is the mean squared nonadditivity
2nonadd(diP, dPk) averaged over k. The use of a variance averaged over k was put forward by Gascuel (1997)
for BIONJ. It is important that the same weights are used for all k, because this causes errors in dij to act as additive errors in later iterations.
The expected error in dPk will be greater than (or, if diP = 0 or djP = 0, equal to) the error expected for an actual sequence at P, because some of the error in diP and djP is passed on to dPk. We keep track of this increased error by introducing a quantity cP, which is the amount of distance one would add to dPk to get the right expected error. These quantities must be included whenever the variance formulas are used, which become
The calculation of the cs is given in appendix 6, although ci is zero for any i that is an actual sequence.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Four-Taxon Star
"Long branches attract" (LBA) is a topological bias toward trees with long branches joined as neighbors (Felsenstein 1978
). To test for LBA, we simulated four-taxon star phylogenies with two long and two short branches as shown in figure 2
. An unbiased method should choose randomly and equally among the three possible bifurcating topologies. The excess frequency with which the long branches are joined we identify as topological bias (Bruno and Halpern 1999
)
|
Maximum likelihood (we used fastDNAml [Olsen et al. 1994
]) and Weighbor show no significant bias in these tests, except perhaps for very long branch lengths. Because Weighbor is intended to approximate ML, it is reassuring that the two resemble each other in this test. One might also have expected Weighbors bias to be intermediate between the biases of positivity-based methods, such as NJ, and additivity-based methods, such as FM; however, the PHYLIP FM program, called "Fitch," requires branch lengths to be positive, apparently causing it to have the bias of a positivity-based method. The absence of detectable LBA bias in Weighbor is one good reason for using it.
Based on these bias results, one can predictand we have confirmedthat parsimony, NJ, and Fitch will perform worse than Weighbor on four-taxon trees with a short internal branch and two long branches that are not joined (this part of the tree space is known as the Felsenstein Zone). Conversely, these biased methods can perform better than Weighbor (and ML!) if the long branches are neighbors in the correct tree and if these branches are of sufficient length.
Trees that obey the molecular clock will tend to favor the LBA bias, because if a clocklike tree has two longest branches, these branches necessarily join. Thus, if real data tend to be clocklike, a biased method could have an advantage. On the other hand, Weighbor gives the correct tree when the distances are exact, suggesting that it is consistent and that if the data are sufficiently convincing, the true tree will be found. Biased methods systematically jump to the conclusion of a clocklike tree (or rather a tree with long branches joined) before the data really support it. Their biases will also tend to result in inflatedand hence misleadingbootstrap values for trees with long branches joined.
Lack of bias alone does not make a method worth using, because choosing a topology at random would also be unbiased by our definition; thus, the ability to find the correct tree when one exists must be tested. In order to avoid any influence of the bias effects we have already tested, we desire test trees that are as neutral as possible with respect to the LBA bias, for example, a four-taxon tree that has all external branch lengths equal. Tests of such symmetric four-taxon trees show negligible differences in performance between Weighbor and the other methods (data not shown), but adding additional taxa to such a tree yields more interesting results.
Five-Taxon Tree with a Long External Branch
We consider a symmetric four-taxon tree with a fifth taxon with a longer branch added well away from the short internal branch (fig. 3
). This tree still avoids effects of the LBA bias (as there is only one long branch), but reveals notable differences among some of the methods. Here we see that three methods, BIONJ, FM, and Weighbor, perform almost as well as maximum likelihood, and these methods are not much affected by the length of the long branch out to a length of 1.0. At lengths beyond 1.0, all methods, including ML, perform progressively worse. This is to be expected due to the difficulty of placing such a distant taxon correctly in the tree.
|
Parsimony clearly out-performs NJ in this test. It performs worse than the other methods, however. The term LBD may not fully describe parsimonys difficulty, because when the length of the long branch is 1.0, parsimony positions it correctly only 91% of the time.
Eight-Taxon Tree with Long Internal Branch
A more stringent test of LBD is shown in figure 4
, where the long branch is now an internal branch. Again, we imposed symmetry on the tree so that LBA bias would not be a factor.
|
Parsimony performs quite poorly when the central branch is long, although when this branch is short, parsimony is second only to the likelihood methods. Parsimonys victories with b = 0.2 over FM, NJ, and BIONJ are statistically significant (P < 0.05), but its advantage over Weighbor is not.
Weighbor performs better than NJ, BIONJ, and, by a small margin, FM over the entire range of distances. At very long distances, Weighbor also performs better than maximum likelihood using the default local search method of FastDNAml for some choices of the input order of taxa (data not shown); with randomized ("jumbled") input order, FastDNAml is indistinguishable from Weighbor when b = 1.5.
We also investigated how performance varies with sequence length. figure 5 shows results for a tree from figure 4 with an internal branch length b = 1.25. In this figure, Weighbor is about 30% more efficient than parsimony, meaning it can achieve the same accuracy with 30% less data, and is even more efficient relative to NJ and BIONJ. Recall that NJ is consistent, so it will approach 100% correctness with infinite sequence length (when there will be no distance errors); but because NJ does not account for the shape of the error distribution, its efficiency is far from optimal. Everywhere on this plot, Weighbors efficiency relative to global ML is at least 90%, and differences between Weighbor and local ML are not statistically significant.
|
Overall Performance
To summarize, we find that Weighbor is not noticeably influenced by LBA or LBD in our tests. On trees that have only short branches (d 0.25 in our tests), LBA and LBD are not an issue, and the differences among methods appear minor. However, when long branches are present, more dramatic differences become evident, and Weighbor performs better than all of the methods in our tests except ML.
We know of no examples in which Weighbor is outperformed by NJ or BIONJ, except for trees for which those methods benefit from the LBA bias, and for any such tree, there are always similar trees with different connectivity for which the LBA bias causes the biased methods perform worse (Bruno and Halpern 1999
). Parsimony performs well on trees with only short branches (d
0.25), but it performs poorly when long branches (d > 0.5) are present in these tests.
Speed
Weighbor is an order N3 algorithm, and it will run on 256 taxa in about 20 min on a 256-MHz Tatung UltraSparc workstation. Weighbor is about 5 times as fast as parsimony (PHYLIPs Dnapars; Felsenstein 1989
) and 200350 times as fast as FastDNAml using the default local search on sequences of length 500. This factor has only been investigated on up to 64 taxa for FastDNAml but should be relatively independent of the number of taxa; on the other hand, the factor will generally increase with the length of the sequences, although it will also depend on the diversity of the sequences. Weighbor is faster than the order N4 Fitch-Margoliash method on 16 or more taxa. Weighbor is substantially slower than both NJ and BIONJ, and the latter could be preferable for very large problems when speed is the primary consideration.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The Weighbor program, including source code, is freely available on the Internet at www.t10.lanl.gov/billb/weighbor or by anonymous ftp at ftp-t10.lanl.gov in pub/billb/weighbor. The sequence length L and the alphabet size ß used in calculating variances are adjustable command line parameters. For JC, ß = 4, while ß < 4 can allow for nucleotide bias or invariant sites, and ß > 4 can be used for protein sequences. Generalizing the program to use a more complex model for the function that specifies how variance increases with distance is straightforward. We expect the current version, with appropriate use of the alphabet size parameter (see "parameters" on the webpage), to work well for most molecular applications. Whenever the variance grows rapidly with distance and long branches are present, Weighbor should perform notably better than alternative methods that ignore this variance growth.
In our simulations, the distance matrix was always computed using the correct (JC) model. In real applications, the correct model is unknown, but the best available model should be used to compute the distance matrix, so that the mean nonadditivity will nearly equal zero (Swofford et al. 1996
; Halpern and Bruno 1998
). One should also be aware that certain methods for computing distances, notably the K2P (Kimura 1980
) and Tamura-Nei (1993) formulas, are not very efficient estimators and should be avoided in favor of generalized least-squares (Goldstein and Pollock 1994
; Pollock 1998
) or ML distances.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Abbreviations: FM, Fitch-Margoliash; JC, Jukes-Cantor; LBA, long branches attract; LBD, long branch distracts; ML, maximum likelihood; NJ, neighbor joining.
2 Keywords: Weighbor,
evolutionary tree reconstruction,
distance methods,
long-branch attraction,
long branch distracts.
3 Address for correspondence and reprints: William J. Bruno, MS K-710, Los Alamos National Laboratory, Los Alamos, New Mexico 87545. E-mail: billb{at}lanl.gov
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Atteson, K. 1997. The performance of the neighbor-joining method of phylogeny reconstruction. Pp. 133147 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies and biology. DIMACS Series of Discrete Mathematics and Theoretical Computer Science, Vol. 37. American Mathematical Society, Providence, R.I.
Bruno, W. J., and A. L. Halpern. 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16:564566.
Bulmer, M. 1991. Use of the method of generalized least squares in reconstructing phylogenies from sequence data. Mol. Biol. Evol. 8:868883.
Felsenstein, J. 1978. Cases in which parsimony and compatibility methods will be positivity misleading. Syst. Zool. 27:401410.[ISI]
. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
. 1989. PHYLIPphylogeny inference package (version 3.2). Cladistics 5:164166.
. 1997. An alternating least squares approach to inferring phylogenies from pairwise distances. Syst. Biol. 46:112125.[ISI][Medline]
Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155:279284.
Gascuel, O. 1997. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685695.[Abstract]
Goldstein, D. B., and D. D. Pollock. 1994. Least squares estimation of molecular distance: noise abatement in phylogenetic reconstruction. Theor. Popul. Biol. 45:219226.[ISI][Medline]
Halpern, A. L., and W. J. Bruno. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 15:910917.[Abstract]
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, NY.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120.[ISI][Medline]
Nei, M., J. C. Stephens, and N. Saitou. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol. 2:6685.[Abstract]
Olsen, G. J., H. Matsuda, R. Hagstrom, and R. Overbeek. 1994. FastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. CABIOS 10:4148.
OOta, S., and N. Saitou. 1998. Threetree: a new method to reconstruct molecular phylogenetic trees from distance matrices. P. 44 in Sixth annual international meeting of the Society for Molecular Biology and Evolution, program and abstracts. SMBE, Rochester, NY.
Pollock, D. D. 1998. Increased accuracy in analytical molecular distance estimation. Theor. Popul. Biol. 54:7890.[ISI][Medline]
Pollock, D. D., and D. B. Goldstein. 1995. A comparison of two methods for constructing evolutionary distances from a weighted contribution of transition and transversion differences. Mol. Biol. Evol. 12:713717.[Abstract]
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406426.[Abstract]
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogeny inference. P. 442 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer, Sunderland, Mass.
Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512526.[Abstract]
Whiting, M. F. 1998. Long-branch distraction and the strepsiptera. Syst. Biol. 47:134138.[ISI][Medline]