Department of Ecology and Evolution, University of Chicago
Abstract
In the reconstruction of a large phylogenetic tree, the most difficult part is usually the problem of how to explore the topology space to find the optimal topology. We have developed a "divide-and-conquer" heuristic algorithm in which an initial neighbor-joining (NJ) tree is divided into subtrees at internal branches having bootstrap values higher than a threshold. The topology search is then conducted by using the maximum-likelihood method to reevaluate all branches with a bootstrap value lower than the threshold while keeping the other branches intact. Extensive simulation showed that our simple method, the neighbor-joining maximum-likelihood (NJML) method, is highly efficient in improving NJ trees. Furthermore, the performance of the NJML method is nearly equal to or better than existing time-consuming heuristic maximum-likelihood methods. Our method is suitable for reconstructing relatively large molecular phylogenetic trees (number of taxa 16).
Introduction
The neighbor-joining (NJ) method (Saitou and Nei 1987
) is simple and widely used, especially for large molecular phylogenetic trees. On the other hand, the maximum-likelihood (ML) method (Cavalli-Sforza and Edwards 1967
; Felsenstein 1981
) tends to outperform the NJ method if an appropriate model of nucleotide substitution is used (Fukami-Kobayashi and Tateno 1991
; Hasegawa, Kishino, and Saitou 1991
; Hasegawa and Fujiwara 1993
; Kuhner and Felsenstein 1994
; Tateno, Takezaki, and Nei 1994
; Huelsenbeck 1995
). Unfortunately, the ML method requires a large amount of computational time when many taxa are involved. Therefore, it is desirable to drastically reduce the topology search space by introducing heuristics.
The basic idea of this paper is to use a "divide-and-conquer" strategy, briefly described as follows: An initial tree is constructed by the NJ method. Bootstrap values (Felsenstein 1985)
are computed on all internal branches (nodes). The initial tree is then divided into subtrees at internal branches that have a bootstrap value higher than a threshold. Each subtree is referred to as a composite operating taxonomic unit (OTU) and is kept intact to reduce the search space. In other words, the topology search by ML reconsiders only internal branches with bootstrap values lower than the threshold. Therefore, the depth of the search depends on the bootstrap values on the internal branches (nodes) of the NJ tree.
Figure 1 shows the basic principle of the new algorithm. Since internal branches A and E in figure 1a have low bootstrap values, they are removed and the remaining internal nodes are merged. Figure 1b shows a multifurcating tree thus constructed. This is an intermediate tree with which to reconstruct the final bifurcating tree. Reconstruction of a bifurcating tree is performed by inserting new internal branches at the multifurcating nodes using the ML principle.
|
This idea, however, is still not practical for very large trees. If a tree has m (2) adjacent internal branches with low bootstrap values, we need to perform the exhaustive search for m + 2 OTUs. When m is large, the computation will be intractable. To reduce the computational load, we present below a greedy (hill-climbing) search algorithm (e.g., Winston 1993
).
Materials and Methods
Algorithm of the Neighbor-Joining Maximum-Likelihood Method
In the neighbor-joining maximum-likelihood (NJML) method, we do not simultaneously remove all internal branches having bootstrap values lower than the threshold. Instead, only n such internal branches are removed in each step. The NJML algorithm is very simple:
Step 1
Step 2
Step 3
Figure 2 shows how the tree reconstruction is performed in this algorithm for n = 3. The bootstrap values on branches A, B, C, F, J, and L are lower than the threshold C (see fig. 2a ). However, only the three smallest bootstrap values are chosen and the corresponding internal branches (A, C, and J) are removed in this step (fig. 2b ). The multifurcating nodes will be resolved, assuming that the remaining parts reflect the true tree topology. In further steps, internal branches B, F, and L will be removed to perform the topology search.
|
Implementation
In PHYLIP, version 3.5c (Felsenstein 1993)
, the programs dnadist, seqboot, neighbor, and consensus were used with slight modifications to construct an initial NJ tree (fig. 3
).
|
We cannot exclude the possibility that a pairwise distance from bootstrap resampling is infinite in dnadist of PHYLIP (fig. 3 ). In this case, NJML will return no result.
As shown in figure 3 , NJML contains several modules written in C. GNU CC version 2.8.1 was used as a compiler.
Simulation Design
Two types of simulation were carried out to check the efficiency of NJML: (1) fixed model trees were used, and randomly generated ancestral sequences were evolved along the model trees under certain evolutionary models; and (2) randomly generated trees were used, and randomly generated ancestral sequences were evolved in the same way as in (1). No site-specific heterogeneity (Yang 1993
) was assumed in either type of simulation. The maximum number (n) of removed internal branches at each step was set at 3 for all runs.
Fixed Model Tree Simulation
Seq-Gen (Rambaut and Grassly 1997
) was used to generate ancestral sequences. In Seq-Gen, Jukes-Cantor (JC) (Jukes and Cantor 1969
) and Kimura's two-parameter (K2P) (Kimura 1980) models were implemented as special cases of Hasegawa, Kishino, and Yano's (1985)
; (HKY) model. We set the nucleotide frequencies to be equal in both models, and we set the transition-to-transversion ratio (TS/TV) to 0.5 and 4.0 for the JC and K2P models, respectively. (We set the ratio to 4.0 for the K2P model rather than the commonly used value of 2.0 because we wanted to consider a more extreme case) Base frequencies were all set to 0.25. For each case, 500 replications were generated.
The model trees used are shown in figures 4 and 5
. In figure 4 , model trees A, B, C, and D are modified from Saitou and Imanishi (1989)
. Trees A and B assume a constant rate of nucleotide substitution. The expected number of nucleotide substitutions per site is denoted by U, with U = 0.05 or 0.5. The length of each branch is expressed as multiples of a (=U/8) (see fig. 4 ). Model trees C and D have a large variation among branches in rate of nucleotide substitution. For both trees, a = 0.01 and a = 0.05 were used. The threshold of bootstrap values was set to 90% or 95%.
|
In figure 5 , model trees T1 and T2 were modified from Strimmer and von Haeseler (1996b). For each of them, a variety of substitution rates a and b were assumed. As shown in figure 5 , T1 and T2 assume a constant rate and two varying rates (among lineages), respectively.
|
Randomly Generated Model Tree Simulation
For each case, 100 trees were generated randomly, and each possible tree was equally probable. To generate them, evolver was used with slight modification. Randomly generated ancestral sequences evolved along the randomly generated trees under the JC or the K2P model. The number of OTUs was 20, and the sequence length was 1,000 bp. The expected branch lengths of randomly generated trees were varied for three cases: 0.01, 0.05, and 0.1. The other part of this simulation was the same as the fixed model tree simulation.
Computer Run Times
Computer run times were measured in NJML, PUZZLE 4.0.2 (an implementation of the QP method), fastDNAml 1.0, and DNAML 3.5 by using the clock() function of GNU CC version 2.8.1 on a Pentium III machine (450 MHz). Sequences were randomly generated by Seq-Gen under the K2P model. A given tree was randomly generated by setting the mean branch length to 0.05. These operations were iterated 100 times for each case, and each computer run time was measured. The averages of the computer run times were used to compare performances. In the case of DNAML, however, only one run was carried out, because the computational run times were too large.
Results
Table 1 shows the results of simulation using "constant-rate" trees A and B (fig. 4 ). All NJ trees given in the tables were consensus trees by bootstrap resampling, and they were used as the initial trees for the NJML method. Table 1 indicates that all initial NJ trees were improved except in two cases under the JC model. In the two exceptions, the sequences were very divergent (U = 0.50) and the simple Jukes-Cantor model was used.
|
|
|
As shown in tables 4 and 5 , although BIONJ and Weighbor often gave better results than did the NJ method, the NJML method outperformed both of them. The QP method gave better results than any other method in six cases, while the NJML method gave better or tie results in the other cases (table 4 ). In particular, note that when model tree T2 (a "rate-varying" tree) was used, the efficiency of the NJML method was considerably better than that of both the NJ method and the QP methods (table 5 ). NJML performed almost as well as fastDNAml and DNAML in most cases and gave better results than fastDNAml and DNAML in some cases.
|
|
|
|
A Phylogenetic Tree of the Small-Subunit rRNA for Eukaryotes
We reconstructed a phylogenetic tree of the small-subunit ribosomal RNA for eukaryotes by using the NJ and NJML methods. The multiply aligned sequences are from the Ribosomal Database Project (Maidak et al. 1999
). Figure 6a and b
shows the initial NJ tree and the NJML tree, respectively. The K2P model was used with TS/TV = 4.0, and the threshold was set to 90%. OTUs of the trees were automatically annotated by programs in DeepForest (OOta 1998
). Excluding gaps, 1,465 sites were used. The NJ tree (fig. 6a
) has the following major problems:
|
In the NJML tree (fig. 6b
), these problems are solved: the nematode, the sea urchin, the amphioxus, and the amoeba are located in reasonable places (e.g., see Maddison and Maddison 1998
).
Discussion
Our results showed that NJML improved almost all initial NJ trees; the improvement was especially remarkable in the varying-rate model trees (tables 2 and 5
). As shown in Saitou and Imanishi (1989)
, the NJ method outperformed the ML method when model trees A and B were used under the JC model. Our results are consistent with Saitou and Imanishi's conclusions (see table 1 ). Under the K2P model, the NJML method was more efficient than the NJ method without exception.
The results of the QP method in tables 4 and 5
were obtained by the improved QP method using discrete weights (Strimmer, Goldman, and von Haeseler 1997
). In fact, in comparison with any other method, its results were surprisingly good when model tree T1 was used with relatively long branch lengths (a/b = 0.02/0.19 and 0.03/0.42). However, we should note that model trees T1 and T2 were originally designed for testing the QP method (Strimmer and von Haeseler 1996) and that the NJML method always outperformed the QP method when model tree T2 was used.
As noted above, NJML outperformed fastDNAml and DNAML in some cases (see tables 4 and 5 ). The computational load for DNAML is prohibitive when the number of taxa is large (Strimmer and von Haeseler 1996). Although fastDNAml is considerably faster than DNAML (table 7 ), the computational load is still not negligible for large trees. In comparison, the search space of NJML is mainly determined by n (the maximum number of removed internal branches at each step), not the total number of taxa, so it is more practical than fastDNAml for large trees.
Interestingly, NJML did not always give better results with the threshold 95% than with the threshold 90%. This is because in the NJML algorithm, the search space varies considerably depending on the distribution of bootstrap values. For n = 3, there are three possibilities with regard to the size of search space if three or more internal branches have lower bootstrap values than the threshold:
This heterogeneity of search space will become higher with greater n. In the worst case, we need to explore a search space whose size is approximately s/3i=1
nj=1 (2j + 1) when there are s internal branches having lower bootstrap values than the threshold. For n = 3,
nj=1 (2j + 1) = 105 and
s/3i=1
nj=1 (2j + 1)
(s/3) x 105 = 35s, which increases only linearly with s. On the other hand, in the best case, we need to explore a search space whose size is approximately
s/3i=13n = s3n-1. Although this heterogeneity of search space should be explored in the future, we think that n = 3 is suitable for personal computers.
Since bootstrapping tends to underestimate the confidence level of a subtree (Zharkikh and Li 1992, 1995
; Hillis and Bull 1993
), a threshold of 90% or even less would be sufficiently high for a good performance of NJML.
Acknowledgements
We thank Masami Hasegawa for permission to use the source code of MOLPHY, version 2.2. Andrew Rodin kindly advised us on simulation studies. Richard Blocker helped us to maintain the computer system. This work was supported by NIH grants GM30998 and GM55759.
Footnotes
1 Keywords: phylogenetic reconstruction
topology search
subtrees
greedy algorithm
2 Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, Illinois 60637. E-mail: whli{at}uchicago.edu
literature cited
Adachi, J., and M. Hasegawa. 1996. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. 28:1150.
Bruno, W. J., N. D. Socci, and A. L. Halpern. 2000. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17:189197.
Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Am. J. Hum. Genet. 19:233.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783791.
. 1993. PHYLIP: phylogeny inference package. Version 3.5c. University of Washington, Seattle.
Foulds, L. R., D. Penny, and M. D. Hendy. 1979. A graph theoretic approach to the development of minimal phylogenetic trees. J. Mol. Evol. 13:151166.[ISI][Medline]
Fukami-Kobayashi, K., and Y. Tateno. 1991. Robustness of maximum likelihood tree estimation against different patterns of base substitutions. J. Mol. Evol. 32:7991.[ISI][Medline]
Gascuel, O. 1997. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685695.[Abstract]
Hasegawa, M., and M. Fujiwara. 1993. Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. 2:15.[Medline]
Hasegawa, M., H. Kishino, and N. Saitou. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol. 32:443445.[ISI][Medline]
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160174.[ISI][Medline]
Hillis, D. M., and J. J. Bull. 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst. Biol. 42:182192.[ISI]
Huelsenbeck, J. P. 1995. The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol. Biol. Evol. 12:843849.[Abstract]
Jukes, T. H., and C. R. Cantor 1969. Evolution of protein molecules. Pp. 21132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120.[ISI][Medline]
Kishino, H., and M. Hasegawa. 1990. Converting distance to time: application to human evolution. Methods Enzymol. 183:550570.[ISI][Medline]
Kuhner, M., and J. Felsenstein. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:459468.[Abstract]
Maddison, D., and W. P. Maddison. 1998. The tree of life: a multi-authored, distributed Internet project containing information about phylogeny and biodiversity. College of Agriculture, University of Arizona, Tucson. Internet address: http://phylogeny.arizona.edu/tree/phylogeny.html.
Maidak, B. L., J. R. Cole, C. T. P. Jr. et al. (14 co-authors). 1999. A new version of the RDP (Ribosomal Database Project). Nucleic Acids Res. 27:171173.
Nei, M., S. Kumar, and K. Takahashi. 1998. The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Natl. Acad. Sci. USA 95:1239012397.
Olsen, G., H. Matsuda, R. Hagstrom, and R. Overbeek. 1994. fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:4148.[Abstract]
OOta, S. 1998. Development of an integrated system for molecular evolutionary study and its application. Ph.D. thesis, Department of Genetics, School of Life Science, Graduate University for Advanced Studies, Mishima, Japan.
Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235238.[Abstract]
Saitou, N., and T. Imanishi. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol. 6:514525.
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Strimmer, K., and A. von Haeseler. 1996a. PUZZLE. Version 2.5. Zoologisches Institut, Universitaet Muenchen, Munich, Germany.
. 1996b. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964969.
Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol. 14:210211.
Tateno, Y., N. Takezaki, and M. Nei. 1994. Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 11:261277.[Abstract]
Winston, P. H. 1993. Artificial intelligence. Addison-Wesley, Mass.
Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:13961401.[Abstract]
. 1999. PAML manual. Department of Biology, Galton Laboratory, University College London, London.
Zharkikh, A., and W.-H. Li. 1992. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock. Mol. Biol. Evol. 9:11191147.[Abstract]
. 1995. Estimation of confidence in phylogeny: the complete-and-partial bootstrap technique. Mol. Phylogenet. Evol. 4:4463.[Medline]