Department of Ecology and Evolution, University of Chicago
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The ML approach, however, requires a large amount of computer time when many taxa are involved. For amino acid sequence data, the ML method incurs even heavier computational burdens. In a given tree, the likelihood value of subtree k, which has two subtrees i and j, is computed in a recursive way: L(k)sk = (si Psksi(
i)L(i)si) (
sj Psksi(
j)L(j))sj, where sm is a state of the root of subtree m, and Psksj(
) is a transition probability between states sk and sj for branch length
(Felsenstein 1981
). The number of states of any internal node (the root of any subtree) is 20 in the case of amino acid sequences, whereas it is only 4 in the case of nucleotide sequences. Clearly, it takes much more time to analyze amino acid sequence data than to analyze DNA sequence data.
Some algorithms exist for reconstructing phylogenetic trees from amino acid sequence data, e.g., heuristic methods in PROTML of MOLPHY (Adachi and Hasegawa 1996
) and the quartet puzzling (QP) method (Strimmer and von Haeseler 1996
). However, faster methods are needed to handle large trees.
We recently developed the NJML method (Ota and Li 2000
), which is a hybrid of the neighbor-joining (NJ) (Saitou and Nei 1987
) and ML methods. Our strategy was to explore the topology search space only around an initial NJ tree using the ML method. The depth of search depends on the reliability of the initial NJ tree in terms of bootstrap values. Since it uses a greedy (hill-climbing) search algorithm for the topology search, the computational cost is greatly reduced compared with the ML method. Ota and Li (2000)
showed that this simple method was highly efficient for improving NJ trees. Furthermore, the performance was nearly equal to or better than those of existing time-consuming heuristic ML methods.
The NJML algorithm, however, was designed only for nucleotide sequence data. In this study, we extended the algorithm to be applicable to amino acid sequence data too. In the new NJML program (NJML+), some approximation options are available for reducing the computational burden. The performances of these options were examined by extensive computer simulations.
![]() |
Algorithm and Implementation of NJML+ |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Furthermore, NJML+ has the following features:
The original NJML program (Ota and Li 2000
) has two major parts: (1) a set of modules (computer programs) for reconstructing an initial bootstrap NJ tree, and (2) a module for the ML estimation. These two parts account for almost all of the computational time taken to obtain a final result. Although the ML estimation takes much time, obtaining an initial bootstrap NJ tree also takes considerable time. We tried to reduce the computational time for both parts. To reduce the computational time for part 1, one may use CLUSTAL W instead of PHYLIP to obtain a bootstrap NJ tree, because CLUSTAL W is much faster. To reduce the computational time for part 2, one may use an approximate method used in NUCML and PROTML of MOLPHY (Adachi and Hasegawa 1996
), in which branch lengths are optimized by the least-squares method and a corresponding likelihood value is computed as an approximate ML value. Since there is no iteration required (except in the final step), the computation is fast.
Now we have two options in part 1 and two options in part 2. Therefore, we can choose one of the four possible ways to obtain a final result. Users should be aware that each of the three approximation options has some drawbacks (see below). Users should choose an option according to the degree of sequence divergence; that is, choose a more exhaustive search when sequence divergence is high.
For nucleotide sequence data, NJML+ has three approximation modes (A, B, and C) and a default mode (D).
For amino acid sequence data, it is very time-consuming to use PHYLIP to reconstruct an initial bootstrap NJ tree (see Discussion). For this reason, we did not prepare modes C and D for amino acid sequence data. The drawback of omitting modes C and D is that for obtaining an initial NJ tree, we cannot choose any evolutionary model of amino acid substitution except for Kimura's (1983)
correction for multiple hits. Therefore, NJML+ for amino acid sequence data has only two approximation modes, which correspond to modes A and B for nucleotide sequence data.
We implemented NJML+ as shown in figure 1 . The data flow for each approximation mode is also shown. Some modules are inherited from NJML with minor modifications.
|
![]() |
Simulation Design |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Simulations to Measure Efficiencies to Recover Correct Topologies
Seq-Gen (Rambaut and Grassly 1997
) and PSeq-Gen (Grassly, Adachi, and Rambaut 1997
) were used to generate ancestral nucleotide and amino acid sequences, respectively. In Seq-Gen, the JC and K2P models were used. All base frequencies were set to 0.25. For each case, 500 replications were conducted. In PSeq-Gen, the Dayhoff and JTT models were used. The amino acid frequencies used were those observed when the accepted mutation matrices for the Dayhoff and JTT models were constructed. For each case, 100 replications were conducted.
Figure 2
shows model trees T1 and T2, modified from Strimmer and von Haeseler (1996)
. For each of them, a variety of substitution rates a and b were assumed. As shown in figure 2
, T1 and T2 assume a constant rate and two varying rates (among lineages), respectively.
|
To compare inferred trees with model trees, a program modified from PAML's evolver (Yang 1999
) was used.
Randomly Generated Model Tree Simulation
For each case, 500 trees were generated randomly, and each possible tree was equally probable. To generate them, evolver was used with slight modification. Randomly generated ancestral amino acid sequences evolved along the randomly generated trees under the JTT or Dayhoff model. The number of OTUs was 20, and the sequence lengths were 300 and 600 amino acids. The expected branch lengths of randomly generated trees were varied for four cases: 0.03, 0.05, 0.07, and 0.10 amino acid substitutions per residue site.
The PUZZLE program often generates multifurcating trees as final trees. In fact, published results in 1996 and 1997 (Strimmer and von Haeseler 1996
; Strimmer, Goldman, and von Haeseler 1997
) were not obtained by the original PUZZLE program. We used both the PUZZLE program and the consensus program of PHYLIP (Felsenstein 1993
) to obtain completely resolved bifurcating trees (A. von Haeseler, personal communication).
The other part of this simulation was the same as the fixed model tree simulation. Since the computational burden of ML computation is large for large amino acid sequence data sets, we used the parallel computing system Chiba City of the Argonne National Laboratory.
Computer Runtimes
Computer runtimes were measured for NJML+ with various modes. In the case of simulation with amino acid sequences, PUZZLE 4.0.2 and PROTML 2.2 (with the quick add OTUs search and star decomposition options) were also included for comparison. To measure runtimes, the clock( ) function of GNU CC, version 2.8.1, was used on a Pentium III machine. Sequences were generated by Seq-Gen and Pseq-Gen under the K2P and JTT models, respectively. A given tree was randomly generated by setting the mean branch length to 0.05. These operations were iterated 100 times for each case, and the average computer runtimes were used to estimate the performance.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
Since NJML+ with mode B has extremely large variances in runtimes, comparison among the three programs is not simple (see table 2 ); in particular, the standard deviation was 144.93 when the number of OTUs was 20. As shown in figure 3 , however, NJML+ with mode B had only a weak tendency to increase its runtime with the number of OTUs.
Performance of Approximation Modes (nucleotide sequence data)
Table 3
shows the performance of NJML+ with three approximation modes in comparison with the NJ method and the NJML method with the same threshold. Nucleotide sequence data were used here. Mode B performed almost as well as NJML in any case. Interestingly, in some cases, this approximation mode gave better results than NJML. Modes A and B gave similar results in many cases; however, the performance of mode A was considerably reduced in some cases. In particular, with model tree T1 and the K2P model, the performance of mode A was much poorer than that of NJ with a/b = 0.03/0.42.
|
Performance of Approximation Modes (amino acid sequence data)
In table 4
, the performances of NJ, NJML+, PROTML, and PUZZLE are shown. Amino acid sequence data were used here. For model tree T1, NJML+ with mode B was the best, and NJML+ with mode A was the second best. For example, for the JTT model with a/b = 0.03/0.42, the proportions of correct trees obtained were 0.95 for NJML+ with mode B and 0.91 for NJML+ with mode A, but only 0.63 for NJ, 0.68 for PMq, 0.65 for PMs, and 0.74 for PZ. For model tree T2, NJML+ with mode B and mode A still performed better than the other methods except for PMs, although the differences in performance were smaller than those for model tree T1.
|
Efficiencies in Randomly Generated Model Trees with Amino Acid Sequence Data
The results of simulation with randomly generated model trees are shown in table 5
. In each case, 500 randomly generated trees were used as model trees (in total, 8,000 model trees were used). NJML+ always gave better results than the other methods except for one case in which it tied with NJ (Dayhoff model with 600 amino acids of amino acid sequence data).
|
An Example: Myosin Light Chain Sequences
Figure 4
shows two phylogenetic trees of myosin essential light chain sequences reconstructed using the NJ method and NJML+ with mode B. The number of sequences was 23, and the number of sites (the sequence length) was 187. The aligned data are from OOta and Saitou (1999)
. The threshold for NJML+ was 90% with n = 3. The JTT model was used.
|
There are three differences between the NJ tree (fig. 4A ) and the NJML+ tree (fig. 4B ).
We also tried NJML+ with mode A for the same sequence data. The result was the same as tree B.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Table 2
shows that mode B of NJML+ led to complicated relationships between the number of OTUs and the corresponding runtime. This is probably because the convergent time in searching the ML value varied considerably with replicated sequence data. In other words, the size of the topology search space is highly dependent on the distribution of bootstrap values among internal branches, as suggested by Ota and Li (2000)
. On the other hand, in mode A, no iterative computation was carried out except in the final step, so the corresponding runtime and the number of OTUs were monotonously correlated.
It is obvious that Kimura's (1983)
correction is not appropriate when d >> 1.0, where d is the number of amino acid substitutions per site between two sequences. Actually, the NJ method could not recover any correct topology with Kimura's correction when model tree T2 was used with a/b = 0.03/0.42 (the expected number of substitutions per site between two most divergent sequences was b + b + a/2 + a/2 + b + b = 4b + a = 1.71). As a consequence, the results of NJML+ were poor in this case (data not shown). In practical analyses with amino acid data, however, diverged data with d >> 1.0 are rarely used. Better methods for estimating distances should improve the efficiency of NJML+. For example, the method of Ota and Nei (1994)
is a good candidate in terms of computational time.
We tried a version of NJML+ using protdist of PHYLIP instead of CLUSTAL W with amino acid sequence data. This version corresponded to NJML+ with mode B for nucleotide sequence data. In terms of efficiencies in reconstructing correct topologies, the performance was considerably better than that of NJML+ with CLUSTAL W (namely, mode C for amino acid sequence data). In terms of computer runtime, however, NJML+ with protdist was impracticable (data not shown).
The costly runtime with protdist arose from the computation of distance matrices with bootstrap resampling data. If we already have a bootstrap tree in Newick format (Felsenstein 1993
), the NJML+ algorithm will improve tree reconstruction without this computational burden.
It may happen that at a step there are two or more candidate topologies that do not differ significantly in likelihood value. Since NJML+ uses a greedy search algorithm, such a situation may lead to a wrong result. However, one can reduce the chance of making an error by using the option mode of NJML+. For example, if we have m such candidate trees in step n, we can give those m trees to NJML+ as user-defined initial bootstrap trees. This is practically a beam search algorithm with m nodes under consideration at depth n (e.g., Kumar 1996
; Rodin and Li 2000
). In addition, if we encounter other multiple candidate trees whose log likelihoods are not significantly different, we can apply the same option. This option of NJML+ provides flexibility and makes this method distinct from simple greedy search algorithms.
In simulations with randomly generated model trees, NJ performed as well as PROTML and PUZZLE. This shows that NJ is quite robust for large amino acid sequence data sets. It is interesting that although NJ and PROTML with the star decomposition search use similar strategies, their results were very different. This may be an important difference between the minimum-evolution criterion (Saitou and Imanishi 1989
; Rzhetsky and Nei 1992
) and the ML criterion.
Since NJML+ elaborates the topology search space around an initial NJ tree with the ML method, it is not surprising that NJML+ gives better results than NJ. This shows an advantage in the strategy of NJML+.
![]() |
Supplementary Material |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: tree reconstruction
topology search
approximate likelihood
neighbor joining
likelihood test
2 Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, Illinois 60637. whli{at}uchicago.edu
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi J., M. Hasegawa, 1996 MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood Comput. Sci. Monogr 28:1-150
Dayhoff M. O., 1978 Survey of new data and computer methods of analysis Pp. 28 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. National Biomedical Research Foundation, Silver Springs, Md
Doolittle W. F., 1999 Phylogenetic classification and the universal tree Science 284:2124-2193
Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]
. 1993 PHYLIP: phylogeny inference package. Version 3.5c Distributed by the author, Department of Genetics, University of Washington, Seattle
Fukami-Kobayashi K., Y. Tateno, 1991 Robustness of maximum likelihood tree estimation against different patterns of base substitutions J. Mol. Evol 32:79-91[ISI][Medline]
Gilbert D. G., 1997 Phylodendron, Java software for phylogenetic tree drawing Indiana University, http://iubio.bio.indiana.edu/treeapp/
Goldman N., J. L. Thorne, D. T. Jones, 1996 Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses J. Mol. Biol 263:196-208[ISI][Medline]
Grassly N. C., J. Adachi, A. Rambaut, 1997 PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees Comput. Appl. Biosci 13:559-560[Medline]
Gribaldo S., P. Cammarano, 1998 The root of the universal tree of life inferred from anciently duplicated genes encoding components of the protein-targeting machinery J. Mol. Evol 47:508-516[ISI][Medline]
Hasegawa M., M. Fujiwara, 1993 Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny Mol. Phylogenet. Evol 2:1-5[Medline]
Hasegawa M., H. Kishino, N. Saitou, 1991 On the maximum likelihood method in molecular phylogenetics J. Mol. Evol 32:443-445[ISI][Medline]
Hasegawa M., H. Kishino, T. Yano, 1985 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA J. Mol. Evol 22:160-174[ISI][Medline]
Huelsenbeck J. P., 1995 The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining Mol. Biol. Evol 12:843-849[Abstract]
Jones D. T., W. R. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences Comput. Appl. Biosci 8:275-282[Abstract]
Jukes T. H., C. R. Cantor, 1969 Evolution of protein molecules Pp. 21132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York
Kimura M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]
. 1983 The neutral theory of molecular evolution Cambridge University Press, Cambridge, England
Kishino H., M. Hasegawa, 1989 Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea J. Mol. Evol 29:170-179[ISI][Medline]
Kishino H., T. Miyata, M. Hasegawa, 1990 Maximum likelihood inference of protein phylogeny and the origin of chloroplasts J. Mol. Evol 31:151-160[ISI]
Kuhner M., J. Felsenstein, 1994 A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates Mol. Biol. Evol 11:459-468[Abstract]
Kumar S., 1996 A stepwise algorithm for finding minimum evolution trees Mol. Biol. Evol 13:584-593[Abstract]
Lio P., N. Goldman, J. L. Thorne, D. T. Jones, 1998 PASSML: combining evolutionary inference and protein secondary structure prediction Bioinformatics 14:726-733[Abstract]
Ota S., N. Saitou, 1999 Phylogenetic relationship of muscle tissues deduced from superimposition of gene trees Mol. Biol. Evol 16:856-867[Abstract]
Ota S., W.-H. Li, 2000 NJML: a hybrid algorithm for the neighbor-joining and maximum likelihood methods Mol. Biol. Evol 17:1401-1409
Ota T., M. Nei, 1994 Estimation of the number of amino acid substitutions per site when the substitution rate varies among sites J. Mol. Evol 38:642-643[ISI]
Philippe H., P. Forterre, 1999 The rooting of the universal tree of life is not reliable J. Mol. Evol 49:509-523[ISI][Medline]
Rambaut A., N. C. Grassly, 1997 Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees Comput. Appl. Biosci 13:235-238[Abstract]
Rodin A., W.-H. Li, 2000 A rapid heuristic algorithm for finding minimum evolution trees Mol. Phylogenet. Evol 6:173-179
Rzhetsky A., M. Nei, 1992 Statistical properties of the ordinary least-squares, generalized least-squares, and minimum-evolution methods of phylogenetic inference J. Mol. Evol 35:367-375[ISI][Medline]
Saitou N., T. Imanishi, 1989 Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree Mol. Biol. Evol 6:514-525
Saitou N., M. Nei, 1987 The neighbor-joining method: a new method for reconstructing phylogenetic trees Mol. Biol. Evol 4:406-425[Abstract]
Sogin M. L., G. Hinkle, D. D. Leipe, 1993 Universal tree of life Nature 362:795
Strimmer K., N. Goldman, A. von Haeseler, 1997 Bayesian probabilities and quartet puzzling Mol. Biol. Evol 14:210-211
Strimmer K., A. von Haeseler, 1996 Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies Mol. Biol. Evol 13:964-969
Tateno Y., N. Takezaki, M. Nei, 1994 Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site Mol. Biol. Evol 11:261-277[Abstract]
Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]
Yang Z., 1999 PAML manual Department of Biology, Galton Laboratory, University College London, London