* Heinrich-Heine Universität Düsseldorf, Germany
Forschungszentrum Jülich, Germany
Correspondence: E-mail: haeseler{at}cs.uni-duesseldorf.de.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: sequence evolution maximum likelihood tree reconstruction phylogenetics quartet tree nearest neighbor interchange accuracy rbcL record value
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Another attempt to reduce computation time is the reconstruction of quartet trees, which are subsequently used to puzzle an overall tree (cf., Strimmer and von Haeseler 1996; Willson 1999; Ranwez and Gascuel 2001). However, the complexity of O(n4) prohibits an application of quartet methods to data with more than approximately 100 sequences, because it is necessary to evaluate all quartet trees. We propose the Important Quartet Puzzling Method, IQP, which uses only O(n2) quartets to construct a tree to overcome this computational drawback. The IQP approach is implemented as a part of an algorithm which efficiently elucidates the landscape of possible optimal trees. To this end, we introduce a search strategy that alternates between the nearest neighbor interchange (NNI) method and the sequential sequence-by-sequence assembly approach based on IQP. This combined strategy reduces the risk of getting stuck in a local optimum. For illustration, two recently analyzed data sets (Guindon and Gascuel 2003) are re-analyzed. Finally, we apply an estimator that is based on the time series of sights of better trees during the tree search to estimate when it becomes unlikely that a further search for a better tree will be successful.
![]() |
The Important Quartet Concept |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The notation of a k-representative leaf set generalizes to all rooted subtrees of the rooted tree T. Moreover, the computation of the corresponding sets can be done in linear time. Figure 1 shows that the representative leaf set of T can be computed from the representative leaf sets of the rooted subtrees T1 and T2. Similarly, the representative leaf sets of T1 and T2 can be obtained from Sk(T3), Sk(T4), and Sk(T5), Sk(T6), respectively, and so on. The 4-representative leaf sets of subtrees T5 and T6 are S4(T5) = {(c, 3), (d, 3), (e, 3), (f, 3)} and S4(T6) = {(g, 2), (h, 2)}. S4(T2) is obtained from S4(T5) S4(T6) by increasing the corresponding distances by one, choosing the four leaves with the smallest distances, and breaking ties randomly. Thus, we obtain S4(T2) = {(g, 3), (h, 3), (c, 4), (d, 4)}.
Because the size of the k-representative leaf set of a rooted subtree Ti is O(k), the cost of computing the representative leaf set from its child leaf sets is O(k). Thus, for a rooted tree with n leaves the collection of all k-representative leaf sets is computed in O(nk) time.
Now consider an unrooted tree T. Each internal node x splits T into three disjoint subtrees, which we then root with the original internal node x. Let ,
, and
denote the corresponding rooted subtrees with the same root x (see figure 2, left). For these rooted subtrees we compute Sk(
), Sk(
), and Sk(
). As in the case of rooted trees the computation of all representative leaf sets for all rooted subtrees is efficiently possible. The collections of k-representative leaf sets are the building blocks for the IQP algorithm.
|
We call a quartet q = (t1, t2, t3, y) an important quartet of an internal node x of an unrooted tree T if and only if
Thus an important quartet consists of three representatives each from one of the three k-representative leaf sets derived from the internal node x and a sequence y, which needs to be inserted in the tree. By construction t1, t2, t3 are close to x and close to each other; thus the reconstruction of a quartet tree for the quartet (t1, t2, t3, y) is more likely to be accurate, because quartets with closely related sequences are less affected by the accumulation of evolutionary noise due to parallel and back substitutions.
More generally, a quartet q is an important quartet of an unrooted tree T if it is an important quartet of some internal node x of T.
Each internal node splits the tree into three rooted subtrees, and each of them is presented by at most k representative leaves. For a new sequence y and an internal node x, there are O(k3) important quartets. Because T has O(n) internal nodes, O(nk3) important quartets are possible.
The IQP Algorithm
To put a new sequence y into a tree T, we first determine the important quartet set of the tree T by specifying important quartets for all internal nodes of the tree T. Then for each important quartet q = (t1, t2, t3, y) the optimal tree topology (with respect to some objective function) is computed among the three unrooted topologies T, T
, and T
(see fig. 3). In IQP, quartet trees are constructed using Neighbor Joining; these trees are the minimum evolution trees in case of four sequences (Saitou and Nei 1987). The estimated quartet trees are then used to place sequence y.
|
Repeating the above procedure for all important quartets of T and summing the resulting scores assigns a total score to each branch. Sequence y is inserted on the branch with the highest score. In case of ties, a branch with highest score is selected at random. Scores are computed in O(nk3) time by using a simple recursive procedure (Schmidt 2003, chapter 4).
In the following, k was set equal to four. Thus, if one wants to compute a phylogenetic tree following the TREE-PUZZLE procedure but using only important quartets, then one would need O(n2) computing time. However, we will not pursue this approach here, because simulations showed that the accuracy of this approach is not satisfying (data not shown).
Combining Tree-Reconstruction Methods
Here we suggest a combination of tree reconstruction methods to compute an optimal tree.
Combined-Algorithm (IQPNNI)
This combined strategy has two main advantages. First, deleting and re-inserting some leaves in the optimization step helps us to escape from a local optimum when applying NNI. Moreover, deleting and re-inserting only a proportion of all leaves conserves parts of the optimized tree and therefore saves computing time.
![]() |
Accuracy |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The accuracy of IQPNNI was compared to Weighbor 1.2 (Bruno, Socci, and Halpern 2000), fastDNAml version 1.2 (Olsen et al. 1994), and PHYML version 2.0.1 (Guindon and Gascuel 2003). Weighbor is a distance-based method and is combined with DNADIST version 3.5 (Felsenstein 1993); the other programs are maximum likelihood methods (see Ranwez and Gascuel 2001 for a detailed reference about the performance of current quartet-based methods compared to other approaches). MetaPIGA (Lemmon and Milinkovitch 2002) was not included in the simulation study because it offers no version that runs in batch mode.
All methods were run with default options. However, the parameters of models of sequence evolution were not estimated but set identical to the simulated conditions.
Accuracy was measured as the percentage of cases where the inferred tree topology and the true tree topology are identical. Alternatively, we computed the distance between the inferred tree and the true tree according to Robinson and Foulds (1981)i.e., the number of bi-partitions present in one of the two trees but not the other.
Small Simulated Data
We generated randomly 3,000 trees with 30 taxa. Trees were drawn from the Yule-Harding distribution (Harding 1971). The branch lengths of trees were drawn from an exponential distribution with mean values equal to 0.03, 0.06, and 0.15 to accommodate for slow, medium, and high rates of evolution, respectively. Seq-Gen (Rambaut and Grassly 1997) was used to evolve sequences along the trees using the Kimura two-parameter model (Kimura 1980) with a transition/transversion ratio of 2.0, and a sequence length of 500 bp.
Tables 1 and 2 show that IQPNNI outperforms all other methods analyzed. However, the performance is, on average, only marginally better, i.e.3.50 versus 3.54 for the Robinson and Foulds distance (table 2). Weighbor, a distance-based method, displays the lowest performance in terms of accuracy.
|
|
The computing time to estimate a phylogenetic tree with 30 sequences is not really an issue for the four programs tested. Weighbor is the fastest program; it needs on average 0.4 seconds to output a tree. PHYML is next (2.9 seconds), followed by IQPNNI (16.7 s), and fastDNAML (28.9 seconds). Because computing time is not an issue for small data sets, one should apply the program with highest accuracy.
Large Simulated Data
The accuracy of IQPNNI for large data sets was investigated on one random tree topology that was created as described above, but with 1,000 sequences of length 500, 1,000, and 2,000 base pairs. The mean branch length was set to 0.05. We compared Weighbor, PHYML, and IQPNNI. Unfortunately fastDNAml could not be applied to 1,000 sequence data sets, because the computing time was too long. Table 3 displays the Robinson and Foulds (1981) distance for the three tree reconstruction methods designed to deal with large numbers of taxa. The numbers in the table show the results of 10 simulation runs and the average performance for each method as a function of the alignment length. Not surprisingly, as the sequence length increases all methods get better, that is they reconstruct trees that are closer to the true tree. However, there is a substantial difference in the performance between Weighbor and PHYML or IQPNNI. Weighbor shows substantially reduced accuracy. The differences in performance between PHYML and IQPNNI are less pronounced.
|
|
|
|
|
Thus, as with MetaPIGA, we need a criterion to stop our quest for the best tree. Here we suggest applying an estimation method that is based on the time of occurrence (i.e., number of iterations) of better trees during our search. These time points are indicated by the jumps in the graph in Figures 4 and 5.
More precisely, let L1, L2, ... , Lj denote the log-likelihoods for the first j iterations, then the sequence (k) of record times (i.e., iteration number, when a better tree is found) is defined by
|
This sequence is used to estimate the point in time, stop, at which to stop the search, i.e., the point at which it appears unlikely that a further search will lead to a better tree. Using the theory detailed in Cooke (1980) and Robert and Solow (2003), we estimate during the run of IQPNNI an upper 95% confidence limit
95% of
stop. More precisely, consider a sequence of record times
1,
2, ... ,
k, then we compute an upper (1
)100% stopping time as
|
|
This additional number of iterations to reach 95% further increases the computation time; i.e for the 218-ssu rRNA about 8 hours were necessary and for the rbcl data we used a total of 15 h. But now we are in the position to know that, with 95% probability, we would not have found a better tree when extending the search even longer. If one is willing to spend more computation time, it is of course possible to compute
99% or even higher upper bounds. One could also start a new run of IQPNNI.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The algorithm presented here turns out to search the tree space efficiently for better trees. We have used the maximum likelihood criterion as an objective function; however, the algorithm is not restricted to maximum likelihood. Any tree reconstruction method with an objective function to minimize or maximize will work in the presented approach.
Although we have only presented results for pdel = 0.3 and k = 4 for simulated data, further studies with pdel ranging from 0.2 to 0.4 and k from 4 to 6 showed that the accuracy, based on the Robinson and Foulds distance, percentage of correctly reconstructed trees is not affected (data not show). Similarily, various combinations of k (4, 5, or 6) and pdel, ranging from 0.1 to 0.3 resulted in minor changes in the log likelihood, which varied from 156,715 to 156,604 (k = 4, pdel = 0.1) for the ssu_rRNA data set and from 100,058 to 100,011 (k = 4, pdel = 0.1) for the rbcl-gene alignment. We note that this observation does not allow any generalizations. In any real application one should run IQPNNI for different choices of k and pdel.
Our simulations indicate that IQPNNI shows a better performance than other tested methods in terms of being closer to the true tree. However, we have only considered a very narrow range of simulation. We have not taken into account the posibility of model violations, uncertainty of estimating the parameters of the model, and various other sources of uncertainty. Including all this is beyond the scope of the paper.
Because IQPNNI generates trees with similar high likelihoods, we have included an option in the program that allows output of a majority rule consensus tree of all the intermediate trees found during one search run of the tree space. We also output the frequencies of the groupings found in that tree. If one is interested only in the most frequent groupings found in the collection of trees with a high likelihood, then this option is helpful.
Finally, we are suggesting a statistic that provides a guide to stop the search for a better tree. This very simple and crude estimation procedure proves to be very useful, although it increases the computation time yet again. To get additional confidence in the reconstructed tree and its maximum likelihood value, one needs to repeat the search with several independent runs of our program. However, we are sure that we have only scratched the surface when it comes to applying statistical inference based on record values to problems of tree reconstruction. Further investigations into the performance of such methods are certainly necessary, but beyond the scope of this paper.
![]() |
Supplementary Material |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The ribosomal sequence alignment can be retrieved from http://rdp.cme.msu/download/SSU\_rRNA/alignments/SSU\_Prok\_rep\_phylip), and the rbcl alignment can be downloaded from (http://www.cis.upenn.edu/~krice/treezilla/record.nex).
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Brauer, M. J., M. T. Holder, L. A. Dries, D. J. Zwickl, P. L. O. Lewis, and D. M. Hillis. 2002. Genetic algorithms and parallel processing in maximum-likelihood phylogeny inference. Mol. Biol. Evol. 19:1717-1726.
Bruno, W. J., N. D. Socci., and A. L. Halpern. 2000. Weighted Neighbor Joining: a likelihood-based approach to distance-based phylogeny reconstruction. J. Mol. Evol. 17:189-197.
Charleston, M. A. 2001. Hitch-hiking: a parallel heuristic search strategy, applied to the phylogeny problem. J. Comput. Biol. 8:79-91.[CrossRef][ISI][Medline]
Cooke, P. 1980. Optimal linear estimation of bounds of random variables. Biometrika 67:257-258.[ISI]
Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version 3.5c. Department of Genetics, University of Washington, Seattle. Distributed by the author.
Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, Mass.
Gascuel, O. 1997. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685-695.[Abstract]
Guindon, S., and O. Gascuel. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696-704.[CrossRef][ISI][Medline]
Harding, E. F. 1971. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Prob 3:44-77.
Hasegawa, M., H. Kishino, and T.-A. Yano. 1985. Dating of the humanape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174.[ISI][Medline]
Huson, D. H., S. M. Nettles, and T. J. Warnow. 1999. Disk-covering, a fast-converging method for phylogenetic reconstruction. J. Comput. Biol. 6:369-386.[CrossRef][ISI][Medline]
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[ISI][Medline]
Larget, B., and D. L. Simon. 1999. Markov chain Monte Carlo alogorithms for the Bayesian of phylogenetic trees. Mol. Biol. Evol. 16:750-759.
Lemmon, A. R., and M. C. Milinkovitch. 2002. The metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation. Proc. Natl. Acad. Sci. USA 99:10516-10521.
Olsen, G. J., H. Matsuda, R. Hagstrom, and R. Overbeek. 1994. fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:41-48.[Abstract]
Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238.[Abstract]
Ranwez, V., and O. Gascuel. 2001. Quartet-based phylogenetic inference: improvements and limits. Mol. Biol. Evol. 18:1103-1116.
Robert, D. L., and A. R. Solow. 2003. When did the dodo become extinct? Nature 426:245-245.
Robinson, D. R., and L. R. Foulds. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131-147.[CrossRef][ISI]
Ronquist, F., and J. P. Huelsenbeck. 2003. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574.
Saitou, N., and M. Nei. 1987. The NeighborJoining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.[Abstract]
Schmidt, H. A. 2003. Phylogenetic trees from large datasets. Ph.D. thesis, Universität Düsseldorf.
Schmidt, H. A., and A. von Haeseler. 2002. Quartet trees as a tool to reconstruct large trees from sequences. Pp. 379388 in K. Jajuga, A. Sokolowski, and H.-H. Bock, eds., Data analysis, classification, and related methods. Studies in classification, data analysis, and knowledge organization. Springer-Verlag, Heidelberg, New York.
Schmidt, H. A., E. Petzold, M. Vingron, and A. von Haeseler. 2003. Molecular phylogenetics: parallelized parameter estimation and quartet puzzling. J. Parallel Distrib. Comput. 63:719-727.[CrossRef][ISI]
Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: Maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502-504.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximumlikelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogeny reconstruction. Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds., Molecular Systematics, 2nd edition. Sinauer Associates, Sunderland, Mass.
Vos, R. 2003. Accelerated likelihood surface exploration: the likelihood ratchet. Syst. Biol 52:368-373.[ISI][Medline]
Willson, S. J. 1999. Building phylogenetic trees from quartets by using local inconsistency measures. Mol. Biol. Evol. 16:685-693.