Département Informatique Fondamentale et Applications, Montpellier, France
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The simplest quartet approach involves keeping for each group of four taxa (quartet) the most likely 4-tree and then seeking the complete phylogeny which contains as many of these 4-trees as possible. Methods based on this principle (Bandelt and Dress 1986
; Jiang, Kearney, and Li 1998
; Berry and Gascuel 2000
) are interesting due to their good theoretical properties. They do not, however, account for the fact that some 4-trees are more reliable than others, and they implicitly give the same importance to all inferred 4-trees.
Several different approaches have been described for incorporating reliability into quartet-based phylogenetic analysis. In Erdös et al. (1997)
, 4-trees found to be unreliable are rejected and not used in phylogeny reconstruction. In Berry et al. (1999), two different 4-trees on the same quartet are used when a single 4-tree is unreliable. Finally, methods like quartet puzzling (QP) (Strimmer, Goldman, and von Haeseler 1997
) and that of Willson (1999a)
preserve the three possible 4-trees for each quartet and associate with them a weight proportional to the confidence they have in them. By using maximum likelihood to balance 4-trees, these algorithms seem to follow a reasonable approach for inferring a tree having a high likelihood on n taxa.
However, inference of phylogenies containing only four taxa is sometimes contested, which questions the very principle of quartet methods. The sampling of taxa preceding phylogenetic reconstruction strongly influences the inferred phylogeny. If one seeks to establish the phylogenetic relationship between four groups of taxa by using a single representative for each of these groups, the result generally depends on which representatives are selected. This problem was studied by Philippe and Douzery (1994)
, who, based on 4-trees inferred by a parsimony method, concluded: "Reconstructing history with only four taxa is rather a game of chance." Adachi and Hasegawa (1999)
extended this result to 4-trees inferred by maximum likelihood and concluded: "As Philippe and Douzery showed, it is now clear that an argument based on a quartet analysis of a single gene is very dangerous." However, it seems possible to overcome this difficulty, since in published simulations (Strimmer and von Haeseler 1996
; Strimmer, Goldman, and von Haeseler 1997
) the performance of QP is close to or even better than that of maximum likelihood. This gives rise to the following questions: Is it possible to improve QP? If so, how do these new quartet methods perform compared with other classical phylogenetic reconstruction methods?
In this article, we present weight optimization (WO), a new algorithm which also uses weighted 4-trees inferred by maximum likelihood. WO searches for the tree on n taxa such that the sum of the weights of the 4-trees induced by this tree is maximal. Finding the optimal tree in this sense is computationally difficult (Steel 1992
), like most optimization tasks in phylogenetic inference (Swofford et al. 1996
), and thus WO is based on heuristic approach which only guarantees finding a near-optimal tree. However, we shall see that its performance in optimizing this criterion is good enough and that better optimization does not improve the topological accuracy. We start by describing QP, and then we describe and demonstrate the properties of WO. Using computer simulations, we then compare the topological accuracies of QP and WO with those of other common phylogenetic reconstruction methods. WO significantly improves QP, but, as discussed, the performance of these quartet methods is rather disappointing. Finally, we describe factors which seem to limit the topological accuracy of quartet methods. The implementation of WO, which uses tools of general interest in phylogenetic reconstruction, is described in the appendix.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Definitions and Notation
A tree is a set of nodes connected by branches (or edges) so that there is a single path connecting two nodes. The degree of a node is the number of branches attached to it. The nodes with degree 1 (the leaves) are labeled by taxa; other nodes are called internal nodes. If all internal nodes have degree 3, then we say that the tree is fully resolved; if some have a higher degree, we say that the tree is partially resolved.
Removing a branch in a phylogeny separates taxa into two disjoint subsets such that their union equals the complete set of taxa. The branch is said to define the bipartition (or split) formed from the two subsets. When a bipartition separates two taxa from all the others, the pair of taxa is called an external pair. Let xy | zt denote the 4-tree that separates taxa x and y from taxa z and t. A quartet is a set of four taxa; for each quartet {x, y, z, t}, there are three possible 4-trees: xy | zt, xz | yt, and xt | yz. When an n-tree T contains a bipartition which separates taxa x and y from taxa z and t, T is said to induce the 4-tree xy | zt, and the 4-tree xy | zt is said to be consistent with T.
Let the pair (q, w) denote the weighted 4-tree (w4-tree) which associates a weight w with the 4-tree q. Let Q denote the set of weighted 4-trees used as the starting point for QP or WO. Q contains three w4-trees for each quartet. Qmax is the set induced by Q which, for each quartet, contains only the 4-tree having the maximum weight (among three). In the unweighted case, for each quartet, one 4-tree has weight 1 and the two others have weight 0. The set of 4-trees induced by a tree T is denoted QT.
After assigning weights to the 4-trees, quartet methods search for the tree T which fits them best. A natural measurement of this fit is the sum of the weights of 4-trees induced by T. Thus, we search for the tree T which maximizes the W criterion, defined as
|
However, finding T defined as such is NP-hard (Steel 1992
) and can require an exponentially long computing time. Therefore, we have to use heuristic algorithms (such as QP or WO), to find a near-optimal tree within a reasonable amount of computing time.
Quartet Puzzling
To infer the phylogeny of n sequences, QP proceeds in three stages: (1) it uses the maximum-likelihood principle to weight all possible 4-trees; (2) based on these weights, it constructs a large number of n-trees; and (3) it computes the consensus tree of these n-trees.
Given a quartet, the likelihoods l1, l2, and l3 of the three associated 4-trees are used to estimate their respective probabilities p1, p2, and p3 of being the correct 4-tree. These probabilities are evaluated using Bayes theorem as (Strimmer, Goldman, and von Haeseler 1997
)
|
The next stage uses these weights on 4-trees to construct a collection of n-trees. To construct a single n-tree, QP follows an addition scheme. It first randomly determines an order among the taxa. The taxa are then added one by one to a partially constructed tree until a complete n-tree containing all taxa is obtained. Suppose that the taxa are denoted as 1, 2, ... , n, and that the addition order is (1, 2, 3, ... , n). The 4-tree of maximum weight among the three which resolve {1, 2, 3, 4} is used as starting point. At step i, the current tree contains the taxa from Si = {1, 2, 3, ... ,i - 1}, and QP seeks to add the taxon i. To determine the branch on which taxon i is added, QP proceeds as follows. First, with each branch of the current partially constructed tree, it associates a penalty which is initialized to 0. Then, it considers all w4-trees of type (xi | yz, w) with x, y, and z in Si, which are the w4-trees "relevant" to the addition of i. For each of these w4-trees, QP adds the penalty w to branches of the path connecting y to z. The taxon i is then added onto the least penalized branch.
Figure 1 shows how, starting from a set of w4-trees, QP determines the branch of 12 | 34 on which taxon 5 is added. For example, to take the w4-tree (51 | 23, 0.4) into account, QP adds a penalty of 0.4 on branches belonging to the path (2, 3). Indeed, if taxon 5 is added on one of these branches, the 4-tree 51 | 23 is not induced by the inferred tree. Note, however, that adding taxon 5 on the branch connected to taxon 4 also leads to a tree construction inconsistent with 51 | 23 and that QP does not penalize this branch.
|
The n-tree that is constructed varies according to the taxon addition order. In order to overcome this problem, QP randomly defines several addition orders, constructs the corresponding n-trees, and returns their consensus as final result. Strimmer and von Haeseler (1996)
recommend that trees be constructed using as many different orders as possible and suggest that 1,000 is a reasonable value. Several consensus tree definitions exist. The current version of QP uses the "majority rule" consensus tree of Margush and McMorris (1981)
, whose bipartitions appear in more than half of the n-trees. Usually this consensus tree is partially resolved, and the simulation tests performed by Strimmer and von Haeseler (1996)
and Strimmer, Goldman, and von Haeseler (1997)
were done (personal communication) using the CONSENSE program from the PHYLIP package (Felsenstein 1989
). This program usually provides a fully resolved tree with relatively high branch supports which systematically contains bipartitions of the majority-rule consensus tree.
Weight Optimization
WO uses 4-tree weighting to dynamically define the taxon addition order. At each step, the selected taxon is added so as to generate the greatest increase of the W criterion (eq. 1)
. This does not guarantee that the optimal n-tree will be found, but this kind of "greedy" approach has proven effective in many optimization problems (Cormen, Leiserson, and Rivest 1992
, pp. 329356).
WO also uses continuous weighting based on likelihood, as introduced by (Strimmer, Goldman, and von Haeseler 1997
) and described above. WO gives branches bonuses rather than penalizing them. This new formulation has the advantage of clarifying the link with the W criterion. When adding taxon i, WO chooses the branch giving the tree with the largest W score (eq. 1)
. To this end, WO considers every relevant w4-tree (ix | yz, w) and adds a bonus w to each branch of the current tree, such that the addition of i on this branch constructs a tree inducing ix | yz. Once all relevant w4-trees have been taken into account, the bonus of any branch b with the addition of taxon i is equal to the increase
W(b, i) of the W criterion induced by this addition. For the w4-tree (ix | yz, w), branches receiving a bonus are determined as follows. There is a single internal node belonging to the three paths (x, y), (x, z), and (y, z) called the median of x, y, z (Barthélemy and Guénoche 1990
, pp. 57). This internal node is attached to three branches and thus defines three disjointed subtrees denoted as Tx, Ty, and Tz, such that x
Tx, y
Ty, and z
Tz (fig. 2
). If taxon i is added on a branch of Tx, then the path (i, x) has its branches within Tx, while the path (y, z) has all its branches within Ty
Tz. This ensures that paths (i, x) and (y, z) do not intersect and, consequently, that the constructed tree induces ix | yz. WO thus adds the bonus w onto branches of Tx, which is equivalent to adding a penalty w to the branches of Ty
Tz. However, this is very different from giving a penalty only to the branches of the path (y, z) as is the case with QP.
|
WO randomly selects three taxa (denoted as 1, 2, and 3) to initialize the tree T and then iteratively adds the remaining taxa. Unlike with QP, the taxon addition order is not random. Instead, at each step, we add the taxon providing the "safest" addition. The safety s(i) of the addition of i is defined as follows. Let M denote the branch with the highest bonus and m the branch giving the second highest bonus. Let W(M, i) and
W(m, i) denote the increases in W(T) resulting from adding i onto branch M or m. We have
|
In order to completely define the addition order, we could alternatively select the 4-tree of maximum weight as the starting point. This requires examining all 4-trees, which is time- and memory-space-consuming, and simulations indicate that this does not improve the topological accuracy of WO. The simpler solution presented here ensures that the starting 4-tree is a good one, since it is the best among the 3 x (n - 3) 4-trees that resolve a quartet {1, 2, 3, x}, with x differing from 1, 2, and 3.
Differences Between WO and QP
Let T denote the correct tree. If the weighting of the 4-trees is accurate enough to obtain Qmax = QT, i.e., if for all quartets the correct 4-tree has a weight superior to that of the two others, then WO reconstructs the correct tree T with certainty. This property also holds for QP in the binary case (Strimmer and von Haeseler 1996
), but it no longer holds when a discrete or continuous weighting is used. Moreover, WO constructs a single tree, whereas our simulations confirm that QP needs to construct a large number of trees to achieve satisfactory performance. As we shall explain, this not only makes WO faster than QP, but also saves memory space and allows us to deal with larger data sets.
The proof that WO infers the correct tree T once Qmax = QT is achieved by proving that at each step of WO, the current tree is a subtree tree of T. This property is true for the 3-tree used as a starting point. Assuming that this property is true at the current step, we simply have to prove that for every taxon, adding it on the branch with maximum bonus for this taxon infers a subtree of T, which guarantees that the property is still true at the next step. Since the current tree is a subtree of T, it possesses a branch such that adding i onto it leads to a subtree of T. The bonus of this branch is the sum of the weights of the newly induced 4-trees, which are all correct. The bonus of any other branch is the sum of the weights of the newly induced 4-trees, but at least one of them is not correctly resolved. These two sums concern weights which are bijectively associated, depending on the relevant quartet to which they correspond. Based on our hypothesis that Qmax = QT, this implies that the highest sum indicates the correct branch to add i, and thus recursively proves that at each WO step the current tree is a subtree tree of T.
Figure 1 illustrates a case in which QP (based on continuous weighting) fails to infer the correct phylogeny, although Qmax = QT. In this example, QP and WO are used to infer the phylogeny of taxa 1, 2, 3, 4, and 5. Moreover, we assume that 12 | 34 is used as the starting point by both algorithms. The way QP and WO then use the 4-tree weighting to add taxon 5 is detailed in figure 1 . After having taken all relevant w4-trees into account, WO adds taxon 5 on the branch connected to taxon 1 and thus infers the unique tree T such that Qmax = QT. Moreover, as shown above, WO reconstructs this tree T irrespective of the three taxa used to initialize the tree. On the other hand, QP adds taxon 5 on the branch connected to taxon 2 and does not infer this tree. In this example, the discrete approximation of a 4-tree weight different from 0 and 1 is always 1/3. Thus, for each quartet, the discrete-weight version of QP has one chance in three to select the correct 4-tree, and, depending on these random selections, it may or may not reconstruct the correct phylogeny. Finally, the only consistent version of QP is that using binary weighting.
The time complexity of a phylogenetic reconstruction algorithm expresses the computing time it requires, depending on the number of treated taxa (and possibly other parameters). QP and WO both have O(n4) complexity. In other words, their computing time increases proportionally to the fourth power of the number of taxa. In fact, QP as described by Strimmer and von Haeseler (1996)
has O(n5) complexity. However, both QP and WO can be implemented in O(n4) by gathering the w4-trees which modify the bonuses (or penalties) of the same branches and by reusing calculations already carried out during previous steps. Although they have the same theoretical complexity, WO is faster than QP because it constructs a single tree, whereas QP constructs numerous trees (1,000 per default). This difference increases when the number of taxa grows, since the authors of QP say that "generally, the more taxa are involved the more runs of the puzzling step are advised." However, WO remains slower than any method having O(n3) complexity, such as distance algorithms like neighbor joining (NJ) (Saitou and Nei 1987
) or BIONJ (Gascuel 1997
). The O(n4) implementation of WO is detailed in the appendix.
Space complexity expresses the memory space required according to the number of treated taxa. In the algorithm shown in figure 3 , the size of the input data is O(n4), and the memory space required is thus O(n4), as for most quartet algorithms. However, it is possible to modify quartet methods in order to input DNA sequences and to compute the weight of a 4-tree each time it is needed. This decreases the memory space required but increases the computation time as soon as weights are needed more than once. QP needs to consult each weight at least 1,000 times (one per reconstructed tree). For such methods, recomputing the weights instead of storing them is not realistic. However, this approach is better adapted in our case, since in the algorithm shown in figure 3 , the weight of a 4-tree is consulted only once. Denoting l as the length of DNA sequences, the memory space required then becomes O(nl + n2). This relatively low space complexity allows us to deal with larger data sets than quartet methods that need to store O(n4) weights.
|
![]() |
Simulation Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Protocol
Our experimental tests followed a protocol used within a similar framework by Kumar (1996)
, and later by Gascuel (1997)
. Six model trees were considered, each consisting of 12 taxa (fig. 4 ). The first three (AA, BB, AB) satisfied the molecular-clock hypothesis, while the other three (CC, DD, CD) presented varying substitution rates among lineages. Each interior branch was one unit long (a for constant- and b for variable-rate trees; the lengths of external branches are given in multiples of a or b). For each of these model trees, we studied four evolutionary conditions:
|
The various reconstruction methods were judged on their ability to infer the correct topology (i.e., that of the tree used to generate the sequences). Generally, this evaluation is made either by counting how many times the tree proposed by the method has the same topology as the correct tree T or by considering a topological distance d(
, T) between the inferred tree and the correct one. We used both criteria. The first penalizes methods which propose not-fully-resolved trees, since such trees always differ from the correct one. The second avoids this drawback but requires choosing a measurement of the difference between two tree topologies, which could favor one reconstruction method over another. For our tests, we used the bipartition distance introduced by Robinson and Foulds (1981)
, which is equal to the number of bipartitions present in one of the two trees and not the other. This is the topological distance used in most simulation studies.
Methods Tested
For our tests, we used the program versions available on the web. The different programs were run with model options corresponding to the Kimura two-parameter model with a transition/transversion ratio of 2. We used default parameter values for other program options. The following programs were tested:
Comparisons of the Different Methods
The results obtained by FASTDNAML, BIONJ, DNAPARS, WO, and the main variants of QP are detailed in tables 1,
2,
3
and 4
. The results obtained are summarized in tables 5
and 6
by averaging over the two sequence lengths and the four evolutionary conditions. Tables 1, 2, and 5
provide the percentages of inferred trees that exactly corresponded to the correct tree. Tables 3, 4, and 6 indicate the average bipartition distance between the inferred tree and the correct one. The results obtained from these two types of comparisons were nearly equivalent, except for QP4.2 which often inferred non-fully-resolved trees. We first analyze the performances of QP variants and compare them with that of WO. We then compare the performances of these quartet methods with those of other reconstruction methods.
|
|
|
|
|
|
Our tests also confirm that using continuous weighting is slightly better than using discrete weighting (about 1% in terms of correctly reconstructed trees), but they question the idea that using discrete or continuous weighting (instead of binary weighting) improves the topological accuracy of QP. Indeed, tables 5 and 6 indicate that the average results obtained by QPB are better than those of QPD and QPC by about 2%4% in terms of percentage of correctly inferred trees. This is likely explained by the fact that QPB is the only consistent variant of QP.
Tables 5 and 6
indicate that the topological accuracy of QP strongly depends on the model tree topology. The performances obtained by QPD and QPC are more than 20% worse for model tree AA than for tree BB in terms of percentage of correctly inferred trees, whereas other methods tend to have similar performances for both trees or better performances for tree BB than for tree AA. Moreover, the topological accuracy of QP1C does not depend on the model tree. The topological bias of QP thus seems to be due to the use of the consensus, which tends to privilege trees having many external pairs (tree BB has four, whereas tree AA has only two). To test this assumption, we applied QPC 1,000 times on a test set made of 12 identical sequences and therefore containing no phylogenetic signal. The trees constructed by QPC before the consensus step contained four external pairs on average, whereas those obtained after the consensus step always contained six (the largest possible number with 12 taxa). Moreover, on data files obtained from a model tree having six external pairs, QPC obtains an impressive performance, even better than that of FASTDNAML. This likely explains the previous excellent results obtained by QPD and QPC (Strimmer and von Haeseler 1996
; Strimmer, Goldman, and von Haeseler 1997
), because the model trees used to generate the sequences contained the greatest possible number of external pairs. QPB and QP4.2 are also affected by this topological bias, but to a lesser extent (their performances are about 7%8% worse for model tree AA than for tree BB). The difference with QPC and QPD is due to the fact that trees reconstructed by QPB are less diversified and that in the majority tree provided by QP4.2, only highly supported bipartitions are retained.
It appears from tables 3 and 4
, that QP4.2 is more efficient (in terms of bipartition distance) than QPC and QPB only when the evolution rate is so low (MD 0.1) and the sequences so short (l = 300) that it is not uncommon that no mutation occurs along certain branches (Kumar 1996
). In this case, QP4.2 benefits from being the only method able (in our tests) to propose partially resolved trees, with about six bipartitions on average, whereas all other methods propose fully resolved trees having nine bipartitions. In all other conditions, QPB and QPC tend to be better ranked than QP4.2, according to the bipartition distance as well as the percentage of correctly inferred trees.
However, WO is better ranked than QPB and QPC for all trees but the BB tree according to both topological criteria (tables 5 and 6 ). Its performance is better overall than that of QP and less dependent on the correct tree topology.
Nevertheless, the performance of WO is still worse than those of the other tested methods. WO is always less accurate than FASTDNAML and BIONJ, and the only cases in which WO is ranked better than DNAPARS correspond to high evolution rates (MD 1.0 and MD
2.0), for which the parsimony methods are well known to be poorly suited.
Tables 5 and 6 clearly demonstrate that FASTDNAML is the most accurate method. On average, it finds the correct tree 12%13% more often than BIONJ, which is the next best method. The topological accuracy of DNAPARS depends on the evolution rate. For a low evolution rate, which corresponds to the assumptions made by parsimony, DNAPARS is a little more accurate than BIONJ; for a medium evolution rate, BIONJ has an advantage over DNAPARS, and this advantage becomes dramatic for high evolution rates.
Since there are about n4 quartets, the time complexity of QP and WO (O(n4); see above) is thus minimal. This complexity makes it possible to deal with problems involving about one or two hundred taxa, while BIONJ and DNAPARS, which have O(n3) complexity, may deal with problems involving a few thousand taxa. However, note that BIONJ and DNAPARS are heuristic algorithms, like WO or QP, and do not systematically provide optimal trees according to the criterion they optimize (e.g., parsimony in the case of DNAPARS). We tested the different programs on a PC with a 466-MHz processor and 128 MB RAM. The average computing time required for one of our data sets was less than 0.1 s for both BIONJ and DNARPARS, about 0.6 s for WO, 2.64 s for QP, and 4.13 s for FASTDNAML. On a larger data set, containing 25 taxa and 1,896 sites, the average computing time required was about 1.6 s for BIONJ (including the computation of distances by DNADIST), 2.9 s for DNAPARS, 53.0 s for WO, 101.0 s for QP, and 318.0 s for FASTDNAML. In this case, bootstrapping the data is easy for BIONJ and DNAPARS, but becomes problematic for the other methods.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Better Optimizing the W Criterion
It could be conjectured that the results obtained by WO are disappointing because it uses a naive approach that is inefficient in optimizing the W criterion. When the correct tree T is not inferred by WO, two cases arise. Either T is the single optimum according to Wand in this case the problem is due to WO incorrectly optimizing the W criterionor T is nonoptimal (or not the single optimum)and in this case the problem concerns the W criterion, which is not discriminating enough.
In order to know which of these two problems limits the performance of WO, we isolated the second kind of error. Instead of defining a dynamic taxon addition order, we used 1,000 random-addition orders and constructed the 1,000 corresponding trees by optimizing the W criterion at each step. Then, we selected the tree among these 1,000 trees which had the highest W criterion value. This algorithm (WO1,000) does not guarantee that the resulting tree is optimal. However, in our tests, it always inferred a tree with a criterion value at least as good as that of the correct tree T. Thus, the only errors made by WO1,000 corresponded to cases in which the W criterion was not sufficiently discriminating. We observed that the performance of WO1,000 was hardly any better than that of WO: the average improvement over all model trees and conditions was no more than 2%, according to the percentage of correctly inferred trees. This clearly shows that only a tiny fraction of the errors of WO are due to insufficient optimization.
Possible Limits of Quartet Methods
The quartet methods are highly sensitive to 4-tree inference errors. For example, trees CC and CC' in figure 5
differ only with respect to the resolution of the nine specific quartets {1, 2, 3, x} with x {4, 5, 6, 7, 8, 9, 10, 11, 12}. For 12 taxa, there are 495 quartets. Therefore, it is sufficient to change the resolution of 2% (9/495) of the quartets of tree CC to obtain the set of 4-trees which exactly correspond to tree CC'. In this case, it is obvious that all quartet methods infer tree CC' instead of tree CC. In fact, in the unweighted case, a change in the resolution of five quartets is enough to infer the wrong tree, whereas in the weighted case this depends on the weights. A change in the weights of the w4-trees corresponding to these nine quartets such that Qmax = QCC', is enough to make all reasonable methods infer the wrong tree. More generally, with n taxa, there are trees which differ only on n - 3 quartets, whereas the total number of quartets is about n4. Thus, the minimum rate of badly inferred 4-trees sufficient for misleading all quartet methods quickly approaches 0 as the number of taxa increases.
|
We stress that poor weighting of a few specific quartets is enough to lead quartet methods to infer an incorrect bipartition. Moreover, once a wrong bipartition is inferred due to badly weighted 4-trees, this may influence the entire tree topology. For example, assume that the correct tree is CC (fig. 5
) and that due to long-branch attraction some 4-trees are badly weighted and indicate taxa 1 and 5 as an external pair; then, the three bipartitions respectively gathering together taxa {1, 2}, {1, 2, 3}, and {4, 5} can no longer be inferred. In this case, even if all quartets {1, 2, 3, x} with x {4, 5, 6, 7, 8, 9, 10, 11, 12} are correctly weighted, the branch separating taxa 1 and 2 from the other is not inferred. This clearly indicates the importance and difficulty of correctly weighting 4-trees.
Although the likelihood approach used by QP (Strimmer, Goldman, and von Haeseler 1997
) and then by WO seems well founded, this weighting system seems ill-adapted to the quartet approach. As an illustration, it should be noted that the likelihood of each 4-tree is well defined but does not allow us to predict the likelihood of the whole tree. Moreover, the continuous version of WO (presented above) finds the correct tree on average only about 3% more often than the binary variant (results not shown). The inadequacy of the weighting system currently used by QP and WO could thus explain their disappointing results.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A great deal of work was carried out on ways of combining 4-trees to obtain a complete n-tree. The weaknesses of the various quartet methods tested in this paper are very likely due to the method of weighting the 4-trees, rather than the method of combining them. Indeed, as explained above, considering only four taxa requires correct management of long-branch attraction. We hope that this will become possible with the new approaches of Lyons-Weiler and Takahashi (1999)
, who suggest ways in which calculation of likelihood could handle this problem, or by Willson (1999b)
, who proposes a version of parsimony that is more resistant to long-branch attraction.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: phylogenetic reconstruction
quartet methods
tree consensus
maximum likelihood
parsimony
distance methods
computer simulations
2 Address for correspondence and reprints: Olivier Gascuel, Département Informatique Fondamentale et Applications, LIRMM 161, Rue Ada, 34392 Montpellier cedex 5, France. gascuel{at}lirmm.fr
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi, J., M. Hasegawa. 1999. Instability of quartet analyses of molecular sequence data by the maximum likelihood method: the cetacea/artiodactyla relashionships. Cladistics. 5:164166
Bandelt, H.-J., A. Dress. 1986. Reconstructing the shape of a tree from observed dissimilarity data. Adv. Appl. Math. 7:309343
Barthélemy, J. P., A. Guénoche. 1990. Trees and proximity representationJohn Wiley and Sons, Chichester, England
Berry, V., O. Gascuel. 1997. Inferring evolutionary trees with strong combinatorial evidence. Lect. Notes Comput. Sci. 1276:111123
.2000. Inferring evolutionary trees with strong combinatorial evidence. Theor. Comput. Sci. 240:271298
Berry, V., T. Jiang, P. Kearney, M. Li. 1999. Quartet cleaning: improved algorithms and simulations. Lect. Notes Comput. Sci. 1643:313324
Cormen, T. H., C. E. Leiserson, R. L. Rivest. 1992. Introduction to algorithmsMIT Press, Cambridge, Mass
Erdös, P., M. Steel, L. A. Székely, T. Warnow. 1997. Constructing big trees from short sequences. Lect. Notes Comput. Sci. 1256:827837
Felsenstein, J.. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376
.1989. Phylogeny inference package (version 3.2). Cladistics. 5:164166
Gascuel, O.. 1997. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685695
Jiang, T., P. Kearney, M. Li. 1998. Orchestrating quartets: approximation and data correctionPp. 416425 in R. Motwani, ed. Proceedings of the 39th IEEE Annual Symposium on Foundations of Computer Science, Los Alamitos, Calif
Kumar, S.. 1996. A stepwise algorithm for finding minimum evolution trees. Mol. Biol. Evol. 13:584593
Lyons-Weiler, J., K. Takahashi. 1999. Branch length heterogeneity leads to nonindependent branch length estimates and can decrease the efficiency of methods of phylogenetic inference. J. Mol. Evol. 49:392405
Margush, T., F. McMorris. 1981. Consensus n-trees. Bull. Math. Biol. 43:239244
Olsen, G., H. Matsuda, R. Hagstrom, R. Overbeek. 1994. A tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:4148
Philippe, H., E. Douzery. 1994. The pitfalls of molecular phylogeny based on four species, as illustrated by the cetacea/artiodactyla relationship. J. Mamm. Evol. 2:133152
Rambaut, A., N. Grassly. 1997. Seq-gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235238
Robinson, D. F., L. R. Foulds. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131147
Saitou, N., M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425
Steel, M.. 1992. The complexity of reconstructing trees from qualitative characters and subtrees. J. Classif. 9:91116
Strimmer, K., N. Goldman, A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol. 14:210211
Strimmer, K., A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964969
Swofford, D. L., G. J. Olsen, P. Waddel, D. Hillis. 1996. Phylogenetic inferencePp. 407514 in C. M. David, M. Hillis, and B. Mable, eds. Molecular systematics. 2nd edition. Sinauer, Sunderland, Mass
Willson, S. J.. 1999a. Building phylogenetic trees from quartets by using local inconsistency measure. Mol. Biol. Evol. 16:685693
. 1999b. A higher parsimony method to reduce long branch attraction. Mol. Biol. Evol. 16:694705