Department of Mathematics and Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University
![]() |
Abstract |
---|
![]() |
Introduction |
---|
If T(J) is accurately known for each J, then there exists a unique way to combine all the trees T(J) into a single tree, which would then necessarily be the accurate T (Bandelt and Dress 1986
). Usually, however, only a computed tree L(J) for each quartet J is known, instead of T(J). Natural choices of criteria for selecting L(J) include maximum likelihood (Felsenstein 1981
), maximum parsimony (Farris, Kluge, and Eckardt 1970
; Fitch 1971
), neighbor joining (Saitou and Nei 1987
), and higher-order parsimony (Willson 1999b
). None of these methods, however, are always accurate. As a result, some of these L(J) trees are usually incorrect, and the various L(J) trees do not fit together into one tree. Different approaches (see Strimmer and von Haeseler 1996
; Willson 1998, 1999a
; Berry and Bryant 1999
; Moulton and Steel 1999
; Berry et al. 2000
) have been taken to build T in this situation.
This paper presents a method of "error correction" which identifies some J for which L(J) is incorrect and also suggests an improved choice for L(J). This purely mathematical correction assumes that L(J) is correct for "most" choices of J and then corrects individual L(J) trees on the basis of their relationship with L(K) for other quartets K. Such an error correction process can often be used in conjunction with various methods of reconstruction of T in order to increase the accuracy of the reconstructed tree.
If W is a tree with taxa S = {1, 2, ... , N}, and J is a subset of S, denote by W | J the tree obtained by removing all leaves except those in J, so that W | J precisely represents the relationships in W among the taxa in the collection J. Suppose that the (usually unknown) underlying true tree is T. A "list" L assigns a tree L(J) to each quartet J. The tree T induces a list T*, defined by T*(J) = T | J, for each quartet J. If T* were known, then T could be reconstructed. Suppose that instead we are merely given a list L which assigns to each quartet J a unique tree L(J) with exactly four leaves, corresponding, respectively, to the four elements of J. Since L is to be considered an approximation of T*, we assume that for most quartets J it is true that L(J) = T | J. Each quartet J such that L(J) T | J will be called "erroneous" in L. Any quartet J such that L(J) = T | J will be called "correct."
Exact formulations of the results require the use of the "quartet distance" D(L1, L2) between two lists. Here, D(L1, L2) is the number of quartets J such that L1(J) L2(J). The quartet distance provides a convenient and sensitive measurement of the degree to which two lists differ.
This paper shows how to compute an "error correction" map, Ec, which, given an approximating list L, produces another list, Ec(L). It will be proved that if L is "close enough" to T* in the sense that D(L, T*) is small enough (more specifically, if D(L, T*) (N - 4)/2), then Ec(L) = T*. This means that when there are sufficiently few erroneous quartets in L, then Ec(L) will exactly correspond to the true tree T. No knowledge of T is required for the definition of Ec. Moreover, the map Ec is the best possible in the sense that it corrects (N - 4)/2 errors, but there exist lists with [(N - 4)/2 + 1] erroneous quartets such that no error correction map can guarantee correctness. The map Ec can be utilized on a list L no matter how L was originally obtained.
Even when L is not close to T*, it is possible that Ec(L) will be closer to T* than was L and thus will show an improvement over L. Iteration of the map Ec to compute Ec(Ec(L)) may yield even better information about T. Whereas there are no guarantees of correctness if L is not sufficiently close to T*, examples suggest that iteration should provide much improvement in the situation when the erroneous quartets are randomly, rather than systematically, placed. Moreover, the Buneman (1971)
tree for Ec(L) always refines the Buneman tree for L. Hence, edges supported strongly in L remain strongly supported in Eck(L) for each k.
This paper also reports the results of some simulations in which DNA strings are generated by various methods from a known correct tree T, with varying rates and models of substitutions. Lists L are then computed using various methods. The simulations show that iteration of Ec yields a substantial reduction in the number of erroneous quartets, provided that the original number of erroneous quartets is not too large.
If L is a list, the Buneman tree for L is a particular tree T such that no edge of T is inconsistent with any tree L(J). Usually, the Buneman tree is incompletely resolved, and various refinements with higher resolution are of interest. Among these refinements are the "refined Buneman tree" of Moulton and Steel (1999)
, the "C-tree" of Berry and Bryant (1999)
, and the "hypercleaned tree" of Berry et al. (2000)
, related to "quartet cleaning," described in Jiang, Kearney, and Li (1998)
. Every reasonable tree should refine the Buneman tree.
The refinements of the Buneman tree in the preceding paragraph all duplicate the worst-case bound that we obtain in that they are guaranteed to find the correct tree T from the list L if D(T*, L) (N - 4)/2. Nevertheless, since their constructions are very different from the construction of Ec, the methods often correct errors different from those corrected by Ec. Hence, the possibility remains of further improvement, for example, by finding the C-tree associated with Ec(L) rather than L. In simulations, iteration of Ec often adds extra resolution to these various refinements of the Buneman tree.
I also report a biological example in which the C-tree and the hypercleaned tree obtained from L do not have as much resolution as the C-tree and the hypercleaned tree obtained from Ec6(L). Thus, Ec adds resolution not available from the other methods even in the context of real data.
![]() |
The Quartet Distance |
---|
Trees are described using a standard parenthesis notation. Thus, ((1 2) (3 4)) denotes the elementary tree with leaves 1, 2, 3, 4 such that taxa 1 and 2 are together in the sense that the path between 1 and 2 is disjoint from the path from 3 to 4; other notations for the same tree are ((2 1) (3 4)), ((3 4) (1 2)), etc. The only other possible elementary trees for the quartet {1, 2 ,3, 4} are ((1 3) (2 4)), ((1 4) (2 3)), and the "star" (1 2 3 4) (see fig. 1 ). The number of elements in a set Q is denoted |Q|.
|
Define the quartet distance between two lists L1 and L2 of PN to be the number of quartets J such that L1(J) L2(J). The quartet distance is easily seen to be a metric. Some properties of the quartet distance on trees are studied in Robinson and Foulds (1981)
, Estabrook, McMorris, and Meacham (1985)
, and Steel and Penny (1993)
.
Any elementary tree T generates a list T* in PN by T*(J) = T | J. It is a theorem of Bandelt and Dress (1986)
that if T and U are elementary trees and for all quartets J we have T | J = U | J, then T = U. As a consequence, if T and U are elementary trees and T* = U*, then T = U. We may thus identify a tree T with its list T*. The quartet distance between an elementary tree T and a list L is defined to be D(T*, L). In an abuse of notation, we may sometimes omit the asterisks and write D(T, L) instead of D(T*,L); the meaning will be clear.
The property of the quartet distance that makes it useful is that distinct elementary trees T and U have distance of at least N - 3. This fact is shown in the following fundamental result, announced without proof in Steel and Penny (1993)
. The theorem is true whether or not T and U are completely resolved.
Theorem 2.1. If T and U are distinct elementary trees with N taxa, then D(T*,U*) N - 3. Moreover, for each N
4, there exist resolved elementary trees T and U such that D(T*, U*) = N - 3.
Proof. See Supplementary Material.
![]() |
The Error-Correcting Map Ec |
---|
For each elementary tree U with four leaves from the set S of N taxa, compute an integer f(U), which may be interpreted as the number of demerits of U, as follows. Initially, set f(U) = N - 4 for all U. For each possible elementary tree V with five leaves from S, say, with leaves from the set Q, |Q| = 5, perform the following:
Since any given quartet J lies in exactly N - 4 quintets Q, using Theorem 2.1, one sees that for each U, 0 f(U)
N - 4. For each quartet J, if there exists an elementary tree U with set J of leaves such that f(U) < f(W) for all elementary trees W
U with set J of leaves, define Ec(L)(J) = U; otherwise, define Ec(L)(J) = L(J). This completes the definition of the map Ec.
The definition of Ec reflects the following intuition for deciding the best tree for a quartet J: Suppose that there is a tree V with a set Q of leaves such that V | K = T(K) for all four of the five quartets K Q other than J. Then V is uniquely determined by Theorem 2.1, and it lends support for the correctness of V | J. Hence, in this situation, we subtract a demerit from f(V | J). Procedure 1 reflects this in the case where L(J) = V | J, and procedure 2 reflects this in the case where L(J)
V | J.
Here is an example illustrating the definition of Ec: Suppose that the "correct tree" T = (1 (2 (3 (4 (5 6))))) and that the list L satisfies the condition that L(J) = T | J for all J, except that L({1, 2, 3, 4}) = ((1 3) (2 4)) instead of T | J = ((1 2) (3 4)). I illustrate how Ec(L) corrects L by computing that Ec(L)({1, 2, 3, 4}) = ((1 2) (3 4)). Let J = {1, 2, 3, 4}. Initially, f(((1 2) (3 4))) = 2. If Q = {1, 2, 3, 4, 5}, let V = (1 (2 (3 (4 5)))), and note that V is found from ((1 2) (3 4)) by a certain placement of taxon 5. Now V will lead to the subtraction of 1 from f(T | J) since all the quartets K which are subsets of Q other than J itself satisfy L(K) = V | K. Similarly, if Q = {1, 2, 3, 4, 6} and V = (1 (2 (3 (4 6)))), then V leads to the subtraction of 1 from f(T | J), since all the quartets K which are subsets of Q other than J itself satisfy L(K) = V | K. It follows that f(T | J) = 0. On the other hand, for any elementary tree U with a set J of leaves, U T | J, we find f(U) = 2. In particular, no placement of 5 into U = ((1 3) (2 4)) exactly matches L on quartets other than J; for example, tree V = ((1 3) (2 (4 5))) satisfies V | {1, 2, 3, 5} = ((1 3) (2 5)), while L({1, 2, 3, 5}) = ((1 2) (3 5)). Now Ec(L)(J) = ((1 2) (3 4)) = T | J, since f((1 2) (3 4)) < f(U) for U with leaves in J, U
((1 2) (3 4)).
The computation of Ec requires the consideration of all 26 distinct elementary trees for each quintet Q. The number of cases is thus 26C(N, 5), where C(N, k) = N!/[k!(N - k)!] is the binomial coefficient, so that the computation requires time O(N5). The function f(U) has four different trees U for each quartet J and hence a domain of size 4C(N, 4). If the user is interested only in trees that are resolved, then the map can easily be redefined to consider only trees V which are resolved and only trees U which are resolved. The number of cases is then only 15C(N, 5), which is still O(N5), and the function f(U) has a domain of size 3C(N, 4).
Suppose that the correct tree is T. The following theorem gives sufficient conditions under which Ec(L)(J) will agree with T | J even if L(J) disagrees with T | J. This result permits us to call Ec an error-correcting map. What makes the theorem useful is that knowledge of T is not required in the definition of Ec. In the statement, w(J) refers to the number of quartets used in correcting L(J) that are "wrong" or "erroneous" if the correct tree is T; in this situation, we regard any quartet K as "correct " in L if and only if L(K) = T | K. The corollary then states that when L has at most (N - 4)/2 erroneous quartets, then Ec(L) gives exactly the correct tree T.
Theorem 3.1. Let T be an elementary tree with N taxa and, for a fixed quartet J, let w(J) denote the number of quartets K such that |J K| = 3 and L(K)
T | K.
Proof. See Supplementary Material.
Corollary 3.2. Suppose that T is an elementary tree, L is a list, and D(L, T*) (N - 4)/2. Then Ec(L) = T*.
Proof. We show that for each quartet J we have Ec(L)(J) = T | J. If L(J) = T | J, then case 2 of the theorem shows that Ec(L)(J) = T | J, since we will have w(J) (N - 4)/2. If L(J)
T | J, then the number w(J) in the theorem satisfies w(J) < (N - 4)/2 (since J itself is one of the erroneous quartets), so case 1 of the theorem shows that Ec(L)(J) = T | J.
Corollary 3.3. Suppose that L has at most (N - 4)/2 erroneous quartets. Then each quartet J is correct in Ec(L).
The following result shows that our error-correcting map Ec is the best possible; there can exist no error-correcting map Ec' that will always correct one more than (N - 4)/2 erroneous quartets.
Proposition 3.4. Let N > 4. There does not exist any map Ec': PN PN such that, for each elementary tree T, whenever L is a list such that D(T*, L)
(N - 4)/2 + 1 = N/2 - 1, D(T*, Ec'(L)) = 0.
Remark. It is possible to make changes in the definition of Ec given above and obtain other maps Ec': PN PN which satisfy the conclusions of 3.1, 3.2, and 3.3. An interesting question is to select which map Ec among the various possibilities has the best properties. The map Ec defined in this section appears "conservative" in the sense that, although it is more computationally intensive, in various simulations with very noisy data it seemed to introduce fewer additional erroneous quartets in comparison with some alternative maps Ec'.
![]() |
Iteration of the Error Correction Process |
---|
In practice, iteration generally leads to a fixed point Eck(L) such that Eck+1(L) = Eck(L). Moreover, the fixed point has been reached after a handful of applications of Ec. The following example, however, illustrates the possibility of a cycle of period 2. Let N = 5 and consider the list L containing ((1 2) (3 4)), ((1 5) (2 3)), ((1 2) (4 5)), ((1 5) (3 4)), and ((2 5) (3 4)). Then Ec(L) contains ((1 2) (3 4)), ((1 2) (3 5)), ((1 5) (2 4)), ((1 5) (3 4)), and ((2 5) (3 4)), differing from L in exactly two quartets. Moreover, Ec(Ec(L)) = L, so that as Ec is iterated, the lists alternate forever in a cycle of period 2. For this example, let T = ((1 2) (5 (3 4))) and U = ((1 5) (2 (3 4))). Then for each k, Eck(L) differs from T in exactly one quartet, and Eck(L) also differs from U in exactly one quartet.
Corollary 3.4 shows that use of Ec when there are (N - 2)/2 erroneous quartets could conceivably lead to the incorrect tree. We suspect, however, that the probability of this happening is low if erroneous quartets are randomly, rather than systematically, located. Theorem 3.1 shows that Ec(L)(J) can be erroneous only if there are at least (N - 4)/2 erroneous quartets among a particular collection of 4(N - 4) quartets associated with J. If there are not too many erroneous quartets and they appear randomly, then (N - 4)/2 erroneous quartets are unlikely to appear in any of those particular collections. Thus, on average, the procedure should yield correct trees even if there are somewhat more than (N - 4)/2 erroneous quartets initially.
Hence, when the number of erroneous quartets is not too large, we may hope that successive applications of the error-correcting map Ec should improve the information in each step.
![]() |
Relation to the Buneman Tree and its Refinements |
---|
A set U of splits of S is compatible if there exists an elementary tree T with U Sp(T). It is known that splits A | B and C | D of S are compatible if and only if one of A
C, A
D, B
C, B
D is empty, and a set U of splits of S is compatible if and only if each pair of splits in U is compatible (Buneman 1971
).
If L is a quartet list and A | (S - A) is a split of S, define W(A, S - A, L) = {({a, b}, {c, d}): a A, b
A, c
S - A, d
S - A, a
b, c
d, L({a, b, c, d})
((a b) (c d))}. If T is an elementary tree with the set S of leaves, and there exists an edge e of T with T(e) = A | (S - A), then W(A, S - A, T*) =
. Hence, if T is an elementary tree, if there exists an edge e of T with T(e) = A | (S - A), and if we regard L as an approximation of T*, then each ({a, b}, {c, d})
W(A, S - A, L) yields an erroneous quartet J = {a, b, c, d}. Briefly, W(A, S - A, L) specifies the quartets that appear "wrong" in L for the split A | (S - A).
Given a list L, there are some particularly interesting collections of splits. Define Bu(L) = {A | B: A | B is a split of S and W(A, B, L) = }. Define RBu(L) = {A | B: A | B is a split of S and |W(A, B, L)| < (N - 3)/2}. Define CBu(L) = {A | B: A | B is a split of S and |W(A, B, L)| < (|A| - 1)(|B| - 1)/2}
{A | B: A | B is a trivial split of S}. Given a real parameter m, define HBum(L) = {A | B: A | B is a split of S and |W(A, B, L)| < m(|A| - 1)(|B| - 1)/2}
{A | B: A | B is a trivial split of S}. Define HBu(L) = HBum(L) for the maximal m such that HBum(L) is compatible.
It is clear that Bu(L) RBu(L)
CBu(L)
HBu(L), since for any nontrivial split A | B, one can verify that (N - 3)/2
(|A| - 1)(|B| - 1)/2. Buneman (1971)
showed that Bu(L) is compatible and Bu(L) = Sp(T) for a unique tree, here called the Buneman tree, for L. It follows from Moulton and Steel (1999)
that RBu(L) is compatible and RBu(L) = Sp(T) for a unique tree, which, by analogy to their nomenclature, will be called the refined Buneman tree for L. An argument like that in Berry and Bryant (1999)
shows that CBu(L) is compatible and CBu(L) = Sp(T) for a unique tree, which, by analogy to their nomenclature, will be called the C-tree for L. Note that HBu1(L) = CBu(L), so that HBum(L) generalizes the C-tree. The tree T for which HBu(L) = Sp(T) will be called the hypercleaned Buneman tree; it is studied in Berry et al. (2000), where a polynomial time algorithm is given for its computation.
The Buneman tree for L is the tree T such that each edge of T is consistent with all the quartets in L. The major shortcoming of the Buneman tree is that it is often very incompletely resolved. Since Bu(L) RBu(L)
CBu(L)
HBu(L), these latter trees are refinements of the Buneman tree for L, often giving a more completely resolved tree.
The next result shows that the Buneman tree for Ec(L) is a refinement of the Buneman tree for L. Hence, the correction map Ec does not introduce errors incompatible with the strongly supported edges of the Buneman tree for L. Often, the Buneman tree for Ec(L) has strictly more edges than the Buneman tree for L, indicating higher resolution. Sometimes the Buneman tree for Ec2(L) may have still more resolution. The process can be repeated until Eck(L) has become periodic.
Theorem 5.1. Let L be a quartet list. Then Bu(Ec(L)) Bu(L).
Corollary 5.2. Iteration of the map Ec applied to any list L leads to a sequence of successive refinements of the Buseman tree: Bu(L) Bu(Ec(L))
Bu(Ec2(L))
...
Bu(Eck(L))
... .
While I am not able to prove that in all cases CBu(L) CBu(Ec(L)), numerous simulations have not yielded a single case in which this inclusion has failed, whereas it has frequently happened that the inclusion was strict. Similarly, it often happens that HBu(L)
HBu(Ec(L)), and the inclusion may be strict. An example with biological data is given in the Biological Examples section.
![]() |
Simulations: Methods |
---|
Given the strings associated with the leaves of the tree T, an initial list L of quartets was generated in each of five different ways: maximum parsimony (MP), maximum likelihood (ML), higher-order parsimony (HOP), neighbor joining with Hamming distance (NJH), or neighbor joining with the Jukes-Cantor distance (NJJC). For ML, the likelihoods for each quartet were computed using the HKY (Hasegawa, Kishino, and Yano 1985
) model of substitution, exactly as in the program PUZZLE (Strimmer and von Haeseler 1996
). For HOP, the HOP estimates were computed as in Willson (1999b). For NJH, the neighbor-joining formulas of Swofford et al. (1996)
were utilized with the initial distance being the Hamming distance (the uncorrected number of sites differing between the two strings). For NJJC, neighbor-joining was applied with the initial distance being the corrected Jukes-Cantor distance. For each case, I computed the resulting number b of "bad quartets," i.e., the number of quartets J such that L(J)
T | J. I then applied the error correction method to each L and iterated the result until there were no further changes observed in the lists Eck(L) or a cycle of period 2 had been identified. The value of b after this error correction was also found.
Small values of b are best. If b = 0, then the tree T could be reconstructed perfectly by the methods described in Bandelt and Dress (1986)
, Strimmer and von Haeseler (1996)
, or Willson (1999a)
. If b is small, then various methods of reconstruction usually find the correct tree T, but as b increases, the reconstructed tree becomes more problematic. Each list contains C(12, 4) = 495 quartets, so the maximum possible value of b was 495.
![]() |
Simulations: Results |
---|
|
Figure 3 shows for many data runs the points (b1, b2), where b1 is the number of erroneous quartets in some original list L, and b2 is the number of erroneous quartets in the list obtained by iterated use of the error correction map Ec. The original lists L were obtained by HOP, ML, MP, NJH, or NJJC, as indicated. For comparison, the line b2 = b1 is also drawn.
|
It is also interesting to compare the trees obtained from the lists using various reconstruction methods. I report one (difficult) case contributing to figure 2
, obtained when S = 450. Given a list X, I performed 101 replications using X to reconstruct candidate trees as in Willson (1999a
), adding taxa in a random order but in each step minimizing the number of quartets incompatible with the proposed tree. The consensus tree so obtained is denoted X#, and the tree obtained which is closest to X is denoted X+ (so that U = X+ minimizes D(X, U) among all trees U found in 101 replications); the correct tree is denoted T. For X = ML = the list found by maximum likelihood, it turns out that D(ML, T) = 82, while D(ML, ML+) = 81, with a tie for ML+, showing that the correct tree T was not the closest tree to ML. By contrast, for X = ML' = Eck(ML) for large k, D(ML', T) = D(ML', ML'+) = 27, and, in fact, T was the closest tree to ML' obtained. The consensus trees have D(ML#, T) = 33, while D(ML'#, T) = 0, again showing that error correction applied to ML improved the signal for the correct tree, since the consensus tree ML'# was the correct tree. For NJJC, D(NJJC#, T) = 24, while D(NJJC+, T) = 0, with D(NJJC, T) = 46, showing a mixed signal for the correct tree; by contrast, NJJC'# = NJJC'+ = T with D(NJJC', T) = 12, yielding a strong signal for the correct tree. HOP finds HOP# = HOP+ = HOP'# = T. For comparison, PUZZLE (Strimmer and von Haeseler 1996
) finds an incompletely resolved consensus tree P such that D(P, T) = 114. The software program PAUP* (Swofford 1999
) very quickly finds the neighbor-joining tree NH using the Hamming distance, which is, however, highly inaccurate, satisfying D(NH, T) = 195. Using instead the Jukes-Cantor distance, PAUP* finds the neighbor-joining tree NC, which is much better but still inaccurate, satisfying D(NC, T) = 24 with NC = NJJC#. (The improvement from using the Jukes-Cantor distance rather than the Hamming distance is impressive.) PAUP* also quickly finds the MP tree P, which is highly inaccurate, satisfying D(P, T) = 185. On the other hand, taking much more computer time, it finds that the ML tree is the correct tree T.
For this same example, we may indicate the effectiveness of the refinements of the Buneman trees by reporting the number of splits (edges) found in each case. The correct tree T had 21 splits, including the 12 trivial splits. All splits found in the cases reported were correct (i.e., were also splits in T), but the trees differed significantly in resolution. Bu(ML), RBu(ML), and CBu(ML) all had 12 splits (and hence no resolution), while maximal hypercleaning led to HBu(ML) with only 13 splits. By contrast, Bu(ML') had 16 splits, RBu(ML') and CBu(ML') had 18 splits, and HBu(ML') had 20 splits, so that error correction greatly increased the resolution. HOP had comparable results: Bu(HOP) had 12 splits, RBu(HOP) had 13 splits, CBu(HOP) had 16 splits, and HBu(HOP) had 19 splits. Error correction again improved resolution: Bu(HOP') had 18 splits, RBu(HOP') and CBu(HOP') had 19 splits, and HBu(HOP') had 20 splits. The computation of each of these trees took only a few seconds.
![]() |
Biological Examples |
---|
Figure 4 shows the tree HBu(ML') that was obtained. The maximal parameter m so that HBum(ML') is compatible is m = 2.0; the choice m = 2.001 leads to incompatible splits, and one can verify that no m > 2.0 leads to compatible splits. Internal nodes are labeled with letters. Note that the tree is completely resolved. The tree satisfies the condition that D(HBu(ML'), ML') = 69 while D(HBu(ML'), ML) = 122; hence, 53 out of the 59 quartets that were changed from ML to ML' were changed to favor HBu(ML'). All other trees reported here have less resolution and may be found by deleting selected edges from HBu(ML'). The tree HBu(ML) lacks the edge ej (so that it appears as though edge ej has been shrunk to a point), showing that the list ML' obtained using error correction provides more resolution than does the list ML even when hypercleaning is utilized. Moreover, D(HBu(ML), ML) = 196, while D(HBu(ML'), ML) = 122, showing that HBu(ML') has an even shorter quartet distance to ML than does HBu(ML). The tree CBu(ML') lacks both de and ej, showing that increased resolution may be obtained by hypercleaning rather than accepting the C-tree. The tree CBu(ML) = RBu(ML) lacks the edges de, ej, bd, and km, while Bu(ML) lacks the edges de, ej, bd, km, and ab; in contrast, the tree Bu(ML') = RBu(ML') lacks only the edges de, ej, and bd. This calculation shows that ML' provides more resolution than ML even when the more conservative trees Bu and RBu are utilized.
|
For a second example, Strimmer and von Haeseler (1996)
studied the phylogeny of certain amniotes. Their tree T was produced by first finding a quartet list ML by maximum likelihood and then using quartet puzzling. When we apply error correction to ML, we obtain a list ML'. The same tree T is obtained using ML' and the methods of Willson (1999a)
, but with a stronger signal: D(ML, T) = 67, while D(ML', T) = 12.
![]() |
Final Comments |
---|
Still, in my experience, the quality of L has the greatest impact on the results. If L is generated by an unreliable method (e.g., MP or NJH in the presence of serious long-branch attraction), then Ec(L) may yield no improvement.
The map Ec is not difficult to program, taking time O(N5). Code to perform the correction (for the case of completely resolved trees) is available from my website at http://www.public.iastate.edu/swillson. The program is fast; a run for 12 taxa takes only a few seconds.
![]() |
Acknowledgements |
---|
![]() |
Footnotes |
---|
1 Keywords: phylogenetic tree
quartet
Buneman tree
hyperclean
error correction
2 Address for correspondence and reprints: Stephen J. Willson, Department of Mathematics, Iowa State University, Ames, Iowa 50011. E-mail: swillson{at}iastate.edu
![]() |
literature cited |
---|
Bandelt, H.-J., and A. Dress. 1986. Reconstructing the shape of a tree from observed dissimilarity data. Adv. Appl. Math. 7:309343.[ISI]
Berry, V., and D. Bryant. 1999. Faster reliable phylogenetic analysis. Pp. 5968 in S. Istrail, P. Pevzner, and M. Waterman, eds. Proceedings of the Third Annual International Conference on Computational Molecular Biology. ACM Press, New York.
Berry, V., D. Bryant, T. Jiang, P. Kearney, M. Li, T. Wareham, and H. Zhang. 2000. A practical algorithm for recovering the best supported edges of an evolutionary tree. Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms. Association for Computing Machinery, New York, and Society for Industrial and Applied Mathematics, Philadelphia.
Buneman, P. 1971. The recovery of trees from measures of dissimilarity. Pp. 387395 in F. R. Hodson, D. G. Kendall, and P. Tautu, eds. Mathematics in archaeological and historical sciences. Edinburgh University Press, Edinburgh, Scotland.
D'Erchia, A., C. Gissi, G. Pesole, C. Saccone, and U. Arnason. 1996. The guinea-pig is not a rodent. Nature 381:597600.
Estabrook, G. F., F. R. McMorris, and C. A. Meacham. 1985. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Zool. 34:193200.[ISI]
Farris, J. S., A. G. Kluge, and M. J. Eckardt. 1970. A numerical approach to phylogenetic systematics. Syst. Zool. 19:172191.[ISI]
Felsenstein, J. 1978. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27:401410.[ISI]
. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
Fitch, W. M. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20:406416.[ISI]
Gusfield, D. 1991. Efficient algorithms for inferring evolutionary trees. Networks 21:1928.
Hasegawa, M., H. Kishino, and K. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160174.[ISI][Medline]
Jiang, T., P. Kearney, and M. Li. 1998. Orchestrating quartets: approximation and data correction. Pp. 416425 in Proceedings of the 39th IEEE Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, Calif.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. Updating the Dayhoff matrix. Comput. Appl. Biosci. 8:275282.[Abstract]
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 7995 in S. Osawa and T. Honjo, eds. Evolution of life: fossils, molecules, and culture. Springer-Verlag, Tokyo.
Moulton, V., and M. Steel. 1999. Retractions of finite distance functions onto tree metrics. Discrete Appl. Math. 91:215233.[ISI]
Robinson, D. F., and L. R. Foulds. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131147.[ISI]
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Steel, M., and D. Penny. 1993. Distributions of tree comparison metricssome new results. Syst. Biol. 42:126141.[ISI]
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964969.
Swofford, D. L. 1999. PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer, Sunderland, Mass.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer, Sunderland, Mass.
Willson, S. J. 1998. Measuring inconsistency in phylogenetic trees. J. Theor. Biol. 190:1536.[ISI][Medline]
. 1999a. Building phylogenetic trees from quartets by using local inconsistency measures. Mol. Biol. Evol. 16:685693.
. 1999b. A higher-order parsimony method to reduce long-branch attraction. Mol. Biol. Evol. 16:694705.
|