Quartet-Mapping, a Generalization of the Likelihood-Mapping Procedure

Kay Nieselt-Struwe and Arndt von Haeseler

*Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany; and {dagger}Max-Planck-Institut für evolutionäre Anthropologie, Leipzig, Germany


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 appendix a
 appendix b 
 appendix c  
 appendix d   
 References
 
Likelihood-mapping (LM) was suggested as a method of displaying the phylogenetic content of an alignment. However, statistical properties of the method have not been studied. Here we analyze the special case of a four-species tree generated under a range of evolution models and compare the results with those of a natural extension of the likelihood-mapping approach, geometry-mapping (GM), which is based on the method of statistical geometry in sequence space. The methods are compared in their abilities to indicate the correct topology. The performance of both methods in detecting the star topology is especially explored. Our results show that LM tends to reject a star tree more often than GM. When assumptions about the evolutionary model of the maximum-likelihood reconstruction are not matched by the true process of evolution, then LM shows a tendency to favor one tree, whereas GM correctly detects the star tree except for very short outer branch lengths with a statistical significance of >0.95 for all models. LM, on the other hand, reconstructs the correct bifurcating tree with a probability of >0.95 for most branch length combinations even under models with varying substitution rates. The parameter domain for which GM recovers the true tree is much smaller. When the exterior branch lengths are larger than a (analytically derived) threshold value depending on the tree shape (rather than the evolutionary model), GM reconstructs a star tree rather than the true tree. We suggest a combined approach of LM and GM for the evaluation of starlike trees. This approach offers the possibility of testing for significant positive interior branch lengths without extensive statistical and computational efforts.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 appendix a
 appendix b 
 appendix c  
 appendix d   
 References
 
Recently, likelihood-mapping (LM), a method of visualizing the phylogenetic signal of a sequence alignment in a single graph, was proposed (Strimmer and von Haeseler 1997Citation ). With this method, an a priori assessment of the phylogenetic content, similar to that of the method of statistical geometry in sequence space (Eigen, Winkler-Oswatitsch, and Dress 1988Citation ), of either the whole set of sequences or a predefined partition into four subsets is possible. In addition, LM can be used to test a posteriori the confidence of an inner branch in a phylogenetic tree. The procedure of LM is based on the three maximum likelihoods for the three possible quartet trees of aligned sequences, which are transformed into barycentric coordinates. The position of the point with these coordinates in a simplex visualizes which of the seven possible treelike and nontreelike quartet graphs represents the underlying topology of divergence of the sequences.

In general, any quartet-based evaluation method of phylogenetic signal is implementable in the setup of LM. We coin the general principle quartet-mapping. In this paper, we extend the LM procedure by combining its principle with the method of statistical geometry in sequence space (Eigen, Winkler-Oswatitsch, and Dress 1988Citation ). Both LM and its extension to statistical geometry return the coordinates of a point which is mapped into the same graph. Applied to the same sequences, the two methods can therefore be used to directly compare the positions of the points and consequently the proposed topologies of the sequences. One should, however, notice a subtle difference between the two approaches. LM is built on the foundation of a stochastic model of sequence evolution, and thus the barycentric coordinates have the statistical interpretation of posterior probabilities. With the extension of this approach to statistical geometry in sequence space or maximum parsimony, for example, a clear statistical interpretation of the coordinates is lacking. However, the statistical properties of the two quartet-mapping procedures with respect to their ability to reconstruct the true evolutionary history have not been investigated so far. Although in numerous investigations it has been shown that maximum likelihood (ML) is generally robust even when evolution assumptions are violated in the reconstruction model (Gaut and Lewis 1995Citation ; Huelsenbeck 1995Citation ; Yang 1996Citation ), its performance in the LM framework is not clear. One explicit question is that of support for the correct topology, especially for the star tree. In principle, a likelihood ratio test can be constructed for the star tree versus a tree phylogeny from the ML estimates (Churchill, von Haeseler, and Navidi 1992Citation ; Strimmer, Goldman, and von Haeseler 1997Citation ; Yang and Rannala 1997Citation ; Whelan and Goldman 1999Citation ; Ota et al. 2000Citation ). However, it has been shown that under the null hypothesis, the likelihood ratio test's statistics deviate from the usual {chi}2 approximation (Goldman 1993Citation ; Gaut and Lewis 1995Citation ), and thus the test is unreliable. The bootstrap test as introduced into the phylogenetic world by Felsenstein (1985) tends to support the tree favored by the tree-making method (Zharkikh and Li 1992Citation ). Another test, called the interior-branch test, computes the standard error of estimates of edge lengths (Nei, Stephens, and Saitou 1985Citation ; Li 1989Citation ; Rzhetsky and Nei 1992Citation ). Here, the emphasis of the test lies on the assessment of the significance of a positive branch length by means of a normal deviate test. Both the bootstrap test and the interior branch test have been shown to become conservative for starlike trees (Sitnikova 1996Citation ).

In this paper, we use the quartet-mapping approach of ML and statistical geometry to investigate the following two questions: (1) Which tree shapes occur for which evolution models? (2) Given that the tree shape is correct, when is an interior branch length reliably positive? For these investigations, we use two approaches: We simulate sequence data evolving along a wide range of tree models under a broad range of parameter settings, including differences in the ratio of transition to transversion substitutions, discrete gamma-distributed site variation, unequal base composition, and unequal rate matrix entries. From the sequences, the simplex points are computed using the ML estimates as well as the statistical geometry parameters. We compare the positions of the points in the simplex relative to each other, and we compare the points with the true points as obtained from the tree models.

In addition, we compute the expected coordinates of the simplex point obtained from the statistical geometry parameters using the Jukes-Cantor (Jukes and Cantor 1969Citation ) and the Kimura (1980)Citation two-parameter substitution models.

We investigate the behavior of LM and its extension to statistical geometry under violation of the evolutionary model. We also compare the efficiencies of both methods in distinguishing between starlike trees and fully resolved trees. We investigate under which circumstances a test for a positive interior branch length can be constructed without extensive statistical and computational efforts.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 appendix a
 appendix b 
 appendix c  
 appendix d   
 References
 
Quartet-Mapping
In this section, we introduce an extension of the LM procedure (Strimmer and von Haeseler 1997Citation ). The main feature of LM is the display of the underlying mode of evolution of quartets of aligned sequences in a single graph. There are seven graphs connecting four taxa (Eigen, Winkler-Oswatitsch, and Dress 1988Citation ): three fully resolved unrooted trees, the completely unresolved star tree, and the three partially resolved "trees" accounting for the cases in which it is difficult to distinguish between two of the three perfect trees. We denote the three perfect trees by T1, T2, and T3, the star tree by T*, and the three amalgamated trees by T12, T13, and T23. LM computes for a quartet of aligned sequences the three MLs L1, L2, and L3 belonging to T1, T2, and T3, respectively, using any model for the evolution of the sequences. The three likelihoods are transformed into posterior probabilities pi = Li (L1 + L2 + L3), i = 1, 2, 3, by applying Bayes' theorem assuming a uniform prior for all three trees. The three probabilities can be viewed as the barycentric coordinates of a vector p = (p1, p2, p3), and the corresponding vector is mapped onto a two-dimensional simplex


where the ei are the unit vectors. In the simplex, seven points are distinguished, corresponding to the seven quartet "trees" (see fig. 1 ). These are the three p vectors (1, 0, 0), (0, 1, 0), and (0, 0, 1) of the perfect tree topologies T1, T2, and T3, respectively, the center point with vector p = (1/3, 1/3, 1/3) of the star topology T*, and the three points of the partially resolved tree topologies T12, T13, and T23, whose corresponding p vectors are (1/2, 1/2, 0), (1/2, 0, 1/2), and (0, 1/2, 1/2). Using the Euclidean distance, the simplex can be subdivided into seven areas (so-called Voronoi cells), where each point in one region has the smallest Euclidean distance to its corresponding point representing one of the seven quartet "trees." For example, the region R(T*) with the points closest to p = (1/3, 1/3, 1/3) of the star topology T* is characterized as

Similarly,


Of course, the Euclidean distance can be substituted by any other reasonable metric.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 1.—The two-dimensional simplex graph of the quartet-mapping procedure. The graph shows the attractors (marked with dots), along with corresponding regions (Voronoi cells), of the seven possible graphs connecting four taxa (A, B, C, D)

 
By this approach, a one-to-one relationship between the topology of the quartet and a single point in the simplex is established. As will now be shown, this procedure is not limited to ML estimators. In fact, any quartet-based approach that evaluates some objective function is implementable in the setup of the LM procedure. The generalized method is accordingly coined "quartet-mapping." In general, each quartet method computes the support of a quartet of sequences for each of the three possible perfect trees. Let {sigma}1 be the support of tree T1, {sigma}2 that of tree T2, and {sigma}3 that of tree T3 as computed by an arbitrary quartet method. We define the relative support


of tree Ti. Then 0 <= si <= 1 and {Sigma}i si = 1. Thus the si can be seen, like pi of the LM method, as barycentric coordinates of a vector s, whose domain is the simplex S2.

Let us exemplify the quartet-mapping approach in more detail for the method of statistical geometry in sequence space (Eigen, Winkler-Oswatitsch, and Dress 1988Citation ). This method computes the sequence space graph of quartets of aligned sequences by analyzing the positionwise entries of the sequences. Let A, B, C, and D be a quartet of aligned sequences of length L, whose characters are drawn from an alphabet of finite length. For nucleotide sequences, there are 15 possible position categories j:


where w, x, y, and z are different nucleotides.

For a given quartet of sequences, one sums the number of positions dj that belong to j. Note that the sum of the 15 dj values is equal to L, the length of the alignment. The underlying topology of divergence of the quartet is obtained from the "informative" categories. A position category is considered informative if it helps to distinguish between different tree shapes. For statistical geometry, the configurations 6, ... , 14 are informative. Configurations 6, 9, and 14 support tree T1, 7, 10, and 13 support tree T2; and 8, 11, and 12 support tree T3.

We define the supports {sigma}GM1, {sigma}GM2, and {sigma}GM3 of the three perfect tree topologies T1, T2, and T3, respectively, as


Then the relative supports siGM of each of the three perfect trees are given by equation (2). This implementation of the quartet-mapping procedure is called geometry-mapping (GM).

An extension of the method of statistical geometry takes a weight matrix M of the alphabet into account (unpublished data). Each entry in the matrix M gives a dissimilarity measure between the two respective symbols of the alphabet. Each position in the alignment of the four sequences A, B, C, and D is then analyzed separately as follows: Let Al, Bl, Cl, and Dl be the entries at position l of A, B, C, and D, respectively. Then the four entries define six pairwise distances, M(Al, Bl), M(Al, Cl), M(Al, Dl), M(Bl, Cl), M(Bl, Dl), and M(Cl, Dl), and three distance sums, M(Al, Bl) + M(Cl, Dl), M(Al, Cl) + M(Bl, Dl), and M(Al, Dl) + M(Bl, Cl). We order these three distance sums by magnitude and label them maxl, medl, and minl, with maxl >= medl >= minl. The six pairwise distances together with maxl are used to compute


Note that one of the three values gl(AB | CD), gl(AC | BD), gl(AD | BC) is always equal to zero. The nonzero values define a two-dimensional diagram as shown in figure 2a whose distance segments are given exactly by these values. For each position l in the alignment, the parameters gl(· | ·) are then computed, and the appropriate distance segments are summed up, yielding g(· | ·) = {Sigma}l gl(· | ·).



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 2.—a, Given four sequences A, B, C, D, the four entries at each position l define six pairwise distances and three distance sums, which are used to solve an equation system in the seven unknowns gl (A | BCD), gl (B | ACD), gl (C|ABD), gl (D|ABC), gl (AB | CD), gl (AC | BD), and gl (AD | BC), of which one of the latter three is always equal to zero. The nonzero values can be matched to two-dimensional graphs with six distance segments. b, Summing all parameters gl ( | ) over all positions l yields seven parameters that can be matched to the distance segments of a three-dimensional box graph

 
The segments g(A | BCD), g(B | ACD), g(C | ABD), and g(D | ABC) refer to the lengths of the edges connecting the sequences A, B, C, and D, respectively, with a three-dimensional box whose edge lengths are determined by the three segments g(AB | CD), g(AC | BD), and g(AD | BC) (see fig. 2b ). These three segments again determine the overall treelikeness of the quartet, as well as the support of each of the three perfect tree topologies. The box dimensions are used to visualize the treelikeness of the quartet in the two-dimensional simplex S2. Again, we define the vector sM = (sM1, sM2, sM3), where now,


In appendix A, we prove that sGM = sM if the off-diagonal entries of the matrix M are equal to one and the diagonal entries are equal to zero.

Since LM and GM are based on the same graph, it is now possible to directly compare various statistical properties of these two tree evaluation methods.

Comparison of Likelihood-Mapping and Geometry-Mapping
We conducted both a theoretical analysis and computer simulations of four-taxon trees, in which we sampled a large proportion of the complete parameter space. The model trees are shown in figure 3 . Note that the branch lengths are expected numbers of nucleotide substitutions per site.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 3.—The star tree and the two unrooted binary trees used for the simulations. The arrow in tree I indicates the location of the hypothetical root

 
In the theoretical analyses, we computed the expected coordinates of the GM simplex vector given the tree model and assuming the Jukes-Cantor model (Jukes and Cantor 1969Citation ) and the Kimura (1980)Citation two-parameter model. In the LM setting, a prediction of the simplex vector, and therefore of the behavior of the ML method, is too difficult. Thus, here the statistical properties were only investigated by means of simulations.

For the simulations, various cases of the HKY model (Hasegawa, Kishino, and Yano 1985Citation ) and a fully reversible REV model (Yang 1994Citation ) were assumed for the sequence evolution. The cases used, together with the model trees, the branch lengths, and the models assumed in the quartet reconstruction, are listed in table 1 . For the fully reversible model, we used the following the "instantaneous nucleotide substitution rate" matrix Q:


Since Q is symmetric, the stationary base composition is uniform.


View this table:
[in this window]
[in a new window]
 
Table 1 Comprehensive List of Model Trees, Evolution Models, and Reconstruction Models Used in the Simulations

 
In all model trees, the branch lengths were increased in steps of 0.05 covering the range from 0.05 to 0.75 (fig. 4 , top); thus, for each tree and each evolution model, a total of 225 parameter combinations were analyzed.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 4.—The parameter space representing branch length combinations as used in the simulations. The left graph refers to the star tree, the middle graph to the clock tree, and the right graph to the nonclock tree. The gray scale (at the bottom) is also used in figure 7

 
In all simulations, the sequence length was set to 1,000, and for each branch length combination, 200 repetitions were carried out for each set of parameters.

LM was used as implemented in PUZZLE, version 3.0 (Strimmer and von Haeseler 1996Citation ). For the analyses of the sequences, the HKY model with the default option of no site variation was used as the reconstruction model for all simulations. Also prespecified was the transition/transversion parameter {kappa} according to the model used. Thus, in the presence of a gamma distribution of the sites, the reconstruction model leads to a violation of the assumption. Also, when sequences were generated under a fully reversible model, REV, the assumption that all transitions and all transversions had equal substitution rates was violated in the reconstruction process.

GM was used as implemented in STATGEOM, version 2.0. For all simulation models with a transition/transversion parameter {kappa} = 4.0 (where {kappa} = 2s/v; see appendix C) the weighted GM method with a weight matrix M of the alphabet taking this {kappa} ratio into account was used. In the other cases, the unweighted GM method was applied.

In the star topology simulations, the percentage of simplex points placed (correctly) into the star area, as well as the percentage of points mapped (incorrectly) into either of the three tree areas, was recorded.

For the clock and the nonclock trees the percentage of simplex points mapped into the correct tree area was computed.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 appendix a
 appendix b 
 appendix c  
 appendix d   
 References
 
For GM, the relative frequencies of different position categories in the alignment determine the coordinates of the simplex point and therefore the topology of the four-taxon graph. For four nucleotide sequences, there are 15 different position categories, as listed in expression (3) . When the evolutionary tree and the model of sequence evolution are known, the probabilities of the different categories can easily be computed (Saitou and Nei 1986Citation ). Appendix B shows the probabilities of each category for the nonclock tree (right graph of fig. 3 ) assuming the Jukes-Cantor model. From the "informative" categories 6, ... , 14, it is then straightforward to compute the expected coordinates of the simplex point sGM = (sGM1, sGM2, sGM3) (see appendix B). Similar calculations can be performed for the Kimura two-parameter model. In appendix C, we compute the supports {sigma}Mi in the GM case when different weights of transitions and transversions are considered.

Given the outer branch lengths a and b, the coordinates of the simplex point can be used to calculate the area into which the simplex point is mapped as a function of the inner branch length c.

In the following sections, we will separately analyze the star tree, i.e., the case in which the expected inner branch length c is equal to zero, and the trees with positive inner branch lengths.

Success in Predicting the Star Tree
Figure 5 shows the results of quartet-mapping for the star tree when HKY substitution models or a general reversible model was simulated. The gray scale is shown in figure 4 . In the diagram, the shades in column 1 represent the percentages of quartets that are in the star tree area, while in columns 2–4, the shades show the percentages of quartets that are in tree areas 1, 2, and 3, respectively. We note that the probability of detecting the star tree T* is virtually independent of the choice of a and b. GM has a better chance of correctly identifying the star topology. LM, on the other hand, is too liberal in accepting a binary tree. Table 2 shows the average percentages of quartets over all 225 parameter combinations of branch lengths a and b that are in the seven simplex areas T*, T1, T2, T3, T12, T13, and T23, respectively. While GM shows almost constant values for all evolution models, the values of LM are quite different for the different evolution models. If the model of sequence evolution is violated in the reconstruction process, then LM has the tendency to favor the wrong topology (%T2 > %T1 and %T3, and %T2 > %T*). This tendency is more pronounced if a << b or vice versa. GM suffers from this tendency only for very small values of a or b.



View larger version (72K):
[in this window]
[in a new window]
 
Fig. 5.—Density plot of the star tree; complete parameter space (for the parameter domain and the gray scale, see fig. 4 ). Indicated are percentages of cases in which likelihood-mapping (top graphs) and geometry-mapping (bottom graphs) placed the simplex point into the star tree area (first column), tree T1 (second column), tree T2 (third column), and tree T3 (fourth column). a, Jukes-Cantor (JC) model. b, F81 model. c, Kimura two-parameter (K2P) model with {kappa} = 4.0. d, HKY model with {kappa} = 4.0 and f = (0.1, 0.3, 0.4, 0.2). e, JC model + {Gamma} with gamma shape parameter {alpha} = 0.8. f, JC model + {Gamma} with {alpha} = 0.2. g, K2P model + {Gamma} with {kappa} = 4.0 and {alpha} = 0.8. h, K2P model + {Gamma} with {kappa} = 4.0 and {alpha} = 0.2. i, REV model

 

View this table:
[in this window]
[in a new window]
 
Table 2 Average Percentage of Quartets in the Star Areas T*, Tree Areas T1, T2, and T3, and Net Areas T12, T13, T23 of 225 Parameter Combinations for the Star Tree

 
We conclude that LM has less power to detect the star tree than does GM. This conclusion is especially valid if the evolutionary model is violated. Then, LM has a strong tendency to suggest a wrong tree.

Success in Predicting the Binary Tree
We first analytically computed the areas into which the simplex point of GM was mapped for the parameter spaces of the clock and the nonclock trees. We used the Jukes-Cantor and the Kimura two-parameter evolutionary models. In the latter case, we compared the properties of the unweighted GM method with those of the weighted GM + M method.

The results of figure 6a show that for the clock tree, depending on the branch length combinations, GM suggests either the correct tree or the star topology. If the weighted GM method is applied, then the T1 region is significantly enlarged. For the nonclock tree (fig. 6b ), the set of branch length combinations for the true tree area is smaller than that for the clock tree. For most parameter values, GM suggests either the T12 topology or the star tree. For a very small range of branch length combinations, the simplex point is even mapped into the "inconsistent" tree area T2 = T(AC | BD).



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 6.—For each parameter combination, the expected area is plotted, and the simplex point of the geometry parameters as obtained from the analytical equations is mapped into it. In the left graph, the Jukes-Cantor model was assumed, and in the middle and right graphs, the Kimura two-parameter model was assumed. In the middle graph, the statistical geometry parameters with an unweighted alphabet metric were computed; in the right graph, a weighted metric with the correct transition/transversion parameter was used. a, Density plot of the clock tree (I). b, Density plot of the nonclock tree (II)

 
While figure 6 displays the analytical results for GM, such calculations seem impossible for the LM situation. Also, for the more complex evolution models, analytical expressions relating the model to the expected coordinates of the simplex point are rather difficult. Consequently, figure 7 displays the results from simulations for the clock tree. For each branch length combination, the percentage of cases in which LM and GM map the simplex point into the correct tree area is shown. Figure 7 follows the color shades of figure 4 . The unfavorable bias of LM in the starlike situation is now clearly an advantage. In a wide range, the probability of recovering the true tree is >95%, even in the cases when the evolution model assumptions are violated. The parameter range for which GM maps >95% of the points into the correct tree area is much smaller. The separating line of the 95% area is almost equal to the respective analytical curves. The line is shifted to the left, i.e., to even smaller outer branch lengths, when the sequences evolve with a severe site variation (see fig. 7f and i ).



View larger version (36K):
[in this window]
[in a new window]
 
Fig. 7.—Density plot of the clock tree I parameter space. Indicated are percentages of cases in which likelihood-mapping (left graphs) and geometry-mapping (right graphs) placed the simplex point into the correct tree area. The gray scale is given in figure 4 . a, Jukes-Cantor (JC) model. b, Kimura two-parameter (K2P) model. c, HKY model. d, F81 model. e, JC + {Gamma} with {alpha} = 0.8. f, JC + {Gamma} with {alpha} = 0.2. g, REV model. h, K2P + {Gamma} with {alpha} = 0.8. i, K2P + {Gamma} with {alpha} = 0.2

 
The results of the simulations of nonclock tree II reinforce the clear superiority of the ML method in detecting a tree (data not shown). In almost the entire parameter space, >95% of all quartets are mapped into the correct tree area. The probability was only slightly reduced when a severe site variation was assumed, and the difference between the two equal branch lengths and the three equal branch lengths was large. GM, in comparison, is successful in reconstructing the correct tree in only a very small part of the parameter space.

We conclude that the success and the stability of LM is virtually independent of the assumed evolution model. The branch length combinations for which LM reconstructs the true tree cover a large part of the parameter space. The results of the GM analyses, on the other hand, show a clear dependency of success in reconstructing the true tree on the tree shape rather than the evolution model.

The simulations and the analytical considerations indicate a relationship between the lengths of the outer branches, a and b, and that of the inner branch, c, that determines into which simplex area the GM point is mapped. From the calculations in appendix D, we find that for the Jukes-Cantor model, the sum of the two outer branch lengths has to be smaller than a threshold value as determined in equation (15) such that there exists a positive minimal inner branch length c, given by equation (14) , for which the corresponding simplex point is mapped into the correct tree area:

This is the threshold line that separates the correct tree area from the star tree area in the clocktree as shown in figure 6 . In the Jukes-Cantor case, if the sum of a and b is larger than this threshold value, then it is impossible to find the true tree based on GM.

A similar threshold for c is, of course, conceivable for more complex models of sequence evolution. To estimate that minimal value, cmin, we computed the location of s as a function of c for three examples with fixed outer branch lengths: a = 0.1 for clock tree I, and a = 0.01, b = 0.1 and a = 0.005, b = 0.1, respectively, for nonclock tree II. c was increased in steps of 0.005 starting from c = 0.005. Based on 200 simulations for a fixed c, the vector (c) was computed, and the minimal value cmin, for which (c) was mapped into the correct tree area T1 was determined. Moreover, we computed the minimal inner edge length c0.95min such that at least 95% of the simulated s vectors were in T1. The results are shown in table 3 .


View this table:
[in this window]
[in a new window]
 
Table 3 Minimal Inner Branch Lengths (cmin) for Which the Mean Simplex Point Computed from 200 Runs Is Mapped into the Correct Tree Area, and Minimal Inner Branch Lengths (c0.95min) for Which at Least 95% of All Quartets Are Mapped into the Correct Tree Area

 
We first note that the analytically computed values from equation (14) for the Jukes-Cantor model in the GM case are identical with the respective values of table 3 . Two further results are notable: the minimal branch length cmin is equal to 0.01 ± 0.005 in the LM situation for all trees and models. This confirms our previous conclusion that LM is robust in detecting the correct tree even if the model assumptions are violated.

Second, cmin as computed by the GM method is in all cases larger than cmin of LM and changes considerably with the evolution models. It ranges between 0.02 (Jukes-Cantor) and 0.55 (Kimura two-parameter + {Gamma} (0.2)). We note too that the differences between cmin and c0.95min are larger for the GM method than for the LM implementation.

We conclude that GM is successful in reconstructing the correct tree when the following three conditions are fulfilled: (1) the sequences evolve without a severe site variation; (2) the total sum of the outer branch lengths is less than 0.9, and thus the mean length of one outer branch does not exceed 0.225; and (3) the interior branch length is larger than a minimal value that depends on the lengths of the outer branches.

Applications
Here we illustrate the two methods for two data sets. The first data set consisted of 24 mitochondrial sequences (11,133 bp) from three eutherian classes (18 eutheria, 2 marsupialia, and 1 monotremata) and three other noneutherian sequences (chicken, lungfish, and frog). The three codon positions were analyzed separately. For all codon positions, LM placed most of the quartets into the region of tree T1 (fig. 8a ), thus suggesting a branching pattern of eutheria and noneutheria versus marsupialia and monotremata. GM, on the other hand, mapped all quartets into the star tree area, suggesting an undefined branching pattern (fig. 8b ).



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 8.—Four-cluster quartet-mapping of mitochondrial DNA (11,133 bases). Sequences were divided into four disjoint groups: 18 eutheria, 2 marsupialia, 1 monotremata, and 3 noneutheria. The three codon positions were analyzed separately (left: codon position 1; middle: codon position 2; right: codon position 3, respectively). a, Results of likelihood-mapping. b, Results of quartet-mapping

 
The second data set was taken from Kuiken et al. (1994). It consisted of 72 HIV-1 sequences of the hypervariable region V3 of the external envelope of the virion. The sample covered different risk groups and different infection years. Kuiken et al. (1994) showed that the data set had been randomized to a large extent, thus allowing almost no phylogenetic interpretation. However, one clustering, associated with risk group, could be distinguished. Therefore, we divided the data set into two groups, the group of HIV-1-infected drug users (IVD), and the group of HIV-1-infected homosexuals as well as HIV-1-infected hemophiliacs (non-IVD), and applied quartet-mapping to this clustering. Both LM and GM confirm the result of Kuiken et al. (1994) (fig. 9 ). Although GM maps a larger amount of the quartets into the star tree area, both mapping methods support the clustering of the IVD group versus the non-IVD group, while there are only few quartets that cannot be clustered within these risk groups.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 9.—Quartet-mapping graphs of 72 HIV-1 sequences of hypervariable region V3 of the external envelope of the virion. The set was divided into two groups: the group of HIV-1-infected drug users (IVD), and the group of the others (non-IVD). A random sample of 5,000 quartets were analyzed. a, Results of likelihood-mapping. b, Results of quartet-mapping

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 appendix a
 appendix b 
 appendix c  
 appendix d   
 References
 
In this paper, we introduced an extension of the LM procedure to other quartet-based tree reconstruction methods. We exemplified this by implementing the method of statistical geometry in sequence space in the LM framework. We compared the statistical properties of both methods by means of analytical, in addition to simulation, studies. We assessed the success of both methods in detecting the star tree, as well as their success in reconstructing the correct topology.

Let us point out that the general idea of quartet-mapping is to reduce the probability that a wrong tree is reconstructed. A tree is recognized only if the simplex point is mapped into one of the three small trapezia (see fig. 1 ). The main conclusion of this paper is that the shape of the tree, rather than the substitution model, determines how reliably and successfully ML and statistical geometry map the simplex point into the correct topology area.

Let us discuss the two methods separately. In all analyzed cases, LM appears to be too liberal in suggesting a tree when the underlying topology is a star tree or very close to it. This bias is especially pronounced if the reconstruction model does not take rate heterogeneity among sites into account. In the heterogeneous case, as well as the REV case, the binary tree which brings the long edges together is favored more often than the other two binary trees. The ultimate reason for this phenomenon needs further investigation, but it has been observed in other studies that when the data violate the assumptions of the analysis model, the likelihood ratio test provides strong support for the tree separating the long edges from the short edges (Gaut and Lewis 1995Citation ).

LM appears to be robust and successful in reconstructing the correct tree over much of the whole parameter space even when simple evolution models are used instead of more complex (and correct) ones. This agrees with robustness results reported elsewhere (Gaut and Lewis 1995Citation ; Huelsenbeck 1995; Yang 1996).

Our results show that GM allows for consistent topology reconstruction if the inner branch length is longer than a threshold value. This threshold value depends critically on the tree shape and to some degree on the substitution model. However, there exists an upper limit for the sum of the outer branches for which the interior branch length has to be infinitely long such that the correct tree is reconstructed. Thus, in a major part of the parameter space, GM suggests a star tree. On the other hand, only a small range of parameters leads to an inconsistent tree reconstruction, i.e., for which GM suggests the incorrect tree. Thus, by using the quartet-mapping approach, i.e., by considering the seven quartet graphs rather than only the three binary trees, the Felsenstein zone is significantly reduced. The large region of the star tree, whose area is equal to the sum of the three areas of the three perfect trees, is the reason that GM suggests a star tree in a large part of the parameter space. GM evaluates the absolute number of positions, and, in addition, it does not take the information of the outer branch lengths into account. Thus, the level of "noise" is quickly approached, such that a tree can no longer be extracted from the sequence alignment. It is interesting to note here that GM allows for consistent tree reconstruction even if the molecular clock is violated, as long as the threshold relation of the branch lengths is satisfied. This is in a way surprising, since statistical geometry, similarly to maximum parsimony, makes no explicit assumption about the underlying process of evolution.

One could now ask if the range of branch lengths for which GM reconstructs the true tree can be enlarged. One source of noise that destroys the reconstruction of the true tree in the statistical-geometry approach of the sequence space approach is the use of the positions in which only one pair has equal characters. We therefore used a second, modified, definition of the coordinates of the simplex point by taking merely the "parsimonious informative" positions (Swofford and Olsen 1990Citation ), i.e., those in which two pairs have equal characters. These are the position categories 68 in expression (3) . Then, the three coordinates of the simplex vector are given by sMP1 = d6/(d6 + d7 + d8), sMP2 = d7/(d6 + d7 + d8), and s3MP = d8/(d6 + d7 + d8). Note that this is the quartet-mapping implementation of the maximum-parsimony method (Farris, Kluge, and Eckhardt 1970Citation ; Fitch 1971Citation ). Assuming the Jukes-Cantor model, we computed the expected coordinates sMPi for the clock tree and the nonclock tree. Figure 10 shows the comparison of the predicted areas using the geometry implementation and the parsimony implementation. Indeed, we find a significantly larger threshold line for the T1 area for the clock tree and an enlarged T1 area for the nonclock tree. A similar computation for the threshold relation (cf. eqs. 13, 14, and 15 ) now gives


We conclude that in the maximum-parsimony case, the threshold for the sum of the outer branch lengths is {approx}1.27 times as large as the one for GM. In addition, for given branch lengths a and b, the minimal inner branch length c, for which the simplex point is mapped into the correct tree area, is smaller.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 10.—Size of the areas into which the simplex point of the original geometry-mapping is mapped as compared with the simplex point of parsimony-mapping under the assumption of the Jukes-Cantor model. In the respective left plots, the original geometry-mapping is shown; in the right plots, the parsimony-mapping is shown. a, Clock tree. The dashed lines indicate the threshold value for the outer branch length a. b, Nonclock tree.

 
The effect observed for GM is also observed for the maximum-parsimony implementation of quartet-mapping. Thus, by considering not only the three binary trees, the Felsenstein zone can also be reduced for the maximum-parsimony method. However, it can clearly be seen that the domain in the parameter space for which GM or maximum parsimony maps the simplex point into the star tree area rather than into the binary tree area is still very large. Future studies will therefore address the problem of reducing the attraction of the star tree area.

When evaluating a reconstructed tree topology, one of the biggest difficulties is in deciding when the length of an interior branch is equal to zero. From the star tree simulations, we find that ML as implemented in LM suggests a binary tree too often. On the other hand, the domain of the parameter space in which GM reconstructs a star tree is too large. We therefore propose the following procedure to test for a positive interior branch length using a combination of LM, GM, and ML. For a quartet of sequences, both LM and GM are applied. Then, we distinguish four cases:

  1. Both LM and GM propose the star topology. Then, indeed, the interior branch length is equal to zero.
  2. LM proposes a star topology, whereas GM proposes a tree topology. This case does not occur and can therefore be ignored.
  3. Both LM and GM propose the same tree topology. Then, the interior branch length is indeed positive.
  4. LM proposes a tree topology, and GM proposes a star topology. In this case, we use ML to calculate the expected branch lengths of the four outer branches, denoted by â1, â2, 1, and 2, and the expected inner branch length c. We now assume that two of the outer branches are short and of similar length and that the other two are comparatively long. We set â = (â1 + â2)/2 and = (1 + 2)/2. If these expected branch lengths â and are in the area of the parameter space, where GM should propose a tree, then with a large probability the interior branch length is indeed equal to zero. If, on the other hand, the values â, , and c are not in the area in which GM would propose a tree, then no definite conclusion can be reached.

Let us look at the example in figure 8 . Here, most of the simplex points of LM are mapped into the tree area T1, whereas the simplex points of GM are all placed into the star tree area. From an additional tree reconstruction now applied to all positions using the ML method as implemented in PUZZLE (Strimmer and von Haeseler 1996Citation ), we computed the five mean branch lengths: â1 = 0.31, â2 = 0.21, 1 = 0.40, and 2 = 0.21 for the outer branches and c = 0.06 for the inner branch length. Then, â = 0.26, = 0.31, and a + b = 0.57. The sum of â and is below the threshold of equation (15) . Therefore, there exists a positive minimal inner branch length cmin {approx} 0.36 given by equation (14) for which GM reconstructs a tree. The value c of the inner edge separating the four groups, however, is smaller than cmin, the minimal value needed for GM to reconstruct a tree. Thus, a definite conclusion about the phylogeny of these four groups cannot be reached.

In this paper, we investigated the statistical properties of LM and GM only for model trees with four taxa. It has been shown that for trees with more than four species, LM is also too liberal to accept a bifurcating tree even if the true topology is a star tree (Nieselt-Struwe 1998Citation ). Nieselt-Struwe (1998)Citation also showed that this tendency cannot be improved by increasing the sequence length.

All models of nucleotide substitution used in the simulations in this paper are Markov models and assume that evolution is independent and identical at each site and along each lineage. This assumption, of course, may be quite wrong. The work of Schöniger and von Haeseler (1995)Citation and others on site-dependent models has shown that, indeed, ML has an enlarged Felsenstein zone if sequence sites do not evolve independently (Muse 1995Citation ; Rzhetsky 1995; Schöniger and von Haeseler 1995Citation ; Tillier and Collins 1995Citation ). Investigations of models violating these assumptions should provide another insight into the stability of the ML method and statistical geometry in sequence space.

We conclude that quartet-mapping is a powerful tool with which to test and compare the success of tree evaluation methods based on quartets of taxa. It is especially a very efficient method for a quick evaluation of zero-length branches in trees. However, we are aware that this procedure cannot replace bootstrapping or other evaluation methods. The statistical properties that have been studied in this work can be supplemented by the addition of a bootstrapping procedure to assess the spreading of the point cloud. Work on this question is in progress.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 appendix a
 appendix b 
 appendix c  
 appendix d   
 References
 
This work was supported by the Deutsche Forschungsgemeinschaft and the Max-Planck-Society, which are gratefully acknowledged. We thank Korbinian Strimmer for supplying the code of PUZZLE, version 3.0. We are also grateful to two anonymous reviewers for their helpful comments on the manuscript. The extension of likelihood-mapping to geometry-mapping is incorporated in STATGEOM, version 2.0, which can be obtained from http://www.gwdg.de/~kniesel.


    Footnotes
 
Naruya Saitou, Reviewing Editor

1 Keywords: likelihood-mapping maximum likelihood statistical geometry phylogenetic content edge test Back

2 Address for correspondence and reprints: Arndt von Haeseler, Max-Planck-Institut für evolutionäre Anthropologie, Inselstrasse 22, D-04103 Leipzig, Germany. E-mail: arndt{at}eva.mpg.de Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 appendix a
 appendix b 
 appendix c  
 appendix d   
 References
 

    Churchill G., A. von Haeseler, W. Navidi, 1992 Sample size for a phylogenetic inference Mol. Biol. Evol 9:753-769[Abstract]

    Eigen M., R. Winkler-Oswatitsch, A. Dress, 1988 Statistical geometry in sequence space: a method of comparative sequence analysis Proc. Natl. Acad. Sci. USA 85:5913-5917[Abstract]

    Farris J., A. Kluge, M. Eckardt, 1970 A numerical approach to phylogenetic systematics Syst. Zool 19:172-191[ISI]

    Felsenstein J., 1985 Confidence-limits on phylogenies—an approach using the bootstrap Evolution 39:783-791[ISI]

    Fitch W., 1971 Toward defining the course of evolution: minimum change for a specific tree topology Syst. Zool 20:406-416[ISI]

    Gaut B., P. Lewis, 1995 Success of maximum likelihood phylogeny inference in the four-taxon case Mol. Biol. Evol 12:152-162[Abstract]

    Goldman N., 1993 Statistical tests for DNA substitution J. Mol. Evol 34:183-198

    Hasegawa M., H. Kishino, T. Yano, 1985 Dating the human-ape splitting by a molecular clock of mitochondrial DNA J. Mol. Evol 22:160-174[ISI][Medline]

    Huelsenbeck J., 1995 The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining Mol. Biol. Evol 12:843-849[Abstract]

    Jukes T., C. Cantor, 1969 Evolution of protein molecules Pp. 21–132 in H. Munro, ed. Mammalian protein metabolism. Academic Press, New York

    Kimura M., 1980 A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]

    Kuiken C. L., K. Nieselt-Struwe, G. F. Weiller, J. Goudsmit, 1994 Quasispecies behavior of human immunodificiency virus type 1: sample analysis of sequence data Pp. 100–119 in K. W. Adolph, ed. Molecular Virology Techniques Part A. Academic Press, New York

    Li W.-H., 1989 A statistical test of phylogenies estimated from sequence data Mol. Biol. Evol 6:424-435[Abstract]

    ———. 1997 Molecular Evolution Sinauer Ass., Inc. Publishers, Sunderland, Mass., USA

    Muse S., 1995 Evolutionary analysis of DNA sequences subject to constraint on secondary structure Genetics 139:1429-1439[Abstract/Free Full Text]

    Nei M., J. Stephens, N. Saitou, 1985 Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes Mol. Biol. Evol 2:66-85[Abstract]

    Nieselt-Struwe K., 1998 From likelihood-mapping to quartet-mapping, a new sequence analysis tool Pp. 13–22 in M. K. Uyenoyama and A. von Haeseler, eds. Proceedings of the Trinational Workshop on Molecular Evolution. Duke University Publication Group, Duke University, Durham, N.C

    Ota R., P. Waddell, M. Hasegawa, H. Shimodaira, H. Kishino, 2000 Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters Mol. Biol. Evol 17:798-803[Abstract/Free Full Text]

    Rzhetsky A., 1995 Estimating substitution rates in ribosomal RNA genes Genetics 141:771-783[Abstract/Free Full Text]

    Rzhetsky A., M. Nei, 1992 A simple method for estimating and testing minimum-evolution trees Mol. Biol. Evol 9:945-967[Free Full Text]

    Saitou N., M. Nei, 1986 The number of nucleotides required to determine the branching order of three species, with special reference to the human-chimpanzee-gorilla divergence J. Mol. Evol 24:189-204[ISI][Medline]

    Schöniger M., A. von Haeseler, 1995 Performance of the maximum likelihood, neighbor joining and maximum parsimony methods when sequence sites are not independent Syst. Biol 44:533-547[ISI]

    Sitnikova T., 1996 Bootstrap method of interior-branch test for phylogenetic trees Mol. Biol. Evol 13:605-611[Abstract]

    Strimmer K., N. Goldman, A. von Haeseler, 1997 Bayesian probabilities and quartet puzzling Mol. Biol. Evol 14:210-211[Free Full Text]

    Strimmer K., A. von Haeseler, 1996 Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies Mol. Biol. Evol 13:964-969[Free Full Text]

    ———. 1997 Likelihood-mapping: a simple method to visualize phylogenetic content of sequence alignment Proc. Natl. Acad. Sci. USA 94:6815-6819[Abstract/Free Full Text]

    Swofford D., G. Olsen, 1990 Phylogeny reconstruction Pp. 411–501 in D. Hillis and G. Moritz, eds. Molecular systematics. Sinauer, Sunderland, Mass

    Tillier E., R. Collins, 1995 Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites Mol. Biol. Evol 12:7-15[Free Full Text]

    Whelan S., N. Goldman, 1999 Distribution of statistics used for the comparison of models of sequence evolution in phylogenetics Mol. Biol. Evol 16:1292-1299[Free Full Text]

    Yang Z., 1994 Estimating the pattern of nucleotide substitution J. Mol. Evol 39:105-111[ISI][Medline]

    ———. 1996 Phylogenetic analysis using parsimony and likelihood methods J. Mol. Evol 42:294-307[ISI][Medline]

    Yang Z., B. Rannala, 1997 Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method Mol. Biol. Evol 14:717-724[Abstract]

    Zharkikh A., W.-H. Li, 1992 Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences: II Four taxa without a molecular clock. J. Mol. Evol 35:356-366[ISI][Medline]

Accepted for publication March 9, 2001.