{delta} Plots: A Tool for Analyzing Phylogenetic Distance Data

B. R. Holland*, K. T. Huber{dagger}, A. Dress{ddagger} and V. Moulton§

*Lehrstuhl für Spezielle Zoologie, Ruhr-Universität Bochum, Bochum, Germany;
{dagger}Department for Biometry and Informatics, Swedish University of Agricultural Sciences;
{ddagger}FSPM-Strukturbildungsprozesse, University of Bielefeld, Bielefeld, Germany;
§Linnaeus Center for Bioinformatics, BMC, Uppsala University


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
A method is described that allows the assessment of treelikeness of phylogenetic distance data before tree estimation. This method is related to statistical geometry as introduced by Eigen, Winkler-Oswatitsch, and Dress (1988Citation [Proc. Natl. Acad. Sci. USA. 85:5913–5917]), and in essence, displays a measure for treelikeness of quartets in terms of a histogram that we call a {delta} plot. This allows identification of nontreelike data and analysis of noisy data sets arising from processes such as, for example, parallel evolution, recombination, or lateral gene transfer. In addition to an overall assessment of treelikeness, individual taxa can be ranked by reference to the treelikeness of the quartets to which they belong. Removal of taxa on the basis of this ranking results in an increase in accuracy of tree estimation. Recombinant data sets are simulated, and the method is shown to be capable of identifying single recombinant taxa on the basis of distance information alone, provided the parents of the recombinant sequence are sufficiently divergent and the mixture of tree histories is not strongly skewed toward a single tree. {delta} Plots and taxon rankings are applied to three biological data sets using distances derived from sequence alignment, gene order, and fragment length polymorphism.


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
A common problem faced in phylogenetic analysis is the inference of an evolutionary tree from a set of taxa on which a pairwise distance has been defined. Examples of such distances include distances between sequences (either observed or corrected according to some model of sequence evolution), gene-order based distances, and metrics inferred from the presence or absence of bands in restriction-fragment length data. Although the use of distances for tree inference is not always desirable, distances can provide a way to take advantage of models of evolutionary change when there is no alternative because other methods are unavailable or intractable (see Swofford et al. [1996]Citation for more details). A topical example of this is the use of breakpoint distances in the analysis of gene-order data.

There are many recipes for inferring trees from distance matrices, but comparatively few tools are available for assessing how appropriate this may be. Processes such as recombination, reassortment, gene conversion, and lateral transfer lead to reticulate evolution which might be better described by a network rather than a tree (Posada and Crandall 2001Citation ) Hence, because there is no a priori reason that a distance should be well represented by a tree, tools to assess treelikeness should prove useful.

Existing methods for assessing signals in a phylogenetic data set before a tree is estimated include Relative Apparent Synapomorphy Analysis (RASA) (Lyons-Weiler, Hoelzer, and Tausch 1996Citation ), spectral analysis (Hendy and Penny 1993Citation ), likelihood mapping (Strimmer and von Haeseler 1997Citation ) and its more recent extension quartet mapping (Nieselt-Struwe and von Haeseler 2001Citation ), and split decomposition (Bandelt and Dress 1992Citation ; Huson 1998Citation ). Note that various techniques can be used to test the accuracy of an obtained tree, most popularly bootstrapping (Felsenstein 1985). But these techniques usually rely on constructing a set of trees first and then analyzing this set.

We presume that evolution typically gives rise to a treelike signal but this signal may be obscured by processes, such as sampling error, parallel changes and reversals, substitutional biases, selective pressure, or perhaps the use of an inappropriate model to correct observed distances. Our method aims to quantify how far a distance matrix is from being additive. In particular, for each quartet q of taxa, we compute a quantity 0 <= {delta}q <= 1 that indicates by how much a quartet fails to satisfy the four-point condition (Zaretsky 1965Citation ; Buneman 1971Citation ); a value of 0 indicates that q is perfectly treelike, and progressively higher values indicate that it is less and less so. This measure has been successfully used in statistical geometry (Dress 1988Citation ; Eigen, Winkler-Oswatitsch, and Dress 1988Citation ; Eigen and Winkler-Oswatitsch 1990Citation ) and indeed {delta} plots could be regarded as an extension of this methodology. The {delta} values for all quartets are displayed in a histogram that we call a {delta} plot. The values for each quartet will be zero if and only if the complete distance data is additive (Zaretsky 1965Citation ), that is, the distances can be represented by a weighted tree (a tree with specified edge lengths). In the Results, we use simulations to assess the behavior of {delta} plots, and we apply our methodology to three biological data sets: HBV viral sequences (in which recombination is known to have occurred), gene-order data from eucaryotic mitochondrial sequences, and amplified fragment-length polymorphism (AFLP) data from the yeast Candida albicans.

In addition to this holistic analysis, we also develop a method for assessing the effect that individual taxa have on the treelikeness of a distance. In particular, for each individual taxon x in a given data set, we compute the average {delta} value of the quartets to which x belongs, denoted x, the rationale being that the more quartets containing x exhibit high {delta} values, the more we may expect taxon x to be obscuring any treelike signal. In the Results, we use simulations to see how the x value for a taxon x depends on its position within an underlying tree and explore its use in identifying recombinants.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
{delta} Plots are based on the well-known four-point condition which we now recall. Suppose that we are given a set X of taxa and a distance table d = (dxy)x,y X, i.e., an assignment of putative genetic distances dxy >= 0 (x,y X) satisfying the conditions dxy = dyx and dxx = 0 for all x,y X. For any four elements x,y,u,v in X, we put

A quartet q = x,y,u,v in X is said to satisfy the four-point condition if the two larger ones of the three quantities dxy|uv, dxu|yv, dxv|yu are equal. It is well known that d is additive (i.e., can be represented by a weighted tree labeled by X) if and only if every quartet q =x,y,u,v in X satisfies this condition (Zaretsky 1965Citation ).

In case d is derived from biological data, this condition will almost never be satisfied. Thus, assuming that dxy|uv <= dxu|yv <= dxv|yu holds, it is natural to consider the ratio


as a measure of the treelikeness of the quartet q, where we define {delta}q to be zero in case dxy|uv = dxu|yv = dxv|yu holds. When the numerator, and hence {delta}q, equals zero then the four-point condition holds. Normalizing by dxv|yu - dxy|uv implies that {delta}q must always lie between 0 and 1. The larger the value of {delta}q the less treelike q is said to be.

To construct a {delta} plot for a set of taxa X, {delta}q is calculated for every quartet q in X and displayed in a histogram. The number of quartets in a data set with n taxa is (n4), so the computational cost of constructing a {delta} plot is O(n4). For large n (say n > 100 taxa), it may be preferable to construct a {delta} plot for a random subsample of the quartets. Note that we denote by the mean value of {delta}q taken over all quartets in X. (For example {delta} plots see figs. 9a and 10a. The {delta} plot in fig. 9a, being more skewed toward zero, shows a more treelike distribution than that in fig. 10a. )



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 9.—The {delta} plot (a) and x plot (b) for distances derived from HBV sequences. (c) The likelihood map of all quartets calculated by Tree-Puzzle 5.0 (Strimmer and von Haeseler 1996Citation ) using the HKY model and estimating the transition-tranversion ratio and nucleotide frequencies from the data

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 10.—The {delta} plot (a) and x plot (b) for distances derived from gene-order data in early branching eukaryotes

 
As mentioned in the Introduction, the measure {delta}q is well known in the area of statistical geometry, and we now briefly describe the connection that {delta} plots have with this method. In case a distance d also satisfies the triangle inequality, i.e., d is a metric, its restriction to any quartet can be represented in a diagram or weighted graph such as the one in figure 1 .



View larger version (6K):
[in this window]
[in a new window]
 
Fig. 1.—Any distance table d on four taxa x,y,u,v with dxv|yu >= dxy|uv and dxv|yu >= dxu|yv can be represented by a diagram such as the one pictured here giving the parameters a,b,c,d,s, and l the values a := (dxy + dxu - dyu)/2, b := (dxu + duv - dxv)/2, c := (dyv + duv - dyu)/2, d := (dxy + dyv - dxv)/2, s := (dxv + dyu - dxy - duv)/2, and l := (dxv + dyu - dxu - dyv)/2. It follows that {delta}q = s/l holds in case l != 0. Note that s and l are nonnegative by construction, whereas a,b,c,d are nonnegative if (and only if) d is a metric, i.e., if (and only if) d satisfies the triangular inequality

 
This is carried out by appropriately labeling the pendant vertices (in this figure labeled by x,y,u,v) and assigning (necessarily unique) nonnegative values to the edge weights a,b,c,d,s,l so that the sum of the weights along a shortest path between each pair of taxa equals the distance between those taxa (Zaretsky 1965Citation ). If the distance is additive, then the value assigned to at least one of s or l will be zero and so, as expected, the diagram becomes a weighted tree.

Statistical geometry (Dress 1988Citation ; Eigen, Winkler-Oswatitsch, and Dress 1988Citation ; Eigen and Winkler-Oswatitsch 1990Citation ) attempts to evaluate properties of data, such as treelikeness, through the computation of diagrams or geometries like the one above for subsets of a set of aligned sequences. It can be performed either in sequence space or distance space, the latter being of interest to us here. In particular, in this method an average over all quartet diagrams is derived and represented in a characteristic diagram that represents the underlying evolutionary divergence of the sequences. Rather than compressing all of the information into a single diagram, {delta} plots represent the distribution of the quartet distance geometries. A similar philosophy underlies the recently developed method of quartet mapping (Nieselt-Struwe and von Haeseler 2001Citation ) which aims to visualize the phylogenetic content of a set of aligned sequences.


    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
In this section, we investigate the effectiveness of {delta} values in identifying specific taxa that might confound tree estimation. We also look at the behavior of {delta} plots for recombinant sequences, the archetypal example of nontreelike data, and provide applications of our methodology to three biological data sets.

In our simulations, we generated treelike and recombinant data sets using the software packages Seq-Gen (Rambaut and Grassly 1997Citation ) and Treevolve version 1.32 (Rambaut and Grassly). Seq-Gen allows the simulation of sequence evolution (according to a variety of evolutionary models) along a user-defined weighted tree. With Treevolve, rather than specifying a generating tree, the probabilities of both recombinant and coalescent events are given. A network is then generated according to these probabilities, and sequences are produced that evolve along the network. In particular, bifurcating trees can be simulated by setting the rate of recombination to zero.

In all simulations, the model of sequence evolution used was K2P (Kimura 1980Citation ) with a transition-transversion bias of {kappa} = 4. In Treevolve the K2P model corresponds to the settings vHKY t2; other parameters that were varied during the simulations include l, the sequence length; s, the number of taxa; n, the number of replicate data sets; and r, the rate of recombination per site. All other Treevolve parameters, for instance those concerning population history, were left at their default settings. In Seq-Gen the K2P model is chosen by setting mHKY t2, the parameters l, sequence length and n, the number of replicate data sets were varied and all other parameters were left at their default settings. Distance matrices were formed using the Phylip (Felsenstein 1993Citation ) package dnadist which calculates the Hamming distances between sequences and then corrects these according to a specified model (in our case K2P with {kappa} = 4).

Identifying Troublesome Taxa
If a taxon sequence has a reticulate history, has been involved in a sequencing or an alignment error, or is highly divergent and is thus basically randomized with respect to the other sequences, we expect the average {delta} value of quartets containing this taxon to be relatively high. With this in mind, we investigated the behavior of x, the mean value of {delta} over all quartets containing a taxon x.

Simulated data sets were generated on two trees with 16 leaves, one with the least balanced topology and the other with the most balanced topology (see fig. 2 ). The expected number of changes from the root to each tip was 0.3. One thousand sets of sequences of 100 bp were generated along each tree, and for each one of these sets a 16 x 16 distance matrix was computed as described above.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 2.—The two topologies used to generate the sequences. On the left is the least balanced topology possible (the caterpillar tree), and on the right is the most balanced topology. The expected number of changes from the root to each tip is indicated by the y-axis, this being the same for both trees (0.30)

 
For each taxon x, we computed x (see fig. 3 ). These values are similar for all taxa within the balanced tree, which is caused by the fact that the taxa are identical relative to one another in terms of their position within the tree topology (see fig. 2 ). Because of the increased chance of parallel changes and reversals, the x values for the unbalanced tree are higher. Note that the unbalanced tree exhibits two trends. First, taxa at the end of long edges have higher x values compared with those at the end of short edges. Second, taxa in the middle of the unbalanced tree have high x values—probably because such taxa are contained in many quartets that are nearly starlike, that is, the internal edges of the corresponding diagram depicted in figure 1 are small compared with the pendant edges.



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 3.—x averaged over 1,000 simulated data sets. The * symbols correspond to the balanced topology ( = 0.18), and the o symbols to the unbalanced topology ( = 0.37). Standard error bars are not shown as they are negligible (<0.002) in all cases

 
To study the effect that a taxon x having a high x value might have on tree-building, we compared the effect of random versus x-directed taxon removal on the accuracy of the neighbor-joining (NJ) method (Saitou and Nei 1987Citation ). In particular, NJ was applied to the same 16 x 16 distance matrices as in the above simulation and its accuracy, as measured by the proportion of internal edges of the generating tree that are correctly recovered, was recorded. Taxa were removed from the data sets according to two schemes: (1) they were removed according to their x values, those taxa with highest x values being removed first and (2) they were removed in a random order. After each taxon had been removed, NJ was applied to the reduced data set and the x values were recomputed. The results of this simulation are shown in figure 4 .



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 4.—The effect of random versus x-directed taxon removal orders on the proportion of internal edges recovered by NJ for the most balanced (a) and least balanced (b) 16 taxa trees (see fig. 3 ). Results have been averaged over 1,000 repetitions per point. The vertical bars show ± one standard error

 
For both of the tree topologies, it is seen that removing taxa on the basis of their x values leads to an increase in NJ's accuracy, whereas random removal order results in a slight decrease in accuracy. The difference in removal schemes is significant in each case, although accuracy improves more for the unbalanced topology.

Recombination
To explore the behavior of {delta} plots for nontreelike data, we used recombination simulations. We first investigated the dependence of {delta} plots on various parameters, viz., number of taxa (n), sequence length (c), and recombination frequency (r).

In a preliminary simulation, sequence length was fixed at 500 bp, and trees were generated using Treevolve with r equal to 0 and n equal to 5, 10, ..., 95. This was repeated 100 times for each value of n. It was found that was independent of n (results not shown, but they appear in Holland 2001Citation ).

In a second simulation, n was fixed at 30 and both c and r were varied; c was taken to be 200, 400, 600, 800, and 1,000, and r was taken to be 0, 2.5 x 10-10, 5.0 x 10-10, 7.5 x 10-10, and 1.0 x 10-9. The results are shown in figure 5 .



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 5.—The dependency of on different levels of recombination and sequence length. Results have been averaged over 1,000 repetitions per point. The vertical bars show ± one standard error

 
Because there is a smaller noise to signal ratio in longer sequences, we expected to be negatively correlated with sequence length. As can be seen in figure 5 , this is certainly the case when no recombination occurs (r = 0). For the highest level of recombination shown (r = 1 x 10-9), increases slightly from sequence lengths c = 600 to c = 800. This is probably because r is specified per nucleotide giving more opportunity for recombination to occur in a longer sequence. The plot shows a positive correlation between and the frequency of recombination events.

The results described in the previous section indicate that x can be used to identify troublesome taxa. We thus investigated whether recombinants can be detected from within tree topologies.

Certain recombination events can lead to sequence alignments having one tree underlying some portion of the alignment, and a different tree underlying another portion. Frequently this can be identified by some taxa changing their position within the tree; see, for example, the Hepatitis B alignment of Bollyky et al. (1996)Citation and the Dengue fever alignment of Holmes, Worobey, and Rambaut (1999)Citation .

We simulated data of this type by concatenating alignments from two generating trees. Figure 6 shows the trees used to generate recombinant alignments used in the simulation. There were two basic topologies—unbalanced and balanced. For both tree topologies, the expected number of substitutions from the root to each tip was 0.3, and these were distributed according to the molecular clock hypothesis. With each basic topology, the recombinant had parents that were either close, intermediate, or divergent—for each case we denote the recombinant by R1, R2, and R3, respectively—giving 2 x 3 = 6 experiments in total. Each sequence was 1,000 bp long; sites 1–500 were simulated on the tree where the recombinant (either R1, R2, or R3) was attached to its left-hand parent, and sites 501–1,000 were simulated along the tree where the recombinant taxon was attached to its right-hand parent.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 6.—The trees used to generate the alignments with a recombinant. For example, with the unbalanced topology (a) and recombinant R1 with close parents, the first half of the sequences were generated on the tree where R1 and 1 are a neighboring pair and concatenated with the sequences generated on the tree, where R1 and 2 are a neighboring pair. The branch lengths are drawn to scale, the expected number of substitutions from the root to each tip being 0.3

 
Results are shown in figure 7 . As expected, we see that the {delta}x values of the recombinant sequences differ most significantly from the nonrecombinant sequences (1) when the parents of the sequence are highly diverged and (2) when the sequence is generated along a balanced tree as opposed to an unbalanced tree.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 7.—x for six types of recombinant alignment. The generating trees are shown in figure 6 . In each plot the right-hand bar shows the mean value of {delta} for the quartets containing the recombinant taxon. The vertical lines on each bar indicate ± one standard deviation

 
We repeated the simulation with sequences of length 500 and 1,000 and left-right combinations of 50%-50%, 75%-25%, and 90%-10%, giving 2 x 3 = 6 experiments for each tree in figure 6 . The results for the balanced tree with intermediately divergent parents are shown in figure 8 . The same general trends were seen for each combination of topology and type of recombination (parents close, intermediate, or divergent).



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 8.—x for six combinations of sequence length and percentage mixture with balanced topology and medium distance recombinant parents. In each plot the right-hand bar shows the mean value of {delta} for the quartets containing recombinant R2. The vertical lines on the bars indicate ± one standard deviation. In the top row of plots the sequence length is 500, and in the bottom row it is 1,000. From left to right the combination of sequence from each parent of the recombinant is 50%-50%, 75%-25% and 90%-10%

 
It appears that shorter sequence lengths and less symmetrical combinations of the left and right trees make it harder to detect recombinant sequences.

In general, we suspect that the difference in {delta}x between recombinant and nonrecombinant sequences will be more pronounced within a larger data set as opposed to a smaller one because the ratio of the number of quartets containing the recombinant, to the number of quartets containing some fixed taxon and the recombinant, is (n - 1)/3. Note that this ratio is only a rough guide because a quartet containing the recombinant taxon does not necessarily have a high {delta} value. In addition, for some quartets containing the recombinant, the topology of the tree will not change from one side of the breakpoint to the other (for example quartet 1,5,8,R1 in fig. 6b ); only the edge weights change. This explains why, as is particularly noticeable for the simulations using the balanced tree, those taxa that are close to the recombinant, e.g., taxa 1 and 2 for R1, have higher x value than those taxa that are further away.

Biological Data
We considered three data sets to explore the applicability of the methods developed in this article.

Viral Data
Our first example was the Hepatitis B virus (HBV) data set analyzed by Bollyky et al. (1996)Citation which contains 24 isolates. The HBV sequences were corrected for multiple changes according to the HKY model using PAUP (Swofford 1998Citation ) to estimate parameters (more generalized distance corrections with distributions of rates across sites or proportions of invariant sites were also tried and produced larger values for ).

In figure 9a, we present the {delta} plot for this data set ( = 0.16), and in figure 9b we plot the x values for each taxon x. The HBV strain with the highest value is the outgroup taxa HBVADW4A, the sole representative of HBV genotype F. Bollyky et al. (1996)Citation found that the HBV data contained two recombinant taxa; the position of these two recombinant taxa are indicated in figure 10 by arrows.

For comparison with another method that analyzes phylogenetic signals before tree construction, we also computed a likelihood map (Strimmer and von Haeseler 1997Citation ) for the data, as calculated by Tree-Puzzle 5.0 (Strimmer and von Haeseler 1996Citation )—see figure 9c. At this stage, no software is available to make it possible to compare with the more recently developed quartet mapping approach (Nieselt-Struwe and von Haeseler 2001Citation ). Most of the quartets are mapped into one of the three regions of the likelihood map. This suggests that this data set contains signals suitable for phylogenetic analysis, a conclusion that can also be drawn from the shape of the {delta} plot. It is also in agreement with the high bootstrap values for the main HBV genotypes that were reported by Bollyky et al. (1996)Citation .

Gene-Order Data
We considered the data set presented in Sankoff et al. (2000)Citation . In this article, normalized induced breakpoint distances were computed between mitochondrial genomes of 18 eukaryotes. Previous work on gene-order distances suggests that they may contain phylogenetic signals, although this signal may not be strong and it is uncertain how reliable such distances are for distant taxa. The {delta} plot for this data set is presented in figure 10a ( = 0.32). It appears to be quite nontreelike which was supported by further analysis with SplitsTree (results not shown). In figure 10b it is seen that the two green algae Marchantia polymorphia and Nephroselmis olivacea have the highest x values, followed by a group of four taxa identified by Sankoff et al. (2000)Citation as early branching protists. This agrees with previous work where it was found that the taxa M. polymorphia and N. olivacea confounded tree-building techniques (David Bryant, personal communication).

Restriction Fragment Length Polymorphism Data
The third data set consists of 42 isolates of the yeast C. albicans (Jan Schmid, personal communication). Distances were formed from a binary character matrix on the basis of the presence of bands in AFLP.

The extent to which C. albicans reproduces sexually versus clonally is currently a matter of debate (Pujol et al. 1993Citation ; Graser et al. 1996Citation ; Tibayrenc 1997Citation ); if there is considerable sexual reproduction, then reassortment of chromosomes will result in a nontreelike signal (although some treelike signals may remain because of linkage along the chromosomes).

In Schmid et al. (1999)Citation , evidence is presented for a cluster of genetically similar isolates within C. albicans that is prevalent across many geographical regions, patient types, and forms of infection. We computed {delta} plots for all isolates and also for the cluster (and its complement) proposed by Schmid et al. (1999)Citation . Figure 11 shows a marked difference in the {delta} plots for the isolates within the cluster ( = 0.05) as compared with those outside the cluster ( = 0.31).



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 11.—Three {delta} plots from distances derived from fragment length polymorphism data in C. albicans. (a) The {delta} plot for the complete data set of 42 isolates ( = 0.14), (b) the {delta} plot for the 26 isolates in the cluster defined by Schmid et al. (1999)Citation ( = 0.05), and (c) the {delta} plot for the 16 isolates not in this cluster ( = 0.31)

 
One explanation for this phenomenon could be that more recombination is occurring in the noncluster strains, whereas the strains within the cluster are reproducing primarily clonally. But another plausible reason might be that the noncluster strains are more diverse and, thus, lead to long-edge lengths with higher {delta} values. A more in-depth analysis of this data set will appear elsewhere.


    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
We find that {delta} plots are a useful exploratory data-analysis tool for phylogenetic studies. {delta} Plots allow assessment of how treelike distance data sets are before tree construction. Furthermore, computing x for each taxon x allows identification of taxa that may obscure the treelike structure of the data and hinder accurate tree construction. Such taxa may, for example, be at the end of divergent long branches, or perhaps be recombinant strains. Simulations showed that removing taxa with the highest x values improved the accuracy of tree construction for the popular distance-based NJ method.

We found that the topology of the underlying tree has an effect on , as illustrated by the difference between for the balanced and unbalanced topologies shown in figure 3 . Moreover, as indicated by figure 3 , the location of a taxon x within the tree topology can influence x. In general, the unbalanced topology is harder to accurately reconstruct, presumably because of the combination of short and long branches. But there was a significant improvement in the accuracy of tree estimation as the taxa with highest x values were removed from the unbalanced tree. This suggests that removing taxa with a high x value could provide a quantitative method for subsampling data sets to avoid the problems of long branch attraction.

The high {delta} values for the unbalanced tree also highlight a potential shortcoming of our definition of {delta}, in that it does not take into account the lengths of the pendant edges in figure 1 . Thus two quartets with the same internal edge lengths s,l will have the same {delta} value even in case one has extremely long pendant edges relative to the other making it more starlike. We experimented with normalization factors for {delta} values to account for this problem (for example, (s + l)/(a + b + c + d + s + l )) but found that these did not alter the general trends shown in figure 3 .

It is interesting to note that varying the correction used for multiple changes in sequence data seemed to make little difference to . For example, for the HBV data, the more general distance corrections—including gamma distributions of rates across sites and proportions of invariant sites—tended to produce higher values than the simple HKY model (despite being indicated as superior models according to likelihood ratio tests). Performing corrections that make distance data less additive may have an adverse effect on the accuracy of topology estimation, even though they presumably make edge length estimation more accurate.

With respect to recombination, we found that increasing levels of recombination led to higher values. In a simple simulation, individual recombinant taxa were found to have significantly higher x values than the nonrecombinant taxa. We believe that {delta} plots are a useful preliminary test to suggest possible recombinants. But caution in interpreting the results is necessary because other factors apart from recombination can cause high x values. This was seen in the example with HBV where it is the outgroup taxon that has the highest {delta} value as opposed to the recombinant strains. In any case the distances can contain no information about the spatial process of recombination; so whenever high x values suggest recombination this needs to be followed up by, for example, the search for breakpoints (see Grassly and Holmes 1997Citation ; McGuire, Wright, and Prentice 1997Citation ; Weiller 1998Citation ; Holmes, Worobey, and Rambaut 1999Citation for different approaches to breakpoint detection). Our studies suggest that an interesting application would be to compute within a sliding window to look for noisy portions of alignments or breakpoints.

In cases where the {delta} plots indicate that the data is far from being additive we suggest the use of phylogeny software such as SplitsTree (Huson 1998Citation ) or Spectronet (Huber et al., unpublished data) that does not restrict the result to being a tree. Also we suggest testing the stability of the tree topology on removing those taxa with high x values. Because {delta} values are dependent on both sequence length and tree topology it is not easy to give specific cutoff values over which data sets or individual taxa should be deemed suspicious. One possible approach to attaining such cutoff values when the distances are based on sequence alignments would be to use parametric bootstrapping to generate data sets based on the best tree fitting the observed distances.

In conclusion, we believe that—in combination with other tools for analyzing treelikeness, such as statistical geometry (Dress 1988Citation ; Eigen, Winkler-Oswatitsch, and Dress 1988Citation ; Eigen and Winkler-Oswatitsch 1990Citation )—{delta} plots can provide a useful way to visualize and explore data sets that complements the various distance-based tree-building methods.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
B.R.H. received support from a Marsden-funded Doctoral Scholarship. B.R.H. thanks M. Hendy and D. Penny for their constructive criticism of the material. K.T.H. and V.M. thank the Swedish Research Council (VR); they and B.R.H. thank the Swedish Foundation for International Cooperation in Research and Education (STINT). A.D. thanks the DFG for its support.


    Footnotes
 
William Martin, Reviewing Editor

Keywords: genetic distance statistical geometry phylogenetic analysis tree reconstruction assessment of data quality recombination Back

Address for correspondence and reprints: B. R. Holland, Lehrstuhl für Spezielle Zoologie, Ruhr-Universität Bochum, 150 Universitätstr., Bochum 44780, Germany. E-mail: barbara.holland{at}ruhr-uni-bochum.de . Back


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Bandelt H.-J., A. W. M Dress, 1992 Split decomposition: a new and useful approach to phylogenetic analysis of distance data Mol. Phyl. Evol 1:242-252[Medline]

    Bollyky P. L., A. Rambaut, P. H. Harvey, E. C. Holmes, 1996 Recombination between sequences of Hepatitis B virus from different genotypes J. Mol. Evol 42:97-102[ISI][Medline]

    Buneman P., 1971 The recovery of trees from measures of dissimilarity Pp. 387–395 in F. R. Hodson, D. G. Kendall, and P. Tautu, eds. Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, U.K

    Dress A., 1988 Statistische Geometrie von Konfigurationen und deren Evolution in Sequenz-Räumen—Definitionen und Probleme Ein Programmvorschlag. In H. Begehr, ed. Die Bedentung der von Berlin ausgehenden Mathematik in Vergangenheit und Gegenwart. Kolloquium-Verlag, Berlin.

    Eigen M., R. Winkler-Oswatitsch, 1990 Statistical geometry on sequence space Methods Enzymol 183:505-530[ISI][Medline]

    Eigen M., R. Winkler-Oswatitsch, A. Dress, 1988 Statistical geometry in sequence space: a method of quantitative sequence analysis Proc. Natl. Acad. Sci. USA 85:5913-5917[Abstract]

    ———. 1985 Confidence-limits on phylogenies—an approach using the bootstrap Evolution 39:783-791[ISI]

    Felsenstein J., 1993 PHYLIP (phylogeny inference package). Version 3.5c Department of Genetics, University of Washington, Seattle

    Graser Y., M. Volovsek, J. Arrington, G. Schonian, W. Presber, T. G. Mitchell, R. Vilgalys, 1996 Molecular markers reveal that population structure of the human pathogen Candida albicans exhibits both clonality and recombination Proc. Nat. Acad. Sci 93:12473-12477[Abstract/Free Full Text]

    Grassly N. C., E. C. Holmes, 1997 A likelihood method for the detection of selection and recombination using sequence data Mol. Biol. Evol 14:239-247[Abstract]

    Hendy M. D., D. Penny, 1993 Spectral analysis of phylogenetic data J. Class 10:5-24[ISI]

    Holland B. R., 2001 Evolutionary analyses of large data sets: trees and beyond Doctoral dissertation, Massey University, Palmerston North, New Zealand.

    Holmes E. C., M. Worobey, A. Rambaut, 1999 Phylogenetic evidence for recombination in Dengue virus Mol. Biol. Evol 16:405-409[Abstract]

    Huson D., 1998 SplitsTree: a program for analyzing and visualizing evolutionary data Bioinformatics 14:68-73[Abstract]

    Kimura M., 1980 A simple method for estimating evolutionary of base substitution through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]

    Lyons-Weiler J., G. A. Hoelzer, R. J. Tausch, 1996 Relative Apparent Synapomorphy Analysis (RASA) I: the statistical measurement of phylogenetic signal Mol. Biol. Evol 13:749-757[Abstract]

    McGuire G., F. Wright, M. J. Prentice, 1997 A graphical method for detecting recombination in phylogenetic data sets Mol. Biol. Evol 14:1125-1131[Abstract]

    Nieselt-Struwe K., A. von Haeseler, 2001 Quartet-mapping, a generalization of the likelihood-mapping procedure Mol. Biol. Evol 18:1204-1219[Abstract/Free Full Text]

    Posada D., K. A. Crandall, 2001 Intraspecific gene genealogies: trees grafting into networks Trends Ecol. Evol 16:37-45[ISI][Medline]

    Pujol C., J. Reynes, F. Renaud, M. Raymond, M. Tibayrenc, F. J. Ayala, F. Janbon, M. Mallie, J. Bastide, 1993 The yeast Candida albicans has a clonal mode of reproduction in a population of infected human immunodeficiency virus-positive patients Proc. Nat. Acad. Sci 90:9456-9459[Abstract]

    Rambaut A. E., N. C. Grassly, 1997 Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees Comput. Appl. Biosci 1:235-238[Abstract]

    ———. Treevolve. Version 1.32 Available from http://evolve.zoo.ox.ac.uk.

    Saitou N., M. Nei, 1987 The neighbor-joining method: a new method for reconstructing phylogenetic trees Mol. Biol. Evol 4:406-425[Abstract]

    Sankoff D., D. Bryant, M. Denault, B. F. Lang, G. Burger, 2000 Early eukaryote evolution based on mitochondrial gene order breakpoints J. Comp. Biol 7:521-536[ISI]

    Schmid J., S. Herd, P. R. Hunter, R. D. Cannon, M. Salleh, 1999 Evidence for a general-purpose genotype in Candida albicans, highly prevalent in multiple geographical regions, patient types and types of infection Microbiology 145:2405-2413[Abstract/Free Full Text]

    Strimmer K., A. von Haeseler, 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies Mol. Biol. Evol 13:964-969[Free Full Text]

    ———. 1997 Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment Proc. Natl. Acad. Sci. USA 94:6815-6819[Abstract/Free Full Text]

    Swofford D. L., 1998 PAUP*: —phylogenetic analysis using parsimony (* and other methods). Version 4.0 Sinauer Associates, Sunderland, Mass

    Swofford D., G. Olsen, P. Waddell, D. Hillis, 1996 Phylogenetic inference Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer Associates, Sunderland, Mass

    Tibayrenc M., 1997 Are Candida albicans natural populations subdivided? Trends Microbiol 5:253-257[ISI][Medline]

    Weiller G. F., 1998 Phylogenetic profiles: a graphical method for detecting genetic recombinations in homologous sequences Mol. Evol. Sys 15:326-335

    Zaretsky K., 1965 Reconstruction of a tree from the distances between its pendant vertices Uspekhi Math. Nauk (Russian Mathematical Surveys) 20:90-92

Accepted for publication May 22, 2002.