*Lehrstuhl für Spezielle Zoologie, Ruhr-Universität Bochum, Bochum, Germany;
Department for Biometry and Informatics, Swedish University of Agricultural Sciences;
FSPM-Strukturbildungsprozesse, University of Bielefeld, Bielefeld, Germany;
Linnaeus Center for Bioinformatics, BMC, Uppsala University
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
There are many recipes for inferring trees from distance matrices, but comparatively few tools are available for assessing how appropriate this may be. Processes such as recombination, reassortment, gene conversion, and lateral transfer lead to reticulate evolution which might be better described by a network rather than a tree (Posada and Crandall 2001
) Hence, because there is no a priori reason that a distance should be well represented by a tree, tools to assess treelikeness should prove useful.
Existing methods for assessing signals in a phylogenetic data set before a tree is estimated include Relative Apparent Synapomorphy Analysis (RASA) (Lyons-Weiler, Hoelzer, and Tausch 1996
), spectral analysis (Hendy and Penny 1993
), likelihood mapping (Strimmer and von Haeseler 1997
) and its more recent extension quartet mapping (Nieselt-Struwe and von Haeseler 2001
), and split decomposition (Bandelt and Dress 1992
; Huson 1998
). Note that various techniques can be used to test the accuracy of an obtained tree, most popularly bootstrapping (Felsenstein 1985). But these techniques usually rely on constructing a set of trees first and then analyzing this set.
We presume that evolution typically gives rise to a treelike signal but this signal may be obscured by processes, such as sampling error, parallel changes and reversals, substitutional biases, selective pressure, or perhaps the use of an inappropriate model to correct observed distances. Our method aims to quantify how far a distance matrix is from being additive. In particular, for each quartet q of taxa, we compute a quantity 0
q
1 that indicates by how much a quartet fails to satisfy the four-point condition (Zaretsky 1965
; Buneman 1971
); a value of 0 indicates that q is perfectly treelike, and progressively higher values indicate that it is less and less so. This measure has been successfully used in statistical geometry (Dress 1988
; Eigen, Winkler-Oswatitsch, and Dress 1988
; Eigen and Winkler-Oswatitsch 1990
) and indeed
plots could be regarded as an extension of this methodology. The
values for all quartets are displayed in a histogram that we call a
plot. The values for each quartet will be zero if and only if the complete distance data is additive (Zaretsky 1965
), that is, the distances can be represented by a weighted tree (a tree with specified edge lengths). In the Results, we use simulations to assess the behavior of
plots, and we apply our methodology to three biological data sets: HBV viral sequences (in which recombination is known to have occurred), gene-order data from eucaryotic mitochondrial sequences, and amplified fragment-length polymorphism (AFLP) data from the yeast Candida albicans.
In addition to this holistic analysis, we also develop a method for assessing the effect that individual taxa have on the treelikeness of a distance. In particular, for each individual taxon x in a given data set, we compute the average value of the quartets to which x belongs, denoted
x, the rationale being that the more quartets containing x exhibit high
values, the more we may expect taxon x to be obscuring any treelike signal. In the Results, we use simulations to see how the
x value for a taxon x depends on its position within an underlying tree and explore its use in identifying recombinants.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
In case d is derived from biological data, this condition will almost never be satisfied. Thus, assuming that dxy|uv dxu|yv
dxv|yu holds, it is natural to consider the ratio
|
To construct a plot for a set of taxa X,
q is calculated for every quartet q in X and displayed in a histogram. The number of quartets in a data set with n taxa is (n4), so the computational cost of constructing a
plot is O(n4). For large n (say n > 100 taxa), it may be preferable to construct a
plot for a random subsample of the quartets. Note that we denote by
the mean value of
q taken over all quartets in X. (For example
plots see figs. 9a
and 10a.
The
plot in fig. 9a,
being more skewed toward zero, shows a more treelike distribution than that in fig. 10a.
)
|
|
|
Statistical geometry (Dress 1988
; Eigen, Winkler-Oswatitsch, and Dress 1988
; Eigen and Winkler-Oswatitsch 1990
) attempts to evaluate properties of data, such as treelikeness, through the computation of diagrams or geometries like the one above for subsets of a set of aligned sequences. It can be performed either in sequence space or distance space, the latter being of interest to us here. In particular, in this method an average over all quartet diagrams is derived and represented in a characteristic diagram that represents the underlying evolutionary divergence of the sequences. Rather than compressing all of the information into a single diagram,
plots represent the distribution of the quartet distance geometries. A similar philosophy underlies the recently developed method of quartet mapping (Nieselt-Struwe and von Haeseler 2001
) which aims to visualize the phylogenetic content of a set of aligned sequences.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In our simulations, we generated treelike and recombinant data sets using the software packages Seq-Gen (Rambaut and Grassly 1997
) and Treevolve version 1.32 (Rambaut and Grassly). Seq-Gen allows the simulation of sequence evolution (according to a variety of evolutionary models) along a user-defined weighted tree. With Treevolve, rather than specifying a generating tree, the probabilities of both recombinant and coalescent events are given. A network is then generated according to these probabilities, and sequences are produced that evolve along the network. In particular, bifurcating trees can be simulated by setting the rate of recombination to zero.
In all simulations, the model of sequence evolution used was K2P (Kimura 1980
) with a transition-transversion bias of
= 4. In Treevolve the K2P model corresponds to the settings vHKY t2; other parameters that were varied during the simulations include l, the sequence length; s, the number of taxa; n, the number of replicate data sets; and r, the rate of recombination per site. All other Treevolve parameters, for instance those concerning population history, were left at their default settings. In Seq-Gen the K2P model is chosen by setting mHKY t2, the parameters l, sequence length and n, the number of replicate data sets were varied and all other parameters were left at their default settings. Distance matrices were formed using the Phylip (Felsenstein 1993
) package dnadist which calculates the Hamming distances between sequences and then corrects these according to a specified model (in our case K2P with
= 4).
Identifying Troublesome Taxa
If a taxon sequence has a reticulate history, has been involved in a sequencing or an alignment error, or is highly divergent and is thus basically randomized with respect to the other sequences, we expect the average value of quartets containing this taxon to be relatively high. With this in mind, we investigated the behavior of
x, the mean value of
over all quartets containing a taxon x.
Simulated data sets were generated on two trees with 16 leaves, one with the least balanced topology and the other with the most balanced topology (see fig. 2 ). The expected number of changes from the root to each tip was 0.3. One thousand sets of sequences of 100 bp were generated along each tree, and for each one of these sets a 16 x 16 distance matrix was computed as described above.
|
|
|
Recombination
To explore the behavior of plots for nontreelike data, we used recombination simulations. We first investigated the dependence of
plots on various parameters, viz., number of taxa (n), sequence length (c), and recombination frequency (r).
In a preliminary simulation, sequence length was fixed at 500 bp, and trees were generated using Treevolve with r equal to 0 and n equal to 5, 10, ..., 95. This was repeated 100 times for each value of n. It was found that was independent of n (results not shown, but they appear in Holland 2001
).
In a second simulation, n was fixed at 30 and both c and r were varied; c was taken to be 200, 400, 600, 800, and 1,000, and r was taken to be 0, 2.5 x 10-10, 5.0 x 10-10, 7.5 x 10-10, and 1.0 x 10-9. The results are shown in figure 5 .
|
The results described in the previous section indicate that x can be used to identify troublesome taxa. We thus investigated whether recombinants can be detected from within tree topologies.
Certain recombination events can lead to sequence alignments having one tree underlying some portion of the alignment, and a different tree underlying another portion. Frequently this can be identified by some taxa changing their position within the tree; see, for example, the Hepatitis B alignment of Bollyky et al. (1996)
and the Dengue fever alignment of Holmes, Worobey, and Rambaut (1999)
.
We simulated data of this type by concatenating alignments from two generating trees. Figure 6 shows the trees used to generate recombinant alignments used in the simulation. There were two basic topologiesunbalanced and balanced. For both tree topologies, the expected number of substitutions from the root to each tip was 0.3, and these were distributed according to the molecular clock hypothesis. With each basic topology, the recombinant had parents that were either close, intermediate, or divergentfor each case we denote the recombinant by R1, R2, and R3, respectivelygiving 2 x 3 = 6 experiments in total. Each sequence was 1,000 bp long; sites 1500 were simulated on the tree where the recombinant (either R1, R2, or R3) was attached to its left-hand parent, and sites 5011,000 were simulated along the tree where the recombinant taxon was attached to its right-hand parent.
|
|
|
In general, we suspect that the difference in x between recombinant and nonrecombinant sequences will be more pronounced within a larger data set as opposed to a smaller one because the ratio of the number of quartets containing the recombinant, to the number of quartets containing some fixed taxon and the recombinant, is (n - 1)/3. Note that this ratio is only a rough guide because a quartet containing the recombinant taxon does not necessarily have a high
value. In addition, for some quartets containing the recombinant, the topology of the tree will not change from one side of the breakpoint to the other (for example quartet 1,5,8,R1 in fig. 6b
); only the edge weights change. This explains why, as is particularly noticeable for the simulations using the balanced tree, those taxa that are close to the recombinant, e.g., taxa 1 and 2 for R1, have higher
x value than those taxa that are further away.
Biological Data
We considered three data sets to explore the applicability of the methods developed in this article.
Viral Data
Our first example was the Hepatitis B virus (HBV) data set analyzed by Bollyky et al. (1996)
which contains 24 isolates. The HBV sequences were corrected for multiple changes according to the HKY model using PAUP (Swofford 1998
) to estimate parameters (more generalized distance corrections with distributions of rates across sites or proportions of invariant sites were also tried and produced larger values for
).
In figure 9a,
we present the plot for this data set (
= 0.16), and in figure 9b
we plot the
x values for each taxon x. The HBV strain with the highest value is the outgroup taxa HBVADW4A, the sole representative of HBV genotype F. Bollyky et al. (1996)
found that the HBV data contained two recombinant taxa; the position of these two recombinant taxa are indicated in figure 10
by arrows.
For comparison with another method that analyzes phylogenetic signals before tree construction, we also computed a likelihood map (Strimmer and von Haeseler 1997
) for the data, as calculated by Tree-Puzzle 5.0 (Strimmer and von Haeseler 1996
)see figure 9c.
At this stage, no software is available to make it possible to compare with the more recently developed quartet mapping approach (Nieselt-Struwe and von Haeseler 2001
). Most of the quartets are mapped into one of the three regions of the likelihood map. This suggests that this data set contains signals suitable for phylogenetic analysis, a conclusion that can also be drawn from the shape of the
plot. It is also in agreement with the high bootstrap values for the main HBV genotypes that were reported by Bollyky et al. (1996)
.
Gene-Order Data
We considered the data set presented in Sankoff et al. (2000)
. In this article, normalized induced breakpoint distances were computed between mitochondrial genomes of 18 eukaryotes. Previous work on gene-order distances suggests that they may contain phylogenetic signals, although this signal may not be strong and it is uncertain how reliable such distances are for distant taxa. The
plot for this data set is presented in figure 10a
(
= 0.32). It appears to be quite nontreelike which was supported by further analysis with SplitsTree (results not shown). In figure 10b
it is seen that the two green algae Marchantia polymorphia and Nephroselmis olivacea have the highest
x values, followed by a group of four taxa identified by Sankoff et al. (2000)
as early branching protists. This agrees with previous work where it was found that the taxa M. polymorphia and N. olivacea confounded tree-building techniques (David Bryant, personal communication).
Restriction Fragment Length Polymorphism Data
The third data set consists of 42 isolates of the yeast C. albicans (Jan Schmid, personal communication). Distances were formed from a binary character matrix on the basis of the presence of bands in AFLP.
The extent to which C. albicans reproduces sexually versus clonally is currently a matter of debate (Pujol et al. 1993
; Graser et al. 1996
; Tibayrenc 1997
); if there is considerable sexual reproduction, then reassortment of chromosomes will result in a nontreelike signal (although some treelike signals may remain because of linkage along the chromosomes).
In Schmid et al. (1999)
, evidence is presented for a cluster of genetically similar isolates within C. albicans that is prevalent across many geographical regions, patient types, and forms of infection. We computed
plots for all isolates and also for the cluster (and its complement) proposed by Schmid et al. (1999)
. Figure 11
shows a marked difference in the
plots for the isolates within the cluster (
= 0.05) as compared with those outside the cluster (
= 0.31).
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We found that the topology of the underlying tree has an effect on , as illustrated by the difference between
for the balanced and unbalanced topologies shown in figure 3
. Moreover, as indicated by figure 3
, the location of a taxon x within the tree topology can influence
x. In general, the unbalanced topology is harder to accurately reconstruct, presumably because of the combination of short and long branches. But there was a significant improvement in the accuracy of tree estimation as the taxa with highest
x values were removed from the unbalanced tree. This suggests that removing taxa with a high
x value could provide a quantitative method for subsampling data sets to avoid the problems of long branch attraction.
The high values for the unbalanced tree also highlight a potential shortcoming of our definition of
, in that it does not take into account the lengths of the pendant edges in figure 1
. Thus two quartets with the same internal edge lengths s,l will have the same
value even in case one has extremely long pendant edges relative to the other making it more starlike. We experimented with normalization factors for
values to account for this problem (for example, (s + l)/(a + b + c + d + s + l )) but found that these did not alter the general trends shown in figure 3
.
It is interesting to note that varying the correction used for multiple changes in sequence data seemed to make little difference to . For example, for the HBV data, the more general distance correctionsincluding gamma distributions of rates across sites and proportions of invariant sitestended to produce higher
values than the simple HKY model (despite being indicated as superior models according to likelihood ratio tests). Performing corrections that make distance data less additive may have an adverse effect on the accuracy of topology estimation, even though they presumably make edge length estimation more accurate.
With respect to recombination, we found that increasing levels of recombination led to higher values. In a simple simulation, individual recombinant taxa were found to have significantly higher
x values than the nonrecombinant taxa. We believe that
plots are a useful preliminary test to suggest possible recombinants. But caution in interpreting the results is necessary because other factors apart from recombination can cause high
x values. This was seen in the example with HBV where it is the outgroup taxon that has the highest
value as opposed to the recombinant strains. In any case the distances can contain no information about the spatial process of recombination; so whenever high
x values suggest recombination this needs to be followed up by, for example, the search for breakpoints (see Grassly and Holmes 1997
; McGuire, Wright, and Prentice 1997
; Weiller 1998
; Holmes, Worobey, and Rambaut 1999
for different approaches to breakpoint detection). Our studies suggest that an interesting application would be to compute
within a sliding window to look for noisy portions of alignments or breakpoints.
In cases where the plots indicate that the data is far from being additive we suggest the use of phylogeny software such as SplitsTree (Huson 1998
) or Spectronet (Huber et al., unpublished data) that does not restrict the result to being a tree. Also we suggest testing the stability of the tree topology on removing those taxa with high
x values. Because
values are dependent on both sequence length and tree topology it is not easy to give specific cutoff values over which data sets or individual taxa should be deemed suspicious. One possible approach to attaining such cutoff values when the distances are based on sequence alignments would be to use parametric bootstrapping to generate data sets based on the best tree fitting the observed distances.
In conclusion, we believe thatin combination with other tools for analyzing treelikeness, such as statistical geometry (Dress 1988
; Eigen, Winkler-Oswatitsch, and Dress 1988
; Eigen and Winkler-Oswatitsch 1990
)
plots can provide a useful way to visualize and explore data sets that complements the various distance-based tree-building methods.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: genetic distance
statistical geometry
phylogenetic analysis
tree reconstruction
assessment of data quality
recombination
Address for correspondence and reprints: B. R. Holland, Lehrstuhl für Spezielle Zoologie, Ruhr-Universität Bochum, 150 Universitätstr., Bochum 44780, Germany. E-mail: barbara.holland{at}ruhr-uni-bochum.de
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bandelt H.-J., A. W. M Dress, 1992 Split decomposition: a new and useful approach to phylogenetic analysis of distance data Mol. Phyl. Evol 1:242-252[Medline]
Bollyky P. L., A. Rambaut, P. H. Harvey, E. C. Holmes, 1996 Recombination between sequences of Hepatitis B virus from different genotypes J. Mol. Evol 42:97-102[ISI][Medline]
Buneman P., 1971 The recovery of trees from measures of dissimilarity Pp. 387395 in F. R. Hodson, D. G. Kendall, and P. Tautu, eds. Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, U.K
Dress A., 1988 Statistische Geometrie von Konfigurationen und deren Evolution in Sequenz-RäumenDefinitionen und Probleme Ein Programmvorschlag. In H. Begehr, ed. Die Bedentung der von Berlin ausgehenden Mathematik in Vergangenheit und Gegenwart. Kolloquium-Verlag, Berlin.
Eigen M., R. Winkler-Oswatitsch, 1990 Statistical geometry on sequence space Methods Enzymol 183:505-530[ISI][Medline]
Eigen M., R. Winkler-Oswatitsch, A. Dress, 1988 Statistical geometry in sequence space: a method of quantitative sequence analysis Proc. Natl. Acad. Sci. USA 85:5913-5917[Abstract]
. 1985 Confidence-limits on phylogeniesan approach using the bootstrap Evolution 39:783-791[ISI]
Felsenstein J., 1993 PHYLIP (phylogeny inference package). Version 3.5c Department of Genetics, University of Washington, Seattle
Graser Y., M. Volovsek, J. Arrington, G. Schonian, W. Presber, T. G. Mitchell, R. Vilgalys, 1996 Molecular markers reveal that population structure of the human pathogen Candida albicans exhibits both clonality and recombination Proc. Nat. Acad. Sci 93:12473-12477
Grassly N. C., E. C. Holmes, 1997 A likelihood method for the detection of selection and recombination using sequence data Mol. Biol. Evol 14:239-247[Abstract]
Hendy M. D., D. Penny, 1993 Spectral analysis of phylogenetic data J. Class 10:5-24[ISI]
Holland B. R., 2001 Evolutionary analyses of large data sets: trees and beyond Doctoral dissertation, Massey University, Palmerston North, New Zealand.
Holmes E. C., M. Worobey, A. Rambaut, 1999 Phylogenetic evidence for recombination in Dengue virus Mol. Biol. Evol 16:405-409[Abstract]
Huson D., 1998 SplitsTree: a program for analyzing and visualizing evolutionary data Bioinformatics 14:68-73[Abstract]
Kimura M., 1980 A simple method for estimating evolutionary of base substitution through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]
Lyons-Weiler J., G. A. Hoelzer, R. J. Tausch, 1996 Relative Apparent Synapomorphy Analysis (RASA) I: the statistical measurement of phylogenetic signal Mol. Biol. Evol 13:749-757[Abstract]
McGuire G., F. Wright, M. J. Prentice, 1997 A graphical method for detecting recombination in phylogenetic data sets Mol. Biol. Evol 14:1125-1131[Abstract]
Nieselt-Struwe K., A. von Haeseler, 2001 Quartet-mapping, a generalization of the likelihood-mapping procedure Mol. Biol. Evol 18:1204-1219
Posada D., K. A. Crandall, 2001 Intraspecific gene genealogies: trees grafting into networks Trends Ecol. Evol 16:37-45[ISI][Medline]
Pujol C., J. Reynes, F. Renaud, M. Raymond, M. Tibayrenc, F. J. Ayala, F. Janbon, M. Mallie, J. Bastide, 1993 The yeast Candida albicans has a clonal mode of reproduction in a population of infected human immunodeficiency virus-positive patients Proc. Nat. Acad. Sci 90:9456-9459[Abstract]
Rambaut A. E., N. C. Grassly, 1997 Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees Comput. Appl. Biosci 1:235-238[Abstract]
. Treevolve. Version 1.32 Available from http://evolve.zoo.ox.ac.uk.
Saitou N., M. Nei, 1987 The neighbor-joining method: a new method for reconstructing phylogenetic trees Mol. Biol. Evol 4:406-425[Abstract]
Sankoff D., D. Bryant, M. Denault, B. F. Lang, G. Burger, 2000 Early eukaryote evolution based on mitochondrial gene order breakpoints J. Comp. Biol 7:521-536[ISI]
Schmid J., S. Herd, P. R. Hunter, R. D. Cannon, M. Salleh, 1999 Evidence for a general-purpose genotype in Candida albicans, highly prevalent in multiple geographical regions, patient types and types of infection Microbiology 145:2405-2413
Strimmer K., A. von Haeseler, 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies Mol. Biol. Evol 13:964-969
. 1997 Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment Proc. Natl. Acad. Sci. USA 94:6815-6819
Swofford D. L., 1998 PAUP*: phylogenetic analysis using parsimony (* and other methods). Version 4.0 Sinauer Associates, Sunderland, Mass
Swofford D., G. Olsen, P. Waddell, D. Hillis, 1996 Phylogenetic inference Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer Associates, Sunderland, Mass
Tibayrenc M., 1997 Are Candida albicans natural populations subdivided? Trends Microbiol 5:253-257[ISI][Medline]
Weiller G. F., 1998 Phylogenetic profiles: a graphical method for detecting genetic recombinations in homologous sequences Mol. Evol. Sys 15:326-335
Zaretsky K., 1965 Reconstruction of a tree from the distances between its pendant vertices Uspekhi Math. Nauk (Russian Mathematical Surveys) 20:90-92