* Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, New Zealand; School of Computing Sciences, University of East Anglia, Norwich, United Kingdom
Correspondence: E-mail: b.r.holland{at}massey.ac.nz.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: lower bounds parsimony phylogeny estimation human mtDNA
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In Hendy, Foulds, and Penny (1980) an approach is presented for demonstrating, in some cases at least, that a particular tree is guaranteed to be of minimal length under the parsimony criteria. The procedure employs an upper and lower bound, and it aims at reducing the upper bound and increasing the lower bound, until they are equal (The MinMax Squeeze).
An example of this approach is illustrated in figure 2 with a small set of six taxa and seven columns (each with only two character-states). Partitioning the data into separate columns gives a lower bound of seven changes, as each column will require at least one mutation on any tree (fig. 2A). A search through tree space finds a tree with 11 changes, and this is taken as an upper boundfour changes more than the lower bound. Partitioning the columns into pairs leads to an increase in the lower bound to 10 (fig. 2B). The increase arises from detecting pairs of columns that are incompatible: they will not fit onto any tree with only one change for each column. For example, considering columns 1 and 2 together gives four dinucleotide pairs, GA, TA, GC, and TC. The minimum possible number of mutations joining the four dinucleotide pairs on any tree is 3. Thus placing columns 1 and 2 into a single part of the partition increases the lower bound by 1. However, the partition in figure 2B still gives a lower bound one less than the upper bound. In this case, the partition of columns in figure 2C increases the lower bound until it equals the upper bound, as the 3-tuple of columns 1, 2 and 7 gives six unique trinucleotide sequences which will require at least five mutations on any tree. Therefore, by the Partition Theorem of Hendy, Foulds, and Penny (1980) (which we restate as Theorem 1 in the next section), there can never be a shorter tree for these data, though the analysis in figure 2 does not demonstrate whether the minimal length tree is unique.
|
In the last 15 years intraspecies phylogenies constructed from mitochondrial data sets have been used to test hypotheses about the origins of modern humans and patterns of human migration (Cann, Stoneking, and Wilson 1987; Vigilant et al. 1991; Murray-McIntosh et al. 1998). The first study to use complete mitochondrial genomes was by Ingman et al. (2000) who used phylogenetic evidence from 53 human mitochondrial genomes to test hypotheses about the origin and dispersal of modern humans. We apply the new lower bounds we have developed to this data set and find that a set of 40 trees (2 steps shorter than the published tree) can be proved minimal.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Given an alignment A and a phylogenetic tree on A, that is a tree with leaves labeled by the sequences of A, the Small Parsimony Problem is to compute the cost or minimum number of mutations required along the tree to give the sequences (Fitch 1971; Rinsma, Hendy, and Penny 1990). The Large Parsimony Problem is to find, over all phylogenetic trees, the trees on A that have minimal cost (Felsenstein 2003; Semple and Steel 2003). We denote this minimal cost, or parsimony score, by P(A). The Large Parsimony Problem is intractable for typical sized alignments (Graham and Foulds 1982), although for data that are nearly homoplasy-free, it is possible to find a tree in polynomial time (Fernandez-Baca and Lagergren 2003). The computational complexity of finding optimal trees means that for many data sets heuristic approaches must be used. Although a heuristic method cannot guarantee to have found the optimal tree, the score of the tree found does provide an upper bound on the score of the optimal tree.
We now turn to the problem of finding lower bounds for the parsimony score of an alignment. Given an alignment A with columns indexed by the set CA = {1,2,..., cA}, and a subset of CA, we let A
be the alignment consisting of the columns indexed by
(so that in particular A
is an nA x |
| matrix). For example, in figure 2C the subset {1, 2, 7} of CA = {1, 2, ..., 7} gives a 7 x 3 subalignment. The approaches that we present below for finding lower bounds for P(A) depend on the following theorem, which originally appeared in Hendy, Foulds, and Penny (1980).
Theorem 1 Given an alignment A, anda partition of CA (i.e., a division of CA into non-empty, non-intersecting subsets or parts whose union is CA). Then
Informally, this theorem states that the score of any tree on the alignment A can never be less than the sum of the parsimony scores of the subalignments of A given by the partition .
Distinct Sequence Bounds
The first lower bound that we present is based on the following result, whose straightforward proof we omit. For an alignment A, we let dA denote the number of distinct sequences aligned in A minus one in case the set of sequences being aligned is ample, and otherwise define dA to be just the number of distinct sequences.
Theorem 2 Suppose that A is an alignment. Then P(A)dA.
Now, given a partition of CA, we let
![]() | (1) |
Clearly, for any alignment A the bound D(A) depends on the choice of partition
. In case
is the trivial partition of CA, i.e., the partition
t in which every part has cardinality or size 1 (as in fig. 2A), then, denoting by ri the number of distinct character states in column i of A, the inequality P(A)
D
t(A) can be re-expressed as
![]() | (2) |
Incompatibility Bounds
For an alignment A and i, j distinct, we denote Ai,j also by Aij, and call a pair of columns in A indexed by i and j incompatible if P(Aij) > (ri 1) + (rj 1). In (Semple and Steel 2003, pg. 97) a lower bound for P(A) is presented that is based on the number of pairs of incompatible columns in A. We now introduce an alternative lower bound.
Theorem 3 Suppose that A is an alignment. Let M(A) be the minimal value oftaken over all integers mi
(ri 1) satisfying
Then P(A)
(3) M(A).
Proof:
For a phylogenetic tree T on A, letdenote the minimum number of mutations along T required for column i of A. Then
and
for all 1
i < j
cA. Moreover, if T' is a phylogenetic tree on A with cost P(A), then
The inequality P(A)
M(A) follows directly from these observations.
Informally, equation (3) states that the combined score of any pair of columns on any tree is at least the parsimony score for the subalignment of just those two columns. Theorem 3 states that if we could solve the integer program described it would provide a lower bound on the parsimony score. The next result follows by summing the inequalities in equation (3).
Corollary 4 Suppose that A is an alignment with cA2. Then
By Theorem 1 and Corollary 4, it follows that
![]() | (4) |
Using an approach based on incompatibility suggests a natural greedy algorithm for finding a partition of CA, this can be obtained as follows. (We will abuse notation slightly and say that a pair of elements of CA are incompatible if they index incompatible columns of A.) First compute all pairs of incompatible elements of CA, then, starting with 1, form a subset of CA by greedily adding in the next smallest element of CA that is incompatible with 1, and then repeatedly adding in the next smallest remaining element of CA that is incompatible with all of those elements already in
. This is continued until no further element can be added to
. Then, starting with the smallest element of CA
, create a new subset in the same way. This is repeated until a partition
g of CA is created, which will consist of sets of pairwise incompatible elements together with sets of size one. Note that having created such a partition we can evaluate both D- and I-bounds associated with
g.
For the special case that A is an alignment of binary sequences in which every column contains exactly two states, a lower bound can be computed for P(A) that improves on the I-bound for A. This bound is obtained as follows. Suppose that is a part in
g containing at least two elements. Since A is an alignment of binary sequences and every pair of columns indexed by
is incompatible, it follows for all i, j distinct in
that P(Aij) = 3. Hence, since in equation (3) of Theorem 3 the value of mi is 1 or 2 for all i in
, it follows that there is at most one column i of A
with mi = 1, consequently, P(A
)
2|
| 1. Moreover, if
has size 1, then similarly P(A
) = r
1 = 1 = 2|
| 1. Hence, by summing over all parts in
g we obtain
![]() | (5) |
Optimizing Partitions
For many alignments the lower bounds described above may be less than the upper bound (the parsimony score of a tree found by heuristic search), and hence cannot be used to prove the trees for such alignments minimal. However, in such cases the D- and I-bounds (but not the I'-bound) may be improved by choosing new partitions so as to (locally) optimize the bound. To do this, we consider two types of moves in partition space: The first takes two parts of a partition and merges them to form a single part. The second moves a single element from one part of a partition into a different part. These moves give a neighbourhood of partitions around any given partition whose lower bounds can be evaluated. A simple hill-climbing procedure can then be used to move through partition space until a local optimum is found.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Simulations
DNA alignments were simulated using Treevolve (Grassly and Rambaut, available from http://evolve.zoo.ox.ac.uk/software.html), which simulates the evolution of DNA sequences according to a coalescent model. As many mtDNA data sets used in population studies are effectively two-state because of the strong transition/transversion bias in mitochondrial data, we simulated such data sets by taking alignments generated by Treevolve, and amalgamating the A and G states to R, and the C and T states to Y. 100 data sets of length 100 base pairs were created under a Jukes-Cantor model of nucleotide substitution (Jukes and Cantor 1969), with either 20 or 40 taxa, and with either a high (1*109) or low (5*1010) mutation rate. The other Treevolve parameter settings we employed were population size, n, of 1*108 and generation time, b, of 1. In alignments simulated using the lower rate there were fewer parallel substitutions and back substitutions and hence less homoplasy. Although only two different mutation rate settings were used, the stochastic nature of the simulations resulted in alignments with a range of homoplasy levels.
Upper bounds on the parsimony score for the alignments were generated using PAUP* by taking the minimum of the parsimony score of the Neighbor-Joining tree (Saitou and Nei 1987), and the parsimony score of the best tree found using a heuristic search. In practice the Neighbor-Joining tree never gave an upper bound that was smaller than the upper bound derived from the tree found by heuristic search.
Both the D- and I-bounds for the parsimony score were computed employing the trivial partition t and the greedily constructed partition
g as described in the Methods. We denote these lower bounds by Dt, Dg, It, and Ig. When necessary, we also performed local optimization to try to improve bounds, which resulted in bounds which we denote, for example, by
as opposed to Dt. For alignments involving only 2-state characters, we also calculated the I'-bound which we denote simply by I'.
Results are presented in table 1 (2-state character data) and table 2 (4-state character data). As can be seen in these tables, the bound Dt = It was rarely equal to the parsimony score. For 2-state data (table 1) the bound I' gave slightly better results than the bounds Dg and Ig. However, except for alignments with low levels of homoplasy these bounds were rarely tight. Of the bounds Dg, Ig, and I'g none dominated the others; that is, none was better for all alignments. Also, none of the locally optimized bounds dominated the others. However, on average, the lower bound performed best. In contrast to this, it was slightly better to start from the partition
g when computing a locally optimized I-bound. For all the bounds the mean difference between the computed upper and lower bounds tended to be larger for the data sets with 40 taxa. However, this difference increased more sharply for the I-bounds than for the D-bounds.
|
|
|
|
Table 3 shows the performance of the D- and I-bounds. For this reduced data set it was possible to prove that the 20 equally scoring trees are optimal using the fact that the upper and lower bounds were the same. Note that this doesn't rule out the possibility that there are still other most parsimonious trees.
|
For the bound a further optimization was performed that made use of information given by one of the 40 equally scoring trees (chosen arbitrarily). For each part
in the locally optimal partition used to calculate
we checked whether the lower bound D(A
) equalled the cost of the columns in
for the selected tree. Those parts in the partition where the cost and the lower bound were equal were retained, and those where they differed were divided into trivial partitions (i.e., into a collection of size 1 subsets), and the procedure of moving through partition space using merge moves was started afresh. This procedure was repeated three times until a lower bound of 269 was attained, indicating that the 40 trees found by heuristic search are optimal (although they may not comprise the set of all most parsimonious trees). The partition that gave the lower bound of 269 (parts of size 1 not shown) was {{0, 11, 82, 85, 161}, {1, 8, 46, 47, 49, 72, 84, 181}, {3, 35, 199}, {4, 9, 14, 27, 68, 114, 180, 191}, {5, 12, 13, 21, 28, 37, 77, 117, 177, 188}, {6, 25, 44, 45, 88, 104, 110, 190}, {7, 17, 53, 60, 108, 192}, {16, 23, 83}, {20, 69, 89, 189}, {26, 143, 193}, {31, 109}, {32, 194}, {42, 52, 87, 186}, {43, 197}, {50, 116}, {61, 187}, {74, 179}, {75, 201}, {79, 112, 115}, {81, 111}, {113, 162}, {130, 198}, {182, 200}, {2, 195, 196}}.
In figure 4 we present a consensus network (Holland and Moulton 2003) of the optimal trees with a threshold of 0. In this extreme form a consensus network is simply the median network of all splits (edges) that appear in any of the 40 optimal trees. The network indicates some areas of the phylogeny that are not yet fully resolved. The 40 trees fall into five groups of 8, where within each group there are no hard conflicts but there are different levels of resolution. Denoting the group of Guraani, Siberian Inuit, and Japanese as clade A, and the group of Warao, Evenki, Khirgiz, and Buriat as clade B: one group of 8 has the Guraani basal within clade A, and clade A and B as sister groups; another 8 have a Japanese sequence basal within clade A, and clade A and a PNG Coast sequence form a sister group; another 8 have a Japanese sequence basal within clade A, and clade A and Asian Indian form a sister group; another 8 have no resolution within clade A, and clade A and clade B form a sister group; the final 8 trees have no resolution within clade A and no resolved sister group for clade A.
The 40 equally scoring trees proved optimal here had several differences to the Neighbor-Joining tree published in Ingman et al. (2000): In particular a pair of Australian lineages comes deeper within the non-African clade in accordance with the archaeological evidence that modern humans have been in Australia for more than 50,000 years (Roberts et al. 2001; Kirch 2000). The European sequences Saami, English, French, Dutch, and Crimean form a clade in all of the 40 optimal trees, as do the Samoan, Korean, and Chinese sequences. This brings more agreement between the tree and the known geographic distribution. One of the clades reported in the Ingman et al. (2000) Neighbor-Joining tree appears in 8/40 of the optimal trees, but is contradicted in 16/40 (in the other 16 it appears in an unresolved polytomy). Note that although these 40 trees have differences from the tree of Ingman et al. (2000) they do not alter the main conclusion of the Ingman et al. article regarding the African origin of modern humans.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The performance of the lower bounds was tested on both simulated and biological data. As expected, the bounds are better for data with low levels of homoplasy. Our results also indicate that a range of lower bounds should be employed to obtain best results. For data that is almost homoplasy free the bounds that are simple to compute (e.g., Dg, Ig, or I'g) should suffice. However, for high levels of homoplasy, lower bounds usually require some optimization in partition space to prove that trees are minimal.
For one of the real data sets we also used extra information, from one of the 40 trees having a parsimony score equal to the best upper bound, to improve on our lower bounds. This method might be further developed to cope with data with higher levels of homoplasy by making more use of information about the costs of each column on a tree that provides the upper bound. This could also indicate more effective ways of both constructing an initial partition and moving through partition space. Another approach that could improve the lower bounds would be to derive bounds by solving the integer program that can be formulated using the inequalities in Theorem 3 (and their counterparts for parts in partitions of size greater than 2). In addition, optimizing in partition space might be improved by approaches such as simulated annealing (Dowsland 1993) or tabu search (Glover and Laguna 1993), as these methods may allow the search to escape from local optima.
Here we have been concerned with the problem of proving trees minimal; however, tight lower bounds will also have application to the exact Branch and Bound search methodology (Hendy and Penny 1982; Felsenstein 2003). Better bounds will lead to larger portions of tree space being eliminated in the search. Ultimately we expect that this approach could lead to faster and more accurate phylogeny estimation.
At present, parsimony is only shown to be a maximum likelihood estimator when the set of sequences being analyzed is ample. For the complete mitochondrial genomes here, we are at a boundary between being ample and being sparse, and more theoretical work is required. It may be that a combination of ample subsets and maximum average likelihood between these subsets is the appropriate ML estimator. In reality there is a multi-scale continuum between generations, populations, genera, and orders etc., and it is necessary to use the most appropriate method for any application.
It is expected, with the increase in DNA sequencing potential, that many more population data sets will require analysis. Thus the MinMax Squeeze will have a large number of applications. We have applied the approach to over 100 limpet sequences (N. Gemmel, Christchurch) and in that case it was trivial to prove the tree minimal as the set of sequences is ample (unpublished data). Our main application here is to 53 complete mitochondrial genomes. This is a large increase in the size of a non-trivial data set for which a tree has been proved optimal, but we have little idea yet on the upper limits for the technique. For example, adding additional genomes in regions where the tree is still unresolved (fig. 4) should reduce the number of optimal trees from the 40 found at present. However, additional genomes may increase the extent of homoplasy, perhaps requiring more powerful methods for determining lower bounds.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Addario-Berry, L., B. Chor, M. Hallett, J. Lagergren, A. Panconesi, and T. Wareham. 2003. Ancestral maximum likelihood of phylogenetic trees is hard. Pp. 202215 in G. Benson and R. Page, eds. Algorithms in Bioinformatics, Third International Workshop, WABI 2003. Springer-Verlag, Berlin, Heidelberg.
Cann, R. L., M. Stoneking, and A. C. Wilson. 1987. Mitochondrial DNA and human evolution. Nature 325:3136.[CrossRef][ISI][Medline]
Dowsland, K. A. 1993. Simulated annealing. Pp. 2069 in C. R. Reeves, ed. Modern Heuristic Techniques for Combinatorial Problems. John Wiley and Sons Inc., New York.
Drummond, A. J., O. G. Pybus, A. Rambaut, R. Forsberg, and A. G. Rodrigo. 2003. Measurably evolving populations. Trends Ecol. Evol. 18:481488.[CrossRef][ISI]
Felsenstein, J. 2003. Inferring phylogenies. Sinauer Associates, Sunderland, Mass.
Fernandez-Baca, D., and J. Lagergren. 2003. A polynomial-time algorithm for near-perfect phylogeny. SIAM J. Comput. 32:11151127.[CrossRef][ISI]
Fitch, W. M. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20:406416.[ISI]
Glover, F., and M. Laguna. 1993. Tabu search. Pp. 70150 in C. R. Reeves, ed. Modern Heuristic Techniques for Combinatorial Problems. John Wiley and Sons Inc., New York.
Graham, R. L., and L. R. Foulds. 1982. Unlikelihood that minimal phylogenies for a realistic biological study can be constructed in reasonable computational time. Math. Biosci. 60:133142.[CrossRef][ISI]
Hendy, M. D., L. R. Foulds, and D. Penny. 1980. Proving phylogenetic trees minimal with l-clustering and set partitioning. Math. Biosci. 51:7188.[CrossRef][ISI]
Hendy, M. D., and D. Penny. 1982. Branch and bound algorithms to determine minimal evolutionary trees. Math. Biosci. 60:133142.[CrossRef][ISI]
Herrnstadt, C., J. L. Elson, E. Fahy, G. Preston, D. M. Turnbull, C. Anderson, S. S. Ghosh, J. M. Olefsky, M. Flint Beal, R. E. Davis, and N. Howell. 2002. Reduced-median-network analysis of complete mitochondrial DNA coding-region sequences for the major African, Asian, and European haplogroups. Am. J. Hum. Genet. 70:11521171.[CrossRef][ISI][Medline]
Holland, B. R., and V. Moulton. 2003. Consensus networks: a method for visualising incompatibilities in collections of trees. Pp. 165176 in G. Benson and R. Page, eds. Algorithms in Bioinformatics, Third International Workshop, WABI 2003. Springer-Verlag, Berlin Heidelberg.
Ingman, M., H. Kaessmann, S. Paabo, and U. Gyllenstern. 2000. Mitochondrial genome variation and the origin of modern humans. Nature 408:708713.[CrossRef][ISI][Medline]
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21132 in M. N. Munro, ed. Mammalian Protein Metabolism, Vol. 3. Academic Press, New York.
Kirch, P. V. 2000. On the Road of the Winds: an Archaeological History of the Pacific Islands Before European Contact. University of California Press, Berkeley, Calif.
Kivisild, T., H. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems. 2002. The emerging limbs and twigs of the East Asian mtDNA tree. Mol. Biol. Evol. 19:17371751.
Lewis, P. P. 2001. A likelihood approach to estimating phylogeny from discrete morphological character data. Syst. Biol. 50:913925.[CrossRef][ISI][Medline]
Murray-McIntosh, R. P., B. J. Scrimshaw, P. J. Hatfield, and D. Penny. 1998. Testing migration patterns and estimating founding population size in Polynesia by using human mtDNA sequences. Proc. Natl. Acad. Sci. U. S. A. 95:90479052.
Richards, M., H. Corte-Real, P. Forster, V. Macaulay, H. Wilkinson-Herbots, A. Demaine, S. Papiha, R. Hedges, H. J. Bandelt, and B. Sykes. 1996. Paleolithic and Neolithic lineages in the European mitochondrial gene pool. Am. J. Hum. Genet. 59:185203.[ISI][Medline]
Rinsma, I., M. D. Hendy, and D. Penny. 1990. Minimally coloured trees. Math. Biosci. 98:201210.[CrossRef][ISI][Medline]
Roberts, R. G., T. F. Flannery, L. K. Ayliffe, H. Yoshida, J. M. Olley, G. J. Prideaux, G. M. Laslett, A. Baynes, M. A. Smith, R. Jones, and B. L. Smith. 2001. New ages for the last Australian megafauna: continent-wide extinction about 46,000 years ago. Science 292:18881892.
Saitou, N., and M. Nei. 1987. The Neighbor-Joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Semple, C., and M. A. Steel. 2003. Phylogenetics. Oxford University Press, Oxford.
Steel, M. A., and D. Penny. 2000. Parsimony, likelihood and the role of models in molecular phylogenetics. Mol. Biol. Evol. 17:839850.
. 2004. Two further links between MP and ML under the Poisson model. Appl. Math. Lett. 17:785790.[CrossRef][ISI]
. 2005. MP and the phylogenetic information in multi-state characters. in V. Albert, ed. Parsimony, Phylogeny and Genomics. Oxford University Press, Oxford. (in press).
Swofford, D. L. 1998. PAUP* Phylogenetic analysis using parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Mass.
Vigilant, L., M. Stoneking, H. Harpending, K. Hawkes, and A. C. Wilson. 1991. African populations and the evolution of human mitochondrial DNA. Science 253:15031507.[ISI][Medline]
|