*Institute of Molecular BioSciences, Massey University, Palmerston North, New Zealand;
Department of Agriculture and Life Sciences, University of Tokyo, Tokyo, Japan; and
Graduate University for Advanced Studies, Tokyo, Japan
Whenever different types of data are used in a scientific analysis, a concern is whether they all point to the same answer except for sampling error. This is no less the case for phylogenetic analysis. If rates of evolution exceed a certain threshold, and edge (branch or internode) lengths and branching pattern meet certain configurations, then maximum parsimony will be inconsistent (Felsenstein 1978
). That is, the data will indicate the wrong tree with increasing support as sequence length tends to infinity. Accordingly, it is possible that faster-evolving characters strongly disagree with more slowly evolving characters in recovering the underlying tree, and the combined data set may also indicate an incorrect tree (Bull et al. 1993
). Furthermore, inconsistency has been shown to be a problem for all classes of phylogenetic analysis, including distance methods and maximum likelihood (Waddell 1995
; Lockhart et al. 1996
).
The data set may consist of a mixture of sites, that is, data partitions (a generic term that can mean a gene, a part of a gene, a set of genes, a locus, etc.) which are either consistent or inconsistent with regard to identifying the historical tree. Because of this, it is unwise to automatically combine all data (Bull et al. 1993
; Huelsenbeck et al. 1994
). Rather, one should make an effort to detect partitions of the data that lead to inferences of statistically distinct trees and then proceed with caution if differences are found (Cunningham 1997
). Note that we use the term "congruence" to mean agreement, which does not necessarily mean that trees are the same, but rather that the differences are not statistically significant.
There are other causes of incongruence apart from inconsistency of tree estimators and the attendant systematic error. One is stochastic bias due to finite sample size, i.e., a difference of the mean estimate with finite data from that with infinite data. Accordingly, the total difference between a tree estimate and the historical tree is a function of stochastic variance about the estimator's mean due to finite data, stochastic bias, and systematic error. However, stochastic bias generally drops away quickly with increasing sequence length and has rarely been shown to be a dominant factor in tree inference.
Another cause of incongruence is ancestral polymorphism. However, branch points of taxa outside of species complexes tend to be separated by reasonably long periods that are typically much longer than the ancestral population coalescence time. Furthermore, given recombination rates similar to substitution rates, finding gene trees that disagree strongly, e.g., with the bootstrap, is even more unlikely (Waddell 1995
). A third cause is horizontal transfer, which seems relatively rare in animals but is more common in bacteria and in plants, e.g., via hybridization. Overall, however, there is considerable evidence for inconsistency of tree estimators, e.g., quite different trees from first, second, and third positions, and we feel that this is the a priori most likely explanation in higher-order vertebrate systematics. That is, generally, inconsistency (systematic error) of estimators should be considered first, before looking for other causes. Therefore, associating the tests below with detection of inconsistency is often reasonable a priori, as discussed later.
Nested hypotheses of whether the optimal tree topologies for k data partitions are significantly different can be set up as follows.
H0: The trees from data partitions 1, ... , k are congruent, so we expect that the data sets were generated under the same tree topology (called TMED, the median tree).
H1: The trees for data partitions 1, ... , k are not congruent.
For example, assume there are two genes, A and B, and four taxa. Thus, there are three unrooted resolved trees T1, T2, and T3, with parameters 1,
2, and
3, respectively. Denote the data partitions A and B by XA and XB, respectively. Then, the maximum likelihood for tree Ti with Xj is ln L(Xj |
i(Xj), Ti), where ln is the natural log, and
i(Xj) is the maximum-likelihood estimate of
i, for Ti, based on data Xj. Let ln L(TiA, TjB) = ln L(XA |
i(XA), Ti) + ln L(XB |
j(XB), Tj). Furthermore, assume that the set of parameters describing the process of character change is chosen using AIC (e.g., Kishino, Miyata, and Hasegawa 1990
). The log likelihood for H0, ln L(H0), is
![]() | (1) |
| (2) |
To test H0 versus H1 when a tree is unknown a priori for each gene, brute force Monte Carlo (parametric bootstrap) computer simulation may be used (Huelsenbeck and Bull 1996
; Huelsenbeck, Bull, and Cunningham 1996
). When the tree that minimizes cLR is searched for (call it TMED; it is also the overall maximum-likelihood tree, as seen later), then n, e.g., 1,000, replicate samples of data A and B are simulated under the optimal parameter values associated with TMED. Each replicate data partition is analyzed in the same way as the original data (including tree search), and the difference in ln L under H0 and H1 is stored. The results of these replicates approximate the true distribution of cLR under H0. This test follows the outline of Cox (1962)
and is both exact and statistically efficient (for phylogenetic applications, see Reeves 1992
; Goldman 1993
). This approach is, unfortunately, computationally very expensive, and consequently this test is rarely used.
Another test of congruence (Waddell et al. 1999
) takes a tree, e.g., TMED, and asks if the maximum-likelihood trees of each data partition are within sampling error of it. This test measures the distance from TMED to the nominated tree, Ti, of each data partition in normalized
ln L units, i.e., ln L(TMED) - ln L(Ti) divided by the standard error of this difference (with the latter estimated using the formula of Kishino and Hasegawa [1989
]). These normalized likelihood differences should be close to standard-normally distributed, i.e., N(0, 1). As long as all trees are specified a priori, the expected distribution of the sum of squared normalized deviations approximates
2 (for f data partitions, degrees of freedom equals f if the median tree is known a priori, or f - 1 when TMED is chosen to minimize the squared deviations). This test does not take into account the increased uncertainty due to tree searching on the data being analyzed.
We report here a fast way to estimate the distribution of cLR under H0. It takes into account the maximization across trees by using centering (e.g., Hall 1992
, p. 292) of log likelihood values obtained from bootstrap replicates. Centering was recently used by Shimodaira and Hasegawa (1999)
to assist in the selection of a confidence set of trees for a single data partition, and this is a natural application of the centering procedure to another type of phylogenetic test.
Let X1, ... , Xk be the data partitions j = 1, ... , k, respectively. Next, select an appropriate set of parameters describing the evolution of characters in each data partition (e.g., using AIC). Select a type I error level, (e.g., 0.05), and fix n, the number of replicates to be used to infer the distribution under H0 (n > 100 is preferable), then perform the test as follows.
Step 1. Find the maximum-likelihood tree, i.e., T1, ... , Tk, for the k data partitions. If there are other trees to be tested, add in z - (k + 1) other trees, Tk+1, ... , Tz (so that z is the total number of trees included in the test).
Step 2. The test statistic for the mth tree (m = 1, ... , z) is
| (1) |
In this step, all free parameter values, including edge lengths, are reoptimized on each tree. By shifting the summation in equation (1) , clearly, the tree that gives the minimum value of cLRm, and hence the minimum sum of likelihood differences to the maximum-likelihood trees of each data partition, is TMED, with statistic cLRMED.
Step 3. (Necessary only if using RELL in step 4). For each data partition j, j = 1, ... , k, store the site pattern log likelihoods for Tm, m = 1, ... , z, in the vector vjm (giving the set v11, ... , vkz).
Step 4. For each data partition, j = 1, ... , k, generate n bootstrap replicates of the data, storing the hth replicate as Ihjm. For each hth replicate, evaluate the likelihood of data partition j on the mth tree by reoptimizing all free parameters, and store the likelihood (a single number) as ln Lhjm. The method of resampling estimated site log likelihoods (RELL; see Kishino, Miyata, and Hasegawa 1990
) from step 3, which avoids reoptimization of parameters, is a much quicker way to approximate ln Lhjm.
Step 5. Calculate 1/n nh=1 ln Lhjm, which is the average likelihood of data partition j under tree m, for the n replicates. Subtract this number from each likelihood replicate, i.e., Rhjm = ln Lhjm - 1/n
nh=1 ln Lhjm.
Step 6. For each replicate and each tree, calculate Shjm = maxi{Rhji} - Rhjm.
Step 7. For each replicate h and each tree Tm, calculate Ehm = kj=1 Shjm, where Ehm is the hth replicate of cLRm.
Step 8. Calculate pm = #{Ehm > cLRm}/n, for each m = 1, ... , z. If pm < , for all m = 1, ... , z, reject H0; otherwise, fail to reject H0.
The new test is described in general form. That is, it allows all trees from step 1 to be tested as congruent with the maximum-likelihood trees of the data partitions. If one simply wants to test H1 versus H0, then rejection of TMED almost certainly means rejection of H0 (thus, in steps 68, only TMED need be considered).
Note that steps 5 and 6 involve centering, which basically acts to conform the sampled log likelihoods of the data under different trees to what they would be if they came from a single tree as H0 specifies. Unlike the parametric bootstrap, centering forces all trees to have equal expected likelihoods under H0, which is typically not the case (i.e., some trees really are better than others). Generally, the test is conservative; i.e., it rejects H0 too rarely, due to this last point.
The number of replicates, n, can be selected in an adaptive way with minimal impact on the outcome of the test. As a simple example, if = 0.05 and pm = 0 for a trial of 100 replicates, it is sufficient to stop; otherwise, return to step 4 and increase n to 1,000.
It may be assumed that the test is made over all binary trees. However, these trees grow in number exponentially, so it is not feasible for moderate numbers of taxa. Additionally, due to the nature of centering, adding in trees far from T only makes the test more conservative. In practice, the test works best when we keep z small and include mostly trees within sampling error of each of the maximum-likelihood trees, including T. The test is fast when using RELL, so if the site likelihoods of promising trees (e.g., by the KH test) can be stored simultaneously with a tree search, there is minimal additional computational load.
A useful aspect of the new test is that any or all trees, 1, ... , z, can be tested to determine if they are acceptable "median" trees, including trees stated a priori (e.g., a morphological tree, a tree from another type of genetic data, or an area cladogram). Note that when testing any Tm as an acceptable median tree, the distribution of the test statistic is evaluated by centering on Tm (steps 68). For the parametric bootstrap, this is equivalent to resimulating on Tm (even if it is not an optimal tree). However, a test based on the distribution of cLR from the highest likelihood tree alone will often give the same answer. In testing a completely a priori tree, the new test tends to be more conservative than usual because there is no maximization of likelihood across trees for this tree.
Consider the question of whether the tree from the concatenated tRNA sequences (22 genes, 1,240 characters) of vertebrate mtDNA is congruent with the tree from the concatenated amino acids (AA; 12 genes, 2,272 noninvariant characters) of the same molecule (Waddell et al. 1999
). Each data partition has the same 36 taxa and spans from lamprey to closely related primates (data available from http://wheat.ab.a.u.-tokyo.ac.jp). The test of Waddell et al. (1999)
did not show a significant difference, although as models became more general, this test came very close to rejecting H0 with
= 0.05. Here, we reevaluate using the new test. First, we analyze the tRNA data with the most general model in PAUP*, version 4.0 (Swofford 1998
), that is, the general time-reversible model with eight free parameters other than the edge lengths, plus a mixture of invariant sites (
inv = 0.060) and variable sites following a
distribution (
= 0.973) (using successive updating of the maximum-likelihood estimates of parameters on the optimal tree following a tree bisection-reconnection tree search). The maximum-likelihood tree is slightly different from that found previously, although the differences are not statistically significant. The best tree now puts the unusual fish Polypterus with the ray fins, as most morphologists would favor. Also, the platypus joins the marsupials, and the armadillo branches immediately before the elephant, just one short internode from being sister to all the other placentals. tRNA trees of near-optimal likelihood combine local permutations of these parts of the tree, with those of the relative positions of the lamprey, the lungfish, and tetrapods. The amino acids are analyzed using ProtML, version 2.3 (Adachi and Hasegawa 1996
), the included mtDNA rate matrix, and removal of invariant sites in proportion to the constant sites (Waddell et al. 1999
).
Importantly, we have a priori biological trees from independent data. Traditional methods (paleontology and morphology) favor the tree Tbio1, (lamprey, (shark, ((rayfins, Polypterus), (coelacanth, (lungfish, (frog, (reptiles, (monotremes, (marsupials, (placentals)))))))))) (e.g., Carrol 1987
), while relationships within placentals are often assumed to follow Novacek (1992
, p. 64). An additional tree, Tbio2, our own estimate of the historical tree, is the same as above except that relationships within placentals follow Waddell, Okada, and Hasegawa (1999)
. Support for Tbio2 is largely independent of the mtDNA data. Are these trees congruent with the maximum-likelihood trees from amino acids and tRNA?
In performing the test, = 0.05 and n = 10,000. Our set of z trees incorporates 20 trees within sampling range of the maximum-likelihood tree from either tRNA or amino acid data partitions, plus the two biological trees (in this case, trees around the amino acid maximum, TAA2, are also trees around T). The results of testing each individual tree are shown in table 1
. Clearly, the new test does not reject H0 for the trees that have near-maximal likelihood based on the amino acid data, but it does reject all the best trees based on the tRNA data. This result implies that the tRNA sequences have low resolving power in parts of the phylogeny where tRNA and amino acid trees disagree. Thus, there is no evidence of the phylogenetic inferences based on tRNA and amino acid sequences being out of harmony with each other, so a joint estimate of the best tree would not seem unreasonable.
|
The estimated distributions of the cLR test statistic when testing TAA2 and Tbio1 as possible median trees are shown in figure 1 . The long tail with Tbio1 illustrates one facet of the conservativeness due to centering. Note that, in contrast to the present test, a failure to record even 1% of bootstrap support for a tree (table 1 ) does not indicate incongruence, since this type of inspection does not take into account the number and proximity of competing trees.
|
Conservativeness of a test is also presently desirable, because maximum-likelihood tests of congruence assume that we can identify and generate the data under the true model of character evolution; failure to take this process into account causes tests of homogeneity to reject H0 too often (Waddell 1995
; Huelsenbeck, Bull, and Cunningham 1996
). Furthermore, in reality, site evolution is typically correlated, so this breakdown of the independence of sites also leads to the parametric bootstrap rejecting H0 too often.
If the new test is considered too conservative, a parametric bootstrap can be invoked in those cases where rejection of H0 is not quite achieved (e.g., 0.05 < P < 0.3). Initial results of a computationally fast version of Huelsenbeck and Bull's (1996)
test are also promising. The test is the same until data sets are evolved according to the model on T, except that during the tree search on the original data, trees that are within sampling error of T and the maximum-likelihood trees of each partition are also stored. Next, n replicates for j partitions are made. Then, all n replicate sets of characters evolved for partition j are temporarily pooled to allow a single estimate of parameters and site likelihoods on each tree to be made and stored. The site log likelihoods just estimated are substituted in for each pattern in the replicate data sets (like RELL). This allows the likelihood of each tree in each replicate partition to be evaluated by simply summing the site likelihood scores. For each of the n replicates, the sum of log likelihood differences from the tree of the highest total likelihood from each of the best trees for the j partitions, the cLR statistic, is used to build the test distribution. For moderate sequence lengths, this method builds a distribution of cLR similar to that constructed using full reoptimization and tree search, as long as the set of trees stored includes all that are likely to be optimal for each replicate. Because of this, the tree search should be extensive, e.g., based around the TBR function of PAUP*.
Note that the centering test assumes that the data evolved on the same tree topology, but edge lengths do not have to show any relationship across data partitions. The test modifies easily to impose either the same edge lengths for all data partitions or the same edge lengths except for some scalar, such as overall evolutionary rate.
The hypotheses for the new test can be rephrased and used to test for systematic error, which is predominantly due to the inconsistency of tree estimators, when there is a biological reason to expect that the historical trees from all data partitions are identical. For example, in the case with nonrecombining mtDNA, we have H0: the estimated trees are mutually consistent; H1: at least one of the trees is an inconsistent estimate of the historical tree. Thus, we do not need to know the historical tree to test for inconsistency if we can safely disregard stochastic bias.
The earlier test in Waddell et al. (1999)
using squared normalized likelihood differences works well in combination with the present test to diagnose which data partitions most strongly disagree with H0. It is also the case that if the test of Waddell et al. (1999)
fails to reject H0, the present test will most likely do the same. Thus, as a very quick test, it retains utility even when trees are not specified a priori.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: phylogenetic congruence likelihood ratio tests
combination of sequence data
phylogenetic analyses
inconsistency
2 Address for correspondence and reprints: Peter J. Waddell, E-mail: waddell{at}onyx.si.edu
, pjwaddell{at}hotmail.com
. Communication preferred by E-mail; please advise by E-mail before using other forms of communication.
![]() |
literature cited |
---|
![]() ![]() ![]() |
---|
Adachi, J., and M. Hasegawa. 1996. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. 28:1150
Bull, J. J., J. P. Huelsenbeck, C. W. Cunningham, D. L. Swofford, and P. J. Waddell. 1993. Partitioning and combining data in phylogenetic analysis. Syst. Biol. 42:384397[ISI]
Carrol, R. L. 1987. Vertebrate paleontology and evolution. W. H. Freeman, New York
Cox, D. R. 1962. Further results on tests of separate families of hypotheses. J. R. Stat. Soc. B34:406424
Cunningham, C. W. 1997. Can three incongruence tests predict when data should be combined? Mol. Biol. Evol. 14:733740[Abstract]
Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401410[ISI]
Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182198[ISI][Medline]
Hall, P. 1992. The bootstrap and Edgeworth expansion. Springer-Verlag, New York
Huelsenbeck, J. P., and J. J. Bull. 1996. A likelihood ratio test to detect conflicting phylogenetic signal. Syst. Biol. 45:9298[ISI]
Huelsenbeck, J., J. J. Bull, and C. W. Cunningham. 1996. Combining data in phylogenetic analysis. Trends Ecol. Evol. 11:152158[ISI]
Huelsenbeck, J. P., D. L. Swofford, C. W. Cunningham, J. J. Bull, and P. J. Waddell. 1994. Is character weighting a panacea for the problem of data heterogeneity in phylogenetic analysis? Syst. Biol. 43:288295
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:170179[ISI][Medline]
Kishino, H., T. Miyata, and M. Hasegawa. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31:151160[ISI]
Liu, F.-G. R., and M. M. Miyamoto. 1999. Phylogenetic assessment of molecular and morphological data for eutherian mammals. Syst. Biol. 48:5464[ISI][Medline]
Lockhart, P. J., A. W. Larkum, M. A. Steel, P. J. Waddell, and D. Penny. 1996. Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA 93:19301934
Novacek, M. J. 1992. Fossils, topologies, missing data, and the higher level phylogeny of eutherian mammals. Syst. Biol. 41:5873[ISI]
Reeves, J. H. 1992. Heterogeneity in the substitution process of amino acid sites of proteins coded for by Mitochondrial DNA. J. Mol. Evol. 35:1731[ISI][Medline]
Shimodaira, H., and M. Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16:11141116
Swofford, D. L. 1998. PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4.0b1. Sinauer, Sunderland, Mass
Waddell, P. J. 1995. Statistical methods of phylogenetic analysis: including Hadamard conjugations, LogDet transforms, and maximum likelihood. Ph.D. thesis, Massey University, Palmerston North, New Zealand
Waddell, P. J., Y. Cao, J. Hauf, and M. Hasegawa. 1999. Using novel phylogenetic methods to evaluate mammalian mtDNA, including AA invariant sitesLogDet plus site stripping, to detect internal conflicts in the data, with special reference to the position of hedgehog, armadillo, and elephant. Syst. Biol. 48:3153[ISI][Medline]
Waddell, P. J., N. Okada, and Hasegawa. 1999. Progress in resolving the interordinal relationships of placental mammals. Syst. Biol. 48:15[ISI][Medline]