Statistical Tests of Gamma-Distributed Rate Heterogeneity in Models of Sequence Evolution in Phylogenetics

Nick GoldmanGo, and and Simon Whelan

Department of Genetics, University of Cambridge, Cambridge, England

Likelihood ratio tests (LRTs) for comparing models of sequence evolution have become popular over the last few years (Goldman 1993Citation ; Yang, Goldman and Friday 1994, 1995Citation ; Huelsenbeck and Crandall 1997Citation ; Huelsenbeck and Rannala 1997Citation ). In their simplest form, such tests compare a simpler null hypothesis (H0) with a more complex alternative hypotheses (H1) which is a generalization of H0. H0 can be derived from H1 by fixing one or more of its free parameters at particular values, and the hypotheses are described as nested. Although it is also possible to test non-nested models (Goldman 1993Citation ), nested models are often preferred, as statistical tests are simpler to perform and their results can be easier to interpret.

The test statistic for an LRT can be written as 2{delta} = 2 ln(H1/H0) = 2(ln(H1) - ln(H0)), where H0 and H1 are the maximum-likelihood (ML) scores under hypotheses H0 and H1, respectively. This statistic measures how much improvement H1 gives over H0, and when the hypotheses are nested, 2{delta} will always be nonnegative. For these nested hypotheses, and under certain regularity conditions, the asymptotic distribution of 2{delta} (i.e., for large amounts of data) will be {chi}2k. Here, k is the number of degrees of freedom by which H0 and H1 differ, that is, the number of free parameters of H1 whose values must be fixed to derive H0 (Wald 1949Citation ; Silvey 1975Citation ; Felsenstein 1981;Citation Goldman 1993Citation ; Yang, Goldman, and Friday 1994, 1995Citation ). (In effect, each free parameter contributes a {chi}21 variate to the distribution of 2{delta}, with the sum of k independent {chi}21 variates being distributed as {chi}2k.) Statistical tests assessed using such {chi}2 distributions have now become a widespread and useful tool in phylogenetics (Huelsenbeck and Crandall 1997Citation ; Huelsenbeck and Rannala 1997Citation ).

Recently, there has been renewed interest in testing whether the predicted {chi}2 distribution gives a reliable estimate of the true distribution of 2{delta} under realistic conditions (e.g., with finite sequence lengths). Whelan and Goldman (1999)Citation investigated cases in which the competing hypotheses were different models of nucleotide substitution. Under three specimen experimental designs (representing realistic phylogenies and nucleotide substitution processes), we found that the {chi}2 distribution was acceptable for performing tests of the significance of parameters describing the relative rate of transition and transversion substitutions (typically denoted {kappa}; one free parameter) and describing equilibrium base frequencies (typically denoted {pi} = ({pi}A, {pi}C, {pi}G, {pi}T); three free parameters, because of the constraint that {Sigma}X {pi}X = 1) when these parameters are estimated by ML (Whelan and Goldman 1999Citation ). Combinations of these parameters give most of the commonly used models of nucleotide substitution. The parameter {kappa} is present in the models K2P of Kimura (1980)Citation and HKY of Hasegawa, Kishino, and Yano (1985)Citation , which reduce to the models JC of Jukes and Cantor (1969)Citation and FEL of Felsenstein (1981)Citation , respectively, when {kappa} is fixed at 1. The parameters {pi} are present in FEL and HKY, which reduce to JC and K2P, respectively, when {pi} is fixed at (1/4, 1/4, 1/4, 1/4).

In contrast, for tests of the significance of the incorporation of the shape parameter of a gamma distribution used to describe among-sites variation in the rate of nucleotide substitution (Yang 1993, 1994, 1996;Citation typically denoted {alpha}; one free parameter), Whelan and Goldman (1999) consistently found an unacceptable fit between the predicted {chi}2 distributions and estimates of the true distributions of 2{delta}. We provisionally attributed this effect to the fact that reducing a hypothesis H1 incorporating a gamma distribution (denoted, e.g., JC+{Gamma}) to a hypothesis H0 without a gamma distribution (in this example, JC) requires {alpha} to be fixed at {infty}, which is on the boundary of the set [0, {infty}) of values permitted under H1. This contravenes the conditions required for the asymptotic distribution of 2{delta} to be {chi}2 (see, e.g., Self and Liang 1987Citation ). Ota and colleagues (Ota, Waddell, and Kishino 1999Citation ; Ota et al. 2000Citation ) have also investigated effects caused by parameters lying on their permitted boundaries. They illustrated that ML estimates of these parameters cannot be asymptotically normally distributed and how, as a consequence, the ML estimates must be biased and the distribution of 2{delta} (2lnLR in their notation) cannot be {chi}2. Andrews (2000) showed that parameters which lie on their permitted boundaries can also adversely affect results of nonparametric bootstraps.

We refer to a parameter which under H0 is fixed at a value on the boundary of the set of values permitted under H1 as a boundary parameter. Self and Liang (1987Citation , p. 608, case 5) showed that (in the absence of any other boundary parameters not being tested) the contribution to 2{delta} of a single boundary parameter is (asymptotically) a distribution which is a 50:50 mixture of {chi}20 and {chi}21 distributions. The {chi}20 distribution takes the value 0 with probability 1; consequently, this mixture distribution takes the value 0 with probability ½, and takes a value drawn from a {chi}21 distribution with probability ½. If the statistical test being performed also tests for the significance of k - 1 other parameters which are not on their boundaries, these contribute {chi}2k-1 to the distribution of 2{delta}, and the complete distribution of 2{delta} is a 50:50 mixture of {chi}2k-1 and {chi}2k (Self and Liang 1987Citation , p. 609, case 9). We denote this distribution by 2k. These are the situations of relevance to the phylogenetic tests described here.

The 2k distribution has a mean of k - ½; significance levels of observed statistics X may be calculated according to P(2k > X) = (P({chi}2k-1 > X) + P({chi}2k > X))/2, where the terms to the right of the equality can be found using standard tables or computer programs (e.g., the chi2 program of the PAML package; Yang 1997Citation ). Other features of 2 distributions are more difficult to derive and were calculated using locally written programs. If two or more parameters lie on their boundaries, the situation is more complicated (Self and Liang 1987Citation ; Ota et al. 2000Citation ), and if these parameters are not independent, the distribution of 2{delta} may not be expressible as any sum of {chi}2 distributions (e.g., Self and Liang 1987Citation , pp. 608–609, case 8). Models incorporating more than one boundary parameter are beginning to appear in phylogenetic analyses (e.g., Huelsenbeck and Nielsen 1999Citation ).

Using simulated data, Whelan and Goldman (1999)Citation performed a number of model comparisons to investigate the true distribution of 2{delta} when the hypotheses being compared differed by various combinations of the parameters {kappa}, {pi}, and {alpha}. We now report the results of further simulations designed to consider specifically whether the 2 distributions described by Self and Liang (1987)Citation and Ota et al. (2000)Citation are acceptable under realistic conditions for statistical comparisons of nested models where H1 incorporates a gamma distribution for rate heterogeneity and H0 does not. We again use the models JC, K2P, FEL, and HKY (using ML estimators of {pi} in FEL and HKY, denoted FELMLE and HKYMLE in Whelan and Goldman 1999Citation ) and their counterparts, which, in addition, include the gamma distribution (JC+{Gamma}, K2P+{Gamma}, FEL+{Gamma}, and HKY+{Gamma}). Nine model comparisons (labeled A–J in table 1 ) represent all possible combinations of parameters which may then be present in H0 and H1, subject to the condition that {alpha} always be included in H1 but not H0. Thus, each comparison is always testing for the significance of including {alpha}, either singly (comparisons A–D) or in combination with other parameters (E–J), and either with (B–D, H–J) or without (A, E–G) other (untested) parameters present. Model trees used for data simulation are those described in Whelan and Goldman (1999)Citation , labeled according to the original data set whose analysis gave the phylogeny and parameters used for data simulation: cytochrome b, {psi}{eta}-globin, and mtDNA (erroneously labeled "D-loop" by Whelan and Goldman 1999Citation , although, in fact, it comprises two protein-coding regions and three tRNAs).


View this table:
[in this window]
[in a new window]
 
Table 1 Probabilities and Other Information Relating to the Fit of Predicted 2 Distributions to Estimates of the True Distribution of 2{delta}

 
Table 1 summarizes our results on the fit of the values of 2{delta} obtained from the analysis of simulated data sets to the predicted 2 distributions. For all but one of the 27 comparisons in table 1 (test of HKY vs. HKY+{Gamma} for the cytochrome b model phylogeny), we observe a good fit between the distributions. Note that this is true irrespective of the combination of parameters being tested and of the combination of parameters present in both models and thus untested.

Ota et al. (2000)Citation comment on practical difficulties that can arise in iterative ML optimizations. These are due to likelihood surfaces which are almost flat, and may be particularly pronounced in cases where optimal parameter values are on or near the boundaries permitted. Possible effects are premature termination of ML iterations either before a parameter estimate reaches a boundary at which it should lie or before it escapes from a boundary at which it should not lie. We encountered some such problems in a few of the analyses reported in table 1 . Typically, difficulty in achieving an accurate ML score is greater for more complex models (e.g., H1 rather than H0) and for analyses with parameters estimated near permitted boundaries. These effects lead to slightly depressed estimates of 2{delta} when it is near 0. We find that these effects account for our example in which the fit between our estimate of the distribution of 2{delta} and the predicted 2 distribution is rejected as inadequate (results not shown), although even in this case the estimated mean and 95% point of the distribution of 2{delta} are close to those of the predicted 2 distribution (table 1 , comparison D for cytochrome b).

We conclude that the theoretical (asymptotic) prediction that the parameter {alpha} contributes a 21 distribution (50:50 mixture of {chi}20 and {chi}21) to the distribution of 2{delta} is accurate in these examples, even for the finite-sized data sets considered, and that 2k distributions (50:50 mixtures of {chi}2k-1 and {chi}2k) are appropriate for performing tests involving the boundary parameter {alpha} and k - 1 additional parameters whose fixed values under H0 do not lie on the boundary permitted under H1.

Combining the results described by Ota and colleagues (Ota, Waddell, and Kishino 1999Citation ; Ota et al. 2000Citation ), those described by Whelan and Goldman (1999)Citation , and those described above, it seems certain that the use of a {chi}21 distribution for testing the shape parameter {alpha} of the gamma distribution or other boundary parameters is inappropriate and potentially misleading. As explained by Ota et al. (2000)Citation , the past use of a {chi}21 distribution instead of the correct 21 distribution (or the use of a {chi}2k distribution instead of a 2k distribution in tests involving k - 1 other parameters) may have led to overly conservative conclusions (i.e., too great a tendency to reject H1). Although confidence in past results regarding rate heterogeneity will often be maintained (typically rejecting H0 when only H1 uses a gamma distribution to allow for such rate heterogeneity; Yang 1996Citation ), this need not always be the case. As more complex and realistic models are compared, including some which use gamma distributions to model factors other than rate heterogeneity, observed values of 2{delta} may be a lot smaller. For example, the failure to use appropriate 2 distributions leads to two false negative results in Huelsenbeck and Nielsen (1999Citation , table 2 , data set COI, second codon position, test of HKY85 vs. HKY85+{Gamma}{kappa}, and NADH6, third codon position, HKY85+{Gamma}{kappa} vs. HKY85+{Gamma}r+{Gamma}{kappa}—although we note that these comparisons are in no way crucial to those authors' findings).


View this table:
[in this window]
[in a new window]
 
Table 2 Percentage Points of 2 Distributions

 
From the theory presented by Self and Liang (1987)Citation and the results shown by Ota and colleagues (Ota, Waddell, and Kishino 1999Citation ; Ota et al. 2000Citation ) and those shown in this paper, we conclude that the use of 2 distributions is appropriate for LRTs of boundary parameters in phylogenetics. These modified distributions result in lower thresholds for rejection of H0 in favor of H1, and appropriate values for significance testing are presented in table 2 . In Whelan and Goldman (1999)Citation , we suggested that simulation always be used for tests incorporating the parameter {alpha} modeling rate heterogeneity. We are now able to recommend instead the (time-saving) use of table 2 for all future LRTs concerning the presence of one boundary parameter (and in the absence of other boundary parameters not being tested). There have been few cases to date of false negatives due to inappropriate distributions being used to test significance, but as models of sequence evolution become more complex and the relevant values of 2{delta} decrease, the use of the correct distribution for significance testing becomes increasingly important.


    Acknowledgements
 TOP
 Acknowledgements
 literature cited
 
N.G. is supported by a Wellcome Trust Fellowship in Biodiversity Research. S.W. is supported by a BBSRC Research Studentship. We are grateful to R. Ota and P. J. Waddell for allowing us to see prepublication versions of Ota et al. (2000), and to Z. Yang for alerting us to the method of calculating P(2k > X).


    Footnotes
 
Mike Hendy, Reviewing Editor

1 Keywords: likelihood ratio tests gamma distribution maximum likelihood model comparison molecular evolution phylogenetics Back

2 Address for correspondence and reprints: Nick Goldman, University Museum of Zoology, Department of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, United Kingdom. E-mail: n.goldman{at}zoo.cam.ac.uk Back


    literature cited
 TOP
 Acknowledgements
 literature cited
 

    Andrews, D. W. K. 2000. Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space. Econometrica 68:399–406.

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.[ISI][Medline]

    Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182–198.[ISI][Medline]

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.[ISI][Medline]

    Huelsenbeck, J. P., and K. A. Crandall. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28:437–466.[ISI]

    Huelsenbeck, J. P., and R. Nielsen. 1999. Variation in the pattern of nucleotide substitution across sites. J. Mol. Evol. 48:86–93.[ISI][Medline]

    Huelsenbeck, J. P., and B. Rannala. 1997. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science 276:227–232.

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. N. Munro, ed. Mammalian protein metabolism. Vol. 3. Academic Press, New York.

    Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120.[ISI][Medline]

    Ota, R., P. J. Waddell, M. Hasegawa, H. Shimodaira, and H. Kishino. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Evol. Biol. 17:798–803.[Abstract/Free Full Text]

    Ota, R., P. Waddell, and H. Kishino. 1999. Statistical distribution for testing the resolved tree against star tree. Pp. 15–20 in Proceeding of the annual joint conference of the Japanese Biometrics and Applied Statistics Societies. Sinfonica, Minato-ku, Tokyo.

    Self, S. G., and K.-L. Liang. 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 82:605–610.[ISI]

    Silvey, S. D. 1975. Statistical inference. Chapman and Hall, London.

    Wald, A. 1949. Note on the consistency of the maximum likelihood estimate. Ann. Math. Stat. 20:595–601.[ISI]

    Whelan, S., and N. Goldman. 1999. Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics. Mol. Biol. Evol. 16:1292–1299.[Free Full Text]

    Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401.[Abstract]

    ———. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306–314.[ISI][Medline]

    ———. 1996. Among-site rate variation and its impact on phylogenetic analysis. TREE 11:367–372.

    ———. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555–556.

    Yang, Z., N. Goldman, and A. Friday. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316–324.[Abstract]

    ———. 1995. Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Syst. Biol. 44:384–399.[ISI]

Accepted for publication January 11, 2000.