Department of Biology, University of Rochester
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The parameters of a phylogenetic model describe the underlying process of sequence evolution. The maximum likelihood and Bayesian methods of statistical inference both estimate these parameters (including the topology) using the likelihood function, a quantity hereafter referred to as p(X|) (which should be read as the probability of the data, X, conditioned on a specific combination of model parameters,
; more formally, the likelihood is proportional to the probability of observing the data). In maximum likelihood, inferences are based on finding the topology relating the species, branch lengths, and parameter estimates of the phylogenetic model that maximize the probability of observing the data. Bayesian inferences, on the other hand, are based on the posterior probability of the topology, branch lengths, and parameters of the phylogenetic model conditioned on the data. Posterior probabilities can be calculated using Bayes's theorem.
Determining which model is best suited to the data can be divided into two distinct criteriamodel adequacy (or assessment) and model choice (or selection). Model adequacy is an absolute measure of how well a model under scrutiny fits the data. Model choice, on the other hand, is a relative measure: the best fitting model from those available is chosen. Although a model may be the best choice, it may be, by absolute standards, inadequate. The likelihood-ratio test (LRT) and Bayes factors are model choice tests: they measure relative merits of competing models but reveal little about their overall adequacy. (Although formally, the LRT evaluates the adequacy of a model [Goldman 1993
], in practice it is used as a model choice strategy.) Although model adequacy and choice are distinct but related criteria, they are often evaluated simultaneously by comparing nested models which differ by a single parameter (see Goldman 1993
). Ideally, we would use only adequate models for a phylogenetic analysis, but in practice we often settle for the best available model. In fact, most models appear to be poor descriptions of sequence evolution (Goldman 1993
).
How does one choose an adequate phylogenetic model? Traditional maximum likelihood approaches to model selection employ the LRT (for hierarchically nested models) or the parametric bootstrap (for nonnested models) (Goldman 1993
). Both methods depend on a particular topology, often generated by a relatively fast method such as parsimony or neighbor-joining. (See Posada and Crandall [2001
] for an analysis of the effects of topology choice on model selection using the LRT, the Akaike information criterion [AIC; Akaike 1974
], and the Bayesian information criterion [BIC; Schwarz 1974
].) The LRT evaluates the merits of one model against another by finding the ratio of their maximum likelihoods. For nested models, the LRT statistic is asymptotically
2-distributed with q degrees of freedom (Wilks 1938
), permitting comparison with standard
2 tables to determine significance. Unfortunately, significance cannot be evaluated in this way when models are not nested, or the null fixes parameters of the alternative model at the boundary of the parameter space, because the regularity conditions of the
2 are not satisfied.
The parametric bootstrap, alternatively, is not constrained by regularity conditions allowing comparison of nonnested models but is time-intensive and may require researchers to write computer simulations to approximate the null distribution. Unfortunately, this computationally expensive approach, the AIC, and the BIC remain the only current methods (apart from simple inspection of the log likelihood scores) to compare nonnested likelihood models.
The results of the LRT and the parametric bootstrap are conditional on the topology and model parameters chosen to conduct the test. The assumed topology may be chosen using a fast method, such as parsimony, known to be inconsistent under certain conditions (Felsenstein 1978
). The branch lengths and model parameters (such as transition/transversion bias) are generally maximum likelihoodpoint estimates conditional on the assumed topology. Ideally, a statistical method should minimize the number of assumptions made.
Bayesian methods offer an efficient means of reducing this reliance on assumptions. These methods can accommodate uncertainty in topology, branch lengths, and model parameters. For example, Suchard, Weiss, and Sinsheimer (2001)
recently developed a Bayesian method of model selection that uses reversible jump Markov chain Monte Carlo (MCMC) and employs Bayes factors for comparing models. This approach is a Bayesian analog of the LRT: the Bayes factor indicates relative superiority of competing models by evaluating the ratio of their marginal likelihoods. In this approach, prior probability distributions of the models must be proper but allowably vague. If the information contained in the data about model adequacy is small, then the priors will determine the outcome of the test. In this situation, most of the posterior will be placed on the more complicated model (Carlin and Chib 1995
). Although the method of Suchard, Weiss, and Sinsheimer (2001)
allows comparison of models without strict dependence on a particular set of assumptions, like traditional likelihood approaches, it does not explicitly evaluate the absolute merits of a model. The chosen model may well be severely inadequate.
Here, I present a Bayesian method using posterior predictive distributions to explicitly evaluate the overall adequacy of DNA models of sequence evolution. The approach I use, posterior predictive check by simulation (Rubin 1984
; Gelman, Dey, and Chang 1992
; Gelman et al. 1995
; Gamerman 1997
), is a Bayesian analog of classical frequentist methods such as the parametric bootstrap or randomization tests (Rubin 1984
). A similar approach has been used recently to test molecular evolution hypotheses (Huelsenbeck et al. 2001
; Nielsen and Huelsenbeck 2001
; Nielsen 2002
). The rationale motivating this approach is that an adequate model should perform well in predicting future observations. In the absence of future observations, predicted observations are simulated from the posterior distribution, under the model in question. These predicted data are then compared with the original data using a test statistic that summarizes the differences between them. Careful evaluation of the model parameters permits enhancement (addition of parameters) or simplification (elimination of irrelevant parameters) of the model to improve its overall fit to the data. Here, I use the multinomial test statistic to evaluate overall adequacy of phylogenetic models.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The GTR is the most general model of DNA sequence evolution allowing for different rates for each substitution class and accommodating unequal base frequencies. The instantaneous rate matrix, Q, for this model is:
|
|
|
Using instantaneous rates, substitution probabilities for a change from nucleotide i to j over a branch of length v can be calculated as P = {Pij} = eQv. In the case of the JC69, K2P, and HKY85 models, closed-form analytical solutions for the substitution probabilities are available (Swofford et al. 1996
). For the GTR model, closed-form solutions do not exist, and standard numerical linear algebra approaches are employed to exponentiate the term Qv (Swofford et al. 1996
). With a matrix of substitution probabilities available, calculation of the likelihood is straightforward using Felsenstein's (1981)
pruning algorithm.
Posterior Predictive Simulations
In evaluating a model's adequacy, we would like to know how well it describes the underlying process that generated the DNA sequence data in hand. Therefore, an ideal model should perform well in predicting future observations of the data. In practice, future observations are unavailable to researchers at the time of data analysis. However, surrogate future observations under the model being tested can be simulated by sampling from the joint posterior density of trees and model parameters (hence, posterior predictive simulations; Rubin 1984
). Because of the complexity of the phylogeny problemthe large number of possible combinations of topology, branch lengths, and model parametersthe posterior density cannot be evaluated analytically. Luckily, we can use numerical methods to obtain an approximation of this density (
[
|X]) using the MCMC technique (Li 1996
; Mau 1996
; Mau and Newton 1997
; Yang and Rannala 1997
; Larget and Simon 1999
; Mau, Newton, and Larget 1999
; Newton, Mau, and Larget 1999
; Huelsenbeck and Ronquist 2001
).
Model assessment using this approach requires approximating the following predictive density:
|
|
Test Statistics
We now have an approximation of the posterior predictive density of the data, simulated under the phylogenetic model being scrutinized. But we are still left with the following problem: how can we use this posterior predictive distribution to assess the phylogenetic model's adequacy? This requires a descriptive test statistic (or discrepancy variable; Gelfand and Meng 1996
) that quantifies discrepancies between the observed data and the posterior predictive distribution. The test statistic is referred to as a realized value when summarizing the observed data. An appropriate test statistic can be defined to measure any aspect of the predictive performance of a model (Gelman et al. 1995
). I use the general notation T(·), where · refers to the variable being tested. To use this statistic, calculate T(·) (an example of the proposed statistic will be shown later), for the posterior predictive data sets to arrive at an approximation of the predictive distribution of this test quantity. This distribution can then be compared with the realized test statistic, which is calculated from the original data.
To asses how well a phylogenetic model is able to predict future nucleotide observations (overall adequacy), a test statistic that quantifies the frequency of site patterns is appropriate. Here I use the multinomial test statistic to summarize the difference between the observed and posterior predictive frequencies of site patterns (Goldman 1993
). A minor limitation of the multinomial is its assumption of independence among sites, restricting its application to phylogenetic models that assume independence. Deviations in the posterior predictive frequency of site patterns from the observed occur because the phylogenetic model is an imperfect description of the evolutionary process. If the evolutionary process that generated the data exhibits a GC bias, for instance, then site patterns containing a predominance of these bases will be overrepresented. An adequate model should be able to predict this deviation, given the information contained in the original sequence data.
The multinomial test statistic of the data, T(X), is calculated in the following way. Let (i) be the ith unique observed site pattern and N
(i) the number of instances this pattern is observed. For a total number of N sites, S = 4k possible site patterns, and n unique site patterns observed, the multinomial test statistic (T[X]) can be calculated as follows:
|
|
To illustrate the multinomial test statistic, let us find the realized T(X) for k = 4 sequences with N = 10 sites from the following hypothetical aligned matrix of DNA sequences:
|
|
Predictive P Values
Classical frequency statistics rely on tail-area probabilities to assign statistical significance; values that lie in the extremes of the null distribution of the test quantity are considered significant. Under classical statistics, the distributions are conditioned on point estimates for model parameters. Predictive densities, on the other hand, are not. Because values are sampled from the posterior distribution of model parameters and trees, they are sampled in proportion to their marginal probabilities. This sampling scheme allows them to be treated as nuisance parametersvalues not of direct interestand to be integrated out. The predictive distribution of the test statistic allows us to evaluate the posterior predictive probability of the model. The posterior predictive P value for the test statistic is:
|
Simulations
To determine the utility and power of this approach, I simulated 300 data sets under a variety of models and parameter values (see table 1
for a description of the specifics for each analysis). For all data sets the true (model data was simulated under) and the JC69 models are examined. Briefly, I performed three sets of simulations to examine (1) the overall model adequacy, (2) the effects of sequence divergence, and (3) the model sensitivity. I discuss each of these in turn subsequently.
|
To test the effects of sequence divergence, I simulated data sets of 2,000 sites under the GTR model. Parameters of the model were chosen as in the test of overall adequacy. For all data sets the tree in figure 1 was used. The overall substitution rate was varied from low (m = 0.1) to high (m = 0.75) divergence.
|
Power Analysis
Under the posterior predictive simulation approach the null hypothesis is that the model is an adequate fit to the data. A model is rejected if the realized test statistic is less than the critical value ( = 0.05). Otherwise the model was accepted. The fraction of times the null model is accepted falsely is an estimate of Type II error rate, ßthe complement (1 - ß) is the power of a test. The power of the multinomial test statistic to reject a false model is determined by the analysis of all the data sets described previously using the JC69 model.
Analysis of the -Globin Pseudogene
To illustrate the method of model determination using posterior predictive distributions, a DNA sequence data set was analyzed under the JC69, HKY85, and GTR models. The data set is the primate -globin pseudogene (Koop et al. 1986
; Goldman 1993
) with the addition of one speciesthe pygmy chimpanzee. This data set consists of seven specieshuman beings (Homo sapiens), chimpanzee (Pan troglodytes), pygmy chimpanzee (Pan paniscus), gorilla (Gorilla gorilla), orangutan (Pongo pygmaeus), rhesus monkey (Macaca mulatta), and owl monkey (Aotus trivirgatus). The original DNA data matrix was 2,205 sites. Indels (c = 183 sites) were excluded from the analyses, yielding a matrix of 2,022 sites.
Programs
MrBayes v2.0 was used to approximate the posterior distribution of a model's parameters and trees (Huelsenbeck and Ronquist 2001
). The Metropolis-coupled MCMC algorithm was used with four chains (Huelsenbeck and Ronquist 2001
). The Markov chains were run for 100,000 generations and sampled every 100th generation. The first 10,000 generations were discarded as burn-in to ensure sampling of the chain at stationarity. Convergence of the Markov chains was verified by plotting the log probability of the chain as a function of generation to verify that they had plateaued. A program that reads the posterior output of MrBayes, simulates predictive data sets, and evaluates the multinomial test statistic was written in the C language. The code is available upon request.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
An illustration of the predictive distribution of the multinomial test statistic for three data sets of 500, 2,000, and 4,000 sites is presented in figure 2 . As expected the true (GTR) model centers the simulated distributions around the realized test statistic (fig. 2A, C, and E ). The false (JC69) model performed poorly (PT = 0.000; fig. 2 B, D, and F). Increasing the number of sites in the data set increased the power of the test to reject the JC69 model: the predictive distributions under the JC69 model moved farther from the realized test statistic. This is because of the higher number of unique site patterns: increasing the number of sites increases the probability of observing rare patterns.
|
|
|
Sequence Divergence
The effect of an increase in sequence divergence on power was explored by varying the overall rate of substitution across the tree shown in figure 1
(m = 0.10, 0.25, 0.50, and 0.75). Test data sets were simulated under the GTR model (table 1
). The results of sequence divergence are shown in tables 2
and 3
. As previously done, two measurements to evaluate the method are presented: the mean predictive P value (T) and the power of the test. Using the first, the GTR model performed well, approaching a mean predictive P value of 0.5 as m increased. An increase in the standard deviation of the predictive P value, from 0.082 to 0.137, was observed with an increase in divergence. This may be the result of a decrease in the diversity of site patterns as sites experience multiple hits and states begin to converge. The JC69 model, on the other hand, performed poorly at all values of m. At divergence levels of m
0.25, the mean posterior predictive P value (
T; table 2
) was below the critical level of
= 0.05.
Using the second measurement, the GTR model again performed wellit was accepted 100% of the time at all levels of divergence. The JC69 model performed poorly at low levels of sequence divergence (m = 0.10); the power of the test was relatively low (50) but rapidly increased to 100 at larger divergences (m = 0.75). For m = 0.50, the power of the test was considerably lower than in data sets with identical simulation conditions in the test of overall model adequacy (95%; see Overall Model Adequacy).
This reduction in power may be the result of a number of factors. The first, and most likely, explanation is sampling error; the small number of replicates leads to large confidence intervals (CI) around the Type II error rate (95% CI, 14%53%). Second, the JC69 model may be robust to minor violations of its assumptions. For example, in replicates for which the JC69 model was accepted, assumptions were not severely violated. Analyses of model sensitivity support this explanation (see later). Third, the simulation tree for these analyses had a smaller sum of branch lengths than in the analysis of overall adequacybranch lengths are in terms of the expected number of substitutions per site. When m = 0.5, figure 1 has a tree length of 2.266, whereas the mean tree length in the adequacy analysis was 2.601 (15 of 20 overall adequacy replicates had longer tree lengths, some as much as 36% longer). Therefore, the effects of divergence on power should be interpreted as a function of the total number of expected substitutions per site across the phylogenynot simply the rate from the root to the tips of the tree (m). Finally, the statistic may be sensitive to the shape of the topology or variations in branch lengths across the tree.
Sensitivity to Model Violations
The sensitivity of the multinomial test statistic to reject inadequate models was explored by simulating data sets under the K2P model, varying from 1 to 12, followed by analysis with both the K2P (true) and JC69 models. When
= 1, the K2P model collapses into the JC69 model. Under these conditions, the JC69 model is not violated and is expected to perform as well as the K2P model. As
increases, reflecting an increase in the transition-transversion bias, the JC69 model becomes more severely violated and is expected to perform more poorly.
The effects of model violations were explored on data sets of two sizes: 1,000, and 5,000 sites (tables 2
and 3
). Both the K2P and JC69 models performed well for data sets of 1,000 sites simulated with a value of 1. The mean posterior predictive P values for the K2P and JC69 models were 0.433 and 0.426, respectively. Both models were accepted in 100% of the replicates. For the JC69 model, as
increased the mean P values declined, whereas the K2P model continued to perform well. The probability of accepting the K2P model was 100% for all replicates except one (
= 12, 95%). The JC69 model performed well at
= 3 (95% accepted), but as the model became increasingly violated, the power increased to 95% and 100%, at
values of 6 and 12, respectively.
A fivefold increase in the number of sites moved the mean posterior predictive P values for the K2P model toward 0.5 (table 2
), and all replicates analyzed under the K2P model were accepted 100% of the time. As the number of sites increased from 1,000 to 5,000, the discriminating power of the test statistic increased, as shown by the rapid decline in the mean predictive P values with increasing (table 2
) and by the increased power to reject the JC69 model (table 3
). For example, there was a nearly 10-fold drop in the mean predictive P value between 1,000 and 5,000 sites under the JC69 model with moderate violationfor
= 3 the mean P value decreased from 0.175 to 0.019 (table 2
). In addition, the variance across the replicate data sets decreased markedly. The JC69 model was accepted 100% of the time when
was 1 but declined with an increase in
compared with the 1,000 site data sets. This pattern is most dramatically demonstrated in a comparison of data sets simulated with
= 3. For 1,000 sites there was a Type II error rate of 95% as compared with a Type II error rate of 10% with 5,000 sites under the JC69 model.
Analysis of the -Globin Pseudogene
The primate -globin pseudogene data set was analyzed under the GTR, HKY85, and JC69 models. Pseudogenes are nonfunctional copies in which mutations are not constrained by selection, and thus substitution biases should reflect mutational biases. Biases in the mutational spectrum will give rise to biases in the observed frequency of site patterns. The analysis of the mean base frequencies for the
-globin pseudogene indicates an AT bias (
A = 0.296,
C = 0.190,
G = 0.238,
T = 0.277). Consequently, models that assume equal base frequencies (i.e., JC69) are not expected to perform as well as models that allow for unequal frequencies (i.e., HKY85 and GTR). The HKY85 and GTR models are adequate summaries of the true underlying process (GTR, PT = 0.199; HKY85, PT = 0.303), although the HKY85 represents a better fit to the datathe HKY85 model was better able to center the predictive distribution of the test statistic around the realized value (fig. 3
). This difference may be because of a better model fit or stochastic error. The JC69 model represents a poor fit to the data (fig. 3
, PT = 0.053), even though it cannot be explicitly rejected at the 0.05 level.
|
When we are confronted with two models that appear to perform equally well, how do we proceed in choosing between them? One approach would be to simply choose the less complex model, thus favoring a reduction in the number of free parameters to be estimated. Another alternative would be to use the method presented here, with a test statistic that summarizes local features of the models. In this way, identification of particular features of a model that do not contribute explanatory power can be identified and eliminated. Conversely, testing the addition of new parameters to a simpler model could lead to a better fit to the data using an expanded model. In these ways, we can identify the best model and arrive at a sound statistical choice.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The multinomial test statistic is presented to evaluate the global (or overall) performance of a model through the posterior predictive distribution. The power of the multinomial test statistic was explored under a wide range of conditions. A number of factors have been shown here to increase power (1) increasing the number of sites, (2) increasing sequence divergence (expected number of substitutions per site), and (3) the degree of violation to a model's assumptions.
An appealing aspect of posterior predictive distributions, when used for model checking, is that a wide variety of test statistics can be formulated to check various aspects of phylogenetic models. For example, posterior predictive distributions can be used to detect variation in rates across data partitions, allowing models to be expanded to accommodate rate heterogeneity. The generality of the posterior predictive approach, and the development of new test statistics, will permit further exploration and development of more complex and realistic phylogenetic models.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: phylogenetics
Bayesian inference
model determination
model selection
model adequacy
posterior predictive densities
posterior predictive simulations
Address for correspondence and reprints: Department of Biology, University of Rochester, Rochester, New York 14627. bollback{at}brahms.biology.rochester.edu
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Akaike H., 1974 A new look at statistical model identification IEEE Trans. Autom. Contr 19:716-723
Bruno W. J., A. L. Halpern, 1999 Topological bias and inconsistency of maximum likelihood using wrong models Mol. Biol. Evol 16:564-566
Carlin B. P., Chib S., 1995 Bayesian model choice via Markov chain Monte Carlo methods J. R. Stat. Soc. B 57:473-484[ISI]
Felsenstein J., 1978 Cases in which parsimony or compatibility methods will be positively misleading Syst. Zool 27:401-410[ISI]
. 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]
Gamerman D., 1997 Markov Chain Monte Carlo: stochastic simulation for Bayesian Inference Chapman and Hall, New York
Gaut B., P. Lewis, 1995 Success of maximum likelihood in the four taxon case Mol. Biol. Evol 12:152-162[Abstract]
Gelfand A. E., X.-L. Meng, 1996 Model checking and model improvement Pp. 189198 in W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds. Markov chain Monte Carlo in practice. Chapman and Hall, New York
Gelman A., J. B. Carlin, H. S. Stern, D. B. Rubin, 1995 Bayesian data analysis Chapman and Hall, New York
Gelman A. E., D. K. Dey, H. Chang, 1992 Model determination using predictive distributions with implementation via sampling-based methods Pp. 147167 in J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, eds. Bayesian statistics 4. Oxford University Press, New York.
Goldman N., 1993 Statistical tests of models of DNA substitution J. Mol. Evol 36:182-198[ISI][Medline]
Hasegawa M., H. Kishino, T. Yano, 1985 Dating the human-ape split by a molecular clock of mitochondrial DNA J. Mol. Evol 22:160-174[ISI][Medline]
Huelsenbeck J. P., J. P. Bollback, 2001 Application of the likelihood function in phylogenetic analysis Chap. 15, pp. 415439 in D. J. Balding, M. Bishop, and C. Cannings, eds. Handbook of statistical genetics. John Wiley and Sons Inc., New York
Huelsenbeck J. P., J. P. Bollback, A. Levine, 2002 Inferring the root of a phylogenetic tree Syst. Biol 51:32-43[ISI][Medline]
Huelsenbeck J. P., D. M. Hillis, 1993 Success of phylogenetic methods in the four taxon case Syst. Biol 42:247-264[ISI]
Huelsenbeck J. P., F. Ronquist, 2001 MRBAYES: Bayesian inference of phylogenetic trees Bioinformat. Appl. Note 17:754-755
Huelsenbeck J. P., F. Ronquist, R. Nielsen, J. P. Bollback, 2001 Bayesian inference of phylogeny and its impact on evolutionary biology Science 294:2310-2314
Jukes T., C. Cantor, 1969 Evolution of protein molecules Pp. 21132 in H. Munro, ed. Mammalian protein metabolism. Academic Press, New York
Kimura M., 1980 A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]
Koop B. F., M. Goodman, P. Xu, K. Chan, J. L. Slightom, 1986 Primate eta-globin DNA sequences and man's place among the great apes Nature 319:234-238[ISI][Medline]
Larget B., D. Simon, 1999 Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees Mol. Biol. Evol 16:750-759
Li S., 1996 Phylogenetic tree construction using Markov chain Monte Carlo Doctoral dissertation, Ohio State University, Columbus
Mau B., 1996 Bayesian phylogenetic inference via Markov chain Monte Carlo methods Doctoral dissertation, University of Wisconsin, Madison
Mau B., M. Newton, 1997 Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo J. Comput. Graph. Stat 6:122-131[ISI]
Mau B., M. Newton, B. Larget, 1999 Bayesian phylogenetic inference via Markov chain Monte Carlo methods Biometrics 55:1-12[ISI][Medline]
Newton M., B. Mau, B. Larget, 1999 Markov chain Monte Carlo for the Bayesian analysis of evolutionary trees from aligned molecular sequences In F. Seiller-Moseiwitch, T. P. Speed, and M. Waterman, eds. Statistics in molecular biology. Monograph series of the Institute of Mathematical Studies
Nielsen R., 2002 Mapping mutations on phylogenies Syst. Biol. (in press)
Nielsen R., J. P. Huelsenbeck, 2001 Detecting positively selected amino acids sites using posterior predictive p-values Pp. 576588 in R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, eds. Pacific symposium on biocomputing. World Scientific, New Jersey
Posada D., K. A. Crandall, 2001 Selecting the best-fit model of nucleotide substitution Syst. Biol 50:580-601[ISI][Medline]
Rannala B., Z. Yang, 1996 Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference J. Mol. Evol 43:304-311[ISI][Medline]
Rubin D. B., 1984 Bayesianly justifiable and relevant frequency calculations for the applied statistician Ann. Stat 12:1151-1172[ISI]
Schwarz G., 1974 Estimating the dimension of a model Ann. Stat 6:461-464
Suchard M. A., R. E. Weiss, J. S. Sinsheimer, 2001 Bayesian selection of continuous-time Markov chain evolutionary models Mol. Biol. Evol 18:101-1013
Sullivan J., D. L. Swofford, 1997 Are guinea pigs rodents? The importance of adequate models in molecular phylogenies J. Mammal. Evol 4:77-86
Swofford D., G. Olsen, P. Waddell, D. M. Hillis, 1996 Phylogenetic inference Pp. 407511 in D. Hillis, C. Moritz, and B. Mable, eds. Molecular systematics. 2nd edition. Sinauer, Sunderland, Mass
Tavaré S., 1986 Some probabilistic and statistical problems on the analysis of DNA sequences Pp. 5786 in Lectures in mathematics in the life sciences. Vol. 17[Please provide the publisher name and location]
Yang Z., 1993 Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites Mol. Biol. Evol 10:1396-1401[Abstract]
. 1994 Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods J. Mol. Evol 39:306-314[ISI][Medline]
Yang Z., B. Rannala, 1997 Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method Mol. Biol. Evol 14:717-724[Abstract]
Wilks S., 1938 The large-sample distribution of the likelihood ratio for testing composite hypotheses Ann. Math. Stat 9:554-560