Bayesian Phylogenetic Model Selection Using Reversible Jump Markov Chain Monte Carlo

John P. Huelsenbeck*, Bret Larget{dagger},{ddagger} and Michael E. Alfaro*

* Section of Ecology, Behavior and Evolution, Division of Biological Sciences, University of California, San Diego
{dagger} Department of Statistics, University of Wisconsin
{ddagger} Department of Botany, University of Wisconsin

Correspondence: E-mail: johnh{at}biomail.ucsd.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
A common problem in molecular phylogenetics is choosing a model of DNA substitution that does a good job of explaining the DNA sequence alignment without introducing superfluous parameters. A number of methods have been used to choose among a small set of candidate substitution models, such as the likelihood ratio test, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Bayes factors. Current implementations of any of these criteria suffer from the limitation that only a small set of models are examined, or that the test does not allow easy comparison of non-nested models. In this article, we expand the pool of candidate substitution models to include all possible time-reversible models. This set includes seven models that have already been described. We show how Bayes factors can be calculated for these models using reversible jump Markov chain Monte Carlo, and apply the method to 16 DNA sequence alignments. For each data set, we compare the model with the best Bayes factor to the best models chosen using AIC and BIC. We find that the best model under any of these criteria is not necessarily the most complicated one; models with an intermediate number of substitution types typically do best. Moreover, almost all of the models that are chosen as best do not constrain a transition rate to be the same as a transversion rate, suggesting that it is the transition/transversion rate bias that plays the largest role in determining which models are selected. Importantly, the reversible jump Markov chain Monte Carlo algorithm described here allows estimation of phylogeny (and other phylogenetic model parameters) to be performed while accounting for uncertainty in the model of DNA substitution.

Key Words: Bayesian phylogenetic inference • Markov chain Monte Carlo • maximum likelihood • reversible jump Markov chain Monte Carlo • substitution models


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
At present, a universal assumption of model-based methods of phylogenetic inference is that character change occurs according to a continuous-time Markov chain. At the heart of any continuous-time Markov chain is a matrix of rates, specifying the rate of change from one character state to another. For many phylogenetic analyses using DNA sequence data, it is assumed that there are four states (the nucleotides A, C, G, T/U) with a 4 x 4 matrix of rates among the 12 possible nucleotide substitutions. A few standard models of DNA substitution have been proposed. These include those first described by Jukes and Cantor (1969), Kimura (1980, 1981), Felsenstein (1981, 1984), Hasegawa, Yano, and Kishino (1984, 1985), Tamura and Nei (1993), and Tavaré (1986). Importantly, the parameterizations of the substitution process suggested by these authors can be directly applied to doublet models (used to describe the substitution process in stem-paired nucleotides; Schöniger and von Haeseler 1994) and codon models (used to describe substitutions in the three nucleotides of a codon; Goldman and Yang 1994, Muse and Gaut 1994). Moreover, these models are used in a wide variety of analysis methods, including distance, maximum likelihood, and Bayesian methods.

One of the challenges facing the biologist is to choose a model of DNA substitution that best describes the data in hand. Generally speaking, the aim is to pick a model that adequately explains the data (in this case an alignment of DNA sequences) without introducing superfluous parameters; one wants to strike a balance between bias avoidance (achieved by using models with many parameters) and reduced variance in the parameter estimates (achieved by using models with few parameters). Fortunately, there are several methods available that allow the biologist to choose among substitution models in a statistical framework. When two nested models are compared, one can use likelihood ratio tests (Goldman 1993). Models are nested when one model is a special case of a more general model. For example, the model proposed by Jukes and Cantor (1969) is a special case of the model proposed by Kimura (1980), which introduces an additional parameter describing the transition/transversion rate bias. When the transition/transversion bias is set to one, then Kimura's model is equivalent to the Jukes-Cantor model. In such nested model comparisons, minus twice the log likelihood ratio is asymptotically distributed as a {chi}2 distribution, where the degrees of freedom of the {chi}2 is the difference in the number of free parameters between the general and restricted models. Phylogeneticists commonly employ a series of nested likelihood ratio tests (LRTs) to select a substitution model, either by calculating likelihood ratios by hand for a pool of candidate models or by using Modeltest (Posada and Crandall 1998), which automates this procedure. This is a reasonable approach because many of the most familiar substitution models can be arranged linearly in a nested hierarchy.

Unfortunately, not all of the interesting substitution model comparisons are between nested models. In these cases, it may make sense to use a method that does not assume nesting of models, such as Akaike's Information Criterion (AIC; Akaike 1973), which is based on information theory (specifically, the Kullback-Liebler distance between models) and introduces a penalty term for models that have more parameters. The idea is to calculate the AIC for every candidate model, and choose the model with the smallest AIC for data analysis. (One might also consider other models that achieve a low AIC score.) Methods related to the AIC include the small sample bias adjusted AIC proposed by Hurvich and Tsai (1989), a quasi-likelihood modification (QAIC) suggested by Lebreton et al. (1992), and the Bayesian information criterion (BIC; Schwarz 1978). AIC has been commonly used for choosing among a limited number of candidate substitution models, and it is implemented in the program Modeltest (Posada and Crandall 1998).

The criteria described above all work well when maximum likelihood is the criterion. In a Bayesian analysis, model choice is often guided by the Bayes factor (Kass and Raftery 1995; Raftery 1996), which is the ratio of the marginal likelihoods of two models; instead of maximizing the likelihood with respect to the parameters, the parameters are integrated over a prior probability distribution. Relatively little work has been done on Bayesian model selection in phylogenetics, with the notable exception of that by Suchard, Weiss, and Sinsheimer (2001). Their work showed how Bayes factors can be calculated for nested models using the Savage-Dickey ratio (Verdinelli and Wasserman 1995).

In this paper, we extend the pool of candidate substitution models to all possible time-reversible models. We develop a new Markov chain Monte Carlo (MCMC) algorithm that jumps between models, allowing accurate calculation of the Bayes factor for any of the models. We also evaluate the possible DNA substitution models using AIC and BIC. We show that, although the named models often do a good job of explaining much of the pattern of substitution for real data sets, (1) many other (unnamed) models also do quite well, (2) the most complicated time-reversible model is not universally chosen as best, and (3) phylogeny estimation can be performed while averaging over the possible models.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
How Many Nucleotide Substitution Models Are There?
In this article, we restrict our attention to time-reversible models of DNA substitution. Although we focus entirely on substitution models with four states (A, C, G, and T/U), all of the results of this study apply equally well to doublet and codon models. The most general time-reversible model, called the GTR model, has rate matrix


and was first described by Tavaré (1986). The entry in the ith row and jth column of the matrix specifies the rate of change from nucleotide i to nucleotide j. The diagonal entries of the matrix, here shown with a dash, are specified such that each row sums to zero. The frequency of the ith nucleotide is denoted {pi}i. The commonly used DNA substitution models all have the unusual feature that the stationary frequency of the process is built directly into the rate matrix. The sum of the four base frequencies must, of course, equal one. We introduce the additional constraint that the sum of the six rate parameters is equal to six (rAC + rAG + rAT + rCG + rCT + rGT = 6) so that their mean is one. We introduce this constraint for the same reason that most parameterizations of the GTR model fix the rate of G {longleftrightarrow} T to one, measuring the rates of the other five substitutions relative to the G {longleftrightarrow} T rate; because the absolute times on a phylogenetic tree are typically unknown, it is convenient to measure the branch lengths of the tree in terms of amount of nucleotide substitution (specifically, the branch lengths on a tree are in terms of expected number of substitutions per site). This means that the substitution rate matrix must be rescaled such that the mean rate is one. This is achieved by setting µ = –1/{sum}i(A,C,G,T){pi}iqii. It also means that we cannot estimate the absolute rates of substitution, but only their relative rates. It is just as easy to measure the relative rates of the six substitution types on a scale where they all sum to six as it is to measure the substitution rates relative to the G {longleftrightarrow} T rate.

The models typically used in phylogenetic analysis are all special cases of the GTR model, involving constraints on the nucleotide frequencies or on the rate parameters. For example, the Felsenstein (1981) model constrains the rates of substitution to be equal (rAC = rAG = rAT = rCG = rCT = rGT), and the Jukes-Cantor (1969) model has the additional constraint that the nucleotide frequencies are equal. For the remainder of this article, we will assume that the base frequencies are a common parameter across all possible models, and that they are free to vary. We concentrate our attention on constraints of the substitution rate parameters. We also introduce a notation that allows any possible time-reversible model to be described. We assign index values to each of the six substitution rates, in the order AC, AG, AT, CG, CT, GT. If a model has the constraint that ri = rj, then the index value for those two rates is the same. Moreover, the index number for the first rate is always 1, and indices are labeled sequentially. So, for example, "111111" denotes the Jukes-Cantor (1969) or Felsenstein (1981) model and "121121" denotes the Kimura (1980), Hasegawa, Kishino, and Yano (1984, 1985), or Felsenstein (1984) model. There are a total of 203 such models, with the simplest being "111111" and the most complex being the GTR model, "123456." All of the possible models are shown in table 1. Of the 203 models, one of them has one substitution type (111111), 31 of them have two substitution types (e.g., 121121), 90 of them have three substitution types (e.g., 123121), 65 of them have four substitution types (e.g., 112341), 15 of them have five substitution types (e.g., 123415), and one of them has six substitution types (123456). Only a handful of the 203 models have been named1. These include 111111 (Jukes and Cantor 1969; Felsenstein 1981), 121121 (Kimura 1980; Felsenstein 1984; Hasegawa, Kishino, and Yano 1984, 1985), 121131 (Tamura and Nei 1993), 123321 (Kimura 1981), 123341 (Posada 2003), 123425 (Posada 2003), and 123456 (Tavaré 1986).


View this table:
[in this window]
[in a new window]
 
Table 1 All Possible Time-Reversible Models of DNA Substitution.

 
How is the number of possible substitution models determined? The answer can be found in the combinatorics literature. Specifically, the number of ways a set with n objects can be partitioned into disjoint and non-empty subsets is described by the Bell numbers (Bell 1934). The Bell numbers, Bn, are calculated as the sum


where S(n, k) is a Stirling number of the second kind. The Stirling numbers of the second kind, S(n, k), give the number of ways to partition a set of n objects into k non-empty sets. The Stirling number of the second kind is


Special cases of the Stirling number of the second kind include S(n, 0) = 0, S(n, 1) = 1, S(n, n) = 1, S(n, 2) = 2n–1 1, and S(n, n – 1) = (). The Bell numbers give the sequence 1, 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975, 678570, 4213597, ... for n = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, .... For the problem investigated here, the Bell number for n = 6 objects gives the number of possible substitution models (B6 = 203) and the Stirling numbers of the second kind give the number of models with one [S(6, 1) = 1], two [S(6, 2) = 31], three [S(6, 3) = 90], four [S(6, 4) = 65], five [S(6, 5) = 15], and six [S(6, 6) = 1] substitution types.

To improve understanding of the remainder of this article, we introduce the following additional notation describing the substitution models. Each substitution model has K groups of substitution types. The membership of the ith group is contained in the vector ki. For example, model M101 (112231) has a total of K = 3 substitution groups: k1 = (AC, AG, GT), k2 = (AT, CG), and k3 = (CT). The number of substitution types in the ith substitution group is denoted ni and the rate for the ith substitution group is denoted ri. For model M101, n1 = 3, n2 = 2, and n3 = 1. Finally, N(M) is the number of substitution groups in model M with ni > 1. For model number 101, N(M101) = 2; the GTR model (123456) has N(M203) = 0.

Choosing Among the Possible Substitution Models
A number of criteria can be used to choose among a set of d candidate models, where the candidate models are labeled M1, M2, ... , Md. Here we review a few of these methods. However, we mainly concentrate on the Bayesian framework for choosing models, showing how Bayes factors can be calculated for substitution models of interest. We will also point out how AIC and BIC can be used to choose among all possible substitution models when using maximum likelihood.

Likelihood Ratio Testing
The likelihood ratio


measures the relative tenability of two competing models. (The likelihood, {ell}(Mi), is maximized over all the parameters.) Model M1 is favored when the likelihood ratio is greater than one, whereas the opposite is true for likelihood ratios less than one. When the models are nested, with M1 being a special case of M2, {Lambda} <= 1 and –2 loge {Lambda} is asymptotically distributed as a {chi}2 distribution with the degrees of freedom being the difference in the number of free parameters between M2 and M1.

Likelihood ratio tests have proven very useful in molecular phylogenetics (see review by Huelsenbeck and Rannala 1997), where they have been used to examine questions ranging from the molecular clock hypothesis to cospeciation in hosts and parasites. However, to use the {chi}2 approximation of the null distribution, the requirement that the models are nested restricts the hypotheses that can be compared. With non-nested models, one can use the Cox test (Cox 1962), but in these cases it is not clear which hypothesis the null distribution should be generated under. In this article, where many of the models of interest are not nested (e.g., 111222 vs. 112211), likelihood ratio tests are of only limited value.

Information Criteria
Another approach to model choice is to attempt to find a model that in some sense minimizes the distance to the "true" model. A number of methods attempt to estimate the Kullback-Liebler information, which is a measure of the distance between models. The Akaike Information Criterion (AIC; Akaike 1973) is based on the maximum likelihood score and the number of parameters for a model, and it represents an attempt to estimate the Kullback-Liebler information (distance) between the true model and the fitted model. The AIC for model i is calculated as


where max[loge {ell}(Mi)] is the maximum likelihood and p is the number of free parameters in the model. A Bayesian variant of the AIC is the Bayesian Information Criterion (BIC; Schwarz 1978) which assumes equal priors on each model and vague priors on the parameters. The BIC for model i is calculated as


where n is the number of observations. For both AIC and BIC, the first term can be interpreted as a penalty for using too simple a model; the maximum likelihood is lower for models that fail to include important parameters. The second term, on the other hand, is a penalty against overfitting. The second term in AIC and BIC favors more parsimonious models.

Model selection using AIC or BIC works by calculating the AIC/BIC scores for each candidate model and selecting the model with the lowest AIC/BIC as best. AIC and BIC have been used previously to choose among a handful of competing phylogenetic models (e.g., Posada and Crandall 1998; Yang et al. 2000).

Bayes Factors
The Bayes factor for a comparison of model M1 to model M2 is


where the observations are denoted X (in this case, an alignment of DNA sequences) and f(X|Mi) is the marginal likelihood of model i. The Bayes factor has the same intuitive meaning as the likelihood ratio, described above; values for the Bayes factor greater than one support M1, whereas a Bayes factor less than one supports M2. The marginal likelihoods are calculated by integrating (marginalizing) over the other parameters of the model. Say that model Mi has parameters {theta}i. The marginal likelihood is calculated by integrating the likelihood over a prior probability distribution, f({theta}i|Mi): f(X|Mi) = f(X|{theta}i, Mi)f({theta}i|Mi)d{theta}i. Note that in contrast to the likelihood ratio test statistic in which the ratio of the maximum likelihood scores is used, for the Bayes factor the ratio of the average likelihoods is used instead. Importantly, the Bayes factor accounts for uncertainty in the parameters of the model, {theta}i. Also note that the Bayes factor does not rely on nesting of the models. In fact, the parameters of the two compared models need not be the same. That is, the parameters of model M1 ({theta}1) are not necessarily the same parameters as those in model M2 ({theta}2).

The Bayes factor can also be calculated as the ratio of the posterior odds to the prior odds of the models:


where f(Mi|X) is the posterior probability and f(Mi) is the prior probability of model i. Often the Bayes factor can be approximated using the output of a MCMC analysis. This is the approach that will be taken in this article, with the details of the MCMC algorithm described in the next section. Suchard, Weiss, and Sinsheimer (2001) calculated the Bayes factor for the special case where the compared models are nested. Specifically, they used the Savage-Dickey ratio to approximate the Bayes factor, where the Savage-Dickey ratio is


Model M1 is a special case of M2. Specifically, M2 is the same as M1 when the parameter {theta} = {theta}0. For example, the Jukes-Cantor (1969) and Kimura (1980) models are the same when the parameter {kappa} of the Kimura model is set to one ({kappa} = 1). The Bayes factor could be calculated using the Savage-Dickey ratio for a comparison of the Jukes-Cantor (M1) to the Kimura (M2) models as


This involves performing a Bayesian analysis under the more general, Kimura (1980), model and evaluating the posterior and prior probability densities of the parameter {kappa} at one.

Reversible Jump MCMC for Model Selection
We do not pursue calculation of Bayes factors using the Savage-Dickey ratio for the same reason that we do not examine likelihood ratio tests; many of the comparisons of interest involve non-nested models. Instead, we calculate Bayes factors for all possible substitution models by constructing a Markov chain that visits substitution models in proportion to their posterior probability. The joint posterior probability of all the model parameters is calculated using Bayes' theorem as


where {tau} is the unrooted tree topology, v is the vector of branch lengths for the tree, {alpha} is the gamma shape parameter describing the amount of among-site rate variation, {pi} is a vector of nucleotide frequencies, {theta}i is the vector of substitution rates, and Mi is the substitution model. The posterior probability, f(·|X), is equal to the likelihood, f(X|·), times the prior probability distribution, f(·), divided by a normalizing constant, f(X). In this section, we describe the modeling assumptions we make in calculating posterior probabilities of models.

Likelihood
We calculate the likelihood (probability of observing the DNA sequence alignment conditional on a specific model) under the GTR + {Gamma} model of DNA substitution, or submodels of the GTR + {Gamma}. We approximate the continuous gamma distribution for among-site rate variation using the discrete approximation of Yang (1994). We use four categories to approximate the gamma distribution. The pruning algorithm of Felsenstein (1981) was used to calculate the likelihood for a tree.

Prior
We assume that all 203 substitution models have equal prior probability. Moreover, we assume a uniform prior on unrooted tree topologies, an exponential (10) prior on branch lengths, a flat Dirichlet distribution on nucleotide frequencies, and a uniform (0, 50) prior on the gamma shape parameter for among-site rate variation. We assume that the K rates of the substitution model are a linear transformation of a flat Dirichlet prior. In particular, if Y1, ... , YK have a flat Dirichlet distribution we let ri = 6Yi/ni for i = 1, ... , K. The prior density of the rates is


(The standard flat Dirichlet distribution has a normalizing constant of {Gamma}(K). Our linear transformation of the flat Dirichlet requires a different normalizing constant.)

Markov Chain Monte Carlo
The posterior probability distribution of the phylogenetic model parameters cannot be calculated analytically. We use MCMC (Metropolis et al. 1953; Hastings 1970; Green 1995) to approximate the posterior probability distribution of the phylogenetic model parameters. MCMC for phylogeny inference has been described elsewhere (see Larget and Simon 1999; Huelsenbeck et al. 2002). The general idea is to construct a Markov chain that has as its state space the parameters of the phylogenetic model and a stationary distribution that is the posterior probability distribution of the parameters. The sequence of parameter states visited by the Markov chain forms a valid, albeit dependent, sample from the posterior probability distribution of interest (Tierney 1994). In this article, we are interested in the posterior probabilities of alternative substitution models. We construct a Markov chain that, in addition to exploring the other model parameters such as the phylogenetic tree, visits alternative substitution models. The posterior probability of the ith model is approximated as the proportion of the time the Markov chain visited model i.

In each iteration of the MCMC algorithm, we pick a parameter at random to update. We update the tree and branch lengths simultaneously using the LOCAL mechanism described in Larget and Simon (1999). We update the nucleotide frequencies and substitution model parameters using a Dirichlet proposal mechanism, and we update the gamma shape parameter using a sliding window mechanism. Each of these proposal mechanisms has been described elsewhere, and we will not go into any more detail about them here. However, two of the proposal mechanisms are unique to the problem of exploring substitution models. Specifically, we have proposal mechanisms that explore the different possible substitution models by merging or splitting groups of substitution types. The main complication is that moves from one substitution model to another involve changes in the dimension of the parameter space, and the normal MCMC theory does not apply. Instead, we construct a Markov chain that jumps dimensions, using the theory described in Green (1995).

The current substitution model will be denoted M. We attempt a merge of two substitution groups with probability pM(M) and attempt a split of a substitution group into two groups with probability pS(M). When the current model is the Jukes-Cantor (111111), then the probability of attempting a merge is zero and the probability of attempting a split is set to one. When the current model is GTR (123456) the probability of attempting a split is set to zero, and the probability of a merge is set to one.

A merge move works as follows. Two of K substitution groups of the current model, M, are chosen at random. These groups are denoted ki and kj. The new group that results from the merging of these two groups is denoted k'. For example, consider a merge of the model 111222. There are only two substitution groups that can be merged: k1 = (AC, AG, AT) and k2 = (CG, CT, GT). The group that results from the merging of these two groups is k' = (AC, AG, AT, CG, CT, GT). Of course, this is the model 111111. The rates of substitution must also be chosen for the new, merged, group. For the merge move, the new rate is deterministically chosen. The rates of substitution for the two merged groups are denoted ri and rj. The new rate, r' is


The new group size is ni + nj. Remember that we constrain the sum of the rates to be 6 ({sum}iniri = 6), so that = 1. Notice that (ni + nj)r' = niri + njrj: the sum of all rates remains unchanged.

A split move works by picking one of the N(M) groups with at least two substitution types at random. The group to be split will be designated k. The substitution types in that group are then divided randomly, with the constraint that there is at least one item in each group. This results in two groups, k'i and k'j. For example, if the current model, M, is 112334, then only two groups are available that could be potentially split: the groups with substitutions AC and AG or with substitution types CG and CT. One of these two groups would be chosen at random (with each having an equal probability of being chosen), and the substitution types in the group randomly divided into two new groups. In this case, if k = (AC, AG), then the resulting groups would be k'i = (AC) and k'j = (AG). To find the new rates for the two new groups, r'i and r'j, we first generate a uniform random variable on the interval (–nir, njr), denoted u, where r is the rate of the group to be split. The new rates are then


and


Notice as well that nir'i + njr'j = (ni + nj)r, so that the sum of all the rates remains unchanged.

The probability of accepting a merge or split move, R, is


For a merge move, the prior ratio is


the proposal ratio is


and the Jacobian is


The acceptance probability for a merge move is then


For a split move from current model M with K – 1 rates to a model M' with K rates the acceptance probability is


where r is the rate of the group to be split and ni and nj are the sizes of the new groups.

Data
We examined 16 alignments of DNA sequences. Table 2 summarizes the characteristics of the data sets examined. The posterior probabilities of substitution models (and other phylogenetic model parameters) were approximated using the MCMC algorithm described above as implemented in a program written by J.P.H. PAUP* (Swofford 2002) was used to calculate the maximum likelihood scores of the 203 models for each data set. All of the data sets were analyzed using the same model: the GTR + {Gamma} model of DNA substitution, and submodels thereof. The MCMC algorithm was run for 10,000,000 cycles for each data set, and sampled every 1,000th cycle. Each data set was analyzed three times. For each chain, we excluded samples taken during the first million cycles as the "burn in," basing inferences on the last 9,000 samples taken in each analysis. All probabilities were based on the pooled post-burn in samples from the three chains.


View this table:
[in this window]
[in a new window]
 
Table 2 The DNA Sequence Alignments Examined in This Study.

 

    Results and Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
For most of the data sets analyzed in this study, the posterior probability is spread out among a handful of models. Table 3 shows the best models selected for each analysis. The posterior probability was concentrated on only a few substitution models for the mammal and Parrotfish data set, where two and three models, respectively, accounted for 95% of the posterior probability. For some of the smaller alignments, however, there was considerable model uncertainty. For example, for the HIV-env alignment 22 models accounted for 95% of the posterior probability. Another interesting finding is that the most complicated model (GTR: 123456) was not found to be best in any of the analyses; in fact, it was chosen with a high probability only in the mammalian mitochondrial data set, where it had a posterior probability of 0.36.


View this table:
[in this window]
[in a new window]
 
Table 3 The best models for each data set.

 
The Bayes factors for the best models found for each alignment ranged from 52 to 481. Jeffreys (1961) provided a table to help interpret Bayes factors, that was modified by Raftery (1996). Generally speaking, Bayes factors in the range 12 to 150 are considered "strong" evidence in favor of a model, whereas Bayes factors greater than 150 are considered "very strong" evidence in favor of a model. Another way of interpreting the Bayes factor is as a measure of how one changes belief in a hypothesis in the light of new evidence. Lavine and Schervisch (1999:120) argue that the Bayes factor should be interpreted as measuring "the change in the odds in favor of the hypothesis when going from the prior to the posterior." For either interpretation, the large Bayes factors indicate that the alignments contained considerable information about the substitution model.

Generally speaking, substitution models with two to five substitution types did best. Table 4 shows the Bayes factors for the number of substitution types (K), and figure 1 shows the posterior probability of the number of substitution types. The Jukes-Cantor (1969) and Felsenstein (1981) model (111111) did poorly for all of the data sets examined in this study. The seven named models did well for 10 of the data sets, having a Bayes factor greater than one, meaning there was more posterior probability associated with the named models than there was prior probability. However, the Bayes factor for the named models was never greater than 115, and it was less than one for four of the data sets.


View this table:
[in this window]
[in a new window]
 
Table 4 The Bayes factors for K, the number of substitution types.

 


View larger version (29K):
[in this window]
[in a new window]
 
FIG. 1. The posterior probability of K, the number of substitution types, for each DNA sequence alignment. The first figure shows the prior probability distribution of K, calculated as the number of models with K substitution parameters divided by the total number of substitution models (203)

 
The results for model choice using AIC and BIC are largely concordant with the results from Bayes factors. Table 3 also shows the best models chosen using the AIC and BIC. For five of the alignments, AIC, BIC, and posterior probabilities were in agreement, choosing precisely the same model as best. For many of the other cases, AIC or BIC chose models that were still in the credible set of models, calculated using the reversible jump MCMC algorithm. In only one instance did AIC/BIC not choose a model that was at least in the credible set of models; for the butterfly alignment of wingless sequences, BIC chose model 193, which was not in the credible set of models. (Model 125, chosen by AIC for the butterfly data set was in the 95% credible set of models.)

In this study, we ran the Markov chains assuming a uniform prior probability on all models, and Bayes factors were calculated as the ratio of the posterior odds to the prior odds in favor of the model. The posterior probabilities of the best models happened to be of intermediate value for all 16 data sets examined, so that it was straight-forward to calculate the Bayes factor for the best model. However, it is possible that the posterior probability for the best model would be approximated to be 1.0 under a uniform prior on models, or alternatively, the biologist might be interested in the Bayes factor for a model that does a very poor job of explaining the data, having a posterior probability that is approximated using MCMC to be 0.0. However, in both cases, the true posterior probability is neither 1.0 nor 0.0, but rather a number very close to 1.0 or 0.0. How does one accurately approximate the Bayes factors for very good or very poor substitution models? One solution involves seeding the prior, giving more weight to a poor model for which one wants to approximate a Bayes factor, or giving less prior weight to a good model. One then runs the MCMC algorithm using the non-uniform prior, calculating the posterior probabilities for the model of interest as the fraction of the time that the Markov chain visited that model. One can then calculate what the posterior probability of model i would have been had the analysis been performed using a uniform prior. The posterior probability of model i is


where f'(M|X) is the posterior probability and f'(M) is the prior probability one is interested in calculating, and f(M|X) and f(M) are the posterior probability and prior probability, respectively, actually used when performing the MCMC analysis.

Application of likelihood ratio tests to a small subset of hierarchically nested models would often lead to different conclusions about the optimal model than were reached here, when considering all possible time-reversible models. For example, consider the following three nested model comparisons: Test 1, M1 (111111) versus M15 (121121); Test 2, M15 versus M40 (121131); and Test 3, M40 versus M203 (123456). Here, the likelihood ratio test statistic is asymptotically distributed as a {chi}2 with one (tests 1 and 2) or three (test 3) degrees of freedom. At the 5% level, the null (simple) model is rejected all of the time for Test 1, half of the time for Test 2, and 13 of the 16 times for Test 3. Importantly, the most complicated model considered (M203) would often be chosen as the best model when looking at only a small subset of all possible models. A similar conclusion would be reached if one only considered the subset of models that have one, two, or six substitution types. Phylogeneticists often concentrate on only these models, probably because the models are easily implemented using the program PAUP* (Swofford 2002) and are the only possible models in the program MrBayes (Huelsenbeck and Ronquist 2001).

The models chosen as best in the 16 DNA sequence alignments examined here were all similar in one respect: virtually all of them allowed the transition rates to differ from the transversion rates. That is to say, one rarely saw a model chosen that constrained a transition rate to be the same as a transversion rate (e.g., M60 where the substitution pattern is 123313). In fact, with only two exceptions, every model in the 95% credible sets of models allowed the transition rates to be different from the transversion rates. The exceptions included the HIV-env and the vertebrate ß-globin alignments, where the best models included M25 (112212), M60 (123313), M64 (112232), M100 (112312), and M171 (112343), among others. Clearly, in almost all of the sequence alignments examined in this study, the predominant pattern is that transitions occur at a different rate than transversions, and this fact determines which models are chosen as best; the best models all appear to be minor variants of the model proposed by Kimura (1980) and Hasegawa, Kishino, and Yano (1984, 1985).

The models currently used in molecular phylogenetics are quite complicated, and are likely to become more so as biologists add parameters to the models. For example, one strategy that is used to allow a phylogenetic model to better explain a DNA sequence alignment is to partition the data, perhaps according to codon or some other biologically relevant criterion, and to allow the parameters of the model to be independently estimated for each partition (Ronquist and Huelsenbeck 2003). This strategy, although it allows the phylogenetic model to better explain the data, also adds many parameters, and it is unclear how the biologist is to sort out which partitioning scheme is best; is it best to constrain a parameter, such as the gamma shape parameter that describes among-site rate variation, to be the same for all of the data partitions, or should one allow the parameter to be different? If it is to be potentially different, should the parameter be independently estimated for all partitions, or should this parameter be constrained to be the same for some but not all of the partitions? Questions such as these are likely to become more prevalent in the future as models become more complicated and biologists become more concerned with using any knowledge about a gene to tailor the assumptions of the analysis to specific gene regions. The automated procedure described in this article has the advantage that it allows the screening of a large number of possible models when performing a phylogenetic analysis. In principle, one could set up a very complicated general model, such as one that allows parameters to be independently estimated for each data partition, and then allow the MCMC procedure to explore the different possible sub-models in proportion to their probability. Formally, the inference of phylogeny would not depend on any specific model, but would be integrated over uncertainty in the model parameters. Moreover, one would learn how the pattern of substitution differs among data partitions from the output of such an analysis.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
J.P.H. was supported by National Science Foundation (NSF) grants MCB-0075404 and DEB-0075406. B.L. was supported by National Institutes of Health (NIH) grant R01 GM68950-01 and NSF grant DEB-0075406. M.E.F. was supported by NSF Information Technology Research (ITR) grant CACTTOL-22035A.


    Footnotes
 
1 In a talk at the 1995 Evolution meetings held in Montreal, Canada, D. L. Swofford pointed out the large number of possible substitution models and named one after his, as it turns out fictional, Aunt Emily. Unfortunately, everyone involved, including D. L. Swofford, has forgotten which of the 203 models was named after "Aunt Emily," so this model is not included as one of the named ones in this article. Back

Arndt von Haeseler, Associate Editor Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 

    Adachi, J., and M. Hasegawa. 1995. Phylogeny of whales: dependence of the inference on species sampling. Mol. Biol. Evol. 12:177-179.[Free Full Text]

    Akaike, H. 1973. Information theory as an extension of the maximum likelihood principle. Pp. 267–281 in B. N. Petrov and F. Csaki, eds. Second International Symposium on Information Theory. Akademiai Kiado, Budapest.

    Alfaro, M. E., and S. J. Arnold. 2001. Molecular systematics and evolution of Regina and the thamnophiine snakes. Mol. Phylo. Evol. 21:408-423.[CrossRef][ISI][Medline]

    Arnason, U., A. Gullberg, and A. Janke. 1997. Phylogenetic analyses of mitochondrial DNA suggest a sister group relationship between Xenartha (Edentata) and Ferungulates. Mol. Biol. Evol. 14:762-768.[Abstract]

    Barns, S. M., C. F. Delwiche, J. D. Palmer, and N. R. Pace. 1996. Perspectives on archaeal diversity, thermophyly and monophyly from environmental rRNA sequences. Proc. Natl. Acad. Sci. USA 93:9188-9193.[Abstract/Free Full Text]

    Bell, E. T. 1934. Exponential numbers. Am. Math. Monthly 41:411-419.

    Brower, A. V. Z. 2000. Phylogenetic relationships among the Nymphalidae (Lepidoptera) inferred from partial sequences of the wingless gene. Proc. R. Soc. Lond., Ser. B 267:1201-1211.[CrossRef][ISI][Medline]

    Cox, D. R. 1962. Further results on tests of families of alternate hypotheses. J. R. Stat. Soc. B 24:406-424.

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]

    Felsenstein, J. 1984. Distance methods for inferring phylogenies: a justification. Evolution 38:16-24.[ISI]

    Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182-198.[ISI][Medline]

    Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.[Abstract/Free Full Text]

    Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82:711-732.[ISI]

    Hafner, M. S., P. D. Sudman, F. X. Villablanca, T. A. Spradling, J. W. Demastes, and S. A. Nadler. 1994. Disparate rates of molecular evolution in cospeciating hosts and parasites. Science 265:1087-1090.[ISI][Medline]

    Harshman, J., C. J. Huddleston, J. P. Bollback, T. J. Parsons, and M. J. Braun. 2003. True and false Gharials: a nuclear gene phylogeny of Crocodylia. Syst. Biol. 52:386-402.[ISI][Medline]

    Hasegawa, M., T. Yano, and H. Kishino. 1984. A new molecular clock of mitochondrial DNA and the evolution of Hominoids. Proc. Jpn. Acad. Ser. B 60:95-98.

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174.[ISI][Medline]

    Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.[ISI]

    Hayasaka, K., T. Gojobori, and S. Horai. 1988. Molecular phylogeny and evolution of primate mitochondrial DNA. Mol. Biol. Evol. 5:626-644.[Abstract]

    Huelsenbeck, J. H., and B. Rannala. 1997. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science 276:227-232.[Abstract/Free Full Text]

    Huelsenbeck, J. P., and F. Ronquist. 2001. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 17:754-755.[Abstract/Free Full Text]

    Huelsenbeck, J. P., B. Larget, R. E. Miller, and F. Ronquist. 2002. Potential applications and pitfalls of Bayesian inference of phylogeny. Syst. Biol. 51:673-688.[CrossRef][ISI][Medline]

    Hurvich, C. M., and C-L. Tsai. 1989. Regression and time series model selection in small samples. Biometrika 76:297-307.[ISI]

    Jackman, T. R., A. Larson, K. de Queiroz, and J. B. Losos. 1999. Phylogenetic relationships and tempo of early diversification in Anolis lizards. Syst. Biol. 48:254-285.[CrossRef][ISI]

    Jeffreys, H. 1961. Theory of Probability. Oxford University Press, Oxford, U.K.

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–123 in H. N. Munro, ed. Mammalian Protein Metabolism. Academic Press, New York.

    Kass, R. E., and A. E. Raftery. 1995. Bayes factors and model uncertainty. J. Am. Stat. Assoc. 90:773-795.[ISI]

    Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[ISI][Medline]

    Kimura, M. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78:454-458.[Abstract]

    Larget, B., and D. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16:750-759.[Free Full Text]

    Lavine, M., and M. J. Schervish. 1999. Bayes factors: what they are and what they are not. Am. Stat. 53:119-122.[ISI]

    Lebreton, J-D., K. P. Burnham, J. Clobert, and D. R. Anderson. 1992. Modeling survival and testing biological hypotheses using marked animals: a unified approach with case studies. Ecol. Monogr. 62:67-118.[ISI]

    Losos, J. B., T. R. Jackman, A. Larson, K. de Queiroz, and L. Rodríguez-Schettino. 1998. Contingency and determinism in replicated adaptive radiations of island lizards. Science 279:2115-2118.[Abstract/Free Full Text]

    Mathews, S., and M. J. Donoghue. 1999. The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science 286:947-950.[Abstract/Free Full Text]

    Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. W. Teller, and E. Teller. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21:1087-1091.[ISI]

    Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol. Biol. Evol. 11:715-724.[Abstract/Free Full Text]

    Posada, D. 2003. Using Modeltest and PAUP* to select a model of nucleotide substitution. Pp. 6.5.1–6.5.14 in A. D. Baxevanis, et al., eds. Current Protocols in Bioinformatics. John Wiley & Sons, Inc., New York.

    Posada, D., and K. A. Crandall. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14:817-818.[Abstract]

    Raftery, A. E. 1996. Hypothesis testing and model selection. Pp. 163–187 in W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds. Markov chain Monte Carlo in Practice. Chapman & Hall, London.

    Ronquist, F., and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574.[Abstract/Free Full Text]

    Schöniger, M., and A. von Haeseler. 1994. A stochastic model and the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3:240-247.[CrossRef][Medline]

    Schwarz, G. 1978. Estimating the dimension of a model. Ann. Stat. 6:461-464.[ISI]

    Suchard, M. A., R. E. Weiss, and J. S. Sinsheimer. 2001. Bayesian selection of continuous-time Markov chain evolutionary models. Mol. Biol. Evol. 18:1001-1013.[Abstract/Free Full Text]

    Streelman, J. T., M. E. Alfaro, M. W. Westneat, D. R. Bellwood, and S. A. Karl. 2002. Evolutionary history of the parrotfishes: biogeography, ecomorphology, and comparative diversity. Evolution 56:961-971.[ISI][Medline]

    Swofford, D. L. 2002. PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.

    Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512-526.[Abstract]

    Tavaré, S. 1986. Some probabilistic and statistical problems on the analysis of DNA sequences. Pp. 57–86 in Lectures in Mathematics in the Life Sciences, vol. 17.

    Tierney, L. 1994. Markov chains for exploring posterior distributions. Ann. Statist. 22:1701-1762.[ISI]

    Van Den Bussche, R. A., R. J. Baker, J. P. Huelsenbeck, and D. M. Hillis. 1998. Base compositional bias and phylogenetic analyses: A test of the "flying DNA" hypothesis. Mol. Phylogenet. Evol. 13:408-416.

    Verdinelli, I., and L. Wasserman. 1995. Computing Bayes factors using a generalization of the Savage-Dickey density ratio. J. Am. Stat. Assoc. 90:614-618.[ISI]

    Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314.[ISI][Medline]

    Yang, Z., R. Nielsen, N. Goldman, and A. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449.[Abstract/Free Full Text]

Accepted for publication February 17, 2004.