Bayesian Phylogenetics Using an RNA Substitution Model Applied to Early Mammalian Evolution

H. Jow*, C. Hudelot{dagger}, M. Rattray* and P. G. Higgs{dagger}

*Department of Computer Science, University of Manchester;
{dagger}School of Biological Sciences, University of Manchester


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusion
 Acknowledgements
 References
 
We study the phylogeny of the placental mammals using molecular data from all mitochondrial tRNAs and rRNAs of 54 species. We use probabilistic substitution models specific to evolution in base paired regions of RNA. A number of these models have been implemented in a new phylogenetic inference software package for carrying out maximum likelihood and Bayesian phylogenetic inferences. We describe our Bayesian phylogenetic method which uses a Markov chain Monte Carlo algorithm to provide samples from the posterior distribution of tree topologies. Our results show support for four primary mammalian clades, in agreement with recent studies of much larger data sets mainly comprising nuclear DNA. We discuss some issues arising when using Bayesian techniques on RNA sequence data.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusion
 Acknowledgements
 References
 
Determining the evolutionary relationship among placental mammals is one of the most controversial problems in evolutionary biology. Although molecular phylogeneticists appear to be making good progress on this group, striking inconsistencies between different studies remain. Recent studies of large data sets mainly derived from nuclear DNA seem to have established a consensus with respect to certain fundamental aspects of early mammalian evolution with strong evidence of four primary, superordinal clades (Eizirik, Murphy, and O'Brien 2001Citation ; Madsen et al. 2001Citation ; Murphy et al. 2001a,Citation 2001bCitation ). Using the numbering from Murphy et al. (2001a)Citation these groups are defined as group I, Afrotheria, which includes species thought to have originated within Africa and the island of Madagascar, first described by Stanhope et al. (1998)Citation ; group II, Xenarthra; group III, including primates, rodents, lagomorphs, and tree shrews; group IV, carnivores, artiodactyls, perissodactyls and others, sometimes referred to as Laurasiatheria because these are thought to have originated in Laurasia—Europe, Asia, and North America (Waddell, Okada, and Hasegawa 1999Citation ; Madsen et al. 2001Citation ).

In contrast, recent analyses using complete mitochondrial genomes have been unable to establish the higher level relationship among mammals with any confidence and, in some cases, apparently high support is given to clades which are not consistent with these more recent studies (Cao et al. 2000Citation ; Nikaido et al. 2000Citation ). It has been argued that mitochondrial data are inherently less informative than nuclear data for obtaining deep-level mammalian phylogenies and that data sets obtained using only mitochondrial genomes will therefore provide less phylogenetic resolution per nucleotide (Springer et al. 2001Citation ). However, with the current trend toward using ever larger data sets, researchers may be losing sight of the fact that a smaller data set will often provide more accurate results if it fits better with the evolutionary assumptions of the phylogenetic inference method (Swofford et al. 2001Citation ).

We have carried out a phylogenetic analysis using the complete set of mitochondrial tRNA and rRNA sequences from 54 mammals, using an evolutionary model specifically suited to RNA molecules. We use Bayesian phylogenetic inference techniques which arguably provide the most principled and efficient use of data for problems in which there are very many highly probable alternative trees. Our results show support for the same four primary clades observed in the studies of Madsen et al. (2001)Citation and Murphy et al. (2001a,Citation 2001bCitation ).

Modern approaches to phylogeny are increasingly based on principled probabilistic foundations and make use of explicit evolutionary models. In particular, maximum likelihood and Bayesian methods have a stochastic model of sequence evolution at their heart in which substitutions are modeled as a time-homogeneous Poisson process. Standard models which consider substitutions at the DNA, amino acid, or codon level have been developed (see, for example, Swofford et al. 1996Citation ; Lewis 2001Citation ). Many RNA molecules are subject to functional constraints, resulting in highly conserved secondary structure over long evolutionary times. Mutations at different sites in a molecule are correlated because of compensatory mutations in the helices which are required to preserve base pairing. Models of DNA evolution which consider base paired sites as independent are therefore unsuitable for modeling RNA coding genes.

A number of substitution models for the helical regions of RNA have been proposed that consider pairs of sites as the fundamental unit rather than the single site. In principle, there are 16 possible pairs that can be formed with the four bases; hence, we need a 16 x 16 substitution rate matrix to describe the evolutionary process. In practice, however, there are only six frequently occurring pairs (AU, GU, GC, UA, UG, and CG), whereas the other 10 mismatch pairs together account for only 2%–3% of pairs in conserved structural regions (Higgs 2000Citation ). Some models use only a 6 x 6 matrix and neglect mismatches completely, whereas other models group all the mismatches into a single MM state and hence use a 7 x 7 matrix. Savill, Hoyle, and Higgs (2001)Citation compared many different evolutionary models using likelihood methods to distinguish which features of the rate models are important to give good explanations of real sequence evolution. The rate of double substitutions (e.g., AU to GC) is apparently high, and models that allow double substitutions as a single step fit the data significantly better than those that allow only single substitutions at a time (e.g., AU to GU, then GU to GC). This suggests that the double substitution process is occurring via the compensatory mutation mechanism explained by population genetics theory (Higgs 1998Citation ). The consensus sequence of the population changes at two sites simultaneously, even though the two mutations almost certainly did not occur simultaneously. It was also found that the six-state and seven-state models appear to fit the data as well as the 16-state ones, and there appears to be little benefit from detailed treatment of the mismatches as separate states. Within the six-state and seven-state model families, the most general time-reversible models give significantly higher likelihood values than any of the alternative models with fewer parameters.

We have developed software implementing the most common six-state and seven-state RNA models for use with maximum likelihood and Bayesian phylogenetic inference (PHASE, available from http://www.bioinf.man.ac.uk/resources/). In this article we describe our Bayesian inference method which is based on Markov chain Monte Carlo (MCMC) techniques (Metropolis et al. 1953Citation ; Hastings 1970Citation ). Although MCMC is a well-established method, it has only recently been applied to the problem of phylogenetic inference (Li 1996Citation ; Mau 1996Citation ; Mau and Newton 1997Citation ; Yang and Rannala 1997Citation ; Larget and Simon 1999Citation ; Mau, Newton, and Larget 1999Citation ; Li, Pearl, and Doss 2000Citation ). Two recent reviews suggest that this method of phylogenetic inference has a very promising future (Huelsenbeck et al. 2001Citation ; Lewis 2001Citation ). The main strength of the MCMC approach is the ability to assign a level of support, the posterior probability, to basically any phylogenetic hypothesis under investigation. None of the currently available packages (Larget and Simon 1999Citation ; Huelsenbeck and Ronquist 2001Citation ; McGiure, Denham, and Balding 2001Citation ) include the RNA evolutionary model used here, which is a major limitation given the extensive use of RNA genes in phylogenetic analyses. Here, we apply our MCMC algorithm to the phylogeny of placental mammals and discuss some of the statistical issues that are important when applying Bayesian inference to RNA data sets.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusion
 Acknowledgements
 References
 
Data Preparation
We constructed a data set using the complete sequences of all RNA molecules taken from 54 complete mammalian mitochondrial genomes, including all 22 tRNAs as well as the small (12S) and large (16S) subunit rRNAs. We include 49 placental mammals as well as four marsupials and a monotreme as outgroups. Missing tRNA Lys in two marsupials (the bandicoot and wallaroo) were treated as missing values in the analysis. All available sequences (in October, 2001) were extracted from the NCBI database, and the accession numbers are given in table 1 . Sequences were aligned by eye using the secondary structure as reference. The mitochondrial tRNA profiles developed by Helm et al. (2000)Citation were used as a guide to align the tRNAs, whereas human rRNA secondary structures from the Gutell lab (Cannone et al. 2002Citation , http://www.rna.icmb.utexas.edu/) were used to help align the rRNAs. Most of the stem regions have highly conserved sequences, and the variations were carefully checked to allow, in at least 50% of the species, a Watson-Crick or GU-UG pair. This criterion was chosen to conserve a large part of the molecules' structure and to allow enough flexibility for further addition of species at a later date. Only stem pairs which were conserved according to this definition were used in the analysis. The final data set was made up of 1,946 nucleotides corresponding to 973 pairs.


View this table:
[in this window]
[in a new window]
 
Table 1 Scientific Names, Classification, and NCBI Accession Numbers of Species in the Data Set

 
Substitution Model
In this article we use a model with seven states in total, with six states representing the most common base pairs (AU, GU, GC, UA, UG, and CG) and a composite mismatch state (MM) representing the other 10 less frequent pairs. We use the most general, time-reversible seven-state model with 42 rate parameters rij defining the rate of substitution from state i to j and seven frequency parameters {pi}i defining the expected frequency of state i (this is model 7A in Savill, Hoyle, and Higgs 2001Citation ). The model is directly analogous to the general time-reversible four-state model used for single nucleotides (see, for example, Page and Holmes 1998Citation , pp. 148–154). There is an obvious constraint that the frequencies add to one,


We impose the following additional constraint,


which makes the average rate of substitutions one per unit of evolutionary time. There is a further constraint that the model is time-reversible in which case,

As in Savill, Hoyle, and Higgs (2001)Citation we define rij {equiv} {alpha}ij{pi}j, so that the aforementioned constraint is automatically satisfied for symmetric choice of {alpha}ij. In total there are 23 constraints on 49 parameters, leading to a model with 26 independent parameters.

We also allow for substitution rate variation over sites by using the discrete-gamma model of Yang (1994)Citation . This model approximates the distribution of rates across different sites, using a number of discrete categories which are chosen to approximate a Gamma distribution. A single parameter determines the shape of this Gamma distribution and we use four site categories here, a choice which was shown to provide good all-round performance for the cases considered by Yang (1994)Citation .

Bayesian Phylogenetics
In Bayesian inference we are interested in computing the posterior probability of the hypothesis under investigation. This quantity is the probability of the hypothesis given the available data and including any prior assumptions we wish to include. The symbol {tau}i labels the ith tree topology, {upsilon}i are the branch lengths associated with this topology, and {theta} are the parameters of our evolutionary model (i.e., the rate parameters {alpha}ij, base pair frequencies {pi}i, and the Gamma distribution parameter). We can calculate the posterior probability density of the combined state {phi} = {{tau}i, {upsilon}i, {theta}} given sequence data X using Bayes' theorem,


where Ns is the number of topologies for a data set containing s species, P(X|{phi}) is the data likelihood, and p({phi}) is the prior probability density associated with state {phi}. Using this quantity we can compute the posterior probability of any identifiable phylogenetic feature of interest by integrating out the appropriate variables. For example, we can compute the posterior probability of a particular topology by integrating out the model parameters and branch lengths. Similarly, we can calculate the posterior probability of other phylogenetic features such as the existence of clades or their relative positioning.

The sum and integrals in the denominator of equation (4) are intractable for realistic sized problems. MCMC is therefore used to generate a large sample from the posterior probability distribution of states without explicit enumeration of these sums and integrals. To calculate the posterior probability of any phylogenetic hypothesis, we simply determine the fraction of samples in support of the hypothesis. For example, the posterior probability of a particular topology is simply given by the fraction of times we see this topology in our MCMC sample.

We use the standard Metropolis-Hastings MCMC algorithm which involves a two-stage process to construct a Markov chain in some state space (Metropolis et al. 1953Citation ; Hastings 1970Citation ). Firstly, a new state {phi}' is proposed according to some proposal mechanism defined by a probability density function f({phi}'|{phi}) conditional on the current state {phi}. Secondly, the proposed state is accepted with some probability which depends on the posterior probability of the state as well as the proposal distribution. In particular, we define the transition probability between states to be,


where {phi}n is the nth state in the chain and f({phi}|{phi}')/f({phi}'|{phi}) is known as the Hastings ratio of the proposal distribution. The advantage of this formulation is that the denominator in equation (4) cancels in equation (5) . We can therefore compute the transition probability using only knowledge of the priors and the likelihood which can be calculated using Felsenstein's efficient recursive algorithm (Felsenstein 1981Citation ).

Equation (5) defines a first order Markov chain, and it can be shown that under quite weak conditions this chain converges to an equilibrium in which states are distributed according to the posterior density p({phi}|X). The chain therefore provides us with samples from the posterior probability of topologies, model parameters, or any other quantity of interest. Our only condition is that the Markov chain be ergodic, i.e., that there is a nonzero probability of reaching any point in the state-space starting from any other point in a finite number of steps. Unfortunately, we do not always know when the Markov chain has converged sufficiently to give accurate results. We choose to complete a number of independent runs, which at least allows us to determine the consistency of our results.

We have no strong evidence for any particular prior distribution of trees, and we therefore choose a simple factorized prior p({phi}) = P({tau}i)p({theta})p({upsilon}i). We set a uniform prior over topologies P({tau}i) = 1/Ns. We choose a flat Dirichlet distribution prior for base pair frequency parameters, i.e., all sets of base pair frequencies summing to one are equally likely. We choose a uniform positive prior for substitution rate ratio parameters, branch lengths, and the Gamma distribution parameter in the variable-rate model of Yang (1994)Citation . In the case of uniform priors one must set an upper limit to ensure a normalisable prior distribution, but in practice the particular choice of this upper limit does not usually affect our results unless an extreme value is chosen.

We split up the proposal process by having a number of different schemes for different groups of variables which we apply iteratively. For the frequency parameters we adopt the technique of Larget and Simon (1999)Citation who use a Dirichlet proposal distribution centered at the current frequency vector (see also the description of software implementation at http://www.bioinf.man.ac.uk/resources/). For the Gamma distribution parameter and substitution rates we use a normal proposal distribution centered at the current value for each rate, with a reflecting boundary at zero and at some user-defined upper limit. For the substitution rates the proposal is actually done for a parameter which is the ratio between each rate and one reference rate. This ensures that the Hastings ratio is one for this proposal irrespective of the normalization in equation (2) .

We use three different proposals for branch length and topology changes which are illustrated in figure 1 . The continuous change proposal (top of fig. 1 ) mainly results in branch length changes but can also induce local topology changes. This proposal changes the length of a randomly chosen branch. If the initial length is x, the new length of the branch is set to x + {delta}, where {delta} is chosen from a normal distribution with mean zero. A special rule is applied when x + {delta} becomes negative. If the branch is an internal branch (such as that shown at the top of fig. 1 ) then the topology is changed to one of the two nearest neighbor topologies with each having a probability of 0.5, and the length of the new internal branch is set to y = |x + {delta}|. If the branch changed is an external branch, then there is no alternative topology to change to if the length becomes negative. In this case the topology of the tree remains the same, and the length of the branch is set to |x + {delta}|. We impose an upper limit on the branch length to ensure that the prior constraint on branch lengths is satisfied, although in practice we have not observed the branch lengths approaching such an upper limit in our simulations.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 1.—The three different branch length and topology change proposals used by our MCMC algorithm. The continuous change proposal (top) can induce topology changes if x + {delta} is negative for an internal branch, in which case the two possibilities shown are equally likely with y = |x + {delta}|. The NNI proposal keeps the internal branch length y fixed. The SPR proposal has a nontrivial Hastings ratio b/(x + y).

 
Although the continuous change proposal can induce local topology changes in a smooth manner, we also include the nearest neighbor interchange (NNI) and subtree pruning and regrafting (SPR) proposals described by Swofford et al. (1996)Citation . To use these moves in the context of MCMC, we have to specify how the branch lengths are affected. In the NNI (middle of fig. 1 ) we select an internal branch at random and rearrange the four neighboring subtrees or leaves into one of two possible alternative arrangements with equal probability keeping the length of the internal branch unchanged. In SPR we randomly detach a subtree E (dashed line) from the tree and reattach it to a randomly selected branch b at a random point (solid line). There is now a single internal branch of length x + y. Unlike the two other topology and branch length proposals, for SPR there is a nontrivial Hastings ratio (b/(x + y) in fig. 1 ).

For the results described later trees are sampled after every 10 MCMC cycles, where one cycle corresponds to an update of all model parameters and one continuous branch length change. The other two topology proposals occur with equal probability every 10 cycles. An initial burn-in period was neglected, after which 20,000 trees were sampled and stored for every run.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusion
 Acknowledgements
 References
 
We summarize the results from five independent MCMC runs in figure 2 . The scientific names of species and NCBI accession numbers are given in table 1 . A consensus tree is constructed from all five runs, using the majority rule consensus method implemented by PHYLIP (Felsenstein 1989Citation ). Each interior node is annotated with the mean posterior probability of the corresponding clade over five runs, and we also give some indication of the standard deviation (see figure caption for details). The estimated substitution model parameters averaged over all runs are given in column 2 of table 2 and in table 3 , along with the state frequencies observed in the data set (column 1 in table 2 ), and the proposal statistics for the MCMC algorithm are given in table 5 .



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 2.—A consensus tree constructed using the combined set of topologies obtained from five independent MCMC runs with a general time-reversible seven-state RNA substitution model and four variable-rate model categories. Maximum likelihood branch lengths are shown for this topology. Numbers represent the percentage of each clade appearing in the combined set of topologies. Letters indicate the variation (standard deviation {sigma}) in these percentages measured over five runs (a, {sigma} <= 1%; b, 1% < {sigma} <= 5%; c, 5% < {sigma} <= 10%; d, 10% < {sigma} < 18%)

 

View this table:
[in this window]
[in a new window]
 
Table 2 State Frequencies Measured from Complete Data Set Against Mean Posterior Estimates from a Seven-State Variable-Rate Model with Four Rate Categories and from a Model with Constant Rate at Every Site

 

View this table:
[in this window]
[in a new window]
 
Table 3 Mean Posterior Transition Rate Estimates for a Seven-State Variable-Rate Model with Four Rate Categories

 

View this table:
[in this window]
[in a new window]
 
Table 5 The Proportion of Accepted Proposals for the MCMC Runs

 
The four primary superordinal clades correspond to those identified by Madsen et al. (2001)Citation and Murphy et al. (2001a,Citation 2001bCitation ) and are supported by our analysis with posterior probability of 86% for group I, 97% for group III, and 100% for group IV. Group II, Xenarthra, is only represented by one species (an armadillo) but is separated from the other groups with 86% posterior probability. The species in groups III and IV are always separated from the other species and are monophyletic sister groups with 97% posterior probability. As far as the relationship between these groups is concerned, we find some support for two alternative arrangements. With 76% posterior probability, we find that group I and group II form a clade (referred to as Atlantogenta by Waddell, Okada, and Hasegawa 1999Citation ) as shown in the consensus tree. The next most likely arrangement with 24% probability has group II branching off first followed by I and finally III and IV as sisters. The other possibility, which has group I branching off first followed by group II, has very low support (<1% posterior probability). The studies of Madsen et al. (2001)Citation , Murphy et al. (2001a)Citation , and Eizirik, Murphy, and O'Brien (2001)Citation failed to distinguish strongly between these alternatives, although it is interesting to note that these last two studies appear to find strongest support for group I as the earliest branching group, a hypothesis which is not supported by our results. In a more recent study combining the data sets from Madsen et al. (2001)Citation , Murphy et al. (2001a)Citation , and using a Bayesian MCMC algorithm Murphy et al. (2001b)Citation also found strong support for the hypothesis with group I branching off earliest.

The monophyly of Afrotheria has previously been obtained using mitochondrial genes (Stanhope et al. 1998Citation ; Mouchaty et al. 2000bCitation ), but its position has been poorly resolved, and it has typically been positioned close to the fereuungulates, which are found in group IV here. We find that it typically branches off earlier than this. Xenarthra has long been identified as one of the oldest placental groups, and our analysis is consistent with this; however, it is only more recently that Afrotheria has been found close to the root of the placentals. This leads us to question the common view of an origin of placental mammals in Asia, spreading to the Northern Hemisphere and then migrating to the Southern Hemisphere.

Group III contains many of the most studied mammals, including the primates, rodents, rabbits, and a tree shrew. This group is well supported, with a posterior probability of 97%, but there are many alternative arrangements within the group that are less well resolved. For example, within the primates there is a posterior probability of 83% attached to the sister relationship of the gorilla and human, whereas the more usually accepted arrangement which places chimps as sister group to humans has only 12% support. We attach little significance to this, as our method is not ideal for very closely related species. Our choice of sites only consists of those which are reliably aligned over the full range of mammals. Rapidly varying sites which would give useful information about closely related species have been excluded. We note that the relationship of the higher primates was also unresolved by Madsen et al. (2001)Citation with a larger data set. Within the primates the Anthropoidea group (old and new world monkeys, and apes) is well supported. Our results support the sister relationship of the loris (a member of Strepsirhini) with the tarsier which is in agreement with other molecular studies (Madsen et al. 2001Citation ) but is in disagreement with morphological evidence (Ross, Williams, and Kay 1998Citation ) that places the tarsier as a sister group to the Anthropoidea. There is no resolution for the relative positioning of these species, and the arrangement with highest support (68%) has them separated from the other primates. This is perhaps explained by a compositional bias which makes the positioning of the tarsier using mitochondrial genes particularly problematic (Schmitz, Ohme, and Zischler 2002Citation ).

According to our studies there is low support for the monophyly of rodents (26%), and therefore a monophyletic rodent group does not appear in the consensus tree shown in figure 2 . Support for the most likely arrangement which separates Muridae from the other rodents is also quite weak (61%). The monophyly of rodents has been questioned by recent complete mitochondrial genome studies (Reyes et al. 2000Citation ). The latest studies using large numbers of predominantly nuclear genes show stronger support for monophyly (Adkins et al. 2001Citation ; Madsen et al. 2001Citation ; Murphy et al. 2001a,Citation 2001bCitation ), whereas studies using smaller nuclear data sets failed to find reliable support (Adkins et al. 2001Citation ). We find some support (79%) for the tree shrew forming a clade with the rodents and lagomorphs, and there is very low support for the traditional positioning of the tree shrew at the root of the primates.

Group IV is supported with 100% posterior probability, but there are low support values for many clades within this group because of the large number of alternative arrangements that occur within the MCMC runs. This lack of resolution is largely caused by the alpaca, whose position within the tree is particularly variable for this data set. The carnivores are monophyletic with 100% support, and the Perissodactyla form a well-supported monophyletic group with 93% posterior probability. Although the Cetartiodactyla are poorly resolved with weak support for monophyly (28%), the generally accepted relationship for all species apart from the pig and alpaca is strongly supported (97%). The arrangement of species from Eulipotyphla and Chiroptera is poorly resolved for the most part, and the groups appear to mix. There appears to be surprisingly strong support for a close relationship between the hedgehogs and the bats, although we note that the hedgehogs are particularly problematic in all phylogenies using mitochondrial genes. Indeed, the hedgehog has been placed at the root of all placentals (Stanhope et al. 1998Citation ; Mouchaty et al. 2000a,Citation 2000b;Citation Nikaido et al. 2000Citation ) and widely separated from the mole, which is usually thought to be closely related.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusion
 Acknowledgements
 References
 
Our results are promising and show quite good congruence with recent studies of much larger data sets derived mainly from nuclear DNA. We believe that taking a well-defined set of sites within the RNA helices and using an evolutionary model appropriate to these sites is likely to increase the reliability of results.

A major advantage of the Bayesian approach is that we do not limit ourselves to a single best tree. It is therefore possible to find strong support for a specific clade or clade arrangement of interest even when there may be a huge number of alternative topologies with comparable likelihood values. An alternative approach would be to use bootstrap resampling of maximum likelihood phylogenies to get an estimate of confidence. But, this would require much greater computational resources than the MCMC approach used here, and the interpretation of bootstrap confidence intervals can be problematic because they seem to provide a conservative estimate of confidence (Swofford et al. 1996Citation ; however, see Efron, Halloran, and Holmes 1996Citation , for an alternative viewpoint). There may also be bias in bootstrap samples because of the particular optimization heuristic used to obtain an estimate of the maximum likelihood tree.

The resolution at some levels of our phylogeny was not good and in particular there were a number of alternative arrangements within the main groups identified which could not be excluded. This indicates that there is not sufficient evidence within the current set of sequence data to distinguish between these alternatives. It does not indicate a problem with the MCMC algorithm. The variation in posterior probability estimates from different MCMC runs is small for all strongly supported groups in figure 2 , indicating that we have good consistency. Only the relatively weakly supported groups have standard deviations greater than 10% in the mean posterior probability.

It is particularly important to use appropriate substitution models when applying Bayesian inference techniques to RNA data because ignoring covariation between base paired sites will generally invalidate our estimates of posterior probabilities. The usual four-state DNA substitution models consider the nucleotides which are base paired in a helical region of RNA to be independent, however, typically 90% of these bases form stable Watson-Crick base pairs, whereas the majority of other bases form the less stable GU-UG pairing. If we treated these bases as independent, then we would significantly overestimate our posterior probability for a hypothesis which is well supported. To illustrate the argument, consider an idealized situation in which we apply the HKY model (Hasegawa, Kishino, and Yano 1985Citation ) to an RNA helical region in which all of the base pairs form Watson-Crick pairs. Let L({phi}1) and L({phi}2) denote likelihoods for the data on one side of the helical region for two different phylogenetic hypotheses {phi}1 and {phi}2. If we ignore correlations, then the likelihood for the data combined from both sides will be L({phi}1)2 and L({phi}2)2 by symmetry of the substitution model. If the prior probabilities associated with {phi}1 and {phi}2 are equal, we see that the ratio of posterior probabilities is L({phi}1)/L({phi}2) in the first case and L({phi}1)2/L({phi}2)2 in the second case. So, for example, if {phi}1 is 100 times more likely given only information from one side of the helix, this will correspond to being 10,000 times more likely given data on both sides of the helix under a four-state DNA substitution model. Ignoring correlations therefore results in an inflated confidence in this hypothesis. Similar considerations would also invalidate bootstrap support values as well as the method of Kishino and Hasegawa (1989)Citation which is often used to construct confidence intervals for phylogenetic inferences using maximum likelihood or parsimony methods. In practice one could overcome the problem of correlated substitution by simply applying a standard DNA model to data from one side of the helical regions in an RNA molecule, however, in doing this we would lose important information from the 10% of base pairs not forming Watson-Crick pairs.

In figures 3 and 4 we present results using the general time-reversible four-state model for single DNA sites applied to the same paired parts of the RNA genes but treating the sites as independent. Four variable-rate categories are used. In figure 3 , we compare the number of occurrences of each unique topology in a single run (ranked in order of number of occurrences) with results obtained using the seven-state model. We see that runs using the four-state DNA model result in significantly higher posterior probabilities being attached to the top topologies compared with runs using the RNA model. Similarly, much lower posterior probabilities are attached to the less frequent topologies when using a DNA model. These results are consistent with our argument against using a DNA model on base paired regions of RNA.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 3.—The number of occurrences of each topology is plotted for two MCMC runs using the general time-reversible seven-state RNA (solid line) and four-state DNA (dashed line) substitution model. The topologies are ranked according to the number of occurrences in each case.

 


View larger version (30K):
[in this window]
[in a new window]
 
Fig. 4.—A consensus tree constructed using the combined set of topologies obtained from five independent MCMC runs using a general time-reversible four-state DNA substitution model with four rate categories. Percentages and letters have the same meaning as in figure 2

 
In figure 4 we show the consensus tree from runs with the four-state DNA model, using exactly the same methodology as that for the results described in the previous section. We see that the estimated posterior probability for the clades is typically much higher than results using the seven-state RNA model (compare with fig. 2 ). In some respects the tree appears to reflect the expected relationships better than in figure 2 , most notably in group III where we now see strong support for monophyly of the primates and rodents. However, when we look at group IV we see that similarly high support is given to more dubious arrangements. For example, in 88% of samples we find the horse and ass separated from the other Perissodactyla and the hippopotamus separated from the whales within the Cetartiodactyla. This should be contrasted with figure 2 where there is generally more uncertainty attached to clades within the group, but there is support for monophyly of the Perissodactyla and for the usually accepted grouping of the hippopotamus and whales.

Thus, on the basis of the topology of the consensus tree obtained, there seems little to distinguish between the four-state and seven-state models because each tree contains some aspects that appear more reasonable than the other. From a statistical point of view, the seven-state model is preferable because correlations make the posterior probabilities using the four-state model invalid. One thing that is clear is that this analysis, based on a fairly restricted number of sites, seems to give much more reliable results than several previous methods using large sets of mitochondrial genes (Cao et al. 2000Citation ; Nikaido et al. 2000Citation ). It is therefore clear that the paired regions of RNA are highly informative. One reason for this may be the lack of ambiguity in their alignment.

For tRNA genes, most of the informative sites are in the stem regions. The anticodon loop is highly conserved in most mitochondrial tRNAs and hence contains little phylogenetic information. The other two loops are very variable in both length and sequence in most cases. Highly variable and gappy regions of alignments are usually excluded from phylogenetic analyses. In the rRNA alignments, however, there are many reliably aligned sites in unpaired regions, and it would be desirable to include these sites in the analysis. We intend to develop our program so that it can deal with two evolutionary models simultaneously: a seven-state model for the paired sites and a four-state model for the unpaired sites. At present, our program deals with only one or other of these models and therefore we have limited our analysis to the paired sites.

Table 2 shows the posterior average of the estimated frequency parameter over two different MCMC runs in comparison with the empirical frequencies of the states measured directly from the sequences. In column 2 we show the estimates from the runs used to create the tree in figure 2 , using a variable-rate model with four rate categories. In column 3 we show estimates from similar runs but using a constant rate of evolution at every site. The frequency estimates in columns 2 and 3 are fairly close to the empirical frequencies except for a notable difference in the MM frequency, which is 11% in column 2 but is closer to 4% in columns 1 and 3. The transition rate matrices shown in tables 3 and 4 for the variable and constant rate models, respectively, are also significantly different from one another. The problem appears to be that sites with the highest substitution rates are also those with the highest frequency of MM states. This makes sense because the rapidly evolving sites are presumably also those under the weakest evolutionary constraints for conserved structure. The rapidly changing sites contribute more to the estimate of the frequency parameters than other sites because they provide less correlated and therefore more informative samples from the distribution of states. A possible way of eliminating this problem would be to allow independent estimates of the frequency parameters in each rate category of the variable-rate model.


View this table:
[in this window]
[in a new window]
 
Table 4 Mean Posterior Transition Rate Estimates for a Seven-State Model with Constant Rate at Every Site

 
The choice of which pairs of sites to include in the analysis was based on the fraction of mismatch states at each position. For most sites, mismatches are rare (4% of all pairs on average); however, there are a number of sites with much higher frequency. Of the 973 paired sites used in the analysis, 113 have mismatches in 10% to 50% of sequences. A cutoff of 50% was used in constructing the data set. We could have used a stricter criterion, but this would have meant eliminating potentially informative sites from the data.

We emphasize that the tree in figure 2 was obtained using a variable-rate model, and we estimate that there is a significant variation in evolutionary rates over sites in our sequences. Allowing for variable rates influences the resulting tree a great deal. When the MCMC analysis is performed with a constant rate over sites, we obtain trees differing markedly from figure 2 that contain some biologically very implausible clades. Therefore we do not show these results here.


    Conclusion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusion
 Acknowledgements
 References
 
We have developed a software package for phylogenetic inference which implements a number of evolutionary models specific to the stem regions of structural RNA molecules, for use in maximum likelihood and Bayesian phylogenetic inference. In this article we used our Bayesian MCMC algorithm to study a data set containing DNA from the stem regions of all the mitochondrial RNA genes taken from 54 mammals. Our results show good resolution in determining the early branches of the mammal tree, and the four main groups identified correspond well with recent results obtained from much larger data sets mainly comprising nuclear genes. We demonstrate that the use of standard four-state models of DNA substitution will overestimate the support (in terms of posterior probabilities) for phylogenetic inferences when using DNA taken from RNA encoding genes. Future work will focus on improving the fit between our evolutionary models and the data and allowing for both four-state and seven-state models to be used simultaneously.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusion
 Acknowledgements
 References
 
M.R. and P.G.H. gratefully acknowledge a BBSRC award which provided support for C.H. H.J. was supported by an ORS scholarship from the University of Manchester.


    Footnotes
 
Mark Ragan, Reviewing Editor

Keywords: Bayesian phylogeny mitochondrial RNA mammalian evolution RNA substitution model Markov chain Monte Carlo Back

Address for correspondence and reprints: M. Rattray, Department of Computer Science, University of Manchester, Manchester, UK. E-mail: magnus{at}cs.man.ac.uk . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusion
 Acknowledgements
 References
 

    Adkins R. M., E. L. Gelke, D. Rowe, R. L. Honeycutt, 2001 Molecular phylogeny and divergence time estimates for major rodent groups: evidence from multiple genes Mol. Biol. Evol 18:777-791[Abstract/Free Full Text]

    Cannone J. J., S. Subramanian, M. N. Schnare, et al. (14 co-authors) 2002 The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs BioMed. Central Bioinformat 3:2.

    Cao Y., M. Fujiwara, M. Nikaido, N. Okada, M. Hasegawa, 2000 Interordinal relationships and timescale of Eutherian evolution as inferred from mitochondrial genome data Gene 259:149-158[ISI][Medline]

    Efron B., E. Halloran, S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees Proc. Natl. Acad. Sci. USA 93:7085-7090[Abstract/Free Full Text]

    Eizirik E., W. J. Murphy, S. J. O'Brien, 2001 Molecular dating and biogeography of the early placental mammals J. Hered 92:212-219[Abstract/Free Full Text]

    Felsenstein J. P., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]

    ———. 1989 PHYLIP (phylogeny inference package). Version 3.2 Cladistics 5:164-166

    Hasegawa M., H. Kishino, T. Yano, 1985 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA J. Mol. Evol 21:160-174

    Hastings W. K., 1970 Monte Carlo sampling methods using Markov chains and their applications Biometrika 57:97-109[ISI]

    Helm M., H. Brulé, D. Friede, R. Giege, D. Pûtz, C. Florentz, 2000 Search for characteristic structural features of mammalian mitochondrial tRNAs RNA 6:1356-1379[Abstract/Free Full Text]

    Higgs P. G., 1998 Compensatory neutral mutations and the evolution of RNA Genetica 102:91-101

    ———. 2000 RNA secondary structure: physical and computational aspects Q. Rev. Biophys 33:199-253[ISI][Medline]

    Huelsenbeck J. P., F. Ronquist, 2001 Mrbayes: Bayesian inference of phylogenetic trees Bioinformatics 17:754-755[Abstract/Free Full Text]

    Huelsenbeck J. P., F. Ronquist, R. Nielsen, J. P. Bollback, 2001 Bayesian inference of phylogeny and its impact on evolutionary biology Science 294:2310-2314[Abstract/Free Full Text]

    Kishino H., M. Hasegawa, 1989 Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoides J. Mol. Evol 29:170-179[ISI][Medline]

    Larget B., D. L. Simon, 1999 Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees Mol. Biol. Evol 16:750-759[Free Full Text]

    Lewis P. O., 2001 Phylogenetic systematics turns over a new leaf Trends Ecol. Evol 16:30-37[ISI][Medline]

    Li S., 1996 Phylogenetic tree construction using Markov chain Monte Carlo Doctoral dissertation, Ohio State University, Columbus

    Li S., D. K. Pearl, H. Doss, 2000 Phylogenetic tree construction using Markov chain Monte Carlo J. Am. Stat. Soc 95:493-508

    Madsen O., M. Scally, C. J. Douady, D. J. Kao, R. W. DeBry, R. Adkins, H. M. Amrine, M. J. Stanhope, W. W. de Jong, M. S. Springer, 2001 Molecular phylogenetics and the origins of placental mammals Nature 409:614-618[ISI][Medline]

    Mau B., 1996 Bayesian phylogenetic inference via Markov chain Monte Carlo methods Doctoral dissertation, University of Wisconsin, Madison

    Mau B., M. A. Newton, 1997 Phylogenetic inference for binary data on dendograms using Markov chain Monte Carlo J. Comput. Graph. Stat 6:122-131[ISI]

    Mau B., M. Newton, B. Larget, 1999 Bayesian phylogenetic inference via Markov chain Monte Carlo methods Biometrics 55:1-12[ISI][Medline]

    McGiure G., M. C. Denham, D. J. Balding, 2001 Mac5: Bayesian inference of phylogenetic trees from DNA sequences incorporating gaps Bioinformatics 17:479-480[Abstract]

    Metropolis N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, 1953 Equations of state calculations for fast computing machines J. Chem. Phys 21:1087-1091[ISI]

    Mouchaty S. K., A. Gullberg, A. Janke, U. Arnason, 2000a. The phylogenetic position of the Talpidae within Eutheria based on analysis of complete mitochondrial sequences Mol. Biol. Evol 17:60-67[Abstract/Free Full Text]

    ———. 2000b. Phylogenetic position of the tenrecs (Mammalia: Tenrecidae) of Madagascar based on analysis of the complete mitochondrial genome sequence of Echinops telfairi Zool. Scr 29:307-317[ISI]

    Murphy W. J., E. Eizirik, W. E. Johnson, Y. P. Zhang, O. A. Ryder, S. J. O'Brien, 2001a. Molecular phylogenetics and the origins of placental mammals Nature 409:614-618[ISI][Medline]

    Murphy W. J., E. Eizirik, S. J. O'Brien, et al. (11 co-authors) 2001b. Resolution of the early placental mammal radiation using Bayesian phylogenetics Science 294:2348-2351[Abstract/Free Full Text]

    Nikaido M., M. Harada, Y. Cao, M. Hasegawa, N. Okada, 2000 Monophyletic origin of the order Chiroptera and its phylogenetic position among Mammalia, as inferred from the complete sequence of the mitochondrial DNA of a Japanese megabat, the Ryukyu flying fox J. Mol. Evol 51:318-328[ISI][Medline]

    Page R. D. M., E. Holmes, 1998 Molecular evolution, a phylogenetic approach Blackwell Science

    Reyes A., C. Gissi, G. Pesole, F. M. Catzeflis, C. Saccone, 2000 Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris Mol. Biol. Evol 48:1-5

    Ross C., B. Williams, R. F. Kay, 1998 Phylogenetic analysis of anthropoid relationships J. Hum. Evol 35:221-306[ISI][Medline]

    Savill N. J., D. C. Hoyle, P. G. Higgs, 2001 RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum likelihood methods Genetics 157:399-411[Abstract/Free Full Text]

    Schmitz J., M. Ohme, H. Zischler, 2002 The complete mitochondrial sequence of Tarsius bancanus: evidence for the extensive nucleotide compositional plasticity of primate mitochondrial DNA Mol. Biol. Evol 19:544-553[Abstract/Free Full Text]

    Springer M. S., R. W. DeBry, C. Douady, H. Amrine, O. Madsen, W. W. de Jong, M. J. Stanhope, 2001 Mitochondrial versus nuclear gene sequences in deep-level mammalian phylogeny reconstruction Mol. Biol. Evol 18:132-143[Abstract/Free Full Text]

    Stanhope M. J., V. G. Waddell, O. Madsen, W. D. Jong, S. B. Hedges, G. C. Cleven, D. Kao, M. S. Springer, 1998 Molecular evidence for multiple origins of insectivora and for a new order of endemic African insectivore mammals Proc. Natl. Acad. Sci. USA 95:9967-9972[Abstract/Free Full Text]

    Swofford D. L., G. J. Olsen, P. J. Waddell, D. M. Hillis, 1996 Phylogenetic inference Pp. 407–515 in D. M. Hillis, ed. Molecular systematics. 2nd edition. Sinauer Associates, Sunderland, Mass

    Swofford D. L., P. J. Waddell, J. P. Huelsenbeck, P. G. Foster, P. O. Lewis, J. S. Rogers, 2001 Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods Syst. Biol 50:525-539[ISI][Medline]

    Waddell P. J., N. Okada, M. Hasegawa, 1999 Towards resolving the interordinal relationships of placental mammals Syst. Biol 48:1-5[ISI][Medline]

    Yang Z., 1994 Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods J. Mol. Evol 39:306-314[ISI][Medline]

    Yang Z., Rannala B., 1997 Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method Mol. Biol. Evol 14:717-724[Abstract]

Accepted for publication May 13, 2002.