* Department of Biology, University College London, London, United Kingdom; School of Biomedical and Chemical Sciences, University of Western Australia, Crawley Perch, Western Australia; and
Department of Computer Science, University of Manchester, Manchester, United Kingdom
Correspondence: E-mail: m.telford{at}ucl.ac.uk.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
RNAs transcribed from rRNA genes have a complex secondary structure mediated by base pairing between sometimes distant regions of the rRNA molecule. The pairing between the stem nucleotides has important consequences for their evolution which differs from that of unpaired loop nucleotides. These differences in evolution should ideally be accounted for when using rRNA sequences for phylogeny estimation.
We use a novel permutation approach to demonstrate the significant superiority of models of sequence evolution that allow stem and loop regions to evolve according to separate models and, in common with previous studies, we show that 16-state models that take base pairing of stems into account are significantly better than simpler, 4-state, single-nucleotide models. One of these 16-state models has been applied to the phylogeny of the Bilateria using small subunit rRNA (SSU) sequences. Our optimal tree largely echoes previous results based on SSU in particular supporting the tripartite Bilaterian tree of deuterostomes, lophotrochozoans, and ecdysozoans. There are also a number of differences, however, perhaps most important of which is the observation of a clade consisting of the gastrotrichs plus platyheminthes that is basal to all other lophotrochozoan taxa. Use of 16-state models also appears to reduce the Bayesian support given to certain biologically improbable groups found using standard 4-state models.
Key Words: ribosomal RNA phylogeny Bilateria secondary structure maximum likelihood Bayesian analysis
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
An important aspect of rRNA genes that has largely been ignored by zoologists when reconstructing bilaterian phylogenies is that the functional RNAs transcribed from these genes have a complex secondary structure mediated by base pairing between sometimes distant regions of the rRNA molecule. By accounting for this secondary structure in the models of evolution used in phylogeny reconstruction, we might hope to improve the reliability of the bilaterian tree.
rRNAs consist of base paired stem and unpaired loop regions. The pairing within the stems involves the Watson-Crick A:U, G:C pairs and the noncanonical G:U pair; other pairings exist, but they are rare enough to be disregarded in the current context. The constraints arising from the need to conserve secondary structure have important effects on the evolution of the stem sequences that differ from those of the loop regions. In essence, because there is selective pressure for the maintenance of rRNA secondary structure potential substitutions affecting one half of a pair of stem nucleotides have a different probability of fixation compared to the equivalent nucleotide in a loop. More specifically, in stem regions, changes from a base paired state to an unpaired state tend to be strongly selected against. Moreover, not all changes between sets of pairing bases are equal. Changes between A:U and G:U and between G:U and G:C each involve a single change and are therefore common. Changing between A:U and G:C or between U:A and C:G can also relatively easily occur involving two changes with the stable transitional state G:U or U:G. Changing from any of A:U, G:U, and G:C to their counterparts U:A, U:G, or C:G on the other hand necessitates either two simultaneous changes which is very unlikely or passage via an unpaired transitional state which is presumably maladapative. Such changes are correspondingly much rarer.
Stem models have been around for a number of years, and there have been a number of descriptions of the use of such models in phylogeny reconstruction (Schöniger and von Haeseler 1994; Muse 1995; Rzhetsky 1995; Tillier and Collins 1995; Otsuka, Terai, and Nakano 1999; Savill, Hoyle, and Higgs 2001; Jow et al. 2002; Hudelot et al. 2003; Smith, Lui, and Tillier 2004). Considering the widely accepted view that the use of more accurate models of sequence evolution should lead to more accurate inference of phylogeny based on those sequences, it seems surprising that the vast majority of published studies of rRNA sequences have not taken into account the differences expected between the evolution of stems and of loops. Our particular interest is in incorporating knowledge of secondary structure in a phylogenetic analysis of the Bilateria, something that has been attempted in the past but without great success. Early in the history of animal SSU phylogenetics, Patterson (1989) recognized the interdependence of nucleotides within stems and therefore employed a scheme which weighted stem nucleotides at half the weight of the loop positions. However, this does not take into account the different selective pressures on stem positions, only their nonindependence. More recently, Otsuka and Sugaya (2003) considered the mitochondrial LSU genes of various animals and, taking stem pairs into account, derived genetic distances between pairs of taxa. These distances were used for tree construction using the Neighbor Joining and unweighted pair group method with arithmetic mean (UPGMA) methods but gave trees that contradicted current understanding of animal phylogeny: Otsuka and Sugaya did not find a monophyletic grouping of lophotrochozans, the mollusks and annelids are grouped with arthropods rather than with brachiopods and platyhelminthes, and the hemichordates group within the chordates rather than with echinoderms. In any case it is clear that a likelihood-based tree estimation method would be preferable to Otsuka and Sugaya's distance-based approach (Schöniger and von Haeseler 1994; Muse 1995; Rzhetsky 1995). The ideal approach must be to model the different evolution of stems and loops within a likelihood framework.
Previous likelihood-based studies have already suggested the superiority of approaches that take into account base pair correlation in RNA stems over methods assuming complete evolutionary independence of stem nucleotides. Using a standard likelihood ratio test, Muse (1995) showed that the addition of an extra pairing-parameter to standard DNA models significantly increased the fit with real data sets. Similarly, Schöniger and von Haeseler (1999) rejected the standard Hasegawa, Kishino, and Yano (HKY) model in favor of a corresponding doublet model with a Goldman-Cox test. Savill et al. (2001) compared a number of doublet models and show that models that allow for double substitutions are preferred.
Here, a permutation test (Felsenstein 2003) is used to demonstrate the presence of altered evolutionary patterns in stems regions compared to loops and to provide further evidence that base paired models are more appropriate than standard, single-nucleotide models. We use the best of these 16-state models to estimate the phylogeny of 72 bilaterians using and Markov Chain Monte Carlo (MCMC) Bayesian search procedure.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Test Tree
Our initial approach was to evaluate the different models of evolution by comparing their likelihood maximized on one specific test tree. For this model evaluation, a data set of 23 bilaterian SSU sequences was used. Unreliably aligned positions were excluded, leaving 1563 positions in the analysis. A total of 680 sites were paired stem nucleotides and the remaining 883 were loop-bulge nucleotides (alignment available on request). The tree used for all comparisons of models is shown in figure 1. Ideally, one would optimize the topology along with all other parameters when performing the permutation tests, but this is computationally impossible. In spite of this we believe our results to be robust for two reasons. First, our tree is very likely to be approximately correct; it makes excellent sense biologically, and this topology or a very close variant of it is found by all of the models used in an MCMC Bayesian search (results not shown). Second, we have repeated the same permutation tests using a different and probably slightly suboptimal topology (we moved Branchiostoma to be the sister group of the echinoderm-hemichordate-Xenoturbella clade) and got essentially identical results (not shown). We subsequently use the best model identified in this way under a Bayesian framework.
|
Permutation Test of 2xGTR Model
We have used a permutation test to compare models rather than the more usual parametric bootstrap. The parametric bootstrap tests the null hypothesis that two nested models (with differing numbers of parameters) are indistinguishable. Parametric bootstrapping uses multiple data sets simulated according to the simpler of the two models under comparison. In contrast, the permutation test approach (Felsenstein 2003), although clearly related to parametric bootstrapping, differs in that (1) the null hypothesis is that the partitioning of characters to different models has no effect and (2) the multiple data sets derive from a randomization of the initial data rather than from simulation. In short, we are asking whether partitioning characters according to our more complex model (e.g., considering stem nucleotides separate from loop nucleotides) is distinguishable, in terms of lnLikelihood, from a random partitioning of these characters. If it is and if the likelihood is higher, then we conclude that the more complex model is preferable to the more simple model.
The first test is whether stem nucleotides and loop nucleotides come from a single population (i.e., they are evolving according to a single GTR modelthe simpler model) or whether they come from different populations (and are therefore evolving according to different GTRsthe more complex model). We calculate the lnLikelihood of the more complex model which corresponds to the data set correctly divided into stem and loops (real data set) using separate GTR models for stem and loop. This is compared to the lnLikelihoods calculated under the same dual model using 100 permutated samples in which the same nucleotides are randomly assigned to pseudo stem and loop classes of the same size as the real ones and therefore actually correspond to the simpler model. If the real lnLikelihood is outside the distribution of 95% of the permutated lnLikelihoods, then the stems and loops are inferred to be evolving according to different models (come from different populations) with 95% confidence. An increase in lnLikelihood seen with twin models over a single model implies that we are justified in assigning stems and loops to separate models in our phylogenetic analyses. To simplify plotting the lnLikelihoods we subtracted the lnLikelihoods of the single GTR from the lnLikelihoods of the double GTRs (2xGTR) for unpermutated and permutated data. The likelihoods were calculated using Phase on the fixed topology shown in figure 1. Branch lengths were also reestimated.
Permutation Tests of Models RNA16A, RNA16HKY85, and RNA16GTR
This permutation test approach was also taken when considering conditions RNA16A, RNA16HKY85, and RNA16GTR. Here, we are asking whether the stem nucleotides evolve according to models in which one side of a pair constrains the evolution of the other side of a pair (i.e., a 16-state, stem-pair modelcomplex) as opposed to a model in which this is not true and in which there is no correlation (equivalent to a four-state modelsimple). For each of the three 16-state models we calculate the lnLikelihood when the model is fitted to the correct stem pairs. This is compared to the lnLikelihoods, calculated under the same model, of 100 permutated samples in which the nucleotides are randomly assigned to the same number of stem pairs; the randomization ensures that there is no correlation between members of pairs. If the lnLikelihood of the correctly paired data set is outside the distribution of 95% of the lnLikelihoods of the permutated data, then we can say with 95% confidence that the stems do not evolve according to the uncorrelated model. An increase in lnLikelihood seen with 16-state models over a 4-state model implies that we would be justified in the use of 16-state models in our phylogenetic analyses.
In order to permutate the data we randomly reordered columns of stem nucleotides while maintaining the set of parentheses indicating the secondary structure. To simplify plotting the lnLikelihoods, we subtracted the lnLikelihood of the GTR from the lnLikelihoods of the 16-state models (models RNA16A, RNA16HKY85, and RNA16GTR) for unpermutated and permutated stem data.
Phylogeny of the Bilateria
The Bayesian framework using the software MCMCPhase was used to estimate the phylogeny of a sample of 72 taxa from within the Bilateria using GTR for the loop regions and RNA16A for the stem regions. Unreliably aligned positions were excluded leaving 1510 positions in the analysis. A total of 634 sites were paired stem nucleotides and the remaining 876 were loop-bulge nucleotides (alignment available on request). Random starting trees were used, and the parameters of the substitution models were treated as random variables and estimated during the analysis. A total of 1,000,000 initial generations were run which was seen to be more than sufficient for the lnLikelihood estimates to plateau. A total of 1,000,000 additional generations were run with sampling of the tree including branch lengths every 1,000th generation. The analyses were repeated four times with indistinguishable results, supporting the notion that a plateau had been reached. MrBayes "sumt" command was used to generate a consensus tree with branch lengths and Bayesian clade support values from these samples. For comparison, the same analyses were run using a single GTR for both stem and loop positions.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
There is very little improvement in lnLikelihood using model RNA16GTR over model RNA16HKY85 (difference = 3.045). As models RNA16HKY85 and RNA16GTR are nested, we are able to compare their lnLikelihoods using a chi-square test with 4 degrees of freedom for the four additional parameters in model RNA16GTR. The chi-square test shows that the improvement in likelihood of model RNA16GTR versus model RNA16HKY85 (using 2 x difference = 6.08 as the test statistic) is nonsignificant.
Superiority of Models that Account for Stem Pairing
We are unaware of a previous use of a permutation approach for studies of stem models. Using this approach we have been able to demonstrate for the first time the superiority of models of nucleotide evolution that divide rRNA sequences into sets of stem and loop nucleotides. Furthermore, we have used the permutation approach to confirm results (e.g., Schöniger and von Haeseler 1999) showing that 16-state, stem-pair models for stem regions of rRNA genes are appropriate and far superior to 4-state, single-nucleotide models. Models RNA16HKY85 and RNA16GTR were included for completeness because they are included in the popular MrBayes software. Our results show that, although they are almost certainly inferior to Model RNA16A they improve hugely over four-state models and their use should be encouraged.
Phylogeny of the Bilateria
Our Bayesian tree (fig. 4) derived using a combination of 4-state GTR model for loops and 16-state RNA16A model for stems accords well with current ideas of bilaterian phylogeny (Adoutte et al. 2000). We see the three separate major clades of the Deuterostomia, Ecdysozoa, and Lophotrochozoa, although the tree was rooted between Deuterostomia and Protostomia so that the relative topology of these three branches has not been tested. Within the Deuterostomia we recover the expected division between chordates and a clade containing the Ambulacraria (echinoderms plus hemichordates) and the worm Xenoturbella as their sister group. The 99% Bayesian support for the exclusion of Xenoturbella from within the Ambulacraria using this superior model of evolution adds to our confidence in this result (Bourlat et al. 2003), although it must be remembered that Bayesian support generally gives higher values than an equivalent nonparametric bootstrap.
|
Within the Lophotrochozoa, although certain groups are well resolved (gastrotrichs, platyhelminths, bivalve mollusks, gastropod mollusks, polyplacophoran mollusks, pogonophorans, nemerteans, and rotifers), for the most part, the relationships between these and other taxa are not well resolved. We do find weak support for a monophyletic assemblage of annelids that includes the sipunculid (now widely considered a derived annelid) and strong support for the groupings of rotifer plus cycliophoran plus entoprocts and gastrotrich plus platyhelminths, but other branches remain poorly supported. The lack of resolution within the Lophotrochozoa is in common with other analyses that have used SSU (Mallatt and Winchell 2002).
Comparison of the single GTR + RNA16A tree and the GTR tree (not shown) shows a number of differences. Our most significant novel finding using GTR + RNA16A is the observation of two highly supported groups within the Lophotrochozoa; the gastrotrichs plus platyhelminthes in one and all other lophotrochozoan phyla (including the rotifer-cycliophoran clade) in the other. Analyses with a single GTR model group the rotifer-cycliophoran clade with the gastrotrich-platyhelminth clade rather than with the other lophotrochozoans. This difference deserves to be examined further considering the demonstrated superiority of the evolutionary model we have used. The single GTR model also fails to group entoprocts with cycliophora + rotifers and supports a number of unlikely associationsentoprocts with nemerteans, phoronid with annelids, and pogonophorans with gastropod mollusks. In general, the GTR + RNA16A model seems to be more agnostic in its placement of these taxa, resulting in a less well-resolved but more accurate tree. Overall, our method, although certainly not perfect with this data set, seems to have found a more credible tree than the simpler alternative.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Although clearly desirable in terms of a measurably improved approximation to the true evolution of rRNA sequences, incorporation of secondary structure information is not a panacea. Our consideration of the phylogeny of the Bilateria using SSU sequences shows that much of our tree does not differ from previous analyses using standard four-state models. Additionally, areas of the tree, particularly within the Lophotrochozoa, are still not resolved. rRNA genes are, nevertheless, the most widely sequenced genes across all living organismsmore than 20,000 SSU sequences in the European rRNA databaseand such a significant improvement in using them for phylogenetic analyses may be hoped to further our understanding of the tree of life. We believe that our tree represents the best-justified SSU-based phylogeny of the Bilateria yet produced.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Extracting Secondary Structure Information Using Xstem
Xstem parses the DCSE alignment format file and identifies positions that make up the two halves of each stem pair in each sequence. The file can be output in formats readable by MrBayes, Phase, and in the 20-state character set proposed by Smith, Lui, and Tillier (2004).
Adding Taxa and Altering Alignments Using Ystem
Ystem allows conversion of the DCSE alignment into a Nexus file that can then be imported into programs such as MacClade where new taxa can be added and the alignment altered. Ystem allows reconversion of the altered nexus file to the DCSE format, reintroducing the structural information from the original DCSE file while maintaining the new alignment and additional taxa.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adoutte, A., G. Balavoine, N. Lartillot, O. Lespinet, B. Prud'homme, and R. de Rosa. 2000. The new animal phylogeny: reliability and implications. Proc. Natl. Acad. Sci. USA 97:44534456.
Aguinaldo, A. A., J. M. Turbeville, L. S. Linford, M. C. Rivera, J. R. Garey, R. A. Raff, and J. A. Lake. 1997. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 387:489493.[CrossRef][ISI][Medline]
Boore, J. L., D. V. Lavrov, and W. M. Brown. 1998. Gene translocation links insects and crustaceans. Nature 392:667668.[CrossRef][ISI][Medline]
Bourlat, S. J., C. Nielsen, A. E. Lockyer, D. T. J. Littlewood, and M. J. Telford. 2003. Xenoturbella is a deuterostome that eats molluscs. Nature 424:925928.[CrossRef][ISI][Medline]
Cook, C. E., M. L. Smith, M. J. Telford, A. Bastianello, and M. E. Akam. 2001. Hox genes and the phylogeny of the arthropods. Curr. Biol. 11:759763.[CrossRef][ISI][Medline]
Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401410.[ISI]
. 2003. Inferring phylogenies. Sinauer Associates Inc, Sunderland, Mass.
Friedrich, M., and D. Tautz. 1995. rDNA phylogeny of the major extant arthropod classes and the evolution of myriapods. Nature 376:165167.[CrossRef][ISI][Medline]
Haase, A., M. Stern, K. Wächtler, and G. Bicker, 2001. A tissue-specific marker of Ecdysozoa. Dev. Genes Evol. 211:428433.[CrossRef][ISI][Medline]
Gutell, R. R., J. J. Cannone, Z. Shang, Y. Du, and M. J. Serra. 2000. A story: unpaired adenosine bases in ribosomal RNAs. J. Mol. Biol. 304:335354.[CrossRef][ISI][Medline]
Hudelot, C., V. Gowri-Shankar, H. Jow, M. Rattray, and P. G. Higgs. 2003. RNA-based phylogenetic methods: application to mammalian mitochondrial RNA sequences. Mol. Phyl. Evol. 28:241252.[CrossRef][ISI][Medline]
Huelsenbeck, J. P., and F. R. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754755.
Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol. 19:15911601.
Leitner, T., S. Kumar, and J. Albert. 1997. Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history. J. Virol. 71:47614770.[Abstract]
Mallatt, J., and C. J. Winchell. 2002. Testing the new animal phylogeny: first use of combined large-subunit and small-subunit rRNA gene sequences to classify the protostomes. Mol. Biol. Evol. 19:289301.
Muse, S. V. 1995. Evolutionary analysis of DNA sequences subject to constraints on secondary structure. Genetics 139:14291439.
Otsuka, J., and N. Sugaya. 2003. Advanced formulation of base pair changes in the stem regions of ribosomal RNAs; its application to mitochondrial rRNAs for resolving the phylogeny of animals. J. Theor. Biol. 222:447460.[ISI][Medline]
Otsuka, J., G. Terai, and T. Nakano. 1999. Phylogeny of organisms investigated by the base-pair changes in the stem regions of small and large ribosomal subunit RNAs. J. Mol. Evol. 48:218235.[ISI][Medline]
Posada, D., and K. A. Crandall. 2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol. 500:580601.
Patterson, C. 1989. Phylogenetic relations of major groups: conclusions and prospects. Pp. 471488 in B. Fernholm and H. Jornvall, eds. The hierarchy of life. Elsevier, Amsterdam.
Rzhetsky, A. 1995. Estimating substitution rates in ribosomal RNA genes. Genetics 141:771783.
Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. Genetics 157:399411.
Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502504.
Schöniger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3:240247.[CrossRef][Medline]
. 1999. Toward assigning helical regions in alignments of ribosomal RNA and testing the appropriateness of evolutionary models. J. Mol. Evol. 49:691698.[ISI][Medline]
Smith, A. D., T. W. H. Lui, and E. R. M. Tillier. 2004. Emprical models for substitution in ribosomal RNA. Mol. Biol. Evol. 21:419427.
Sullivan, J., and D. L. Swofford 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenies. J. Mamm. Evol. 4:7786.
Tillier, E. R. M., and R. A. Collins. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol. 12:715.
Vawter, L., and W. M. Brown. 1993. Rates and patterns of base change in the small subunit ribosomal RNA gene. Genetics 134:597608.
Wolf, Y. I., I. B. Rogozin, and E.V. Koonin. 2004. Coelomata and not ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 14:2936.
|