Department of Medical Biophysics, University of Toronto, and Ontario Cancer Institute, University Health Network, Toronto, Ontario, Canada
Correspondence: E-mail: e.tillier{at}utoronto.ca.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We have developed a method to obtain substitution rate matrices empirically from RNA alignments that include structural information in the form of base pairs. Our data consisted of alignments from the European Ribosomal RNA Database of Bacterial and Eukaryotic Small Subunit and Large Subunit Ribosomal RNA ( Wuyts et al. 2001. Nucleic Acids Res. 29:175177; Wuyts et al. 2002. Nucleic Acids Res. 30:183185). Using secondary structural information, we converted each sequence in the alignments into a sequence over a 20-symbol code: one symbol for each of the four individual bases, and one symbol for each of the 16 ordered pairs. Substitutions in the coded sequences are defined in the natural way, as observed changes between two sequences at any particular site. For given ranges (windows) of sequence divergence, we obtained substitution frequency matrices for the coded sequences. Using a technique originally developed for modeling amino acid substitutions ( Veerassamy, Smith, and Tillier. 2003. J. Comput. Biol. 10:9971010), we were able to estimate the actual evolutionary distance for each window. The actual evolutionary distances were used to derive instantaneous rate matrices, and from these we selected a universal rate matrix.
The universal rate matrices were incorporated into the Phylip Software package ( Felsenstein 2002. http://evolution.genetics.washington.edu/phylip.html), and we analyzed the ribosomal RNA alignments using both distance and maximum likelihood methods. The empirical substitution models performed well on simulated data, and produced reasonable evolutionary trees for 16S ribosomal RNA sequences from sequenced Bacterial genomes.
Empirical models have the advantage of being easily implemented, and the fact that the code consists of 20 symbols makes the models easily incorporated into existing programs for protein sequence analysis. In addition, the models are useful for simulating the evolution of RNA sequence and structure simultaneously.
Key Words: rRNA evolution empirical substitution model RNA structure simulation
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Probabilistic methods of phylogenetic analysis, such as Bayesian phylogeny (Yang and Rannala 1997), maximum likelihood (Felsenstein 1981), and modified distance measures (Kimura 1981), make use of substitution models that define probabilities for the transition from one base to another. When base substitution is adequately modeled using a low number of states (or under restricting assumptions), parametric models are feasible. Simultaneously modeling the evolution of sequence and structure has been done with several parametric models, but this approach has been shown to be accurate only when large numbers of parameters are used (Savill, Hoyle, and Higgs 2001). In the common case where a reasonable description of the substitution process requires a large number of parameters (e.g., for models of amino acid substitution), empirical models are often used. For empirical models, the relative substitution probabilities are derived from databases of alignments; the PAM model of amino acid substitution is a well-known example (Dayhoff, Schwartz, and Orcutt 1978). The estimation of empirical substitution rates in rRNA has been attempted previously (Tillier and Collins 1998; Higgs 2000).
Databases of rRNA sequences are now sufficiently large (with over 50,000 sequences represented) to form the basis of an empirical model of substitution for rRNA. Here we present an empirical model of substitution for both individual and paired bases in structurally annotated rRNA. Our model is based on sequence alignments from RNA databases and conserved reference secondary structures. The method we used to obtain substitution matrices from rRNA data is an extension of a procedure we recently applied to derive models of amino acid substitution (Veerassamy, Smith, and Tillier 2003). We accomplish this by converting the rRNA sequences into sequences over a 20-symbol code. Although the 20-symbol code reflects the number of individual bases and ordered pairs of bases, the size of the alphabet is fortuitous: the resulting rRNA models have 20 x 20 substitution rate matrices, and can be used in many applications designed to analyze the evolution of proteins using 20 x 20 amino acid substitution rate matrices. Such programs were modified to allow the analysis of rRNA sequences with the same methods.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sequence Coding
We converted each rRNA sequence into a sequence over a 20-symbol alphabet in order to treat single-stranded and double-stranded bases uniformly. The code is given in figure 1a, and an example conversion is shown in figure 1b. The code uses four symbols to represent the single-stranded bases: the unpaired bases or bases paired with a gap (which are also single stranded) and 16 symbols to represent ordered pairs of bases, as given by the reference secondary structure. Bases that were not A, C, G, U, or were coded as unknowns (X), as were pairs having the form (,), and unknowns were ignored in subsequent analyses. The symbol for each pair appears exactly once in the converted sequence, and the symbols are ordered according to the position of the 3' base in a pair.
|
Substitution Frequency Matrices and Observed Distances
From the alignments, each pair of converted sequences was compared, and their observed sequence divergence (frequency of substitution) was recorded. Those sites containing X were disregarded. Sequence pairs were disregarded altogether when there were fewer than 200 comparable sites between them. This was done to prevent the artificial situation of observing very high divergence between two sequences purely because very few sites were comparable.
A series of overlapping windows representing the full range of sequence divergence were defined. The first window included all sequence pairs with divergence less than or equal to the fixed window size w. The ith window included all pairwise sequence comparisons with a relative divergence between i and w + i. For each alignment, each type of substitution was counted for all of the sequence pairs in each window. In this manner, a count matrix was obtained for each window, and these were converted into frequency matrices. Because direction of substitution is not known, each substitution was counted for both directions, and the diagonal entries in the matrix were doubled. Using this method, consecutive windows are expected to contain a nearly identical set of pairs, and windows corresponding to sufficiently high divergence are expected to be empty. The sliding-window strategy allowed us to ensure that we had observations from many different levels of sequence divergence. By manipulating the window size we were able to control the amount of data in each window. The sizes for the sliding windows were selected according to the quality of matrices produced for the sizes (see below).
Estimating Actual Evolutionary Distances
Each count matrix was converted to a frequency matrix, denoted by F. From the frequency matrices, we calculated the vector of observed frequencies for each code symbol
|
|
|
|
|
We estimated the derivatives of the observed distances with respect to actual distance numerically, using the five-point formula (Burden and Faires 1985),
|
|
|
|
|
|
|
|
|
|
This scheme can provide up to n(1 - w) + 1 instantaneous rate matrices, but many fewer were found to be valid. Both the negative eigenvalues in mutability matrices and the negative off-diagonal rates in instantaneous rate matrices have been found to be associated with insufficient observations in the count matrix and/or high divergence.
From the set of valid instantaneous rate matrices, one was selected as the universal rate matrix. The selection criteria were based on how well an instantaneous matrix could predict the observed mutability matrices, given the estimated actual evolutionary distance. For each valid instantaneous rate matrix A, we evaluated the sum of relative residuals. This is calculated as the difference in the norm (largest singular value) between the exponential of the estimated log of the matrix and the original matrix:
|
|
We simulated sequences using a 20-species tree (see fig. 5), and the resulting sequences were analyzed with our own programs, and programs from the PHYLIP package (Felsenstein 2002). We used both the dnaml program and the combination of the dnadist and neighbor programs from PHYLIP on sequences obtained by converting the simulated sequences (over the 20-symbol code) back into the four bases of RNA (using our program called 20to4). We also used the programs rrnadist and rrnaml, which are modified versions of the PHYLIP protdist and protml that incorporate our 16S rRNA matrices. Finally, our own distance program rnadist, which implements the parametric model OTRNA from Tillier and Collins (1998), was also used.
|
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Window sizes were selected to produce the greatest number of valid matrices. For the Bacterial, Mitochondrial, and Archaeal alignments, a window size of 10% of the alignment length was found to produce very good results, and deviation from this size did not produce significantly more valid matrices. For Eukaryote (SSU and LSU) alignments, increasing the window size resulted in improvements up to 20%. Because we wish to cover the largest range of actual divergences, the window size was selected to balance the sometimes conflicting requirements of producing instantaneous rate matrices corresponding to large divergence, and also to keep the residuals low.
Table 1 shows the number of windows in each alignment from which observations were made. The total number of windows is the number of count matrices with non-zero row sums. All of these were used in our analysis when testing the universal rate matrices according to predictive ability. Not all of these could be used when fitting the curves needed to estimate the actual distances. Those windows for which the mutability matrix had negative eigenvalues could not be used in Equation 8. We also required that the numerically estimated derivatives from Equation 8 be non-negative. The mutability matrices corresponding to all windows fitting this criterion were used to fit the curve given by equation 11. Figure 2 gives the curves for Bacterial SSU, Mitochondrial SSU, and Eukaryote SSU and LSU, where the data points used to estimate the curve are plotted, along with the resulting curve. The fitted curve is shown over the range of divergence for non-empty windows. Figure 3 shows the resulting distance correction for the different datasets.
The valid windows are those that meet the requirements for fitting the curve; they also produced a valid instantaneous rate matrix with no negative values off the diagonal. Table 1 shows that the Eukaryote alignments produced many valid windows, whereas the mitochondrial alignment produced very few.
The most important statistic in table 1 is the average relative residual of the observed mutability matrices with the mutability matrices predicted using the universal rate matrix. The relative residual measures how well the selected universal rate matrix can predict the observed data (the mutability matrices). We calculated residuals predicting mutability matrices from all non-empty windows; we also predicted mutability matrices from those windows with a valid instantaneous rate matrix. When predicting mutability matrices for those windows producing valid instantaneous matrices, we observed that the universal rate matrix had residuals less than 0.1. The residual for predicting Bacterial matrices was based on 144 comparisons. For the Mitochondrial SSU matrix, the residual with respect to valid matrices was 0.048, but it was averaged over only 19 matrices, each corresponding to a window close to the one used to obtain the universal rate matrix. The Eukaryotic alignments produced a large number of valid matrices.
Simulations
Because of the unusual nature of our parametric substitution model for recoded RNA sequences, and because we would be using protein-analysis programs for the analysis of rRNA data, we first established the feasibility of our approach by means of simulated sequences. The Bacterial SSU rRNA matrix was incorporated into PSeq-Gen (Grassly, Adachi, and Rambaut 1997) and used to generate 100 simulated data sets which we then analyzed with several methods; the results are given in figure 5. The values on the branches are the number of simulations in which the correct tree (given in figure 5f) was found. The topology and branch lengths of the tree in figure 5f were chosen to reflect a range for the degree of difficulty in correctly identifying the branching by standard methods. The rRNA methods (rrnaml [fig. 5b] and rrnadist [fig. 5a] using our Bacterial SSU rRNA matrix with default parameters) did well in recovering the trees. This is not surprising because these methods applied the same model that was employed to generate the sequences. Nevertheless, this result means that such models are useful for the simulation of rRNA evolution and for phylogenetic analysis of such sequences. The simulated sequences were also analyzed with the distances calculated according to the OTRNA model (figure 5c), which is a parametric model for RNA evolution (Tillier and Collins 1998). Although rnadist does not do as well as the rrnadist (fig. 5a), it does perform better than the standard DNA methods dnadist (fig. 5e) and dnaml (fig. 5d) on the same data. The DNA methods were applied after reverse coding the sequences back to the four-base RNA code, so that the secondary structure information was lost. The observed reduction in the methods' ability to obtain the correct tree shows that the consideration of structural information is important for obtaining the correct phylogeny.
Bacterial Phylogeny
Because the rrnadist and rrnaml programs were successful in simulations, we were confident in applying the empirical models to actual rRNA data. We chose the sequenced Bacterial genomes because of our interest in these genomes, and because their true phylogeny is unknown. The resulting rrnaml tree is compared to the dnaml tree in figure 6. The trees are remarkably similar, but with generally lower bootstrap values for the rrnaml tree. The distance trees obtained with dnadist, rnadist, and rrnadist are also not significantly different from one another (data not shown).
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The assumptions of our method for deriving the empirical matrix have largely been addressed by Veerassamy, Smith, and Tillier (2003). Some minor methodological changes were made to apply the method to the coded rRNA sequences. Unlike the amino acid substitution frequency matrices from the BLOCKS database (Henikoff and Henikoff 1992), the rRNA substitution frequency matrices often had negative eigenvalues for higher sequence divergence levels and thus could not be used for estimating a rate matrix. Negative eigenvalues in mutability matrices are often associated with an insufficient number of observations in the corresponding count matrix. Instead of using the clustering approach of Blosum (Henikoff and Henikoff 1992), we used a sliding-window strategy. The purpose of the sliding window was to obtain a large number of count matrices, corresponding to many different levels of sequence divergence, and to ensure that each count matrix contained a high number of counts.
Negative eigenvalues are also observed when the frequency matrix is too far from identity (Devauchelle et al. 2002). Biologically speaking, this is the situation of too many overlapping substitutions to determine a set of positive substitution rates that produced the observed data. This phenomenon has been observed before and accounts for the difficulty in obtaining distance estimates from rRNA (Hoyle and Higgs 2003). Negative eigenvalues occur earlier (i.e., at lower levels of observed divergence) in the Bacterial data set than in the Eukaryotic data set. A likely cause is higher rates of simultaneous compensatory base substitutions in Bacteria, and stronger conservation of secondary structure in Bacteria.
Certain assumptions underlying our model are independent of our methods. The most evident assumption is that the sequence data and the alignments used are correct. The huge alignments from the rRNA databases represent solutions to an extremely difficult global multiple sequence alignment problem, and they cannot realistically be expected to be optimal with respect to any measure of alignment quality. The nature of our empirical model requires that very large multiple alignments be used, so we must assume that the alignment given is sufficiently accurate to form a basis of inference.
We also analyzed alignments obtained from the RDPII database (Maidak et al. 2001) and obtained similar results (data not shown). We chose to focus on the European rRNA database, so that we could refine our methods for a specific database. The European rRNA database was more complete in terms of 23S sequences. Also, in the RDPII database, the crucial distinction between deleted nucleotides and unknown (unsequenced) nucleotides was not always clear.
Other assumptions had to do with the correctness of the reference structures, and how well conserved is the secondary structure in any of the alignments used. The structures we used were obtained from the CRW database (Cannone et al. 2002). The structures were largely obtained by comparative sequence analysis (Gutell, Lee, and Cannone 2002), which also assumes conserved structures throughout sequences. Although the structure of rRNA is quite conserved, it is not immutable. Particularly the Eukaryote sequences have shown changes in structure (Wuyts, Van de Peer, and Wachter 2001). These variations are reflected in the universal rate matrix we derived for them, which showed some significant substitution rates between paired and unpaired bases (see fig. 4). These rates reflect changes in structure or problems with the alignments. The matrices will be refined by considering more reference secondary structures in an alignment. The same methods can easily be applied in the extreme case where each sequence has a known secondary structure.
The idea that the base pairs rather than the bases in the sequences are the independently evolving units in an structured RNA sequence has been used before (Tillier and Collins 1995), and sophisticated parametric models considering up to the 16 base combinations possible in a pairing have been developed. The disadvantages of these models are that they can be slow and difficult to implement, although some such models have been implemented in a Bayesian phylogenetic framework (PHASE) (Jow et al. 2002; Hudelot et al. 2003), and for distance analyses (rrnadist). It has also been shown that RNA models models require many free parameters to be accurate (Savill, Hoyle, and Higgs 2001) because of differences in the rates of substitution between the base pairs. However, the models are usually applied only to rRNA molecules for phylogenetic analysis. The large size of the rRNA database makes it possible to empirically derive the substitution rates between the base pairs, and including the single-stranded bases in the model allowed us to easily implement the combined analysis of single-stranded and double-stranded regions. This task was made even easier by the fact that 20 x 20 empirical matrices have commonly been implemented for the analysis of protein sequences.
A recoding approach could be applied to other RNA molecules (tRNAs for example), but different recoding alphabets might also be used on other types of sequences. For example, it is conceivable to derive an empirical model for codon evolution (a 64 x 64 matrix). The method used to then derive a matrix is independent of the recoding. Our method for deriving the empirical matrices (from Veerassamy, Smith, and Tillier 2003), requires large amounts of data (as does the approach of Muller and Vingron [2000]). Other approaches for deriving matrices could be used in more specific cases with smaller datasets (Arvestad and Bruno 1997; Whelan and Goldman 2001; Devauchelle et al. 2002).
The phylogenetic trees obtained with the new method show reduced levels of statistical support, as expected, because of the reduced number of independently evolving characters considered in the recoded sequences (Tillier and Collins 1995). The new method therefore does not allow better resolution of the tree; rather it gives a more accurate estimate of the (generally low) confidence in the branch estimates. The branching order is not significantly different from the one obtained with the standard DNA approaches, however. This similarity is reassuring in two ways, the first being that the standard DNA methods on this data set show some robustness with respect to the violation of the assumption of independence of sites, and second because it shows that the methods using our empirical models do at least as well as the parametric approaches.
We have shown that empirical matrices can be used for the study of rRNA evolution with both simulations and with an example phylogenetic analysis of Bacterial 16S rRNA sequences. The ability to analyze and simulate rRNA sequence evolution including secondary structure constraints should be very useful in other studies.
![]() |
Supplementary Material |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Arvestad, L., and W. J. Bruno. 1997. Estimation of reversible substitution matrices from multiple pairs of sequences. J. Mol. Evol. 45:696-703.[ISI][Medline]
Burden, R. L., and J. D. Faires. 1985. Numerical analysis, 3rd edition. PWS Publishers, Boston.
Cannone, J. J., S. Subramanian, and M. N. Schnare, et al. (14 co-authors). 2002. The Comparative RNA Web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3:1-2 http://www.rna.icmb.utexas.edu/.[CrossRef][Medline]
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. Pp. 345352 in (M. O. Dayhoff, ed.) Atlas of protein sequence and structure, vol. 5. National Biomedical Research Foundation.
Devauchelle, C., A. Grossmann, A. Hnaut, M. Holschneider, M. Monnerot, J. L. Risler, and B. Torrsani. 2002. Rate matrices for analyzing large families of protein sequences. J. Comput. Biol. 8:381-399.[CrossRef][ISI]
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]
Felsenstein, J. 2002. PHYLIP (phylogeny inference package) version 3.6.3. http://evolution.genetics.washington.edu/phylip.html.
Grassly, N. C., J. Adachi, and A. Rambaut. 1997. PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:559-560.[Medline]
Gutell, R. R., J. C. Lee, and J. J. Cannone. 2002. The accuracy of ribosomal RNA comparative structure models. Curr. Opin. Struct. Biol. 12:301-310.[CrossRef][ISI][Medline]
Henikoff, S., and J. Henikoff. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89:10915-10919.[Abstract]
Higgs, P. G. 2000. RNA Secondary structure: physical and computational aspects. Q. Rev. Biophys. 33:199-253.[CrossRef][ISI][Medline]
Hoyle, D. C., and P. G. Higgs. 2003. Factors affecting the errors in the estimation of evolutionary distances between sequences. Mol. Biol. Evol. 20:1-9.
Hudelot, C., V. Gowri-shankar, H. Jow, M. Rattray, and P. Higgs. 2003. RNA-based phylogenetic methods: application to mammalian mitochondrial RNA sequences. Mol. Phylogenet. Evol. 28:241-252.[CrossRef][ISI][Medline]
Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol. 19:1591-1601.
Kimura, M. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78:454-458.[Abstract]
Maidak, B. L., J. R. Cole, T. G. Lilburn, C. T. Parker, P. R. Saxman, R. J. Farris, G. M. Garrity, G. J. Olsen, T. M. Schmidt, and J. M. Tiedje. 2001. The RDP-II (Ribosomal Database Project). Nucleic Acids Res. 29:173-174.
Muller, T., and M. Vingron. 2000. Modeling amino acid replacement. J. Comput. Biol. 7:761-776.[CrossRef][ISI][Medline]
Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. Genetics 157:399-411.
Tillier, E. R. M., and R. A. Collins. 1995. Neighbor-Joining and maximum likelihood with RNA sequences: addressing the inter-dependence of sites. Mol. Biol. Evol. 12:7-15.
Tillier, E. R. M., and R. A. Collins., and. 1998. High apparent rate of simultaneous compensatory base-pair substitutions in ribosomal RNA. Genetics 148:1993-2002.
Veerassamy, S., A. Smith, and E. R. M. Tillier. 2003. A transition probability model for amino acid substitutions from BLOCKS. J. Comput. Biol. 10:997-1010.[CrossRef][ISI][Medline]
Whelan, S., and N. Goldman. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. J. Mol. Evol. 18:691-699.
Wuyts, J., P. D. Rijk, Y. Van de Peer, T. Winkelmans, and R. D. Wachter. 2001. The European Large Subunit Ribosomal RNA Database. Nucleic Acids Res. 29:175-177.
Wuyts, J., Y. Van de Peer, and R. D. Wachter. 2001. Distribution of substitution rates and location of insertion sites in the tertiary structure of ribosomal RNA. Nucleic Acids Res. 29:5017-5028.
Wuyts, J., Y. Van de Peer, T. Winkelmans, and R. D. Watcher. 2002. The European Database on Small Subunit Ribosomal RNA. Nucleic Acids Res. 30:183-185.
Yang, Z., and B. Rannala. 1997. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo Method. Mol. Biol. Evol. 14:717-724.[Abstract]