* Bioinformatics Research Center, University of Aarhus, Aarhus, Denmark; Genome Analysis and Bioinformatics Group, Department of Statistics, University of Oxford, Oxford, England
Correspondence: E-mail: roald{at}birc.au.dk.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: RNA structure coding region overlapping information context-dependent evolution virus evolution
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this article we present a model of the nucleotide substitution process in such coding regions with conserved RNA structure (hereinafter, CORS). The model can be applied to data where there is a priori knowledge of the protein-coding and RNA structural annotation. Our focus here is to estimate parameters that contain information about the evolutionary process. Other potential applications of the model include comparative RNA secondary structure prediction in coding regions and the estimation of RNA virus phylogenies between higher taxonomical units where the double evolutionary constraints upon CORS could potentially alleviate the often incurred problems of saturation (Zanotto et al. 1996).
The functional and structural interactions of the amino acids within a protein can create a variety of evolutionary dependencies between protein-coding nucleotides. Most stochastic models of nucleotide substitution for coding regions consider only the simplest of these, namely the context dependency in the evolutionary process among nucleotides within adjacent non-overlapping three-tuples (codons) introduced by the triplet nature of the genetic code (Goldman and Yang 1994; Muse and Gaut 1994). These models ignore other interactions and assume evolutionary independence between codons, which means that the transition probability between sequences can be factored into the product of the transition probabilities between codons and calculated with relative ease.
Only certain base pairs can form the stable chemical bonds needed to maintain an RNA structure. The conservation of structure therefore introduces long-range context dependencies in the evolutionary process between base-pairing nucleotides. Existing stochastic models of nucleotide substitution for regions with RNA structure incorporate long-range correlations by considering two-tuples of base-pairing nucleotides as independent units of evolution resulting in a similar factorization of the transition probability as described above [see Savill et al. (2001) for a description].
An evolutionary model of nucleotide substitution CORS must acknowledge that selection evaluates new mutations both in the context of the encoded protein and in the conserved RNA structure. As a consequence of the context dependencies described above, the evolutionary process of base-pairing two-tuples in stem regions can no longer be assumed to be independent of neighboring nucleotides, because these now make up the protein-coding context in which substitutions occur. The neighboring nucleotides may in turn base-pair with nucleotides from yet other codons, and thus expand the context dependency to include these (fig. 1). In this manner context dependency cascades throughout the structural regions and questions the computationally convenient assumption of evolutionary independence between short N-tuples of nucleotides.
|
An alternative solution is to reduce context dependencies to a level manageable by traditional phylogenetic models. The model presented here achieves this by replacing context dependencies within codons with codon positionspecific heterogeneity in the substitution process and is inspired by the previous work of Hein and Stovlbaek (1995) and Yang (1996). Thus the input to our analysis is an alignment of DNA or RNA sequences with multiple layers of annotation, which we use to define sets of nucleotide tuples considered to evolve via independent, but annotation-specific, substitution processes. As this presents a general procedure in the construction of what we refer to as context-reducing models of molecular evolution, we develop a general conceptual framework for its presentation. (See Siepel and Hausler (2003) for a different approach to context reduction.) The assumption of independence between N-tuples allows for the factorization of transition probabilities and subsequent application of the model to a large data set of full-length genome sequences from the hepatitis C virus with known RNA structural elements. Using this data set, we evaluate different components of the model and estimate evolutionary parameters.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Components of Phylogenetic Models
The data of traditional phylogenetic analysis is an alignment of n homologous sequences. Let the alignment be represented by a matrix x of dimension n x L with entries belonging to the alphabet . The rows xj (1
j
n) of x correspond to the aligned sequences, and the entries of a column xi (1
i
L) correspond to homologous sequence symbols.
A standard assumption in phylogenetic analysis is that of evolutionary independence between short N-tuples of nucleotides. In the following we consider models that split the alignment into such N-tuples.
Let a phylogenetic model for independent alignment columns be given by a five-tuple of model components = {
, Q,
,
, ß}, where
is the sequence alphabet of size d, Q is the instantaneous rate matrix of dimension d x d,
is a vector of equilibrium frequencies of length d,
is a rooted binary tree topology, and ß is a vector of branch lengths. Phylogenetic models defined for N-tuples, instead of single columns, will have
replaced by
N and the dimensionality of Q and
adjusted accordingly.
, Q, and
define a continuous Markov process which is used to model the substitution process along the branches of the phylogenetic tree represented by
and ß. The likelihood of an N-tuple given a phylogenetic model can be calculated in time O(n|
|2N) by the dynamic programing algorithm of Felsenstein (1981), where |
|N is the size of the N-tuple alphabet.
Let xv be an N-tuple defined as the concatenation of the alignment columns specified by the N-long vector of indices v(vi {1, ... , L}). The columns of xv need not be direct neighbors in x. Let I be a set of index vectors defining a fragmentation of x into N-tuples. The likelihood of x given I is:
![]() | (1) |
Annotated Alignments
Alignments can be annotated with information on the structure or function of different regions. This information can be given in the form of m label sequences, each drawn from a set Ak defined by annotation category k (1 k
m). The annotation of x can then be represented by a matrix y of dimension m x L. The rows yj (1
j
m) of y correspond to the label sequences. The complete annotation of alignment position i is contained in the column yi (1
i
L), which is a member of the combined label set
The complete annotation of an N-tuple is given by the vector yv belonging to the label set
The fragmentation of an alignment is determined by its annotation and the specific independence assumptions of a given model. Let frag(x, y) be a mapping from an alignment and its annotation to an index set I, which then defines the fragmentation. The fragmentation of x thus partitions the sites in the alignment into nucleotide tuples which may be of varying lengths.
Defining a Composite Phylogenetic Model
Differences in the selective regime acting on the regions defined by the annotation will give rise to differences in the substitution process. It is therefore of interest to use different phylogenetic models for N-tuples with different annotations. A phylogenetic model for annotated alignments can be defined as the set = {
1, ...,
K}, where the submodels
i = {
i, Qi,
i,
i, ßi} are traditional phylogenetic models for N-tuples as defined above.
Let c define a mapping from the annotation of an N-tuple onto {1, ..., K} representing the set of phylogenetic models. The likelihood of an alignment given its annotation and a composite phylogenetic model is
![]() | (2) |
Parameterizations
The parameter space of a composite phylogenetic model is potentially large, but it can be reduced by introducing constraints on the legal parameter values. These constraints can be expressed by equations defining legal subspaces of the parameters and express assumptions about the substitution process. They can, for example, be introduced to test hypotheses or to create robust models for sparse data.
The off-diagonal entries of a rate matrix qa,b (with a b, a, b
) denote the instantaneous rate of change from a to b. The diagonal entries are defined by the requirement that rows sum to zero
The matrix of transition probabilities P for a given time span t can be found by matrix exponentiation
(Liò and Goldman 1998). It is convenient to normalize Q to one expected substitution per site per time unit by requiring the equality
![]() | (3) |
The substitution process is commonly assumed to be time reversible, which can be ensured by the constraint of detailed balance: aqa,b =
bqb,a
a > b. Other constraints commonly applied to nucleotide models include strand symmetry (Lobry and Lobry 1999) and a fixed ratio between transitions and transversions (Hasegawa, Kishino, and Yano 1985).
The substitution processes of the regions defined by an annotation can often be assumed to have some common properties. These properties can be intrinsic to the type of sequence being modellede.g., the transition bias of nucleotide sequencesor it can be due to a selective force acting across several regions. Including constraints between rate matrices allows tests of the validity of such assumptions and can considerably reduce the number of free parameters.
If there is no exchange of genetic material between the lineages of the phylogenetic tree, i can be assumed to be the same between submodels. If the substitution process does not change between branches, ß can also be assumed equal between submodels, in which case differences in the rate of substitution can be incorporated by defining ßi as a scaling of a general branch length vector: ßi = riß. Each submodel can then be defined as
i = (
i, Qi,
i, ri,
, ß).
Likelihood Ratio Tests
When two models have nested parameter spaces, their relative fit can be evaluated by a likelihood ratio test (LRT) between the simpler (null) model i and a more complex (alternative) model
j. The test statistic
![]() | (4) |
A Composite Phylogenetic Model for Coding Regions with Conserved RNA Structures
In this section, the general framework outlined above is used to define a model of the substitution process in coding nucleotide sequences with overlapping RNA secondary structure.
Let the RNA structural annotation be given by the sequence yS drawn from AS = {ns, l, p}, where ns denotes nonstructural positions, l denotes loop and bulge positions, and p denotes RNA stem-pairing positions. Let the coding annotation be given by the sequence yC drawn from AC = {1, 2, 3}, where 1, 2, and 3 represent first, second, and third codon positions, respectively. (See figure 1 for examples of the annotation.)
All columns of the alignment, apart from the RNA stem-pairing ones, are assumed to evolve independently. As opposed to the standard models of coding regions, this corresponds to ignoring context dependency between nucleotides within the same codon. The fragmentation function thus maps stem-pairing positions onto index vectors for two-tuples, and everything else onto index vectors for single columns. The specific pairing of sites can be given to the frag function as an extra annotation sequence, which is implicit here.
Each possible labeling of the N-tuples defined by the fragmentation now defines a submodel. The possible labelings for single columns consist of the set and for the two-tuples it consists of
since the criteria defining the set of single columns and two-tuples were the presence or absence of the RNA-structure label p. The total number of submodels therefore becomes 15
This corresponds to three different submodels for single nucleotides in the three codon positions of nonstructural regions, three different submodels for single nucleotides in the three codon positions of loop/bulge regions, and nine different submodels for the nine different codon-position combinations of a base-pairing nucleotide pair.
The enumeration of the submodels is given by the mapping (fig. 1). In the following discussion the submodels and their parameters will be referred to through their defining labels, as this is more informative than an explicit enumeration. A dot (·) is used to represent a label of any type. For example, the submodel for two-tuples modeling RNA stem-pairing first and third codon positions will be denoted
c(pp,13), and
c(pp,·) will represent all submodels for stem-pairing two-tuples.
Rate Matrices for Single Sites
The rate matrices of submodels for single columns (Qc(ns,·) and Qc(l,·)) are parameterized according to the HKY model (Hasegawa, Kishino, and Yano 1985). The constraints of the HKY model can be expressed by defining each off-diagonal entry in terms of four free parameters:
![]() |
Rate Matrices for Two-Tuples
The rate matrices of submodels for two-tuples (c(pp,·)) are highly constrained to allow estimates from sparse data (table 2). Matrix entries are based on a pre-estimated reversible pair-symmetric two-tuple (i.e., 16 x 16) rate matrix (Qfixed), which was estimated from a large set of stem-pairing sites from noncoding RNA-structures (Knudsen and Hein 1999). The rate matrix can be found at www.stats.ox.ac.uk/
meyer/CORSmodel. The off-diagonal entries of Qi are defined by the equations
![]() |
|
The Start Model
The most general form of the model full that we can construct has no shared parameters between the 15 submodels and thus contains 6·5 + 9·3 = 57 parameters, excluding the phylogenetic tree.
We start, however, with a model start that is constrained in some dimensions. These constraints reflect the amount of available data, our hypotheses concerning the substitution process, and our desire to obtain biologically interpretable parameters.
In start we let the submodels for the three different codon positions in the nonstructural regions have separate parameter sets. Hence, we expect to see differences in the substitution process and rate which reflect the average effect that nucleotide substitutions in each codon position has on protein conservation. Specifically, we expect that third positions, where nearly all substitutions are synonymous, will show a higher estimated rate of substitution than first positions, where fewer substitutions are synonymous, and that these will again show higher rate estimates than second positions, where all substitutions are nonsynonymous. A similar relation is expected from the ratio of the rate of transitions to the rate of transversions (
), again reflecting the relative number of transitions that contribute synonymous substitutions in each codon position. No scaling of the phylogenetic tree is needed for nonstructural third positions (Ic(ns,3)), because the phylogenetic tree was estimated from these. We therefore fix the rate of the third-position submodel to one (rc(ns,3) = 1).
Although both nonstructural and loop/bulge regions are nonpairing, the latter may have a biological functionality which distinguishes them from the former. In the starting model, therefore, no ties were introduced between parameters of loop/bulge and nonstructural substitution models.
Because of the relative sparseness of data columns in each of the two-tuple annotation categories (Ic(pp,·)), we have chosen to constrain their submodels considerably. Thus, we set
s3 = 1
j, k
{1, 2, 3}, where s1 and s2 are two new free parameters shared between all relevant two-tuple models. This corresponds to assuming that the codon-positionspecific effects on the relative rate of substitution are independent of the specific position combination and removes 16 free parameters compared to
full. Because Qfixed is symmetric, this specifically induces equivalence between models with symmetric codon positions, so that
c(pp,jk) =
c(pp,kj). The normalization procedure means that s1 and s2 are defined relative to s3 = 1. Their estimates can thus be interpreted as the effect that protein conservation has on the relative rate of substitution compared to the rate in third positions. Hence, we would expect these estimates to rank like the rates of substitution in the nonstructural regions.
We also constrain the substitution rate parameters of the two-tuple models by parameterizing the rate as where rp is a new free parameter shared between all two-tuple models. Given the previously mentioned normalization procedure, (rc(ns,j) + rc(ns,k))/2 is the expected rate of substitution for a nonstructural nucleotide pair evolving independently. This means that rp can be interpreted as a scaling of the substitution rate in the nonstructural regions, induced by structure conservation.
Constraints on the parameters of start will be used to express hypotheses on the substitution process, which are then tested in a likelihood ratio framework (see Results).
Parameter Estimation
The maximum likelihood estimate (MLE) argmaxP(x | y,
) can be found through numerical optimization. Such optimizations are computationally intensive and prone to be caught in local optima when the dimensionality of the parameter space is large. The number of parameters subject to numerical optimization is here reduced by pre-estimating the phylogenetic tree (
and ß), and by following the normal practice of using a simple analytic estimate for the equilibrium frequencies, which is derived by counting, and thus is not based on
and ß. In the following discussion we denote the set of index vectors mapped to
i by Ii = {v
frag(x, y) : c(yv) = i}.
The estimate of and ß is based on third-codon positions in nonstructural regions (Ic(ns,3)). This allows the ri estimates to be interpreted as the rate of substitution relative to sites in Ic(ns,3) (third position). A distance matrix based on Kimura's two-parameter model (Kimura 1980) was found using DNADIST with default settings from the PHYLIP package (Felsenstein 1993).
was estimated from this distance matrix using Weighted Neighbor Joining (Bruno, Socci, and Halpern 2000). The BASEML program from the PAML program package (Yang 2000) was used to find a MLE of ß under a HKY model (which thus corresponds to
keeping
fixed.
The estimator for i is the symbol frequency in the set of N-tuples defined by Ii. When the equilibrium distribution is constrained to being the same for several submodels, the estimator becomes the symbol frequency in the corresponding union of N-tuple entry sets. Recall that only
c(ns,·) and
c(l,·) are free parameters, because
c(pp,·) are pre-estimated along with Qfixed.
The MLEs of the remaining parameters of (i.e., Qi and ri) are found using the quasi-Newton numerical optimization method, with BFGS approximation of the Hessian implemented in the OPT++ package (Meza 1994). The optimization was found to be robust to the initial parameter values. Rewriting the composite-likelihood expression (see eq. 2),
![]() | (5) |
Approximative standard errors of the MLE found by the numerical optimization procedure were calculated from an estimate of the Fisher information matrix [e.g., Ewens and Grant (2001)]. The estimator used was a difference approximation to minus the Hessian of the log-likelihood function (ln(P(x|y, )) evaluated at the MLE. This approximation of the standard errors relies on the asymptotic behavior of the MLE; the standard errors of estimates based on sparse data (i.e., parameters based on subsets of Ic(pp,·) or Ic(l,·)) are therefore only indicative.
Implementation
A general framework for phylogenetic analysis has been written in C++ which allows models to be specified in XML. A Linux executable version can be downloaded from www.stats.ox.ac.uk/meyer/CORSmodel.
The Data
Hepatitis C virus (HCV) is a flavivirus belonging to the Flaviviridae family with a positive-sense single-stranded RNA genome. The genome is approximately 9,500 bases long and contains a single open reading frame (ORF) which encodes one large polyprotein. On the basis of phylogeny, hepatitis C viruses are divided into genotypes 1 through 6, and these are further divided into subtypes designated by a, b, and c, in order of discovery. Two RNA structures have been described in the 5' and 3' untranslated regions of the HCV genome: one is involved in initiation of RNA replication (Yi and Lemon 2003) and the other functions as an internal ribosomal entry site (Reynolds et al. 1996). However, it has recently been demonstrated, via bioinformatics (Tuplin et al. 2002) and enzymatic mapping techniques (Tuplin et al. personal communication), that five RNA secondary structures exist within the 3' part of the coding region. These structures define our structural annotation.
The alignment, which contained only 0.27% gaps, was generated manually from the coding part of 99 HCV genotype 1a and 1b genomic sequences (alignment with accession numbers available at www.stats.ox.ac.uk/meyer/CORSmodel). The RNA structure annotation of the alignment was extrapolated from the genotype 1a sequence used in the experimental validation of the coding structures. Tables 1 and 2 provide the distribution of nucleotides between the different structural categories. All sites outside these five structures were annotated as nonstructural (ns). The first 50 sites of the alignment were discarded due to an RNA structure known to extend from the 5' UTR into the beginning of the coding region (Reynolds et al. 1996).
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Model Comparisons
The model start was taken as a starting point for the model comparisons. Simpler models are defined by successively adding constraints to the parameter space of
start. This leads to a hierarchy of nested models, which is depicted in figure 2. The relative fit of successive models is evaluated by likelihood ratio tests where the simpler model represents the null hypothesis and the more general model represents the alternative hypothesis. A significant P value will therefore give rise to rejection of the simpler model and retention of the more general model. A nonsignificant P value does not lend support to rejection of the simpler model, in which case the simpler model is adopted. Table 3 defines the models in terms of their constraints. Table 4 reports the likelihoods, the test statistics, and the P values of each test.
|
|
|
A further LRT showed no significant effect of letting nucleotide equilibrium frequencies differ between nonstructural and loop/bulge regions (3 vs.
2), and
3 therefore replaced
2 as our null model. The opposite was observed when testing for difference in the transition bias between nonstructural and loop/bulge regions (
4 vs.
3).
Comparisons of models 5 and
6 versus
3 showed that both transition bias and nucleotide equilibrium frequencies are significantly different between the three different codon positions in the nonstructural regions.
Next, we turned to the submodels that describe substitution of two-tuples. By comparing models 7 and
3, we found a significant effect of including the codon-positionspecific skewing via the parameters s1 and s2. A similar observation was made for the parameter rp, which scales the rate of substitution in base-pairing regions (
8 vs.
3).
Lastly, we tested our restricted start model start against the completely unrestricted full model
full, which was found to provide a significantly better relative fit.
Parameter Estimates
Parameter estimates from the final model (3) and standard errors are listed in table 5. Values of the rate of substitution and the transition bias estimated for the nonstructural submodels followed our prediction and were ordered by codon position as follows: third > first > second.
|
Considering the two-tuple submodels, we found a reduced relative rate of substitution in first and second codon positions. Contrary to expectation, the estimated effect was slightly more pronounced (s2 > s1) in first than in second positions, but these estimates are also associated with a high degree of uncertainty.
The scaling parameters for the structural regions (rl and rp) showed a reduction in the absolute rate of substitution to about half in loop/bulge regions and about a third in base-pairing regions.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In base-pairing regions we found that the substitution process is affected by selection to conserve both the amino acid sequence of the encoded protein and the RNA structure. The protein-coding constraint was reflected as a lowering of the relative rate of substitution in base-pairing nucleotides occupying first and second codon positions, compared to that observed in noncoding RNA structural regions, and the constraints from structural conservation caused a marked reduction of the rate of substitution in base-pairing regions compared to nonstructural regions.
We also inferred that selection imposes a significantly different filtering of mutations in loop/bulge regions compared to nonstructural regions, which results in a lowered rate of substitution and a difference in the relative number of accepted transitions and transversions. This indicates that loop/bulge regions of the investigated structures are not mere spacer regions but have a biological function imposing additional selective constraints. This added constraint could occur because loop/bulge regions mainly occupy slowly evolving protein regions. The amino acid composition of loop regions, however, shows no sign of a bias toward slowly evolving amino acids like proline, cysteine, and tryptophan when compared to the overall amino acid composition (results not shown). It therefore seems more likely that this functionality is related to the RNA structures where it could be mediated by the formation of pseudo knots via complementary base-pairing or the interaction with proteins as described for the loop region of the cis-acting replication element (CRE) of polioviruses (Goodfellow, Kerrigan, and Evans 2003).
A factor which may affect both parameter estimates and LRTs is the fidelity of the structural annotation. The experimental annotation employed in our study is based on a HCV genotype 1a strain, and our fragmentation of the alignment, and subsequent parameter estimation, is based on the assumption that its structure is conserved throughout the alignment. However, a comparison to an experimental annotation of homologous RNA structures in HCV genotype 2 shows considerable differences (Tuplin et al. in progress). Our data set does not contain any genotype 2 sequences but consists solely of sequences from the more closely related subtypes 1a and 1b. Still, we found that 10.1% of the positions annotated as pairing contain mismatching nucleotides (mismatching with respect to base-pairingi.e., pairs other than A-T, C-G, or T-G), indicating that functional conservation within HCV genotype 1 may be achieved with some structural flexibility. Some positions may thus be mis-annotated on part of the tree and could potentially affect parameter estimates. To investigate this possibility, we re-estimated parameters under model 3, using a "cleaned" data set, where alignment columns containing mismatching nucleotides were treated as missing data. This showed that mis-annotation results in a small upward bias in the estimated rate of substitution in base-pairing regions (rp = 0.323 vs. rp = 0.384). A slight effect was also observed on the s parameters, which changed their relation from s1 < s2 to the expected relation s2 < s1.
The rate estimates of loop/bulge submodels could potentially be downward biased if regions annotated as loop/bulge are indeed base-pairing throughout a part of the tree. However, because of the strength of the constraints introduced by complementary base-pairing, we expect the effect of structural evolution to be greatest in two-tuple models.
Although the effects we observe are small, it is clear that the assumptions on which our model rests are sensitive to structural evolution. This question could be addressed through the use of models that allow the RNA structure to evolve along the tree. Such models do not exist at present but represent an exciting challenge.
There are more immediate ways in which the present approach could be improved. One would expect that the constraints of protein conservation differ between amino acids with different structural or functional roles in the protein. The resulting heterogeneity in the substitution process could be accommodated either by adding additional layers of annotation describing protein structure and function or by integrating over a distribution of, e.g., substitution rates (Yang 1993; Felsenstein and Churchill 1996).
We have stated our model via a general framework for constructing context-reducing phylogenetics models for genetic data with multiple annotations. There are several potential applications of this modeling framework, including, for example, protein sequences where both secondary and tertiary structure is taken into account, coding regions with overlying splice-sites, and coding regions annotated by genomic characteristics (e.g., isochore vs. non-isochore). The main practical limitations on the use of the presented framework will be of time usage and robustness of the parameter optimizations. Total time usage will depend on the time and number of likelihood calculations in the optimization procedure. Because the time spent in each likelihood calculation is proportional to the number of sites and squares in the alphabet size of the sites, it becomes impractical in most cases to fragment the alignment into tuples longer than three. Adding free parameters to a model will generally increase the complexity of the search space. Thus, the needed number of likelihood calculations will grow more than linearly with the number of free parameters, and the chance of finding the global optimum will decrease. This is true for a given submodel, but the relation between the total number of free parameters and the overall optimization procedure is more complex. A way of reducing the overall number of parameters is by constraining these between submodels. However, such constraints make submodels interdependent, and they increase the number of free parameters which have to be optimized simultaneously and thus the complexity of the search space. As a consequence, the effect of the total number of parameters on the speed and fidelity of the estimation procedure will depend on a complex interplay between the data and the structure of the model, and no clear guidelines are available. However, we note that the optimizations procedure employed here was both feasible and robust with more than 50 free parameters.
The comparison between the full model and our constrained starting model showed that the latter does not capture the full complexity of the evolutionary process. As more data from CORS accumulate, more elaborate models should be explored. It would also be of great interest to evaluate the goodness-of-fit that our model provides to data, and to estimate the validity of our assumption of a distributed LRT statistic. Both could be tested by parametric bootstrapping (Goldman 1993), but because of the numerical optimization procedures used, this would be extremely demanding computationally.
Furthermore, it would be interesting to evaluate how well our context-reducing model compares to context-elaborate models such as that of Pedersen and Jensen (2001), which represent a more loyal description of the true evolutionary process in structural regions. The latter approach employs a more accurate model of the evolutionary process, but it must resort to approximative computational techniques, whereas our model represents an approximation to the known context dependencies in the evolutionary process but relies on exact calculations. Thus the choice of substitution model type stands between models which treat a small amount of information with the greatest possible accuracy and approximative models which treat the greatest possible amount of information with reduced accuracy. Our objectives here were to develop a model that could be used in the estimation of evolutionary parameters and the testing of evolutionary hypotheses, in comparative RNA structure prediction in coding regions, and in RNA virus phylogenetics. At present the challenges in implementation and computation do not allow for the routine use of process-based models to solve any of these problems. This motivated our choice of a context-reducing model. As algorithms and computers improve, context-elaborate models will, however, become more attractive. The choice of model will then be determined by the type of analysis performed. Thus, context-elaborate models may become the models of choice for the estimation of evolutionary parameters and the testing of evolutionary hypotheses. However, for use in comparative RNA structure prediction and RNA virus phylogenetics, we believe that the computational demand of high throughput sequence analysis and phylogenetic algorithms will dictate the use of computationally convenient approximative models for quite some time.
![]() |
Conclusion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Mark Springer, Associate Editor
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bruno, W. J., Socci, N. D., and Halpern, A. L. 2000. Weighted Neighbor Joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17:189197.
Chartrand, P., Meng, X. H., Huttelmaier, S., Donato, D., and Singer, R. H. 2002. Asymmetric sorting of ashlp in yeast results from inhibition of translation by localization elements in the mRNA. Mol. Cell 10:13191330.[ISI][Medline]
Chartrand, P., Meng, X. H., Singer, R. H., and Long, R. M. 1999. Structural elements required for the localization of ASH1 mRNA and of a green fluorescent protein reporter particle in vivo. Current Biol. 9:333336.[CrossRef][ISI][Medline]
Ewens, W. J., and Grant, G. R. 2001. Statistical methods in bioinformatics. SpringerVerlag, New York.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
. 1993. PHYLIP, Phylogeny Inference Package, 3.5c edition. University of Washington, Seattle, Wash.
Felsenstein, J., and Churchill, G. A. 1996. A hidden markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93104.[Abstract]
Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182198.[ISI][Medline]
Goldman, N., and Yang, Z. 1994. A codon-based model of nucleotide substitution of protein-coding DNA sequences. Mol. Biol. Evol. 11:725735.
Goodfellow, I. G., Kerrigan, D., and Evans, D. J. 2003. Structure and function analysis of the poliovirus cis-acting replication element (CRE). RNA 9:124137.
Hasegawa, M., Kishino, H., and Yano, T. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160174.[ISI][Medline]
Hein, J., and Stovlbaek, J. 1995. A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames. J. Mol. Evol. 40:181189.[ISI][Medline]
Katz, L., and Burge, C. B. 2003. Widespread selection for local RNA secondary structure in coding regions of bacterial genes. Genome Res. 13:20422051.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol Evol. 16:111120.[ISI][Medline]
Knudsen, B., and Hein, J. 1999. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15:446454.
Liò, P., and Goldman, N. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:12331244.
Lobry, J., and Lobry, C. 1999. Evolution of DNA base composition under no-strand-bias conditions when the substitution rates are not constant. Mol. Biol. Evol. 16:719723.[Abstract]
Meza, J. C. 1994. OPT++: an object-oriented class library for nonlinear optimization. Technical Report SAND94-8225, Sandia National Laboratories, Livermore, Calif.
Muse, S. V., and Gaut, B. S. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol. Evol. 11:715724.
Pedersen, A. M., and Jensen, J. L. 2001. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol. 18:763776.
Reynolds, J. E., Kaminski, A., Carroll, A. R., Clarke, B. E., Rowlands, D. J., and Jackson, R. J. 1996. Internal initiation of translation of hepatitis C virus RNA: the ribosome entry site is at the authentic initiation codon. RNA 2:867878.[Abstract]
Robinson, D. M., Jones, D. T., Kishino, H., Goldman, N., and Thorne, J. L. 2003. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol. 20:16921704.
Savill, N. J., Hoyle, D. C., and Higgs, P. G. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. Genetics 157:399411.
Siepel, A., and Haussler, D. 2003. Phylogenetic estimation of context dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 23:46888.
Tuplin, A., Wood, J., Evans, D. J., Patel, A. H., and Simmonds, P. 2002. Thermodynamic and phylogenetic prediction of RNA secondary structures in the coding region of hepatitis C virus. RNA 8:824841.
Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:13961401.[Abstract]
. 1996. Maximum-Likelihood Models for Combined Analyses of Multiple Sequences Data. J. Mol. Evol. 42:587596.[ISI][Medline]
. 2000. Phylogenetic Analysis by Maximum Likelihood (PAML), 3.0 edition. University College London.
Yi, M., and Lemon, S. M. 2003. 3' nontranslated RNA signals required for replication of hepatitis C virus RNA. J. Virol. 77:35573568.
Zanotto, P. M., Gibbs, M. J., Gould, E. A., and Holmes, E. C. 1996. A reevaluation of the higher taxonomy of viruses based on RNA polymerases. J. Virol. 70:60836096.[Abstract]