A Dependent-Rates Model and an MCMC-Based Methodology for the Maximum-Likelihood Analysis of Sequences with Overlapping Reading Frames
Anne-Mette Krabbe Pedersen and
Jens Ledet Jensen
Department of Theoretical Statistics, Institute of Mathematics, University of Aarhus, Denmark
 |
Abstract
|
---|
We present a model and methodology for the maximum-likelihood analysis of pairwise alignments of DNA sequences in which two genes are encoded in overlapping reading frames. In the model for the substitution process, the instantaneous rates of substitution are allowed to depend on the nucleotides occupying the sites in a neighborhood of the site subject to substitution at the instant of the substitution. By defining the neighborhood of a site to extend over all sites in the codons in both reading frames to which a site belongs, constraints imposed by the genetic code in both reading frames can be taken into account. Due to the dependency of the instantaneous rates of substitution on the states at neighboring sites, the transition probability between sequences does not factorize and therefore cannot be obtained directly. We present a Markov chain Monte Carlo procedure for obtaining the ratio of two transition probabilities between two sequences under the model considered, and we describe how maximum-likelihood parameter estimation and likelihood ratio tests can be performed using the procedure. We describe how the expected numbers of different types of substitutions in the shared history of two sequences can be calculated, and we use the described model and methodology in an analysis of a pairwise alignment of two hepatitis B sequences in which two genes are encoded in overlapping frames. Finally, we present an extended model, together with a simpler approximate estimation procedure, and use this to test the adequacy of the former model.
 |
Introduction
|
---|
In Felsenstein's maximum-likelihood framework for inferring evolutionary trees from DNA sequences, a fundamental assumption is that the substitution processes in the single-nucleotide sites are independent (Felsenstein 1981
). When this is the case, transition probabilities between sequences can feasibly be obtained, because these probabilities become products of transition probabilities between nucleotides. If the independent substitution processes in the sites are assumed to be identical Markov processes described by a matrix of instantaneous rates Qnuc, the matrix of transition probabilities between nucleotides separated by time t can be obtained as Pnuc(t) = exp(Qnuct).
For many sequences, the assumption of independent substitution processes in the nucleotide sites is in striking contradiction to biological reality. Protein-coding sequences present an obvious example: the rate of synonymous substitution is generally higher than that of nonsynonymous substitution (see Li, Wu, and Luo 1985
and references therein). Whether a substitution in a site is synonymous or nonsynonymous depends on what nucleotides occupy the other sites of the codon. Substitution processes in nucleotide sites belonging to the same codon are thus nonindependent.
Li, Wu, and Luo (1985)
were among the first to describe a method for estimating evolutionary distances between coding sequences in which constraints imposed by the structure of the genetic code were taken into account. Their method relied on a partitioning of sites into degeneracy classes. A site was defined to be nondegenerate if all possible changes at the site were nonsynonymous, twofold-degenerate if one was synonymous and the other two were nonsynonymous, and fourfold-degenerate if all possible changes were synonymous. The classification of sites into degeneracy classes was based on one of the observed sequences, and the degeneracy class of a site was assumed to be constant over time. Having defined the degeneracy classes of the sites, the sequences were analyzed under the assumption that the substitution processes in the nucleotide sites were independent. This method is approximate: the classification of sites into degeneracy classes depends on which sequence one chooses to base the classification on, and the degeneracy class of a site is not constant over time, but changes as substitutions occur.
Goldman and Yang (1994)
and Muse and Gaut (1994)
described how the nonindependence introduced by the structure of the genetic code could be dealt with in an exact manner. They presented codon-based models in which the substitution processes in codons, rather than single-nucleotide sites, were assumed to be independent. The substitution processes in the codons were assumed to be identical, reversible Markov processes, described by a 61 x 61 matrix of instantaneous rates of codon substitution, Qcodon. In the matrix, entries corresponding to nonsynonymous substitutions could be modified relative to synonymous ones by multiplication of a factor representing the fractional reduction of amino-acid-altering relative to amino-acid-preserving rates. Due to the assumption of independent substitution processes in codons, transition probabilities between sequences in codon-based models factorize into a product of transition probabilities between codons. As in nucleotide-based models, these can be obtained by taking the exponential of the product of the rate matrix and the time, i.e., Pcodon(t) = exp(Qcodont). The basic idea behind codon-based models has since been utilized in the development of models for substitution processes in RNA sequences. Schöninger and von Haeseler (1994)
, Muse (1995)
, and Tillier and Collins (1995)
have presented dinucleotide-based models for the analysis of RNA sequences that allow dependencies among the substitution processes in nucleotide sites that participate in base-pairings to be modeled. The substitution processes in dinucleotides are assumed to be independent, and the transition probability between sequences factorizes into a product of transition probabilities between dinucleotides.
The substitution processes in sequences in which more genes are encoded in overlapping reading frames are subject to constraints imposed by overlapping genetic codes. Due to the overlapping of the constraints, an assumption of independent substitution processes in small subsequences is inappropriate. An adequate description of the substitution process in these sequences can therefore not be obtained by exploiting the idea behind the codon-based models. A model for the substitution processes in sequences with multiple overlapping reading frames was suggested by Hein and Støvlbæk (1995)
, who extended the notion of the degeneracy class of a site to that of a combination of degeneracy classes (one for each reading frame to which a site belongs). They defined class-specific matrices of instantaneous rates of substitution, assumed independent Markov processes in the sites according to these matrices, and illustrated how a maximum-likelihood analysis of evolutionary trees under the model could be performed. As in the method of Li, Wu, and Luo (1985)
, the classification of sites was based on one of the observed sequences, and the class of a site was assumed to be fixed. The method thus inherits the shortcomings of Li, Wu, and Luo's (1985)
method and deals with the constraints imposed by the overlapping genetic codes in an approximate manner.
In this study, we present a model for the substitution process in sequences in which two genes are encoded in overlapping frames that incorporates the constraints imposed by both of the overlapping genetic codes in an exact manner. This is achieved by allowing the instantaneous rates of substitution in a site to depend on what nucleotides occupy the sites in the neighborhood of the site at the instant of the substitution. Thus, the model does not rely on the degeneracy class notion and does not assume that the substitution processes in any subsequences are independent. Rather, the model contains parameters representing the degrees of selectional constraints operating in the different frames, and these parameters can be estimated. Due to the nonindependent instantaneous rates of substitution, the transition probability between two sequences does not factorize into products of transition probabilities between small subsequences. Rather, transition probabilities between full-length sequences must be considered. The model and methodology we describe here were obtained by generalizing a model and methodology we have previously presented (Jensen and Pedersen 2000
).
In the Methods section, we describe the model for the substitution process in sequences with overlapping reading frames. We show that the substitution process is reversible and derive the stationary distribution of a sequence under the model. We further describe a Markov chain Monte Carlo (MCMC) procedure for estimating the ratio of two transition probabilities between two sequences under the described model. Together, these entities, the stationary distribution of a single sequence and the ratio of transition probabilities between two sequences, comprise the elements needed for a maximum-likelihood analysis to be performed. In the Results section, we analyze an alignment of two homologous hepatitis B subsequences in which the polymerase (P) and the envelope (S) genes are encoded in overlapping frames using the model and methodology presented in the preceding section. We obtain maximum-likelihood estimates of the parameters in the model and perform a likelihood ratio test of a hypothesis concerning the mode of substitution in the sequences. We calculate expected numbers of various types of substitutions. Finally, to check the adequacy of the model, we present a more general model, together with a simpler approximate estimation procedure.
 |
Methods
|
---|
The Model
In this section, we present a model for the substitution process in sequences with overlapping reading frames, in which constraints imposed by the two overlapping genetic codes are incorporated.
We consider an alignment of two homologous DNA sequences in which two genes are encoded in overlapping reading frames. We assume that the sequences have evolved from a common ancestral sequence through independent identical evolutionary processes that involve substitutions only. We further assume that the substitution process is a homogeneous Markov process and that substitutions happen sequentially, so that within an instant, the sequence may be changed in one nucleotide position only. We do not allow substitutions that generate stop codons in any of the reading frames considered, and we assume that no substitutions occur in the first and last codons of reading frame I in the alignment. With
short for the set of parameters specifying the substitution process, the likelihood of observing the two sequences x and y at the tips of an evolutionary tree with branch lengths tx and ty is given by

| (1) |
where the sum over s0 is over all possible ancestral sequences and ps0(
) is the probability under the model of the ancestral sequence being s0. The parameters tx and ty are the time epochs separating sequences x and y respectively, from the ancestral sequence, and Ps0
z(
, tz) is the transition probability between sequences s0 and z, z = x, y.
We still have to specify the precise form of the Markov process in the inner parts of the sequences, that is, in codons 2, ... , n - 1, in a way that permits the constraints imposed by the two overlapping genetic codes to be incorporated. For this, consider a sequence in which two genes are encoded in overlapping reading frames. The substitution of a nucleotide in the sequence may lead to the alteration of the amino acid being encoded in both of the reading frames, in one of the reading frames only, or in none of the reading frames. Assume that the reading frames overlap as illustrated in figure 1
. Whether a nucleotide substitution is synonymous or nonsynonymous with respect to one of the reading frames depends on what nucleotides occupy the other positions of the codon within that reading frame at the instant of the substitution. Whether a substitution in the first codon position in reading frame I is synonymous or nonsynonymous with respect to reading frame II depends on what nucleotides occupy the two immediately preceding nucleotide positions, that is, positions 2 and 3 of the preceding reading frame I codon. In order to determine whether a substitution in codon position 2 or 3 in reading frame I is synonymous or nonsynonymous with respect to reading frame II, the nucleotide immediately following the codon, that is, in position 1 of the next codon in reading frame I, must be known.
In order to incorporate constraints imposed by the operation of two overlapping reading frames in the model for the substitution process, we must allow the instantaneous rates of nucleotide substitution to be context-dependent. Numbering the positions by the codon number in reading frame I, the instantaneous rate of substitution of one of the three nucleotides in codon i should depend on the nucleotides occupying positions 2 and 3 of codon i - 1 and position 1 of codon i + 1. Let denote the ith codon in reading frame I of the inner part of a sequence (i = 2, ... , n - 1), where zik is the nucleotide in codon position k, k = 1, 2, 3. Let
i be a codon that differs from
i in one nucleotide position only. In order to allow for unequal nucleotide frequencies, we assume that the instantaneous rate of substitution to codon
i is proportional to
(
i), where (
i) is the target nucleotide, that is, the nucleotide occupying the position in
i at which it differs from zi. We assume that the
k's, k
{A, C, G, T}, sum to 1. We further allow for transition/transversion bias by multiplying instantaneous rates of transitional substitutions by the factor K. With respect to these two features, our model is similar to that of Hasegawa, Kishino, and Yano (1985)
. The constraints imposed by the genetic codes are incorporated by multiplying all instantaneous rates that alter the amino acid in reading frame I only by fI, those that alter the amino acid in reading frame II only by fII, and those that alter the amino acids in both reading frames by fI/II, a procedure related to that used in the codon-based model for single coding sequences (Goldman and Yang 1994
; Muse and Gaut 1994
). We refer to the f parameters (fI, fII, and fI/II) as parameters for selective constraints. An f parameter larger than 1 indicates that amino-acid-altering substitutions in the associated reading frame are promoted, whereas if f < 1, synonymous substitutions are favored. When f = 1, there are no selective constraints in the corresponding reading frame. Note that using reading frame I for numbering the positions along the sequence has no influence on the instantaneous rates. Whether a substitution alters the amino acid in one of the reading frames is not related to the numbering used.
Let denote the instantaneous rates of substitution from a sequence that has zi as the ith codon in reading frame I to a sequence that is identical to the sequence except in codon i, in which it holds the codon
i, at an instant when positions 2 and 3 of codon i - 1 and position 1 of codon i + 1 are z2i-1, z3i-1, and z1i+1, respectively. The model then states the following form for the instantaneous rates of substitution:

| (1A) |
Here, 1ts is 1 for a transition (ts) and 0 for a transversion (tv), 1non(I),syn(II) is 1 for a substitution that is nonsynonymous in reading frame I and synonymous in reading frame II and 0 otherwise, and 1syn(I),non(II) and 1non(I),non(II) are defined similarly. The set
, on which the indicator functions are 1 in equation (2)
, consists of the 61 nonstop codons. Spelled out, we obtain the following representation of the instantaneous rates:

| (1) |
The instantaneous rate of a substitution which changes a codon ACA to a codon GCA in reading frame I at an instant when the codon considered is preceded by the nucleotides TC and followed by the nucleotide A is thus
GKfI, since the substitution is a transition to a G that changes the amino acid coded for in reading frame I from a threonine to an alanine and does not change the amino acid encoded in reading frame II, since both of the codons TCA and TCG code for serines. The instantaneous rate of a substitution which changes a codon GAT in the context CC | ... | A to GAA is
AfI/II, since the substitution is a transversion to an A and the codons ATA and AAA (in reading frame II) code for different amino acids, as do the codons GAT and GAA (in reading frame I). Note that the model can easily be modified to other kinds of overlapping genes, e.g., genes encoded in opposite directions. The only modification needed is in the translation via the genetic code.
The Stationary Distribution and Reversibility
We assume that the Markov process has reached equilibrium and let
(z) denote the equilibrium frequency of a sequence z. In this section, we show that the model presented above is reversible, identify the stationary distribution of a sequence under the model, and give a quick procedure for calculating the normalizing constant of the stationary distribution.
A Markov process with instantaneous rates qz,
is reversible and has
as the stationary distribution if
 | (1B) |
Under the model described above, the equilibrium frequency of a sequence with a stop codon in any of the two reading frames is 0, as are instantaneous rates of substitutions that generate stop codons. The above equality is thus satisfied for these cases. Moreover, the equality is trivially satisfied for sequences z and
that differ in more than one nucleotide position, as in this case, the instantaneous rates are 0. Therefore, assume that the two sequences z and
do not contain stop codons in either reading frame and differ at only one codon position in codon i, and consider

| (1C) |
Since all factors in the instantaneous rates except
(
i), which depends on the target nucleotide (
i), appear symmetrically, they cancel out, and we obtain
 | (1D) |
It is easily seen that this equality is satisfied if the equilibrium distribution of a sequence
(z) is a product over the
k parameters corresponding to the nucleotide constituents of the sequence. Incorporating the exclusion of sequences with stop codons in the second reading frame, we obtain that under the model, the stationary distribution of a sequence z = (z2, ... , zn-1) with z1 and zn fixed is

| (3) |
if zi
,
i. For all other sequences, the stationary distribution is 0. Here, Z is a normalizing constant different from 1, because sequences with stops in either reading frames are excluded. Without this exclusion, Z would indeed be 1, because the
k's sum to 1. We have thus identified the equilibrium distribution and at the same time shown that the process is reversible. With reversibility, the likelihood value becomes independent of the placement of the root, and with this and the assumed equilibrium, the likelihood in equation (1)
reduces to
 | (4) |
where we write
x(
) for
(x) to stress that in the likelihood
(x) is treated as a function of the parameters in the model, of which only
, and not the branch length(s), is relevant for the stationary distribution, and t is the sum of the branch lengths tx and ty.
In order to calculate the likelihood value eq. (4), we must be able to calculate the value of the normalizing constant Z of the stationary distribution eq. (3). This normalizing constant can be found by summing up the equilibrium frequencies of all possible sequences. By first summing over (z2i, z3i), i = 2, ... , n - 1, we can derive an explicit form for Z. The details are in appendix A, where we end up with the formula (A.9)

| (2) |
where all the terms are defined in appendix A.
Calculation of the Transition Probability Between Two Sequences
We now specify how the transition probability from a sequence x to a sequence y under the model described above can be calculated. Since the instantaneous rates of substitution under the model depend on the states at neighboring sites at the instant of the substitution, the probability of transition between the full sequences does not reduce to a product of "marginal" transition probabilities, such as a product of transition probabilities between nucleotides or codons. The substitution processes in all the sites must be considered simultaneously. As argued in Jensen and Pedersen (2000), we will have to resort to simulations in order to calculate the transition probability. Furthermore, in order to reduce the variance of the simulated values, we simulate the ratio of two probabilities Px
y(
1, t1)/Px
y(
2, t2) instead of simulating a transition probability directly. If the ratio can be evaluated for two sets of parameter values, likelihood ratio tests can be obtained, and maximum-likelihood estimates of the parameters in the model can be found by maximizing the ratio

| (3) |
as a function of (
1, t1) for fixed (
2, t2). In the following, we describe a procedure for obtaining the ratio of two transition probabilities using an MCMC simulation technique.
Let
t be the space of paths between sequences x and y separated by time t, and let L denote a particular path in
t. A path is a specification of the number of substitutions, the positions in which the substitutions occur, what nucleotides replace existing nucleotides in these substitutions, and the times (
(0, t)) at which the substitutions occur. Let µt be the measure on
i which, for a fixed number of substitutions r and fixed positions and fixed nucleotides of these substitutions, corresponds to ordinary integration on the space (0, t)r for the substitution times. For a positive number s, we denote by (s/t)L the path in
s obtained by scaling all the substitution times in L by s/t. Let q
(t; L) be the contribution from the path L to the transition probability Px
y(
, t). A detailed description of q is given in appendix B.
In appendix B, we derive the representation

| (5) |
where r is the number of substitutions in the path L, and
denotes the mean value under the distribution
on the space
t2 having density

| (6) |
with respect to µt2. We can thus obtain an approximation of the ratio of the two transition probabilities by calculating [tr1q
1(t1; (t1/t2)L)]/[t2rq
2(t2; L)] for a large number of paths L drawn from
. The farther apart (
1, t1) and (
2, t2) are, the larger the variance and the more paths are needed for the ratio to be obtained with reasonable precision. It is thus necessary when maximizing as a function of (
1, t1) to alter the parameters in the simulation measure (
2, t2) as (
1, t1) moves away from (
2, t2).
We now specify how to simulate from equation (6)
. We use an MCMC method (Gilks, Richardson, and Spiegelhalter 1996
), that is, we construct a Markov chain on the path space
t2 that has
as its stationary distribution. A path L is the collection of paths Lij of the nucleotides in the jth codon position of codon number i in reading frame I. We construct the Markov chain by running through the codons from number 2 to number n - 1 while we update the path Li for the ith codon. The updating of Li is done by proposing a new path L'i from a distribution Pi with density qi. The new path L'i is accepted with probability

| (7) |
where 
2 is given in appendix B. From a computational-cost point of view, the important thing here is that 
2 depends on Li (or L'i) and the two neighboring paths Li-1 and Li+1 only. Having completed a run through the alignment, we have performed a transition in our Markov chain on sequence paths. By continuing the procedure many times, we obtain a sample of sequence paths which contains paths throughout the support of equation (6)
in the correct proportions. In particular, if we propose a path L'i that gives rise to a stop codon in reading frame II, then 
2 (L'i | Li-1, Li+1) will be zero, and therefore the path is not accepted.
The choice of an initial path L to start the Markov chain is not important. We have obtained a start path by simulating paths Li from Pi, i = 2, ... , n -1, and continuing until a sequence path without stop codons in reading frame II has been obtained. The exact form of the path proposal distribution for codon i, qi, is given by the following three steps:
- The number of substitutions ki is taken from a modified Poisson distribution with intensity
i.
- The substitution times ti(r), r = 1, ... , ki, are taken from a uniform distribution on the interval from 0 to t2.
- The nucleotide position and the new nucleotide for each substitution is chosen from a set of allowed substitutions
r.
As for the modification of the Poisson distribution in step 1, note that if the ith codons in the two sequences are identical, paths with one substitution are impossible. If the codons differ at d0 positions, there must be at least d0 substitutions in the path leading from one codon to the other. We thus modify the Poisson distributions and propose a number of substitutions for the path between the ith codons from

| (4) |
where di0 is the number of codon positions at which the ith codons in the two sequences differ. The intensity
i is described below.
In both steps 1 and 3 we will use intensities
iy,w of a change from a codon y to a codon w given by

| (4A) |
where , is defined as in equation (2)
, except that we treat substitutions to stop codons in the second reading frame ((xi-12, xi-13, w1) or (w2, w3, xi+11)) as substitutions to a 21st amino acid, rather than giving them intensity 0. We have adopted this approach in order to keep the proposal distribution simple; that is, we are not using the paths Li-1 and Li+1 during the proposal step. For the Poisson intensity
i in step 1, we take

| (5) |
In step 3, let ki be the number of substitutions from step 1, and let zi(r), r = 0, ... , ki, be the codon after the rth substitution with zi(0) = xi and zi(ki) = yi. When zi(1), ... , zi(r - 1) have been chosen, we choose zi(r) according to the probabilities

| (6) |
The set of allowed substitutions for substitution number r,
r, can be described as follows: let di(z) be the number of nucleotide positions at which codon z differs from codon yi. Then,

| (7) |
The following examples illustrate the role of
r. Assume that for a potential path between identical codons (xi = yi), the number of substitutions chosen is two. For the first substitution, we may choose any combination of position and nucleotide, as long as the nucleotide chosen is different from the one that occupies that position at the moment, and as long as we do not create a stop codon. Irrespective of the choices for the first substitution, the choices for the second (and last) substitution are completely fixed, since we must get to the target codon via this substitution. Similarly, in a path containing one substitution between codons that differ in one position, the choices of position and nucleotide for the substitution are both completely fixed. For paths between codons that differ at one position for which we have chosen a number of two substitutions, the first must occur in the position at which the two codons differ (since otherwise it would generate an additional difference, making it impossible to get to the target codon with the remaining one substitution). Furthermore, the first substitution must not generate the target codon, since if it does, we have no way of assigning the last substitution. Note that we may generate a path for which the set of allowed substitutions for a certain substitution is empty. This is the case for a path with two substitutions between the codons TCA and TTA: the first substitution must be in position 2 (as otherwise we would generate an additional difference), it cannot be to a T (because then the target codon is reached prematurely), and it cannot be to an A or a G (as rates to the codons TAA and TGA are 0, since the codons are stop codons). A path for which an empty set of allowed substitutions is created will be discarded; that is, it will never be accepted.
With this procedure for generating paths, the density of proposing a path Li for codon i is

| (8) |
Having obtained expressions for the equilibrium frequency of a sequence under the model, and here an approximation of the likelihood ratio of transition probabilities, we have the means for performing maximum-likelihood estimation and likelihood ratio tests.
 |
Results
|
---|
Below, we present results from a maximum-likelihood analysis of a pairwise alignment of two hepatitis B sequences, in which the model and methodology presented in the Methods section were used. We refer to the model presented above as "the full model". We give maximum-likelihood estimates of the parameters in the full model and perform a likelihood ratio test of a model of multiplicatively operating selective constraints (referred to as "the multiplicative model") under the full model. The multiplicative model is accepted. In the subsequent subsection, we show how the expected numbers of various types of substitutions may be calculated and give the values obtained under the multiplicative model. In the last subsection, we present an extension of the full model, referred to as the "extended model," that we use to check the adequacy of the full model. For the extended model, we describe and use a simpler, but approximate, estimation method than that used for the full and multiplicative models.
Maximum-Likelihood Analysis
The hepatitis B viral genome is circular and partially double-stranded, with the longer strand consisting of approximately 3,200 nt (Ganem 1996
).
Every nucleotide in the genome is within a coding region, and more than half of the sequence is translated in more than one reading frame. The genome has four open reading frames (ORFs): P, C, S, and X. The P ORF encodes the viral polymerase, the C ORF encodes the structural protein of the nucleocapsid, the X ORF encodes a putative regulatory protein, and the S ORF encodes the viral surface glycoproteins. The S ORF is completely embedded in the P ORF, and the C and X genes partially overlap with the P ORF and also themselves partially overlap (see fig. 2
).

View larger version (19K):
[in this window]
[in a new window]
|
Fig. 2.Schematic representation of the circular genome of hepatitis B. Approximate locations of the four open reading frames C, P, X, and S are shown.
|
|
Two full genome sequences were obtained from the GenBank database (http://ncbi.nlm.nih.gov/genbank) (accession numbers AF151735 [type ayw2] and X75663 [type adw4q]). An alignment of the parts of the sequences in which the S (surface) and P (polymerase) genes are encoded in overlapping frames was obtained automatically using GENAL (Hein and Støvlbæk 1994
). The alignment exhibited an insertion of 11 consecutive codons after the first 2 codons in the reading frame of the S gene. Analysis was restricted to the region following the insertion, a region spanning 1,152 nt, or 384 codons, in the P gene reading frame. The paths of the first and last codons in this region were assumed to be fixed with no substitutions. In the region analyzed, the sequences differed at 13% of the nucleotide positions. The 150 differing nucleotide positions were distributed among 119 reading frame I codons, and 78 transitional differences were exhibited, with the remaining 72 being transversional differences. Among the differing nucleotide positions, 70 fell in first codon positions in reading frame I (third codon positions in reading frame II), 32 in second codon positions in reading frame I (first codon positions in reading frame II), and 48 in third codon positions of reading frame I (second codon positions of reading frame II).
The MCMC algorithm was implemented in C. The Markov chain appeared to converge quickly toward its stationary distribution (results not shown). In the maximization of the likelihood ratio as a function of (
1, t1), we used a stepwise procedure: we started with a certain set of initial values for the parameters for the simulation measure (
2, t2). We performed a first rough maximization (round I) with high threshold (10-4) in which 10,000 samples were used in each calculation of the ratio of the two transition probabilities. We used the resulting "rough" maximum-likelihood estimates as new values of (
2, t2) in a new round of maximization (round II) with a lower threshold (10-5). In this round, each evaluation of the likelihood value was based on 100,000 samples. We then proceeded to round III, in which we used the obtained parameter estimates from round II as new values of (
2, t2), while the number of samples used for evaluation of the ratio of transition probabilities was kept at 100,000, and the threshold was again lowered (10-6). In each round, the starting values of (
1, t1) were set equal to those of the simulation measure (
2, t2).
In order to examine the efficiency and dependency of the MCMC sampler and maximization scheme on the starting values of the parameters, we compared the outcomes of four different runs. Runs A, B, and C had initial values of
k = 0.25, k
{A, C, G, T}, but the initial values for the remaining parameters varied. Initial parameter values of run C* were identical to those of run C:II to the first four decimals, but the runs were started with different random seeds. Initial parameter values and maximum-likelihood estimates obtained after each round in each of the four runs are given in table 1
. Also given are the numbers of iterations (full parameter vector updatings) used in the maximization procedure (Powell's method; Press et al. 1992
) to reach the given maximum-likelihood estimates. The intensities are scaled so that, at equilibrium, the expected number of substitutions out of a codon is 1, and thus the parameter t gives the expected number of substitutions in total per codon between the two sequences. How this scaling is obtained is described in the next section.
View this table:
[in this window]
[in a new window]
|
Table 1 Maximum-Likelihood Estimates of the Parameters in the Full Model After each of Three Rounds in Four Different Runs
|
|
The maximum-likelihood estimates obtained in the four different runs were similar, and the results did not indicate any dependency of the Gibbs sampler on the starting values for the procedure (rows A:III, B:III, and C:III), nor was the result sensitive to the random seed (rows C:III and C*:III). After round III, values obtained for the
k, k
{A, C, G, T}, parameters varied by <1%, those of t and K by <5%, and those of the remaining parameters (fP, fS, and fP/S) by <8%. In contrast to the similarity of the final parameter values, the computational time requirements of the runs differed markedly. As can be seen from the number of iterations used in the maximization procedures of the runs (table 1
, last column), the computational effort involved was positively correlated with the distance between the parameter values in which the run was started and the "true" values. This demonstrated the importance of having "good" starting values.
The degrees of variation in the final parameter values from the four different runs reflect the amount of information contained in the data concerning the different parameters. The variance is expected to be larger on the purely evolutionary parameters (t, K, fP, fS, and fP/S) than on the parameters
k, k
{A, C, G, T}, which are determined mainly by sequence composition (equilibrium distribution). Among the purely evolutionary parameters, the variance is most likely larger on the fP, fS, and fP/S than on the t and K parameters, since the former are related to a finer partitioning of evolutionary events. It is likely that some of the variation obtained among the final values of the fP, fS, and fP/S parameters is due to too few samples in the calculation of the transition probability. As the likelihood surface is flat in the directions corresponding to the purely evolutionary parameters, a relatively high precision in the calculation of the transition probability is necessary for reliable estimates of these parameters to be obtained. The precision may be increased by augmenting the number of samples used. The pattern of variation among round II and III parameter values illustrates this (table 1
): reasonable values for the
k, k
{A, C, G, T}, parameters could be obtained after round I, as these differed only slightly (<5%) from the final values obtained after round III. As for the t and K parameters, their values were ill determined after round I (see run B); however, after round II, they differed by no more than a few percent. The values for the remaining parameters (fP, fS, and fP/S) obtained after round II differed by up to 14% from the final values.
The maximum-likelihood estimates of the parameters showed that in the double coding region analyzed, selection against amino-acid-altering substitutions was stronger in the S than in the P reading frame (fS
0.15 vs. fP
0.22). A similar finding has been reported by Yang, Lauder, and Lin (1995)
. Parameter estimates further indicated that selection against substitutions that altered the amino acid encoded in both reading frames was particularly strong (fP/S
0.06). Under the model above, with fP and fS varying freely, we performed a log likelihood ratio test for the null hypothesis that selection against amino acid substitution in the double coding region acts multiplicatively; that is, fP/S = fP · fS. Parameter estimates under the null hypothesis were
A = 0.238,
C = 0.270,
G = 0.219,
T = 0.273, t = 0.450, K = 1.590, fP = 0.346, and fS = 0.250. Use of the values from run A:III for the parameters under the model with fP and fS varying freely led to a -2 log Q test statistic of 1.305. This gave a P value of approximately 0.25, and the null hypothesis of multiplicatively operating selective constraints was thus accepted.
Expected Numbers of Various Types of Substitutions
If we scale the intensities so that the average rate of substitution per codon at equilibrium equals 1, the time t between sequences will effectively be measured as expected numbers of substitutions per codon. Let (s1, s2, s3), (s4, s5, s6), and (s7, s8, s9) be three consecutive codons in reading frame I, and consider the septet of nucleotides s = (s1, s2, s3, s4, s5, s6, s7). The scaling is then obtained by requiring that

| (9) |
where probs denotes the stationary probability of septet s, and
is the set of all septets. A procedure for calculating probs is given in appendix A.
We can now write the instantaneous rate of synonymous substitutions (that is, substitutions that are synonymous in both reading frames) per codon as

| (8) |
where the set M(s) consists of those w = (w1, w2, w3) for which the change (s4, s5, s6)
(w1, w2, w3) is synonymous, the change (s2, s3, s4)
(s2, s3, w1) is synonymous, and the change (s5, s6, s7)
(w2, w3, s7) is synonymous. The expected number of synonymous substitutions per codon is obtained as
syn(I), syn(II)t, with the maximum-likelihood estimates as parameter values. Expected numbers of other types of substitutions are obtained by restricting the summation in equation (8)
appropriately.
Expected proportions of different kinds of substitutions per codon under the model with multiplicatively acting selection factors were calculated by inserting the maximum-likelihood estimates in the formulas above. The obtained values are given in table 2
. Also given are values for a situation with the same parameter values of
k, k
{A, C, G, T}, and K, but with no selective constraints, that is, with fP = fS = fP/S = 1.0. Expected numbers of the various types of substitutions per codon can be obtained by multiplication with
. For the maximum-likelihood values under the model with multiplicative selective constraints, the ratio of transitional to transversional substitutions was 0.534/0.466 = 1.146, and that of synonymous to any type of nonsynonymous substitutions was 0.058/(1 - 0.058) = 0.062. With respect to reading frame I, the ratio of synonymous to nonsynonymous rates was (0.058 + 0.314)/(1 - 0.058 - 0.314) = 0.592, whereas with respect to reading frame II, it was (0.058 + 0.439)/(1 - 0.058 - 0.439) = 0.988. The corresponding values for the case that was similar but had no selective constraints (second row of table 2
) were 0.848, 0.012, 0.379, and 0.385, respectively. As compared with the hypothetical situation with similar parameter values but no selective constraints, the transition/transversion rate ratio was thus raised by a factor of 1.146/0.848 = 1.351, and the ratio of overall synonymous to nonsynonymous rates was raised by a factor of 0.062/0.012 = 5.167. The ratio of synonymous to nonsynonymous rates with respect to reading frames I and II, respectively, were raised by factors of 0.592/0.379 = 1.562 and 0.988/0.385 = 2.566. These ratios further establish the stronger degree of selective constraints for the evolution of the S gene than the part of the P gene in which the S gene overlaps.
View this table:
[in this window]
[in a new window]
|
Table 2 Expected Proportions of Various Types of Substitutions per Codon Under the Multiplicative Selectional Constraints Model for Two Sets of Parameter Values: Maximum-Likelihood Estimators (MLEs) and No Selective Constraints (NSCs)
|
|
Model Check
In this section, we present an extension of the model described in the Methods section and describe a simpler, but approximate, estimation method. We use the extended model to check the adequacy of the full model assumed above. In the extended model, all dinucleotide interactions and position specific nucleotide intensity parameters are allowed for.
In the extended model, we use the intensities from equation (2) , with the nucleotide intensity
(
i) replaced by a more general term which includes position specific nucleotide intensities
jk, k
{A, G, C, T},
jA +
jG +
jC +
jT = 1, j = 1, 2, 3, and dinucleotide interactions. By a dinucleotide interaction we mean a function of two neighboring nucleotides. We denote these functions by
 | (9A) |
respectively. Precisely, the extended model is now defined by using equation (2) with the term
(
i) replaced by

| (10) |
where
is the position at which
i differs from zi (note that of the eight terms inside the brackets, four terms cancel because
i and zi differ at one position only).
The model considered in the Methods section corresponds to the model here with no dinucleotide interactions, that is, with
j(a, b)
1, for all (a, b)
{A, G, C, T}2, and identical position-specific nucleotide intensities, that is,
ja =
a, a
{A, G, C, T}, j = 1, 2, 3. A model where the only dinucleotide interactions are selection against CpG dinucleotides is obtained when
j = 1 except for the values
j(C, G), j = 1, 2, 3. In this model, instantaneous rates of substitution that generate (respectively, eliminate) a CpG in frame j, j = 1, 2, 3, are multiplied by {1/
j(C, G)}1/2 (respectively,
j(C, G)1/2) relative to instantaneous rates of substitution that leave the CpG count unaltered.
In the extended model, the stationary distribution of a sequence z = (z2, ... , zn-1) is

| (9) |
for zi
for all i. This can be verified by inspection in a manner similar to that used for the model in the Methods section. Like the model in the Methods section, the extended model allows an explicit formula for the normalizing constant to be derived (see eq. A.4 in appendix A), and that allows expected numbers of various types of substitutions to be calculated (e.g., eq. 8). For the latter, stationary probabilities of subsequences under the extended model are neededthese are derived in appendix A.
To check the adequacy of the model assumed in the Methods section, we compare its performance with that of the extended model. For parameter estimation in the extended model we use the following simplified procedure: we first estimate the dinucleotide interactions
j and the position-specific nucleotide intensities
ja using the stationary distribution under the extended model. We base the estimation on one of the two sequences (we use ayw2). The full extended model, however, has too many parameters to be useful. When fitting the extended model, we will make the dinucleotide interactions as simple as possible; that is, we will only include those in the model that increase the fit of the data to the model significantly. For this, we use the following stepwise selection procedure: We start by analyzing the conditional distribution of z1i+1 given (z1i, z2i, z3i), given in equation (A.14)
in appendix A, in order to estimate the interaction
1. We start with the model where
1
1 and thereby obtain an estimate of
1ara, a
{A, C, G, T} (see appendix A). We next calculate the score function (numerically) for the different entries of
1 and choose the entry with the largest absolute value. We include this entry as a parameter in the conditional distribution and see if this provides a significantly better description of the distribution. This procedure is continued until a reasonable fit has been obtained (see below). Next, the conditional distributions (A.15) and (A.16) from appendix A are treated in the same way in order to estimate the interactions
2 and
3.
The result of the stepwise procedure for estimating the interaction parameters and the position-specific nucleotide intensities are given in tables 3 and 4
. It is clear from table 3
that there was a significant CG depression in the datathe sequence considered the first entry to be included in the interaction
j was the CG entry in all three positions. The stepwise procedure additionally includes four types of dinucleotide interactions. Furthermore, the estimates of the position-specific nucleotide intensities
j varied among the three codon positions (table 4 ). The results from this analysis show that the simpler model does not do well with respect to describing a single sequence, meaning that the simple model is only a rough approximation of the true model.
View this table:
[in this window]
[in a new window]
|
Table 4 Parameter Estimates of the Position-Specific Nucleotide Intensity Parameters Obtained Under the Extended Model in which all Selected Interactions Have Been Included
|
|
Having estimated the interactions
j and the position-specific nucleotide intensities
j in the extended model using the stationary distribution, we could return to the MCMC method of the Methods section to estimate the remaining purely evolutionary parameters t, K, fP, fS, and fP/S. However, we mention here another approximate estimation method that will allow us to consider the fit of the model as well. We split the observable changes in the aligned codons in the two sequences into a number of disjoint groups. In particular, we used the nine groups obtained by dividing codons into groups with single and multiple changes and further dividing the group exhibiting single changes into transition and transversion groups, and dividing each of these two groups into groups based on the four combinations of synonymous and nonsynonymous changes in the two reading frames. The observed numbers Ng, g = 1, ... , 9, in the nine groups are approximately independent and Poisson distributed with, say, mean µg. Thus, we form an approximate likelihood based on the Poisson approximation and use this to estimate the remaining parameters (t, K, fP, fS, fP/S). The means µg can be approximated by a simple forward simulation of the Markov process describing the evolution; that is, we must simulate exponential waiting times and simulate the jump type.
The estimated evolutionary parameters for the extended model, are t = 0.476, K = 1.946, fP = 0.175, fS = 0.161, and fP/S = 0.033. Observed numbers of codons falling into the nine groups are given in column 3 of table 5
, and expected numbers under the extended model with corresponding parameter estimates are given in the fourth column. The expectations under the extended model fit well with the observed numbers in the nine categories (-2 log Q = 3.8
(4)). In the fifth column, results are given for a model that is similar to the extended model except that the selectional constraints are assumed to act multiplicatively (fP/S = fPfS). Here, parameter values identical to those in the column before are used, except that fP/S = 0.175 · 0.161 = 0.028. Even though no fitting of the parameters to the extended model with multiplicatively operating selection parameters has been done, the -2 log Q test statistic is only marginally augmented (to 4.1), and the result of multiplicatively operating selection factors found in the analyses based on the simple model of the Methods section are confirmed. The last column gives the expected number of codons in the nine categories under the model of the Methods section using the maximum-likelihood estimates from the subsection above. With a -2 log Q value of 9.3, the simple model is a good approximation of the extended model with respect to the evolutionary part, as measured by the expectations of the various types of changes.
View this table:
[in this window]
[in a new window]
|
Table 5 Observed and Expected Numbers of Nine Groups of Codons Under the Three Models "Ext," "Ext-Mult," and "Full," along with Twice the Decrease in Log Likelihood When Going from the "Means-Free" Model to Each of These Three Models
|
|
 |
Discussion
|
---|
The model presented for the substitution process in DNA sequences in which two genes are encoded in overlapping reading frames takes into account constraints imposed by the genetic code in both of the reading frames. A model for the analyses of sequences with three genes encoded in overlapping frames, in which constraints imposed by three overlapping codes are incorporated, can be obtained simply by extending the neighborhood of dependency one nucleotide to the right. A procedure for calculating the transition probability between two sequences under such a model can be obtained by minor modifications of the procedure presented here. By combining models of the above types, one can achieve a model for sequences with combinations of non-, single-, double-, and triple-coding regions in which constraints imposed by the various combinations of overlapping reading frames are allowed for. In the case of hepatitis B, analyses of the substitution process in the full genome under such a model should be feasible, given the limited size of the genome (3.2 kb) and the small number of genes.
The methodology described for calculating the transition probability between two sequences has two drawbacks. First of all, it allows for the analysis of pairs of sequences only. The development of a procedure for obtaining the likelihood of observing a set of (more than two) sequences at the tips of a given binary tree under a model with dependent substitution rates has yet to be developed. The second drawback is that the methodology is computationally very demanding. It should be possible to reduce the computational time requirements considerably by parallelizing computations. Computational requirements of a similar procedure for more than two sequences related by a tree will be increased due to the larger number of branch length parameters. It is, however, possible that the increase due to more parameters will be counterbalanced by an increase of information in the data regarding the more poorly determined purely evolutionary parameters, which would allow transition probabilities to be approximated by smaller samples of paths.
Given the computational requirements for inference under the presented model, it would be of considerable interest to compare results obtained with this model and methodology to those obtained using the more heuristic and much quicker models and procedures. Computational demands, however, seriously limit the possibility of performing simulation-based bias studies.
 |
Appendix A
|
---|
In this appendix, we first find the normalizing constants Z in the stationary distributions (3) and (9) of a sequence. We next turn to a rewriting of the stationary distribution that allows us to calculate the stationary probability of a subsequence.
Since the model in equation (3) is a special case of the model in equation (9), we first state the formulas for the earlier model. The normalizing constant Z is found by summing up the equilibrium frequencies over all z2, ... , zn-1. If we sum first over (zi2, zi3), i = 2, ... , n - 1, we obtain

| (A1) |
where, for (a, b)
{A, G, C, T}2,

| (A2) |
To evaluate equation (A.1)
, we use the eigenvalues and eigenvectors of the 4 x 4 matrix V. Let
1, ... ,
4 and v1, ... , v4 be the eigenvalues and left eigenvectors, respectively, with
1 being the largest eigenvalue. Writing vi = (viA, viG, viC, viT), we thus have

| (11) |
Let w = (wA, wG, wC, wT) be the vector with
 | (A3) |
where a
{A, G, C, T}, and define coefficients c1, ... , c4 by
 | (A2) |
Then we get, from equation (A.1) ,

| (A4) |
Let us now specialize equation (A.2)
to the simple model in equation (A.1)
, corresponding to
j
1 and
ja =
a. The matrix V from equation (A.2)
becomes

| (A5) |
where the vectors
= (
A,
G,
C,
T) and ß = (ßA, ßG, ßC, ßT) are given by

| (12) |
with
stop =
T
A
A +
T
G
A +
T
A
G. A simple calculation shows that the eigenvalues are

| (A6) |
with corresponding left eigenvectors

| (A7) |
where

| (A8) |
The vector w from equation (A.3)
becomes

| (13) |
and the normalizing constant in equation (A.4)
is

| (A9) |
In particular, if (z12, z13) is different from (T, A) and (T, G) we find that c1 = c2 = 1/
We next turn to a closer study of the stationary measure in equations (3) and (9)
. The stationary measure can be written as a product of conditional densities. To this end, we consider the matrix V in equation (A.2)
again and let r = (rA, rG, rC, rT) be the positive right eigenvector corresponding to the largest eigenvalue
1. For the simple model with V given in equation (A.5) , we find

| (A10) |
Considering the chain {zi1}, i = 1, 2, ... , one finds that this is a homogeneous Markov chain with transition matrix

| (14) |
Since

| (15) |
the stationary density p0 for this Markov chain is

| (A11) |
Furthermore, the conditional density p231|1 of (zi2, zi3, zi+11) given zi1 is

| (A12) |
Thus, the stationary frequency of the septet (zi, zi+1, zi+21) is

| (A13) |
which can be used for evaluating the expected numbers of various types of substitutions. For the simple model, we use equation (A.13) with
j
1,
aj =
a in equation (A.12)
, with r given in equation (A.10)
, and with v1 given in equation (A.7)
.
Finally, in connection with finding a suitable extended model, we use the following conditional distributions:

| (A14) |
where the functions g and h are normalizing functions, defined so that equations (A.14) and (A.15)
are densities.
 |
Appendix B
|
---|
We first derive formula (5). From the definition of q
1(t1; L), we have

| (15) |
where r is the number of substitutions in the path L. We then find

| (16) |
where
is the mean under the measure
defined in equation (6)
.
The weight q
(t; L) of a path L with r substitutions is the product of the densities of r waiting times, times the product of r jump probabilities, times the probability that the last waiting time exceeds t. Since the intensities in these waiting times are the sum over all the positions of the intensity for an event at this position, one sees that q
(t; L) becomes a product

| (1) |
where q
(t, i, 1) depends on (Li-12, Li-13, Li), and q
(t, i, 2) and q
(t, i, 3) depend on (Li, Li+11). To give the exact form of these terms, define si to be the total number of substitutions in the paths Li-12, Li-13, Li1, Li2, Li3, and Li+11. Let ui(r) be the time the rth among these substitutions occurs, and let zi-12(r), zi-13(r), zi(r), and zi+11(r), respectively, be the nucleotide contents of the second and third positions of codon i - 1, the three positions in codon i, and the first position in codon i + 1 after the rth substitution. Set

| (17) |
With these definitions, we can write q
(t, i, j) as

| (18) |
The 
2(Li | Li-1, Li+1) term from equation (7) comes from the conditional distribution of Li given (Li-1, Li+1) and is found from equation (B.1)
to be

| (19) |
The term 
2(L'i | Li-1, Li+1) is defined as above, with Li replaced by L'i.
 |
Acknowledgements
|
---|
A.K.P. was supported by grants from the Danish Natural Science Research Foundation and the Carlsberg Foundation. Two anonymous reviewers and the associate editor are thanked for many helpful comments.
 |
Footnotes
|
---|
Keith Crandall,
Reviewing Editor
1 Keywords: overlapping reading frames
substitution process
dependent substitution rates
MCMC
hepatitis B 
2 Address for correspondence and reprints: Anne-Mette Krabbe Pedersen, Department of Theoretical Statistics, Institute of Mathematics, University of Aarhus, Ny Munkegade, DK-8000 Aarhus C, Denmark. annemet{at}imf.au.dk 
 |
literature cited
|
---|
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
Ganem, D. 1996. Hepadnaviridae and their replication. Pp. 27032737 in B. N. Fields, D. M. Knipe, and P. M. Howley, eds. Fields Virology. Vol. 2, 3rd edition. Lippincott-Raven, Philadelphia.
Gilks, W. R., S. Richardson, and D. J. Spiegelhalter. 1996. Introducing Markov chain Monte Carlo. Pp. 19 in W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds. Markov chain Monte Carlo in practice. Chapman and Hall, London.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736.[Abstract/Free Full Text]
Hasegawa, M., M. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160174.[ISI][Medline]
Hein, J., and J. Støvlbæk. 1995. A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames. J. Mol. Evol. 40:181189.[ISI][Medline]
Jensen, J. L., and A. K. Pedersen. 2000. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. Appl. Prob. 32:499517.[ISI]
Li, W.-H., C.-I. Wu, and C.-C. Luo. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2:150174.[Abstract]
Muse, S. V. 1995. Evolutionary analyses of DNA sequences subject to constraints on secondary structure. Genetics 139:14291439.
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715724.[Abstract/Free Full Text]
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical recipes in Fortran. Cambridge University Press, Cambridge, England.
Schöninger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phylogenet. Evol. 3:240247.[Medline]
Tillier, E. R. M., and R. A. Collins. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol. 12:715.[Free Full Text]
Yang, Z., I. J. Lauder, and H. J. Lin. 1995. Molecular evolution of the hepatitis B virus genome. J. Mol. Evol. 41:587596.[ISI][Medline]
Accepted for publication December 11, 2000.