* Bioinformatics Research Center, North Carolina State University
Bioinformatics Unit, Department of Computer Science, University College London, London, United Kingdom
Laboratory of Biometrics, Graduate School of Agriculture and Life Sciences, University of Tokyo, Tokyo, Japan
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
Correspondence: E-mail: thorne{at}statgen.ncsu.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: protein structure evolution Markov chain Monte Carlo Bayesian
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A notable exception to this lack of progress is found in studies by Pedersen and Jensen on evolutionary dependence among sites that is caused by reading frame overlap (Jensen and Pedersen 2000; Pedersen and Jensen 2001). Here, we borrow ideas from Pedersen and Jensen but focus on the evolutionary dependence among codons that is associated with protein tertiary structure. In addition, we build upon a recently proposed technique for simulating protein evolution (Parisi and Echave 2001). With this technique, the rate at which a site experiences change can be modified by substitutions at neighboring sites. Through simulations, Parisi and Echave convincingly demonstrated that their model incorporates information that is unobtainable with widely used models of protein evolution. For example, the simulation studies showed tendencies of certain amino acids to preferentially occupy certain sites in the left-handed ß helix domain of UDP-N-acetylglucosamine acyltransferases. When a group of actual sequences with this helix domain was examined, qualitatively similar tendencies were observed.
Parisi and Echave began their simulations with a reference protein of known tertiary structure. They then selected a function to assign a distance between a protein sequence and the reference structure. Underlying the sequence-structure distance is the idea that protein tertiary structure evolves very slowly (Chothia and Lesk 1986; Flores et al. 1993). Therefore, the energy associated with the structure of an ancestral protein (e.g., the reference protein) and the energy associated with the structure of a descendant protein should be similar. The sequence-structure distance can be interpreted as a surrogate for the difference in energies between an ancestral and a descendant protein.
When applying the ParisiEchave procedure, most simulations that begin with an ancestral sequence will likely result in a descendant that differs from an observed descendant. Because of this apparent drawback, an alternative strategy is needed. Our approach differs from Parisi and Echave's work mainly because our goal is to perform statistical inference when studying the relationship between two observed sequences. It would be computationally inefficient to simulate sequence evolution and then discard all simulated descendant sequences that differ from those observed.
Here, we explain our technique for statistical inference with evolutionary models that have dependence due to protein structure. With simulations, we show that this technique can obtain accurate parameter estimates. With analyses of actual sequence pairs, we show that our parameter estimates are biologically reasonable. Although dependence owing to protein structure is the focus here, slight modifications of our inferential procedure can be applied when the evolutionary dependence among sequence sites arises in other ways.
![]() |
Modeling Protein Evolution Under Structural Constraints |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The key motivation underlying our model is that nonsynonymous substitution rates should depend in part on whether the implied amino acid replacements would stabilize or destabilize the known and assumed fixed protein tertiary structure. To assess the effect of an amino acid replacement on protein stability, a measure is needed for how well the sequence fits the structure both before and after the replacement. If this measure indicates that the replacement would improve the sequence-structure fit, then the rate of the nonsynonymous change should be high. Likewise, if the sequence-structure fit would become poorer as a result of an amino acid replacement, then the corresponding nonsynonymous rate should be low.
Fortunately, systems for assessing the compatibility between sequence and structure have been developed for the purpose of protein fold recognition (e.g., see Jones and Thornton 1996). Our evolutionary model relies on a sequence-structure compatibility criterion that has been successfully applied to protein fold recognition by Jones, Taylor, and Thornton (1992) and Jones (1999). This criterion can be split into two components, one component assessing solvent accessibility and the other assessing pairwise interactions between residues near to each other in 3-dimensional space. In our application we assume that protein tertiary structure is known and fixed, and consequently the values of these two components can be determined for a sequence by threading it onto the known structure. The solvent accessibility and pairwise measures of sequence-structure compatibility, respectively denoted by Es(i) and Ep(i) for a sequence i, are intended to be positively correlated with the free energy the protein has when folded into the known structure. Therefore, Es(i) and Ep(i) have low (ideally negative) values when sequences and structures are relatively compatible. Rather than being actual energy potentials, the specific values of Es(i) and Ep(i) are derived from statistical analysis of large numbers of known structures to assess the "plausibility" of observing different amino acids at different degrees of solvent accessibility and observing different pairs of amino acids at different physical separations (Jones, Taylor, and Thornton 1992; Jones 1999).
Except for the treatment of nonsynonymous rates, our parameterization is similar to that of widely used codon models. We include parameters A,
G,
C, and
T (
A +
G +
C +
T = 1 and all of these parameters are non-negative) so that mutations to the four nucleotide types need not be equally likely. Alternative modeling options that we have not yet pursued would have separate sets of these mutation frequency parameters for each of the three codon positions or separate parameters for each of the 61 codons (e.g., Pedersen, Wiuf, and Christiansen 1998). Our model contains the parameter
> 0 because transitions and transversions may occur at different rates, and it contains the parameter u to scale the overall rate of change. To handle nonsynonymous rates, there are the parameters
, s, and p which will be discussed in more detail below.
The instantaneous rate of change Ri, j from sequence i to sequence j is set to 0 if sequences i and j differ at more than one nucleotide or if sequence j encodes a premature stop codon. For the other cases where sequences i and j differ by exactly one nucleotide that has type h (h {A, G, C, T}) in sequence j, our rate matrix entries are:
|
When s = p = 0, our model simplifies to the sort of widely used codon model that has been studied by others (e.g., Muse and Gaut 1994; Goldman and Yang 1994). Biologically plausible values of both s and p are positive, for positive values of these parameters favor sequences with good fits to the known structure. The biologically unreasonable case where s < 0 and p < 0 would have evolution favoring sequences that do not fit the known structure well.
The s and p parameters reflect the contribution of nonsynonymous rates that comes from the sequence-structure fit, whereas the parameter is intended to capture contributions to nonsynonymous rates that are external to the protein of interest. A variety of external impacts on nonsynonymous rates can be envisioned. For example, the value of
may be less than 1 if the protein being studied is part of a co-adapted system that might be disrupted when nonsynonymous changes cause the protein to function less well in the system. The value of
could exceed 1 if, for example, nonsynonymous changes to the protein helped a pathogen evade the immune system of its host. We view this distinction between effects on nonsynonymous rates that are external to the protein (represented by
) and effects on nonsynonymous rates that are internal to the protein (represented by s and p) as being potentially useful for characterizing the process of molecular evolution. Although our implementation forces all codons to share the same
value, it would be straightforward to adopt previously proposed strategies that allow
to vary among sites (e.g., Yang et al. 2000).
Stationary Probabilities of Sequences
A nice feature of this model is the explicit form for the equilibrium distribution of each possible coding sequence of length N. For simplicity of notation, let represent all the parameters in the rate matrix R (i.e.,
= {
,
, s, p, u,
A,
C,
G,
T}) and use im to represent the nucleotide at position m in DNA sequence i. Sequence i has stationary probability
|
As with models that have independent evolution of codons (e.g., Goldman and Yang 1994; Muse and Gaut 1994), the stationary probability p(i|) does not depend on
,
, or u. Interestingly, if s and p are not both zero then the expected frequencies under stationarity of A, C, G, and T need not be close to
A,
C,
G, and
T. The model is also time reversible. The time reversibility property,
|
|
Sequence Path Densities
The relatively general dependence structure of our model does not facilitate calculation of p(j|i, ). Transition probabilities for conventional models of sequence evolution are calculated by specifying the rate matrix among 61 codon states (or 4 nucleotide states or 20 amino acid states) and then exponentiating this matrix (see Swofford et al. 1996). The 4N x 4N dimension of our rate matrix is typically too large to make matrix exponentiation computationally tractable. Rather than matrix exponentiation, our strategy is to augment the (observed) sequence data i and j with a (unobserved) path
, or a sequence of events between the two sequences. Here, we will arbitrarily choose i to be ancestral and j to be the descendant. The sequence path
specifies how i is transformed to j on a branch of the evolutionary tree. The information within
includes both the times on the branch at which sequence changes occur and the nature of each sequence change event. Because there are actually an infinite number of possible sequence paths that could transform one sequence to another, our strategy is to randomly sample these sequence paths from an appropriate probability density.
Specifically, we adopt a Bayesian inference framework. We specify a prior density p() for our parameters and then sample
and
from their joint posterior density p(
,
|i, j). This sampling of
and
allows their joint posterior distribution to be approximated and thereby serve as the basis for Bayesian parameter estimation. We can express p(
,
|i, j) as
|
Consider the rate of change away from a sequence v to any other sequence of length N. This rate of change away from v will be denoted Rv,· where
|
|
Metropolis-Hastings Algorithm
Via the Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970), we construct a Markov chain on the combined and
state space with stationary distribution p(
,
|i, j). To implement our Markov chain Monte Carlo (MCMC) algorithm, we randomly pick initial state (
(0),
(0)) from the set of possibilities where p(
,
|i, j) exceeds zero. As described below in detail, we then propose random new values
' and
' conditional on the current values
and
. The proposal density will be denoted J(
',
'|
,
). With probability equal to the minimum of 1 and
|
Because of the sum over all sequences in the denominator of Equation (1), it is not computationally feasible to explicitly calculate the p(i|) or p(i|
') terms in Equation (6). Therefore, we approximate the ratio p(i|
')/p(i|
). Our approximation strategy relies on randomly sampling a group of M sequences of length N from the stationary distribution of sequences for parameter values
* = {
*,
*, s*, p*, u*,
}. The M sequences are effectively independently sampled from this stationary distribution and will be denoted
(1),
(2), ... , and
(M). The sampling of
(h) from p(
(h)|
*) is achieved via construction of a Markov chain with a state space consisting of all sequences of length N. This Gibbs Sampling approach (Geman and Geman 1984) has the desired p(
(h)|
*) as its stationary distribution and exploits the fact that the numerator of Equation (1) is straightforward to calculate even though the denominator may be computationally intractable. Consider four DNA sequences of length N that are identical except for the residue at one specific nucleotide site. The sequence with an A at this sole position will be denoted by vA while the sequences with C, G, and T will be respectively denoted by vC, vG, and vT. Because the denominator of Equation (1) need not be evaluated, it is easy to determine p(v
|
*)/(p(vA|
*) + p(vC|
*) + p(vG|
*) + p(vT|
*)) for all
{A, C, G, T}. Conditional upon N 1 of the N sequence positions, the residue at the sole position that is free to vary can therefore be randomly sampled according to its stationary probability.
The initial state of the Markov chain defined by our Gibbs Sampler is randomly selected from the set of all DNA sequences of length N that lack a stop codon. Thereafter, the Gibbs Sampler repeatedly cycles through two steps. The first step is to randomly select a site at which to propose a change. The second step is to fix the sequence at all positions except this site and then randomly choose the nucleotide to occupy the selected site according to the stationary probabilities of the four sequences that represent the four possible residues at this site. This Gibbs Sampler allows a sequence (h) to be sampled with probability p(
(h)|
*). Because we desire the M sequences
(1),
(2), ... ,
(M) to be effectively independent samples, it is important to simulate the Markov chain for a long time between the sampling of each of the M sequences.
An importance sampling argument (see Appendix A) shows that when M is sufficiently large,
|
Proposing
Our MCMC implementation actually consists of various proposal distributions J(',
'|
,
), and the Markov chain is formed by cycling through these different proposal distributions. Each proposal can result in only slight differences between (
,
) and (
',
'). Most of these proposal distributions lead to slight differences between
' and
but set
' =
. For example, one proposal distribution has
' =
and
' =
except that
'
. We employ similar proposal steps that propose change to only s, only p, only u, or only
. All of these proposal steps are Metropolis-Hastings schemes that involve sampling a proposed parameter value from a uniform distribution that is determined by the current parameter value, a prespecified "window" length surrounding the current parameter value, and any constraints on the parameters.
There is a separate proposal step for each of A,
G,
C, and
T, and these are also conventional in that the proposed value for
A,
G,
C, or
T will involve sampling from some uniform distribution. However, these steps are slightly different from those for
, s, p, u, and
because the constraint that
A +
G +
C +
T = 1 necessitates that a change in one of these four parameters cannot be made without an accompanying total change of the opposite sign in the other three. Our implementation partitions this total change of the opposite sign among the other three parameters according to the relative size of the three parameters. For example, if
=
A +
is obtained by sampling from a uniform distribution centered about
A, then we set
=
G
G/(
G +
C +
T),
=
C
C/(
G +
C +
T), and
=
T
T/(
G +
C +
T). If the three free parameters had been
/
, and
/
rather than
,
, and
, then the proposal density J(
',
'|
,
) would be the uniform density for
because only
would be changed with this proposal step. Because our parameterization is actually in terms of
,
, and
, J(
',
'|
,
) becomes the uniform density for
multiplied by a factor 1/(1
)2 that represents the Jacobian of the transformation from
,
/(
+
+
) and
/(
+
+
) to
,
, and
.
Proposing Site Paths
When proposing a new path ' different from
, all parameter values in
are held constant. The sequence paths
and
' represent possible ways to transform an ancestral sequence i to a descendant j. A sequence path is fully specified by a series of nucleotide substitutions and the specific times at which the substitutions occur. For each possible sequence path, i must have a residue ir at site r, whereas j must have a residue jr at this site. A sequence path
can also be represented by a set of site paths
1,
2, ... ,
r, ... ,
N. A particular site path
r specifies all nucleotide substitutions that change site r in path
, as well as the specific times of the changes. Our approach for proposing
' is to set
' equal to
with the exception of the site path at one randomly selected site. For this site, we base the sampling of the proposed site path
on a simple model of evolution that assumes independence of nucleotide substitutions among sites. To emphasize that this simple model may have a parameterization that is quite different from the dependent sites model,
rather than
will represent the independent sites model and its parameters. Our goal will be to sample a site path
r for site r from p(
r|ir, jr,
).
Because of the relative ease of sampling the site paths r, we adopt what is commonly referred to as the Felsenstein 1984 model (see Felsenstein 1989). Nielsen (2001) has used a similar algorithm for sampling site paths from this model for a different application. Nielsen's algorithm, or an algorithm that simulates site paths beginning with ir and then discards the paths unless they end with jr, would be suitable for our application. However, our implementation uses a different algorithm for sampling site paths
r from p(
r|ir, jr,
) (see Appendix B). No matter what method is selected for sampling from p(
r|ir, jr,
), the resulting proposal
' should be accepted or rejected according to Equation (6).
![]() |
Examples |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
There is a trade-off when determining the mesh size for the * values. A mesh that is too wide will produce poor approximations with Equation (7). A mesh that is overly fine will be computationally infeasible. By selecting 9 points along the ranges of each of the s and p uniform priors to define the mesh size separating successive values of these parameters on the
* grid, we found an acceptable compromise between approximation accuracy and computational requirements. We performed some MCMC runs with a finer mesh and obtained similar results to those obtained when 9 points defined the grid for each of s and p. Posterior approximations when only 5 points defined the grid for each of s and p were not satisfactory. We did some exploratory analyses with a prior for s that was uniform on the interval (4, 4) and a prior for p that was uniform on the interval (0.3, 0.3). With this prior for s and p, 17 points were used to define the grid for each of s and p. We found that this latter prior for s and p yielded very similar results to those with the (2, 2) prior for s and (0.15, 0.15) prior for p (data not shown). The remaining three dimensions of the 5-dimensional
* grid represent
,
, and
. For these parameters, we have 12 points define the mesh along each of these dimensions. The grid points for these parameters are each evenly spaced in the interval from 0.14 to 0.36. The value of M, the number of sequences sampled from the stationary distribution represented by a particular grid point, was 1,000 for the results presented here. The number of Gibbs iterations between successively sampled sequences was set to 10 multiplied by the length in nucleotides of the sequences.
To enhance MCMC convergence, we have each MCMC cycle include one proposed update of s, p, u, , and
but five proposed updates of site paths. Each MCMC cycle also proposes an update to one of
A,
C,
G, and
T so that a proposed update to each of these parameters is made once per four cycles. Each analysis presented here consisted of 100,000 MCMC cycles. The first 5,000 of these cycles were treated as a "burn-in" period and did not contribute to the posterior approximations. For all cases, two separate MCMC runs were performed to check for convergence to the desired posterior distribution. Plots of sampled parameter values versus the cycle number at which samples were taken were also examined to check for convergence.
One of the codon models implemented by the PAML software (Yang 1997) is equivalent to our parameterization when s = 0 and p = 0. For the uniform prior distributions that we employ, the maximum of the likelihood surface for s = 0 and p = 0 should be the point at which the joint posterior density is maximized. As a simple check of our software, the maximum likelihood estimates returned by PAML can be compared to the samples from the posterior distribution that are obtained by our MCMC routine. For all pairs of sequences used in this article, we have compared the PAML version 3.12 output with that of our software and have found a close correspondence (data not shown).
It is conventional to express the amount of evolution separating two sequences in terms of the expected number of nucleotide substitutions per site. With our dependence model, the substitution rate at a site is determined by the residues occupying other sites, and it can change when other sites change. To explore the prior distribution of the expected number of substitutions per site with our dependence model, we used a simulation procedure. For each simulation, values of the parameters that comprise were sampled from their prior distribution. Based on the resulting
, the aforementioned Gibbs Sampling technique was employed to randomly select a sequence from its stationary distribution. This randomly sampled sequence was treated as being the ancestral sequence, and then evolution was simulated according to our model to produce a descendant sequence. By computing the average number of changes per site for each simulated sequence path and by repeating the simulation procedure 10,000 times, the prior distribution for the expected number of changes per site was approximated to yield the results presented here.
Annexin V
We first illustrate our techniques by analysis of two different annexin V sequence pairs that each consist of 314 aligned codons. One pair includes the chicken and human annexin V sequences (GenBank accession numbers M30971 and X12454) and the other pair consists of the mouse and rat sequences (GenBank accession numbers NM_009673 and NM_013132). The two pairs of sequences represent nonoverlapping evolutionary lineages and substantially different divergence levels. The human-chicken pair is 78% identical at the amino acid level and 74% identical at the nucleotide level, whereas the mouse-rat pair is 95% identical at the amino acid level and 93% identical at the nucleotide level.
Annexin V is a calcium-dependent phospholipid-binding protein that has potent vascular anticoagulant activity (Andree et al. 1990), is also an inhibitor of protein kinase C (Schlaepfer, Jones, and Haigler 1992), and has been associated with the apoptotic pathway (Reutelingsperger and van Heerde 1997) and the initiation of the hepatitis B virus infection process (Neurath and Strick 1994). It was selected for this study as an unbiased, arbitrary example of a typical protein of biological interest. The tertiary structure of the chicken annexin V protein has been experimentally determined (Bewley et al. 1993), and we assume that the native conformations of all four annexin V sequences are essentially identical to this structure.
For these two annexin V sequence pairs, the posterior distributions of s and p are quite similar (table 1). For both sequence pairs, the posteriors of s and p are concentrated in the s > 0 and p > 0 quadrant. This is the biologically plausible quadrant of the parameter space where evolution favors sequences that fold well into the known structure. The posterior distributions of ,
, and u were relatively unaffected by whether s and p were forced to be zero. This indicates that the information contributing to the s and p estimates is largely independent of the information contributing to the estimates of the
,
, and u parameters.
|
To evaluate the potential performance of our estimation procedure, we simulated annexin V evolution using the posterior means of the model parameters as estimated from the human-chicken pair. For each of five simulated annexin V pairs, the values of ,
, u,
A,
C,
G,
T were set to their posterior means from our analysis of the original human-chicken data while the values of s and p varied among simulations. The posterior means of s and p for the human-chicken pair were approximately s = 0.947 and p = 0.0282. One pair of sequences was simulated for each of (s = 0.947, p = 0.0282), (s = 0.947, p = 0.0282), (s = 0.947, p = 0.0282), (s = 0.947, p = 0.0282), and (s = 0, p = 0). In each simulation, an ancestral sequence was selected via Gibbs Sampling from the appropriate stationary distribution of sequences. A descendant sequence was then evolved according to our model, and the resulting sequence pair was analyzed with the approach described here. For each of the five simulation scenarios, the posterior densities of s and p are concentrated close to the true values of these parameters (table 2).
|
|
For both annexin V sequence pairs, we have explored variation due to protein structure among the possible nonsynonymous changes that could affect a sequence. For all 2,058 possible sequences j that differ by one nonsynonymous substitution from the observed mouse sequence i, the rate factor associated with sequence-structure compatibility, e(Es(i)Es(j))s+(Ep(i)Ep(j))p, was calculated. To produce the histogram of these 2,058 values shown in figure 1A, the posterior means of s and p from the mouse-rat comparison were used. A corresponding histogram generated with the 2,069 possible nonsynonymous changes to the chicken sequence is similar and is not shown. Both histograms indicate that the nonsynonymous rate variation due to protein structure is substantial but both the mean and median effect of nonsynonymous substitutions on structural compatibility are deleterious. Although many of the possible nonsynonymous changes improve the sequence-structure compatibility (i.e., e(Es(i)Es(j))s+(Ep(i)Ep(j))p > 1 ), this improved compatibility does not overcome the low posterior mean of by making any of these possible nonsynonymous changes positively selected.
|
Lysozyme
In their pioneering work on adaptation, Stewart and collaborators (e.g., Stewart, Schilling, and Wilson 1987; Messier and Stewart 1997; see also Yang 1998; Yang and Nielsen 2002) demonstrated that the evolutionary lineage leading to the Colobine monkeys (e.g., Colobus guereza) has experienced an excess of nonsynonymous substitutions during lysozyme c evolution. The apparent explanation for this excess is that the Colobine monkeys have acquired a foregut in which bacteria ferment leaves. In the Colobine monkeys, lysozyme c is active at a lower pH and is more resistant to pepsin than in other primates. These properties of lysozyme c in Colobine monkeys may be adaptations that help with lysis of bacteria that pass from the foregut fermentation chamber into the true stomach (Stewart, Schilling, and Wilson 1987; Messier and Stewart 1997).
To investigate lysozyme c evolution with our approach, we selected two pairs of lysozyme c sequences that represent nonoverlapping evolutionary lineages and that each consist of 130 aligned codons. The phylogenetic path separating the Rhesus macaque (Macaca mulatta) and Colobus guereza pair (GenBank accession numbers X60236 and U76916, 96% nucleotide identity and 91% amino acid identity) includes lineages that have previously been demonstrated to have experienced a strong excess of nonsynonymous substitutions (Stewart, Schilling and Wilson 1987; Messier and Stewart 1997; Yang and Nielsen 2002). The phylogenetic path separating the marmoset (Callithrix jacchus) and human pair (GenBank accession numbers M19045 and U76923, 93% nucleotide identity and 87% amino acid identity) does not seem to have experienced such a strong excess of nonsynonymous change (Stewart, Schilling, and Wilson 1987; Messier and Stewart 1997; Yang and Nielsen 2002), although our estimates indicate it has a ratio of nonsynonymous to synonymous rates that is substantially higher than for the annexin V pairs (see below). The tertiary structure of the human lysozyme c protein has been experimentally determined (Harata, Abe, and Muraki 1998), and we assume that the native conformations of all four lysozyme c sequences are essentially identical to this structure.
As with the annexin V analyses, the posterior distributions of s and p are concentrated in the biologically plausible quadrant of the parameter space for both lysozyme c sequence pairs (table 3). In general, we do not find strong correlations between , s, and p in our samples from the posterior distribution. As an example, Pearson's correlation for the human-marmoset comparison was approximately 0.13 between s and p, 0.00 between s and
, and 0.02 between p and
.
|
|
![]() |
Future Directions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Although the multiple sequence version of our methods may tend to be computationally impractical for inference of evolutionary tree topologies, the extension to multiple sequences may lead to improved methods for ancestral sequence reconstruction. With our approach, ancestral sequences that do not fold well into a tertiary structure are unlikely to be inferred. In addition, the paths from ancestral to descendant sequences may allow the order of adaptive nucleotide substitution events during protein evolution to be inferred.
Other methods for detecting positive selection (e.g., Yang et al. 2000; Yang and Nielsen 2002) treat each codon independently and ignore protein structure. In reality, the question of whether a particular nonsynonymous substitution confers a selective advantage should depend to some extent on the amino acids encoded by other codons. Our technique has the advantages of incorporating and quantifying this dependence.
Although the sequence-structure compatibility criterion employed here (Jones, Taylor, and Thornton 1992; Jones 1999) has demonstrated considerable success when applied to protein fold recognition, other sequence-structure compatibility functions have been proposed (e.g., Singh, Tropsha, and Vaisman 1996), and it will be interesting to explore which of these criteria is most appropriate for describing the process of molecular evolution. Incorporation of alternative criteria would necessitate few changes to our approach.
The sequence-structure compatibility criterion can be viewed as a surrogate for fitness. Measures of the fitness of a sequence that are not explicitly derived from protein structure could be incorporated into our approach without major modifications of our inference procedure. Such modifications might be of particular interest for studying bacteriophage or retroviruses because these quickly evolving organisms are good systems for experimentally measuring the fitness consequences of particular nucleotide substitutions (e.g., Bull, Badget, and Wichman 2000). For making inferences regarding history and evolutionary process, our statistical approaches could supplement more direct experimental information.
![]() |
Appendix A |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
![]() |
Appendix B |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Because of assumed independence of the within-group and general processes, the total number of events experienced at a site in time duration T has a Poisson distribution with parameter (g + w)T. This formulation of the Felsenstein 1984 model actually allows general or within-group events occur that result in the same nucleotide type occupying a site before and after an event. This sort of "hidden" event is algebraically convenient for determining how to sample r from p(
r|i, j,
), and it is not difficult to normalize so that the rate units are the expected number of nucleotide changes rather than the expected number of events.
With the Felsenstein 1984 model, the probability that a site is occupied by nucleotide type ß after an amount of evolution T given that the site was originally occupied by type is
|
We employ three steps to sample a particular site path from the distribution p(r|i, j,
). First, a random sample is obtained from the distribution of the number of general and within-group changes conditional on ir, jr, and
. The second step, conditional upon the number of events that occur at site r, is to randomly sample the times of these events from a uniform distribution that spans the time period of length T.
The uniform distribution is used because, conditional upon the number of events, event times are uniformly distributed for a time-homogeneous Poisson process such as the one defined by the Felsenstein 1984 model. After the numbers of general and within-group events have been determined and the event times have been assigned, the third step is to appropriately assign the specific residue types that are yielded by each event.
The form of the transition probabilities facilitates sampling the number of general events and the number of within-group events at a site conditional upon the information that the site was occupied by type ir at the beginning of the branch and by type jr at the end of the branch. The most complicated situation is for ir = jr. In this situation, r has 0 general events and 0 within-group events with probability egTewT/
(T). The probability of 0 general events and at least one within-group event at the site for ir = jr is egT(1 ewT)
/
(T)). Given that this happens, the probability of exactly nw within-group events on the path is
ewT/(nw!(1 ewT)) for nw
{1, 2, ...}. Therefore, the probability of 0 general events and nw within-group events given that ir = jr and nw
{1, 2, ...} is egT
/
(T)). Likewise, the probability of ng > 0 general events and nw within-group events is
/
when ir = jr. Similar reasoning allows sampling from the probability distribution of ng and nw conditional upon the beginning nucleotide type ir and the ending type jr for the cases where ir
jr. After determining ng and nw, the times for each general or within-group event on a path can be randomly sampled from a uniform distribution along the branch.
The last step in randomly sampling a site path conditional upon the beginning type ir and ending type jr is to randomly determine which nucleotide type occupies a site after each event. The final event on a site path must always be to type jr because this type occupies the site at the end of the branch. This final event may be of either the general or within-group variety.
If the last general event is not the final event on a site path, to have type jr at the end of the site path, the last general event must be to a nucleotide type included in group H( jr). Specifically, the last general event results in type k with probability k/
if k belongs to group H( jr). If k does not belong to H( jr), then the probability is 0 that type k is assigned to occupy the site after the last general event.
Next, all general events that are not the last general event on the path at the site are handled. This is accomplished by having nucleotide type k occupy the site after one of these general events with probability k. When all general events have been assigned nucleotide types, the final step is to treat any remaining unassigned within-group events. Because the within-group events cannot result in transversions, the time intervals on a branch during which a site is occupied by a pyrimidine depend solely on whether the site is occupied by a purine or a pyrimidine at the beginning of the branch and on which general events result in purines and which result in pyrimidines. If an unassigned within-group event occurs during a "purine" interval, then the event results in A with probability
A/
R and G with probability
G/
R. If an unassigned within-group event occurs during a "pyrimidine" interval, then the event results in C with probability
C/
Y and in T with probability
T/
Y.
The procedure discussed to this point allows site paths to be sampled conditional upon the residues that begin and end the path, but some events on the sampled site path may be "hidden" in that the nucleotides before and after the event are identical. Because the goal is to propose a site path according to the Felsenstein 1984 model and then evaluate the resulting sequence path with a model that has dependence among codons and that is not parameterized so as to consider hidden events, all hidden events are pruned from the site path before it is considered in Equation (6). Because the Felsenstein 1984 model can also be parameterized in the conventional way that has all evolutionary events result in a changed sequence, the conventional parameterization is adopted for calculation of J(',
'|
,
) and J(
,
|
',
').
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Andree, H. A. M., C. P. M. Reutelingsperger, R. Hauptmann, H. C. Hemker, W. Th. Hermens, and G. M. Willems. 1990. Binding of vascular anticoagulant (VAC
) to planar phospholipid bilayers. J. Biol. Chem. 265:4923-4928.
Bewley, M. C., C. M. Boustead, J. H. Walker, D. A. Waller, and R. Huber. 1993. Structure of chicken annexin V at 2.25-A resolution. Biochemistry 32:3923-3929.[ISI][Medline]
Bull, J. J., M. R. Badget, and H. A. Wichman. 2000. Big-benefit mutations in a bacteriophage inhibited with heat. Mol. Biol. Evol. 17:942-950.
Chothia, C., and A. M. Lesk. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5:519-527.[Abstract]
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary changes in proteins. Pp. 345352 in Atlas of protein sequence and structure, vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]
Felsenstein, J. 1989. Phylogenetic Inference Package (PHYLIP), version 3.2. University of Washington,. Seattle. Cladistics 5:164-166.
Flores, T. P., C. A. Orengo, D. S. Moss, and J. M. Thornton. 1993. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2:1811-1826.
Geman, S., and D. Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6:721-741.[ISI]
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.
Harata K., Y. Abe, and M. Muraki. 1998. Full-matrix least squares refinement of lysozymes and analysis of anisotropic thermal motion. Proteins 30:232-243.[CrossRef][Medline]
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.[ISI]
Jensen, J. L., and A. K. Pedersen. 2000. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. Appl. Prob. 32:499-517.[CrossRef][ISI]
Jones, D. T. 1999. GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287:797-815.[CrossRef][ISI][Medline]
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. A new approach to protein fold recognition. Nature 358:86-89.[CrossRef][ISI][Medline]
Jones, D. T., and J. M. Thornton. 1996. Potential energy functions for threading. Curr. Opin. Struct. Biol. 6:210-216.[CrossRef][ISI][Medline]
Koshi, J. M., D. P. Mindell, and R. A. Goldstein. 1997. Beyond mutation matrices: physical chemistry based evolutionary models. Pp. 8089 in S. Miyano and T. Takagi, eds. Genome informatics 1997. Universal Academy Press, Tokyo.
Koshi, J. M., D. P. Mindell, and R. A. Goldstein. 1999. Using physical chemistry based substitution models in phylogenetic analyses of HIV-1 subtypes. Mol. Biol. Evol. 16:173-179.[Abstract]
Koshi, J. M., and R. A. Goldstein. 1998. Mathematical models of natural amino acid site mutations. Proteins 32:289-295.[CrossRef][ISI][Medline]
Messier, W., and C.-B. Stewart. 1997. Episodic adaptive evolution of primate lysozymes. Nature 385:151-154.[CrossRef][ISI][Medline]
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. J. Chem. Phys. 21:1087-1092.[ISI]
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with applications to the chloroplast genome. Mol. Biol. Evol. 11:715-724.
Neurath, A. R., and N. Strick. 1994. The putative cell receptors for hepatitis B virus (HBV), annexin V, and apolipoprotein H, bind to lipid components of HBV. Virology 204:475-477.[CrossRef][ISI][Medline]
Nielsen, R. 2001. Mutations as missing data: inferences on the ages and distributions of nonsynonymous and synonymous mutations. Genetics 159:401-411.
Parisi, G., and J. Echave. 2001. Structural constraints and emergence of sequence patterns in protein evolution. Mol. Biol. Evol. 18:750-756.
Pedersen, A.-M. K., and J. L. Jensen. 2001. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol. 18:763-776.
Pedersen, A.-M. K., C. Wiuf, and R. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:1069-1081.[Abstract]
Pollock D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187-98.[CrossRef][ISI][Medline]
Pupko, T., I. Pe'er, R. Shamir, and D. Grauer. 2000. A fast algorithm for joint reconstruction of ancestral aminoacid sequences. Mol. Biol. Evol. 17:890-896.
Reutelingsperger, C. P. M., and W. L. van Heerde. 1997. Annexin V. the regulator of phosphatidylserine-catalyzed inflammation and coagulation during apoptosis. Cell. Mol. Life Sci. 53:527-532.[CrossRef][ISI][Medline]
Schlaepfer, D. D., J. Jones, and H. T. Haigler. 1992. Inhibition of protein kinase C by Annexin V. Biochemistry 31:1886-1891.[ISI][Medline]
Singh, R. K., A. Tropsha, and I. I. Vaisman. 1996. Delaunay tessellation of proteins. J. Comput. Biol. 2:213-221.
Stewart, C.-B., J. W. Schilling, and A. C. Wilson. 1987. Adaptive evolution in the stomach lysozymes of foregut fermenters. Nature 330:401-404.[CrossRef][ISI][Medline]
Swofford, D. L., G. J. Olsen, P. J. Waddel, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics, 2nd ed. Sinauer Associates, Sunderland, Mass.
Wollenberg, K. R., and W. R. Atchley. 2000. Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc. Natl. Acad. Sci. USA 97:3288-3291.
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555-556.[Medline]
Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568-573.[Abstract]
Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917.
Yang, Z., R. Nielsen, and M. Hasegawa. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15:1600-1611.
Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449.