Protein Evolution with Dependence Among Codons Due to Tertiary Structure

Douglas M. Robinson*, David T. Jones{dagger}, Hirohisa Kishino{ddagger}, Nick Goldman§ and Jeffrey L. Thorne*,

* Bioinformatics Research Center, North Carolina State University
{dagger} Bioinformatics Unit, Department of Computer Science, University College London, London, United Kingdom
{ddagger} Laboratory of Biometrics, Graduate School of Agriculture and Life Sciences, University of Tokyo, Tokyo, Japan
§ European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom

Correspondence: E-mail: thorne{at}statgen.ncsu.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Markovian models of protein evolution that relax the assumption of independent change among codons are considered. With this comparatively realistic framework, an evolutionary rate at a site can depend both on the state of the site and on the states of surrounding sites. By allowing a relatively general dependence structure among sites, models of evolution can reflect attributes of tertiary structure. To quantify the impact of protein structure on protein evolution, we analyze protein-coding DNA sequence pairs with an evolutionary model that incorporates effects of solvent accessibility and pairwise interactions among amino acid residues. By explicitly considering the relationship between nonsynonymous substitution rates and protein structure, this approach can lead to refined detection and characterization of positive selection. Analyses of simulated sequence pairs indicate that parameters in this evolutionary model can be well estimated. Analyses of lysozyme c and annexin V sequence pairs yield the biologically reasonable result that amino acid replacement rates are higher when the replacements lead to energetically favorable proteins than when they destabilize the proteins. Although the focus here is evolutionary dependence among codons that is associated with protein structure, the statistical approach is quite general and could be applied to diverse cases of evolutionary dependence where surrogates for sequence fitness can be measured or modeled.

Key Words: protein structure • evolution • Markov chain Monte Carlo • Bayesian


    Introduction
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Rates of amino acid replacement can vary both among protein positions (e.g., Yang, Nielsen, and Hasegawa 1998) and among the types of amino acids involved in the replacement (e.g., Dayhoff, Schwartz, and Orcutt 1978). Furthermore, rates of amino acid replacement at a protein site are likely to depend on the amino acids found at other positions in the protein. Although it is widely accepted that positions in a protein sequence do not evolve independently, and although methods for detecting protein sites with correlated patterns of evolution have been proposed (e.g., Pollock, Taylor, and Goldman 1999; Wollenberg and Atchley 2000), little progress has been made on incorporating dependence among sites into procedures for evolutionary inferences.

A notable exception to this lack of progress is found in studies by Pedersen and Jensen on evolutionary dependence among sites that is caused by reading frame overlap (Jensen and Pedersen 2000; Pedersen and Jensen 2001). Here, we borrow ideas from Pedersen and Jensen but focus on the evolutionary dependence among codons that is associated with protein tertiary structure. In addition, we build upon a recently proposed technique for simulating protein evolution (Parisi and Echave 2001). With this technique, the rate at which a site experiences change can be modified by substitutions at neighboring sites. Through simulations, Parisi and Echave convincingly demonstrated that their model incorporates information that is unobtainable with widely used models of protein evolution. For example, the simulation studies showed tendencies of certain amino acids to preferentially occupy certain sites in the left-handed ß helix domain of UDP-N-acetylglucosamine acyltransferases. When a group of actual sequences with this helix domain was examined, qualitatively similar tendencies were observed.

Parisi and Echave began their simulations with a reference protein of known tertiary structure. They then selected a function to assign a distance between a protein sequence and the reference structure. Underlying the sequence-structure distance is the idea that protein tertiary structure evolves very slowly (Chothia and Lesk 1986; Flores et al. 1993). Therefore, the energy associated with the structure of an ancestral protein (e.g., the reference protein) and the energy associated with the structure of a descendant protein should be similar. The sequence-structure distance can be interpreted as a surrogate for the difference in energies between an ancestral and a descendant protein.

When applying the Parisi–Echave procedure, most simulations that begin with an ancestral sequence will likely result in a descendant that differs from an observed descendant. Because of this apparent drawback, an alternative strategy is needed. Our approach differs from Parisi and Echave's work mainly because our goal is to perform statistical inference when studying the relationship between two observed sequences. It would be computationally inefficient to simulate sequence evolution and then discard all simulated descendant sequences that differ from those observed.

Here, we explain our technique for statistical inference with evolutionary models that have dependence due to protein structure. With simulations, we show that this technique can obtain accurate parameter estimates. With analyses of actual sequence pairs, we show that our parameter estimates are biologically reasonable. Although dependence owing to protein structure is the focus here, slight modifications of our inferential procedure can be applied when the evolutionary dependence among sequence sites arises in other ways.


    Modeling Protein Evolution Under Structural Constraints
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Parameterization
Following Parisi and Echave (2001), we propose a Markov model to describe the evolution of proteins under structural constraints. This is accomplished with the creation of an instantaneous rate matrix, R, where the entry in row i and column j represents the rate of change from one sequence to another. Because we abandon the assumption of independent changes among codons, we cannot follow the conventional practice (e.g., Goldman and Yang 1994; Muse and Gaut 1994) by usefully expressing our model with a series of 61 x 61 rate matrices that each describe change at a specific codon location in the protein. For a DNA sequence of length N nucleotides and ignoring for now the possibility of stop codons, the dimensions of our rate matrix R are 4N x 4N. Our assumption that each substitution event changes only a single nucleotide residue reduces the maximum number of nonzero elements in each row of R to 3N + 1. These 3N + 1 elements comprise the diagonal entry that represents no change in the sequence and one entry for each of the three possible substitutions at each of the N nucleotide sites.

The key motivation underlying our model is that nonsynonymous substitution rates should depend in part on whether the implied amino acid replacements would stabilize or destabilize the known and assumed fixed protein tertiary structure. To assess the effect of an amino acid replacement on protein stability, a measure is needed for how well the sequence fits the structure both before and after the replacement. If this measure indicates that the replacement would improve the sequence-structure fit, then the rate of the nonsynonymous change should be high. Likewise, if the sequence-structure fit would become poorer as a result of an amino acid replacement, then the corresponding nonsynonymous rate should be low.

Fortunately, systems for assessing the compatibility between sequence and structure have been developed for the purpose of protein fold recognition (e.g., see Jones and Thornton 1996). Our evolutionary model relies on a sequence-structure compatibility criterion that has been successfully applied to protein fold recognition by Jones, Taylor, and Thornton (1992) and Jones (1999). This criterion can be split into two components, one component assessing solvent accessibility and the other assessing pairwise interactions between residues near to each other in 3-dimensional space. In our application we assume that protein tertiary structure is known and fixed, and consequently the values of these two components can be determined for a sequence by threading it onto the known structure. The solvent accessibility and pairwise measures of sequence-structure compatibility, respectively denoted by Es(i) and Ep(i) for a sequence i, are intended to be positively correlated with the free energy the protein has when folded into the known structure. Therefore, Es(i) and Ep(i) have low (ideally negative) values when sequences and structures are relatively compatible. Rather than being actual energy potentials, the specific values of Es(i) and Ep(i) are derived from statistical analysis of large numbers of known structures to assess the "plausibility" of observing different amino acids at different degrees of solvent accessibility and observing different pairs of amino acids at different physical separations (Jones, Taylor, and Thornton 1992; Jones 1999).

Except for the treatment of nonsynonymous rates, our parameterization is similar to that of widely used codon models. We include parameters {pi}A, {pi}G, {pi}C, and {pi}T ({pi}A + {pi}G + {pi}C + {pi}T = 1 and all of these parameters are non-negative) so that mutations to the four nucleotide types need not be equally likely. Alternative modeling options that we have not yet pursued would have separate sets of these mutation frequency parameters for each of the three codon positions or separate parameters for each of the 61 codons (e.g., Pedersen, Wiuf, and Christiansen 1998). Our model contains the parameter {kappa} > 0 because transitions and transversions may occur at different rates, and it contains the parameter u to scale the overall rate of change. To handle nonsynonymous rates, there are the parameters {omega}, s, and p which will be discussed in more detail below.

The instantaneous rate of change Ri, j from sequence i to sequence j is set to 0 if sequences i and j differ at more than one nucleotide or if sequence j encodes a premature stop codon. For the other cases where sequences i and j differ by exactly one nucleotide that has type h (h {A, G, C, T}) in sequence j, our rate matrix entries are:


When s = p = 0, our model simplifies to the sort of widely used codon model that has been studied by others (e.g., Muse and Gaut 1994; Goldman and Yang 1994). Biologically plausible values of both s and p are positive, for positive values of these parameters favor sequences with good fits to the known structure. The biologically unreasonable case where s < 0 and p < 0 would have evolution favoring sequences that do not fit the known structure well.

The s and p parameters reflect the contribution of nonsynonymous rates that comes from the sequence-structure fit, whereas the {omega} parameter is intended to capture contributions to nonsynonymous rates that are external to the protein of interest. A variety of external impacts on nonsynonymous rates can be envisioned. For example, the value of {omega} may be less than 1 if the protein being studied is part of a co-adapted system that might be disrupted when nonsynonymous changes cause the protein to function less well in the system. The value of {omega} could exceed 1 if, for example, nonsynonymous changes to the protein helped a pathogen evade the immune system of its host. We view this distinction between effects on nonsynonymous rates that are external to the protein (represented by {omega}) and effects on nonsynonymous rates that are internal to the protein (represented by s and p) as being potentially useful for characterizing the process of molecular evolution. Although our implementation forces all codons to share the same {omega} value, it would be straightforward to adopt previously proposed strategies that allow {omega} to vary among sites (e.g., Yang et al. 2000).

Stationary Probabilities of Sequences
A nice feature of this model is the explicit form for the equilibrium distribution of each possible coding sequence of length N. For simplicity of notation, let {theta} represent all the parameters in the rate matrix R (i.e., {theta} = {{kappa}, {omega}, s, p, u, {pi}A, {pi}C, {pi}G, {pi}T}) and use im to represent the nucleotide at position m in DNA sequence i. Sequence i has stationary probability


where the sum in the denominator is over all possible sequences k that lack a premature stop codon. We note that the above formula resembles the Boltzmann distribution of statistical physics and the denominator resembles what is known in statistical physics as a partition function. Previously proposed models that have evolutionary independence among protein locations also take the Boltzmann form as a stationary distribution (e.g., Koshi, Mindell, and Goldstein 1997; Koshi and Goldstein 1998; Koshi, Mindell, and Goldstein 1999).

As with models that have independent evolution of codons (e.g., Goldman and Yang 1994; Muse and Gaut 1994), the stationary probability p(i|{theta}) does not depend on {kappa}, {omega}, or u. Interestingly, if s and p are not both zero then the expected frequencies under stationarity of A, C, G, and T need not be close to {pi}A, {pi}C, {pi}G, and {pi}T. The model is also time reversible. The time reversibility property,


is computationally convenient because it allows the likelihood p(i, j|{theta}) for a data set with two sequences i and j to be computed with


rather than by enumerating all possible common ancestral sequences of i and j (see Felsenstein 1981).

Sequence Path Densities
The relatively general dependence structure of our model does not facilitate calculation of p(j|i, {theta}). Transition probabilities for conventional models of sequence evolution are calculated by specifying the rate matrix among 61 codon states (or 4 nucleotide states or 20 amino acid states) and then exponentiating this matrix (see Swofford et al. 1996). The 4N x 4N dimension of our rate matrix is typically too large to make matrix exponentiation computationally tractable. Rather than matrix exponentiation, our strategy is to augment the (observed) sequence data i and j with a (unobserved) path {rho}, or a sequence of events between the two sequences. Here, we will arbitrarily choose i to be ancestral and j to be the descendant. The sequence path {rho} specifies how i is transformed to j on a branch of the evolutionary tree. The information within {rho} includes both the times on the branch at which sequence changes occur and the nature of each sequence change event. Because there are actually an infinite number of possible sequence paths that could transform one sequence to another, our strategy is to randomly sample these sequence paths from an appropriate probability density.

Specifically, we adopt a Bayesian inference framework. We specify a prior density p({theta}) for our parameters and then sample {theta} and {rho} from their joint posterior density p({rho}, {theta}|i, j). This sampling of {theta} and {rho} allows their joint posterior distribution to be approximated and thereby serve as the basis for Bayesian parameter estimation. We can express p({rho}, {theta}|i, j) as


The p(i, j) term in the denominator is difficult to explicitly determine. However, p(i, j) is not a function of {theta} and need not be calculated with the Markov chain Monte Carlo procedure described in the next section. The p(j, {rho}|i, {theta}) term is governed by the rate matrix R. For a specific sequence path {rho} from i to j, let q be the total number of nucleotide substitutions on the path and let t(z) be the time of the zth substitution where t(0) = 0 is the time at which the branch begins. For simplicity, the time at which the branch ends will be t(q + 1) = 1. Because the scaling parameter u in our model can be adjusted, the time at which the branch ends can always be considered to be 1. The sequence that exists immediately after the zth substitution will be represented by i(z). Let i(0) = i and i(q + 1) = i(q) = j. Therefore, the sequence path {rho} is fully specified by t(0), t(1), ... , t(q), t(q + 1) and i(0), i(1), ... , i(q), i(q + 1).

Consider the rate of change away from a sequence v to any other sequence of length N. This rate of change away from v will be denoted Rv where


and where the sum is over all sequences k that differ from v. In practice, Rv is feasible to calculate because Rv,k must be zero unless k is one of the 3N sequences that differ from v at exactly one nucleotide. The time until v experiences a nucleotide substitution is exponentially distributed with parameter Rv. Given that some substitution occurs, the probability that v is transformed to some sequence k is Rv,k /Rv. Therefore,


where the final term eRi(q),·(t(q+1)–t(q)) represents the probability of no change in the time interval from the final substitution at t(q) until the time t(q + 1) = 1 that the branch ends.

Metropolis-Hastings Algorithm
Via the Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970), we construct a Markov chain on the combined {theta} and {rho} state space with stationary distribution p({rho}, {theta}|i, j). To implement our Markov chain Monte Carlo (MCMC) algorithm, we randomly pick initial state ({rho}(0), {theta}(0)) from the set of possibilities where p({rho}, {theta}|i, j) exceeds zero. As described below in detail, we then propose random new values {rho}' and {theta}' conditional on the current values {rho} and {theta}. The proposal density will be denoted J({rho}', {theta}'|{rho}, {theta}). With probability equal to the minimum of 1 and


we set the next state of our Markov chain ({rho}(1), {theta}(1)) equal to the proposed state (i.e., {rho}(1) = {rho}', {theta}(1) = {theta}'). Otherwise, {rho}(1) = {rho}(0) and {theta}(1) = {theta}(0). By repeatedly proposing p' and {theta}' and then randomly accepting the proposals with probabilities determined by Equation (6), a Markov chain with the desired stationary distribution p({rho}, {theta}|i, j) is formed.

Because of the sum over all sequences in the denominator of Equation (1), it is not computationally feasible to explicitly calculate the p(i|{theta}) or p(i|{theta}') terms in Equation (6). Therefore, we approximate the ratio p(i|{theta}')/p(i|{theta}). Our approximation strategy relies on randomly sampling a group of M sequences of length N from the stationary distribution of sequences for parameter values {theta}* = {{kappa}*, {omega}*, s*, p*, u*, }. The M sequences are effectively independently sampled from this stationary distribution and will be denoted {eta}(1), {eta}(2), ... , and {eta}(M). The sampling of {eta}(h) from p({eta}(h)|{theta}*) is achieved via construction of a Markov chain with a state space consisting of all sequences of length N. This Gibbs Sampling approach (Geman and Geman 1984) has the desired p({eta}(h)|{theta}*) as its stationary distribution and exploits the fact that the numerator of Equation (1) is straightforward to calculate even though the denominator may be computationally intractable. Consider four DNA sequences of length N that are identical except for the residue at one specific nucleotide site. The sequence with an A at this sole position will be denoted by vA while the sequences with C, G, and T will be respectively denoted by vC, vG, and vT. Because the denominator of Equation (1) need not be evaluated, it is easy to determine p(v{alpha}|{theta}*)/(p(vA|{theta}*) + p(vC|{theta}*) + p(vG|{theta}*) + p(vT|{theta}*)) for all {alpha} {A, C, G, T}. Conditional upon N – 1 of the N sequence positions, the residue at the sole position that is free to vary can therefore be randomly sampled according to its stationary probability.

The initial state of the Markov chain defined by our Gibbs Sampler is randomly selected from the set of all DNA sequences of length N that lack a stop codon. Thereafter, the Gibbs Sampler repeatedly cycles through two steps. The first step is to randomly select a site at which to propose a change. The second step is to fix the sequence at all positions except this site and then randomly choose the nucleotide to occupy the selected site according to the stationary probabilities of the four sequences that represent the four possible residues at this site. This Gibbs Sampler allows a sequence {eta}(h) to be sampled with probability p({eta}(h)|{theta}*). Because we desire the M sequences {eta}(1), {eta}(2), ... , {eta}(M) to be effectively independent samples, it is important to simulate the Markov chain for a long time between the sampling of each of the M sequences.

An importance sampling argument (see Appendix A) shows that when M is sufficiently large,


The quality of this approximation improves if {theta}* is close to both {theta}' and {theta}. For this reason, the {theta}* chosen for this approximation depends in our implementation on the values of {theta} and {theta}'. The ratio approximated in Equation (7) depends on the six parameters s, p, {pi}A, {pi}G, {pi}C, and {pi}T. Because {pi}A + {pi}G + {pi}C + {pi}T = 1, the values of these six parameters can be specified in a 5-dimensional space. We determine the possible values for {theta}* by partitioning this 5-dimensional space into a grid. For a particular combination of {theta} and {theta}', we find their midpoint and then choose {theta}* to be the grid point that is nearest this midpoint. Because not all grid points will be visited in a particular MCMC run, we save computation by not sampling the M sequences from p({eta}(h)|{theta}*) until a particular {theta}* grid point is visited for the first time during the MCMC run.

Proposing {theta}
Our MCMC implementation actually consists of various proposal distributions J({rho}', {theta}'|{rho}, {theta}), and the Markov chain is formed by cycling through these different proposal distributions. Each proposal can result in only slight differences between ({rho}, {theta}) and ({rho}', {theta}'). Most of these proposal distributions lead to slight differences between {theta}' and {theta} but set {rho}' = {rho}. For example, one proposal distribution has {rho}' = {rho} and {theta}' = {theta} except that {kappa}' != {kappa}. We employ similar proposal steps that propose change to only s, only p, only u, or only {omega}. All of these proposal steps are Metropolis-Hastings schemes that involve sampling a proposed parameter value from a uniform distribution that is determined by the current parameter value, a prespecified "window" length surrounding the current parameter value, and any constraints on the parameters.

There is a separate proposal step for each of {pi}A, {pi}G, {pi}C, and {pi}T, and these are also conventional in that the proposed value for {pi}A, {pi}G, {pi}C, or {pi}T will involve sampling from some uniform distribution. However, these steps are slightly different from those for {kappa}, s, p, u, and {omega} because the constraint that {pi}A + {pi}G + {pi}C + {pi}T = 1 necessitates that a change in one of these four parameters cannot be made without an accompanying total change of the opposite sign in the other three. Our implementation partitions this total change of the opposite sign among the other three parameters according to the relative size of the three parameters. For example, if = {pi}A + {delta} is obtained by sampling from a uniform distribution centered about {pi}A, then we set = {pi}G{delta}{pi}G/({pi}G + {pi}C + {pi}T), = {pi}C{delta}{pi}C/({pi}G + {pi}C + {pi}T), and = {pi}T {delta}{pi}T/({pi}G + {pi}C + {pi}T). If the three free parameters had been /, and / rather than , , and , then the proposal density J({rho}', {theta}'|{rho}, {theta}) would be the uniform density for {delta} because only would be changed with this proposal step. Because our parameterization is actually in terms of , , and , J({rho}', {theta}'|{rho}, {theta}) becomes the uniform density for {delta} multiplied by a factor 1/(1 – )2 that represents the Jacobian of the transformation from , /( + + ) and /( + + ) to , , and .

Proposing Site Paths
When proposing a new path {rho}' different from {rho}, all parameter values in {theta} are held constant. The sequence paths {rho} and {rho}' represent possible ways to transform an ancestral sequence i to a descendant j. A sequence path is fully specified by a series of nucleotide substitutions and the specific times at which the substitutions occur. For each possible sequence path, i must have a residue ir at site r, whereas j must have a residue jr at this site. A sequence path {rho} can also be represented by a set of site paths {rho}1, {rho}2, ... , {rho}r, ... , {rho}N. A particular site path {rho}r specifies all nucleotide substitutions that change site r in path {rho}, as well as the specific times of the changes. Our approach for proposing {rho}' is to set {rho}' equal to {rho} with the exception of the site path at one randomly selected site. For this site, we base the sampling of the proposed site path on a simple model of evolution that assumes independence of nucleotide substitutions among sites. To emphasize that this simple model may have a parameterization that is quite different from the dependent sites model, {psi} rather than {theta} will represent the independent sites model and its parameters. Our goal will be to sample a site path {rho}r for site r from p({rho}r|ir, jr, {psi}).

Because of the relative ease of sampling the site paths {rho}r, we adopt what is commonly referred to as the Felsenstein 1984 model (see Felsenstein 1989). Nielsen (2001) has used a similar algorithm for sampling site paths from this model for a different application. Nielsen's algorithm, or an algorithm that simulates site paths beginning with ir and then discards the paths unless they end with jr, would be suitable for our application. However, our implementation uses a different algorithm for sampling site paths {rho}r from p({rho}r|ir, jr, {psi}) (see Appendix B). No matter what method is selected for sampling from p({rho}r|ir, jr, {psi}), the resulting proposal {rho}' should be accepted or rejected according to Equation (6).


    Examples
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Prior Densities and Implementation
In all analyses, all combinations of non-negative values for {pi}A, {pi}C, {pi}G, and {pi}T that satisfy {pi}A + {pi}C + {pi}G + {pi}T = 1 were treated as being equally likely a priori. Prior densities were uniform on the interval (0, 4) for u, uniform on the interval (0, 10) for {kappa}, and uniform on the interval (0, 10) for {omega}. Because so little is understood about the relationship between protein structure and protein evolution, the assignment of priors to the s and p parameters was somewhat but not completely arbitrary. Positive values of s and p represent the biologically reasonable scenario where evolution maintains sequence-structure compatibility. Negative values represent the scenario where evolution favors incompatibility. By centering prior distributions for s and p about zero, neither the compatibility nor the incompatibility of sequences and structures was favored a priori. With priors for s and p centered about zero, posterior distributions that are concentrated in the quadrant with s and p both exceeding 0 would be biologically reasonable and would therefore help to validate our approach. To set the end points of our uniform prior distributions for s and p, some exploration of posteriors was performed. We did not want the posterior distributions for s and p to be concentrated near the prior interval end points and, because the approximation of Equation (7) will be best for a finely meshed grid of {theta}* values, we did not want the uniformly distributed intervals for s and p to be overly wide. We opted to analyze each pair of sequences with a prior for s that was uniform on the interval (–2, 2) and a prior for p that was uniform on the interval (–0.15, 0.15). This prior was chosen because posterior distributions of s and p were not concentrated near the end points of their respective prior intervals and because the grid mesh of {theta}* proved fine enough to yield satisfactory approximations with Equation (7).

There is a trade-off when determining the mesh size for the {theta}* values. A mesh that is too wide will produce poor approximations with Equation (7). A mesh that is overly fine will be computationally infeasible. By selecting 9 points along the ranges of each of the s and p uniform priors to define the mesh size separating successive values of these parameters on the {theta}* grid, we found an acceptable compromise between approximation accuracy and computational requirements. We performed some MCMC runs with a finer mesh and obtained similar results to those obtained when 9 points defined the grid for each of s and p. Posterior approximations when only 5 points defined the grid for each of s and p were not satisfactory. We did some exploratory analyses with a prior for s that was uniform on the interval (–4, 4) and a prior for p that was uniform on the interval (–0.3, 0.3). With this prior for s and p, 17 points were used to define the grid for each of s and p. We found that this latter prior for s and p yielded very similar results to those with the (–2, 2) prior for s and (–0.15, 0.15) prior for p (data not shown). The remaining three dimensions of the 5-dimensional {theta}* grid represent , , and . For these parameters, we have 12 points define the mesh along each of these dimensions. The grid points for these parameters are each evenly spaced in the interval from 0.14 to 0.36. The value of M, the number of sequences sampled from the stationary distribution represented by a particular grid point, was 1,000 for the results presented here. The number of Gibbs iterations between successively sampled sequences was set to 10 multiplied by the length in nucleotides of the sequences.

To enhance MCMC convergence, we have each MCMC cycle include one proposed update of s, p, u, {kappa}, and {omega} but five proposed updates of site paths. Each MCMC cycle also proposes an update to one of {pi}A, {pi}C, {pi}G, and {pi}T so that a proposed update to each of these parameters is made once per four cycles. Each analysis presented here consisted of 100,000 MCMC cycles. The first 5,000 of these cycles were treated as a "burn-in" period and did not contribute to the posterior approximations. For all cases, two separate MCMC runs were performed to check for convergence to the desired posterior distribution. Plots of sampled parameter values versus the cycle number at which samples were taken were also examined to check for convergence.

One of the codon models implemented by the PAML software (Yang 1997) is equivalent to our parameterization when s = 0 and p = 0. For the uniform prior distributions that we employ, the maximum of the likelihood surface for s = 0 and p = 0 should be the point at which the joint posterior density is maximized. As a simple check of our software, the maximum likelihood estimates returned by PAML can be compared to the samples from the posterior distribution that are obtained by our MCMC routine. For all pairs of sequences used in this article, we have compared the PAML version 3.12 output with that of our software and have found a close correspondence (data not shown).

It is conventional to express the amount of evolution separating two sequences in terms of the expected number of nucleotide substitutions per site. With our dependence model, the substitution rate at a site is determined by the residues occupying other sites, and it can change when other sites change. To explore the prior distribution of the expected number of substitutions per site with our dependence model, we used a simulation procedure. For each simulation, values of the parameters that comprise {theta} were sampled from their prior distribution. Based on the resulting {theta}, the aforementioned Gibbs Sampling technique was employed to randomly select a sequence from its stationary distribution. This randomly sampled sequence was treated as being the ancestral sequence, and then evolution was simulated according to our model to produce a descendant sequence. By computing the average number of changes per site for each simulated sequence path and by repeating the simulation procedure 10,000 times, the prior distribution for the expected number of changes per site was approximated to yield the results presented here.

Annexin V
We first illustrate our techniques by analysis of two different annexin V sequence pairs that each consist of 314 aligned codons. One pair includes the chicken and human annexin V sequences (GenBank accession numbers M30971 and X12454) and the other pair consists of the mouse and rat sequences (GenBank accession numbers NM_009673 and NM_013132). The two pairs of sequences represent nonoverlapping evolutionary lineages and substantially different divergence levels. The human-chicken pair is 78% identical at the amino acid level and 74% identical at the nucleotide level, whereas the mouse-rat pair is 95% identical at the amino acid level and 93% identical at the nucleotide level.

Annexin V is a calcium-dependent phospholipid-binding protein that has potent vascular anticoagulant activity (Andree et al. 1990), is also an inhibitor of protein kinase C (Schlaepfer, Jones, and Haigler 1992), and has been associated with the apoptotic pathway (Reutelingsperger and van Heerde 1997) and the initiation of the hepatitis B virus infection process (Neurath and Strick 1994). It was selected for this study as an unbiased, arbitrary example of a typical protein of biological interest. The tertiary structure of the chicken annexin V protein has been experimentally determined (Bewley et al. 1993), and we assume that the native conformations of all four annexin V sequences are essentially identical to this structure.

For these two annexin V sequence pairs, the posterior distributions of s and p are quite similar (table 1). For both sequence pairs, the posteriors of s and p are concentrated in the s > 0 and p > 0 quadrant. This is the biologically plausible quadrant of the parameter space where evolution favors sequences that fold well into the known structure. The posterior distributions of {kappa}, {omega}, and u were relatively unaffected by whether s and p were forced to be zero. This indicates that the information contributing to the s and p estimates is largely independent of the information contributing to the estimates of the {kappa}, {omega}, and u parameters.


View this table:
[in this window]
[in a new window]
 
Table 1 Priors and Posteriors for the Annexin V Sequence Comparisons.

 
For biologically relevant values of s and p, the amino acid composition of sequences that are highly compatible with the tertiary structure may differ from the amino acid composition when s = p = 0. These amino acid composition differences may induce differences in expected nucleotide composition. This may explain why the estimates of {pi}A, {pi}C, {pi}G, and {pi}T were affected by whether or not s and p were forced to zero (table 1).

To evaluate the potential performance of our estimation procedure, we simulated annexin V evolution using the posterior means of the model parameters as estimated from the human-chicken pair. For each of five simulated annexin V pairs, the values of {kappa}, {omega}, u, {pi}A, {pi}C, {pi}G, {pi}T were set to their posterior means from our analysis of the original human-chicken data while the values of s and p varied among simulations. The posterior means of s and p for the human-chicken pair were approximately s = 0.947 and p = 0.0282. One pair of sequences was simulated for each of (s = 0.947, p = 0.0282), (s = 0.947, p = –0.0282), (s = –0.947, p = 0.0282), (s = –0.947, p = –0.0282), and (s = 0, p = 0). In each simulation, an ancestral sequence was selected via Gibbs Sampling from the appropriate stationary distribution of sequences. A descendant sequence was then evolved according to our model, and the resulting sequence pair was analyzed with the approach described here. For each of the five simulation scenarios, the posterior densities of s and p are concentrated close to the true values of these parameters (table 2).


View this table:
[in this window]
[in a new window]
 
Table 2 Posterior Means and 95% Credibility Intervals for s and p with Sequence Pairs That Were Simulated According to Our Model and the Annexin V Tertiary Structure.

 
With our parameterization, a nonsynonymous nucleotide substitution that changes a sequence i to a sequence j will occur at the rate this substitution would have if it were synonymous multiplied by a factor of


Therefore, if e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p > 1/{omega}, the change from sequence i to sequence j could be interpreted as involving positive selection. Likewise, the change from sequence i to j could be interpreted as negatively selected if e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p < 1/{omega}. Among the nonsynonymous nucleotide substitutions that could occur at a particular codon in a particular sequence, some of these nonsynonymous substitutions may be positively selected and others may be negatively selected. The fact that some changes to a particular codon may be positively selected whereas others are negatively selected is a desirable property of our parameterization of nonsynonymous rates.

For both annexin V sequence pairs, we have explored variation due to protein structure among the possible nonsynonymous changes that could affect a sequence. For all 2,058 possible sequences j that differ by one nonsynonymous substitution from the observed mouse sequence i, the rate factor associated with sequence-structure compatibility, e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p, was calculated. To produce the histogram of these 2,058 values shown in figure 1A, the posterior means of s and p from the mouse-rat comparison were used. A corresponding histogram generated with the 2,069 possible nonsynonymous changes to the chicken sequence is similar and is not shown. Both histograms indicate that the nonsynonymous rate variation due to protein structure is substantial but both the mean and median effect of nonsynonymous substitutions on structural compatibility are deleterious. Although many of the possible nonsynonymous changes improve the sequence-structure compatibility (i.e., e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p > 1 ), this improved compatibility does not overcome the low posterior mean of {omega} by making any of these possible nonsynonymous changes positively selected.



View larger version (14K):
[in this window]
[in a new window]
 
FIG. 1. (A) Nonsynonymous rate variation due to structure in mouse annexin V. Letting mouse annexin V be sequence i, a histogram of the values of the factor e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p is constructed from the 2,058 sequences j that differ from i by exactly 1 nonsynonymous nucleotide substitution. The values of s and p used here are the posterior mean estimates from the mouse-rat comparison. Of the 2,058 factors represented by the histogram, 713 factors exceed 1. None of the 2,058 factors exceeds 6.803, the inverse of the posterior mean of the {omega} parameter, and therefore none of the possible nonsynonymous substitutions is categorized as positive selection. (B) Comparison for mouse annexin V of effects of solvent accessibility and pairwise interactions on nonsynonymous rates. For each of the 2,058 possible nonsynonymous changes to mouse annexin V that are summarized in figure 1A, the value of (Es(i) – Es(j))s is plotted versus the value of (Ep(i) – Ep( j ))p. The x = y and x = –y lines are drawn to assist comparison between effects of pairwise interactions and solvent accessibility

 
To compare the impact on nonsynonymous rates of solvent accessibility and pairwise interactions, values of (Es(i) – Es(j))s and (Ep(i) – Ep(j))p should be considered instead of the values of s and p in isolation. Figure 1B plots (Es(i) – Es(j))s versus (Ep(i) – Ep(j))p for the 2,058 values summarized in figure 1A. This plot shows that relatively few nonsynonymous changes to the mouse sequence improve both the pairwise and solvent accessibility components of sequence-structure compatibility. The plot also indicates that solvent accessibility and pairwise interactions can both have substantial impacts on nonsynonymous rates.

Lysozyme
In their pioneering work on adaptation, Stewart and collaborators (e.g., Stewart, Schilling, and Wilson 1987; Messier and Stewart 1997; see also Yang 1998; Yang and Nielsen 2002) demonstrated that the evolutionary lineage leading to the Colobine monkeys (e.g., Colobus guereza) has experienced an excess of nonsynonymous substitutions during lysozyme c evolution. The apparent explanation for this excess is that the Colobine monkeys have acquired a foregut in which bacteria ferment leaves. In the Colobine monkeys, lysozyme c is active at a lower pH and is more resistant to pepsin than in other primates. These properties of lysozyme c in Colobine monkeys may be adaptations that help with lysis of bacteria that pass from the foregut fermentation chamber into the true stomach (Stewart, Schilling, and Wilson 1987; Messier and Stewart 1997).

To investigate lysozyme c evolution with our approach, we selected two pairs of lysozyme c sequences that represent nonoverlapping evolutionary lineages and that each consist of 130 aligned codons. The phylogenetic path separating the Rhesus macaque (Macaca mulatta) and Colobus guereza pair (GenBank accession numbers X60236 and U76916, 96% nucleotide identity and 91% amino acid identity) includes lineages that have previously been demonstrated to have experienced a strong excess of nonsynonymous substitutions (Stewart, Schilling and Wilson 1987; Messier and Stewart 1997; Yang and Nielsen 2002). The phylogenetic path separating the marmoset (Callithrix jacchus) and human pair (GenBank accession numbers M19045 and U76923, 93% nucleotide identity and 87% amino acid identity) does not seem to have experienced such a strong excess of nonsynonymous change (Stewart, Schilling, and Wilson 1987; Messier and Stewart 1997; Yang and Nielsen 2002), although our estimates indicate it has a ratio of nonsynonymous to synonymous rates that is substantially higher than for the annexin V pairs (see below). The tertiary structure of the human lysozyme c protein has been experimentally determined (Harata, Abe, and Muraki 1998), and we assume that the native conformations of all four lysozyme c sequences are essentially identical to this structure.

As with the annexin V analyses, the posterior distributions of s and p are concentrated in the biologically plausible quadrant of the parameter space for both lysozyme c sequence pairs (table 3). In general, we do not find strong correlations between {omega}, s, and p in our samples from the posterior distribution. As an example, Pearson's correlation for the human-marmoset comparison was approximately –0.13 between s and p, 0.00 between s and {omega}, and 0.02 between p and {omega}.


View this table:
[in this window]
[in a new window]
 
Table 3 Priors and Posteriors for the Lysozyme c Sequence Comparisons.

 
Figure 2A shows a histogram that summarizes the values of e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p for the 862 sequences j that differ by exactly one nonsynonymous change from the human lysozyme c sequence. The values of s and p used to construct this histogram were those from the human-marmoset comparison. Figure 2B shows a histogram that summarizes the values of e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p for the 872 sequences j that differ by exactly one nonsynonymous change from the Rhesus macaque lysozyme c sequence. The values of s and p used to construct this histogram were those from the colobus-rhesus sequence comparison. As indicated by the portions of these histograms to the right of the arrows, a substantial number of possible nonsynonymous substitutions would be positively selected. In comparison, none of the possible nonsynonymous annexin V substitutions represented in the histograms of figure 1 would be positively selected. This disparity in fraction of positively selected nonsynonymous substitutions can be mainly attributed to the higher estimates of {omega} for the two lysozyme c sequence pairs than for those obtained from annexin V.



View larger version (11K):
[in this window]
[in a new window]
 
FIG. 2. (A) Nonsynonymous rate variation due to structure in human lysozyme c. Letting human lysozyme c be sequence i, a histogram of the values of the factor e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p is constructed from the 862 sequences j that differ from i by exactly 1 nonsynonymous nucleotide substitution. The values of s and p here are the posterior mean estimates from the human-callithrix comparison. Of the 862 factors represented by the histogram, 276 factors exceed 1. The arrow at 1.233 shows the inverse of the posterior mean of the {omega} parameter. Therefore, the 115 possible nonsynonymous changes to the human sequence that exceed 1.233 can be categorized as positive selection. (B) Nonsynonymous rate variation due to structure in the Rhesus macaque lysozyme c. Letting Rhesus macaque lysozyme c sequence be sequence i, a histogram of the values of the factor e(Es(i)–Es(j))s+(Ep(i)–Ep(j))p is constructed from the 872 sequences j that differ from i by exactly 1 nonsynonymous nucleotide substitution. The values of s and p here are the posterior mean estimates from the Colobus-Rhesus comparison. Of the 872 factors represented by the histogram, 264 factors exceed 1. The arrow at 0.377 shows the inverse of the posterior mean of the {omega} parameter. Therefore, the 754 possible nonsynonymous changes to the Rhesus macaque sequence that exceed 0.377 can be categorized as positive selection. (C) Comparison for human lysozyme c of effects of solvent accessibility and pairwise interactions on nonsynonymous rates. For each of the 862 possible nonsynonymous changes to human lysozyme c that are summarized in figure 2A, the value of (Es(i) – Es(j))s is plotted versus the value of (Ep(i) – Ep(j))p). The x = y and x = –y lines are drawn to assist comparison between effects of pairwise interactions and solvent accessibility

 
Figure 2C plots (Es(i) – Es(j))s versus (Ep(i) – Ep(j))p for the 862 possible nonsynonymous changes to human lysozyme c. This plot has a pattern that is qualitatively similar to the annexin V pattern in figure 1B. The plot for the 872 possible nonsynonymous changes to the Rhesus macaque lysozyme c is also qualitatively similar to figure 1B and is not shown.


    Future Directions
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
An extension of our inferential procedure to data sets with more than two sequences should not be difficult. The main modification will be with how site paths are proposed. With more than two sequences, a site path will traverse multiple branches. Pupko et al. (2000) have introduced a fast algorithm for finding the optimal joint ancestral sequence reconstruction with simple independent sites models of evolution. A variation of this algorithm will allow joint ancestral sequence reconstructions to be sampled according to probabilities defined by the Felsenstein 1984 model. Given a set of reconstructed ancestral nodes for a site, the problem of proposing site paths for multiple sequences becomes the smaller problem of proposing site paths for each branch that comprises the tree. We have already addressed this smaller problem here.

Although the multiple sequence version of our methods may tend to be computationally impractical for inference of evolutionary tree topologies, the extension to multiple sequences may lead to improved methods for ancestral sequence reconstruction. With our approach, ancestral sequences that do not fold well into a tertiary structure are unlikely to be inferred. In addition, the paths from ancestral to descendant sequences may allow the order of adaptive nucleotide substitution events during protein evolution to be inferred.

Other methods for detecting positive selection (e.g., Yang et al. 2000; Yang and Nielsen 2002) treat each codon independently and ignore protein structure. In reality, the question of whether a particular nonsynonymous substitution confers a selective advantage should depend to some extent on the amino acids encoded by other codons. Our technique has the advantages of incorporating and quantifying this dependence.

Although the sequence-structure compatibility criterion employed here (Jones, Taylor, and Thornton 1992; Jones 1999) has demonstrated considerable success when applied to protein fold recognition, other sequence-structure compatibility functions have been proposed (e.g., Singh, Tropsha, and Vaisman 1996), and it will be interesting to explore which of these criteria is most appropriate for describing the process of molecular evolution. Incorporation of alternative criteria would necessitate few changes to our approach.

The sequence-structure compatibility criterion can be viewed as a surrogate for fitness. Measures of the fitness of a sequence that are not explicitly derived from protein structure could be incorporated into our approach without major modifications of our inference procedure. Such modifications might be of particular interest for studying bacteriophage or retroviruses because these quickly evolving organisms are good systems for experimentally measuring the fitness consequences of particular nucleotide substitutions (e.g., Bull, Badget, and Wichman 2000). For making inferences regarding history and evolutionary process, our statistical approaches could supplement more direct experimental information.


    Appendix A
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
The approximation of Equation (7) for p(i|{theta}')/p(i|{theta}) relies on having a random sample of sequences {eta}(1), {eta}(2), ... , {eta}(M) from p({eta}(h)|{theta}*). The denominator of Equation (1) is


and


The ratio of the denominators can be approximated via importance sampling,


Substitution of this approximation for D({theta})/D({theta}') into Equation (8) yields Equation (7).


    Appendix B
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
To propose a sequence path {rho}', we modify the current sequence path {rho} by replacing the site path {rho}r at site r with a site path that is sampled from the posterior density p(|i, j, {psi}) as defined by the Felsenstein 1984 nucleotide substitution model (Felsenstein 1989). The Felsenstein 1984 model can be interpreted as comprising two separate processes: the within-group process and the general process. For the general process, events occur at a rate g. If a general event occurs, a nucleotide of type i is replaced by a nucleotide of type j with probability {pi}j. Besides general events, within-group events are also allowed by the Felsenstein 1984 model. These events occur at a rate w, and the possible outcomes of a within-group event depend on whether the nucleotide occupying the site before an event is a purine or a pyrimidine. Let H({alpha}) be R if nucleotide type {alpha} is a purine and Y if nucleotide type {alpha} is a pyrimidine. Also, define {pi}R = {pi}A + {pi}G and {pi}Y = {pi}C + {pi}T. If a within-group event occurs at a site currently occupied by type {alpha}, nucleotide type ß replaces {alpha} with probability {pi}ß/{pi}H(ß) if {alpha} and ß are both purines or both pyrimidines and with probability 0 otherwise. In other words, within-group events either lead to transitions or to no change of nucleotide type.

Because of assumed independence of the within-group and general processes, the total number of events experienced at a site in time duration T has a Poisson distribution with parameter (g + w)T. This formulation of the Felsenstein 1984 model actually allows general or within-group events occur that result in the same nucleotide type occupying a site before and after an event. This sort of "hidden" event is algebraically convenient for determining how to sample {rho}r from p({rho}r|i, j, {psi}), and it is not difficult to normalize so that the rate units are the expected number of nucleotide changes rather than the expected number of events.

With the Felsenstein 1984 model, the probability that a site is occupied by nucleotide type ß after an amount of evolution T given that the site was originally occupied by type {alpha} is


Each of the three terms found in these transition probabilities has an intuitive explanation. The probability of 0 general events and 0 within-group events is egTewT. The egT(1 – ewT){pi}ß/{pi}H(ß) term represents the case of 0 general events and at least one within-group event where the most recent within-group event results in the nucleotide type ß occupying the site. The (1 – egT){pi}ß term is the probability that the site is occupied by type ß and at least one general event occurs.

We employ three steps to sample a particular site path from the distribution p({rho}r|i, j, {psi}). First, a random sample is obtained from the distribution of the number of general and within-group changes conditional on ir, jr, and {psi}. The second step, conditional upon the number of events that occur at site r, is to randomly sample the times of these events from a uniform distribution that spans the time period of length T.

The uniform distribution is used because, conditional upon the number of events, event times are uniformly distributed for a time-homogeneous Poisson process such as the one defined by the Felsenstein 1984 model. After the numbers of general and within-group events have been determined and the event times have been assigned, the third step is to appropriately assign the specific residue types that are yielded by each event.

The form of the transition probabilities facilitates sampling the number of general events and the number of within-group events at a site conditional upon the information that the site was occupied by type ir at the beginning of the branch and by type jr at the end of the branch. The most complicated situation is for ir = jr. In this situation, {rho}r has 0 general events and 0 within-group events with probability egTewT/(T). The probability of 0 general events and at least one within-group event at the site for ir = jr is egT(1 ewT)/(T)). Given that this happens, the probability of exactly nw within-group events on the path is ewT/(nw!(1 ewT)) for nw {1, 2, ...}. Therefore, the probability of 0 general events and nw within-group events given that ir = jr and nw {1, 2, ...} is egT/(T)). Likewise, the probability of ng > 0 general events and nw within-group events is / when ir = jr. Similar reasoning allows sampling from the probability distribution of ng and nw conditional upon the beginning nucleotide type ir and the ending type jr for the cases where ir != jr. After determining ng and nw, the times for each general or within-group event on a path can be randomly sampled from a uniform distribution along the branch.

The last step in randomly sampling a site path conditional upon the beginning type ir and ending type jr is to randomly determine which nucleotide type occupies a site after each event. The final event on a site path must always be to type jr because this type occupies the site at the end of the branch. This final event may be of either the general or within-group variety.

If the last general event is not the final event on a site path, to have type jr at the end of the site path, the last general event must be to a nucleotide type included in group H( jr). Specifically, the last general event results in type k with probability {pi}k/ if k belongs to group H( jr). If k does not belong to H( jr), then the probability is 0 that type k is assigned to occupy the site after the last general event.

Next, all general events that are not the last general event on the path at the site are handled. This is accomplished by having nucleotide type k occupy the site after one of these general events with probability {pi}k. When all general events have been assigned nucleotide types, the final step is to treat any remaining unassigned within-group events. Because the within-group events cannot result in transversions, the time intervals on a branch during which a site is occupied by a pyrimidine depend solely on whether the site is occupied by a purine or a pyrimidine at the beginning of the branch and on which general events result in purines and which result in pyrimidines. If an unassigned within-group event occurs during a "purine" interval, then the event results in A with probability {pi}A/{pi}R and G with probability {pi}G/{pi}R. If an unassigned within-group event occurs during a "pyrimidine" interval, then the event results in C with probability {pi}C/{pi}Y and in T with probability {pi}T/{pi}Y.

The procedure discussed to this point allows site paths to be sampled conditional upon the residues that begin and end the path, but some events on the sampled site path may be "hidden" in that the nucleotides before and after the event are identical. Because the goal is to propose a site path according to the Felsenstein 1984 model and then evaluate the resulting sequence path with a model that has dependence among codons and that is not parameterized so as to consider hidden events, all hidden events are pruned from the site path before it is considered in Equation (6). Because the Felsenstein 1984 model can also be parameterized in the conventional way that has all evolutionary events result in a changed sequence, the conventional parameterization is adopted for calculation of J({rho}', {theta}'|{rho}, {theta}) and J({rho}, {theta}|{rho}', {theta}').


    Acknowledgements
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
We thank Kevin Scott, Stefan Bekiranov, Reinhard Hopperger, Elizabeth A. Thompson, and two anonymous referees for their help. This work was supported by the Institute for Bioinformatics Research and Development of the Japan Science and Technology Co. (H.K. and J.L.T.), the Japanese Society for the Promotion of Science (H.K.), National Science Foundation grant INT 99-0934 (D.M.R. and J.L.T.), National Science Foundation grant DEB-0120635 (J.L.T.), and a Wellcome Trust Senior Fellowship in Basic Biomedical Sciences (N.G.).


    Footnotes
 
Brian Golding, Associate Editor Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Modeling Protein Evolution Under...
 Examples
 Future Directions
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 

    Andree, H. A. M., C. P. M. Reutelingsperger, R. Hauptmann, H. C. Hemker, W. Th. Hermens, and G. M. Willems. 1990. Binding of vascular anticoagulant {alpha} (VAC {alpha}) to planar phospholipid bilayers. J. Biol. Chem. 265:4923-4928.[Abstract/Free Full Text]

    Bewley, M. C., C. M. Boustead, J. H. Walker, D. A. Waller, and R. Huber. 1993. Structure of chicken annexin V at 2.25-A resolution. Biochemistry 32:3923-3929.[ISI][Medline]

    Bull, J. J., M. R. Badget, and H. A. Wichman. 2000. Big-benefit mutations in a bacteriophage inhibited with heat. Mol. Biol. Evol. 17:942-950.[Abstract/Free Full Text]

    Chothia, C., and A. M. Lesk. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5:519-527.[Abstract]

    Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary changes in proteins. Pp. 345–352 in Atlas of protein sequence and structure, vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C.

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]

    Felsenstein, J. 1989. Phylogenetic Inference Package (PHYLIP), version 3.2. University of Washington,. Seattle. Cladistics 5:164-166.

    Flores, T. P., C. A. Orengo, D. S. Moss, and J. M. Thornton. 1993. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2:1811-1826.[Abstract/Free Full Text]

    Geman, S., and D. Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6:721-741.[ISI]

    Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.[Abstract/Free Full Text]

    Harata K., Y. Abe, and M. Muraki. 1998. Full-matrix least squares refinement of lysozymes and analysis of anisotropic thermal motion. Proteins 30:232-243.[CrossRef][Medline]

    Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.[ISI]

    Jensen, J. L., and A. K. Pedersen. 2000. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. Appl. Prob. 32:499-517.[CrossRef][ISI]

    Jones, D. T. 1999. GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287:797-815.[CrossRef][ISI][Medline]

    Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. A new approach to protein fold recognition. Nature 358:86-89.[CrossRef][ISI][Medline]

    Jones, D. T., and J. M. Thornton. 1996. Potential energy functions for threading. Curr. Opin. Struct. Biol. 6:210-216.[CrossRef][ISI][Medline]

    Koshi, J. M., D. P. Mindell, and R. A. Goldstein. 1997. Beyond mutation matrices: physical chemistry based evolutionary models. Pp. 80–89 in S. Miyano and T. Takagi, eds. Genome informatics 1997. Universal Academy Press, Tokyo.

    Koshi, J. M., D. P. Mindell, and R. A. Goldstein. 1999. Using physical chemistry based substitution models in phylogenetic analyses of HIV-1 subtypes. Mol. Biol. Evol. 16:173-179.[Abstract]

    Koshi, J. M., and R. A. Goldstein. 1998. Mathematical models of natural amino acid site mutations. Proteins 32:289-295.[CrossRef][ISI][Medline]

    Messier, W., and C.-B. Stewart. 1997. Episodic adaptive evolution of primate lysozymes. Nature 385:151-154.[CrossRef][ISI][Medline]

    Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. J. Chem. Phys. 21:1087-1092.[ISI]

    Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with applications to the chloroplast genome. Mol. Biol. Evol. 11:715-724.[Abstract/Free Full Text]

    Neurath, A. R., and N. Strick. 1994. The putative cell receptors for hepatitis B virus (HBV), annexin V, and apolipoprotein H, bind to lipid components of HBV. Virology 204:475-477.[CrossRef][ISI][Medline]

    Nielsen, R. 2001. Mutations as missing data: inferences on the ages and distributions of nonsynonymous and synonymous mutations. Genetics 159:401-411.[Abstract/Free Full Text]

    Parisi, G., and J. Echave. 2001. Structural constraints and emergence of sequence patterns in protein evolution. Mol. Biol. Evol. 18:750-756.[Abstract/Free Full Text]

    Pedersen, A.-M. K., and J. L. Jensen. 2001. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol. 18:763-776.[Abstract/Free Full Text]

    Pedersen, A.-M. K., C. Wiuf, and R. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:1069-1081.[Abstract]

    Pollock D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187-98.[CrossRef][ISI][Medline]

    Pupko, T., I. Pe'er, R. Shamir, and D. Grauer. 2000. A fast algorithm for joint reconstruction of ancestral amino–acid sequences. Mol. Biol. Evol. 17:890-896.[Abstract/Free Full Text]

    Reutelingsperger, C. P. M., and W. L. van Heerde. 1997. Annexin V. the regulator of phosphatidylserine-catalyzed inflammation and coagulation during apoptosis. Cell. Mol. Life Sci. 53:527-532.[CrossRef][ISI][Medline]

    Schlaepfer, D. D., J. Jones, and H. T. Haigler. 1992. Inhibition of protein kinase C by Annexin V. Biochemistry 31:1886-1891.[ISI][Medline]

    Singh, R. K., A. Tropsha, and I. I. Vaisman. 1996. Delaunay tessellation of proteins. J. Comput. Biol. 2:213-221.

    Stewart, C.-B., J. W. Schilling, and A. C. Wilson. 1987. Adaptive evolution in the stomach lysozymes of foregut fermenters. Nature 330:401-404.[CrossRef][ISI][Medline]

    Swofford, D. L., G. J. Olsen, P. J. Waddel, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics, 2nd ed. Sinauer Associates, Sunderland, Mass.

    Wollenberg, K. R., and W. R. Atchley. 2000. Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc. Natl. Acad. Sci. USA 97:3288-3291.[Abstract/Free Full Text]

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555-556.[Medline]

    Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568-573.[Abstract]

    Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917.[Abstract/Free Full Text]

    Yang, Z., R. Nielsen, and M. Hasegawa. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15:1600-1611.[Abstract/Free Full Text]

    Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449.[Abstract/Free Full Text]

Accepted for publication May 26, 2003.