Applications of Codon and Rate Variation Models in Molecular Phylogeny

Eric E. Schadt*{dagger}, Janet S. Sinsheimer*{ddagger} and Kenneth Lange*{ddagger}

*Department of Biomathematics, UCLA School of Medicine, Los Angeles;
{dagger}Department of Computational Genomics, Rosetta Inpharmatics, 12040 115th Avenue NE, Kirkland, Washington;
{ddagger}Department of Human Genetics, UCLA School of Medicine, Los Angeles


    Abstract
 TOP
 Abstract
 Introduction
 Theory and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The current article illustrates the practical advantages of some new models and statistical algorithms for codon substitution and spatial rate variation in molecular phylogeny. Our companion paper in this issue discusses at length the mathematical properties of these models for nucleotide and codon substitution, for site-to-site and branch-to-branch heterogeneity in rates of evolution, and for spatial correlation in the assignment of rates. In this study we summarize the theoretical background and apply the models and algorithms to data on ß-globin, the complete HIV genome, and the mitochondrial genome. Our complex but realistic models enhance biological interpretation of sequence data and show substantial improvements in model fit over existing models. All the new statistical algorithms applied are incorporated in our phylogeny software LINNAEUS, which is tuned for performance and modeling flexibility.


    Introduction
 TOP
 Abstract
 Introduction
 Theory and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The recent whole-genome sequencing of a variety of species has prompted a careful reexamination of statistical methods for reconstructing phylogenies (Blanchette, Bourque, and Sankoff 1997Citation ; Lander et al. 2001Citation ; Moret et al. 2001Citation ). Current annotations of these genomes show strong regional variation in sequence composition, gene density, repeat density, gene duplication frequency, and recombination rates. Content variation by region suggests variation in evolutionary pressures. If evolutionary pressures do vary widely, then phylogeny reconstruction and interpretation will be improved if reconstruction and estimation of selection go hand in hand, particularly in coding and regulatory regions.

Considerable progress has already been made in this program for integrating selection and reconstruction. Models for codons, for rate heterogeneity among sites, and for spatial correlation in rate assignment are the subjects of several recent papers (Goldman and Yang 1994Citation ; Yang 1995Citation ; Felsenstein and Churchill 1996Citation ; Goldman, Thorne, and Jones 1998Citation ; Yang, Nielsen, and Hasegawa 1998Citation ; Wagner, Baake, and Gerisch 1999Citation ; Yang et al. 2000Citation ). Our companion paper (Schadt and Lange 2002Citation ) addresses mathematical and algorithmic issues raised by generalizations of these models. The current article summarizes and applies our generalizations to actual data.

The models presented in Yang, Nielsen, and Hasegawa (1998)Citation , Wagner, Baake, and Gerisch (1999)Citation , and Yang et al. (2000)Citation are closest in spirit to the models featured in our article. Yang, Nielsen, and Hasegawa (1998)Citation introduce amino acid acceptance rates in a codon model. In their model, nonsynonymous codon substitutions are accepted more or less freely, depending on empirically measured distances between amino acids. Yang et al. (2000)Citation discuss different rate class models. Wagner, Baake, and Gerisch (1999)Citation develop a four-state mutation-selection model for DNA–RNA sequences. This model uses coupled Ising chains to incorporate rate dependency among nucleotide sites. Our companion paper (Schadt and Lange 2002Citation ) generalizes these models to an arbitrary number of rate classes and to rate correlation at a distance.

Now that the evolutionary relationships connecting many species have been sorted out, the more subtle questions of structure, function, and regulation will drive the field of molecular phylogeny. More sophisticated models should aid in the functional annotation of genes. Identification of conserved domains is bound to benefit from information on rates of evolution, strength of interaction between spatially separated sites, and most likely assignment of codons to internal nodes and of codon sites to rate classes.

As a guide to what follows, we first briefly summarize the generalized codon substitution and spatial rate variation models covered in detail by Schadt and Lange (2002)Citation . The current article emphasizes applications, intuition, and interpretation. In our view, data analysis of coding regions should be conducted at the codon level where mutation and selection interact to determine amino acid substitutions. In modeling this interplay, we deal with the tradeoffs between accuracy and parsimony by postulating amino acid similarity classes and estimating acceptance probabilities for proposed substitutions within and between classes. To capture spatial interactions in rate class assignments, we turn to Markov random fields (Cressie 1993Citation , pp. 410–423). These offer a rich theoretical context and host of machinery for elaborating models. To illustrate the strengths of each of these model advances, we analyze real sequence data on ß-globin, the complete HIV genome, and the mitochondrial genome (MTG). All algorithms presented in this study or in Schadt and Lange (2002)Citation are implemented in the LINNAEUS software package, freely available to nonprofit researchers upon request.


    Theory and Methods
 TOP
 Abstract
 Introduction
 Theory and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Our point of departure is the nucleotide substitution model posited by Schadt, Sinsheimer, and Lange (1998)Citation . This generalization of traditional models (Jukes and Cantor 1969Citation , pp. 21–132; Kimura 1980Citation ; Hasegawa, Kishino, and Yano 1985Citation ) is a continuous-time Markov chain on the states A, G, C, and T with infinitesimal generator or rate matrix,


The off-diagonal entries of {Lambda} have simple probabilistic interpretations. Thus, the probability that nucleotide G replaces nucleotide A during a small time interval {Delta}t is approximately the product {alpha}{Delta}t of {Delta}t and the infinitesimal transition rate {alpha} in row A and column G of {Lambda}.

An entry pij(t) of the matrix exponential P(t) = et{Lambda} gives the probability of going from state (nucleotide) i at time 0 to state (nucleotide) j at time t. Computation of any matrix's exponential is far easier when the matrix can be diagonalized. Fortunately, all eigenvalues and eigenvectors of {Lambda} are real and explicitly computable. Reversibility is achieved by imposing the constraints ß{gamma} = {lambda}{sigma} and {alpha}{delta} = {epsilon}{kappa} on the infinitesimal transition rates. Under reversibility, the equilibrium frequencies of the different states are


Several authors have suggested an elaboration of the nucleotide model allowing stochastic variation in branch lengths (Nei and Gojobori 1986Citation ; Tamura and Nei 1993Citation ; Yang 1993Citation ; Kelly and Rice 1996Citation ). If this extension is carried out subject to one branch length having unit mean (Yang 1993Citation ), then all kinetic parameters and branch length parameters are statistically identifiable. The most tractable model assumes that branch lengths are independent realizations of gamma random variables. In other words, for each branch b and each nucleotide site, the branch length Tb is an independent gamma variate with scale parameter {zeta}b and shape parameter {nu}b. It is parsimonious to eliminate one parameter per branch by postulating a common coefficient of variation


across branches. This is equivalent to choosing a common shape parameter {nu}. With these conventions it is possible to calculate an analog of the matrix exponential et{Lambda} and proceed with likelihood computation just as in the unadorned nucleotide model. The model just described differs from that in Yang (1993)Citation , where the same random branch length applies to all sites. Unless a discrete approximation to the gamma distribution is invoked, this choice renders the model computationally intractable for all but a handful of taxa (Yang 1994Citation ). In any case, we will soon turn to an alternative model that handles rate variation more efficiently.

Any Markov chain model for nucleotide substitution, including the one with infinitesimal generator (1), induces a codon substitution model in a natural fashion. If the infinitesimal transition rate of going from nucleotide b to nucleotide d is vbd, then the infinitesimal transition rate in going from codon (a, b, c) to codon (a, d, c) is also vbd. (Similar expressions hold when the mutated nucleotide occurs in the first or third position.) The only exception to this rule is that transitions to stop codons are disallowed. An induced codon model has 61 states and a rather sparse infinitesimal generator {Upsilon}. All finite-time transition probabilities are entries of the matrix exponential et{Upsilon} by analogy to the nucleotide model. The computation of et{Upsilon} is harder compared with the computation of et{Lambda} because of the sheer number of states involved and the lack of explicit eigenvalues and eigenvectors. Although using an induced codon model seems like a trivial conceptual advance, eliminating passages to stop codons often leads to a large increase in the maximum likelihood of the induced codon model compared with the maximum likelihood of the nucleotide model.

Selection can be added to a codon model by introducing acceptance probabilities. If two neighboring codons, such as (a, b, c) and (a, d, c) are nonsynonymous, then we postulate a probability {rho} with which substitutions from one to the other are actually accepted. Rejected substitutions correspond to mutations that are eliminated by selection. In general, {rho} depends on its start and destination codons. If we impose symmetry constraints such as {rho}(a,b,c)->(a,d,c) = {rho}(a,d,c)->(a,b,c), then any reversible nucleotide model induces a reversible codon model. Reversibility permits consideration of unrooted trees rather than rooted trees and has many numerical advantages. This class of model is similar to those described in Pedersen, Wiuf, and Christiansen (1998)Citation and Yang, Nielsen, and Hasegawa (1998)Citation .

In modeling acceptance probabilities, we must balance parsimony and biological realism. One flexible device is to divide the amino acids into similarity (or mathematical equivalence) classes. The most comprehensive model would assign each amino acid to its own similarity class, but this model generates more parameters than can be reliably estimated from most data sets, even with the symmetry constraints necessary for reversibility imposed. In many of the applications presented in this study, we divide the amino acids into four similarity classes on the basis of polarity. These classes are: S1 = {G, I, V, L, A, M, P, F, W}, S2 = {Q, N, Y, H, K, R}, S3 = {S, T, E, D}, and S4 = {C}. Cysteine (C) is placed in its own class because of its propensity to form disulfide bridges. Four {rho}'s are estimated: a within class acceptance rate {rho}0, an acceptance rate {rho}1 between the nonpolar class S1 and either of the polar classes S2 and S3, an acceptance rate {rho}2 between the two opposite polar classes S2 and S3, and an acceptance rate {rho}3 between cysteine and any other class.

Because these natural groupings according to charge and polarity omit many important chemical determinants, LINNAEUS is written flexibly enough to accommodate alternative decompositions. Of course, similarity classes hardly exhaust the modeling possibilities. To take advantage of the diverse ways in which chemists rate amino acids, we briefly explore penalty models of the form


for two amino acids having quantitative scales s1k and s2k with respect to amino acid property k. Relevant properties include polarity, hydrophobicity, molecular weight, side-chain volume, and surface area. Equation (3) for {rho} is symmetric in the two amino acids and reduces to 1 for synonymous substitutions. The constraint {rho} <= 1 is achieved by requiring the {rho}k to be nonnegative. Likelihood ratio tests of the hypotheses {rho}k = 0 can quantify the relative importance of each property in the substitution process. We apply this model to the ß-globin and MTG data discussed subsequently.

In addition to introducing penalties for nonsynonymous substitutions, it is also helpful to incorporate different rate classes in statistical analysis. Indeed, codon sites in crucial catalytic domains or in conserved secondary structures are under much stronger selective pressure than are codon sites in other parts of a protein. Rate classes are distinguished by a second level of acceptance probabilities. Suppose that there are l such classes with class i assigned parameter {eta}i. Then the nonsynonymous substitution (a, b, c) -> (a, d, c) has infinitesimal transition rate vbd{rho}{eta}i depending on {eta}i and the penalty {rho} specified by the two amino acids. In contrast, a synonymous substitution (a, b, c) -> (a, d, c) has infinitesimal transition rate vbd. In most applications it is convenient to order the classes so that class 0 evolves fastest, immediately followed by class 1 and so on, down to class l - 1 which evolves slowest. In symbols, this implies 0 < {eta}i+1 < {eta}i for all i < l - 1. To better identify all parameters, we usually constrain {eta}0 = 1 and restrict l <= 3.

It is worth noting that {rho} and {eta} generalize the nonsynonymous–synonymous ratio parameter {omega} appearing in the discrete model of Yang et al. (2000)Citation . Specifically, if there are multiple rate classes and a single {rho}, then our model corresponds to the discrete model M3 of Yang et al. (2000)Citation . Multiple {rho}'s allow for differences in nonsynonymous rates both within as well as between rate classes (Yang, Nielsen, and Hasegawa 1998Citation ). Positive selection occurs when an {omega} or a product {rho}{eta} exceeds 1. In the latter case we must relax either the convention {rho} <= 1 or the convention {eta} <= 1. In our examples we allow values of {rho} in excess of 1.

Although these two levels of acceptance clearly separate the different forces of evolution acting on orthologous genes, the distinctions between penalties and rate classes become even clearer once we introduce correlated rates of evolution. Markov chains offer convenient vehicles for such an introduction, but they are restricted to a unidirectional flow of information and nearest-neighbor interactions. Markov random fields are more flexible and fully consistent with fast likelihood evaluation (Schadt and Lange 2002Citation ). The crux of the matter is to specify a realistic prior distribution for the rate class vector C = (C1, ..., Cn) over n codon sites. In this study the rate class Ci of codon site i assumes one of the values 0, ..., l - 1 for l rate classes. In the multilevel logistic model, the distribution of C can be written as the normalized sum


using the potential function


Here r is the spatial extent of the interactions, µj accounts for the proportion of sites in class j, and {theta}j accounts for the spatial interaction at distance j. The indicator function 1D equals 1 when the condition D is true and 0 otherwise. The normalizing factor {Sigma}d eH(d) is called the partition function. Its presence dictates that we impose a constraint such as µ0 = 0. All classes are a priori equally likely when all µj = 0. Positive values of {theta} favor positive interactions between sites and induce clumped rate class assignments. Negative values of {theta}'s favor negative interactions and induce alternating rate class assignments. The Ising model corresponds to the special case l = 2 and r = 1.

Each of these model generalizations operates within the context of a fixed tree. Our companion paper (Schadt and Lange 2002Citation ) also suggests methods for reducing the number of potential trees and systematically visiting those trees consistent with accepted taxonomic groupings. For a given tree, the estimation of parameters by maximum likelihood is by no means the only objective. In reconstructing evolution we want to impute codons at the root and other internal nodes of the tree. This is possible either by computing appropriate posterior probabilities node by node or by finding a single best global assignment through a variant of Viterbi's dynamic programming method (Viterbi 1967Citation ). Of course, codon imputation should be carried out using the best tree and the best-fitting model. Imputation of rate classes to codon sites is similarly helpful in visualizing conserved regions within proteins. One can also make these assignments locally by computing posterior probabilities or globally by dynamic programming.

Even though each of the above innovations is important in its own right, their integration in usable software is even more important. In the examples that follow, we illustrate the power of our program LINNAEUS in this integrative role. The data sets are familiar, so there is little need to analyze them exhaustively. To keep the current article within reasonable bounds, we have not undertaken large-scale simulations to prove the tautology that advances in model realism translate into better parameter estimates. Finally, note that although we ignore nucleotide and codon gaps, the models discussed in this study can accommodate gaps as additional states. Depending on user demand and programming feasibility, future versions of LINNAEUS will handle gaps.


    Results
 TOP
 Abstract
 Introduction
 Theory and Methods
 Results
 Discussion
 Acknowledgements
 References
 
An Example of Stochastic Rate Variation
As a first example we apply the stochastic rate variation model to the complete HIV-1 genomes of 14 varieties representing five different subtypes. The aligned sequences UGSE8891, Q2317, U455, HAN2, MBC925, RL42, 96BW0402, 92BR025, ETH2220, IBNG, SE6165, 92NG003, DRCBL, and HH87932 can be retrieved from the HIV Sequence Compendium 2000 (Kuiken et al. 2000Citation ) on the Los Alamos National Laboratory website (http://hiv-web.lanl.gov/). Figure 1 gives the accepted phylogeny of the 14 varieties. The substitution of gamma-distributed branch lengths for fixed branch lengths is a crude device for modeling rate variation across sites. For the sake of simplicity we conduct our comparisons using reversible nucleotide models at equilibrium.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1.—The phylogeny used for the 14 HIV genome sequences discussed in the text. Listed branch lengths represent the estimated means for the associated gamma distributions. The SE of these parameter estimates appear in parentheses

 
Table 1 summarizes the maximum likelihood fits of two models, with estimated mean branch lengths listed in figure 1 . Comparison of the models by a likelihood ratio test is complicated by our choice of parameters. In the gamma model, branch b has mean {xi}b = {nu}/{zeta}b and variance {nu}/{zeta}2b = {xi}2b/{nu}. This suggests the alternative parameterization {xi}b and {psi} = {nu}-1. The fixed branch length model then corresponds to the choice {psi} = 0. Under the null hypothesis {psi} = 0, the likelihood ratio test statistic


displayed in table 1 should be a half-and-half mixture of a point mass at 0 and a {chi}21 random variable. Note here that 0 is a lower boundary for {psi}. Accordingly, the P-value of the test is 1/2Pr({chi}21 >= 4.90) = 0.013. The estimated {xi}b appear in figure 1 and show good agreement between the deterministic and stochastic models, except for a few internal branches. In our wider experience with the stochastic branch length model, rate variation among taxa must be large before there is much improvement in model fit. In extreme examples involving rapidly and slowly evolving paralogs within a given species, the stochastic branch length model delivers sharp improvements in fit.


View this table:
[in this window]
[in a new window]
 
Table 1 Parameter Estimates for 14 HIV-1 Genomes Using the Tree in Figure 1

 
Models for the ß-Globin Gene
To illustrate the flexibility and power of the codon substitution and spatial rate variation models, we now fit a sequence of models to the ß-globin data highlighted in Yang et al. (2000)Citation . These data consist of aligned sequences from 17 vertebrate taxa. The ß-globin gene in the taxa stretches over 148 codons and includes 11 {alpha}-helices comprising 111 codons. The highly conserved heme-binding histidines at codon positions 64 and 93 occur in nonhelical regions. For the sake of brevity we consider only reversible models. In view of Felsenstein's pulley principle, the likelihood of the data is insensitive to the root position. We consequently force both branches connected to the root to have length 1. To capture the complexity of evolution within this gene, we fitted a hierarchy of models and test nested models using likelihood ratio criteria.

The phylogeny depicted in figure 2 differs from the standard one advocated by Murphy et al. (2001)Citation , where the bushbaby appears with the other primates instead of with the lagomorpha, and the rodents are most closely related to the lagomorpha. We have chosen the current tree to permit comparisons with the results of Yang et al. (2000)Citation who correctly noted that it is the tree best supported by the ß-globin data. We will take up this point later.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 2.—The Yang et al. (2000)Citation phylogeny for the 17 taxa in the ß-globin example. Branch lengths are estimated under model 9 in table 3 .

 
We fit nine models of increasing sophistication to the ß-globin data. Model 1 is the nucleotide model of Schadt, Sinsheimer, and Lange (1998)Citation with the reversibility constraints {alpha}{delta} = {epsilon}{kappa} and ß{gamma} = {lambda}{sigma} and the root probabilities (2). Model 2 is the codon model induced by this nucleotide model with rejection of transitions to stop codons. Model 3 adds a single acceptance probability {rho} for nonsynonymous codon substitutions. Model 4 replaces this simple acceptance probability by the four acceptance probabilities {rho}0 through {rho}3 previously discussed for the amino acid similarity sets S1 through S4. In the case of the codon models, the root codon frequencies are normalized products of the equilibrium nucleotide frequencies ignoring the three stop codons. Table 2 gives the maximum log likelihood, Akaike information criterion (AIC) (Sakamoto, Ishiguro, and Kitagaw 1986Citation ), chi-square statistic and P-value where appropriate, maximum likelihood parameter estimates, and standard errors (SE) of the parameter estimates for these first four models.


View this table:
[in this window]
[in a new window]
 
Table 2 Maximum Likelihood Estimates and Standard Errors for Four Single Rate Class Models Fit to the ß-Globin Data

 
Models 5 and 6 expand the single rate class of model 3 to two and three rate classes, respectively. Models 7 and 8 introduce spatial interaction parameters into model 6, at maximum distances of one and four codons, respectively. Finally, model 9 is a fairly complete multilevel logistic model having codon acceptance parameters over the similarity sets S1 through S4, three rate classes, and spatial correlation acting over a distance of four or fewer codons. Table 3 repeats for models 5 through 9 the statistics appearing in table 2 . Parameter estimates are missing in tables 2 and 3 whenever the corresponding parameters are ignored in the model. Going from top to bottom along the leftmost column of each table, we encounter first, kinetic parameters; second, acceptance parameters; third, rate class proportion parameters; and fourth, rate class interaction parameters.


View this table:
[in this window]
[in a new window]
 
Table 3 Maximum Likelihood Estimates and Standard Errors for Five Rate Class Models Fit to the ß-Globin Data

 
The induced codon model 2 with the same parameters as the nucleotide model 1 offers a substantial improvement in fit (AIC 7869.46 vs. AIC 7924.02). Further dramatic improvement is realized in going from model 2 to model 3, where general nonsynonymous substitutions are penalized. For these two nested models, the likelihood ratio {chi}2 = 213.42 with df = 1 (P << 10-20). The similarity sets introduced in model 4 yield a more modest but still significant improvement over model 3; {chi}2 = 16.13 with df = 3 and asymptotic P = 0.001. For these first four models, the focus is on codon transitions and substitution penalties. Models 5 through 9 stress rate classes and spatial correlations and also show large improvements in maximum likelihoods. In passing from model 3 to model 5 with two rate classes instead of one, we get {chi}2 = 179.24 with df = 2 (P << 10-20). We improve the fit of model 5 by allowing for three rate classes in model 6, with {chi}2 = 46.32 and df = 2 (P = 6.8 x 10-11). Finally, the spatial interaction terms present in models 7 and 8 provide further improvements in fit. The {chi}2 values in going from model 6 to model 7 and from model 7 to model 8 are 5.42 with df = 1 (P = 0.020) and 34.32 with df = 3 (P = 1.7 x 10-7), respectively. Adding spatial interaction terms provides even more striking improvements in fit in the mitochondrial example discussed subsequently. Finally, reintroducing four acceptance probabilities along with the three rate classes results in another improvement in fit, with {chi}2 = 19.48 with df = 3 (P = 0.00022).

In addition to the hierarchy of models described in tables 2 and 3 , we fit a modified version of model 2 using the substitution penalties (eq. 3 ) based on the hydrophobicity ratings given by Dressler and Potter (1991, pp. 134–135)Citation and the side-chain volumes given by Zamyatnin (1972)Citation . Table 4 presents the results of fitting the ß-globin data to three models. Model H includes only hydrophobicities, model V includes only volumes, and model HV includes both. Hydrophobicity is more predictive than is volume, but both effects are quite significant. Model HV has a much lower AIC than has model 4. This could be attributed to the superiority of scales to qualitative classifications or to the superiority of hydrophobicity to charge. More extensive experimentation with these data shows that charge penalties are highly correlated with hydrophobicity penalties and are almost as predictive. Hydrophobicity and volume are the best pair of predictors. On a marginal basis, the ß-globin data suggest the following order of importance in natural selection: (1) hydrophobicity, (2) charge, (3) volume, (4) surface area, and (5) molecular weight.


View this table:
[in this window]
[in a new window]
 
Table 4 Maximum Likelihood Estimates and Standard Errors for the Hydrophobicity-Volume Model Fit to the ß-Globin Data

 
These data also illustrate how rate classes can be imputed to codon sites and codons to internal nodes. On the basis of the preferred model 9, figure 3 plots the posterior probability and Viterbi rate class assignments site by site. These plots clarify the estimates summarized in tables 2 and 3 . The x-axis of figure 3 represents codon position. The posterior probabilities generate the three curves drawn in subplots two and four: (1) a light black curve representing the posterior probability of the slowest rate class, (2) a double-bold black curve representing the posterior probability of the middle rate class, and (3) a bold black curve representing the posterior probability of the fastest rate class. The scale of the y-axis in these subplots is logarithmic to base 10. An examination of the posterior probabilities is instructive. From codon sites 1–23, it is clear that the middle and fastest rate classes are the most likely, reflecting the somewhat variable nature of the codons in this segment. Codon sites 24 through 37 represent one of the most conserved stretches of the protein. A few other highly conserved regions are also evident from the displayed posterior probabilities.



View larger version (49K):
[in this window]
[in a new window]
 
Fig. 3.—Posterior probability and Viterbi rate class plots for 145 codons in the ß-globin molecule for the taxa represented in figure 2 . Plot conventions and interpretation are given in the text

 
Subplots one and three of figure 3 give the Viterbi rate class assignment for the entire protein. Here, 2 represents the slowest rate class, 1 the middle rate class, and 0 the fastest rate class. Particularly noteworthy are codon sites 58 through 63 and 87 through 102, which contain crucial histidine residues involved in heme binding. In the Viterbi plots, the several abrupt changes between slowest and fastest rate classes are consistent with the weak spatial interactions seen in the maximum likelihood analysis.

It is difficult to visualize simultaneously all codon assignments to internal nodes. As an interesting example, we highlight the assignments at codon site 7. Figure 4 gives the maximum posterior probability assignments and maximum parsimony assignments. Both these methods fix the contemporary taxa at their observed codons. At internal nodes where assignments differ, the posterior probability assignments appear in parentheses. Maximum parsimony operating at the nucleotide level gives the GAC codon (aspartic acid) for two of the internal nodes, whereas the posterior probability method operating at the codon level gives the AAC codon (asparagine). At the amino acid level, maximum parsimony gives alanine at these two nodes. It is possible that weighted parsimony might reconcile these discrepancies, but certainly minimizing the number of transitions between charge groups is a more rational strategy than the strategy of minimizing unweighted transition counts.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 4.—Nucleotide and amino acid assignments at ß-globin codon site 7 for the phylogeny appearing in figure 2 .

 
Finally, we return to the question of the best tree. This data set affords the opportunity of identifying a best phylogeny subject to constraints. With 17 taxa for the ß-globin data, exhaustive enumeration of the more than 6 x 1015 possible phylogenies is out of question. But as described in our companion paper (Schadt and Lange 2002Citation ) we can easily enumerate all phylogenies with accepted groupings kept intact. For the sake of argument, we chose five invariant groups: (1) the primates consisting of the human, tarsier, and bushbaby, (2) the rodents consisting of the mouse, rat, and hamster, (3) the lagomorpha consisting of the hare and rabbit, (4) the artiodactyla and carnivora consisting of the pig, sheep, cow, and elephant seal, and (5) the out-group consisting of the opossum, chicken, duck, and two-clawed frogs. Assuming that each group is internally organized consistent with the conclusions of Murphy et al. (2001)Citation and that groups 1 through 4 are more closely related to each other than they are to group 5, we present in figure 5 the model 9 maximum log likelihoods computed by LINNAEUS for the 15 possible trees connecting these groups. The best-supported tree depicted in figure 5 (row 1 from top, column 1 from left) clearly differs from the tree (row 2, column 2) most consistent with the Murphy et al. (2001)Citation phylogeny.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 5.—Possible phylogenies for the ß-globin data with accepted taxonomic groupings. Model 9 maximum log likelihoods are given beneath each tree. The bovi/sui/phocidae label corresponds to the group containing the sheep and cow in the Bovidae family, the pig in the Suidae family, and the elephant seal in the Phocidae family

 
The model 9 maximum log likelihood (-3649.63) for the Yang et al. (2000)Citation tree in figure 2 surpasses the best log likelihood of the figure 5 trees. The figure 2 tree is hard to accept because it separates the bushbaby from the primates and shows the primates and rodents as nearest neighbors. Among the plausible reasons that can be advanced for these discrepancies are: (1) the small amount of data, (2) alignment bias, (3) gene duplication and loss, (4) convergent evolution, (5) model misspecification, and (6) lateral gene transfer (Cao et al. 1998Citation ). Of course, the point of this exercise is not to settle the issue but merely to illustrate the flexibility of LINNAEUS.

Spatial Interaction in MTGs
To further illustrate the advantages offered by the codon substitution and rate variation models, we turn to the MTG data featured in Yang et al. (2000)Citation . The MTG data consist of all the mitochondrial coding sequences from seven primates: human, common chimpanzee, bonobo chimpanzee, gorilla, Bornean orangutan, Sumatran orangutan, and common gibbon. The accepted phylogeny for these taxa is shown in figure 6 . Table 5 provides maximum likelihoods and parameter estimates for four reversible codon models. For ease of reference, figure 6 labels branches by their estimated lengths under model 3 as defined by table 5 .



View larger version (7K):
[in this window]
[in a new window]
 
Fig. 6.—The accepted phylogeny for the 7 taxa in the MTG example. Branch lengths are estimated under model 3 in table 5 .

 

View this table:
[in this window]
[in a new window]
 
Table 5 Maximum Likelihood Estimates and Standard Errors for Four Models Fit to the MTG Data

 
All four models use two or three rate classes and the amino acid similarity sets S1 through S4. The first three models involve two rate classes but vary in their extent of spatial interaction. The fourth model adds a third rate class. Model 1 postulates no spatial interaction, model 2 (the Ising model) spatial interaction at a distance of one codon, and model 3 spatial interaction at a distance of four codons. Model 4 differs from model 3 by adding a third rate class. The likelihood ratio statistics for passing from model 1 to model 2 and from model 2 to model 3 are 82.06 and 77.08, respectively. The corresponding asymptotic chi-square tests have df = 1 and 3 and are highly significant (P << 10-20).

In contrast, the likelihood ratio statistic for adding a third rate class is just 5.80 with df = 1 or 2 (P value between 0.016 and 0.055). In model 4, the estimate of {eta}2 hits its lower boundary of 0; hence, rate class 3 can only represent invariant sites. Our results conflict with those of Yang et al. (2000)Citation , who found support for three rate classes and evidence for positive selection. The importance of spatial correlation in the MTG is consistent with the observed patterns found in the compact genomes of many lower organisms, where most sites have an important biological function and are subject to stronger selective pressures than are the intergenic regions in more complex genomes.

We also fit the hydrophobicity-volume model to the MTG data. Table 6 reports our results for the analogs of ß-globin models H (hydrophobicity alone), V (volume alone), and HV (hydrophobicity and volume combined). These models can be compared with a codon model (model 0) without penalty and spatial parameters and with the same parameterization as model 2 for the ß-globin data (maximum log likelihood of -31,201.73 and AIC of 62,437.46) and with a codon model (model 5) with the four similarlity class penalties and with the same parameterization as model 4 for the ß-globin data (maximum loglikelihood of -29,574.45 and AIC of 59,188.90). The likelihood ratio tests noted in table 6 show that models H, V, and HV, all significantly improve fit over model 0. Hence, hydrophobicity and side-chain volume are both important determinants of codon substitution, with hydrophobicity having the greater impact. But AIC values favor the similarity class model 5 over the HV model.


View this table:
[in this window]
[in a new window]
 
Table 6 Maximum Likelihood Estimates and Standard Errors for the Hydrophobicity-Volume Model Fit to the MTG Data

 
Figure 7 serves the same purpose as do the subplots one and three of figure 3 and depicts the Viterbi rate class vector for a portion of four coding sequences in the MTG. Starting from the top of the figure, the sequences range from a strongly conserved portion of the MTATP6 gene, to a moderately conserved portion of the cytochrome c subunit 3 gene, to a less conserved portion of the cytochrome b and NADH dehydrogenase subunit 3 genes. In figure 7 , 1 represents the slow rate class and 0 the fast rate class. In contrast to the ß-globin plot, we see longer stretches assigned to the slow rate class in the more conserved genes. This is consistent with the stronger overall spatial interaction observed for this sequence.



View larger version (39K):
[in this window]
[in a new window]
 
Fig. 7.—Viterbi rate class assignments for portions of 4 genes in the MTG. Rate 1 represents the slow class and rate 0 the fast class

 
Program Performance
The performance of LINNAEUS in maximum likelihood estimation varies widely depending on the quality of parameter starting values, the number of iterations until convergence, the length of the sequence under analysis, the number of rate classes, and the extent of spatial interactions. For the small ß-globin example, the average, per-iteration run time for models 1 through 5 was 0.04, 6.29, 7.10, 9.23, and 25.71 s, respectively, on a Pentium IV 1.2 GHz PC with 1 GB of RAM. The more extensive HIV and mitochondrial codon examples require about 15 min per maximum likelihood run. In general, once the step to codon models is taken, the computational price of adding rate variation and spatial interaction is modest.

Profiling the LINNAEUS code reveals that the majority of time in likelihood evaluation is spent exponentiating the infinitesimal generator. It follows that the complexity of likelihood evaluation scales linearly with the number of rate classes. Cursory examination of run times also supports the importance of computing partial derivatives analytically rather than numerically. With n parameters, it takes n + 1 likelihood evaluations to compute partial derivatives by forward differences and 2n likelihood evaluations to compute them by central differences. In practice, we see an approximately n-fold reduction in computing times using the derivative formulas derived in Schadt and Lange (2002)Citation .

The number of trees examined is also a crucial determinant of computational complexity. There are several solutions to this bottleneck. One is to screen large numbers of trees by statistically inferior methods using distance matrices or maximum parsimony. This can identify a subset of good candidate trees for more thorough maximum likelihood comparison. If the number of candidate trees is moderately large, then application of a nucleotide model rather than a codon model might be prudent in a first pass. Another tactic is to restrict the trees examined by keeping accepted taxonomic groupings intact. This works well in the ß-globin example. Finally, it is straightforward to conduct several maximum likelihood searches in parallel on multiprocessor or clustered machines. LINNAEUS readily adapts to parallelization.


    Discussion
 TOP
 Abstract
 Introduction
 Theory and Methods
 Results
 Discussion
 Acknowledgements
 References
 
As initially observed by Zuckerkandl and Pauling (1965, pp. 97–116)Citation , amino acids with similar physical and chemical properties are more likely to replace one another than are amino acids with dissimilar properties. To improve phylogenetic inference we need to pay attention to this fact. One possibility is to take advantage of the amino acid metric of Miyata, Miyazawa, and Yasunaga (1979)Citation . This metric measures the distance between amino acid pairs on the basis of the polarity and volume of their side chains. In the phylogenetic models of Yang, Nielsen, and Hasegawa (1998)Citation and Yang et al. (2000)Citation , it takes only two additional parameters on an exponential scale to exploit the Miyata and Yasunga metric. We have taken a second track suggested by Yang, Nielsen, and Hasegawa (1998)Citation by grouping amino acids in similarity classes and introducing penalties for transitions between amino acids of the same or different classes. We do not penalize synonymous substitutions. Our initial bias favored the similarity class approach over the metric approach. One objection to the metric model is that it combines various chemical and physical properties into a single distance. This amalgamation deters the assessment of the marginal impact of each property. The penalty model (eq. 3 ) overcomes this objection. It is interesting that the ß-globin and MTG data offer no clear verdict in deciding which method is superior. Application of either the similarity class method or the amended metric method is certainly preferable to ignoring the physical and chemical properties of the underlying amino acids. Both methods fail to account for selection in synonymous substitution. In principle, this could operate through different rates of translation or degradation of RNA message.

Models with multiple rate classes are well equipped to capture variation in evolutionary pressure. Viewing rate class assignment as a Markov random field allows the identification of patches of codon sites evolving at uniform rates. Spatial correlation in evolutionary pressure is plausible and should be distinguished from the general penalties incurred in replacing one amino acid by a dissimilar amino acid. In the mitochondrial data, we see substantial improvements in fit after inclusion of spatial interaction in rate classes. Our companion paper (Schadt and Lange 2002Citation ) argues that maximum likelihood estimation is still possible with a modest amount of correlation that skips nearest neighbors and acts at a distance. The geometry of proteins certainly brings amino acids that are separated by long stretches of sequence into close proximity. How best to model action at a distance is still an open question, but the algorithmic machinery exists for likelihood evaluation in simple cases.

One of the themes we have stressed is imputation of codons to internal nodes and rate classes to codon sites. These assignments can be made locally by computing posterior probabilities or globally by Viterbi's method of dynamic programming. Basing assignments on complex models fit to real data should lead to valuable insights into evolutionary history and conserved regions of proteins. The plots in figures 4 and 7 are instructive in this regard. The massive data sets now available from genomic sequences deserve detailed analysis that goes beyond traditional phylogeny reconstruction. In accordance with this, we can expect better estimates of branch lengths using more realistic models. Poor models confuse and confound various effects. Good models separate these effects and prevent one from masking another.

The best models are both realistic and computable. It is a mistake to cavalierly dismiss computational bottlenecks with the statement that they will eventually respond to the relentless increases in processor speeds and parallel computing. When problem complexity scales exponentially in some important determinant, no hardware advances are likely to clear the bottleneck. For this reason, research on algorithms is still relevant, possibly more so for Bayesian analysis than for maximum likelihood analysis. Without the computational advances incorporated in LINNAEUS, it would be much more painful to undertake data analysis based on codon models and rate class variation. Our algorithms for partial differentiation of the likelihood alone have resulted in huge reductions in computation times. We hope the availability of LINNAEUS will accelerate the transition to more realistic likelihood models. The biological and mathematical complexity of these methods should no longer be a barrier to their use even if we prefer to teach the simpler methods to undergraduates.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Theory and Methods
 Results
 Discussion
 Acknowledgements
 References
 
This research was supported in part by USPHS grants GM08185 (E.E.S.), CA16042 (J.S.S.), and GM53275 (K.L.).


    Footnotes
 
Fumio Tajima, Reviewing Editor

Keywords: evolution DNA sequence alignment Markov chain Markov random field maximum likelihood dynamic programming Back

Address for correspondence and reprints: Department of Computational Genomics, Rosetta Inpharmatics, 12040 115th Avenue NE, Kirkland, WA 98034. E-mail: eric_schadt{at}merck.com Back


    References
 TOP
 Abstract
 Introduction
 Theory and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Blanchette M., G. Bourque, D. Sankoff, 1997 Breakpoint phylogenies Genome Inform. Ser. Workshop Genome Inform 8:25-34[Medline]

    Cao Y., A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Paabo, M. Hasagawa, 1998 Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders J. Mol. Evol 47:307-322[ISI][Medline]

    Cressie N. A. C., 1993 Statistics for spatial data Wiley, New York

    Dressler D., G. Potter, 1991 Discovering enzymes Scientific American Library, New York

    Felsenstein J., G. A. Churchill, 1996 A Hidden Markov Model approach to variation among sites in rate of evolution Mol. Biol. Evol 13:93-104[Abstract]

    Goldman N., J. L. Thorne, D. T. Jones, 1998 Assessing the impact of secondary structure and solvent accessibility on protein evolution Genetics 149:445-458[Abstract/Free Full Text]

    Goldman N., Z. Yang, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences Mol. Biol. Evol 11:725-736[Abstract/Free Full Text]

    Hasegawa M., H. Kishino, T. Yano, 1985 A new molecular clock of mitochondrial DNA and the evolution of hominoids Proc. Jpn. Acad. Sci. B 60:95-98

    Jukes T. H., C. R. Cantor, 1969 Evolution of protein molecules Academic Press, New York

    Kelly C., J. Rice, 1996 Modeling nucleotide evolution: a heterogeneous rate analysis Math. Biosci 133:85-109[ISI][Medline]

    Kimura M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]

    Kuiken C., B. Foley, B. Hahn, P. Marx, J. McCutchan, J. Mellors, J. Mullins, S. Wolinsky, B. Korber, 2000 HIV sequence compendium 2000 Edited by Los Alamos National Laboratory, Los Alamos, N.M.

    Lander E. S., L. M. Linton, B. Birren, et al. (254 co-authors) 2001 Initial sequencing and analysis of the human genome Nature 409:860-921[ISI][Medline]

    Miyata T., S. Miyazawa, T. Yasunaga, 1979 Two types of amino acid substitutions in protein evolution J. Mol. Evol 12:219-236[ISI][Medline]

    Moret B. M., L. S. Wang, T. Warnow, S. K. Wyman, 2001 New approaches for reconstructing phylogenies from gene order data Bioinformatics 17:S165-S173[Abstract/Free Full Text]

    Murphy W. J., E. Eizirik, W. E. Johnson, Y. P. Zhang, O. A. Ryder, S. J. O'Brien, 2001 Molecular phylogenetics and the origins of placental mammals Nature 409:614-618[ISI][Medline]

    Nei M., T. Gojobori, 1986 Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions Mol. Biol. Evol 3:418-426[Abstract]

    Pedersen A. K., C. Wiuf, F. B. Christiansen, 1998 A codon-based model designed to describe lentiviral evolution Mol. Biol. Evol 15:1069-1081[Abstract]

    Sakamoto Y., M. Ishiguro, G. Kitagawa, 1986 Akaike information criterion statistics D. Reidel Publishing Company, Dordrecht, Holland

    Schadt E. E., K. Lange, 2002 Codon and rate variation models in molecular phylogeny Mol. Biol. Evol 19:1534-1549.[Abstract/Free Full Text]

    Schadt E. E., J. S. Sinsheimer, K. Lange, 1998 Computational advances in maximum likelihood methods for molecular phylogeny Genome Res 8:222-233[Abstract/Free Full Text]

    Tamura K., M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Mol. Biol. Evol 10:512-526[Abstract]

    Viterbi J., 1967 Error bounds for convolutional codes and an asymptotically optimal decoding algorithm IEEE Trans. Information Theory 13:260-269

    Wagner H., E. Baake, T. Gerisch, 1999 Ising quantum chain and sequence evolution J. Stat. Phys 92:1017-1052[ISI]

    Yang Z., 1993 Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites Mol. Biol. Evol 10:1396-1401[Abstract]

    ———. 1994 Estimating the pattern of nucleotide substitution J. Mol. Evol 39:105-111[ISI][Medline]

    ———. 1995 A space-time process model for the evolution of DNA sequences Genetics 139:993-1005[Abstract/Free Full Text]

    Yang Z., R. Nielsen, N. Goldman, A. M. Pedersen, 2000 Codon-substitution models for heterogeneous selection pressure at amino acid sites Genetics 155:431-449[Abstract/Free Full Text]

    Yang Z., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611[Abstract/Free Full Text]

    Zamyatnin A. A., 1972 Protein volume in solution Prog. Biophys. Mol. Biol 24:107-123[Medline]

    Zuckerkandl E., L. Pauling, 1965 Evolutionary divergence and convergence in proteins Academic Press, New York

Accepted for publication May 2, 2002.