*Department of Biomathematics, UCLA School of Medicine, Los Angeles;
Department of Computational Genomics, Rosetta Inpharmatics, 12040 115th Avenue NE, Kirkland, Washington;
Department of Human Genetics, UCLA School of Medicine, Los Angeles
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Considerable progress has already been made in this program for integrating selection and reconstruction. Models for codons, for rate heterogeneity among sites, and for spatial correlation in rate assignment are the subjects of several recent papers (Goldman and Yang 1994
; Yang 1995
; Felsenstein and Churchill 1996
; Goldman, Thorne, and Jones 1998
; Yang, Nielsen, and Hasegawa 1998
; Wagner, Baake, and Gerisch 1999
; Yang et al. 2000
). Our companion paper (Schadt and Lange 2002
) addresses mathematical and algorithmic issues raised by generalizations of these models. The current article summarizes and applies our generalizations to actual data.
The models presented in Yang, Nielsen, and Hasegawa (1998)
, Wagner, Baake, and Gerisch (1999)
, and Yang et al. (2000)
are closest in spirit to the models featured in our article. Yang, Nielsen, and Hasegawa (1998)
introduce amino acid acceptance rates in a codon model. In their model, nonsynonymous codon substitutions are accepted more or less freely, depending on empirically measured distances between amino acids. Yang et al. (2000)
discuss different rate class models. Wagner, Baake, and Gerisch (1999)
develop a four-state mutation-selection model for DNARNA sequences. This model uses coupled Ising chains to incorporate rate dependency among nucleotide sites. Our companion paper (Schadt and Lange 2002
) generalizes these models to an arbitrary number of rate classes and to rate correlation at a distance.
Now that the evolutionary relationships connecting many species have been sorted out, the more subtle questions of structure, function, and regulation will drive the field of molecular phylogeny. More sophisticated models should aid in the functional annotation of genes. Identification of conserved domains is bound to benefit from information on rates of evolution, strength of interaction between spatially separated sites, and most likely assignment of codons to internal nodes and of codon sites to rate classes.
As a guide to what follows, we first briefly summarize the generalized codon substitution and spatial rate variation models covered in detail by Schadt and Lange (2002)
. The current article emphasizes applications, intuition, and interpretation. In our view, data analysis of coding regions should be conducted at the codon level where mutation and selection interact to determine amino acid substitutions. In modeling this interplay, we deal with the tradeoffs between accuracy and parsimony by postulating amino acid similarity classes and estimating acceptance probabilities for proposed substitutions within and between classes. To capture spatial interactions in rate class assignments, we turn to Markov random fields (Cressie 1993
, pp. 410423). These offer a rich theoretical context and host of machinery for elaborating models. To illustrate the strengths of each of these model advances, we analyze real sequence data on ß-globin, the complete HIV genome, and the mitochondrial genome (MTG). All algorithms presented in this study or in Schadt and Lange (2002)
are implemented in the LINNAEUS software package, freely available to nonprofit researchers upon request.
![]() |
Theory and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
An entry pij(t) of the matrix exponential P(t) = et gives the probability of going from state (nucleotide) i at time 0 to state (nucleotide) j at time t. Computation of any matrix's exponential is far easier when the matrix can be diagonalized. Fortunately, all eigenvalues and eigenvectors of
are real and explicitly computable. Reversibility is achieved by imposing the constraints ß
=
and
=
on the infinitesimal transition rates. Under reversibility, the equilibrium frequencies of the different states are
|
Several authors have suggested an elaboration of the nucleotide model allowing stochastic variation in branch lengths (Nei and Gojobori 1986
; Tamura and Nei 1993
; Yang 1993
; Kelly and Rice 1996
). If this extension is carried out subject to one branch length having unit mean (Yang 1993
), then all kinetic parameters and branch length parameters are statistically identifiable. The most tractable model assumes that branch lengths are independent realizations of gamma random variables. In other words, for each branch b and each nucleotide site, the branch length Tb is an independent gamma variate with scale parameter
b and shape parameter
b. It is parsimonious to eliminate one parameter per branch by postulating a common coefficient of variation
|
Any Markov chain model for nucleotide substitution, including the one with infinitesimal generator (1), induces a codon substitution model in a natural fashion. If the infinitesimal transition rate of going from nucleotide b to nucleotide d is vbd, then the infinitesimal transition rate in going from codon (a, b, c) to codon (a, d, c) is also vbd. (Similar expressions hold when the mutated nucleotide occurs in the first or third position.) The only exception to this rule is that transitions to stop codons are disallowed. An induced codon model has 61 states and a rather sparse infinitesimal generator . All finite-time transition probabilities are entries of the matrix exponential et
by analogy to the nucleotide model. The computation of et
is harder compared with the computation of et
because of the sheer number of states involved and the lack of explicit eigenvalues and eigenvectors. Although using an induced codon model seems like a trivial conceptual advance, eliminating passages to stop codons often leads to a large increase in the maximum likelihood of the induced codon model compared with the maximum likelihood of the nucleotide model.
Selection can be added to a codon model by introducing acceptance probabilities. If two neighboring codons, such as (a, b, c) and (a, d, c) are nonsynonymous, then we postulate a probability with which substitutions from one to the other are actually accepted. Rejected substitutions correspond to mutations that are eliminated by selection. In general,
depends on its start and destination codons. If we impose symmetry constraints such as
(a,b,c)
(a,d,c) =
(a,d,c)
(a,b,c), then any reversible nucleotide model induces a reversible codon model. Reversibility permits consideration of unrooted trees rather than rooted trees and has many numerical advantages. This class of model is similar to those described in Pedersen, Wiuf, and Christiansen (1998)
and Yang, Nielsen, and Hasegawa (1998)
.
In modeling acceptance probabilities, we must balance parsimony and biological realism. One flexible device is to divide the amino acids into similarity (or mathematical equivalence) classes. The most comprehensive model would assign each amino acid to its own similarity class, but this model generates more parameters than can be reliably estimated from most data sets, even with the symmetry constraints necessary for reversibility imposed. In many of the applications presented in this study, we divide the amino acids into four similarity classes on the basis of polarity. These classes are: S1 = {G, I, V, L, A, M, P, F, W}, S2 = {Q, N, Y, H, K, R}, S3 = {S, T, E, D}, and S4 = {C}. Cysteine (C) is placed in its own class because of its propensity to form disulfide bridges. Four 's are estimated: a within class acceptance rate
0, an acceptance rate
1 between the nonpolar class S1 and either of the polar classes S2 and S3, an acceptance rate
2 between the two opposite polar classes S2 and S3, and an acceptance rate
3 between cysteine and any other class.
Because these natural groupings according to charge and polarity omit many important chemical determinants, LINNAEUS is written flexibly enough to accommodate alternative decompositions. Of course, similarity classes hardly exhaust the modeling possibilities. To take advantage of the diverse ways in which chemists rate amino acids, we briefly explore penalty models of the form
|
In addition to introducing penalties for nonsynonymous substitutions, it is also helpful to incorporate different rate classes in statistical analysis. Indeed, codon sites in crucial catalytic domains or in conserved secondary structures are under much stronger selective pressure than are codon sites in other parts of a protein. Rate classes are distinguished by a second level of acceptance probabilities. Suppose that there are l such classes with class i assigned parameter i. Then the nonsynonymous substitution (a, b, c)
(a, d, c) has infinitesimal transition rate vbd
i depending on
i and the penalty
specified by the two amino acids. In contrast, a synonymous substitution (a, b, c)
(a, d, c) has infinitesimal transition rate vbd. In most applications it is convenient to order the classes so that class 0 evolves fastest, immediately followed by class 1 and so on, down to class l - 1 which evolves slowest. In symbols, this implies 0 <
i+1 <
i for all i < l - 1. To better identify all parameters, we usually constrain
0 = 1 and restrict l
3.
It is worth noting that and
generalize the nonsynonymoussynonymous ratio parameter
appearing in the discrete model of Yang et al. (2000)
. Specifically, if there are multiple rate classes and a single
, then our model corresponds to the discrete model M3 of Yang et al. (2000)
. Multiple
's allow for differences in nonsynonymous rates both within as well as between rate classes (Yang, Nielsen, and Hasegawa 1998
). Positive selection occurs when an
or a product
exceeds 1. In the latter case we must relax either the convention
1 or the convention
1. In our examples we allow values of
in excess of 1.
Although these two levels of acceptance clearly separate the different forces of evolution acting on orthologous genes, the distinctions between penalties and rate classes become even clearer once we introduce correlated rates of evolution. Markov chains offer convenient vehicles for such an introduction, but they are restricted to a unidirectional flow of information and nearest-neighbor interactions. Markov random fields are more flexible and fully consistent with fast likelihood evaluation (Schadt and Lange 2002
). The crux of the matter is to specify a realistic prior distribution for the rate class vector C = (C1, ..., Cn) over n codon sites. In this study the rate class Ci of codon site i assumes one of the values 0, ..., l - 1 for l rate classes. In the multilevel logistic model, the distribution of C can be written as the normalized sum
|
|
Each of these model generalizations operates within the context of a fixed tree. Our companion paper (Schadt and Lange 2002
) also suggests methods for reducing the number of potential trees and systematically visiting those trees consistent with accepted taxonomic groupings. For a given tree, the estimation of parameters by maximum likelihood is by no means the only objective. In reconstructing evolution we want to impute codons at the root and other internal nodes of the tree. This is possible either by computing appropriate posterior probabilities node by node or by finding a single best global assignment through a variant of Viterbi's dynamic programming method (Viterbi 1967
). Of course, codon imputation should be carried out using the best tree and the best-fitting model. Imputation of rate classes to codon sites is similarly helpful in visualizing conserved regions within proteins. One can also make these assignments locally by computing posterior probabilities or globally by dynamic programming.
Even though each of the above innovations is important in its own right, their integration in usable software is even more important. In the examples that follow, we illustrate the power of our program LINNAEUS in this integrative role. The data sets are familiar, so there is little need to analyze them exhaustively. To keep the current article within reasonable bounds, we have not undertaken large-scale simulations to prove the tautology that advances in model realism translate into better parameter estimates. Finally, note that although we ignore nucleotide and codon gaps, the models discussed in this study can accommodate gaps as additional states. Depending on user demand and programming feasibility, future versions of LINNAEUS will handle gaps.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
The phylogeny depicted in figure 2
differs from the standard one advocated by Murphy et al. (2001)
, where the bushbaby appears with the other primates instead of with the lagomorpha, and the rodents are most closely related to the lagomorpha. We have chosen the current tree to permit comparisons with the results of Yang et al. (2000)
who correctly noted that it is the tree best supported by the ß-globin data. We will take up this point later.
|
|
|
In addition to the hierarchy of models described in tables 2
and 3
, we fit a modified version of model 2 using the substitution penalties (eq. 3
) based on the hydrophobicity ratings given by Dressler and Potter (1991, pp. 134135)
and the side-chain volumes given by Zamyatnin (1972)
. Table 4
presents the results of fitting the ß-globin data to three models. Model H includes only hydrophobicities, model V includes only volumes, and model HV includes both. Hydrophobicity is more predictive than is volume, but both effects are quite significant. Model HV has a much lower AIC than has model 4. This could be attributed to the superiority of scales to qualitative classifications or to the superiority of hydrophobicity to charge. More extensive experimentation with these data shows that charge penalties are highly correlated with hydrophobicity penalties and are almost as predictive. Hydrophobicity and volume are the best pair of predictors. On a marginal basis, the ß-globin data suggest the following order of importance in natural selection: (1) hydrophobicity, (2) charge, (3) volume, (4) surface area, and (5) molecular weight.
|
|
It is difficult to visualize simultaneously all codon assignments to internal nodes. As an interesting example, we highlight the assignments at codon site 7. Figure 4 gives the maximum posterior probability assignments and maximum parsimony assignments. Both these methods fix the contemporary taxa at their observed codons. At internal nodes where assignments differ, the posterior probability assignments appear in parentheses. Maximum parsimony operating at the nucleotide level gives the GAC codon (aspartic acid) for two of the internal nodes, whereas the posterior probability method operating at the codon level gives the AAC codon (asparagine). At the amino acid level, maximum parsimony gives alanine at these two nodes. It is possible that weighted parsimony might reconcile these discrepancies, but certainly minimizing the number of transitions between charge groups is a more rational strategy than the strategy of minimizing unweighted transition counts.
|
|
Spatial Interaction in MTGs
To further illustrate the advantages offered by the codon substitution and rate variation models, we turn to the MTG data featured in Yang et al. (2000)
. The MTG data consist of all the mitochondrial coding sequences from seven primates: human, common chimpanzee, bonobo chimpanzee, gorilla, Bornean orangutan, Sumatran orangutan, and common gibbon. The accepted phylogeny for these taxa is shown in figure 6
. Table 5
provides maximum likelihoods and parameter estimates for four reversible codon models. For ease of reference, figure 6
labels branches by their estimated lengths under model 3 as defined by table 5
.
|
|
In contrast, the likelihood ratio statistic for adding a third rate class is just 5.80 with df = 1 or 2 (P value between 0.016 and 0.055). In model 4, the estimate of 2 hits its lower boundary of 0; hence, rate class 3 can only represent invariant sites. Our results conflict with those of Yang et al. (2000)
, who found support for three rate classes and evidence for positive selection. The importance of spatial correlation in the MTG is consistent with the observed patterns found in the compact genomes of many lower organisms, where most sites have an important biological function and are subject to stronger selective pressures than are the intergenic regions in more complex genomes.
We also fit the hydrophobicity-volume model to the MTG data. Table 6 reports our results for the analogs of ß-globin models H (hydrophobicity alone), V (volume alone), and HV (hydrophobicity and volume combined). These models can be compared with a codon model (model 0) without penalty and spatial parameters and with the same parameterization as model 2 for the ß-globin data (maximum log likelihood of -31,201.73 and AIC of 62,437.46) and with a codon model (model 5) with the four similarlity class penalties and with the same parameterization as model 4 for the ß-globin data (maximum loglikelihood of -29,574.45 and AIC of 59,188.90). The likelihood ratio tests noted in table 6 show that models H, V, and HV, all significantly improve fit over model 0. Hence, hydrophobicity and side-chain volume are both important determinants of codon substitution, with hydrophobicity having the greater impact. But AIC values favor the similarity class model 5 over the HV model.
|
|
Profiling the LINNAEUS code reveals that the majority of time in likelihood evaluation is spent exponentiating the infinitesimal generator. It follows that the complexity of likelihood evaluation scales linearly with the number of rate classes. Cursory examination of run times also supports the importance of computing partial derivatives analytically rather than numerically. With n parameters, it takes n + 1 likelihood evaluations to compute partial derivatives by forward differences and 2n likelihood evaluations to compute them by central differences. In practice, we see an approximately n-fold reduction in computing times using the derivative formulas derived in Schadt and Lange (2002)
.
The number of trees examined is also a crucial determinant of computational complexity. There are several solutions to this bottleneck. One is to screen large numbers of trees by statistically inferior methods using distance matrices or maximum parsimony. This can identify a subset of good candidate trees for more thorough maximum likelihood comparison. If the number of candidate trees is moderately large, then application of a nucleotide model rather than a codon model might be prudent in a first pass. Another tactic is to restrict the trees examined by keeping accepted taxonomic groupings intact. This works well in the ß-globin example. Finally, it is straightforward to conduct several maximum likelihood searches in parallel on multiprocessor or clustered machines. LINNAEUS readily adapts to parallelization.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Models with multiple rate classes are well equipped to capture variation in evolutionary pressure. Viewing rate class assignment as a Markov random field allows the identification of patches of codon sites evolving at uniform rates. Spatial correlation in evolutionary pressure is plausible and should be distinguished from the general penalties incurred in replacing one amino acid by a dissimilar amino acid. In the mitochondrial data, we see substantial improvements in fit after inclusion of spatial interaction in rate classes. Our companion paper (Schadt and Lange 2002
) argues that maximum likelihood estimation is still possible with a modest amount of correlation that skips nearest neighbors and acts at a distance. The geometry of proteins certainly brings amino acids that are separated by long stretches of sequence into close proximity. How best to model action at a distance is still an open question, but the algorithmic machinery exists for likelihood evaluation in simple cases.
One of the themes we have stressed is imputation of codons to internal nodes and rate classes to codon sites. These assignments can be made locally by computing posterior probabilities or globally by Viterbi's method of dynamic programming. Basing assignments on complex models fit to real data should lead to valuable insights into evolutionary history and conserved regions of proteins. The plots in figures 4 and 7 are instructive in this regard. The massive data sets now available from genomic sequences deserve detailed analysis that goes beyond traditional phylogeny reconstruction. In accordance with this, we can expect better estimates of branch lengths using more realistic models. Poor models confuse and confound various effects. Good models separate these effects and prevent one from masking another.
The best models are both realistic and computable. It is a mistake to cavalierly dismiss computational bottlenecks with the statement that they will eventually respond to the relentless increases in processor speeds and parallel computing. When problem complexity scales exponentially in some important determinant, no hardware advances are likely to clear the bottleneck. For this reason, research on algorithms is still relevant, possibly more so for Bayesian analysis than for maximum likelihood analysis. Without the computational advances incorporated in LINNAEUS, it would be much more painful to undertake data analysis based on codon models and rate class variation. Our algorithms for partial differentiation of the likelihood alone have resulted in huge reductions in computation times. We hope the availability of LINNAEUS will accelerate the transition to more realistic likelihood models. The biological and mathematical complexity of these methods should no longer be a barrier to their use even if we prefer to teach the simpler methods to undergraduates.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: evolution
DNA sequence
alignment
Markov chain
Markov random field
maximum likelihood
dynamic programming
Address for correspondence and reprints: Department of Computational Genomics, Rosetta Inpharmatics, 12040 115th Avenue NE, Kirkland, WA 98034. E-mail: eric_schadt{at}merck.com
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Blanchette M., G. Bourque, D. Sankoff, 1997 Breakpoint phylogenies Genome Inform. Ser. Workshop Genome Inform 8:25-34[Medline]
Cao Y., A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Paabo, M. Hasagawa, 1998 Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders J. Mol. Evol 47:307-322[ISI][Medline]
Cressie N. A. C., 1993 Statistics for spatial data Wiley, New York
Dressler D., G. Potter, 1991 Discovering enzymes Scientific American Library, New York
Felsenstein J., G. A. Churchill, 1996 A Hidden Markov Model approach to variation among sites in rate of evolution Mol. Biol. Evol 13:93-104[Abstract]
Goldman N., J. L. Thorne, D. T. Jones, 1998 Assessing the impact of secondary structure and solvent accessibility on protein evolution Genetics 149:445-458
Goldman N., Z. Yang, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences Mol. Biol. Evol 11:725-736
Hasegawa M., H. Kishino, T. Yano, 1985 A new molecular clock of mitochondrial DNA and the evolution of hominoids Proc. Jpn. Acad. Sci. B 60:95-98
Jukes T. H., C. R. Cantor, 1969 Evolution of protein molecules Academic Press, New York
Kelly C., J. Rice, 1996 Modeling nucleotide evolution: a heterogeneous rate analysis Math. Biosci 133:85-109[ISI][Medline]
Kimura M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]
Kuiken C., B. Foley, B. Hahn, P. Marx, J. McCutchan, J. Mellors, J. Mullins, S. Wolinsky, B. Korber, 2000 HIV sequence compendium 2000 Edited by Los Alamos National Laboratory, Los Alamos, N.M.
Lander E. S., L. M. Linton, B. Birren, et al. (254 co-authors) 2001 Initial sequencing and analysis of the human genome Nature 409:860-921[ISI][Medline]
Miyata T., S. Miyazawa, T. Yasunaga, 1979 Two types of amino acid substitutions in protein evolution J. Mol. Evol 12:219-236[ISI][Medline]
Moret B. M., L. S. Wang, T. Warnow, S. K. Wyman, 2001 New approaches for reconstructing phylogenies from gene order data Bioinformatics 17:S165-S173
Murphy W. J., E. Eizirik, W. E. Johnson, Y. P. Zhang, O. A. Ryder, S. J. O'Brien, 2001 Molecular phylogenetics and the origins of placental mammals Nature 409:614-618[ISI][Medline]
Nei M., T. Gojobori, 1986 Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions Mol. Biol. Evol 3:418-426[Abstract]
Pedersen A. K., C. Wiuf, F. B. Christiansen, 1998 A codon-based model designed to describe lentiviral evolution Mol. Biol. Evol 15:1069-1081[Abstract]
Sakamoto Y., M. Ishiguro, G. Kitagawa, 1986 Akaike information criterion statistics D. Reidel Publishing Company, Dordrecht, Holland
Schadt E. E., K. Lange, 2002 Codon and rate variation models in molecular phylogeny Mol. Biol. Evol 19:1534-1549.
Schadt E. E., J. S. Sinsheimer, K. Lange, 1998 Computational advances in maximum likelihood methods for molecular phylogeny Genome Res 8:222-233
Tamura K., M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Mol. Biol. Evol 10:512-526[Abstract]
Viterbi J., 1967 Error bounds for convolutional codes and an asymptotically optimal decoding algorithm IEEE Trans. Information Theory 13:260-269
Wagner H., E. Baake, T. Gerisch, 1999 Ising quantum chain and sequence evolution J. Stat. Phys 92:1017-1052[ISI]
Yang Z., 1993 Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites Mol. Biol. Evol 10:1396-1401[Abstract]
. 1994 Estimating the pattern of nucleotide substitution J. Mol. Evol 39:105-111[ISI][Medline]
. 1995 A space-time process model for the evolution of DNA sequences Genetics 139:993-1005
Yang Z., R. Nielsen, N. Goldman, A. M. Pedersen, 2000 Codon-substitution models for heterogeneous selection pressure at amino acid sites Genetics 155:431-449
Yang Z., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611
Zamyatnin A. A., 1972 Protein volume in solution Prog. Biophys. Mol. Biol 24:107-123[Medline]
Zuckerkandl E., L. Pauling, 1965 Evolutionary divergence and convergence in proteins Academic Press, New York