Codon and Rate Variation Models in Molecular Phylogeny

Eric Schadt*{dagger} and Kenneth Lange*{ddagger}

*Department of Biomathematics, UCLA School of Medicine, Los Angeles, CA;
{dagger}Computational Genomics, Rosetta Inpharmatics, Kirkland, WA;
{ddagger}Department of Human Genetics, UCLA School of Medicine, Los Angeles, CA


    Abstract
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
This article generalizes previous models for codon substitution and rate variation in molecular phylogeny. Particular attention is paid to (1) reversibility, (2) acceptance and rejection of proposed codon changes, (3) varying rates of evolution among codon sites, and (4) the interaction of these sites in determining evolutionary rates. To accommodate spatial variation in rates, Markov random fields rather than Markov chains are introduced. Because these innovations complicate maximum likelihood estimation in phylogeny reconstruction, it is necessary to formulate new algorithms for the evaluation of the likelihood and its derivatives with respect to the underlying kinetic, acceptance, and spatial parameters. To derive the most from maximum likelihood analysis of sequence data, it is useful to compute posterior probabilities assigning residues to internal nodes and evolutionary rate classes to codon sites. It is also helpful to search through tree space in a way that respects accepted phylogenetic relationships. Our phylogeny program LINNAEUS implements algorithms realizing these goals. Readers may consult our companion article in this issue for several examples.


    Introduction
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
If the promise of the burgeoning DNA sequence and protein databases is to be realized, then more refined models of evolution at the sequence level must be constructed. Not only must these models be more realistic, they must also permit fast computation in frequentist and Bayesian statistical inference. This is a tall order. There are several ways to increase the realism of single-nucleotide substitution models. For instance, one can work at the protein level and consider transitions between amino acids or between secondary-structure within proteins rather than between nucleotides (Koshi and Goldstein 1997Citation , 1998Citation ). Although this traditional approach has the advantage of taking evolution and sequence conservation seriously, it suffers from two defects.

First, not all sequence conservation occurs within coding regions. Regions of gene regulation also experience the guiding hand of selection. Second, amino acid substitution models blur the distinction between mutation, which operates at the nucleotide level, and selection, which operates at the amino acid level. One can reasonably expect the rate of mutation at a nucleotide site to be independent of the role its resident nucleotide plays at the codon level. Whether selection retains or eliminates a mutation depends on whether the corresponding codon substitution is synonymous or nonsynonymous. Some nonsynonymous substitutions can be more easily tolerated than others. These considerations suggest that we view selection as accepting some codon substitutions and rejecting others according to definite probabilistic rules (Hasegawa 1986Citation ; Adachi and Hasegawa 1992Citation ; Goldman and Yang 1994Citation ; Pedersen, Wiuf, and Christiansen 1998Citation ; Yang, Nielsen, and Hasegawa 1998Citation ). More realistic modeling must recognize that some codon sites experience stronger selection than other sites. This fact suggests that one allows for variation in the rate of evolution among sites (Koshi and Goldstein 1998Citation ; Yang, Nielsen, and Hasegawa 1998Citation ; Yang 2000Citation ). Finally, because conserved catalytic domains, protein folds, and tertiary structures extend over both consecutive and nonconsecutive codon sites, it is desirable to pursue models that allow for spatial correlation in rate variation (Yang 1995Citation ; Felsenstein and Churchill 1996Citation ; Wagner, Baake, and Gerisch 1999Citation ).

These broad research trends have been set in motion by the researchers just noted. Our goal here is to refine some of the current models in interesting and productive ways and show how these models fit into fast maximum likelihood estimation. For example, we advocate modeling the acceptance of nonsynonymous substitutions using equivalence classes of amino acids on the basis of physical and chemical similarities. Modeling of spatial variation in the rate of evolution has been limited to the construction of hidden Markov chains and coupled Ising chains connecting adjacent nucleotide sites (Yang 1995Citation ; Felsenstein and Churchill 1996Citation ; Wagner, Baake, and Gerisch 1999Citation ). These natural developments parallel the successful applications of hidden Markov chains in contexts such as gene finding, similarity searches, and homology recognition (Gribskov, McLachlan, and Eisenberg 1987Citation ; Tatusov, Altschul, and Koonin 1994Citation ; Yi and Lander 1994Citation ; Burge and Karlin 1997Citation ). But there is a growing recognition of the need to liberate phylogenetic modeling from the unidirectional flow of information implicit in Markov chains and simple nearest neighbor interactions required by Ising chains (Koshi and Goldstein 1998Citation ; Yang, Nielsen, and Hasegawa 1998Citation ). The most nuanced models will achieve these goals with codons rather than nucleotides as the units of selection.

In our view, the theory of Markov random fields offers a perfect framework for expansion of the rate variation models. In particular, we demonstrate that imposing a Gibbs prior with limited spatial interactions is fully compatible with fast likelihood evaluation. This opens the door to more realistic, but still parsimonious, codon models. On the numerical side, we present several algorithmic improvements in likelihood evaluation that can decrease computing times by at least two orders of magnitude. To stimulate the wider use of sophisticated models, we have implemented all of our advances in the computer program LINNAEUS. Applications of LINNAEUS appear in our companion article (Schadt, Sinsheimer, and Lange 2002Citation ) and will not be pursued here.


    Theory and Methods
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
Our point of departure is our previous generalization of Kimura's nucleotide substitution model (Kimura 1980Citation ; Schadt, Sinsheimer, and Lange 1998Citation ). This generalization possesses just enough symmetry to permit explicit diagonalization and exponentiation of the infinitesimal generator, or rate matrix, of the underlying continuous-time Markov chain. Given the partial derivatives of the infinitesimal generator, we have shown how to evaluate the partial derivatives of the likelihood almost as quickly as the likelihood itself. Such evaluation is important because most methods of maximum likelihood search rely heavily on gradient information. Our previous algorithmic advances have even greater relevance to codon models, with 61 states, than to nucleotide models, with just four states. Fortunately, many of the analytic successes at the nucleotide level carry over to the codon level. The algorithms for quick evaluation of the likelihood and its derivatives are cases in point. Other tasks such as matrix exponentiation are less tractable. But considerable progress can be made even here if one pays heed to the sparseness and reversibility properties of the infinitesimal generator. In particular, it is profitable to relate reversibility at the codon level to reversibility at the nucleotide level.

Nucleotide Models
Codon models are preferable to amino acid models because they incorporate detail at the mechanistic level at which mutation operates. The codon models of Goldman and Yang (1994)Citation and Muse and Gaut (1994)Citation rely on a continuous-time Markov chain with 61 states. Mutations to one of the three stop codons totally disrupt gene function and can be ignored safely. Of course, codon models are no panacea. Nucleotide substitution is seldom completely random in conserved regions flanking coding regions, near intron-exon boundaries, or in intronic regions such as the silencer and enhancer regions implicated in regulation and splicing. In the case of regulation, substitutions can affect transcription and transcript stability; in the case of splicing, substitutions can change the entire structure of the translated protein (Miyata and Yasunaga 1980Citation ; Li, Wu, and Luo 1985Citation ; Nei and Gojobori 1986Citation ; Li 1993Citation ).

As a basis for our codon models, we adopt the generalization of Kimura's nucleotide substitution model proposed by Schadt, Sinsheimer, and Lange (1998)Citation . Kimura's model and subsequent generalizations view a phylogeny as a binary tree with contemporary taxa forming the leaves. Evolution occurs from site to site and branch to branch via independent Markov chains bouncing back and forth among the four possible states A (adenine), G (guanine), C (cytosine), and T (thymine). These continuous-time chains share the common infinitesimal generator


in our earlier nucleotide model. All eigenvalues and eigenvectors of {Lambda} are real and explicitly computable. This fact facilitates finding the equilibrium distribution and forming the finite-time transition matrix P(t) = exp(t{Lambda}). The entry pij(t) of P(t) represents the probability that a daughter node of the tree occupies state j at time t given that its mother node occupies state i at time 0. Note that {Lambda} embodies the assumption that transversion rates depend only on their destination states and not on their initial states. This assumption allows more flexibility than competing models (Kimura 1980Citation ; Goldman and Yang 1994Citation ; Muse and Gaut 1994Citation ) currently available in widely used phylogeny programs. Although we presently ignore nucleotide and codon gaps, the models discussed here can accommodate gaps as additional states. If these models prove useful, we will extend them to allow for gaps.

Rate Variation Among Lineages
Because, in general, it is impossible to separate the time parameter for each branch from the kinetic parameters, we adopt Yang's (1994)Citation convention and set one of the branch lengths equal to 1. This convention allows a branch length to be interpreted as the expected number of changes between the two taxa at the ends of the branch. The nucleotide substitution models sketched above implicitly assume that the rate of evolution across lineages is constant. To allow for stochastic variation in branch lengths, one can introduce randomized times for each branch.

There are two ways of choosing these times, both involving gamma distributed random variates. In the traditional model (Nei and Gojobori 1986Citation ; Tamura and Nei 1993; Yang 1993Citation ; Kelly and Rice 1996), each branch length at a given site is multiplied by the same gamma variate with mean 1. The gamma variates at different sites are chosen independently. Unfortunately, this procedure enormously complicates likelihood evaluation because the likelihood at the site must be integrated over all possible realizations of the gamma variate. Although one can carry out such quadratures numerically, using say Laguerre polynomials, accuracy tends to be low, and computation times tend to be high, particularly when derivatives of the likelihood are also sought.

Alternatively, one can imagine multiplying the different branch lengths at a given site by independent gamma distributed variates of mean 1. Although this creates even more integrals, these can be done along each branch separately as part of a single overall likelihood evaluation. Furthermore, as noted below, the necessary quadratures can be performed analytically. To promote model parsimony, it is preferable to specify the same shape parameter across both branches and sites for the independent gamma variates. In this regard, recall that the standard parameterization of a gamma distributed variate T involves a scale parameter {zeta} and a shape parameter {nu}. The shape parameter determines the coefficient of variation of T through the equation


The constraint E(T) = {nu}/{zeta} = 1 forces {zeta} = {nu}.

In our alternative model with independent gamma multipliers rather than a common gamma multiplier, we must compute the unconditional finite-time transition matrix Q for evolution along each branch. Assuming a branch has average length 1/{zeta} and the random multiplier T has mean 1 and shape parameter {nu}, one calculates Q using the formula


The most natural method of computing Q({Lambda}) is to consider it to be a matrix version of the analytic function


defined for Real({lambda}) < {zeta}. Differentiation under the integral sign and integration by parts show that f({lambda}) satisfies the ordinary differential equation


with solution f({lambda}) = (1 - {lambda}/{zeta})-{nu} subject to the initial condition f(0) = 1.

Although this insight is useful, practical computation of Q({Lambda}) is best achieved when {Lambda} is diagonalizable (Horn and Johnson 1991Citation , 520 p.). If {Lambda} = UDU-1 with D diagonal having diagonal entries di, then


where f(D) is the diagonal matrix having diagonal entries f(di). The condition Real(di) <= 0, satisfied by all infinitesimal generators {Lambda}, guarantees that all diagonal entries f(di) are well defined. Extension of Q({Lambda}) to nondiagonalizable {Lambda} is discussed thoroughly by Horn and Johnson (1991)Citation and will be omitted here.

From Nucleotide Models to Codon Models
Any nucleotide substitution model induces a trivial codon substitution model. For example, if vbd is the infinitesimal transition rate from nucleotide b to nucleotide d, then on the codon level, the infinitesimal transition rate from codon (a,b,c) to codon (a,d,c) is still vbd, assuming sites mutate independently and neither codon (a,b,c) nor codon (a,d,c) is a stop codon. Substitutions involving the first or third sites are treated similarly, and single-step substitutions between codons differing at more than one site are prohibited.

The only difference between running this naive codon model and three adjacent independent nucleotide models is that the three stop codons are disallowed in the codon model. For this reason, the naive codon model is not terribly interesting. One can increase its relevance by incorporating an acceptance mechanism. If two neighboring codons (a,b,c) and (a,d,c) are nonsynonymous, then one can penalize transitions between them by replacing vbd with {rho}vbd. Here {rho} [0, 1] is the probability that the proposed evolutionary change is accepted. If the destination codon is one of the three stop codons, the acceptance probability is 0. The acceptance probability for a synonymous change is 1. Mathematically, the codon model with acceptance probabilities is determined by a 61 x 61 infinitesimal generator {Upsilon} whose off-diagonal entries are the products of the corresponding entries of the infinitesimal generator of the naive codon model and the acceptance matrix. The diagonal entries of {Upsilon} are defined so that row sums vanish.

Preservation of Reversibility
Although the codon model is too complicated to permit explicit exponentiation of its infinitesimal generator {Upsilon}, symmetry considerations can simplify this task and likelihood computation in general. For example, to avoid complex arithmetic in computing the eigenstructure of {Upsilon}, it is desirable to derive sufficient conditions guaranteeing that {Upsilon} has real eigenvalues and eigenvectors. Reversibility is the key. One can easily show that reversibility in the nucleotide model is equivalent to imposing the two constraints ß{gamma} = {lambda}{sigma} and {alpha}{delta} = {epsilon}{kappa} on {Lambda}. With these constraints in force, the equilibrium distribution {pi} is


If we let D{pi} denote the diagonal matrix with the entries of {pi} along its diagonal, then reversibility is equivalent to the detailed balance condition

where the superscript * denotes vector or matrix transpose. Detailed balance implies that the eigenvalues of {Lambda} are real. To prove this result rigorously, consider the matrix D{pi}1/2{Lambda}D{pi}-1/2, which is similar to {Lambda} (Kelly 1979Citation , pp. 5–10). The calculation


does the trick when we recall that a symmetric matrix has real eigenvalues and real orthogonal eigenvectors (Horn and Johnson 1985Citation , pp. 169–171).

Besides accounting for nonsynonymous codon changes in a parsimonious manner, the codon model possesses the attractive property of turning a reversible nucleotide model into a reversible codon model. The scalar version of detailed balance is {pi}bvbd = {pi}dvdb. If the acceptance probabilities in the codon model are symmetric, then we claim that the equilibrium probability of the codon (a,b,c) is {pi}a{pi}b{pi}c up to a multiplicative constant. The proof of this statement is simply the equality

where {rho} = {rho}(a,b,c)->(a,d,c) = {rho}(a,d,c)->(a,b,c). Thus, detailed balance is preserved, and {Upsilon} is reversible with an identified equilibrium distribution. Because stop codons are omitted, we must normalize the proposed equilibrium distribution by dividing its entries by the sum {Sigma}(a,b,c) {pi}a{pi}b{pi}c taken over all nonstop codons (a,b,c).

Codon Acceptance Probabilities
In contrast to Muse and Gaut (1994)Citation , who employ a single acceptance probability that makes no distinctions between the different kinds of nonsynonymous codon substitutions, and in contrast to Goldman and Yang (1994)Citation , who use empirically determined codon acceptance probabilities based on the amino acid distances given in Grantham (1974)Citation , we prefer to estimate a small number of codon acceptance probabilities directly from the data. A large amount of sequence data would tend to favor this tactic of trading extra parameters for greater biological realism. Of course, the extreme model treating each amino acid separately requires (20 2) = 190 symmetric acceptance probabilities. To avoid such a proliferation of parameters, we group amino acids into similarity (equivalence) classes and then define acceptance probability parameters for transitions within or between classes. For instance, if we define similarity classes on the basis of the polarity properties of the amino acids, then the natural classes comprise the nonpolar amino acids S1 = {G, I, V, L, A, M, P, F, W}, the positive-polar–positively charged amino acids S2 = {Q, N, C, Y, H, K, R}, and the negative-polar–negatively charged amino acids S3 = {S, T, E, D}. We could further divide these classes on the basis of the size of amino acid side chains or other properties potentially affecting substitution rates. For instance, to account for cysteine's propensity to form disulfide bonds bridging different parts of a protein, we could place cysteine in its own similarity class S4 = {C}.

With the similarity classes S1 through S4 just defined, we postulate an acceptance probability {rho}0 within classes, an acceptance probability {rho}1 between a polar and a nonpolar class, an acceptance probability {rho}2 between different polar classes, and an acceptance probability {rho}3 involving substitution to or from cysteine. One would anticipate that {rho}0 > {rho}1 > {rho}2 and {rho}0 > {rho}3. The symmetry constraints implicit in this parameterization are not completely realistic. For instance, they probably fail for transitions involving cysteine. Substitution of another amino acid for a cysteine involved in a disulfide bond is bound to be much less likely than the reverse substitution. But it would take an enormous amount of data to see this effect, and it is mathematically advantageous to maintain symmetry for the sake of reversibility.

The penalty model just described appears to coincide with one proposed earlier by Yang, Nielsen, and Hasegawa (1998)Citation . In our companion article (Schadt, Sinsheimer, and Lange 2002Citation ), we compare it with a generalization of the amino acid metric model proposed by Goldman and Yang (1994)Citation . No clear winner emerges in this comparison, but the similarity class and metric models are both clearly superior to the naive Muse and Gaut (1994)Citation model.

Matrix Exponentiation and Differentiation
One of the primary concerns in likelihood evaluation is computation of the finite-time transition matrix exp(t{Upsilon}) and its partial derivatives with respect to the parameters. Computation is much easier if {Upsilon} is diagonalizable. Under the conditions of reversibility described above, let {nu} denote the product equilibrium distribution constructed from the nucleotide equilibrium distribution (eq. 1 ). Also let D{nu} denote the diagonal matrix with the entries of {nu} on its diagonal. Because D{nu}1/2{Upsilon}D{nu}-1/2 is symmetric, there exists a real orthogonal matrix U and a real diagonal matrix {Delta} such that D{nu}1/2{Upsilon}D{nu}-1/2 = U{Delta}U*. The columns of U are the eigenvectors of D{nu}1/2{Upsilon}D{nu}-1/2, and the diagonal entries of {Delta} are its eigenvalues. It follows from this representation that


Here exp(t{Delta}) is the diagonal matrix derived by exponentiating each of the eigenvalues separately. The sparseness of D{nu}1/2{Upsilon}D{nu}-1/2 helps in finding U. Only about 15% of the entries of {Upsilon} and thus of D{nu}1/2{Upsilon}D{nu}-1/2 are nonzero in the codon model. When we use special routines in the LAPACK linear algebra package and the D2a class in the IML++ package that exploit sparseness and symmetry, the computational burden of computing exp(t{Upsilon}) decreases by a factor of five. Details on the use of these packages for sparse matrix computations can be found in Dongarra et al. (1996)Citation and Pozo, Remington, and Lumsdaine (1996)Citation .

Suppose {xi} is a parameter of the codon model. Any numerical method that can compute the entrywise partial derivatives ({partial}/{partial}{xi})exp(t{Upsilon}) using only {Upsilon} and the entrywise partial derivatives ({partial}/{partial}{xi}){Upsilon} is highly desirable. For instance, the optimization engine SEARCH of LINNAEUS works faster and more reliably given these partial derivatives (Lange, Boehnke, and Weeks 1988Citation ). Formulas for ({partial}/{partial}{xi})exp(t{Upsilon}) are available in the statistical literature on fitting differential equation models (Jennrich and Bright 1974Citation ) and in the Lie group literature (Richtmyer 1981Citation , pp 143–144), but these have not seen application in molecular phylogeny. For this reason, we indicate briefly one method of calculating ({partial}/{partial}{xi})exp(t{Upsilon}).

Suppose f(z) is an analytic function of the complex variable z. The particular choice f(z) = exp(tz) is the pertinent one for us. Because f(z) is locally defined by an absolutely convergent power series, one can define a matrix-valued analytic function f(M) on square matrices M by substituting M for z in this series. This correspondence gives rise to a matrix functional calculus whose crowning highlight is the Cauchy integral formula


where I is the identity matrix, C is a simple closed curve strictly enclosing all the eigenvalues of M, and f(z) is analytic on C and its interior (Ikebe and Inagaki 1986Citation ; Horn and Johnson 1991Citation ). Differentiation of this formula yields


If we assume that M is diagonalizable in the form M = SDS-1 with D a diagonal matrix with jth diagonal entry dj, then we can write


where Dj is a matrix of zeros except for a one in diagonal entry j. Substituting this expression in equation (2) produces


based on the standard version of Cauchy's integral formula for scalar-valued analytic functions (Marsden 1987Citation , pp. 121–122). If F is the matrix with entry fjk = f'(dj) when dj = dk and fjk = (f(dj) - f(dk))/(dj - dk) when dj != dk, then this reduces to


where {circ} denotes the Hadamard (pointwise) product of two matrices. The computational complexity of computing ({partial}/{partial}{xi})f(M) is consequently dominated by ordinary matrix multiplication.

When M is not diagonalizable, a more complicated formula of Schwerdtferger applies based on the Jordan canonical form and Frobenius covariants of M (Horn and Johnson 1991Citation ). Fortunately for reversible models, {Upsilon} is diagonalizable. Even when {Upsilon} is not diagonalizable a priori, we can assume this condition in numerical practice because almost all matrices have distinct eigenvalues and are therefore diagonalizable. The appendix contains a short proof of this measure-theoretic assertion.

Rate Variation Models
Several authors have objected to the naive assumptions that different codon sites evolve equally rapidly and independently (Goldman and Yang 1994Citation ; Yang 1995Citation ; Felsenstein and Churchill 1996Citation ). Clearly, residues forming part of a catalytic domain or critical protein fold are more strongly conserved than residues that fall outside such regions. Better models of evolution must allow for variation in the rate of evolution and correlation in the rates of neighboring codon sites. The simplest way to model rate variation is to suppose that each codon site i is assigned a random rate class Ci from among r possible rate classes, which for convenience we label 1, ..., r. For two rate classes, we nominate the first class as slow and the second as fast. Probably the most parsimonious way of differentiating slow from fast evolution is to modulate the rate of evolution through the acceptance probabilities. Slow evolution can be distinguished from fast evolution by introducing a multiplicative parameter {eta} (0, 1) and replacing each acceptance probability {rho}i by {eta}{rho}i. For synonymous codon changes, one retains the acceptance probability of 1.

Modeling correlations in the rate of evolution among neighboring codon sites is more subtle. It is important to recognize that we are dealing with two different graphs in molecular phylogeny: (1) a taxon graph representing the relationships between the taxa and (2) a site graph representing which codon sites interact in determining rates of evolution. The site-to-site independence assumption essentially hides the fact that the site graph exists. The nodes of this graph are the different codon sites; its edges connect interacting pairs of sites. Yang (1995)Citation and Felsenstein and Churchill (1996)Citation implicitly define a site graph through their Markov chain models for rate variation. The Markov chain models postulate that the rate of evolution at the current site depends on the rate of evolution at the previous site. Although this is certainly a more realistic premise than independent rate variation, it is not the last word on the subject. Hidden Markov chains impose a unidirectional flow of information. Except for mathematical convenience, there is no reason to assume a preferred direction. Markov chain models also permit interactions only between nearest neighbors. These limitations can be relaxed without overly complicating maximum likelihood estimation and statistical inference.

The key to generalization is to replace Markov chains by Gibbs random fields (Kelly 1979Citation ; Whittaker 1990Citation , pp. 47–61; Cressie 1993Citation , pp. 410–422). The prototype Gibbs random field is the Ising model from statistical mechanics. As opposed to a Markov chain, it allows a bidirectional flow of information. Wagner, Baake, and Gerisch (1999)Citation introduced a simple two-state nucleotide evolution model based on coupled Ising chains. This model captures nearest neighbor interactions between particles on a one-dimensional lattice (Thompson 1972Citation , pp. 116–144). The Ising model and more complicated Gibbs fields are defined through potential functions H(c) defined on the realizations of the random rate class vector C = (C1, ..., Cn) over n consecutive codon sites. A Gibbs random field assigns the prior probability


to the realization C = c. For the sake of notational simplicity, we omit the usual minus signs in the definition of potentials and priors. The partition function {Sigma}d eH(d) is a multiple sum extending over all vectors d = (d1, ..., dn) drawn from the n-fold Cartesian product of {1, ..., r}. As a prelude to integrating the Gibbs models with phylogeny likelihood calculation, it is useful to review some particular Gibbs fields that can be applied directly to this problem.

Gibbs Fields and Their Partition Functions
In the Potts generalization of the Ising model (Baxter 1982Citation , pp. 422–452), one determines the prior probability of the random vector C = (C1, ..., Cn) over n codon sites through either the linear potential function


or the circular potential function


Here the vector (µi) controls the frequency of the different rate classes, and the matrix ({theta}ij) controls the spatial correlation in rate class between neighboring codons. This model generalizes the Ising model by allowing codon sites to belong to an arbitrary number of rate classes. In the circular case, we set cn+1 = c1.

One of the hardest problems in statistical mechanics has been the evaluation of partition functions. Fortunately, the Potts model permits exact evaluation. If we define the matrix


then


Conveniently, the circular model dispenses with frequency parameters after reparameterization. If we let M be the r x r matrix (e{omega}i,j) and v the r x 1 vector (eµi/2), then the partition functions reduce to the expressions


These expressions can be further simplified by diagonalizing M.

For example, in the Ising model when r = 2 and


the matrix


has eigenvalues


These can be used to write tr(Mn) = {phi}+n + {phi}-n in the circular Ising model (Ising 1925Citation ; Kramers and Wannier 1941Citation ; Onsager 1944Citation ).

Evaluation of the partition function can sometimes be done directly without recourse to linear algebra. For instance, consider the linear logistic model


with variation in nearest neighbor interactions but no frequency differences in rate classes (Derin and Elliott 1987Citation ). For the sake of argument, imagine that the rate indicators Ci are independent and uniformly distributed over the r rate classes. The n - 1 indicator random variables 1{Ci=Ci+1} are then independent, and the normalized partition function


can be viewed as their multivariate moment generating function. By virtue of independence, m({theta}1, ..., {theta}n-1) reduces to the product


of the univariate moment generating functions. It follows that the partition function satisfies


which obviously simplifies to a power if the {theta}i are equal. Extending this result to include frequency differences appears intractable (Strauss 1977Citation ), but in this simple case we see that allowing for rate variation among codon sites imposes little extra computational burden.

As a final example, suppose that r = 2 and the rate indicators Ci are chosen from {-1, 1} rather than {1, 2}. The partition function


joins nearest and next nearest neighbors. It can be radically simplified by writing Di = CiCi+1 and observing that Di {-1, 1} and DiDi+1 = CiCi+2. These considerations imply that


Thus H(d) exactly matches the potential of the linear Ising model with n - 1 sites. It follows that the explicit formula found for the Ising partition function applies here as well.

Likelihood Evaluation Under Rate Variation
To compute the likelihood of the data under the rate variation model, we let L(c) = {Pi}ni=1 Li(ci) denote the likelihood of the observations over all codon sites conditional on the value (c1, ..., cn) of the rate class vector C. Here Li(ci) is the likelihood specific to codon site i when it is in rate class ci. Given the value of the finite-time transition matrix et{Upsilon} and its derivatives ({partial}/{partial}{xi})exp(t{Upsilon}), computation of Li(ci) and its derivatives ({partial}/{partial}{xi})Li(ci) is carried out by the algorithms sketched by Felsenstein (1981)Citation and Schadt, Sinsheimer, and Lange (1998)Citation . These algorithms mimic Baum and Petrie's (1966)Citation forward and backward algorithms and have computational complexity O(ts2), where t is the number of taxa and s = 4 or s = 61 is the number of states of the model. As Felsenstein (1981)Citation notes, a complexity of O(ts2) is a vast improvement over the complexity O(st) encountered when one visits the internal nodes of the evolutionary tree simultaneously rather than recursively. Jensen, Lauritzen, and Olesen (1990)Citation present a nice generalization of the Baum and Petrie and Felsenstein algorithms to more complex graphical structures. This generalization closely resembles algorithms developed in pedigree analysis (Lange 1997Citation , pp. 102–110).

To compute the full likelihood in the rate variation model with spatial correlation, we must take into account the prior probability of the rate class vector C. Given a potential function H(c), we have


At this level of generality, it is not possible to suggest an efficient method of evaluating . But once we limit the extent of spatial interaction, things become easier. For instance, let us take as a concrete example the potential function


for a linear logistic model with interactions extending over k nearest neighbors. In this case, the exponentiated potential


comes into play as a product of arrays.

Computation of either the partition function or the numerator of reduces to the evaluation of a multiple sum of products of arrays. It is best to compute the multiple sums as iterated sums. Consider the numerator of . In the forward algorithm, we eliminate the indices c1, ..., cn in the indicated order. Starting with the array f0(c1, ..., ck) = 1 at step 0, at step i we recursively create the new array fi(ci+1, ..., cmin{i+k,n}) using


and then immediately discard fi-1(ci, ..., cmin{i+k-1,n}). The final array fn has zero dimension and furnishes the numerator of in equation (3) . In computing the array fi(ci+1, ..., cmin{i+k,n}), it is advantageous to perform the indicated array multiplications pairwise, starting with the two smallest arrays Li(ci)eµci, multiplying the resulting product against e{theta}11{ci=ci+1}, and so forth until a single array holding Li(ci)eµc, {Pi}min{k,n-i}j=1 e{theta}j1{ci=ci+j} is formed. The final multiplication of this array against fi-1(ci, ..., cmin{i+k-1,n}) can be carried out simultaneously with addition over the index ci. When the array e{theta}j1{ci=ci+j} is brought into play, its obvious symmetries can be exploited to reduce the overall computational burden.

The computational complexity of the forward algorithm scales linearly in n. At each step of the algorithm, one should rescale constructed arrays to prevent numerical underflow. Because a similar algorithm applies to the partition function, there is no need to express the partition function in closed form. Of course, if a closed form exists, it will usually be preferred on esthetic grounds.

Likelihood Differentiation Under Rate Variation
For certain rate variation models, it may be impossible to compute partial derivatives exactly. In such cases numerical partial derivatives of the likelihood can be approximated by either forward or central differences. Each forward difference requires an additional likelihood evaluation, and each central difference requires two additional likelihood evaluations. These extra likelihood evaluations are especially costly for kinetic parameters because they involve recalculation of all of the site-specific likelihoods Li(ci). For spatial parameters, the site-specific likelihoods are fixed, and it is possible to evaluate the full likelihood quickly given them. This suggests that numerical differentiation with respect to the spatial parameters is reasonable. One drawback of forward and central differences is that they entail roundoff errors originating from the subtraction of nearly equal function values. For a real analytic function f(x), there is a better alternative (Squire and Trapp 1998Citation ) based on the approximation


where x and y are real and i = . This approximation can be implemented in computer languages such as C and Fortran that accommodate complex arithmetic. Fortunately, in the linear logistic model (eq. 4 ), is a real analytic function of the spatial parameters µj and {theta}i.

The product rule allows us to express the partial derivative of with respect to a kinetic parameter {xi} as


For each i, the inner sum on the multi-index c can be computed using the forward algorithm by substituting ({partial}/{partial}{xi})Li(ci) for Li(ci). Although each of these separate evaluations is fast, the cumulative work can be considerable if n is large. It is possible to reduce the required effort by carrying out a backward algorithm and storing intermediate results. The backward algorithm for likelihood evaluation begins with the array bn+1(cn-k+1, ..., cn) = 1 and recursively defines new arrays


until reaching the constant array = b1.

It is conceptually simple but notationally messy to describe how to combine the forward and backward algorithms to compute partial derivatives. The basic idea is to exploit as much of the forward and backward information as possible. This is accomplished by using fi-1(ci, ..., ci+k-1) and bi+k(ci, ..., ci+k-1) as a basis for further computation. For the sake of simplicity, we deal with a central value of i satisfying k <= i <= n - k. When i < k, we omit the forward array and deal with the backward array alone, and when i > n - k we do the reverse. These special cases we leave to the reader. For a central value of i we collect all arrays that have not been visited into a single array a(ci, ..., ci+k-1) by forming the product


It is now a simple matter of summing over the remaining indices to produce


In practice, efficient evaluation of and its partial derivatives with respect to the kinetic parameters can be orchestrated by first carrying out the backward algorithm, storing the intermediate arrays, and then performing the forward algorithm. At the ith step of the forward algorithm, we sandwich the necessary forward and backward results and compute one piece of the differential of as indicated in equations (6) and (7) .

Inferring Hidden States
The models and algorithms discussed so far lend themselves to the optimal assignment of codons to internal nodes and of codon sites to rate classes. This flexibility will help scientists in understanding and visualizing the most likely ancestral sequences and the most highly conserved regions in a protein. Felsenstein and Churchill (1996)Citation detailed such an algorithm in computing posterior probabilities for a given nucleotide site belonging to a given rate class, using their hidden Markov-based model of rate variation among sites. Others have detailed fast and efficient algorithms for the reconstruction of nucleotide and amino acid sequences and demonstrated the usefulness of these reconstructions in annotating the functional regions of protein sequences (Pupko et al. 2000Citation ; Armon, Graur, and Ben-Tal 2001Citation ). We generalize these reconstruction and rate assignment algorithms so they can be applied in the context of the generalized reconstruction models we have developed here.

There are two common methods for inferring hidden states in the type of likelihood models detailed here. The first simply computes the posterior probability of each possible assignment and chooses the most probable one at each node or site. The second method finds the most likely constellation of assignments over all nodes or sites simultaneously by dynamic programming. The posterior probability method involves fairly modest changes to the various likelihood algorithms. The dynamic programming method requires more radical surgery and adapts the Viterbi (1967)Citation algorithm as generalized by Dawid (1992)Citation .

Inferring Hidden Rate Classes
Consider for instance computation of the posterior probability that a codon site belongs to a given rate class. The posterior probability in question is the ratio of a joint probability and the likelihood of the data. The joint probability captures the data with the given codon site restricted to a particular rate class. Computation of the joint probability proceeds exactly as likelihood computation with the exception that when the codon site in question is reached in the computation, the range of summation shifts from all possible rate classes to a single fixed rate class. To carry out this strategy as efficiently as possible, we mimic the process of computing partial derivatives. Thus, by analogy to equations (6) and (7) we form


and the multiple sum


omitting the sum on ci. Dividing S(ci) defined by equation (9) by the likelihood now gives the desired posterior probability of rate class ci at codon site i.

To explain the Viterbi algorithm for finding the best sequence of rate classes, it is clearer to adopt a general perspective. The problem of likelihood evaluation reduces to an evaluation of a multiple sum of multiple products of arrays of the form


Here Z = Z1 x ··· x Zn is a Cartesian product of n finite sets. In our case, each set Zi = {1, ..., r} contributing to the product represents the possible choices of rate class at codon i. The multi-index or sequence c = (c1, ..., cn) determines a rate class assignment for each of the n codon sites. The array aTj(c) depends only on the components of c determined by the subset Tj {1, ..., n}. For instance, if n = 3 and Tj = {1, 3}, then the array aTj(c) reduces to an array a{1,3}(c1, c3) depending on the first and third indices. The restriction to subsets indexed by elements j J conveys the fact that only some of the 2n subsets of {1, ..., n} occur in the product defining . We can avoid duplicate sets Tj = Tk by replacing corresponding arrays aTj(c) and aTk(c) by their pointwise product aTj(c)aTk(c). In practice, {cup}jJTj = {1, ..., n}.

In contrast to likelihood evaluation, the best assignment problem involves finding a multi-index c maximizing {Pi}jJ aTj(c). In other words, maxima replace sums. Dynamic programming solves a problem by taking advantage of already computed solutions of smaller examples of the same problem. In this case, we solve a sequence of n subproblems. For subproblem m, put


to pin down the arrays partially indexed by one or more of the first m indices c1, ..., cm. In the current context, dynamic programming exploits the identity


The validity of this identity hinges on the fact that the arrays in the first product on the right involve the index cm but none of the indices c1, ..., cm-1. The solutions to the inner maximization depend on the indices ci with


These are precisely the indices other than c1, ..., cm-1 implicated by association with c1, ..., cm-1 through some array. For each choice of the associated indices, we must preserve a pointer that specifies a choice of c1, ..., cm-1 providing a solution. When we perform the outer maximization over cm, the solution depends on the indices ci with i {cup}jUm Tj\{1, ..., m}. Again we preserve a pointer that specifies a best cm together with the sequence c1, ..., cm-1 specified by the solution at stage m - 1. At stage n we find a maximum and recover the solution vector by tracing back along the associated pointers.

As a toy example, consider the product


For m = 1 we find a maximum of


as a function of c2 and record a pointer specifying a best c1 for each c2. For m = 2 we find a maximum of


as a function of (c3, c4) and record a pointer from (c3, c4) to (c1, c2). For m = 3 we find a maximum of


as a function of (c4) and record a pointer from (c4) to (c1, c2, c3). For m = 4 we find a maximum over c4 and combine it with its pointer to (c1, c2, c3).

The rate class problem possesses two major advantages. First, the index set (eq. 10 ) at stage m - 1 always contains at most k indices for k nearest neighbor interactions. Second, it is natural to proceed from left to right along the n codon sites. As we will discuss later, the same strategy may or may not work well for neighborhood structures connecting nonconsecutive sites.

Inferring Hidden Codons: Ancestral Sequence Reconstruction
We now turn to the problem of imputing codons to internal nodes. To simplify our exposition, we fix a tree connecting the taxa of interest, a codon site, and a rate class for that site. In our previous article (Schadt, Sinsheimer, and Lange 1998Citation ), we sketched Felsenstein's (1981)Citation algorithm for computing the likelihood of the data observed at the leaves of the tree. This algorithm works for both nucleotide and codon models. In discussing the algorithm, we introduced several probabilities. These were (1) the conditional probability Tk->l(i, j) that the daughter node l of a branch k -> l is in state j given that the mother node k is in state i, (2) the conditional probability Dk(i) of the subtree rooted at node k given node k is in state i, (3) the conditional probability Lk(i) of the left subtree rooted at node k given node k is in state i, (4) the conditional probability Rk(i) of the right subtree rooted at node k given node k is in state i, and (5) the joint probability Uk(i) of node k being in state i and the complementary subtree showing its observed bases at its leaves. In definition (1), note that Tk->l(i, j) is nothing more than an entry of the finite-time transition matrix P(t) pertaining to the branch k -> l. The complementary subtree referred to in definition (3) contains node k and all nodes that are not descendents of k. The letters D, L, R, and U in these definitions suggest the directions down, left, right, and up, with the root occurring at the top of the tree.

We remind the reader that the quantities Lk(i) and Rk(i) are computed in a postorder traversal of the tree in which daughter nodes are visited prior to mother nodes. The quantities Uk(i) are computed in a subsequent preorder traversal of the tree in which mother nodes are visited before daughter nodes. If l is one of the daughter nodes of k, then computation of the Ul(j) relies on the stored values of Lk(i) and Rk(i) available after the postorder traversal and on the stored values of Uk(i) available after the partially completed preorder traversal. At the time Ul(j) is computed, it is trivial to compute the product Ul(j)Ll(j)Rl(j) giving the joint probability of the data at the site and an assigned state j for node l. The ratio of this probability to the likelihood of the data represents the conditional probability of the assigned state i given the data.

To remove the restriction to a particular site and rate class, let i denote a codon site, l a node, and ci a rate class at site i. We have just shown how to compute the joint likelihood Li(ci, xl) of the observed data at site i and a specified codon xl at node l. Within the context of the rate class model, efficient computation of the posterior probabilities of the various codons xl at the various nodes l proceeds by constructing the array a(ci, ..., ci+k-1) of equation (8) omitting the factor Li(ci). This amended array is then used to form the sum S(ci) of equation (9) . The joint probability of the data and codon xl at node l is determined by the sum {Sigma}rci=1 Li(ci, xl)S(ci). Dividing this joint probability by the likelihood of the data yields the posterior probability of codon xl at node l of site i. If done correctly, the whole process of imputing codons to internal nodes is very efficient.

Implementation of the Viterbi algorithm is similar in spirit to the posterior probability algorithm, but practical implementation and notation becomes cumbersome when one attempts to integrate it with the codon model. For this reason, let us assume that we have already assigned as previously described an optimal rate class to each site. This allows us to fix a tree, a codon site, and a rate class for that site. Our description of the Viterbi algorithm for assigning optimal codons to nodes again exploits both a postorder and a preorder traversal. We retain the symbol Dm(i) but interpret it as the conditional probability of the subtree emanating from m with optimally assigned codons. Here we obviously condition on codon i at node m. If m is a leaf, then Dm(i) is the 0/1 function 1{i=j} indicating consistency between the hypothesized codon i and the observed codon j at m. Now consider a mother node m with left and right daughter nodes l and r. The appropriate recurrence defining Dm(i) is


When we reach the root, we evaluate maxi {pi}iDroot(i) using the codon priors {pi}i at the root. This gives the probability of the best assignment of codons.

Of course, this does not solve the problem of finding the best assignment. To find the best assignment, we must define left and right pointers at each node m that track the best j and k in equation (11) given an i. In the preorder phase of the Viterbi algorithm, we then simply trace back the pointers from mother nodes to daughter nodes and note the optimal codon at each node as we visit it. The whole algorithm is obviously reminiscent of maximum parsimony. Both methods assign codon (or nucleotides) to nodes. If the models we have been exploring capture natural variation well, then maximum parsimony assignment is apt to be inferior to Viterbi assignment using likelihoods. Further exploration of this claim is clearly warranted.

Tree Building Algorithms
Efficient likelihood calculation and good optimization algorithms go only partway toward easing the computational burdens of finding the best evolutionary tree. For m contemporary taxa, there are Tm = (2m - 3)!/[2m-2(m - 2)!] rooted trees to consider. A host of methods exists for efficiently searching tree space; among these are branch-and-bound (Felsenstein 1981Citation ), simulated annealing (Schadt, Sinsheimer, and Lange 1998Citation ), and the recent progressive search algorithms of Warnow, Moret, and St. John (2001)Citation and Nakhleh et al. (2001)Citation . Here we will discuss methods of exhaustive enumeration that take into account constraints on phylogenies available through accepted groupings of taxa.

To illustrate our point, consider the unrooted tree of figure 1 depicting 16 complete HIV-1 genomes across five subtypes. In the absence of prior information on subtypes, these taxa are connected by more than 6.19 x 1015 possible rooted trees. Grouping the taxa by subtype dramatically reduces the number of possible trees. There are three taxa each of subtype A, B, C, and F and four taxa of subtype G. Each group of three related taxa can be arranged in three trees, the group of four related taxa can be arranged in 15 trees, and the five subtypes themselves can be arranged in 105 trees. Therefore, there are only (34)(15)(105) = 127,575 rooted trees connecting the taxa and preserving the subtypes. This represents a reduction in the number of possible trees by 10 orders of magnitude. Because there are fewer trees, the time devoted to fitting each possible tree is less of an issue. Accordingly, constraining tree space through known groupings promotes more sophisticated modeling.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1.—Constraining the topology of tree space can greatly reduce the number of possible trees.

 
Before discussing direct enumeration with group constraints, it is worth spending a few moments considering direct enumeration without such constraints. The standard inductive proof of the formula Tm = (2m - 3)!/[2m-2(m - 2)!] suggests an efficient means of visiting all possible rooted trees. Suppose we number the taxon from 1 to m. Given a tree with m - 1 taxa, a tree with m taxa can be constructed by attaching taxon m to any one of the existing 2m - 4 edges or directly to the root if the current bifurcation at the root is moved forward in time. If we label the edges 1, ..., 2m - 4 and the root 2m - 3, then we can think of enlarging the tree as a choice among these integers. We can view the whole tree as a vector with m - 2 integer entries representing such successive choices. Position 1 corresponds to the attachment site in going from 2 taxa to 3 taxa and is occupied by one of the integers 1, 2, 3. In general, position k corresponds to the attachment site in going from k + 1 taxa to k + 2 taxa and is occupied by one of the integers 1, ..., 2k + 1.

An algorithm for visiting all possible trees becomes in this context simply a systematic method of enumerating these vectors. The most obvious scheme, backtracking, involves partial vectors. If the current partial vector is (l1, ..., lk), then we extend it to (l1, ..., lk, 1) when k < m - 2. If the current partial vector is the full vector (l1, ..., lm-2), and lm-2 < 2m - 3, then we replace (l1, ..., lm-2) by (l1, ..., lm-2 + 1). Finally, if the current partial vector is the full vector (l1, ..., lm-2) and lm-2 = 2m - 3, then we backtrack to the partial vector (l1, ..., lk), where k is the first position in going from right to left along (l1, ..., lm-2) at which lk < 2k + 1. We then replace the full vector by the partial vector (l1, ..., lk + 1). If no such position k exists, then we have exhausted the list of trees over the m taxa.

In principle, one could object that this is an inefficient process because it visits partial trees when only full trees are of interest. There are two logical replies. First, if we let Sm = T2 + ··· + Tm be the number of possible trees with m or fewer taxa, then one can prove inductively that


More importantly, the difficult likelihood calculations are undertaken only for the full trees. Programming the backtracking algorithm leans heavily on recursion and on visiting the branches and root in a regular order, say by a postorder traversal.

The most natural way to impose a tree constraint is to treat selected taxa as a unit. It will help in discussing constraints to refer to a taxon as a leaf and a group of taxa as a leaf bundle, or a bundle for short. In the HIV example just discussed, there are five bundles with four bundles having three leaves each and a fifth bundle having four leaves. If we want to enumerate all trees consistent with these groups, then we enumerate all trees consistent with the five bundles. In effect, the bundles serve as leaves. For each such enumerated tree, we unfold each attached bundle into the possible subtrees formed by its leaves.

It is helpful to expand this definition by making it recursive. Hence, a tree is built from leaves and leaf bundles. Each bundle consists of additional leaves and additional bundles. Each additional bundle consists of yet more leaves and yet more bundles, and so forth, like nested Russian dolls, but more complicated. Figure 2 gives an example in which 10 leaves are organized into two bundles B1 and B2 and two leaves, say 9 and 10. Bundle B1 consists of bundle B3 and leaves 4 and 5. Bundle B2 consists of leaves 6, 7, and 8 and no bundles. Finally, bundle B3 consists of leaves 1, 2, and 3 and no bundles. If we want to count the total number of trees consistent with these constraints, then we start at the root of the tree and list the attached leaves and bundles. In general, suppose the level 0 contributions attached to the root consist of leaves i1, ..., ij and bundles B1, ..., Bk. The total number of consistent trees C0 is then


where C1(Bl) counts the number of consistent subtrees coming from bundle Bl at level 1. This formula recursively works its way down to the lowest level.



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 2.—A 10-taxa tree with constrained groups B1, B2, and B3. The different levels of the tree illustrate the constrained subtopologies described in the text. These constraints result in a significant reduction in the number of trees that need to be considered to find the best tree.

 
Obviously, our description conveys only the general idea of how to assemble consistent trees. Each time a tree is visited at level 0, it must be assembled from its own leaves and the subtrees assembled at lower levels through its bundles. For the sake of brevity, we omit all programming details except to emphasize once again that recursion and postorder traversals are the keys to fast, compact code.

Long-Range Correlations
If prior information suggests that widely separated amino acids jointly determine a catalytic domain or protein fold, then it might be wise to model the correlated evolution of the responsible codons. The conserved histidine residues in the beta-globin protein furnish a typical example. How this modeling is done can have an enormous impact on the efficiency of likelihood evaluation. To illustrate our algorithmic point in a nongenetic context, suppose we have to evaluate the sum of products of arrays


The arrays a{i,i+3}(ci,ci+3) correspond to spatial correlation at a distance of 3. The forward method of evaluating forms in sequence the arrays and scalars


Similar formulas apply with maxima replacing sums if we want to find the optimal sequence of states c1, ..., cn.

It is important to note in either case that the complexity of the algorithm (eq. 12 ) increases sharply if we substitute arrays a{i,i+1,i+2,i+3}(ci, ci+1, ci+2, ci+3) for the arrays a{i,i+3}(ci, ci+3). Then the intermediate arrays b{i+3}(ci+3) are replaced by the more complicated arrays b{i+1,i+2,i+3} (ci+1, ci+2, ci+3) involving more indices and requiring more arithmetic operations to generate and handle. The moral is that in dealing with correlation at a distance it is better to omit short-range correlations or to strictly limit the number of long-range correlations. Even with these restrictions, computing the derivatives of the likelihood becomes messier. Of course, with extremely complicated patterns of interaction or in contexts such as pedigree analysis where sites are no longer arranged in a linear chain, it may be more efficient to eliminate the indices in permuted order. Finding the best order can be highly nontrivial. At some point the computational complexity of likelihood evaluation or dynamic programming can balloon out of control. In such cases it may be worthwhile to examine other strategies such as simulated annealing or genetic algorithms for solving the Viterbi problem. Likelihood evaluation yields to stochastic sampling (Lange 1997Citation ).


    Discussion
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
Within coding regions, codons rather than nucleotides are the units of evolution, and neighboring codon sites experience correlated evolutionary pressures. For reasons of computational simplicity, these two facts are often overlooked even though taking them into account improves phylogeny reconstruction and helps identify conserved domains within proteins. The current article describes several enhancements to previous models for codon substitution and rate variation. Some of these enhancements improve the realism of the models, and some improve their computer implementation. Our improvements to the models include: (1) more flexible nucleotide substitution mechanisms, (2) codon models with more detailed and flexible acceptance mechanisms for nonsynonymous substitutions, (3) usable sufficient conditions for reversibility in the codon models, and (4) rate variation models based on Gibbs random fields that permit a bidirectional flow of information and interactions between nonadjacent codon sites. Our computational enhancements include: (5) a clear exposition of how reversibility helps in diagonalizing and exponentiating the infinitesimal generator, (6) an explicit method for differentiating the matrix exponential of the infinitesimal generator, (7) fast forward and backward algorithms for likelihood evaluation, (8) fast algorithms for forming the partial derivatives of the likelihood, (9) fast algorithms for assigning codons to internal nodes and rate classes to codon sites, and (10) fast algorithms for enumerating trees constrained by accepted groupings of taxa.

As the field of molecular phylogeny matures, its emphasis is shifting from phylogeny reconstruction to the more subtle and ambitious problems of annotating and functionating whole genomes. The sheer amount of sequence data currently generated will sustain much more complex phylogenetic modeling. The models discussed here and their creative extensions will aid in identifying sequence motifs, conserved regions within genes, and gene families. Better estimates of branch lengths and site-by-site rates of evolution will be critical in these tasks. Equally important will be better tools for reconstructing ancestral sequences and for visualizing domains of slow and fast evolution. Although phylogenetic-based sequence analysis cannot hope to supplant X-ray crystallography and nuclear magnetic resonance in understanding protein structure, these alternatives are expensive and time consuming. Partial knowledge is preferable to none, particularly in identifying targets for therapeutic intervention and in unraveling complex webs of interacting proteins.

Our modeling philosophy is to retain as much model generality as possible consistent with fast computation. Our bias toward flexibility is reflected in our program LINNAEUS, freely available from Eric Schadt on request. For instance, LINNAEUS allows users to define codon equivalence classes tailored to specific proteins or types of proteins. For receptors spanning cellular membranes, classifying codons by polarity makes good sense. For enzymatic proteins, precise folding is required, and codon classification by amino acid size might be more relevant. LINNAEUS also permits users to specify the number of rate classes and the extent of spatial correlation. LINNAEUS is written in VC++ but mainly adheres to the ANSI C++ standard and makes use of the standard template libraries. LINNAEUS has currently only been compiled and tested under Windows NT 4.0 and Windows 2000, but we plan to make a LINUX/UNIX version available in the near future. We further contemplate many future enhancements. It might be useful to incorporate covariates such as measures of solvent accessibility amino acid by amino acid. Of course, one of the biggest challenges is to model rate correlation at a distance. The three-dimensional structure of proteins brings amino acids into close proximity that are distant in the linear metric of the coding sequence. Somehow we must overcome the computational barrier of shifting through all pairs of codons to discover coordinated evolution at a distance.

Codon models are far more computationally demanding than nucleotide models. Nonetheless, the combination of good algorithms and faster processors has put them within reach. As explained in our companion article, the cumulative impact of our algorithmic improvements is substantial. Once the step to codon models is taken, the computational price of adding rate variation is modest. Computation of the partial derivatives of the loglikelihood is the biggest computational bottleneck in maximum likelihood estimation with codon models. If a model has p parameters, then it takes p + 1 likelihood evaluations to approximate partial derivatives by forward differences and 2p + 1 likelihood evaluations to approximate them by central differences. Likelihood evaluation itself is dominated by diagonalizing and exponentiating the infinitesimal generator. Our exact formulas compute the score vector via the diagonalization of the infinitesimal generator. Thus, a single likelihood evaluation suffices. With nearly 50 parameters in some of our more complex models, we see an almost 50-fold reduction in computing times. How long-range rate correlations are modeled can also impact computation times. This is a complicated topic and bears more examination, particularly if the advantages of fast likelihood differentiation are to be preserved.

We hope the availability of LINNAEUS will accelerate the transition to likelihood-based models, which in our view have far more to offer than traditional nonparametric methods. Our companion article (Schadt, Sinsheimer, and Lange 2002Citation ) provides examples of LINNAEUS in action. Past problems with likelihood-based methods stem from poor models and slow computation. As both of these barriers fall, there will be less and less excuse for using nonparametric methods. This is particularly true as attention shifts from phylogeny reconstruction to the discovery of conserved regions within genes and proteins.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
This research was supported in part by the USPHS grant GM53275.


    Appendix
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
To prove that almost all n x n matrices are diagonalizable, it suffices to prove that almost all n x n matrices have n distinct eigenvalues. Consider a typical such matrix M = (mij). The eigenvalues r1, ..., rn of M coincide with the roots of the characteristic polynomial p(r) = det(rI - M). Expanding this determinant, it is clear that the coefficients of


are themselves polynomials in the entries of M. But we can also write


where {sigma}k is the sum of the (nk) products ri1ri2 ··· rik of the roots taken k at a time. For instance,


In the mathematical literature, {sigma}k is known as the kth elementary symmetric polynomial in the roots. In general, a symmetric polynomial s(r1, ..., rn) is a multivariate polynomial symmetric under all permutations of its n arguments (Niven, Zuckerman, and Montgomery 1991Citation , pp. 484–489).

When n = 2 the characteristic polynomial


has distinct roots if and only if its discriminant


does not vanish. In general, the n-dimensional discriminant is the symmetric polynomial


By definition, the roots of p(r) are distinct if and only if d(r1, ..., rn) does not vanish. Now the fundamental theorem of symmetric polynomials says that any symmetric polynomial can be expressed as a polynomial in the elementary symmetric polynomials (Niven, Zuckerman, and Montgomery 1991Citation ). Because each {sigma}k is itself a polynomial in the entries of M, it follows that

for some polynomial q(m11, ..., mnn), which cannot be trivially 0 by virtue of the fact that some matrices have distinct eigenvalues.

Our goal is now within reach. Indeed, one need only add that any nontrivial multivariate polynomial is nonzero for almost all values of its arguments (Hoffman 1975Citation , 358 p.). This fact follows for a univariate polynomial because it has at most a finite number of roots. The general case follows by induction and Fubini's theorem. It is noteworthy that the entries of M can be either real or complex. The weaker classical result that the set S of real matrices with distinct eigenvalues is both open and dense in Rn2 follows from our stronger result (Hirsch and Smale 1974Citation , pp. 154–156.). The set S is open because the eigenvalues of M are continuous functions of the entries of M. If there were an open ball B Rn2 containing no matrix with distinct eigenvalues, this would mean that the complement of S would have positive measure, contradicting the proposition just established.


    Footnotes
 
Fumio Tajima, Reviewing Editor

Keywords: molecular evolution codon models Markov chain Gibbs field maximum likelihood dynamic programming abstract Back

Address for correspondence and reprints: Eric Schadt, Computational Genomics, Bosetta Inpharmatics, Kirkland, WA 98034. E-mail: eric_schadt{at}merck.com . Back


    References
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 

    Adachi J., M. Hasegawa, 1992 Amino-acid substitution of proteins coded for mitochondrial-DNA during mammalian evolution Japan J. Genet 67:187-197[ISI]

    Armon A., D. Graur, N. Ben-Tal, 2001 ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogentic information J. Mol. Biol 307:447-463[ISI][Medline]

    Baum L. E., T. Petrie, 1966 Statistical inference for probabilistic functions of finite state Markov chains Ann. Math. Stat 37:1554-1563[ISI]

    Baxter R., 1982 Exactly solved models in statistical mechanics Academic Press, New York

    Burge C., S. Karlin, 1997 Prediction of complete gene structures in human genomic DNA J. Mol. Biol 268:78-94[ISI][Medline]

    Cressie N. A. C., 1993 Statistics for spatial data Wiley, New York

    Dawid A. P., 1992 Applications of a general propagation algorithm for probabilistic expert systems Stat. Comput 2:25-36

    Derin H., H. Elliott, 1987 Modeling and segmentation of noisy and textured images using Gibbs Random Fields IEEE Trans. Pattern Anal. Machine Intelligence 9:39-55[ISI]

    Dongarra J., A. Lumsdaine, R. Pozo, K. A. Remington, 1996 IML++: iterative methods library reference guide. Version 1.2 ftp://math.nist.gov/pub/pozo/docs/iml.ps.gz

    Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]

    Felsenstein J., G. A. Churchill, 1996 A Hidden Markov Model approach to variation among sites in rate of evolution Mol. Biol. Evol 13:93-104[Abstract]

    Goldman N., Z. Yang, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences Mol. Biol. Evol 11:725-736[Abstract/Free Full Text]

    Grantham R., 1974 Amino acid difference formula to help explain protein evolution Science 185:862-864[ISI][Medline]

    Gribskov M., A. D. McLachlan, D. Eisenberg, 1987 Profile analysis: detection of distantly related proteins Proc. Natl. Acad. Sci. USA 84:4355-4358[Abstract]

    Hasegawa M., 1986 Statistical methods in molecular phylogeny and their application to human evolution Japan J. Genet 61:633-634

    Hirsch M. W., S. Smale, 1974 Differential equations, dynamical systems, and linear algebra Academic Press, New York

    Hoffman K., 1975 Analysis in Euclidean space Prentice-Hall, Englewood Cliffs, NJ

    Horn R. A., C. R. Johnson, 1985 Matrix analysis Cambridge University Press, New York

    ———. 1991 Topics in matrix analysis Cambridge University Press, Cambridge, U.K

    Ikebe Y., T. Inagaki, 1986 An elementary approach to the functional calculus for matrices Am. Math. Monthly 93:390-392[ISI]

    Ising E., 1925 Beitrag zur Theorie des Ferromagnetismus Z. Physik 31:253-258

    Jennrich R. I., P. B. Bright, 1974 Fitting systems of linear differential equations using computer generated exact derivatives Technical Report No. 10. Health Sciences Computing Facility, UCLA, Los Angeles, Calif

    Jensen F. V., S. L. Lauritzen, K. G. Olesen, 1990 Bayesian updating in recursive graphical models by local computations Comp. Stat. Q 4:269-282

    Kelly C., J. Rice, 1996 Modeling nucleotide evolution: a heterogeneous rate analysis Math. Biosci 133:85-109[ISI][Medline]

    Kelly F. P., 1979 Reversibility and stochastic networks Wiley, New York

    Kimura M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]

    Koshi J. M., R. A. Goldstein, 1997 Mutation matrices and physical-chemical properties: correlations and implications Proteins 27:336-344[ISI][Medline]

    ———. 1998 Mathematical models of natural amino acid site mutations Proteins 32:289-295[ISI][Medline]

    Kramers H. A., G. H. Wannier, 1941 Statistics of the two-dimensional ferromagnet. Part I Phys. Rev 60:252-262

    Lange K., 1997 Mathematical and statistical methods for genetic analysis Springer-Verlag, New York

    Lange K., M. Boehnke, D. E. Weeks, 1988 Programs for pedigree analysis: MENDEL, FISHER, and dGENE Genet. Epid 5:471-472[ISI]

    Li W. H., 1993 Unbiased estimation of the rates of synonymous and nonsynonymous substitution J. Mol. Evol 36:96-99[ISI][Medline]

    Li W. H., C. I. Wu, C. C. Luo, 1985 A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes Mol. Biol. Evol 2:150-174[Abstract]

    Marsden J. E., 1987 Basic complex analysis W. H. Freeman and Company, New York

    Miyata T., T. Yasunaga, 1980 Molecular evolution of mRNA: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its application J. Mol. Evol 16:23-36[ISI][Medline]

    Muse S. V., B. S. Gaut, 1994 A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome Mol. Biol. Evol 11:715-724[Abstract/Free Full Text]

    Nakhleh L., U. Roshan, K. St. John, J. Sun, T. Warnow, 2001 Designing fast converging phylogenetic methods Bioinformatics 17 1: (Suppl.) S190-S198

    Nei M., T. Gojobori, 1986 Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions Mol. Biol. Evol 3:418-426[Abstract]

    Niven I., H. S. Zuckerman, H. L. Montgomery, 1991 An introduction to the theory of numbers. 5th edition Wiley, New York [Appendix A.2]

    Onsager L., 1944 A two-dimensional model with an order-disorder transition Phys. Rev 65:117-149

    Pedersen A-M., K. C. Wiuf, F. B. Christiansen, 1998 A codon-based model designed to describe lentiviral evolution Mol. Biol. Evol 15:1069-1081[Abstract]

    Pozo R., K. A. Remington, A. Lumsdaine, 1996 SparseLip++: sparse matrix class library reference guide. Version. 1.5 ftp://math.nist.gov/pub/pozo/papers/sparse.ps.gz.

    Pupko T., I. Pe'er, R. Shamir, D. Graur, 2000 A fast algorithm for joint reconstruction of ancestral amino acid sequences Mol. Biol. Evol 17:890-896[Abstract/Free Full Text]

    Richtmyer R. D., 1981 Principles of advanced mathematical physics, Vol. II Springer-Verlag, New York

    Schadt E. E., J. S. Sinsheimer, K. Lange, 1998 Computational advances in maximum likelihood methods for molecular phylogeny Genome. Res 8:222-233[Abstract/Free Full Text]

    ———. 2002 Applications of codon and rate variation models in molecular phylogeny Mol. Biol. Evol 19:1550-1562.[Abstract/Free Full Text]

    Squire W., G. Trapp, 1998 Using complex variables to estimate derivatives of real functions SIAM Rev 40:110-112[ISI]

    Strauss E. J., 1977 Clustering on coloured lattices J. Appl. Prob 4:135-143

    Tamura K., M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Mol. Biol. Evol 10:512-526[Abstract]

    Tatusov R. L., S. F. Altschul, E. V. Koonin, 1994 Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks Proc. Natl. Acad. Sci. USA 91:12091-12095[Abstract/Free Full Text]

    Thompson C. J., 1972 Mathematical statistical mechanics Princeton University Press, Princeton, NJ

    Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]

    Viterbi J., 1967 Error bounds for convolutional codes and an asymptotically optimal decoding algorithm IEEE Trans. Informat. Theory 13:260-269

    Wagner H., Baake E., Gerisch T., 1999 Ising quantum chain and sequence evolution J. Stat. Phys 92:1017-1052[ISI]

    Warnow T., B. M. Moret, K. St. John, 2001 Absolute convergence: true trees from short sequences Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA 01). Pp. 186–195

    Whittaker J., 1990 Graphical models in applied multivariate statistics Wiley, Chichester, U.K

    Yang Z., 1993 Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites J. Mol. Evol 10:1396-1401

    ———. 1994 Estimating the pattern of nucleotide substitution J. Mol. Evol 39:105-111[ISI][Medline]

    ———. 1995 A space-time process model for the evolution of DNA sequences Genetics 139:993-1005[Abstract/Free Full Text]

    ———. 2000 Relating physicochemical properties of amino acids to variable nucleotide substitution patterns among sites PSB Proc 4:81-92

    Yang Z., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611.[Abstract/Free Full Text]

    Yi T. M., E. S. Lander, 1994 Recognition of related proteins by iterative template refinement Protein Sci 3:1315-1328[Abstract/Free Full Text]

Accepted for publication May 2, 2002.