*Department of Biomathematics, UCLA School of Medicine, Los Angeles, CA;
Computational Genomics, Rosetta Inpharmatics, Kirkland, WA;
Department of Human Genetics, UCLA School of Medicine, Los Angeles, CA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
First, not all sequence conservation occurs within coding regions. Regions of gene regulation also experience the guiding hand of selection. Second, amino acid substitution models blur the distinction between mutation, which operates at the nucleotide level, and selection, which operates at the amino acid level. One can reasonably expect the rate of mutation at a nucleotide site to be independent of the role its resident nucleotide plays at the codon level. Whether selection retains or eliminates a mutation depends on whether the corresponding codon substitution is synonymous or nonsynonymous. Some nonsynonymous substitutions can be more easily tolerated than others. These considerations suggest that we view selection as accepting some codon substitutions and rejecting others according to definite probabilistic rules (Hasegawa 1986
; Adachi and Hasegawa 1992
; Goldman and Yang 1994
; Pedersen, Wiuf, and Christiansen 1998
; Yang, Nielsen, and Hasegawa 1998
). More realistic modeling must recognize that some codon sites experience stronger selection than other sites. This fact suggests that one allows for variation in the rate of evolution among sites (Koshi and Goldstein 1998
; Yang, Nielsen, and Hasegawa 1998
; Yang 2000
). Finally, because conserved catalytic domains, protein folds, and tertiary structures extend over both consecutive and nonconsecutive codon sites, it is desirable to pursue models that allow for spatial correlation in rate variation (Yang 1995
; Felsenstein and Churchill 1996
; Wagner, Baake, and Gerisch 1999
).
These broad research trends have been set in motion by the researchers just noted. Our goal here is to refine some of the current models in interesting and productive ways and show how these models fit into fast maximum likelihood estimation. For example, we advocate modeling the acceptance of nonsynonymous substitutions using equivalence classes of amino acids on the basis of physical and chemical similarities. Modeling of spatial variation in the rate of evolution has been limited to the construction of hidden Markov chains and coupled Ising chains connecting adjacent nucleotide sites (Yang 1995
; Felsenstein and Churchill 1996
; Wagner, Baake, and Gerisch 1999
). These natural developments parallel the successful applications of hidden Markov chains in contexts such as gene finding, similarity searches, and homology recognition (Gribskov, McLachlan, and Eisenberg 1987
; Tatusov, Altschul, and Koonin 1994
; Yi and Lander 1994
; Burge and Karlin 1997
). But there is a growing recognition of the need to liberate phylogenetic modeling from the unidirectional flow of information implicit in Markov chains and simple nearest neighbor interactions required by Ising chains (Koshi and Goldstein 1998
; Yang, Nielsen, and Hasegawa 1998
). The most nuanced models will achieve these goals with codons rather than nucleotides as the units of selection.
In our view, the theory of Markov random fields offers a perfect framework for expansion of the rate variation models. In particular, we demonstrate that imposing a Gibbs prior with limited spatial interactions is fully compatible with fast likelihood evaluation. This opens the door to more realistic, but still parsimonious, codon models. On the numerical side, we present several algorithmic improvements in likelihood evaluation that can decrease computing times by at least two orders of magnitude. To stimulate the wider use of sophisticated models, we have implemented all of our advances in the computer program LINNAEUS. Applications of LINNAEUS appear in our companion article (Schadt, Sinsheimer, and Lange 2002
) and will not be pursued here.
![]() |
Theory and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Nucleotide Models
Codon models are preferable to amino acid models because they incorporate detail at the mechanistic level at which mutation operates. The codon models of Goldman and Yang (1994)
and Muse and Gaut (1994)
rely on a continuous-time Markov chain with 61 states. Mutations to one of the three stop codons totally disrupt gene function and can be ignored safely. Of course, codon models are no panacea. Nucleotide substitution is seldom completely random in conserved regions flanking coding regions, near intron-exon boundaries, or in intronic regions such as the silencer and enhancer regions implicated in regulation and splicing. In the case of regulation, substitutions can affect transcription and transcript stability; in the case of splicing, substitutions can change the entire structure of the translated protein (Miyata and Yasunaga 1980
; Li, Wu, and Luo 1985
; Nei and Gojobori 1986
; Li 1993
).
As a basis for our codon models, we adopt the generalization of Kimura's nucleotide substitution model proposed by Schadt, Sinsheimer, and Lange (1998)
. Kimura's model and subsequent generalizations view a phylogeny as a binary tree with contemporary taxa forming the leaves. Evolution occurs from site to site and branch to branch via independent Markov chains bouncing back and forth among the four possible states A (adenine), G (guanine), C (cytosine), and T (thymine). These continuous-time chains share the common infinitesimal generator
|
Rate Variation Among Lineages
Because, in general, it is impossible to separate the time parameter for each branch from the kinetic parameters, we adopt Yang's (1994)
convention and set one of the branch lengths equal to 1. This convention allows a branch length to be interpreted as the expected number of changes between the two taxa at the ends of the branch. The nucleotide substitution models sketched above implicitly assume that the rate of evolution across lineages is constant. To allow for stochastic variation in branch lengths, one can introduce randomized times for each branch.
There are two ways of choosing these times, both involving gamma distributed random variates. In the traditional model (Nei and Gojobori 1986
; Tamura and Nei 1993; Yang 1993
; Kelly and Rice 1996), each branch length at a given site is multiplied by the same gamma variate with mean 1. The gamma variates at different sites are chosen independently. Unfortunately, this procedure enormously complicates likelihood evaluation because the likelihood at the site must be integrated over all possible realizations of the gamma variate. Although one can carry out such quadratures numerically, using say Laguerre polynomials, accuracy tends to be low, and computation times tend to be high, particularly when derivatives of the likelihood are also sought.
Alternatively, one can imagine multiplying the different branch lengths at a given site by independent gamma distributed variates of mean 1. Although this creates even more integrals, these can be done along each branch separately as part of a single overall likelihood evaluation. Furthermore, as noted below, the necessary quadratures can be performed analytically. To promote model parsimony, it is preferable to specify the same shape parameter across both branches and sites for the independent gamma variates. In this regard, recall that the standard parameterization of a gamma distributed variate T involves a scale parameter and a shape parameter
. The shape parameter determines the coefficient of variation of T through the equation
|
In our alternative model with independent gamma multipliers rather than a common gamma multiplier, we must compute the unconditional finite-time transition matrix Q for evolution along each branch. Assuming a branch has average length 1/ and the random multiplier T has mean 1 and shape parameter
, one calculates Q using the formula
|
|
|
Although this insight is useful, practical computation of Q() is best achieved when
is diagonalizable (Horn and Johnson 1991
, 520 p.). If
= UDU-1 with D diagonal having diagonal entries di, then
|
From Nucleotide Models to Codon Models
Any nucleotide substitution model induces a trivial codon substitution model. For example, if vbd is the infinitesimal transition rate from nucleotide b to nucleotide d, then on the codon level, the infinitesimal transition rate from codon (a,b,c) to codon (a,d,c) is still vbd, assuming sites mutate independently and neither codon (a,b,c) nor codon (a,d,c) is a stop codon. Substitutions involving the first or third sites are treated similarly, and single-step substitutions between codons differing at more than one site are prohibited.
The only difference between running this naive codon model and three adjacent independent nucleotide models is that the three stop codons are disallowed in the codon model. For this reason, the naive codon model is not terribly interesting. One can increase its relevance by incorporating an acceptance mechanism. If two neighboring codons (a,b,c) and (a,d,c) are nonsynonymous, then one can penalize transitions between them by replacing vbd with vbd. Here
[0, 1] is the probability that the proposed evolutionary change is accepted. If the destination codon is one of the three stop codons, the acceptance probability is 0. The acceptance probability for a synonymous change is 1. Mathematically, the codon model with acceptance probabilities is determined by a 61 x 61 infinitesimal generator
whose off-diagonal entries are the products of the corresponding entries of the infinitesimal generator of the naive codon model and the acceptance matrix. The diagonal entries of
are defined so that row sums vanish.
Preservation of Reversibility
Although the codon model is too complicated to permit explicit exponentiation of its infinitesimal generator , symmetry considerations can simplify this task and likelihood computation in general. For example, to avoid complex arithmetic in computing the eigenstructure of
, it is desirable to derive sufficient conditions guaranteeing that
has real eigenvalues and eigenvectors. Reversibility is the key. One can easily show that reversibility in the nucleotide model is equivalent to imposing the two constraints ß
=
and
=
on
. With these constraints in force, the equilibrium distribution
is
|
If we let D denote the diagonal matrix with the entries of
along its diagonal, then reversibility is equivalent to the detailed balance condition
![]() |
|
Besides accounting for nonsynonymous codon changes in a parsimonious manner, the codon model possesses the attractive property of turning a reversible nucleotide model into a reversible codon model. The scalar version of detailed balance is bvbd =
dvdb. If the acceptance probabilities in the codon model are symmetric, then we claim that the equilibrium probability of the codon (a,b,c) is
a
b
c up to a multiplicative constant. The proof of this statement is simply the equality
![]() |
Codon Acceptance Probabilities
In contrast to Muse and Gaut (1994)
, who employ a single acceptance probability that makes no distinctions between the different kinds of nonsynonymous codon substitutions, and in contrast to Goldman and Yang (1994)
, who use empirically determined codon acceptance probabilities based on the amino acid distances given in Grantham (1974)
, we prefer to estimate a small number of codon acceptance probabilities directly from the data. A large amount of sequence data would tend to favor this tactic of trading extra parameters for greater biological realism. Of course, the extreme model treating each amino acid separately requires (20 2) = 190 symmetric acceptance probabilities. To avoid such a proliferation of parameters, we group amino acids into similarity (equivalence) classes and then define acceptance probability parameters for transitions within or between classes. For instance, if we define similarity classes on the basis of the polarity properties of the amino acids, then the natural classes comprise the nonpolar amino acids S1 = {G, I, V, L, A, M, P, F, W}, the positive-polarpositively charged amino acids S2 = {Q, N, C, Y, H, K, R}, and the negative-polarnegatively charged amino acids S3 = {S, T, E, D}. We could further divide these classes on the basis of the size of amino acid side chains or other properties potentially affecting substitution rates. For instance, to account for cysteine's propensity to form disulfide bonds bridging different parts of a protein, we could place cysteine in its own similarity class S4 = {C}.
With the similarity classes S1 through S4 just defined, we postulate an acceptance probability 0 within classes, an acceptance probability
1 between a polar and a nonpolar class, an acceptance probability
2 between different polar classes, and an acceptance probability
3 involving substitution to or from cysteine. One would anticipate that
0 >
1 >
2 and
0 >
3. The symmetry constraints implicit in this parameterization are not completely realistic. For instance, they probably fail for transitions involving cysteine. Substitution of another amino acid for a cysteine involved in a disulfide bond is bound to be much less likely than the reverse substitution. But it would take an enormous amount of data to see this effect, and it is mathematically advantageous to maintain symmetry for the sake of reversibility.
The penalty model just described appears to coincide with one proposed earlier by Yang, Nielsen, and Hasegawa (1998)
. In our companion article (Schadt, Sinsheimer, and Lange 2002
), we compare it with a generalization of the amino acid metric model proposed by Goldman and Yang (1994)
. No clear winner emerges in this comparison, but the similarity class and metric models are both clearly superior to the naive Muse and Gaut (1994)
model.
Matrix Exponentiation and Differentiation
One of the primary concerns in likelihood evaluation is computation of the finite-time transition matrix exp(t) and its partial derivatives with respect to the parameters. Computation is much easier if
is diagonalizable. Under the conditions of reversibility described above, let
denote the product equilibrium distribution constructed from the nucleotide equilibrium distribution (eq. 1 ). Also let D
denote the diagonal matrix with the entries of
on its diagonal. Because D
1/2
D
-1/2 is symmetric, there exists a real orthogonal matrix U and a real diagonal matrix
such that D
1/2
D
-1/2 = U
U*. The columns of U are the eigenvectors of D
1/2
D
-1/2, and the diagonal entries of
are its eigenvalues. It follows from this representation that
|
Suppose is a parameter of the codon model. Any numerical method that can compute the entrywise partial derivatives (
/
)exp(t
) using only
and the entrywise partial derivatives (
/
)
is highly desirable. For instance, the optimization engine SEARCH of LINNAEUS works faster and more reliably given these partial derivatives (Lange, Boehnke, and Weeks 1988
). Formulas for (
/
)exp(t
) are available in the statistical literature on fitting differential equation models (Jennrich and Bright 1974
) and in the Lie group literature (Richtmyer 1981
, pp 143144), but these have not seen application in molecular phylogeny. For this reason, we indicate briefly one method of calculating (
/
)exp(t
).
Suppose f(z) is an analytic function of the complex variable z. The particular choice f(z) = exp(tz) is the pertinent one for us. Because f(z) is locally defined by an absolutely convergent power series, one can define a matrix-valued analytic function f(M) on square matrices M by substituting M for z in this series. This correspondence gives rise to a matrix functional calculus whose crowning highlight is the Cauchy integral formula
|
|
If we assume that M is diagonalizable in the form M = SDS-1 with D a diagonal matrix with jth diagonal entry dj, then we can write
|
|
|
When M is not diagonalizable, a more complicated formula of Schwerdtferger applies based on the Jordan canonical form and Frobenius covariants of M (Horn and Johnson 1991
). Fortunately for reversible models,
is diagonalizable. Even when
is not diagonalizable a priori, we can assume this condition in numerical practice because almost all matrices have distinct eigenvalues and are therefore diagonalizable. The appendix contains a short proof of this measure-theoretic assertion.
Rate Variation Models
Several authors have objected to the naive assumptions that different codon sites evolve equally rapidly and independently (Goldman and Yang 1994
; Yang 1995
; Felsenstein and Churchill 1996
). Clearly, residues forming part of a catalytic domain or critical protein fold are more strongly conserved than residues that fall outside such regions. Better models of evolution must allow for variation in the rate of evolution and correlation in the rates of neighboring codon sites. The simplest way to model rate variation is to suppose that each codon site i is assigned a random rate class Ci from among r possible rate classes, which for convenience we label 1, ..., r. For two rate classes, we nominate the first class as slow and the second as fast. Probably the most parsimonious way of differentiating slow from fast evolution is to modulate the rate of evolution through the acceptance probabilities. Slow evolution can be distinguished from fast evolution by introducing a multiplicative parameter
(0, 1) and replacing each acceptance probability
i by
i. For synonymous codon changes, one retains the acceptance probability of 1.
Modeling correlations in the rate of evolution among neighboring codon sites is more subtle. It is important to recognize that we are dealing with two different graphs in molecular phylogeny: (1) a taxon graph representing the relationships between the taxa and (2) a site graph representing which codon sites interact in determining rates of evolution. The site-to-site independence assumption essentially hides the fact that the site graph exists. The nodes of this graph are the different codon sites; its edges connect interacting pairs of sites. Yang (1995)
and Felsenstein and Churchill (1996)
implicitly define a site graph through their Markov chain models for rate variation. The Markov chain models postulate that the rate of evolution at the current site depends on the rate of evolution at the previous site. Although this is certainly a more realistic premise than independent rate variation, it is not the last word on the subject. Hidden Markov chains impose a unidirectional flow of information. Except for mathematical convenience, there is no reason to assume a preferred direction. Markov chain models also permit interactions only between nearest neighbors. These limitations can be relaxed without overly complicating maximum likelihood estimation and statistical inference.
The key to generalization is to replace Markov chains by Gibbs random fields (Kelly 1979
; Whittaker 1990
, pp. 4761; Cressie 1993
, pp. 410422). The prototype Gibbs random field is the Ising model from statistical mechanics. As opposed to a Markov chain, it allows a bidirectional flow of information. Wagner, Baake, and Gerisch (1999)
introduced a simple two-state nucleotide evolution model based on coupled Ising chains. This model captures nearest neighbor interactions between particles on a one-dimensional lattice (Thompson 1972
, pp. 116144). The Ising model and more complicated Gibbs fields are defined through potential functions H(c) defined on the realizations of the random rate class vector C = (C1, ..., Cn) over n consecutive codon sites. A Gibbs random field assigns the prior probability
|
Gibbs Fields and Their Partition Functions
In the Potts generalization of the Ising model (Baxter 1982
, pp. 422452), one determines the prior probability of the random vector C = (C1, ..., Cn) over n codon sites through either the linear potential function
|
|
One of the hardest problems in statistical mechanics has been the evaluation of partition functions. Fortunately, the Potts model permits exact evaluation. If we define the matrix
|
|
|
For example, in the Ising model when r = 2 and
|
|
|
Evaluation of the partition function can sometimes be done directly without recourse to linear algebra. For instance, consider the linear logistic model
|
|
|
|
As a final example, suppose that r = 2 and the rate indicators Ci are chosen from {-1, 1} rather than {1, 2}. The partition function
|
|
Likelihood Evaluation Under Rate Variation
To compute the likelihood of the data under the rate variation model, we let L(c) = ni=1 Li(ci) denote the likelihood of the observations over all codon sites conditional on the value (c1, ..., cn) of the rate class vector C. Here Li(ci) is the likelihood specific to codon site i when it is in rate class ci. Given the value of the finite-time transition matrix et
and its derivatives (
/
)exp(t
), computation of Li(ci) and its derivatives (
/
)Li(ci) is carried out by the algorithms sketched by Felsenstein (1981)
and Schadt, Sinsheimer, and Lange (1998)
. These algorithms mimic Baum and Petrie's (1966)
forward and backward algorithms and have computational complexity O(ts2), where t is the number of taxa and s = 4 or s = 61 is the number of states of the model. As Felsenstein (1981)
notes, a complexity of O(ts2) is a vast improvement over the complexity O(st) encountered when one visits the internal nodes of the evolutionary tree simultaneously rather than recursively. Jensen, Lauritzen, and Olesen (1990)
present a nice generalization of the Baum and Petrie and Felsenstein algorithms to more complex graphical structures. This generalization closely resembles algorithms developed in pedigree analysis (Lange 1997
, pp. 102110).
To compute the full likelihood in the rate variation model with spatial correlation, we must take into account the prior probability of the rate class vector C. Given a potential function H(c), we have
|
|
|
Computation of either the partition function or the numerator of reduces to the evaluation of a multiple sum of products of arrays. It is best to compute the multiple sums as iterated sums. Consider the numerator of
. In the forward algorithm, we eliminate the indices c1, ..., cn in the indicated order. Starting with the array f0(c1, ..., ck) = 1 at step 0, at step i we recursively create the new array fi(ci+1, ..., cmin{i+k,n}) using
|
The computational complexity of the forward algorithm scales linearly in n. At each step of the algorithm, one should rescale constructed arrays to prevent numerical underflow. Because a similar algorithm applies to the partition function, there is no need to express the partition function in closed form. Of course, if a closed form exists, it will usually be preferred on esthetic grounds.
Likelihood Differentiation Under Rate Variation
For certain rate variation models, it may be impossible to compute partial derivatives exactly. In such cases numerical partial derivatives of the likelihood can be approximated by either forward or central differences. Each forward difference requires an additional likelihood evaluation, and each central difference requires two additional likelihood evaluations. These extra likelihood evaluations are especially costly for kinetic parameters because they involve recalculation of all of the site-specific likelihoods Li(ci). For spatial parameters, the site-specific likelihoods are fixed, and it is possible to evaluate the full likelihood quickly given them. This suggests that numerical differentiation with respect to the spatial parameters is reasonable. One drawback of forward and central differences is that they entail roundoff errors originating from the subtraction of nearly equal function values. For a real analytic function f(x), there is a better alternative (Squire and Trapp 1998
) based on the approximation
|
The product rule allows us to express the partial derivative of with respect to a kinetic parameter
as
|
|
It is conceptually simple but notationally messy to describe how to combine the forward and backward algorithms to compute partial derivatives. The basic idea is to exploit as much of the forward and backward information as possible. This is accomplished by using fi-1(ci, ..., ci+k-1) and bi+k(ci, ..., ci+k-1) as a basis for further computation. For the sake of simplicity, we deal with a central value of i satisfying k i
n - k. When i < k, we omit the forward array and deal with the backward array alone, and when i > n - k we do the reverse. These special cases we leave to the reader. For a central value of i we collect all arrays that have not been visited into a single array a(ci, ..., ci+k-1) by forming the product
|
|
Inferring Hidden States
The models and algorithms discussed so far lend themselves to the optimal assignment of codons to internal nodes and of codon sites to rate classes. This flexibility will help scientists in understanding and visualizing the most likely ancestral sequences and the most highly conserved regions in a protein. Felsenstein and Churchill (1996)
detailed such an algorithm in computing posterior probabilities for a given nucleotide site belonging to a given rate class, using their hidden Markov-based model of rate variation among sites. Others have detailed fast and efficient algorithms for the reconstruction of nucleotide and amino acid sequences and demonstrated the usefulness of these reconstructions in annotating the functional regions of protein sequences (Pupko et al. 2000
; Armon, Graur, and Ben-Tal 2001
). We generalize these reconstruction and rate assignment algorithms so they can be applied in the context of the generalized reconstruction models we have developed here.
There are two common methods for inferring hidden states in the type of likelihood models detailed here. The first simply computes the posterior probability of each possible assignment and chooses the most probable one at each node or site. The second method finds the most likely constellation of assignments over all nodes or sites simultaneously by dynamic programming. The posterior probability method involves fairly modest changes to the various likelihood algorithms. The dynamic programming method requires more radical surgery and adapts the Viterbi (1967)
algorithm as generalized by Dawid (1992)
.
Inferring Hidden Rate Classes
Consider for instance computation of the posterior probability that a codon site belongs to a given rate class. The posterior probability in question is the ratio of a joint probability and the likelihood of the data. The joint probability captures the data with the given codon site restricted to a particular rate class. Computation of the joint probability proceeds exactly as likelihood computation with the exception that when the codon site in question is reached in the computation, the range of summation shifts from all possible rate classes to a single fixed rate class. To carry out this strategy as efficiently as possible, we mimic the process of computing partial derivatives. Thus, by analogy to equations (6)
and (7)
we form
|
|
To explain the Viterbi algorithm for finding the best sequence of rate classes, it is clearer to adopt a general perspective. The problem of likelihood evaluation reduces to an evaluation of a multiple sum of multiple products of arrays of the form
|
In contrast to likelihood evaluation, the best assignment problem involves finding a multi-index c maximizing j
J aTj(c). In other words, maxima replace sums. Dynamic programming solves a problem by taking advantage of already computed solutions of smaller examples of the same problem. In this case, we solve a sequence of n subproblems. For subproblem m, put
|
|
|
As a toy example, consider the product
|
|
|
|
The rate class problem possesses two major advantages. First, the index set (eq. 10 ) at stage m - 1 always contains at most k indices for k nearest neighbor interactions. Second, it is natural to proceed from left to right along the n codon sites. As we will discuss later, the same strategy may or may not work well for neighborhood structures connecting nonconsecutive sites.
Inferring Hidden Codons: Ancestral Sequence Reconstruction
We now turn to the problem of imputing codons to internal nodes. To simplify our exposition, we fix a tree connecting the taxa of interest, a codon site, and a rate class for that site. In our previous article (Schadt, Sinsheimer, and Lange 1998
), we sketched Felsenstein's (1981)
algorithm for computing the likelihood of the data observed at the leaves of the tree. This algorithm works for both nucleotide and codon models. In discussing the algorithm, we introduced several probabilities. These were (1) the conditional probability Tk
l(i, j) that the daughter node l of a branch k
l is in state j given that the mother node k is in state i, (2) the conditional probability Dk(i) of the subtree rooted at node k given node k is in state i, (3) the conditional probability Lk(i) of the left subtree rooted at node k given node k is in state i, (4) the conditional probability Rk(i) of the right subtree rooted at node k given node k is in state i, and (5) the joint probability Uk(i) of node k being in state i and the complementary subtree showing its observed bases at its leaves. In definition (1), note that Tk
l(i, j) is nothing more than an entry of the finite-time transition matrix P(t) pertaining to the branch k
l. The complementary subtree referred to in definition (3) contains node k and all nodes that are not descendents of k. The letters D, L, R, and U in these definitions suggest the directions down, left, right, and up, with the root occurring at the top of the tree.
We remind the reader that the quantities Lk(i) and Rk(i) are computed in a postorder traversal of the tree in which daughter nodes are visited prior to mother nodes. The quantities Uk(i) are computed in a subsequent preorder traversal of the tree in which mother nodes are visited before daughter nodes. If l is one of the daughter nodes of k, then computation of the Ul(j) relies on the stored values of Lk(i) and Rk(i) available after the postorder traversal and on the stored values of Uk(i) available after the partially completed preorder traversal. At the time Ul(j) is computed, it is trivial to compute the product Ul(j)Ll(j)Rl(j) giving the joint probability of the data at the site and an assigned state j for node l. The ratio of this probability to the likelihood of the data represents the conditional probability of the assigned state i given the data.
To remove the restriction to a particular site and rate class, let i denote a codon site, l a node, and ci a rate class at site i. We have just shown how to compute the joint likelihood Li(ci, xl) of the observed data at site i and a specified codon xl at node l. Within the context of the rate class model, efficient computation of the posterior probabilities of the various codons xl at the various nodes l proceeds by constructing the array a(ci, ..., ci+k-1) of equation (8)
omitting the factor Li(ci). This amended array is then used to form the sum S(ci) of equation (9) . The joint probability of the data and codon xl at node l is determined by the sum rci=1 Li(ci, xl)S(ci). Dividing this joint probability by the likelihood of the data yields the posterior probability of codon xl at node l of site i. If done correctly, the whole process of imputing codons to internal nodes is very efficient.
Implementation of the Viterbi algorithm is similar in spirit to the posterior probability algorithm, but practical implementation and notation becomes cumbersome when one attempts to integrate it with the codon model. For this reason, let us assume that we have already assigned as previously described an optimal rate class to each site. This allows us to fix a tree, a codon site, and a rate class for that site. Our description of the Viterbi algorithm for assigning optimal codons to nodes again exploits both a postorder and a preorder traversal. We retain the symbol Dm(i) but interpret it as the conditional probability of the subtree emanating from m with optimally assigned codons. Here we obviously condition on codon i at node m. If m is a leaf, then Dm(i) is the 0/1 function 1{i=j} indicating consistency between the hypothesized codon i and the observed codon j at m. Now consider a mother node m with left and right daughter nodes l and r. The appropriate recurrence defining Dm(i) is
|
Of course, this does not solve the problem of finding the best assignment. To find the best assignment, we must define left and right pointers at each node m that track the best j and k in equation (11) given an i. In the preorder phase of the Viterbi algorithm, we then simply trace back the pointers from mother nodes to daughter nodes and note the optimal codon at each node as we visit it. The whole algorithm is obviously reminiscent of maximum parsimony. Both methods assign codon (or nucleotides) to nodes. If the models we have been exploring capture natural variation well, then maximum parsimony assignment is apt to be inferior to Viterbi assignment using likelihoods. Further exploration of this claim is clearly warranted.
Tree Building Algorithms
Efficient likelihood calculation and good optimization algorithms go only partway toward easing the computational burdens of finding the best evolutionary tree. For m contemporary taxa, there are Tm = (2m - 3)!/[2m-2(m - 2)!] rooted trees to consider. A host of methods exists for efficiently searching tree space; among these are branch-and-bound (Felsenstein 1981
), simulated annealing (Schadt, Sinsheimer, and Lange 1998
), and the recent progressive search algorithms of Warnow, Moret, and St. John (2001)
and Nakhleh et al. (2001)
. Here we will discuss methods of exhaustive enumeration that take into account constraints on phylogenies available through accepted groupings of taxa.
To illustrate our point, consider the unrooted tree of figure 1 depicting 16 complete HIV-1 genomes across five subtypes. In the absence of prior information on subtypes, these taxa are connected by more than 6.19 x 1015 possible rooted trees. Grouping the taxa by subtype dramatically reduces the number of possible trees. There are three taxa each of subtype A, B, C, and F and four taxa of subtype G. Each group of three related taxa can be arranged in three trees, the group of four related taxa can be arranged in 15 trees, and the five subtypes themselves can be arranged in 105 trees. Therefore, there are only (34)(15)(105) = 127,575 rooted trees connecting the taxa and preserving the subtypes. This represents a reduction in the number of possible trees by 10 orders of magnitude. Because there are fewer trees, the time devoted to fitting each possible tree is less of an issue. Accordingly, constraining tree space through known groupings promotes more sophisticated modeling.
|
An algorithm for visiting all possible trees becomes in this context simply a systematic method of enumerating these vectors. The most obvious scheme, backtracking, involves partial vectors. If the current partial vector is (l1, ..., lk), then we extend it to (l1, ..., lk, 1) when k < m - 2. If the current partial vector is the full vector (l1, ..., lm-2), and lm-2 < 2m - 3, then we replace (l1, ..., lm-2) by (l1, ..., lm-2 + 1). Finally, if the current partial vector is the full vector (l1, ..., lm-2) and lm-2 = 2m - 3, then we backtrack to the partial vector (l1, ..., lk), where k is the first position in going from right to left along (l1, ..., lm-2) at which lk < 2k + 1. We then replace the full vector by the partial vector (l1, ..., lk + 1). If no such position k exists, then we have exhausted the list of trees over the m taxa.
In principle, one could object that this is an inefficient process because it visits partial trees when only full trees are of interest. There are two logical replies. First, if we let Sm = T2 + ··· + Tm be the number of possible trees with m or fewer taxa, then one can prove inductively that
|
The most natural way to impose a tree constraint is to treat selected taxa as a unit. It will help in discussing constraints to refer to a taxon as a leaf and a group of taxa as a leaf bundle, or a bundle for short. In the HIV example just discussed, there are five bundles with four bundles having three leaves each and a fifth bundle having four leaves. If we want to enumerate all trees consistent with these groups, then we enumerate all trees consistent with the five bundles. In effect, the bundles serve as leaves. For each such enumerated tree, we unfold each attached bundle into the possible subtrees formed by its leaves.
It is helpful to expand this definition by making it recursive. Hence, a tree is built from leaves and leaf bundles. Each bundle consists of additional leaves and additional bundles. Each additional bundle consists of yet more leaves and yet more bundles, and so forth, like nested Russian dolls, but more complicated. Figure 2 gives an example in which 10 leaves are organized into two bundles B1 and B2 and two leaves, say 9 and 10. Bundle B1 consists of bundle B3 and leaves 4 and 5. Bundle B2 consists of leaves 6, 7, and 8 and no bundles. Finally, bundle B3 consists of leaves 1, 2, and 3 and no bundles. If we want to count the total number of trees consistent with these constraints, then we start at the root of the tree and list the attached leaves and bundles. In general, suppose the level 0 contributions attached to the root consist of leaves i1, ..., ij and bundles B1, ..., Bk. The total number of consistent trees C0 is then
|
|
Long-Range Correlations
If prior information suggests that widely separated amino acids jointly determine a catalytic domain or protein fold, then it might be wise to model the correlated evolution of the responsible codons. The conserved histidine residues in the beta-globin protein furnish a typical example. How this modeling is done can have an enormous impact on the efficiency of likelihood evaluation. To illustrate our algorithmic point in a nongenetic context, suppose we have to evaluate the sum of products of arrays
|
|
It is important to note in either case that the complexity of the algorithm (eq. 12
) increases sharply if we substitute arrays a{i,i+1,i+2,i+3}(ci, ci+1, ci+2, ci+3) for the arrays a{i,i+3}(ci, ci+3). Then the intermediate arrays b{i+3}(ci+3) are replaced by the more complicated arrays b{i+1,i+2,i+3} (ci+1, ci+2, ci+3) involving more indices and requiring more arithmetic operations to generate and handle. The moral is that in dealing with correlation at a distance it is better to omit short-range correlations or to strictly limit the number of long-range correlations. Even with these restrictions, computing the derivatives of the likelihood becomes messier. Of course, with extremely complicated patterns of interaction or in contexts such as pedigree analysis where sites are no longer arranged in a linear chain, it may be more efficient to eliminate the indices in permuted order. Finding the best order can be highly nontrivial. At some point the computational complexity of likelihood evaluation or dynamic programming can balloon out of control. In such cases it may be worthwhile to examine other strategies such as simulated annealing or genetic algorithms for solving the Viterbi problem. Likelihood evaluation yields to stochastic sampling (Lange 1997
).
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
As the field of molecular phylogeny matures, its emphasis is shifting from phylogeny reconstruction to the more subtle and ambitious problems of annotating and functionating whole genomes. The sheer amount of sequence data currently generated will sustain much more complex phylogenetic modeling. The models discussed here and their creative extensions will aid in identifying sequence motifs, conserved regions within genes, and gene families. Better estimates of branch lengths and site-by-site rates of evolution will be critical in these tasks. Equally important will be better tools for reconstructing ancestral sequences and for visualizing domains of slow and fast evolution. Although phylogenetic-based sequence analysis cannot hope to supplant X-ray crystallography and nuclear magnetic resonance in understanding protein structure, these alternatives are expensive and time consuming. Partial knowledge is preferable to none, particularly in identifying targets for therapeutic intervention and in unraveling complex webs of interacting proteins.
Our modeling philosophy is to retain as much model generality as possible consistent with fast computation. Our bias toward flexibility is reflected in our program LINNAEUS, freely available from Eric Schadt on request. For instance, LINNAEUS allows users to define codon equivalence classes tailored to specific proteins or types of proteins. For receptors spanning cellular membranes, classifying codons by polarity makes good sense. For enzymatic proteins, precise folding is required, and codon classification by amino acid size might be more relevant. LINNAEUS also permits users to specify the number of rate classes and the extent of spatial correlation. LINNAEUS is written in VC++ but mainly adheres to the ANSI C++ standard and makes use of the standard template libraries. LINNAEUS has currently only been compiled and tested under Windows NT 4.0 and Windows 2000, but we plan to make a LINUX/UNIX version available in the near future. We further contemplate many future enhancements. It might be useful to incorporate covariates such as measures of solvent accessibility amino acid by amino acid. Of course, one of the biggest challenges is to model rate correlation at a distance. The three-dimensional structure of proteins brings amino acids into close proximity that are distant in the linear metric of the coding sequence. Somehow we must overcome the computational barrier of shifting through all pairs of codons to discover coordinated evolution at a distance.
Codon models are far more computationally demanding than nucleotide models. Nonetheless, the combination of good algorithms and faster processors has put them within reach. As explained in our companion article, the cumulative impact of our algorithmic improvements is substantial. Once the step to codon models is taken, the computational price of adding rate variation is modest. Computation of the partial derivatives of the loglikelihood is the biggest computational bottleneck in maximum likelihood estimation with codon models. If a model has p parameters, then it takes p + 1 likelihood evaluations to approximate partial derivatives by forward differences and 2p + 1 likelihood evaluations to approximate them by central differences. Likelihood evaluation itself is dominated by diagonalizing and exponentiating the infinitesimal generator. Our exact formulas compute the score vector via the diagonalization of the infinitesimal generator. Thus, a single likelihood evaluation suffices. With nearly 50 parameters in some of our more complex models, we see an almost 50-fold reduction in computing times. How long-range rate correlations are modeled can also impact computation times. This is a complicated topic and bears more examination, particularly if the advantages of fast likelihood differentiation are to be preserved.
We hope the availability of LINNAEUS will accelerate the transition to likelihood-based models, which in our view have far more to offer than traditional nonparametric methods. Our companion article (Schadt, Sinsheimer, and Lange 2002
) provides examples of LINNAEUS in action. Past problems with likelihood-based methods stem from poor models and slow computation. As both of these barriers fall, there will be less and less excuse for using nonparametric methods. This is particularly true as attention shifts from phylogeny reconstruction to the discovery of conserved regions within genes and proteins.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
When n = 2 the characteristic polynomial
|
|
|
![]() |
Our goal is now within reach. Indeed, one need only add that any nontrivial multivariate polynomial is nonzero for almost all values of its arguments (Hoffman 1975
, 358 p.). This fact follows for a univariate polynomial because it has at most a finite number of roots. The general case follows by induction and Fubini's theorem. It is noteworthy that the entries of M can be either real or complex. The weaker classical result that the set S of real matrices with distinct eigenvalues is both open and dense in Rn2 follows from our stronger result (Hirsch and Smale 1974
, pp. 154156.). The set S is open because the eigenvalues of M are continuous functions of the entries of M. If there were an open ball B
Rn2 containing no matrix with distinct eigenvalues, this would mean that the complement of S would have positive measure, contradicting the proposition just established.
![]() |
Footnotes |
---|
Keywords: molecular evolution
codon models
Markov chain
Gibbs field
maximum likelihood
dynamic programming abstract
Address for correspondence and reprints: Eric Schadt, Computational Genomics, Bosetta Inpharmatics, Kirkland, WA 98034. E-mail: eric_schadt{at}merck.com
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi J., M. Hasegawa, 1992 Amino-acid substitution of proteins coded for mitochondrial-DNA during mammalian evolution Japan J. Genet 67:187-197[ISI]
Armon A., D. Graur, N. Ben-Tal, 2001 ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogentic information J. Mol. Biol 307:447-463[ISI][Medline]
Baum L. E., T. Petrie, 1966 Statistical inference for probabilistic functions of finite state Markov chains Ann. Math. Stat 37:1554-1563[ISI]
Baxter R., 1982 Exactly solved models in statistical mechanics Academic Press, New York
Burge C., S. Karlin, 1997 Prediction of complete gene structures in human genomic DNA J. Mol. Biol 268:78-94[ISI][Medline]
Cressie N. A. C., 1993 Statistics for spatial data Wiley, New York
Dawid A. P., 1992 Applications of a general propagation algorithm for probabilistic expert systems Stat. Comput 2:25-36
Derin H., H. Elliott, 1987 Modeling and segmentation of noisy and textured images using Gibbs Random Fields IEEE Trans. Pattern Anal. Machine Intelligence 9:39-55[ISI]
Dongarra J., A. Lumsdaine, R. Pozo, K. A. Remington, 1996 IML++: iterative methods library reference guide. Version 1.2 ftp://math.nist.gov/pub/pozo/docs/iml.ps.gz
Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]
Felsenstein J., G. A. Churchill, 1996 A Hidden Markov Model approach to variation among sites in rate of evolution Mol. Biol. Evol 13:93-104[Abstract]
Goldman N., Z. Yang, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences Mol. Biol. Evol 11:725-736
Grantham R., 1974 Amino acid difference formula to help explain protein evolution Science 185:862-864[ISI][Medline]
Gribskov M., A. D. McLachlan, D. Eisenberg, 1987 Profile analysis: detection of distantly related proteins Proc. Natl. Acad. Sci. USA 84:4355-4358[Abstract]
Hasegawa M., 1986 Statistical methods in molecular phylogeny and their application to human evolution Japan J. Genet 61:633-634
Hirsch M. W., S. Smale, 1974 Differential equations, dynamical systems, and linear algebra Academic Press, New York
Hoffman K., 1975 Analysis in Euclidean space Prentice-Hall, Englewood Cliffs, NJ
Horn R. A., C. R. Johnson, 1985 Matrix analysis Cambridge University Press, New York
. 1991 Topics in matrix analysis Cambridge University Press, Cambridge, U.K
Ikebe Y., T. Inagaki, 1986 An elementary approach to the functional calculus for matrices Am. Math. Monthly 93:390-392[ISI]
Ising E., 1925 Beitrag zur Theorie des Ferromagnetismus Z. Physik 31:253-258
Jennrich R. I., P. B. Bright, 1974 Fitting systems of linear differential equations using computer generated exact derivatives Technical Report No. 10. Health Sciences Computing Facility, UCLA, Los Angeles, Calif
Jensen F. V., S. L. Lauritzen, K. G. Olesen, 1990 Bayesian updating in recursive graphical models by local computations Comp. Stat. Q 4:269-282
Kelly C., J. Rice, 1996 Modeling nucleotide evolution: a heterogeneous rate analysis Math. Biosci 133:85-109[ISI][Medline]
Kelly F. P., 1979 Reversibility and stochastic networks Wiley, New York
Kimura M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]
Koshi J. M., R. A. Goldstein, 1997 Mutation matrices and physical-chemical properties: correlations and implications Proteins 27:336-344[ISI][Medline]
. 1998 Mathematical models of natural amino acid site mutations Proteins 32:289-295[ISI][Medline]
Kramers H. A., G. H. Wannier, 1941 Statistics of the two-dimensional ferromagnet. Part I Phys. Rev 60:252-262
Lange K., 1997 Mathematical and statistical methods for genetic analysis Springer-Verlag, New York
Lange K., M. Boehnke, D. E. Weeks, 1988 Programs for pedigree analysis: MENDEL, FISHER, and dGENE Genet. Epid 5:471-472[ISI]
Li W. H., 1993 Unbiased estimation of the rates of synonymous and nonsynonymous substitution J. Mol. Evol 36:96-99[ISI][Medline]
Li W. H., C. I. Wu, C. C. Luo, 1985 A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes Mol. Biol. Evol 2:150-174[Abstract]
Marsden J. E., 1987 Basic complex analysis W. H. Freeman and Company, New York
Miyata T., T. Yasunaga, 1980 Molecular evolution of mRNA: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its application J. Mol. Evol 16:23-36[ISI][Medline]
Muse S. V., B. S. Gaut, 1994 A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome Mol. Biol. Evol 11:715-724
Nakhleh L., U. Roshan, K. St. John, J. Sun, T. Warnow, 2001 Designing fast converging phylogenetic methods Bioinformatics 17 1: (Suppl.) S190-S198
Nei M., T. Gojobori, 1986 Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions Mol. Biol. Evol 3:418-426[Abstract]
Niven I., H. S. Zuckerman, H. L. Montgomery, 1991 An introduction to the theory of numbers. 5th edition Wiley, New York [Appendix A.2]
Onsager L., 1944 A two-dimensional model with an order-disorder transition Phys. Rev 65:117-149
Pedersen A-M., K. C. Wiuf, F. B. Christiansen, 1998 A codon-based model designed to describe lentiviral evolution Mol. Biol. Evol 15:1069-1081[Abstract]
Pozo R., K. A. Remington, A. Lumsdaine, 1996 SparseLip++: sparse matrix class library reference guide. Version. 1.5 ftp://math.nist.gov/pub/pozo/papers/sparse.ps.gz.
Pupko T., I. Pe'er, R. Shamir, D. Graur, 2000 A fast algorithm for joint reconstruction of ancestral amino acid sequences Mol. Biol. Evol 17:890-896
Richtmyer R. D., 1981 Principles of advanced mathematical physics, Vol. II Springer-Verlag, New York
Schadt E. E., J. S. Sinsheimer, K. Lange, 1998 Computational advances in maximum likelihood methods for molecular phylogeny Genome. Res 8:222-233
. 2002 Applications of codon and rate variation models in molecular phylogeny Mol. Biol. Evol 19:1550-1562.
Squire W., G. Trapp, 1998 Using complex variables to estimate derivatives of real functions SIAM Rev 40:110-112[ISI]
Strauss E. J., 1977 Clustering on coloured lattices J. Appl. Prob 4:135-143
Tamura K., M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Mol. Biol. Evol 10:512-526[Abstract]
Tatusov R. L., S. F. Altschul, E. V. Koonin, 1994 Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks Proc. Natl. Acad. Sci. USA 91:12091-12095
Thompson C. J., 1972 Mathematical statistical mechanics Princeton University Press, Princeton, NJ
Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]
Viterbi J., 1967 Error bounds for convolutional codes and an asymptotically optimal decoding algorithm IEEE Trans. Informat. Theory 13:260-269
Wagner H., Baake E., Gerisch T., 1999 Ising quantum chain and sequence evolution J. Stat. Phys 92:1017-1052[ISI]
Warnow T., B. M. Moret, K. St. John, 2001 Absolute convergence: true trees from short sequences Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA 01). Pp. 186195
Whittaker J., 1990 Graphical models in applied multivariate statistics Wiley, Chichester, U.K
Yang Z., 1993 Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites J. Mol. Evol 10:1396-1401
. 1994 Estimating the pattern of nucleotide substitution J. Mol. Evol 39:105-111[ISI][Medline]
. 1995 A space-time process model for the evolution of DNA sequences Genetics 139:993-1005
. 2000 Relating physicochemical properties of amino acids to variable nucleotide substitution patterns among sites PSB Proc 4:81-92
Yang Z., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611.
Yi T. M., E. S. Lander, 1994 Recognition of related proteins by iterative template refinement Protein Sci 3:1315-1328