Maximum-Likelihood Phylogenetic Analysis Under a Covarion-like Model

Nicolas Galtier2,

Centre National de la Recherche Scientifique UMR 5000—Génome, Populations, Interactions, Université Montpellier 2, Montpellier, France


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Data Analysis
 Discussion
 Appendix
 Acknowledgements
 literature cited
 
Here, a model allowing covarion-like evolution of DNA sequences is introduced. In contrast to standard representation of the distribution of evolutionary rates, this model allows the site-specific rate to vary between lineages. This is achieved by adding as few as two parameters to the widely used among-site rate variation model, namely, (1) the proportion of sites undergoing rate changes and (2) the rate of rate change. This model is implemented in the likelihood framework, allowing parameter estimation, comparison of models, and tree reconstruction. An application to ribosomal RNA sequences suggests that covarions (i.e., site-specific rate changes) play an important role in the evolution of these molecules. Neglecting them results in a severe underestimate of the variance of rates across sites. It has, however, little influence on the estimation of ancestral G+C contents obtained from a nonhomogeneous model, or on the resulting inferences about the evolution of thermophyly. This theoretical effort should be useful for the study of protein adaptation, which presumably proceeds in a typical covarion-like manner.


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Data Analysis
 Discussion
 Appendix
 Acknowledgements
 literature cited
 
Markov models of DNA sequence evolution are widely used for reconstructing phylogenetic trees and studying the processes of molecular evolution from genomic sequence data. Considerable progress has been made since the precursor works of Jukes and Cantor (1969)Citation and Kimura (1980)Citation : models have been built to account for unbalanced base composition (Hasegawa, Kishino, and Yano 1985Citation ; Tamura 1992Citation ), variable G+C content between sequences (Galtier and Gouy 1998Citation ), and synonymous versus nonsynonymous changes (Goldman and Yang 1994Citation ), to mention only a few. A significant advance occurred when Yang (1993, 1994)Citation introduced a model allowing variable substitution rates across sites within the likelihood framework. Yang showed that adding a single parameter—namely, the shape parameter of an assumed Gamma distribution of rates—could increase the likelihood by many orders of magnitude. This is presumably because selective constraints vary across sites. Some sites in a protein or an RNA evolve more or less freely, whereas others can hardly be substituted without a significant drop in fitness. Accounting for this effect greatly improves our representation of molecular evolution.

This picture, however, is an instantaneous one. It is quite likely that the selective constraints applying to a particular site also vary in time and between lineages. As far as long periods of time are concerned, critical sites with respect to the function of a macromolecule may change, making the evolutionary rate of a particular site variable across the phylogeny. This notion was introduced as early as 30 years ago by Fitch and Markowitz (1970)Citation and Fitch (1971)Citation and was called the "covarion" process. The terminology comes from the idea that the rate of a site might be modified by a substitution arising at a distinct site of the molecule, with which it therefore covaries. Site-specific rate variation, however, can as well be caused by external factors such as environmental changes. In this paper, I refer to "true covarions" when two (several) sites of a sequence are undergoing nonindependent evolution, and I use the term "site-specific rate variation" (SSRV) or "covarion-like evolution" in the more general case.

There are at least two good reasons for being interested in modeling SSRV. First, it might improve phylogenetic reconstructions. Philippe and colleagues have shown that SSRV is common in genes widely used for the recovery of deep phylogenies and have and suggested that it induces tree-building biases (Germot and Philippe 1999Citation ; Lopez, Forterre, and Philippe 1999Citation ; Philippe et al. 2000Citation ). Second, modeling SSRV should help to obtain an understanding of the way natural selection applies at the molecular level. In particular, episodic adaptive evolution presumably proceeds with frequent changes of rates at various sites.

Little theoretical work was achieved after Fitch's early reports about SSRV until Tuffley and Steel (1998)Citation proposed an explicit Markov model that resulted in a distance-based test for detecting SSRV effects (Lockhart et al. 1998, 2000Citation ). In this paper, I introduce a more general Markov model of site-specific rate variation and devise a maximum-likelihood implementation of this model. This method is applied to ribosomal RNA sequences to assess the amount of covarion-like evolution for this kind of data and to check the robustness of previously reported inferences about the early evolution of life.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Data Analysis
 Discussion
 Appendix
 Acknowledgements
 literature cited
 
The Model
Yang (1994)Citation proposed a discrete-Gamma model of among-site rate variation where the evolutionary rate of a particular site is one out of an arbitrary finite number g of possible relative rates (r1, r2, ... , rg). These possible relative rates and their probabilities are obtained by discretizing a Gamma distribution with mean unity; one can ensure equal probabilities for the ri's by using appropriate cutting points, which was done by Yang and in this study. The probability distribution of rates across sites is therefore determined by a single parameter, namely, the shape parameter {alpha} of the Gamma distribution. Under this model, simulating the evolution of a site on a given tree using a given Markov process M of nucleotide substitution involves three steps: (1) randomly draw a rate from the discretized Gamma distribution, (2) randomly draw an ancestral nucleotide state at the root of the tree, and (3) make this state evolve along the branches of the tree according to process M and the rate that was drawn at step 1. A site therefore belongs to a "category" (following Felsenstein's terminology) which is fixed during the entire simulation process.

I now generalize Yang's model by allowing site-specific rates to vary in time. It is assumed that the rate of a particular site can switch from one category to another according to a continuous Markov process R. With rate {nu}, the current evolutionary rate moves to a new category, obtained by randomly drawing from the distribution of ri's. Switches between distant categories of rates are therefore assumed to be as probable as short-range changes. The parameter {nu} could be called the "rate variation rate." It determines the amount of site-specific rate variation. It is assumed constant over sites and in time. Note that the process R of rate change is continuous: switches can occur anywhere in the tree, not specifically at nodes.

Simulating the evolution of a site under the SSRV model now involves four steps: (1) randomly draw an ancestral rate at the root of the tree from the discretized Gamma distribution; (2) make this rate evolve along the branches of the tree according to process R, and record the rate assigned to each segment of the tree (i.e., between switching points); (3) randomly draw an ancestral nucleotide state at the root of the tree; and (4) make this state evolve along the branches of the tree according to process M and the local rate determined at step 2 (i.e., scaling the length of each segment according to its relative rate). Nucleotide process M is therefore compounded with (i.e., superimposed on) rate process R. R is time-reversible, making the compound process time-reversible if M is so. Note that sites are independent and identically distributed under this model: there is no "true covariation" here. The equal-rates (ER), among-site rate variation (ASRV), and site-specific rate variation (SSRV) models are graphically compared in figure 1 . SSRV reduces to ASRV when {nu} = 0 (no change of rate) and to ER when {nu} tends to infinity (permanent change of rate results in constant rate).



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1.—Distribution of rates across sites and lineages under three models of evolution. Each tree plot describes the distribution of rates across lineages for a particular site under the considered model. Three categories of rate are assumed, represented by different line thicknesses. Under the equal-rates (ER) model, all sites evolve at a constant, unique, moderate rate. Under the among-site rate variation (ASRV) model, each site has its own rate (low, moderate, or high), which is constant between lineages. Under the site-specific rate variation (SSRV) model, the rate of a site can switch between categories; a site has distinct rates in distinct lineages

 
In the formulation above, the rate of a site can be reassigned its current category when a switch occurs. This has no biological relevance: process R is identical to a process where switches would occur at rate {nu}' = {nu}·(g - 1)/g without self-reassignment. The chosen parameterization simplifies some of the calculations below.

Maximum-Likelihood Implementation
This section aims at computing the probability of a particular DNA sequence data set under the SSRV model given a tree topology, a set of branch lengths {{lambda}i}, a time-reversible nucleotide Markov process (or substitution matrix) M, a Gamma shape parameter {alpha}, and an SSRV rate {nu}.

Felsenstein (1981)Citation introduced likelihood calculation across trees under the ER model ({alpha} = {infty}, {nu} = 0). Since independent evolution of sites is assumed, the probability of a data set is the product of the probabilities of observing each site. The probability of a particular site is computed by conditioning over all possible nucleotide states at internal nodes of the tree. For example, the probability of site (AAG) under ER in the three-species tree of figure 1 (top left) is


(1)
where R is the nucleotide state at the root node, X0 and X1 belong to {A, C, G, T}, and Pr(X -> Y/{lambda}) is the probability of reaching state Y when evolving from state X along a branch of length {lambda} according to process M. These transition probabilities are derived from the theory of stochastic processes (e.g., see Yang 1995Citation ). With time-reversible M, the likelihood does not depend on the location of the root as soon as the probabilities of nucleotide states at the root are taken from the stationary distribution of M—the so-called pulley principle (Felsenstein 1981Citation ). Equation (1) can be generalized to any tree topology using a recurrent formula. Let y be a site;


(2)
where R is the nucleotide state at the root node, N is the state at any internal node, N1 and N2 are the states at the son nodes of N, and {lambda}1 and {lambda}2 are the lengths of branches leading from N to N1 and from N to N2, respectively. The summation for X1 and X2 is over {A, C, G, T}. Equations (2) and (3) show that the likelihood of a site under ER can be computed by a single pass on the tree, in time linear in the number of leaves and in the square of the number of states (four for nucleotide sequences).

I now extend this method to the above SSRV model. The probability of a site can be computed by conditioning on both states and rates at ancestral nodes. Equations (2) and (3) become


(4)
where rR, rN, rN1, and rN2 are the rates at nodes root, N, N1, and N2, and the summations for s and t are over the range {r1, ... , rg}. Equations (4) and (5) show that computing the likelihood under the SSRV model with four states (nucleotides) has the same complexity as computing the likelihood under equal rates with 4·g states. The time required is therefore g2 times as great as that required under the ER model, and g times as great as that required under the ASRV model (Yang 1994Citation ).

Now comes the problem of calculating transition probabilities in equation (5) , namely, the probability of evolving from rate ri to rate rj and from state X to state Y during length {lambda} under processes R and M. This can be done following the standards of stochastic process theory. The combined R x M process can be viewed as a single Markov process Q in which states are defined as (X, ri) pairs, where X is a nucleotide state and ri is a rate. There are 4·g possible states. Transitions between states occur at rate


(6)
The probability matrix P of final state (Y, rj) given initial state (X, ri) after evolution according to Q along branch length {lambda} is given by taking the exponent of matrix Q·{lambda} (Yang 1995Citation ):

(7)

This approach allows exact calculation of the likelihood. It is not optimal, however, for technical purposes, because equation (7) requires numerical diagonalization of 4·g xg matrix Q, which can be time-consuming. This numerical step, moreover, precludes analytical calculation of the derivatives of the likelihood with respect to parameters of the model, quite useful for likelihood maximization. This is especially problematic when a complex model of nucleotide change is used (e.g., see below). I now derive an approximate calculation of the transition probabilities that does not require any matrix diagonalization. This is achieved by writing


(8)
Equation (8) holds because rate changes do not depend on nucleotide states. The left-hand factor depends only on the rate process R. From stochastic process theory, it equals (1 - exp(-{nu}·{lambda}))/g if j != i, and exp(-{nu}·{lambda}) + (1 - exp(-{nu}·{lambda}))/g if j = i. The right-hand factor is the probability of evolving from nucleotide state X to nucleotide state Y along branch length {lambda} given that the initial rate was ri and the final rate is rj. This can be approximated by


(9)
where ij({lambda}, {nu}) is the mean of the relative rate across its evolution along a branch of length {lambda} at rate {nu} given the initial and final rates ri and rj. Equation (9) is an approximation because the nucleotide process is assumed to apply along the average net branch length, rather than to be integrated over the distribution of the net length (where the net length of a branch is the length obtained after every segment has been scaled according to its rate). The conditional average rate is


(10)
where a = 2·(1 - ri), b = 1 + (g - 2)·ri, and c = g if j = i, and where a = 2 - ri - rj, b = 1 - ri - rj, and c = 0 if j != i. The derivation of equation (10) is given in the appendix. The approximation was found to be quite good when applied to the data sets used in this study. The approximate transition probabilities can be differentiated analytically, allowing the use of efficient maximization algorithms.

Unequal SSRV Rates Among Sites
A constant rate of rate change {nu} is assumed in the above SSRV model, which might be unrealistic. It is likely that for many proteins or RNAs the rate of some sites remains more or less constant for long periods, while other sites switch more often. Some sites might remain critical for the function of the macromolecule and evolve at a slow rate in all of the branches of the tree as a consequence of strong purifying selection pressure. Some may escape any selective pressure and evolve at a (fast) neutral rate in the long run. Other sites, involved in episodic adaptive events, might recurrently switch between a slow and a fast rate, in agreement with the above-described SSRV process. I now generalize the SSRV model to account for this possibility. It is assumed that a proportion {pi} of the sites evolve according to SSRV (with SSRV rate {nu}), while the remaining sites evolve according to ASRV (with SSRV rate 0). This general model is called unequal site-specific rate variation (USSRV).

The probability of site y under USSRV is given by

(11)
where PrSSRV(y) is given by equations (4) and (5) and PrASRV(y) is obtained by setting {nu} = 0 in equations (4) and (5) (or see Yang 1994Citation ). The USSRV model acknowledges g + 1 categories of sites: sites with constant rate r1 (proportion [1 - {pi}]/g), sites with constant rate r2 (proportion [1 - {pi}]/g), ... , sites with constant rate rg (proportion [1 - {pi}]/g), and sites with variable rates (proportion {pi}). The likelihood of a site under USSRV is computed by averaging likelihoods conditional on that site belonging to every category. USSRV reduces to SSRV when {pi} = 1 and to ASRV when {pi} = 0.

The approximate likelihood is maximized using the Newton-Raphson method. Then, the exact calculation is performed using the near-optimal parameter values sought in the previous step. Finally, the exact likelihood is maximized using the downhill-simplex method. These algorithms are available as part of the NHML package (ftp://pbil.univ-lyon1.fr/pub/mol_phylogeny).


    Data Analysis
 TOP
 Abstract
 Introduction
 Methods
 Data Analysis
 Discussion
 Appendix
 Acknowledgements
 literature cited
 
Ribosomal RNA (rRNA) sequences are widely used for reconstructing ancient evolutionary events. Recently, an improved model of nucleotide substitutions was applied to data from Eukaryotes, Bacteria, and Archaea for the purpose of estimating ancestral rRNA base compositions (Galtier, Tourasse, and Gouy 1999Citation ). This model, however, did not allow for site-specific rate variation. In this section, I investigate the importance of SSRV effects in rRNA evolution. The relevance of the above-mentioned study is checked in the light of the SSRV model. Ribosomal RNA data are also used to examine the influence of the number of taxa on SSRV estimation.

Two data sets, each including 40 species (10 Archaea, 17 Bacteria or chloroplasts, and 13 Eukaryotes), were used: small-subunit (SSU) rRNA (695 unambiguously aligned, complete sites) and large-subunit (LSU) rRNA (1,409 sites). Applying a nonhomogeneous, nonstationary model of nucleotide substitution (Galtier and Gouy 1998Citation ) to these data, Galtier, Tourasse, and Gouy (1999)Citation estimated the G+C content of the most recent common ancestor (MRCA) to extant life forms. They found that the moderate estimated G+C content (56.1% ± 5% for SSU, 54.0% ± 2.5% for LSU) is not compatible with life at very high temperatures: high ribosomal G+C content is a necessary condition for survival of present-day species in hot environments (Galtier and Lobry 1997Citation ). This result therefore questions the hypothesis of a thermophilic common ancestor (e.g., see Woese 1987Citation ; Forterre 1996Citation ).

Galtier, Tourasse, and Gouy (1999)Citation assumed a Gamma distribution of rates among sites. I conducted the analysis again under the more general SSRV and USSRV models, allowing covarion-like effects. This involved combining the above piece of theory with Galtier and Gouy's (1998)Citation nonhomogeneous model of DNA evolution. The latter allows G+C content to vary in time and between lineages. The combined model accounts for unequal transition/transversion ratio, variable G+C content between lineages, variable rates among sites and, site-specific rate changes. The assumed Gamma distribution was discretized in g = 4 equally probable classes of rates. Table 1 displays the log likelihoods and the details of parameter estimates for four models of rate variation, namely, ER, ASRV, SSRV, and USSRV (assumed tree topology: fig. 1 of Galtier, Tourasse, and Gouy 1999Citation ).


View this table:
[in this window]
[in a new window]
 
Table 1 Likelihood Analysis of Two 40-Species rRNA Data Sets Under Four Models of Site-Specific Rate Distribution

 
This analysis suggests that site-specific changes of evolutionary rates (i.e., covarion-like evolution) is a major feature of rRNA evolution. Allowing site-specific rate changes resulted in a vast increase in log likelihood (ln LSSRV - ln LASRV = 105.1 for SSU, 230.2 for LSU). Allowing unequal {nu} among sites (USSRV) also significantly improved the fit (23.6 more log likelihood units for SSU, 37.7 for LSU). The difference in log likelihood was highly significant according to likelihood ratio tests. The estimated proportion {pi} of sites undergoing rate changes under USSRV was remarkably high. Although the increase of log likelihood was lower than than between ER and ASRV, accounting for site-specific changes of rates seemed to significantly improve our representation of rRNA evolution.

Allowing covarion-like evolution made a difference with respect to the estimation of parameters of the evolutionary model. Although the general shape of the tree was unchanged (data not shown), branches were slightly longer under (U)SSRV than under ASRV (total tree length was increased by roughly 10%). This suggests that some saturation might be overlooked when site-specific rate variation exists and is not taken into account. The transition/transversion ratio {kappa} was also higher when estimated under (U)SSRV. Again, this is reminiscent of the previously reported (and confirmed here) underestimate of {kappa} when among-site rate variation is not accounted for (e.g., Wakeley 1996Citation ). Underestimating {kappa} is a consequence of overlooking multiple substitutions, since transitions saturate more quickly than transversions. The shape parameter {alpha} of the Gamma distribution is the most sensitive to covarion-like effects. It is dramatically decreased when site-specific rate variation is allowed—remember that a decrease in {alpha} means a higher variance of rates across categories. This is presumably because the mean evolutionary rate of a rate-changing site is not extreme: fast and slow periods average in the long run. These sites are "seen" as moderately fast when considered from the point of view of ASRV, while they are actually a mixture of slow and fast rates. The variance of the overall distribution of rates is therefore underestimated.

The use of a nonhomogeneous, nonstationary model of evolution allows one to estimate ancestral base compositions. The SSRV and USSRV estimates of the MRCA's rRNA G+C content are very close to (and even slightly lower than) the ASRV estimate. Galtier, Tourasse, and Gouy's (1999)Citation result is therefore confirmed when site-specific rate variation is taken into account. Their claim of a nonhyperthermophilic common ancestor did not result from a biased analysis—as far as covarion-like effects are concerned.

LSU rRNA data were used to assess the sensitivity of SSRV detection to the number of analyzed sequences. Fifteen subsets of sequences including 20, 10, or 5 species (5 of each category) were randomly drawn from the total of 40 sequences, making sure that four domains (namely, Eukaryotes, Bacteria, Euryarchaea, and Crenarchaea) were represented in proportions roughly identical to those of the total data set. Each subset was analyzed using ASRV, SSRV, and USSRV. Results are shown in table 2 (the approximate likelihood calculation was used here). For all three models, the estimated variance of rate across sites was decreased (parameter {alpha} increases) when the number of species decreased, as previously reported (e.g., Tourasse and Gouy 1997Citation ). A similar effect was found for parameters {nu} and {pi}. Twenty-species data sets were quite consistent with the total 40-species data set but 10- and 5-species data sets contained little information with respect to site-specific rate variation, as indicated by the small difference in log likelihood between models and the high variance of parameter estimates. With 10 species, a significant increase in log likelihood was found when moving from the ASRV to the SSRV model, but not when moving from SSRV to USSRV (excepting one data set out of five). With five species, no (U)SSRV effect was detected. This again underlines the importance of species sampling in molecular phylogeny and evolution studies. There is little hope of detecting any SSRV effect using fewer than 20 or 30 sequences.


View this table:
[in this window]
[in a new window]
 
Table 2 Influence of the Number of Analyzed Sequences on Site-Specific Rate Variation Detection and Parameter Estimates

 

    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Data Analysis
 Discussion
 Appendix
 Acknowledgements
 literature cited
 
The SSRV and USSRV models introduced in this paper are more general than Fitch and Markowitz's (1970)Citation original description of the so-called "covarion" process, recently formalized by Tuffley and Steel (1998)Citation . In the original model, sites can be either "on" or "off": there are two classes of rates, one of which is rate 0. A common rate of switch between the two categories is assumed for all sites. The proposed SSRV model allows an arbitrary number of classes but involves the additional assumption of discrete-Gamma distribution (i.e., constraints on the relative rates of distinct categories). The advantages of (discrete-)Gamma versus discrete rate class models are discussed by Yang (1996)Citation in the context of among-site rate variation. Discrete-Gamma models provide a good fit to many data sets at the cost of few, easily interpretable parameters. Yang's arguments presumably also apply to SSRV. The USSRV model is original in allowing a proportion of sites not to experience rate changes, relaxing a dubious assumption of Tuffley and Steel's (1998)Citation and SSRV models. It is quite unlikely that every site of a molecule undergoes rate changes at a common rate.

A maximum-likelihood implementation was achieved by making use of the properties of the compound process of rate and nucleotide changes. A desirable property of the SSRV and USSRV models in the likelihood framework is their generalization of the widely used ER and ASRV models. (U)SSRV reduces to ASRV when {nu} = 0 and/or {pi} = 0, and to ER when {alpha} or {nu} tends to infinity. This means that the parameters are easily interpretable. They directly measure the importance of site-specific rate variation and can be compared between data sets. Furthermore, the nested relationship between the four models allows relevant comparisons of likelihoods and election of the most appropriate model through likelihood ratio tests.

The new models revealed a significant amount of site-specific rate variation when applied to ribosomal RNA data. This analysis suggested that neglecting SSRV when it exists has at least two important consequences. First, the variance of the distribution of rates among sites is underestimated (Gamma shape parameter overestimated). Second, correction of multiple substitutions is less efficient (total tree length and transition/transversion ratio underestimated). The two effects presumably result from a unique cause: highly switching sites have a moderate average rate in the long run. Said simply, these sites are considered moderately fast when analyzed under ASRV, so the number of multiple substitutions they experience during fast-rate episodes is underestimated. The ability of (U)SSRV models to detect some saturation that is hidden to ASRV suggests that these models might improve phylogenetic reconstructions. Lockhart et al. (1998)Citation feel the same: they argue that the internal edge that separates plastid and cyanobacterial 16S rRNA sequences is mainly the consequence of overlooked covarion effects. The large number of taxa required to properly account for SSRV effects and the resulting extensive running time preclude any attempt to search the tree space with reasonable efficiency, however. If these models have to be used for phylogenetic purposes, it should be in the context of evaluating a small number of competing topologies previously sought using faster algorithms.

Another promising application field is the study of protein adaptation. An important and popular goal of molecular evolution is the detection of positive selection at the sequence level. Classically, this was achieved by comparing nonsynonymous (Ka) and synonymous (Ks) rates of evolution (e.g., Hughes and Nei 1988Citation ). Positive selection was invoked when Ka was higher than Ks, which was found for a very small fraction of proteins (Endo, Ikeo, and Gojobori 1996Citation ). This approach, however, is limited by averaging of nonsynonymous and synonymous rates over all sites. It would not detect positive selection acting on a few sites. Yang et al. (2000)Citation improved this strategy by applying the ASRV model, therefore separating rather than averaging fast and slow nonsynonymous rates across sites. Yang et al. (2000)Citation found that a larger number of proteins than expected included sites evolving according to a positive-selection-like process. Following their comment, I argue that the importance of positive selection might still be underestimated when data are analyzed under ASRV. This is because the nonsynonymous/synonymous rate of each site is averaged over the whole tree. ASRV cannot detect short episodes of positive selection involving sites which are constrained (i.e., slow) in other parts of the tree. It is quite likely, however, that protein adaptation involves short adaptive episodes, followed by stasis when a new local optimum of fitness has been reached. Yang, Swanson, and Vacquier (2000)Citation dealt with this problem by allowing a distinct Ka/Ks ratio (their {omega} parameter) in each branch of the tree. This was done at the cost of a large number of additional parameters and of again averaging Ka/Ks over sites.

SSRV models might be the right approach to account for episodic evolution of proteins. Both lineage and site effects are automatically separated at the cost of only one or two parameters. Perspectives of this work therefore include generalization to codon-based models of evolution and the use of data analysis tools for characterizing those sites and lineages involved in positive selection in the context of (U)SSRV models.


    Appendix
 TOP
 Abstract
 Introduction
 Methods
 Data Analysis
 Discussion
 Appendix
 Acknowledgements
 literature cited
 
Derivation of the Average Relative Rate ij Along a Branch of Length {lambda} Conditional on Initial Rate ri and Final Rate rj
With rate {nu}, the relative rate changes from its current category to any of the g possible categories with equal probabilities. The mean rate along a pathway of length {lambda} starting from ri and ending with rj is computed by conditioning on the number of changes C:


(A1)
where rij(C = k) is the mean relative rate given initial and final rates ri and rj, respectively, and given that k changes occurred. First, consider the i != j case. Since changes are equiprobable, the mean rate conditional on k > 0 changes is


(A2)
Equation (A2) holds because k changes cut the branch into k + 1 intervals, the first and last of which have rates ri and rj, respectively, with the remaining k - 1 having average rate 1 since they are randomly sampled from a distribution of mean 1 (remember that reassignment of current rate is allowed).

The probability that k changes occurred given ri and rj != ri is 0 for k = 0 and


(A3)
for k != 0. These probabilities can be computed by noting that (1) the probability of k changes irrespective of initial and final states is ({lambda}·{nu})k·exp(-{lambda}·{nu})/k! (changes occur according to a Poisson process with rate {lambda}·{nu}), and (2) all g possible final states are equiprobable given that at least one change occurred. The numerator and denominator in equation (A3) are therefore


(A4)
The ratio (A4)/(A5) simplifies to


(A6)
Substituting equations (A2) and (A6) into equation (A1) and summing (noting that the term corresponding to k = 0 in equation (A1) is 0 when i != j) results in equation (10) . Similar reasoning can be used when j = i.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Data Analysis
 Discussion
 Appendix
 Acknowledgements
 literature cited
 
Many thanks to Olivier Gascuel and Ziheng Yang for helpful comments and suggestions. This work was supported by the Génopole Montpellier-Perpignan.


    Footnotes
 
Dan Graur, Reviewing Editor

3 Abbreviations: ASRV, among-site rate variation; SSRV, site-specific rate variation; USSRV, unequal site-specific rate variation. Back

1 Keywords: covarion Markov model thermophyly LUCA positive selection Back

2 Address for correspondence and reprints: Centre National de la Recherche Scientifique UMR 5000—Génome, Populations, Interactions, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier cedex, France. galtier{at}crit1.univ-montp2.fr Back


    literature cited
 TOP
 Abstract
 Introduction
 Methods
 Data Analysis
 Discussion
 Appendix
 Acknowledgements
 literature cited
 

    Endo, T., K. Ikeo, and T. Gojobori. 1996. Large-scale search for genes on which positive selection may operate. Mol. Biol. Evol. 5:685–690.

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.[ISI][Medline]

    Fitch, W. M. 1971. Rate of change of concomitantly variable codons. J. Mol. Evol. 1:84–96.[Medline]

    Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579–593.[ISI][Medline]

    Forterre, P. 1996. A hot topic: the origin of hyperthermophiles. Cell 85:789–792.

    Galtier, N., and M. Gouy. 1998. Inferring pattern and process: maximum likelihood implementation of a non-homogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–879.[Abstract]

    Galtier, N., and J. Lobry. 1997. Relationships between genomic G+C content, RNA secondary structures and optimal growth temperature in prokaryotes. J. Mol. Evol. 44:632–636.[ISI][Medline]

    Galtier, N., N. J. Tourasse, and M. Gouy. 1999. A non-hyperthermophilic ancestor to extant life forms. Science 283:220–221.

    Germot, A., and H. Philippe. 1999. Critical analysis of eukaryotic phylogeny: a case study based on the HSP70 family. J. Eukaryot. Microbiol. 46:116–124.[ISI][Medline]

    Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736.[Abstract/Free Full Text]

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.[ISI][Medline]

    Hughes, A. L., and M. Nei. 1988. Nucleotide substitution at major histocompatibility complex loci reveals overdominant selection. Nature 335:167–170.

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.

    Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120.[ISI][Medline]

    Lockhart, P. J., D. H. Huson, U. Maier, M. J. Fraunholz, Y. Van de Peer, A. C. Barbrook, C. J. Howe, and M. A. Steel. 2000. How molecules evolves in eubacteria. Mol. Biol. Evol. 17:835–838.[Free Full Text]

    Lockhart, P. J., M. A. Steel, A. C. Barbrook, D. H. Huson, M. A. Charleston, and C. J. Howe. 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15:1183–1188.[Abstract]

    Lopez, P., P. Forterre, and H. Philippe. 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49:496–508.[ISI][Medline]

    Philippe, H., P. Lopez, H. Brinkman, K. Budin, A. Germot, J. Laurent, D. Moreira, M. Müller, and H. Le Guyader. 2000. Early branching or fast evolving eukaryotes? An answer based on slowly evolving positions. Proc. R. Soc. Lond. B Biol. Sci. 267:1213–1221.[ISI][Medline]

    Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 9:678–687.[Abstract]

    Tourasse, N. J., and M. Gouy. 1997. Evolutionary distances between nucleotide sequences based on the distribution of substitution rates among sites as estimated by parsimony. Mol. Biol. Evol. 14:287–298.[Abstract]

    Tuffley, C., and M. A. Steel. 1998. Modelling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:63–91.[ISI][Medline]

    Wakeley, J. 1996. The excess of transitions among nucleotide substitutions: new methods of estimating transition bias underscore its significance. Trends Ecol. Evol. 11:158–163.[ISI]

    Woese, C. R. 1987. Bacterial evolution. Microbiol. Rev. 51:221–271.[ISI]

    Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401.[Abstract]

    ———. 1994. Maximum-likelihood phylogenetic estimation of from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306–314.[ISI][Medline]

    ———. 1995. On the general reversible Markov process model of nucleotide substitution: a reply to Saccone et al. J. Mol. Evol. 41:254–255.[ISI]

    ———. 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367–372.[ISI]

    Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution-models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449.

    Yang, Z., W. J. Swanson, and V. D. Vacquier. 2000. Maximum likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites. Mol. Biol. Evol. 17:1446–1455.[Abstract/Free Full Text]

Accepted for publication January 11, 2001.