Centre National de la Recherche Scientifique UMR 5000Génome, Populations, Interactions, Université Montpellier 2, Montpellier, France
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This picture, however, is an instantaneous one. It is quite likely that the selective constraints applying to a particular site also vary in time and between lineages. As far as long periods of time are concerned, critical sites with respect to the function of a macromolecule may change, making the evolutionary rate of a particular site variable across the phylogeny. This notion was introduced as early as 30 years ago by Fitch and Markowitz (1970)
and Fitch (1971)
and was called the "covarion" process. The terminology comes from the idea that the rate of a site might be modified by a substitution arising at a distinct site of the molecule, with which it therefore covaries. Site-specific rate variation, however, can as well be caused by external factors such as environmental changes. In this paper, I refer to "true covarions" when two (several) sites of a sequence are undergoing nonindependent evolution, and I use the term "site-specific rate variation" (SSRV) or "covarion-like evolution" in the more general case.
There are at least two good reasons for being interested in modeling SSRV. First, it might improve phylogenetic reconstructions. Philippe and colleagues have shown that SSRV is common in genes widely used for the recovery of deep phylogenies and have and suggested that it induces tree-building biases (Germot and Philippe 1999
; Lopez, Forterre, and Philippe 1999
; Philippe et al. 2000
). Second, modeling SSRV should help to obtain an understanding of the way natural selection applies at the molecular level. In particular, episodic adaptive evolution presumably proceeds with frequent changes of rates at various sites.
Little theoretical work was achieved after Fitch's early reports about SSRV until Tuffley and Steel (1998)
proposed an explicit Markov model that resulted in a distance-based test for detecting SSRV effects (Lockhart et al. 1998, 2000
). In this paper, I introduce a more general Markov model of site-specific rate variation and devise a maximum-likelihood implementation of this model. This method is applied to ribosomal RNA sequences to assess the amount of covarion-like evolution for this kind of data and to check the robustness of previously reported inferences about the early evolution of life.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
I now generalize Yang's model by allowing site-specific rates to vary in time. It is assumed that the rate of a particular site can switch from one category to another according to a continuous Markov process R. With rate , the current evolutionary rate moves to a new category, obtained by randomly drawing from the distribution of ri's. Switches between distant categories of rates are therefore assumed to be as probable as short-range changes. The parameter
could be called the "rate variation rate." It determines the amount of site-specific rate variation. It is assumed constant over sites and in time. Note that the process R of rate change is continuous: switches can occur anywhere in the tree, not specifically at nodes.
Simulating the evolution of a site under the SSRV model now involves four steps: (1) randomly draw an ancestral rate at the root of the tree from the discretized Gamma distribution; (2) make this rate evolve along the branches of the tree according to process R, and record the rate assigned to each segment of the tree (i.e., between switching points); (3) randomly draw an ancestral nucleotide state at the root of the tree; and (4) make this state evolve along the branches of the tree according to process M and the local rate determined at step 2 (i.e., scaling the length of each segment according to its relative rate). Nucleotide process M is therefore compounded with (i.e., superimposed on) rate process R. R is time-reversible, making the compound process time-reversible if M is so. Note that sites are independent and identically distributed under this model: there is no "true covariation" here. The equal-rates (ER), among-site rate variation (ASRV), and site-specific rate variation (SSRV) models are graphically compared in figure 1
. SSRV reduces to ASRV when = 0 (no change of rate) and to ER when
tends to infinity (permanent change of rate results in constant rate).
|
Maximum-Likelihood Implementation
This section aims at computing the probability of a particular DNA sequence data set under the SSRV model given a tree topology, a set of branch lengths {i}, a time-reversible nucleotide Markov process (or substitution matrix) M, a Gamma shape parameter
, and an SSRV rate
.
Felsenstein (1981)
introduced likelihood calculation across trees under the ER model (
=
,
= 0). Since independent evolution of sites is assumed, the probability of a data set is the product of the probabilities of observing each site. The probability of a particular site is computed by conditioning over all possible nucleotide states at internal nodes of the tree. For example, the probability of site (AAG) under ER in the three-species tree of figure 1
(top left) is
| (1) |
| (2) |
I now extend this method to the above SSRV model. The probability of a site can be computed by conditioning on both states and rates at ancestral nodes. Equations (2) and (3)
become
| (4) |
Now comes the problem of calculating transition probabilities in equation (5)
, namely, the probability of evolving from rate ri to rate rj and from state X to state Y during length under processes R and M. This can be done following the standards of stochastic process theory. The combined R x M process can be viewed as a single Markov process Q in which states are defined as (X, ri) pairs, where X is a nucleotide state and ri is a rate. There are 4·g possible states. Transitions between states occur at rate
| (6) |
![]() | (7) |
This approach allows exact calculation of the likelihood. It is not optimal, however, for technical purposes, because equation (7) requires numerical diagonalization of 4·g x 4·g matrix Q, which can be time-consuming. This numerical step, moreover, precludes analytical calculation of the derivatives of the likelihood with respect to parameters of the model, quite useful for likelihood maximization. This is especially problematic when a complex model of nucleotide change is used (e.g., see below). I now derive an approximate calculation of the transition probabilities that does not require any matrix diagonalization. This is achieved by writing
| (8) |
| (9) |
| (10) |
Unequal SSRV Rates Among Sites
A constant rate of rate change is assumed in the above SSRV model, which might be unrealistic. It is likely that for many proteins or RNAs the rate of some sites remains more or less constant for long periods, while other sites switch more often. Some sites might remain critical for the function of the macromolecule and evolve at a slow rate in all of the branches of the tree as a consequence of strong purifying selection pressure. Some may escape any selective pressure and evolve at a (fast) neutral rate in the long run. Other sites, involved in episodic adaptive events, might recurrently switch between a slow and a fast rate, in agreement with the above-described SSRV process. I now generalize the SSRV model to account for this possibility. It is assumed that a proportion
of the sites evolve according to SSRV (with SSRV rate
), while the remaining sites evolve according to ASRV (with SSRV rate 0). This general model is called unequal site-specific rate variation (USSRV).
The probability of site y under USSRV is given by
![]() | (11) |
The approximate likelihood is maximized using the Newton-Raphson method. Then, the exact calculation is performed using the near-optimal parameter values sought in the previous step. Finally, the exact likelihood is maximized using the downhill-simplex method. These algorithms are available as part of the NHML package (ftp://pbil.univ-lyon1.fr/pub/mol_phylogeny).
![]() |
Data Analysis |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Two data sets, each including 40 species (10 Archaea, 17 Bacteria or chloroplasts, and 13 Eukaryotes), were used: small-subunit (SSU) rRNA (695 unambiguously aligned, complete sites) and large-subunit (LSU) rRNA (1,409 sites). Applying a nonhomogeneous, nonstationary model of nucleotide substitution (Galtier and Gouy 1998
) to these data, Galtier, Tourasse, and Gouy (1999)
estimated the G+C content of the most recent common ancestor (MRCA) to extant life forms. They found that the moderate estimated G+C content (56.1% ± 5% for SSU, 54.0% ± 2.5% for LSU) is not compatible with life at very high temperatures: high ribosomal G+C content is a necessary condition for survival of present-day species in hot environments (Galtier and Lobry 1997
). This result therefore questions the hypothesis of a thermophilic common ancestor (e.g., see Woese 1987
; Forterre 1996
).
Galtier, Tourasse, and Gouy (1999)
assumed a Gamma distribution of rates among sites. I conducted the analysis again under the more general SSRV and USSRV models, allowing covarion-like effects. This involved combining the above piece of theory with Galtier and Gouy's (1998)
nonhomogeneous model of DNA evolution. The latter allows G+C content to vary in time and between lineages. The combined model accounts for unequal transition/transversion ratio, variable G+C content between lineages, variable rates among sites and, site-specific rate changes. The assumed Gamma distribution was discretized in g = 4 equally probable classes of rates. Table 1
displays the log likelihoods and the details of parameter estimates for four models of rate variation, namely, ER, ASRV, SSRV, and USSRV (assumed tree topology: fig. 1 of Galtier, Tourasse, and Gouy 1999
).
|
Allowing covarion-like evolution made a difference with respect to the estimation of parameters of the evolutionary model. Although the general shape of the tree was unchanged (data not shown), branches were slightly longer under (U)SSRV than under ASRV (total tree length was increased by roughly 10%). This suggests that some saturation might be overlooked when site-specific rate variation exists and is not taken into account. The transition/transversion ratio was also higher when estimated under (U)SSRV. Again, this is reminiscent of the previously reported (and confirmed here) underestimate of
when among-site rate variation is not accounted for (e.g., Wakeley 1996
). Underestimating
is a consequence of overlooking multiple substitutions, since transitions saturate more quickly than transversions. The shape parameter
of the Gamma distribution is the most sensitive to covarion-like effects. It is dramatically decreased when site-specific rate variation is allowedremember that a decrease in
means a higher variance of rates across categories. This is presumably because the mean evolutionary rate of a rate-changing site is not extreme: fast and slow periods average in the long run. These sites are "seen" as moderately fast when considered from the point of view of ASRV, while they are actually a mixture of slow and fast rates. The variance of the overall distribution of rates is therefore underestimated.
The use of a nonhomogeneous, nonstationary model of evolution allows one to estimate ancestral base compositions. The SSRV and USSRV estimates of the MRCA's rRNA G+C content are very close to (and even slightly lower than) the ASRV estimate. Galtier, Tourasse, and Gouy's (1999)
result is therefore confirmed when site-specific rate variation is taken into account. Their claim of a nonhyperthermophilic common ancestor did not result from a biased analysisas far as covarion-like effects are concerned.
LSU rRNA data were used to assess the sensitivity of SSRV detection to the number of analyzed sequences. Fifteen subsets of sequences including 20, 10, or 5 species (5 of each category) were randomly drawn from the total of 40 sequences, making sure that four domains (namely, Eukaryotes, Bacteria, Euryarchaea, and Crenarchaea) were represented in proportions roughly identical to those of the total data set. Each subset was analyzed using ASRV, SSRV, and USSRV. Results are shown in table 2
(the approximate likelihood calculation was used here). For all three models, the estimated variance of rate across sites was decreased (parameter increases) when the number of species decreased, as previously reported (e.g., Tourasse and Gouy 1997
). A similar effect was found for parameters
and
. Twenty-species data sets were quite consistent with the total 40-species data set but 10- and 5-species data sets contained little information with respect to site-specific rate variation, as indicated by the small difference in log likelihood between models and the high variance of parameter estimates. With 10 species, a significant increase in log likelihood was found when moving from the ASRV to the SSRV model, but not when moving from SSRV to USSRV (excepting one data set out of five). With five species, no (U)SSRV effect was detected. This again underlines the importance of species sampling in molecular phylogeny and evolution studies. There is little hope of detecting any SSRV effect using fewer than 20 or 30 sequences.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A maximum-likelihood implementation was achieved by making use of the properties of the compound process of rate and nucleotide changes. A desirable property of the SSRV and USSRV models in the likelihood framework is their generalization of the widely used ER and ASRV models. (U)SSRV reduces to ASRV when = 0 and/or
= 0, and to ER when
or
tends to infinity. This means that the parameters are easily interpretable. They directly measure the importance of site-specific rate variation and can be compared between data sets. Furthermore, the nested relationship between the four models allows relevant comparisons of likelihoods and election of the most appropriate model through likelihood ratio tests.
The new models revealed a significant amount of site-specific rate variation when applied to ribosomal RNA data. This analysis suggested that neglecting SSRV when it exists has at least two important consequences. First, the variance of the distribution of rates among sites is underestimated (Gamma shape parameter overestimated). Second, correction of multiple substitutions is less efficient (total tree length and transition/transversion ratio underestimated). The two effects presumably result from a unique cause: highly switching sites have a moderate average rate in the long run. Said simply, these sites are considered moderately fast when analyzed under ASRV, so the number of multiple substitutions they experience during fast-rate episodes is underestimated. The ability of (U)SSRV models to detect some saturation that is hidden to ASRV suggests that these models might improve phylogenetic reconstructions. Lockhart et al. (1998)
feel the same: they argue that the internal edge that separates plastid and cyanobacterial 16S rRNA sequences is mainly the consequence of overlooked covarion effects. The large number of taxa required to properly account for SSRV effects and the resulting extensive running time preclude any attempt to search the tree space with reasonable efficiency, however. If these models have to be used for phylogenetic purposes, it should be in the context of evaluating a small number of competing topologies previously sought using faster algorithms.
Another promising application field is the study of protein adaptation. An important and popular goal of molecular evolution is the detection of positive selection at the sequence level. Classically, this was achieved by comparing nonsynonymous (Ka) and synonymous (Ks) rates of evolution (e.g., Hughes and Nei 1988
). Positive selection was invoked when Ka was higher than Ks, which was found for a very small fraction of proteins (Endo, Ikeo, and Gojobori 1996
). This approach, however, is limited by averaging of nonsynonymous and synonymous rates over all sites. It would not detect positive selection acting on a few sites. Yang et al. (2000)
improved this strategy by applying the ASRV model, therefore separating rather than averaging fast and slow nonsynonymous rates across sites. Yang et al. (2000)
found that a larger number of proteins than expected included sites evolving according to a positive-selection-like process. Following their comment, I argue that the importance of positive selection might still be underestimated when data are analyzed under ASRV. This is because the nonsynonymous/synonymous rate of each site is averaged over the whole tree. ASRV cannot detect short episodes of positive selection involving sites which are constrained (i.e., slow) in other parts of the tree. It is quite likely, however, that protein adaptation involves short adaptive episodes, followed by stasis when a new local optimum of fitness has been reached. Yang, Swanson, and Vacquier (2000)
dealt with this problem by allowing a distinct Ka/Ks ratio (their
parameter) in each branch of the tree. This was done at the cost of a large number of additional parameters and of again averaging Ka/Ks over sites.
SSRV models might be the right approach to account for episodic evolution of proteins. Both lineage and site effects are automatically separated at the cost of only one or two parameters. Perspectives of this work therefore include generalization to codon-based models of evolution and the use of data analysis tools for characterizing those sites and lineages involved in positive selection in the context of (U)SSRV models.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
| (A1) |
| (A2) |
The probability that k changes occurred given ri and rj ri is 0 for k = 0 and
| (A3) |
| (A4) |
| (A6) |
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
3 Abbreviations: ASRV, among-site rate variation; SSRV, site-specific rate variation; USSRV, unequal site-specific rate variation.
1 Keywords: covarion
Markov model
thermophyly
LUCA
positive selection
2 Address for correspondence and reprints: Centre National de la Recherche Scientifique UMR 5000Génome, Populations, Interactions, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier cedex, France. galtier{at}crit1.univ-montp2.fr
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Endo, T., K. Ikeo, and T. Gojobori. 1996. Large-scale search for genes on which positive selection may operate. Mol. Biol. Evol. 5:685690.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
Fitch, W. M. 1971. Rate of change of concomitantly variable codons. J. Mol. Evol. 1:8496.[Medline]
Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579593.[ISI][Medline]
Forterre, P. 1996. A hot topic: the origin of hyperthermophiles. Cell 85:789792.
Galtier, N., and M. Gouy. 1998. Inferring pattern and process: maximum likelihood implementation of a non-homogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871879.[Abstract]
Galtier, N., and J. Lobry. 1997. Relationships between genomic G+C content, RNA secondary structures and optimal growth temperature in prokaryotes. J. Mol. Evol. 44:632636.[ISI][Medline]
Galtier, N., N. J. Tourasse, and M. Gouy. 1999. A non-hyperthermophilic ancestor to extant life forms. Science 283:220221.
Germot, A., and H. Philippe. 1999. Critical analysis of eukaryotic phylogeny: a case study based on the HSP70 family. J. Eukaryot. Microbiol. 46:116124.[ISI][Medline]
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736.
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160174.[ISI][Medline]
Hughes, A. L., and M. Nei. 1988. Nucleotide substitution at major histocompatibility complex loci reveals overdominant selection. Nature 335:167170.
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120.[ISI][Medline]
Lockhart, P. J., D. H. Huson, U. Maier, M. J. Fraunholz, Y. Van de Peer, A. C. Barbrook, C. J. Howe, and M. A. Steel. 2000. How molecules evolves in eubacteria. Mol. Biol. Evol. 17:835838.
Lockhart, P. J., M. A. Steel, A. C. Barbrook, D. H. Huson, M. A. Charleston, and C. J. Howe. 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15:11831188.[Abstract]
Lopez, P., P. Forterre, and H. Philippe. 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49:496508.[ISI][Medline]
Philippe, H., P. Lopez, H. Brinkman, K. Budin, A. Germot, J. Laurent, D. Moreira, M. Müller, and H. Le Guyader. 2000. Early branching or fast evolving eukaryotes? An answer based on slowly evolving positions. Proc. R. Soc. Lond. B Biol. Sci. 267:12131221.[ISI][Medline]
Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 9:678687.[Abstract]
Tourasse, N. J., and M. Gouy. 1997. Evolutionary distances between nucleotide sequences based on the distribution of substitution rates among sites as estimated by parsimony. Mol. Biol. Evol. 14:287298.[Abstract]
Tuffley, C., and M. A. Steel. 1998. Modelling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:6391.[ISI][Medline]
Wakeley, J. 1996. The excess of transitions among nucleotide substitutions: new methods of estimating transition bias underscore its significance. Trends Ecol. Evol. 11:158163.[ISI]
Woese, C. R. 1987. Bacterial evolution. Microbiol. Rev. 51:221271.[ISI]
Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:13961401.[Abstract]
. 1994. Maximum-likelihood phylogenetic estimation of from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306314.[ISI][Medline]
. 1995. On the general reversible Markov process model of nucleotide substitution: a reply to Saccone et al. J. Mol. Evol. 41:254255.[ISI]
. 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367372.[ISI]
Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution-models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449.
Yang, Z., W. J. Swanson, and V. D. Vacquier. 2000. Maximum likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites. Mol. Biol. Evol. 17:14461455.