Department of Biology, University of Rochester
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A prediction of the concomitantly variable codons, or covarion, hypothesis of substitution (Fitch and Markowitz 1970
) is that the rate of substitution should vary over the evolutionary history of a group. The argument is that as mutations are fixed at some sites in a gene, the functional constraints at other positions may change. Hence, the rate at a site should vary over the evolutionary history of a group of organisms if the covarion model of evolution is correct. Attempts to test this model have looked at the variability of specific positions in an amino acid or DNA sequence across different clades of organisms (Lockhart et al. 1998
); the covarion hypothesis is supported if a site is found to be variable in some clades but not in others. Application of this test to eubacterial tufA gene sequences suggested that a covariotide model does explain the data better than a model with constant rates on the tree (Lockhart et al. 1998
). Application of a covariotide model to two ribosomal sequences data sets also supported a covarion-like model of substitution (Galtier 2001)
.
This paper takes advantage of a formulation of the covarion model first proposed by Tuffley and Steel (1998)
that allows the likelihood for an alignment of DNA sequences to be calculated. I test a covariotide model of DNA substitution on 11 genes using a likelihood ratio test. A covariotide model provides a significant improvement in most of the genes examined.
The method used in this study is very similar to that used in a recent paper by Galtier (2001)
. Galtier (2001)
extends the gamma model of Yang (1993, 1994)
that allows the rate at a site to vary across a sequence. Specifically, he allows the rate at a site to change over evolutionary time using the same type of framework as Tuffley and Steel (1998)
, estimating the rate of switching (and other parameters) using maximum likelihood. Moreover, he allows the rate of switching to vary across a sequence. In this paper, I directly implement the model of Tuffley and Steel (1998)
and a variant thereof that allows for site-specific or gamma-distributed rate variation.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Phylogenetic Model
The species are assumed to be related by an unrooted bifurcating tree, i. Each possible tree is numbered 1, ..., B(s), where B(s) = (2s - 5)!/[2s-3(s - 3)!] is the number of possible unrooted binary trees for s species. Figure 1
shows an example of an arbitrarily chosen tree of s = 8 sequences. The tips of the tree are labeled 1 to s and the internal nodes are labeled s + 1 to 2s - 2. The internal nodes are labeled consecutively according to a postorder traversal of the tree (i.e., from the tips of the tree to the root). Although the tree is unrooted, for convenience it can be drawn to be rooted at one of the tip sequences. In this paper, the root of the tree is at sequence s. Each phylogenetic tree has 2s - 3 branches. The lengths of the branches are in terms of expected number of substitutions per site and the collection of branch lengths for the ith tree is designated vi = (v1, v2, ..., v2s-3).
|
|
|
Together, the switching and substitution processes form a single time-reversible Markov process with instantaneous rate matrix
|
As an example, consider a covarion model for the Jukes-Cantor (1969)
model of DNA substitution. The Jukes-Cantor (1969)
model of DNA substitution assumes constant rates among substitution types and equal stationary frequencies for the nucleotides. The instantaneous rates of change for the Jukes-Cantor covarion model are then
|
The implementation of the covariotide model used in this study is a generalization of the proportion of invariable sites model. The proportion of invariable sites model assumes that some fraction, p, of sites are unable to vary. The proportion of invariable sites model can be obtained from the covariotide model by allowing s01 and s10 to approach 0 with the constraint that (s10)/(s01 + s10) = p.
The probability of observing the data at the ith site involves a summation over all possible ancestral states that can be assigned to the internal nodes on the tree. However, the possible states for the covariotide model are not simply A, C, G, and T, as they normally are for substitution models typically used in phylogenetic analysis, but rather A0, C0, G0, T0, A1, C1, G1, and T1; every nucleotide can be either capable (1) or incapable (0) of substitution. To calculate the likelihood at a site, I use a variable z(j) that contains the conditional probability of the observations above the jth node of the tree, given that the node j was in a particular state [z(j) = (z(j)A0, z(j)C0, z(j)G0, z(j)T0, z(j)A1, z(j)C1, z(j)G1, z(j)T1)]. This variable is initialized for the tip sequences. If xji = C is observed at tip sequence j, for example, then the conditional likelihood vector z(j) is z(j) = (0, 1, 0, 0, 0, 1, 0, 0). In other words, the conditional probability of all entries that correspond to an on or off C are set to 1 and the others are set to 0.
The site probability can then be calculated using Felsenstein's (1981)
pruning algorithm. Starting from the tips of the tree, we visit successively deeper nodes in the tree calculating z(j) for each visited node. The conditional probability of the observations on the subtree above node j is
|
|
|
|
![]() |
|
|
The summation and multiple integration required to calculate the integrated likelihood for s01 and s10 are impossible to perform analytically. I use the program MrBayes 2.1 (Huelsenbeck and Ronquist 2001
) to approximate the integrated likelihood,
(s01, s10). The program is set up to perform Bayesian inference of phylogeny and uses a variant of Markov chain Monte Carlo to approximate the posterior probability density of parameters. However, when uninformative priors are used, as they are for this study, the posterior distribution of a parameter can be considered a rescaled likelihood surface. All Markov chains were run for 2,000,000 generations and sampled every 100th cycle. Inferences were based on a total of 15,000 sampled points for each chain (the first quarter of the samples was discarded as the portion of the chain sampled when not at apparent stationarity). Chains were initialized using randomly chosen trees.
Testing the Covarion Model
I test the null hypothesis that the rate of substitution over the tree is constant for a site. The null hypothesis, H0, is that the proportion of invariable sites model is true. The proportion of invariable sites model can be obtained from the Tuffley and Steel (1998)
covariotide model by allowing s01 and s10 to approach zero with the constraint that (s10)/(s01 + ss10) = p, where p is the proportion of sites that are incapable of substitution. More specifically, the proportion of invariable sites model assumes that a site takes one of two rates: with probability p the site is invariable, meaning that it has a rate of zero; with probability 1 - p the rate at the site is multiplied by the factor 1/(1 - p). Note that rate variation across sites is still allowed under the null hypothesis. In fact, the model of rate variation across sites is quite sophisticated as it combines the proportion of invariable sites model with either the site-specific or gamma model for among-site rate variation. Under the null hypothesis, among-site rate variation is zero with probability p. With probability 1 - p the rate multiplier for a site (
, from above) is either drawn from a gamma distribution or sites are divided into first, second, and third codon positions and the rate multiplier,
, is estimated separately for each.
The alternative hypothesis, H1, allows the switching rates to vary freely. The likelihood ratio
|
Although the models are nested, because the null model involves a restriction of the parameters at the boundary of the parameter space, the 2 approximation does not hold (Whelan and Goldman 1999
; Ota et al. 2000
). For this study the appropriate null distribution is a mixture of
20,
21, and
22 distributions (Self and Liang 1987
). In particular, there are two parameters on the boundary of the parameter space and the other (nuisance) parameters of the model are not restricted to be on the boundary. Hence, the situation in this paper matches case 7 discussed in Self and Liang (1987)
. The likelihood ratio test statistic, -2 loge
, is asymptotically distributed as a mixture of
20,
21, and
22 distributions with mixing probabilities
- p,
, and p, respectively. The mixing parameter, p
1/2 is calculated as
|
In this study the likelihood ratio test statistic was approximated by fitting a bivariate normal distribution to the switching rates sampled by the MCMC algorithm. The likelihood ratio was approximated by comparing the density at the maximum of the bivariate normal to the density at the restriction that s01 = s10 = 0. The posterior probability distribution is asymptotically normally distributed, and the normal approximation is known as the Laplace approximation to the posterior density (Laplace 1986
; also see Raftery 1995
and Gamerman 1997
). The normal approximation to the posterior density has been used in model testing of evolutionary hypotheses by Suchard, Weiss, and Sinsheimer (2001)
. The bivariate normal has positive density for all possible values, whereas the switching model is restricted to have positive likelihoods for s01, s10 > 0. This should not be a problem, though, as I am only interested in knowing the relative heights of the bivariate normal at the maximum value and at the restriction; that is, I only need to know the likelihoods up to a constant in order to validly test the null hypothesis. For the purposes of determining the mixture of
2 distributions (discussed previously) the mixing parameter, p, must be estimated. The mixing parameter is calculated using the information matrix, which for the bivariate normal approximation to the posterior density of s01 and s10 is
|
I checked convergence of the ergodic averages for the switching rates using Gelman's (1995)
statistic (the estimated potential scale reduction). This statistic summarizes the ratio of two estimates of the within- and between-chain variances for different Markov chains starting from random points in the parameter space. It is constructed such that the estimated potential scale reduction approaches 1 from above. Values of
less than 1.1 or 1.2 are considered to indicate an adequately run MCMC analysis (Gelman 1995
).
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
|
The null hypothesis tested in this studythat switching rates were zerohas parameter values at the boundaries of the parameter space. This means that the normal asymptotic result, that the likelihood ratio test statistic is 2 distributed with
degrees of freedom, does not hold (Whelan and Goldman 1999
; Ota et al. 2000
). Instead, the null distribution is a mixture of
2 distributions. Here, the mixture was (
- p)
20,
21, and p
22, where p is a parameter that is calculated from the Fisher information matrix for the parameters. The 95% critical values for a test of the null hypothesis varied from 4.54 to 4.90 for the 11 genes examined here. The critical values differed because the switching rates were more or less correlated across the genes. If I had ignored the fact that the parameters were at the boundaries of the parameter space under the null hypothesis, I would have used the critical values of a
2 distribution with
= 2 degrees of freedom. The 95% critical value for a test of the null hypothesis would have been 5.99. In other words, the test would have been conservative. In this study, the null hypothesis was usually rejected decisively and the extra work spent calculating the exact critical values was probably unnecessary. However, for more subtle problems it may be important to use the exact critical values to avoid inappropriate rejection of the null hypothesis.
The likelihood approach taken here differs from that taken by Tuffley and Steel (1998)
. Tuffley and Steel (1998)
derived distance measures for the covarion model and noted that in principal the covarion model can be distinguished from a rates across sites model when there are at least four monophyletic groups of species. However, distance methods are generally not as powerful at distinguishing models or inferring trees as likelihood-based methods. The covarion model implemented in this study is somewhat different from the model used by Tuffley and Steel (1998)
in that among-site rate variation is accommodated along with rate variation on the tree. Hence, this allows a test of the covarion model while accounting for the fact that different sites may have different inherent rates. Moreover, as implemented in this study, the covarion model is readily applied to phylogeny estimation.
Galtier (2001)
implemented a model of rate variation over a tree that allows the rate to switch incrementally, with the rate increments determined by a discrete gamma distribution. The appropriate null hypothesis for the Galtier (2001)
model is that rates are gamma distributed, and he tested this hypothesis using likelihood ratio tests. The covarion-like model implemented in this study differs from that implemented by Galtier (2001)
, in that rates simply switch between off and on. Rate variation across sites is allowed in the variant of the Tuffley and Steel (1998)
model implemented in this study because the rate of substitution, when the switching process is on, can vary from one site to another. Specifically, rates across sites followed either the site-specific model or the gamma model. The null hypothesis tested in this study, then, was the proportion of invariable sites model plus either site-specific or gamma-distributed rate variation. Which covarion-like model best fits biological sequences remains an open and interesting question.
To date, the covarion-covariotide model of DNA substitution has found little application in comparative sequence analysis. This study, and two earlier studies of three genes (Lockhart et al. 1998
; Galtier 2001
), suggests that a model that allows rates to change through time may provide a much better explanation of observed DNA sequences. The covarion model may also prove important in phylogeny reconstruction; rates of change that vary over the phylogenetic history of a group may mislead phylogenetic methods that fail to accommodate this process (Lockhart et al. 1998
). Finally, the general methodology used to accommodate rate variation over time may also prove useful for accommodating other processes that are known to vary over the evolutionary history of organisms, such as the frequency of the nucleotides G and C. Some of these processes are known to mislead phylogenetic methods (Lockhart et al. 1992
), especially for ancient groups, and any method that can estimate the rate at which substitution parameters change over evolutionary time might lead to new insights about the process of DNA substitution.
![]() |
Program Availability |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: covarion-covariotide model
integrated likelihood
likelihood ratio
Markov chain Monte Carlo
maximum likelihood
mixed chi-square
Address for correspondence and reprints: John P. Huelsenbeck, Department of Biology, University of Rochester, Rochester, New York 14627. johnh{at}brahms.biology.rochester.edu
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Arnason U., A. Gullberg, A. Janke, 1997 Phylogenetic analyses of mitochondrial DNA suggest a sister group relationship between Xenartha (Edentata) and Ferungulates Mol. Biol. Evol 14:762-768[Abstract]
Atrian S., L. Sánchez-Pulido, R. Gonzàlez-Duarte, A. Valencia, 1998 Shaping of Drosophila alcohol dehydrogenase through evolution: relationship with enzyme functionality J. Mol. Evol 47:211-221[ISI][Medline]
Berger J. O., B. Liseo, R. L. Wolpert, 1999 Integrated likelihood methods for eliminating nuisance parameters Stat. Sci 14:1-28[ISI]
Bollback J. P., J. P. Huelsenbeck, 2001 Phylogeny, genome evolution, and host specificity of single-stranded RNA bacteriophage (Family Leviviridae) J. Mol. Evol 52:117-128[ISI][Medline]
Brown W. M., E. M. Prager, A. Wang, A. C. Wilson, 1982 Mitochondrial DNA sequences of primates, tempo and mode of evolution J. Mol. Evol 18:225-239[ISI][Medline]
Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]
Fitch W. M., E. Markowitz, 1970 An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution Biochem. Genet 4:579-593[ISI][Medline]
Galtier N., 2001 Maximum-likelihood phylogenetic analysis under a covarion-like model Mol. Biol. Evol 18:866-873
Gamerman D., 1997 Markov chain Monte Carlo Chapman & Hall, London
Gelman A., 1995 Inference and monitoring convergence Pp. 131144 in W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds. Markov chain Monte Carlo in practice. Chapman & Hall, London
Huelsenbeck J. P., F. Ronquist, 2001 MrBayes: Bayesian inference of phylogenetic trees Bioinformatics 17:754-755
Hughes A. L., M. Nei, 1988 Pattern of nucleotide substitution at major histocompatibility complex loci reveals overdominant selection Nature 355:167-170
Jukes T. H., C. R. Cantor, 1969 Evolution of protein molecules Pp. 21123 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York
Kuno G., G.-J. J. Chang, K. R. Tsuchiya, N. Karabatsos, C. B. Cropp, 1998 Phylogeny of the genus Flavivirus J. Virol 72:73-83
Laplace P. S., 1986 Memoir on the probability of causes of events (translation by S. Stigler) Stat. Sci 1:364-378
Lockhart P. J., C. J. Howe, D. A. Bryant, T. J. Beanland, A. W. D. Larkum, 1992 Substitutional bias confounds inference of cyanelle origins from sequence data Mol. Biol. Evol 34:153-162
Lockhart P. J., M. A. Steel, A. C. Barbrook, D. H. Huson, M. A. Charleston, C. J. Howe, 1998 A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages Mol. Biol. Evol 15:1183-1188[Abstract]
Ota R., P. J. Waddell, M. Hasegawa, H. Shimodaira, H. Kishino, 2000 Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters Mol. Biol. Evol 17:798-803
Raftery A. E., 1995 Hypothesis testing and model selection Pp. 163187 in W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds. Markov chain Monte Carlo in practice. Chapman & Hall, London
Self S. G., K.-Y. Liang, 1987 Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions J. Am. Stat. Assoc 82:605-610[ISI]
Suchard M. A., R. E. Weiss, J. S. Sinsheimer, 2001 Bayesian selection of continuous-time Markov chain evolutionary models Mol. Biol. Evol 18:1001-1013
Takahashi K., A. P. Rooney, M. Nei, 2000 Origins and divergence times of mammalian class II MHC gene clusters J. Hered 91:198-204
Tavare S., 1986 Some probabilistic and statistical problems on the analysis of DNA sequences Pp. 5786 in Lectures in mathematics in the life sciences, Vol. 17. American Mathematical Society
Tuffley C., M. Steel, 1998 Modeling the covarion hypothesis of nucleotide substitution Math. Biosci 147:63-91[ISI][Medline]
Whelan S., N. Goldman, 1999 Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics Mol. Biol. Evol 16:1292-1299
Whiting M. F., J. C. Carpenter, Q. D. Wheeler, W. C. Wheeler, 1997 The Strepsiptera problem: phylogeny of the holometabolous insect orders inferred from 18S and 28S ribosomal DNA sequences and morphology Syst. Biol 46:1-68[ISI][Medline]
Yang Z., 1993 Maximum-likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods J. Mol. Evol 39:306-314[ISI]
. 1994 Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ across sites Mol. Biol. Evol 10:1396-1401[Abstract]
. 1996 Among-site rate variation and its impact on phylogenetic analyses TREE 11:367-372
Yang Z., R. Nielsen, N. Goldman, A.-M. K. Pedersen, 2000 Codon-substitution models for heterogeneous selection pressure at amino acid sites Genetics 155:431-449