* Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Sweden
Centre for the Study of Evolution & School of Biological Sciences, University of Sussex, Brighton, U.K.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: molecular clock overdispersion substitution rate ANOVA
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In mammals we have good evidence of all three types of evolutionary rate variation (Li 1997), but as yet we have no quantitative data on the relative contributions of gene, lineage, and gene-by-lineage effects. We have developed a method to solve this problem by using an analysis of variance (ANOVA) in which we partition the variance due to each effect and test whether the effects are statistically significant. The basic approach is as follows. We take orthologous genes from a series of lineages which either form a star phylogeny or form independent pairs of taxa (fig. 1); in the latter case we need to know the time of divergence for each pair of taxa to partition the variance correctly. In ANOVA we need an estimate of the error variance. This could potentially be estimated analytically from the formulae used to estimate the branch lengths; however, this is complex, so we have elected to estimate the error variance empirically by dividing each gene into its odd- and even-numbered codons This yields two estimates of the substitution rate for each combination of lineage and gene. We calculate the branch lengths for each half of the gene and divide the branch length by the time of divergence if required (i.e., if we have independent pairs of lineages).
|
We have applied our new method to nuclear and mitochondrial sequence data from mammals. For the nuclear data set we analyzed the variation in both amino acid and synonymous substitution rates using four taxa which form an approximate star phylogeny. But the ratio of synonymous to amino substitution rates in mitochondria is such that we had to analyze different data sets; for the amino acid substitution rate we analyzed seven taxa which form an approximate star phylogeny, and for the synonymous substitution rate we analyzed four independent pairs of taxa.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Nuclear Genes
We used the ACNUC (Gouy et al. 1984) and HOVERGEN (Duret, Mouchiroud, and Gouy 1994) databases to search for orthologous genes in the following four mammalian species: human, cow, mouse, and rabbit. Coding sequences were extracted using the EMBOSS package at HGMP (http://www.hgmp.mrc.ac.uk/), and DNA sequence alignment was performed on the basis of protein alignments using the default settings of the CLUSTALW (Thompson, Higgins, and Gibson 1994) multiple alignment program, as implemented in the DAMBE package (http://web.hku.hk/xxia/software/ software.htm). Because it is important to confirm orthology, we performed an additional check of the phylogeny using the PAML package (Yang 1997), which was used to test the three alternative four-species unrooted trees using a codon model of sequence evolution and the Shimodaira and Hasegawa (1999) test with multiple-comparison correction. For only one gene was the true species tree of [[human, cow], [mouse, rabbit]] rejected, leaving 29 genes. Accession numbers are available on request from the authors.
Times of Divergence
To successfully partition the substitution rate between gene, lineage, and gene-by-lineage effects, we need to express the substitution rates in terms of substitution per unit time. For the nuclear genes and one of the mitochondrial DNA data sets, this is straightforward because the taxa in each case form an approximate star phylogeny; the divergences for each taxa are therefore over similar time scales. In contrast, in the case of the data set used to analyze mitochondrial synonymous substitution rates, we analyzed four independent pairs of taxa in which the taxa have been separated for different periods of time. We obtained the following times of divergence from Pesole et al. (1999), who inferred times of divergence from locally calibrated molecular clocks (chimps), biogeographic data (seals), and fossils (whales, horses-donkeys): chimps 2.5 Myr, seals 2.7 Myr, whales 5 Myr, horse-donkey 4 Myr.
Substitution Rate Estimation
Rates of synonymous and amino acid substitution were estimated using the PAML package (Yang 1997). Amino acid substitution rates, Ka, were estimated from the amino acid sequences of the genes, using the Poisson model of amino acid substitution, with variation among sites modeled by a discretized gamma function (similar results were obtained assuming no variation in the rate of substitution across sites). Synonymous substitution rates, Ks, were estimated using a model of codon evolution which accounted for the transition-transversion rate bias, with codon usage bias modeled by the nucleotide frequencies at the three codon positions and with a different ratio of nonsynonymous to synonymous substitution rates allowed down each lineage. To reduce the effect of changes at the amino acid level on synonymous site evolution, we restricted the analysis of Ks to codons for which the first two codon positions were invariant across all species.
Mitochondrial Ks rates were estimated by pairwise comparison between each of the four closely related species pairs, and mitochondrial Ka rates were estimated by analysis of the seven species alignments, with the tree assumed to be [[mouse, rabbit], [human, cow]], [[elephant, aardvark], armadillo], which is consistent with recent mammalian phylogenies (Madsen et al. 2001; Murphy et al. 2001). Nuclear Ks and Ka rates were estimated by analysis of the four species alignments, with the tree assumed to be ([human, cow], [mouse, rabbit]). Only the terminal branch lengths were included in the mitochondrial Ka analysis and the nuclear gene analyses (i.e., internal branch lengths were not used). We did this for two reasons: first, the deepest branch cannot be partitioned without the addition of an out-group taxon; second, the internal branches make distance estimates from different taxa nonindependent, for example, the divergence from mouse to the root of the mitochondrial Ka tree shares several internal branches with the divergence from rabbit to the root.
The sharing of lineages is not the only potential source of nonindependence; even the terminal branches are not independent of one another because they are all dependent on every sequence in the data set (excluding data sets in which we analyze pairs of independent taxa); for example, if we have three sequences for which the pairwise divergences are D12, D13, and D23, the individual branch lengths are calculated using all three pairwise comparisons, e.g., (D12 + D13 - D23)/2. As Bulmer (1989) has shown, this nonindependence generates negative covariances between branches, which ultimately manifests itself as overdispersion. In our method this covariance is expected to be automatically partitioned into the error term; we examine this issue in our generic simulations below.
To check our assumption of a star phylogeny, we calculated the average distances along all branches in our two data sets. These are shown in figure 1; we only give the Ks tree for the nuclear genes because the Ka tree is very similar in terms of the proportion of the total tree length which is in the internal branch. As expected, it is evident that the internal branches in each tree are much shorter than the terminal branches. It should also be noted that the statistical significance of the gene effect and the gene-by-lineage interaction are independent of the star assumption; we can divide the divergence down any branch by an arbitrary constant with no effect on the probability values. Of course, the significance of the lineage effect is affected, as are the estimates of the variance components.
The ANOVA of Substitution Rates
To test each effect (gene, lineage, and gene-by-lineage interaction) and to partition the variance of the substitution rates, we used a model II, two-way ANOVA with replication. We tested the interaction term by an F- test of the interaction mean sum-of-squares (MS) against the error MS. The gene and lineage effects were tested against the interaction MS; we did not attempt to combine the error and interaction MS. This did not affect our results qualitatively. The variance was partitioned according to the following formulae:
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Table 1a shows the proportion of each set of the 1,000 simulated data sets in which the gene, lineage, or interaction effects were significant at 5%. Over all parameter combinations the level of type I error is as expected for the lineage effect but is highly conservative for the gene effect. The pattern for the interaction term is more complex. With only three lineages the level of type I error is unacceptably high over most parameter combinations. But the situation improves as the number of lineages increases, with the error being about double that expected with four lineages and the same as that expected with 10 lineages.
|
The heterogeneity in variance we have simulated is quite substantial, but the results suggest that caution should be taken in interpreting the mildly significant interaction effects. For this reason we have performed parametric simulations to check the significance of the interaction effects we have observed.
Variation in Mitochondrial Substitution Rates
The results of the ANOVA of amino acid substitution rates, Ka, in mitochondrial genes are given in table 2. All three effects are significant, with the gene effect being the strongest, followed by the lineage and gene-by-lineage interaction. Figure 2 shows the (geometric) mean amino acid substitution rates for each gene and lineage. The strong lineage effect seems to be largely due to the fast rates of evolution in human and elephant, and to a lesser extent, in mouse. In contrast, the gene effect is rather more evenly distributed. Figure 3 shows the substitution rate for each gene in each lineage, plotted on a log scale; if there were no gene-by-lineage interaction (or sampling error), the lines would be parallel. The lines are not parallel, but it is not easy to discern any obvious interactions, i.e., instances of a gene evolving particularly fast or slowly in a particular lineage.
|
|
|
In contrast to the Ka analysis, only the lineage effect is significant for the synonymous substitution rates in the mitochondrial data set, and this contributes around 85% of the nonerror variance. But the divergences in this data set are small, and this may restrict our power to detect gene and gene-by-lineage effects. Unfortunately, because of the fact that we needed a divergence date to partition the variance correctly, we were restricted in our choice of taxa. The mean substitution rate for each lineage and gene are plotted in figure 4. The rate of synonymous substitution in mitochondrial genes appears to be much higher in whales and horses than in seals and hominids (human-chimpanzee).
|
The analysis of Ka in nuclear genes reveals highly significant lineage, gene, and gene-by-lineage effects (see table 2). Most of the nonerror variance in Ka is due to gene effects, with the remainder split between lineage and gene-by-lineage effects in the ratio 2:1. The mean substitution rates for each gene and lineage are plotted in figure 5. As with the mitochondrial Ka analysis, individual gene-by-lineage interactions are hard to identify; one possibility is the dec gene in mouse (fig. 6).
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In all four ANOVAs the lineage effect is significant. For the nuclear genes, in which the comparison can be made, the lineage effect seems to be quite similar for the amino acid and synonymous substitution rates humans have the lowest rate, followed by cow and rabbit, which have similar branch lengths, with mice showing the greatest rate. To investigate this further, we did an ANOVA of the logarithm of Ka/Ksas expected, there was no lineage effect (see table 2), which suggests that the lineage effect is similar for the amino acid and synonymous substitution rates and that they probably have a common cause. The cause of the lineage effect is most probably the differences in the rate of mutation between lineages, due to variation in factors such as generation time and metabolic rate, but there are other possibilities. For example, if slightly deleterious mutations segregated at both synonymous and amino acid sites, then differences in effective population size would generate correlated differences in rate along lineages between Ka and Ks.
A gene effect for synonymous substitution rates has been clearly demonstrated only once before (Mouchiroud, Gautier, and Bernardi 1995); this was also the case in mammals. A gene effect is also implicit in the observation that the genes which are close together on mammalian chromosomes have similar rates of synonymous substitution (Matassi, Sharp, and Gautier 1999; Lercher, Williams, and Hurst 2001); if the synonymous substitution did not vary in a consistent manner between lineages, it would be difficult to detect a local similarity in rate. The significant gene effect might be due to this local similarity, although the effect is quite weak, particularly in rodents (Lercher, Williams, and Hurst 2001). The gene effect might also be due to CpG islands which are likely to reduce the rate of evolution through the stabilization of the CpG dinucleotide (Bulmer 1986; Sved and Bird 1990).
Gene-by-Lineage Interactions and Overdispersion
The molecular clock is said to be overdispersed when the variation in the rate of evolution of a gene across lineages is greater than that expected under a Poisson process. But, overdispersion can arise through a variety of processes, including variation in the mutation rate between lineages. This led Gillespie (1989) to suggest that lineage effects should be removed before testing for overdispersion. Gene-by-lineage interactions are equivalent to Gillespie's (1989) definition of overdispersion: it is the variation in the rate of evolution which cannot be attributed to gene or lineage effects. Overdispersion has been observed previously for both synonymous and amino acid substitution rates in mammalian nuclear genes (Gillespie 1989; Ohta 1995). But there was some doubt over the significance of the observed overdispersion because two factors were not accounted for: (1) the variance associated with the estimate of the substitution rate (Bulmer 1989), and (2) the covariance associated with estimating the rate for a nonindependent set of lineages (Bulmer 1989). Simulations suggest that the rate of amino acid substitution is genuinely overdispersed in mammals (Nielsen 1997); however, a similar conclusion was not reached for the synonymous substitution rate. The level of overdispersion is usually measured by the index of dispersion; this is the variance in the "corrected" number of substitutions for a gene across lineages divided by the mean corrected number of substitutions, where the corrected number of substitutions is the number of substitutions for a gene in a lineage weighted by the relative rate of evolution in the lineage across genes. Gillespie (1989) found that the index of overdispersion was 6.95 for the rate of amino acid substitution and 4.64 for the rate of synonymous substitution across primates, artiodactyls, and rodents for a data set of 20 genes; similar results were obtained by Ohta (1995) on a larger data set, 5.60 for Ka and 5.89 for Ks. The values of the index of overdispersion for our data are comparable: 8.6 for Ka and 4.8 for Ks for nuclear genes, and 10.0 and 8.2 for mitochondrial genes. Our simulation results suggest that the overdispersion of synonymous substitution rates is not significant.
The cause of overdispersion in the rate of amino acid substitution has been one of the central arguments in the neutralist-selectionist debate (Gillespie 1986, 1989, 1991; Takahata 1987, 1988; Cutler 2000). Three neutral explanations for overdispersion have been suggested. First, overdispersion in the rate of amino acid substitution could be due to overdispersion in the mutation rate; if this were the case, then we would not expect an interaction for Ka/Ks. But there is a highly significant gene-by-lineage interaction for Ka/Ks (see table 2), and so it seems likely that there is overdispersion in the rate of amino acid substitution which is not caused by variation in the mutation rate.
Second, it has been suggested that overdispersion could be due to variation in the proportion of effectively neutral mutations caused by variation in the effective population size (Takahata 1987; Araki and Tachida 1997). If the strength of selection against a deleterious mutation is such that |Nes| << 1, where Ne is the effective population size and s the strength of selection, then the mutation can spread by random genetic drift and become fixed in the population. But if |Nes| >> 1, the probability of fixation is essentially zero, and the mutation cannot become fixed. Therefore, variation in the effective population size can change the proportion of mutations which can become fixed and, hence, the rate of evolution. Single-locus models show that this process can lead to overdispersion under some parameter combinations (Takahata 1987; Araki and Tachida 1997). But it seems likely that much of this overdispersion effect will manifest itself as a lineage effect in a multilocus system. If the effective population size decreases, the proportion of effectively neutral mutations will increase, but it will increase in all genes, generating an overall increase in the rate of evolution across all genes. Oversdispersion will be present because the change in the proportion of effectively neutral mutations will be different for different genes, but the effect may be too subtle to detect.
Third, it has been suggested that overdispersion might arise through "fluctuating neutral space" (Takahata 1987; Iwasa 1993)neutral substitutions change the proportion of mutations which are neutral. Under certain circumstances this model is expected to generate adaptive evolution for the following reason. The proportion of mutations which are neutral cannot increase indefinitely, so some neutral mutations will reduce the proportion of mutations which are neutral whereas others increase it (at equilibrium we expect the number of mutations becoming neutral to be equal to the number of mutations becoming nonneutral). Consider a neutral mutation at one site which changes another site, which was formerly neutral, to one that is subject to selection. On average there is a one in four chance that this second site will be occupied by the most advantageous allele, because there are four nucleotides. The first substitution will therefore create advantageous alleles at the second site, and adaptive evolution will ensue.
But it could be the case that the mutation at the first site will not be neutral unless the second site is occupied by its most advantageous allele. In this model, neutral space will fluctuate, but it will do so slowly. If the mutation at the first site only affects the selection at one other site, there is a one in four chance that the second site will be occupied by the most advantageous of the alleles and that the mutation at the first site will be neutral. If the first mutation affects the selection at two sites, the probability that the first mutation is neutral drops to 1 in 16; if it affects three sites, the probability is reduced further to 1 in 64. Hence, neutral mutations which significantly decrease the neutral space will be rare. This behavior may be exactly what is required to explain overdispsersion because, as Gillespie (1993) and Cutler (2000) have shown, the change in the rate of evolution needs to be at a rate which is slower than the rate of substitution.
Gillespie (1986, 1989, 1991) has proposed that overdispersion is a consequence of bursts of adaptive evolution. This model certainly seems to be more consistent with some of the dramatic examples of overdispersion that are known. For example, there has been a very marked acceleration in the rate of amino acid substitution in the ancestral lineages leading to both the hominoids and colobine monkeys in the lysozyme gene; in each case, Ka > Ks, suggesting that the acceleration is due to adaptive evolution (Messier and Stewart 1997; Yang 1998). But there are potential problems with the hypothesis that overdispersion is caused by bursts of adaptive evolution. As Gillespie (1993) and Cutler (2000) have shown, the environment needs to change at a rate which is slower than the rate at which substitutions occur for overdispersion to be produced. Given that a typical mammalian protein of 500 codons goes through one amino acid substitution every 2 Myr, the rate of environmental change needs to be very slow, and yet the habitat of most species changes much more rapidly. For example, a large proportion of the world was under ice during the last ice-age just 10,000 years ago. There are two possible solutions to this problem. First, the rate of evolution of a protein may not depend upon the external environment but on the substitutions which have occurred in the protein or in other related proteins. This is similar to the covarion model and might be termed a fluctuating adaptive space model. Second, adaptive evolution might be associated with the occupation of a vacant or new niche, and such niches may only become available at a much slower rate than the rate of environmental change.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Araki, H., and H. Tachida. 1997. Bottleneck effect on evolutionary rate in the nearly neutral mutation model. Genetics 147:907-914.
Bulmer, M. 1986. Neighbouring base effects on substitution rates in pseudogenes. Mol. Biol. Evol 3:322-329.[Abstract]
Bulmer, M. 1989. Estimating the variability of substitution rates. Genetics 123:615-619.
Cutler, D. 2000. Understanding the overdispersed molecular clock. Genetics 154:1403-1417.
Duret, L., D. Mouchiroud, and M. Gouy. 1994. HOVERGEN, a database of homologous vertebrate genes. Nucleic Acids Res 22:2360-2365.[Abstract]
Gillespie, J. H. 1986. Variability of evolutionary rates of DNA. Genetics 113:1077-1091.
Gillespie, J. H. 1989. Lineage effects and the index of dispersion of molecular evolution. Mol. Biol. Evol 6:636-647.[Abstract]
Gillespie, J. H. 1991. The causes of molecular evolution. Oxford University Press, Oxford.
Gillespie, J. H. 1993. Substitution processes in molecular evolution. I. Uniform and clustered substitutions in a haploid model. Genetics 134:971-981.
Gouy, M., F. Milleret, C. Mugnier, M. Jacobzone, and C. Gautier. 1984. ACNUC: a nucleic acid sequence data base and analysis system. Nucleic Acids Res 12:121-127.[Abstract]
Iwasa, Y. 1993. Overdispersed molecular evolution in constant environments. J. Theor. Biol 164:373-393.[CrossRef][ISI][Medline]
Lanave, C., S. Liuni, F. Licciulli, and M. Attimonelli. 2000. Update of AMmtDB: a database of multi-aligned Metazoa mitochondrial DNA sequences status. Nucleic Acids Res. 28:153-154.
Lercher, M. J., E. J. B. Williams, and L. D. Hurst. 2001. Local similarity in evolutionary rates extends over whole chromosomes in human-rodent and mouse-rat comparisons: implications for understanding the mechanistic basis of the male mutation bias. Mol. Biol. Evol 18:2032-2039.
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Madsen, O., M. Scally, C. J. Douady, D. J. Kao, R. W. DeBry , R. Adkins, H. M. Amrine, M. J. Stanhope, W. W. de Jong , M. S. Springer. 2001. Parallel adaptive radiations in two major clades of placental mammals. Nature 409:610-614.[CrossRef][ISI][Medline]
Matassi, G., P. M. Sharp, and C. Gautier. 1999. Chromosomal location effects on gene sequence evolution in mammals. Curr. Biol 9:786-791.[CrossRef][ISI][Medline]
Messier, W., and C.-B. Stewart. 1997. Episodic adaptive evolution of primate lysozymes. Nature 385:151-154.[CrossRef][ISI][Medline]
Mouchiroud, D., C. Gautier, and G. Bernardi. 1995. Frequencies of synonymous substitutions are gene-specific and correlated with frequencies of non-synonymous substitutions. J. Mol. Evol 40:107-113.[ISI][Medline]
Murphy, W. J., E. Elzirik, W. E. Johnson, Y. P. Zhing, O. A. Ryder, and S. J. O'Brien. 2001. Molecular phylogenetics and the origins of placental mammals. Nature 409:614-618.[CrossRef][ISI][Medline]
Nielsen, R. 1997. Robustness of the estimator of the index of overdispersion for DNA sequences. Mol. Phylogenet. Evol. 7:346-351.[CrossRef][ISI][Medline]
Ohta, T. 1995. Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. J. Mol. Evol 40:56-63.[ISI][Medline]
Pesole, G., C. Gissi, A. De Chirico, and C. Saccone. 1999. Nucleotide substitution rate of mammalian mitochondrial genomes. J. Mol. Evol 48:427-434.[ISI][Medline]
Shimodaira, H., and M. Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol 16:1114-1116.
Sved, J., and A. P. Bird. 1990. The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl. Acad. Sci. USA 87:4692-4696.[Abstract]
Takahata, N. 1987. On the overdispersed molecular clock. Genetics 116:169-179.
Takahata, N. 1988. More on the episodic clock. Genetics 118:387- 388.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. ClustalWimproving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.[Abstract]
Wallis, M. 1994. Variable evolutionary rates in the molecular evolution of mammalian growth hormones. J. Mol. Evol. 38:619-627.[ISI][Medline]
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci 13:555-556.[Medline]
Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol 15:568-573.[Abstract]