* Physics Department and Center for Theoretical Biological Physics, University of California at San Diego
Institute for Theoretical Physics, Cologne University, Cologne, Germany
Department of Biological Sciences, Stanford University
Correspondence: E-mail: peter.arndt{at}uni-koeln.de.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: nucleotide substitution CpG-methylation repetitive elements GC content
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To address these issues we developed a maximum-likelihood (ML) method that simultaneously estimates the substitution rates corrected for multiple hits, for the four transversions, two transitions, and the CpG-based transition process. This method is based on a recently developed treatment of neighbor-dependent substitutions (Arndt, Burge, and Hwa 2002), which is an extension of Kimura's approach along with its generalizations (Kimura 1981; Hasegawa, Kishino, and Yano 1985) to include the CpG-based transition process. Importantly, it is capable of recovering the true patterns of substitution going far back in time, given the knowledge of the ancestral sequence and a sufficient amount of data.
In the case of the human genome, sufficient amount of data is indeed available in the form of exceptionally prolific interspersed repetitive elements (REs), which have been inserted into the human genome in bursts at various stages during evolution (Britten et al. 1988; Jurka and Smith 1988; Britten 1994; Kapitonov and Jurka 1996; Mighell, Markham, and Robinson 1997). Every burst generated numerous copies of an ancestral "master sequence," which can be reconstructed easily from the available data (Jurka 2000). The master sequences was not the same for every burst; this way various subfamilies of RE have been generated at different times in evolution. Most of the REs reside in the intergenic regions and are believed to be functionally neutral.
The oldest families of REs that can be identified in the human genome were inserted over 200 MYA. Naturally, older elements accumulated more base substitutions than younger elements. Therefore, a careful comparison of the frequencies of substitutions observed in REs of different ages can tell us the history of the substitutional processes extending far into the past. This analysis applied to all REs found in the human genome provides us with the information about the genome-wide averaged background substitutional processes at different times. Using the same analysis and applying it to REs found in specific regions, we could reconstruct the region-specific substitutional processes as well. We were able to reveal different substitution patterns in regions with different background GC content. These patterns changed in time and therefore provide us additional information on the origin and timing of the human isochore structure.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Data Analysis
In the analysis, the "star" phylogeny was assumed for each subfamily of REs. This assumption is based on the known biology of retrotransposons and supported by previous phylogenetic analysis of the transposons (Britten et al. 1988; Jurka and Smith 1988; Jurka and Milosavljevic 1991). The assumption is invalidated if a significant fraction of the REs is generated by duplication. However, the latter is estimated to affect under 10% of the REs (J. Jurka, personal communication), making the star phylogeny a reasonable starting point. Given the degree of sequence divergence of the copies of REs (table 1), we assume that each copy evolves as a unique sequence. Further, the extremely low divergence of the youngest REs (Jurka 2000) in the human genome suggests that retrotranscription errors can be safely ignored. The ancestor sequence for each subfamily is taken from the RepBase (Jurka 2000) and verified by a ML-based reconstruction method.
Estimation of Substitution Frequencies
We used a maximum-likelihood (ML) approach to estimate the substitution frequencies for each subfamily of REs. The observed data is given by the pair-wise gapped alignments of each identified copy of an RE in the genome (ß1, ..., ßM), with its ancestral sequence (1, ...,
M) from the RepBase (Jurka 2000). Here Greek letters represent the nucleotides A, C, G, and T, and M is the length of the alignment. From the alignments of a particular subfamily of RE, we count the number of observed substitutions
i
ßi, recording also the neighboring bases in the ancestor,
i1 to the left and
i+1 to the right. We disregard a site i if any of the
i1,
i,
i+1, or ßi is an insertion or deletion in the alignment. We denote these counts by N(ß|
L,
,
R) and the set of all counts for a subfamily by {N}.
To implement the ML approach, we need to specify a substitution model describing the observed data. We chose a general model comprising all possible single nucleotide transversions (eight) and transitions (four), as well as the CpG-based transition, CpGCpA/TpG. Each process and its complementary process are assumed to occur with the same substitution frequency per site. Hence, the substitution model is parameterized by a set of seven frequencies, collectively denoted as {
}. The likelihood of observing the data {N} under a given model with parameters {
} can be approximated by
|
The heart of our method is the recently developed neighbor-dependent substitution model (Arndt, Burge, and Hwa 2002). This method is an extension of Kimura's approach and its generalizations (Kimura 1981; Hasegawa, Kishino, and Yano 1985) to include the CpG-based transition process. It was assumed that the REs were selectively neutral, so that the two DNA strands can be treated in the same way. The model comprised the four transversions, two single-nucleotide transitions, and the CpG-based transition, as well as all the secondary processes involving multiple and back substitutions. A special case of this model in which all the transition rates were equal and all the transversion rates were equal was solved analytically (Arndt, Burge, and Hwa 2002). For the more general case at hand involving all seven substitution frequencies, the principle of the likelihood calculation was the same as described by Arndt, Burge, and Hwa (2002), but the calculation became more tedious and was performed with the help of a computer. The corresponding program is available upon request. Undoubtedly, the inclusion of the neighbor-dependent substitution process made the analysis more complicated. However, the results obtained were more sensitive and reliable at longer evolutionary time scales, especially when compared with the alternative method of "direct counting" of CpG CpA/TpG events. The performance of this approach has been evaluated quantitatively as will be shortly described below.
Reconstruction of the Ancestral Sequence
In our analysis, we assumed that the RE sequence appearing in the RepBase was the correct ancestral sequence (Jurka 2000). We tested this assumption by reconstructing the ancestral sequence from all copies found in the human genome, using the ML approach with the above substitution model. In contrast to the above estimation of substitution frequencies, here we want to find the most likely ancestral sequence (1, ...,
M), given the observed copies and our substitution model. The likelihood to be maximized is again L({N}|{
}), but now one has to adjust the ancestral sequence
i (which changes the numbers {N}) while keeping the substitution frequencies {
} fixed. Initially, the frequencies {
} are unknown. We use the naïve consensus sequence and the ML analysis described above to get an estimate for these frequencies. Subsequently, we construct the ML ancestor by varying the ancestor sequence and maximizing L({N}|{
}) using the {
} just determined. We then use the new ancestor sequence to get a better estimate of the frequencies. After three iterations, this procedure converges to a unique ML ancestor sequence and estimated substitution frequencies. We tested this method by independently evolving many copies (in number comparable to the number of copies of REs found in the human genome) of a synthetic ancestral sequence. We were able to reconstruct this ancestor sequence exactly, even if all CpG disappeared from the naïve consensus of the evolved sequences due to the high CpG-based transition frequency. Comparing the ML ancestor and the ancestor in the RepBase, we observe differences for less than 1% of the sites.
Performance Evaluation
To test the performance of the ML method, we synthetically aged sequences with known substitution frequencies starting from an ancestral sequence. We used the following stochastic evolution procedure: pick a base at random and allow one of the seven substitution processes we consider to occur with probability that is 1/1,000 of the corresponding substitution frequency ; repeat until every base was visited 1,000 times on average. Scaling all
s by a single factor while keeping their ratios fixed, we can generate sequences evolved for various amounts of times with different degrees of divergence from the ancestor. The ratios themselves are the relative substitution frequencies and are independent of the amount of time that the sequences have been aged.
To test the accuracy and limitation of our ML method in estimating the substitution frequencies, we synthetically evolved 20,000 copies of the ancestral MIR sequence (5 Mb in total, about 1/10 of the amount of MIR sequences found in the human genome) for varying periods of time t, with equal transversion rates and differing transition rates (A:TG:C rate = 3x, G:C
A:T rate = 5x, and CpG-based transition rate = 40x that of the transversion rate, respectively). These rates are chosen to be similar to those found in our later analysis. We perform the ML analysis for sequences at each age t, and plot the three transition frequencies (solid symbols) obtained against the average of the transversion frequencies for each age in figure 1. We observe fixed ratios between the transition and transversion frequencies, with the slopes reflecting very accurately (within 1% relative errors) the relative substitution rates used in aging. Estimates of the same frequencies using the direct counting method are indicated in figure 1 by the open symbols. The result is especially poor for the CpG-based transition frequencies (the open diamonds). For sequences with transversion frequency above 0.02, this DC-estimated CpG transition rate saturates to 0.5, indicating that most CpGs had decayed into either CpAs or TpGs.
|
We first consider the effect of a fivefold increase in the CpG-based transition rate (from 8x to 40x the transversion rate), while keeping the relative single-nucleotide transition rates the same as before. The change is applied at a time t0 corresponding to an average transversion frequency of 0.025, to mimic an observed effect to be described below. The frequencies estimated by ML for various aging time are shown in figure 2. Here and below, it will be convenient to use the average transversion frequency as the unit of aging time t. For t < t0 = 0.025, the sequences do not know about the rate change at t0, and the estimated substitution frequencies are the same as those obtained before (as shown in figure 1). For t > t0, the estimated CpG-based transition frequencies (the diamonds in figure 2) become very different, whereas the other two transition frequencies (the triangles) follow the same trend as for t < t0. When plotted against the average transversion frequencies, the CpG-based transition frequencies exhibit a sharp change at t = 0.025 and fall reasonably well onto two straight lines. We see that the slopes for the data points in the two regions reflect the two relative substitution rates used in the aging process (slopes of the solid lines in figure 2). Even for the less reliable "past" rate, we obtain a relative error of less than 10%. Of course, the actual error size depends on the total amount of sequence data for each time t.
|
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Intriguingly, t0 = 0.025 (90 MYA) corresponds roughly to the time of the mammalian radiation (Hedges et al. 1996; Kumar and Hedges 1998; Easteal 1999; Murphy et al. 2001). Corroborating this inference is an independent observation in figure 4 that the CpG-based transition frequencies of nearly all the L1Pxx elements (solid square symbols, corresponding to the L1s found in all primates) fall on the steep segment, while the CpG-based frequencies of all the L1Mxx elements (solid diamond symbols, corresponding to L1s found in all eutherian mammals) (Smit and Riggs 1995) fall on the flatter segment. One possible mechanism for the abrupt increase of the CpG-based transition rate (but not the neighbor-independent C
T transition rate) is a hypothetical elevation of methylation activity in germline cells at the time of mammalian radiation.
The History of Substitution in Different Isochores
To gain some insight into the strong regional GC content variations of the human genome, we repeated the above analysis separately for REs found in regions with different base composition. Specifically, we partitioned the REs according to the base composition of the region flanking individual copies into seven equal-sized bins of GC content ranging from 30% to 60%. A plot similar to figure 4 was generated for each of the GC content bins and analyzed separately. As expected, we find the abrupt upward shift in the rate of CpG-based transition for every bin of GC content, occurring at approximately 90 MYA (data not shown). More surprisingly, the rates of single-nucleotide transitions also showed time dependence. As shown in figure 5, both transition frequencies are nearly independent of GC content for transversion frequencies less than 0.02 (i.e., in the recent past), whereas clear dependence on GC content can be seen for the two transitions at much earlier times (e.g., for transversion frequencies greater than 0.03). Moreover, the dependences of GC content in the distant past are opposite for the two transitions: the rates of GC-enriching transitions were higher in the regions of higher GC content (fig. 5a), whereas the opposite was the case for the AT-enriching transitions (fig. 5b). Similar results have already been found by Lander et al. (2001). It thus appears that the GC-dependence of the substitution pattern also underwent a change. Intriguingly, this change coincides in time (90 MYA) with the upward shift of the CpG-based transition rate and the mammalian radiation. Note that this change is not a trivial consequence of the shift in the genome-wide CpG-based transition rate shown in figure 4, since the different substitution processes are treated separately in our ML analysis.
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our estimates of the recent pattern of substitution agree with those derived by other authors by studying evolution at synonymous sites in pairs of mammalian species (Duret et al. 2002; Rosenberg, Subramanian, and Kumar 2003) and by studying SNPs in the human populations (Eyre-Walker and Hurst 2001; Duret et al. 2002; Webster, Smith, and Ellegren 2003). In any case, the bigger estimation error is expected in the older, pre-90 MYA patterns of substitution. However, those are the patterns that produce a very good fit between the observed and the expected GC content in the human genome. Therefore, we believe that our main results are robust to the possible errors in the estimation procedure. It is noteworthy that although the GC content expected based on pre-90 MYA patterns fits well the observed GC content, it consistently underestimates it by a small amount (fig. 6b). This suggests the existence of another GC-enriching process acting fairly uniformly on most sequences in the genome. The possible candidates include insertions of GC-rich sequences (pseudogenes or REs) and an AT-bias in small deletions or a GC-bias in small insertions that we have not taken into account in our procedure.
If the high-GC isochores have been continuously degraded over the past 90 Myr, why do we still see the pronounced isochore structure in the human genome? We believe this is primarily because the degrading process is very slow. Based on the estimated substitution rates, we expect that high-GC isochores had at most 5% higher GC content 90 MYA. This would imply that the faster-evolving mammalian lineages should have the more muted isochore patternthe expectation is borne out in the mouse lineage (Mouchiroud, Gautier, and Bernardi 1988).
How can we explain the sharp reduction in the GC dependence of all transition and transversion rates (figs. 5 and 7), without any apparent change in the genome-wide rate of transitions and transversions at approximately 90 MYA (fig. 4)? One plausible scenario involves a combination of the regional variation in substitution rates (driven by either mutational biases [Sueoka 1988; Wolfe, Sharp, and Li 1989], consistent differences in the strength of selection [Bernardi 1986; Eyre-Walker 1999], or frequency of biased gene conversion [Holmquist 1992; Eyre-Walker 1993]) and increased chromosomal rearrangement activities in the mammalian lineage. Imagine that some regions in the genome (for instance centromeric and telomeric regions) have consistently higher rates of AT to GC substitutions. Without chromosomal rearrangements, the DNA sequences in these regions will eventually reach steady state and attain a higher GC content consistent with the local substitution rates. Suppose the rate of chromosomal rearrangements increased significantly after steady state was reached (e.g., at around 90 MYA). Then until a new steady state can be reached (a process which would take several hundred million years), some REs in GC-rich segments would be subject to low ATGC rates, whereas others would continue to have high AT
GC rates, and similarly for REs in GC-poor segments. This would remove the apparent GC-dependence of the substitution rates since 90 MYA and explain why all transition and transversion rates changed their GC-dependences synchronously at that time, whereas the genome-averaged substitution rates hardly changed at all. An alternative scenario is the occurrence of an isolated episode of increased chromosomal rearrangements at around 90 MYA. This would also produce the observed shift in the GC-dependence of the single-nucleotide substitution patterns. However, under this scenario, we would expect the isochore structure existing before 90 MYA to reestablish itself rather than becoming completely degraded after several hundred million years.
This model requires that the rate of chromosomal rearrangements in the amniote ancestor be lower than what it is in eutherian mammals. It predicts that the mammalian lineages with higher rates of chromosomal rearrangements should show more muted isochore structure, with less variation of the GC content across the genome. The mouse lineage does appear to have a higher rate of chromosomal rearrangements and a more muted isochore structure. However, it remains to be established whether this correlation cannot be explained away simply by the increase in the overall rate of molecular evolution in the mouse lineage.
The most intriguing result generated by our analysis is the apparently coincidental increase of the CpG-based transition rate (possibly due to an increase in methylation rate of cytosine) and the decrease in the GC dependence of the single-nucleotide transitions and transversions (possibly due to the increase in chromosomal rearrangement activities). Although we do not have an explanation for this coincidence, we would like to point out a possible causal connection. If the mammalian lineage experienced an invasion of very active transposable elements (such as L1s) at this time, the increase in methylation rate (and thus an increase in the CpG-based transition rate) could be an evolved response to control the expansion of transposable elements (Yoder, Walsh, and Bestor 1997). The increase in the rate of chromosomal rearrangements then can be due to the increase in the number of dispersed homologous sequences generated by the active transposable elements and the consequent increase in the rate of ectopic recombination.
Although the suggested explanation is highly speculative, it has the virtue of being a parsimonious, single-cause explanation for the multiple coincident changes in genomic patterns of molecular evolution in the ancestor of eutherian mammals. Further research will shed light on the validity of this model. The most robust part of this study, however, is the determination of the history of genomic patterns of molecular evolutionthis history will need to be incorporated into the future accounts of the evolution of the human genome.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Arndt, P. F., C. B. Burge, and T. Hwa. 2002. DNA sequence evolution with neighbor-dependent mutation. 6th Annual International Conference on Computational Biology. Washington, D.C J. Comp. Biol. 10:313-322.[ISI]
Bernardi, G. 1986. Compositional constraints and genome evolution. J. Mol. Evol. 24:1-11.[ISI][Medline]
Bernardi, G. 1993. The vertebrate genome: isochores and evolution. Mol. Biol. Evol. 10:186-204.[Abstract]
Bernardi, G. 1990. Compositional transitions in the nuclear genomes of cold-blooded vertebrates. J. Mol. Evol. 31:282-293.[ISI][Medline]
Bernardi, G. 2000. Isochores and the evolutionary genomics of vertebrates. Gene 241:3-17.[CrossRef][ISI][Medline]
Bernardi, G., S. Hughes, and D. Mouchiroud. 1997. The major compositional transitions in the vertebrate genome. J. Mol. Evol. 44:S44-51.[ISI][Medline]
Britten, R. J. 1994. Evidence that most human Alu sequences were inserted in a process that ceased about 30 million years ago. Proc. Natl. Acad. Sci. USA 91:6148-6150.[Abstract]
Britten, R. J., W. F. Baron, D. B. Stout, and E. H. Davidson. 1988. Sources and evolution of human Alu repeated sequences. Proc. Natl. Acad. Sci. USA 85:4770-4774.[Abstract]
Coulondre, C., J. H. Miller, P. J. Farabaugh, and W. Gilbert. 1978. Molecular basis of base substitution hotspots in Escherichia coli. Nature 274:775-780.[ISI][Medline]
Duret, L., M. Semon, G. Piganeau, D. Mouchiroud, and N. Galtier. 2002. Vanishing GC-rich isochores in mammalian genomes. Genetics 162:1837-1847.
Easteal, S. 1999. Molecular evidence for the early divergence of placental mammals. Bioessays 21:1052-1058 discussion 1059.[CrossRef][ISI][Medline]
Eyre-Walker, A. 1993. Recombination and mammalian genome evolution. Proc. R. Soc. Lond. B. Biol. Sci. 252:237-243.[ISI][Medline]
Eyre-Walker, A. 1999. Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152:675-683.
Eyre-Walker, A., and L. D. Hurst. 2001. The evolution of isochores. Nat. Rev. Genet. 2:549-555.[CrossRef][ISI][Medline]
Filipski, J., J. P. Thiery, and G. Bernardi. 1973. An analysis of the bovine genome by Cs2SO4-Ag density gradient centrifugation. J. Mol. Biol. 80:177-197.[ISI][Medline]
Galtier, N., and D. Mouchiroud. 1998. Isochore evolution in mammals: a human-like ancestral structure. Genetics 150:1577-1584.
Gu, Z., H. Wang, A. Nekrutenko, and W. H. Li. 2000. Densities, length proportions, and other distributional features of repetitive sequences in the human genome estimated from 430 megabases of genomic sequence. Gene 259:81-88.[CrossRef][ISI][Medline]
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174.[ISI][Medline]
Hedges, S. B., P. H. Parker, C. G. Sibley, and S. Kumar. 1996. Continental breakup and the ordinal diversification of birds and mammals. Nature 381:226-229.[CrossRef][ISI][Medline]
Holmquist, G. P. 1992. Chromosome bands, their chromatin flavors, and their functional features. Am. J. Hum. Genet. 51:17-37.[ISI][Medline]
Hughes, S., D. Zelus, and D. Mouchiroud. 1999. Warm-blooded isochore structure in Nile crocodile and turtle. Mol. Biol. Evol. 16:1521-1527.[Abstract]
Jurka, J. 2000. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16:418-420.[CrossRef][ISI][Medline]
Jurka, J., and A. Milosavljevic. 1991. Reconstruction and analysis of human Alu genes. J. Mol. Evol. 32:105-121.[ISI][Medline]
Jurka, J., and T. Smith. 1988. A fundamental division in the Alu family of repeated sequences. Proc. Natl. Acad. Sci. USA 85:4775-4778.[Abstract]
Kapitonov, V., and J. Jurka. 1996. The age of Alu subfamilies. J. Mol. Evol. 42:59-65.[ISI][Medline]
Kimura, M. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc Natl. Acad. Sci. USA 78:454-458.[Abstract]
Kumar, S., and S. B. Hedges. 1998. A molecular timescale for vertebrate evolution. Nature 392:917-920.[CrossRef][ISI][Medline]
Lander, E. S., L. M. Linton, and B. Birren, et al. (256 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.[CrossRef][ISI][Medline]
Mighell, A. J., A. F. Markham, and P. A. Robinson. 1997. Alu sequences. FEBS Lett. 417:1-5.[CrossRef][ISI][Medline]
Mouchiroud, D., C. Gautier, and G. Bernardi. 1988. The compositional distribution of coding sequences and DNA molecules in humans and murids. J. Mol. Evol. 27:311-320.[ISI][Medline]
Murphy, W. J., E. Eizirik, W. E. Johnson, Y. P. Zhang, O. A. Ryder, and S. J. O'Brien. 2001. Molecular phylogenetics and the origins of placental mammals. Nature 409:614-618.[CrossRef][ISI][Medline]
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical recipes in C, the art of scientific computing. Cambridge University Press, Cambridge.
Rosenberg, M. S., S. Subramanian, and S. Kumar. 2003. Pattern of transitional mutation biases within and among mammalian genomes. Mol. Biol. Evol. 20:988-993.
Smit, A. F., and A. D. Riggs. 1995. MIRs are classic, tRNA-derived SINEs that amplified before the mammalian radiation. Nucleic Acids Res. 23:98-102.[Abstract]
Smit, A. F., G. Toth, A. D. Riggs, and J. Jurka. 1995. Ancestral, mammalian-wide subfamilies of LINE-1 repetitive sequences. J. Mol. Biol. 246:401-417.[CrossRef][ISI][Medline]
Sueoka, N. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85:2653-2657.[Abstract]
Webster, M. T., N. G. Smith, and H. Ellegren. 2003. Compositional evolution of noncoding DNA in the human and chimpanzee genomes. Mol. Biol. Evol. 20:278-286.
Wolfe, K. H., P. M. Sharp, and W. H. Li. 1989. Mutation rates differ among regions of the mammalian genome. Nature 337:283-285.[CrossRef][ISI][Medline]
Yi, S., D. L. Ellsworth, and W. H. Li. 2002. Slow molecular clocks in old world monkeys, apes, and humans. Mol. Biol. Evol. 19:2191-2198.
Yoder, J. A., C. P. Walsh, and T. H. Bestor. 1997. Cytosine methylation and the ecology of intragenomic parasites. Trends Genet. 13:335-340.[CrossRef][ISI][Medline]