*Laboratoire de Biométrie, Génétique et Biologie des Populations, Université Claude Bernard, Villeurbanne, France;
and
Laboratoire Génome, Populations, Interactions, Université Montpellier 2, Montpellier, France
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The reason for the TpA scarcity is not clearly understood. However, UpA appears to be a preferential target for ribonucleases (Beutler et al. 1989
). Moreover, Beutler et al. (1989)
noticed that TpA is more stringently excluded in DNA destined to be expressed in the cytosol (exons of protein-coding genes and tRNA and rRNA genes) than in nontranscribed Y-chromosomal DNA, DNA that is expressed only in mitochondria, and DNA that is degraded within the nucleus (intron DNA). This led the authors to propose that, by reason of their instability, there was a selective pressure against UpA dinucleotides in mRNA, tRNA, or rRNA sequences.
Unexpectedly, the deficiencies in CpG and TpA dinucleotides, measured by the ratio of observed to expected dinucleotide frequency (CpGo/e, TpAo/e), varies according to the G+C contents of human genes: CpG depletion is lower and TpA depletion higher in G+C-rich than in G+C-poor genes (Hanai and Wada 1988
). The same trend has been observed within genes: both the G+C content and the CpGo/e ratio are higher in 5' untranslated regions (UTRs) than in 3' UTRs (Pesole et al. 1997
). On a larger scale, it has been shown that CpGo/e is higher in G+C-rich parts of the genome (G+C-rich isochores) than in G+C-poor regions (Bernardi et al. 1985
; Aissani and Bernardi 1991
; Jabbari and Bernardi 1998
). Interestingly, these correlations (positive and negative, respectively) between sequence CpGo/e or TpAo/e and G+C content have also been found in RNA viruses (Rima and McFerran 1997
). However, the reason for these correlations was not established.
We propose that the observed TpA deficiency (on one hand) and the observed correlations between G+C content, CpG deficiency, and TpA deficiency (on the other hand) are essentially indirect consequences of the mutational CpG depletion. We first present an intuitive argument explaining the reasons for these effects, and we then quantify them through an improved model of dinucleotide evolution that accounts for overlaps between successive dinucleotides.
![]() |
Intuitive Argument |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To give an idea of the impact of CpG mutations, we calculated the G+C content and the observed over expected dinucleotide frequencies in a random sequence where 67% of CpG's would have been changed to TpG or CpA. We calculated these values for G+C contents (before CpG mutations) of 40% and 60%. As shown in table 1 , in both cases, CpG depletion induces an apparent TpA deficiency. The CpGo/e ratio is higher than the ratio of final/initial CpG frequencies (0.33). Thus, the CpGo/e ratio underestimates the real mutation pressure on CpG dinucleotides. Finally, the CpGo/e and TpAo/e ratios are, respectively, higher and lower in the G+C-rich than in the G+C-poor sequence, in agreement with observations from the human genome.
|
![]() |
A Model of Dinucleotide Evolution |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In equation (1)
, r(i, jm, k) is the rate of change from trinucleotide (ijk) to trinucleotide (imk), deductible from the model. For instance, r(A, A
C, T) equals
, and r(A, C
T, G) equals
1.(1 -
). Factor b((x, y), (i, j
m, k)) in equation (1)
is the balance for dinucleotide (xy) when a (ijk) to (imk) change occurs, i.e., the difference between the number of (xy) dinucleotides included in trinucleotide (imk) and the number of (xy) dinucleotides included in (ijk). For instance, b[(A, C), (A, C
T, G)] is -1 (one AC is lost by changing ACG to ATG), b[(A, A), (A, G
A, A)] is 2, and b[(A, A), (C, C
T, C)] is 0. In words, equation (1)
states that the overall change for dinucleotide (xy) is the sum over all trinucleotides (ijk) and all possible changes for j of the frequency of that trinucleotide times the probability of that change times the effect of that change on (xy) occurrence. Trinucleotide frequencies can be deduced from dinucleotide ones:
where nj = j dij =
j dji is the frequency of nucleotide j. Equation (2)
assumes that dependencies do not extend farther than two bases, i.e., that the probability of the state of one nucleotide depends only on its neighbors. We checked this approximation from simulations and found it to be very good.
Equation (1)
written for all possible (x, y) forms a system of 16 differential equations that describe the instantaneous dynamics of dinucleotide frequencies under our model. This system can hardly be solved analyticallyin contrast to the analogous system in models describing nucleotide evolutionessentially because the expression of tijk includes products between dij's, making it nonlinear. However, equation (1)
allows one to quickly simulate the evolution of dinucleotides and to deduce equilibrium frequencies given ,
, and
1. The simulation process is the following:
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Figure 1
displays the relationship between CpG and TpA depletion (observed/expected frequencies) and G+C content in 3,073 human DNA sequences longer than 100 kb. Sequences were retrieved from GenBank release 115 (December 15, 1999) using the ACNUC database (Gouy et al. 1985
). As previously reported with smaller data sets, a significant correlation between CpGo/e and G+C content was found: CpG deficiency was lower in G+C-rich regions. A moderate TpA deficiency also appeared, correlated with G+C content as well (but negatively). The regression lines are shown.
|
Interestingly, our model also predicts some TpA deficiency at equilibrium, although no specific mutational mechanism has been assumed with respect to TpA dinucleotides. Moreover, in agreement with the observation on real sequences, our model predicts a negative correlation between TpAo/e and G+C content (fig. 1b
). Note that the slope of the correlation is the same in real and in simulated sequences. Thus, variations of the TpAo/e ratio according to the G+C content are probably simply a direct consequence of CpG depletion. Using a related approach but a different model, Bulmer (1986)
did not predict any TpA deficiency. One should note, however, that the TpAo/e ratio is lower in real sequences than expected according to our model. Thus, other factors contribute to the deficiency of TpA in human sequences.
In summary, these simulations confirm the qualitative predictions of the simplistic example presented in table 1 . In agreement with real data, our model predicts that: (1) an increased mutation rate from CpG to TpG and CpA induces an apparent depletion in TpA, (2) this apparent TpA depletion increases with G+C content, and (3) the CpGo/e ratio underestimates the real mutation pressure on CpG dinucleotides, all the more as the sequence is G+C-rich; as a consequence, (4) CpGo/e and TpAo/e are correlated (positively and negatively, respectively) to G+C content.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Second, this result raises the question of the definition of CpG islands. The essential feature of CpG islands is the absence of methylation (at least in the germ line). Originally, CpG islands were identified as genomic regions that were very rich in cleavable sites for mCpG-sensitive restriction enzymes (Bird 1986
). The sequencing of these unmethylated regions revealed relatively high G+C contents and CpGo/e ratios. These features have been used to detect CpG islands by computer analysis of genomic sequences. Classically, CpG islands are identified as DNA regions (
200 bp) with a G+C content higher than 50% and a CpGo/e ratio higher than 0.6 (Gardiner-Garden and Frommer 1987
). Are these criteria relevant to identify all nonmethylated islands? The relatively high G+C content in CpG islands can be explained in part by the fact that CpG depletion tends to decrease the G+C content in the rest of the genome. However, it is also possible that this latter property reflects a bias in the original method for the detection of unmethylated DNA regions: the recognition sites for mCpG-sensitive restriction enzymes contain at least 50% G+C and hence are more frequent in G+C-rich than in G+C-poor DNA. It is therefore not clear whether this latter criterion is necessary to identify unmethylated DNA regions. Indeed, it has been shown that nonmethylated islands in fish genomes are G+C-poor (Cross et al. 1991
). The CpGo/e ratio reflects the mutability of CpG's and thus is an indicator of the level of methylation in the germ line. However, as we have shown, in G+C-rich regions, this ratio underestimates the real CpG depletion. According to our model, a CpGo/e ratio of 0.6 with a G+C content of 50% corresponds to a rate of transition at the CpG doublet about 3.5 times as low as that in the rest of the genome (i.e.,
1 = 8.0). Figure 2
displays the CpGo/e ratio predicted by our model for different G+C contents and for two values of
1: high methylation rate (
1 = 27.6, genomic average rate) and low methylation rate (
1 = 8.0, CpG islands rate). This figure shows that according to the classical criteria (CpGo/e
0.6, G+C
50%), a highly methylated G+C-rich region (>70%) would be erroneously considered as a CpG island. Conversely, an undermethylated G+C-poor region (<50%) would not be identified as a CpG island. We therefore suggest that the criteria to identify CpG islands should be set according to the G+C-content of sequences. According to our simulations (fig. 2
), the threshold of CpGo/e as a function of G+C frequency to assess the presence of unmethylated islands can be calculated with the following formula:
![]() |
|
Third, Beutler et al. (1989)
have noted that the TpAo/e ratio is lower in exons of protein-coding genes and tRNA and rRNA genes than in Y-chromosomal DNA, mitochondrial DNA, and introns. They also have shown that UpA has a destabilizing effect on RNAs. This led them to propose that a selective pressure against TpA is acting in DNA sequences destined to be expressed in the cytosol. However, it should be noted that protein-coding regions and tRNA or rRNA genes are characterized by a relatively high G+C content (on average, 55%60%) compared with the other sequences they analyzed (less than 49% in introns, 44% in mitochondria, and 39% in chromosome Y genomic sequences). According to our model, these differences in G+C content could explain the differences in TpAo/e. To directly test the selectionist hypothesis proposed by Beutler et al. (1989)
, we compared the TpAo/e ratios in coding regions, introns, and 3' UTRs of human genes. Coding regions and 3' UTRs are part of the mRNA (and hence are destined to be expressed in the cytosol), whereas introns are not. Therefore, according to the selectionist hypothesis, TpAo/e should be lower in coding regions and 3' UTRs than in introns. On the other hand, coding regions are relatively G+C-rich compared with introns and 3' UTRs. Therefore, according to our model, TpAo/e should be lower in coding regions than in 3' UTRs and introns. As shown in table 2
, the data fit with our model and not with the selectionist hypothesis.
Since human genomic DNA is essentially nontranscribed, another prediction of the selectionist hypothesis is that TpAo/e should be lower in 3' UTRs and introns than in genomic sequences. On the contrary, we found that the average TpAo/e ratios in 3' UTRs (0.64 ± 0.20) and introns (0.68 ± 0.13) were very close to those of large (>100 kb) genomic sequences of similar base composition (0.67 and 0.66, respectively, calculated according to the regression slope presented in fig. 1 ).
Therefore, contrary to what has been proposed (Beutler et al. 1989
) there is no evidence that TpA dinucleotides are more counterselected in exons than in introns or in transcribed than in nontranscribed DNA. As shown with our model, CpG depletion induces an apparent TpA depletion that depends on the G+C content of sequences. The differences in the TpAo/e ratios between the different sequences analyzed by Beutler et al. (1989)
are merely a consequence of their differences in G+C content. However, as mentioned previously, the CpG depletion does not totally explain the observed TpA deficiency in the human genome. Karlin and Mràzek (1997)
proposed that the deficiency in TpA might be due to its low thermodynamic stacking energy in DNA. They also suggested that because of the presence of TpA as part of many regulatory signals (e.g., TATA box, polyadenylation signal), TpA suppression might help to avoid inappropriate binding of regulatory factors. Although these are useful working hypotheses, they are still speculative, and the reason for the TpA depletion remains to be determined.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: dinucleotides
CpG
TpA
CpG islands
methylation
isochores
2 Address for correspondence and reprints: Laurent Duret, Laboratoire de Biométrie, Génétique et Biologie des Populations, UMR CNRS 5558, Université Claude Bernard, 43 Boulevard du 11 Novembre 1918, 69622 Villeurbanne cedex, France. E-mail: duret{at}biomserv.univ-lyon1.fr
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Aissani, B., and G. Bernardi. 1991. CpG islands: features and distribution in the genome of vertebrates. Gene 106:173183.
Antequera, F., and A. Bird. 1999. CpG islands as genomic footprints of promoters that are associated with replication origins. Curr. Biol. 9:R661R667.
Bernardi, G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, and F. Rodier. 1985. The mosaic genome of warm-blooded vertebrates. Science 228:953958.
Beutler, E., T. Gelbart, J. H. Han, J. A. Koziol, and B. Beutler. 1989. Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proc. Natl. Acad. Sci. USA 86:192196.
Bird, A. P. 1980. DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res. 8:14991504.[Abstract]
. 1986. CpG-rich islands and the function of DNA methylation. Nature 321:209213.
Bulmer, M. 1986. Neighboring base effects on substitution rates in pseudogenes. Mol. Biol. Evol. 3:322329.[Abstract]
Cargill, M., D. Atshuler, J. Ireland et al. (17 co-authors). 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22:231238.[ISI][Medline]
Cross, S., P. Kovarik, J. Schmidtke, and A. Bird. 1991. Non-methylated islands in fish genomes are GC-poor. Nucleic Acids Res. 19:14691474.[Abstract]
Gardiner-Garden, M., and M. Frommer. 1987. CpG islands in vertebrate genomes. J. Mol. Biol. 196:261282.[ISI][Medline]
Giannelli, F., T. Anagnostopoulos, and P. M. Green. 1999. Mutation rates in human. II. Sporadic mutation-specific rates and rate of detrimental human mutations inferred from Hemophilia B. Am. J. Hum. Genet. 65:15801587.[ISI][Medline]
Gouy, M., C. Gautier, M. Attimonelli, C. Lanave, and G. Di Paola. 1985. ACNUC, a portable retrieval system for nucleic acid sequences databases: logical and physical designs and usage. Comp. Appl. Biosci. 1:167172.[Abstract]
Halushka, M. K., J.-B. Fan, K. Bentley, L. Hsie, N. Shen, A. Weder, R. Cooper, R. Lipshutz, and A. Chakravarti. 1999. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22:239247.[ISI][Medline]
Hanai, R., and A. Wada. 1988. The effects of guanine and cytosine variation on dinucleotide frequency and amino acid composition in the human genome. J. Mol. Evol. 27:321325.[ISI][Medline]
Jabbari, K., and G. Bernardi. 1998. CpG doublets, CpG islands and Alu repeats in long human DNA sequences from different isochore families. Gene 224:123128.
Karlin, S., and J. Mràzek. 1997. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. USA 94:1022710232.
Pesole, G., S. Luini, G. Grillo, and C. Saccone. 1997. Structural and compositional features of untranslated regions of eukaryotic mRNAs. Gene 205:95102.
Rima, B. K., and N. V. McFerran. 1997. Dinucleotide and stop codon frequencies in single-stranded RNA viruses. J. Gen. Virol. 78:28592870.[Abstract]
Sved, J., and A. Bird. 1990. The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl. Acad. Sci. USA 87:46924696.
Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 9:678687.[Abstract]
Yang, Z. 1995. On the general reversible Markov process model of nucleotide substitution: a reply to Saccone et al. J. Mol. Evol. 41:254255.[ISI]