Centre for the Study of Evolution and School of Biological Sciences, University of Sussex, Brighton, England
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We can test whether synonymous codon bias is caused by mutation bias using population genetic data. Let u be the mutation rate from G : C base pairs to A : T base pairs, and let v be the mutation rate in the opposite direction. If mutation rates are low (i.e., Neu << 1 and Nev << 1, where Ne is the effective population size) and constant, and no other evolutionary forces affect base composition, then the equilibrium frequency of G : C base pairs in a sequence is f = v/(v + u) (Suoeka 1962
). Therefore, the probability that we will observe an A or T mutation segregating at a site which was ancestrally G or C, henceforth referred to as a GC
AT mutation, is MGC
AT = fuH(n), where H(n) is the probability of observing a neutral mutation in a sample of n sequences, and the probability of observing a G or C mutation at a site which was ancestrally A or T, henceforth referred to as an AT
GC mutation, is MAT
GC = (1 - f)vH(n). It is not difficult to show that MGC
AT = MAT
GC; i.e., the number of AT
GC mutations segregating in a sample is expected to be equal to the number of GC
AT mutations if mutation bias is the sole cause of synonymous codon bias (Eyre-Walker 1997, 1999
).
A recent analysis showed that there were more GCAT mutations than AT
GC mutations segregating at synonymous sites in mammalian MHC genes, suggesting that mutation bias was not solely responsible for synonymous codon bias (Eyre-Walker 1999
). However, it was not possible to demonstrate conclusively that the data conformed to the infinite-sites model (the requirement that mutation rates are low), and the results lacked generality, since for each species, all the studied genes came from a small region of a single chromosome.
A large number of single-nucleotide polymorphisms (SNPs) from human protein-coding genes, dispersed throughout the genome, have recently been published (Cargill et al. 1999
; Hacia et al. 1999
). For many of these SNPs, the corresponding sites have been sequenced in chimpanzees. Since the divergence between humans and chimpanzees is low (
0.015 at fourfold-degenerate synonymous sites; Eyre-Walker and Keightley 1999
), the chimpanzee sequence can be used to infer the ancestral state in humans (i.e., whether an SNP segregating X and Y is due to an X
Y or a Y
X mutation). Furthermore, the average nucleotide diversity at fourfold-degenerate sites in human genes is sufficiently low (
0.001; Li and Stadler 1991
; Cargill et al. 1999
) for the data to conform to the infinite-sites model, even at CpG dinucleotides which mutate approximately 1020 times as fast as other sites (Bulmer 1986
; Sved and Bird 1990
).
In this paper, we test the mutation bias hypothesis (i.e., whether mutation bias is responsible for synonymous codon bias) in humans by analyzing the pattern of polymorphism in synonymous SNPs.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
CpG Islands
CpG islands were identified by calculating the expected number of CpG's based on the base composition and comparing this number to the level observed. If the observed/expected ratio was >50%, the SNP was inferred to be in a CpG island. For CpG analysis, we used the longest available contiguous sequence. If the sequence length was >600 bp, then a sliding-window analysis was performed (window length = 300 bp, step length = 50 bp), and the maximum observed/expected value overlapping the SNP was taken. For shorter contiguous sequences, we took the observed/expected value for the entire sequence.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Hypermutation
Hypermutable sites potentially have two effects; they could lead to problems with parsimony, and they could violate the infinite-sites assumption. In each case, if the hypermutable sites had elevated rates of ATGC mutation, they would tend to generate an excess of GC
AT mutations, as we see in the data. The reasons for this rather counterintuitive behavior are fully discussed elsewhere (Eyre-Walker 1998, 1999
). However, three lines of evidence suggest that hypermutable sites were not responsible for the excess of GC
AT mutations we observed. First, we are not aware of any evidence of AT
GC hypermutable sites in mammals; the one well-known class of hypermutable sites, CpG dinucleotides, are expected to cause a bias in the opposite direction of that required to explain the data: CpG dinucleotides generate C
T and G
A transitions at elevated rates, and such mutations will tend to appear as T
C and A
G changes, respectively, in the data (Eyre-Walker 1998, 1999
). Second, it is possible to demonstrate that the excess in GC
AT mutations is not due to a problem with parsimony, since we can dispense with the chimpanzee sequence and infer the direction of mutation from the frequencies of the alleles segregating at a site; the rarer allele is assumed to be more recent. This method is unbiased under the null hypothesis (synonymous codon bias is caused by mutation bias) and the infinite-sites assumption (Eyre-Walker 1999
). Using allele frequencies, we infer that there have been 65 GC
AT mutations, compared with 37 AT
GC mutations over all genes (P = 0.007) and 45 GC
AT versus 15 AT
GC mutations (P = 0.0001) for genes with GC3 > 0.6; the sample sizes are smaller because frequency data are available for only a subset of the SNPs. Third, the infinite-sites assumption would only be seriously compromised in this context if the rate of mutation were some 100 times as high as the average nucleotide diversity observed (Eyre-Walker 1999
), and with that level of hypermutability, we would expect to see an excess of GC
AT substitutions inferred by parsimony (Eyre-Walker 1998
) over even short timescales, such as the divergence along the human lineage since we split from chimpanzees (Eyre-Walker and Keightley 1999
). In a sample of 28 genes sequenced in humans, chimpanzees, and gorillas (Eyre-Walker and Keightley 1999
), there have been identical numbers of GC
AT and AT
GC synonymous substitutions along the human lineage (22 substitutions in each direction inferred by parsimony, 18 GC
AT and 15 AT
GC substitutions for genes with GC3 > 0.6), just as we expect for a sequence of stationary base composition.
Mutation Pattern
The excess of GCAT mutations segregating in human SNPs could be the result of a recent change in the mutation pattern from a GC bias to an AT bias, but this seems unlikely for three reasons. First, a change in the mutation pattern would manifest itself as an excess of GC
AT substitutions over AT
GC substitutions unless the change in the mutation pattern had been very recent. As we showed above, there appear to have been similar numbers of GC
AT and AT
GC substitutions along the human lineage since the split from chimpanzees. Second, a dramatic change in the mutation pattern is required to explain the data. For example, there are 18 GC
AT mutations and 4 AT
GC mutations for the SNPs in exons with GC3 between 70% and 80%, and the change in the mutation process needed to cause this pattern would eventually reduce GC3 to
40% (calculated using eq. 8 in Eyre-Walker [1997
]). Third, we would require several independent changes in the mutation pattern in the same direction to explain the excess of GC
AT synonymous polymorphisms in the MHC genes of other mammals (Eyre-Walker 1999
).
Selection and Biased Gene Conversion
It therefore seems that mutation bias is not responsible for synonymous codon bias in human genes. However, there are at least two other possibilities: natural selection and biased gene conversion; biased gene conversion is a process which leads to the biased transmission of alleles; for example, if biased gene conversion is very strong and G+C-biased, 100% of all gametes from a C/T heterozygote will be C. Both selection and biased gene conversion are expected to generate an excess of GCAT mutations. This can be seen using the following simple argument: Let us imagine there is no mutation bias, and selection has elevated the G+C content of a sequence to 80%. Since there is no mutation bias, 80% of the new mutations will be GC
AT, and 20% will be AT
GC (ignoring G
C and A
T mutations). Unfortunately, the situation is more complicated, because selection may affect the probability of detecting a mutation; for example, if directional selection had elevated the G+C content to 80% in the previous example, each GC
AT mutation would be slightly deleterious, while each AT
GC would be slightly advantageous; we would therefore expect to detect the AT
GC mutations more readily, because they would segregate at slightly higher frequencies, on average, than the GC
AT mutations.
To demonstrate formally that selection and biased gene conversion are expected to generate an excess of GCAT mutations, we derived the expected proportion of GC
AT mutations segregating in a sample of sequences, PGC
AT, under two models: a model of weak directional selection, which is equivalent to a model of biased gene conversion (Nagylaki 1983
); and a model of strong stabilizing selection. Let f ' (or f '') be the frequency of sites fixed for G : C base pairs, u be mutation rate from G : C to A : T base pairs, and v be the mutation rate in the opposite direction. We will assume that selection or biased gene conversion favors high G+C. First, consider weak directional selection and biased gene conversion, two processes which can be described by a single parameter s, since they are dynamically identical (Nagylaki 1983
). Under semidominant directional selection, s is the strength of selection in favor of G : :C base pairs, and under biased gene conversion, s is the strength of biased gene conversion, where (s + 1)/2 of the alleles from a G : C/A : T heterozygote are G or C. If mutation rates are low enough that the infinite-sites assumption holds (i.e., Neu << 1, Nev << 1), the equilibrium proportion of sites fixed for G : C in a diploid is
|
|
|
|
|
Noncoding DNA
It is likely that whatever affects synonymous codon bias also affects large regions of the genome, since in mammals synonymous codon bias is correlative with the base composition of the chromosomal region in which the gene is situatedi.e., GC3 is strongly correlated to the G+C content of the 5' and 3' UTR regions, introns, and isochores (Bernardi et al. 1985
; Clay et al. 1996
). As expected, there is an excess of GC
AT mutations segregating in intron, 3' UTR, and anonymous STS sequences (i.e., STS sequences which are not known to be within or flanking a protein-coding sequence), particularly in those sequences which are G+C rich (table 2
). It therefore seems that either natural selection or biased gene conversion also affects the base composition of G+C rich noncoding DNA and therefore has a profound effect on the structure of the human genome, since large sections of the genome are G+C-rich, while others are G+C-poor (Bernardi 1995
).
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Abbreviations: EST, expressed sequence tag; MHC, major histocompatibility complex; SNP, single-nucleotide polymorphism; STS, sequence tagged site; UTR, untranslated region.
2 Keywords: human
synonymous codons
mutation bias
3 Address for correspondence and reprints: Adam Eyre-Walker, Centre for the Study of Evolution and School of Biological Sciences, University of Sussex, Brighton BN1 9QG, United Kingdom. a.c.eyre-walker{at}sussex.ac.uk
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bernardi, G. 1995. The human genome: organization and evolutionary history. Annu. Rev. Genet. 29:445476[ISI][Medline]
Bernardi, G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, and F. Rodier. 1985. The mosaic genome of warm blooded vertebrates. Science 228:953958
Bulmer, M. 1986. Neighbouring base effects on substitution rates in pseudogenes. Mol. Biol. Evol. 3:322329[Abstract]
. 1991. The selection-mutation-drift theory of synonymous codon usage. Genetics 129:897907
Cargill, M., D. Altshuler, J. Ireland et al. (17 co-authors). 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22:231238[ISI][Medline]
Clay, O., S. Caccio, Z. Zoubak, D. Mouchiroud, and G. Bernardi. 1996. Human coding and noncoding DNA: compositional correlations. Mol. Phylogenet. Evol. 5:212[ISI][Medline]
Duret, L., and D. Mouchiroud. 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. USA 96:44824487
Eyre-Walker, A. 1997. Differentiating selection and mutation bias. Genetics 147:19831987
. 1998. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. 47:686690[ISI][Medline]
. 1999. Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152:675683
Eyre-Walker, A., and P. D. Keightley. 1999. High genomic deleterious mutation rates in hominids. Nature 397:344347
Filipski, J. 1987. Correlation between molecular clock ticking, codon usage, fidelity of DNA repair, chromosome banding and chromtin compactness in germline cells. FEBS Lett. 217:184186[ISI][Medline]
Gouy, M., and C. Gautier. 1982. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10:70557074[Abstract]
Hacia, J. G., J.-B. Fan, O. Ryder et al. (16 co-authors). 1999. Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays. Nat. Genet. 22:164167[ISI][Medline]
Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:1334[Abstract]
Lewis, J., and A. P. Bird. 1991. DNA methylation and chromatin structure. FEBS Lett. 205:155159
Li, W.-H., and L. A. Stadler. 1991. Low nucleotide diversity in man. Genetics 129:513523
Li, W.-H., M. Tanimura, and P. M. Sharp. 1987. An evaluation of the molecular clock hypothesis using mammalian DNA sequences. J. Mol. Evol. 25:330342[ISI][Medline]
Nagylaki, T. 1983. Evolution of a finite population under gene conversion. Proc. Natl. Acad. Sci. USA 80:62786281
Sawyer, S. A., and D. L. Hartl. 1992. Population genetics of polymorphism and divergence. Genetics 132:11611176
Sharp, P. M., C. J. Burgess, A. T. Lloyd, and K. J. Mitchell. 1992. Selective use of termination and variation in codon choice. Pp. 397425 in D. L. Hatfield, B. J. Lee, and R. M. Pirtle, eds. Transfer RNA in protein synthesis. CRC Press, Boca Raton, Fla
Suoeka, N. 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA 48:582592
. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85:26532657
Sved, J., and A. P. Bird. 1990. The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl. Acad. Sci. USA 87:46924696
Wolfe, K. H., P. M. Sharp, and W.-H. Li. 1989. Mutation rates differ among regions of the mammalian genome. Nature 337:283285