Synonymous Codon Bias Is Not Caused by Mutation Bias in G+C-Rich Genes in Humans

Nick G. C. Smith and Adam Eyre-Walker

Centre for the Study of Evolution and School of Biological Sciences, University of Sussex, Brighton, England


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 literature cited
 
It is has been suggested that synonymous codon bias is a consequence of mutation bias in mammals. We tested this hypothesis in humans using single-nucleotide polymorphism data. We found a pattern of polymorphism which was inconsistent with the mutation bias hypothesis in G+C-rich genes. However, the data were consistent with the action of natural selection or biased gene conversion. Similar patterns of polymorphism were also observed in noncoding DNA, suggesting that natural selection or biased gene conversion may affect large tracts of the human genome.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 literature cited
 
It is well established that selection acts on synonymous codon use in many groups of organisms, including bacteria, fungi, and insects (Sharp et al. 1992Citation ). However, in those groups of organisms, the degree of synonymous codon bias is correlated to the level of gene expression (Gouy and Gautier 1982Citation ; Ikemura 1985Citation ; Duret and Mouchiroud 1999Citation ), and this is not observed in mammals (Duret and Mouchiroud 1999Citation ). Instead, the pattern of synonymous codon use is correlated to the G+C content of the genomic region in which the gene resides; genes in the G+C-rich regions of the genome preferentially use G- and C-ending codons, while those in the A+T-rich regions use A- and T-ending codons (Bernardi et al. 1985Citation ; Bernardi 1995Citation ). This pattern has led to the suggestion that synonymous codon bias is caused by mutation bias in mammals (Filipski 1987Citation ; Suoeka 1988Citation ; Wolfe, Sharp, and Li 1989Citation ).

We can test whether synonymous codon bias is caused by mutation bias using population genetic data. Let u be the mutation rate from G : C base pairs to A : T base pairs, and let v be the mutation rate in the opposite direction. If mutation rates are low (i.e., Neu << 1 and Nev << 1, where Ne is the effective population size) and constant, and no other evolutionary forces affect base composition, then the equilibrium frequency of G : C base pairs in a sequence is f = v/(v + u) (Suoeka 1962Citation ). Therefore, the probability that we will observe an A or T mutation segregating at a site which was ancestrally G or C, henceforth referred to as a GC->AT mutation, is MGC->AT = fuH(n), where H(n) is the probability of observing a neutral mutation in a sample of n sequences, and the probability of observing a G or C mutation at a site which was ancestrally A or T, henceforth referred to as an AT->GC mutation, is MAT->GC = (1 - f)vH(n). It is not difficult to show that MGC->AT = MAT->GC; i.e., the number of AT->GC mutations segregating in a sample is expected to be equal to the number of GC->AT mutations if mutation bias is the sole cause of synonymous codon bias (Eyre-Walker 1997, 1999Citation ).

A recent analysis showed that there were more GC->AT mutations than AT->GC mutations segregating at synonymous sites in mammalian MHC genes, suggesting that mutation bias was not solely responsible for synonymous codon bias (Eyre-Walker 1999Citation ). However, it was not possible to demonstrate conclusively that the data conformed to the infinite-sites model (the requirement that mutation rates are low), and the results lacked generality, since for each species, all the studied genes came from a small region of a single chromosome.

A large number of single-nucleotide polymorphisms (SNPs) from human protein-coding genes, dispersed throughout the genome, have recently been published (Cargill et al. 1999Citation ; Hacia et al. 1999Citation ). For many of these SNPs, the corresponding sites have been sequenced in chimpanzees. Since the divergence between humans and chimpanzees is low (~0.015 at fourfold-degenerate synonymous sites; Eyre-Walker and Keightley 1999Citation ), the chimpanzee sequence can be used to infer the ancestral state in humans (i.e., whether an SNP segregating X and Y is due to an X->Y or a Y->X mutation). Furthermore, the average nucleotide diversity at fourfold-degenerate sites in human genes is sufficiently low (~0.001; Li and Stadler 1991Citation ; Cargill et al. 1999Citation ) for the data to conform to the infinite-sites model, even at CpG dinucleotides which mutate approximately 10–20 times as fast as other sites (Bulmer 1986Citation ; Sved and Bird 1990Citation ).

In this paper, we test the mutation bias hypothesis (i.e., whether mutation bias is responsible for synonymous codon bias) in humans by analyzing the pattern of polymorphism in synonymous SNPs.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 literature cited
 
Data
SNP data from two recent studies were obtained from their respective websites (http://waldo.wi.mit.edu/cvar_snps/ for Cargill et al. [1999Citation ]; http://genome.nhgri.nih.gov/apes/ for Hacia et al. [1999Citation ]). Both data sets provided the following information: the nucleotides segregating in humans, the nucleotide(s) at the same site in chimpanzee sequences, reference names for the sequences containing the SNPs, and the nucleotide sequences flanking the SNP. By assuming the chimpanzee nucleotide to be the ancestral state, SNPs were classified as GC->AT or AT->GC according to the mutation which generated them. We ignored those SNPs at which A/T or G/C were segregating and the few sites for which ancestral state reconstruction was ambiguous (if the chimpanzee site was polymorphic or if the chimpanzee nucleotide differed from both human nucleotides). We obtained the sequence containing each SNP by using the accession number from the Human SNP database (www-genome.wi.mit.edu/SNP/human/index.html) or by using the SNP-flanking sequences in a BLAST search. Annotations in the GenBank sequences allowed us to classify the SNPs into four classes: there were 125 synonymous, 60 intron, 60 3' untranslated region (UTR), and 49 anonymous STS SNPs. All of the synonymous and intron SNPs came from Cargill et al. (1999)Citation , while all of the STS SNPs came from Hacia et al. (1999)Citation ; exactly half of the 3' UTR SNPs came from each data set. STS SNPs not found in exons, introns, ESTs, or UTRs were categorized as anonymous; we ignored SNPs for which no clear annotation was available (e.g., unannotated EST sequences). For each synonymous SNP, we calculated the third position G+C content (GC3) of the exon in which the SNP was contained, or GC3 for the complete coding sequence if the intron/exon boundaries were not known. For SNPs in introns and UTRs, we calculated the G+C content for the intron or UTR concerned, and for STSs, we used the longest sequence available, up to 500 bp either side of the SNP.

CpG Islands
CpG islands were identified by calculating the expected number of CpG's based on the base composition and comparing this number to the level observed. If the observed/expected ratio was >50%, the SNP was inferred to be in a CpG island. For CpG analysis, we used the longest available contiguous sequence. If the sequence length was >600 bp, then a sliding-window analysis was performed (window length = 300 bp, step length = 50 bp), and the maximum observed/expected value overlapping the SNP was taken. For shorter contiguous sequences, we took the observed/expected value for the entire sequence.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 literature cited
 
Synonymous SNPs
There are 125 GC{leftrightarrow}AT synonymous SNPs in the data set of Cargill et al. (1999)Citation for which the chimpanzee sequence has been determined and a complete human cDNA sequence exists. As in a previous analysis of synonymous SNPs in MHC genes (Eyre-Walker 1999Citation ), there is a clear excess of GC->AT mutations segregating at synonymous sites (88 GC->AT and 37 AT->GC mutations, P < 0.00001). The excess of GC->AT mutations is particularly evident in those genes which preferentially use G- and C-ending codons (for SNPs in exons with GC3 > 0.6, 60 GC->AT and 11 AT->GC mutations, P < 0.00001) (table 1 ); there is no evidence of an excess of GC->AT mutations in genes with low GC3.


View this table:
[in this window]
[in a new window]
 
Table 1 The Numbers of GC->AT and AT->GC Synonymous Mutations Segregating in Human Genes

 
Sampling Bias
While this result would seem to be inconsistent with the mutation bias hypothesis in the G+C-rich genes, there are a number of explanations for the excess of GC->AT mutations which need to be considered: sampling bias, hypermutable sites, and a recent change in the pattern of mutation. It seems unlikely that our results were due to biases in the methods used to detect the SNPs for several reasons (i.e., ascertainment bias). First, Cargill et al. (1999)Citation estimate that they detected at least 85% of all SNPs. Second, we would not expect the excess of GC->AT mutations to increase with increasing G+C content, as we see in the data (table 1 ); since under the mutation bias hypothesis we expect equal numbers of GC->AT and AT->GC mutations at all G+C contents, we would therefore expect a similar level of ascertainment bias at all compositional levels. Third, a similar excess of synonymous GC->AT mutations was observed in MHC genes, where the mutations were detected by a different method, direct sequencing (Eyre-Walker 1999Citation ).

Hypermutation
Hypermutable sites potentially have two effects; they could lead to problems with parsimony, and they could violate the infinite-sites assumption. In each case, if the hypermutable sites had elevated rates of AT->GC mutation, they would tend to generate an excess of GC->AT mutations, as we see in the data. The reasons for this rather counterintuitive behavior are fully discussed elsewhere (Eyre-Walker 1998, 1999Citation ). However, three lines of evidence suggest that hypermutable sites were not responsible for the excess of GC->AT mutations we observed. First, we are not aware of any evidence of AT->GC hypermutable sites in mammals; the one well-known class of hypermutable sites, CpG dinucleotides, are expected to cause a bias in the opposite direction of that required to explain the data: CpG dinucleotides generate C->T and G->A transitions at elevated rates, and such mutations will tend to appear as T->C and A->G changes, respectively, in the data (Eyre-Walker 1998, 1999Citation ). Second, it is possible to demonstrate that the excess in GC->AT mutations is not due to a problem with parsimony, since we can dispense with the chimpanzee sequence and infer the direction of mutation from the frequencies of the alleles segregating at a site; the rarer allele is assumed to be more recent. This method is unbiased under the null hypothesis (synonymous codon bias is caused by mutation bias) and the infinite-sites assumption (Eyre-Walker 1999Citation ). Using allele frequencies, we infer that there have been 65 GC->AT mutations, compared with 37 AT->GC mutations over all genes (P = 0.007) and 45 GC->AT versus 15 AT->GC mutations (P = 0.0001) for genes with GC3 > 0.6; the sample sizes are smaller because frequency data are available for only a subset of the SNPs. Third, the infinite-sites assumption would only be seriously compromised in this context if the rate of mutation were some 100 times as high as the average nucleotide diversity observed (Eyre-Walker 1999Citation ), and with that level of hypermutability, we would expect to see an excess of GC->AT substitutions inferred by parsimony (Eyre-Walker 1998Citation ) over even short timescales, such as the divergence along the human lineage since we split from chimpanzees (Eyre-Walker and Keightley 1999Citation ). In a sample of 28 genes sequenced in humans, chimpanzees, and gorillas (Eyre-Walker and Keightley 1999Citation ), there have been identical numbers of GC->AT and AT->GC synonymous substitutions along the human lineage (22 substitutions in each direction inferred by parsimony, 18 GC->AT and 15 AT->GC substitutions for genes with GC3 > 0.6), just as we expect for a sequence of stationary base composition.

Mutation Pattern
The excess of GC->AT mutations segregating in human SNPs could be the result of a recent change in the mutation pattern from a GC bias to an AT bias, but this seems unlikely for three reasons. First, a change in the mutation pattern would manifest itself as an excess of GC->AT substitutions over AT->GC substitutions unless the change in the mutation pattern had been very recent. As we showed above, there appear to have been similar numbers of GC->AT and AT->GC substitutions along the human lineage since the split from chimpanzees. Second, a dramatic change in the mutation pattern is required to explain the data. For example, there are 18 GC->AT mutations and 4 AT->GC mutations for the SNPs in exons with GC3 between 70% and 80%, and the change in the mutation process needed to cause this pattern would eventually reduce GC3 to ~40% (calculated using eq. 8 in Eyre-Walker [1997Citation ]). Third, we would require several independent changes in the mutation pattern in the same direction to explain the excess of GC->AT synonymous polymorphisms in the MHC genes of other mammals (Eyre-Walker 1999Citation ).

Selection and Biased Gene Conversion
It therefore seems that mutation bias is not responsible for synonymous codon bias in human genes. However, there are at least two other possibilities: natural selection and biased gene conversion; biased gene conversion is a process which leads to the biased transmission of alleles; for example, if biased gene conversion is very strong and G+C-biased, 100% of all gametes from a C/T heterozygote will be C. Both selection and biased gene conversion are expected to generate an excess of GC->AT mutations. This can be seen using the following simple argument: Let us imagine there is no mutation bias, and selection has elevated the G+C content of a sequence to 80%. Since there is no mutation bias, 80% of the new mutations will be GC->AT, and 20% will be AT->GC (ignoring G{leftrightarrow}C and A{leftrightarrow}T mutations). Unfortunately, the situation is more complicated, because selection may affect the probability of detecting a mutation; for example, if directional selection had elevated the G+C content to 80% in the previous example, each GC->AT mutation would be slightly deleterious, while each AT->GC would be slightly advantageous; we would therefore expect to detect the AT->GC mutations more readily, because they would segregate at slightly higher frequencies, on average, than the GC->AT mutations.

To demonstrate formally that selection and biased gene conversion are expected to generate an excess of GC->AT mutations, we derived the expected proportion of GC->AT mutations segregating in a sample of sequences, PGC->AT, under two models: a model of weak directional selection, which is equivalent to a model of biased gene conversion (Nagylaki 1983Citation ); and a model of strong stabilizing selection. Let f ' (or f '') be the frequency of sites fixed for G : C base pairs, u be mutation rate from G : C to A : T base pairs, and v be the mutation rate in the opposite direction. We will assume that selection or biased gene conversion favors high G+C. First, consider weak directional selection and biased gene conversion, two processes which can be described by a single parameter s, since they are dynamically identical (Nagylaki 1983Citation ). Under semidominant directional selection, s is the strength of selection in favor of G : :C base pairs, and under biased gene conversion, s is the strength of biased gene conversion, where (s + 1)/2 of the alleles from a G : C/A : T heterozygote are G or C. If mutation rates are low enough that the infinite-sites assumption holds (i.e., Neu << 1, Nev << 1), the equilibrium proportion of sites fixed for G : C in a diploid is


(Li et al. 1987Citation ; Bulmer 1991Citation ), where S = 4Nes. Therefore, in a sample of n sequences, the expected proportions of GC->AT and AT->GC mutations are given by


(Sawyer and Hartl 1992Citation ), assuming there is free recombination. The proportion of GC->AT mutations segregating in the sample is simply


Second, consider a stabilizing-selection model in which selection is acting to maintain the G+C content of a sequence at f ''. If we assume that selection is sufficiently strong that the average G+C content of the sequence in the population is at the optimum, f '', and that mutation rates are sufficiently low that each mutation in the sequence under stabilizing selection appears, segregates, and is removed before the next occurs, then each mutation, whether it is GC->AT or AT->GC, is deleterious. If we assume that selection is symmetrical about the optimum, then each mutation will be subject to the same level of selection; let the strength of selection be s against the mutation. Then, we have


As figure 1 shows, when selection favors increased G+C, we expect an excess of GC->AT mutations under both models, and when selection favors increased A+T, we expect a deficit of GC->AT mutations. This is likely to be the pattern we expect under most models of selection, since the stabilizing- and directional-selection models lie at opposite ends of a continuum; as selection becomes weak in the stabilizing-selection model, mutation pressure will push the population away from the optimum; if selection becomes very weak, then the population will be sufficiently far below the optimum that the model becomes a weak directional-selection model.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1.—The expected proportion of GC->AT mutations segregating in a sample of sequences (PGC->AT) plotted against the G+C content of the sequence for a variety of mutation biases (k = v/(u + v)) for two models: weak directional selection/biased gene conversion (solid lines) (eq. 3 ) and stabilizing selection (dashed lines) (eq. 4 ). Note that for the first model, P'GC->AT is a function of k, S, and f ', but f ' is itself a function of k and S, so P'GC->AT can be plotted parametrically against f ' as S varies. In each case, the sample size is assumed to be 10 sequences; the results are very similar for other sample sizes

 
CpG Dinucleotides
While both selection and biased gene conversion are consistent with the data presented here, there are few data which can discriminate between them at present. We can test two simple selective hypotheses: that selection is acting on synonymous codon use, but only to maintain (1) CpG islands, ~1-kb sequences which have high levels of the dinucleotide CpG and high G+C content, or (2) methylated CpG dinucleotides. Both CpG islands and methylated CpGs have been implicated in the regulation of gene expression (Lewis and Bird 1991Citation ) and might therefore be targets of natural selection. However, while the excess of GC->AT mutations is very apparent for both CpG islands and CpG dinucleotides (CpG islands: 14 GC->AT mutations and 1 AT->GC mutation, P = 0.0005; CpG dinucleotides: 47 GC->AT and 15 AT->GC mutations, P = 0.0001 at SNPs segregating C/T at a site flanked 3' by G, or G/A at a site flanked 5' by C), there is an excess of GC->AT mutations both for non-CpG island DNA and for dinucleotides other than CpG (non-CpG island: 73 GC->AT and 36 AT->GC mutations, P = 0.0005; other dinucleotides: 41 GC->AT and 15 AT->GC mutations, P = 0.023).

Noncoding DNA
It is likely that whatever affects synonymous codon bias also affects large regions of the genome, since in mammals synonymous codon bias is correlative with the base composition of the chromosomal region in which the gene is situated—i.e., GC3 is strongly correlated to the G+C content of the 5' and 3' UTR regions, introns, and isochores (Bernardi et al. 1985Citation ; Clay et al. 1996Citation ). As expected, there is an excess of GC->AT mutations segregating in intron, 3' UTR, and anonymous STS sequences (i.e., STS sequences which are not known to be within or flanking a protein-coding sequence), particularly in those sequences which are G+C rich (table 2 ). It therefore seems that either natural selection or biased gene conversion also affects the base composition of G+C rich noncoding DNA and therefore has a profound effect on the structure of the human genome, since large sections of the genome are G+C-rich, while others are G+C-poor (Bernardi 1995Citation ).


View this table:
[in this window]
[in a new window]
 
Table 2 The Numbers of GC->AT and AT->GC Single-Nucleotide Polymorphisms (SNPs) Segregating in Introns, 3' Untranslated Regions (UTRs), and Anonymous Sequence Tagged Site (STS) Sequences

 


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 literature cited
 
We thank Eric Lander, Francis Collins, and their groups for making their data available, and Gil McVean, Laurence Hurst, and Peter Keightley for comments and helpful discussion. This work was supported by the BBSRC (N.G.C.S., A.E.-W.) and the Royal Society (A.E.-W.).


    Footnotes
 
Manolo Gouy, Reviewing Editor

1 Abbreviations: EST, expressed sequence tag; MHC, major histocompatibility complex; SNP, single-nucleotide polymorphism; STS, sequence tagged site; UTR, untranslated region. Back

2 Keywords: human synonymous codons mutation bias Back

3 Address for correspondence and reprints: Adam Eyre-Walker, Centre for the Study of Evolution and School of Biological Sciences, University of Sussex, Brighton BN1 9QG, United Kingdom. a.c.eyre-walker{at}sussex.ac.uk Back


    literature cited
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 literature cited
 

    Bernardi, G. 1995. The human genome: organization and evolutionary history. Annu. Rev. Genet. 29:445–476[ISI][Medline]

    Bernardi, G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, and F. Rodier. 1985. The mosaic genome of warm blooded vertebrates. Science 228:953–958

    Bulmer, M. 1986. Neighbouring base effects on substitution rates in pseudogenes. Mol. Biol. Evol. 3:322–329[Abstract]

    ———. 1991. The selection-mutation-drift theory of synonymous codon usage. Genetics 129:897–907

    Cargill, M., D. Altshuler, J. Ireland et al. (17 co-authors). 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22:231–238[ISI][Medline]

    Clay, O., S. Caccio, Z. Zoubak, D. Mouchiroud, and G. Bernardi. 1996. Human coding and noncoding DNA: compositional correlations. Mol. Phylogenet. Evol. 5:2–12[ISI][Medline]

    Duret, L., and D. Mouchiroud. 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. USA 96:4482–4487

    Eyre-Walker, A. 1997. Differentiating selection and mutation bias. Genetics 147:1983–1987

    ———. 1998. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. 47:686–690[ISI][Medline]

    ———. 1999. Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152:675–683

    Eyre-Walker, A., and P. D. Keightley. 1999. High genomic deleterious mutation rates in hominids. Nature 397:344–347

    Filipski, J. 1987. Correlation between molecular clock ticking, codon usage, fidelity of DNA repair, chromosome banding and chromtin compactness in germline cells. FEBS Lett. 217:184–186[ISI][Medline]

    Gouy, M., and C. Gautier. 1982. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10:7055–7074[Abstract]

    Hacia, J. G., J.-B. Fan, O. Ryder et al. (16 co-authors). 1999. Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays. Nat. Genet. 22:164–167[ISI][Medline]

    Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:13–34[Abstract]

    Lewis, J., and A. P. Bird. 1991. DNA methylation and chromatin structure. FEBS Lett. 205:155–159

    Li, W.-H., and L. A. Stadler. 1991. Low nucleotide diversity in man. Genetics 129:513–523

    Li, W.-H., M. Tanimura, and P. M. Sharp. 1987. An evaluation of the molecular clock hypothesis using mammalian DNA sequences. J. Mol. Evol. 25:330–342[ISI][Medline]

    Nagylaki, T. 1983. Evolution of a finite population under gene conversion. Proc. Natl. Acad. Sci. USA 80:6278–6281

    Sawyer, S. A., and D. L. Hartl. 1992. Population genetics of polymorphism and divergence. Genetics 132:1161–1176

    Sharp, P. M., C. J. Burgess, A. T. Lloyd, and K. J. Mitchell. 1992. Selective use of termination and variation in codon choice. Pp. 397–425 in D. L. Hatfield, B. J. Lee, and R. M. Pirtle, eds. Transfer RNA in protein synthesis. CRC Press, Boca Raton, Fla

    Suoeka, N. 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA 48:582–592

    ———. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85:2653–2657

    Sved, J., and A. P. Bird. 1990. The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl. Acad. Sci. USA 87:4692–4696

    Wolfe, K. H., P. M. Sharp, and W.-H. Li. 1989. Mutation rates differ among regions of the mammalian genome. Nature 337:283–285

Accepted for publication January 4, 2001.