Department of Integrative Biology, University of CaliforniaBerkeley
The phenomenon of codon usage bias has been important in the study of evolution because it provides examples of weak selection working at the molecular level. During the last two decades, evidence has accumulated that some examples of codon usage bias are driven by selection, particularly for species of fungi (e.g., Bennetzen and Hall 1982
; Ikemura 1985
), bacteria (e.g., Ikemura 1981
; Sharp and Li 1987
), and insects (e.g., Akashi 1997
; Moriyama and Powell 1997
). This connection between codon usage bias and selection has been important in stimulating the development of alternatives to strictly neutral theories of molecular evolution (Ohta and Gillespie 1996
; Kreitman and Antezana 2000
). Uncertainty persists, however, regarding the phylogenetic distribution of codon usage bias (such as whether selection-based codon usage bias is present in mammals; e.g., Karlin and Mrazek 1996
; Iida and Akashi 2000
; Sueoka and Kawanishi 2000
; Smith and Eyre-Walker 2001a;
Urrutia and Hurst 2001
). Questions also remain as to what models of selection underlie codon preferences (Kreitman and Antezana 2000
), and specifically whether the presence of suboptimal codons is the result of mutation and drift, variation in selection pressure across sites, or antagonistic selection pressures (Smith and Eyre-Walker 2001b
).
For investigating certain questions regarding the evolution of codon usage bias, it is useful to have a summary statistic describing the pattern of codon usage across all amino acids. Many summary statistics have already been developed to describe the patterns of codon usage. They can be divided roughly into two classes (Comeron and Aguade 1998
). One class summarizes the usage of certain preferred codons, and the other compares every codon's usage to a null distribution (typically uniform usage of synonymous codons). The former class of methods has the disadvantage that it requires a prior knowledge of the preferred codons. With summary statistics one can explore general patterns, such as the relationship of codon usage bias to recombination rate, gene length, or synonymous substitution rate. Observing broad patterns such as these has already provided insight into the evolutionary dynamics of codon usage bias (Kliman and Hey 1993
; Akashi and Eyre-Walker 1998
; Kreitman and Comeron 1999
).
Of the summary statistics that do not require knowledge of preferred codons, the effective number of codons (ENC, or c as proposed by Wright 1990
) has been found to be the most statistically well behaved, in that it is the least affected by short gene lengths (Comeron and Aguade 1998
).
c is inversely proportional to the extent of nonuniform codon usage. It takes the value of 61 when all codons are being used with equal frequency, and its value decreases as codon usage becomes less uniform. It is intuitively accessible because its value is intended to correspond to the number of codons being used in a sequence.
One limitation of c is that measuring the departures from uniform codon usage is not always desirable. Knowledge of background nucleotide composition patterns may suggest a null distribution of codon usage that is nonuniform. For instance, in Drosophila, where the background nucleotide composition is 60% AT, one might like to measure how far codon usage departs from 60% usage of AT-ending codons. Indeed, accounting for background nucleotide composition when studying codon usage has been recognized as being important (Akashi, Kliman, and Eyre-Walker 1998
; Marais, Mouchiroud, and Duret 2001
; Urrutia and Hurst 2001
). Taking composition into account is particularly important for phylogenetic studies of codon usage bias where comparisons are made among species with differing background nucleotide compositions or for studies of codon usage bias in genomic regions that may differ in background nucleotide composition. If not taken into consideration, differences in background nucleotide composition might lead one to conclude that differences in codon preference exist when they do not.
The departure of c from 61 because of variation in background nucleotide composition was recognized by Wright and described as the following relationship between the third position GC content (fGC) and
c:
|
Other codon usage bias summary statistics such as Akashi's scaled 2 (Akashi 1995
), the maximum-likelihood codon bias statistic (MCB; Urrutia and Hurst 2001
), and B*(a) (Karlin and Mrazek 1996
) account for background nucleotide composition. However, as presented below, these statistics are affected strongly by the sequence lengths being studied. Clearly, it would be useful to have a statistic similar to
c that has low sequence-length effects but that relaxes the assumption of equal usage of all synonymous codons.
Here, a statistic ('c) is presented that accounts for background nucleotide composition and is minimally affected by sequence length. The statistic is based on Pearson's X2 statistics and describes the departure of the observed codon usage from some expected distribution. The expected distribution can be derived from knowledge of the background nucleotide composition. When uniform usage of codons is expected,
'c reduces to
c. Like
c,
'c is relatively insensitive to sequence length. The properties of
'c are tested on simulated sequences, and compared with related summary statistics. The results suggest that
'c is useful for studying codon usage among sequences that vary in background nucleotide composition.
To understand the construction of 'c, it is first necessary to provide background on
c.
c, or the ENC (Wright 1990
), is derived by making an analogy with the effective number of alleles ne at a locus (Kimura and Crow 1964
). An estimate of ne is
e = 1/
, where
is an estimate of the expected heterozygosity at the locus. For quantifying codon usage bias at an amino acid a, Wright defined an estimate of the homozygosity of codon usage as follows:
|
|
|
|
To derive 'c and show its relationship to
c, I use Pearson's X2 statistics to quantify the departure of each codon's usage (pi) from some expected usage (ei) for each amino acid. The expected usage of codons can be calculated in a number of ways, including using mono-, di-, or trinucleotide frequencies. The
2 statistic for each amino acid, Xa2, is calculated as
|
|
|
Notably, 'c simplifies to
c when synonymous codons are expected to be in equal frequency. In this case the ei are all 1/k, and the Xa2 statistic for each amino acid reduces to
|
|
This derivation shows that 'c is a generalization of
c to the case in which expected codon usage is not uniform. Whereas
c is implicitly based on an expected distribution of uniform usage,
'c can have the expected proportions of synonymous codon usage given by the background nucleotide composition.
'c will decrease from 61 when the codon usage does not match the distribution predicted by nucleotide composition. In contrast,
c will decrease from 61 whenever the codon usage departs from uniform usage of each synonymous codon. The result is that the value of
'c should be less dependent on nucleotide composition and more dependent on the codon preferences that go beyond those caused by unequal nucleotide composition.
In the practical implementation of both c and
'c, care must be taken to exclude
a values that are undefined or equal to zero, which occurs when amino acids are rare or missing (see Wright 1990
). The calculations presented subsequently follow Wright's suggestion that if no threefold redundant codons are observed, one should average and to obtain . If other k-fold redundancy classes are unobserved, is assumed to equal 1/k. Such an assumption is conservative with regard to measuring strong codon usage bias. Finally, in the calculation of we exclude amino acids that are observed fewer than five times. Here, these methods of dealing with missing data are applied equally to the calculation of both
c and
'c.
To explore the properties of the statistics c and
'c, pseudorandomly generated coding sequences were generated for various lengths and nucleotide compositions. Amino acids were chosen uniformly, and codon usage was assumed to be multinomial with proportions determined by the nucleotide composition. The background nucleotide composition estimates are given as fA, fC, fG, and fT for adenine, cytosine, guanine, and thymine, respectively. For example, for a codon i that has the sequence CAT, the expected frequency was calculated as ei = fCfAfT/K, where K is a renormalization constant for ensuring the ei for an amino acid sum to one.
A range of nucleotide compositions were used to generate the sequences (Table 1
). The nucleotide compositions are modeled after those used in a study by Comeron and Aguade (1998)
and contain conditions corresponding to different levels of background GC-AT content. The compositions also vary in their degree of GC skew (Lobry 1996
; Shioiri and Takahata 2001
). The Low-2, Med-2, and High-2 have higher GC skew relative to Low-1, Med-1, and High-1, respectively. For each composition and gene length, 10,000 sequences were generated. The mean and variance of the
c and
'c statistics were calculated for each set of sequences.
|
Figure 1
shows the mean values of the various bias statistics for sequences of 7,500 bp for the various background nucleotide compositions. The mean values are shown relative to their mean values with uniform nucleotide composition. As expected when the nucleotide frequencies are uniform, c and
'c are equal, and both share a value of 61. However, as the unevenness in the background nucleotide composition increases, the mean value of
c decreases, whereas that of
'c remains relatively constant. The value of
'c is nearly 61 for all nucleotide composition conditions, whereas that of
c ranges from 61 with equal nucleotide compositions to nearly 26 when the compositions are highly nonuniform. It is also worth noting that
c is particularly sensitive to GC skew. The value of
c is on average nine units smaller for the composition sets with high GC skew than for those with none. Like
'c, scaled
2 and MCB remain constant over various nucleotide composition sets. The B*(a) statistic remains nearly constant for modest levels of unevenness in nucleotide composition but increases for the medium and high levels.
|
|
|
Although 'c differs from
c, they are similar enough in some cases so that using
'c rather than
c may not always produce different results. For example, a reanalysis with
'c of data from Dunn, Bielawski, and Yang (2001)
on the relationship between Drosophila substitution rates and codon usage bias produced results that are qualitatively similar to those from the original analysis with
c (data not shown).
However, in other cases the difference between c and
'c will be important. For example, a difference in
c of 10 may be seen as biologically significant (e.g., Moriyama and Powell 1997
, 1998
). At that resolution the results presented here show that using
'c with nonuniform expectations will result in a considerable difference if the GC or AT content is on the order of 85%. Such nucleotide composition is extreme in nature and is found genome-wide in protozoan species, such as Plasmodium falciparum (Gardner et al. 1998
), or in mitochondrial genomes of arthropods, such as Drosophila, Anopheles, Bombyx, and Apis (Shioiri and Takahata 2001
). In addition, the results presented here show that the presence of nucleotide composition skew will affect the value of
c, to change it even more from that of
'c.
If differences of less than 10 in the value of c are of interest, then using
'c rather than
c is important for even lower levels of unevenness in nucleotide composition. For instance, a nucleotide composition of approximately 35% GC or AT can cause the value of
c to change by five units relative to equal usage. In addition, modest levels of GC skew, such as in composition set Low-2, can cause the value of
c to change by five units relative to no skew.
Using 'c with nonuniform expectations will be particularly advantageous when comparing codon usage bias among sequences from genomic regions that vary significantly in background nucleotide composition. Such variation is especially notable in mammals and birds, where variation in nucleotide composition is structured into isochores (Bernardi et al. 1985
). In addition,
'c may also be useful in comparative studies of codon usage bias in which the species being studied have substantially different nucleotide compositions.
In summary, this paper has introduced a summary statistic of codon usage bias ('c) that incorporates variation in background nucleotide composition among sequences. Simulations show that relative to the available statistics,
'c effectively adjusts for background nucleotide composition, is the least affected by gene length, and has a low CV. With its use the analysis of codon usage bias can be accomplished without the confounding effects of variation in nucleotide composition.
A freely distributable program that calculates both c and
'c is available on the Web page: http://ib.berkeley.edu/labs/slatkin/software.html.
Acknowledgements
The author thanks Josh Herbeck and Montgomery Slatkin for helpful discussions, Araxi Urrutia for discussion and for providing the script to calculate MCB, and two reviewers for their comments. The research was supported by a Howard Hughes Medical Institute Predoctoral Fellowship and the National Institutes of Health (NIH GM-40282 to M. Slatkin).
Footnotes
Adam Eyre-Walker, Reviewing Editor
Abbreviations: ENC, effective number of codons.
Keywords: codon usage bias
background nucleotide composition
mutation bias
Address for correspondence and reprints: John A. Novembre, Department of Integrative Biology, University of CaliforniaBerkeley, VLSB 3060, Berkeley, California 94720. novembre{at}socrates.berkeley.edu
References
Akashi H., 1995 Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA Genetics 139:1067-1076
. 1997 Distinguishing the effects of mutational biases and natural selection on DNA sequence variation Genetics 147:1989-1991
Akashi H., A. Eyre-Walker, 1998 Translational selection and molecular evolution Curr. Opin. Genet. Dev 8:688-693[ISI][Medline]
Akashi H., R. Kliman, A. Eyre-Walker, 1998 Mutation pressure, natural selection, and the evolution of base composition in Drosophila Genetica 102103:49-60
Bennetzen J., B. Hall, 1982 Codon selection in yeast J. Biol. Chem 257:3026-3031
Bernardi G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, F. Rodier, 1985 The mosaic genome of warm-blooded vertebrates Science 228:953-958[ISI][Medline]
Comeron J., M. Aguade, 1998 An evaluation of measures of synonymous codon usage bias J. Mol. Evol 47:268-274[ISI][Medline]
Dunn K., J. Bielawski, Z. Yang, 2001 Substitution rates in Drosophila nuclear genes: implications for translational selection Genetics 157:295-305
Gardner M., H. Tettelin, D. Carucci, et al. (24 co-authors) 1998 Chromosome 2 sequence of the human malaria parasite Plasmodium falciparum Science 282:1126-1132.
Iida K., H. Akashi, 2000 A test of translational selection at silent sites in the human genome: base composition comparisons in alternatively spliced genes Gene 261:93-105[ISI][Medline]
Ikemura T., 1981 Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes J. Mol. Biol 146:1-21[ISI][Medline]
. 1985 Codon usage and tRNA content in unicellular and multicellular organisms Mol. Biol. Evol 2:13-34[Abstract]
Karlin S., J. Mrazek, 1996 What drives codon choices in human genes? J. Mol. Biol 262:459-472[ISI][Medline]
Kimura M., J. Crow, 1964 The number of alleles that can be maintained in a finite population Genetics 49:725-738
Kliman R., J. Hey, 1993 Reduced natural selection associated with low recombination in Drosophila melanogaster Mol. Biol. Evol 10:1239-1258[Abstract]
Kreitman M., M. Antezana, 2000 The population and evolutionary genetics of codon bias Pp. 82101 in R. Singh and C. Krimbas, eds. Evolutionary genetics: from molecules to morphology. Cambridge University Press, Cambridge, U.K
Kreitman M., J. Comeron, 1999 Coding sequence evolution Curr. Opin. Genet. Dev 9:637-641[ISI][Medline]
Lobry J., 1996 Asymmetric substitution patterns in the two DNA strands of bacteria Mol. Biol. Evol 13:660-665[Abstract]
Marais G., D. Mouchiroud, L. Duret, 2001 Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes Proc. Natl. Acad. Sci. USA 98:5688-5692
Moriyama E., J. Powell, 1997 Codon usage bias and tRNA abundance in Drosophila J. Mol. Evol 45:514-523[ISI][Medline]
. 1998 Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli Nucleic Acids Res 26:3188-3193
Ohta T., J. Gillespie, 1996 Development of neutral and nearly neutral theories Theor. Popul. Biol 49:128-142[ISI][Medline]
Sharp P., W. Li, 1987 The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias Mol. Biol. Evol 4:222-230[Abstract]
Shioiri C., N. Takahata, 2001 Skew of mononucleotide frequencies, relative abundance of dinucleotides, and DNA strand asymmetry J. Mol. Evol 53:364-376[ISI][Medline]
Smith N., A. Eyre-Walker, 2001a. Synonymous codon bias is not caused by mutation bias in G+C-rich genes in humans Mol. Biol. Evol 18:982-986
. 2001b. Why are translationally sub-optimal synonymous codons used in Escherichia coli? J. Mol. Evol 53:225-236[ISI][Medline]
Sueoka N., Y. Kawanishi, 2000 DNA G+C content of the third codon position and codon usage biases of human genes Gene 261:53-62[ISI][Medline]
Urrutia A., L. Hurst, 2001 Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection Genetics 159:1191-1199
Wright F., 1990 The effective number of codons' used in a gene Gene 87:23-29[ISI][Medline]