1 Laboratorio de Organización y Evolución del Genoma, Facultad de Ciencias, Iguá 4225, Montevideo 11400, Uruguay
2 Escuela Universitaria de Tecnología Médica, Facultad de Medicina, Avda. Italia (s/n) Hospital de Clínicas, Montevideo 11600, Uruguay
Correspondence
Héctor Musto
hmusto{at}fcien.edu.uy
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The available evidence suggests that the strength and direction of these forces can vary both among different species and among sequences from the same genome. For example, the genomic G+C contents of prokaryotes vary from 25 to 75 mol% (Sueoka, 1962), and given the correlation that holds between GC3s (G+C content at silent third codon positions) and genomic G+C content (Bernardi & Bernardi, 1986
; Muto & Osawa, 1987
), the mutational bias characteristic of each genome greatly influences codon choices. However, the availability of very long contigs, and especially complete genomes, has shown that the mutational bias is not simply shifting the whole genome towards G+C or A+T. For instance, it has been shown that there are regional variations in the G+C content around the genome of Mycoplasma genitalium (McInerney, 1997
; Kerr et al., 1997
) which exert a great influence on GC3s and, consequently, on codon usage. Perhaps more unexpected was the finding of Lobry (1996)
, who showed that in several bacteria the leading and lagging strand of replication can be easily recognized by the so-called GC-skew, the quantity (G-C)/(G+C). Indeed, the leading strand usually displays positive values while the reverse is true for the lagging strand (the switch of sign occurs exactly at or very near to the origin and terminus of replication). As a consequence, the leading strand is G- (and T-) rich, while the lagging strand displays a bias towards C (and A). This effect can be so strong that in species like Borrelia burgdorferi, Treponema pallidum and Chlamydia trachomatis the position of the sequences in relation to the replication fork can be recognized as the most important force driving codon usage (McInerney, 1998
; Lafay et al., 1999
; Romero et al., 2000a
). Finally, a common theme in completely sequenced genomes is the finding of regions displaying base compositions far away from those of the genome as a whole. These regions have been interpreted as being the result of events of horizontal transfer of DNA between species differing in their genomic G+C contents (Garcia-Vallve et al., 2000
; Karlin, 2001
), and the sequences located in these regions display different codon usage than the rest of the genes. Therefore, it can be concluded that the overall base content of a genome and the mutational bias of each replicative strand are the main forces driving codon usage.
However, superimposed onto these general effects, in several species it has been found that natural selection leads to the fixation of some triplets among highly expressed genes. This was observed in Escherichia coli (Post & Nomura, 1979; Gouy & Gautier, 1982
), where it was noted that the codon usage of highly expressed sequences was biased in relation to the pattern of lowly expressed genes. Indeed, in the former group there is an increase of certain triplets (major codons) while in the latter group the usage of codons is more random. From another perspective, Ikemura (1981)
showed that there is a match between these codons and the most abundant tRNAs. Therefore, for E. coli it was proposed that the triplets that are recognized more efficiently by the most abundant isoacceptor are preferred, and the degree of bias in each gene should be proportional to the level of expression. Although the codon usage pattern of several prokaryotes fell within this interpretation (i.e. codon usage is the result of mutational biases and translational selection) the more species that are being studied the more peculiarities are beginning to appear. For example, it was shown that in Helicobacter pylori, although the composition of the genome is not skewed and there is a low (but detectable) level of heterogeneity among genes, codon usage does not appear to be influenced simply by mutational biases or translational selection (Lafay et al., 2000
). Furthermore, in Mycobacterium tuberculosis, although the classical factors are apparent, it was reported that the hydropathy level of each protein is correlated with the base content at silent sites (de Miranda et al., 2000
). A more complex pattern was found in Chlamydia trachomatis, since codon usage appears to be shaped by the global genomic composition, the strand-specific mutational bias (as noted above), natural selection acting at the level of translation, the hydropathy level of each protein and each amino acid's conservation (Romero et al., 2000a
). Therefore, as more prokaryotic genomes are analysed it is becoming clear that more factors shape codon usage than previously thought. Hence, more studies are needed (I) to understand the generality of the factors and phenomena described above, and (II) to detect new forces shaping codon usage. With these goals in mind, we decided to study the codon usage patterns in two species of Clostridium that have been sequenced recently, namely Clostridium perfringens (Shimizu et al., 2002
) and Clostridium acetobutylicum (Nolling et al., 2001
). These Gram-positive, anaerobic, spore-forming bacteria have several features that make them useful for these studies: (i) they belong to the same genus, which is important for comparative purposes; (ii) their genomes are compositionally biased (G+C contents of 31 and 29 mol%, respectively), which could hide the effect of natural selection; (iii) their generation time is short (Shimizu et al., 2002
), which, contrary to (i) and (ii), would make selection for translational efficiency more likely to be detected; and (iv) on the leading strand of replication the two species display a very strong purine bias and an excess of coding sequences (Shimizu et al., 2002
), which might add additional levels of complexity to their patterns of codon usage.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods of analysis.
Codon usage, correspondence analysis (COA) (Greenacre, 1984), GC3s (the frequency of codons ending in C or G, excluding Met, Trp and stop codons), the relative synonymous codon usage (RSCU) (Sharp et al., 1986
) and the codon adaptation index (CAI) (Sharp & Li, 1987
) were calculated using the program CODONW 1.3 (written by John Peden and available from ftp://molbiol.ox.ac.uk/Win95.codonW.zip). In the two species under study, the CAI was calculated taking the codon usage of the ribosomal proteins as a reference. COA of RSCU values was carried out to determine the major sources of variation among synonymous codons. The putative orthologous sequences were identified running a BLAST query of the whole set of proteins of one genome against the set of the other one using the stand-alone BLAST package (Altschul et al., 1997
). The sequence with the best match, according to the score value, was identified. Then, the coding sequences of these pairs were translated and aligned using CLUSTAL W (Thompson et al., 1994
); subsequently, the alignments were back-translated to the known DNA sequences. dS (synonymous distance) and dN (non-synonymous distance) values were calculated using the NeiGojobori method (Nei & Gojobori, 1986
) using the JADIS package (Goncalves et al., 1999
), only on those pairs of sequences displaying a minimal value of 50 % identity and with a length difference of 20 % at the amino acid level. The analyses were performed only with the pairs of sequences displaying dS values
2·0. The final dataset comprised 676 pairs of genes. Whole-genome alignment and comparison were carried out with the MUMmer system (release 2.1) (Delcher et al., 2002
) using the default settings.
![]() |
RESULTS AND DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Patterns of codon usage in C. perfringens
When COA is applied to C. perfringens it detects a principal trend (8·4 % of the total variability) that is clearly associated with expression levels. Indeed, at one extreme of this axis (Fig. 2a) lie genes that are known to be heavily expressed, such as those encoding several ribosomal proteins, translation elongation factors, glyceraldehyde-3-phosphate dehydrogenase, phosphoglycerate kinase, fructose-bisphosphate aldolase, triose-phosphate isomerase, pyruvate kinase and heat-shock proteins, while genes expressed at the lowest levels and those encoding hypothetical proteins are distributed almost normally around the mean value of this axis. The clustering of highly expressed genes at one end of the distribution indicates that these sequences are characterized by a different pattern of codon usage than the rest of the genes; therefore, translational selection might be operative in this bacterium. To see which triplets are increased in the highly expressed group of genes, we compared the codon usage pattern of the sequences displaying the most extreme values at both ends of the first axis (50 genes at either extreme). The differences in codon usage between the two groups were tested with a
2 test. We found that there are 17 codons whose usage is significantly increased (P<0·01) among the highly expressed group of genes, and they encode 17 amino acids (Cys is the only residue without an increased triplet). These codons are listed in Table 2
.
|
|
To further confirm the translational selection hypothesis, we calculated the CAI value for each sequence in C. perfringens taking as a reference the codon usage of ribosomal proteins, which are certainly heavily expressed. When all the sequences were sorted according to their CAI, the highest values were displayed not only by the genes encoding ribosomal proteins (which is a trivial result) but also by almost exactly the same genes that lie at the extreme of the first axis generated by the COA, which is confirmed by the strong correlation between the position of the sequences along this axis and the respective CAI values (R=0·82, P<0·0001). These results support our interpretation that the first axis discriminates expression levels.
The second axis of the COA (6·7 % of the variability) discriminates between genes located in the leading or lagging strand of replication. The importance of this effect can be so high that in species like Borrelia burgdorferi, Treponema pallidum and Chlamydia trachomatis it is the most important force driving codon usage (McInerney, 1998; Lafay et al., 1999
; Romero et al., 2000a
). Among these species, the sequences located in the leading strand are G- and T-rich at the synonymous sites, while the complementary bases are more frequent in genes located in the lagging strand. However, this kind of bias is not found in Clostridium perfringens. Indeed, when the position of the codons in relation to the second axis is analysed it can be seen that purine- and pyrimidine-ending triplets lie at the opposite extremes. When the genes are sorted according to their position on the second axis, most sequences located in the lagging strand of replication cluster together towards one end of the distribution (Fig. 3a
). This result is certainly related to the very strong purine bias associated with an excess of coding sequences that characterizes the leading strand of C. perfringens, as well as the genomes of several other Gram-positive prokaryotes (Shimizu et al., 2002
). This is shown in Table 3
, where the nucleotide compositions of C. perfringens and C. acetobutylicum are displayed. It can be seen that there is a clear asymmetry in the distribution of ORFs between the two strands and that although the GC3 content remains constant, the purine content is higher in the leading strand, although it should be stressed that the differences are higher with G than with A. However, the differences are constant in the two clostridial species across their entire genomes (Table 3
). We note that this bias towards A+G in the leading strand is so strong that it detects the origin and terminus of replication as clear as does the GC-skew (Fig. 1
).
|
|
Patterns of codon usage in C. acetobutylicum
Our next step was to study the factors that shape codon usage in a bacterium related to C. perfringens, C. acetobutylicum. Although the two species belong to the same genus, there are strong differences between them. First, the genome of C. acetobutylicum is 30 % longer and displays 40 % more ORFs than the genome of C. perfringens. Second, while in the former species the origin and terminus of replication are roughly opposite in the genome, in the latter bacterium this is not the case (Fig. 1). Third, since the split of these two species from their last common ancestor there have been a number of genomic rearrangements (Shimizu et al., 2002
), although both organisms still share several compositional features (low G+C content, strong purine bias in the leading strand of replication, mean GC-skew of 20 %).
COA in C. acetobutylicum detected a principal trend (6·7 % of the total variability) that was equivalent to the second main trend in C. perfringens; in other words, it discriminated between genes located on the leading or lagging strand of replication (Fig. 3b), and again it was associated with a strong purine bias in the sequences placed in the leading strand (see Fig. 1
and Table 3
). Not surprisingly, when the genes were sorted according to their position on the second axis generated by the analysis (5·4 % of the variability), the most heavily expressed sequences were clustered at one end of the distribution, indicating that translational selection for codon usage is operative in C. acetobutylicum too (Fig. 2b
). We made the same analyses as were made in C. perfringens, to detect the increased codons among the putatively highly expressed genes of C. acetobutylicum (see above). We found that 17 triplets encoding 15 amino acids are increased among the highly expressed set of sequences (no optimal codons were detected for Cys, Asp and Thr). It is interesting to note that 13 of these codons were shared between the two species (Table 2
), showing that the general pattern described in C. perfringens is also valid for C. acetobutylicum. However, we should remark that the differences observed in the RSCU values between highly and lowly expressed sequences in C. acetobutylicum were not as high as those in C. perfringens (Table 2
).
When the CAI values were calculated in C. acetobutylicum (taking as a reference the sequences encoding its ribosomal proteins) we found that the highest values were again displayed by the same genes that lie at the extreme of the second axis generated by the COA, and the correlation between the position of the sequences along this axis and the respective CAI values was highly significant (R=0·56, P<0·0001), although lower than in C. perfringens (this is consistent with the observation of smaller differences in the RSCU values in the two species, see above). Therefore, we conclude that, in spite of minor differences, the same main forces are operative for shaping codon usage in the two bacteria studied here, although it should be noted that translational selection appears to be less strong in C. acetobutylicum than in C. perfringens. Whether these forces are due to differences in generation times and/or effective population size is something that deserves more investigation.
Comparative studies of C. perfringens and C. acetobutylicum
To gain support for the above-mentioned conclusions, we analysed the orthologous sequences from C. perfringens and C. acetobutylicum. Since no qualitative differences were observed using either the NeiGojobori or the Li (Li, 1993) method the results shown correspond to the former method. As can be seen in Fig. 4
, as a consequence of the huge genomic rearrangements, most orthologous sequences fell outside the diagonal, indicating a nearly complete lack of gene order conservation. From this result, and taking into account the strong and diverse mutational biases that characterize the two replicative strands of these genomes, it is interesting to split the sequences into three groups: those that are placed on the same strand (which can be leading or lagging) and those which changed strand. The total figures are 561 leading, 50 lagging and 65 that have switched strand. The base compositions at the synonymous sites for these pairs is representative, in each species, of the whole dataset (data not shown).
|
|
|
![]() |
ACKNOWLEDGEMENTS |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 33893402.
Bernardi, G. & Bernardi, G. (1986). Compositional constraints and genome evolution. J Mol Evol 24, 111.[Medline]
Bulmer, M. (1991). The selection-mutation-drift theory of synonymous codon usage. Genetics 129, 897907.
de Miranda, A. B., Alvarez-Valin, F., Jabbari, K., Degrave, W. M. & Bernardi, G. (2000). Gene expression, amino acid conservation, and hydrophobicity are the main factors shaping codon preferences in Mycobacterium tuberculosis and Mycobacterium leprae. J Mol Evol 50, 4555.[Medline]
Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. (2002). Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res 30, 24782483.
Dong, H., Nilsson, L. & Kurland, C. G. (1996). Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J Mol Biol 260, 649663.[CrossRef][Medline]
Garcia-Vallve, S., Romeu, A. & Palau, J. (2000). Horizontal gene transfer of glycosyl hydrolases of the rumen fungi. Mol Biol Evol 17, 352361.
Goncalves, I., Robinson, M., Perriere, G. & Mouchiroud, D. (1999). JADIS: computing distances between nucleic acid sequences. Bioinformatics 15, 424425.
Gouy, M. & Gautier, C. (1982). Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res 10, 70557074.[Abstract]
Grantham, R., Gautier, C., Gouy, M., Jacobzone, M. & Mercier, R. (1981). Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res 9, r4374.[Abstract]
Greenacre, M. (1984). Theory and Applications of Correspondence Analysis. London: Academic.
Grocock, R. J. & Sharp, P. M. (2002). Synonymous codon usage in Pseudomonas aeruginosa PAO1. Gene 289, 131139.[CrossRef][Medline]
Ikemura, T. (1981). Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 151, 389409.[Medline]
Kanaya, S., Yamada, Y., Kudo, Y. & Ikemura, T. (1999). Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238, 143155.[CrossRef][Medline]
Karlin, S. (2001). Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol 9, 335343.[CrossRef][Medline]
Kerr, A. R., Peden, J. F. & Sharp, P. M. (1997). Systematic base composition variation around the genome of Mycoplasma genitalium, but not Mycoplasma pneumoniae. Mol Microbiol 25, 11771179.[CrossRef][Medline]
Lafay, B., Lloyd, A. T., McLean, M. J., Devine, K. M., Sharp, P. M. & Wolfe, K. H. (1999). Proteome composition and codon usage in spirochaetes: species-specific and DNA strand-specific mutational biases. Nucleic Acids Res 27, 16421649.
Lafay, B., Atherton, J. C. & Sharp, P. M. (2000). Absence of translationally selected synonymous codon usage bias in Helicobacter pylori. Microbiology 146, 851860.
Li, W. H. (1993). Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol 36, 9699.[Medline]
Lobry, J. R. (1996). Origin of replication of Mycoplasma genitalium. Science 272, 745746.[Medline]
McInerney, J. O. (1997). Prokaryotic genome evolution as assessed by multivariate analysis of codon usage patterns. Microb Comp Genomics 2, 110.
McInerney, J. O. (1998). Replicational and transcriptional selection on codon usage in Borrelia burgdorferi. Proc Natl Acad Sci U S A 95, 1069810703.
Musto, H., Romero, H., Zavala, A., Jabbari, K. & Bernardi, G. (1999). Synonymous codon choices in the extremely GC-poor genome of Plasmodium falciparum: compositional constraints and translational selection. J Mol Evol 49, 2735.[Medline]
Muto, A. & Osawa, S. (1987). The guanine and cytosine content of genomic DNA and bacterial evolution. Proc Natl Acad Sci U S A 84, 166169.[Abstract]
Nei, M. & Gojobori, T. (1986). Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3, 418426.[Abstract]
Nolling, J., Breton, G., Omelchenko, M. V. & 16 other authors (2001). Genome sequence and comparative analysis of the solvent-producing bacterium Clostridium acetobutylicum. J Bacteriol 183, 48234838.
Percudani, R., Pavesi, A. & Ottonello, S. (1997). Transfer RNA gene redundancy and translational selection in Saccharomyces cerevisiae. J Mol Biol 268, 322330.[CrossRef][Medline]
Post, L. E. & Nomura, M. (1979). Nucleotide sequence of the intercistronic region preceding the gene for RNA polymerase subunit alpha in Escherichia coli. J Biol Chem 254, 1060410606.[Abstract]
Romero, H., Zavala, A. & Musto, H. (2000a). Codon usage in Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces. Nucleic Acids Res 28, 20842090.
Romero, H., Zavala, A. & Musto, H. (2000b). Compositional pressure and translational selection determine codon usage in the extremely GC-poor unicellular eukaryote Entamoeba histolytica. Gene 242, 307311.[CrossRef][Medline]
Sharp, P. M. & Li, W. H. (1986). An evolutionary perspective on synonymous codon usage in unicellular organisms. J Mol Evol 24, 2838.[Medline]
Sharp, P. M. & Li, W. H. (1987). The codon Adaptation Index: a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15, 12811295.[Abstract]
Sharp, P. M., Tuohy, T. M. & Mosurski, K. R. (1986). Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res 14, 51255143.[Abstract]
Shimizu, T., Ohtani, K., Hirakawa, H. & 7 other authors (2002). Complete genome sequence of Clostridium perfringens, an anaerobic flesh-eater. Proc Natl Acad Sci U S A 99, 9961001.
Sueoka, N. (1962). On the genetic basis of variation and heterogeneity of DNA base composition. Proc Natl Acad Sci U S A 48, 582592.[Medline]
Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 46734680.[Abstract]
Zavala, A., Naya, H., Romero, H. & Musto, H. (2002). Trends in codon and amino acid usage in Thermotoga maritima. J Mol Evol 54, 563568.[CrossRef][Medline]
Received 17 November 2002;
revised 3 December 2002;
accepted 17 January 2003.