*Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada;
Institute of Basic Medical Sciences, Beijing, China
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Functional Groups of Bacterial Genes
The functional classification of the E. coli protein-coding genes (Riley 1993
) was obtained from http://www.genetics.wisc.edu/html/orftables/index.html. Some groups of similar functions were merged for our analysis, leaving the following 20 functional groups: amino acid biosynthesis and metabolism (A); biosynthesis of cofactors, prosthetic groups and carriers (B); cell structure (C); structural proteins (C1); energy metabolism (E); carbon compound catabolism (E1); phage, transposon, or plasmid (F); central intermediary metabolism (I); fatty acid and phospholipid metabolism (L); membrane proteins (M); nucleotide biosynthesis and metabolism (N); other known genes (O); cell processes (including adaptation, protection, and putative chaperones) (P); DNA replication (R); transcription, RNA processing, and degradation (S); translation, posttranslational modification (T); hypothetical, unclassified, unknown (U); transport and binding proteins and putative transport proteins (X); putative enzymes (Y); regulatory function and putative regulatory proteins (Z). The original group R in Riley's (1993)
scheme contains not only DNA replication genes, but also DNA recombination, restriction-modification, and repair genes as well. Because previous work has suggested that such genes may be of alien origin (Lawrence and Ochman 1997
; Mrazek and Karlin 1999
), they were removed from this group and added to the O group (other known genes). Sixteen of the 20 functional groups can be further grouped as four superclasses (Tamames et al. 1996
): Communication (containing P and Z), Energy (containing A, B, E, E1, I, L, N, and X), Information (containing R, S, and T), and Structure (containing C, C1, and M).
The functional classification of H. influenzae genes was obtained from http://www.tigr.org/tdb/CMR/ghi/htmls/SplashPage.html. It has a scheme similar to that of E. coli, except that it lacks the four categories of carbon compound catabolism (E1), structural proteins (C1), membrane proteins (M), and putative enzymes (Y), and there is an additional group (denoted group P1) which includes functions of protein secretion and trafficking, protein folding, stabilization, modification, repair, and degradation. Thus, for H. influenzae, the functional superclasses are as follows: Communication (containing P, P1, and Z), Energy (containing A, B, E, I, L, N, and X), Information (containing R, S, and T), and Structure (containing only C).
Self-Organizing Map
The SOM is an unsupervised neural network method that is particularly useful for data visualization (Kohonen 1997
). The SOM algorithm simultaneously finds a representative set of reference vectors of the training data and positions them on a regular two-dimensional grid of neurons. It can be thought of as a flexible net that is spread into the data "cloud." Because the net is two-dimensional, it can easily be visualized. The mapping from the input space onto the grid of neurons is learned from the training data samples by a simple stochastic learning process whereby the SOM neurons (the reference vectors) are adjusted by small steps with respect to the input vectors. A thorough description of the algorithm can be found elsewhere (Kohonen 1997
; Marabini and Carazo 1994
). The following is a summary of the method:
![]() | (1) |
![]() | (2) |
For our analysis, we used the SOM Toolbox (http://www.cis.hut.fi/projects/somtoolbox), a newly developed MATLAB-based (The MathWorks, Inc.) SOM.
Visualization of the Map
The SOM Toolbox provides several ways to display a map. The two methods used in this paper are the U-matrix (unified distance matrix) (Ultsch and Siemon 1990
) and component planes. The U-matrix uses color to show the distances between neighboring map units: longer distances are represented by shades of yellow and red, while shorter distances are represented by shades of blue. Therefore, clusters of genes with similar codon usages appear as blue areas, while areas in which the codon usage is changing rapidly between adjacent units appear as yellow and red areas. Because horizontally transferred genes and highly expressed genes have different-from-normal codon usage, they are expected to be separated on the U-matrix.
Clusters of genes can be made easier to recognize by labeling the map with known gene names or functional classes. As genes encoding ribosomal proteins are known to be highly expressed, these genes were marked on the map to serve as a landmark for clusters of highly expressed genes in the projection. Similarly, as genes such as insertion sequences, transposases, restriction-modification endonucleases, and flagella-related genes are often thought to be horizontally transferred, they can serve as landmarks for presumed alien genes. For the E. coli genome, we also labeled those genes that were previously classified according to different expression levels (Sharp and Li 1986
): very highly expressed genes (45 ribosomal protein genes and 10 other genes from Sharp and Li's data); highly expressed genes (15 genes in Sharp and Li's data, of which 4 were not found in the current genome sequence); moderate codon bias (57 genes, of which 12 were not found), and low codon bias (58 genes, of which 6 were not found). The genes which were not found probably had their locus names changed over the years.
A component plane consists of the values of one vector component (representing one codon) in all map units and provides an idea of the spread of values of that component. Thus, by showing component planes of all synonymous codons for one amino acid, it is easy to see which codons are more frequently used than the others. An inspection of all component planes indicates which components are correlated. Correlations between codons are revealed as similar patterns in identical positions of the component planes. By comparing the U-matrix with component planes, it is possible to identify which components contribute strongly to a cluster observed in the U-matrix.
Shannon Uncertainty
Another way to measure codon usage is through the information theoretical notion of Shannon uncertainty, or entropy. Shannon uncertainty can be thought of as a measure of randomness; a fair die has a higher Shannon uncertainty than a loaded die. In terms of codon usage, unbiased codon usage has a higher Shannon uncertainty than biased usage. The advantage of using Shannon uncertainty is that it allows a complex source of bias to be represented by a single statistic. The Shannon uncertainty H for M possible outcomes is given by the following formula (Shannon 1948
):
| (1) |
| (2) |
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Although the axes of a U-matrix have no intrinsic biological meaning, the vertical axis was found to be correlated with G+C content and average gene length, with genes of lower G+C content and shorter length found near the top of the map, and increasing G+C content and length toward the lower part (figure not shown). This corresponds with the fact that the presumed alien genes (on the upper side of the map) have a lower average G+C content (46.0%) and a much shorter average length (246 amino acids) than the genes with normal codon usage (52.9% G+C and 349 amino acids) and highly expressed genes (52.3% G+C and 338 amino acids). Note, however, that not all alien genes have reduced G+C content compared with nonalien genes. Some alien genes have very high G+C contents (Lawrence and Ochman 1997
; Karlin, Mrazek, and Campbell 1998). The horizontal axis was found to have an increasing average G+C content on the left, maintaining a high G+C content in the middle and decreasing in G+C content on the right. There is no simple trend in average gene length along this axis.
From component planes of synonymous codons, one can identify which codons are more commonly used to encode an amino acid. Figure 2 shows the component planes of the six synonymous codons of arginine. It can be observed that CGC and CGT are the two most commonly used codons for coding arginine, while AGA and AGG are the least common. A comparison of images of component planes and the U-matrix shows that CGT (arginine), TCT (serine), GGT (glycine), TTC (phenylalanine), AAC (asparagine), TAC (tyrosine), and ATC (isoleucine) mainly occur in the class of presumed highly expressed genes. AGA (arginine), AGG (arginine), CTA (leucine), ACA (threonine), and ATA (isoleucine) occur mainly in the class of presumed alien genes. These codons contribute strongly to the formation of their respective categories.
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Karlin, Campbell, and Mrazek (1998) introduced a new method to measure the codon bias of one group of genes relative to that of another group of genes based on a modification of the existing Codon Adaptation Index (CAI; Sharp and Li 1987
). They used the method to detect alien genes and highly expressed genes in several bacterial genomes, including those of H. influenzae and M. jannaschii (Mrazek and Karlin 1999
). In the H. influenzae genome, Mrazek and Karlin (1999)
identified 48 alien genes. Since there is some discrepancy between their gene-naming system and the one we used, we found only 38 of these genes. Nevertheless, a comparison of these 38 presumed alien genes with the presumed alien genes identified with SOM reveals 30 genes in common. Therefore, most (79%) of the presumed alien genes detected by Mrazek and Karlin are within the alien genes cluster identified in this study. In the M. jannaschii genome, Mrazek and Karlin identified 23 alien genes, of which only 7 (30%) were identified by SOM. However, of the 22 highly expressed genes identified by Mrazek and Karlin, 19 were found by SOM (86%). It should, however, be noted that Karlin et al.'s method considers only genes of >200 codons, which excludes many shorter alien genes. For example, in a 1.43-Mb E. coli contig, they identified only 16 alien genes, while Lawrence and Ochman (1997)
identified 228 alien genes (Karlin, Mrazek, and Campbell 1998
). However, the above comparisons demonstrate that both the alien genes and the highly expressed genes detected by Mrazek and Karlin (1999)
are generally subsets of the respective gene categories identified by SOM.
In addition to the main methodological difference between this work and other works concerning detection of alien genes in bacterial genomes, we also applied an information measurement of codon usage based on Shannon uncertainty. Although the "effective number of codons" of a gene (Wright 1990
) also quantifies how far the codon usage of a gene departs from equal usage of synonymous codons, it takes no account of amino acid composition. The Shannon uncertainty of codon usage calculates combined entropy of 61 synonymous codons and averages over the 20 amino acids. It appears this index has effectively distinguished the major gene categories within each of the seven genomes (table 1
). In all cases, the highly expressed gene class shows a decrease in Shannon uncertainty compared with the average codon usage of the genome, which indicates that this class is more biased in codon usage than the genome average. The index for the alien gene class is increased above the genome average, indicating that codon bias is decreased in this class. This is expected for a group of genes that have been horizontally transferred from many different lineages. The normally expressed gene class has a Shannon uncertainty close to the genome average, since the bulk of the genes of a genome are in the normal class. However, it seems meaningless to compare Shannon uncertainty across genomes, since there is no connection between this index for the genomes and their phylogenetic relationship (see, e.g., in table 1
; the Shannon uncertainty of the E. coli genome is more different from that of H. influenzae than from that of A. aeolicus).
Class Structures of Codon Usage Map
In addition to the major class structures (genes with normal codon usage, presumed highly expressed genes, and presumed alien genes) found from the U-matrix, there are several subclusters within each of the classes, suggesting that all categories can be further divided. Such subclasses in the presumed alien gene category indicate multiple origins of horizontally transferred genes and their different ameliorating stages to the host genome (Lawrence and Ochman 1997, 1998
). As the codon usage of horizontally transferred genes approaches the host usage over time, it is expected that earlier horizontally transferred genes should appear near the border between the clusters of alien genes and genes with normal codon usage.
Similarly, there exist subclasses in the clusters of genes with highly biased codon usage, as very highly expressed genes (such as the ribosomal proteins) tend to have more biased usage than do moderately highly expressed genes (such as tRNA synthetase and RNA polymerase). On the E. coli codon usage map (fig. 1
), some genes that were previously identified as having moderate codon bias (Sharp and Li 1986
) are found to locate in the area of highly expressed genes. This discrepancy is caused by the criteria that were used to define different levels of codon bias. As Sharp and Li (1986)
noted, insufficient data were available then to classify these genes accurately with regard to level of gene expression. The smaller variation within the category of normally expressed genes may be derived from a combination of many factors, such as different expression levels, sequences of different functions, and different G+C contents.
Functional Classification of Gene Categories
The functional classification of gene categories identified with SOM gives further insight into codon usage patterns of the E. coli and H. influenzae genomes. The distributions of different functions among highly expressed genes, normally expressed genes, and putative alien genes are very different from each other. Of particular interest, with a different approach, Jain, Rivera, and Lake (1999)
recently found that operational genes (those involved in housekeeping) are more likely to be horizontally transferred than informational genes (those involved in transcription, translation, and related processes). Our work supports this proposal. However, in Jain, Rivera, and Lake's (1999)
work, several functional groups (energy metabolism, transport proteins, and replication) were not included in their list of operational or informational genes. Our result supports the hypothesis that energy metabolism, replication, and transport protein genes are relatively less often horizontally transferred (fig. 4
). However, some replication-related genes, such as those for DNA recombination, restriction, and modification, tend to have unusual codon usage and thus may also be horizontally transferred. It will be interesting to find out whether functional distribution of alien genes shows the same trend in other bacterial genomes, especially the archaebacteria.
Horizontal Gene Transfer in the Seven Genomes
The fact that a cluster of genes with unusual codon usage was found by SOM in all seven bacterial genomes suggests that horizontal gene transfer in prokaryotes is a widespread phenomenon (Doolittle 1999
; Jain, Rivera, and Lake 1999
). The mixing of gene classes of the four hyperthermic organisms (A. aeolicus, A. fulgidus, M. thermoautotrophicum, and P. horikoshii) is surprising but understandable, as both convergent evolution and horizontal transfer between these organisms is expected because they live in similar environments. In cases of horizontal transfer, perhaps a cluster analysis of the codon usage information in the CUTG database (Nakamura, Gojobori, and Ikemura 1999
) could be used to identify the possible source organisms of the transferred genes.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: codon usage
self-organizing map
genome
gene function
horizontal gene transfer
2 Address for correspondence and reprints: Huai-chun Wang, Institute of Basic Medical Sciences, 27 Taiping Road, Beijing 100850, China. wanghc{at}nic.bmi.ac.cn
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Badger, J. 1999. Exploration of microbial genomic sequences via comparative analysis. Ph.D. thesis, University of Illinois at Urbana-Champaign.
Blattner, F. R., G. Plunkett III, C. A. Bloch et al. (17 co-authors). 1997. The complete genome sequence of Escherichia coli K-12. Science 277:14531474.
Bult, C. J., O. White, G. J. Olsen et al. (40 co-authors). 1996. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273:10581073.
Deckert, G., P. V. Warren, T. Gaasterland et al. (15 co-authors). 1998. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392:353358.
Doolittle, W. F. 1999. Phylogenetic classification and the universal tree. Science 284:21242129.
Fleischmann, R. D., M. D. Adams, O. White et al. (40 co-authors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496512.
Grantham, R., C. Gautier, M. Gouy, R. Mercier, and A. Pave. 1980. Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 8:r49r62.
Ikemura, T. 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J. Mol. Biol. 146:121.[ISI][Medline]
. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:1334.[Abstract]
Jain, R., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. USA 96:38013806.
Karlin, S., A. M. Campbell, and J. Mrazek. 1998. Comparative DNA analysis across diverse genomes. Annu. Rev. Genet. 32:185225.[ISI][Medline]
Karlin, S., J. Mrazek, and A. M. Campbell. 1998. Codon usages in different gene classes of the Escherichia coli genome. Mol. Microbiol. 29:13411355.[ISI][Medline]
Kawarabayasi, Y., M. Sawada, H. Horikawa et al. (30 co-authors). 1998. Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3. DNA Res. 5:5576.[Medline]
Klenk, H. P., R. A. Clayton, J. F. Tomb et al. (51 co-authors). 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390:364370.
Kohonen, T. 1982. Self-organized formation of topologically correct feature map. Biol. Cybern. 43:5969.[ISI]
. 1997. Self-organizing maps. 2nd extended edition. Springer, Berlin.
Lawrence, J. G., and H. Ochman. 1997. Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol. 44:383397.[ISI][Medline]
. 1998. Molecular archaeology of Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95:94139417.
Marabini, R., and J. M. Carazo. 1994. Pattern recognition and classification of images of biological macromolecules using artificial neural networks. Biophys. J. 66:18041814.[Abstract]
Mathe, C., A. Peresetsky, P. Dehais, M. van Montagu, and P. Rouze. 1999. Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. J. Mol. Biol. 285:19771991.[ISI][Medline]
Médigue, C., T. Rouxel, P. Vigier, A. Henaut, and A. Danchin. 1991. Evidence of horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222:851856.[ISI][Medline]
Mrazek, J., and S. Karlin. 1999. Detecting alien genes in bacterial genomes. Ann. N.Y. Acad. Sci. 870:314329.
Nakamura, Y., T. Gojobori, and T. Ikemura. 1999. Codon usage tabulated from the international DNA sequence databases; its status 1999. Nucleic Acids Res. 27:292.
Riley, M. 1993. Functions of the gene products of Escherichia coli. Microbiol. Rev. 57:862952.
Shannon, C. E. 1948. A mathematical theory of communication. Bell System Tech. J. 27:379423, 623656.[ISI]
Sharp, P. M., and W.-H. Li. 1986. Codon usage in regulatory genes in Escherichia coli does not reflect selection for rare codons. Nucleic Acids Res. 14:77377749.[Abstract]
. 1987. The codon adaptation indexa measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15:12811295.[Abstract]
Smith, D. R., L. A. Doucette-Stamm, C. Deloughery et al. (37 co-authors). 1997. Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics. J. Bacteriol. 179:71357155.[Abstract]
Tamames, J., G. Casari, C. Ouzounis, and A. Valencia. 1996. Genomes with distinct function composition. FEBS Lett. 389:96101.[ISI][Medline]
Ultsch, A., and H. P. Siemon. 1990. Kohonen's self-organizing feature maps for exploratory data analysis. Pp. 305308 in Proceedings of the International Neural Network Conference 1990. Kluwer, Dordrecht, The Netherlands.
Vesanto, J. 1999. SOM-based data visualization methods. Intelligent Data Anal. 3:111126.
Wright, F. 1990. The effective number of codons' used in a gene. Gene 87:2329.