University of Wisconsin-Madison, Department of Plant Pathology, Madison, Wisconsin 53706
IN THE SHORT TIME since the first bacterial genome was sequenced (3), additional bacterial genomes from over 70 genera in 12 different phyla have been sequenced, resulting in over 500 Mb of sequence data. Even as the scientific community sequences ever more genomes, the avalanche of data we have already acquired has not come close to being fully analyzed. Among the reasons for this are that the statistical tools to address many genomics questions remain to be developed, and we do not yet fully understand the results from the statistical tools we already have.
The majority of available bacterial genome sequences are in the Proteobacteria and Firmicutes phyla. Comparison of genomes from closely related species or genera in these phyla have clarified which portions of the genomes are conserved among species and genera and which portions differ. In many cases, genes in closely related genera are colinear along much of the chromosome, but single gene differences and gene islands consisting of multiple genes are interspersed along the shared genomic backbone. It is often not clear how these gene islands were acquired by the bacteria, but many have one or more of the hallmarks of horizontal gene transfer (HGT), including low GC content, unusual codon usage, fragments of mobile element sequences (such as phage, plasmid, transposon, or insertion element sequences), a phylogenetic history that does not match that of conserved housekeeping genes, and absence in a closely related genome (4).
Not all genes that are likely to have been horizontally transferred maintain all of the traits of HGT. The most powerful method for identifying HGT is comparison of genomes within a species or genus, but in many cases, only one or a few genome sequences are available so this is not possible. Indirect methods that search genomes for genes with low GC content or unusual codon usage have been developed, and these can be used to detect HGT candidates, but these methods do not necessarily detect the same sets of genes, leaving open the question of whether some of these methods are not actually detecting HGT (9).
Several investigators have tried to clarify which, if any, of these indirect methods are best able to detect HGT. Koski et al. (6) compared the largely colinear Escherichia coli and Salmonella typhi genomes and defined HGT candidates based on which genes were present only in E. coli. They found that gene base composition was a poor indicator of HGT. In contrast, Daubin et al. (2) examined the codon usage and base composition in genes unique to genomes from several bacterial species and found that these genes, believed to be recently acquired, have a relatively low GC content compared with surrounding genes, even in AT-rich genomes. Their results suggest that "peculiar evolutionary pressures" are acting on genes that undergo HGT that result in a maintained low GC content relative to the rest of the genome. They concluded that base composition analysis does not significantly underestimate the amount of HGT in a genome and, thus, that GC content analysis is a useful method for identifying horizontally transferred genes. Ragan (9) used multiple indirect methods to identify HGT candidates in E. coli and found that different methods identified different subsets of genes. Ragan concluded that these different methods do divide genes into classes that are probably biologically important but that we still do not understand enough about genome evolution to interpret the importance of these different gene subsets. Still, these indirect methods serve to help biologists focus their attention on a gene subset when searching for HGT candidates among a sometimes overwhelming number of genes.
These HGT candidates often carry genes important for a part of a species life cycle, such as virulence genes, and are responsible for the differences in competitiveness in closely related species or strains (for examples, see Refs. 1, 7, and 8). These gene islands may drive the differentiation of species by allowing strains with the gene island to be more fit in particular environments. This may result in the interaction of those strains with a different microbial community and allow them to acquire new gene islands. The presence of one gene island can even be required for the acquisition of other islands, as in the human pathogen Vibro cholera. In this example, a gene island encoding the toxin-coregulated pili (TCP), which is required for virulence, also encodes the receptor for CTX, a filamentous phage that carries a second set of virulence genes encoding cholera toxin (10).
In the previous online release of Physiological Genomics (release 16.1, December 2003), Zhang and Zhang (13) reported the use of a windowless method, called the cumulative GC profile, that allowed them to identify regions differing in GC content from surrounding genes and identify three new gene islands. This windowless method is essentially an algorithm that results in a line that changes direction sharply as regions with different GC content are encountered. They had previously reported on this method (11, 12), but it has not been applied to identify new candidates for HGT until now.
The output of the algorithm reported by Zhang and Zhang (13) is far more intuitive than the window methods presently used. With current methods, the researcher must empirically choose a window of a certain number of nucleotides, then sequence within that sliding window is analyzed. If the window is too large, then the resolution of the output is low. If the window is too small, then the output is noisy and difficult to interpret. In either case, researchers are likely to miss regions where there is an abrupt change in GC content. With the method reported by Zhang and Zhang, the output is a fairly smooth curve plotted in two dimensions. A gradual up or down slope in the curve indicates a relatively lower or higher GC content, respectively. A sudden change in GC content results in a sharp maximum or minimum in the curve.
Zhang and Zhang used this algorithm to examine the GC content of the Bacillus cereus genome. B. cereus is a gram-positive spore-forming bacterium that is an opportunistic pathogen and commonly found in soil. It is closely related to B. anthracis, the infamous and deadly pathogen. They identified three regions where the graphed line changed direction sharply and subsequently identified three previously unreported B. cereus gene islands not found in B. anthracis. These gene islands had other hallmarks of horizontally transferred regions including insertion into a tRNA gene and genes homologous to phage genes.
Because this method results in a graph representing the base composition of the entire genome at once, the authors were able to confirm an earlier report that the B. cereus genome can be divided into three regions that vary in base composition (5). Their method allowed them to display this in a format that was both intuitive and that allowed comparison of the base composition of entire genomes from multiple species on one graph.
The power of this technique is that the authors were able to identify these gene islands with their method prior to comparison of the two genomes. Thus the many researchers who only have one available genome sequence for their organism of interest should be able to use this technique to identify putative horizontally transferred gene islands. In cases where there are multiple genomes sequences available, this method will aid in identification of important gene islands may be present in multiple strains of a single species. By their demonstration of the utility of their windowless base composition analysis, Zhang and Zhang have added another important device to the bioinformatics toolbox.
Remarkably, these three gene islands were not discussed in the initial report on the B. cereus genome (5). This result highlights how much analysis remains to be done even on genomes that have already been published. It also calls into question whether the reason that these gene islands were not discussed in the original publication is due to manuscript size limitations imposed by many journals. As we are provided with additional tools for genome analysis, such as this one developed by Zhang and Zhang, and with many more genomes to compare each new genome against, we need to develop new formats for publication of genome sequences that allow a more comprehensive analysis of each new sequence.
FOOTNOTES
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: A. O. Charkowski, Univ. of Wisconsin-Madison, Dept. of Plant Pathology, 1630 Linden Dr., Madison, WI 53706 (E-mail: amyc{at}plantpath.wisc.edu).
10.1152/physiolgenomics.00199.2003.
REFERENCES
|
HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Visit Other APS Journals Online |