Making sense of an alphabet soup: the use of a new bioinformatics tool for identification of novel gene islands. Focus on "Identification of genomic islands in the genome of Bacillus cereus by comparative analysis with Bacillus anthracis"

Amy O. Charkowski

University of Wisconsin-Madison, Department of Plant Pathology, Madison, Wisconsin 53706

IN THE SHORT TIME since the first bacterial genome was sequenced (3), additional bacterial genomes from over 70 genera in 12 different phyla have been sequenced, resulting in over 500 Mb of sequence data. Even as the scientific community sequences ever more genomes, the avalanche of data we have already acquired has not come close to being fully analyzed. Among the reasons for this are that the statistical tools to address many genomics questions remain to be developed, and we do not yet fully understand the results from the statistical tools we already have.

The majority of available bacterial genome sequences are in the Proteobacteria and Firmicutes phyla. Comparison of genomes from closely related species or genera in these phyla have clarified which portions of the genomes are conserved among species and genera and which portions differ. In many cases, genes in closely related genera are colinear along much of the chromosome, but single gene differences and gene islands consisting of multiple genes are interspersed along the shared genomic backbone. It is often not clear how these gene islands were acquired by the bacteria, but many have one or more of the hallmarks of horizontal gene transfer (HGT), including low GC content, unusual codon usage, fragments of mobile element sequences (such as phage, plasmid, transposon, or insertion element sequences), a phylogenetic history that does not match that of conserved housekeeping genes, and absence in a closely related genome (4).

Not all genes that are likely to have been horizontally transferred maintain all of the traits of HGT. The most powerful method for identifying HGT is comparison of genomes within a species or genus, but in many cases, only one or a few genome sequences are available so this is not possible. Indirect methods that search genomes for genes with low GC content or unusual codon usage have been developed, and these can be used to detect HGT candidates, but these methods do not necessarily detect the same sets of genes, leaving open the question of whether some of these methods are not actually detecting HGT (9).

Several investigators have tried to clarify which, if any, of these indirect methods are best able to detect HGT. Koski et al. (6) compared the largely colinear Escherichia coli and Salmonella typhi genomes and defined HGT candidates based on which genes were present only in E. coli. They found that gene base composition was a poor indicator of HGT. In contrast, Daubin et al. (2) examined the codon usage and base composition in genes unique to genomes from several bacterial species and found that these genes, believed to be recently acquired, have a relatively low GC content compared with surrounding genes, even in AT-rich genomes. Their results suggest that "peculiar evolutionary pressures" are acting on genes that undergo HGT that result in a maintained low GC content relative to the rest of the genome. They concluded that base composition analysis does not significantly underestimate the amount of HGT in a genome and, thus, that GC content analysis is a useful method for identifying horizontally transferred genes. Ragan (9) used multiple indirect methods to identify HGT candidates in E. coli and found that different methods identified different subsets of genes. Ragan concluded that these different methods do divide genes into classes that are probably biologically important but that we still do not understand enough about genome evolution to interpret the importance of these different gene subsets. Still, these indirect methods serve to help biologists focus their attention on a gene subset when searching for HGT candidates among a sometimes overwhelming number of genes.

These HGT candidates often carry genes important for a part of a species life cycle, such as virulence genes, and are responsible for the differences in competitiveness in closely related species or strains (for examples, see Refs. 1, 7, and 8). These gene islands may drive the differentiation of species by allowing strains with the gene island to be more fit in particular environments. This may result in the interaction of those strains with a different microbial community and allow them to acquire new gene islands. The presence of one gene island can even be required for the acquisition of other islands, as in the human pathogen Vibro cholera. In this example, a gene island encoding the toxin-coregulated pili (TCP), which is required for virulence, also encodes the receptor for CTX{Phi}, a filamentous phage that carries a second set of virulence genes encoding cholera toxin (10).

In the previous online release of Physiological Genomics (release 16.1, December 2003), Zhang and Zhang (13) reported the use of a windowless method, called the cumulative GC profile, that allowed them to identify regions differing in GC content from surrounding genes and identify three new gene islands. This windowless method is essentially an algorithm that results in a line that changes direction sharply as regions with different GC content are encountered. They had previously reported on this method (11, 12), but it has not been applied to identify new candidates for HGT until now.

The output of the algorithm reported by Zhang and Zhang (13) is far more intuitive than the window methods presently used. With current methods, the researcher must empirically choose a window of a certain number of nucleotides, then sequence within that sliding window is analyzed. If the window is too large, then the resolution of the output is low. If the window is too small, then the output is noisy and difficult to interpret. In either case, researchers are likely to miss regions where there is an abrupt change in GC content. With the method reported by Zhang and Zhang, the output is a fairly smooth curve plotted in two dimensions. A gradual up or down slope in the curve indicates a relatively lower or higher GC content, respectively. A sudden change in GC content results in a sharp maximum or minimum in the curve.

Zhang and Zhang used this algorithm to examine the GC content of the Bacillus cereus genome. B. cereus is a gram-positive spore-forming bacterium that is an opportunistic pathogen and commonly found in soil. It is closely related to B. anthracis, the infamous and deadly pathogen. They identified three regions where the graphed line changed direction sharply and subsequently identified three previously unreported B. cereus gene islands not found in B. anthracis. These gene islands had other hallmarks of horizontally transferred regions including insertion into a tRNA gene and genes homologous to phage genes.

Because this method results in a graph representing the base composition of the entire genome at once, the authors were able to confirm an earlier report that the B. cereus genome can be divided into three regions that vary in base composition (5). Their method allowed them to display this in a format that was both intuitive and that allowed comparison of the base composition of entire genomes from multiple species on one graph.

The power of this technique is that the authors were able to identify these gene islands with their method prior to comparison of the two genomes. Thus the many researchers who only have one available genome sequence for their organism of interest should be able to use this technique to identify putative horizontally transferred gene islands. In cases where there are multiple genomes sequences available, this method will aid in identification of important gene islands may be present in multiple strains of a single species. By their demonstration of the utility of their windowless base composition analysis, Zhang and Zhang have added another important device to the bioinformatics toolbox.

Remarkably, these three gene islands were not discussed in the initial report on the B. cereus genome (5). This result highlights how much analysis remains to be done even on genomes that have already been published. It also calls into question whether the reason that these gene islands were not discussed in the original publication is due to manuscript size limitations imposed by many journals. As we are provided with additional tools for genome analysis, such as this one developed by Zhang and Zhang, and with many more genomes to compare each new genome against, we need to develop new formats for publication of genome sequences that allow a more comprehensive analysis of each new sequence.

FOOTNOTES

Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: A. O. Charkowski, Univ. of Wisconsin-Madison, Dept. of Plant Pathology, 1630 Linden Dr., Madison, WI 53706 (E-mail: amyc{at}plantpath.wisc.edu).

10.1152/physiolgenomics.00199.2003.

REFERENCES

  1. Alfano JR, Charkowski AO, Deng WL, Badel JL, Petnicki-Ocwieja T, van Dijk K, and Collmer A. The Pseudomonas syringae Hrp pathogenicity island has a tripartite mosaic structure composed of a cluster of type III secretion genes bounded by exchangeable effector and conserved effector loci that contribute to parasitic fitness and pathogenicity in plants. Proc Natl Acad Sci USA 97: 4856–4861, 2000.[Abstract/Free Full Text]
  2. Daubin V, Lerat E, and Perriere G. The source of laterally transferred genes in bacterial genomes. Genome Biol 4: R57, 2003.[CrossRef][Medline]
  3. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, McKenney K, Sutton G, FitzHugh W, Fields CA, Gocayne JD, Scott J, Shirley R, Liu LI, Glodek A, Kelley JM, Weidman JF, Phillips CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna MC, Nguyen DT, Saudek DM, Brandon RC, Fine LD, Fritchman JL, Furhmann JL, Geoghagan NSM, Gnehm CL, McDonald LA, Small KV, Fraser CM, Smith HO, and Venter JC. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512, 1995.[ISI][Medline]
  4. Groisman EA and Ochman H. Pathogenicity islands: bacterial evolution in quantum leaps. Cell 87: 791–794, 1996.[ISI][Medline]
  5. Ivanova N, Sorokin A, Anderson I, Galleron N, Candelon B, Kapatral V, Bhattacharyya A, Reznik G, Mikhailova N, Lapidus A, Chu L, Mazur M, Goltsman E, Larsen N, D’Souza M, Walunas T, Grechkin Y, Pusch G, Haselkorn R, Fonstein M, Ehrlich SD, Overbeek R, and Kyrpides N. Genome sequence of Bacillus cereus and comparative analysis with Bacillus anthracis. Nature 423: 87–91, 2003.[CrossRef][ISI][Medline]
  6. Koski LB, Morton RA, and Golding GB. Codon bias and base composition are poor indicators of horizontally transferred genes. Mol Biol Evol 18: 404–412, 2001.[Abstract/Free Full Text]
  7. Lawrence JG and Ochman H. Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci USA 95: 9413–9417, 1998.[Abstract/Free Full Text]
  8. Nunes LR, Rosato YB, Muto NH, Yanai GM, da Silva VS, Leite DB, Goncalves ER, de Souza AA, Coletta HD, Machado MA, Lopes SA, and Oliveira RCC. Microarray analyses of Xylella fastidiosa provide evidence of coordinated transcription control of laterally transferred elements. Genome Res 13: 570–578, 2003.[Abstract/Free Full Text]
  9. Ragan MA. On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett 201: 187–191, 2001.[CrossRef][ISI][Medline]
  10. Waldor MK and Mekalanos JJ. Lysogenic conversion by a filamentous phage encoding cholera toxin. Science 272: 1910–1914, 1996.[Abstract]
  11. Zhang CT, Wang J, and Zhang R. A novel method to calculate the G+C content of genomic DNA sequences. J Biomol Struct Dyn 19: 333–341, 2001.[ISI][Medline]
  12. Zhang CT, Zhang R, and Ou HY. The Z curve database: a graphic representation of genome sequences. Bioinformatics 19: 593–599, 2003.[Abstract/Free Full Text]
  13. Zhang R and Zhang CT. Identification of genomic islands in the genome of Bacillus cereus by comparative analysis with Bacillus anthracis. Physiol Genomics 16: 19–23, 2003. First published November 4, 2003; 10.1152/physiolgenomics.00170.2003.[Abstract/Free Full Text]




This Article
Full Text (PDF)
Citation Map
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Download to citation manager
Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Articles by Charkowski, A. O.
Articles citing this Article
PubMed
PubMed Citation
Articles by Charkowski, A. O.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Visit Other APS Journals Online
Copyright © 2004 by the American Physiological Society.