1 Oxford Centre for Ecology and Hydrology, Mansfield Road, Oxford OX1 3SR, UK
2 Department of Biology and Biochemistry, University of Bath, Claverton Down, Bath BA2 7AY, UK
Correspondence
Dawn Field
dfield{at}ceh.ac.uk
![]() |
ABSTRACT |
---|
![]() ![]() ![]() |
---|
No genome is an island
Genomes can never be considered as isolated datasets or islands. Rather, they must be viewed and interpreted in the context of the large amount of molecular data available in public databases. We now have a vast collection of genomes in our public databases. This collection contains more than 2500 genomes from bacteria (>200), viruses (>1200), plasmids (>600), eukaryotes (>30) and organelles (>500) (Field et al., 2005; Wheeler et al., 2005
). While there are numerous resources for the study of genomes we focus here on databases and software for the study of complete prokaryotic genomes.
First-generation tools focused on the study of a single genome, and the majority of genomic resources developed to date identify features in individual genomes expressly for the sake of functional annotation. Now there is an obvious shift towards the creation of tools that allow the viewing and manipulation of data in a comparative genomic context. While many of these next-generation comparative genomics databases and software packages combine information from multiple sources to decorate a single target genome with finer detail, there are also truly comparative genomic tools that leverage the information in multiple genomes simultaneously. Tools that allow the direct comparison of two or more genomes are becoming increasingly common (Darling et al., 2004; Hohl et al., 2002
; Kurtz et al., 2004
). For example, the Artemis Comparison Tool (Table 1
) provides a visualization of BLAST hits between two complete genome sequences, thus allowing rapid examination of the degree of synteny (conservation in gene order), major genomic rearrangements, or the integration of novel genomic islands, phages or other foreign elements.
|
Despite all the advantages of continually growing collections of genomes, increased taxonomic and ecological richness itself is not enough to solve all the challenges associated with interpreting the contents of these genomes. We simultaneously need to improve the quality and speed of annotation and combine computational studies with empirical studies, especially those that help to elucidate the functions of the large numbers of hypothetical and orphan genes still found in our genome databases. Likewise, the sheer vastness of these growing datasets poses myriad challenges for data management and analyses.
The current status of resources for the study of collections of genomes
With the current wealth of genomes it is more compelling than ever before to find novel ways to maximize the power of comparative genomics to unravel the biology of individual isolates and groups of taxa. In the following three sections we review currently available resources for accessing complete genome sequences, databases and computational servers, and software for local data analysis.
Genome sequences.
The largest collections of raw and annotated genome files can be downloaded from the genome sections of the primary international databases (Table 1). There are now also a growing number of secondary genomic resources dedicated to subsets of genomes grouped by taxonomic similarity and, more recently, by shared niche, as well as specialized projects making reannotated and added-value versions of genomes available (Table 1
). All completed and ongoing genome projects can be tracked in the Genomes Online Database (Bernal et al., 2001
).
Databases and computational servers.
Increasingly, testing hypotheses about specific genomes or sets of genes is possible using online resources that provide the ability to view and manipulate pre-computed analyses and access a range of computational tools (Table 1). Some of the genomic features that are now catalogued in online databases include: conserved orthologous genes (COGS) (Tatusov et al., 2003
; Uchiyama, 2003
), gene fusion events (Suhre & Claverie, 2004
), orphans (Siew et al., 2004
; Wilson et al., 2005
), functional groups of genes like cytoplasmic membrane transport proteins (Ren et al., 2004
), horizontally derived sequences (Table 1
), replication origins and compositional biases (Hallin & Ussery, 2004
). Among these sites, the Comprehensive Microbial Resource (CMR) (Peterson et al., 2001
) and the Genome Atlas (Hallin & Ussery, 2004
) stand out for the variety of tools they contain.
Software for local data analysis and the creation of new tools.
Although they remain a valuable resource, pre-computed datasets may not encompass a genome of interest, nor do they permit the user to explore different analytical parameters. In such cases, it is necessary to download genomes and software, and in some cases it may also be necessary to write new software and create new databases from scratch. One aid is the growing availability of toolkits that facilitate the creation of bespoke programming code, for example the widely used Bio* programming libraries (http://open-bio.org/). Another generic advance is the number of ways to quickly set up bioinformatics computing resources. These include CDs and DVDs full of bioinformatics software as well as complete turn-key workstations optimized for bioinformatics research (Tiwari & Field, 2005).
In the future it is envisioned that specialized analyses will be possible using local software that communicates with a variety of websites on the internet to process data, thus providing maximum power and flexibility. This technology is already being developed; it involves the next generation of the web, and more specifically web services and workflow tools. There are various workflow tools now becoming available, but Taverna (Oinn et al., 2004) is of special interest because it is part of the MyGRID project (http://www.mygrid.org.uk/) to produce next-generation tools in support of data-intensive in silico experiments in biology.
Challenges and opportunities associated with the analysis of many genomes
The ability to compare more genomes is compelling from a scientific standpoint, but also brings with it a series of challenges. Issues of data storage, computational speed, file formats, integration of multiple tools, and ease of access all become more complex. On a higher level, the information required to manipulate the genomes, universal naming conventions, and the construction of databases to incorporate metadata require novel approaches. Perhaps most important are the conceptual advances that must be implemented through the development of new algorithms and statistical approaches for detecting patterns in data. For example, sequence alignment continues to be an area of active research (Miller, 2001) and approaches must now cope with the need to align millions of base pairs of sequence from two or more genomes (Table 1
). The evolutionary fluidity of bacterial genomes, in terms of rapid loss and addition of genes and mobile elements and large chromosomal rearrangements, means that complete genome alignments present a far more serious computational challenge than single gene alignments writ large. Phylogenomics, for which accurate alignment is a pre-requisite, is also still actively advancing to meet the complexities involved with whole-genome data rather than single gene alignments (see http://www.phylo.org/). Here we briefly describe six research areas set to make a significant impact on our future understanding of genome sequences in a comparative genomic context.
Genomic annotation in a comparative genomic context.
While the annotation of a new genome in a comparative genomic context (i.e. through comparison with an already annotated close relative) is now a common practice, and there are an increasing number of projects (re-)annotating large numbers of genomes, there are also novel ways to leverage the availability of collections of genomes to improve annotation. One practice involves successively annotating sets of functionally related genes across all genomes in a dataset instead of annotating each genome to completion before attempting another. The Fellowship for the Interpretation of Genomes consortium is developing this approach within the framework of the SEED annotation tool (Table 1), which allows experts to progressively annotate individual subsystems' (for example Type II secretion systems or biosynthesis of O-glycans). This new paradigm should improve the quality and quantity of expert annotations available to the wider public.
Merging automated, experimental and curated information.
As the above example illustrates, one of the most exciting prospects for the future is the comparison of information derived from automated, experimental and curated sources. This trend is underscored in the recent recommendation of the American Society for Microbiology, which recently published a report on the need to characterize functional unknown and orphan genes and build a centralized, curated database of all microbial genomes based on experimental analyses (Roberts et al., 2005). Understanding the functions of the large number of hypothetical predicted proteins in our complete genome collection is one of the biggest challenges of the future and databases which help this effort will be exceptionally useful.
Visualization of genomic comparisons.
The ability to summarize large volumes of genomic data in a visually intuitive format is a critical step. Currently most tools that provide access to multi-genomic information do so with respect to a single reference genome. For example, Fig. 1 was created by inputting the Haemophilus influenzae genome into the standalone Multiple Genome Navigator software (Hoebeke et al., 2003
). Databases of the future will ideally let the user switch between views based on each genome as reference strain. They will also provide novel ways to display data that expand beyond this type of view to include the ability to compare every genome with every other genome.
|
Multiple genome sequences for single species can also reveal evidence concerning the short-term micro-evolution of bacteria, and the dynamics of genome architecture. Such analyses are most powerfully conducted within a phylogenetic or population biology framework; thus genome data should, where possible, be considered in concert with population-level data from a large number of strains (Feil, 2004). Population-level data, such as multilocus sequence typing (MLST) data (Maiden et al., 1998
), can reveal the degree of clonality (or genotypic clustering) within populations based on sequence-level analysis of stable (core) housekeeping genes. Alternatively evidence on the distribution of accessory elements which are likely to play a role in the adaptation to specific micro-niches can be assayed by the generation of microarray data. Microarray data can also help to identify hyper-variable regions of the genome, genomic rearrangements and evidence concerning gene expression. Furthermore, it is relatively very cheap and simple to generate data from large strain collections, although there remains an urgent need to develop more efficient software to aid in the interpretation of microarray data.
The value of taking a population genomics approach.
Given the observed diversity within many named species and the increasing ease and decreasing cost of the generation of complete bacterial sequences, it seems inevitable that genomic datasets will in time encompass meaningful population samples for single species. Given an appropriate sampling regime (which could be informed by current population data such as provided by MLST), relatively small samples of sequenced strains, perhaps 1030, will result in a powerful synthesis between genomics and population biology (Luikart et al., 2003). However, in order to exploit this resource, the existing tools for the analysis of sequence alignments from single gene loci will need to be adapted to deal with complete genome sequences. The problem of alignment was dealt with earlier, but there exist a battery of statistical tests for exploring evolutionary parameters, such as the intensity or direction of selection (Yang & Bielawski, 2000
), demographic changes in the population (Strimmer & Pybus, 2001
) or the rate of homologous recombination (Holmes et al., 1999
) which could be applied to whole-genome data. A powerful approach would be to employ sliding windows, where data are subdivided into blocks of a user-defined size. Each block would, in turn, be analysed sequentially, thus making the computational problems of analysing large sequences tractable. Crucially, this would also provide insights into differing levels of variation along the genome and thus provide evidence as to the consistency of evolutionary forces in disparate genomic locales. For example, tests to detect homologous recombination based either on the distribution of polymorphic sites, or on the level of phylogenetic consistency, may reveal the extent (boundary points) of large-scale sequence mosaicism in genomes, at the level of tens or hundreds of kilobases rather than at the level of single gene loci, the scale at which such tests are traditionally employed (Smith et al., 1991
). Although the possibility of large-scale homologous recombination is rarely investigated, it has been observed in E. coli (Guttman & Dykhuizen, 1994
) and more recently in Staphylococcus aureus (Robinson & Enright, 2004
). Thus, there is broad scope for novel types of databases and software that integrate genomes, information on mosaicism and the ability to make phylogenetic inferences (Fig. 2
).
|
Currently the difficulty of obtaining such metadata in a high-quality and easily accessible format is a common bottleneck in large-scale computational studies. This has led to a call for a new genomic standard (Field & Hughes, 2005) to capture information about genome sequences at the time of publication (analogous to the submission of genome annotation files). In this way, the experts generating each genome sequence would be directly responsible for providing data to the wider community about the detailed features of the organism. A catalogue of these reports would provide an extensive amount of novel data and a powerful new research tool for the future and would complement the growing number of initiatives aimed at generating computed genomic features.
Summary and a look to the future
All of the above advances rely on increased data integration and access to finer levels of detail. Ideally, next-generation databases and tools will be able to incorporate population-level data, place genomes into a rigorous phylogenetic and organismal context, and combine computed and curated data to maximize the quality and quantity of data. This will require increased interactions and collaborations between researchers working in allied fields. Multi-disciplinary approaches will be necessary to meet the challenge of interpreting the vast quantity of data generated by bacterial genome sequencing. Education will be essential if researchers are to span the cultural gulfs between these separate disciplines and work productively at the interfaces of fields like ecology and bioinformatics. A key goal will be the ability to seamlessly move between future datasets in the search for answers. Finding further ways to leverage the knowledge of our current, sizeable collection of genomes is also critical to producing rationales for the targeted study of additional taxa, genomes, populations and genes. To appreciate the challenges and potential that lie before us we need to imagine access to thousands of bacterial genomes (and tens of strains for particular species) and modify our vision of the tools we need for the future accordingly.
![]() |
REFERENCES |
---|
![]() ![]() ![]() |
---|
Cole, S. T., Eiglmeier, K., Parkhill, J. & 41 other authors (2001). Massive gene decay in the leprosy bacillus. Nature 409, 10071011.[CrossRef][Medline]
Darling, A. C., Mau, B., Blattner, F. R. & Perna, N. T. (2004). Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14, 13941403.
Feil, E. J. (2004). Small change: keeping pace with microevolution. Nat Rev Microbiol 2, 483495.[CrossRef][Medline]
Field, D. & Hughes, J. (2005). Cataloguing our current genome collection. Microbiology 151, 10161019.[CrossRef][Medline]
Field, D., Hughes, J. & Gray, T. (2005). The GenomeMine database. http://www.genomics.ceh.ac.uk/GMINE/.
Guttman, D. S. & Dykhuizen, D. E. (1994). Clonal divergence in Escherichia coli as a result of recombination, not mutation. Science 266, 13801383.[Medline]
Haft, D. H., Selengut, J. D., Brinkac, L. M., Zafar, N. & White, O. (2005). Genome properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics. Bioinformatics 21, 293306.
Hallin, P. F. & Ussery, D. W. (2004). CBS genome atlas database: a dynamic storage for bioinformatic results and sequence data. Bioinformatics 20, 36823686.
Hoebeke, M., Nicolas, P. & Bessieres, P. (2003). MuGeN: simultaneous exploration of multiple genomes and computer analysis results. Bioinformatics 19, 859864.
Hohl, M., Kurtz, S. & Ohlebusch, E. (2002). Efficient multiple genome alignment. Bioinformatics 18 Suppl 1, S312S320.[Medline]
Holmes, E. C., Worobey, M. & Rambaut, A. (1999). Phylogenetic evidence for recombination in dengue virus. Mol Biol Evol 16, 405409.[Abstract]
Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C. & Salzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome Biol 5, R12.[CrossRef][Medline]
Luikart, G., England, P. R., Tallmon, D., Jordan, S. & Taberlet, P. (2003). The power and promise of population genomics: from genotyping to genome typing. Nat Rev Genet 4, 981994.[CrossRef][Medline]
Maiden, M. C., Bygraves, J. A., Feil, E. & 10 other authors (1998). Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95, 31403145.
Miller, W. (2001). Comparison of genomic DNA sequences: solved and unsolved problems. Bioinformatics 17, 391397.[Abstract]
Oinn, T., Addis, M., Ferris, J. & 8 other authors (2004). Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 30453054.
Peterson, J. D., Umayam, L. A., Dickinson, T., Hickey, E. K. & White, O. (2001). The comprehensive microbial resource. Nucleic Acids Res 29, 123125.
Ren, Q., Kang, K. H. & Paulsen, I. T. (2004). TransportDB: a relational database of cellular membrane transport systems. Nucleic Acids Res 32 (database issue), D284D288.
Roberts, R. J., Karp, P., Kasif, S., Linn, S. & Buckley, M. R. (2005). An Experimental Approach to Genome Annotation. Critical Issues Colloquia Report. Washington, DC: American Society for Microbiology.
Robinson, D. A. & Enright, M. C. (2004). Evolution of Staphylococcus aureus by large chromosomal replacements. J Bacteriol 186, 10601064.
Siew, N., Azaria, Y. & Fischer, D. (2004). The ORFanage: an ORFan database. Nucleic Acids Res 32 (database issue), D281D283.
Smith, J. M., Dowson, C. G. & Spratt, B. G. (1991). Localized sex in bacteria. Nature 349, 2931.[CrossRef][Medline]
Strimmer, K. & Pybus, O. G. (2001). Exploring the demographic history of DNA sequences using the generalized skyline plot. Mol Biol Evol 18, 22982305.
Suhre, K. & Claverie, J. M. (2004). FusionDB: a database for in-depth analysis of prokaryotic gene fusion events. Nucleic Acids Res 32 (database issue), D273D276.
Tatusov, R. L., Fedorova, N. D., Jackson, J. D. & 14 other authors (2003). The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.[CrossRef][Medline]
Tiwari, B. & Field, D. (2005). A bioinformatics playground. LinuxUser and Developer 46, 5056.
Uchiyama, I. (2003). MBGD: microbial genome database for comparative analysis. Nucleic Acids Res 31, 5862.
Welch, R. A., Burland, V., Plunkett, G., 3rd & 16 other authors (2002). Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci U S A 99, 1702017024.
Wheeler, D. L., Barrett, T., Benson, D. A. & 26 other authors (2005). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 33 (database issue), D39D45.
Wilson, G. A., Bertrand, N., Patel, Y., Hughes, J. B., Feil, E. J. & Field, D. (2005). Orphans as taxonomically restricted and ecologically important genes. Microbiology 151 (in press).
Yang, Z. & Bielawski, J. P. (2000). Statistical methods for detecting molecular adaptation. Trends Ecol Evol 15, 496503.[CrossRef][Medline]
HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
INT J SYST EVOL MICROBIOL | MICROBIOLOGY | J GEN VIROL |
J MED MICROBIOL | ALL SGM JOURNALS |