Laboratoire Génomique des Microorganismes Pathogènes, Institut Pasteur, Département de Biologie Moléculaire, 25 rue du Dr Roux, 75724 Paris Cedex 15, France1
Author for correspondence: Farid Chetouani. Tel: +33 1 45 68 87 48. Fax: +33 1 45 68 87 86. e-mail: fchetou{at}pasteur.fr
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: comparative genomics, bioinformatics, software, differential genome analysis
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
To date two complementary in silico methods have been developed allowing genome subtraction. They are either based on computed clusters of homologous proteins or on pairwise protein comparisons. The first approach uses the following process to construct the protein families. First, all proteins of a sequence database, including those of complete genomes, are compared to each other with similarity search software, like BLASTP (Altschul et al., 1997 ) or FASTA (Pearson & Lipman, 1988
). Then, the corresponding search outputs are processed according to default constraints to extract significant hits. Finally, the protein families are constructed using single transitive links: e.g. if proteins A and B are similar according to the constraints and proteins B and C are also similar then proteins A, B, C are stored in the same cluster. Software tools like CluSTr (Kriventseva et al., 2001
), COG (Tatusov et al., 2001
), HOBACGEN (Perriere et al., 2000
), ProtoMap (Yona et al., 2000
) or SYSTERS (Krause et al., 2000
) provide access to such sets of homologous proteins, but only COG contains a tool, entitled Phylogenetic pattern search, which allows genome subtraction to select protein families. The second approach does not use fixed constraints. The user defines the similarity thresholds to decide whether a coding sequence is present or absent in a genome. The software Seebugs (Bruccoleri et al., 1998
) belongs to this category and is based on protein sequence comparisons using the FASTA program.
To our knowledge, there are only two freely available resources providing a query engine for differential genome analysis: the reference website COG (http://www.ncbi.nlm.nih.gov/COG/) and the Seebugs software. The public database COG contains defined clusters of homologous genes for 34 of the 43 publicly available complete genomes (April 2001). If a user is interested in a specific cellular process or in genome data from a micro-organism not yet included in the COG database, the Seebugs software could be installed locally. However, as the authors of Seebugs admit in their documentation, installation is somewhat complex. Considering this situation, we have developed a software package with a user-friendly web interface for differential genome analysis which we have called FindTarget. The user chooses the similarity criteria to decide whether or not a gene has a counterpart in a set of selected genomes according to BLASTP comparisons between theoretical proteomes (predicted from the DNA sequences). Coloured multiple alignments and phylogenetic trees of conserved proteins are provided to help define relationships between gene products. For each selected gene a link to the corresponding entry in a public annotated genome database (if available) allows access to updated gene information. Any genome, even unfinished or private genomes, can be added.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
If the database contains n genomes, nx(n-1) proteome BLASTP comparisons should theoretically be performed. To limit computation time, only the proteome versus proteome comparisons defined by the local FindTarget administrator are launched according to the research interest of the user. To reduce disk space usage, for each proteome versus proteome BLASTP comparison, only alignment properties with the best hits are saved. These are the protein name of the query and its best matching protein, the length of the query sequence and its best hit protein, the number of similar or identical amino acids for the best overlap region, the length of the best overlap region and its expected value and score, the number of amino acids found in all overlap regions for the query protein and for its best hit protein. Typically, for a comparison of two genomes encoding about 4000 proteins a parsed BLAST output file has a size of only 150 kilobytes. Several script utilities are provided for easy installation or update of the database. Due to its design, FindTarget is not restricted to BLASTP comparisons. It can also support other similarity search programs like FASTA (Pearson and Lipman, 1988 ) or PSI-BLAST (Altschul et al., 1997
).
Differential genome analysis algorithm.
During a FindTarget session, the user defines the input parameters to query the database. These parameters are presented in Table 1. To increase flexibility, different selection and exclusion criteria are available (Table 2
). The algorithm executed according to the chosen parameters is divided into two steps. First, the program selects all the proteins from the query genome which have a homologue in at least m reference genomes (m=match number) according to the selection criterion. A temporary list of query proteins is then generated. The next step is to reject from the temporary set all the query proteins that have a homologue in at least one exclusion genome according to the exclusion criterion. From this analysis, a final list of query proteins is retained. Typically, such an analysis takes 14 s on a Linux Pentium II 400 MHz computer with 128 megabytes of RAM.
|
|
During software development every effort has been made to ensure installation is easy. Installation is a two-step process. First, on a Unix web server the following external software must be installed: BLAST (Altschul et al., 1997 ), BLAST 2 sequences (Tatusova & Madden, 1999
), DisplayFam (Corpet et al., 1999
), MultAlin (Corpet, 1988
), Html4blast (http://bioweb.pasteur.fr/docs/softgen.html#html4blast) and Mview (Brown et al., 1998
). Second, the configuration file of the FindTarget package has to be modified according to the local host machine (definitions of the filename directories, path to external softwares, email of the sofware administrator).
![]() |
RESULTS AND DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
An application of FindTarget
To illustrate the utility of FindTarget, we used it to predict genes potentially implicated in Gram-negative membrane synthesis. The cell envelope is formed by the cytoplasmic membrane, the periplasm and the cell wall. In Gram-positive bacteria, the periplasm is defined as the volume directly surrounding the cytoplasmic membrane. In Gram-negative bacteria an additional outer membrane is present. In certain bacteria, a capsule (polysaccharide layer) may surround the cell envelope. The cytoplasmic membrane is a phospholipid bilayer containing membrane proteins. The cell wall consists of peptidoglycan (also called murein), a linear polysaccharide with peptide linkers. The Gram-negative outer membrane consists of a phospholipid bilayer, membrane proteins, lipoproteins and lipopolysaccharides (lipid A and O polysaccharide). The lipopolysaccharides are surface antigens anchored to the outer membrane by a terminal lipid A core. The lipid A and O polysaccharides are unique to the outer membrane of Gram-negative bacteria (Neidhart et al., 1990 ).
The functions of the genes expected to be specific for Gram-negative bacteria are diverse. They may encode outer-membrane proteins or proteins involved in the interaction between the outer membrane and cytoplasmic membrane. They may also be involved in the synthesis and the degradation of membrane constituents. With the use of FindTarget, we searched for E. coli proteins having a homologue in a set of Gram-negative bacteria (Campylobacter jejuni, Haemophilus influenzae, Helicobacter pylori 26695, Neisseria meningitidis Z2491), but not in a set of Gram-positive bacteria (Bacillus subtilis, Mycobacterium tuberculosis, Mycoplasma genitalium G37). The input parameters for this session are defined in Table 1. This approach allowed us to select 39 proteins from E. coli (see Table 3
for a complete list of selected genes). Logically, the number of gene products that were selected by this approach changes according to the stringency defined by values of the numeric input parameters (selection/exclusion criteria, match number).
|
Conclusion
FindTarget is an easy to use and powerful tool for identifying potentially specific genes for one or several species as determined using the similarity criteria selected by the user. The Unix package is available upon request and can be readily installed on cheap Linux personal computers which are now becoming common in molecular biology laboratories.
However, it is important to remember that in some organisms identical biochemical reactions may be catalysed by non-related enzymes. This is non-orthologous gene displacement (Koonin et al., 1996 ) and so the FindTarget user has to be careful with the interpretation of the results as absence of a protein in a genome does not necessarily mean that the corresponding function is missing.
FindTarget includes practical functionalities to analyse the results, such as generation of multiple alignments, reconstruction of phylogenetic trees, similarity searches in local databases and optional links to public databases of annotated genomes. The list of the genes selected during a work session and their coding sequences can be displayed and saved for further analysis. FindTarget quickly produces result outputs. Therefore, it allows successive requests to test several combinations of parameters and to define the most appropriate ones. Finally, results of this in silico comparison could be combined with other whole-genome analyses such as transcriptome and two-dimensional gel electrophoresis of proteins. Indeed transcriptome studies often lead to the identification of numerous genes, which cannot be all analysed in depth. A combination of these results with a FindTarget analysis could provide arguments for the selection of genes for further functional analysis. With the growing number of publicly available complete genomes, software tools like FindTarget should provide a rational basis for experimental design in the rapidly expanding field of functional genomics.
![]() |
ACKNOWLEDGEMENTS |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Brown, N. P., Leroy, C. & Sander, C. (1998). MView: a web-compatible database search or multiple alignment viewer. Bioinformatics 14, 380-381.[Abstract]
Bruccoleri, R. E., Dougherty, T. J. & Davison, D. B. (1998). Concordance analysis of microbial genomes. Nucleic Acids Res 26, 4482-4486.
Corpet, F. (1988). Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16, 10881-10890.[Abstract]
Corpet, F., Gouzy, J. & Kahn, D. (1999). Browsing protein families via the Rich Family Description format. Bioinformatics 15, 1020-1027.
Fujita, Y., Yoshida, K., Miwa, Y., Yanai, N., Nagakawa, E. & Kasahara, Y. (1998). Identification and expression of the Bacillus subtilis fructose-1,6-bisphosphatase gene (fbp). J Bacteriol 180, 4309-4313.
Huynen, M., Dandekar, T. & Bork, P. (1998). Differential genome analysis applied to the species-specific features of Helicobacter pylori. FEBS Lett 426, 1-5.[Medline]
Huynen, M. A., Diaz-Lazcoz, Y. & Bork, P. (1997). Differential genome display. Trends Genet 13, 389-390.[Medline]
Koonin, E. V., Mushegian, A. R. & Bork, P. (1996). Non-orthologous gene displacement. Trends Genet 12, 334-336.[Medline]
Krause, A., Stoye, J. & Vingron, M. (2000). The SYSTERS protein sequence cluster set. Nucleic Acids Res 28, 270-272.
Kriventseva, E. V., Fleischmann, W., Zdobnov, E. M. & Apweiler, R. (2001). CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res 29, 33-36.
Neidhart, F. C., Ingraham, J. L. & Schaechter, M. (1990). Physiology of the Bacterial Cell. Sunderland, MA: Sinauer Associates.
Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85, 2444-2448.[Abstract]
Perriere, G., Duret, L. & Gouy, M. (2000). HOBACGEN: database system for comparative genomics in bacteria. Genome Res 10, 379-385.
Sekowska, A., Bertin, P. & Danchin, A. (1998). Characterization of polyamine synthesis pathway in Bacillus subtilis 168. Mol Microbiol 29, 851-858.[Medline]
Tatusov, R. L., Natale, D. A., Garkavtsev, I. V. & 7 other authors (2001). The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 29, 2228.
Tatusova, T. A. & Madden, T. L. (1999). BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 174, 247-250.[Medline]
Wall, L., Christiansen, T. & Schwartz, R. L. (1996). Programming Perl, 2nd edition. Sebastopol: OReilly & Associates.
Yona, G., Linial, N. & Linial, M. (2000). ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res 28, 49-55.
Received 12 April 2001;
revised 5 July 2001;
accepted 13 July 2001.
HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
INT J SYST EVOL MICROBIOL | MICROBIOLOGY | J GEN VIROL |
J MED MICROBIOL | ALL SGM JOURNALS |