Extracting phylogenetic information from whole-genome sequencing projects: the lactic acid bacteria as a test case

Tom Coenye and Peter Vandamme

Laboratorium voor Microbiologie, Universiteit Gent, K.L. Ledeganckstraat 35, B-9000 Gent, Belgium

Correspondence
Tom Coenye
Tom.Coenye{at}UGent.be


   ABSTRACT
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
The availability of an ever increasing number of complete genome sequences of diverse prokaryotic taxa has led to the introduction of novel approaches to infer phylogenetic relationships among bacteria. In the present study the sequences of the 16S rRNA gene and nine housekeeping genes were compared with the fraction of shared putative orthologous protein-encoding genes, conservation of gene order, dinucleotide relative abundance and codon usage among 11 genomes of species belonging to the lactic acid bacteria. In general there is a good correlation between the results obtained with various approaches, although it is clear that there is a stronger phylogenetic signal in some datasets than in others, and that different parameters have different taxonomic resolutions. It appears that trees based on different kinds of information derived from whole-genome sequencing projects do not provide much additional information about the phylogenetic relationships among bacterial taxa compared to more traditional alignment-based methods. Nevertheless, it is expected that the study of these novel forms of information will have its value in taxonomy, to determine which genes are shared, when genes or sets of genes were lost in evolutionary history, to detect the presence of horizontally transferred genes and/or confirm or enhance the phylogenetic signal derived from traditional methods. Although these conclusions are based on a relatively small dataset, they are largely in agreement with other studies and it is anticipated that similar trends will be observed when comparing other genomes.


Abbreviations: LAB, lactic acid bacteria; UPGMA, Unweighted Pair Group Method with Arithmetic mean


   INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
The first sequence of an entire bacterial genome (that of Haemophilus influenzae Rd) was published in 1995 (Fleischmann et al., 1995). Since then the total number of publicly available genome sequences has grown rapidly (see http://igweb.integratedgenomics.com/GOLD/) and there has been an increasing interest in the use of these genome sequence data to assess evolutionary relationships among bacterial species (Eisen, 2000; Wolf et al., 2002). The comparison of single genes (especially the 16S rRNA gene) to infer phylogenetic relationships among bacteria has been widely used for several decades (see, for example, Wheelis et al., 1992; Eisen, 1995; Garrity & Holt, 2001), but there has been considerable debate whether a tree based on any single gene can accurately represent the evolution of a species, considering the possibility of horizontal gene transfer (Lan & Reeves, 1996; Ochman et al., 2000) and the possibility of degradation of the phylogenetic signal because of saturation for amino acid substitutions (Forterre & Philippe, 1999) as complicating factors. Proposed ways to put whole-genome sequences to work in bacterial classification include alignments and analysis of large numbers of conserved genes (‘supertree approach’) (Feng et al., 1997; Roux et al., 1997; Brown et al., 2001; Daubin et al., 2001; Wolf et al., 2001), comparisons based on the presence and absence of orthologous genes or families of genes, or differences in overall gene content (Snel et al., 1999; Tekaia et al., 1999; Fitz-Gibbon & House, 1999; Wolf et al., 2001; House & Fitz-Gibbon, 2002; Bansal & Meyer, 2002), presence or absence of specific protein folds (Lin & Gerstein, 2000), presence and absence of conserved insertions and deletions (Gupta, 1998; Gupta & Griffiths, 2002), differences in amino acid composition (Tekaia et al., 2002), conservation of gene order (Dandekar et al., 1998; Huynen & Bork, 1998; Wolf et al., 2001; Kunisawa, 2001; Suyama & Bork, 2001) and biases in nucleotide composition of genomes [including dinucleotide relative abundance (Karlin et al., 1997, 1998) and codon usage (Wright, 1990; Karlin et al., 1997)].

The goal of the present study was to extract information from whole-genome sequences that can be used to gain further insights into the taxonomy and evolution of bacterial species. As a test case for our approach we chose the lactic acid bacteria (LAB). Phylogenetically, LAB can be divided into two groups. Most LAB (including members of the genera Streptococcus, Lactobacillus and Lactococcus) belong to the Gram-positive bacteria with a low (<50 mol%) G+C content (‘Firmicutes’), while some other LAB (e.g. Bifidobacterium species) belong to the Gram-positive bacteria with a high (>50 mol%) G+C content (‘Actinobacteria’). LAB are a good test case for this phylogenomic approach since (i) they are taxonomically relatively well characterized using more traditional methods (Pot et al., 1994), (ii) several complete genome sequences are available, and (iii) many more will become available in the near future (Klaenhammer et al., 2002).


   METHODS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Whole-genome sequence data.
The complete genome sequences used in this study are shown in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1. Whole-genome sequences used in this study

 
Detection of putative orthologous genes.
Putative orthologous protein-encoding genes [defined as those homologous genes that show the largest identity of several possibilities above a certain threshold (Bansal & Meyer, 2002)] were detected by comparing annotated whole-genome sequences by means of a direct genome-to-genome BLASTP analysis (Altschul et al., 1997). Query sequences that did not identify homologues with BLAST E values <10-10 were considered to be specific for the query genome. This analysis was performed in a bidirectional manner, meaning that each genome was used as a query against all the other genomes. Based on the fraction of putative orthologous genes observed in pairwise combinations between genomes, a UPGMA (Unweighted Pair Group Method with Arithmetic mean) tree was constructed using the Bionumerics 3.5 software package (Applied Maths).

Comparison of 16S rRNA gene sequences and amino acid sequences of various house-keeping proteins.
16S rRNA gene sequences and the amino acid sequences of the proteins encoded by gyrB (DNA gyrase B subunit), rpoD (RNA polymerase {sigma}42 protein), sodA (manganese-dependent superoxide dismutase), dnaK (heat-shock protein 70), recA (recombination protein A), gki (glucose kinase), ddl (D-alanine : D-alanine ligase), alaS (alanyl-tRNA synthetase) and ileS (isoleucyl-tRNA synthetase) of all strains listed in Table 1 were downloaded from the GenBank or EMBL databases. We compared protein sequences instead of the corresponding DNA sequences since (i) silent substitutions are much more frequent than replacement mutations and tend to randomize the third codon position in protein-encoding genes, and (ii) the base composition of the third codon position may vary systematically between species, indicating that it is subject to a selective force that may be lineage-specific (Swofford & Olsen, 1990). Alignment, tree construction and bootstrap analyses (1000 replicates) were performed using the Bionumerics 3.5 software package (Applied Maths). Phylogenetic trees were constructed using the neighbour-joining, UPGMA, maximum-parsimony and maximum-likelihood methods (Saitou & Nei, 1987; Felsenstein, 1988, 1996). Bootstrap values >70 % were considered to be significant. A supertree based on the sequences of the 16S rRNA gene and house-keeping proteins was constructed by combining the similarity matrices of the individual comparisons using the Bionumerics 3.5 software package.

Determination of dinucleotide relative abundance values.
We determined the dinucleotide relative abundance value for each genome. Sequences were concatenated with their inverted complementary sequence using REVSEQ, YANK and UNION (EMBOSS). Mononucleotide frequencies were calculated using Artemis 4.0 (Rutherford et al., 2000), dinucleotide frequencies were calculated using COMPSEQ (EMBOSS). Dinucleotide relative abundances {rho}*XY were calculated using the equation {rho}*XY=fXY/fXfY where fXY denotes the frequency of dinucleotide XY, and fX and fY denote the frequencies of X and Y, respectively (Karlin et al., 1997). Statistical theory and data from previous studies (Karlin et al., 1997, 1998) indicate that the normal range of {rho}*XY is between 0·78 and 1·23. The dissimilarities in relative abundance of dinucleotides between both sequences were calculated using the equation described by Karlin et al. (1997): {delta}*(f,g)=1/16{Sigma}|{rho}*XY(f)-{rho}*XY(g)| (multiplied by 1000 for convenience), where the sum extends over all dinucleotides. To allow the clustering of {delta}*(f,g) values and comparison with other datasets, {delta}*(f,g) values were transformed to percentages of the largest {delta}*(f,g) value observed in this study.

Codon usage.
The numbers of each codon used in the sequences listed in Table 1 were calculated using Artemis 4.0 (Rutherford et al., 2000). Two other indices of codon usage bias were calculated for each sequence. Nc, the effective number of codons used in a sequence (Wright, 1990), was calculated using CHIPS (EMBOSS). Nc values range from 20 (in an extremely biased genome where only one codon is used per amino acid) to 61 (all synonymous codons are used with equal probability) (Wright, 1990; Andersson & Sharp, 1996). We also calculated GC3s, the frequency of G+C in synonymously variable third positions of codons (excluding Met, Trp and termination codons).

Collinearity of genomes.
Regions displaying substantial DNA sequence similarity between two genomes were visualized using DOTTUP (EMBOSS). The word size used was 100 nt.

Statistical analysis.
Comparison of results obtained with various methods and statistical analyses were performed using the Bionumerics 3.5 (Applied Maths) and SPSS 11.0.1 (SPSS Inc.) software packages.


   RESULTS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Differences in gene content
To determine the number of genes present in one genome but not the other, we performed pairwise, bidirectional BLASTP comparisons, using a cutoff value of E=10-10. It is obvious that the apparent number of putative orthologous genes will change with the choice of parameters. However, previous studies have shown that lowering the threshold does not dramatically increase the number of orthologues detected (Snel et al., 1999; Bansal & Meyer, 2002). To verify this we performed pairwise BLASTP comparisons between the genomes of both Streptococcus agalactiae strains listed in Table 1, using various cutoff levels. Our results indicate that the difference between a cutoff level of E=10-10 and E=10-2 is less than 5 % (1902 putative orthologues compared to 1989 putative orthologues detected). We decided to consistently use the more stringent cutoff value throughout this study, especially because most taxa included in this study are relatively closely related and were expected to share a high number of putative orthologues.

The percentage of putative orthologous protein-encoding genes detected in these pairwise comparisons is shown in Table 2. We have expressed the number of putative orthologues as the fraction of the genes being compared, to account for differences in genome size (Bansal & Meyer, 2002). Strains of the same Streptococcus species shared between 84·15 and 93·93 % of their protein-encoding genes while strains belonging to different species within the genus Streptococcus shared between 55·95 and 75·90 % of their protein-encoding genes. Species belonging to different genera of LAB did not share more than 63·64 % of their protein-encoding genes. A dendrogram derived from the data in Table 2 is shown in Fig. 1. This dendrogram confirmed that closely related strains and species shared a higher fraction of their protein-encoding genes than do more distantly related ones.


View this table:
[in this window]
[in a new window]
 
Table 2. Fraction of putative orthologues shared by strains investigated

Upper triangle, sequences from strains in header row used as database and sequences of strains in header column used as query; lower triangle, vice versa.

 


View larger version (25K):
[in this window]
[in a new window]
 
Fig. 1. UPGMA tree based on the fraction of shared putative orthologues between genomes.

 
Collinearity of genomes
To determine to what extent gene order is conserved among different taxa, we determined the collinearity of the genomes. Using a word size of 100, significant collinearity was only seen between strains belonging to the same species (see Fig. 2 for examples). Decreasing the word size (down to 10) did not result in the detection of any additional conservation of gene order.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2. Dot-plot of the genomes of two S. pneumoniae (top) and two S. agalactiae (bottom) strains, visualizing the regions with substantial similarity between the two genomes. The word size used was 100.

 
Comparison of 16S rRNA gene sequences and amino acid sequences of various house-keeping proteins
Phylogenetic trees based on complete 16S rRNA gene sequences and the sequences of nine proteins involved in central metabolism were constructed using the neighbour-joining, UPGMA, maximum-parsimony and maximum-likelihood methods. We wanted our selection to contain (i) both more conserved and more variable proteins and (ii) proteins belonging to different functional groups. We included recA, gyrB, dnaK and rpoB because they have been used in numerous taxonomic studies and represent the more conserved genes. sodA was included because it has been used in many studies regarding the identification of Gram-positive bacteria. gki, ileS and alaS were included as proteins with intermediate variability. ddl was included as a more variable protein; this protein has also been used in several studies regarding taxonomy and identification of streptococci.

No differences in overall topology were observed using the different methods for tree construction (data not shown). The neighbour-joining dendrograms (including bootstrap values) derived from complete 16S rRNA gene sequences and the sequences of nine proteins involved in central metabolism and a dendrogram based on the combined analysis are shown in Fig. 3. Overall branch lengths are relatively similar (except for the most distantly related organism, Bifidobacterium longum) (Fig. 3). In the tree based on 16S rRNA gene sequences, strains of the same species group together and share more than 99 % sequence similarity. Bootstrap values indicate that most branches in the dendrogram are highly significant, with the exception of the Streptococcus pyogenes branch (bootstrap value of 44 %), the S. pyogenesS. agalactiae branch (21 %) and the Streptococcus mutansStreptococcus pneumoniae branch (62 %). S. pyogenes and S. agalactiae appear closely related (96·8 % sequence similarity), while S. mutans and S. pneumoniae are somewhat more distantly related to the former two species and to each other (less than 93 % sequence similarity). This confirms results based on the analyses of 16S rRNA gene sequences of other strains representing the same species (Bentley et al., 1991; Kawamura et al., 1995). Lactococcus lactis is considered as the closest neighbour of the streptococci (88·3 % sequence similarity), while Lactobacillus plantarum and B. longum are more distantly related (less than 84 and 74 % sequence similarity, respectively). A comparable topology is observed in the trees based on the gyrB, dnaK, recA, alaS and sodA sequences (it should be noted that the sodA gene is not present in Lb. plantarum WCFS1 and B. longum NCC2705). In the trees based on rpoD, gki, ddl and ileS sequences, a different topology is observed. In the tree based on rpoD sequences, Lb. plantarum appears more closely related to the streptococci than Lc. lactis. In the tree based on gki sequences, S. pneumoniae appears more distantly related to the other streptococci than Lc. lactis. In the tree based on ddl sequences, substantial intraspecies and intragenus sequence variability can be observed. Finally, the sequence of the Lb. plantarum ileS protein seems to be considerably different from that of all other LAB: the Lb. plantarum sequence is considered as the most dissimilar sequence in this tree while the B. longum sequences are considered the most dissimilar in all other trees (with the exception of the sodA tree). Bootstrap analysis indicates that, as in the tree based on 16S rRNA gene, most branches in the dendrograms are highly significant, although there are some exceptions (Fig. 3). In the dendrogram based on the combined analysis, the topology is similar to that of the 16S rRNA gene tree, with the exception that S. mutans UA159 seems more closely related to S. pyogenes and S. agalactiae than S. pneumoniae (and vice versa in the 16S rRNA gene-based tree). In all trees (except the tree based on gki and ddl sequences) the genus Streptococcus and the individual Streptococcus species appear as a monophyletic group(s).




View larger version (78K):
[in this window]
[in a new window]
 
Fig. 3. Phylogenetic trees based on 16S rRNA gene sequences, the amino acid sequences of nine housekeeping genes and combined sequences. Bootstrap values are only indicated if they are <70 %. The scale bar indicates 10 % dissimilarity.

 
Dinucleotide relative abundance values
We determined the relative dinucleotide abundance values for all strains listed in Table 1. {rho}* values were in the normal range for all dinucleotides in all taxa investigated except for AT (over-represented in all genomes except B. longum NCC2705), CG (under-represented in all genomes except Lb. plantarum WCFS1 and B. longum NCC2705) and TA (under-represented in all genomes except S. agalactiae 2603V/R and NEM316, and Lc. lactis IL1403) (data not shown). {delta}*(f,g) values were between 3·94 (between S. pneumoniae R6 and S. pneumoniae TIGR4) and 192·4 (between S. mutans UA159 and B. longum NCC2705). A dendrogram based on normalized {delta}*(f,g) values is shown in Fig. 4.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 4. UPGMA tree based on normalized {delta}*(f,g) values.

 
Codon usage
Dendrograms based on codon usage and GC3s are shown in Fig. 5. In general codon usage is very similar among the LAB investigated, except for B. longum NCC2705 which has a clearly different codon usage. GC3s is virtually identical for strains belonging to the same species, but there is clearly more variation between different species. B. longum NCC2705 is an outlier here as well. There is little variation in Nc among the different genomes investigated, with Nc ranging from 40·4 (B. longum) to 52·5 (Lb. plantarum). The complete table with codon usage, GC3s and Nc is available as supplementary data at http://mic.sgmjournals.org.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 5. UPGMA trees based on the percentage similarity in codon usage (top) and GC3s (bottom).

 
Analysis of concordance between techniques
To determine how concordant the groupings obtained using the different methods were, we compared the similarity matrices obtained from the different experiments using BioNumerics 2.5 software by using the Pearson product-moment correlation coefficient. The correlation coefficients are shown in Table 3. Second-degree regression curves illustrating the relationship between several parameters are shown in Fig. 6.


View this table:
[in this window]
[in a new window]
 
Table 3. Pearson product-moment correlation coefficient (expressed as a percentage) between different experiments

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 6. Concordance between selected techniques.

 

   DISCUSSION
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
In the present study we wanted to evaluate the usefulness of various forms of information extracted from whole-genome sequences for bacterial taxonomy. At present it is unclear how the wealth of information revealed by large-scale whole-genome sequencing projects can be used for taxonomic purposes. We set out to determine different types of information, useful in taxonomy and phylogeny, that can be extracted from whole-genome sequences. Previous studies have addressed this question as well, but most of these studies only addressed deep phylogenetic relationships (e.g. the relationship between Bacteria and Archaea) or were restricted to a single method, or both. In the present study we used the presence of putative orthologues, conservation of gene order, alignment of macromolecules, biases in nucleotide composition and codon usage to assess the relationships between various LAB.

Presence of putative orthologues and collinearity of genomes
The most straightforward way of comparing genomes is to consider them as a ‘bag of genes' and to determine the number or fraction of shared putative orthologues (Huynen & Bork, 1998; Bansal & Meyer, 2002). It has been proposed that in this kind of analysis, evolutionary distance can be interpreted in terms of the underlying evolutionary events, including gene loss and acquisition (Snel et al., 1999). Several studies have shown that the number of orthologues shared between species ultimately depends on genome size and phylogenetic position (Snel et al., 1999; Bansal & Meyer, 2002). However, the genome sizes of the organisms included in the present study are in the same range (1·85–3·31 Mb; Table 1) and it is therefore not surprising to see that the number of genes two genomes have in common is a reflection of their phylogenetic relationship. The strong phylogenetic signal observed implies that for this group of organisms (i) horizontal gene transfer events are less common than generally accepted (Lan & Reeves, 1996; Ochman et al., 2000), (ii) horizontal gene transfer events occur mainly between closely related species (e.g. different Streptococcus species), (iii) closely related species are affected in the same way (i.e. horizontal gene transfer affected their common ancestor), and/or (iv) transferred genes mainly replace orthologues already present in the genome (Snel et al., 1999). This confirms previous findings in other bacteria (see for example Snel et al., 1999, 2002).

Although there are several reports documenting the conservation of local gene order (Tamames et al., 1997; Dandekar et al., 1998; Nölling et al., 2001), it is thought that in general there is little conservation of global gene order (Kolsto, 1997; Okstad et al., 1999; Suyama & Bork, 2001). This is largely confirmed in the present study as significant global conservation of gene order could only be seen between strains of the same species. Our data also confirm that the order of orthologous genes is less preserved than their presence (Huynen & Bork, 1998).

Comparing phylogenetic trees based on ‘universal’ genes
A traditional approach to determine the relationships between organisms is the construction of phylogenetic trees of universal or near-universal genes. However, when comparing trees based on different genes, different topologies can often be observed. This has been attributed to horizontal gene transfer (Lan & Reeves, 1996; Ochman et al., 2000) and saturation for amino acid substitutions (Forterre & Philippe, 1999). Therefore, trees may reflect the genealogy of individual genes (‘gene tree’) rather than the genealogy of the species (‘species tree’) (Teichmann & Mitchison, 1999; Brocchieri, 2001). Nevertheless, sequencing and analysis of conserved macromolecules, and especially the 16S rRNA gene, is one of the cornerstones of bacterial taxonomy and phylogeny (Vandamme et al., 1996; Rossello-Mora & Amann, 2001). Recently it was proposed that the applicability of protein-encoding gene sequence analysis to genomically circumscribe bacterial taxa should be evaluated (Stackebrandt et al., 2002). The data from a first study (Zeigler, 2003) suggest that sequence analysis of a small set of protein-encoding genes could reliably discriminate bacterial species. A way to avoid the above-mentioned problems is the so-called ‘supertree approach’ in which combined alignments of orthologous genes and/or proteins are compared. Phylogenies based on supertrees are supposedly more robust and representative of the organisms' phylogeny since the numbers of phylogenetically informative sites and sampled loci are increased (Brown et al., 2001; Wolf et al., 2002). Supertrees can be constructed by concatenating many sequence alignments into one, using the combined long sequence for tree reconstruction, or by combining phylogenetic information contained in multiple, independently reconstructed gene trees (Wolf et al., 2002). The latter method has the advantage that (i) it does not a priori require that exactly the same set of species is represented in all alignments, and (ii) it does not force all proteins into a single sequence change model (Wolf et al., 2002). In the present study we constructed trees based on the sequence of nine proteins involved in central metabolism and of the 16S rRNA gene. We also constructed a supertree based on the combined alignment of all 10 sequences.

As mentioned above the topologies of the trees based on the 16S rRNA gene, gyrB, sodA, dnaK, recA and alaS were very similar, which is largely in agreement with previously published data for these and other bacteria (Eisen, 1995; Huang, 1996; Poyart et al., 1998; Diaz-Lazcoz et al., 1998; Ahmad et al., 2000). The extensive intraspecies and intragenus diversity observed in the ddl-based tree is most probably the result of a previously described hitchhiking effect driven by the penicillin-binding protein 2b gene (pbp2b): interspecies recombinational exchanges at the pbp2b locus (selected for by the use of penicillin) can extend into or through the nearby ddl gene (Enright & Spratt, 1999). At present we have no explanation for the topologies seen in the trees based on rpoD, gki and ileS. The supertree based on the combined analysis is also in agreement with the 16S rRNA gene tree, again confirming results obtained in previous studies (Daubin et al., 2001; Brown et al., 2001).

Biases in nucleotide composition
Dinucleotide relative abundance values are thought to be constant within a genome and it has been hypothesized that this is due to the factors that affect them being constant throughout the genome. It has also been postulated that the set of dinucleotide relative abundance values constitutes a genomic signature that reflects the pressure of these factors (Karlin et al., 1997). We showed that the dissimilarities in relative abundance of dinucleotides between the genomes of two strains belonging to the same species are lower than between the genomes of strains belonging to different species. Variations in tRNA availability, translational accuracy and efficiency, and codon/anticodon interaction strength are important factors for the generation of biases in codon usage and therefore codon bias can be seen as a measurement of these mutational and selectional pressures (Karlin et al., 1998). Most LAB investigated in this study had very similar patterns of codon usage (Lb. plantarum and B. longum being the exceptions), suggesting that their overall patterns of the above-mentioned factors are similar. There seems to be no variation in the G+C content at synonymous third codon positions within a species, but considerable diversity between species. Overall, the phylogenetic signal is stronger for GC3s than for codon usage.

Concordance between methods
As is obvious from Table 3 and Fig. 6 there is, in general, a good correlation between the results obtained with the various approaches. Nevertheless, it is also clear that there is a stronger phylogenetic signal in some datasets than in others and that different parameters have different taxonomic resolution. For example the absence of conservation of gene order across species makes this approach less suitable for comparing distantly related organisms. Although our conclusions are based on a relatively small dataset, we anticipate that similar trends will be observed when comparing other genomes. All analyses described in the present study were carried out on a standard desktop computer and most of the software described is freely available or can be accessed via a web interface. Of all the methods described, determination of {delta}* values is probably the easiest and has the additional advantage that it does not require functional annotation of the genome sequence or a prior alignment of sequences.

Conclusions
The results of the present study seem to confirm the data obtained by Wolf et al. (2001) in that trees based on different kinds of information derived from whole-genome sequencing projects do not provide much additional information about the phylogenetic relationships among bacterial taxa compared to traditional alignment-based methods. Nevertheless, we anticipate that the study of these novel forms of information will have its value in taxonomy. Several of the approaches described allow the determination of which genes or functional categories of genes are shared, when genes or sets of genes are lost in evolutionary history, and the detection of the presence of horizontally transferred genes. Most importantly, they can confirm or enhance the phylogenetic signal derived from traditional (alignment-based) methods (Wolf et al., 2002). It can therefore be expected that these novel approaches will find their place beside the more traditional approaches to bacterial phylogeny and will provide a better picture of bacterial phylogeny and taxonomy.


   ACKNOWLEDGEMENTS
 
T. C. and P. V. are indebted to the Fund for Scientific Research – Flanders (Belgium) for a position as postdoctoral fellow and research grants, respectively. T. C. also acknowledges the support from the Belgian Federal Government (Federal Office for Scientific, Technical and Cultural Affairs). We thank Dirk Gevers for stimulating discussions.


   REFERENCES
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Ahmad, S., Selvapandiyan, A. & Bhatnagar, R. K. (2000). Phylogenetic analysis of gram-positive bacteria based on grpE, encoded by the dnaK operon. Int J Syst Evol Microbiol 50, 1761–1766.[Abstract/Free Full Text]

Ajdic, D., McShan, W. M., McLaughlin, R. E. & 16 other authors (2002). Genome sequence of Streptococcus mutans UA159, a cariogenic dental pathogen. Proc Natl Acad Sci U S A 99, 14434–14439.[Abstract/Free Full Text]

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.[Abstract/Free Full Text]

Andersson, S. G. E. & Sharp, P. M. (1996). Codon usage in the Mycobacterium tuberculosis complex. Microbiology 142, 915–925.[Abstract]

Bansal, A. K. & Meyer, T. E. (2002). Evolutionary analysis by whole-genome comparisons. J Bacteriol 184, 2260–2272.[CrossRef][Medline]

Bentley, R. W., Leigh, J. A. & Collins, M. D. (1991). Intrageneric structure of Streptococcus based on comparative analysis of small-subunit rRNA sequences. Int J Syst Bacteriol 41, 487–494.[Abstract]

Beres, S. B., Sylva, G. L., Barbian, K. D. & 13 other authors (2002). Genome sequence of a serotype M3 strain of group A Streptococcus: phage-encoded toxins, the high-virulence phenotype, and clone emergence. Proc Natl Acad Sci U S A 99, 10078–10083.[Abstract/Free Full Text]

Bolotin, A., Wincker, P., Mauger, S., Jaillon, O., Malarme, K., Weissenbach, J., Ehrlich, S. D. & Sorokin, A. (2001). The complete genome sequence of the lactic acid bacterium Lactococcus lactis ssp. lactis IL1403. Genome Res 11, 731–753.[Abstract/Free Full Text]

Brocchieri, L. (2001). Phylogenetic inferences from molecular sequences: review and critique. Theor Pop Biol 59, 27–40.[CrossRef][Medline]

Brown, J. R., Douady, C. J., Italia, M. J., Marshall, W. E. & Stanhope, M. J. (2001). Universal trees based on large combined protein sequence data sets. Nat Genet 28, 281–285.[CrossRef][Medline]

Dandekar, T., Snel, B., Huynen, M. & Bork, P. (1998). Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 23, 324–328.[CrossRef][Medline]

Daubin, V., Gouy, M. & Perriere, G. (2001). Bacterial molecular phylogeny using supertree approach. Genome Inform 12, 155–164.

Diaz-Lazcoz, Y., Aude, J. C., Nitschke, P., Chiapello, H., Landes-Devauchelle, C. & Risler, J. L. (1998). Evolution of genes, evolution of species: the case of aminoacyl-tRNA synthetases. Mol Biol Evol 15, 1548–1561.[Free Full Text]

Eisen, J. A. (1995). The RecA protein as a model molecule for molecular systematic studies of bacteria: comparison of trees of RecAs and 16S rRNAs from the same species. J Mol Evol 41, 1105–1123.[Medline]

Eisen, J. A. (2000). Assessing evolutionary relationships among microbes from whole-genome analysis. Curr Opin Microbiol 3, 475–480.[CrossRef][Medline]

Enright, M. C. & Spratt, B. G. (1999). Extensive variation in the ddl gene of penicillin-resistant Streptococcus pneumoniae results from a hitchhiking effect driven by the penicillin-binding protein 2b gene. Mol Biol Evol 16, 1687–1695.[Abstract/Free Full Text]

Felsenstein, J. (1988). Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet 22, 521–565.[CrossRef][Medline]

Felsenstein, J. (1996). Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol 266, 418–427.[Medline]

Feng, D. F., Cho, G. & Doolittle, R. F. (1997). Determining divergence times with a protein clock: update and reevaluation. Proc Natl Acad Sci U S A 94, 13028–13033.[Abstract/Free Full Text]

Ferretti, J. J., McShan, W. M., Ajdic, D. & 20 other authors (2001). Complete genome sequence of an M1 strain of Streptococcus pyogenes. Proc Natl Acad Sci U S A 98, 4658–4663.[Abstract/Free Full Text]

Fitz-Gibbon, S. & House, C. H. (1999). Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res 27, 4218–4222.[Abstract/Free Full Text]

Fleischmann, R. D., Adams, M. D., White, O. & 37 other authors (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512.[Medline]

Forterre, P. & Philippe, H. (1999). Where is the root of the universal tree of life? Bioessays 21, 871–879.[CrossRef][Medline]

Garrity, G. M. & Holt, J. G. (2001). The road map to the manual. In Bergey's Manual of Systematic Bacteriology, Vol. 1, 2nd edn, pp. 119–141. Edited by D. R. Boone & R. W. Castenholz. New York: Springer.

Glaser, P., Rusniok, C., Buchrieser, C. & 9 other authors (2002). Genome sequence of Streptococcus agalactiae, a pathogen causing invasive neonatal disease. Mol Microbiol 45, 1499–1513.[CrossRef][Medline]

Gupta, R. S. (1998). Protein phylogenies and signature sequences: a reappraisal of evolutionary relationships among Archaebacteria, Eubacteria and Eukaryotes. Microbiol Mol Biol Rev 62, 1435–1491.[Abstract/Free Full Text]

Gupta, R. S. & Griffiths, E. (2002). Critical issues in bacterial phylogeny. Theor Popul Biol 61, 423–434.[CrossRef][Medline]

Hoskins, J., Alborn, W. E., Arnold, J. & 39 other authors (2001). Genome of the bacterium Streptococcus pneumoniae strain R6. J Bacteriol 183, 5709–5717.[Abstract/Free Full Text]

House, C. H. & Fitz-Gibbon, S. T. (2002). Using homolog groups to create a whole-genomic tree of free-living organisms: an update. J Mol Evol 54, 539–547.[CrossRef][Medline]

Huang, W. M. (1996). Bacterial diversity based on type II DNA topoisomerase genes. Annu Rev Genet 30, 79–107.[CrossRef][Medline]

Huynen, M. A. & Bork, P. (1998). Measuring genome evolution. Proc Natl Acad Sci U S A 95, 5849–5856.[Abstract/Free Full Text]

Karlin, S., Mrazek, J. & Campbell, A. M. (1997). Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol 179, 3899–3913.[Abstract]

Karlin, S., Campbell, A. M. & Mrazek, J. (1998). Comparative DNA analysis across diverse genomes. Annu Rev Genet 32, 185–225.[CrossRef][Medline]

Kawamura, Y., Hou, X. G., Sultana, E., Miura, H. & Ezaki, T. (1995). Determination of 16S rRNA sequences of Streptococcus mitis and Streptococcus gordonii and phylogenetic relationships among members of the genus Streptococcus. Int J Syst Bacteriol 45, 406–408.[Abstract]

Klaenhammer, T., Altermann, E., Arigoni, F. & 33 other authors (2002). Discovering lactic acid bacteria by genomics. Antonie Van Leeuwenhoek 82, 29–58.[CrossRef][Medline]

Kleerebezem, M., Boekhorst, J., van Kranenburg, R. & 17 other authors (2003). Complete genome sequence of Lactobacillus plantarum WCFS1. Proc Natl Acad Sci U S A 100, 1990–1995.[Abstract/Free Full Text]

Kolsto, A. B. (1997). Dynamic bacterial genome organization. Mol Microbiol 24, 241–248.[CrossRef][Medline]

Kunisawa, T. (2001). Gene arrangements and phylogeny in the class Proteobacteria. J Theor Biol 213, 9–19.[CrossRef][Medline]

Lan, R. & Reeves, P. R. (1996). Gene transfer is a major factor in bacterial evolution. Mol Biol Evol 13, 47–55.[Abstract]

Lin, J. & Gerstein, M. (2000). Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res 10, 808–818.[Abstract/Free Full Text]

Nölling, J., Breton, G., Omelchenko, M. V. & 16 other authors (2001). Genome sequence and comparative analysis of the solvent-producing bacterium Clostridium acetobutylicum. J Bacteriol 183, 4823–4838.[Abstract/Free Full Text]

Ochman, H., Lawrence, J. G. & Groisman, E. A. (2000). Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299–304.[CrossRef][Medline]

Okstad, O. A., Hegna, I., Lindback, T., Rishovd, A. L. & Kolsto, A. B. (1999). Genome organization is not conserved between Bacillus cereus and Bacillus subtilis. Microbiology 145, 621–631.[Abstract]

Pot, B., Ludwig, W., Kersters, K. & Schleifer, K. H. (1994). Taxonomy of lactic acid bacteria. In Bacteriocins of the Lactic Acid Bacteria: Microbiology, Genetics and Applications. Edited by L. De Vuyst & E. J. Vandamme. London: Chapman & Hall.

Poyart, C., Quesne, G., Coulon, S., Berche, P. & Trieu-Cuot, P. (1998). Identification of streptococci to species level by sequencing the gene encoding the manganese-dependent superoxide dismutase. J Clin Microbiol 36, 41–47.[Abstract/Free Full Text]

Rossello-Mora, R. & Amann, R. (2001). The species concept for prokaryotes. FEMS Microbiol Rev 25, 39–67.[CrossRef][Medline]

Roux, V., Rydkina, E., Eremeeva, M. & Raoult, D. (1997). Citrate synthase gene comparison, a new tool for phylogenetic analysis, and its application for the Rickettsiae. Int J Syst Bacteriol 47, 252–261.[Abstract/Free Full Text]

Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M. A. & Barell, B. G. (2000). Artemis: sequence visualisation and annotation. Bioinformatics 16, 944–945.[Abstract]

Saitou, N. & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406–425.[Abstract]

Schell, M. A., Karmirantzou, M., Snel, B. & 9 other authors (2002). The genome sequence of Bifidobacterium longum reflects its adaptation to the human gastrointestinal tract. Proc Natl Acad Sci U S A 99, 14422–14427.[Abstract/Free Full Text]

Smoot, J. C., Barbian, K. D., Van Gompel, J. J. & 15 other authors (2002). Genome sequence and comparative microarray analysis of serotype M18 group A Streptococcus strains associated with acute rheumatic fever outbreaks. Proc Natl Acad Sci U S A 99, 4668–4673.[Abstract/Free Full Text]

Snel, B., Bork, P. & Huynen, M. A. (1999). Genome phylogeny based on gene content. Nat Genet 21, 108–110.[CrossRef][Medline]

Snel, B., Bork, P. & Huynen, M. A. (2002). Genomes in flux: the evolution of archeal and proteobacterial gene content. Genome Res 12, 17–25.[Abstract/Free Full Text]

Stackebrandt, E., Frederiksen, W., Garrity, G. M. & 10 other authors (2002). Report of the ad-hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol 52, 1043–1047.[Abstract/Free Full Text]

Suyama, M. & Bork, P. (2001). Evolution of prokaryotic gene order: genome rearrangements in closely related species. Trends Genet 17, 10–13.[CrossRef][Medline]

Swofford, D. L. & Olsen, G. J. (1990). Phylogeny reconstruction. In Molecular Systematics. Edited by D. M. Hillis & C. Moritz. Sunderland, MA: Sinauer Associates.

Tamames, J., Casari, G., Ouzounis, C. & Valencia, A. (1997). Conserved clusters of functionally related genes in two bacterial genomes. J Mol Evol 44, 66–73.[Medline]

Teichmann, S. A. & Mitchison, G. (1999). Is there a phylogenetic signal in prokaryote proteins? J Mol Evol 49, 98–107.[Medline]

Tekaia, F., Lazcano, A. & Dujon, B. (1999). The genomic tree as revealed from whole proteome comparisons. Genome Res 9, 550–557.[Abstract/Free Full Text]

Tekaia, F., Yeramian, E. & Dujon, B. (2002). Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297, 51–60.[CrossRef][Medline]

Tettelin, H., Nelson, K. E., Paulsen, I. T. & 36 other authors (2001). Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 293, 498–506.[Abstract/Free Full Text]

Tettelin, H., Masignani, V., Cieslewicz, M. J. & 40 other authors (2002). Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae. Proc Natl Acad Sci U S A 99, 12391–12396.[Abstract/Free Full Text]

Vandamme, P., Pot, B., Gillis, M., De Vos, P., Kersters, K. & Swings, J. (1996). Polyphasic taxonomy, a consensus approach to bacterial systematics. Microbiol Rev 60, 407–438.[Medline]

Wheelis, M. L., Kandler, O. & Woese, C. R. (1992). On the nature of global classification. Proc Natl Acad Sci U S A 89, 2930–2934.[Abstract]

Wolf, Y. I., Rogozin, I. B., Grishin, N. V., Tatusov, R. L. & Koonin, E. V. (2001). Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 1, 8.[CrossRef][Medline]

Wolf, Y. I., Rogozin, I. B., Grishin, N. V. & Koonin, E. V. (2002). Genome trees and the tree of life. Trends Genet 18, 472–479.[CrossRef][Medline]

Wright, F. (1990). The ‘effective number of codons' used in a gene. Gene 87, 23–29.[CrossRef][Medline]

Zeigler, D. R. (2003). Gene sequences useful for predicting relatedness of whole genomes in bacteria. Int J Syst Evol Microbiol 53, 1893–1900.[Abstract/Free Full Text]

Received 27 May 2003; revised 16 September 2003; accepted 16 September 2003.