Investigating the Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes

Pietro Liò2

Department of Zoology, University of Cambridge, United Kingdom


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Statistical Analysis
 Results
 Discussion
 Acknowledgments and Permissions...
 References
 
Our thesis is that the DNA composition and structure of genomes are selected in part by mutation bias (GC pressure) and in part by ecology. To illustrate this point, we compare and contrast the oligonucleotide composition and the mosaic structure in 36 complete genomes and in 27 long genomic sequences from archaea and eubacteria. We report the following findings (1) High–GC-content genomes show a large underrepresentation of short distances between Gn and Cn homopolymers with respect to distances between An and Tn homopolymers; we discuss selection versus mutation bias hypotheses. (2) The oligonucleotide compositions of the genomes of Neisseria (meningitidis and gonorrhoea), Helicobacter pylori and Rhodobacter capsulatus are more biased than the other sequenced genomes. (3) The genomes of free-living species or nonchronic pathogens show more mosaic-like structure than genomes of chronic pathogens or intracellular symbionts. (4) Genome mosaicity of intracellular parasites has a maximum corresponding to the average gene length; in the genomes of free-living and nonchronic pathogens the maximum occurs at larger length scales. This suggests that free-living species can incorporate large pieces of DNA from the environment, whereas for intracellular parasites there are recombination events between homologous genes. We discuss the consequences in terms of evolution of genome size. (5) Intracellular symbionts and obligate pathogens show small, but not zero, amount of chromosome mosaicity, suggesting that recombination events occur in these species.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Statistical Analysis
 Results
 Discussion
 Acknowledgments and Permissions...
 References
 
The recently derived genome sequences of several bacteria and archaea provide an excellent opportunity to investigate differences in DNA sequence composition between- and within-genomes and to relate these differences to the species' ecology. Genome sequences have been investigated on the basis of several characteristics, for example, dinucleotide contents (Karlin and Burge 1995Citation ; Karlin, Campbell, and Mrazek 1998Citation ), percentage of shared homologous genes (Huynen, Dandekar, and Bork 1998Citation ; Tekaja, Lazcano, and Dujon 1999Citation ; Grishin, Wolf, and Koonin 2000Citation ), and patterns associated with physicochemical properties of sequences, such as DNA curvature of regulatory sequences (Bolshoy and Nevo 2000Citation ; Pedersen et al. 2000Citation ).

The density of guanine and cytosine, i.e., the GC content, is an important parameter to investigate and compare the structure and evolution of genomes (Muto and Osawa 1987Citation ; Bernardi 1993Citation ; Bellgard and Gojobori 1999Citation ). Although the average GC content of bacterial genomes varies across species, the GC mutational pressure, i.e., the specificity in replication and repair machinery, and the context-dependent mutation bias tend to homogenize each genome (Sueoka 1962, 1992Citation ; Liò et al. 1996Citation ; Karlin, Campbell, and Mrazek 1998Citation ). The different selection constraints acting on regulatory and coding regions and the lateral transfer of genes with different GC content (Martin 1999Citation ; Garcia-Vallvé, Romeu, and Palau 2000Citation ; Ochman, Lawrence, and Groisman 2000Citation ) are opposed to the genome composition homogenization.

We have compared a large number of genome sequences to investigate the relationships between oligonucleotide composition and genome structure heterogeneity in prokaryotes and their ecology, i.e., distinguishing free-living species and nonchronic pathogens from chronic pathogens and symbionts. First, we use a measure of sequence entropy, based on the frequency content of short oligonucleotides (1–7 bp), to compare 36 complete genomes and 27 long genomic sequences from archaea and eubacteria with different GC content. Second, because oligonucleotide frequencies depend on the mosaic structure of genomes, we used a method derived from wavelets (see Chui 1992Citation and Daubechies 1992Citation , among others) to estimate the length-size of the genome structure heterogeneity in bacteria and archaea with different ecology. We show that the entropy measure and the wavelet scalogram can be used as complementary methods to detect global patterns in genome sequences.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Statistical Analysis
 Results
 Discussion
 Acknowledgments and Permissions...
 References
 
Data
We have analyzed 36 complete genomes and 27 long sequences from eubacteria and archaea (permission to use data of uncompleted genome sequencing projects has been obtained; see Acknowledgments). We report in table 1 the list of genomic sequences ordered by the average GC content, which are also shown in figure 2 , and the average GC content if the genome sequence is partial (pg) and if the prokaryote is an archaea (A). It is known that there is often a large difference between the average genomic GC content and the GC content calculated considering only the third base of codons; in this work we analyze entire genome sequences, and therefore we have used the average genomic GC content.


View this table:
[in this window]
[in a new window]
 
Table 1 List of the Genomic Sequences Analyzed

 


View larger version (11K):
[in this window]
[in a new window]
 
Fig. 2.—Entropy measure of genomic sequences. A, Second-difference of entropy based on the frequencies of 7 bp oligonucleotides with respect to the average GC content in bacteria and archaea genomes (see Materials and Methods and table 1 ). Entropies of single-base frequency for genome data set are coded as filled circles. The species are coded as in figure 1F. B, we have subtracted the values of the curved line in figure 2A from the second-difference in entropy of 7 bp oligonucleotide (y-axis); the x-axis represents the GC content. The species are coded as in figure 1F, and the arrows are described in the text

 

    Statistical Analysis
 TOP
 Abstract
 Introduction
 Materials and Methods
 Statistical Analysis
 Results
 Discussion
 Acknowledgments and Permissions...
 References
 
Measure of Between-Genome Composition Diversity
We compared genome sequences through the analysis of short oligonucleotide frequencies (1–7 bp). We used the following entropy measure: where fi is the frequency of the oligonucleotide i of length n (n = 2, ..., 7 bases). In short sequences, there are oligonucleotides that do not occur even if their probabilities in the entire genome are not zero. Therefore, in order to compensate the finite length of the sequence, N, we used a correction factor M/2N, where M is the number of oligonucleotides with nonzero occurrences (see Herzel 1988Citation ; Schmitt and Herzel 1997Citation ; Liò and Ruffo 1998Citation ). In a long, random DNA sequence, all the oligonucleotides of length n are equally represented and D(n) is equal to n·ln4 (=1.386... for n = 1). If some oligonucleotides are more abundant than expected and others extremely rare, D(n) is smaller than n·ln4; if the sequence is periodic, with a period p, D(n) is a constant for n > p. A further estimator is the first-difference of entropy, it allows to estimate the bias in the frequencies of oligonucleotides of length n, taking into account the bias in the frequencies of oligonucleotides of length n - 1. The term estimates the bias in dinucleotide frequencies, taking into account the single-base frequencies; the term estimates the bias in trinucleotide frequencies with respect to the dinucleotide frequencies, and so on. Therefore, the plot of for different values of n allows to detect changes in oligonucleotide frequencies that may reflect mutation-selection pressures. In table 2 , we show the effect of the correction factor in computing the first-difference of entropy of oligonucleotide frequencies in a 200-kbp random DNA sequence with GC content = 0.5. Clearly, using the correction factor, we can neglect the finite size of the sequences; moreover, all the sequences in the genome data set are longer than 200 kbp. In a similar way, for comparison, we computed the second-differences of entropy, for different values of n.


View this table:
[in this window]
[in a new window]
 
Table 2 Test of the Correction Factor for the Sequence Finite Size

 
Distribution of Distances Between Homopolymers
We used density plots to analyze the distribution of distances between An, Tn, Cn, and Gn homopolymers. Density plots, that have unit total area, are essentially smooth versions of histograms which provide smooth estimates of population frequency or probability density curves. Histograms depend on the starting point on the grid of bins, and the differences can be surprisingly large; better results are obtained on computing the histograms by averaging over a large number of different starting points and having very small bins (Silverman 1986Citation , pp. 45–47). We compute the density through the formula for a sample of distances s1, ..., si, ..., sn, a fixed kernel k, and a bandwidth b. We used a normal kernel, and we selected data-dependent bandwidths (b) using the formula b = 0.9 min ({sigma}, R/1.34)n-1/5, where n is the sample size, {sigma} is the standard deviation, and R is the interquantile range (Silverman 1986Citation ).

Measure of Within-Genome Composition Heterogeneity
Karlin and Brendel (1993)Citation , Karlin and Burge (1995)Citation , and Karlin and Mrazek (1997)Citation , among others (see also Liò et al. 1996Citation ), have provided evidences that even the DNA sequence of the most simple microorganism shows large degree of patchiness at different sequence length scales. In their analyses, the window size is chosen on the basis of the sequence length and features to be detected; consequently, the localization accuracy of the methods is of the order of the chosen window length. In order to investigate GC patchiness, we coded DNA sequences as G,C = 1, A,T = 0. It is known that GC content variations along genomes occur at different sequence lengths; for example, codons, genes, long repeats, pathogenicity islands, and isochores (Bernardi 1993Citation ; Nekrutenko and Li 2000Citation ). GC variation is much larger than purine-pyrimidine variation (Liò et al. 1996Citation ; Liò and Ruffo 1998Citation ); therefore, GC patchiness represents most of the genome sequence patchiness. Arneodo et al. (1995, 1996, 1998)Citation used continuous wavelet transform to analyze GC patchiness and correlations in DNA sequences of different species. We used wavelet scalogram based on discrete wavelet transform (DWT, Mallat 1989Citation ).

Using Wavelets to Assess Within-Genome Composition Heterogeneity
The name wavelet means small waves (the sinusoids used in Fourier analysis are big waves). In short, a wavelet is an oscillation that decays quickly. Wavelets are functions that can be used to efficiently describe a signal by breaking it down into its components at different scales (or frequency bands) and following their evolution in the space domain. Unlike the Fourier basis, wavelets are local both in frequency and space. Wavelets are discontinuous and sum to zero and show different, complex shapes, each suitable for a different class of problems. Wavelets are also related to fractals in that the same shapes repeat at different orders of magnitude. Therefore, they are particularly performing better than Fourier analysis when the signals contain discontinuities and sharp spikes (Chui 1992Citation ; Daubechies 1992Citation ).

Wavelet Series
In wavelet theory, a function is represented by an infinite series expansion in terms of dilated and translated versions of a basic function {psi}, called mother wavelet, each multiplied by an appropriate coefficient (see Daubechies 1992Citation and Chui 1992Citation , among others). The wavelet family {psi}j,k is obtained from the mother wavelet by shrinking by a factor 2j and translating by 2-jk, to obtain {psi}j,k(x) = 2j/2{psi}(2jx - k) where the j subscript represents the dilation number and the k subscript represents the translation number. The scale factor 2j/2 is a normalization factor for {psi}j,k. The wavelet series representation of a function f is therefore


with wavelet coefficients


Coefficients fj,k describe features of f at the spatial location and frequency proportional to 2j (or scale j). Despite Fourier transform, wavelets provide time-frequency localization in that the coefficient fj,k gives information about the function near time point and near frequency proportional to 2j.

Discrete Wavelet Transform
The DWT decomposes a function into its wavelet coefficients (Mallat 1989Citation ). From a computational point of view, the DWT proceeds by recursively applying two convolution functions known as quadrature mirror filters, each producing an output stream that is half of the length of the original input, until the resolution level zero is reached. If the filters are applied n times (with 2n <= N), at each intermediate step (a level in wavelet terminology) j = 1, ..., n, the transform produces two vectors of coefficients, Sj of scaling coefficients and Dj of wavelet coefficients. The vector Dj is kept, whereas Sj is processed through the two filters. At the last level n, both Sn and Dn are kept. Different coefficient vectors contain information about the characteristics of the sequence at different scales or sequence lengths. Coefficients at coarse scales capture gross and global features. Coefficients at fine scales contain the local details of the profile. At level j, the wavelet coefficients Dj are associated with changes in the averages of the data on a scale 2j-1 at a set of location times. Scaling coefficients Sn at the last level n are instead associated with averages of the data on scales 2n and higher. The wavelet transform is, therefore, a cumulative measure of the variations in the data over regions proportional to the wavelet scales, with coefficients at coarser and coarser levels, i.e., for increasing values of j, describing features at lower frequency ranges and larger time periods. For example, given a genome sequence of 2 x 106 bp, GC variations at gene length (1,000–2,000 bp) correspond to scales 10 to 11.

For practical purposes, the DWT is often represented in matrix form as Wy, with W an orthogonal matrix and y a vector of observations of the signal. An inverse wavelet transform can be also defined. The standard DWT, as the fast Fourier transform, operates on data sets with length 2N, N integer. When required, data can be padded with zeros. These zeroes do not affect the results.

Choice of Wavelet Basis
We found that, in general, Daubechies wavelets perform better than Haar for the type of signals considered in this work (GC plots). Wavelets from Daubechies families have two important properties; they are compactly supported and have maximum number of vanishing moments (a function f has N vanishing moments if where q = 0, 1, ..., N - 1). Compact supports are useful to describe local characteristics that change rapidly with time. A large number of vanishing moments leads to high compressibility because the fine scale wavelet coefficients will be essentially zero where the function is smooth. Although we have previously used Dauchechies N = 10 (Liò and Vannucci 2000Citation ), the analysis of the large set of genomic sequences (table 1 ) showed that, if interested in scales at or above gene length, the Daubechies' basis of type N = 2 performs as well as N = 10 for G+C pattern analysis. Therefore, in this work and in a recent publication (Vannucci and Liò 2001Citation ) we used Daubechies N = 2.

Wavelet Scalogram
Genome mosaicity is analyzed using the scalogram, that is the equivalent of the periodogram used in Fourier analysis. The scalogram is a plot of the sum of the squares of the coefficients at each scale (Flandrin 1988Citation ; Chiann and Morettin 1998Citation ; Ariño and Vidakovic 1995Citation ). The plot will indicate at which scale of resolution the energy of the function is concentrated. A relatively smooth function will have most of its energy concentrated at large scales. A function showing high frequency oscillations will have a large portion of its energy concentrated in high-resolution wavelet coefficients.

We found that the largest amount of GC content variation in eubacteria and archaea genomes occurs at short sequence–lengths, mainly at codon or few codon lengths. Therefore, in order to improve the detection of GC content variations over gene-length, we applied a wavelet denoising technique to eliminate rapid variations of GC content (Donoho and Johnstone 1994Citation ; Donoho et al. 1995Citation ). Finally, each scalogram is generated by subtracting the values of a scalogram obtained from a random DNA sequence of the same length and average GC content.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Statistical Analysis
 Results
 Discussion
 Acknowledgments and Permissions...
 References
 
Patterns of Genome Composition in High– and Low–GC-Content Genomes
Recent papers have shown that the relative abundances of the dinucleotides constitute a signature of prokaryotic and eukaryotic genome sequences (Karlin and Burge 1995Citation ; Karlin and Mrazek 1997Citation ; Karlin, Campbell, and Mrazek 1998Citation ). In this work we have extended the analysis to longer oligonucleotides (1–7 bp) and used a very large genome sequence data set. Figure 1A shows the entropy of the single-base frequency with respect to the GC content for the 36 complete genomes and 27 long genomic sequences (see description in Material and Methods and table 1 ). We found that single-base entropies do not show significant deviations from a curve generated using random DNA sequences with different GC content (the curved line in fig. 1B and E ). Figures 1B to E shows the first-difference of entropy of 2, 3, 6, and 7 bp oligonucleotide frequencies (results for 4 and 5 bp do not add any interesting features). The genomes are represented by symbols according to the species' ecology. In figure 1F, we show the difference between the first-difference of entropy of 7 bp oligonucleotide frequencies and the values obtained using random DNA sequences (curved line in fig. 1E ). Therefore, whereas for figures 1A to E the comparisons of entropy values should be done with respect to the curved line, in figure 1F the comparison is with the top line (value 0). Note that the species are classified according to their taxonomy. Results from figures 1E and F suggest that genomes are clustered neither by ecology nor by taxonomy. For many genome sequences, the first-difference of entropy is very large for both dinucleotides (distances from the curved line of fig. 1A ) and trinucleotides (distances between dinucleotide positions in fig. 1B and trinucleotide positions in fig. 1C ). There are few genomes that have almost equal or similar first-differences of entropy values at all oligonucleotide lengths n (n = 1, ..., 7). We found that all the Chlamydias' sequences (lower triangles) have similar values of first-difference of entropy; there are negligible differences between Haemophilus influenzae and H. ducrey and between Neisseria meningitidis strains MC58 and Z2491; there are small differences between N. meningitidis strains and the N. gonorrhoea (second arrow from left in fig. 1B ) and between both the Helicobacter pylori strains J99 and 26695 (first arrow from left in fig. 1B ). The genome sequences of H. pylori and Neisseria show larger first-difference of entropy for dinucleotides than the sequences with the same GC content (fig. 1B ). The genome sequences of B. stearothermophilus, H. pylori, B. pseudomallei, and Thermotoga maritima show the largest first-difference of entropy for trinucleotides.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 1.—Entropy measure of genomic sequences. A, Entropies of single-base frequency for genome data set with respect to the GC content (x-axis). BE, first-difference of entropy based on the frequencies of 2, 3, 6, and 7 bp oligonucleotides, respectively, with respect to the average GC content in bacteria and archaea genomes (see Materials and Methods and table 1 ). The curved lines in B and E represent the entropy of single-base frequency from random DNA sequences with different GC content. The genomes are represented by symbols according to the prokaryotes' ecology: squares for obligate pathogens and intracellular symbionts, circles for free-living and nonobligate pathogens. Complete genomes are represented by open symbols and uncompleted genomes by closed symbols. F, we have subtracted the values of the curved line in figure 2E (and fig. 2B ) from the first-difference in entropy of 7 bp oligonucleotide frequencies (y-axis); the x-axis represents the GC content. The species are coded according to their taxonomy: closed squares, mycoplasmas; closed diamonds, mycobacteria; open squares, other gram+; circles, archaea; upper triangles, gram-; open diamonds, hyperthermophiles; star, spirochaetes; lower triangles, chlamydias; filled circles, cyanobacteria. The arrows are described in the text

 
Our results show that, particularly for n > 4, the largest first-difference of entropy occurs for low–GC-content genomes, for example Ureaplasma urealyticum (first arrow from left in fig. 1E ), H. pylori strains (first arrow from left in fig. 1B; second arrow from left in fig. 1E; first arrow from the left in fig. 1F ), the N. meningitidis strains (second arrow from left in fig. 1B; third arrow in fig. 1E; second arrow from the right in fig. 1F ), N. gonorrhoea (on the right hand side of N. meningitidis), and high–GC-content genomes, such as Rickettsia capsulatus (fourth arrow from the left in fig. 1E; third arrow from the left in fig. 1F ) and S. streptomyces (first arrow from the right in fig. 1E ).

The genome sequence of Mycoplasma genitalium shows larger oligonucleotide first-difference of entropy than the other sequenced mycoplasma genomes, U. urealyticum and Mycoplasma pneumoniae (closed squares in fig. 1F ). Among the high–GC-content genomes, Mycoplasma bovis and Mycoplasma tubercolosis have larger first-difference of entropy than Mycoplasma leprae (closed diamonds in fig. 1F ). Although genes from these genomes share 80% amino acid identity on an average, M. leprae has a lower GC content than the other two mycoplasmas (Bellgard and Gojobori 1999Citation ). The genomes of M. bovis and M. tubercolosis have lower first-difference of entropy than the genomes of proteobacteria with similar GC content, such as Pseudomonas aeruginosa and the Bordetellae sequences.

The comparison of genome sequences that have complementary GC-content percentage, for example, 35% and 65%, shows that low–GC-content genomes have smaller first-difference of entropy than high–GC-content genomes (fig. 1F ). For example, the average first-difference of entropy for 7-bp oligonucleotides in genome sequences with GC% < 0.40, GC% between 0.4 and 0.6, and GC% > 0.6 are (the standard deviation in parentheses): -0.042 (0.020), -0.046 (0.019), and -0.071 (0.015), respectively. This finding seems to be particularly relevant for proteobacteria (circles in fig. 1F ). A further confirmation is given by the second-difference entropies of 7-bp oligonucleotides, D(7)-D(5), for the same set of genomic sequences of figure 1 (fig. 2A ). In figure 2B, we have subtracted entropies from random DNA sequences from the values in figure 2A. The largest values correspond to those of figure 1F: the two H. pylori (first arrow from left in figs. 2A and B ), N. meningitidis and N. gonorrhoea (second arrow from left in figs. 2A and B ), and R. capsulatus (third arrow from the left in figs. 2A and B ).

We analyzed in details the oligonucleotide frequencies in high– and low–GC-content genomes. The contrast in oligonucleotide frequencies for high– and low–GC-content genomes shows that, particularly for n > 4 bp, homopolymers of the type Gn and Cn are generally underrepresented in all genomes in the data set. We made use of density plots to analyze the distribution of distances between An, Tn, Cn, and Gn homopolymers. In figure 3 , we show the density plots of the distances between A6 and T6 homopolymers in AT-rich genomes and the distances between C6 and G6 homopolymers in GC-rich genomes in a set of bacterial genomes. Each plot of figure 3 shows the distances between homopolymers in genomes with almost complementary GC%, i.e., for example, Campylobacter jejuni (GC content ~31%) versus P. aeruginosa (~67%, fig. 3A ); Borrelia burgdorferi (~29%) versus Deinococcus radiodurans (~67%, fig. 3B ); Methanococcus jannaschii (~31%) versus Halobacterium sp. (~68%, fig. 3C ) and Escherichia coli (~51%, fig. 3D ). In the absence of any selection or mutation mechanism, we expect the GC-rich genomes to contain a number of Gn and Cn homopolymers, quite equal to the number of An and Tn homopolymers in AT-rich genomes and vice versa. Instead, we found that the number of G6 and C6 homopolymers in GC-rich genomes is much lower than the number of A6 and T6 homopolymers in AT-rich genomes. All the plots show similar large differences between G6/C6 and A6/T6 distances in GC-rich and AT-rich genomes, respectively.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 3.—Density plots of the distances between A6 and T6 homopolymers in low–GC-content genomes and between C6 and G6 homopolymers in high–GC-content genomes. A, Campylobacter jejuni (GC content ~31%) and P. aeruginosa (~67%); B, B. burgdorferi (~29%) and D. radiodurans (~67%); C, M. jannaschii (~31%) and Halobacterium sp. (~68%); D, E. coli (~51%). The x-axis shows the distances between homopolymers, and the y-axis shows the density

 
Within-Genome Composition Heterogeneity
Our study, based on a very large set of genomes from bacteria and archaea, focuses on searching for correlation between genome mosaicity and ecology in bacteria and archaea. In order to investigate the mosaic-like structure of genomes, we calculated the wavelet scalograms for the genome data set (fig. 4 ). We found quantitative and qualitative differences between the genomes of free-living species/not-chronic pathogens (figs. 4A and B ) and those of intracellular parasites/symbionts (figs. 4C and D ). Although the size of GC-variations, i.e., the scales, depends on the sequence length because of the wavelet decomposition process (see Material and Methods), the genomes of free-living species and nonobligate pathogens clearly show more mosaic-like structure than the genomes of obligate pathogens and intracellular symbionts. In particular, the genomes of the N. meningitidis strains show more mosaic-like structure than all the other genomes analyzed, even larger than Synechocystis and E. coli (not shown) and other free-living bacteria. Mosaicity occurs at all the scales in the genomes of free-living species or not-chronic parasites (figs. 4A and B ), whereas intracellular parasites' genomes have low or negligible values at coarser scales (figs. 4C and D ). The maximum of the scalograms for intracellular parasites occurs at scales that correspond to the average gene length (table 3 ); it occurs at coarser scales for free-living and not-chronic genomes. This suggests that free-living species can incorporate large pieces of DNA from the environment, whereas for intracellular parasites the most common event is the recombination between homologous genes. This has important consequences on the genome reduction evolution in that it explains the genome size reduction observed in intracytoplasmic genomes and mitochondria. Finally, we found that there is small but not negligible mosaicity in genomes of intracytoplasmic symbionts. It is noteworthy that these results do not depend on the GC directional mutational pressure: for example low–GC-content genomes (for example Buchnera sp.) and genomes that have 50% GC content (for example Chlamidias) have similar genome mosaicity and ecology.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 4.—Wavelet scalograms of archaea and eubacteria genome sequences. A, Synechocystis sp. (+), N. meningitidis MC58 (circles), V. cholerae (triangles); B, A. fulgidus (+), C. jejuni (circles), M. jannaschii (triangles); C, H. pylori (+), R. prowazekii (circles), Chlamydia pneumoniae J138 (triangles); D, B. burgdorferi (+), U. urealyticum (circles), Buchnera sp. (triangles). The x-axis shows the scale, and the y-axis shows the energy that is an estimate of the amount of mosaicity at each scale

 

View this table:
[in this window]
[in a new window]
 
Table 3 Numerical Results of Scalogram of Figure 4

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Statistical Analysis
 Results
 Discussion
 Acknowledgments and Permissions...
 References
 
Genome Composition and GC Mutational Pressure
The evolution of DNA composition and structure of bacterial genomes depends on mutational and selection processes, such as GC mutation pressure, strand-bias mutation generated by DNA repair during transcription (Sueoka 1995Citation ), and the relative frequency of synonymous codons caused by the abundance of tRNAs (Sorensen, Kurland, and Pedersen 1989Citation ; Berg and Kurland 1997Citation ). Although much of the difference in oligonucleotide frequencies in a short sequence can be nonfunctional, genome-wide, it may have relevance in terms of genome organization, codon usage, gene expression, recombination, and mutation rate. The first-differences of entropy of dinucleotide frequencies may reflect structural constraints, for example, the stacking energies and the species-specific machinery for DNA repair and modification (Karlin, Campbell, and Mrazek 1998Citation ). Because coding regions generally constitute more than 90% of the bacterial genome, trinucleotide frequency diversity may be related to synonymous codon usage in coding regions.

The genomes of N. meningitidis and H. pylori show larger values of first-difference of entropy than other bacteria and archaea with similar GC content. These genomes contain several large pathogenicity islands that have different GC content from the nearby genomic regions (Tomb et al. 1997Citation ; Liò and Vannucci 2000Citation ; Parkhill et al. 2000Citation ; Tettelin et al. 2000Citation ). The results from the entropy analysis and the wavelet scalograms allow us to infer that the H. pylori genome does not contain more GC-heterogeneity than the other chronic pathogens, but it contains few genomic regions with remarkably different short oligonucleotide frequencies. This is in agreement with our previous findings (see fig. 2 in Liò and Vannucci 2000Citation ) and the findings of Liu and co-workers (Liu et al. 1999Citation ) that showed that there are large differences in sequence composition in the CAG pathogenicity island with respect to the average composition of the H. pylori genome. The genome of N. gonorrhoea has similar entropy values as that of N. meningitidis, it may contain pathogenicity islands with similar base composition and size; R. capsulatus may have pathogenicity regions or alien genes too. Low–GC-content genomes show less diversity in oligonucleotide frequencies than high–GC-content genomes. Some authors have shown that in E. coli and in other bacteria, the homopolymers of the type An and Tn are more abundant than Cn and Gn (Dechering et al. 1998Citation ; Shomer and Yagil 1999Citation ). The results in figure 3 suggest that short distances (<2 kbp) between Cn and Gn are less abundant than short distances between An and Tn in bacterial genomes. Although several DNA mismatch and repair mechanisms are known to change oligonucleotide frequencies (see for example Deschavanne and Radman 1991Citation ), no mutation mechanisms is known to act selectively on long Gn and Cn homopolymers. Because noncoding regions represent a small percentage of eubacteria and archaea genomes, statistical analyses give little information about differences between homopolymers in coding regions with respect to the overall genome. The fact that bacterial mRNAs are generally polycistronic with length of several kilobase pairs suggests the importance of a constraint on RNA secondary structure against Gn and Cn sticky patches. This is in agreement with the work of Huynen and co-workers who found that, in histone genes, the compensation of the G-C ratio indicates a selection pressure at the mRNA level rather than a selection pressure or mutation bias at the DNA level or a selection pressure on codon usage (Huynen, Konings, and Hogeweg 1992Citation ). It is also known that pairing of palindromic Gn and Cn patches serve as stop transcription mechanism in several bacterial operons (see for example Lewin 1997Citation , pp. 318–319).

Ecology and Genome Composition Heterogeneity: Molecular Evolution Implications
A possible explanation as to why the genomes of free-living and nonchronic pathogens show more patchiness than genomes of chronic pathogens or symbionts is that free-living bacteria and nonchronic pathogens experience a fluctuating and challenging environment with diversified, in time and locus, selection. The genomes of these species can easily exchange DNA segments and incorporate cassettes of resistance genes that allow them to face environmental changes. Probably, nonchronic pathogens have high degree of mosaicity because of the pressure to keep the pathogenicity islands shared among bacterial species with different genome-wide base composition.

Differences in the wavelet scalograms of free-living bacteria may also reflect different mechanisms for DNA uptake and recombination. Therefore, genomes such as D. radiodurans (GC% {approx} 67%) and S. coelicolor (GC% {approx} 72%), that are subjected to strong GC mutational bias, do not undergo a reduction in genome size and tRNA population. Instead, the genomes of chronic pathogens and symbionts that are subjected to GC pressure, such as Mycoplasma capricolum (GC content ~ 25%) and Micrococcus luteus (GC content ~ 75%), undergo a genome size and tRNA population reduction (Muto et al. 1990Citation ; Kano et al. 1991Citation ; Andersson and Kurland 1995Citation ). The fact that the scalograms of the genomes of intracellular pathogens or symbionts such as Buchnera sp. (fig. 4D and table 3 ) show a small but not negligible amount of genome composition heterogeneity suggests that genetic recombination events occur. This finding is in agreement with the predictions of Wolf and co-workers for Rickettsiae and Chlamydiae (Wolf, Aravind, and Koonin 1999Citation ). The genome of Rickettsia prowazekii contains the largest amount of repeats (24%) among the bacterial genomes sequenced to date (Andersson et al. 1998Citation ). It is known that the presence of repeats increases the recombination rate, and this may explain the relatively large values of the GC variations.

Our analyses, based on a very large number of genome sequences, show how oligonucleotide composition is affected by GC mutational pressure and that genome mosaic-like structure depends on the history of gene transfer events and thus on the ecology of the species. Entropy measure and wavelet scalogram are complementary methods to analyze genome sequences and to detect the presence of alien genes, i.e., they can be used as preliminary analyses in the investigation of host-pathogen relationship. The alien genes can be located through the selective reconstruction of GC plot from the scalogram using just few scales, as shown in Liò and Vannucci (2000)Citation . Further improvements of this work will consider using the nondecimated or stationary version of the DWT, a modified transform where coefficients at each level are not subsampled.


    Acknowledgments and Permissions for Using Genome Data
 TOP
 Abstract
 Introduction
 Materials and Methods
 Statistical Analysis
 Results
 Discussion
 Acknowledgments and Permissions...
 References
 
P.L. is supported by an EPSRC/BBSRC Bioinformatics Initiative grant. We thank Marina Vannucci and Nick Goldman for helpful suggestions. For the following species we have considered all nonredundant sequences available in GenBank and the sequences freely available through the web sites of several research institutions (we obtained the permission on using data from ongoing genomic projects). Sequence data of B. pertussis, B. bronchiseptica, B. parapertussis, B. pseudomallei, Corynebacterium diphtheriae, C. difficile, M. bovis, M. leprae, S. typhi, S. coelicolor, and Y. pestis are from sequencing groups at Sanger Centre and can be found at www.sanger.ac.uk. Sequence data of N. gonorrhoea, S. pyogenes, A. actinomycetemcomitans, B. stearothermophilus, S. aureus, and S. mutans are from the University of Oklahoma (www.genome.ou.edu, dna1.chem.ou.edu/gono.html). We acknowledge the Gonococcal Genome Sequencing Project supported by USPHS/NIH grant AI38399, and L. A. Lewis, A. Gillaspy, R. McLaughlin, M. Gipson, T. Ducey, T. Ownbey, K. Hartman, C. Nydick, M. Carson, J. Vaughn, C. Thomson, L. Song, S. Lin, X. Yuan, F. Najar, M. Zhan, Q. Ren, H. Zhu, S. Qi, S. Kenton, H. Lai, J. White, S. Clifton, B. A. Roe, and D. W. Dyer. We also acknowledge the S. mutans Genome Sequencing Project funded by USPHS/NIH grant from the Dental Institute and B. A. Roe, R. Y. Tian, H. G. Jia, Y. D. Qian, S. P. Linn, L. Song, R. E. McLaughlin, M. McShan, and J. Ferretti. L. pneumophila, N. puntiforme sequences are from Columbia Genome Center (genome3.cpmc.columbia.edu); K. pneumoniae is sequenced by Washington University Consortium (genome.wustl.edu/gsc/Projects/bacteria.shtml); H. ducreyi, M. maripaludis are sequenced by the University of Washington (www.htsc.washington.edu, kandisnky.genome.washington.edu/maripaludis), N. europaea, P. marinus are from DOE- JGI (www.jgi.doe.gov/JGI_microbial/html); P. abyssi from Genoscope (www.genoscope.cns.fr/Pab); R. capsulatus is sequenced by University of Chicago (capsulapedia.uchicago.edu), and R. sphaeroides is sequenced by University of Texas (www-mmg.med.uth.tmc.edu/sphaeroides).


    Footnotes
 
William Taylor, Reviewing Editor

Keywords: genomics GC content genome structure prokaryotes ecology Back

Address for correspondence and reprints: Pietro Liò, Department of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, U.K. p.lio{at}zoo.cam.ac.uk . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Statistical Analysis
 Results
 Discussion
 Acknowledgments and Permissions...
 References
 

    Alm R. A., L. S. Ling, D. T. Moir, et al. (23 co-authors) 1999 Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori Nature 397:176-180[ISI][Medline]

    Andersson S. G., C. G. Kurland, 1995 Genomic evolution drives the evolution of the translation system Biochem. Cell Biol 73:775-787[ISI][Medline]

    Andersson S. G., A. Zomorodipour, J. O. Andersson, et al. (10 co-authors) 1998 The genome sequence of Rickettsia prowazekii and the origin of mitochondria Nature 396:133-140[ISI][Medline]

    Ariño M., B. Vidakovic, 1995 On wavelet scalograms and their applications in economic time series Discussion paper 95-21, ISDS, Duke University

    Arneodo A., E. Bacry, P. V. Graves, J. F. Muzy, 1995 Characterizing long-range correlations in DNA sequences from wavelet analysis Phys. Rev. Lett 74:3293-3296[ISI][Medline]

    Arneodo A., Y. d'Aubenton Carafa, B. Audit, E. Bacry, J. F. Muzy, C. Thermes, 1998 What can we learn with wavelets about DNA sequences Physica A 249:439-448[ISI]

    Arneodo A., Y. d'Aubenton Carafa, E. Bacry, P. V. Graves, J. F. Muzy, C. Thermes, 1996 Wavelet based fractal analysis of DNA sequences Physica D 1328:1-30

    Bellgard M. I., T. Gojobori, 1999 Inferring the direction of evolutionary changes of genomic base composition TiG 15:254-256[Medline]

    Berg O. G., C. G. Kurland, 1997 Growth rate-optimised tRNA abundance and codon usage J. Mol. Biol 270:544-550[ISI][Medline]

    Bernardi G., 1993 The vertebrate genome: isochores and evolution Mol. Biol. Evol 10:186-204[Abstract]

    Blattner F. R., G. Plunkett, C. A. Bloch, et al. (17 co-authors) 1997 The complete genome sequence of Escherichia coli K-12 Science 277:1453-1474[Abstract/Free Full Text]

    Bolshoy A., E. Nevo, 2000 Ecologic genomics of DNA: upstream bending in prokaryotic promoters Genome Res 10:1185-1193[Abstract/Free Full Text]

    Bult C. J., O. White, G. J. Olsen, et al. (23 co-authors) 1996 Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii Science 273:1058-1073[Abstract]

    Chiann C., P. A. Morettin, 1998 A wavelet analysis for time series J. Nonparametric Stat 10:1-46

    Chui C. K., 1992 An introduction to wavelets Academic Press, New York

    Cole S. T., R. Brosch, R. Parkhill, et al. (25 co-authors) 1998 Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence Nature 393:537-544[ISI][Medline]

    Daubechies I., 1992 Ten lectures on wavelets SIAM, Philadelphia

    Dechering K. J., K. Cuelenaere, R. N. Konings, J. A. Leunissen, 1998 Distinct frequency-distributions of homopolymeric DNA tracts in different genomes Nucleic Acids Res 26:4056-4062[Abstract/Free Full Text]

    Deckert G., P. V. Warren, T. Gaasterland, et al. (15 co-authors) 1998 The complete genome of the hyperthermophilic bacterium Aquifex aeolicus Nature 392:353-358[ISI][Medline]

    Deschavanne P., M. Radman, 1991 Counterselection of GATC sequences in enterobacteriophages by the components of the methyl-directed mismatch repair system J. Mol. Evol 33:125-132[ISI][Medline]

    Donoho D., I. Johnstone, 1994 Ideal spatial adaptation via wavelet shrinkage Biometrika 81:425-455[ISI]

    Donoho D., I. Johnstone, G. Kerkyacharian, D. Picard, 1995 Wavelet shrinkage: asymptopia? (with discussion) J. R. Stat. Soc. Ser. B 57:301-369[ISI]

    Flandrin P., 1988 Time-frequency and time-scale IEEE Fourth Annual ASSP Workshop on Spectrum Estimation and Modeling. Pp. 77–80. Minnesota, Minn

    Fleischmann R. D., M. D. Adams, O. White, et al. (10 co-authors) 1995 Whole-genome random sequencing and assembly of Haemophilus influenzae Rd Science 269:496-512[ISI][Medline]

    Fraser C. M., S. Casjeans, W. M. Huang, et al. (25 co-authors) 1997 Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi Nature 390:580-586[ISI][Medline]

    Fraser C. M., J. D. Gocayne, O. White, et al. (10 co-authors) 1995 The minimal gene complement of Mycoplasma genitalium Science 270:397-403[Abstract]

    Fraser C. M., S. J. Norris, G. M. Weinstock, et al. (25 co-authors) 1998 Complete genome sequence of Treponema pallidum, the syphilis spirochete Science 281:375-388[Abstract/Free Full Text]

    Garcia-Vallvé S., A. Romeu, J. Palau, 2000 Horizontal gene transfer in bacterial and archaeal complete genomes Genome Res 10:1719-1725[Abstract/Free Full Text]

    Glass J. I., E. J. Lefkowitz, J. S. Glass, C. R. Heiner, E. Y. Chen, G. H. Cassell, 2000 The complete sequence of the mucosal pathogen Ureaplasma urealyticum Nature 407:757-762[ISI][Medline]

    Grishin N. V., Y. I. Wolf, E. V. Koonin, 2000 From complete genomes to measures of substitution rate variability within and between proteins Genome Res 10:991-1000[Abstract/Free Full Text]

    Heidelberg J. F., J. A. Eisen, W. C. Nelson, et al. (33 co-authors) 2000 DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae Nature 406:477-483[ISI][Medline]

    Herzel H., 1988 Complexity of symbol sequences Syst. Anal. Model. Simul 5:435-441[ISI]

    Himmelreich R., H. Hilbert, H. Plagens, E. Pirkl, B. C. Li, R. Herrmann, 1996 Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae Nucleic Acids Res 24:4420-4449[Abstract/Free Full Text]

    Huynen M., T. Dandekar, P. Bork, 1998 Measuring genome evolution Proc. Natl. Acad. Sci. USA 95:5849-5856[Abstract/Free Full Text]

    Huynen M. A., D. A. Konings, P. Hogeweg, 1992 Equal G and C contents in histone genes indicates selection pressures on mRNA secondary structure J. Mol. Evol 34:280-291[ISI][Medline]

    Kalman S., W. Mitchell, R. Marathe, et al. (10 co-authors) 2000 Comparative genomes of Chlamydia pneumoniae and C. trachomatis Nat. Genet 21:385-389[ISI]

    Kaneko T., S. Sato, H. Kotani, et al. (24 co-authors) 1996 Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions DNA Res 3:109-136[Medline]

    Kano A., Y. Andachi, T. Ohama, S. Osawa, 1991 Novel anticodon composition of transfer RNAs in Micrococcus luteus, a bacterium with a high genomic G + C content. Correlation with codon usage J. Mol. Biol 221:387-401[ISI][Medline]

    Karlin S., V. Brendel, 1993 Patchiness and correlations in DNA sequences Science 259:677-680[ISI][Medline]

    Karlin S., C. Burge, 1995 Dinucleotide relative abundance extremes: a genomic signature Trends Genet 11:283-290[ISI][Medline]

    Karlin S., A. M. Campbell, J. Mrazek, 1998 Comparative DNA analysis across diverse genomes Annu. Rev. Genet 32:185-225[ISI][Medline]

    Karlin S., J. Mrazek, 1997 Compositional differences within and between eukaryotic genomes Proc. Natl. Acad. Sci. USA 94:10227-10232[Abstract/Free Full Text]

    Kawarabayasi Y., Y. Hino, H. Horikawa, et al. (25 co-authors) 1999 Complete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1 DNA Res 6:83-101[Medline]

    Kawarabayasi Y., M. Sawada, H. Horikawa, et al. (25 co-authors) 1998 Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3 DNA Res 5:55-76.[Medline]

    Klenk H. P., R. A. Clayton, J. F. Tomb, et al. (25 co-authors) 1997 The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus Nature 390:364-370[ISI][Medline]

    Kunst F., N. Ogasawara, I. Moszer, et al. (25 co-authors) 1997 The complete genome sequence of the gram-positive bacterium Bacillus subtilis Nature 390:249-256[ISI][Medline]

    Lewin B., 1997 Gene VI, Chap. 11 Oxford University Press Inc., New York

    Liò P., S. Ruffo, A. Politi, M. Buiatti, 1996 Analysis of genomic patchiness of Haemophilus influenzae and S. cerevisiae chromosomes J. Theor. Biol 183:455-469[ISI][Medline]

    Liò P., S. Ruffo, 1998 Searching for genomic constraints Il Nuovo Cimento D 20:113-127

    Liò P., M. Vannucci, 2000 Finding pathogenicity islands and gene transfer events in genome data Bioinformatics 16:932-940[Abstract]

    Liu G., T. K. McDaniel, S. Falkow, S. Karlin, 1999 Sequence anomalies in the Cag7 gene of the helicobacter pylori pathogenicity island Proc. Natl. Acad. Sci. USA 96:7011-7016[Abstract/Free Full Text]

    Mallat S. G., 1989 A theory for multiresolution signal decomposition: the wavelet representation IEEE Trans. Pattern Machine Intelligence 11:674-693[ISI]

    Martin W., 1999 Mosaic bacterial chromosomes: a challenge en route to a tree of genomes Bioessays 21:99-104[ISI][Medline]

    Muto A., Y. Andachi, H. Yuzawa, F. Yamao, S. Osawa, 1990 The organization and evolution of transfer RNA genes in Mycoplasma capricolum Nucleic Acids Res 18:5037-5043[Abstract]

    Muto A., S. Osawa, 1987 The guanine and cytosine content of genomic DNA and bacterial evolution Proc. Natl. Acad. Sci. USA 84:166-169[Abstract]

    Nekrutenko A., W. H. Li, 2000 Assessment of compositional heterogeneity within and between eukaryotic genomes Genome Res 10:1986-1995[Abstract/Free Full Text]

    Nelson K. E., R. A. Clayton, S. R. Gill, et al. (25 co-authors) 1999 Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima Nature 399:323-329[ISI][Medline]

    Ng W. V., S. P. Kennedy, G. G. Mahairas, et al. (43 co-authors) 2000 Genome sequence of Halobacterium species NRC-1 Proc. Natl. Acad. Sci. USA 97:12176-12181[Abstract/Free Full Text]

    Ochman H., J. G. Lawrence, E. A. Groisman, 2000 Lateral gene transfer and the nature of bacterial innovation Nature 405:299-303[ISI][Medline]

    Parkhill J., M. Achtman, K. D. James, et al. (21 co-authors) 2000 Complete DNA sequence of a serogroup A strain of Neisseria menigitidis Z2491 Nature 404:502-506[ISI][Medline]

    Parkhill J., B. W. Wren, K. Mungall, 2000 The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences Nature 403:665-668[ISI][Medline]

    Pedersen A. G., L. J. Jensen, S. Brunak, H. H. Staerfeldt, D. W. Ussery, 2000 A DNA structural atlas for Escherichia coli J. Mol. Biol 299:907-930[ISI][Medline]

    Read T. D., R. C. Brunham, C. Shen, et al. (25 co-authors) 2000 Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 Nucleic Acids Res 28:1397-1406[Abstract/Free Full Text]

    Ruepp A., W. Graml, M. L. Santos-Martinez, et al. (13 co-authors) 2000 The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum Nature 407:508-513[ISI][Medline]

    Schmitt A. O., H. Herzel, 1997 Estimating the entropy of DNA sequences J. Theor. Biol 188:369-377[ISI][Medline]

    Scott D. W., 1992 Multivariate density estimation Theory, practice and visualization. John Wiley and Sons, New York

    Shigenobu S., H. Watanabe, M. Hattori, Y. Sakaki, H. Ishikawa, 2000 Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp APS. Nature 407:81-86

    Shirai M., H. Hirakawa, M. Kimoto, et al. (10 co-authors) 2000 Comparison of whole genome sequences of Chlamydia pneumoniae J138 from Japan and CWL029 from USA Nucleic Acids Res 28:2311-2314[Abstract/Free Full Text]

    Shomer B., G. Yagil, 1999 Long W tracts are over-represented in the Escherichia coli and Haemophilus influenzae genomes Nucleic Acids Res 27:4491-4500[Abstract/Free Full Text]

    Silverman B. W., 1986 Density estimation for statistics and data analysis Chapman & Hall, London

    Simpson A. J. G., F. C. Reinach, P. Arruda, et al. (115 co-authors) 2000 The genome sequence of the plant pathogen Xylella fastidiosa Nature 406:151-157[ISI][Medline]

    Smith D. R., L. A. Doucette-Stamm, C. Deloughery, et al. (25 co-authors) 1997 Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics J. Bacteriol 179:7135-7155[Abstract]

    Sorensen M. A., C. G. Kurland, S. Pedersen, 1989 Codon usage determines translation rate in Escherichia coli J. Mol. Biol 207:365-377[ISI][Medline]

    Stephens R. S., S. Kalman, C. Lammel, et al. (12 co-authors) 1998 Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis Science 282:754-759[Abstract/Free Full Text]

    Stover C. K., X. Q. Pham, A. L. Erwin, et al. (31 co-authors) 2000 Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen Nature 406:959-964[ISI][Medline]

    Sueoka N., 1962 On the genetic basis of variation and heterogeneity of DNA base composition Proc. Natl. Acad. Sci. USA 48:582-588[ISI][Medline]

    ———. 1992 Directional mutation pressure, selective constraints and genetic equilibria J. Mol. Evol 34:95-114[ISI][Medline]

    ———. 1995 Intrastrand parity rules of DNA base composition and usage biases of synonymous codons J. Mol. Evol 40:318-325[ISI][Medline]

    Takami H., K. Nakasone, Y. Takaki, et al. (12 co-authors) 2000 Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans and genomic comparison with bacillus subtilis Nucleic Acids Res 28:4317-4331[Abstract/Free Full Text]

    Tekaja F., A. Lazcano, B. Dujon, 1999 The genomic tree as revealed from whole genome comparisons Genome Res 9:550-557[Abstract/Free Full Text]

    Tettelin H., N. J. Saunders, J. Heidelberg, et al. (42 co-authors) 2000 Complete genome sequence of Neisseria meningitidis serogroup B strain MC58 Science 287:1809-1815[Abstract/Free Full Text]

    Tomb J. F., O. White, A. R. Kerlavage, et al. (25 co-authors) 1997 The complete genome sequence of the gastric pathogen Helicobacter pylori Nature 388:539-547[ISI][Medline]

    Vannucci M., P. Liò, 2001 Wavelet analysis of biological sequences: applications to protein structure and genomics Sankhya Ser. B 63:204-219

    White O., J. A. Eisen, J. F. Heidelberg, et al. (25 co-authors) 1999 Genome sequence of the Radioresistant Bacterium Deinococcus radiodurans R1 Science 286:1571-1577[Abstract/Free Full Text]

    Wolf Y. I., L. Aravind, E. V. Koonin, 1999 Rickettsiae and Chlamydiae: evidence of horizontal gene transfer and gene exchange TiG 15:173-175[Medline]

Accepted for publication October 3, 2001.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (4)
Request Permissions
Google Scholar
Articles by Liò, P.
PubMed
PubMed Citation
Articles by Liò, P.