Department of Zoology, University of Cambridge, United Kingdom
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The density of guanine and cytosine, i.e., the GC content, is an important parameter to investigate and compare the structure and evolution of genomes (Muto and Osawa 1987
; Bernardi 1993
; Bellgard and Gojobori 1999
). Although the average GC content of bacterial genomes varies across species, the GC mutational pressure, i.e., the specificity in replication and repair machinery, and the context-dependent mutation bias tend to homogenize each genome (Sueoka 1962, 1992
; Liò et al. 1996
; Karlin, Campbell, and Mrazek 1998
). The different selection constraints acting on regulatory and coding regions and the lateral transfer of genes with different GC content (Martin 1999
; Garcia-Vallvé, Romeu, and Palau 2000
; Ochman, Lawrence, and Groisman 2000
) are opposed to the genome composition homogenization.
We have compared a large number of genome sequences to investigate the relationships between oligonucleotide composition and genome structure heterogeneity in prokaryotes and their ecology, i.e., distinguishing free-living species and nonchronic pathogens from chronic pathogens and symbionts. First, we use a measure of sequence entropy, based on the frequency content of short oligonucleotides (17 bp), to compare 36 complete genomes and 27 long genomic sequences from archaea and eubacteria with different GC content. Second, because oligonucleotide frequencies depend on the mosaic structure of genomes, we used a method derived from wavelets (see Chui 1992
and Daubechies 1992
, among others) to estimate the length-size of the genome structure heterogeneity in bacteria and archaea with different ecology. We show that the entropy measure and the wavelet scalogram can be used as complementary methods to detect global patterns in genome sequences.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
![]() |
Statistical Analysis |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Measure of Within-Genome Composition Heterogeneity
Karlin and Brendel (1993)
, Karlin and Burge (1995)
, and Karlin and Mrazek (1997)
, among others (see also Liò et al. 1996
), have provided evidences that even the DNA sequence of the most simple microorganism shows large degree of patchiness at different sequence length scales. In their analyses, the window size is chosen on the basis of the sequence length and features to be detected; consequently, the localization accuracy of the methods is of the order of the chosen window length. In order to investigate GC patchiness, we coded DNA sequences as G,C = 1, A,T = 0. It is known that GC content variations along genomes occur at different sequence lengths; for example, codons, genes, long repeats, pathogenicity islands, and isochores (Bernardi 1993
; Nekrutenko and Li 2000
). GC variation is much larger than purine-pyrimidine variation (Liò et al. 1996
; Liò and Ruffo 1998
); therefore, GC patchiness represents most of the genome sequence patchiness. Arneodo et al. (1995, 1996, 1998)
used continuous wavelet transform to analyze GC patchiness and correlations in DNA sequences of different species. We used wavelet scalogram based on discrete wavelet transform (DWT, Mallat 1989
).
Using Wavelets to Assess Within-Genome Composition Heterogeneity
The name wavelet means small waves (the sinusoids used in Fourier analysis are big waves). In short, a wavelet is an oscillation that decays quickly. Wavelets are functions that can be used to efficiently describe a signal by breaking it down into its components at different scales (or frequency bands) and following their evolution in the space domain. Unlike the Fourier basis, wavelets are local both in frequency and space. Wavelets are discontinuous and sum to zero and show different, complex shapes, each suitable for a different class of problems. Wavelets are also related to fractals in that the same shapes repeat at different orders of magnitude. Therefore, they are particularly performing better than Fourier analysis when the signals contain discontinuities and sharp spikes (Chui 1992
; Daubechies 1992
).
Wavelet Series
In wavelet theory, a function is represented by an infinite series expansion in terms of dilated and translated versions of a basic function , called mother wavelet, each multiplied by an appropriate coefficient (see Daubechies 1992
and Chui 1992
, among others). The wavelet family
j,k is obtained from the mother wavelet by shrinking by a factor 2j and translating by 2-jk, to obtain
j,k(x) = 2j/2
(2jx - k) where the j subscript represents the dilation number and the k subscript represents the translation number. The scale factor 2j/2 is a normalization factor for
j,k. The wavelet series representation of a function f is therefore
|
|
Coefficients fj,k describe features of f at the spatial location and frequency proportional to 2j (or scale j). Despite Fourier transform, wavelets provide time-frequency localization in that the coefficient fj,k gives information about the function near time point and near frequency proportional to 2j.
Discrete Wavelet Transform
The DWT decomposes a function into its wavelet coefficients (Mallat 1989
). From a computational point of view, the DWT proceeds by recursively applying two convolution functions known as quadrature mirror filters, each producing an output stream that is half of the length of the original input, until the resolution level zero is reached. If the filters are applied n times (with 2n
N), at each intermediate step (a level in wavelet terminology) j = 1, ..., n, the transform produces two vectors of coefficients, Sj of scaling coefficients and Dj of wavelet coefficients. The vector Dj is kept, whereas Sj is processed through the two filters. At the last level n, both Sn and Dn are kept. Different coefficient vectors contain information about the characteristics of the sequence at different scales or sequence lengths. Coefficients at coarse scales capture gross and global features. Coefficients at fine scales contain the local details of the profile. At level j, the wavelet coefficients Dj are associated with changes in the averages of the data on a scale 2j-1 at a set of location times. Scaling coefficients Sn at the last level n are instead associated with averages of the data on scales 2n and higher. The wavelet transform is, therefore, a cumulative measure of the variations in the data over regions proportional to the wavelet scales, with coefficients at coarser and coarser levels, i.e., for increasing values of j, describing features at lower frequency ranges and larger time periods. For example, given a genome sequence of 2 x 106 bp, GC variations at gene length (1,0002,000 bp) correspond to scales 10 to 11.
For practical purposes, the DWT is often represented in matrix form as Wy, with W an orthogonal matrix and y a vector of observations of the signal. An inverse wavelet transform can be also defined. The standard DWT, as the fast Fourier transform, operates on data sets with length 2N, N integer. When required, data can be padded with zeros. These zeroes do not affect the results.
Choice of Wavelet Basis
We found that, in general, Daubechies wavelets perform better than Haar for the type of signals considered in this work (GC plots). Wavelets from Daubechies families have two important properties; they are compactly supported and have maximum number of vanishing moments (a function f has N vanishing moments if where q = 0, 1, ..., N - 1). Compact supports are useful to describe local characteristics that change rapidly with time. A large number of vanishing moments leads to high compressibility because the fine scale wavelet coefficients will be essentially zero where the function is smooth. Although we have previously used Dauchechies N = 10 (Liò and Vannucci 2000
), the analysis of the large set of genomic sequences (table 1
) showed that, if interested in scales at or above gene length, the Daubechies' basis of type N = 2 performs as well as N = 10 for G+C pattern analysis. Therefore, in this work and in a recent publication (Vannucci and Liò 2001
) we used Daubechies N = 2.
Wavelet Scalogram
Genome mosaicity is analyzed using the scalogram, that is the equivalent of the periodogram used in Fourier analysis. The scalogram is a plot of the sum of the squares of the coefficients at each scale (Flandrin 1988
; Chiann and Morettin 1998
; Ariño and Vidakovic 1995
). The plot will indicate at which scale of resolution the energy of the function is concentrated. A relatively smooth function will have most of its energy concentrated at large scales. A function showing high frequency oscillations will have a large portion of its energy concentrated in high-resolution wavelet coefficients.
We found that the largest amount of GC content variation in eubacteria and archaea genomes occurs at short sequencelengths, mainly at codon or few codon lengths. Therefore, in order to improve the detection of GC content variations over gene-length, we applied a wavelet denoising technique to eliminate rapid variations of GC content (Donoho and Johnstone 1994
; Donoho et al. 1995
). Finally, each scalogram is generated by subtracting the values of a scalogram obtained from a random DNA sequence of the same length and average GC content.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The genome sequence of Mycoplasma genitalium shows larger oligonucleotide first-difference of entropy than the other sequenced mycoplasma genomes, U. urealyticum and Mycoplasma pneumoniae (closed squares in fig. 1F
). Among the highGC-content genomes, Mycoplasma bovis and Mycoplasma tubercolosis have larger first-difference of entropy than Mycoplasma leprae (closed diamonds in fig. 1F
). Although genes from these genomes share 80% amino acid identity on an average, M. leprae has a lower GC content than the other two mycoplasmas (Bellgard and Gojobori 1999
). The genomes of M. bovis and M. tubercolosis have lower first-difference of entropy than the genomes of proteobacteria with similar GC content, such as Pseudomonas aeruginosa and the Bordetellae sequences.
The comparison of genome sequences that have complementary GC-content percentage, for example, 35% and 65%, shows that lowGC-content genomes have smaller first-difference of entropy than highGC-content genomes (fig. 1F ). For example, the average first-difference of entropy for 7-bp oligonucleotides in genome sequences with GC% < 0.40, GC% between 0.4 and 0.6, and GC% > 0.6 are (the standard deviation in parentheses): -0.042 (0.020), -0.046 (0.019), and -0.071 (0.015), respectively. This finding seems to be particularly relevant for proteobacteria (circles in fig. 1F ). A further confirmation is given by the second-difference entropies of 7-bp oligonucleotides, D(7)-D(5), for the same set of genomic sequences of figure 1 (fig. 2A ). In figure 2B, we have subtracted entropies from random DNA sequences from the values in figure 2A. The largest values correspond to those of figure 1F: the two H. pylori (first arrow from left in figs. 2A and B ), N. meningitidis and N. gonorrhoea (second arrow from left in figs. 2A and B ), and R. capsulatus (third arrow from the left in figs. 2A and B ).
We analyzed in details the oligonucleotide frequencies in high and lowGC-content genomes. The contrast in oligonucleotide frequencies for high and lowGC-content genomes shows that, particularly for n > 4 bp, homopolymers of the type Gn and Cn are generally underrepresented in all genomes in the data set. We made use of density plots to analyze the distribution of distances between An, Tn, Cn, and Gn homopolymers. In figure 3
, we show the density plots of the distances between A6 and T6 homopolymers in AT-rich genomes and the distances between C6 and G6 homopolymers in GC-rich genomes in a set of bacterial genomes. Each plot of figure 3
shows the distances between homopolymers in genomes with almost complementary GC%, i.e., for example, Campylobacter jejuni (GC content 31%) versus P. aeruginosa (
67%, fig. 3A
); Borrelia burgdorferi (
29%) versus Deinococcus radiodurans (
67%, fig. 3B
); Methanococcus jannaschii (
31%) versus Halobacterium sp. (
68%, fig. 3C
) and Escherichia coli (
51%, fig. 3D
). In the absence of any selection or mutation mechanism, we expect the GC-rich genomes to contain a number of Gn and Cn homopolymers, quite equal to the number of An and Tn homopolymers in AT-rich genomes and vice versa. Instead, we found that the number of G6 and C6 homopolymers in GC-rich genomes is much lower than the number of A6 and T6 homopolymers in AT-rich genomes. All the plots show similar large differences between G6/C6 and A6/T6 distances in GC-rich and AT-rich genomes, respectively.
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The genomes of N. meningitidis and H. pylori show larger values of first-difference of entropy than other bacteria and archaea with similar GC content. These genomes contain several large pathogenicity islands that have different GC content from the nearby genomic regions (Tomb et al. 1997
; Liò and Vannucci 2000
; Parkhill et al. 2000
; Tettelin et al. 2000
). The results from the entropy analysis and the wavelet scalograms allow us to infer that the H. pylori genome does not contain more GC-heterogeneity than the other chronic pathogens, but it contains few genomic regions with remarkably different short oligonucleotide frequencies. This is in agreement with our previous findings (see fig. 2
in Liò and Vannucci 2000
) and the findings of Liu and co-workers (Liu et al. 1999
) that showed that there are large differences in sequence composition in the CAG pathogenicity island with respect to the average composition of the H. pylori genome. The genome of N. gonorrhoea has similar entropy values as that of N. meningitidis, it may contain pathogenicity islands with similar base composition and size; R. capsulatus may have pathogenicity regions or alien genes too. LowGC-content genomes show less diversity in oligonucleotide frequencies than highGC-content genomes. Some authors have shown that in E. coli and in other bacteria, the homopolymers of the type An and Tn are more abundant than Cn and Gn (Dechering et al. 1998
; Shomer and Yagil 1999
). The results in figure 3
suggest that short distances (<2 kbp) between Cn and Gn are less abundant than short distances between An and Tn in bacterial genomes. Although several DNA mismatch and repair mechanisms are known to change oligonucleotide frequencies (see for example Deschavanne and Radman 1991
), no mutation mechanisms is known to act selectively on long Gn and Cn homopolymers. Because noncoding regions represent a small percentage of eubacteria and archaea genomes, statistical analyses give little information about differences between homopolymers in coding regions with respect to the overall genome. The fact that bacterial mRNAs are generally polycistronic with length of several kilobase pairs suggests the importance of a constraint on RNA secondary structure against Gn and Cn sticky patches. This is in agreement with the work of Huynen and co-workers who found that, in histone genes, the compensation of the G-C ratio indicates a selection pressure at the mRNA level rather than a selection pressure or mutation bias at the DNA level or a selection pressure on codon usage (Huynen, Konings, and Hogeweg 1992
). It is also known that pairing of palindromic Gn and Cn patches serve as stop transcription mechanism in several bacterial operons (see for example Lewin 1997
, pp. 318319).
Ecology and Genome Composition Heterogeneity: Molecular Evolution Implications
A possible explanation as to why the genomes of free-living and nonchronic pathogens show more patchiness than genomes of chronic pathogens or symbionts is that free-living bacteria and nonchronic pathogens experience a fluctuating and challenging environment with diversified, in time and locus, selection. The genomes of these species can easily exchange DNA segments and incorporate cassettes of resistance genes that allow them to face environmental changes. Probably, nonchronic pathogens have high degree of mosaicity because of the pressure to keep the pathogenicity islands shared among bacterial species with different genome-wide base composition.
Differences in the wavelet scalograms of free-living bacteria may also reflect different mechanisms for DNA uptake and recombination. Therefore, genomes such as D. radiodurans (GC% 67%) and S. coelicolor (GC%
72%), that are subjected to strong GC mutational bias, do not undergo a reduction in genome size and tRNA population. Instead, the genomes of chronic pathogens and symbionts that are subjected to GC pressure, such as Mycoplasma capricolum (GC content
25%) and Micrococcus luteus (GC content
75%), undergo a genome size and tRNA population reduction (Muto et al. 1990
; Kano et al. 1991
; Andersson and Kurland 1995
). The fact that the scalograms of the genomes of intracellular pathogens or symbionts such as Buchnera sp. (fig. 4D
and table 3 ) show a small but not negligible amount of genome composition heterogeneity suggests that genetic recombination events occur. This finding is in agreement with the predictions of Wolf and co-workers for Rickettsiae and Chlamydiae (Wolf, Aravind, and Koonin 1999
). The genome of Rickettsia prowazekii contains the largest amount of repeats (24%) among the bacterial genomes sequenced to date (Andersson et al. 1998
). It is known that the presence of repeats increases the recombination rate, and this may explain the relatively large values of the GC variations.
Our analyses, based on a very large number of genome sequences, show how oligonucleotide composition is affected by GC mutational pressure and that genome mosaic-like structure depends on the history of gene transfer events and thus on the ecology of the species. Entropy measure and wavelet scalogram are complementary methods to analyze genome sequences and to detect the presence of alien genes, i.e., they can be used as preliminary analyses in the investigation of host-pathogen relationship. The alien genes can be located through the selective reconstruction of GC plot from the scalogram using just few scales, as shown in Liò and Vannucci (2000)
. Further improvements of this work will consider using the nondecimated or stationary version of the DWT, a modified transform where coefficients at each level are not subsampled.
![]() |
Acknowledgments and Permissions for Using Genome Data |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: genomics
GC content
genome structure
prokaryotes
ecology
Address for correspondence and reprints: Pietro Liò, Department of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, U.K. p.lio{at}zoo.cam.ac.uk
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Alm R. A., L. S. Ling, D. T. Moir, et al. (23 co-authors) 1999 Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori Nature 397:176-180[ISI][Medline]
Andersson S. G., C. G. Kurland, 1995 Genomic evolution drives the evolution of the translation system Biochem. Cell Biol 73:775-787[ISI][Medline]
Andersson S. G., A. Zomorodipour, J. O. Andersson, et al. (10 co-authors) 1998 The genome sequence of Rickettsia prowazekii and the origin of mitochondria Nature 396:133-140[ISI][Medline]
Ariño M., B. Vidakovic, 1995 On wavelet scalograms and their applications in economic time series Discussion paper 95-21, ISDS, Duke University
Arneodo A., E. Bacry, P. V. Graves, J. F. Muzy, 1995 Characterizing long-range correlations in DNA sequences from wavelet analysis Phys. Rev. Lett 74:3293-3296[ISI][Medline]
Arneodo A., Y. d'Aubenton Carafa, B. Audit, E. Bacry, J. F. Muzy, C. Thermes, 1998 What can we learn with wavelets about DNA sequences Physica A 249:439-448[ISI]
Arneodo A., Y. d'Aubenton Carafa, E. Bacry, P. V. Graves, J. F. Muzy, C. Thermes, 1996 Wavelet based fractal analysis of DNA sequences Physica D 1328:1-30
Bellgard M. I., T. Gojobori, 1999 Inferring the direction of evolutionary changes of genomic base composition TiG 15:254-256[Medline]
Berg O. G., C. G. Kurland, 1997 Growth rate-optimised tRNA abundance and codon usage J. Mol. Biol 270:544-550[ISI][Medline]
Bernardi G., 1993 The vertebrate genome: isochores and evolution Mol. Biol. Evol 10:186-204[Abstract]
Blattner F. R., G. Plunkett, C. A. Bloch, et al. (17 co-authors) 1997 The complete genome sequence of Escherichia coli K-12 Science 277:1453-1474
Bolshoy A., E. Nevo, 2000 Ecologic genomics of DNA: upstream bending in prokaryotic promoters Genome Res 10:1185-1193
Bult C. J., O. White, G. J. Olsen, et al. (23 co-authors) 1996 Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii Science 273:1058-1073[Abstract]
Chiann C., P. A. Morettin, 1998 A wavelet analysis for time series J. Nonparametric Stat 10:1-46
Chui C. K., 1992 An introduction to wavelets Academic Press, New York
Cole S. T., R. Brosch, R. Parkhill, et al. (25 co-authors) 1998 Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence Nature 393:537-544[ISI][Medline]
Daubechies I., 1992 Ten lectures on wavelets SIAM, Philadelphia
Dechering K. J., K. Cuelenaere, R. N. Konings, J. A. Leunissen, 1998 Distinct frequency-distributions of homopolymeric DNA tracts in different genomes Nucleic Acids Res 26:4056-4062
Deckert G., P. V. Warren, T. Gaasterland, et al. (15 co-authors) 1998 The complete genome of the hyperthermophilic bacterium Aquifex aeolicus Nature 392:353-358[ISI][Medline]
Deschavanne P., M. Radman, 1991 Counterselection of GATC sequences in enterobacteriophages by the components of the methyl-directed mismatch repair system J. Mol. Evol 33:125-132[ISI][Medline]
Donoho D., I. Johnstone, 1994 Ideal spatial adaptation via wavelet shrinkage Biometrika 81:425-455[ISI]
Donoho D., I. Johnstone, G. Kerkyacharian, D. Picard, 1995 Wavelet shrinkage: asymptopia? (with discussion) J. R. Stat. Soc. Ser. B 57:301-369[ISI]
Flandrin P., 1988 Time-frequency and time-scale IEEE Fourth Annual ASSP Workshop on Spectrum Estimation and Modeling. Pp. 7780. Minnesota, Minn
Fleischmann R. D., M. D. Adams, O. White, et al. (10 co-authors) 1995 Whole-genome random sequencing and assembly of Haemophilus influenzae Rd Science 269:496-512[ISI][Medline]
Fraser C. M., S. Casjeans, W. M. Huang, et al. (25 co-authors) 1997 Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi Nature 390:580-586[ISI][Medline]
Fraser C. M., J. D. Gocayne, O. White, et al. (10 co-authors) 1995 The minimal gene complement of Mycoplasma genitalium Science 270:397-403[Abstract]
Fraser C. M., S. J. Norris, G. M. Weinstock, et al. (25 co-authors) 1998 Complete genome sequence of Treponema pallidum, the syphilis spirochete Science 281:375-388
Garcia-Vallvé S., A. Romeu, J. Palau, 2000 Horizontal gene transfer in bacterial and archaeal complete genomes Genome Res 10:1719-1725
Glass J. I., E. J. Lefkowitz, J. S. Glass, C. R. Heiner, E. Y. Chen, G. H. Cassell, 2000 The complete sequence of the mucosal pathogen Ureaplasma urealyticum Nature 407:757-762[ISI][Medline]
Grishin N. V., Y. I. Wolf, E. V. Koonin, 2000 From complete genomes to measures of substitution rate variability within and between proteins Genome Res 10:991-1000
Heidelberg J. F., J. A. Eisen, W. C. Nelson, et al. (33 co-authors) 2000 DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae Nature 406:477-483[ISI][Medline]
Herzel H., 1988 Complexity of symbol sequences Syst. Anal. Model. Simul 5:435-441[ISI]
Himmelreich R., H. Hilbert, H. Plagens, E. Pirkl, B. C. Li, R. Herrmann, 1996 Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae Nucleic Acids Res 24:4420-4449
Huynen M., T. Dandekar, P. Bork, 1998 Measuring genome evolution Proc. Natl. Acad. Sci. USA 95:5849-5856
Huynen M. A., D. A. Konings, P. Hogeweg, 1992 Equal G and C contents in histone genes indicates selection pressures on mRNA secondary structure J. Mol. Evol 34:280-291[ISI][Medline]
Kalman S., W. Mitchell, R. Marathe, et al. (10 co-authors) 2000 Comparative genomes of Chlamydia pneumoniae and C. trachomatis Nat. Genet 21:385-389[ISI]
Kaneko T., S. Sato, H. Kotani, et al. (24 co-authors) 1996 Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions DNA Res 3:109-136[Medline]
Kano A., Y. Andachi, T. Ohama, S. Osawa, 1991 Novel anticodon composition of transfer RNAs in Micrococcus luteus, a bacterium with a high genomic G + C content. Correlation with codon usage J. Mol. Biol 221:387-401[ISI][Medline]
Karlin S., V. Brendel, 1993 Patchiness and correlations in DNA sequences Science 259:677-680[ISI][Medline]
Karlin S., C. Burge, 1995 Dinucleotide relative abundance extremes: a genomic signature Trends Genet 11:283-290[ISI][Medline]
Karlin S., A. M. Campbell, J. Mrazek, 1998 Comparative DNA analysis across diverse genomes Annu. Rev. Genet 32:185-225[ISI][Medline]
Karlin S., J. Mrazek, 1997 Compositional differences within and between eukaryotic genomes Proc. Natl. Acad. Sci. USA 94:10227-10232
Kawarabayasi Y., Y. Hino, H. Horikawa, et al. (25 co-authors) 1999 Complete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1 DNA Res 6:83-101[Medline]
Kawarabayasi Y., M. Sawada, H. Horikawa, et al. (25 co-authors) 1998 Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3 DNA Res 5:55-76.[Medline]
Klenk H. P., R. A. Clayton, J. F. Tomb, et al. (25 co-authors) 1997 The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus Nature 390:364-370[ISI][Medline]
Kunst F., N. Ogasawara, I. Moszer, et al. (25 co-authors) 1997 The complete genome sequence of the gram-positive bacterium Bacillus subtilis Nature 390:249-256[ISI][Medline]
Lewin B., 1997 Gene VI, Chap. 11 Oxford University Press Inc., New York
Liò P., S. Ruffo, A. Politi, M. Buiatti, 1996 Analysis of genomic patchiness of Haemophilus influenzae and S. cerevisiae chromosomes J. Theor. Biol 183:455-469[ISI][Medline]
Liò P., S. Ruffo, 1998 Searching for genomic constraints Il Nuovo Cimento D 20:113-127
Liò P., M. Vannucci, 2000 Finding pathogenicity islands and gene transfer events in genome data Bioinformatics 16:932-940[Abstract]
Liu G., T. K. McDaniel, S. Falkow, S. Karlin, 1999 Sequence anomalies in the Cag7 gene of the helicobacter pylori pathogenicity island Proc. Natl. Acad. Sci. USA 96:7011-7016
Mallat S. G., 1989 A theory for multiresolution signal decomposition: the wavelet representation IEEE Trans. Pattern Machine Intelligence 11:674-693[ISI]
Martin W., 1999 Mosaic bacterial chromosomes: a challenge en route to a tree of genomes Bioessays 21:99-104[ISI][Medline]
Muto A., Y. Andachi, H. Yuzawa, F. Yamao, S. Osawa, 1990 The organization and evolution of transfer RNA genes in Mycoplasma capricolum Nucleic Acids Res 18:5037-5043[Abstract]
Muto A., S. Osawa, 1987 The guanine and cytosine content of genomic DNA and bacterial evolution Proc. Natl. Acad. Sci. USA 84:166-169[Abstract]
Nekrutenko A., W. H. Li, 2000 Assessment of compositional heterogeneity within and between eukaryotic genomes Genome Res 10:1986-1995
Nelson K. E., R. A. Clayton, S. R. Gill, et al. (25 co-authors) 1999 Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima Nature 399:323-329[ISI][Medline]
Ng W. V., S. P. Kennedy, G. G. Mahairas, et al. (43 co-authors) 2000 Genome sequence of Halobacterium species NRC-1 Proc. Natl. Acad. Sci. USA 97:12176-12181
Ochman H., J. G. Lawrence, E. A. Groisman, 2000 Lateral gene transfer and the nature of bacterial innovation Nature 405:299-303[ISI][Medline]
Parkhill J., M. Achtman, K. D. James, et al. (21 co-authors) 2000 Complete DNA sequence of a serogroup A strain of Neisseria menigitidis Z2491 Nature 404:502-506[ISI][Medline]
Parkhill J., B. W. Wren, K. Mungall, 2000 The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences Nature 403:665-668[ISI][Medline]
Pedersen A. G., L. J. Jensen, S. Brunak, H. H. Staerfeldt, D. W. Ussery, 2000 A DNA structural atlas for Escherichia coli J. Mol. Biol 299:907-930[ISI][Medline]
Read T. D., R. C. Brunham, C. Shen, et al. (25 co-authors) 2000 Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39 Nucleic Acids Res 28:1397-1406
Ruepp A., W. Graml, M. L. Santos-Martinez, et al. (13 co-authors) 2000 The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum Nature 407:508-513[ISI][Medline]
Schmitt A. O., H. Herzel, 1997 Estimating the entropy of DNA sequences J. Theor. Biol 188:369-377[ISI][Medline]
Scott D. W., 1992 Multivariate density estimation Theory, practice and visualization. John Wiley and Sons, New York
Shigenobu S., H. Watanabe, M. Hattori, Y. Sakaki, H. Ishikawa, 2000 Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp APS. Nature 407:81-86
Shirai M., H. Hirakawa, M. Kimoto, et al. (10 co-authors) 2000 Comparison of whole genome sequences of Chlamydia pneumoniae J138 from Japan and CWL029 from USA Nucleic Acids Res 28:2311-2314
Shomer B., G. Yagil, 1999 Long W tracts are over-represented in the Escherichia coli and Haemophilus influenzae genomes Nucleic Acids Res 27:4491-4500
Silverman B. W., 1986 Density estimation for statistics and data analysis Chapman & Hall, London
Simpson A. J. G., F. C. Reinach, P. Arruda, et al. (115 co-authors) 2000 The genome sequence of the plant pathogen Xylella fastidiosa Nature 406:151-157[ISI][Medline]
Smith D. R., L. A. Doucette-Stamm, C. Deloughery, et al. (25 co-authors) 1997 Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics J. Bacteriol 179:7135-7155[Abstract]
Sorensen M. A., C. G. Kurland, S. Pedersen, 1989 Codon usage determines translation rate in Escherichia coli J. Mol. Biol 207:365-377[ISI][Medline]
Stephens R. S., S. Kalman, C. Lammel, et al. (12 co-authors) 1998 Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis Science 282:754-759
Stover C. K., X. Q. Pham, A. L. Erwin, et al. (31 co-authors) 2000 Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen Nature 406:959-964[ISI][Medline]
Sueoka N., 1962 On the genetic basis of variation and heterogeneity of DNA base composition Proc. Natl. Acad. Sci. USA 48:582-588[ISI][Medline]
. 1992 Directional mutation pressure, selective constraints and genetic equilibria J. Mol. Evol 34:95-114[ISI][Medline]
. 1995 Intrastrand parity rules of DNA base composition and usage biases of synonymous codons J. Mol. Evol 40:318-325[ISI][Medline]
Takami H., K. Nakasone, Y. Takaki, et al. (12 co-authors) 2000 Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans and genomic comparison with bacillus subtilis Nucleic Acids Res 28:4317-4331
Tekaja F., A. Lazcano, B. Dujon, 1999 The genomic tree as revealed from whole genome comparisons Genome Res 9:550-557
Tettelin H., N. J. Saunders, J. Heidelberg, et al. (42 co-authors) 2000 Complete genome sequence of Neisseria meningitidis serogroup B strain MC58 Science 287:1809-1815
Tomb J. F., O. White, A. R. Kerlavage, et al. (25 co-authors) 1997 The complete genome sequence of the gastric pathogen Helicobacter pylori Nature 388:539-547[ISI][Medline]
Vannucci M., P. Liò, 2001 Wavelet analysis of biological sequences: applications to protein structure and genomics Sankhya Ser. B 63:204-219
White O., J. A. Eisen, J. F. Heidelberg, et al. (25 co-authors) 1999 Genome sequence of the Radioresistant Bacterium Deinococcus radiodurans R1 Science 286:1571-1577
Wolf Y. I., L. Aravind, E. V. Koonin, 1999 Rickettsiae and Chlamydiae: evidence of horizontal gene transfer and gene exchange TiG 15:173-175[Medline]
|