Department of Biology, Duke University
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: comparative genomics natural selection motif bias promoters
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
For three main reasons, Eubacteria and Archaea are ideal systems in which to evaluate the hypothesis that selection acts against spurious binding sites. First, many, small, fully sequenced prokaryotic genomes are available for analysis. Second, these genomes are uncondensed and thus open to direct binding by transcriptional machinery (Langer et al. 1995), which presents the opportunity for selection to act against inappropriate binding of transcriptional machinery independent of chromatin condensation. Third, the sites that control transcription are well defined. In Eubacteria, transcription is controlled in large part by the RNA polymerase holoenzyme, whose contact with DNA is mediated by interactions between the 70 factor and the -35 and -10 sequences (consensus sequences 5'-TTGACA-3' and 5'-TATAAT-3', respectively [Baumann, Qureshi, and Jackson 1995]). In Archaea, transcription resembles that in eukaryotes: an A-box motif (5'-TTTA[T/A]A-3'), centered at -27 bp from the start of transcription, is bound by TATA-binding protein independent of RNA polymerase II (Baumann, Qureshi, and Jackson 1995; Langer et al. 1995). These binding site motifs, or slight variants on them, are necessary for transcription in both groups of organisms.
Here we test the hypothesis that selection acts to remove spurious transcription factor binding sites throughout 52 genomes of Eubacteria and Archaea. We use the consensus binding site from Eubacteria and the two main variants from Archaea as our focal sequences and test for underrepresentation of these sequences in both coding and noncoding regions of the genome. A model for the loss and gain of binding sites is also introduced; this model allows us to estimate the average strength of selection against spurious binding sites across genomes.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Chaos Game Representation of Genomes
Counts obtained from above were graphed according to a chaos game representation (CGR) algorithm (Deschavanne et al. 1999). The program made by the authors to generate these figures, CGRmotif, is available at http://www.duke.edu/jes12/cgr. Whereas most methods used to find binding sites attempt to identify sites that are overrepresented in upstream regions (Wagner 1998; Berman et al. 2002), our program can be used to identify previously unknown transcription factor binding sites by looking for deficiencies of sites in a genome (Stajich, Hahn, and Wray, unpublished data).
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
As an alternative way to correct for counting transcriptionally relevant binding sites in noncoding regions, we scanned promoter regions 5' of coding DNA, excluding a single binding site motif, if any were found, that was in the correct orientation (see Materials and Methods for details). This method is needed for two reasons. First, not every gene has its own promoter sequence: many genes are cotranscribed in polycistronic operons in both groups of organisms. Second, not every gene uses the strongly binding consensus sites. Genes that are not transcribed at high levels may use nonconsensus, weaker binding sites as a regulating mechanism (Kobayashi, Nagata, and Ishihama 1990; Xu, McCabe, and Koudelka 2001). Taking into account the active binding sites, we find that spurious binding sites are underrepresented in noncoding regions in almost every archaeal and eubacterial genome (table 2, "Noncoding-x"). This indicates that there is consistently stronger selection against spurious binding sites in noncoding regions. A concern is that when examining any motif, regardless of identity or function, our attempt to correct for selectively maintained motifs may lead to some amount of underrepresentation. However, we are still confident in our results because both across whole genomes and in coding regions, where controlling for true binding sites does not depend on correctly identifying such sites, we still see a significant underrepresentation of spurious transcription factor binding sites.
Our results suggest two different selective mechanisms acting against spurious binding sites. Selection most likely acts throughout the genome to reduce random binding, thus enhancing transcriptional efficiency. In promoter regions, selection may also act to remove spurious sites to avoid steric hindrance of transcription factors or to avoid gene silencing via ectopic transcription and RNA interference. The avoidance of steric hindrance in promoter regions may explain the conservation of sequences with no known binding affinities in-between binding sites in promoters: any motif that possibly binds a transcription factor is deleterious, thus further constraining sequence space.
Strength of Selection
Because little is known about the strength of the selective forces that constrain the genome as a whole, we also estimated the strength of selection against spurious binding sites. Selection acting on any single gene for translational efficiency is revealed by the nonrandom use of synonymous codons (codon bias) to match tRNA species abundance (Ikemura 1985; Moriyama and Powell 1997). Selection for codon usage has been estimated to be approximately in both bacteria (Hartl, Moriyama, and Sawyer 1994) and Drosophila (Akashi 1995) ("Ne" represents the effective population size of the species, and "s" represents the selection coefficient for each variant). In order to measure the average strength of selection against spurious binding sites, we modeled the number of spurious binding sites in a genome as a balance between mutation and selection. In the neutral case without selection against binding sites, the equilibrium number of spurious binding sites expected in a genome is described by
|
|
|
Solving for n0 and n in equations 1 and 2 and dividing gives the result . Here we assume that ß is much larger than
(that mutation away from any particular sequence is much more likely than mutation to that sequence) and that x, the number of true binding sites, is negligible compared with the number of possible binding sites in a genome, N (where
of the genome in base pairs - size of binding motif in base pairs + 1).
Intuitively, this result makes sense: the deficiency of binding sites in the genome (the ratio of observed number of binding sites to expected) is due to the lower probability of fixation of mutants that are selected against (the selection parameter, ). Equation 3 allows us to numerically solve for the average efficacy of selection (as measured by Nes) on spurious binding sites. These values are plotted for the Eubacteria and Archaea in figure 2.
|
|
The effects of selection within each genome are also quite large. The average numbers of binding sites in a genome, relative to the expected, are 91% for TTGACA and 89% for TATAAT in Eubacteria and 99% for TTTAAA and 98% for TTTATA in Archaea (for uncorrected genomes); in coding regions alone the numbers are 87% for TTGACA, 81% for TATAAT, 91% for TTTAAA, and 90% for TTTATA. If we correct for true binding sites across the genome, the average numbers of spurious binding sites, relative to the expected, are 85% for TTGACA, 75% for TATAAT, 89% for TTTAAA, and 90% for TTTATA. Once again, because we assume that the consensus binding sites are the same across each taxonomic group, these numbers are conservative estimates of the effects of selection.
The method used to estimate the expected number of motifs in a genome (Karlin, Burge, and Campbell 1992) corrects for motif bias in all subsequences of our focal motifs. This means that simple codon bias, in or out of frame, does not affect our estimates. Unfortunately, this method does not take into account any effects of di-codon bias: the nonrandom distribution of neighboring codon pairs (Gutman and Hatfield 1989). However, we have good reason to think that this effect is minimal or nonexistent. Codon bias across these diverse sets of organisms, which have many hundreds of millions of years separating them even within Eubacteria or Archaea, is extremely varied (Nakamura, Gojobori, and Ikemura 2000). The different genomes differ in the synonymous codons that are used, in the GC content of coding regions, and in the amino acids that are used (Singer and Hickey 2000). In addition, it has been shown that di-codon bias differs among the species of Eubacteria and Archaea (Badger and Olsen 1999; McVean and Hurst 2000) and so should not explain the patterns we see across these groups. Finally, this bias only has effects on one DNA strand, whereas our results use both strands. It should be noted, however, that none of the reasons stated above argues against the contention that di-codon bias may be caused by selection against spurious binding sites in any single genome.
The next step in the study of selection against binding site motifs will be to examine eukaryotic genomes, where this form of selection may introduce a low level of background selection (Charlesworth, Morgan, and Charlesworth 1993) throughout the genome. It will be interesting to learn whether the motif bias observed here is as evenly distributed in eukaryotes, where heterochromatin, gene-rich regions, and different rates of recombination introduce a greater degree of spatial heterogeneity across a genome. In regions of heterochromatin, where DNA may not be open to spurious binding by transcription factors, selection against spurious binding site motifs would be unnecessary. In euchromatin, areas that are gene rich, and hence transcriptionally active much of the time, may show the strongest effects of this selection. On top of both of these conditions, rates of recombination along a chromosome show an effect on the efficacy of selection (Kliman and Hey 1993; Comeron and Kreitman 2002) and may add to the spatial heterogeneity in underrepresentation. Finally, even though the transcriptional machinery of the Archaea shares many similarities with that of eukaryotes (Baumann, Qureshi, and Jackson 1995; Langer et al. 1995), the complexity of multicellular regulatory regions is unnecessary in prokaryotic genomes. In eukaryotes, there are often multiply-represented transcription factor binding sites in any one promoter region, as opposed to the single site necessary to initiate transcription. In these cases calculating underrepresentation only in coding regions may be preferred to avoid the inclusion of multiple binding sites maintained by selection.
The pattern of binding site motif underrepresentation presented here clearly supports the action of selection in constraining sequences throughout the genome, regardless of function. In fact, selection is almost certainly constraining sequences without biologically relevant function, as well as coding and regulatory sequences, to a specific region of sequence space. Although we have demonstrated selection only on the consensus sequences in the focal binding sites, this method may be used to estimate the strength of selection against variants of the consensus and other transcription factor binding sites. The use of population genetics models and theory along with the tools of comparative genomics has allowed new insight into the effects of natural selection at the level of the whole genome (e.g., Comeron and Kreitman 2002; Lynch 2002). Here we have extended this approach by connecting the effects of purifying selection on genomic sequences with stabilizing selection at the level of transcriptional output.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Akashi, H. 1995. Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics 139:1067-1076.
Badger, J., and G. Olsen. 1999. CRITICA: coding region identification tool invoking comparative analysis. Mol. Biol. Evol. 16:512-524.[Abstract]
Baumann, P., S. A. Qureshi, and S. P. Jackson. 1995. Transcription: new insights from studies on Archaea. Trends Genet. 11:279-283.[CrossRef][ISI][Medline]
Berman, B. P., Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M. Rubin, and M. B. Eisen. 2002. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl. Acad. Sci. USA 99:757-762.
Burge, C., A. M. Campbell, and S. Karlin. 1992. Over- and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. USA 89:1358-1362.[Abstract]
Charlesworth, B., M. T. Morgan, and D. Charlesworth. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134:1289-1303.
Comeron, J. M., and M. Kreitman. 2002. Population, evolutionary and genomic consequences of interference selection. Genetics 161:389-410.
Davidson, E. H. 2001. Genomic regulatory systems: development and evolution. Academic Press, San Diego.
Deschavanne, P. J., A. Giron, J. Vilain, G. Fagot, and B. Fertil. 1999. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16:1391-1399.[Abstract]
Fairall, L., and J. W. R. Schwabe. 2001. DNA binding by transcription factors. Pp. 6584 in J. Locker, eds. Transcription factors. Academic Press, San Diego.
Gutman, G. A., and G. W. Hatfield. 1989. Nonrandom utilization of codon pairs in Escherichia coli. Proc. Natl. Acad. Sci. USA 86:3699-3703.[Abstract]
Hartl, D. L., E. N. Moriyama, and S. A. Sawyer. 1994. Selection intensity for codon bias. Genetics 138:227-234.
Hess, C. M., J. Gasper, H. E. Hoekstra, C. E. Hill, and S. V. Edwards. 2000. MHC class II pseudogene and genomic signature of a 32-kb cosmid in the house finch (Carpodacus mexicanus). Genome Res. 10:613-623.
Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:13-34.[Abstract]
Karlin, S., and C. Burge. 1995. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11:283-290.[CrossRef][ISI][Medline]
Karlin, S., C. Burge, and A. M. Campbell. 1992. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic Acids Res. 20:1363-1370.[Abstract]
Karlin, S., A. M. Campbell, and J. Mrazek. 1998. Comparative DNA analysis across diverse genomes. Annu. Rev. Genet. 32:185-225.[CrossRef][ISI][Medline]
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge.
Kliman, R., and J. Hey. 1993. Reduced natural selection associated with low recombination in Drosophila melanogaster. Mol. Biol. Evol. 10:1239-1258.[Abstract]
Kobayashi, M., K. Nagata, and A. Ishihama. 1990. Promoter selectivity of Escherichia coli RNA polymerase: effect of base substitutions in the promoter -35 region on promoter strength. Nucleic Acids Res. 18:7367-7372.[Abstract]
Langer, D., J. Hain, P. Thuriaux, and W. Zillig. 1995. Transcription in Archaea: similarity to that in Eucarya. Proc. Natl. Acad. Sci. USA 92:5768-5772.
Li, Q. M., and S. A. Johnston. 2001. Are all DNA binding and transcription regulation by an activator physiologically relevant? Mol. Cell. Biol. 21:2467-2474.
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Locker, J. 2001. Transcription factors. Academic Press, San Diego, Calif.
Lynch, M. 2002. Intron evolution as a population-genetic process. Proc. Natl. Acad. Sci. USA. 99:6118-6123.
McVean, G. A. T., and G. D. D. Hurst. 2000. Evolutionary lability of context-dependent codon bias in bacteria. J. Mol. Evol. 50:264-275.[ISI][Medline]
Moriyama, E. N., and J. R. Powell. 1997. Codon usage bias and tRNA abundance in Drosophila. J. Mol. Evol. 45:514-523.[ISI][Medline]
Nakamura, Y., T. Gojobori, and T. Ikemura. 2000. Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 28:292.
Ohta, T. 1992. The nearly neutral theory of molecular evolution. Ann. Rev. Ecol. Syst. 23:263-286.[CrossRef][ISI]
Rice, P., I. Longden, and A. Bleasby. 2000. EMBOSS: The European molecular biology open software suite. Trends Genet. 16:276-277.[CrossRef][ISI][Medline]
Rocha, E. P. C., A. Danchin, and A. Viari. 2001. Evolutionary role of restriction modification systems as revealed by comparative genome analysis. Genome Res. 11:946-958.
Singer, G. A. C., and D. A. Hickey. 2000. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol. Biol. Evol. 17:1581-1588.
Stajich, J. E., D. Block, and K. Boulez, et al. (21 co-authors). 2002. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12:1611-1618.
Stone, J. R., and G. A. Wray. 2001. Rapid evolution of cis-regulatory sequences via local point mutations. Mol. Biol. Evol. 18:1764-1770.
Wagner, A. 1998. Distribution of transcription factor binding sites in the yeast genome suggests abundance of coordinately regulated genes. Genomics 50:293-295.[CrossRef][ISI][Medline]
Xu, J., B. C. McCabe, and G. B. Koudelka. 2001. Function-based selection and characterization of base-pair polymorphisms in a promoter of Escherichia coli RNA polymerase sigma(70). J. Bacteriol. 183:2866-2873.