Center for Biological Sequence Analysis, Building 2081, and Section for Molecular Microbiology, Building 3012, BioCentrum-DTU, Technical University of Denmark, DK-2800 Lyngby, Denmark
Author for correspondence: Steen Knudsen. Tel: +45 45252480. Fax: +45 45931585. e-mail: steen{at}cbs.dtu.dk
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: sigma factor, HMM, response regulator aspartate phosphatase, extended -10 region
Abbreviations: FP, false positive; HMM, hidden Markov model; RNAP, RNA polymerase; TP, true positive
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The factor in the RNAP complex recognizes and binds a specific conserved DNA pattern upstream of the transcription start site, thereby allowing the RNAP to associate with the DNA strand, first loosely in a closed promoterpolymerase complex, and then tightly, melting a local region of the promoter to form an open promoterpolymerase complex, resulting in initiation of transcription. When the transcription is initiated, the
factor is released from the complex.
Every factor facilitates binding of the RNAP complex by recognition of a specific binding site, usually located 10 and 35 bp upstream of the transcription start site.
A in B. subtilis participates in the initiation of transcription of most of the housekeeping genes. The consensus sequence recognized by
A, 5'-TTGACA-17 nt-TATAAT-3', is identical to the consensus that
70 of Escherichia coli recognizes.
A-dependent promoters from B. subtilis are easily transcribed by the
70 of E. coli, but poorly the other way around (Camacho & Salas, 1999
), which suggests that the RNAP of B. subtilis has a stricter requirement for binding than the RNAP of E. coli (Voskuil & Chambliss, 1998
; Camacho & Salas, 1999
). This corresponds with the fact that earlier studies have shown that many Gram-positives including B. subtilis utilize an extended -10 region in a large number of their
A-dependent promoters. This region is located 1 bp upstream of the -10 region and is hence referred to as the -16 region. The consensus of this region is 5'-TRTG-3', where R=G/A (Helmann, 1995
; Voskuil & Chambliss, 1998
; Camacho & Salas, 1999
), and it is therefore larger than the corresponding 5'-TG-3' motif found in E. coli (Ponnambalam et. al., 1986
; Keilty & Rosenberg, 1987
). This extension is estimated to exist in less than 10% of the promoters in E. coli (Chan et al., 1990
) and approximately 45% in B. subtilis. Especially in promoters containing this extended signal a series of A- and T-rich regions upstream of the -35 region has been observed. And both
A-dependent promoter types have an overrepresentation of A residues downstream of the -10 region (Voskuil & Chambliss, 1998
). By extracting this information it is possible to create a model for prediction of new sites.
The complete genome sequence (4·2 Mb) of B. subtilis was published in November 1997 (Kunst et al., 1997 ), and at present 4228 genes are annotated (SubtiList, 1999
). Approximately one-third of these genes have experimentally identified functions. The function of the second third can be predicted by homology to other known gene products. Among the last third of the genes there is most likely an unknown number of misclassified ORFs, and therefore the exact number of genes in B. subtilis remains unknown. It is therefore also not possible to estimate how many promoters exist in the genome of B. subtilis. If the annotated genes are correct and if only regions upstream of a gene and downstream of a terminator, or regions between genes arranged head-to-head, are defined as promoter regions, a conservative estimation will be that B. subtilis has 1800 promoter regions. The fraction of these that is dependent on
A is not known.
There are currently no publicly accessible tools for the prediction of A-binding sites in B. subtilis. Nobody has attempted to estimate the number of such sites. We have used hidden Markov models (HMMs) and trained them to recognize
A-binding sites in B. subtilis from existing experimentally generated data. The goal of this work is to create a tool to predict the number of true signals within the genome. This work has the further aim of recovering possible hidden information in the surrounding sequence of the two types of
A-binding sites as they are known today. This will clarify the differences between the sites, and make it easier to distinguish between them.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
HMMs are generally well suited for searching for motifs like A-binding sites since they facilitate an easy and intuitive incorporation of prior knowledge about signals associated with the motif in question. Another advantage, compared to techniques such as neural networks, is the ease of relating trained model parameters to sequence information; for instance, it is possible to use the trained emission probabilities to directly read off any consensus signals found and to get a good idea of the information present in these signals.
Prior knowledge may be included in the HMM architecture by addition or deletion of states, by biasing their nucleotide emission probabilities and/or biasing the probabilities of transitions between them.
When a model architecture has been set up, the optimal parameters are estimated by the BaumWelch algorithm, which maximizes the likelihood of the training sequences given the model i.e. it finds the HMM parameters which best capture the statistics of the training sequences.
The trained model is then used to analyse sequences not included in the training set. To get an idea of the models ability to generalize, one may split the initial training set into 10 parts and then repeatedly train on nine parts and test on the remaining part, until all parts have been tested once. This is a common technique known as a 10-fold cross-validation. It provides a way of estimating the extent of expected false positives and false negatives for a given threshold, when using the model to decode new sequences.
Decoding is the term applied to the process of evaluating how well a sequence or sub-sequence fits a given HMM model. There are several ways to perform decoding, and we have used posterior decoding, where one calculates, for the ith nucleotide xi in the query sequence x of length L, the total probability that the state i emitting it is state k, P(
i=k|x). Note that in general there are many paths through the model that could have emitted nucleotide xi while in state k (i.e. for which
i=k), so one must add the probabilities of all these parses to get the total probability. Formally, we have:
![]() | (1) |
The numerator may be written
![]() | (2) |
![]() |
since all observations after xi depend only on i. The first and second term in this product may be calculated recursively by the forward and backward algorithm respectively (Durbin et al., 1998
). The remaining unknown on the right-hand side of equation (1)
, P(x), may also be obtained from the forward/backward algorithms.
HMM prediction.
For predicting sites in the genome we calculate, for each nucleotide, the posterior probability that it was emitted by the first state of the -10 region in a A-binding motif. Once we have the posterior probabilities of the -10 start-states at all nucleotide positions, we simply regard all probabilities above a certain threshold (determined by the cross-validation procedure described above) as statistically significant. Hence, whenever the posterior probability of the desired motif exceeds the threshold, the model is said to have found a motif at that nucleotide. The better the motif and its contextual signals fit the model, the higher the probability score, and the more confidence will be placed in the prediction.
Fig. 1 is a schematic view of the HMM used to predict
A promoters in B. subtilis. The model incorporates known information about conserved positions in
A binding (Helmann, 1995
; Voskuil & Chambliss, 1998
), and was trained to pick up additional unknown signals.
|
The model shown in Fig. 1 is used exclusively for decoding (testing). For training purposes a loop model should be avoided, since in the absence of fully labelled training sequences it may end up using several motifs in its maximum-likelihood estimation even though only one is actually present; this will then distort the statistics of the motif states and hence impair decoding performance. Thus, during training a second background (identical to BACKGROUND) is included on the right-hand side of Fig. 1
in such a way that all three signal states must pass on to this second background and from here to the end state E.
Promoter regions as well as other intergenic regions are known to be comparatively A and T rich. Hence, in order to prevent prediction of A signals merely on the basis of A and T richness, a state has been added to the model (A-T background in Fig. 1
). The signal state is included purely for technical reasons in order to switch from the trained leftright model to the looping prediction model shown in Fig. 1
.
Note the presence of two alternative A-binding site models in Fig. 1
. This is motivated by the finding of two different submodels of binding sites one with an extended -10 region (extended) and one without (normal) (Helmann, 1995
; Voskuil & Chambliss, 1998
). As dictated by the data in the training set each model allows a separation of the -10 and -35 region of 1621 bp and 410 bp between the -10 region and the start site of transcription (Helmann, 1995
).
Fig. 2 shows a more detailed view of the extended and normal submodels. The consensus sequences of the -10 and -35 regions are clearly marked, as are the A-T-rich states. Dotted lines indicate the presence of more states than could be comfortably shown. The presence of the 9 and 5 extra explicitly modelled states in the extended model reflects the expectation that there is more information in the binding sites. The rationale behind the self-looped states is to model length distributions between e.g. the signal state (Fig. 1
) and the start of the -35 region. There are two more looped states in the normal model in order to compensate for the 9 explicit states in the extended model. If the length modelling differed markedly between the extended and normal model, one would risk situations where a submodel was preferred merely on the basis of length. The emission probabilities of the looped states are identical to the background state.
|
Bacterial strains, plasmids and growth conditions.
The bacterial strains and plasmids used in this study are listed in Table 1. Strains HØJ1, HØJ4, HØJ5 and HØJ8 have a single-copy raplacZ transcriptional fusion inserted by a double-crossover recombination event at the amyE locus, with and without a base substitution in the extended -10 region. Cells were grown at 37 °C as described previously (Saxild et al., 1995
). Spizizen salt-buffered minimal medium supplemented with 100 µg L-tryptophan ml-1 was used in the enzyme assay and LuriaBertani (LB) broth was used as rich medium. The relevant antibiotics were used at the following concentrations: neomycin, 5 µg ml-1; ampicillin, 50 µg ml-1.
|
Construction of clones.
The promoter regions from rapA and rapB were obtained by a PCR on chromosomal DNA from the wild-type B. subtilis strain 168. In addition to the wild-type promoter region, site-directed mutations were incorporated in the extended -10 region by using PCR primers with mismatches. The amplified promoter fragments with a 5' EcoRI linker and a 3' BamHI linker were cloned in a transcriptional fusion with the reporter gene lacZ using the vector pDG268neo. The plasmid was amplified in E. coli MC1061, linearized with KpnI and transformed into B. subtilis HH263. The transcriptional fusion was integrated into the amyE gene by a double-crossover event (Saxild et al., 1996 ). All strains containing a transcriptional fusion were confirmed by colony PCR with relevant primers, sequencing of the cloned promoter region and verification of the AmyE- phenotype, by screening for inability to produce clearing zones on LB plates containing 1% starch. Primers complementary to regions on each side of the cloned fusion were used in PCRs to confirm that no double insertion had occurred.
Primer extension.
RNA was isolated from HØJ1 and HØJ5 as described by Saxild et al. (1995) . The single-stranded DNA primer (annealing just downstream of the BamHI cloning site in pDG268) was radiolabelled at the 5' terminus using T4 polynucleotide kinase and [
-33P]ATP. The primer extension was performed by using the displayTHERMO-RT Reverse Transcriptase kit from Display Systems Biotech. The radiolabelled cDNA probes were separated on a 6% polyacrylamide sequencing gel next to a sequencing of pB1/pA1 with the same primer, and visualized by autoradiography.
ß-Galactosidase activity assay.
Growing cells were harvested by pouring 2530 ml culture into a 50 ml centrifuge tube 1/3 full of ice, centrifuging at 7000 g for 5 min, washing with 10 ml of a 0·9% NaCl solution, centrifuging at 7000 g for 5 min, washing with 2 ml of a 0·9% NaCl solution, centrifuging at 15000 g for 2 min, discarding the liquid phase, gently adding 0·5 ml 30 mM phosphate buffer (pH 7·5), 1 mM EDTA and 1 mM DTT (sonication buffer) without resolving the pellet, and stored at -20 °C. The total amount of protein was determined by the Lowry method. The ß-galactosidase activity assay was performed using the method of Miller (1972) .
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The predicted signals always contain the whole A-binding motif, and the expected transcription start site, but are reported based on the score from the -10, rather than the -35, sequence. As observed in E. coli, the -10 region of B. subtilis tends to occur unaccompanied by a -35 signal (or accompanied by an extremely poor one) in
A-binding promoters (Camacho & Salas, 1999
), though usually dependent on an activator (Lewis et al., 2000
). The converse is relatively rare, and presumably such single -35 sites are not sufficient to bind
A and should therefore not be counted as hits.
In order to estimate the number of FPs made on the entire genome we generated 100000 random sequences of length 100 with BACKGROUND statistics and counted the number of sequences scoring higher than the chosen cutoff. We did this three times and got 185, 199 and 225 sequences respectively. We then simply assume that the mean of these numbers, 203, is the expected number of FPs made on 100000 candidates. In addition to the first-order Markov statistics used in BACKGROUND, we also tried generating the random sequences from kth-order Markov chains for k=0, 2 and 3. The number of sequences found on average in these cases was 791, 265 and 270 respectively i.e. they all performed worse than the chosen k=1.
Note that it is conceivable that some of the high-scoring random sequences would in fact bind A in an experimental setup and hence are not really FPs. Nevertheless, this is our best estimate. In a genome of length 4·2 Mb the model is therefore expected to find roughly (4·2x106/100)(203/100000)=85 FPs on both the positive and the negative strand, making a total of 170.
Using the HMM we predict that the entire genome of B. subtilis contains 2538 A-binding sites. When examining the list containing the reported results (1927 high-confidence predictions) we were able to locate 1127 of these within the 400 bp upstream regions of the 4228 predicted genes in B. subtilis (SubtiList, 1999
). Both these lists are available from the authors upon request.
The model further predicts that approximately 50% of the predicted sites are of the extended type, which is a little more than previous findings on smaller samples (45%) (Helmann, 1995 ).
The sequence logos in Fig. 4 show the profiles of the HMM predictions. From this it is clear that more information exists in both types of
A-binding sites than previously thought. It is especially clear that our model has found that the transcription start site in B. subtilis promoters dependent on
A is highly conserved. The consensus sequence of this signal is 5'-YRTA-3' (+1 in bold) in the normal type, and 5'-YRNA-3', where Y=C/T, R=A/G and N=nt, in the extended type. The most frequent observed +1 signal in
A-binding promoters is, according to these results, 5'-TATA-3'. The signal in the +1 position is strongest in the extended type, and here an A is much more frequent than a G. This may be to compensate for the fact that the +2 position in this type of binding site is less conserved.
|
The model identified A-binding sites in the expected promoter regions of eight of the response-regulator aspartate phosphatase encoding genes (the rap genes). Our model predicts that rapB, -D, -E, -I and -K are transcribed from a
A-dependent promoter using the extended
A-binding site, and that rapF, -G and -H have the normal
A-binding site.
In Fig. 5 the expected -10 regions of the rap genes are aligned. There appears to be a highly conserved consensus containing a TG motif immediately upstream of a -10 motif in all the rap genes, which at least implies that this group of genes are being transcribed from a promoter containing an extended -10 region.
|
Experimental verification of predicted extended sites
We tested these predictions by site-directed mutagenesis of the extended region within the predicted A-binding site. The site-directed mutagenesis (see Table 2
) indeed showed a decrease in transcription for both rapA and rapB throughout a sporulation experiment (rapA, -B and -E are known to play a role in the phosphorelay signal-transduction system of sporulation: Mueller et al., 1992
; Jiang et al., 2000
), confirming that this region is necessary for transcription. When the G in the TG motif in the promoter region of rapB is substituted with a C, the amount of transcript drops on average 10-fold in the sporulation experiment. Likewise, when the corresponding G upstream of rapA is substituted with an A, the level of transcription drops approximately fourfold.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
When using the chosen cutoff, we are unable to locate approximately 30% of the true A-binding sites. These false negatives are binding sites that in a variety of ways differ from the average
A-binding sites. One example of true
A-binding sites that this prediction tool has difficulties in finding are the Spo0A-activated promoters. These promoters are known to have one or several 0A boxes at or near the -35 region, where Spo0A
P binds and activates transcription. This binding abolishes the negative effect of not only the poorly conserved -35 regions, but also the exceptionally large separation between the -10 and the -35 region (more than 21 bp), which promoters of this type are known to have (Lewis et al., 2000
). These sites do not fit the model due to the fact that the model only allows a spacing of 1621 bp. Despite this drawback, we chose to accept this restriction because it gave rise to the model with the best overall performance.
The large spacing and poorly conserved -35 regions, which are often observed in activator-dependent promoters, could explain why the model does not find any true A-binding sites in either rapA, rapC, rapE, or the putative second site in rapF, though there apparently exists a strong signal for an extended -10 region in this group of genes. rapA, rapC and rapE are known to be activated by the binding of ComA
P to a ComA box upstream of the -35 region (Mueller et al., 1992
; Lazazzera et al., 1999
; Jiang et al., 2000
). When the expected promoter regions of the rap genes are aligned, it appears that rapF has a ComA box consensus site at the same position as rapA, rapC and rapE, which strongly suggests that expression of rapF is also dependent on ComA
P (alignment not shown).
In Fig. 4 it is observed that the HMM has found a highly conserved signal at the +1 position. It appears that the site of initiation of transcription in
A-dependent promoters is separated from the -10 Pribnow box by on average 7 bp and has the consensus 5'-pyrimidine-purine-T-A/C-3' (most frequent: 5'-TATA-3'), and starting transcription at the purine. This corresponds with findings in E. coli, where the initiation site in
70-dependent promoters is -purine-pyrimidine- (Rosenberg & Court, 1979
; Pedersen & Engelbrecht, 1995
).
From this work, it is suggested that the A-binding sites classified as extended are significantly different from normal
A-binding sites in two areas of the promoter sequence. These differences are the -16-motif and the four bases approximately 40 nt upstream of the initiation site, which seem to be rich in A and T. The extended type has likewise been found to be less conserved at position 5 of the -35 region (the C in TTGACA) and at the +2 position in the +1 motif.
In conclusion, we have constructed an HMM that has identified A-binding sites in B. subtilis with known sensitivity and specificity. We have estimated the total number of
A-binding sites to be around 2538, and found the ratio between extended and normal -10 regions of
A-binding sites to be around 1:1. To support these findings we have experimentally verified that two of the predicted promoters indeed depend on an extended type of
A-binding site.
The trained HMM is available from the authors upon request. The list of predictions from the trained HMM is available as supplementary data with the online version of this paper at http://mic.sgmjournals.org.
![]() |
ACKNOWLEDGEMENTS |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Chan, B., Spassky, A. & Busby, S. (1990). The organization of open complexes between Escherichia coli RNA polymerase either with or without consensus -35 sequences. Biochem J 270, 141-148.[Medline]
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998). Markov chains and hidden Markov models. In Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids, pp. 4680. Cambridge, MA: Cambridge University Press.
Helmann, J. D. (1995). Compilation and analysis of Bacillus subtilis A-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res 23, 2351-2360.[Abstract]
Huang, X. & Helmann, J. D. (1998). Identification of target promoters for the Bacillus subtilis sigma(X) factor using a consensus-directed search. J Mol Biol 279, 165-173.[Medline]
Huang, X., Fredrick, K. L. & Helmann, J. D. (1998). Promoter recognition by Bacillus subtilis sigma W: autoregulation and partial overlap with the sigma X regulon. J Bacteriol 180, 3765-3770.
Jiang, M., Grau, R. & Perego, M. (2000). Differential processing of propeptide inhibitors of Rap phosphatases in Bacillus subtilis. J Bacteriol 182, 303-310.
Keilty, S. & Rosenberg, M. (1987). Constitutive function of a prositively regulated promoter reveals new sequences essential for activity. J Biol Chem 262, 6389-6395.
Kunst, F., Ogasawara, N., Moszer, I. & 148 other authors (1997). The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 390, 249256.[Medline]
Lazazzera, B. A., Kurtser, I. G., McQuade, R. S. & Grossman, A. D. (1999). An autoregulatory circuit affecting peptide signaling in Bacillus subtilis. J Bacteriol 181, 5193-5200.
Lewis, R. J., Brannigan, J. A., Offen, W. A., Smith, I. & Wilkinson, A. J. (1998). An evolutionary link between sporulation and prophage induction in the structure of a repressor:anti-repressor complex. J Mol Biol 283, 907-912.[Medline]
Lewis, R. J., Krzywda, S., Brannigan, J. A., Turkenburg, J. P., Muchová, K., Dodson, E. J., Barák, I. & Wilkinson, A. J. (2000). The trans-activation domain of the sporulation response regulator Spo0A revealed by X-ray crystallography. Mol Microbiol 38, 198-212.[Medline]
Miller, J. H. (1972). Assay of ß-galactosidase. In Experiments in Molecular Genetics, pp. 352355. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory.
Mueller, J. P., Bukusoglu, G. & Sonenshein, A. L. (1992). Transcriptional regulation of Bacillus subtilis glucose starvation-inducible genes: control of gsiA by the ComPComA signal transduction system. J Bacteriol 174, 4361-4373.[Abstract]
Pedersen, A. G. & Engelbrecht, J. (1995). Investigations of Escherichia coli promoter sequences with artificial neural networks: new signals discovered upstream of the transcriptional startpoint. Proc Int Conf Intell Syst Mol Biol 3, 292-299.[Medline]
Ponnambalam, S., Webster, C., Bingham, A. & Busby, S. (1986). Transcription initiation at the Escherichia coli galactose operon promoters in the absence of the normal -35 region sequences. J Biol Chem 261, 16043-16048.
Rosenberg, M. & Court, D. (1979). Regulation sequences involved in the promotion and termination of RNA transcription. Annu Rev Genet 13, 319-373.[Medline]
Sanger, F., Nicklen, S. & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74, 5463-5467.[Abstract]
Saxild, H. H., Jacobsen, J. H. & Nygaard, P. (1995). Functional analysis of the Bacillus subtilis purT gene encoding formate-dependent glycinamide ribonucleotide transformylase. Microbiology 141, 2211-2218.[Abstract]
Saxild, H. H., Andersen, L. N. & Hammer, K. (1996). dranupCpdp operon of Bacillus subtilis: nucleotide sequence, induction by deoxyribonucleosides, and transcriptional regulation by the deoR-encoded DeoR repressor protein. J Bacteriol 178, 424-434.[Abstract]
Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst Tech J 27, 379423, 623656.
SubtiList (1999). Release R15.1. Current URL http://bioweb.pasteur.fr/GenoList/SubtiList.
Voskuil, M. I. & Chambliss, G. H. (1998). The -16 region of Bacillus subtilis and other gram-positive bacterial promoters. Nucleic Acids Res 26, 3584-3590.
Zeng, X. & Saxild, H. H. (1999). Identification and characterization of a DeoR-specific operator sequence essential for induction of dranupCpdp operon expression in Bacillus subtilis. J Bacteriol 181, 1719-1727.
Zhang, Y. & Begley, T. P. (1991). Cloning, sequencing and regulation of thiA, a thiamin biosynthesis gene from Bacillus subtilis. Gene 198, 73-82.
Received 31 January 2001;
revised 22 May 2001;
accepted 6 June 2001.