Division of Pharmacology/Neurobiology, Biozentrum of the University of Basel, CH-4056 Basel, Switzerland
Address all correspondence and requests for reprints to: Michael Podvinec, Division of Pharmacology/Neurobiology, Biozentrum of the University of Basel, Klingelbergstrasse 70, CH-4056 Basel, Switzerland. E-mail: michael.podvinec{at}unibas.ch.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The ligands for NRs are extremely diverse, including steroid hormones, vitamin D3, thyroid hormones, fatty acids, eicosanoids, RA, bile acids, sterols, and numerous drugs and other xenochemicals. Of particular recent interest in pharmacogenomics are the xenobiotic- or drug-sensing receptors, constitutive androstane receptor, pregnane X receptor, also termed steroid and xenobiotic receptor or pregnenolone activated receptor (PXR/SXR/PAR), and chicken xenobiotic receptor (CXR) (5, 6).
Cognate REs for NRs are repeats of single DNA hexamers in a distinct arrangement toward each other in terms of relative orientation and spacing. The hexamer half-sites have a canonical consensus sequence of AG(G|T)TCA. Half-site repeats are categorized into direct repeats (DRs) and palindromic inverted repeats (IRs) or everted repeats (ERs). The majority of NRs binds as homo- or heterodimers to these hexamer repeats, each dimer partner interacting with one of the hexamers (1). Variations in the half-site sequence, their relative orientation, and the length of the spacer sequence allow for a wide range of different REs, enabling specific binding of a given receptor dimer to multiple target genes with different affinities, as well as giving rise to integration of different signaling pathways by competition of different receptors at a given binding site (7).
Available laboratory techniques to identify likely REs in target genes are tedious and time consuming and include DNAse I-hypersensitivity assays, subfragmentation, and reporter gene experiments. Bioinformatics-based approaches could considerably facilitate these studies, but present methods have been poor predictors of NR REs because of a low degree of sequence conservation of these elements. Here, we describe a computer algorithm capable of predicting weakly conserved recognition sites for transcription factors that bind as dimers, such as NRs. Compared with present approaches, our algorithm provides more specificity in the recognition of putative NR REs, making the analysis of large stretches of genomic sequences feasible. The algorithm is based on weighted nucleotide distribution matrices and the combined analysis of both half-sites of the RE. We have named this data-mining tool NUBIScan for "nuclear receptor binding site scanner."
To assess the validity of predictions by this algorithm, we have analyzed the sequences of three genes known to be regulated by NRs and compared our findings to experimental evidence. Moreover, we have used NUBIScan as a tool to identify new functional NR REs in the chicken cytochrome P450 (CYP) 2H1 and 3A37 genes. We find that the predictions obtained with the NUBIScan algorithm correlate well with experimental findings and provide an efficient way to select putative regulatory regions in large genomic sequences for experimental analysis.
![]() |
RESULTS AND DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
For our analysis, we have selected 28'630 bp from a bacterial artificial chromosome clone sequence (GenBank accession no. AC002457), corresponding to the sequence between the end of the upstream genes coding sequence and the beginning of the MDR1 coding sequence.
Within this sequence, we scanned for high-scoring matches to DR4, DR3, and ER6 motifs, using the matrix described in Fig. 1. These motifs are known to be recognized by PXR/RXR and have been found in the MDR1 RE. The functional DR4(I) element (9) had the best DR4 match with a Z score of 10.23 (Fig. 2A
). An ER6 motif, which binds PXR/RXR heterodimers in vitro, was predicted as the second-best ER6. Two further DR4 elements, DR4(II) and DR4(III), which could not be shown to be functional, are found on the 17th rank with identical Z scores of 6.94. Also, a DR3 element found in the sequence but shown to be ineffective was found at third rank with a Z score of 7.59 (Fig. 2A
).
|
|
We analyzed 11.2 kb of the CYP3A7 promoter (GenBank accession no. AF329900) for DR3 and ER6 repeats (Fig. 2B). The functional dNR1 site was found as the best match (Z score 8.67). In contrast, the inactive dNR3 site, located 400 bp downstream from the XREM, attained a Z score of only 3.75. Searching for ER6-type REs, the dNR2 site was the second-best match, followed by the pNR site. Notably, the best-scoring ER6 element is located only 100 bp upstream from the dNR2 site, and it is possible, but untested so far, that this element contributes to the activation mediated by the (extended) XREM.
Because the underlying nucleotide distribution matrix defines what type of REs are recognized by the algorithm, it can be adapted to recognize REs for various receptors. To demonstrate this possibility of extension and adaptation, we have compiled a set of specific matrices for the - or
-subtypes of the PPAR from experimental data by Juge-Aubry et al. (12), weighted according to the ability of the hexamer sequences to bind either receptor isoform in vitro. As a test case for these matrices, we chose the rat glucose transporter 2 gene (GLUT2/SLC2A2). GLUT2 is a low-affinity facilitative glucose transporter, present in pancreatic ß-cells, hepatocytes, as well as intestine and kidney epithelial cells. A role for GLUT2 as glucose sensor in pancreatic ß-cells stimulating insulin release has been proposed: homozygous GLUT2 null mice show hyperglycemia and relative hypoinsulinemia (13). Furthermore, troglitazone, an antidiabetic agent that increases glucose tolerance and insulin sensitivity, has previously been shown to be an activator of PPAR
and to increase GLUT2 protein levels (14, 15). Recently, a functional PPAR
RE was identified in the 5'-part of the GLUT2 gene (16).
We screened 5192 bp of the 5'-part of the GLUT2 gene (GenBank accession number: L28126) for DR1 elements, the cognate REs for PPAR. The functional PPAR RE was the best match with a Z score of 7.19 when using a matrix specific for PPAR
(Fig. 2C
), and with a Z score of 7.13 when using a matrix geared toward PPAR
(not shown). This difference in match quality may be reflected in the activation potential of these two receptors: whereas PPAR
activated the RE efficiently in transactivation assays, PPAR
activated the element to a much lesser extent (16).
These exemplary applications show that prediction of functional NR REs is possible in sequences as long as 30 kb. It must be emphasized that the Z score of a match is a measure of the resemblance to known functional sites but does not prove function, which depends on additional variables.
The Algorithm as a Predictor of Binding Sites
After proving that the NR REs predicted by our algorithm correlate well with functional binding sites in characterized regulatory regions, we applied NUBIScan to previously uncharacterized regulatory regions and tested the predictions.
We searched for REs in the recently characterized chicken CYP3A37 gene, the expression of which is increased by prototypical inducers of the CYP3A gene family in ovo and in LMH chicken hepatoma cells (17). From a chicken genomic library, we isolated a cosmid clone containing all CYP3A37 exons and flanking regions. We have subcloned and sequenced 3.1 kb of upstream sequence.
1 Because CYP3A37 is induced by a range of drugs and steroids that were shown to activate the chicken xenobiotic receptor CXR (6, 17), we surmised that CXR is the receptor mediating induction of this gene. Therefore, we scanned the upstream sequence with NUBIScan for DR4-type CXR REs, using the matrix described in Fig. 1. The best-scoring match (Z score, 8.72) was located at -1082 bp (Table 1
). Concurrent with in silico analysis, the upstream region was subfractionated into smaller pieces, which were cloned into the pBLCAT5 chloramphenicol acetyl transferase (CAT) reporter gene vector. These constructs were tested for inducibility in LMH cells. The site conferring drug inducibility coincided with the predicted DR4-type binding site (Fig. 3A
). Based on these results, we have defined a phenobarbital response unit (PBRU), able to confer drug induction, which is 159 bp in length, encompassing the predicted binding site (Fig. 3B
). Site-directed mutagenesis of one of the predicted half-sites abolished drug induction completely. Mutation also abolished activation by CXR in transactivation assays (Fig. 3C
) and prevented binding of CXR/RXR to the site in EMSAs (Fig. 4
).
|
|
A search for DR4 repeats using the previously used matrix (see Fig. 1) yielded five predicted REs with a Z score higher than 6.5 (Table 1
). The best-scoring result (Z score 8.92) is the DR4 element essential for drug-induced activation of the 264 bp, as shown by site-directed mutagenesis (18). While the second- and fourth-best results concern fragments unresponsive in reporter gene assays, the third-best match colocalizes with the DR4 described as necessary for CXR binding and drug-mediated induction in the 240-bp PBRU (19). The fifth predicted RE lies within the previously described 264-bp PBRU (Fig. 5A
) and may explain the weak residual induction previously observed after mutation of the other DR4 element in this PBRU. To study the relative function of the two REs in more detail, we created mutants of the 264-bp PBRU. In addition to the previously described mutant of the 264-bp PBRU, where the NR1 site is mutated (18), we mutated both half-sites of the NR2 site in wild-type and
NR1 constructs, as depicted in Fig. 5B
. We tested these constructs in reporter gene assays in LMH cells. Inducibility was measured after treating the cells for 24 h with 400 µM phenobarbital (PB), 500 µM glutethimide (GE), or vehicle alone (0.1% dimethylsulfoxide) (Fig. 5B
). Mutation of the upstream DR4 element (
NR1) profoundly reduced drug inducibility conferred by the element (from 74.5-fold to 6.5-fold for GE) but did not abolish it completely. Only additional mutation of the downstream DR4 (
NR1
NR2) completely abolished induction, suggesting functional roles for both DR4 elements in the 264-bp PBRU. Interestingly, mutation of only the downstream RE (
NR2) led to very similar results as observed with mutation of the upstream element (
NR1). Inducibility was greatly reduced (from 74.5-fold to 4.3-fold for GE). This attributes important roles in induction to both REs in the 264-bp PBRU.
|
In EMSAs, we then determined whether CXR/RXR can still bind to the mutant PBRUs (Fig. 6). Radiolabeled wild-type probe could be efficiently displaced by cold wild-type competitor, but not with the
NR1 mutant.
NR2 competitor also displaced the wild-type probe quite efficiently, whereas the combined
NR1
NR2 mutant did not compete with the probe. When wild-type and mutant PBRUs were radiolabeled and used as probe, strong binding of CXR/RXR heterodimer was observed to the wild-type probe and to the
NR2 probe.
NR1 could still weakly bind the receptors, and with the
NR1
NR2 mutant, no binding at all was observed.
|
In conclusion, application of the NUBIScan algorithm to this well characterized regulatory region thus has verified the usefulness of the algorithm in determining likely sites of regulation by NRs. Not unexpectedly, not all predicted sites mapped to distinctly drug-inducible fragments of the CYP2H1 flanking region. Recognition and transactivation by NRs depends on more than the core binding site, such as three-dimensional DNA structure, accessory binding sites, and surrounding sequences, factors not accounted for in the present model. Nevertheless, among the five best matches, the majority of the predictions were positive, and this refined in silico analysis has led to the discovery of a further RE in the 264-bp PBRU, which had eluded previous studies and computer analyses.
Advantages and Limitations of the NUBIScan Algorithm
The present algorithm clearly provides more specificity than an approach based on matching consensus patterns. Already in generating a consensus pattern, much of the information about nucleotide frequency contained in an aligned set of RE sequences is discarded. When searching with a consensus, then, again information is lost, because only a match/no match decision can be made. In contrast, an approach using comparisons to a nucleotide frequency matrix conserves frequency information from the set of REs and furthermore, leads to differentiated quality scores for each match and thus retains more specificity (22). Presently available matrix-based approaches for the prediction of transcription factor-binding sites, such as the MatInspector algorithm (23), are designed to cover a large number of different protein-DNA interactions and do not consider all aspects of the internal structure of a particular binding site. The here described algorithm proved to be considerably more efficient and specific in the detection of functional NR REs.
Generally, detection of NR REs is hampered by the low degree of sequence conservation among REs, leading to low predictive power of the algorithm. As a result, the researcher is often overwhelmed with a wealth of putative REs, the quality scores of which are very similar. Increasing the specificity of predictions alleviates this problem.
Our approach focuses on a prediction of specific combinations of two low-affinity binding sites, such as NR binding sites. As a new concept, we consider each half-site of the RE as a motif and then search for a specific combination of these motifs. The net result of this approach of combining two matches to a matrix is that the mean score of all matches in a sequence drops dramatically, because the score of weak matches is decreased by the combination step, whereas the score of good matches (close to 1) remains nearly the same.
Using this combined scoring approach, a good half-site can compensate to some extent for a weaker half-site, an observation that was also made in experimental analysis of NR binding sites (e.g. in Ref. 18).
It is clear, however, that an in silico approach can never replace biological evidence for the functionality of a predicted RE. Transcriptional activation is a dynamic, multiprotein process, which depends on further factors such as chromatin structure; the sequences surrounding the NR binding site are important for its function. Because the algorithm focuses solely on the hexamer cores, predicted REs thus may not be functional in a native context. On the other hand, NRs have been shown to act also indirectly by cooperation with other proteins without contacting the DNA themselves. This level of regulation will be missed by our algorithm; it can only predict REs that involve direct receptor/DNA interaction.
Nevertheless, active REs were consistently found among the top-scoring predictions, and an analysis with NUBIScan is a time-efficient first approach to highlighting regions of interest for further experiments.
The estimate of false positive or false negative predictions is strongly dependent on the cut-off chosen for the Z score. With a threshold of >6.5 Z scores, we have not yet detected a false negative result, i.e. a site to which a receptor binds that is not detected. With the same Z score, we have seen false positives in the CYP2H1 flanking region, as discussed above.
Compared with three common approaches, visual inspection, pattern searches, and searches using the MatInspector algorithm (23), NUBIScan performs favorably: visual inspection of the sequence is restricted to short sequences. A pattern search approach based on degenerate consensus patterns will result in too many hits without any means to quantitatively distinguish good and poor matches. The MatInspector algorithm uses a database of weighted nucleotide matrices. Unfortunately, within the current release of the database, there are very few matrices for NR REs. Moreover, analysis of large genomic sequences (>1000 bp) with MatInspector for single transcription factors is not recommended by the authors, as the predictive power of the algorithm is decreased by random matches that exceed the set threshold. This is prevented in NUBIScan by the multiplicative half-site combination step.
With this gain of specificity, achieved through the modular approach of combining distinct, but related, sites rather than using a larger, more variable region spanning both half-sites, screening large sets of genomic sequences for particular binding sites becomes feasible, where in our experience other programs fail to produce reliable results due to low specificity of the attained matches.
![]() |
CONCLUSIONS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Although the algorithm obviously cannot replace experimental verification of the proposed sites of regulation, it selects the set of sites most promising for further investigation. Sequence information of genes and their surrounding sequence is increasingly available for many genomes. A researcher interested in the regulation of a specific gene can apply the algorithm to mine this growing information base. This screening method for target genes of a particular regulatory pathway complements high-throughput approaches, such as gene expression arrays.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Matrix Calculation
Initially, a nucleotide distribution matrix is constructed from a set of aligned sequences, e.g. from a group of NR half-sites. Subsequently, a position weight is calculated for each position of the matrix representing conservation of bases at that position
![]() | (1) |
Equation 1: definition of matrix position weight according to information content. Wi, weight at position i; b, base; pi(b), frequency of base b at position i (taken from stochastic matrix).
This position weight is arbitrarily scaled between 0 and 100, where 100 signifies a position with full sequence conservation throughout the training set and 0 signifies no conservation. Due to the positional weights of the matrix, more conserved nucleotides are considered more important in the analysis of a query sequence.
Scanning a Query Sequence for Potential NR Binding Sites
To localize specific NR binding sites, NUBIScan searches for two occurrences of the matrix within a specified distance, having a defined relative orientation (DR, ER, or IR). These parameters are defined before execution of the algorithm. The comparison is a two-step process: initially, single scores, i.e. quality scores for all possible matches on both strands of the query sequence to the matrix, are calculated. Match quality scores are expressed as ratio to the best attainable match.
![]() | (2) |
Finally, matrix matches are combined by multiplication to give the score for the desired arrangement and spacing of half-sites. Thus, SFull(j) = Shalf-site(j) * Shalf-site(j +n) for position j in the query sequence and a DR with n spacer nucleotides. For ERs and IRs, scores from the sense and antisense strand are combined analogously. Positions whose score is over a predefined threshold value are included in a list of found matches. This threshold can be set either as a final score or as a Z score (i.e. a number of standard deviations from the mean score of all possible sites in the sequence).
The core algorithm has been programmed in C as a command-line tool for Microsoft Corp. Windows and Unix platforms. A web interface for the algorithm was programmed in Perl.
The algorithm is shown in detail in the supplemental data to this paper published on The Endocrine Societys Journals Online web site, http://mend.endojournals.org/. In addition, this algorithm is the core of a web interface that functions as a central hub, providing background information and user instructions as well as access to the programs. This web interface is available to researchers from nonprofit and academic institutions at the following URL: http://www.nubiscan.unibas.ch. It allows users to create their own matrix files from sets of sequences and to scan query sequences with their own matrices or with provided general-purpose matrices.
Users can define a search strategy for their query sequence, so that a sequence can be scanned for different arrangements of half-sites at the same time. In this way, variations in the admissible half-site spacing for a given receptor can be accommodated. For each single search, a threshold can be set either as absolute score or as Z score. As a starting point, we recommend setting the threshold at six to seven Z scores, and to examine the resulting predictions subsequently in rank order.
Cloning and Mutagenesis of REs
CAT reporter gene plasmids containing subfragments of the CYP3A37 5'-flanking region were constructed by PCR and subsequent restriction digestion (see Fig. 3). The 3096-bp fragment was amplified by PCR using 5'-ATC GGA TCC AGC TGG GTG TAG GGT CCA T-3' and 5'-ATC GGA TCC ACT GGC CTC ATG TCC CGA-3' as sense and antisense primers, creating a BamHI site on both ends to facilitate cloning.
From the 3096-bp fragment, the 1264-bp fragment was excised using EcoNI and NsiI, religation resulting in the 874 bp + 960 bp construct. The 1264-bp fragment was further cut with AatII, resulting in the 896-bp and 368-bp fragment. Finally, digestion with BspMI yielded the 212-bp and 159-bp fragments. The DR4 site of the 159-bp PBRU was mutated into a NotI restriction site by PCR using standard overlap techniques as described previously (18). Thus, the DR4 was altered from the wild-type 5'-TGAACTGCGATGCACT-3' sequence to 5'-gcggccGCGATGCACT-3 (altered bases are in lowercase letters, the hexamer cores are represented in boldface). CAT reporter gene plasmids containing the wild-type 264-bp CYP2H1 PBRU or the 264-bp CYP2H1 PBRU with the NR1 half-sites mutated were described (18). Briefly, the NR1 site was changed from 5'-GAACTTCCTTGCCCT-3' to 5'-ccgcggTCCTgatatc-3', introducing SacII and EcoRV restriction sites. Mutants of the second NR site were done with PCR overlap techniques. By changing the wild-type 5'-GGGTCCTGGGAGTTCA-3' sequence to 5'-tctagaTGGGctcgag-3', the half-sites were mutated into an XbaI and a XhoI restriction site.
Cell Culture and Transient Transfection, EMSA
LMH chicken hepatoma cells were cultivated as previously described (6). One day before transfection, the culture medium was replaced with Williams E medium containing glutamine and penicillin/streptomycin, supplemented with 10% delipidated, charcoal-stripped FCS (Sigma, St. Louis, MO). This change was observed to greatly increase the extent of drug-mediated induction, presumably by the absence of inhibitory substances from the serum. Cells were transfected with 1 µg of reporter vector and 100 ng of pRSV-ß-Galactosidase (kindly provided by Dr. A. Kralli, Biozentrum, University of Basel, Switzerland) vector per well, using FuGene6 transfection reagent (Roche Molecular Biochemicals, Rotkreuz, Switzerland) according to the manufacturers instructions. Four hours after transfection, an appropriate dilution of the inducer compounds was added to the wells, and cells were incubated for an additional 24 h before preparation of cell extracts to assay for reporter gene and ß-galactosidase expression. Transactivation assays in CV-1 monkey kidney cells, reporter gene assays, and EMSAs were performed as described before (6). Expression plasmids for CXR and chicken RXR have been described (6). The monoclonal antimouse RXR rabbit antibody used for supershifts was kindly provided by Dr. P. Chambon (Institut de Génétique et de Biologie Moléculaire et Cellulaire, Université Louis Pasteur, Illkirch, France).
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
This work was supported by the Swiss National Science Foundation.
Abbreviations: CAT, Chloramphenicol acetyltransferase; CXR, chicken xenobiotic receptor; CYP, cytochrome P450; DR, direct repeat; ER, everted repeat; GE, glutethimide; GLUT2, glucose transporter 2 gene; IR, inverted repeat; MDR1, multidrug resistance gene 1; NR, nuclear receptor; PB, phenobarbital; PBRU, phenobarbital response unit; PXR, pregnane X receptor; RE, response element; XREM, xenobiotic-responsive enhancer module.
1 These sequence data have been submitted to the DDBJ/EMBL/GenBank databases under accession no. AF486653.
Received for publication December 26, 2001. Accepted for publication February 13, 2002.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|