(Received for publication, July 25, 1994; and in revised form, November 30, 1994)
From the
The influence of simple repeat sequences, cloned into different
positions relative to the SV40 early promoter/enhancer, on the
transient expression of the chloramphenicol acetyltransferase (CAT)
gene was investigated. Insertion of (G)(C)
in either orientation into the 5`-untranslated region of the CAT
gene reduced expression in CV-1 cells 50-100 fold when compared
with controls with random sequence inserts. Analysis of CAT-specific
mRNA levels demonstrated that the effect was due to a reduction of CAT
mRNA production rather than to post-transcriptional events. In
contrast, insertion of the same insert in either orientation upstream
of the promoter-enhancer or downstream of the gene stimulated gene
expression 2-3-fold. These effects could be reversed by
cotransfection of a competitor plasmid carrying
(G)
(C)
sequences. The results suggest
that a G
C-binding transcription factor modulates gene expression
in this system and that promoter strength can be regulated by providing
protein-binding sites in trans. Although constructs containing longer
tracts of alternating (C-G), (T-G), or (A-T) sequences inhibited CAT
expression when inserted in the 5`-untranslated region of the CAT gene,
the amount of CAT mRNA was unaffected. Hence, these inhibitions must be
due to post-transcriptional events, presumably at the level of
translation. These effects of microsatellite sequences on gene
expression are discussed with respect to recent data on related simple
repeat sequences which cause several human genetic diseases.
Eukaryotic genomes contain a relatively high percentage of
direct repeat sequences known as microsatellites. Simple di-, tri, or
tetranucleotide repeats have been reported to be unstable and vary in
length of repeat unit from one to several thousand base pairs.
Recently, abnormalities of such repeats have been implicated in the
genesis of a number of human genetic conditions. For example,
instability of dinucleotide (CA) repeats has been observed in some
human cancers(1) . More recently, several neurological diseases
result from an increased number of trinucleotide repeat units near or
within certain genes(2, 3, 4, 5) .
Expansion of trinucleotide (CGG) repeats is responsible for the fragile
X syndrome (6) while several other disorders (myotonic
dystrophy, spinal and bulbar muscular atrophy, spinocerebellar ataxia
type 1, and Huntington's disease) have been linked to expanded
triplet blocks of (CTG) repeats(4, 7, 8, 9, 10) .
The massive expansion of these triplet repeats is a novel type of
mutational event(11) . The molecular etiology of this
non-Mendelian behavior is unknown at present.
Simple repeat
sequences adopt several types of non-B-DNA structures under appropriate
environmental conditions (negative supercoil density, ionic strength,
etc). Repeating pur-pyr sequences, (G-C) or (A-C) form left-handed
Z-DNA, and inverted repeats adopt cruciform
structures(11, 12) . Under the influence of negative
supercoiling and acidic conditions, (G-A)(T-C), repeat sequences
fold into intramolecular triple-stranded structures (H-DNA) based on
C
G
C triads, as long as there is mirror symmetry of the
pur
pyr stretch(11, 13) . In the presence of
magnesium, (G)
(C) sequences in plasmids also form
C
G
G-based triple strands even at neutral
pH(13, 14) . The tandem arrays of 12 bp (
)direct repeats (DR2) in the segment inversion site of
herpes simplex virus-1 is highly G+C-rich and forms an unorthodox
conformation under physiological conditions (15) .
Additionally, runs of three or more Gs can self-associate to form
tetra-stranded structures which have been implicated as possible
synaptic intermediates during meiosis (16) and are essential
parts of the structure adopted by the single-stranded telomeric repeats
at the ends of eukaryotic
chromosomes(17, 18, 19) . The trinucleotide
(CTG)
repeat found in the 3`-untranslated region
of the myotonin kinase gene (20) adopts a still different non-B
DNA structure (
)(
)and displays preferential
nucleosome assembly in vitro(22) .
Recent studies demonstrate that the non-B DNA structures form and have biological consequences in vivo. Left-handed Z-DNA exists in plasmids (23) and the chromosome (24) and is stabilized by domains of negative supercoiling (25, 26) in living Escherichia coli cells. In addition, the triplexes can exist in vivo and regulate gene expression(27, 28) .
Simple repeat sequences of
purpyr composition occur with high frequency in the promoters and
particularly around the transcriptional signal of many eukaryotic
genomes(13, 14) . These repeat sequences are found in
the 5`-untranslated region of a number of human genes (3, 29, 30) and are the binding sites for
several proteins(29, 31) . The trinucleotide
(CCG)
repeat which becomes unstable in fragile X
syndrome is located in the 5`-untranslated region of the FMR1 gene and
is characterized as a binding site for a specific nuclear
protein(32) .
In spite of the realization that simple repeat
sequences exhibit high structural variability in vitro,
relatively little is known about their properties in biological
systems. Hence, we investigated the effects of the simple repeating
sequences on transcription. Herein, a systematic study was conducted to
understand the biological consequences of the presence of repeating
sequences in the vicinity of a transcription unit on transient gene
expression in monkey cells. Our data shows that (C)(G) tracts act
as modulators in cis on transcription driven by the SV40 early
promoter/enhancer. Transcription is inhibited, and this inhibition can
be titrated in trans with a competitor sequence. Therefore,
(G)
(C)-binding proteins may act as transcriptional regulators.
Constructs containing other repeating sequences, (C-G), (T-G), and
(A-T), inhibited gene expression only when inserted in the
5`-untranslated region of the CAT gene in a manner that was
length-dependent. RNA analyses revealed that transcription was not
inhibited by these inserts and that full-length mRNA was produced. This
indicates that the inhibition is due to post-transcriptional events.
In parallel studies, plasmids
containing different lengths of other repeating DNA sequences were
constructed (Table 1). Fragments containing
(CG)E(CG)
, (CG)
E(CG)
,
and (CG)
were excised from pRW1558, pRW1557, and
pRW1567(34) , respectively, by XhoII cleavage except
for (CG)
where XhoII + EcoRI were
used. (E designates the AATT in the EcoRI sites.) These
fragments were blunt-ended and inserted into the filled-in HindIII site of pSV2cat (site 1) except for pRW1958, which had
its insert in the SmaI site of the polylinker in pRW1906. The
plasmids thus obtained were designated pRW1958, pRW1957, and pRW1925,
respectively. The 48- and 32-bp alternating (pur
pyr) sequences
were inserted into site 2 of pSV2cat resulting in pRW1922 and pRW1923,
respectively. The insertion of the same 48- and 32-bp fragments into
site 3 generated pRW1927 and pRW1928, respectively. Recombinant
plasmids also were constructed which contained other repeating
sequences. pRW1912 has a 98-bp BamHI fragment containing
(TG)
E(CA)
derived from pRW1151(35) .
Insertion of this fragment into sites 2 and 3 resulted in pRW1921 and
pRW1926, respectively. pRW1924 contains the 140-bp insert from the 3`
side of the mouse
immunoglobulin gene containing a 62-bp tract of
(T-G); this fragment was excised from pRW777 (36) using EcoRI + HindIII. pRW1929 has a 120-bp XbaI-SspI fragment containing the
(AT)
N
(AT)
sequence. This fragment
was excised from a modified pUC plasmid containing the 1.9-kilobase
pair KpnI- PvuII fragment, which is the DNase I
super-hypersensitive site located 53 kbp upstream of the human
-globin gene (37) (a gift of Dr. Tim Townes of University
of Alabama, Birmingham). The above fragments were blunt-ended and
inserted into the site 1 of pSV2cat. All plasmids were grown in E.
coli HB101 and purified twice over cesium chloride gradients. All
inserts were characterized by DNA sequence analyses on both strands
using the Maxam-Gilbert sequencing method(38) .
Figure 1: Schematic diagram of pSV2cat. The SV40 early promoter, the TATA box (AT), the 21-bp repeats, and the 72-bp enhancers are shown. The arrow indicates the direction of transcription. DNA-repeating sequences were inserted into the 5`-untranslated region of the CAT gene (site 1). In addition, the same sequences were placed either upstream (site 2) or downstream of the SV40 promoter/enhancer (sites 3 and 4). The heavy line indicates the bacterial CAT-coding sequence. The cross-hatched regions correspond to the three DNA fragments used as hybridization probes.
Table 1and Fig. 2show that insertion of 50 bp of
polylinker sequences into the 5`-untranslated region of the gene
(pRW1906) has no detrimental effect on gene expression and, in fact,
consistently yielded slightly higher enzyme activity levels than the
parent plasmid pSV2cat. In contrast, the level of CAT activity for
pRW1253, which contains a 29-bp GC tract in the same context as
pRW1906, is strongly reduced. If this effect was caused by mRNA
secondary structure, inversion of the insert (pRW1254) should lead to
normal levels of CAT expression, since the other strand is now the
coding strand. Comparison of the CAT activity expressed by the two
constructs containing (G)
(C)
sequences
in different orientations showed that both produced similar low levels
of enzyme. As shown in Table 1, the reduction in activity was of
the order of 20-100-fold relative to pSV2cat.
Figure 2:
CAT
levels expressed by constructs containing repeating (G)(C) tracts
within the transcribed region. Transfection of CV-1 cells and
determination of CAT activity were performed as described under
``Experimental Procedures.''
Since the
negative effect of the (G)(C) sequences was independent of the
orientation and therefore of the transcribed DNA strand, it was likely
to be caused by an event at the DNA level. To eliminate the possibility
that the observed reductions are a consequence of the polylinker
sequences present in addition to the pur
pyr tracts in pRW1253 and
pRW1254, we also constructed a plasmid (pRW1266) in which most of the
polylinker sequences surrounding the (G)
(C)
block were deleted (see ``Experimental Procedures'').
This construct was very deficient in CAT expression, even more than
pRW1253 and pRW1254, indicating that the negative effect is caused by
the presence of the pur
pyr tracts and not by any flanking
sequences (data not shown).
The effect of simple repeat DNA
sequences was investigated further by studying sequence isomeric
inserts. The level of CAT activity in constructs containing
(C-G) and (C-G)
was strongly reduced
(80-90%) whereas no inhibition was observed for pRW1925
containing 14 bp of (C-G). Thus, the inhibition of gene expression
depends on the length of the (C-G) insert. Additionally, constructs
containing (T-G) (pRW1912) and (A-T) sequences (pRW1929) inhibited CAT
expression whereas little or no inhibition was observed with pRW1924
containing the (T-G)
insert (Table 1).
The inhibition of CAT expression observed with plasmids containing these repeating sequences was not a result of deletions or rearrangements of these repeats in mammalian cells since Southern blot analyses on plasmids containing the above inserts in CV-1 and COS-1 cells confirmed the integrity of all the insert sequences (see ``Experimental Procedures''). The reason why other workers (43) observed deletions of (C-G) sequences in SV40-derived constructs is unclear but may be due to the use of different systems.
Figure 3: CAT mRNA levels in transfected CV-1 cells. Total RNA was extracted from CV-1 cells transfected with the indicated constructs 48-h post-transfection, transferred to nylon membranes, and probed with the DNA fragments as described under ``Experimental Procedures.''
The up-modulation of gene expression was
largely independent of the distance of the (G) inserts
from the transcription start site, since constructs with inserts in
site 3, downstream of the CAT gene (pRW1269) showed the same level of
expression as those in site 2, (pRW1259 and pRW1260). This is further
corroborated by the results obtained with (G)
insert into
site 4. Since the insertion of (G)
and (C)
at
site 4 lies within the coding sequence of the CAT gene and causes no
expression of the gene, RNA analyses were performed on these
constructs. Full-length mRNA was made by pRW1931 as revealed by
Northern blot analysis (not shown) which again was independent of the
orientation of the insert as demonstrated by the orientation isomer
pRW1932. The amounts of RNA made by the constructs at site 4 were
similar to those in sites 2 and 3.
The level of the CAT activity
observed in constructs containing 48- and 32-bp repeating (C-G) inserts
in site 2 (pRW1922 and pRW1923, respectively) and site 3 (pRW1927 and
pRW1928, respectively) was similar to the parental plasmid, pSV2cat or
pRW1292. Similar results were obtained with pRW1921 and 1926 containing
repeating (T-G) inserts at site 2 and 3, respectively.
Furthermore, when these sequences were inserted into pA10cat2, no stimulation of CAT expression was observed. Our results clearly indicate that the alternating (C-G) and (T-G) sequences do not enhance transcription of the CAT gene in the enhancerless constructs (pA10cat2 derivatives), in disagreement with a previous report (44) and likewise do not have any effect on transcription from the wild-type SV40 promoter.
Figure 4:
Competition in trans by a plasmid
containing repeating (G)(C) sequences. CAT activity of cells
cotransfected with 3 µg (open circles) or 8 µg (filled circles) of reporter plasmid pRW1254 and increasing
amounts of competitor plasmid pRW1231 is shown. For the CAT assays,
protein concentrations were adjusted so that 20-35% of starting
material was acetylated after 30 min. All values reflect the average of
two determinations. Reproducibility was
±15%.
These results clearly demonstrate that (G)(C) blocks of
sufficient length at site 1 can exert a negative regulatory effect on
gene expression in a transient expression system. This effect is
independent of the orientation of the insert, which excludes the
possibility of formation of mRNA secondary structure as the basis for
the observed block of CAT expression. In addition, it appears from RNA
blots that overall transcription is repressed in these constructs and
that pausing or stoppage of the RNA polymerase at the pur
pyr
sites does not significantly contribute to the effect.
In parallel
studies, we found that alternating (G-C) stretches of 48 or 32 bp in
length exhibited a strong (80-90%) reduction in CAT activity.
Similar effects were observed with constructs containing 120-bp
alternating (A-T) and 98-bp (T-G) insert sequences. The large reduction
in CAT activity exerted by these inserts is clearly due to a
post-transcriptional block. Since these inserts have a dyad axis in
their sequences, a hairpin can form in the mRNA. Such hairpins have
been shown to inhibit the movement of the 40 S ribosomal subunit along
the RNA(49, 50) . The (T-G) tracts lacked
this symmetry and did not inhibit CAT gene expression. Calculations of
the free energies of formation of such hairpins, which are of the order
of 100 kcal/mol, reveal that the mRNA of the (G-C) inserts can fold
into extraordinarily stable hairpins. It is very likely that such a
structure would present a strong stop for the ribosomes. The finding
that the (C-G)
insert, which contains an inverted repeat,
does not inhibit CAT expression is probably due to its shorter length
and its lower free energy (<30 kcal/mol) for mRNA hairpin formation.
The free energy of RNA hairpin formation for the (A-T)-rich insert
(pRW1929) is <30 kcal/mol. The reason for the observed reduction in
its CAT activity is uncertain but may be related to its A+T
content. The alternating (C-G) and (T-G)
sequences are
known to form Z-DNA in vitro and in vivo in bacterial
cells(34, 35, 51) , whereas plasmids
containing repeat sequences of (A-T) (pRW1929) and
(TG)
E(CA)
(pRW1912) adopt cruciform
structures. Our data do not strictly exclude the possible existence of
such structures in mammalian cells, but it clearly indicates that these
sequences do not act as inhibitors of transcription. From our studies,
it therefore appears that the G+C content of the insert alone is
not sufficient to produce the observed inhibitions of (G)
(C)
tracts on transcription. Since the inserts seem to be >95% stable,
we assume that the down-regulation of CAT transcription is not due to
gene rearrangements or site-specific cutting by (G)
(C)-specific
endonucleases. We cannot rule out extensive site-specific nicking of
the pur
pyr sequences by such enzymes. However, such a model would
make it difficult to explain the observed reduction of expression of
the CAT gene by the constructs carrying inserts outside the
transcription unit. In addition, to ascertain that the
(G)
(C)-specific reductions in transcriptional competence are not
cloning artifacts, we have removed the inserts from these constructs
and reconstituted wild-type pSV2cat activity (data not shown).
It is
conceivable that mRNA containing long runs of homopolymers could
hybridize with its coding sequence by forming a D-loop or a triplex and
thus hinder the formation of new transcripts. However, this is very
unlikely since the down-regulation of transcription is independent of
the orientation of the inserts. Similarly, the fact that these
sequences do not act as enhancers excludes the possibility of the
competition for transcription stimulation factors. Since the inserts do
not appear to inhibit elongation of the CAT mRNA, but rather repress
transcription initiation, it seems possible that the (G)(C)
inserts act at a distance to render the promoter region less active.
Oligo(G)
(C) blocks have been shown to exert structural
distortions on flanking sequences(52, 53) , and
similar pur
pyr sequences have been reported to influence
structural transitions over a long distance (54, 55, 56) . The fact that gene expression
is reduced to a different degree depending on the distance of the
(G)
(C) inserts from the promoter supports this model.
A
plausible model for site 1 is that the (G)(C) sequences are
complexed to putative DNA-binding proteins in vivo which then
interfere with gene expression. In this case, the binding protein would
obstruct the accessibility of the promoter to transcription factors or
prevent the DNA from assuming the conformation necessary for promoter
function. In previous experiments, a model was proposed for the effect
of (G)
tracts on gene expression when these sequences were
inserted 5` of the TK enhancerless promoter(28) . In their
model, a trans-acting factor fails to bind to the long (G) tracts due
to formation of triple helix structure in mouse LTK
cells. Our results with inserts at site 2 agree with theirs
although the level of enhancement of CAT gene expression is different.
We have observed 2-3-fold enhancement with (G)
at
this site as compared to their
10-fold enhancement. The difference
could be due to the fact that our systems are not the same. It is
possible that different promoters have different responses to the
presence of poly purine tracts, and also the G-binding protein (GBP)
could be present at different levels in different cell types.
As
stated above, simple repeat sequences containing runs of Gs and Cs
occur in the 5`-untranslated region of several genes (14, 29) and are binding sites for
proteins(29, 57) . An erythrocyte-specific factor
(BGP1) binds to the linear fragments of (G) tracts in the
5`-flanking region of the chicken adult
-globin gene(29) .
It has been shown that this factor has greater affinity for the
(G)
sequences than Sp1 and is distinct from Sp1.
Additionally, the nuclear factor suGF1 from sea urchin embryos was
reported recently to interact with 11 contiguous Gs in the H1-H4
intergenic region of a sea urchin early histone gene in
vitro(58) . Although it has been suggested that these
factors may play a role in gene regulation, the current experimental
evidence for the proposed role of these factors binding to (G)
(C)
is unclear. Here, we have shown that when (G)
is inserted
in the 5`-untranslated region of the CAT gene transcription is
inhibited, and this inhibition can be titrated in trans with the
competitor. Thus, the (G)
-binding proteins can act as
transcriptional regulators. We believe that this
(G)
-binding factor is different from Sp1, but further work
is required to characterize this factor and study its interaction with
the (G)
tracts in detail.
The results with (G) inserts in sites 1 and 2-4 are consistent with a looping
model. DNA looping is known to facilitate protein-protein interactions
for accurate transcriptional initiation (59) . A possible
mechanism by which these inserts may affect gene expression is
suggested by the fact that the (G)
-binding protein (GBP)
can recognize the (G)
sequences. The DNA bound regulatory
protein (GBP) touches the TATA-box binding protein (TBP), and initiates
transcription when the intervening DNA loops out or bends to allow
protein-protein interactions to occur. When (G)
is
inserted in site 1, the loop is too small to form because the distance
between the (G)
and the transcription initiation site is
100 bp; the GBP binds to (G)
in the reporter plasmid,
blocking transcription and resulting in very low expression. This
suggests that it acts as a repressor for the inserts in site 1. In the
presence of the competitor, the GBP binds to the (G)
in
the competitor DNA, and the level of expression goes up slightly.
Alternatively, when (G)
is inserted in sites 2, 3, or 4,
distal from the initiation site, looping is possible, transcription is
initiated, and RNA is produced. In this case, the loop is not formed in
the presence of the competitor, and the expression is lowered,
suggesting that the GBP now acts as an activator for the inserts at
these sites.
Our results are consistent with this looping model.
However, we cannot exclude the possibility of formation of an altered
DNA structure in (G) tracts inside the monkey cells which
could be recognized by the G-binding factor, thereby affecting the
initiation of transcription. Protein-protein interactions between GBP
and the basal transcription factor can be studied further using the
two-hybrid system pioneered by Fields et al.(61) .
The biological functions of DNA microsatellites are unknown. Our
data demonstrate that microsatellite-type sequences, which are
recognized by some proteins, can profoundly effect transcription and
thus gene expression, depending on their sequence, length, and map
location. Numerous prior studies have revealed that these same factors
are important in stabilizing non-B-DNA
conformations(25, 26) . Several human genetic diseases
(Introduction, 2-11) are caused by expansion of (CGG) or (CTG)
triplet repeats which are proximal or within the relevant genes. It is
of interest that the (CGG) repeat sequences are located in
the 5`-untranslated region of the FMR1 gene whose transcriptional
regulation is linked to the fragile X syndrome. In affected
individuals, the FMR1 protein is absent(21) , therefore,
suggesting the possible role of microsatellites in gene expression.
These sequences are known to adopt non-B-DNA structures (11-13,
15-19, 28, 34-36, 51-53, 56, 60) and can serve as
binding sites for some proteins(32, 60) . These
correlations suggest the possible conformational roles of these
repetitive sequences in gene regulation of human diseases.