(Received for publication, February 19, 1997, and in revised form, April 24, 1997)
From the The autonomous expansion of the unstable
5 DNA sequences containing trinucleotide repeats of the general
sequence (CXG)n or (GAA)n appear to be
genetically unstable. Amplifications of such repeats were found to be
associated with several human diseases, e.g. fragile X
(FraX)1 syndrome, Huntington's disease,
myotonic dystrophy, or Friedreich's ataxia (for reviews see Refs. 1
and 2). The number of repeats usually correlates with the severity of
the disease.
The FraX syndrome is associated with an amplification of the
trinucleotide sequence 5 Genetic instability of long 5 We are interested in the characterization of human cellular proteins
binding to 5 Here we describe the rapid cloning of the full-length cDNA for
p20-CGGBP by a novel strategy. Peptide sequence tags obtained by
nanoelectrospray mass spectrometry (25) of the purified protein have
been used to screen an EST data base. The encoded protein has been
expressed in bacteria, and its binding specificity has been
investigated in detail.
Human HeLa cells were purchased from the
Gesellschaft für Biotechnologische Forschung, Braunschweig,
Germany.
The
5 Table I.
Repetitive oligodeoxyribonucleotides used in competition experiments
Institut für Genetik,
ABSTRACT
INTRODUCTION
EXPERIMENTAL PROCEDURES
RESULTS
DISCUSSION
FOOTNOTES
ACKNOWLEDGEMENTS
REFERENCES
-d(CGG)n-3
repeat in the 5
-untranslated region of the human
FMR1 gene leads to the fragile X syndrome, one of the most
frequent causes of mental retardation in human males. We have recently
described the isolation of a protein p20-CGGBP that binds
sequence-specifically to the double-stranded trinucleotide repeat
5
-d(CGG)-3
(Deissler, H., Behn-Krappa, A., and Doerfler, W. (1996)
J. Biol. Chem. 271, 4327-4334). We demonstrate now
that the p20-CGGBP can also bind to an interrupted repeat sequence.
Peptide sequence tags of p20-CGGBP obtained by nanoelectrospray mass
spectrometry were screened against an expressed sequence tag data base,
retrieving a clone that contained the full-length coding sequence for
p20-CGGBP. A bacterially expressed fusion protein p20-CGGBP-6xHis
exhibits a binding pattern to the double-stranded 5
-d(CGG)n-3
repeat similar to that of the authentic p20-CGGBP. This novel protein
lacks any overall homology to other known proteins but carries a
putative nuclear localization signal. The p20-CGGBP gene is
conserved among mammals but shows no homology to non-vertebrate
species. The gene encoding the sequence for the new protein has been
mapped to human chromosome 3.
-d(CGG)n-3
in the 5
-UTR of the
FMR1 (fragile X mental retardation) gene and extensive
5
-d(CG)-3
methylation in the repeat as well as in adjacent
regions (3-7). As a consequence, the expression of the encoded protein
FMRP is reduced or abolished. Fragile sites similar to this locus have been identified on several of the human chromosomes, and all sites characterized so far have shown an expansion of 5
-d(CGG)n-3
repeats suggesting an important structural function of this DNA sequence. Several studies have demonstrated that single-stranded oligodeoxyribonucleotides of the general sequence
5
-d(CXG)n-3
with X = A or G can
form stable secondary structures in vitro (8-14). The
mechanism of the amplification of unstable DNA sequences is not
understood. Expansion might involve DNA slippage or unequal crossing
over, but these models do not explain the observed phenomenon sufficiently (for review see Ref. 15). The function of the repeat itself is also unknown. A 5
-d(CGG)4-3
fragment in the
rRNA gene promoter (16) and a 5
-d(CTG)25-3
element
located in the promoter of the mouse growth inhibitory factor gene (17)
are thought to function as regulatory elements.
-d(CXG)n-3
tracts
with deletion products of triplet repeat sequences has also been observed in Escherichia coli (18). The instability of long
5
-d(CGG)n-3
repeats in E. coli has been shown to
depend on host cell genotype, length, polymorphism, and on the
orientation of the triplet repeat relative to the replication origin
(19). Expansion products of triplet repeat sequences have also been
detected (20).
-d(CGG)n-3
repeats in a sequence-specific manner.
The study of these proteins may improve our understanding of triplet
repeat amplifications and their function. Double-stranded as well as
single-stranded simple trinucleotide repeat sequences are target
sequences for sequence-specific DNA binding proteins (21-23). We have
recently isolated the 20-kDa protein p20-CGGBP (5
-d(CGG)n-3
binding protein) from HeLa nuclear extracts by DNA affinity
chromatography. This protein binds sequence-specifically to the
unstable, double-stranded 5
-d(CGG)n-3
trinucleotide repeat
(24). The protein requires more than eight repeats for proper binding.
Base pair exchanges in the 1st base pair of every second triplet repeat
abolish binding. None of the other known unstable trinucleotide repeats
or either single strand of the trinucleotide repeat
5
-d(CGG)n-3
can serve as a target sequence for this protein.
The binding of p20-CGGBP is also severely inhibited by complete or
partial cytosine-specific DNA methylation of the binding motif.
Interestingly, a p20-CGGBP activity has been found in several human and
mammalian cell lines and in human primary lymphocytes.
Cell Lines
-d(CGG)n-3
binding protein p20-CGGBP was purified from HeLa
nuclear extracts by anion exchange chromatography and DNA affinity
chromatography as described (24) with slight modifications. A mixture
of double-stranded (ds) and single-stranded calf thymus DNA-cellulose
(Pharmacia Biotech Inc.) was used as an unspecific DNA matrix. As a
specific DNA matrix 5
-d(CGG)17-3
ds-Sepharose was used.
Each fraction was tested for its ability to bind sequence-specifically to the oligodeoxyribonucleotide 5
-d(CGG)17-3
ds as
described. The sequences of various oligodeoxyribonucleotides used in
this study were summarized in Table I. Protein fractions prepared for
peptide sequencing were eluted from
5
-d(CGG)17-3
ds-Sepharose at 65 °C for 10 min in 5 mM Tris-HCl, 1% SDS, 2 mM dithiothreitol, 10 µg/ml aprotinin, pH 6.9, separated by SDS-polyacrylamide gel electrophoresis and transferred to a polyvinylidene difluoride (PVDF)
membrane (Problot, Applied Biosystems) in 10 mM CAPS, 10% methanol, pH 11. After staining of the membrane in Coomassie Blue R-250, the appropriate band was excised and stored at
20 °C
(26).
Composition of ds
Sequence
(CGG)17ds
5
-(CGG
CGG)8 CGG-3
3
-(GCC GCC)8 GCC-5
(CAG)17ds
5
-(CAG CAG)8 CAG-3
3
-(GTC GTC)8 GTC-5
(CGA)17ds
5
-(CGA CGA)8 CGA-3
3
-(GCT GCT)8 GCT-5
(CAA)17ds
5
-(CAA CAA)8 CAA-3
3
-(GTT GTT)8 GTT-5
(TGG)17ds
5
-(TGG TGG)8 TGG-3
3
-(ACC ACC)8 ACC-5
CGG8Tds
5
-(CGG
TGG)8 CGG-3
3
-(GCC ACC)8 GCC-5
CGG8Ads
5
-(CGG AGG)8 CGG-3
3
-(GCC
TCC)8 GCC-5
CGG8Gds
5
-(CGG GGG)8
CGG-3
3
-(GCC CCC)8 GCC-5
CGG10AGGds
5
-(CGG)3 AGG (CGG)9 AGG
(CGG)3-3
3
-(GCC)3 TCC (GCC)9 TCC
(GCC)3-5
CGG10AGG/(CCG)17
5
-(CGG)3
AGG (CGG)9 AGG (CGG)3-3
3
-(GCC)3 GCC (GCC)9 GCC
(GCC)3-5
(CGG)17/CCG10CCT
5
-(CGG)3
CGG (CGG)9 CGG (CGG)3-3
3
-(GCC)3 TCC (GCC)9 TCC
(GCC)3-5
FraxFds
5
-GTCCCCCGCTGCCGTCGCCGTCGCCGTCGCCGCCGCCGCCGCCGCCGCCGCC-3
3
-CAGGGGGCGACGGCAGCGGCAGCGGCAGCGGCGGCGGCGGCGGCGGCGGCGG-5
Protein fractions isolated from HeLa
nuclear extracts and enriched for the 5-d(CGG)17-3
ds
binding activity as well as fractions without such activity were
separated by SDS-polyacrylamide gel electrophoresis. After transfer to
a PVDF membrane (Fluorotrans, Pall, Dreieich, Germany) in 195 mM glycine, 25 mM Tris, pH 8.3, the membrane
was washed in buffer SW (10 mM K-HEPES, 100 mM
KCl, 0.5 mM dithiothreitol, 0.1 mM EDTA, 0.2 mM spermine, 10% glycerol, protease inhibitors, pH 7.9).
All subsequent steps were carried out at 4 °C unless stated
otherwise. Proteins were denatured by washing the membrane twice for 10 min in buffer SWA (10 mM K-HEPES, 100 mM KCl,
0.5 mM dithiothreitol, 0.1 mM EDTA, 10%
glycerol, 6 M guanidinium hydrochloride, pH 7.9) and were
renatured by removing half of the volume of buffer SWA, replacing it
with buffer SW, and incubating the membrane for 10 min. This step was
repeated four times and followed by a 5-min incubation in buffer SW.
The membrane was blocked in buffer SW containing 5% low fat milk
powder, 10 µg/ml salmon sperm DNA, and 1 µg/ml poly(dA·dT) for 60 min. After washing the membrane in buffer SW with 0.5% low fat milk powder, the membrane was incubated for 2 h at 18 °C in buffer SW with 60 fmol of 32P-labeled oligodeoxyribonucleotide
(1.7 × 106 cpm) and 4 µg/ml poly(dA·dT). The
membrane was finally washed in buffer SW for 10 min, dried,, and
exposed for 2-4 days on Kodak XAR films.
The PVDF
membrane carrying the blotted protein was destained in water. The
protein was reduced, alkylated, and digested overnight with trypsin
(12.5 ng/µl; Boehringer Mannheim, sequencing grade) at 37 °C in a
50 mM ammonium bicarbonate, 5 mM
CaCl2 buffer (25, 27). Peptides were extracted from the
membrane in 10 µl of aqueous 5% formic acid (Merck, Darmstadt,
Germany) and subsequently in 10 µl of 1:1 acetonitrile:water, 5%
formic acid (two changes). The resulting peptide mixture was dried and
stored at 20 °C. The solution was reconstituted in 1 µl of 70%
formic acid, rapidly diluted with 9 µl of water to avoid formylation,
and was concentrated and desalted using 50 nl of PorosTM R2 material
(Perspective Biosystems, Framingham, MA) prepared in a glass
capillary as described previously (25, 27). The peptide mixture was
eluted directly into a gold-coated glass capillary by passing 2 volumes
of 0.4 µl of 50% methanol, 5% formic acid over the PorosTM column.
The gold-coated glass capillary was mounted in the nanoelectrospray ion
source (28, 29) on the mass spectrometer for peptide tandem mass
spectrometry (MS) sequencing.
Tandem MS investigations were performed on an API III triple quadrupole
mass spectrometer (PE Sciex, Ontario, Canada) equipped with an updated
collision cell (30) and a nanoelectrospray ion source (28, 29). To
detect peptides, which were below the chemical noise level, the parent
ion scan technique, a filtering method for triple quadruple mass
spectrometers, was used on the isoleucine/leucine immonium ion (31)
(see Fig. 3). Only peptides generating upon fragmentation a predefined
fragment ion, in this instance the 86-Da immonium ion of
isoleucine/leucine, were detected. By applying this filtering
procedure, even peptides generating signals below the noise level in
the spectrum could still be recognized.
Data base searches were performed using the sequence tag approach and the PeptideSearch program (32). Sequence tags present information on part of a peptide sequence and typically comprise an internal sequence stretch of about three to four amino acids and the mass location in a peptide of known total mass. Frequently, this information can be gained from a tandem mass spectrum and facilitates the highly specific identification of the underlying complete peptide sequence during a data base search. A non-redundant data base currently containing more than 200,000 protein sequences or open reading frames was searched. Another data base was prepared from the available EST data base dBEST (33, 34) which currently contains about 500,000 human sequence entries. The EST clone ID269133 (Genbank accession numbers [GenBank] and [GenBank]) was obtained via the I.M.A.G.E. consortium (35), distributed by Research Genetics, Huntsville, AL, and resequenced on automated DNA sequencers.
Expression of the Fusion Protein p20-CGGBP-6xHis in E. coli and Purification by Ni-Chelate ChromatographyA fragment p20-518 was
amplified from the plasmid 269133 by using primers p20Bam5
(5
-dCGCGGATCCGAGCGATTTGTAGTAACAGCA-3
) and p20Kpn3
(5
-dGGGGTACCTCAACAATCTTGTGAGTTGAG-3
) using Pfu DNA
polymerase (Stratagene). The fragment p20-518 contained a BamHI recognition site at the 5
-end, a site for
KpnI at the 3
-end, and coded for amino acids (aa) 2-166 of
p20-CGGBP. This fragment was cloned into the vector pQE40 (Qiagen,
Hilden, Germany) and cut with BamHI and KpnI.
Thereby, a histidine hexamer was introduced at the N terminus of the
fusion protein (see below). The correct sequence of the fusion
construct pQE40-518 was ascertained by DNA sequencing of 10 independent
clones after transformation into E. coli M15[pREP4]
(Qiagen). The vector pQE40 encoded the fusion protein
6xHis-dihydrofolate reductase (DHFR-6xHis). Its coding sequence was
removed by cleavage with the above-mentioned enzymes.
The E. coli strains containing the appropriate constructs
were grown in liquid culture at 37 °C to an
A600 nm of 0.7. Synthesis of the fusion protein
was induced by adjusting the medium to 1.5 mM
isopropyl-1-thio--D-galactopyranoside (MBI Fermentas). After incubation for another 2 h, bacteria were pelleted and lysed for 30 min on ice in buffer SB (50 mM sodium
phosphate, 10 mM Tris-HCl, 300 mM NaCl, 1 mM MgCl2, 0.1% Tween 20, 20% glycerol, 8 µg/ml aprotinin, pH 7.85) in the presence of 1 mg/ml lysozyme (Calbiochem). The lysate was sonicated on ice for 1 min at 40 watts
(Sonifier B12, Branson Sonic Power Company, Danbury, CT). The
cytoplasmic extract was collected after centrifugation (10,000 × g, 4 °C, 10 min), frozen in liquid nitrogen, and stored
at
80 °C.
The fusion protein p20-CGGBP-6xHis and the control protein DHFR-6xHis
were purified by Ni2+-chelate affinity chromatography from
bacteria expressing the constructs pQE40-518 and pQE40, respectively.
Proteins from cytoplasmic extracts were bound to
Ni2+-nitrilotriacetic-agarose (Qiagen) at 4 °C for
3 h and, subsequently, at room temperature for 30 min. All
following steps were carried out at 4 °C. The material was washed
with buffer SB and SB100 (same as SB but containing 100 mM
imidazole, pH 6.5), and the fusion proteins were eluted with buffer
SB500 (same as SB, but containing 500 mM imidazole, pH
6.0). Since imidazole inhibited the DNA-binding activity, the eluted
fusion proteins were equilibrated in 20 mM K-HEPES, 100 mM NaCl, 1 mM MgCl2, 0.15 mM spermine, 0.1 mM EDTA, 0.5 mM
dithiothreitol, 20% glycerol, 0.01% Tween 20, and protease
inhibitors, pH 7.9. Even in the presence of protective carrier
proteins, the purified proteins were stable at 80 °C for only 2 weeks. The purity of the fusion proteins was followed by
electrophoresis on SDS-polyacrylamide gels. The DNA-binding specificities of the purified fusion proteins and of the bacterial lysates were tested by electrophoretic mobility shift assays
(EMSA).
EMSAs were carried out
essentially as described (24). Oligodeoxyribonucleotides (nomenclature
and sequences see Table I) were hybridized in a thermal cycler and
5-end-labeled to a specific activity of 15,000 cpm/fmol by T4
polynucleotide kinase. When the binding activity of the bacterially
expressed fusion protein p20-CGGBP-6xHis was determined, 10 µg of
bovine serum albumin and/or 0.15 mM spermine were added per
assay to increase the stability of the protein.
Total cellular RNA was isolated from cell lines according to standard procedures. Electrophoresis was carried out under denaturing conditions (36). The RNA was transferred to a nylon membrane (Qiabrane, Qiagen) and hybridized (37) with 32P-labeled probe p20-701. A dot blot membrane carrying standardized amounts of mRNA isolated from a number of human tissues (human master mRNA blot, CLONTECH, Heidelberg, Germany) was also hybridized with the p20-701 probe.
A somatic cell hybrid panel (Oncor, Inc., Heidelberg, Germany) containing 15 µg of PstI-cleaved genomic DNA from 23 different rodent cell lines each carrying one human chromosome was hybridized with probe p20-779. The DNA probe p20-779 contained the complete 779-bp insert and was isolated from the EST clone ID269133 after EcoRI and NotI cleavage. The DNA fragment p20-701 lacked the poly(A)-tail of the insert and was isolated after BfaI cleavage of probe p20-779. Hybridization probes were 32P-labeled by randomly primed tagging (38).
Protein
fractions isolated from HeLa nuclear extracts and highly enriched for
the 5-d(CGG)17-3
-ds-binding activity were analyzed by
Southwestern blotting for their binding activity to the target sequence
of the p20-CGGBP protein (Fig. 1). Separated and blotted
proteins were incubated with a 32P-labeled
oligodeoxyribonucleotide that carried either 17 5
-d(CGG)-3
repeats
[(CGG)17ds] or with the control oligodeoxyribonucleotide (CAG)17ds that contained 17 5
-d(CAG)-3
repeats. A band of
about 20 kDa was detected in all fractions highly enriched for
p20-CGGBP with 32P-labeled (CGG)17ds as a
binding probe (Fig. 1, lanes designated fractions III and
IV). This band was not present when (CAG)17ds was used as a probe. It also failed to be detected in a fraction deficient of 5
-d(CGG)17-3
-ds-binding activity. An
additional band of 100 kDa was observed in some fractions as well as in
crude nuclear extracts (Fig. 1, and data not shown) when either binding probe was used. This activity was likely due to unspecific DNA-protein interactions since this binding was also detected with HeLa nuclear extracts when a variety of single- or double-stranded
oligodeoxyribonucleotides were used as binding probes (data not shown).
These results confirmed that p20-CGGBP could bind directly to its
target sequence without the involvement of additional cellular
proteins.
Determination of the Amino Acid Sequence of the p20-CGGB-Protein: EST Clone ID269133 Contains the Complete Coding Sequence for p20-CGGBP
For the determination of the amino acid sequence of
several internal peptides of p20-CGGBP, about 400 ng of p20-CGGBP were isolated from 5 × 109 HeLa cells, separated from
accompanying proteins by SDS-polyacrylamide gel electrophoresis, and
transferred to a PVDF membrane. The experimental design for the
determination of the amino acid sequence is outlined in
Fig. 2. Extraction of the peptides from the membrane
resulted in limited recovery, as seen by the poor signal-to-noise ratio in Fig. 3, upper panel. Only one peptide was
apparent whose fragmentation spectrum is shown in Fig. 3, lower
panel. To detect peptide ion signals below the level of chemical
noise, a parent ion scan for the immonium ion of isoleucine/leucine was
performed. This scan revealed two more peptides that were subsequently
fragmented (Fig. 3, 2nd panel from top). Tandem MS spectra
of all three peptides were generated. Interpretation of these spectra
resulted in one complete and two partial sequences. The sequence of
peptide "a" was determined to be FVVTAPPAR. For peptide "b,"
interpretation of the high m/z range resulted in
the N-terminal sequence VSV(I/L) or a peptide sequence tag (17 Da)
VSV(I/L) (634.4 Da). A peptide sequence tag could also be assigned to
peptide "c": (172.2 Da) (I/L)YV (601.80 Da). A search of a large
non-redundant protein data base containing more than 200,000 entries
revealed no match to the sequence FVVTAPPAR or the sequence tags,
indicating that the protein was unknown. A search in the expressed
sequence tag data base (dbEST) with PeptideSearch, however, did
retrieve a matching EST (clone ID269133). This clone had been sequenced
from the 3- (Genbank accession no. [GenBank]) and 5
-ends (Genbank accession no. [GenBank]). Peptide a was found in the former and peptide b
in the latter. The retrieved sequence for peptide b was
VSVIQDFVK, which matched the obtained tandem mass spectrum for this peptide. Peptide c did not match directly, but regions 1 and 2 of the peptide sequence tag were found to match, indicating a sequence
error in the DNA sequencing coding for the C-terminal part of the
peptide (32). Resequencing of the clone indeed revealed a different DNA
sequence. After its correction the peptide sequence was
TALYVPLD which also led to complete agreement with the
tandem mass spectrum of peptide c. The derived amino acid sequence of the 501-bp open reading frame, obtained after double-stranded resequencing of the EST clone, thus contained all three peptides covering 28 amino acids (Fig. 3, bottom).
The analysis of the cDNA sequence revealed that the nucleotide
sequence context of a putative start codon ATG
(AGGATGG) at nucleotide position 197 of the
779-bp insert was in accordance with the Kozak rules (40) for
functional translational start sites in eucaryotes. In addition,
several very short ORFs were detected 5 of the putative start codon.
The stop codon at nucleotide 698 was followed by a polyadenylation
signal at nucleotide 729 and a poly(A)-tail starting at nucleotide 759. The 501-bp ORF encoded a 166-aa long protein with a molecular mass of
about 19 kDa which is in accordance with the apparent molecular mass
for p20-CGGBP as determined by SDS-polyacrylamide gel electrophoresis. The results indicated that the EST clone ID269133 contained the complete coding sequence of p20-CGGBP.
Northern blot analyses
of RNA isolated from human HeLa cells detected only one transcript with
an apparent size of 1.2 kilobases with p20-CGGBP cDNA (probe
p20-701) as hybridization probe (data not shown). This finding
suggested the presence of a large 5-UTR of p20-CGGBP-specific RNA. Low
level expression of this RNA was found in a number of human tissues by
dot blot RNA hybridization. In contrast to very low level expression in
whole fetal brain, expression was high in adult cerebellum and cerebral
cortex but low in other regions of the adult brain. RNA isolated from
human placenta, thymus, and lymph nodes also contained relatively high levels of a p20-CGGBP-specific RNA. These results agree with the observation that several EST clones with extensive homologies to the
clone ID269133 (35) had been isolated from different human tissues as
well as from mouse and rat. Moreover, DNA from several mammalian
sources yielded specific signals with the p20CGGBP cDNA probe,
whereas DNA from non-mammalian species, except chicken DNA, did not
(Fig. 4). This high degree of conservation of the p20-CGGBP sequence among mammals and its expression in a variety of
many human tissues are consistent with our observation that an activity
binding to the double-stranded 5
-d(CGG)n-3
trinucleotide
repeat was detected in a large number of human and mammalian cell lines
(24). However, a yeast sequence homologous to the EST clone ID269133
was not found. Analyses of the derived amino acid sequence did not
reveal any overall homology to known proteins. Computer-aided sequence
analyses detected a putative nuclear localization signal between aa 69 and 84 (Fig. 3, bottom panel).
The results presented are in accordance with the finding that protein-DNA complex formation between p20-CGGBP and its target sequence cannot be competed by a set of oligodeoxyribonucleotides carrying consensus binding sequences for known DNA-binding proteins. We, therefore, conclude that p20-CGGBP is a novel DNA-binding protein.
Chromosomal Localization of the Gene Encoding p20-CGGBPA
somatic cell hybrid panel with genomic DNA from mouse or hamster hybrid
cell lines, each carrying one specific human chromosome, was hybridized
to the complete insert of EST clone ID269133 (probe p20-779). A
strong human DNA-specific signal was detected with DNA from human
chromosome 3 (arrowhead in Fig. 5). The
intense cross-hybridization to the corresponding rodent genes (mouse
and hamster) of p20-CGGBP was consistent with its high degree of
conservation in mammals.
The Fusion Protein p20-CGGBP-6xHis Binds Sequence-specifically to the Double-stranded Trinucleotide Repeat 5
The binding
specificity of the protein encoded by the 501-bp ORF of EST clone
ID269133 was further characterized. The protein was expressed in
E. coli which was chosen as a host because an activity
similar to p20-CGGBP did not occur in this procaryote (see also
Fig. 6C). A histidine-hexamer was introduced
at the N terminus of the coding sequence of p20-CGGBP starting with the second amino acid to avoid internal translational start sites. The
expressed fusion protein p20-CGGBP-6xHis showed an apparent molecular
mass of 20 kDa upon electrophoresis in SDS-polyacrylamide gels as
expected (Fig. 6A). Purification of the recombinant protein by Ni2+-chelate affinity chromatography was feasible only
under native conditions and resulted in pure p20-CGGBP-6xHis (Fig.
6A). As a control, the mammalian enzyme dihydrofolate
reductase (DHFR-6xHis) was also expressed and purified from bacteria
under similar experimental conditions (Fig. 6A).
Bacterial lysates were prepared from bacteria induced for the expression of p20-CGGBP-6xHis or for DHFR-6xHis as well as from uninduced bacteria. The lysates were tested for the presence of proteins capable of binding to the oligodeoxyribonucleotide (CGG)17ds. Formation of the specific complex cI between (CGG)17ds and bacterial proteins was detected only in lysates prepared from bacteria expressing the fusion protein p20-CGGBP-6xHis (Fig. 6B) and not from those expressing DHFR-6xHis (Fig. 6C).
The specificity of the complex cI between the fusion protein
p20-CGGBP-6xHis and the double-stranded target sequence was assessed with purified recombinant p20-CGGBP-6xHis that was bound to
(CGG)17ds in the presence of oligodeoxyribonucleotides
containing other trinucleotide or related sequence repeats (see Table
I). The formation of the complex cI, which exhibited a
similar mobility as the one formed with p20-CGGBP isolated from HeLa
nuclei, was competed only by the homologous oligodeoxyribonucleotide
(CGG)17ds (Fig. 6B) and not by
oligodeoxyribonucleotides carrying different trinucleotide repeats,
such as (CAG)17ds, (CGA)17ds,
(TGG)17ds, and (CAA)17ds (Fig. 6B).
Furthermore, no competition was detected with oligodeoxyribonucleotides
containing base pair exchanges at the first base of every other triplet
repeat. In addition, the oligodeoxyribonucleotide FraxFds, which was
isolated from the FraxF locus (41) and contained only eight
5-d(CGG)-3
repeats adjacent to other triplets, failed to compete for
binding. Thus, more than eight 5
-d(CGG)-3
repeats were apparently
required for proper binding. These competition patterns were identical to those described previously for HeLa nuclear extracts or for p20-CGGBP purified from them (24). However, the binding affinity of
purified recombinant p20-CGGBP-6xHis was lower than that of the
authentic p20-CGGBP. Possibly the recombinant protein could have been
partly inactivated during or after purification by metal chelate
chromatography, since the binding activity of the raw bacterial lysate
was quite high (data not shown). The addition of carrier protein
(e.g. bovine serum albumin) or of polyamines (e.g. spermine) increased the stability of the purified
protein slightly without influencing its binding specificity.
These DNA binding studies confirmed that the ORF in the EST clone ID269133 encoded the full-length cDNA for p20-CGGBP. Furthermore, p20-CGGBP was found to be sufficient for the formation of complex cI. Thus, involvement of other cellular proteins in the formation of complex cI was unlikely.
p20-CGGBP Also Binds to Interrupted Repeats and Tolerates an A/G Mismatch in Its Target SequenceFMR1 promoter
sequences carrying more than 40 repeats of the 5-d(CGG)-3
stretch
appeared to be stable upon female transmission when these repeats were
interrupted by 5
-d(AGG)-3
triplets at every 7th to 10th position
(42-44). The loss of these interrupting 5
-d(AGG)-3
triplets at the
3
-end of a longer 5
-d(CGG)-3
trinucleotide repeat was implicated as
a possible first step in the expansion.
Binding of purified p20-CGGBP or crude nuclear extract to the target
oligodeoxyribonucleotide (CGG)17ds was competed in the presence of the oligodeoxyribonucleotide CGG10AGGds (see Table I). This
oligodeoxyribonucleotide contained two 5-d(AGG)-3
interruptions and
an uninterrupted stretch of nine 5
-d(CGG)-3
repeats (Fig. Fig.
7). This result suggested that p20-CGGBP could indeed
bind to an interrupted repeat.
Similarly, the ds oligodeoxyribonucleotides
(CGG)17/CCG10CCT and CGG10AGG/(CCG)17 (see
Table I) also competed for the binding to (CGG)17-ds (Fig.
7). These results indicate that either nine 5-d(CGG)-3
repeats are
sufficient for binding or that one mismatch (A/G or C/T) in the binding
sequence can be tolerated. Competition experiments with
oligodeoxyribonucleotides carrying a mismatch in the first position of
every second triplet (for details and sequences see
Table II) showed that the presence of several A/G mismatches obviously did not inhibit binding of p20-CGGBP to its target
sequence.
Mechanisms underlying instability and the physiological function
of trinucleotide repeats in the human genome are not understood. We
have, therefore, investigated proteins that interact
sequence-specifically with the unstable triplet repeat
5-d(CGG)n-3
in the human FMR1 gene and have
recently purified the 20-kDa protein p20-CGGBP from nuclear extracts of
HeLa cells. This protein binds exclusively to 5
-d(CGG)n-3
repeats and not to any other unstable triplet repeat sequence (24). The
cloning of the cDNA encoding this protein has now become possible
with a strategy involving amino acid sequence determination by tandem
mass spectrometry and screening of EST data bases with the obtained
sequence tags.
From partial mass spectrometric sequence data of three peptides of the protein, a cDNA fragment has been identified in the EST data base using special software algorithms. Since the coding sequence of this protein is short, it is completely represented by the identified clone, omitting the need for any library screening and subcloning. To our knowledge this is the first reported example in which a combination of mass spectrometric sequencing, data base searching, and a generally available clone collection have made additional cloning efforts redundant. The rapid cloning of the cDNA of p20-CGGBP thus demonstrates the potential of mass spectrometric sequencing in conjunction with the screening of EST data bases. Once the protein had been purified on an SDS gel and the sequencing had started, it has taken only 2 weeks to have its clone available to express the protein and test for its putative function. The EST data base is thought to contain already more than half of all human genes (34). Since the size of the data base is still growing rapidly and due to efforts to obtain longer stretches of coding sequence, it is reasonable to expect that many human proteins may be amenable to the type of analysis described here.
Mass spectrometric sequencing has intrinsically favorable characteristics that make it the technique of choice for EST data base identifications. Much lower amounts of protein are sufficient for analysis as compared with conventional amino acid sequencing techniques such as Edman degradation (25). In one experiment several peptides can be fragmented. The chances that an analyzed peptide is represented in the EST data base are thus increased. The generation of sequence tags, short amino acid sequences together with their precise mass location in the peptide, from tandem mass spectra is relatively simple and can often be done automatically. Sequence tags have a high statistical search specificity making them a very powerful probe to locate the corresponding EST, even in the presence of DNA sequencing errors, as was the case here. We anticipate that mass spectrometric sequencing in conjunction with EST data bases will play an important role in cloning new proteins from organisms for which large EST data bases are available or from closely related organisms (45).
The authenticity of the isolated cDNA is supported by the finding that the recombinant protein p20-CGGBP-6xHis exhibits the same, although weaker, DNA-binding pattern as the purified protein. Most likely, p20-CGGBP is the exclusive protein partner in the p20-CGGBP-(CGG)17ds complex cI. It contains at least the DNA-binding domain of the involved proteins as shown by Southwestern blot analyses. Complex cI probably consists of more than one molecule p20-CGGBP because it is highly sensitive to deoxycholate treatment (24), suggesting the involvement of protein-protein interactions (46).
The physiological function of the novel DNA-binding protein p20-CGGBP
in triplet repeat function and instability cannot be derived solely
from its cDNA sequence due to the lack of homology to known
proteins. However, high conservation of p20-CGGBP among mammals as
confirmed by Southern blot analyses (Fig. 4) and computer-aided sequence analyses as well as its expression in a variety of human tissues point to an important, if not essential, function in mammalian cells. It is attractive to speculate that the 5-d(CGG)-3
repeat itself has regulatory functions in the expression of the adjacent FMR1 gene. It has been described that short trinucleotide
repeats 5
-d(CGG)-3
or 5
-d(CTG)-3
could function as regulatory
elements in at least two different mammalian promoters (16, 17). In this context, a report about two individuals carrying an expanded, but
unmethylated, 5
-d(CGG)-3
repeat in the 5
-UTR of the FMR1 gene without the fragile X phenotype but with normal expression of FMRP
is very interesting (47). It is likely that the silencing of the FMRP
expression in FraX individuals is in part due to cytosine-specific DNA
methylation of the amplified repeat and adjacent sequences. Interestingly, p20-CGGBP can bind to an (un)interrupted FMR-1 triplet repeat but not to a highly methylated 5
-d(CGG)-3
repeat in vitro (24). The elucidation of the cDNA sequence of
p20-CGGBP now provides a basis for the study of its role in triplet
repeat function and expansion in mammalian and non-mammalian cells.
We thank Helmut Deissler, Institute of Cell Biology, University of Essen Medical School for help with the sequence analyses and for valuable comments on the manuscript and Sandra Kühn for providing D. melanogaster DNA. We are grateful to Petra Böhm for expert editorial work.