(Received for publication, January 28, 1997, and in revised form, April 28, 1997)
From the MUC5B, mapped clustered with
MUC6, MUC2, and MUC5AC to
chromosome 11p15.5, is a human mucin gene of which the genomic
organization is being elucidated. We have recently published the
sequence and the peptide organization of its huge central exon, 10,713 base pairs (bp) in length. We present here the genomic organization of
its 3 Mucus is the layer that covers, protects, and lubricates the
luminal surfaces of epithelial respiratory, gastrointestinal, and
reproductive tracts. These basic properties are due to the viscous and
viscoelastic properties of mucins, the major glycoprotein components of
mucus. Mucins constitute a family of high molecular mass glycoproteins
synthesized by the goblet cells of the epithelia and in some cases by
submucosal glands (for more complete reviews, see Refs. 1-3).
Alterations of the biosynthesis of mucins affecting the protein core
and/or the carbohydrate content linked to the peptide have been
observed in numerous pathological situations such as various adenomas
and carcinomas, inflammatory diseases such as cystic fibrosis, asthma,
chronic bronchitis, or inflammatory bowel diseases (4-7). Moreover,
the hypersecretion of mucins and the presence of alternating
hydrophobic and hydrophilic domains in mucins have been shown to play a
central role in the pathogenesis of cholesterol gallstones (8, 9).
All apomucins contain tandemly repeated sequences rich in threonine
and/or serine. Due to the high carbohydrate content, the peptide moiety
of mucins has been difficult to characterize. cDNA cloning has
enabled researchers to approach the study of the mucins over the past
decade. Today, the membrane-associated mucin MUC1 and the secreted MUC7
are the only mucins for which the full-length cDNA and the genomic
organization have been reported (10-13). Both were revealed to be, in
fact, small mucins. A complete cDNA of the large secreted mucin
MUC2 (14-17) has been described. Partial cDNAs have been
identified for the other human mucin genes that code for secreted
mucins: MUC3 (18), MUC4 (19), MUC5AC (20-24), MUC5B (25), and MUC6
(26).
Four mucin genes are mapped to 11p15.5: MUC5AC,
MUC5B, MUC2, and MUC6. (26-28). Recently,
we have determined that the order of the four clustered 11p15.5 human
mucin genes is tel-MUC6/MUC2/MUC5AC/MUC5B-cen (29). We have
also established that MUC2, MUC5AC, and MUC5B have a consensus
cysteine-rich domain found twice in MUC2 (16), at least four times in
MUC5AC (21, 22, 24), and seven times in MUC5B (30).
MUC5B is expressed mainly in bronchus glands and also in
submaxillary glands, endocervix, gall bladder, and pancreas (31-35). The structural organization of the peptide deduced from the nucleotide sequence of the central region of MUC5B has been published
recently (30). The single large exon of 10,713 bp,1 containing all the tandem repeat
domain, is, to our knowledge, the biggest described for a vertebrate
gene. It codes for a 3570-amino acid peptide. Nineteen subdomains have
been individualized. Most of the MUC5B subdomains show similarity to
each other, creating four larger composite super-repeat units of 528 amino acids. Each super-repeat is made up of repeats consisting of an
irregular repeat of 29 amino acids, one cysteine-rich subdomain (10 cysteine residues, 108 aa), and one unique sequence of 111 amino acid
residues also rich in serine and threonine. The complete organization
of the region downstream of the central region of the human
MUC5B gene, i.e. its complete 3 A A human genomic A human placental genomic DNA library in pWE15 cosmid provided by
Stratagene was screened using the JER57 probe. Two positive clones,
BEN1 and BEN2, were obtained (30). BEN2 was the useful clone in the
present study.
Oligonucleotide primers used in
PCR, RACE-PCR, RT-PCR, and sequencing experiments were synthesized by
Eurogentec (Liège, Belgium). Their sequences and locations are
indicated in Table I.
Table I.
Primers used for cDNA synthesis and DNA and cDNA sequencing
Unité 377 INSERM,
ABSTRACT
INTRODUCTION
EXPERIMENTAL PROCEDURES
RESULTS AND DISCUSSION
FOOTNOTES
ACKNOWLEDGEMENTS
Note Added in Proof
REFERENCES
region, which encompasses 10,690 bp. The genomic sequence has
been completely determined. The 3
region of MUC5B is
composed of 18 exons ranging in size from 32 to 781 bp, contrasting
thus with the very large central exon. The sizes of the 18 introns range from 114 to 1118 bp. Some repetitive sequences were identified in
four introns. The peptide deduced from the sequence of the 18 exons
consists of an 808-amino acid peptide. This carboxyl-terminal region
exhibits extensive sequence similarity to MUC2, MUC5AC, and von
Willebrand factor, particularly the number and the positions of the
cysteine residues, suggesting that this domain may be derived from a
common ancestral gene. The presence in these components of a cystine
knot also found in growth factors such as transforming growth
factor-
is of particular interest. Moreover, one part of this
peptide is identical to the 196-amino acid sequence deduced from the
cDNA clone pSM2-1, which codes for a part of the high molecular
weight mucin MG1 isolated from human sublingual gland. Considering the
expression pattern of MUC5B and the origin of MG1, we can
thus conclude that MUC5B encodes MG1.
region, is
reported in this paper; we present here the complete genomic nucleotide
sequence, the exon-intron organization, and the full cDNA sequence
coding for the carboxyl-terminal domain of the human MUC5B apomucin.
This domain stretches 808 amino acid residues and can be divided into six subdomains. The last five cysteine-rich subdomains exhibit extensive sequence similarity to MUC2, MUC5AC, and vWF (17, 22, 23,
36), particularly the number and the positions of the cysteine
residues, suggesting that this domain may be derived from a common
ancestral gene. Moreover, with the exception of one substitution, which
does not change the coded amino acid, one part of the cDNA sequence
we determined is identical to the nucleotide sequence of pSM2-1. This
cDNA codes for 196 amino acids in the carboxyl-terminal region of
the high molecular weight mucin MG1 isolated from human sublingual
gland (37). Considering the expression pattern of MUC5B (31) and the
origin of MG1, we can thus conclude that MUC5B encodes
MG1.
Screening of cDNA and Genomic Libraries
gt11
cDNA library constructed from human tracheal mucosa was screened
with rabbit antibodies raised to deglycosylated Pronase glycopeptides
from bronchial mucins (38). Among the various positive clones obtained,
the one designated TH71 and containing a poly(A) tail was of particular
interest in the present study.
EMBL4 phage library was screened using hybridization
with the JER57 probe (25). One positive clone, CEL5, was isolated and
studied.
Primer designation
Primer sequence (5
to
3
)
Position
Orientationa
NAU61
ACTCAATGCTCAGGGTTTATTTGC
10582-10605
AS
NAU67
GGGTTTATTTGCAAAACTG
10575-10593
AS
NAU71
AGTGCTGATTGCACACTGCGT
838-859
AS
NAU102
CCTGTCGCAGCTTCCTGGCAG
10446-10466
AS
NAU106
CAGTGAGCATAGGGGAAGCCT
3387-3407
S
NAU127
AGGCTTCCCCTATGCTCACTG
3387-3407
AS
NAU128
CGTGTCCACTGTGTCCTCCTCAGTC
1-25
S
NAU140
GATGGCGGAGGGCTGCTTCTG
5139-5159
S
NAU141
CAGACCGTGTGCACGCAGCAC
1001-1021
AS
NAU142
CCAGGGTAGGACTCCTGAGTG
10246-10266
AS
NAU151
TGAGCAGCGGTTTCAGCAAGA
3168-3188
S
NAU152
CAAGGTTGTGGCACTCAGCAA
3837-3857
AS
NAU196
CGAGGGTTCAGTGTCGGTG
6013-6031
S
NAU200
CAGTGTCCTTACCGGGAGA
2221-2239
AS
NAU203
ATTTAGGAAACCCATCGGGT
5689-5708
AS
NAU207
CGCGGGGTGCCACACACAGGCC
10142-10163
AS
NAU208
GGGTGTAGGTGTGCAGGATGG
9927-9947
AS
NAU219
GCAGGGAAGGGCGCCTGGGAA
7394-7414
AS
NAU226
AGCGGAAGGTGGGACAGCAGT
6620-6640
AS
NAU227
ACTGCTGTCCCACCTTCCGCT
6620-6640
S
NAU232
CTTCCCAGGCGCCCTTCCCTGC
7393-7414
S
NAU233
CTGCGAGACCGAGGTCAACATC
9113-9134
S
NAU234
GATGTTGACCTCGGTCTCGCAG
9113-9134
AS
NAU249
CTCCTCACAGGAGTAGCAGC
8814-8832
AS
NAU277
CAGTGACTGGCGAGGTGCAACTG
3973-3995
S
NAU278
GTATGGGGCCGCATGCGTTGTACACT
4624-4649
AS
NAU280
TGGACAGATGCCCAGGGTTGA
5901-5921
S
NAU281
TGCCATTGTACGAACACAGCT
6776-6796
AS
NAU282
CTGCAGGCCCCATTGGGTCAT
7297-7317
S
NAU293
ATGAGCCGTGGATGGGGTCCC
1195-1215
S
NAU297
TCATGGTCCTGGGCGGCTCCT
5277-5297
AS
a
Strand orientation: sense (S), antisense (AS).
The 5-AmpliFINDER RACE
kit (CLONTECH) was used to synthesize first-strand
cDNA from human trachea poly(A)+ RNA (1 µg) obtained
from CLONTECH using NAU61 as a primer (Table I),
followed by ligation of the 5
-ANCHOR adapter. The PCR was then
performed using the nested primer NAU67 (Table I and Fig. 1) and the
5
-ANCHOR primer. Nested PCRs involving a second or third round
amplification were carried out with 1 µl of the reaction mixture
obtained from each previous round of PCR as template.
RT-PCR Amplification
Total RNA of human gall bladder was extracted as described previously (39). Single-stranded cDNA was performed using the 1st STRAND Synthesis kit (CLONTECH), random hexamers and human trachea poly(A)+ mRNA (0.5 µg) (CLONTECH) or total gall bladder RNA (1 µg). PCR amplification reaction mixtures (50 µl) contain 0.3 mM dNTPs, 2.5 units of Taq DNA polymerase (Boehringer Mannheim), 15 pmol of the appropriate primers, the buffer system purchased with Taq DNA polymerase, and an aliquot of cDNA. The PCR was performed using a Perkin-Elmer Thermal Cycler 480. PCR parameters were 94 °C for 2 min, followed by 30 cycles at 94 °C for 30 s, 60 °C for 1 min, and 72 °C for 2 min, followed by a final extension at 72 °C for 15 min. The amplified products were electrophoresed on a 1% Seaplaque gel (FMC, Rockland, ME) and stained with ethidium bromide. The band was cut out, purified using Preps DNA purification resin (Promega), and subcloned into the T/A cloning vector, pMOSBlue T-vector (Amersham). Thereafter, cDNA clones were subcloned into pBluescript KS(+) vector (Stratagene) using the restriction enzymes (Boehringer Mannheim) PstI, SacI, and/or SmaI. Subclones were sequenced as described below using either universal primers or a series of oligonucleotides specific for both strands of the inserts (Table I).
Isolation and Sequencing of MUC5B Genomic Clones and Sequence AnalysesFragments of the genomic clones CEL5 and BEN2
corresponding to the region downstream of the central exon were
subcloned into pBluescript KS(+) vector as described previously (30).
The double-stranded plasmid inserts were sequenced manually using the
dideoxynucleotide chain termination method (40) using
[-35S]dATP (Amersham) and Sequenase 2.0 (U. S. Biochemical Corp.) according to the protocol indicated by the
manufacturer. Universal primers or a series of specific
oligonucleotides were used. Sequencing reaction mixtures were
electrophoresed on 6% polyacrylamide gel (Sequagel-6TM, National
Diagnostics). The clones were sequenced on both strands several times.
Direct DNA sequencing on cosmid was performed as described previously
(30). Computer analyses were performed using PC/GENE Software. The
whole genomic sequence reported in this paper has been submitted to the
EMBL Data Bank with accession number [GenBank]. The sequence of TH71 has
been submitted to the EMBL Data Bank with accession number [GenBank].
To determine the exact number of
repeats in the intron G, first we cut the genomic subcloned
BglII-BglII fragment using SacI and
RsaI that flank the region containing these direct 59-bp
repeats. The complete digestion with SmaI was obtained using
10 units/µg DNA for 3 h. The partial digestions were performed
using 1 unit/µg DNA and 0.25 unit/µg DNA for 1 h. After
electrophoretic separation on 1.5% agarose gel, the blot analysis was
conducted using the antisense oligonucleotide NAU199 (5-
AGAGCCGAGGGGTCTGGG-3
), which had been previously radiolabeled using T4
polynucleotide kinase (Boehringer Mannheim) and
[
-32P]ATP from Amersham.
The partial restriction maps of the genomic clones CEL5 and
BEN2 were determined. Their overlapping parts present the same restriction map. The partial restriction map of the 3 region of
MUC5B is shown in Fig. 1 together with the
overlapping fragments, which were separated and subcloned into
pBluescript KS(+) vector. The fragments
BamHI-SacII and PstI-SacII
(in the left part of Fig. 1) contain the 3
end of the
central exon. All these clones were entirely sequenced after
restriction digestion and subcloning. Primer walking using specific
oligonucleotides (Table I) was also performed.
Several cDNA-positive clones were
obtained by screening the gt11 cDNA library using antibodies as
described previously (38). The clone designated TH71 is 380 bp in
length. Its sequence (Fig. 2), submitted to the EMBL
data bank with accession number [GenBank], revealed a poly(A) tail with 73 A, 16 bp downstream from a polyadenylation signal (AAUAAA). By
sequencing the PstI-PstI subclone (noted with an
asterisk in Fig. 1) obtained from the fragment
NotI-BglII of the BEN2 clone, an identical 67-bp
sequence was observed (Fig. 2), up to the A where the poly(A) addition
occurs, indicating that the clone BEN2 contains the 3
end of the
MUC5B gene. Using the two synthesized oligonucleotides NAU61
and NAU67 chosen in this sequence, a 5
-RACE-PCR experiment was
performed. After cloning of the fragment obtained, the insert of 88 bp
designated RACE67 was sequenced. This sequence is identical to the
88-bp sequence determined in the PstI-PstI clone
(Fig. 2). In contrast, the first 34 nucleotides differ from the
sequence of TH71. The TH71 clone, which has been found using the
antibodies directed against the repeat part of the MUC5B apomucin (38),
begins with a 132-bp sequence we found in the central exon. Between
this sequence and the 3
end identical to the RACE67, TH71 seems to
have been rearranged; moreover, the following results show that an
important part of the cDNA has been lost. We will discuss these
data below.
The NotI-BglII fragment from BEN2 contains two other clustered canonical polyadenylation signals, AATAAA. The first was located about 2 kilobase pairs downstream from the first polyadenylation signal and the second 298 base pairs downstream from this latter AATAAA. The significance of these two additional polyadenylation signals is not known. It will be interesting to determine if several forms of MUC5B mRNA can be transcribed by selection of alternative polyadenylation signals.
The dinucleotides TG and GT were found with oligo(T) stretches in the
region downstream from the first AATAAA motif within the
PstI-PstI subclone. This region, referred to as
"GT cluster," is important for 3 processing of polyadenylated
mRNAs (41). Moreover, the pentanucleotide CATTG was found between
the AATAAA sequence and the poly(A) site addition (Fig.
3). This CAYTG recognition element has been described to
be related to cleavage site selection by Berget (42). The author
suggested that pre-polyadenylated RNA hybridized with the AAUAAA
recognition element as related to primary site selection, and with
CAYUG recognition element within the U4 small nuclear
ribonucleoproteins as related to cleavage site selection. Hence,
MUC5B combines some common features of the 3
mRNA
processing. From this nucleotide sequence, the new oligonucleotide
NAU102 was synthesized to perform RT-PCR.
RT-PCR
Two specific overlapping cDNAs were synthesized by
RT-PCR experiments. The locations of the oligonucleotides used in these experiments are indicated in Fig. 1. The oligonucleotide primer NAU151
was designed on the basis of the sequence determined for the
BglII-BglII cosmid fragment (Fig. 1).
This fragment hybridized with human tracheal RNA on Northern blot and
probably contains coding sequences. An amplification product was
obtained when the RT-PCR was performed with the two primers NAU151 and
NAU102 using human tracheal first-strand cDNA as template. It was
designated RT151-102 and is 2209 nucleotides in length. An other RT-PCR
was then performed with the following oligonucleotide primers: NAU152, designed with the sequence of RT151-102, and NAU128, chosen in the 3
end sequence of the MUC5B central exon (30). The resultant 1166-bp amplification product, called RT128-152, and the RT151-102 were
cloned into pMOS-Blue T-vector. They were subsequently
subcloned into pBluescript KS(+) vector after cutting with the
restriction enzymes PstI, SacI, and/or
SmaI. The subclones were entirely sequenced on both strands
several times using T3 and T7 primers and specific oligonucleotides
(see in Table I). The two amplification products RT128-152 and
RT151-102 have overlapping sequences of 416 nucleotides.
The 3 region of the
human MUC5B gene shown in Fig. 3 encompasses 10,690 bp, of
which the first 113 nucleotides correspond to the 3
end of the central
exon we recently published (30). The full-length sequence has been
submitted to the EMBL data bank with accession number [GenBank].
The 3 region of MUC5B gene is composed of 18 exons ranging
in size from 32 to 781 bp (Table II) in good agreement
with the mean length of exons (43), in contrast to the extraordinary large central exon of MUC5B (30). The last exon is the
largest one. It codes for the 72-amino acid COOH terminus of the core protein and comprises the 3
-untranslated region, 564 bp in length, of
the MUC5B gene. The sizes of the 18 introns range from 114 bp to 1118 bp. Each intron begins with a GT and ends with an AG (Table
II), obeying strictly the GT/AG rule of splice-junction sequences
proposed by Mount (44).
|
The 2423 nucleotides open reading frame (Fig. 3) encodes a 808-amino acid peptide rich in cysteine (10.1%) and proline (9.5%). This region is relatively poor in threonine and serine (8.3 and 7.3%, respectively). It is thus different from a mucin-like domain. The comparison of a part of the deduced peptide sequence of RT151-102 (aa 634-829 in Fig. 3) with the deduced amino acid sequence of the cDNA clone pSM2-1 (37) shows 100% identity. In nucleotide sequence only one codon differs, since the proline in position 769 in our sequence (Fig. 3) is coded by CCC instead of CCG in the sequence of Troxler et al. (37). Consequently, this suggests that pSM2-1 is a part of the MUC5B gene. The pSM2-1 clone was isolated from a human sublingual gland cDNA library, screened with a polyclonal antiserum against deglycosylated MG1, the high molecular weight mucin from human sublingual gland. MG1 is a candidate, among other roles, for participation in enamel pellicle formation (45). MG1 is made up of multiple disulfide-linked subunits and contains numerous hydrophobic binding sites in naked regions with negatively charged amino acid residues (46). These characteristics are in very good agreement with our data, since such regions do exist in the 3570-amino acid peptide encoded by the central exon of MUC5B (30). Seven nonadjacent domains, termed Cys subdomains, have been individualized among the 19 subdomains encoded by the central exon. These Cys subdomains, found in several other apomucins, are richer in Cys (9.3%), Asp (4.9%), and Glu (7.7%) than the 12 other subdomains. The Cys subdomains are poor in Ser and Thr (Ser+Thr: 9.6%) versus tandem repeat domains, termed R domains (Ser+Thr: 52.5%). Moreover, Loomis et al. (46) suggested that aromatic residues of MG1 are buried within the hydrophobic domains. In fact, the Tyr and Trp amino acid residues are strikingly clustered in the Cys subdomains for the MUC5B apomucin. What is more, the MG1 glycoprotein and MUC5B mRNAs are both expressed in salivary glands among other mucosa for MUC5B (31, 37). We can thus conclude that MUC5B encodes the MG1 apomucin.
The deduced amino acid sequence of the carboxyl-terminal region of
MUC5B contains 15 consensus sequences for attachment of N-linked oligosaccharides (italic in Fig.
4). Studies were performed using the computer PC Gene
software (47). The secondary structure of the carboxyl-terminal region
of MUC5B was predicted to contain 62% turn conformation and 13%
helix structure located between aa 219 and 250, 407 and 421, 776 and
801. The rest of the structure consists of extended and coil
structures. The rigid conformation could be essential for the
oligomerization process. In fact 88% of the cysteine residues are
located in or near a
turn as well as 11 out of the 15 potential
N-glycosylation sites. Moreover, a serumalbumin family
signature with the consensus sequence
YX6CCX7C has also been
found between residues 658 and 682. As mucins are well known to bind
various hydrophobic substances such as cholesterol, fatty acids, or
bilirubin, this small region could be important in the formation of
gallstones for example in which mucins have been described to be
involved (8, 9, 48, 49).
As far as TH71 is concerned, we can now evaluate that more than 2000 bp have been lost in this last cDNA. We were unable to reproduce this cDNA using RT-PCR. It may be concluded that there has been a problem when producing this clone, which has been otherwise of great interest in determining the location of the polyadenylation signal in the genomic DNA.
Deduced Amino Acid Sequence of MUC5B Carboxyl-terminal Region: Comparison with Other ProteinsSome partial alignments with the
sequence of vWF have been made by other authors, for example for MUC2
(16) and for MUC5AC-related cDNA clones, like NP3a (22) and L31
(23). Fig. 4 shows that an alignment on longer sequences can be
accomplished. The conservation of nearly all of the cysteine residues
and of several other amino acids of vWF and of the three 11p15.5 human
mucins MUC2, MUC5AC, and MUC5B is readily apparent, suggesting a very
similar tertiary structure. The comparison of the deduced
carboxyl-terminal MUC5B peptide with vWF especially allows us to
dissect this region into six domains: one domain called MUC11p15-type,
which follows the central exon, one 56-amino acid domain with
similarities to what we called the A3uD4 domain (located between the A3
and the D4 domains in vWF), one D4-like domain, one B-like domain, one
C-like domain, and one CK domain of 86 amino acid residues (see Fig. 5A).
The domain called MUC11p15-type (Fig. 5A) from aa 38 to 84 in MUC5B follows the central domain described previously (30). It shows similarities to MUC2 and MUC5AC, particularly with regard to the cysteine residues. This domain is somewhat different from the A3 domain of vWF.
The second domain, called A3uD4, also present in vWF, is 69 aa in length in MUC5B and spans aa 85-154. This domain is also found in MUC2 and L31, i.e. MUC5AC.
The vWF-D4 domain was found in MUC5AC, MUC2, and MUC5B (aa 155-533). D4 was found in zonadhesin, a sperm membrane protein that binds in a specific manner to the egg extracellular matrix from pig (50). Moreover, the D4 domain shows similarity to a part of vitellogenin found in nematode Caenorhabditis elegans, chicken, and frog (51). Among the well conserved peptide sequences, of which the positions are indicated for MUC5B, NC(S/T)YVL (aa 180-185), TXGXCGXC (aa 300-307), YAXLC (aa 417-421), CXDWR (aa 427-431), EGCFCP (aa 473-478) found in MUC2, MUC5AC, vWF and MUC5B, the TXGXCGXC octapeptide contains the vicinal cysteine residue motif CGXC, which also exists in vitellogenin and zonadhesin. Mayadas et al. (52) showed that these motifs are similar to the amino acid sequences at the active site of disulfide isomerases that catalyze thiol protein disulfide interchange. These vicinal cysteines may have the capacity to catalyze disulfide interchain formation, but Voorberg et al. (53) indicated that the dimerization resides in the last 151 residues for the vWF. Recently Perez-Vilar et al. (54) validated this hypothesis for PSM, showing that this apomucin can very likely form dimers between its carboxyl-terminal domains. Hence, our sequencing data suggest that the MUC5B apomucin is also able to form dimers between its carboxyl-terminal domains. The presence of disulfide-linked subunits was already predicted by Loomis et al. (46) in MG1. Moreover, Kawagishi et al. reported that MG1 contains at least two subunits (55), one of which is the salivary link component that weakly cross-reacts with antiserum to the human small intestinal link component, which contains N-linked carbohydrate (56).
Following the D4-like domain, one B-like domain of 40 aa residues was found in MUC5B (aa 534-574), MUC2, and MUC5AC instead of the three B domains defined for vWF (36). This B-like domain has not yet been found in other protein sequences.
Instead of the two C domains (C1 and C2) present in vWF, only one C-like domain was found in the three 11p15.5 mucins MUC5B, MUC5AC, and MUC2 (aa 575-740 in MUC5B). One C-like domain was previously reported in the frog integumentary mucin FIM-B.1 (57). This C-like domain has been described to be more related to the C1 than to the C2 domain. In contrast to vWF, in which the homologous C1 and C2 domains arose by duplication, this duplication did not occur in FIM-B.1 (57). In fact, our genomic study shows that the C-like domain of MUC5B is more related to the C2 domain. Extremely conserved intron positions can be shown for introns 45 and L, and for introns 46 and M. Thus C1 and C2 in vWF, and C-like domains in 11p15.5 mucin genes, probably have a common ancestor domain, which has duplicated into C1 and C2 in vWF.
The last domain found in MUC5B is the CK domain (for cystine knot) from
aa 741 to 826. The CK domain was also found in the 3 end of the
secreted proteins MUC2, MUC5AC, FIM-B.1, and rat-Muc2 (58, 59). The CK
domain exists in other secreted proteins (60-62). Eleven cysteine
residues and some other amino acid residues within this CK domain are
nearly invariant. Molecular modeling of the Norrie disease protein (61)
predicts that this domain has a tertiary structure similar to that of
transforming growth factor
(TGF-
). In TGF-
, seven cysteine
residues, corresponding to cysteines 741, 764, 768, 787, 788, 818, and
820 in MUC5B, are nearly invariant (61). Crystallography studies of
TGF-
2 have shown that six of these cysteine residues are closely
grouped to make a rigid structure called the cystine knot (for reviews see 60, 63). Moreover, the determination of the crystal structure of
dimeric nerve growth factor and platelet-derived growth factor revealed
a structure similar to the one of TGF-
2. Tertiary structure similarities probably account for a strong resistance, as suggested by
these authors, to heat, denaturants, and extremes of pH. The remaining
cysteine residue in each monomer, corresponding in MUC5B to Cys-787,
forms an additional disulfide bond that was found to link two TGF-
2
monomers into a dimer. Hence, the human mucins MUC5B, MUC5AC, and MUC2,
the animal mucins PSM, BSM, rat-Muc2, and FIM-B.1 and Norrie disease
protein may be members, with their 11 cysteine residues, of a new CK
subfamily.
Out of the 15 consensus sequences for attachment of N-linked
oligosaccharides (italic in Fig. 4), 10 sites are close to
those observed in the 3 ends of MUC2, and/or MUC5AC (L31) and/or vWF. Four of them (positions aa 179, 298, 299, and 569 in MUC5B) have the
same positions in the deduced peptides of the three human mucin genes
mapped on chromosome 11p15.5. One site has the same position in the
three mucins and in vWF (aa 2223 in vWF and aa 463 in MUC5B). In
addition to the typical and expected O-glycosylation that
occurs in MUC5B, it is very tempting to speculate that this apomucin,
synthesized in the endoplasmic reticulum, is rapidly N-glycosylated. The polypeptide might fold to form
intramolecular interactions and then dimers through intermolecular
disulfide bridges within the carboxyl-terminal region. Although
previous studies on bovine vWF suggested that
N-glycosylation is not necessary for dimerization (64),
Wagner et al. (65) more recently reported that
N-linked carbohydrate addition onto human vWF is important for dimerization. In contrast, Perez-Vilar et al. (54)
demonstrated that PSM dimerization is not dependent on the
N-linked oligosaccharides within its carboxyl-terminal
domain. Further studies on MUC5B apomucin using recombinant proteins
synthesis to obtain antibodies and culture of mucus-secreting cells
such as HT29-MTX in presence of tunicamycin will be required to clarify
the role of N-glycosylation.
The schematic organization of the carboxyl-terminal MUC5B gene is given in Fig. 5B. No alternative splicing was found using total RNA from gall bladder or poly(A)+ mRNA from trachea and the following pairs of primers: NAU128/NAU152, NAU151/NAU203, NAU140/NAU219, NAU227/NAU208, and NAU233/NAU67 (Table I).
Introns A, C, E, G, I, K, L, and Q are class 1, where each intron
interrupts the coding sequence between the first and second bases of
the codon (66). Introns F, H, and R are class 2 (the intron interrupts
the second and the third bases of the codon), and introns B, D, J, M,
N, O, and P are class 0 (the intron occurs between codons). Introns B,
C, E, G, J, K, L, M, and N have the same positions as the introns 33, 34, 36, 37, 38, 40/41, 45, 46, or 43/47, respectively, in vWF (Fig. 4).
It must be emphasized that introns C and 34, E and 36, G and 37, K and
40 or 41, L and 45 are class 1, while introns J and 38, M and 46, and N
and 43 or 47 are class 0. We can then observe that the ORFs between the symmetrical introns C and E, between E and G, between G and K, between
K and L, between J and M and between M and N have flanking introns of
the same phase class at both their ends. Consequently these ORFs are
good candidates for exon shuffling especially when both ORF flanking
introns are class 1 (67). It would be interesting to determine if
MUC2, MUC5AC, and MUC5B have a common
3 end gene organization. Then it would be proposed that exon shuffling
mechanisms may have played an important role in the formation of genes
coding for proteins with D/B/C/CK domains, while a single ancestral
gene may have evolved by successive duplications to give rise to the 11p15.5 human mucin gene family. Clearly, much work remains to be done
and new data have to be collected concerning the three other 11p15.5
mucin genes MUC2, MUC5AC, and MUC6 to
confirm our hypothesis; in particular, the exon-intron repartitions
have to be elucidated.
In some introns, unique tandemly repeated sequences that are more or
less perfect are found: 23 copies of an imperfect 20-bp repeat in
intron A, 11 copies of an imperfect 10-bp repeat in intron C, 9 copies
of a perfect 59-bp repeat in intron G, and 12 copies of an imperfect
20-bp repeat in intron P. Searching of the GenBank data base indicated
that the consensus sequences of these four distinct repeats were not
identical with any registered sequence. It is striking that intron G is
75% (G+C)-rich and it is almost entirely built up of copies of a
perfect 59-bp repeat, CCTGTGCGGTGAGTGGGGGCGGCCCCGGGCCCCCCAGACCCCTCGGCCTCTCTGAGTGT. Each repeat contains one GC box binding site and one SmaI enzyme
recognition site. The first copy of this repeat begins in the 3 end of
the exon 6. To determine the exact number of repeats in this intron, first we cut the genomic subcloned BglII-BglII
fragment using SacI and RsaI enzymes that flank
the region containing these perfect 59-bp repeats. Then we performed a
complete and two partial restriction digestions with SmaI
(for details see "Experimental Procedures"). NAU199 is an
oligonucleotide that recognizes a part of the 59-bp repeat. The results
shown in Fig. 6 led us to conclude that there are nine
59-bp repeats. In a previous study where single-stranded oligonucleotides were used, we found that a nuclear factor called NF1-MUC5B (68), extracted from the colonic mucus-secreting subclone HT29-MTX, binds this GC site. This factor, with a
Mr of 42,000, has been demonstrated not to be
Sp1. Biochemical studies are currently in progress in our laboratory to
characterize this nuclear factor.
In summary, we have cloned and sequenced the whole genomic 3 region of
MUC5B and defined the exon-intron repartition. We have
proved that this gene codes for the high molecular weight salivary
mucin MG1. MUC5B is expressed essentially at high levels in
acini mucous cells of salivary and respiratory submucosal glands and in
epithelial cells of gall bladder, endocervix, and pancreas. This study
provides the first genomic organization of the 3
region for a large
size secreted gel-forming mucin gene. Our recent work showed that the
central domain of MUC5B is encoded by a single large exon (10,713 bp),
the largest one described to date in vertebrates. The deduced protein
contains 19 subdomains. Some of them show similarity to each other,
creating repeat units called super-repeats of 528 amino acid residues,
which are the biggest ever determined in mucin genes. Each super-repeat
comprises a 108-amino acid cysteine-rich subdomain. This last
subdomain, found seven times in MUC5B, has thus been found several
times in at least three of the four human mucin genes mapped to
11p15.5. (30). It seems that 11p15.5 human mucin genes are
characterized by (i) a large exon encoding the repetitive domain as
demonstrated for MUC5B and as suggested by Toribara et al.
for MUC2 (15), (ii) the presence of Cys subdomains with 10%
Cys residues (30), and (iii) a unique sequence just downstream from the
repetitive domain typical of the 11p15.5 mucin genes and a
cysteine-rich region, which is divided in several subdomains similar to
vWF-D4, vWF-B, vWF-C, and CK domains (Fig. 4). It will be interesting
to determine if the three other mucin genes MUC2,
MUC5AC, and MUC6 have the same 3
end genomic
organization as MUC5B. Moreover, it is clear with our
previously published data (21, 29, 30) and with our present results,
that MUC5AC and MUC5B are two distinct genes;
therefore, it would be preferable for all authors to be precise in
specifying which gene is concerned when they write MUC5.
The nucleotide sequence(s) reported in this paper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) Y10080[GenBank] and Y09788[GenBank].
We are grateful for the technical assistance of Michel Crépin, Evelyne Destailleur, Viviane Mortelec, and Danièle Petitprez.
While this manuscript was in review,
Nielsen, P. A., Bennett, E. P., Wandall, H. H., Therkildsen, M. H.,
Hannibal, J., and Clausen, H. ((1997) Glycobiology
7, 413-419) identified MG1 as tracheobronchial mucin MUC5B.
On the other hand, Keates, A. C., Nunes, D. P., Afdhal, N. H., Troxler,
R. F., and Offner, G. D. ((1997) Biochem. J. 324, 295-303) published a partial genomic organization of the 3 end of
MUC5B with some differences from our data.