(Received for publication, May 30, 1996, and in revised form, September 30, 1996)
From the INSERM 377 Laboratoire Gérard Biserte,
place de Verdun, 59045 Lille Cedex, France and the
¶ Laboratoire de Biochimie et de Biologie Moléculaire de
l'Hôpital C. Huriez, CHRU de Lille, 59037 Lille
Cedex, France
Human mucin gene MUC5B is mapped clustered with MUC6, MUC2, and MUC5AC on chromosome 11p15.5. We report here the isolation of three overlapping genomic clones of human MUC5B spanning approximately 40 kilobases. We have determined their partial restriction maps and the intron-exon boundaries of the central region encoding a single open reading frame. This coding region has been completely sequenced. Its length is 10,713 base pairs, and it encodes a 3570-amino acid peptide. Nineteen subdomains have been individualized. Some subdomains show similarity to each other, creating larger composite repeat units that we have called super-repeats. Four super-repeats of 528 amino acid residues are thus observed within the central exon. Each comprises (i) a subdomain composed of 11 repeats of the irregular repeat of 29 amino acid residues, (ii) a unique conserved subdomain with no typical repeat, and (iii) a cysteine-rich subdomain. This latter subdomain has high sequence similarity to the cysteine-rich domains described in MUC2 and MUC5AC. Sequence data of these three genes, together with their clustered organization, lead us to suggest that they may be a part of a multigene family. The super-repeat present in MUC5B is the largest ever determined in mucin genes and the central exon of this gene is, by far, the largest reported for a vertebrate gene.
Mammalian respiratory, gastrointestinal, and reproductive tracts are protected by mucus secretions, of which the major components are the mucins. The mucins form a heterogeneous group of high molecular mass, polydisperse, highly glycosylated macromolecules. They are synthesized and secreted by specialized cells in the epithelium.
Considerable advances have been made over the past years toward our understanding of the structure and function of mucin glycoproteins. The isolation of mucin cDNA clones introduced a new approach to the structure of the mucins. Until now, at least eight human mucin genes have been identified (see Ref. 1 for review). The chromosomal localization of these genes has been established: four of them, MUC2, MUC5AC, MUC5B, and MUC6, are clustered on 11p15.5 between the HRAS and IGF2 genes, MUC1 is on 1q21-24, MUC3 on 7q22, MUC4 on 3q29, and MUC7 on chromosome 4q13-21. Recently, a cDNA called pAM1 has been cloned from a human tracheal library and localized on chromosome 12 (2). Three novel cDNAs have been reported: NP3a from a human nasal polyp library (3), L31 from a HT29-MTX cell line library (4), and HGM-1 from a human stomach library (5). Their sequences show that they correspond to some parts of the MUC5AC gene. More recently, a novel cDNA (pSM2-1) from human sublingual gland has been described (6). MUC1, which is developmentally regulated and aberrantly expressed by carcinomas, encodes a membrane-associated mucin-like glycoprotein. In contrast, the other described genes code for secreted mucins (1). Mucins present extended arrays of tandemly repeated sequences, producing a protein core rich in potential O-glycosylation sites and having a high content of serine, threonine, proline, glycine, and alanine. The tandem repeat units vary in length from as few as 24 bp1 in MUC5AC (7) to 507 bp in MUC6 (8). The tandem repeat domain is flanked on either side by nonrepeat regions. In the MUC2 gene product, these tandem repeats are flanked by cysteine-rich subdomains of approximately 845 residues upstream and 700 residues downstream. Both cysteine-rich subdomains have sequences similar to the D-domains of human pro-von Willebrand factor (9, 10). Some parts of these D-domains are also found in NP3a (3) and in HGM-1 (5). Moreover, the MUC2 gene has upstream to the tandem repeat a region of imperfectly conserved repeats flanked by another type of cysteine-rich domain (11). This latter domain has also been described in HGM-1 (5), twice in MUC5AC (12), and also in its related cDNAs (3, 4).
Mucins are essential for the protective properties of the mucus (13, 14). In addition it is becoming apparent that an abnormal expression of the mucin genes occurs in various disease states and in conditions associated with a high risk of adenoma or carcinoma (15-17). Therefore, the study of the factors responsible for the regulation of mucin expression is of great interest. With this goal, it is necessary to acquire a detailed knowledge of the genomic structure of the mucin genes. The complete sequences of MUC1 (18), MUC2 (11, 9, 19), and MUC7 (20) cDNAs have been described. However, only partial cDNA sequences are published for the other mucin genes. The whole genomic structure is only known for MUC1 (21), although the partial genomic organization of MUC7 has also been reported recently (22).
We have previously described four human tracheobronchial cDNA clones with degenerate 87-bp tandem repeats belonging to the human MUC5B gene (23). Our laboratory has focused considerable attention upon this gene which is expressed in mucous glands of tracheobronchial tissue, submaxillary glands, gall bladder, and endocervix (24-26).
In this paper, we present the partial restriction map of the three overlapping genomic clones of the MUC5B gene spanning more than 40 kb and the complete sequence of its central exon which encompasses 10,713 bp and codes for a 3570-amino acid peptide. Nineteen subdomains are individualized. Some subdomains show similarity to each other, creating larger composite repeat units that we call super-repeats. In addition to the tandem repeat of 29 amino acid residues as described previously (23), which we now define as being irregular or imperfectly conserved, four repeats of 528 amino acid residues are observed within the central exon. Each comprises 11 repeats of the irregular repeat of 29 amino acid residues, a unique conserved domain of 111 amino acid residues with no typical repeat but rich in threonine, serine, and alanine and a cysteine-rich region of 108 amino acid residues. This latter subdomain has sequence similarity with cysteine-rich domains of MUC2 and MUC5AC or its related cDNAs. This super-repeat is the largest ever determined in mucin genes.
DNA from 20 healthy unrelated volunteers was prepared from leukocytes. It was digested with the following restriction endonucleases: BamHI, BglII, EcoRI, HindIII, KpnI, PstI, XbaI, and XhoI. Fragments were separated by electrophoresis in phosphate buffer through 1% agarose gel, transferred to a nylon membrane, and hybridized as described previously (23).
Library ScreeningJER57, the longest cDNA of MUC5B isolated previously (23), was first used as a probe to screen a genomic EMBL4 phage library (12). One positive clone CEL5 was isolated and studied. Other screenings only gave the same clone.
To isolate larger genomic clones of the MUC5B gene, we screened a human placenta genomic DNA library in pWE15 cosmids provided by Stratagene using the JER57 probe. Two positive clones BEN1 and BEN2 were isolated and studied.
Restriction Mapping of CosmidsThe restriction mapping strategy of Wahl et al. (27) was slightly modified as follows. Cosmids were digested to completion with the restriction enzyme NotI. For each of the other restriction enzymes used, one part of this NotI-digested cosmid was digested to completion and a second part partially digested with the enzyme in order to generate a set of fragments that began at the T7 or T3 promoters and ended at the site of cleavage of the chosen enzyme. These NotI-terminated digestion products were fractionated on an agarose gel (0.6%) and blotted to HybondTM-N+ membrane (Amersham Corp.) by capillary blotting overnight. The fragments were then mapped relative to the T7 or T3 promoters by hybridizing the blot with end-labeled oligonucleotide-sequencing primers specific for these promoters.
5The 5 AmpliFINDER RACE kit (Clontech, Inc., Palo
Alto, CA) was used to synthesize first strand cDNA from 2 µg of
human tracheal poly(A)+ RNA obtained from Clontech with
NAU58 as first primer (5
-TTGTAGCACATCTTGAAGACGCCC-3
, antisense nt
776-799) followed by the ligation of the 5
anchor adapter. The
RACE-PCR was performed in 50-µl reaction volumes containing 5 µl of
10 × buffer, 5 µl of 10 mM deoxynucleoside triphosphates, 2.5 µl of reversed transcribed target cDNA, 10 pmol of each primer (NAU57,
5
-
CCTGCACACCAGGCCGAAGTG-3
, antisense nt 741-762
with underlined nucleotides added in 5
to generate a BamHI
restriction site and 5
anchor primer), and 1.5 units of Taq
DNA polymerase (Boehringer Mannheim). After overlaying with 50 µl of
mineral oil (Sigma), the mixture was denatured at 94 °C for 3 min followed by 30 cycles at 94 °C for 1 min,
71 °C for 1 min, 72 °C for 2 min. The elongation step was
extended for an additional 10-min period. Secondary amplification was
performed using 2.5 µl of the primary amplification product. The
thermal cycling protocol used was the same as for the primary RACE
amplification.
Total RNA was extracted from a human gall bladder using the guanidine isothiocyanate/CsCl method (28, 29).
Reverse Transcription and AmplificationA sample (0.5 µg) of human tracheal poly(A)+ RNA (Clontech) and a sample (1 µg) of total RNA extracted from human gall bladder were reverse-transcribed with the 1st-STRANDTM cDNA synthesis kit (Clontech) using random primers according to the manufacturer's instructions.
The first strand cDNA (8 µl) and cosmid DNA (30 ng) were
amplified by PCR with various primers: NAU112 (antisense)
5-ACCAGGCTGGGCCTGGGCACGGCA-3
(nt 3105-3129), NAU113 (sense)
5
-GACGACTACAGCCACTGCCCCAGTACCCTA-3
(nt 1671-1697), NAU81 (sense)
5
-CCAACTGGACCCTGGCACAGGTG-3
(nt 694-716), NAU82 (antisense)
5
-GACTGAGGAGGACACAGTGGACACG-3
(nt 10601-10625), NAU128 (sense)
5
-CGTGTCCACTGTGTCCTCCTCAGTC-3
(nt 10601-10625), NAU71 (antisense)
5
-AGTGCTGATTGCACACTGCGT-3
(in the first exon downstream the central
exon), NAU136 (sense) 5
-TTCAACTATGAAATCCGTGTGTTC-3
(nt
8493-8516)
The thermal cycling protocol used was the same as the one described above except that the annealing temperature was 62 °C. PCR experiments were performed using a Perkin-Elmer apparatus.
Cloning of Amplification ProductsRACE-PCR products were separated by electrophoresis. Parts of the gel containing bands of interest were excised and the DNA was purified using Glassmilk (BIO 101, Inc.) and cloned into pGEMT vector (Promega). PCR products were purified using Preps DNA purification resin (Promega) and cloned into pMOSblue vector (Amersham).
Cloning in pKSThe fragments of interest from phage or cosmid clones were subcloned into the pBluescript KS(+) vector from Stratagene.
Plasmid DNA PurificationWe used the WizardTM minipreps DNA purification system (Promega).
DNA Sequencing and Sequence AnalysesThe clones were
sequenced on both strands by the dideoxy chain termination method using
-35S-dATP with Sequenase version 2.0 (U. S. Biochemical
Corp.), Sequitherm (TEBU), or the
T7SequencingTM kit (Pharmacia Biotech Inc.).
They were sequenced using synthetic oligonucleotides corresponding to
the T7 and T3 primers of the pKS plasmid, to the T7 and
40 primers of
the pGEMT or pMOSblue vector. Part of the sequence was
determined by primer walking using primers specific to the
MUC5B gene. Analyses of nucleic acid and protein sequence
data were performed using PC/GENE Software.
To perform DNA sequencing directly on cosmids (2 µg), we had to anneal at 37 °C for 30 min and to use 5 pmol of primers instead of 0.5 pmol when sequencing DNA in plasmids using the Sequenase version 2.0. The nucleotide sequence reported in this paper has been submitted to the EMBL Data Bank with accession number Z72496[GenBank].
Human genomic DNA from leukocytes of 20 healthy unrelated volunteers was digested with BamHI, BglII, EcoRI, HindIII, KpnI, PstI, XbaI, and XhoI. The sizes of the fragments obtained and hybridized with the JER57 probe are indicated in Table I. These results indicated the fragments of interest recognized with the JER57 probe. We isolated all these fragments from a phage or a cosmid genomic library and sequenced them to obtain the complete sequence recognized with the JER57 probe (see below).
|
The EMBL4 human genomic library has been screened with the
JER57 probe, and one positive clone with an insert of approximately 12 kb called CEL5 was obtained. The JER57 probe hybridized with one
BamHI-BamHI fragment of 1.1 kb (indicated in
Figs. 1 and 2A) situated in
the 5 part of this clone. This fragment was completely sequenced.
Other screenings of the EMBL4 genomic library only gave the same clone.
This led us to screen a human placenta genomic DNA library in cosmid
vector pWE15 using JER57 as probe. Two cosmid clones containing inserts
of approximately 40 kb were obtained and called BEN1 and BEN2. The
partial restriction map of the three clones (CEL5, BEN1, and BEN2) is
indicated in Fig. 1. BEN1 contains a CpG island, since restriction
sites such as BssHII and NotI are close together
(30, 31). This island is located at 24 kb from the 5
-end of the insert
and corresponds to the I 8 CpG island on the macrocartography performed
in the 11p15.5 region (32). BEN2 overlaps the 3
region of BEN1 on 16 kb and overlaps completely the CEL5 clone. We have found on these
clones all the restriction fragments corresponding to the fragments
recognized with the JER57 probe and observed on Southern blots of
genomic DNA (Fig. 1 and Table I).
Strategy for the Sequencing of the Cosmid Genomic Fragments Hybridizing with the JER57 Probe
We prepared, subcloned into pBluescript KS(+) and sequenced all the fragments indicated in Fig. 2A at the top. Various restriction fragments derived from these clones were also subcloned to determine the entire sequence. We distinguished the fragments obtained from BEN2 only (with an asterisk). One fragment BamHI-BamHI of 1.1 kb was obtained from BEN2 and from CEL5 (noted with an asterisk and ++++). Since some fragments had the same lengths but were at various positions, we had to carefully isolate them starting from larger fragments unambiguously positioned, obtained for example only from BEN2, or for others only from BEN1. Due to an extremely high proportion of GC residues, many sequence problems were encountered. Subclones were thus sequenced several times before a reliable sequence was obtained.
Determination of Intron-Exon Boundaries on Each Side of the Central ExonTo obtain cDNAs upstream of the central tandem repeat
region we used 5 RACE-PCR with the primers NAU58 and NAU57 chosen in a
nonrepeat region in 5
of the central region. The amplification products were analyzed by agarose gel electrophoresis. Four ethidium bromide-stained bands ranging from 0.4 to 1 kb were obtained (data not
shown) and cloned into pGEMT vector. Several transformants were
isolated and sequenced. The nucleotide sequence showed that the 3
-ends
of all the clones sequenced overlap with the genomic sequence. The
comparison of the sequence of the longest cDNA (985 bp) with the
sequence of the fragment BamHI-KpnI of 1.35 kb
(noted with a vertical arrow on the left in Fig.
2A) from the cosmid BEN1 indicates an intron of 468 bp
(nucleotide sequence not shown), thus identifying the 5
-end of the
central exon. Splice acceptor and donor sequences agree with the
"GT-AG" rule (33).
To determine the 3-end of the central exon we performed RT-PCR using
two primers (NAU128 and NAU71, their positions are shown on Fig.
2A) on human RNAs from tracheobronchial tissue and from gall
bladder. Then we compared the sequences obtained after subcloning into
pMOSblue vector with those determined on fragments of the cosmid located 3
of the central exon: fragment
BamHI-SacII ~ 1.7 kb and
PstI-SacII ~ 1.3 kb (noted with
vertical arrows on the right in Fig.
2A). The presence of an intron at 27 bp downstream from the
site PstI evidenced by comparing the sequences of the two
cDNAs obtained with the genomic sequence. Splice acceptor and donor
sequences agree with the GT-AG rule (33).
In order to confirm that the sequence was in the correct order, we
performed restriction maps of overlapping fragments such as the 8-kb
KpnI-KpnI fragment from BEN2 or the 5.5-kb
HindIII-NotI 3-end fragment from BEN1 (Fig.
1).
To confirm that this sequence forms only one exon, additional PCR products were produced using various pairs of primers (see Fig. 2A). We compared the lengths and the sequences of the fragments obtained first by RT-PCR on gall bladder or tracheobronchial RNA and by PCR on genomic DNA (cosmids BEN1 or BEN2). Using NAU81 and NAU112 we obtained fragments of ~2500 bp, using NAU113 and NAU112 fragments of ~1400 bp, using NAU82 and NAU136 fragments of ~2100 bp. In each case, we obtained the same lengths and the same sequences with either cDNAs or genomic DNA as matrices. No intron was detected in this way. Thus the sequenced region consists of a single long open reading frame which extends over 10,713 bp (submitted to the EMBL Data Bank with accession number Z72496[GenBank]) coding for 3570 amino acid residues and leading to a polypeptide core with a calculated Mr of 370,000.
Comparison with the cDNAs Described PreviouslyThe nucleotide sequences of the previously isolated cDNAs JUL10, JER28, and JUL7 (23) have been positioned within the central exon. JUL10 is at position 5272-6258. JUL7 overlaps JER28. They are respectively positioned at 8510-10143 and at 9172-9732. Some small differences in sequence were observed between the genomic sequence determined here and these cDNAs as determined previously by us (23). As we did not succeed to position the JER57 sequence as reported by us (23), we redetermined its nucleotide sequence (accession number X74955[GenBank] with correction submitted to the EMBL Data Bank). The reason for this discrepancy between this newly determined sequence of JER57, which is at position 8361-10222, and our previous work is explained under "Discussion."
These cDNAs were obtained by screening a human tracheobronchial tissue cDNA expression library using antibodies (23). Their deduced amino acid sequences allowed us to choose the appropriate open reading frame of the central exon continuous with that of their sequences.
Analysis of the Nucleotide Sequence and of the Deduced Amino Acid SequenceNineteen subdomains are individualized and are indicated
in Fig. 2B. Seven code for cysteine-rich subdomains called
Cys1 (aa 6-112), Cys2 (aa 176-283), Cys3 (aa 457-565), Cys4 (aa
986-1093), Cys5 (aa 1515-1622), Cys6 (aa 2213-2320), and Cys7 (aa
2742-2849). Their amino acid sequences are displayed in Fig.
3. Their average amino acid composition is given in
Table II (column Cys). These subdomains are rich in
cysteine residues (9.3%). In Fig. 3 is also shown the similarity
between these cysteine-rich subdomains and homologous domains found in
other mucins such as human MUC5AC or related cDNAs (3, 5), mouse
Muc5ac (34), pig gastric mucin (35), human MUC2 (9), and rat MUC2
homologue (36). In addition to the remarkable conservation of the
cysteine residues, the conservation of numerous amino acid residues
(bold boxes or boxes in Fig. 3) such as
tryptophan, proline, arginine, and glycine is to be noted, especially
between MUC5B and MUC5AC and its related cDNAs. There is one
conserved potential O-glycosylation site in each of these
cysteine-rich subdomains (boxed together with a tryptophan
residue in the upper part of Fig. 3). No potential N-glycosylation site exists. The Cys4 to Cys7 subdomains of
MUC5B mucin present a perfect sequence similarity except in three
positions. The average amino acid composition of the seven
cysteine-rich subdomains (see column Cys in Table II) is very similar
to that of domain III in HGM-1 (5) (Fig. 3). It is noticeable that all
but 2 of the cysteine residues of the central exon are in the
cysteine-rich subdomains. 31 out of the 32 tyrosine residues present in
the central exon are in the cysteine-rich subdomains.
|
The first two subdomains between Cys1-Cys2 and Cys2-Cys3, subdomains R01 (64 amino acid residues) and R02 (174 amino acid residues), respectively, are enriched with threonine, serine, proline, and alanine (Table II), but no typical repeats or even imperfect repeats can be discerned. Searching of the GenBankTM data base indicated that these sequences were not identical with any registered sequence. A certain similarity with other mucins, especially MUC1 or MUC2, is due to the typical amino acid composition of mucins.
Five subdomains (RI to RV) composed of various numbers of 87-bp imperfect tandem repeats are encountered downstream of the regions coding for the cysteine-rich subdomains Cys3 to Cys7. The alignment of the corresponding nucleotide sequences is shown in Fig. 4A. Domains RI, RII, and RIV are very similar. Each contains 11 imperfectly conserved repeats of 87 bp. The comparison is more obvious when looking at the deduced amino acid sequences (Fig. 4B). The subdomain RIII contains the same 11 irregular repeats of 29 amino acid residues from RIII-6 to RIII-17 (except RIII-7). Moreover, RIII-1 to RIII-5 are very similar to the first five repeats of subdomains RI, RII, and RIV. The subdomain RV is composed of 22 irregular repeats. RV-1 to RV-5 are also very similar to the first five repeats of the subdomains RI to RIV. RV-6 to RV-17 (except RV-7 and RV-9) are similar to the 11 imperfectly conserved repeats of the other R-subdomains except for the insertion of the pentapeptide PTTTT in RV-14. Beneath the alignment is indicated the consensus sequence of this imperfectly conserved repeat and the percentages of the indicated amino acid residue(s). One serine residue has 100% conservation and 6 amino acid residues have more than 80% conservation. As far as all the five R-subdomains are concerned, 44 units have the 29-amino acid residue repeat out of the aligned 72 units. The others are composed of 28, 27, or 26 amino acid residues, although one contains 32 residues. The amino acid composition (Table II) shows an extremely high threonine content (37%) and high serine, proline, and alanine contents. There are numerous potential O-glycosylation sites. The sequence TXXP, which has been implicated as a major site for GalNAc addition, is found in most of the 72 repeats (5, 37). These regions are likely to be heavily glycosylated. Moreover, one potential N-glycosylation site exists in each R-subdomain, and it is noticeable that the surrounding sequences are exactly the same for the potential N-glycosylation sites situated in RI, RII, RIII, and RIV (TPPVP N TTATT).
Concerning the sequences called RI-end, RII-end, RIII-end, and RIV-end,
the alignment of their nucleotide sequences (Fig. 5A) shows a striking similarity. A consensus
sequence is indicated. The amino acid sequences are aligned in Fig.
5B. The 111 amino acid residues present a perfect sequence
similarity to each other except in eight positions. Searching of the
GenBankTM data base indicated that these sequences are not identical
with any registered sequence. A certain sequence similarity with other
mucin genes is only due to the high percentage of threonine, serine,
proline, and alanine (Table II) encountered in these sequences. There
are numerous potential O-glycosylation sites and five
TXXP sequences are present in each R-end subdomain.
The central exon ends with a 75-amino acid peptide that we have called
R03, since it consists of a sequence different from those of the other
subdomains of the MUC5B central exon. It is enriched in
threonine, serine, proline, phenylalanine, and valine (Table II), but
no typical repeat exists. It is different from the sequences found at
the 3-ends of the tandem repeats of MUC5AC (3, 4) and of
MUC2 (9).
Close examination brought to the foreground the existence of a super-repeat found four times within the MUC5B central exon that we have called UpA, UpB, UpC, and UpD (Fig. 2B). Restriction mapping and sequence determination have shown that the BamHI digestion sites approximately flank the super-repeats (Fig. 2A). The BamHI digestion sites are quite regularly distributed within the central exon. This super-repeat consists of a R-subdomain with 11 imperfect repeats of 29 amino acid residues followed by a R-end subdomain and a cysteine-rich subdomain. It contains 528 amino acid residues. The alignment of these four super-repeats has been performed (not shown since the alignments of the various sequences existing in these super-repeats have already been shown in Figs. 3, 4B, and 5B). They present a striking similarity except in the 11 positions noticed before (in the R-end and the cysteine-rich subdomains) and in 54 positions in the first part of the super-repeat composed of the 11 imperfectly conserved repeats of 29 amino acid residues.
We report here the isolation of three overlapping genomic clones of the human MUC5B mucin, one from a phage library (CEL5) and two from a cosmid library (BEN1 and BEN2) using the longest cDNA of MUC5B that we had previously isolated in the laboratory and called JER57 (23). MUC5B has been mapped clustered with MUC6, MUC2, and MUC5AC to chromosome 11p15.5. The partial restriction map of the three overlapping clones has been determined. We have also determined the intron-exon boundaries of the central region encoding a single open reading frame. This coding region has been completely sequenced. Its length is 10,713 bp, and it codes for 3570 amino acid residues. MUC5B is expressed in mucous glands of tracheobronchial tissue, submaxillary glands, gall bladder, and endocervix (24-26). RT-PCR experiments carried out on total RNA from human gall bladder or on human tracheal poly(A)+ RNA unambiguously demonstrated that this open reading frame consists of a single exon and does not contain any intron. It was not possible to amplify the complete length of this open reading frame by RT-PCR using the two primers, NAU81 and NAU82, that are on both ends of the central exon. However, by performing different amplifications using more closely situated primers, in particular in the MUC5B specific part of the cysteine-rich domains, we were able to produce fragments exhibiting the same lengths and the same sequences as those in the cosmid clones. No difference was found in the common sequences determined on the genomic clones from the two libraries, even though they came from two unrelated individuals. This gene differs from MUC1 and MUC2 for which variable numbers of randem repeats polymorphisms have been demonstrated (38, 11). Since the repeat unit of 87 bp is imperfectly conserved, we had to sequence the whole central exon, while in MUC2 the authors were able to extrapolate the sequence repeated up to 100 times (9, 11).
The length of the central exon is much larger than that of
MUC1, which varies from 3.5 to 6.2 kb, depending on the
number of repeats (38). MUC7, a nonforming-gel mucin,
possesses a central exon of 2.2 kb (22). As far as
MUC2 is concerned, the intron/exon distribution is not as
yet known. Toribara et al. (11) have indicated that
the 5 region of the GMUC clone isolated in their laboratory forms,
together with the tandem repeat array, a single large exon. In the most
common genotype for MUC2, this region extends over 8,700 bp.
No information is available about the size of the exonic repeat region
of the other mucin genes. A survey on intron and exon lengths has
indicated that in vertebrates few exons are over 800 bp (39). Most
vertebrate genes are composed of long introns and short exons. Composed
of 10,713 bp, the central exon of MUC5B is even much larger
than the 7572-bp exon of the gene for lipoprotein ApoB considered to be
the largest one in vertebrates (40). The corresponding polypeptide
coded by the central exon of MUC5B has a calculated
Mr of 370,000.
Nineteen subdomains have been individualized. Seven subdomains of 108 amino acid residues, called Cys1 to Cys7, contain 10 cysteine residues and show a very similar organization to that observed twice in human (11) and rat (36) MUC2, at least twice in human (12), and mouse (34) MUC5AC as well as in other mucin cDNAs, which can now be identified as representing parts of the MUC5AC gene, i.e. NP3a (3), HGM-1 (5), and L31 (4). The structure of the cysteine-rich subdomains has been conserved over a long evolutionary time scale, and the evolutionary constraint has likely been maintained because of crucial disulfide bonds. Thus, this typical cysteine-rich domain seems to be a feature of at least three of the four human mucin genes located on 11p15.5, little information is available as to whether there is a similar domain in MUC6 (8). It is more and more tempting to speculate, as have Toribara et al. (8), that at least MUC2, MUC5AC, and MUC5B genes on chromosome 11p15.5 are part of a multigene family whose members code for secreted mucins. The conservation of numerous amino acid residues in addition to the cysteine residues suggests an important role for these domains. It is noticeable that the MUC5B and MUC5AC genes show greater similarity with each other than with the MUC2 genes (from human and rat). Deletions of amino acid residues occur at the same positions in MUC5B and MUC5AC, while some observed deletions are characteristics of the MUC2 type. Moreover, MUC5B and MUC5AC are very close together in the cluster (32). Overall, these observations suggest that MUC5AC and MUC5B may have the same ancestral gene.
It is noteworthy that the mouse mucin gene homologous to human MUC5AC is mapped to a site on mouse chromosome 7 homologous to the location of the human secretory mucin gene cluster on human chromosome 11p15.5 (34). We do not as yet know if mouse possesses both Muc5ac and Muc5b. Perhaps only one gene exists.
The presence of multiple cysteine residues strongly supports the idea that the MUC5B gene codes for a secreted mucin as disulfide bonding is necessary for the formation of a mucus gel. Moreover, we can postulate that interactions may occur via cysteine-rich domains between several mucin molecules producing a complicated network. This can explain the observations made using electron microscopy of bronchial mucins which have been shown to form complex entangled structures (41) or "bush-like" aggregates (42). The cysteine-rich domains may play a highly conserved role such as the packaging of the mucins or in mediating interactions with numerous other proteins such as, for example, the association with bilirubin in the gall bladder matrix (43). Further studies will be required to more adequately clarify the role of these domains.
An examination of the newly determined sequence of JER57 cDNA led us to conclude that the changes in the reading frame emphasized in our previous work (23) were in fact due to errors in sequence determination because of difficulties encountered in determining the sequence of such imperfectly conserved repeat domains. This is a problem that we have now overcome. We will now call the 87-bp repeat "imperfectly conserved" or "irregular" rather than "degenerate" repeat as designated previously. The cDNAs previously isolated (23) essentially consist of parts of the R-subdomain. In fact, the repeat of 29 amino acid residues of the R-subdomains appears to be only one of the components of the super-repeat. This super-repeat is the largest ever determined in mucin genes. The largest described until now was the 507-bp repeat unit of MUC6 (8). The super-repeat in MUC5B is more than three times as long as the MUC6 repeat. Its particularity is that it is made of three various subdomains. A variable number of this super-repeat has not been as yet observed. An interesting aspect of the deduced polypeptide sequence is the alternating arrangement of the three types of subdomains. The four R-end subdomains show a striking similarity to each other. Their role is likely to be crucial.
The conservation of the potential N-glycosylation sites within the repeat may be important for the proper maturation of MUC5B providing the positions at which N-glycosylation occurs. Recently, Denny et al. (44) reported that molecular cloning of mucin apoproteins showed that the consensus sequence(s) for N-glycosylation is usually found outside of the repeat domain. For MUC5B, five out of the seven potential N-glycosylation sites are within the repeat region. The subdomains R01, R02, and R03 display amino acid compositions typical of mucins but do not contain any repeat.
The MUC5B gene is clustered in 11p15.5 with MUC2, MUC6, and MUC5AC (8, 32). Although mucins are characterized by having tandem repeat regions containing significant amounts of threonine, serine, alanine, and proline, the individual repeats units for each of these genes are very different in length and amino acid sequence. Moreover, the number of the cysteine-rich subdomains emphasized here, which are different from the D-domains of human pro-von Willebrand factor (10), considerably differs between MUC2 and MUC5B. The role of these two types of secreted mucins may be highly specialized as suggested by in situ hybridization experiments which show a different expression pattern for each gene (24). In fact, we will now be able to design new oligonucleotides and produce fusion proteins and antibodies in order to reinvestigate the expression of the MUC5B gene using in situ hybridization and immunohistochemistry.
Experiments performed to elucidate the entire genomic organization of MUC5B are now progressing as is the study of the regulatory regions. Whether the regulatory elements of MUC5B function coordinately with the regulatory elements of adjacent mucin genes in the cluster is an important question to answer in the future.
The nucleotide sequence(s) reported in this paper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) Z72496[GenBank] (HSMUC5BEX).
We are indebted to Dr. A. T. Nurden for help in improving the style of the paper. We thank Pascal Mathon and Danièle Petitprez for technical assistance.