(Received for publication, August 3, 1994; and in revised form, October 28, 1994)
From the
Transcripts for the 1 chain of mouse type XVIII collagen
were found to be heterogeneous at their 5`-ends and to encode three
variant N-terminal sequences of the ensuing 1315-, 1527-, or
1774-residue collagen chains. The variant mRNAs appeared to originate
from the use of two alternate promoters of the
1(XVIII) chain
gene, resulting in the synthesis of either short or long Nterminal
non-collagenous NC1 domains, the latter being further subject to
modification due to alternative splicing of the transcripts. As a
result, the 1527- and 1774-residue polypeptides share the same signal
peptide, and the lengths of their NC1 domains are 517 or 764 amino acid
residues, respectively, while the 1315-residue polypeptide has a
different signal peptide and a 301-residue NC1 domain. The longest NC1
domain was strikingly characterized by a 110-residue sequence with 10
cysteines, which was found to be homologous with the previously
identified frizzled proteins belonging to the family of
G-protein-coupled membrane receptors. Thus, it is proposed that the
cysteine-rich motif, termed fz, represents a new sequence motif that
can be found in otherwise unrelated proteins. Tissues containing mainly
one or two NC1 domain mRNA variants or all three NC1 domains were
identified, indicating that there is tissue-specific utilization of two
alternate promoters and alternative splicing of
1(XVIII)
transcripts.
The collagen superfamily includes 19 collagen types(1, 2, 3) . A feature common to all collagenous proteins is the presence of at least one triple helical sequence of a repeated Gly-X-Y motif. All collagenous proteins contain non-collagenous sequences at their N and C termini and often also within the collagenous sequences. Portions of the non-collagenous domains of several collagens can be aligned with sequences found in non-collagenous mosaic proteins such as fibronectin, von Willebrand factor, and thrombospondin, while other regions are exclusively found in the collagens(4) .
The 19 types of collagen can be divided into two groups in terms of their primary structure and supramolecular aggregates, fibrillar and non-fibrillar collagens(1, 2, 3) . Characteristic of the structurally homologous fibril-forming collagens, i.e. types I-III, V, and XI, is that the repeated Gly-X-Y sequence is long and uninterrupted, while the members of the heterogeneous group of non-fibrillar collagens, types IV, VI-X, and XII-XIX, are all characterized by the presence of one or more interruptions in the collagenous sequence. Several kinds of molecular assembly have been found in the non-fibrillar collagens, and members of this group can be divided into distinct subgroups in terms of their structural homology.
A recent addition to the non-fibrillar
collagens is type
XVIII(5, 6, 7, 8) . Elucidation of
the complete primary structure of the 1 chain of mouse type XVIII
collagen has revealed a 1315-residue polypeptide that includes a
25-residue signal peptide, a 301-residue N-terminal non-collagenous
domain (NC1), a 674residue collagenous sequence with nine interruptions
of 10-24 residues, and a 315-residue non-collagenous domain
(NC11)(8) . It is likely that hetereogeneity occurs at the
N-terminal end, since the overlapping cDNA clones reported (6, 7) differ with respect to the first 27 amino acid
residues of the
1(XVIII) chain.
Interestingly, seven of the
collagenous domains of the 1(XVIII) chain and both flanking
domains share homology with the recently described
1 chain of type
XV collagen(6, 9, 10, 11) , and it
has been suggested that they may form a new subgroup within the
collagen family (5, 6, 7) . Their N-terminal
non-collagenous domains share sequence homology in that both contain an
approximate 200-residue sequence corresponding to the N-terminal end of
thrombospondin(7, 10) . The C-terminal non-collagenous
domains of the
1(XVIII) and
1(XV) chains are unique to these
two chains and are highly homologous throughout their
sequences(6, 8) . Despite the homology, the two chains
are sufficiently different to make it unlikely that they reside in the
same collagen molecule, implying that they are
chains of separate
collagen types. The human genes encoding these homologous proteins are
located on separate chromosomes, that for the
1(XVIII) chain on
21q22.3 (12) and that for the
1(XV) chain on
9q21-22(13) .
The purpose of this work was to
elucidate the nature of the sequence variability affecting the
N-terminal ends of the 1(XVIII) chains. Three variant N-terminal
sequences were identified, and one of these was found to contain a
cysteine-rich domain postulated here to represent a new sequence motif.
Furthermore, marked differences were observed in the tissue
distribution of the variant mRNAs corresponding to the three N-terminal
ends.
To isolate the extreme 5`-end
of the short 1(XVIII) variant, cDNA fragments were obtained
without the conventional library screening by the PCR method developed
for isolation of end fragments from yeast artificial chromosomes
clones(16) . A 3-µl aliquot of the pool of blunt-ended cDNA
fragments were ligated at 0.1 µM final concentration of a
linker solution (for linkers, see (16) ) in a 25-µl
ligation reaction at room temperature for 4 h. The incubation was
stopped by adding 75 µl of water, and the strands were denatured by
heating for 10 min at 95 °C. 2 µl of the diluted cDNA pool was
used as the template in a 10-µl PCR reaction containing Dynazyme
polymerase buffer (Finnzymes, Finland), 0.2 mM of each
deoxynucleotide, 10 pmol of MIXX-23 (see below), 1 pmol of a 25-mer
linker primer (see (16) ), and 0.5 units of Dynazyme polymerase
(Finnzymes, Finland). The amplification conditions were 1 min at 94
°C, 45 s at 65 °C, and 1 min at 72 °C for 35 cycles. After
the first round of PCR, the reaction mixture was diluted with 250
µl of water, and 2 µl of the dilution was used for the second
round of PCR with the conditions as above except that 5 pmol of MIXX-24
(see below) primer and 5 pmol of the 25-mer linker primer were used.
The products synthesized were EcoRI digested and subcloned to
the EcoRI site of Bluescript SK. After transformation, colony
hybridization was used to screen for recombinants containing
1(XVIII) sequences.
To isolate cDNA clones covering the translation initiation codon of the long variant, the mouse embryo cDNA pool described above was used for PCR as described above. A 10-µl PCR reaction contained 2 µl of the cDNA pool, Dynazyme polymerase buffer, 0.2 mM of each deoxynucleotide, 5 pmol of the primer MIXX-39 (see below), 5 pmol of a 25-mer linker primer, and 0.5 units of Dynazyme polymerase, and the amplification was performed as above. The PCR products were characterized as above except that the probe used in the hybridization was the insert of the clone PE17.24 described above.
The oligonucleotide primers used were (bases added to generate a complete EcoRI restriction site as well two base extensions are underlined): MIXX-20, 5`-GATGGCAAATAGCACCC-3` (nt 379-395 and 1672-1688 in Fig. 2, A and B, respectively); MIXX-23, 5`-GTGGCTGGCCGGACATGAAACAG-3` (nt 345-367 and 1668-1660 in Fig. 2, A and B, respectively); MIXX-24, 5`-ATGAATTCTGGTCCAAAGATGTAGGCCGG-3` (nt 258-280 and 1551-1573 in Fig. 2, A and B, respectively); and MIXX-39, 5`-ATTCAGGGGACTCAGGGAATTC-3` (nt 86-107 in Fig. 2B).
Figure 2:
Nucleotide and deduced amino acid
sequences of cDNA clones for the three NC1 domains of the 1 chain
of mouse type XVIII collagen. A, nucleotide and deduced amino
acid sequences of the short NC1 domain, NC1-301, the sequences
shown encoding the signal peptide and the first 78 residues of this
domain. The asterisk indicates the extreme 5`-end nucleotide
of the previously reported clone SXT-5(7) . B,
nucleotide and deduced amino acid sequences for the long NC1 domains,
NC1-764 and NC1-517, the sequences shown encoding the
signal peptide and the first 541 or 273 residues, respectively, of
these domains. Sequences lacking in clones PE8.1, PE19, and PE15.2 due
to alternative splicing are shown in brackets. In A and B, the last 17 nt represent oligonucleotide MIXX-20
(see Fig. 1A). A 299-residue sequence is common to the
three NC1 domains, and the first 76 residues of the common sequences
are shown as shaded in A and B. The
remaining 223 residues are not shown in the figure, as they are encoded
by sequences downstream of oligonucleotide MIXX-20 used in this study
and can be found in a previous study(7) . Potential N-linked glycosylation sites are boxed, cysteine
residues are circled, and the arrows indicate the
most likely signal peptide cleavage sites. The numbering of
the nucleotide and amino acid residues begins from the
5`-end.
Figure 1:
cDNA clones encoding
three N termini of the mouse 1 chain of type XVIII collagen and
hydropathy plot of the longest variant. A, the overlapping
cDNA clones are shown with respect to the schematic structure of the
short (upperpanel) and long (lowerpanel) NC1 domains. The clones covering three different
length NC1 variants, NC1-301, NC1-517, and NC1-764,
are shown. SS1 and SS2 indicate signal peptides for
the NC1-301 and NC1-764/517 variants, respectively. The
lengths of the signal peptides, the domain common to NC1-764 and
NC1-517, the alternatively spliced sequence present only in
NC1-764, and the sequence common to all NC1 variants are shown in
amino acids; (2) indicates residues unique to the
NC1-301. The locations of the EcoRI (E) and SacI (S) restriction sites are indicated. The EcoRI sites shown in parentheses represent linker
sites introduced during cloning. cDNA sequences specific for the long
and short NC1 domains are shown with gray and white boxes, respectively. The locations of oligonucleotide primers used
in cloning are shown with arrows. The scale is in kilobases. B, in the hydropathy plot, the numbering of the amino
acid residues begins from the first residue of the longest NC1 domain.
The hydrophobic regions are positive, and the hydrophilic ones are
negative. The alternatively spliced sequences are shown in brackets. The locations of all cysteine residues are shown
with verticalbars. The variable sequences of the
three N termini and their 299-residue common region are indicated. The arrow indicates the most likely signal peptide cleavage
site.
The cDNA clones PE8.1, PE19, and PE15.2 varied in length between 833-929 nt but contained overlapping sequences at their extreme 5`- and 3`-ends to the 1.7-kb clone PE17.24 (Fig. 2B). The cDNA-derived open reading frame of the clone PE17.24 contained a stretch of leucine residues at its beginning, which suggested that the translation initiation codon might be located in the near 5` direction. Primer extension in combination with PCR of the 5`-ends using MIXX-39 as an antisense oligonucleotide resulted in isolation of the 94-bp clone PX4.3 (Fig. 1A), which extended furthest in the 5` direction and covered the putative initiation codon.
The 1461 extreme 5` nucleotides derived from the overlapping clones PE17.24 and PX4.3 differed from the first 168 nt derived from clones PX2.25 and PE21, but nt 1462-1688 of PE17.24/PX4.3 and nt 169-395 of PX2.25/PE21 were identical. Thus, PE17.24/PX4.3 must encode a variant N-terminal domain in which the signal peptide and the first 2 amino acid residues of the NC1-301 domain are replaced by a markedly longer sequence. The sequences encoding the remaining 299 residues of the NC1 domain are identical for the two variants. Nucleotides 1338-1467 of PE-17.24/PX4.3 were identical to the 123-nt clone TA5 reported by Oh et al.(6) as differing from the 5`-end of the mRNA encoding the NC1-301. The sequences described in this paper therefore represent the same 5`-end as reported by Oh et al.(6) but extend 1.4 kb further in the 5` direction.
The overlapping clones PE17.24 and PX4.3 correspond to a non-collagenous domain of 785 residues (Fig. 2B). This sequence begins with a methionine, and only 2 nt of 5`-untranslated sequences are included in the clones. The N-terminal sequence of the predicted polypeptide is highly hydrophobic and clearly fulfills the criteria for a signal peptide. The cysteine at position 19 and leucine at position 21 best suit the rules for residues occupying the -3 and -1 positions in a signal peptide, but the valine at position 22 and the alanine at position 24 satisfy the rules almost as well(22) . Thus, the signal peptide is predicted to be either 21 or 24 residues in length. Assuming that the signal peptide identified here is 21 residues long (signal peptide 2), the NC1 domain encoded by the mRNA corresponding to PE17.24 will be 764 residues (NC1-764). A striking feature of the NC1-764 domain is the presence of 10 cysteine residues within a stretch of 110 residues located immediately upstream of the portion of the NC1-764 domain identical to the NC1-301. In addition, putative N-linked glycosylation sites are located at residues 354 and 361 of NC1-764.
Clones PE8.1, PE15.2, and PE19 were lacking nt 721-1461, which encode residues 241-486 of NC1-764 (Fig. 2B). Thus, these clones covered the same signal peptide 2 as clones PE17.24 and PX4.3, and the NC1 domain is 517 residues (NC1-517). A stretch of 247 residues located at the center of NC1-764 is lacking from NC1-517, and most strikingly, the region lacking encompasses the cysteine-rich domain and the two putative N-linked glycosylation sites (see above).
Hydropathy analysis indicated that an approximately 70-residue stretch adjacent to the putative signal peptide of NC1-764 and NC1-517 represents the most hydrophilic region of the NC1 domain sequences, this stretch being particularly rich in acidic amino acid residues (Fig. 1B and Fig. 2B). In contrast, the region subject to alternative splicing and the beginning of the common NC1 portion are the most hydrophobic parts (Fig. 1B).
Figure 3:
Comparison of the cysteine-rich sequences
identified in the NC1-764 domain of the mouse 1(XVIII)
collagen chain with rat frizzled-1 and frizzled-2 proteins and the Drosophila frizzled protein. The numbering of the
amino acid residues begins from the N termini of each protein (23) . Note that the last
1(XVIII) chain amino acid
residue in the aligned sequence represents the extreme C-terminal
residue of the alternatively spliced region in the NC1-764
domain. The amino acid residues that are identical between the mouse
1(XVIII) collagen and one or more of the frizzled proteins are
shown by blackboxes, and similar residues are shown
in shadedboxes. The additional identities and
homologies that exist only between the frizzled proteins are not
indicated. The similarly located cysteine residues are numbered from the N-direction. A consensus motif for the
homologous sequence present in all four polypeptides is indicated as
follows: h, hydrophobic; p, polar; -, acidic
residues; and +, basic residues.
The 10 cysteine
residues located in the cysteine-rich region of the mouse 1(XVIII)
collagen chain NC1 domain can be aligned with 10 almost identically
spaced cysteine residues in the rat and Drosophila frizzled
proteins (Fig. 3). Other residues around the cysteines are also
found to be identical or similar, the identity between a stretch of 126
amino acids in NC1-764 and 127 amino acids in the rat frizzled-1
protein being 24% and the similarity 47%. Within this stretch, the
degree of identity is 57-86%, and the degree of similarity is
81-95% between the three frizzled proteins. The numbers of amino
acid residues separating cysteines 2 and 3, cysteines 3 and 4,
cysteines 4 and 5, and cysteines 6 and 7 are identical in the three
frizzled proteins and the
1(XVIII) chain, while differences of
1-4 residues in length can be observed between the other cysteine
pairs.
Figure 4:
Northern blot analysis of variant mRNAs of
mouse 1(XVIII) collagen in mouse tissues. 2 µg of
poly(A)
RNA from the adult tissues indicated were
fractionated by gel electrophoresis. muscle, skeletal muscle. A, blot hybridized with a mixture of cDNA probes recognizing
all of the
1(XVIII) collagen mRNAs. To obtain a representative
hybridization pattern for all of the
1(XVIII) collagen mRNAs, a
shorter exposure (shortexp.) of the skeletal muscle
sample is also shown. B, blot hybridized with a probe
identifying mRNAs encoding both long NC1 domains. C, blot
hybridized with a probe recognizing mRNAs encoding the NC1-764
variant. D, blot hybridized with a probe identifying mRNAs
encoding the NC1-301 domain. The positions of the probes with
respect to the variant NC1 structures are given schematically below the
autoradiography blots. The sizes of marker RNAs and the
1(XVIII)
mRNA signals are indicated in kilobases.
The same
Northern blot was also hybridized with variant-specific cDNA probes.
Two different-sized mRNAs were found to occur for both the
NC1-764 and NC-301 variants and probably also for the
NC1-517 variant, which reflected the utilization of different
poly(A) signals(5, 6) . Probe B (see Fig. 4) identifying all mRNAs encoding NC1-764 and
NC1-517 domains resulted in clear signals of 5.7, 6.1, and 7.0 kb
in the lung, liver, skeletal muscle, and kidney and extremely faint
signals of 6.1 and 7.0 kb in the other tissues (Fig. 4B), while probe C (see Fig. 4) detecting
only those mRNAs encoding the NC1-764 variant resulted in the
detection of two transcripts of 6.1 and 7.0 kb in the lung, liver,
skeletal muscle, and kidney (Fig. 4C) and extremely
faint signals in the other tissues (not visible in Fig. 4C). Comparison of the signals obtained using the
B and C probes indicated that the 7.0-kb band is specific for
NC1-764 mRNA, and the 5.7-kb band is specific for NC1-517
mRNA, while the 6.1-kb band contains both the shorter NC1-764
mRNA and the longer NC1-517 mRNA. Furthermore, comparison of the
signal intensities obtained with probes B and C suggests that mRNAs
encoding the NC1-517 variant are in majority in liver. A probe
specific for NC1-301 resulted in detection of strongly
hybridizing mRNAs of 4.5 and 5.7 kb in the kidney and testis, the same
transcripts also being faintly visible in the other tissues (Fig. 4D). Since the NC1-301specific probe did
not recognize the mRNAs for the long variants, the results suggest that
separate transcription initiation sites exist for the two types of
transcripts. The relative expression levels of the three mRNA variants
are indicated in Table 1.
The three NC1 domain variants of the type XVIII collagen
chains are likely to be due to the use of two alternate promoters and
to the primary transcripts for one of these promoters also being
subject to alternative splicing. An overview of the three 1(XIII)
polypeptide variants, which consist of 1774, 1527, or 1315 amino acid
residues with sequence-derived molecular masses of 182.2, 156.0, or
134.3 kDa, respectively, is presented in Fig. 5. The first two
variants have the same signal peptide and NC1 domains that are either
764 or 517 residues in length, depending on the alternative splicing,
while the third variant has its own signal peptide, and its NC1 domain
is 301 residues. All three polypeptides are thought to be identical
with respect to a 299-residue portion of their NC1 domains, their
collagenous domains, and their C-terminal non-collagenous domains.
Figure 5:
Schematic structures of the full-length
variant polypeptides of the mouse 1(XVIII) collagen chains.
Collagenous sequences are shown in white, non-collagenous
domains common to all variants are shown in black,
non-collagenous sequences common to both long variant NC1 domain
portions are shown in gray, and non-collagenous sequence
unique to the NC1-764 variant is shown by cross-hatching. The putative signal peptide 1 is indicated
with lefthatching, and the putative signal peptides
2 are indicated with righthatching. The lengths of
the amino acid sequences (aa) specific for each variant are
given, as well as the lengths of the common regions. C,
cysteine residue; 10C, cluster of 10 cysteine residues; N, potential N-glycosylation site; 2N, two
adjacent N-glycosylation sites; O, potential O-linked glycosylation site; fz and tsp,
frizzled and thrombospondin sequence motifs, respectively; ac,
acidic domain.
Several of the other collagens are known to be modified by the use
of alternative promoters and alternative splicing, although the
significance of these modifications is not fully understood at present.
An alternate, cartilage-specific transcription start site has been
found within intron 2 of the gene encoding the chick 2 chain of
type I collagen(25) , and an alternative transcript of the
chick
1(III) gene was identified in which exons 1-23 are
replaced by the initiation of transcription at intron 23(26) .
These type I and III variant transcripts probably direct systhesis of
non-collagenous chains. Two transcription start sites have also been
found for the gene encoding the
1 chain of type IX
collagen(27) . The two promoters are used in a tissue-specific
manner, resulting in the synthesis of
1(IX) polypeptides
possessing either long or short N-terminal non-collagenous domains
similar to our observations regarding type XVIII collagen. The first
collagen found to undergo alternative splicing was type XIII collagen,
and it is still the only one in which this affects both collagenous and
non-collagenous sequences(28, 29, 30) .
Subsequently, alternative splicing has been found to affect the modular
N-terminal non-collagenous domains of the
3 chain of type VI
collagen(31, 32, 33) and the
1 chains
of the homologous collagen types XII and XIV (34, 35, 36) . The C-terminal non-collagenous
domain of the
3 chain of type IV collagen (37) and the
2 chain of type VI collagen(38, 39) are also
modified by alternative splicing. Furthermore, the primary transcripts
for the
1(XIV) chain are also subject to alternative splicing
affecting the 5`-untranslated sequences, which are hypothesized to
modulate translational control(35) . The mode of alternative
splicing of primary transcripts is in most cases exon skipping, but use
of an internal splice acceptor site has also been observed in
2(VI) transcripts(38) . The 5`-end of the gene encoding
the
1(XVIII) chains has not been characterized, and therefore we
do not yet know how the proposed two promoters of this gene are
arranged with respect to each other or what the mode of the observed
alternative splicing is. Significant differences occur at the
N-terminal ends of the type XVIII collagen chain variants, and it may
therefore be presumed that the variant molecules possess different
functional properties.
Both long NC1 domains have a markedly more
acidic N terminus than NC1-301, which only consists of sequences
present in all three NC1 variants. The most striking difference between
the NC1 variant is the occurrence of a stretch of 110 amino acid
residues with 10 cysteines within the 247-residue sequence only present
in the NC1-764 domain. Interestingly, the 1(XVIII) chain
cysteine-rich domain showed homology to a cysteine-rich domain found in
the Drosophila frizzled protein and the rat frizzled-1 and
frizzled-2 proteins. These proteins vary in size between 570-641
residues and all contain a domain characterized by 10 cysteines within
their N-terminal one-third portion and seven putative membrane domains
within their C-terminal two-thirds portion(23, 40) .
With respect to the seven-transmembrane-domain profile, the frizzled
proteins resemble the G-protein-coupled receptors. Mutations in the Drosophila frizzled locus encoding the frizzled protein cause
abnormal orientation of the wing hairs, suggesting that this protein is
needed for establishment of cell polarity in the
epidermis(24, 41) . Moreover, genetic mosaic studies
suggest that the product of the frizzled locus functions in a
dual fashion as it appears to serve both in reception of a polarity
signal and in its intercellular transmission to the adjacent cells. The
cysteine-rich portions of the frizzled proteins encompass most of their
extracellular portion, and these domains are thus likely to be involved
in ligand binding and intercellular transmission of polarity
information. The ligand(s) participating in this event is not known,
however. Collagens are known to be mosaic proteins with a number of
shuffled domains also present in non-collagenous proteins(4) .
The homology identified here leads us to suggest that the cysteine-rich
sequence, termed here the fz motif, represents another sequence motif
that can be found in both non-collagenous and collagenous proteins.
Elucidation of the possible function of the fz motif in type XVIII
collagen will require recombinant expression experiments, however.
The tissue distribution of the variant 1(XVIII) collagen
transcripts is unusual. Of the eight mouse tissues studied, markedly
high levels of mRNAs for type XVIII collagen were found in the liver
and kidney, the next highest levels being found in the lung, skeletal
muscle, and testis, while the brain, heart, and spleen contained
markedly lower levels. We know of no other collagen mRNAs with
similarly prominent expression in liver. mRNAs for the NC1-301
variant appeared to be constitutively expressed in low amounts in all
tissues except in the kidney and testis, where they were more abundant,
while the two other mRNA variants likely to be derived from the second
promoter were found in the kidney, liver, lung, and skeletal muscle,
being thus more restricted in their tissue distribution. mRNAs encoding
the NC1-517 variant were mainly responsible for the strong
signals in the liver, but liver tissue was also found to contain the
mRNA variant for the NC1-764 domain characterized by the
cysteine-rich sequence. Lung tissue also contained mRNAs encoding the
two long NC1 domains as its major variants but with a more even
distribution of the two forms. The kidney contained all three mRNA
variants, namely mRNAs for the two long NC1 domains and for the
NC1-301, whereas the testis contained only mRNAs for the
NC1-301 domain in any appreciable amounts. Northern analysis
revealed wide expression in rat tissues of mRNAs for the frizzled-1 and
frizzled-2 proteins with highest mRNA levels in the kidney, liver,
heart, uterus, and ovary(23) . Thus, kidney and liver appear to
be tissues that contain both mRNAs for the type XVIII collagen variant
with the fz motif and mRNAs for the two frizzled proteins. In
conclusion, the tissue distribution of
1(XVIII) mRNAs is unlike
that of any of the other collagens, and there are distinct differences
between the variants in this respect. This latter observation speaks
for a possible functional significance for the utilization of two
presumed promoters and alternative splicing of
1(XVIII)
transcripts.
The nucleotide sequence(s) reported in this paper has been submitted to the GenBank(TM)/EMBL Data Bank with accession number(s) U11636 [GenBank]and U11637[GenBank].