2 Department of Medical Biochemistry, Göteborg University, Gothenburg, Sweden; and 3 FCC, Fraunhofer-Chalmers Research Centre for Industrial Mathematics, Chalmers Science Park, Gothenburg, Sweden
Received on November 24, 2003; revised on January 20, 2004; accepted on February 23, 2004
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key words: bioinformatics / glycosylation / mucin / SEA / EGF
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Mucins are characterized by one or more domains that are rich in the amino acids Ser, Thr, and Pro and that are referred to as mucin or PTS domains. These domains often contain tandem repeats. The amino acid composition of a mucin domain often exceeds 40% of Ser and Thr, sometimes up to 90%, and the Pro content is often more than 5%. The function of a mucin domain is to serve as scaffolds for O-linked glycans bound to the amino acids Ser and Thr and by these bind water and interact with lectins (carbohydrate-binding proteins) found on microorganisms or endogenously. The dense oligosaccharide clusters make this domain proteolytically resistant and give an extended and stiff conformation, which can be described as being like that of a bottle brush. The biophysical properties of mucins are largely related to these extensive O-linked domains, illustrated by the typical mucin having more than 80% of its mass as carbohydrates. These highly O-glycosylated mucin domains are also found in other proteins where they make up minor parts. Mucin domains are often found in the stalk region of membrane proteins as, for example, in the low-density lipoprotein receptor, where the mucin domain is 48 residues long (Davis et al., 1986). The length and nature of the mucin domains can be important as illustrated in the Ebola virus glycoprotein (Yang et al., 2000
). In the classical mucins, the mucin domains are long. The longest known today is the porcine submaxillary mucin made of 135 repeats of 81 amino acids, giving a mucin domain with more than 10,000 residues (Eckhardt et al., 1997
). Among the smaller is the MUC7 with six 23-amino-acid repeats, giving an length of 138 residues (Bobek et al., 1993
). Another characteristic of mucin domains, important for the prediction of these, is that they are typically encoded by a single exon.
The amino acid sequence of mucin domains tends to be poorly conserved. As a consequence, for the identification of such domains, one cannot rely on methods like BLAST that take advantage of sequence similarity. Identification of mucin domains is therefore a bioinformatic challenge, and novel methods are needed. For this reason we developed approaches based on the amino acid compositional bias characteristic of mucin domains.
As a starting point, we applied our methods to identify as many mucins as possible in the genome of the puffer fish Fugu rubripes (Aparicio et al., 2002). The genome is only 390 Mb, which is about eight times smaller than the 3000-Mb human genome, yet it contains a similar repertoire of genes (Brenner et al., 1993
). A number of mucins of higher mammals have previously been identified, and their functional roles have been studied. However, other vertebrates as well as lower metazoans have been less well characterized with respect to mucins. Therefore, the analysis of Fugu in the present work provides a better understanding of the evolution of mucins in the vertebrates. We report both gel-forming and transmembrane mucin genes that were not previously identified in the annotation of the Fugu genome.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The ENSEMBL Fugu project provides a set of proteins predicted by the standard gene prediction pipeline as well a set of proteins predicted by the Genscan method. We searched both these categories for mucin domains using the PTSPRED and MPRED programs. In all cases where a mucin domain was predicted, the full-length protein sequence was retrieved and analyzed with SignalP (Nielsen et al., 1997) and TMHMM (Sonnhammer et al., 1998
) for the prediction of signal sequence and transmembrane domains, respectively. The full-length sequences were also analyzed with respect to Pfam domains using hmmer (http://hmmer.wustl.edu). All proteins predicted to have at least one mucin domain, one signal peptide (when the predicted protein included the N-terminus) and at least one Pfam domain typical of human mucins (Table II) were further studied at the genomic level to accomplish a prediction as accurate as possible for the complete coding sequence.
|
|
A transmembrane, MUC1-type mucin
Fugu MUC1
Only one of the predicted mucins had a domain structure reminiscent of a MUC1-type mucin (Figure 1). It is identical to a peptide predicted by Genscan (v.11, accession number scaffold_368.124967.141885). The fMUC1 protein is predicted to have 2055 amino acids and one transmembrane domain as predicted by the TMHMM program. The mucin or PTS domain is located in the N-terminal end of the protein, which is predicted to be extracellular. The stalk region between the PTS and transmembrane domains contains two SEA as well as two EGF domains, both typical for transmembrane mucins (Table II). The SEA domain is found in several mucins like MUC1 and MUC13 (Wreschner et al., 2002). The function of this domain is not known, but in MUC1 it is holding the two protein parts together despite a posttranslational cleavage in the middle of the domain. An EGF domain is sometimes found together with a SEA domain (like in MUC13) but sometimes without it (like in MUC4). The cytoplasmic tail of some of the mucins has been shown to be involved in signaling by phosphorylation (Gendler, 2001
). The presence of a large number of Tyr, Ser, and Thr residues in the cytoplasmic tail of fMUC1 is consistent with this idea and suggests a similar function for fMUC1.
Gel-forming, MUC2-type mucins
All identified human gel-forming mucins contain several von Willebrand factor D domains (vwd). The function of these domains is presently not understood, but they are found in a number of nonmucin proteins, where they take part in the formation of large multimeric protein complexes. In addition to the vwd domains, the human gel-forming mucins typically have a C-terminal cysteine-knot (CK) domain, responsible for a disulfide bondmediated dimerization. Here we have identified four different proteins, and each has one CK domain, four vwds, at least one mucin domain, and a domain structure typical for the human gel-forming mucins. All these mucins have been named MUC2-type mucins.
Fugu MUC2A
The predicted protein fMUC2A is almost identical to a peptide predicted by Genscan in the v. 8 data set (FuguGenscan 10920). However, in the v.11 data set, the same prediction is missing and instead another related but truncated protein was found (SINFRUP00000135941). The fMUC2A mucin (1993 amino acids) has one PTS domain with three vwd domains on its N-terminal side and one on its C-terminal side (Figure 1). Not only this domain organization but also domain lengths are similar to the human MUC2. An exception is that the central mucin domain is considerably smaller in Fugu, accounting for the difference in length (about 3000 amino acids) between these two proteins.
Fugu MUC2B
In the standard category of Fugu proteins (v. 11) we only found a set of small proteins (SINFRUP00000141455, SINFRUP00000141460, and SINFRUP00000141449), representing fragments of our predicted protein fMUC2B. On the other hand, the fMUC2B is more closely related to a peptide (18974 of v. 8) predicted by Genscan. However, this Genscan prediction is different from fMUC2B in that it has four transmembrane helices in its N-terminus. These helices are in a region of the genome with a gap in the assembly and the promoter region for fMUC2B might be in this gap. Therefore, we consider the Genscan peptide 18974 as an incorrect prediction, where two unrelated genes have been combined into one. The fMUC2B (2634 amino acids) is similar to fMUC2A, except that the PTS domain is larger and contains a longer N-terminus.
Fugu MUC2C
The fMUC2C protein is based on the v. 11 protein SINFRUP00000151681 and a Genscan prediction (Scaffold_981.70259.92060), whereas it has no equivalent in the v. 8 data set. The SINFRUP00000151681 protein represents only a part of fMUC2C (from position 26 to a position within the PTS region). The Genscan prediction is identical to fMUC2C, except for the PTS domain and a C-terminal extension with three vWD domains. This extension is presumably incorrect, and it seems more likely that the fMUC2C has a CK at its C-terminal end (Figure 1). Within the central region, at least two PTS domains were predicted with an intervening Cys-rich domain. This type of intervening domain has been named CysD or Cys-subdomain and is found in the human MUC2 (two) or MUC5AC and 5B (several). Unfortunately, it is not possible to fully reconstruct the PTS domain because there is a gap in the genome assembly in this region.
Fugu MUC2D
The predicted fMUC2D (Figure 1) is encoded by a sequence in the v. 11 data set (SINFRUP00000160341, SINFRUP00000160337, and SINFRUP00000160338) where one part (SINFRUP00000160341) is identical to a part of our predicted fMUC2D. Due to a gap in the genome sequence, the predicted protein is incomplete at its N-terminus; therefore it is not known if it contains additional vwd domains as found for other proteins of this family.
Fugu zonadhesin
The human zonadhesin has not been classified as a mucin but bears many of the characteristic features for a mucin. For instance, it has a central mucin domain, although not as large as for the classical mucins (Figure 1). In the search for mucin domains, we found the Fugu zonadhesin, which is identical to SINFRUP00000149997 in the v. 11 data set except for minor differences in the N-terminus. Genscan predicts a protein (scaffold_981.70259.92060, v. 11) where our predicted Fugu zonadhesin is joined to a sequence containing a number of transmembrane domains. This prediction is probably not correct and seems to merge two different genes. The N-terminal part of the Fugu zonadhesin is still missing, but the part possible to predict has the same domain structure as the human protein (Figure 1).
Evolutionary relationship between Fugu mucins
To further analyze the relationship between the different Fugu mucins, we considered multiple alignments of regions that are well conserved. Therefore, we extracted all the vwd domains of these proteins and numbered them from the N-terminal end as shown in Figure 1. A multiple alignment with CLUSTALW was carried out, and the guide tree used by the program is shown in Figure 2. All of the zonadhesin vwd domains are found in one of the branches. However, for the other mucins each branch typically had vwd domains with the same relative position/number from the different mucins. The only exception is the vwd1 of fMUC2B, which is more related to the vwd2 family. These results suggest that the order of vwd domains has been conserved and that the four fMUC2 proteins are evolutionarily closely related. This relationship was not clearly recognized from a multiple alignment of the full-length mucins. In addition, these results give further support to our mucin predictions.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Many mucin domains have short tandem repeats, and it should be possible to account for this in a mucin search strategy. However, some mucins do not have any apparent repeat nature of their mucin domains, probably due to a rapid loss of the repeats by the evolution. For example, the human MUC1 has a mucin domain with nearly identical 20-amino-acid repeats, whereas the corresponding mouse sequence only show limited repetitive nature (Gendler et al., 1990; Spicer et al., 1991
).
The mucin domains known so far are always contained within one exon. This is an important criterion that has been taken into account when analyzing the predicted mucins during the prediction and reconstruction of the full-length proteins. At the same time, a complicating issue is that sequencing and assembly of mucin domain genomic regions is difficult. This is because mucin domain sequences are G/C-rich, long, and of a repetitive nature. An illustration of this is that among the human gel-forming mucins, only MUC5B and MUC2 are completely sequenced (Desseyn et al., 1998; Gum et al., 1994
). Other mucins, for example MUC5AC, MUC6, and MUC19, are only partially sequenced. The problem is also illustrated in the present study, where fMUC2C and fzonadhesin both are incomplete in their PTS domains as a result of gaps in the genome assembly.
A characteristic property of the mucin domains in the mature protein is their dense O-glycosylation. These glycans are added posttranslationally by enzymes present in the Golgi apparatus. This means that the predicted proteins have to be processed in the secretory pathway. As such, they have to carry a signal sequence directing these for secretion. All the mucin-type molecules known today have a typical N-terminal signal sequence. Because signal sequences predictions by the SignalP program are relatively accurate, the presence of a signal sequence has been used as an additional requirement for the prediction of a mucin. The presence of an N-terminal signal sequence is also present in all the proteins predicted here (not known for fMUC2D and fzonadhesin because the N-termini of these are still missing). For this reason, we require that a candidate protein predicted by PTSPRED or MPRED should have a signal sequence to qualify as a mucin protein. For instance, PTSPRED and MPRED searches identified several typical nuclear proteins (like RNA polymerase, splice and transcriptional factors) as these have regions with the amino acid composition typical of mucins.
The four predicted gel-forming Fugu mucins as well as the fzonadhesin contain vwds. The name is derived from the von Willebrand factor, a protein of the human coagulation system (Sadler, 1998). This protein has been suggested to be the ancestor for the gel-forming mucins (Desseyn et al., 2000
) and is also found in the Fugu genome (SINFRUP00000149997, also shown in the database). These domains are typical for gel-forming mucins and are found also in other extracellular proteins that are involved in the formation of polymeric complexes, such as vitellogenin, humoral lectin, apolipophorin, and luciferin 2-oxygenase. The vwd seems to be ubiquitous in multicellular eukaryotes. For instance, vitellogenin with this domain is found in higher mammals as well as in Caenorhabditis elegans. Our identification of the gel-forming mucins in Fugu shows that these mucins have a global domain structure comparable to other gel-forming mucins and that the mucins fMUC2A-C are homologous with respect to their vwd structure (Figure 2). This is also consistent with and indirectly supports our prediction of the protein sequence in these parts. Finally, the analysis of vwds illustrates that for an efficient prediction of mucins it is important to consider also other domains of the protein than mucin domains.
Gel-forming mucins have previously been found to be coexpressed with trefoil factors (TFF) (Taupin and Podolsky, 2003). Searching the Fugu genome for the TFF motif did not reveal any TFF proteins (data not shown).
In the present study we identified a number of Fugu mucins that were previously not annotated. Only one transmembrane mucin was found and contained domains typical of the human transmembrane mucins. Several of these proteins, like MUC1, MUC13, and MUC17, have SEA domains (Wreschner et al., 2002) and some have EGF domains (MUC3A, 3B, 4, 12, 13, and 17), all in the stalk region between the mucin domain and the membrane domain. About 10 transmembrane mucins and mucin-type molecules are known to be encoded by the human genome today (MUC1, 3A, 3B, 4, 12, 13, 15, 16, 17, and CD43) as compared to the five gel-forming mucins (MUC2, 5AC, 5B, 6, and MUC19). Therefore, the evolution of mucins from lower vertebrates to higher mammals seems to have maintained the number of gel-forming mucins at about the same level, whereas the family of transmembrane mucins has been markedly expanded. This probably reflects important and expanding roles for the transmembrane mucins in higher animals, where they are involved in mucosal surface protection and signaling.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
One method for the identification of mucin domains, implemented in PTSPRED, is to examine the frequency of the amino acids Ser, Thr, and Pro. The basic principle of the program is that a protein sequence is analyzed by moving a window, typically 100 amino acids long, along the sequence and determining the composition of Ser, Thr, and Pro in that window. The window is by default moved in steps of 10. If the composition of S + T and P, respectively, is above a certain threshold value, it is recorded as a potential PTS domain. If two or more such domains overlap, they are merged in the output from the program. Typical threshold values are 40% S + T and 5% P. The output from the program is a list of hits ordered by the length of the PTS-rich region. Finally, one version of the program allows us to analyze a genomic sequence by considering all six possible translation reading frames.
An alternative approach, the MPRED program, is built on a generalized hidden Markov model that is currently composed of two states, mucin or nonmucin, but can be extended to include additional features. The algorithm runs through the protein sequence and determines which amino acids belong to which state, resulting in a set of start and end coordinates for potential mucin domains along with a probability indicating a reliability of the prediction. The probability distributions incorporated in the model, such as state transitions, domain lengths, sequence composition, and so on, are based on empirical data. Currently the performance of the method is limited by the relatively small number of available training sequences. With more training sequences, the specificity of the predictions would be improved.
PTSPRED and MPRED as well as further documentation of these programs are available on request.
Transmembrane domains were predicted by the TMHMM program (Sonnhammer et al., 1998) and signal sequences using SignalP (Nielsen et al., 1997
). For the alignment of protein to DNA we used BLAST (Altschul et al., 1997
) or programs of the GCG package (Wisconsin package version 10.2, Genetics Computer Group, Madison, WI). CLUSTALW (Thompson et al., 1994
) was used for multiple sequence alignments and for Pfam searches the domains previously found in mucins were downloaded (www.sanger.ac.uk/Software/Pfam) and searches made with the hmmer package (http://hmmer.wustl.edu). In-house Perl scripts were used for additional tasks.
![]() |
Acknowledgements |
---|
![]() |
Footnotes |
---|
![]() |
Abbreviations |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Aparicio, S., Chapman, J., and Brenner, S. (2002) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297, 13011310.
Bobek, L.A., Tsai, H., Biesbrock, A.R., and Levine, M.J. (1993) Molecular cloning, sequence, and specificity of expression of the gene encoding the low molecular weight human salivary mucin (MUC7). J. Biol. Chem., 268, 2056320569.
Brenner, S., Elgar, G., Sandford, R., Macrae, A., Venkatesh, B., and Aparicio, S. (1993) Characterization of the pufferfish (Fugu) genome as a compact model vertebrate genome. Nature, 366, 265268.[CrossRef][ISI][Medline]
Davis, C.G., Elhammer, R.D.W., Schneider, W.J., Kornfeld, S., Brown, M.S., and Goldstein, J.L. (1986) Deletion of clustered O-linked carbohydrates does not impair function of low density lipoprotein receptor in transfected fibroblasts. J. Biol. Chem., 261, 28282838.
Desseyn, J.L., Buisine, M.P., Porchet, N., Aubert, J.P., and Laine, A. (1998) Genomic organization of the human mucin gene MUC5B-cDNA and genomic sequences upstream of the large central exon. J. Biol. Chem., 273, 3015730164.
Desseyn, J.L., Aubert, J.P., Porchet, N., and Laine, A. (2000) Evolution of the large secreted gel-forming mucins. Mol. Biol. Evol., 17, 11751184.
Eckhardt, A.E., Timpte, C.S., DeLuca, A.W., and Hill, R.L. (1997) The complete cDNA sequence and structural polymorphism of the polypeptide chain of porcine submaxillary mucin. J. Biol. Chem., 272, 3320433210.
Gendler, S.J. (2001) MUC1, the renaissance molecule. J. Mamm. Gland Biol. Neoplasia, 6, 339353.[CrossRef][ISI][Medline]
Gendler, S.J. and Spicer, A.P. (1995) Epithelial mucin genes. Ann. Rev. Physiol., 57, 607634.[CrossRef][ISI][Medline]
Gendler, S.J., Lancaster, C.A., Taylor-Papadimitriou, J., Duhig, T., Peat, N., Burchell, J., Pemberton, L., Lalani, E.N., and Wilson, D. (1990) Molecular cloning and expression of human tumor-associated polymorphic epithelial mucin. J. Biol. Chem., 265(25), 1528615293.
Gum, J.R., Hicks, J.W., Toribara, N.W., Siddiki, B., and Kim, Y.S. (1994) Molecular cloning of human intestinal mucin (MUC2) cDNA. Identification of the amino terminus and overall sequence similarity to prepro-von Willebrand factor. J. Biol. Chem., 269, 24402446.
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng., 10, 16.[CrossRef][ISI]
Perez-Vilar, J. and Hill, R.L. (1999) The structure and assembly of secreted mucins. J. Biol. Chem., 274, 3175131754.
Sadler, J.E. (1998) Biochemistry and genetics of von Willebrand factor. Ann. Rev. Biochem., 67, 395424.[CrossRef][ISI][Medline]
Sonnhammer, E.L., von Heijne, G., and Krogh, A. (1998) A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., 6, 175182.[Medline]
Spicer, A.P., Parry, G., Patton, S., and Gendler, S.J. (1991) Molecular cloning and analysis of the mouse homologue of the tumor-associated mucin, MUC1, reveals conservation of potential O-glycosylation sites, transmembrane, and cytoplasmic domains and a loss of minisatellite-like polymorphism. J. Biol. Chem., 266, 1509915109.
Taupin, D. and Podolsky, D.K. (2003) Trefoil factors: Initiators of mucosal healing. Nat. Rev. Mol Cell Biol., 4, 721723.[CrossRef][ISI][Medline]
Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalities and weight matrix choice. Nucl. Acids Res., 22, 46734680.[Abstract]
Wreschner, D., McGuckin, M.A., Williams, S.J., Baruch, A., Yoell, M., Ziv, R., Okun, L., Zaretsky, J., Smorodinsky, N., Keydar, I., and others. (2002) Generation of ligand-receptor alliances by SEA module-mediated cleavage of membrane-associated mucin proteins. Protein Sci., 11, 698706.
Yang, Z.Y., Duckers, H.J., Sullivan, N.J., Sanchez, A., Nabel, E.G., and Nabel, G.J. (2000) Identification of the Ebola virus glycoprotein as the main viral determinant of vascular cell cytotoxicity and injury. Nat. Med., 6, 886889.[CrossRef][ISI][Medline]
|