The addition of a carbohydrate moeity to the side-chain of a residue in a protein chain influences the physicochemical properties of the protein. Glycosylation is known to alter proteolytic resistance, protein solubility, stability, local structure, lifetime in circulation and immunogenicity (Lis and Sharon, 1993; Hounsell et al., 1996). The process, being vital in the functionality of certain proteins, is an important factor in the production of therapeutic proteins for humans (Goochee et al., 1991). Due to their ability to correctly fold proteins and perform post-translational modifications, eukaryotic expression systems have come into focus.
One such potential candidate for recombinant glycoprotein production is the simple eukaryote Dictyostelium discoideum (Jung and Williams, 1997; Slade et al., 1997). D.discoideum is an interesting model organism for studying glycosylation, and may yield insights into the evolution of the glycosylation apparatus from simple eukaryotes to the more complex mammalian systems.
A useful aspect for glycoprotein design is the determination of preferable acceptor motifs for glycosyltransferases. The acceptor site for N-glycosylation in various systems is relatively well known (Bause, 1983). The consensus sequence Asn-Xaa-Ser/Thr-Xaa (Xaa being any residue except Proline) provides the necessary signal for glycosylation, even though not all of these motifs are actually utilized. Information from studies on O-glycosylation acceptor sites has, however, been rather limited. It is known that O-glycosylation occurs on certain hydroxyl residues (mainly serine and threonine), but not on all of these exposed amino acids in the protein.
Residues flanking the acceptor sites are believed to influence the process of glycosylation (Wilson et al., 1991; O'Connell et al., 1992; Wang et al., 1993; Hansen et al., 1995; Nehrke et al., 1996) though no simple acceptor consensus sequence is obvious (Gooley and Williams, 1994; Chou et al., 1995). Several classes of O-linked glycosylation have been reported (Hansen et al., 1995) as opposed to the single class of reducing terminal linkages for N-linked glycosylation (Kornfeld and Kornfeld, 1985). Further, multiple glycosyltransferases may be involved in the process with distinct substrate specificities, as is the case with the UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase family (Clausen and Bennett, 1996; Wandall et al., 1997).
In vivo studies on the human von Willebrand factor (Nehrke et al., 1996) indicate that position -1 relative to the glycosylated site is sensitive to charged residues. Acidic residues at positions -1 and +3 eliminate glycosylation altogether. Other studies indicate that prolines, serines, and threonines (O'Connell et al., 1991; Wilson et al., 1991; Elhammer et al., 1993), but no charged residues (Nehrke et al., 1996), flank the glycosylated Ser or Thr. Many O-glycosylated sites are known to occur among Ser and Thr clusters (Gooley et al., 1991; Wilson et al., 1991; Pisano et al., 1993; Hansen et al., 1998). Further, there are indications that serine residues are less frequently glycosylated compared to threonine residues in the same sequence context (Elhammer et al., 1993; Wang et al., 1993; Jung et al., 1997).
Since O-glycosylation is a post-translational event most probably initiated in the Golgi apparatus (Roth et al., 1994; Röttger et al., 1998), it is reasonable to assume that glycosylation acceptor sites occur on the surface of the protein (Aubert et al., 1976), suitably exposed to a UDP-HexNAc:polypeptide N-acetylhexosaminyl transferase. O-glycosylation is thus, also dependent on local secondary structure (Hansen et al., 1994) and on the overall tertiary structure of the protein.
Apart from the O-linked GalNAc ("mucin type") modification, another prominent form of glycosylation is the addition of N-acetylglucosamine (GlcNAc) as the primary glycan to the side chain of serine or threonine residues. Two configurations have been described for the anomeric linkage of GlcNAc to the polypeptide chain. O-[alpha]-Linkages normally exist for cell surface and extracellular proteins in simple eukaryotes (Jung et al., 1998; Previato et al., 1998), while O-[beta]-GlcNAc occurs on nuclear and cytosolic glycoproteins in a large range of eukaryotes (Hart, 1997). The enzyme responsible for this glycosylation (O-[beta]-GlcNAc) has recently been cloned and characterized (Kreppel et al., 1997; Lubas et al., 1997).
The secreted and cell-surface glycoproteins of several simple eukaryotes, including Entamoeba histolytica, Trypanosoma cruzi, Leishmania major, Plasmodium falciparum, and Dictyostelium discoideum have recently been shown to carry O-GlcNAc. In the case of T.cruzi and D.discoideum secreted glycoproteins, the reducing terminal linkage has been identified as O-[alpha]-GlcNAc (Previato et al., 1998; Jung et al., 1998). Intracellular proteins with O-GlcNAc have also been identified in some of these eukaryotes (e.g., in Leishmania sp., Handman et al., 1993), so it is probable that two O-GlcNAc glycosylation pathways exist in these organisms also: an ER/Golgi pathway responsible for secreted and cell-surface glycoproteins and a second cytosolic/nuclear pathway responsible for the O-[beta]-GlcNAc glycosylation. Are any of the glycosylation motifs similar between the O-[alpha]-GlcNAc glycosylation and the O-[beta]-GlcNAc glycosylation?
As no simple pattern for GlcNAc acceptor sites emerges, the glycosylated and nonglycosylated sites are not easily separable. This prompted us to investigate if these fuzzy patterns were recognizable using artificial neural networks. Neural networks are capable of classifying even highly complex and nonlinear biological sequence patterns, where correlations between positions are important (Presnell and Cohen, 1993). Not only does the network recognize the patterns seen during training, but it also retains the ability to generalize and recognize similar, though not identical patterns. Artificial neural network algorithms have been extensively used in biological sequence analysis (Wu, 1997; Baldi and Brunak, 1998).
Our study employs artificial neural networks in an attempt at predicting O-[alpha]-GlcNAc glycosylation sites in membrane and secreted proteins of D.discoideum. It is suggested that O-linked GlcNAc addition to these proteins in D.discoideum may resemble the mucin-type glycosylation in mammals (Jung et al., 1998). The knowledge of glycosylation sites (and their sequence contexts) from in vivo experimental data (Jung et al., 1997, 1998) was combined with surface accessibility prediction to develop a method for predicting putative acceptor sites in other D.discoideum amino acid sequences. The 34 megabase D.discoideum genome, comprising six chromosomes, is estimated to contain ~7000 genes (Loomis and Smith, 1995), roughly one-tenth of which is available in public databases. In our study, we scanned the D.discoideum protein sequences available in the SwissProt (Bairoch and Apweiler, 1998) and Genpept (Benson et al., 1998) databases for GlcNAc O-glycosylation and ranked these according to their likelihood of being glycosylated. Further, we have attempted to classify the available sequences into functional categories and cellular locations and to mark the glycosylated fractions.
The sequence context around GlcNAc O-linked glycosylation sites
Figure 1. Sequence information logo of O-glycosylated sequence contexts. Conservation of residues at each position is computed as a function of the Kullback-Leibler information content (Kullback and Leibler, 1951) and represented as sequence logos (Schneider and Stephens, 1990). Sequences are aligned with position zero being the glycosylated Ser/Thr. Heights of the letters indicate their information content, the letters are stacked vertically according to their heights, the ones on the top of each stack thus reflecting the most conserved residue at that position. The 39 glycosylated windows were referenced against a background of all the nonglycosylated sequence windows (window size = 41, n = 324). The charged residues are shown in blue (basic, positive) and red (acidic, negative), while the neutral ones are shown in green (polar, hydrophilic) and black (apolar, hydrophobic).
Data used in the paper comprised of eight Dictyostelium discoideum membrane and secreted protein sequences. Earlier experiments elucidated 39 of the potential 363 serine/threonine sites to be modified in vivo with O-[alpha]-linked GlcNAc moieties (Table IV). The distribution of amino acid residues around these O-GlcNAc sites is illustrated in Figure
Figure
Figure 2. Distance distribution of glycosylated sites. Distances between glycosylated sites in the training set data are plotted here. Clustering is evident by the high frequency of closely placed sites (distances of 2, 4). The empty bins at positions 1, 3, 5, 7, . . . display the peculiar nature of the occupied sites never to occur at odd positions relative to each other.
We did not segregate sequence contexts for serine glycosylated sites and threonine glycosylated sites due to paucity in experimental data for serine glycosylated sites (only three Ser sites known).
Conformational preferences of the O-glycosylated sites
Secondary structure predictions using the PHD server (Rost and Sander, 1994) indicated that the glycosylated sites predominantly lay in coil and [beta]-sheet regions (data not shown). Of the glycosylated sites, 69% of them were predicted to have no secondary structure and the remaining 31% were predicted to be [beta]-strands. Regions around the glycosylated site were largely unassigned (coils), with positions -7 and -3 showing a strong preference for [beta]-strands. The "even"-positioning of residues shown in Figure
Predictive performance of the neural networks
The final method correctly identifies ~97% of the glycosylated and nonglycosylated residues in a cross validated test. Performance measures are summarized in Table I.
The incorporation of protein surface-accessibility predictions (using a surface derived threshold) increases sensitivity and makes the method more reliable for new entries. Prediction involving serine glycosylated residues is currently poor and the network does not distinguish serine flanks from threonine flanks very well. We hope to improve this with availability of more experimental data on glycosylated serine sites.Predictions using NetOGlyc 2.0
We performed a preliminary study towards comparing acceptor site similarities between the D.discoideum GlcNAc context and the mammalian GalNAc context. NetOGlyc is a mucin-type (GalNAc) glycosylation predictor trained on mammalian protein sequences (Hansen et al., 1995; Hansen et al., 1998). Predictions using this server on our D.discoideum data set revealed a correlation coefficient of 0.74. The server interestingly identified 35 out of the 39 D.discoideum GlcNAc-modified sites, however it also made 13 incorrect predictions (false positives). This indicates the high level of overlapping specificities between both cases, and hints that the D.discoideum GlcNAc specificities may be a subset of the mammalian repertoire of O-glycosylations.
Table I.
Method | Sensitivity (%) | Specificity (%) | PPV (%) | NPV (%) | Corr. coeff. |
ANNs w/o surface | 89 | 97 | 93 | 94 | 0.87 |
ANNs with surface | 97 | 97 | 94 | 98 | 0.93 |
Figure 3. Predictions on the extracellular protein Sheathin D. The server predicts six O-GlcNAc sites in the spacer domain of ShD (LYYGEPCNTPTVTPTTTPSPTTKPPTGPLK, predicted glycosylated residues underlined). The vertical impulses are the Ser/Thr residues. A residue with an impulse crossing the surface-derived threshold (horizontal wavy line) is said to be glycosylated.
Figure 4. Linear MALDI-TOF analysis of tryptic digests of ShD. (A) is the complete spectrum for the tryptic digest of ShD observed in linear mode. The large peptide masses of 5875.47 to 6280.77 Da show the diagnostic ±203 Da and -146 Da mass differences typical of glycopeptides with HexNAc and Fucose additions, respectively. (B) shows the expanded glycopeptide region for ShD. The spectra were obtained on a Perseptive Biosystems MALDI-TOF instrument and insulin was used as external calibrant. The protein SP96 (SwissProt Acc. P14328) was however not included in the above statistics. NetOGlyc abundantly predicted glycosylations throughout this protein. This was not a surprise since the protein is known to contain O-linked fucose residues and fucose and/or GlcNAc in a phosphodiester linkage to serine or threonine residues, however no O-linked GlcNAc residues directly linked to serine or threonine residues (M.Mreyen, unpublished observations).
Experimental confirmation of a "blind"-prediction
Sheathin D, a glycoprotein found in the extracellular matrix of D.discoideum, was first described and the N-terminal sequence published by Ti et al., (1995). The protein reacts with the carbohydrate-specific mAb MUD50 giving a strong indication that Sheathin D contains reducing terminal O-GlcNAc. O-glycans, recognized by mAb MUD50, and at this developmental stage of the D.discoideum life cycle are normally comprised of reducing terminal O-GlcNAc, phosphate, and fucose (Haynes et al., 1993; Zachara et al., 1996; Jung et al., 1997), where the GlcNAc is in an [alpha]-linkage to the peptide backbone (Jung et al., 1998).
Peptides from an in situ endoproteinase trypsin digest of ShD were analyzed by MALDI-TOF. One mass in the mass spectrum corresponds to a potential glycosylated peptide with the sequence LYYGEPCNTPTVTPTTTPSPTTKPPTGPLK (Figure
While the mass alone cannot determine the number of glycosylation sites, the masses of the glycopeptide in ShD (the mass of the nonglycosylated tryptic peptide is 3158.58) are consistent with 6-8 glycosylation sites of an oligosaccharide composed of GlcNAc, phosphate, and fucose (see Table II). The theoretical peptide contains 10 potential sites, but elsewhere (Jung et al., 1998) we have shown that in proteins expressed at the asexual stage in D.discoideum, the second Thr in a tandem repeat is not glycosylated. This thus leaves a maximum of 8 possible sites of glycosylation.
Table II.
Observed mass ShD[alpha] | Mass of carbohydratea | Compositionb |
6280.73 | 3121.51 | Hex1HexNAc9dHex5 (Phos)5 |
3121.57 | Hex4HexNAc7dHex5 (Phos)4 | |
6077.75 | 2918.31 | Hex1HexNAc8dHex5 (Phos)5 |
2918.37 | Hex4HexNAc6dHex5 (Phos)4 | |
5931.18a | 2772.17 | Hex1HexNAc8dHex4 (Phos)5 |
2772.23 | Hex4HexNAc6dHex4 (Phos)4 | |
5815.7 | 2756.17 | Hex3HexNAc6dHex5 (Phos)5 |
2756.23 | HexNAc8dHex5 (Phos)5 | |
5875.47a | 2715.12 | Hex1HexNAc7dHex5 (Phos)5 |
2715.17 | Hex4HexNAc5dHex5 (Phos)4 |
A DictyOGlyc prediction on the complete protein chain of Sheathin D (ShD, compiled from the cDNA in Dicty-cDB, M. Fuchs and A. Gooley, unpublished observations) shows a cluster of six predicted O-[alpha]-GlcNAc sites (out of the above mentioned eight possible sites), corresponding to the spacer domain of the protein (Figure
Predictions on public Dictyostelium discoideum protein sequences
On a redundancy reduced SwissProt/Genpept data set of 652 full-length sequences, the DictyOGlyc server predicted 252 (~38%) of these proteins as glycosylated (907 predicted glycosylation sites out of the 43,258 Ser and Thr residues in all). In another batch, the nonredundant 11,689 translated cDNA sequences from the Dicty-cDB database (5 Sep[prime]98 version, kindly provided by H.Urushihara, Japan; http://www.csm.biol.tsukuba.ac.jp/cDNAproject.html) were tested for glycosylations, and the server predicted 3062 (~25%) of these as glycosylated. No further analysis on this set was performed at this stage, however all predictions are available from our website (http://www.cbs.dtu.dk/services/DictyOGlyc/STATS/).
The protein entries predicted in the SwissProt/Genpept data set were analyzed further. A histogram (not shown) depicting distances between predicted glycosylated sites revealed a similar "even" positioning as seen in the training set data.
Predicted entries were ranked in descending order of their likelihood of being glycosylated. Proteins with high-scoring predicted sites were predominantly found to be membrane bound or secreted proteins of D.discoideum (Table III). This was expected, since these proteins follow a similar glycosylation pathway. A substantial number of lower-ranked predicted sites belonged to cytoplasmic and nuclear proteins. These proteins are believed to follow a different pathway for O-GlcNAc addition, even though some of the glycosyltransferase acceptor sites show similar patterns (Snow and Hart, 1998). Cysteine proteinases also follow a different glycosylation pathway (Mehta et al., 1997) and were not found to have any predicted glycosylations.
Table III.
Proteins picked up by the server include a variety of spore germination associated proteins and one can speculate whether glycosylation has any role to play in the spore germination process. A recent report (Ramalingam and Ennis, 1997) presents experimental evidence of a spore germination protein being O-glycosylated which is also on our list of "high-rankers" (SPG7_DICDI, SwissProt Acc. P22698).
GP100, a membrane glycoprotein, was predicted to possess a cluster of O-GlcNAc modified residues in a region annotated in SwissProt as a "Thr/Pro rich extracellular region" (Table III, 4th row). The glycoprotein has been annotated to possess both N-linked and O-linked glycosylations, but only the N-glycosylation sites (and not the O-glycosylations) seem to have been experimentally mapped.
Another "high-ranker" is the Contact site A protein which is known to contain reducing terminal GlcNAc residues. While the precise sites are unknown, the approximate locations of the O-glycosylations on the protein have been recently characterized to be in the carboxy terminal region of the protein (Mang, 1995) as predicted by our method.
FP21 is a cytoplasmic glycoprotein and is known to be O-glycosylated although presumably via a different pathway (Kozarov et al., 1995). The exact linkage of the sugar moieties to this cytoplasmic protein has recently been determined to be a pentasaccharide linked via its reducing terminal GlcNAc to a hydroxyproline residue (Teng-umnuay et al., 1998). Our server predicts four glycosylation sites (two high-ranking) in this 162 amino acid protein. The protein PSA_DICDI (SwissProt Acc. P12729) is a prespore-specific cell surface antigen that has been determined in vitro to have reducing terminal GlcNAc residues (Gooley et al., 1992). Some of the fusion peptides used in the training of the network were modified portions of this protein. Our method however predicts two extra sites, not represented in the training data set, which are in agreement with studies carried out recently in our laboratory (N.Zachara, personal communication).
Decomposition of the available proteome into functional and locational categories
Functional classification, derived primarily from the GeneQuiz server (Scharf et al., 1994; Casari et al., 1996), and categorization into cellular locations are depicted in Figure
The cellular-location classification (Figure
Figure 5. Functional and cellular locational classification of available proteome. (Left) Functional classification of the available D.discoideum proteome primarily using the GeneQuiz server (14 classes). The outer circle shows a segregation of all the protein sequences, while the inner circle depicts the fraction in each class which is predicted glycosylated. Figures in brackets: first figure indicates the fraction of the proteome that particular class constitutes, Second figure indicates the percentage (of that class) predicted glycosylated. * There is no "Cell envelope" in D.discoideum. This category, assigned by GeneQuiz, largely contains RAS-related proteins. (Right) Locational classification of proteins (classified using annotations, PSORT, SignalP, and the TMHMM prediction servers). Compartments chosen were those used by PSORT.
Mapping the predicted glycosylation sites along the length of all proteins (Figure
Figure 6. Glycosylation sites mapped across the length of all protein chains. The number of glycosylated sites is shown by the height of the bars. The functional (above)/locational(below) categories of proteins in each of the columns, is depicted by a coloration scheme.
A study in which the number of sites in a protein chain was correlated with its function and location (Figure
Figure 7. The number of glycosylation sites in each protein correlated with its function/location. Binning the available D.discoideum proteome according to number of sites in each protein and classifying them functionally (left) and according to subcellular location (right).
Mapping glycosylation acceptor sites on proteins is an important step toward understanding the glycosylation process. Clearly experimental determination of an acceptor site is more conclusive evidence than a theoretical prediction. However given the tedious and time-consuming procedures involved in accurate experimental determination, prediction methods have an advantage of being fast, reproducible, publicly available and have been shown to be accurate (Nielsen et al., 1999) to an extent of giving valuable hints and a direction for further experiments. Due to the fuzzy sequence context of the acceptor site for O-linked glycans, we use artificial neural networks as a method for identifying such sites. A protein surface-accessibility prediction method was integrated into our method for the final prediction of a site being O-[alpha]-GlcNAc modified. The developed method, "DictyOGlyc," is useful in rapidly scanning an entire database of proteins for possible candidates of glycosylation. In a separate study, the protein Sheathin D was analyzed for O-linked glycosylations. The experimental results show a high correlation to the DictyOGlyc predictions.
Using our method, we scanned D.discoideum protein entries from the SwissProt and Genpept databases, which comprise roughly one-tenth of the entire proteome of the organism. A closer look was taken at some of the "high-ranking" predicted sites, almost all of which were from membrane and secreted proteins. A number of the proteins predicted glycosylated, are known to have O-glycosylations which have not yet been experimentally site-mapped. Roughly one-third of the available D.discoideum proteome we analyzed was predicted to have GlcNAc O-glycosylations, these proteins comprising almost all functional classes. This gives an indication of how widespread these glycosylations may be within the organism. Membrane and secreted (extracellular) proteins were predicted to possess a number of clustered glycosylation sites in the C-terminal half of the protein. This probably corresponds to a proteolytic resistant "spacer" region which extends the catalytic domain away from the cell surface. The class of transport and binding proteins is clearly an important class for glycosylations and it largely consists of sporulation-associated proteins. Regulatory proteins and nuclear proteins also show a high number of predicted glycosylation sites and could possibly be sites for reciprocal phosphorylation.
The fact that glycosylations were predicted on a large number of intracellular and nuclear proteins was surprising, even though the predictions were mostly "low-ranking." This suggests that intracellular glycosylation motifs are very similar to extracellular ones, and the glycosyltransferases involved in both cases are either the same or have very similar catalytic domains. We do not know whether D.discoideum contains O-[beta]-GlcNAc glycosylated proteins, but considering the diversity of organisms that are known to carry this modification, it would appear almost inevitable that this modification will be discovered. However, a T.cruzi UDP-HexNAc:polypeptide [alpha]-N-acetylglucosaminyl transferase (O-[alpha]-GlcNAc transferase) activity present in a microsomal membrane preparation did not glycosylate the well characterized acceptor for the O-[beta]-GlcNAc-transferase, YSDSPSTST (Haltiwanger et al., 1992; Previato et al., 1998). While one motif is not sufficient to exclude the possibility of common acceptor motifs between cytosolic and secreted O-GlcNAc glycosylated proteins, a strategy where a neural network is trained on a data set of motifs identified in nuclear/cytosolic glycoproteins would allow a comparison between cytosolic and Golgi associated GlcNAc glycosylation motifs. Such a comparison could evaluate the extent and nature of the similarity in the motifs. Importantly, the Edman degradation based technology we use to identify glycosylation sites modified in vivo (Gooley and Williams, 1997) can differentiate between anomeric linkages of O-GlcNAc (N.E.Zachara, A.A.Gooley, and K.L.Williams, unpublished observations).
Data used in training the networks were derived from in vivo studies which makes the method more relevant to the biological system (Gooley and Williams, 1994). Glycosylation in vivo has been shown to be less restricted in terms of regions flanking the acceptor site in comparison to in vitro glycosylation (Nehrke et al., 1996). Statistical analysis of the training data does indicate certain residue preferences in regions surrounding the O-glycosylated sites though not such that a consensus acceptor sequence pattern can be defined. Interestingly, all glycosylated positions were found to be at even positions relative to each other which might indicate a structural preference such as [beta]-sheets making alternate residues accessible for being occupied.
The training data set has been limited in order to study a specific system; membrane and secreted proteins of D.discoideum. While confining the scope of the method, it also eliminates noise from data of other systems, allowing this particular glycosylation event to be studied in detail. Correlation studies with NetOGlyc (Hansen et al., 1998) indicate that the domains of GlcNAc acceptor specificities on D.discoideum and those of GalNAc on mammalian systems are not mutually exclusive and suggest that D.discoideum may contain a subset of the glycosyltransferases found in mammals. Further, this provides an indication that mammalian sequences may exhibit aberrant glycosylation patterns if expressed in D.discoideum. An analogous case has been shown earlier where recombinant gp120 from HIV-1, expressed in baculovirus infected insect cells, exhibits an altered glycosylation pattern from mammalian gp120 (Moore et al., 1990). One strategy of reducing obstacles with recombinant mammalian glycoprotein production in D.discoideum would be to tailor glycosylation sequence motifs in mammalian proteins for expression in D.discoideum systems with the help of the DictyOGlyc server.
The method is useful in detecting patterns similar to those shown during training. Experimental elucidation of new glycosylation sites with different acceptor motif patterns would be useful to incorporate into the server for enhanced predictions. Neural networks, in the future, may provide a very powerful tool in functional genomics for mass screening of translated DNA sequence databases to obtain an idea of the type and extent of post-translational modifications in the proteome. A combination of such servers could further provide a method for functional annotation of the large number of unassigned ORFs being churned out rapidly by genome sequencing projects. Publicly available E-Mail and WWW server
The method has been made available on the Internet; amino acid sequences can be submitted at http://www.cbs.dtu.dk/services/DictyOGlyc/ for instant prediction of GlcNAc O-glycosylations. Alternatively, sequences can be E-mailed to DictyOGlyc@cbs.dtu.dk. Sending the word "help" in the E-mail text will return information on input and output formats. We encourage users to return feedback on any experimental confirmations or falsifications of the predictions. Any new information regarding glycosylation on D.discoideum would be highly appreciated and could be used to retrain the networks thereby improving our current prediction accuracy. Sequence data used for network training
As in vivo glycosylation has been shown to differ from in vitro glycosylation (Nehrke et al., 1996), it is relevant to mention that the data used in this study were exclusively derived from in vivo studies. In vivo expressed glutathione S-transferase fusion peptides, with experimentally determined GlcNAc O-glycosylation sites (Jung et al., 1997, 1998), were among the sequences used for training the neural networks. Other sequences employed for the training include secreted recombinants forms of PsA (Prespore-specific antigen) (Zachara et al., 1996), a spore coat protein SP96 and the N-terminal sequence of a recently described (Packer et al., 1997) secreted protein ("ATP"-Protein). This set of eight proteins had 39 serine and threonine residues [alpha]-linked to an O-GlcNAc sugar moeity (Table IV) and encompasses all in vivo data acquired till date with respect to D.discoideum. Elucidation of glycosylation sites was performed by solid-phase protein sequencing and has been reported earlier (Zachara et al., 1996; Jung et al., 1997, 1998; Packer et al., 1997).
Discussion
Materials and methods
Analyzing the sequence context and conformational preferences
One approach used for analyzing sequences towards pattern identification is the study of information density over the sequence motif. Information content of a particular position in a sequence is a measure of the degree of residue conservation at that position. Sequence windows centered on the glycosylated residue were aligned and the Kullback-Leibler information content was quantified by the formula
![]() |
where I is the information content at position i, p the probability of occurrence of the amino acid L (at position i) in a glycosylated window, and q the probability of occurrence in a nonglycosylated window (the background distribution). This information content, expressed in bits/amino acid, was visualized using sequence logos (Schneider and Stephens, 1990). Here, amino acid symbols are scaled to heights denoting their frequencies of occurrence at each position. Scaled amino acid symbols are stacked vertically in ascending order of their heights, the total height of the stack giving the value of I at that position. The sequence logo illustrates preferred acceptor patterns for O-GlcNAc transferases and makes any consensus sequence distinctly evident. A preliminary structural study was performed on the known glycosylation windows and their secondary structures predicted using the PHD method (Rost and Sander, 1994). Resulting predictions were represented using the Kullback-Leibler information content expressed in the form of sequence logos, as described above. Predictions around nonglycosylated amino acids were taken as reference in the Kullback-Leibler information measure.
Neural network algorithms
The basic idea with artificial neural networks is to use a network of neurons, each node (or neuron) having multiple inputs and a single output based on the weights (or strengths) associated with the various inputs. On presenting sequence windows, exhibiting a particular feature, repetitively to such a network, randomly initialized weights can be adjusted to achieve a desired output, that is, to classify the pattern to the assigned feature. Once the network has been so trained on sequences with a known feature, it can be presented with an unclassified sequence, and patterns within the sequence are (hopefully) suitably classified. We employed standard feed-forward artificial neural networks with sigmoidal nodes and one layer of hidden units (Baldi and Brunak, 1998). Back propagation was used for adjusting the weights (Rumelhart et al., 1986), and amino acids were represented in the network by sparse encoding (Qian and Sejnowski, 1988). The method was similar to the one we used earlier for predicting O-linked GalNAc residues on mammalian proteins (Hansen et al., 1998).
We evaluated performance of neural networks with different parameters by partitioning the entire data set into a training data set and a test data set. Essentially seven of the eight protein sequences were used to train the network, the remaining sequence being used as test data. The networks were tested over moving window sizes ranging from three amino acids (one amino acid flanking the serine or threonine site) to 23 amino acids (11-residue flanks), and with the number of hidden units ranging from 2 to 8. Performance was evaluated with cross-validation using a coefficient of correlation (Matthews, 1975) given by
![]() |
where Px is the number of true positives (experimentally verified glycosylated sites which are also predicted glycosylated), Nx the number of true negatives (experimentally verified nonglycosylated, predicted nonglycosylated), Pfx the number of false positives (experimentally nonglycosylated, predicted glycosylated) and Nfx, the number of false negatives (experimentally verified glycosylated, predicted nonglycosylated).
We settled for a jury of six neural networks comprising three sets of training data, trained over window sizes 5, 7, and 19 (number of consecutive residues) and the number of hidden units being 2 and 5 for each window size. The results arising from each network in the jury were sigmoidally enhanced and averaged to obtain a value between zero and unity. Traditionally, a threshold of 0.5 is used, hence a site with an output of more than 0.5 would be assigned glycosylated. During evaluation of the networks, we had tested them on threshold values ranging from 0.1 to 0.9 (data not shown). Most of the test data were correctly predicted when the threshold was set between 0.45 and 0.55. We use a surface-accessibility modified threshold as described below.
Protein surface-accessibility prediction
The exact subcellular location of the initiation of O-glycosylation is still controversial (Roth et al., 1994; Röttger et al., 1998). However, it is universally agreed that the O-glycan is linked post-translationally to Ser or Thr residues of a fully folded and assembled protein (Aubert et al., 1976; Rudd and Dwek, 1997; Van den Steen et al., 1998), and is thus surface exposed on the protein. Surface exposure, being a valuable factor in determining the possibility of a site being O-glycosylated, we combined the glycosylation predictions above with surface accessibility predictions of the potential sites.
The network used to predict surface accessibility was trained on a non-redundant (less than 25% sequence identity for alignment lengths of 80 or more residues) data set of 134 globular protein structures extracted from the PDB database. All of these were x-ray determined structures with a resolution better than 2.5 Å. The Connolly Molecular Surface procedure (Connolly, 1993) was used with a probe radius of 1.4 Å, corresponding to the molecular radius of water, to label all residues in the 134 protein structures as either surface exposed or buried. Surface assignment was defined as having more than 20% of the normalized standard maximal surface area (Rose et al., 1985) exposed to the solvent. The thus obtained data set of 134 labeled protein sequences (residues labeled "buried"/"surface"), based on high resolution protein structures, was used to train neural networks to recognize the relation between sequence and surface accessibility. Overall, the method correctly predicted surface accessibility for 74% of the residues. Details of the network used, training procedures, and performance will be reported elsewhere. This surface prediction method has been used earlier in prediction of post-translational cleavage sites of picornaviral polyproteins (Blom et al., 1996) and prediction of mucin type O-glycosylations in mammalian proteins (Hansen et al., 1998).
Predicted outputs from the glycosylation network and the surface-accessibility network were combined in the following manner. The glycosylation threshold was reduced for sites predicted to be on surface, but increased for sites predicted to be buried. To accomplish this, a modulated threshold was derived by scaling the surface score (Os) with a factor e and off-setting it by a constant c. Using this variable cutoff, glycosylation was assigned if Og > c + (e × Os), where Og was the combined glycosylation potential. A systematic screening of combinations of e and c by cross-validating on the known experimental data yielded a scaling factor (e) of 0.999 and an off-set value (c) of 0.118 as optimum. The averaged glycosylation potential (from the six networks) was compared with the modulated surface threshold, and a site with a potential greater than the threshold qualified as glycosylated (Figure
Figure 8. Schematic sketch of the neural networks used. A jury of six neural networks with different configurations were initially trained on the known glycosylated and nonglycosylated sites. This trained jury can then be used to predict sites on any protein sequence. The outputs from each of the networks are averaged and thresholded using a surface-prediction modulated value. Predictions on the available proteome
The final method was put together as a prediction server and was employed to scan the available D.discoideum proteome in public databases for predicted glycosylations. The Dictyostelium subset in SwissProt Rel. 35 (Bairoch and Apweiler, 1998) comprised of 267 sequences which were combined with the 664 protein sequences obtained from GenPept Rel. 106 (Benson et al., 1998). A redundancy reduction was performed in which 100% identical protein sequences were eliminated from the combined set. The reduced set of 652 sequences (from an initial number of 931) formed our basic data set for further proteome studies. Glycosylation predictions were made on this set and ranked in order of their likelihood of being glycosylated (determined by the difference between predicted potential and threshold).
In addition, we also analyzed translated cDNA sequences from the Dicty-cDB database. A total of 12,565 translated cDNA sequences was reduced for redundancy to 11,689 unique sequences. Glycosylation predictions were made on these and ranked. Fine teasing the available proteome
The SwissProt-Genpept 652-entry reduced data set was segregated functionally and according to cellular locations. Functional classifications were performed using the public GeneQuiz server (Scharf et al., 1994; Casari et al., 1996) which derives functional annotations for each sequence and categorizes them into one of 14 classes (including an "unknown" category). A few of the unknown category proteins were later reclassified manually into one of the 13 functional classes on the basis of their databank annotations.
Cellular locations of the proteins were derived using a combination (in order of precedence) of SwissProt annotations (where present), the PSORT II prediction server (Nakai and Kanehisa, 1992; Horton and Nakai, 1997), the SignalP prediction server (Nielsen et al., 1997), and the Transmembrane helix predictor TMHMM (Sonnhammer et al., 1998).
Three types of analyses were performed. First, percentages of glycosylated fractions in each functional and locational category were calculated and plotted (Figure
We thank Natasha Zachara for her critical comments and text regarding O-[alpha]-GlcNAc and O-[beta]-GlcNAc glycosylation in simple eukaryotes, Andrew Bohlken for preparing the ShD alpha digest, Kristoffer Rapacki for assistance with database extraction and Hideko Urushihara, University of Tsukuba, Japan who had kindly provided a translated version of Dicty-cDB. Assistance from the maintainers of GeneQuiz and their patience with our large submissions is appreciated. E.J. appreciates support from the Deutscher Akademischer Austauschdienst (HspII/AUFE) and Macquarie University International Postgraduate Research Award for her doctoral research. K.L.W. and A.A.G. acknowledge support from the Australian Research Council and National Health and Medical Research Council for this research. R.G., J.H., and S.B. acknowledge support from the Danish National Research Foundation.
O-GlcNAc, O-linked N-acetylglucosamine; HexNAc, N-acetylhexosamine.
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: jnl.info{at}oup.co.uk
Last modification: 14 Oct 1999
Copyright©Oxford University Press, 1999.