Scanning the available Dictyostelium discoideum proteome for O-linked GlcNAc glycosylation sites using neural networks

Ramneek Gupta1, Eva Jung2, Andrew A. Gooley2, Keith L. Williams2, Søren Brunak1 and Jan Hansen1,3

1Center for Biological Sequence Analysis, Department of Biotechnology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark and 2School of Biological Sciences, Macquerie University, Sydney, 2109 NSW, Australia

Received on November 6, 1998; revised on February 19, 1999; accepted on March 16, 1999

Dictyostelium discoideum has been suggested as a eukaryotic model organism for glycobiology studies. Presently, the characteristics of acceptor sites for the N-acetylglucosaminyltransferases in Dictyostelium discoideum, which link GlcNAc in an alpha linkage to hydroxyl residues, are largely unknown. This motivates the development of a species specific method for prediction of O-linked GlcNAc glycosylation sites in secreted and membrane proteins of D.discoideum. The method presented here employs a jury of artificial neural networks. These networks were trained to recognize the sequence context and protein surface accessibility in 39 experimentally determined O-[alpha]-GlcNAc sites found in D.discoideum glycoproteins expressed in vivo. Cross-validation of the data revealed a correlation in which 97% of the glycosylated and nonglycosylated sites were correctly identified. Based on the currently limited data set, an abundant periodicity of two (positions -3, -1, +1, +3, etc.) in Proline residues alternating with hydroxyl amino acids was observed upstream and downstream of the acceptor site. This was a consequence of the spacing of the glycosylated residues themselves which were peculiarly found to be situated only at even positions with respect to each other, indicating that these may be located within [beta]-strands. The method has been used for a rapid and ranked scan of the fraction of the Dictyostelium proteome available in public databases, remarkably 25-30% of which were predicted glycosylated. The scan revealed acceptor sites in several proteins known experimentally to be O-glycosylated at unmapped sites. The available proteome was classified into functional and cellular compartments to study any preferential patterns of glycosylation. A sequence based prediction server for GlcNAc O-glycosylations in D.discoideum proteins has been made available through the WWW at http://www.cbs.dtu.dk/services/DictyOGlyc/ and via E-mail to DictyOGlyc{at}cbs.dtu.dk.

Key words: Dictyostelium/O-glycosylation/neural-networks/prediction/proteome

Introduction

The addition of a carbohydrate moeity to the side-chain of a residue in a protein chain influences the physicochemical properties of the protein. Glycosylation is known to alter proteolytic resistance, protein solubility, stability, local structure, lifetime in circulation and immunogenicity (Lis and Sharon, 1993; Hounsell et al., 1996). The process, being vital in the functionality of certain proteins, is an important factor in the production of therapeutic proteins for humans (Goochee et al., 1991). Due to their ability to correctly fold proteins and perform post-translational modifications, eukaryotic expression systems have come into focus.

One such potential candidate for recombinant glycoprotein production is the simple eukaryote Dictyostelium discoideum (Jung and Williams, 1997; Slade et al., 1997). D.discoideum is an interesting model organism for studying glycosylation, and may yield insights into the evolution of the glycosylation apparatus from simple eukaryotes to the more complex mammalian systems.

A useful aspect for glycoprotein design is the determination of preferable acceptor motifs for glycosyltransferases. The acceptor site for N-glycosylation in various systems is relatively well known (Bause, 1983). The consensus sequence Asn-Xaa-Ser/Thr-Xaa (Xaa being any residue except Proline) provides the necessary signal for glycosylation, even though not all of these motifs are actually utilized. Information from studies on O-glycosylation acceptor sites has, however, been rather limited. It is known that O-glycosylation occurs on certain hydroxyl residues (mainly serine and threonine), but not on all of these exposed amino acids in the protein.

Residues flanking the acceptor sites are believed to influence the process of glycosylation (Wilson et al., 1991; O'Connell et al., 1992; Wang et al., 1993; Hansen et al., 1995; Nehrke et al., 1996) though no simple acceptor consensus sequence is obvious (Gooley and Williams, 1994; Chou et al., 1995). Several classes of O-linked glycosylation have been reported (Hansen et al., 1995) as opposed to the single class of reducing terminal linkages for N-linked glycosylation (Kornfeld and Kornfeld, 1985). Further, multiple glycosyltransferases may be involved in the process with distinct substrate specificities, as is the case with the UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase family (Clausen and Bennett, 1996; Wandall et al., 1997).

In vivo studies on the human von Willebrand factor (Nehrke et al., 1996) indicate that position -1 relative to the glycosylated site is sensitive to charged residues. Acidic residues at positions -1 and +3 eliminate glycosylation altogether. Other studies indicate that prolines, serines, and threonines (O'Connell et al., 1991; Wilson et al., 1991; Elhammer et al., 1993), but no charged residues (Nehrke et al., 1996), flank the glycosylated Ser or Thr. Many O-glycosylated sites are known to occur among Ser and Thr clusters (Gooley et al., 1991; Wilson et al., 1991; Pisano et al., 1993; Hansen et al., 1998). Further, there are indications that serine residues are less frequently glycosylated compared to threonine residues in the same sequence context (Elhammer et al., 1993; Wang et al., 1993; Jung et al., 1997).

Since O-glycosylation is a post-translational event most probably initiated in the Golgi apparatus (Roth et al., 1994; Röttger et al., 1998), it is reasonable to assume that glycosylation acceptor sites occur on the surface of the protein (Aubert et al., 1976), suitably exposed to a UDP-HexNAc:polypeptide N-acetylhexosaminyl transferase. O-glycosylation is thus, also dependent on local secondary structure (Hansen et al., 1994) and on the overall tertiary structure of the protein.

Apart from the O-linked GalNAc ("mucin type") modification, another prominent form of glycosylation is the addition of N-acetylglucosamine (GlcNAc) as the primary glycan to the side chain of serine or threonine residues. Two configurations have been described for the anomeric linkage of GlcNAc to the polypeptide chain. O-[alpha]-Linkages normally exist for cell surface and extracellular proteins in simple eukaryotes (Jung et al., 1998; Previato et al., 1998), while O-[beta]-GlcNAc occurs on nuclear and cytosolic glycoproteins in a large range of eukaryotes (Hart, 1997). The enzyme responsible for this glycosylation (O-[beta]-GlcNAc) has recently been cloned and characterized (Kreppel et al., 1997; Lubas et al., 1997).

The secreted and cell-surface glycoproteins of several simple eukaryotes, including Entamoeba histolytica, Trypanosoma cruzi, Leishmania major, Plasmodium falciparum, and Dictyostelium discoideum have recently been shown to carry O-GlcNAc. In the case of T.cruzi and D.discoideum secreted glycoproteins, the reducing terminal linkage has been identified as O-[alpha]-GlcNAc (Previato et al., 1998; Jung et al., 1998). Intracellular proteins with O-GlcNAc have also been identified in some of these eukaryotes (e.g., in Leishmania sp., Handman et al., 1993), so it is probable that two O-GlcNAc glycosylation pathways exist in these organisms also: an ER/Golgi pathway responsible for secreted and cell-surface glycoproteins and a second cytosolic/nuclear pathway responsible for the O-[beta]-GlcNAc glycosylation. Are any of the glycosylation motifs similar between the O-[alpha]-GlcNAc glycosylation and the O-[beta]-GlcNAc glycosylation?

As no simple pattern for GlcNAc acceptor sites emerges, the glycosylated and nonglycosylated sites are not easily separable. This prompted us to investigate if these fuzzy patterns were recognizable using artificial neural networks. Neural networks are capable of classifying even highly complex and nonlinear biological sequence patterns, where correlations between positions are important (Presnell and Cohen, 1993). Not only does the network recognize the patterns seen during training, but it also retains the ability to generalize and recognize similar, though not identical patterns. Artificial neural network algorithms have been extensively used in biological sequence analysis (Wu, 1997; Baldi and Brunak, 1998).

Our study employs artificial neural networks in an attempt at predicting O-[alpha]-GlcNAc glycosylation sites in membrane and secreted proteins of D.discoideum. It is suggested that O-linked GlcNAc addition to these proteins in D.discoideum may resemble the mucin-type glycosylation in mammals (Jung et al., 1998). The knowledge of glycosylation sites (and their sequence contexts) from in vivo experimental data (Jung et al., 1997, 1998) was combined with surface accessibility prediction to develop a method for predicting putative acceptor sites in other D.discoideum amino acid sequences. The 34 megabase D.discoideum genome, comprising six chromosomes, is estimated to contain ~7000 genes (Loomis and Smith, 1995), roughly one-tenth of which is available in public databases. In our study, we scanned the D.discoideum protein sequences available in the SwissProt (Bairoch and Apweiler, 1998) and Genpept (Benson et al., 1998) databases for GlcNAc O-glycosylation and ranked these according to their likelihood of being glycosylated. Further, we have attempted to classify the available sequences into functional categories and cellular locations and to mark the glycosylated fractions.

Results

The sequence context around GlcNAc O-linked glycosylation sites


Figure 1. Sequence information logo of O-glycosylated sequence contexts. Conservation of residues at each position is computed as a function of the Kullback-Leibler information content (Kullback and Leibler, 1951) and represented as sequence logos (Schneider and Stephens, 1990). Sequences are aligned with position zero being the glycosylated Ser/Thr. Heights of the letters indicate their information content, the letters are stacked vertically according to their heights, the ones on the top of each stack thus reflecting the most conserved residue at that position. The 39 glycosylated windows were referenced against a background of all the nonglycosylated sequence windows (window size = 41, n = 324). The charged residues are shown in blue (basic, positive) and red (acidic, negative), while the neutral ones are shown in green (polar, hydrophilic) and black (apolar, hydrophobic).

Data used in the paper comprised of eight Dictyostelium discoideum membrane and secreted protein sequences. Earlier experiments elucidated 39 of the potential 363 serine/threonine sites to be modified in vivo with O-[alpha]-linked GlcNAc moieties (Table IV). The distribution of amino acid residues around these O-GlcNAc sites is illustrated in Figure 1. While no clear consensus is observable, a high occurrence of proline (alternate positions from -17 to +19) and valine residues (alternate positions from -7 to +13) alternating with threonine residues is observed. This is largely due to the close clustering of a number of glycosylated positions in our data. However, it may indicate that residues significant for glycosylation extend beyond the earlier reported -4 and +4 positions (Elhammer et al., 1993). The high occurrence of prolines at positions -1 and +3 relative to the glycosylated site reported by Wilson et al., (1991) is observed in our study, but is not limited to these positions. In a recent report, Jung et al., (1997) showed the motif PTVTPT to be sufficient for O-glycosylation which also displays the alternating proline/valine positioning.

Figure 2 histograms the distance between glycosylated sites in the same peptide. This is representative of the clustering and displays the high tendency of glycosylated sites to be situated in clustered close proximity. An interesting observation is the fact that all utilized sites occur at even positions away from each other. This even extends to consecutive glycosylated sites with 11 intervening nonglycosylated residues!


Figure 2. Distance distribution of glycosylated sites. Distances between glycosylated sites in the training set data are plotted here. Clustering is evident by the high frequency of closely placed sites (distances of 2, 4). The empty bins at positions 1, 3, 5, 7, . . . display the peculiar nature of the occupied sites never to occur at odd positions relative to each other.

We did not segregate sequence contexts for serine glycosylated sites and threonine glycosylated sites due to paucity in experimental data for serine glycosylated sites (only three Ser sites known).

Conformational preferences of the O-glycosylated sites

Secondary structure predictions using the PHD server (Rost and Sander, 1994) indicated that the glycosylated sites predominantly lay in coil and [beta]-sheet regions (data not shown). Of the glycosylated sites, 69% of them were predicted to have no secondary structure and the remaining 31% were predicted to be [beta]-strands. Regions around the glycosylated site were largely unassigned (coils), with positions -7 and -3 showing a strong preference for [beta]-strands. The "even"-positioning of residues shown in Figure 2 suggests structural constraints for the transferase enzyme or possible "twists" in the protein chain making alternate residues inaccessible. A likely example of this could be an antiparallel [beta]-sheet. This indeed is seen in one of the predicted protein entries (SwissProt Acc. P13466, PDB Entry 1KSR) whose structure (Fucini et al., 1997) is available in the Brookhaven Protein Databank (Bernstein et al., 1977; Abola et al., 1987). The only other two proteins (SwissProt Acc. P10733, P20425 ; PDB Entries 1SVQ, 1UKE) with predicted glycosylations that had structures available (Schnuchel et al., 1995; Scheffzek et al., 1996) in PDB at this time, exhibited the predicted glycosylated sites in highly accessible coils. This is in accordance with our findings of the glycosylated sites occurring within [beta]-sheets and coil regions.

Predictive performance of the neural networks

The final method correctly identifies ~97% of the glycosylated and nonglycosylated residues in a cross validated test. Performance measures are summarized in Table I.

The incorporation of protein surface-accessibility predictions (using a surface derived threshold) increases sensitivity and makes the method more reliable for new entries. Prediction involving serine glycosylated residues is currently poor and the network does not distinguish serine flanks from threonine flanks very well. We hope to improve this with availability of more experimental data on glycosylated serine sites.

Predictions using NetOGlyc 2.0

We performed a preliminary study towards comparing acceptor site similarities between the D.discoideum GlcNAc context and the mammalian GalNAc context. NetOGlyc is a mucin-type (GalNAc) glycosylation predictor trained on mammalian protein sequences (Hansen et al., 1995; Hansen et al., 1998). Predictions using this server on our D.discoideum data set revealed a correlation coefficient of 0.74. The server interestingly identified 35 out of the 39 D.discoideum GlcNAc-modified sites, however it also made 13 incorrect predictions (false positives). This indicates the high level of overlapping specificities between both cases, and hints that the D.discoideum GlcNAc specificities may be a subset of the mammalian repertoire of O-glycosylations.

Table I. Predictive performance of the neural network algorithms used for prediction of GlcNAc O-glycosylation
Method Sensitivity (%) Specificity (%) PPV (%) NPV (%) Corr. coeff.
ANNs w/o surface 89 97 93 94 0.87
ANNs with surface 97 97 94 98 0.93
Cross-validation was performed over the test data sets. Values in the table have been rounded up to the nearest integer. ANNs: artificial neural networks (GlcNAc site predictors), sensitivity (Px /(Px + Nfx)), specificity (Nx /(Nx + Pfx)); PPV: positive prediction value (Px /(Px + Pfx)) reflects the reliability of the positively predicted sites; NPV: negative prediction value (Nx /(Nx + Nfx)) reflects the reliability of the negatively predicted sites.


Figure 3. Predictions on the extracellular protein Sheathin D. The server predicts six O-GlcNAc sites in the spacer domain of ShD (LYYGEPCNTPTVTPTTTPSPTTKPPTGPLK, predicted glycosylated residues underlined). The vertical impulses are the Ser/Thr residues. A residue with an impulse crossing the surface-derived threshold (horizontal wavy line) is said to be glycosylated.


Figure 4. Linear MALDI-TOF analysis of tryptic digests of ShD. (A) is the complete spectrum for the tryptic digest of ShD observed in linear mode. The large peptide masses of 5875.47 to 6280.77 Da show the diagnostic ±203 Da and -146 Da mass differences typical of glycopeptides with HexNAc and Fucose additions, respectively. (B) shows the expanded glycopeptide region for ShD. The spectra were obtained on a Perseptive Biosystems MALDI-TOF instrument and insulin was used as external calibrant.

The protein SP96 (SwissProt Acc. P14328) was however not included in the above statistics. NetOGlyc abundantly predicted glycosylations throughout this protein. This was not a surprise since the protein is known to contain O-linked fucose residues and fucose and/or GlcNAc in a phosphodiester linkage to serine or threonine residues, however no O-linked GlcNAc residues directly linked to serine or threonine residues (M.Mreyen, unpublished observations).

Experimental confirmation of a "blind"-prediction

Sheathin D, a glycoprotein found in the extracellular matrix of D.discoideum, was first described and the N-terminal sequence published by Ti et al., (1995). The protein reacts with the carbohydrate-specific mAb MUD50 giving a strong indication that Sheathin D contains reducing terminal O-GlcNAc. O-glycans, recognized by mAb MUD50, and at this developmental stage of the D.discoideum life cycle are normally comprised of reducing terminal O-GlcNAc, phosphate, and fucose (Haynes et al., 1993; Zachara et al., 1996; Jung et al., 1997), where the GlcNAc is in an [alpha]-linkage to the peptide backbone (Jung et al., 1998).

Peptides from an in situ endoproteinase trypsin digest of ShD were analyzed by MALDI-TOF. One mass in the mass spectrum corresponds to a potential glycosylated peptide with the sequence LYYGEPCNTPTVTPTTTPSPTTKPPTGPLK (Figure 3). Mass differences of ±203 Da, ±162 Da, ±146 Da diagnostic of HexNAc, Hex, and deoxyHex heterogeneity respectively, were observed (Figure 4b).

While the mass alone cannot determine the number of glycosylation sites, the masses of the glycopeptide in ShD (the mass of the nonglycosylated tryptic peptide is 3158.58) are consistent with 6-8 glycosylation sites of an oligosaccharide composed of GlcNAc, phosphate, and fucose (see Table II). The theoretical peptide contains 10 potential sites, but elsewhere (Jung et al., 1998) we have shown that in proteins expressed at the asexual stage in D.discoideum, the second Thr in a tandem repeat is not glycosylated. This thus leaves a maximum of 8 possible sites of glycosylation.

Table II. Observed heterogeneity for ShD glycosylation on MALDI-TOF MS
Observed mass ShD[alpha] Mass of carbohydratea Compositionb
6280.73 3121.51 Hex1HexNAc9dHex5 (Phos)5
  3121.57 Hex4HexNAc7dHex5 (Phos)4
6077.75 2918.31 Hex1HexNAc8dHex5 (Phos)5
  2918.37 Hex4HexNAc6dHex5 (Phos)4
5931.18a 2772.17 Hex1HexNAc8dHex4 (Phos)5
  2772.23 Hex4HexNAc6dHex4 (Phos)4
5815.7 2756.17 Hex3HexNAc6dHex5 (Phos)5
  2756.23 HexNAc8dHex5 (Phos)5
5875.47a 2715.12 Hex1HexNAc7dHex5 (Phos)5
  2715.17 Hex4HexNAc5dHex5 (Phos)4
The peptide LYYGEPCNTPTVTPTTTPSPTTKPPTGPLK contains 10 possible sites, but only the 8 sites underlined have the potential to be glycosylated, based on previous in vivo studies (Jung et al., 1998). The peptide mass of the unglycosylated peptide is 3158.58. The ion-mode was M+H and all mass values are average. The search was done with a mass error of ±0.5 Da . The DictyOGlyc 1.1 server predicts six GlcNAc residues in a reducing terminal linkage. Additional HexNAc residues in the composition could account for glycan chain extensions (Haynes et al., 1993).
aMass error ±1.5 Da.
bThere are several compositions that fit the observed mass difference. However, only two structures are consistent with the known composition of the MUD50 dependent carbohydrate epitope.

A DictyOGlyc prediction on the complete protein chain of Sheathin D (ShD, compiled from the cDNA in Dicty-cDB, M. Fuchs and A. Gooley, unpublished observations) shows a cluster of six predicted O-[alpha]-GlcNAc sites (out of the above mentioned eight possible sites), corresponding to the spacer domain of the protein (Figure 3). The remaining 45 Ser/Thr residues in the protein were predicted negative for O-[alpha]-GlcNAc modifications and the peptide mass fingerprint did not show any indications of further glycopeptides (data not shown).

Predictions on public Dictyostelium discoideum protein sequences

On a redundancy reduced SwissProt/Genpept data set of 652 full-length sequences, the DictyOGlyc server predicted 252 (~38%) of these proteins as glycosylated (907 predicted glycosylation sites out of the 43,258 Ser and Thr residues in all). In another batch, the nonredundant 11,689 translated cDNA sequences from the Dicty-cDB database (5 Sep[prime]98 version, kindly provided by H.Urushihara, Japan; http://www.csm.biol.tsukuba.ac.jp/cDNAproject.html) were tested for glycosylations, and the server predicted 3062 (~25%) of these as glycosylated. No further analysis on this set was performed at this stage, however all predictions are available from our website (http://www.cbs.dtu.dk/services/DictyOGlyc/STATS/).

The protein entries predicted in the SwissProt/Genpept data set were analyzed further. A histogram (not shown) depicting distances between predicted glycosylated sites revealed a similar "even" positioning as seen in the training set data.

Predicted entries were ranked in descending order of their likelihood of being glycosylated. Proteins with high-scoring predicted sites were predominantly found to be membrane bound or secreted proteins of D.discoideum (Table III). This was expected, since these proteins follow a similar glycosylation pathway. A substantial number of lower-ranked predicted sites belonged to cytoplasmic and nuclear proteins. These proteins are believed to follow a different pathway for O-GlcNAc addition, even though some of the glycosyltransferase acceptor sites show similar patterns (Snow and Hart, 1998). Cysteine proteinases also follow a different glycosylation pathway (Mehta et al., 1997) and were not found to have any predicted glycosylations.


Table III. O-GlcNAc predictions made on Dictyostelium sequences from public databanks
Sequences for all D.discoideum proteins were extracted from SwissProt (rel.35) and Genpept (rel. 106). Predictions were made using DictyOGlyc 1.1, and entries were ranked according to their likelihood of being glycosylated (potential - surface derived threshold). The table shows some of the proteins with `high-ranking' predictions. The figures in the table show the Ser/Thr sites in each protein (mapped across the protein length) with each of their prediction potentials (vertical impulses) and the surface-derived threshold (horizontal wavy line). * indicates predicted O-glycosylated proteins which are also known experimentally to be O-glycosylated (exact sites not mapped).

Proteins picked up by the server include a variety of spore germination associated proteins and one can speculate whether glycosylation has any role to play in the spore germination process. A recent report (Ramalingam and Ennis, 1997) presents experimental evidence of a spore germination protein being O-glycosylated which is also on our list of "high-rankers" (SPG7_DICDI, SwissProt Acc. P22698).

GP100, a membrane glycoprotein, was predicted to possess a cluster of O-GlcNAc modified residues in a region annotated in SwissProt as a "Thr/Pro rich extracellular region" (Table III, 4th row). The glycoprotein has been annotated to possess both N-linked and O-linked glycosylations, but only the N-glycosylation sites (and not the O-glycosylations) seem to have been experimentally mapped.

Another "high-ranker" is the Contact site A protein which is known to contain reducing terminal GlcNAc residues. While the precise sites are unknown, the approximate locations of the O-glycosylations on the protein have been recently characterized to be in the carboxy terminal region of the protein (Mang, 1995) as predicted by our method.

FP21 is a cytoplasmic glycoprotein and is known to be O-glycosylated although presumably via a different pathway (Kozarov et al., 1995). The exact linkage of the sugar moieties to this cytoplasmic protein has recently been determined to be a pentasaccharide linked via its reducing terminal GlcNAc to a hydroxyproline residue (Teng-umnuay et al., 1998). Our server predicts four glycosylation sites (two high-ranking) in this 162 amino acid protein. The protein PSA_DICDI (SwissProt Acc. P12729) is a prespore-specific cell surface antigen that has been determined in vitro to have reducing terminal GlcNAc residues (Gooley et al., 1992). Some of the fusion peptides used in the training of the network were modified portions of this protein. Our method however predicts two extra sites, not represented in the training data set, which are in agreement with studies carried out recently in our laboratory (N.Zachara, personal communication).

Decomposition of the available proteome into functional and locational categories

Functional classification, derived primarily from the GeneQuiz server (Scharf et al., 1994; Casari et al., 1996), and categorization into cellular locations are depicted in Figure 5. Forty-four percent of all proteins were functionally unclassified which incidentally is similar to the percentage of unclassified yeast ORFs. A large number of these unclassified proteins (40%) were predicted glycosylated. Other predicted glycosylations were quite widespread among all functional classes, notable of which were those for regulatory functions (68%) and for transport and binding proteins (61%). The "regulatory function" proteins mainly consisted of kinases, and the predicted glycosylation motifs could possibly be sites for reciprocal glycosylation and phosphorylation, a phenomenon revealed by G.Hart (Hart et al., 1995). The transport and binding proteins were a mix of sporulation associated proteins, cell surface proteins, cAMP receptors, and others.

The cellular-location classification (Figure 5) indicated the cytoplasmic and nuclear proteins constituting up to 75% of the entire set, a large fraction of which was predicted glycosylated. Even though these were lower-ranked predictions compared to extracellular and membrane proteins, their high occurrence was surprising. Membrane and secreted proteins constituted 14% of all proteins, roughly half of which contained predicted glycosylation sites.


Figure 5. Functional and cellular locational classification of available proteome. (Left) Functional classification of the available D.discoideum proteome primarily using the GeneQuiz server (14 classes). The outer circle shows a segregation of all the protein sequences, while the inner circle depicts the fraction in each class which is predicted glycosylated. Figures in brackets: first figure indicates the fraction of the proteome that particular class constitutes, Second figure indicates the percentage (of that class) predicted glycosylated. * There is no "Cell envelope" in D.discoideum. This category, assigned by GeneQuiz, largely contains RAS-related proteins. (Right) Locational classification of proteins (classified using annotations, PSORT, SignalP, and the TMHMM prediction servers). Compartments chosen were those used by PSORT.

Mapping the predicted glycosylation sites along the length of all proteins (Figure 6) indicated a high preference for glycosylation near the C-terminal end of proteins, though this is also largely reflective of the highly clustered glycosylations in this area. Protein function did not appear to have any readily-apparent bias except for the preference of proteins involved in biosynthesis of cofactors, prosthetic groups and carriers to be glycosylated along 60-80% of their protein chain length. The segregation into cellular locations revealed that membrane proteins have a bias for being glycosylated toward their C-terminal ends.


Figure 6. Glycosylation sites mapped across the length of all protein chains. The number of glycosylated sites is shown by the height of the bars. The functional (above)/locational(below) categories of proteins in each of the columns, is depicted by a coloration scheme.

A study in which the number of sites in a protein chain was correlated with its function and location (Figure 7), revealed that glycosylated sites occurred in abundance (>10 sites in one protein chain) mostly in extracellular and membrane proteins and interestingly, also in nuclear proteins. Functionally, these included transport and binding proteins and those for regulatory functions and central intermediary metabolism. The predicted glycosylation sites in most of these proteins were found to be clustered and with "even" spacing. A fraction of these proteins were functionally "unknown"; the predicted glycosylated clusters could probably provide a valuable clue to their function. Cytoplasmic, cytoskeletal, peroxisomal, and endoplasmic reticular proteins did not carry more than three glycosylation sites in any single protein chain. Also, none of the Golgi associated or secretory vesicle associated proteins showed any predicted glycosylations despite having a large number of Ser/Thr sites. This might be an important result, since it could indicate a different pathway for this type of glycosylation. However this may be highly speculative with current data since it is based on less than 10 proteins (both classes combined). Functionally, the only class of proteins which did not have at least one predicted glycosylated site was that of fatty acid and phospholipid metabolism, this class however comprising of only a single protein.


Figure 7. The number of glycosylation sites in each protein correlated with its function/location. Binning the available D.discoideum proteome according to number of sites in each protein and classifying them functionally (left) and according to subcellular location (right).

Discussion

Mapping glycosylation acceptor sites on proteins is an important step toward understanding the glycosylation process. Clearly experimental determination of an acceptor site is more conclusive evidence than a theoretical prediction. However given the tedious and time-consuming procedures involved in accurate experimental determination, prediction methods have an advantage of being fast, reproducible, publicly available and have been shown to be accurate (Nielsen et al., 1999) to an extent of giving valuable hints and a direction for further experiments. Due to the fuzzy sequence context of the acceptor site for O-linked glycans, we use artificial neural networks as a method for identifying such sites. A protein surface-accessibility prediction method was integrated into our method for the final prediction of a site being O-[alpha]-GlcNAc modified. The developed method, "DictyOGlyc," is useful in rapidly scanning an entire database of proteins for possible candidates of glycosylation. In a separate study, the protein Sheathin D was analyzed for O-linked glycosylations. The experimental results show a high correlation to the DictyOGlyc predictions.

Using our method, we scanned D.discoideum protein entries from the SwissProt and Genpept databases, which comprise roughly one-tenth of the entire proteome of the organism. A closer look was taken at some of the "high-ranking" predicted sites, almost all of which were from membrane and secreted proteins. A number of the proteins predicted glycosylated, are known to have O-glycosylations which have not yet been experimentally site-mapped. Roughly one-third of the available D.discoideum proteome we analyzed was predicted to have GlcNAc O-glycosylations, these proteins comprising almost all functional classes. This gives an indication of how widespread these glycosylations may be within the organism. Membrane and secreted (extracellular) proteins were predicted to possess a number of clustered glycosylation sites in the C-terminal half of the protein. This probably corresponds to a proteolytic resistant "spacer" region which extends the catalytic domain away from the cell surface. The class of transport and binding proteins is clearly an important class for glycosylations and it largely consists of sporulation-associated proteins. Regulatory proteins and nuclear proteins also show a high number of predicted glycosylation sites and could possibly be sites for reciprocal phosphorylation.

The fact that glycosylations were predicted on a large number of intracellular and nuclear proteins was surprising, even though the predictions were mostly "low-ranking." This suggests that intracellular glycosylation motifs are very similar to extracellular ones, and the glycosyltransferases involved in both cases are either the same or have very similar catalytic domains. We do not know whether D.discoideum contains O-[beta]-GlcNAc glycosylated proteins, but considering the diversity of organisms that are known to carry this modification, it would appear almost inevitable that this modification will be discovered. However, a T.cruzi UDP-HexNAc:polypeptide [alpha]-N-acetylglucosaminyl transferase (O-[alpha]-GlcNAc transferase) activity present in a microsomal membrane preparation did not glycosylate the well characterized acceptor for the O-[beta]-GlcNAc-transferase, YSDSPSTST (Haltiwanger et al., 1992; Previato et al., 1998). While one motif is not sufficient to exclude the possibility of common acceptor motifs between cytosolic and secreted O-GlcNAc glycosylated proteins, a strategy where a neural network is trained on a data set of motifs identified in nuclear/cytosolic glycoproteins would allow a comparison between cytosolic and Golgi associated GlcNAc glycosylation motifs. Such a comparison could evaluate the extent and nature of the similarity in the motifs. Importantly, the Edman degradation based technology we use to identify glycosylation sites modified in vivo (Gooley and Williams, 1997) can differentiate between anomeric linkages of O-GlcNAc (N.E.Zachara, A.A.Gooley, and K.L.Williams, unpublished observations).

Data used in training the networks were derived from in vivo studies which makes the method more relevant to the biological system (Gooley and Williams, 1994). Glycosylation in vivo has been shown to be less restricted in terms of regions flanking the acceptor site in comparison to in vitro glycosylation (Nehrke et al., 1996). Statistical analysis of the training data does indicate certain residue preferences in regions surrounding the O-glycosylated sites though not such that a consensus acceptor sequence pattern can be defined. Interestingly, all glycosylated positions were found to be at even positions relative to each other which might indicate a structural preference such as [beta]-sheets making alternate residues accessible for being occupied.

The training data set has been limited in order to study a specific system; membrane and secreted proteins of D.discoideum. While confining the scope of the method, it also eliminates noise from data of other systems, allowing this particular glycosylation event to be studied in detail. Correlation studies with NetOGlyc (Hansen et al., 1998) indicate that the domains of GlcNAc acceptor specificities on D.discoideum and those of GalNAc on mammalian systems are not mutually exclusive and suggest that D.discoideum may contain a subset of the glycosyltransferases found in mammals. Further, this provides an indication that mammalian sequences may exhibit aberrant glycosylation patterns if expressed in D.discoideum. An analogous case has been shown earlier where recombinant gp120 from HIV-1, expressed in baculovirus infected insect cells, exhibits an altered glycosylation pattern from mammalian gp120 (Moore et al., 1990). One strategy of reducing obstacles with recombinant mammalian glycoprotein production in D.discoideum would be to tailor glycosylation sequence motifs in mammalian proteins for expression in D.discoideum systems with the help of the DictyOGlyc server.

The method is useful in detecting patterns similar to those shown during training. Experimental elucidation of new glycosylation sites with different acceptor motif patterns would be useful to incorporate into the server for enhanced predictions. Neural networks, in the future, may provide a very powerful tool in functional genomics for mass screening of translated DNA sequence databases to obtain an idea of the type and extent of post-translational modifications in the proteome. A combination of such servers could further provide a method for functional annotation of the large number of unassigned ORFs being churned out rapidly by genome sequencing projects.

Publicly available E-Mail and WWW server

The method has been made available on the Internet; amino acid sequences can be submitted at http://www.cbs.dtu.dk/services/DictyOGlyc/ for instant prediction of GlcNAc O-glycosylations. Alternatively, sequences can be E-mailed to DictyOGlyc@cbs.dtu.dk. Sending the word "help" in the E-mail text will return information on input and output formats. We encourage users to return feedback on any experimental confirmations or falsifications of the predictions. Any new information regarding glycosylation on D.discoideum would be highly appreciated and could be used to retrain the networks thereby improving our current prediction accuracy.

Materials and methods

Sequence data used for network training

As in vivo glycosylation has been shown to differ from in vitro glycosylation (Nehrke et al., 1996), it is relevant to mention that the data used in this study were exclusively derived from in vivo studies. In vivo expressed glutathione S-transferase fusion peptides, with experimentally determined GlcNAc O-glycosylation sites (Jung et al., 1997, 1998), were among the sequences used for training the neural networks. Other sequences employed for the training include secreted recombinants forms of PsA (Prespore-specific antigen) (Zachara et al., 1996), a spore coat protein SP96 and the N-terminal sequence of a recently described (Packer et al., 1997) secreted protein ("ATP"-Protein). This set of eight proteins had 39 serine and threonine residues [alpha]-linked to an O-GlcNAc sugar moeity (Table IV) and encompasses all in vivo data acquired till date with respect to D.discoideum. Elucidation of glycosylation sites was performed by solid-phase protein sequencing and has been reported earlier (Zachara et al., 1996; Jung et al., 1997, 1998; Packer et al., 1997).

Table IV. In vivo experimental data used for training the artificial neural networks
Table IVa shows sequence contexts (13-residue sequence windows) around the glycosylated amino acid for each of the 39 glycosylated sites found using in vivo methods in D.discoideum (membrane and secreted proteins). As shown, no general consensus sequence or pattern is evident, but the neural networks use this information to "learn" and "generalize" while predicting the possibility of glycosylations on a new sequence window shown to it. (a, Packer et al., 1997; b, Jung et al., 1998; c, Jung et al., 1997; d, unpublished observations; e, Zachara et al., 1996; f, M.Mreyen, unpublished observations). Table IVb shows the sequence regions glycosylated in each of the proteins. The line below each of the sequences is the assignment line; T and S indicating glycosylated threonines and serines, respectively. The clustering and "even"-positioning of glycosylation sites is evident. In GST-MUC1, this is true even for sites 12 residues apart (with no intermediate sites).

Analyzing the sequence context and conformational preferences

One approach used for analyzing sequences towards pattern identification is the study of information density over the sequence motif. Information content of a particular position in a sequence is a measure of the degree of residue conservation at that position. Sequence windows centered on the glycosylated residue were aligned and the Kullback-Leibler information content was quantified by the formula

where I is the information content at position i, p the probability of occurrence of the amino acid L (at position i) in a glycosylated window, and q the probability of occurrence in a nonglycosylated window (the background distribution). This information content, expressed in bits/amino acid, was visualized using sequence logos (Schneider and Stephens, 1990). Here, amino acid symbols are scaled to heights denoting their frequencies of occurrence at each position. Scaled amino acid symbols are stacked vertically in ascending order of their heights, the total height of the stack giving the value of I at that position. The sequence logo illustrates preferred acceptor patterns for O-GlcNAc transferases and makes any consensus sequence distinctly evident. A preliminary structural study was performed on the known glycosylation windows and their secondary structures predicted using the PHD method (Rost and Sander, 1994). Resulting predictions were represented using the Kullback-Leibler information content expressed in the form of sequence logos, as described above. Predictions around nonglycosylated amino acids were taken as reference in the Kullback-Leibler information measure.

Neural network algorithms

The basic idea with artificial neural networks is to use a network of neurons, each node (or neuron) having multiple inputs and a single output based on the weights (or strengths) associated with the various inputs. On presenting sequence windows, exhibiting a particular feature, repetitively to such a network, randomly initialized weights can be adjusted to achieve a desired output, that is, to classify the pattern to the assigned feature. Once the network has been so trained on sequences with a known feature, it can be presented with an unclassified sequence, and patterns within the sequence are (hopefully) suitably classified. We employed standard feed-forward artificial neural networks with sigmoidal nodes and one layer of hidden units (Baldi and Brunak, 1998). Back propagation was used for adjusting the weights (Rumelhart et al., 1986), and amino acids were represented in the network by sparse encoding (Qian and Sejnowski, 1988). The method was similar to the one we used earlier for predicting O-linked GalNAc residues on mammalian proteins (Hansen et al., 1998).

We evaluated performance of neural networks with different parameters by partitioning the entire data set into a training data set and a test data set. Essentially seven of the eight protein sequences were used to train the network, the remaining sequence being used as test data. The networks were tested over moving window sizes ranging from three amino acids (one amino acid flanking the serine or threonine site) to 23 amino acids (11-residue flanks), and with the number of hidden units ranging from 2 to 8. Performance was evaluated with cross-validation using a coefficient of correlation (Matthews, 1975) given by

where Px is the number of true positives (experimentally verified glycosylated sites which are also predicted glycosylated), Nx the number of true negatives (experimentally verified nonglycosylated, predicted nonglycosylated), Pfx the number of false positives (experimentally nonglycosylated, predicted glycosylated) and Nfx, the number of false negatives (experimentally verified glycosylated, predicted nonglycosylated).

We settled for a jury of six neural networks comprising three sets of training data, trained over window sizes 5, 7, and 19 (number of consecutive residues) and the number of hidden units being 2 and 5 for each window size. The results arising from each network in the jury were sigmoidally enhanced and averaged to obtain a value between zero and unity. Traditionally, a threshold of 0.5 is used, hence a site with an output of more than 0.5 would be assigned glycosylated. During evaluation of the networks, we had tested them on threshold values ranging from 0.1 to 0.9 (data not shown). Most of the test data were correctly predicted when the threshold was set between 0.45 and 0.55. We use a surface-accessibility modified threshold as described below.

Protein surface-accessibility prediction

The exact subcellular location of the initiation of O-glycosylation is still controversial (Roth et al., 1994; Röttger et al., 1998). However, it is universally agreed that the O-glycan is linked post-translationally to Ser or Thr residues of a fully folded and assembled protein (Aubert et al., 1976; Rudd and Dwek, 1997; Van den Steen et al., 1998), and is thus surface exposed on the protein. Surface exposure, being a valuable factor in determining the possibility of a site being O-glycosylated, we combined the glycosylation predictions above with surface accessibility predictions of the potential sites.

The network used to predict surface accessibility was trained on a non-redundant (less than 25% sequence identity for alignment lengths of 80 or more residues) data set of 134 globular protein structures extracted from the PDB database. All of these were x-ray determined structures with a resolution better than 2.5 Å. The Connolly Molecular Surface procedure (Connolly, 1993) was used with a probe radius of 1.4 Å, corresponding to the molecular radius of water, to label all residues in the 134 protein structures as either surface exposed or buried. Surface assignment was defined as having more than 20% of the normalized standard maximal surface area (Rose et al., 1985) exposed to the solvent. The thus obtained data set of 134 labeled protein sequences (residues labeled "buried"/"surface"), based on high resolution protein structures, was used to train neural networks to recognize the relation between sequence and surface accessibility. Overall, the method correctly predicted surface accessibility for 74% of the residues. Details of the network used, training procedures, and performance will be reported elsewhere. This surface prediction method has been used earlier in prediction of post-translational cleavage sites of picornaviral polyproteins (Blom et al., 1996) and prediction of mucin type O-glycosylations in mammalian proteins (Hansen et al., 1998).

Predicted outputs from the glycosylation network and the surface-accessibility network were combined in the following manner. The glycosylation threshold was reduced for sites predicted to be on surface, but increased for sites predicted to be buried. To accomplish this, a modulated threshold was derived by scaling the surface score (Os) with a factor e and off-setting it by a constant c. Using this variable cutoff, glycosylation was assigned if Og > c + (e × Os), where Og was the combined glycosylation potential. A systematic screening of combinations of e and c by cross-validating on the known experimental data yielded a scaling factor (e) of 0.999 and an off-set value (c) of 0.118 as optimum. The averaged glycosylation potential (from the six networks) was compared with the modulated surface threshold, and a site with a potential greater than the threshold qualified as glycosylated (Figure 8). A separate study using the surface accessibility network predicted all the glycosylation sites in the training data to be on the protein surface (data not shown).


Figure 8. Schematic sketch of the neural networks used. A jury of six neural networks with different configurations were initially trained on the known glycosylated and nonglycosylated sites. This trained jury can then be used to predict sites on any protein sequence. The outputs from each of the networks are averaged and thresholded using a surface-prediction modulated value.

Predictions on the available proteome

The final method was put together as a prediction server and was employed to scan the available D.discoideum proteome in public databases for predicted glycosylations. The Dictyostelium subset in SwissProt Rel. 35 (Bairoch and Apweiler, 1998) comprised of 267 sequences which were combined with the 664 protein sequences obtained from GenPept Rel. 106 (Benson et al., 1998). A redundancy reduction was performed in which 100% identical protein sequences were eliminated from the combined set. The reduced set of 652 sequences (from an initial number of 931) formed our basic data set for further proteome studies. Glycosylation predictions were made on this set and ranked in order of their likelihood of being glycosylated (determined by the difference between predicted potential and threshold).

In addition, we also analyzed translated cDNA sequences from the Dicty-cDB database. A total of 12,565 translated cDNA sequences was reduced for redundancy to 11,689 unique sequences. Glycosylation predictions were made on these and ranked.

Fine teasing the available proteome

The SwissProt-Genpept 652-entry reduced data set was segregated functionally and according to cellular locations. Functional classifications were performed using the public GeneQuiz server (Scharf et al., 1994; Casari et al., 1996) which derives functional annotations for each sequence and categorizes them into one of 14 classes (including an "unknown" category). A few of the unknown category proteins were later reclassified manually into one of the 13 functional classes on the basis of their databank annotations.

Cellular locations of the proteins were derived using a combination (in order of precedence) of SwissProt annotations (where present), the PSORT II prediction server (Nakai and Kanehisa, 1992; Horton and Nakai, 1997), the SignalP prediction server (Nielsen et al., 1997), and the Transmembrane helix predictor TMHMM (Sonnhammer et al., 1998).

Three types of analyses were performed. First, percentages of glycosylated fractions in each functional and locational category were calculated and plotted (Figure 5). Second, predicted glycosylations were mapped across the length of all proteins (Figure 6) by normalizing all protein chains to the same length and enumerating predicted sites in each of 10 columns into which the length was split. Each of these columns was also segregated into functional and locational compartments. Last, a study was done to test the existence of any correlation between the number of sites predicted on one protein chain with its functional and locational assignment (Figure 7). The number of sites predicted glycosylated in each category was normalized with the number of proteins existing in that category, and a column was created for each "bin" (number of sites). Looking at the numbers, we rebinned the columns into classes of 1, 2, 3, 4-10 and above 10 (number of predicted sites per protein chain).

Acknowledgments

We thank Natasha Zachara for her critical comments and text regarding O-[alpha]-GlcNAc and O-[beta]-GlcNAc glycosylation in simple eukaryotes, Andrew Bohlken for preparing the ShD alpha digest, Kristoffer Rapacki for assistance with database extraction and Hideko Urushihara, University of Tsukuba, Japan who had kindly provided a translated version of Dicty-cDB. Assistance from the maintainers of GeneQuiz and their patience with our large submissions is appreciated. E.J. appreciates support from the Deutscher Akademischer Austauschdienst (HspII/AUFE) and Macquarie University International Postgraduate Research Award for her doctoral research. K.L.W. and A.A.G. acknowledge support from the Australian Research Council and National Health and Medical Research Council for this research. R.G., J.H., and S.B. acknowledge support from the Danish National Research Foundation.

Abbreviations

O-GlcNAc, O-linked N-acetylglucosamine; HexNAc, N-acetylhexosamine.

References

Abola ,E.E., Bernstein,F.C., Bryant,S.H., Koetzle,T.F. and Weng,J. (1987) Protein Data Bank. In Allen,F.H., Bergerhoff,G. and Sievers,R. (eds.), Crystallographic Databases-Information Content, Software Systems, Scientific Applications. Data Commission of the International Union of Crystallography, Bonn/Cambridge/Chester, pp. 107-132.

Aubert ,J.P., Biserte,G. and Loucheux-Lefebur,M.H. (1976) Carbohydrate-peptide linkage in glycoproteins. Arch. Biochem. Biophys, 175, 410-418. MEDLINE Abstract

Bairoch ,A. and Apweiler,R. (1998) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Res., 26, 38-42. MEDLINE Abstract

Baldi ,P. and Brunak,S. (1998) Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA.

Barth ,A., Muller-Taubenberger,A., Taranto,P. and Gerisch,G. (1994) Replacement of the phospholipid-anchor in the Contact Site A glycoprotein of D.discoideum by a transmembrane region does not impede cell adhesion but reduces time on the cell surface. J. Cell. Biol., 124, 205-215. MEDLINE Abstract

Bause ,A. (1983) Structural requirements of N-glycosylation of proteins. Biochem. J., 209, 331-336. MEDLINE Abstract

Benson ,D.A., Boguski,M.S., Lipman,D.J., Ostell,J. and Ouellette,B.F.F. (1998) GenBank. Nucleic Acids Res., 26, 1-7. MEDLINE Abstract

Bernstein ,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.Jr., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112, 535-542. MEDLINE Abstract

Blom ,N., Hansen,J.E., Blaas,D. and Brunak,S. (1996) Cleavage site analysis in picornaviral polyproteins: discovering cellular targets by neural networks. Protein Sci., 5, 2203-2216. MEDLINE Abstract

Casari ,G., Ouzounis,C., Valencia,A. and Sander,C. (1996) Genequiz-II: automatic function assignment for genome sequence analysis. In Proceedings of the First Annual Pacific Symposium on Biocomputing. World Scientific, Hawaii, pp. 707-709.

Chou ,K.C., Zhang,C.T., Kezdy,F.J. and Poorman,R.A. (1995) A vector projection method for predicting the specificity of GalNAc-transferase. Proteins Struct. Funct. Genet., 21, 118-126. MEDLINE Abstract

Clausen ,H. and Bennett,E. (1996) A family of UDP-GalNAc: polypeptide N-acetylgalactosaminyl-transferases control the initiation of mucin-type O-linked glycosylation. Glycobiology, 6, 635-646. MEDLINE Abstract

Connolly ,M.L. (1993) The molecular surface package. J. Mol. Graph., 11, 139-141. MEDLINE Abstract

Elhammer ,A.P., Poorman,R.A., Brown,E., Maggiora,L.L., Hoogerheide,J.G. and Kezdy,F.J. (1993) The specificity of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase as inferred from a database of in vivo substrates and from the in vitro glycosylation of proteins and peptides. J. Biol. Chem., 268, 10029-10038. MEDLINE Abstract

Fang ,H., Higa,M., Suzuki,K., Aiba,K., Urushihara,H. and Yanagisawa,K. (1993) Molecular cloning and characterization of two genes encoding gp138, a cell surface glycoprotein involved in the sexual cell fusion of Dictyostelium discoideum. Dev. Biol., 156, 201-208. MEDLINE Abstract

Fucini ,P., Renner,C., Herberhold,C. and Noegel,A.A. (1997) The repeating segments of the F-actin cross-linking gelation factor (ABP-120) have an immunoglobulin-like fold. Nature Struct. Biol., 4, 223-230.

Giorda ,R., Ohmachi,T., Shaw,D.R. and Ennis,H.L. (1990) A shared internal threonine-glutamic acid-threonine-proline repeat defines a family of Dictyostelium discoideum spore germination specific proteins. Biochemistry, 29, 7264-7269. MEDLINE Abstract

Goochee ,C.F., Gramer,M.J., Anderson,D.C., Bahr,J.B. and Rasmussen,J.R. (1991) The oligosaccharides of glycoproteins: bioprocess factors affecting oligosaccharide structure and their effect on glycoprotein properties. Bio/Technology, 9, 1347-1355. MEDLINE Abstract

Gooley ,A.A. and Williams,K.L. (1994) Towards characterizing O-glycans: the relative merits of in vivo and in vitro approaches in seeking peptide motifs specifying O-glycosylation sites. Glycobiology, 4, 413-417. MEDLINE Abstract

Gooley ,A.A. and Williams,K.L. (1997) How to find, identify and quantitate the sugars on proteins. Nature, 385, 557-559. MEDLINE Abstract

Gooley ,A.A., Classon,B.J., Marschalek,R. and Williams,K.L. (1991) Glycosylation sites identified by detection of glycosylated amino acids released from Edman degradation: the identification of Xaa-Pro-Xaa-Xaa as a motif for Thr-O-glycosylation. Biochem. Biophys. Res. Commun., 178, 1194-1201. MEDLINE Abstract

Gooley ,A.A., Marschalek,R. and Williams,K.L. (1992) Size polymorphisms due to changes in the number of O-glycosylated tandem repeats in the Dictyostelium discoideum glycoprotein PsA. Genetics, 130, 749-756. MEDLINE Abstract

Haltiwanger ,R.S., Blomberg,M.A. and Hart,G.W. (1992) Glycosylation of nuclear and cytoplasmic proteins. Purification and characterization of a uridine diphospho-N-acetylglucosamine:polypeptide [beta]-N-acetylglucosaminyltransferase. J. Biol. Chem., 267, 9005-9013. MEDLINE Abstract

Handman ,E., Barnett,L.D., Osborn,A.H., Goding,J.W. and Murray,P.J. (1993) Identification, characterisation and genomic cloning of a O-linked N- acetylglucosamine-containing cytoplasmic Leishmania glycoprotein. Mol. Biochem. Parasitol., 62, 61-72. MEDLINE Abstract

Hansen ,J.E., Lund,O., Rapacki,K., Clausen,H., Mosekilde,E., Nielsen,J.O. and Hansen,J.E.S. (1994) Glycosylation and Protein Conformation. In Bohr,H. and Brunak,S. (eds.), Protein Structure by Distance Analysis. IOS Press, Amsterdam, pp. 247-254.

Hansen ,J.E., Lund,O., Engelbrecht,J., Bohr,H., Nielsen,J.O., Hansen,J.E.S. and Brunak,S. (1995) Prediction of O-glycosylation of mammalian proteins: specificity patterns of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase. Biochem. J., 308, 801-813. MEDLINE Abstract

Hansen ,J.E., Lund,O., Nilsson,J., Rapacki,K. and Brunak,S. (1998a) O-GLYCBASE Version 3.0: a revised database of O-glycosylated proteins. Nucleic Acids Res., 26, 387-389. MEDLINE Abstract

Hansen ,J.E., Lund,O., Tolstrup,N., Gooley,A.A., Williams,K.L. and Brunak,S. (1998b) NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconjugate J., 15, 115-130.

Hart ,G.W. (1997) Dynamic O-linked glycosylation of nuclear and cytoskeletal proteins. Annu. Rev. Biochem., 66, 315-335. MEDLINE Abstract

Hart ,G.W., Greis,K.D., Dong,L.Y., Blomberg,M.A., Chou,T.Y., Jiang,M.S., Roquemore,E.P., Snow,D.M., Kreppel,L.K. and Cole,R.N. (1995) O-Linked N-acetylglucosamine: the "yin-yang" of Ser/Thr phosphorylation? Nuclear and cytoplasmic glycosylation. Adv. Exp. Med. Biol., 376, 115-123. MEDLINE Abstract

Haynes ,P.A., Gooley,A.A., Ferguson,M.A., Redmond,J.W. and Williams,K.L. (1993) Post-translational modifications of the Dictyostelium discoideum glycoprotein PsA. Glycosylphosphatidylinositol membrane anchor and composition of O-linked oligosaccharides. Eur. J. Biochem., 216, 729-737. MEDLINE Abstract

Horton ,P. and Nakai,K. (1997) Better prediction of protein cellular localization sites with the k nearest neighbor classifier. In Proc. Fifth Int. Conf. on Intelligent Systems for Molecular Biology. Volume 5. AAAI Press, Menlo Park, CA, pp. 147-152.

Hounsell ,E.F., Davies,M.J. and Renouf,D.V. (1996) O-linked protein glycosylation structure and function. Glycoconjugate J., 13, 19-26.

Jung ,E. and Williams,K.L. (1997) The production of recombinant glycoproteins with special reference to simple eukaryotes including Dictyostelium discoideum. Biotechnol. Appl. Biochem., 25, 3-8. MEDLINE Abstract

Jung ,E., Gooley,A.A., Packer,N.H., Slade,M.B., Williams,K.L. and Dittrich,W. (1997) An in vivo approach for the identification of acceptor sites for O-glycosyltransferases: motifs for the addition of O-GlcNAc in Dictyostelium discoideum. Biochemistry, 36, 4034-4040. MEDLINE Abstract

Jung ,E., Gooley,A.A., Packer,N.H., Karuso,P. and Williams,K.L. (1998) Rules for the addition of O-linked N-acetylglucosamine to secreted proteins in Dictyostelium discoideum. Studies on glycosylation of mucin MUC1 and MUC2 repeats. Eur. J. Biochem., 253, 517-524. MEDLINE Abstract

Kornfeld ,R. and Kornfeld,S. (1985) Assembly of asparagine-linked oligosaccharides. Annu. Rev. Biochem., 54, 631-664. MEDLINE Abstract

Kozarov ,E., van der Wel,H., Field,M., Gritzali,M., Brown R.D. Jr. and West,C.M. (1995) Characterization of FP21, a cytosolic glycoprotein from Dictyostelium. J. Biol. Chem., 270, 3022-3030. MEDLINE Abstract

Kreppel ,L.K., Blomberg,M.A. and Hart,G.W. (1997) Dynamic glycosylation of nuclear and cytosolic proteins. Cloning and characterization of a unique O-GlcNAc transferase with multiple tetratricopeptide repeats. J. Biol. Chem., 272, 9308-9315. MEDLINE Abstract

Kullback ,S. and Leibler,R.A. (1951) On information theory and sufficiency. Ann. Math. Stat., 22, 79-86.

Lis ,H. and Sharon,N. (1993) Protein glycosylation: structural and functional aspects. Curr. J. Biochem., 218, 1-27.

Loomis ,W.F. and Smith,D.W. (1995) Consensus phylogeny of Dictyostelium. Experientia, 51, 1110-1115. MEDLINE Abstract

Lubas ,W.A., Frank,D.W., Krause,M. and Hanover,J.A. (1997) O-Linked GlcNAc transferase is a conserved nucleocytoplasmic protein containing tetratricopeptide repeats. J. Biol. Chem., 272, 9316-9324. MEDLINE Abstract

Mang ,C. (1995) Analyse der Kohlenhydratstrukturen des Zelladhaesionsproteins contact Site A. PhD thesis. Ludwig Maximilians University, Munich, Germany.

Matthews ,B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442-451. MEDLINE Abstract

Mehta ,D.P., Etchison,J.R., Wu,R. and Freeze,H.H. (1997) UDP-GlcNAc:Ser-protein N-acetylglucosamine-1-phosphotransferase from Dictyostelium discoideum recognizes serine-containing peptides and eukaryotic cysteine proteinases. Biotechnol. Gen. Eng. Rev., 14, 1-35.

Moore ,J.P., McKeating,J.A., Jones,I.M., Stephens,P.E., Clements,G., Thomson,S. and Weiss,R.A. (1990) Characterization of recombinant gp120 and gp160 from HIV-1: binding to monoclonal antibodies and soluble CD4. AIDS, 4, 307-315. MEDLINE Abstract

Nakai ,K. and Kanehisa,M. (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics, 14, 897-911. MEDLINE Abstract

Nehrke ,K., Hagen,F.K. and Tabak,L.A. (1996) Charge distribution of flanking amino acids influences O-glycan acquisition in vivo. J. Biol. Chem., 271, 7061-7065. MEDLINE Abstract

Nielsen ,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng., 10, 1-6. MEDLINE Abstract

Nielsen ,H., Brunak,S. and von Heijne,G. (1999) Machine learning approaches for prediction of signal peptides and other protein signals. Protein Eng., 12, 3-9. MEDLINE Abstract

O'Connell ,B.C., Tabak,L.A. and Ramasubbu,N. (1991) The influence of flanking sequence on the O-glycosylation of threonine in vitro. J. Biol. Chem., 180, 1024-1030.

O'Connell ,B.C., Hagan,F.K. and Tabak,L.A. (1992) The influence of flanking sequence on O-glycosylation. Biochem. Biophys. Res. Commun., 267, 25010-25018.

Packer ,N.H., Pawlak,A., Kett,W.C., Gooley,A.A., Redmond,J.W. and Williams,K.L. (1997) Proteome analysis of glycoforms: a review of strategies for the microcharacterisation of glycoproteins separated by two-dimensional polyacrylamide gel electrophoresis. Electrophoresis, 18, 452-460. MEDLINE Abstract

Pisano ,A., Redmond,J.W., Williams,K.L. and Gooley,A.A. (1993) Glycosylation sites identified by solid-phase Edman degradation: O-linked glycosylation motifs on human glycophorin A. Glycobiology,3, 429-435. MEDLINE Abstract

Presnell ,S.R. and Cohen,F.E. (1993) Artificial neural networks for pattern recognition in biochemical sequences. Annu. Rev. Biophy. Biomol. Struct., 22, 283-298.

Previato ,J.O., Sola-Penna,M., Agrellos,O.A., Jones,C., Oeltmann,T., Travasos,L.R. and Mendonca-Previato,L. (1998) Biosynthesis of O-N-acetylglucosamine-linked glycans in Trypanosoma cruzi. Characterization of the novel uridine diphospho-N-acetylglucosamine:polypeptide N-acetylglucosaminyltransferase- catalyzing formation of N-acetylglucosamine alpha1->O-threonine. J. Biol. Chem., 273, 14982-14988. MEDLINE Abstract

Qian ,N. and Sejnowski,T.J. (1988) Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol., 202, 865-884. MEDLINE Abstract

Ramalingam ,R. and Ennis,H.L. (1997) Characterization of the Dictyostelium discoideum cellulose-binding protein CelB and regulation of gene expression. J. Biol. Chem., 272, 26166-26172. MEDLINE Abstract

Rose ,G.D., Geselowitz,A.R., Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Hydrophobicity of amino acid residues in globular proteins. Science, 229, 834-838. MEDLINE Abstract

Rost ,B. and Sander,C. (1994) Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 19, 55-72. MEDLINE Abstract

Roth ,J., Wang,Y., Eckhardt,A.E. and Hill,R.L. (1994) Subcellular localization of the UDP-N-acetyl-d-galactosamine: polypeptide N-acetylgalactosaminyltransferase-mediated O-glycosylation reaction in the submaxillary gland. Proc. Natl. Acad. Sci. USA, 91, 8935-8939. MEDLINE Abstract

Röttger ,S., White,J., Wandall,H.H., Olivo,J.C., Stark,A., Bennett,E.P., Whitehouse,C., Berger,E.G., Clausen,H. and Nilsson,T. (1998) Localisation of three human polypeptide GalNAc-transferases in HeLa cells suggests initiation of O-linked glycosylation throughout the Golgi apparatus. J. Cell Sci., 111, 45-60. MEDLINE Abstract

Rudd ,P.M. and Dwek,R.A. (1997) Glycosylation: Heterogeneity and the 3D structure of proteins. Crit. Rev. Biochem. Mol. Biol., 32, 1-100. MEDLINE Abstract

Rumelhart ,D.E., Hinton,G.E. and Williams,R.J. (1986) Learning internal representations by error propagation. In Rumelhart,D., McClelland,J. and the PDP Research Group (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. 1. Foundations. MIT Press, Cambridge, MA., pp. 318-362.

Scharf ,M., Schneider,R., Casari,G., Bork,P., Valencia,A., Ouzounis,C. and Sander,C. (1994) Genequiz: a workbench for sequence analysis. In Proc. of Second Int. Conf. on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 348-353.

Scheffzek ,K., Kliche,W., Wiesmuller,L. and Reinstein,J. (1996) Crystal structure of the complex of UMP/CMP kinase from Dictyostelium discoideum and the bisubstrate inhibitor P1-(5[prime]-adenosyl) P5-(5[prime]-uridyl) pentaphosphate (UP5A) and Mg2+ at 2.2 A: implications for water-mediated specificity. Biochemistry, 35, 9716-9727. MEDLINE Abstract

Schneider ,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18, 6097-6100. MEDLINE Abstract

Schnuchel ,A., Wiltscheck,R., Eichinger,L., Schleicher,M. and Holak,T.A. (1995) Structure of severin domain 2 in solution. J. Mol. Biol., 247, 21-27. MEDLINE Abstract

Slade ,M.B., Emslie,K.R. and Williams,K.L. (1997) Expression of recombinant glycoproteins in the simple eukaryote Dictyostelium discoideum. Biotechnol. Gen. Eng. Rev., 14, 1-35.

Snow ,D.M. and Hart,G.W. (1998) Nuclear and cytoplasmic glycosylation. Int. Rev. Cytol., 181, 43-74. MEDLINE Abstract

Sonnhammer ,E.L., von Heijne,G. and Krogh,A. (1998) A hidden Markov model for predicting transmembrane helices in protein sequences. In Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology AAAI Press, Menlo Park, CA, pp. 175-182.

Teng-umnuay ,P., Morris,H.R., Dell,A., Panico,M., Paxton,T. and West,C.M. (1998) The cytoplasmic F-box binding protein SKP1 contains a novel pentasaccharide linked to hydroxyproline in Dictyostelium. J. Biol. Chem., 273, 18242-18249. MEDLINE Abstract

Ti ,Z.C., Wilkins,M.R., Vardy,P.H., Gooley,A.A. and Williams,K.L. (1995) Glycoprotein complexes interacting with cellulose in the "cell print" zones of the Dictyostelium discoideum extracellular matrix. Dev. Biol., 168, 332-341. MEDLINE Abstract

Van den Steen ,P., Rudd,P.M., Dwek,R.A. and Opdenakker,G. (1998) Concepts and principles of O-linked glycosylation. Crit. Rev. Biochem. Mol. Biol., 33, 151-208. MEDLINE Abstract

Wandall ,H.H., Hassan,H., Mirgorodskaya,E., Kristensen,A.K., Roepstorff,P., Bennett,E.P., Nielsen,P.A., Hollingsworth,M.A., Burchell,J., Taylor-Papadimitriou,J. and Clausen,H. (1997) Substrate specificities of three members of the human UDP-N-acetyl-[alpha]-d-galactosamine:polypeptide N-acetylgalactosaminyltransferase family, GalNAc-T1, -T2 and -T3. J. Biol. Chem., 272, 23503-23514. MEDLINE Abstract

Wang ,Y., Agrawal,N., Eckhardt,A.E., Stevens,R.D. and Hill,R.L. (1993) The acceptor substrate specificity of porcine submaxillary UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase is dependent on the amino acid sequences adjacent to serine and threonine residues. J. Biol. Chem., 268, 22979-22983. MEDLINE Abstract

Wilson ,I.B.H., Gavel,Y. and von Heijne,G. (1991) Amino acid distributions around O-linked glycosylation sites. Biochem. J., 275, 529-534. MEDLINE Abstract

Wong ,E.F.S., Brar,S.K., Sesaki,H., Yang,C. and Siu,C.H. (1996) Molecular cloning and characterization of DdCAD-1, a Ca2+ dependent cell-cell adhesion molecule, in Dictyostelium discoideum. J. Biol. Chem., 271, 16399-16408. MEDLINE Abstract

Wu ,C.H. (1997) Artificial neural networks for molecular sequence analysis. Comput. Chem., 21, 237-256. MEDLINE Abstract

Zachara ,N.E., Packer,N.H., Temple,M.D., Slade,M.B., Jardine,D.R., Karuso,P., Moss,C.J., Mabbutt,B.C., Curmi,P.M.G., Williams,K.L. and Gooley,A.A. (1996) Recombinant prespore specific antigen from Dictyostelium discoideum is a [beta]-sheet glycoprotein with a spacer peptide modified by O-linked N-acetylglucosamine. Eur. J. Biochem., 238, 511-518. MEDLINE Abstract


3To whom correspondence should be addressed


This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: jnl.info{at}oup.co.uk
Last modification: 14 Oct 1999
Copyright©Oxford University Press, 1999.