Biases and complex patterns in the residues flanking protein N-glycosylation sites

Shifra Ben-Dor2, Nir Esterman3, Eitan Rubin2 and Nathan Sharon1,3

2 Department of Biological Services, Weizmann Institute of Science, Rehovot 76100, Israel; and 3 Department of Biological Chemistry, Weizmann Institute of Science, Rehovot 76100, Israel

Received on June 10, 2003; revised on August 8, 2003; accepted on September 3, 2003


    Abstract
 Top
 Abstract
 Introduction
 Results and discussion
 Conclusions
 Materials and methods
 References
 
N-Glycosylation, the most common and most versatile protein modification reaction, occurs at the ß-amide of the aspargine of the Asn-Xaa-Ser/Thr sequon. For reasons that are unclear, not all such sequons are glycosylated. To find patterns that affect glycosylation, we examined the amino acid residues from the 20th preceding the sequon to the 20th residue following it, using bioinformatics tools. A clean data set of annotated, experimentally verified, glycosylated and nonglycosylated sequons derived from 617 well-defined nonredundant N- and N-,O-glycoproteins listed in SWISS-PROT (June 2002) was used. NXS and NXT sequons were analyzed separately. Although no overt patterns were found to explain sequon occupancy or nonoccupancy, trends for over- or underrepresentation of certain amino acids at particular positions were statistically significant and different in NXS and NXT sequons. In extension of earlier reports, none of the 80 Asn-Pro-Ser/Thr found were glycosylated, and a markedly low level of glycosylation was seen in sequons with Pro at the position following the Ser/Thr. In addition, a general observation was made that the considerable number of glycosylated sequons in the C-terminal 10 residues of glycoproteins suggests that N-glycosylation in these cases may be posttranslational and not cotranslational, as widely accepted.

Key words: bioinformatics / database survey / glycoproteins / glycosylation frequency / sequon


    Introduction
 Top
 Abstract
 Introduction
 Results and discussion
 Conclusions
 Materials and methods
 References
 
N-Glycosylation, the most common and most versatile protein modification reaction (Apweiler et al., 1999Go), has been known for a long time to occur almost exclusively at asparagine residues that are part of the consensus sequence NXS/T, also known as sequon (Marshall, 1972Go; Spiro, 2002Go). It has, however, been noted that not all such sequons are glycosylated, although the reasons for this were not understood. Several approaches have been used in the past to define the signals that control sequon glycosylation (reviewed by Shakin-Eshleman, 1996Go). These included surveys of glycosylated and nonglycosylated sequons, generally in small numbers of glycoproteins; the use of sequon-containing peptides as glycosylation acceptors in cell free systems and intact cells; and examination of the effect of site-directed mutagenesis on glycosylation both in intact cells and cell-free systems. Most of these studies concentrated on the role of amino acid residues occupying the X position and occasionally also that following the S/T of the sequon. Only one survey studied further flanking residues, 15 on either side of the sequon, in close to 50 proteins (Gavel and von Heijne, 1990Go). The main conclusions reached were: (1) in the X position P inhibits glycosylation absolutely; (2) at P1 (the position immediately following the sequon), P inhibits glycosylation in the majority of cases; (3) NXS sequons are less frequently glycosylated than NXT sequons (Kaplan et al., 1987Go); (4) the efficiency of glycosylation of NXS sequons is affected by the nature of the X residue, whereas that of NXT is not (Kasturi et al., 1995Go).

The availability of a large number of glycoproteins in which the glycosylation sites have been fully characterized has prompted us to examine the effect of additional sequon-flanking residues on N-glycosylation. Here we report on the distribution of residues at the X position of the sequons and at 20 positions on either side in 617 nonredundant N- and N-,O-linked glycoproteins selected from SWISS-PROT database (version 40 of June 2002) in which the positions of occupied and unoccupied sequons is known ("well-defined").


    Results and discussion
 Top
 Abstract
 Introduction
 Results and discussion
 Conclusions
 Materials and methods
 References
 
Data set
The data was taken from SWISS-PROT version 40.24 (June 2002) and categorized into four groups: (1) all proteins, (2) all glycoproteins, (3) well-defined N-glycoproteins (including both N-, and N-,O-linked), and (4) nonredundant well-defined N-glycoproteins. All proteins consisted of the entire database, glycosylated and unglycosylated proteins included. All glycoproteins comprised any database entry that contained the Feature Table key <CARBOHYD> tag (Jung et al., 2001Go) and the description <N-LINKED>. Well-defined glycoproteins consisted of a further filtering of the all glycoproteins set, such that all sites analyzed should have been experimentally determined as glycosylated or not (more details in Materials and methods). The well-defined glycoproteins were analyzed in two sets, the full set and a nonredundant set, in which proteins with more than 90% identity were excluded to prevent skewing of the data set by heavily overrepresented families. Proteins with sequons that are not always glycosylated (for example, RNase A/B, P07998, or tissue plasminogen activator, P00750) were excluded from analysis, as were nonstandard sequons, such as NGGT (reported only in MOPC Ig heavy chain V region—P01756) or NXC (discussed later).

Frequency of occurrence of sequons and amino acids in all proteins, all glycoproteins, and well-defined glycoproteins
As summarized in Table I, the all protein set contained 111,817 entries, of which 75,392 have sequons and 10,696 have been reported as glycoproteins (all glycoproteins). The total number of well-defined glycoproteins in the database is 859. Among them 110 are O-linked, 79 are N-,O-linked, and 670 are N-glycoproteins. We focused on the well-defined N- and N-,O-glycoproteins, their number being 749. The average number of sequons per sequon-containing protein varies greatly between the various data sets; most of the difference can be attributed to differences in average protein length of the different groups.


View this table:
[in this window]
[in a new window]
 
Table I. Number of different entities, derived from SWISS-PROT database (June 2002)

 
When comparing the distribution of sequons per 250 amino acids, a different pattern emerges. All reported glycoproteins and all the well-defined N- and N-,O-glycoproteins, contain little over 2 sequons per molecule (ranging from 2.05 to 2.22 per molecule). The sequon-containing proteins in SWISS-PROT have an average of 1.81 sequons per 250 residues; in bacterial proteins (one-third of all sequon-containing proteins), where glycosylation is a very rare event, there are 1.75 sequons per protein molecule. The lack of sequons in these organisms cannot therefore account for their inability to produce glycoproteins.

Taxonomic distribution
Most well-defined glycoproteins are from metazoa (82%), with fewer from plants (10%), fungi (5%), viruses (2%), and other organisms (1%) (Figure 1A). The distribution of these glycoproteins is nearly identical to that of all well-defined glycoproteins (including the O-glycoproteins). This is essentially the same as that found on the basis of release 36 (Apweiler et al., 1999Go). The database grew by more than 25%, and the percentage of all glycoproteins remained the same, at 10% of the total database. However, the number of well-defined glycoproteins did not grow at the same relative rate as the database, because less protein chemistry is being performed, and the growth of the database is due in part to many predicted proteins resulting from the rapid increase in the number of genomes elucidated. Nevertheless, the distribution of well-defined glycoproteins is roughly representative of the all glycoproteins data set (Figure 1B). The higher percentage of viral proteins and the lower percentage of plant proteins in the latter data set is likely due to the fact that viral glycoproteins have not been well studied, whereas many plant glycoproteins have been investigated on the protein level.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 1. Distribution of sequon-containing proteins in SWISS-PROT by taxonomy. (A) Well-defined N-, and N-,O-glycoproteins. (B) All glycoproteins. (C) All sequon-containing proteins. (D) All sequon-containing proteins except for eubacteria.

 
The all sequon taxonomy distribution (Figure 1C) is virtually identical to the whole database distribution (with a 2% shift from fungi to others in the whole database distribution), (data not shown) but markedly different from that shown in Figure 1A and 1B. When we subtract the eubacterial sequon-containing proteins from all the sequon-containing proteins (Figure 1D), the division shifts to the expected distribution, if glycosylation is equally distributed among the various taxa. What we see in Figure 1A and 1B, however, is a definite enrichment of glycoproteins in Metazoa at the expense of all other classes of organism.

Sequon occupancy
The nonredundant well-defined glycoproteins have 2081 sequons. Their rate of occupancy is 60.7%, which matches previously published results (Shakin-Eshleman, 1996Go; Apweiler et al., 1999Go). As shown in Table II, in these glycoproteins there are 50% more NXT sequons as NXS ones, and the rate of occupancy of the NXT sequons is about one-third higher than that of the NXS sequons, a trend that has been discussed in detail elsewhere (Shakin-Eshleman, 1996Go).


View this table:
[in this window]
[in a new window]
 
Table II. Rate of sequon occupancy

 
Among the occupied sequons, we found three nonstandard verified sequons in which cysteine substitutes for the standard hydroxy amino acids. Although this number has not changed since it was reported over 10 years ago (Gavel and von Heijne, 1990Go), four additional glycosylated NXC sequons were identified very recently in Caenorhabditis elegans glycoproteins (Kaji et al., 2003Go). The NXC sequons were excluded from further analysis.

Amino acid distribution
Before asking if there were unusual features in amino acid frequency around sequons, we compared the normal amino acid distribution of all glycoproteins and of the well-defined glycoproteins to that of the whole database ("normal frequency"). In Figure 2 only those residues the frequency of which differs by more than 10% of normal, are shown. Residues G, T, W, P, and especially C (the latter almost twice) are overrepresented in glycoproteins as a whole and in well-defined glycoproteins in particular. In contrast, the residues E, R, K, A, M, and I are underrepresented. This matches the pattern for single residue discrimination of intracellular and extracellular proteins proposed by Nakashima and Nishikawa (1994)Go. It may be due to the fact that the majority of well-defined glycoproteins are secreted or membrane proteins and would therefore be expected to have a similar profile. Once again, the well-defined glycoproteins are representative of the larger glycoprotein set.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 2. Frequency of amino acids in the glycoprotein data sets. The ratio of amino acids with more than a 10% difference from the normal distribution is shown. Light gray: residues in all glycoproteins divided by residues in all proteins, dark gray: residues in well-defined glycoproteins divided by residues in all proteins.

 
Occupied sequons at the C-terminal decapeptide of N-glycoproteins
Eighteen sequences with occupied sequons in the C-terminal decapeptide of nonprocessed proteins were found in our data set. These are listed in Table III. None of the glycoproteins mentioned in Table III as glycosylated at the C-terminal decapeptide have been reported to undergo posttranslational processing at their C-terminal end prior to glycosylation. One protein (P15797) undergoes cleavage of the C-terminal end after glycosylation (the site is in the cleaved peptide). All but one of the sequences (P07307) are secreted. P07307 is a type II membrane protein. The frequency of sequon occurrence in this domain of glycoproteins is less than average (~70%), whereas the rate of occupancy of the sequons is similar to the average (~60%). These findings are of special interest in view of the biosynthetic studies that led to the conclusion that N-glycosylation is cotranslational, and that the first step in the process, the transfer of the dolichol oligosaccharide precursor, occurs approximately 30 amino acids from the active ribosome (Varki et al., 1999Go). The presence of a considerable number of N-linked sugars at the C-terminal 10 residues of glycoproteins suggests that N-glycosylation in these cases is posttranslational.


View this table:
[in this window]
[in a new window]
 
Table III. Occupied N-glycosylation sites at the C-terminal 10-residue domain of different glycoproteins

 
Distribution of sequons within the well-defined glycoproteins
The finding that the frequency of sequon occurrence at the C-terminal end of the well-defined glycoproteins is lower than average prompted us to calculate the relative positions of all sequons and of the glycosylated ones along the polypepetide chain of these glycoproteins. It was found that the sequons are unevenly distributed along the polypeptide chain, with lower frequencies at both ends (Figure 3A). The ratio between NXS and NXT sequons stays approximately even along the length of the polypeptide, with averages of 40% and 60%, respectively (data not shown). However, when the percentage of glycosylation was also taken in consideration, differences were found (Figure 3B). For NXS sequons there are two peaks of glycosylation, in the fourth and seventh deciles. NXT sequons peak at the third and seventh deciles. Both drop at the fifth decile and from the eighth decile on, but the drop in NXS glycosylation is particularly drastic. The difference between the extent of occupancy of NXS and NXT sequons was particularly high in the last decile of the well-defined glycoproteins, namely, 24.1% versus 52.2%, respectively.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 3. Sequon distribution in nonredundant well-defined glycoproteins. (A) All sequons, regardless of glycosylation state and of hydroxy amino acid, distributed by decile of protein length. (B) Percent of glycosylated sequons distributed by decile of protein length. NXS (circles) and NXT (squares).

 
Amino acid distribution of in and around the sequons
We have examined the distribution of amino acids from the 20th residue preceding the sequons to the 20th residue following them, comparing occupied to nonoccupied sites. The residues N-terminal to the sequon were denoted with an M (e.g., M1, M2, and M3), and those C-terminal to the sequon with a P (e.g., P1, P2, and P3) the numbers increasing with distance from the sequon. In total, our well-defined nonredundant set has 2081 sequons (Table I). Of these, only those sequons with 20 flanking amino acids on both sides were analyzed, in total 1910, of which 1157 are glycosylated and 753 are not. We separated the NXS and NXT sequons for this analysis, as their rates of occupancy were very different (Table II).

A chi-square test was performed to check for the statistical significance of the occurrence of each amino acid at each position individually, comparing glycosylated to nonglycosylated sequons (Figure 4). The p-value is the standard value for chi-square scores with one degree of freedom. Only residues with a p-value of under 0.01 are presented in the figure. NPS/T was excluded from this analysis, because there are no cases in which it is glycosylated and zero is an unacceptable value in a chi-square calculation.




View larger version (74K):
[in this window]
[in a new window]
 
Fig. 4. Chi-square analysis of individual amino acids at individual positions. (A) NXS sequons; (B) NXT sequons. The p-value is the standard value for chi-square scores with one degree of freedom. Only residues with p < 0.01 are shown. *, P at the X position (details in text).

 
The most statistically significant result in both NXS and NXT sequons is P at P1. It is a strong but not absolute inhibitor of glycosylation, because of the 61 NXS/TP sequences present in the data set, 7 were glycosylated.

A considerable difference was found between the patterns of flanking residues of the NXT and NXS sequons. In the case of NXS, hydrophobic residues at positions M9 to M17 as well as positively charged ones at P5 and P7 are highly overrepresented in the domains flanking glycosylated sequons. With NXT, the situation is more complex. The same or similar residues are overrepresented in close positions in the domains flanking glycosylated and nonglycosylated sequons. This is the case for instance with Cys in positions P4 and P6 in glycosylated sequons and in positions P5 and P8 in nonglycosylated ones.

Multisite analysis
The data-mining tool WizWhy (WizSoft, Israel) was used to analyze the sequons and flanking regions. WizWhy identifies complex if-then rules or patterns by first identifying biases in single sites, and merging rules that together better explain the dependent variable, in this case, glycosylation. Due to the small sample size, we grouped the amino acids together on a chemical basis, in the following nine groups: MLIV, TSC, FYW, AG, KR, DE, QN, P, and H. Rules were chosen based on a minimum of 30 instances in which the rule was true and a probability of at least 90% that the rule was true (small number of exceptions). Once again, the NXS and NXT sequons were examined separately.

One rule contraindicating glycosylation was found in both populations examined (NXS, NXT): X (P). Thus, none of the 80 NPS/T sequons present in the nonredundant well-defined glycoprotein set are occupied, confirming the earlier conclusion that a proline residue in position X completely prevents sequon glycosylation. The effect of P1 (P) was not seen, however, because there were not enough instances to pass the necessary threshold after the data set was divided by the hydroxy amino acid.

No rules indicating glycosylation were found for NXS sequons. However, 12 patterns, with two or three positions, were found with more than 90% confidence for NXT sequons (Table IV).


View this table:
[in this window]
[in a new window]
 
Table IV. Rules that indicate NXT glycosylation

 
In another grouping (DE; KRH; QNST; AMLIVFWY; C; G; P), once again no rules promoting glycosylation were found for NXS; however, this time 39 rules were found for NXT. From the NXT rules we see a trend that can be summarized by general patterns: M14 and M13, QNST; M5, AMLIVFWY; X, AMLIVFWY; P15 and P16, AMLIVFWY. Full details of the patterns are available on request.


    Conclusions
 Top
 Abstract
 Introduction
 Results and discussion
 Conclusions
 Materials and methods
 References
 
In this study we presented a database survey of 617 well-defined nonredundant glycoproteins in the attempt to identify patterns that promote or suppress N-glycosylation. Except for P at positions X and P1, no single amino acid at any of the flanking positions of the NXS/T sequon seems to have a dramatic effect on N-glycosylation. However, many individual flanking amino acids significantly affect the chances of a sequon being glycosylated or not.

There is a considerable difference between the patterns of flanking residues of the NXT and NXS sequons. A search for complex patterns revealed several consensus sequences in the flanking domains that promote glycosylation but cannot fully explain glycosylation specificity. It is likely that conformational factors have a more decisive role in determining whether a sequon will be occupied or not (Imperiali, 1997Go). Conformational factors are in fact the explanation given for failure of NPS/T and NXS/TP glycosylation (Bause, 1983Go). The presence of a considerable number of glycosylated sequons at the C-terminal decapeptide of glycoproteins shows that glycosylation can take place posttranslationally as well as cotranslationally.


    Materials and methods
 Top
 Abstract
 Introduction
 Results and discussion
 Conclusions
 Materials and methods
 References
 
Data set
The data were taken from SWISS-PROT version 40.24 of July 2002 (O'Donovan et al., 2002Go). The database was screened for proteins containing the FT <CARBOHYD> tag (Jung et al., 2001Go). This data set was named "all glycoproteins." This was further filtered to exclude those proteins that did not have at least one well-defined N-glycosylation site. A site is considered well defined if in the SWISS-PROT database it is in the form: "CARBOHYD (residue number) N-LINKED (GLCNAC...):" or of a form similar to this and is not listed as potential, probable, or by similarity. In cases where there was doubt whether a site was glycosylated or not (e.g., when it was listed as "partial," which means it is sometimes glycosylated and sometimes not) it was excluded from further analysis. This means that at least one positive well-defined site had to be present in a glycoprotein for it to be considered well defined.

A nonredundancy test, CD-HI (Li et al., 2001Go) was performed to generate a data set that contained proteins that had less than 90% identity, to prevent skewing of the data set by heavily overrepresented families. This data set was named "well-defined glycoproteins."

The sequons in each well-defined nonredundant glycoprotein were then classified into three categories: glycosylated, nonglycosylated, and unknown. The latter category, which included only 11 sequons, was excluded from further analysis.

After completing the screening, we checked that each position denoted as a sequon in the database was indeed present in the sequence of the particular glycoprotein, and was indeed a sequon. In cases of doubt of reliability of the annotation, each case was checked individually by extensive literature search. We excluded proteins with unspecified residues (X) near the sequon (within 20 amino acids on either side).

Analyses
Amino acid distribution of all of SWISS-PROT was taken from the database documentation files. The distribution of amino acids in each position around the sequon was calculated for 20 amino acids preceding and following the sequon.

Chi-square
Chi-square analysis was performed on each of the sequons and their flanking residues. The frequency of each amino acid at each position was determined and compared to the presence of all other amino acids at that position. This was done for the X residue in NXS/T as well as the residues flanking both glycosylated and nonglycosylated sequons. The two sets were then compared. This method considers not only how many times an amino acid seems to promote glycosylation but also how many times it refrains from interfering with it (and vice versa). The analysis was carried out on NXS and NXT sequons separately.

WizWhy
Multisite analysis was carried out with the program WizWhy. WizWhy identifies complex if-then rules or patterns by first identifying biases in single sites, and merging rules that together better explain the dependent variable, in this case, glycosylation. Due to the small sample size we grouped the amino acids together on a chemical basis, in the following groups: MLIV, TSC, FYW, AG, KR, DE, QN, P, and H. Rules were chosen based on a minimum of 30 instances in which the rule was true and a probability of at least 90% that the rule was true (small number of exceptions).


    Acknowledgements
 
We thank Dr. Hillary Voet (Hebrew University Faculty of Agriculture, Rehovot) for help with statistics and Dr. Mira Marcus-Kalish (WizSoft, Israel) for the gift of WizWhy.


    Footnotes
 
1 To whom correspondence should be addressed; e-mail: nathan.sharon{at}weizmann.weizmann.ac.il Back


    References
 Top
 Abstract
 Introduction
 Results and discussion
 Conclusions
 Materials and methods
 References
 
Apweiler, R., Hermjakob, H., and Sharon, N. (1999) On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Biochim. Biophys. Acta, 1473, 4–8.[ISI][Medline]

Bause, E. (1983) Structural requirements of N-glycosylation of proteins. Biochem. J., 209, 331–336.[ISI][Medline]

Gavel, Y. and von Heijne, G. (1990) Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering. Protein Eng., 3, 433–542.[ISI][Medline]

Imperiali, B. (1997) Protein glycosylation: the clash of the titans. Acc. Chem. Res., 30, 452–459.[CrossRef]

Jung, E., Veuthey, A.L., Gasteiger, E., and Bairoch, A. (2001) Annotation of glycoproteins in the SWISS-PROT database. Proteomics, 1, 262–268.[CrossRef][ISI][Medline]

Kaji, H., Saito, H., Yamauchi, Y., Shinkawa, T., Taoka, M., Hirabayashi, J., Kasai, K., Takahashi, N., and Isobe, T. (2003) Lectin affinity capture, isotope-coded tagging and mass spectrometry to identify N-linked glycoproteins. Nat. Biotechnol., 21, 667–672.[CrossRef][ISI][Medline]

Kaplan, H.A., Welply, J.K., and Lennarz, W.J. (1987) Oligosaccharyl transferase: the central enzyme in the pathway of glycoprotein assembly. Biochim. Biophys. Acta, 906, 161–173.[ISI][Medline]

Kasturi, L., Eshleman, J.R., Wunner, W.H., and Shakin-Eshleman, S.H. (1995) The hydroxy amino acid in an Asn-X-Ser/Thr sequon can influence N-linked core glycosylation efficiency and the level of expression of a cell surface glycoprotein. J. Biol. Chem., 270, 14756–14761.[Abstract/Free Full Text]

Li, W., Jaroszewski, L., and Godzik, A. (2001) Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics, 17, 282–283.[Abstract]

Marshall, R. (1972) Glycoproteins. Annu. Rev. Biochem., 41, 673–702.[CrossRef][ISI][Medline]

Nakashima, H. and Nishikawa, K. (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol., 238, 54–61.[CrossRef][ISI][Medline]

O'Donovan, C., Martin, M.J., Gattiker, A., Gasteiger, E., Bairoch, A., and Apweiler, R. (2002) High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform., 3, 275–284.[Medline]

Shakin-Eshleman, S.H. (1996) Regulation of N-linked core glycosylation. Trends Glycosci. Glycotechnol., 8, 115–130.[ISI]

Spiro, R. (2002) Protein glycosylation: nature, distribution, enzymatic formation, and disease implications of glycopeptide bonds. Glycobiology, 12, 43R–56R.[Abstract/Free Full Text]

Varki, A., Cummings, R., Esko, J., Freeze, H., Hart, G., and Marth, J. (1999) Essentials of glycobiology. Cold Spring Harbor Laboratory Press, New York, p. 68.