1Department of Biotechnology and Biomaterial Chemistry, Graduate School of Engineering, Nagoya University, Chikusa-ku, Nagoya 464-8603, 2Department of Computational Biology, Biomolecular Engineering Research Institute, 6-2-3, Furuedai, Suita, Osaka 565-0874 and 4Advanced IT Development Department, CTI Co., Ltd, Meieki-Minami 1-27-2, Nakamura-ku, Nagoya 450-0003, Japan
3 To whom correspondence should be addressed. e-mail: shirai{at}beri.or.jp
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: computer application/drug design/ligand prediction/molecular interaction/sugar
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Polysaccharides (carbohydrates) are often referred to as the third molecular chain of life. The functional roles of polysaccharides and their interactions with proteins are drawing more attention than before, since it has been recognized that carbohydrates are used as information carriers rather than simple storage material. Carbohydrateprotein interactions are involved in a variety of biological activities, including immune responses, cellcell recognition and cell adhesion (Weis et al., 1992; Feizi, 1993
; Lis and Sharon, 1998
; Kogelberg and Feizi, 2001
). Since polysaccharides assume a large variety of configurations, their potential for information encoding might be greater than that of peptides or nucleotides (Laine, 1994
). Therefore, many carbohydrate-binding proteins are being considered as targets for new medicines (Beuth et al., 1994
; Axford, 1997
). Computer-aided predictions of carbohydrateprotein interactions might facilitate the rational design of drugs with activities against the proteins.
Although carbohydrateprotein interactions have been analyzed in several studies of the mechanisms of proteincarbohydrate recognition (Rini, 1995; Weis and Drickamer, 1996
; Elgavish and Shaanan, 1997
; Rao et al., 1998
; Garcia-Hernandez and Hernandez-Arana, 1999
; Garcia-Hernandez et al., 2000
; Clarke et al., 2001
; Neumann et al., 2002
), only one computer application has been reported that is specifically aimed at predicting carbohydrate-binding sites (Taroni et al., 2000
). Compared with the abundance of methodologies developed for proteinnucleic acid (Kono and Sarai, 1999
) or proteinprotein interactions (Jones and Thornton, 1997a
,b; Ishida et al., 2000
), there are still very few methods for predicting carbohydrateprotein interactions.
The rapid expansion of the structure databases provides an opportunity to use knowledge-based approaches for prediction. Several hundred structures of carbohydrateprotein complexes are currently available in the Protein Data Bank (PDB) (Zhang and Kim, 2003). The previously reported system used the statistics of amino acid propensity at carbohydrate-binding sites (Taroni et al., 2000
). Patches of residues on a test protein molecule were ranked by the average of the propensity score over the residues. In 63% of the test proteins, at least one of the best three predicted patches was found to have considerable overlap with the real one. This result demonstrated that the patch and propensity approach was valid for prediction.
It is known that certain amino acid residues in carbohydrate-binding sites show characteristic spatial distributions around saccharide moieties (Drickamer, 1992; Iobst and Drickamer, 1994
; Rini, 1995
; Kolatkar and Weis, 1996
; Weis and Drickamer, 1996
; Elgavish and Shaanan, 1997
; Taroni et al., 2000
). This suggests that the coordinates of the carbohydrate-binding residues can be explicitly used for predictions. In this study, a program system has been developed that uses the empirical rules of the spatial distribution of protein atoms at known carbohydrate-binding sites for prediction, and the performance of the system was tested on 50 known carbohydrateprotein complexes.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A schematic overview of the prediction system is shown in Figure 1. The system consists of two components: the programs for the construction of empirical rules and those for the sugar-binding-site search. The former programs require a set of known 3D carbohydrateprotein complex structures as the input, and then construct empirical rules. The search programs require the 3D structure of a target protein and the empirical rules as inputs, and search for the positions of carbohydrate-binding sites on the target. The output is in the form of PDB-formatted monosaccharide coordinates that are placed on the target protein and scored by the empirical rules. The programs were written in C-language and implemented on an OCTANE workstation (SGI).
|
To construct the empirical rules, a total of 80 carbohydrateprotein complexes were selected from the PDB. First, all of the PDB entries containing saccharide moieties were retrieved from the database. Proteins that contained only covalently bound carbohydrates were discarded. Then, the amino acid sequences of the proteins were compared with each other to purge redundancy. A 30% overall sequence identity was used as a cut-off.
The monosaccharide moieties in the selected entries were categorized according to their molecular species and conformation. All of the moieties of the same category were superimposed and the structures of amino acid residues that comprised the binding sites were compared visually on computer graphics. The proteins that appeared to be redundant were also discarded at this stage. To promote data accumulation, the same proteins were retained in the database if they bound different saccharides. The entries in the final selection are listed in Table I.
|
The monosaccharide moieties, along with their contacting amino acid residues, were superimposed on a reference coordinate system, as defined in Figure 2a. The C3 atom of a monosaccharide (C4 atom for sialic acid) was placed at the origin. The C3C2 bond (C4C3 bond for sialic acid) was aligned with the x-axis by placing the C2 atom on the positive region. The C3C4 bond (C4C5 bond for sialic acid) was laid on the xy plane (Figure 2a).
|
Binding-site search
First, a reference grid system was placed on the target protein molecule. The grid system covered the entire target protein and consisted of evenly distributed points at 1.0 Å inter vals (Figure 2c). Points in the reference grid system were discarded if they were within 4.0 Å or >6.0 Å separated from the nearest protein atoms. The origin of the scoring systems (defined in the previous section and Figure 2b) was moved to every point in the grid system and the system was rotated around the point at 15° steps in spherical polar angles. The score was calculated at each rotation step for every monosaccharide type. The score was the summation of the target atom counts of the point in the scoring system that was closest to the corresponding atoms of the target protein. The translation and rotation parameters for the best 10 scores were retained during the search through the reference grid system. The coordinates of the predicted monosaccharide were calculated by the translation and rotation parameters. Finally, the program outputted the best 10 monosaccharide coordinates that were placed on the target protein.
Performance test on known complexes
Fifty known carbohydrateprotein complex structures, including 24 enzymes and 26 lectins, were used as targets for the performance test (Table II). The complex structures contained a total of 127 monosaccharide moieties (71 for enzymes, 56 for lectins). The scoring systems for glucose (Glc), galactose (Gal), mannose (Man), N-acetylglucosamine (Nag), fucose (Fuc) and sialic acid (Sia) were used for the searches. Sites for some derivatives, such as methyl-mannose and glucose phosphate, were sought by using the scoring system for their prototypes (Table II). The target proteins listed in Table II were selected to evenly cover the protein types (enzyme or lectin) and binding saccharide types. Since the number of available new structures was limited, the proteins used for the scoring system and their relatives were also used as the targets. However, when each target protein and its relatives were analyzed, they were excluded from the scoring system construction.
|
A tolerant evaluation was also made, for comparative purposes. The criterion was similar to that used in the previous study (Taroni et al., 2000), which employed the method to find residue patches of the carbohydrate-binding site by using amino acid propensities at the known binding sites. The residue patch is a set of amino acid residues that comprises a carbohydrate-binding site. The program PATCH (Jones and Thornton, 1997a
,b) was used to determine the residue patches in the previous study (Taroni et al., 2000
). The judgment on the prediction result was based on the relative overlap of the predicted residue patch. The relative overlap is the percentage overlap of a residue patch to the one that had maximum overlap to the real (experimentally determined) patch among the examined patches. The actual percentage overlap of a patch to the real patch was smaller than the relative overlap if none of the examined patches had 100% overlap to the real patch. If at least one of the best three predicted patches had >70% relative overlap, then the prediction was regarded as successful. From the table given in the literature (Taroni et al., 2000
), the successfully detected patches had 2686% (average 49%) overlap with the real patches.
In this study, the percentage overlap was defined as the ratio: (number of residues in contact with both the predicted and real monosaccharide)/(number of residues in contact with the real monosaccharide). A residue was thought to comprise a residue patch of a binding site when at least one atom of the residue was found within 4.0 Å from any of the monosaccharide atoms. A site was considered as being detected when at least one of the top three predictions had >50% overlap with the real patch.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To derive the empirical rules of carbohydrateprotein interactions, a non-redundant carbohydrateprotein complex database was constructed. The monosaccharide moieties were categorized according to their molecular species and conformations. Since only non-linear hexoses, namely Glc (85), Gal (28), Man (25), Fuc (12), Sia (13) and Nag (26) in the 4C1 chair conformation, were observed frequently enough in the database, these six types were used for the empirical rule construction (the number of observations of each hexose in the non-redundant database is shown in parentheses). Variations in the /ß anomer, the C5C6 torsion angles and some chemical modifications (methylation, sulfuration, halogenization and deoxylation) were ignored to enhance data accumulation.
Binding-site prediction: strict evaluation
The prediction system was tested on 50 known carbohydrateprotein complex structures. The target proteins have a total of 61 monosaccharide/polysaccharide binding sites, which can be divided into 127 monosaccharide binding sites. The performance of the system was evaluated by both the strict and tolerant criteria (see Materials and methods for definitions). The results are summarized in Table II and Figure 3. The success rate of prediction for monosaccharide sites by the strict criterion was 16% (20/127). The rate for enzymes was 18% (13/71) and that for lectins was 13% (7/56) (Figure 3a, solid line).
|
|
Binding-site prediction: tolerant evaluation
The performance was also evaluated on a residue patch basis (tolerant criterion). Based on the tolerant criterion, 59% (75/127) of the monosaccharide sites were detected. The rates for enzymes and lectins were 66 (47/71) and 50% (28/56), respectively (Figure 3a, dashed line). In terms of the polysaccharide sites, 69 (42/61), 77 (23/30) and 61% (19/31) were detected overall, and for enzymes and lectins, respectively (Figure 3b, dashed line).
Figure 4b shows an example. Based on the tolerant criterion, the residue patch of the real site R1 was detected by the second best prediction site P2 (90% overlap). The patches for sites R2R4 were detected by the best prediction site P1 (50, 100 and 80% overlap, respectively), because the patch residues for the sites considerably overlap each other. In this case, the position and orientation of the saccharide ligand are not reliable, although the residues comprising the binding site were detectable.
The performance (success rate of prediction) of the present system was compared with that of the previous study by using the data shown in the literature (Taroni et al., 2000) based on the polysaccharide-tolerant criterion. Note that the above criterion is different from that used for the evaluation of their results (see Materials and methods for the criteria used previously and in this study). The values of percentage overlap of predicted patch to the real patch were derived from the literature (tables III and IV in Taroni et al., 2000
), which were comparable with the percentage overlap values in this study. The prediction in this study is based on the detection of monosaccharides in either monosaccharide- or polysaccharide-binding sites, while both sites were treated together in the previous study. Considering this difference, the comparison was made by using a total of 18 monosaccharide- or disaccharide-binding sites that were commonly used in both studies (Table III). With the present prediction system, the sites were detected with efficiencies of 67% (12/18) overall, 78% (7/9) for enzymes and 56% (5/9) for lectins (Figure 3c, solid line). The corresponding values in the previous study were 33% (6/18) overall, 44% (4/9) for enzymes and 22% (2/9) for lectins (Figure 3c, dashed line).
|
The performance was evaluated for each monosaccharide type. The success rates for Gal, Glc, Man and Nag were 64 (16/25), 64 (27/42), 26 (5/19) and 61% (14/23), respectively, based on the monosaccharide-tolerant evaluation (Figure 3d). The rates for Fuc and Sia were 67 (2/3) and 100% (4/4), although these values are not comparable with the others, due to the poor sampling.
Detection rates for monosaccharide and polysaccharide
Among a total of 127 binding sites on the target proteins, 29 were monosaccharide-binding sites and 98 were sites for a saccharide residue of polysaccharides. The success rates for monosaccharide ligands and saccharide residues in polysaccharide ligands were 55 (16/29) and 60% (59/98) by the tolerant criterion, respectively.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The performance of the present prediction system can be summarized as follows. The coordinates of the bound monosaccharide were predictable in 16% of the cases (monosaccharide-strict criterion; Figure 3a, solid line). At least one of the monosaccharide moieties in a chain was detected in 30% of the cases (polysaccharide-strict criterion; Figure 3b, solid line). The low success rate might be because the present system used only a 4C1 conformation of saccharide and did not estimate conformational changes in saccharides and amino acid residues on binding. At least one of the best three predictions allocated a part of a binding site, i.e. the coordinates of the predicted ligand are not reliable but a saccharide will bind to the region in 69% of the cases (polysaccharide-tolerant criterion; Figure 3b, dashed line). The system works efficiently for Gal, Glc and Nag sites, but not for Man sites (Figure 3d). The success rates for monosaccharides (55%) and saccharide residues in chains (60%) were not significantly different from each other. The relatively higher rate for the latter might be because the detection of a residue patch in a polysaccheride-binding site was easier due to overlap between patches of neighboring saccharide residues.
The performance of the present system appeared to be improved in comparison with the previously reported one (Taroni et al., 2000) (Figure 3c). This improvement is due to the explicit use of the carbohydrate-binding residue coordinates. The proportions of the overall rates of enzymes and lectins are similar between the present and previous methods (Figure 3c). The binding sites of the enzymes were more efficiently detected than those of the lectins. As previously pointed out (Taroni et al., 2000
), enzymes tend to bind saccharides in a deeper cleft as compared with lectins. This increases the number of atoms in contact with the substrate and thus might facilitate predictions.
Empirical rule inspection
Examples of the scoring system are shown in Figure 5. The regions where selected target atoms were frequently observed are enclosed in the networks. The selected target atoms are aromatic-C, carboxyl-O, charged and non-charged amino-N.
|
A possible explanation for the poor detection rate for Man (Figure 3d), from a scoring system inspection, is that its scoring system contains more peaks that are more dispersed than for the others (Figure 5c). For example, single prominent carboxyl-O peaks were observed for Glc and Gal, but four peaks were observed for Man (see green networks in Figure 5ac). This might be partly due to poor sampling and the further accumulation of empirical data would improve the detection rate.
Suggestions for improvement of the current system
Figure 6 shows the success rate of the predictions against the relative score. Relative score is the score of a predicted saccharide-binding site divided by the highest score of the same saccharide type for the same target protein. The success rate is the number of successful predictions divided by that of the false predictions in each range of relative score. The success rate rises as relative score increases, and the rate exceeds unity at the highest relative score (see line plot in Figure 6), which demonstrates the significance of the scoring system.
|
The present prediction system uses the spatial distribution of amino acid residues, while the previous system (Taroni et al., 2000) uses their propensity at carbohydrate-binding sites. Since the two prediction systems do not use exactly the same information, a combination of the systems might be another strategy to improve the performance.
Because many of the natural ligands of carbohydrate-binding proteins are polysaccharides, improvement of the current system toward polysaccharide prediction is required. One of the remarkable features of the present system is that the monosaccharide coordinates are given as the prediction results. The monosaccharide coordinates can be used for a scaffold to construct those of the polysaccharide, if methods for refining the coordinates and chain elongation are available. The improvement of the program system by adding this function is currently in progress.
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Berman,H.M., Bhat,T.N., Bourne,P.E., Feng,Z., Gilliland,G., Weissig,H. and Westbrook,J. (2000) Nat. Struct. Biol., 7, 957959.[CrossRef][Medline]
Beuth,J., Ko,L.H., Pulverer,G., Uhlenbruck,G. and Pichlmaier,H. (1994) Int. J. Med. Microbiol. Virol. Parasitol. Infect. Dis., 281, 324333.
Clarke,C., Woods,R.J., Gluska,J., Cooper,A., Nutley,M.A. and Boons,G.J. (2001) J. Am. Chem. Soc., 123, 1223812247.[CrossRef][ISI][Medline]
Drickamer,K. (1992) Nature, 360, 183186.[CrossRef][ISI][Medline]
Elgavish,S. and Shaanan,B. (1997) Trends Biochem. Sci., 22, 462467.[CrossRef][ISI][Medline]
Feizi,T. (1993) Curr. Opin. Struct. Biol., 3, 701710.[CrossRef][ISI]
Garcia-Hernandez,E. and Hernandez-Arana,A. (1999) Protein Sci., 8, 10751086.[Abstract]
Garcia-Hernandez,E., Zubillaga,R.A., Rodriguez-Romero,A. and Hernandez-Arana,A. (2000) Glycobiology, 10, 9931000.
Iobst, S.T. and Drickamer,K. (1994) J. Biol. Chem., 269, 1551215519.
Ishida,H., Shirai,T., Matsuda,Y., Kato,Y., Ohno,M., Isaji,T. and Yamane,T. (2000) J. Biochem., 128, 561574.[Abstract]
Jones,S. and Thornton,J.M. (1997a) J. Mol. Biol., 272, 121132.[CrossRef][ISI][Medline]
Jones,S. and Thornton,J.M. (1997b) J. Mol. Biol., 272, 133143.[CrossRef][ISI][Medline]
Kogelberg,H. and Feizi,T. (2001) Curr. Opin. Struct. Biol., 11, 635643.[CrossRef][ISI][Medline]
Kolatkar,A.R. and Weis,W.I. (1996) J. Biol. Chem., 271, 66796685.
Kono,H. and Sarai,A. (1999) Proteins: Struct. Funct. Genet., 35, 114131.[CrossRef][ISI][Medline]
Laine,R.A. (1994) Glycobiology, 4, 759767.[Abstract]
Lis,H. and Sharon,N. (1998) Chem. Rev., 98, 637674.[CrossRef][ISI][Medline]
Neumann,D., Kohlbacher,O., Lenhof,H.P. and Lehr,C.M.(2002) Eur. J. Biochem., 269, 15181524.
Rao,V.S.R., Lam,K. and Qasba,P.K. (1998) Int. J. Biol. Macromol., 23, 295307.[CrossRef][ISI][Medline]
Rini,J.M. (1995) Annu. Rev. Biophys Biomol. Struct., 24, 551577.[CrossRef][ISI][Medline]
Taroni,C., Jones,S. and Thornton,J.M. (2000) Protein Eng., 13, 8998.
Weis,W.I. and Drickamer,K. (1996) Annu. Rev. Biochem., 65, 441473.[CrossRef][ISI][Medline]
Weis,W.I., Taylor,M.E. and Drickamer,K. (1992) Immunol. Rev., 98, 637674.
Westbrook,J., Feng,Z., Chen,L., Yang,H. and Berman,H.M. (2003) Nucleic Acids Res., 31, 489491.
Yokoyama,S., et al. (2000) Nat. Struct. Biol., 7, 943945.[CrossRef][Medline]
Zhang,C. and Kim,S.H. (2003) Curr. Opin. Chem. Biol., 7, 2832.[CrossRef][ISI][Medline]
Received April 24, 2003; revised June 2, 2003; accepted June 10, 2003.