1 Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT and Department of Crystallography, Birkbeck College, Malet Street, London, WC1 7HX, UK
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: prediction/proteinsugar complex/proteinsugar interactions/surface patch
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A non-homologous dataset of proteincarbohydrate complexes was selected for the analysis of proteincarbohydrate interactions, from the PDB release of January 1995. The criteria used to assign homology are as follows: proteins showing over 30% sequence identity are assigned to the same homologous family. From each family obtained, a representative complex is chosen so that it should contain, when available, a native protein, it should represent a naturally occurring complex and should be the one with highest resolution. The selected entries are then compared using the structural alignment program SSAP (Taylor and Orengo, 1989). The calculated SSAP scores give the degree of structural similarity of each pair, 100 indicating structural identity and zero being the lowest similarity value. Those proteins whose SSAP score is greater than 80 are considered as related, hence some sequence families are combined. From each of the final non-homologous families of proteins, a representative is again chosen as described above. This set of proteins, referred to as dataset I, was employed in the optimization of the prediction parameters. Two additional sets of proteincarbohydrate complexes were used to test the performance of the prediction algorithm. They were both selected from the PDB updated to March 1997 and were named Test set I and Test set II. The first, smaller dataset comprised all new sugar binding proteins which show no homology to any protein of dataset I nor to each other. Test set II contained all other proteincarbohydrate complexes which are homologous to structures either in the original dataset (dataset I) or Test set I.
Patch analysis
A surface patch on a protein is defined as N neighbouring solvent accessible amino acid residues surrounding a central exposed residue (Jones and Thornton, 1997a), where N is the number of residues comprising the actual sugar binding site on the protein. The neighbouring residues are determined by their C
positions. An amino acid is defined as a surface residue if more than 1% of its accessible surface area is exposed to the solvent. An observed carbohydrate binding site patch is formed by those surface residues whose accessible surface area decreases by more than 1 Å2 after the binding of the ligand. All possible overlapping surface patches have been determined for each structure (i.e. one patch for each surface amino acid residue) with the program PATCH (Jones and Thornton, 1997a
). For each patch, six parameters are calculated: solvation potential, residue sugar interface propensity, hydrophobicity, planarity, protrusion index and relative accessible surface area (ASA) (Jones and Thornton, 1997a
).
Solvation potential
The solvation potential is a knowledge-based measure of the propensity of an amino acid residue to have a certain degree of solvation in the protein (Jones et al., 1992). These potentials, evaluated at points across the full accessibility range from 0 to 100%, were derived from a large dataset of non-homologous proteins. The solvation potential of a given residue in a protein depends on the relative surface exposure of that residue. The solvation potential of a surface residue is given by the difference between the solvation potential value associated with its exposed ASA and the solvation potential corresponding to a residue of the same type with zero ASA
![]() |
The solvation potential for the complete `patch' is the mean solvation potential of the amino acid residues comprising the patch. The more positive the solvation potential, the higher the propensity for burial.
Interface propensity
This parameter was adapted to the sugar binding analysis by using the propensity of each amino acid residue to be in the interface with the sugar. The propensity quotient P(i) of a surface residue of type i is defined as the ratio between its frequency at the interface and the `average' frequency of any amino acid at the interface, i.e. a residue favours the interface region if it is found there more frequently than average, that is
![]() |
Values above 1 indicate a propensity for being in contact with a sugar ligand. For an easier evaluation of the results the above equation was linearized taking its natural logarithm. In this case, positive values indicate a propensity to be involved in carbohydrate binding, negative values indicate a dislike of the interface. A list of the sugar interface propensity for the 20 common amino acid residues is given in Table I.
|
Hydrophobicity
The patch hydrophobicity is the mean hydrophobicity associated to a given surface patch. The hydrophobicity scale used in the calculation is the Fauchère and Pliska scale (Fauchère and Pliska, 1983). The larger the hydrophobicity parameter, the higher the hydrophobicity of the patch residues.
Planarity The planarity of a surface patch is calculated as the root mean square deviation (r.m.s.d.) of all patch atoms from the best fit plane through the patch. The higher the r.m.s.d. value, the less planar the patch.
Protrusion
The protrusion index gives an indication of the patch protrusion from the surface of a protein. Residue protrusion indexes are calculated by fitting an equimomental ellipsoid to the protein and calculating the relative location of each residue in a series of concentric shells (Thornton et al., 1986). The patch protrusion index is the mean index over the patch residues. The higher the value, the more protruding the patch.
Accessibility The patch accessibility is the average relative ASA value over the patch residues. The higher the ASA parameter the more accessible the patch. The parameter scores of the observed binding site on a given protein are calculated, together with those of all other surface patches of that same protein. The values are divided into 10 equal intervals and ranked on a scale of 1 to 10, 1 containing the highest scores, 10 the lowest. This procedure is repeated for each proteinsugar complex in the dataset.
Patch prediction
The prediction algorithm employed is based on a comparison of the parameters described above, obtained for the calculated patches on a protein. Each of the six parameter ranges is normalized to a scale of 0 to 100, and is used to calculate the probability Pj of each patch to be a sugar binding site. The probability Pj, or combined score, is calculated from the individual scores as follows
![]() |
The number of parameters to be included in the calculation can be chosen according to the results obtained in the patch analysis previously performed. It is also possible to choose how a certain individual score should contribute to the combined score, i.e. whether a high value of a parameter should be considered favourable or unfavourable in the prediction. For example, if the characteristics required are high residue propensity, low accessibility and low protrusion index, eqn 3 becomes
![]() |
|
![]() |
![]() |
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The dataset of 19 non-homologous carbohydrate binding proteins (Table II) comprises nine enzyme structures, seven lectin structures, one Fab fragment and two periplasmic carbohydrate binding proteins. Some of the complexes contain more than one carbohydrate binding site. All non-identical sites of binding on a single structure were considered separately. When more than one identical binding site was present, the site containing the ligand with lowest average B factor was chosen. Cyclomaltodextrin glycosyltransferase (1cxg) and glycogen phosphorylase (6gpb) are two examples of enzymes containing carbohydrate ligands bound to sites other than the active site. In our classification, the catalytic sites of these two structures were included in the enzymes class, the other sugar binding sites in the lectin class. The 19 dataset structures present a total of 25 sites of binding.
|
Dataset I was screened to determine which if any of the six parameters (solvation potential, propensity, hydrophobicity, planarity, protrusion and relative accessible surface area) best differentiate the carbohydrate binding site regions on a protein. The six parameters were computed for every calculated patch and for the observed sugar interface of each dataset protein. The size of the calculated patches was, in each case, equal to the size of the observed interface and is reported in Table II. Although a monosaccharide can be in contact with up to 19 protein residues, several oligosaccharides interact with only 89 amino acids. Carbohydrate interface sizes range from a minimum of seven residues in the case of maltose bound to cyclodextrin glycosyltransferase (1cxg), to a maximum of 35 for soybean ß-amylase bound with bound maltotetraose (1byb). Enzymes tend to bury a large portion of the bound sugar while lectins leave the ligand more exposed to the solvent, usually interacting with the end monosaccharide units of an oligosaccharide. When dealing with multimeric structures, the calculation of parameters and patches were performed on the whole protein, so that the interface regions between different subunits were excluded from the calculation. Only those patches containing residues from the wanted subunits were subsequently used. For example, the homopentamer cholera toxin (1chb) has one binding site per monomer. The program PATCH is run on the complete protein but only the patches containing residues from one chain (H) are kept.
The total number of calculated patches for each structure is given in Table II and the parameter distributions for each protein are summarized in histogram plots (one plot for each parameter). An example is presented in Figure 2
for soybean ß-amylase with bound maltotetraose.
|
|
Prediction
Dataset I
From the above analysis it was concluded that a surface patch has a high probability of being a carbohydrate binding site on an enzyme or periplasmic sugar binding protein, if it has high average residue propensity, low protrusion index and low relative ASA. In lectin structures the best patches will have high residue propensity score and high protrusion index. These parameters have also been used for the Fab fragment. The prediction was additionally run with the residue propensity scores alone, to determine the extent the amino acid composition of the binding site region distinguishes a binding site from the remaining protein surface. Different patch sizes were explored: 12, 15 and 20 residues for enzymes; 12 and 15 residues for lectins. The size of the patch chosen does not have a significant effect on the results, being in any case very small compared with the total size of the protein [Table II(c) and (d)]. The best choice for the two classes of proteins was 15 residues for enzymes and 12 residues for lectins. The scoring criteria for the prediction algorithm are given in the Materials and methods section. At least 70% of the maximum overlap possible between a calculated patch and the observed binding site has to be achieved for a successful prediction (Rel
70). The results for the 19 dataset I complexes are reported in Table III
. The results for the enzyme-type binding sites were very good. Ten of the 11 binding sites were correctly predicted. The only poor prediction is the active site of glycogen phosphorylase (6gpb) for which, although the top three patches do show some overlap with the observed site, this is below the threshold set for a correct prediction. It should be noted that the patch with highest overlap ranks 12 out of 663 patches. The analysis of the lectin binding sites is more complex since almost all structures are multimeric proteins or contain more than one carbohydrate binding site. When more than one binding site is present, care must be taken in interpreting the results. The top three patches can refer to any of the binding sites, particularly if they are all of the same type and have been predicted with the same parameters. An incorrect prediction can arise if the top three patches overlap with a binding site other than the one chosen. To overcome this problem, the prediction results of all distinct binding sites of a single protein were combined. The top three calculated patches that do not show an overlap with another site of binding are taken as the best scoring patches for a binding site. The ranking of the patch with highest overlap is also accordingly shifted removing all patches that overlap with one of the other binding sites. Only six of the 14 lectin binding sites were successfully predicted. In three examples (1slt, 2aai and 1cxg4) the top three patches showed an insufficient overlap with the observed interface. Rerunning the prediction with residue propensity only, results in a correct prediction in all three cases. In the remaining five unsuccessful predictions (2cwgE, 6gpb2, 1hgh1, 1hgh2 and 1mfb), the observed binding site patch was completely missed. Wheat germ agglutinin (2cwg) has four binding sites (Wright, 1990
), only two of which are occupied by a bound oligosaccharide in this entry. The low affinity binding site (containing ligand E) was completely missed. A closer analysis has revealed that all calculated patches scoring better than those overlapping with binding site E (containing ligand E) are either overlapping with binding site D (containing ligand D), or are located in the areas containing the two unoccupied binding sites. The Fab fragment (1mfb) carbohydrate binding site was another incorrectly predicted structure. The lack of success in this case can be explained looking at the results of the patch analysis for this complex in Table II
. The ranking of the observed binding site patch protrusion index was 5, meaning that this parameter is not discriminating for this protein. A prediction was therefore run using the residue interface propensity score only. The results improved considerably. The relative overlap of the top three patches was now 44, 67 and 67% respectively and the patch with maximum P1 ranks 9 out 297 possibilities. The site of glycogen storage in glycogen phosphorylase (6gpb2) was also not located in the prediction. The highest scoring patch was in position 58 out of 664, and only 2% of the patches have a relative overlap greater than 70%. Both binding sites of influenza virus haemagglutinin (1hgh1 and 1hgh2), a membrane bound protein, were missed in the prediction. The third best patch shows an overlap with the site of low affinity binding. A visual inspection of the results showed that most of the top patches in this case, are located in the area facing the viral membrane. Of particular interest were the results obtained for cyclomaltodextrin glycosyltransferase (1cxg) which has four sugar binding sites, three of which have lectin characteristics, the other having enzyme characteristics. The program PATCH was run with two different sets of parameters on this protein, giving a successful prediction in both cases. Use of the enzyme parameters (high residue propensity, low protrusion index and low ASA) located the catalytic site, while when the lectin parameters (high residue propensity and high protrusion index) were used, the top patches all overlapped with the other three binding sites. Using only the residue propensity score in the prediction and a patch size of 15, the top four patches overlapped with the active site (patch 4 having 100% relative overlap), while the first patch overlapping with one of the maltose binding sites was in position 11.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bundle,D.R. and Young,N.M. (1992) Curr. Opin. Struct. Biol., 2, 666.
Davies,G. and Henrissat,B. (1995) Structure, 3, 853859.[ISI][Medline]
Drickamer,K. (1995) Nature Struct. Biol., 2, 437439.[ISI][Medline]
Drickamer,K. (1997) Structure, 5, 465468.[ISI][Medline]
Fauchère,J. and Pliska,V. (1983) Eur. J. Med. Chem., 18, 369375.[ISI]
Henrissat,B. and Davies,G. (1997) Curr. Opin. Struct. Biol., 7, 637644.[ISI][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 8689.[ISI][Medline]
Jones,S. and Thornton,J.M. (1997a) J. Mol. Biol., 272, 121132.[ISI][Medline]
Jones,S. and Thornton,J.M. (1997b) J. Mol. Biol., 272, 133143.[ISI][Medline]
Laskowski,R.A., Luscombe,N.M., Swindells,M.B. and Thornton,J.M. (1996) Protein Sci., 5, 24382452.
Meyer,J.E.W. and Schulz,G.E. (1997) Protein Sci., 6, 10841091.
Moodie,S.L., Mitchell,J.O. and Thornton,J.M. (1996) J. Mol. Biol., 263, 486500.[ISI][Medline]
Quiocho,F.A. (1989) Pure Appl. Chem., 61, 12931306.[ISI]
Sharon,N. and Lis,H. (1990) Chem. Brit., 26, 679682.[ISI]
Sharon,N. (1993) Trends Biochem. Sci., 18, 221226.[ISI][Medline]
Spurlino,J.C., Rodseth,L.E. and Quiocho,F.A. (1992) J. Mol. Biol., 226, 1522.[ISI][Medline]
Taroni,C. (1998). Computational Analysis of ProteinCarbohydrate Interactions. PhD thesis, University College London.
Taylor,G. (1996) Curr. Opin. Struct. Biol., 6, 830837.[ISI][Medline]
Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 122.[ISI][Medline]
Thornton,J.M., Edwards,M.S., Taylor,W.R. and Barlow,D.J. (1986) EMBO J., 5, 409413.[Abstract]
Toone,E.J. (1994) Curr. Opin. Struct. Biol., 4, 719.[ISI]
Vyas,N.K. (1991) Curr. Opin. Struct. Biol., 1, 732740.
Vyas,N.K., Vyas,M.N. and Quiocho,F.A. (1991) J. Biol. Chem., 266, 52265237.
Weis,W.I. (1997) Curr. Opin. Struct. Biol., 7, 624630.[ISI][Medline]
Wright,C.S. (1990) J. Mol. Biol., 215, 635.[ISI][Medline]
Received May 4, 1999; revised November 1, 1999; accepted November 16, 1999.