Mapping of protein surface cavities and prediction of enzyme class by a self-organizing neural network

Martin Stahl, Chiara Taroni and Gisbert Schneider1

F.Hoffmann–La Roche Ltd, Pharmaceuticals Research, CH-4070 Basel, Switzerland


    Abstract
 Top
 Abstract
 Introduction
 Computational methods
 Results and discussion
 References
 
An automated computer-based method for mapping of protein surface cavities was developed and applied to a set of 176 metalloproteinases containing zinc cations in their active sites. With very few exceptions, the cavity search routine detected the active site among the five largest cavities and produced reasonable active site surfaces. Cavities were described by means of solvent-accessible surface patches. For a given protein, these patches were calculated in three steps: (i) definition of cavity atoms forming surface cavities by a grid-based technique; (ii) generation of solvent accessible surfaces; (iii) assignment of an accessibility value and a generalized atom type to each surface point. Topological correlation vectors were generated from the set of surface points forming the cavities, and projected onto the plane by a self-organizing network. The resulting map of 865 enzyme cavities displays clusters of active sites that are clearly separated from the other cavities. It is demonstrated that both fully automated recognition of active sites, and prediction of enzyme class can be performed for novel protein structures at high accuracy.

Keywords: active site/correlation/feature map/nonlinear projection/protein structure


    Introduction
 Top
 Abstract
 Introduction
 Computational methods
 Results and discussion
 References
 
Knowledge of the three-dimensional structure of a target protein is a rich source of information for computer-aided drug design. Of special interest are the size and form of the active site, and the distribution of functional groups and lipophilic areas. As the number of solved X-ray structures of proteins is rapidly increasing, it is both possible and desirable to address questions related to coverage of the protein structure universe, conserved arrangements of functional groups or common ligand binding patterns (Alberts et al., 1998Go; Young et al., 1999Go). However, such an analysis cannot be performed by visual inspection of structural models only. It is evident that an automatic procedure for automated analysis, prediction and comparison of potential binding sites in proteins could be a very helpful tool (Böhm, 1998Go).

Here we describe the implementation of a computational method for (i) automated detection of protein surface pockets, (ii) generation of a property-encoded solvent accessible surface (SAS) for each pocket, (iii) generation of topological correlation vectors of the SAS and (iv) projection (visualization) of these vectors onto a planar display by means of self-organizing maps (SOM). As a result, a two-dimensional map was obtained which displays the distribution of surface cavities in a chemical property space. This method was applied to a set of 176 proteins from the Protein Data Base (PDB) containing a catalytically active zinc ion in the active site (Bernstein et al., 1977Go). On the resulting SOM, active site pockets are clearly separated from other surface depressions for the majority of proteins. A more detailed analysis showed that the automated mapping of the active sites accurately reflects established enzyme classification. This can give new insight into local structural similarities between enzymes revealing completely different folds and functions. Furthermore, the mapping technique allowed for the correct classification of 90 surface pockets derived from 18 additional zinc containing proteins that were not contained in the training set.


    Computational methods
 Top
 Abstract
 Introduction
 Computational methods
 Results and discussion
 References
 
Protein data collection

A training set of 175 protein structures was selected from the PDB. It contained all proteins accessible on July 17, 1998, carrying a catalytically active zinc cation in the active site with at least two nitrogen atoms in the zinc coordination sphere. It was found that the raw collection was biased towards structures of carbonic anhydrases I and II. Therefore, all structures of mutants of these enzymes were removed. The structures of three procarboxypeptidases remained in the set (1pyt, 1pca, 1nsa), although these represent inactive enzyme forms. In addition, a test set consisting of 18 proteins was compiled to estimate the accuracy of our prediction system: 1bc2, 1bn1, 1bn3, 1bn4, 1bnn, 1bnq, 1bnt, 1bnu, 1bnw, 1bv3, 1bvt, 1cpx, 1kop, 1koq, 1sxs, 2anh, 2bmi and 4aig. These structures were made accessible in the PDB between July and December 1998.

Detection of surface cavities

A rectangular, Cartesian grid with 1 Å spacing (axbxc grid points) was generated around the protein (Figure 1aGo). Grid points within 0.8 Å from the van-der-Waal's surface of a protein atom were marked as `protein'. Remaining points were marked as `solvent'. To define a grid-based surface, solvent points were selected that were spaced less than 2 Å apart from a protein point. For these defined surface points, a crude accessibility measure was calculated: starting from a given grid point, the program scanned along the positive and negative x, y and z axes, and the four cubic diagonals on the grid, yielding a total of 14 scan directions. A maximum of 10 steps on the grid was considered along each direction. When a protein grid point was encountered during a scan, a counter variable with an initial maximal value of 14 was decremented. This results in large accessibility values, x, for surface grid points close to convex parts of the protein surface, and yields low values for points within clefts or surface depressions (Figure 1bGo). As a next step, all surface grid points with x > 4 were reset to solvent. A surface grid point was also reset if less than 10 of the surrounding 26 grid points were marked as surface. As a result, the remaining surface grid points formed contiguous clusters defining protein surface pockets (Figure 1cGo). The pockets were sorted according to the number of grid points involved. Finally, pocket atoms were defined as being the protein grid points closest to any surface point. Variants of this algorithm have been applied by us in a different context (Stahl and Böhm, 1998Go), and are part of the LIGSITE program (Hendlich et al., 1997Go).



View larger version (46K):
[in this window]
[in a new window]
 
Fig. 1. Schematic description of the cavity detection process. (a) The protein is embedded in a rectangular grid, grid points are marked as `protein' (squares with bold edges) or `solvent' (gray-shaded squares); (b) accessibility values are calculated for `solvent' points, a threshold criterion is applied to define cavities; (c) contiguous clusters of `cavity' points are detected and excised (for details see text); (d) Connolly surfaces of cavity-forming protein atoms are calculated.

 
Calculation of cavity surfaces

Solvent accessible surfaces (SAS) were calculated by the Connolly algorithm (Figure 1dGo) (Connolly, 1983Go). For all cavity surface points, an accurate accessibility value, x, was calculated that employs 45 instead of 14 scan directions (cf. previous paragraph). This algorithm has been described elsewhere (Stahl et al., 1999Go). In the present work, scans were performed up to a distance of 9 Å from the surface points. We found empirically that the majority of the surface points with an accessibility value above 25 were situated outside the binding pocket. Therefore, these points were removed. Finally, small disconnected surface patches were automatically removed, including surface points with a distance >4 Å to the next grid point of the pocket.

Assignment of interaction type

One out of five possible interaction types (`aliphatic', `hydrogen-bond donor', `hydrogen-bond acceptor', `aromatic-face' and `aromatic-edge') was assigned to each of the surface points. A point was marked as `aliphatic' if the closest atom center contained an sp3-hybridized CH-, CH2- or CH3-group, a sulfur atom engaged in a disulfide bridge or a carboxylate carbon atom. Surface points of thiol and hydroxy groups, primary amines and metal cations were classified as `donor' points, surface points of carboxylate oxygen atoms were classified as `acceptors'.

The assignment of interaction types to points forming the surface of other functional groups depended on their relative positions. Atoms that were part of amides, guanidinium groups and aromatic rings were assigned two vectors: a unit vector perpendicular to the plane of the corresponding functional group (v1), and a unit vector pointing towards the center of the functional group (v2). This center was defined to be the central carbon atom of an amide or guanidinium moiety, or the geometric center of an aromatic ring respectively. The surface normal vector s of a given surface point P was calculated by the Connolly algorithm (Connolly, 1983Go). If the angle between s and v1 was larger than 50° or the angle between s and v2 was smaller than 80°, P was marked as aromatic-face. If these conditions did not apply, the interaction type of P were donor, acceptor or aromatic-edge, depending on the atom type of the closest atom.

The surface description resulting from this algorithm has two advantages over a simple atom type code: (i) it includes orientation-dependent features of functional groups on the protein surface, and (ii) it is complementary to the surface properties of potential ligands binding to protein surface pockets. Self-organizing maps (vide infra) generated from the interaction atom type descriptors proved to be superior to those generated with orientation-independent atom types (results not shown).

Generation of topological correlation vectors

A set of SAS points defining a protein pocket plus their associated accessibility values, x, and their interaction atom types T served as a starting point for the generation of topological cross-correlation vectors CV. All pairs of SAS points with a distance 0 < d < 15 Å were considered. This range was divided into 10 equal distance bins CVd (width = 1.5 Å). Each distance bin was further subdivided into 15 bins for each pair of interaction atom types, resulting in 150 vector elements CVTd for CV. Each vector element is a sum {Sigma}xAxB, where (A, B) are pairs of surface points falling into the distance bin d and having the interaction atom types specified by T.

Self-organizing maps (SOM)

Kohonen's self-organizing neural network provides a method for topology-preserving nonlinear projection of a high-dimensional space onto a low-dimensional display (Kohonen, 1982Go). A thorough description of the algorithm can be found elsewhere (Schneider and Wrede, 1998Go; Kohonen, 1989Go). The idea is to pave the high-dimensional data space (here spanned by the correlation vectors derived from protein cavities) similar to Voronoi or Dirichlet tesselation to obtain `receptive fields' of artificial neurons. As a result of this self-organization process, the receptive fields represent data clusters, and their relative arrangement in high-dimensional space can be visualized on a low-dimensional display. Here, a two-dimensional map of 20x20 neurons (clusters) with toroidal topology was used (Kohonen, 1989Go). Distances between cavity correlation vectors were calculated by the standard euclidian distance measure. The SOM was optimized by the conventional Kohonen algorithm, using a Gaussian neighborhood function and an initial update radius of 13 neurons (Roche in-house software, NEUROMAP) (Schneider and Wrede, 1998Go).


    Results and discussion
 Top
 Abstract
 Introduction
 Computational methods
 Results and discussion
 References
 
The aims of this work were to assess the usefulness of SOMs and our protein cavity descriptors for classification and prediction of active sites. A set of 175 protein structures served as training data, containing enzymes of closely as well as only distantly related families, enzymes of completely different function and multiple X-ray structures of a number of enzymes. It therefore allowed for an analysis of clustering behavior at various levels of sequence and structure similarity. For each protein, the largest surface depressions were determined by the cavity search routine. In 64% of the training set, the active site was identical to the largest surface pocket; all other active sites were among the five largest pockets (rank 2, 23%; rank 3, 6%; rank 4, 5%; rank 5, 2%). Thus, property-encoded surfaces were calculated for the five largest surface pockets only. This resulted in 865 examples, for which topological correlation vectors were generated. In the following we will use the terms `active pocket' and `inactive pocket' as short forms for active site and non-active site pockets.

In Figure 2Go the distribution of active and inactive pockets is shown on a two-dimensional SOM, as defined by the cavity descriptors. A separation of active (gray) and inactive pockets (white) is striking. Several groups of active pockets are surrounded by empty neurons (black) and can thereby be distinguished from the large, coherent white areas where inactive pockets are grouped together. This observation strongly supports the usefulness of our correlation vector representation of protein surface cavities and indicates that relevant biological features have been captured.



View larger version (71K):
[in this window]
[in a new window]
 
Fig. 2. Nonlinear projection of the distribution of protein surface cavities in a chemical space spanned by topological correlation vectors. A toroidal 20x20 self-organizing map was used. The `receptive fields' of the neurons are indicated by squares, and the location of different cavity types is shown by gray-shading. Black, empty neuron; white, inactive pockets; dark gray, metalloproteinase active pockets; light gray, other Zn2+-containing active pockets; cross-hatched, multiple cavity types clustered; arrow, row of cavity structures shown in Figure 3Go.

 
Metalloproteinase active pockets (dark gray) form three separate groups, whereas the majority of the other active pockets (light gray) cluster in the center of the map (Figure 2Go). Visual inspection of the pocket surfaces suggests four large groups which can—with some simplification—be regarded as four areas on the map separated by diagonal lines. The upper left corner of the map, consisting mainly of inactive pockets, is dominated by small, shallow surface depressions. The large, diagonal band of active pockets contains medium-sized, valley-shaped surfaces. The size and complexity of these cavities increases towards the lower right corner of the map. The two groups of metalloproteinase active pockets in the lower right quadrant of the map are small, representing relatively deep and narrow active sites.

These observations are partly illustrated in Figure 3Go. It shows simplified lateral views on the pockets represented by row number 5 on the map shown in Figure 2Go (arrow in Figure 2Go). One should keep in mind that the map actually forms a torus, i.e. neuron [20,5] is directly adjacent to neuron [1,5]. As a result of topology-preservation, adjacent neurons contain similar protein cavity structures. Neurons [6,5] and [14,5] are empty, reflecting comparably large structural differences between the neurons separated by these `voids'. The distribution of cavity shapes along the neurons shown in Figure 3Go clearly shows that the SOM was able to perform a reasonable mapping of `cavity space' defined by our correlation vector representation.



View larger version (46K):
[in this window]
[in a new window]
 
Fig. 3. Lateral views of protein cavity structures projected onto a row of adjacent neurons of the self-organizing feature map. Neuron positions on the map are given in brackets (cf. Figures 2 or 4GoGo). Note that neuron [20,5] is directly adjacent to neuron [1,5]. Neurons [6,5] and [14,5] do not contain structures (`empty clusters').

 
In Figure 4aGo, the distribution of the different types of active pockets is displayed, and the corresponding enzyme classes are marked by numbers (cf. legend of Figure 4Go). It is immediately obvious that most classes of enzymes form individual clusters. There are two notable exceptions to this observation. Superoxide dismutases have extremely shallow active sites, which leads to small active site surface patches that cannot be well distinguished from the bulk of inactive pockets by means of the correlation vectors. Therefore, several members of this group are scattered among inactive pockets (e.g. neurons [7/15] and [8/15]). ß-Lactamases are the second exception. These enzymes possess a variable loop region above the active site that can adopt variable conformations (in some cases no density is observed for this part of the structure) (Philippon et al., 1998Go). This is reflected in greatly varying accessibility values for the active site surfaces (e.g. neurons [11/4] and [12/9]).



View larger version (59K):
[in this window]
[in a new window]
 
Fig. 4. Distribution of metalloproteinase classes on a self-organizing map (cf. Fig. 2Go). (a) Training data projection; (b) Test data projection. A, adenosine deaminase; B, ß-lactamase; C, carbonic anhydrase; D, L-fucose-1-phosphate aldolase; E, alkaline phosphatase; F, purple acid phosphatase; G, other Zn2+-containing active pockets; H, Cu/Zn superoxide dismutase; I, astacine; L, adamalysins; M, matrixins; O, other metalloproteinases; P, procarboxypeptidase; S, serralysins; T, thermolysin and neutral protease; X, carboxypeptidases; dotted, inactive pocket; cross-hatched, multiple cavity types clustered.

 
Carbonic anhydrases form the largest of the contiguous clusters of active pockets (Figure 4aGo). Interestingly, this cluster is divided into two subgroups. An all-against-all sequence comparison with BLAST2 (Altschul and Gish, 1996Go) and subsequent clustering using the Jarvis–Patrick algorithm (Jarvis and Patrick, 1973Go) revealed that the members within each subgroup possess more than 97% pair-wise sequence identity, while any pair of sequences from the two groups has less than 60% identity. The outlier in neuron [8,16], 1thj, has only 55% sequence identity to the large group and adopts a different fold than members of the two large clusters. 1thj forms a single stranded, left-handed ß-helix (or ß-solenoid). The other carbonic anhydrases, instead, have an {alpha}/ß roll architecture. Obviously, for the class of carbonic anhydrases, differences in sequence are paralleled in differences in active site features.

Alkaline phosphatases, metzincins, thermolysins and carboxypeptidases also form large groups of active pockets (Figure 4aGo). Clustering of enzymes of the same type is not perfect in all cases, e.g. there are two outliers in neurons [8,1] and [10,1], which should be part of the thermolysin and the alkaline phosphatase group, respectively. Visual inspection reveals that these pocket surfaces cannot be distinguished from the remaining members of the respective group; therefore, their location on the map must be attributed to errors during the map's self-organization process. It is well-known that conventional Kohonen-type SOMs tend to certain topology distortions inherent to the training algorithm (Kohonen, 1982Go; Graepel and Obermayer, 1998Go). Furthermore, these observations might be due to premature convergence of the training process (Bienfait and Gasteiger, 1997aGo,bGo). Despite these disadvantages SOMs are generally considered as being well-suited for visualization of high-dimensional spaces (Kirew et al., 1998Go; Schneider and Wrede, 1998Go; Schneider et al., 1998Go). Recently, some modifications of the training algorithm and additional methods have been suggested that might help to overcome the problems mentioned (Graepel and Obermayer, 1998Go; Wang et al., 1998Go).

The arrangement of metalloproteinases on the map warrants a more detailed analysis. Endo- and exopeptidases are well separated (Figure 4aGo). The group of carboxypeptidase A active pockets is located at a large distance from the other metalloproteinase pockets, and is also set apart from a distant relative, the muramoyl pentapeptide carboxypeptidase (1lbu, neuron [15,11]), as well as from their inactive pro-enzymes (denoted by `P' in Figure 4aGo). The large family of metzincin enzymes is divided into three groups (Stöcker et al., 1995Go; Borkakoti, 1998Go). This seems to be a consequence of their active site shape: the matrix metalloproteinase cluster in the lower left corner of the map is characterized by active pockets containing a stretched-out and rather shallow cavity, and an S1-pocket of moderate size (MMP-1, MMP-9). In contrast, the second cluster located in the lower right corner of the map contains pockets dominated by deep S1 subsites (MMP-8, adamalysins). Some of the pocket surfaces of both groups are depicted in Figure 3Go (neurons [17/5], [18/5], [19/5]). The separate neuron [14,4] in Figure 4aGo contains pockets with tunnel-shaped S1-pockets (MMP-3 structures and one neutrophile collagenase, 1mnc, with Arg222 rotated away from the bottom of the S1 pocket). The small group of serralysines is slightly set apart from the other metzincin family proteins due to the fact that their active pockets possess more complex, distorted surfaces (Bode et al., 1996Go).

Up to this point, the analysis of the SOMs has shown that the automatic projection of surface-derived correlation vectors leads to an intuitively reasonable arrangement of protein pockets. It is apparent, however, that the specific position of a pocket on the map depends on the various empirical cut-off values used in our cavity analysis. We were therefore interested to see whether the trained SOM (Figures 2 and 4aGoGo) permits correct predictions for enzymes not contained in the training set. Surface pockets and the corresponding correlation vectors were calculated for a set of 18 zinc enzyme structures from the PDB not contained in the training set. Their projection on the trained SOM is shown in Figure 4bGo. With one exception, carbonic anhydrase (PDB-codes 1kop, 1koq) located in neuron [11,14], all active pockets and all inactive pockets were correctly classified. Furthermore, the active pockets were placed within clusters of the correct type of enzyme. This means that our method is sufficiently robust for accurately predicting the enzyme type, provided that members of the enzyme class were covered by the training set.

In the present work we have successfully applied a novel method for automatic recognition of cavities on the surface of protein structures. The applicability of nonlinear projection by conventional Kohonen-type neural networks for data visualization was substantiated. This method complements other automated procedures for locating binding pockets based on triangulation techniques (Liang et al., 1998Go). To further improve the accuracy of the SOM projections, modified or other nonlinear mapping algorithms might be useful (Bienfait and Gasteiger, 1997aGo,bGo; Schneider and Wrede, 1998Go; Schneider et al., 1998Go). Furthermore, we were able to demonstrate that correlation vectors encoding the distribution of generalized atom types and the shape of surface pockets are suited for classification of (i) active and inactive sites, and (ii) accurate prediction of the enzymatic class of test set proteins. This was verified taking the surface cavities of a set of zinc-containing metalloproteinases as an example. We are convinced that this and similar techniques bear a significant potential for automated protein structure analysis and drug design (Verdonk et al., 1999Go).


    Acknowledgments
 
Hans-Joachim Böhm, Daniel Bur and Petra Schneider are thanked for helpful discussions and comments on the manuscript.


    Notes
 
1 To whom correspondence should be addressed; email: gisbert.schneider{at}roche.com Back


    References
 Top
 Abstract
 Introduction
 Computational methods
 Results and discussion
 References
 
Alberts,I.L., Nadassy,K. and Wodak,S.J. (1998) Protein Sci., 7, 1700–1716.[Abstract/Free Full Text]

Altschul,S.F. and Gish,W. (1996) Methods Enzymol., 266, 460–480.[ISI][Medline]

Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535–542.[ISI][Medline]

Bienfait,B. and Gasteiger,J. (1997a) J. Mol. Graph. Model., 15, 203–215.[ISI][Medline]

Bienfait,B. and Gasteiger,J. (1997b) J. Mol. Graph. Model., 15, 254–258.

Bode,W., Grams,F., Reinemer,P., Gomis-Ruth,F.X., Baumann,U., McKay,D.B. and Stocker,W. (1996) Adv. Exp. Med. Biol., 389, 1–11.[Medline]

Böhm,H.-J. (1998) J. Comput.-Aided Mol. Des., 12, 309–323.

Borkakoti,N. (1998) In Gubernator,K. and Böhm,H.-J. (eds), Structure-Based Ligand Design. Wiley–VCH, Weinheim, pp. 73–86.

Connolly,M.L. (1983) J. Appl. Crystallogr., 16, 548–558.[ISI]

Graepel,T. and Obermayer,K. (1998) Neural Comput., 11, 139–155.[Abstract/Free Full Text]

Hendlich,M., Rippmann,F. and Barnickel,G. (1997) J. Mol. Graph. Model., 15, 359–363.[ISI][Medline]

Jarvis,R.A. and Patrick,E.A. (1973) IEEE Trans. Comput., C-22, 1925–1034.

Kirew,D.B., Chretien,J.R., Bernard,P. and Ros,F. (1998) SAR QSAR Environ. Res., 8, 93–107.[ISI][Medline]

Kohonen,T. (1982) Biol. Cybern., 43, 59–69.[ISI]

Kohonen,T. (1989) Self-Organization and Associative Memory. Springer Verlag, Heidelberg.

Liang,J., Edelsbrunner,H. and Woodward,C. (1998) Protein Sci., 7, 1884–1897.[Abstract/Free Full Text]

Philippon,A., Dusart,J., Joris,B. and Frere,J.M. (1998) Cell. Mol. Life Sci., 54, 341–346.[ISI][Medline]

Schneider,G., Sjöling,S., Wallin,E., Wrede,P., Glaser,E. and von Heijne,G. (1998) Proteins, 30, 49–60.[ISI][Medline]

Schneider,G. and Wrede,P. (1998) Prog. Biophys. Mol. Biol., 70, 175–222.[ISI][Medline]

Stahl,M. and Böhm,H.-J. (1998) J. Mol. Graph. Model., 16, 121–132.[ISI][Medline]

Stahl,M., Bur,D. and Schneider,G. (1999) J. Comput. Chem., 20, 336–347.[ISI]

Stöcker,W., Grams,F., Baumann,U., Reinemeyer,P., Gomis-Rüth,F.X., McKay,D.B. and Bode,W. (1995) Protein Sci., 4, 823–840.[Abstract/Free Full Text]

Verdonk,M.L., Cole,J.C. and Taylor,R. (1999) J. Mol. Biol., 289, 1093–1108.[ISI][Medline]

Wang,H.C., Dopazo,J., de la Fraga,L.G., Zhu,Y.P and Carazo,J.M. (1998) Protein Sci., 7, 2613–2622.[Abstract/Free Full Text]

Young,M.M., Skillman,A.G. and Kuntz,I.D. (1999) Proteins, 34, 317–332.[ISI][Medline]

Received August 19, 1999; revised October 25, 1999; accepted November 16, 1999.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (15)
Request Permissions
Google Scholar
Articles by Stahl, M.
Articles by Schneider, G.
PubMed
PubMed Citation
Articles by Stahl, M.
Articles by Schneider, G.