Common Structural Cliques: a tool for protein structure and function analysis

Mariusz Milik, Sándor Szalma and Krzysztof A. Olszewski1

Accelrys, 9685 Scranton Road, San Diego, CA 92121, USA 1To whom correpsondence should be addressed. e-mail: kato{at}accelrys.com


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Proposed is a method for locating functionally relevant atoms in protein structures and a representation of spatial arrangements of these atoms allowing for a flexible description of active sites in proteins. The search method is based on comparison of local structure features of proteins that share a common biochemical function. The method does not depend on overall similarity of structures and sequences of compared proteins or on previous knowledge about functionally relevant residues. The compared protein structures are condensed to a graph representation, with atoms as nodes and distances as edge labels. Protein graphs are then compared to extract all possible Common Structural Cliques. These cliques are merged to create Structural Templates: graphs that describe structural analogies between compared proteins. Structures of serine endopeptidases were compared in pairs using the presented algorithm with different geometrical parameters. Additionally, a Structural Template was extracted from the structures of aminotransferases, two different proteins that catalyze the same type of chemical reaction. The results presented show that the method works efficiently even in the case of large protein systems and allows for extraction of common structural features from proteins catalyzing a particular chemical reaction, but that evolved from different ancestors by convergent evolution.

Keywords: convergence/function assignment/graph representation/protein structure/structural templates/structure–function relationship


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Proteins belonging to the same functional family often share common local structural features even if there is no evolutional dependence or sequence similarity between them. In many cases and particularly in the case of enzymes, protein function is determined by the chemical character and spatial arrangement of a few selected atoms. These atoms may form the actual active site of the protein, bind a particular substrate or coordinate a prosthetic group or a metal ion that catalyzes the specific chemical reaction.

Looking at protein structures from the functional perspective, one may distinguish two classes of protein residues: (i) functionally relevant residues that participate in the specific protein function, and (ii) scaffold residues that form the structural environment for the functional residues, keeping them in proper spatial configuration. Obviously, such a division is in many cases arbitrary and artificial. It may be used, however, as a starting point for a more advanced analysis of structure–function relationships in proteins. This focuses interest on a small subset of residues that are extracted from the whole protein structure and that are expected to contain most of the function-related information for the given protein. Additionally, knowledge of the most functionally relevant residues can enhance functional annotation of low homology sequence alignments (Reddy et al., 2001Go) because it is expected that the functionally relevant residues would be better conserved in the evolutionary process than the structural ones (Lesk and Chothia, 1980Go; Chothia and Lesk, 1986Go). For example, it is possible to mutate drastically structural scaffold residues of proteins without significant disruption of their functionality (Axe et al., 1996Go; Gassner et al., 1996Go; Chotia and Gerstein, 1997Go). Moreover, atoms from the functional residues are expected to form similar arrangements both in homologous and in evolutionarily unrelated proteins performing the same biological function [for example, the aminotransferases family discussed by Petsko and Ringe (Petsko and Ringe, 2002Go)]. Therefore, functional annotation procedures based on functionally relevant residues can be applied to non-homologous proteins expected to perform similar biological functions (Russell, 1998Go; Russell et al., 1998Go).

Comparison of protein structures and searches for Structural Templates (STs) of functionally relevant residues currently concern many research groups. A sub-graph isomorphism algorithm was proposed by Artymiuk et al. for searching protein structures for user-defined 3D patterns (Artymiuk et al., 1994Go). Patterns were defined based on the prior knowledge of functional residues in the given protein family. The authors used a reduced representation of protein structures with pseudo-atoms representing side chains. Pattern geometry was represented by a graph with the pseudo-atoms as nodes and interatomic distances as edges. In another approach, Fisher et al. used geometric hashing for a C{alpha}-only representation of protein structure (Fisher et al., 1994Go, 1995). Because protein structures were represented there as unconnected spheres centered at C{alpha} coordinates, the extracted patterns were sequence-independent. The patterns resulted from automated clustering of pairs of compatible spheres. The results described in the paper were encouraging, but a C{alpha}-only representation is not always optimal for function-defining patterns. Wallace et al. proposed a similar method applying the geometric hashing paradigm to protein structures in expanded side chain representation (Wallace et al., 1996Go, 1997). The patterns, containing a set of atoms defining the active site and a list of allowed amino acids, were stored in the PROCAT database. In another approach, Russell proposed a 3D pattern extraction method based on comparison of conserved residues from protein structures (Russell, 1998Go), using multiple sequence alignment to define patterns. Each residue was represented by three atoms and a weighted root mean square deviation (r.m.s.d.) between residues was used as a similarity measure. Another method, proposed by Fetrow and Skolnick and Di Gennaro et al., used a C{alpha}-only representation of protein structures, distance conservation for each pair of residues and additional sequence-dependent constraints, to define Fuzzy Functional Forms (FFFs) (Fetrow and Skolnick, 1998Go; Di Gennaro et al., 2001Go). The FFFs definition of structural pattern is rooted strongly in expert knowledge and literature analysis. Turcotte et al. proposed a machine learning method for protein fold recognition (Turcotte et al., 2001Go). Using the derived rules, they were able to assign automatically protein structures to the proper SCOP families. Unfortunately, common protein folds do not always imply a common biological function. Irving et al. designed a method for protein active site identification by structural alignments (Irving et al., 2001Go). The method was used to suggest the locations of plausible active sites in homologous proteins and it was based on a search of maximal common sub-clusters of C{alpha} atoms. Another method combined sequence threading with chemostructural restrictions to detect functionally important and evolutionarily conserved fragments of protein structures (Reva et al., 2002Go). The restrictions applied during the threading procedure are extracted from experimental data and from literature analysis. The method was applied to refine a homology model of dipeptidyl peptidase IV.

Most of these structure comparison methods, based on rigid structure comparison (measured by r.m.s. distance between atoms), use very restricted (e.g. C{alpha}-only) representations or require external knowledge about residues belonging to the eventual functionally relevant site. The rigid structure comparison methods work best for reasonably similar protein domains that share a common fold. It may be difficult to compare, using these methods, active sites of two proteins with low sequence similarity or of converged proteins without additional information about possible active site definition. Additionally, in the case of evolutionarily close proteins and in the case of C{alpha}-only representations, it is difficult to distinguish between residues conserved because of their functional importance and structural scaffold residues.

In this paper, we propose a method, called Common Structural Cliques (CSC), for automatically locating functionally relevant atoms in protein structures. The method provides a representation of spatial arrangements of the functional atoms and allows for the flexible description of active/binding sites in proteins. This new representation allows for a formal description of flexible structural patterns that contain multiple rigid sub-patterns connected by flexible hinges. The search method is based on the comparison of protein structures that share a common biological function. The method does not depend on overall similarity of structures and sequences of compared proteins or on previous knowledge about functionally relevant residues in the considered protein family. This work is a modification and expansion of the previously published FAUST method for the extraction of functionally relevant templates from protein structures (Milik et al., 2002Go).

In this modification, new algorithms are applied to the extraction of local similarities between protein structures. By restricting our attention to four-atom cliques, we were able to search exhaustively for eventual matches in both structures, in contrast to the previously used heuristic algorithm. Additionally, we added an algorithm that automatically searches for a most promising ST by combining structural cliques containing atom pairs having the maximum number of overlaps (the algorithm is described in detail below). The manual procedure for this process, previously used in FAUST, is still available in CSC as an option.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The goal of this work is to show the extraction of STs for enzyme families and to analyze the performance of our algorithms with different values of geometric parameters. For simplicity, we used the Enzyme Classification (EC) number to define family membership for candidate protein structures (Bairoch, 2000Go). This classification allowed us to select protein structures objectively for the ST extraction procedure and to define an objective criterion for evaluation of the extracted templates. We used the EC number assigned to structures by the authors of the source PDB file. This classification information was used here only in the stage of initial data set preparation, the definition of protein families. No other information extracted from the PDB file (e.g. active site definitions) was used in the ST extraction procedure. Protein structures used in this work were taken from the representative set of non-redundant protein structures (Hobohm and Sander, 1994Go). The sequence similarity threshold value was 25% to minimize trivial structural similarities resulting from sequence homology; the database version was from April 2002. In the case of multi-chain structures, all chains were included in CSC searches to include potential functional multimers and enzyme–inhibitor complexes.

Analysis of local similarities between protein structures using the CSC procedure occurs in four general steps. First, the compared protein structures must be condensed to the atom graph representation, with atoms as nodes and atom–atom distances as edge labels. Second, all local atom cliques are extracted from these atom graphs. Third, all possible CSCs are extracted by exhaustive comparison of local atom cliques. Fourth, the extracted CSCs are merged to create sets of larger and continuous graphs: STs that describe structural analogies between compared proteins (see Figure 1 for illustration).



View larger version (42K):
[in this window]
[in a new window]
 
Fig. 1. Schematic view of data flow in CSC algorithm. (a) Local atom cliques are sub-sets of atoms from protein structures with the property that every atom in the clique is closer than 8 Å from every other atom in the clique. (b) Common Structural Clique (CSC) establishes correspondence between two local atom cliques selected from two protein structures in such a way that respective interatomic distances between atoms belonging to the CSC are approximately equal in both protein structures and atoms belonging to the same pair have the same chemical identity. (c) Structural Templates (STs) are created by overlapping CSC and are represented as graphs.

 
In order to reduce the size of atom graphs representing protein structures, we selected representative atoms from protein side chains (Milik et al., 2002Go). Backbone atoms were excluded, because of the abundance of local structural cliques created by backbone atoms involved in secondary structure elements. These cliques in most cases are non-functional and significantly increase input noise in our method. The complete list of the selected atom types, with PDB codes, is presented in Table I. Obviously, other criteria can be chosen as the representative set of atoms. For instance, surface-exposed atoms may be selected to compare the structures of signaling proteins; atoms from a binding site neighborhood may be chosen to analyze geometry of protein–ligand interactions, etc.


View this table:
[in this window]
[in a new window]
 
Table I. Complete list of the selected atom types, with their PDB codes
 
The initial atom graph representation of a protein structure is defined here by a list of representative atoms, in conjunction with a matrix containing information about atom–atom proximity. This matrix (called the adjacency matrix in graph theory) contains the value 1 in position (i,j) when two atoms i and j are closer to each other than a pre-defined threshold distance and the value 0 in the opposite case. Technically, such a defined adjacency matrix is an atomic resolution version of the well-known ‘contact map’ representation of protein structures. However, in our case, multiple points may be used for the definition of the local geometry of one residue. Distance threshold values of 7, 8 and 9 Å were tested. These threshold values were chosen based on analysis of the most prevalent local distances between atoms in typical enzymatic active sites. Tests of these threshold values showed that the result of the CSC template extraction procedure is not very sensitive to this parameter (see Results and Discussion sections for details).

The second stage of the procedure starts from extraction of all possible local atom cliques from the compared protein structural graphs. The local atom clique is defined as a set of atoms from the structural graph with the property that every atom from this set is adjacent to every other atom from the same set. The adjacency is defined in the adjacency matrix created in the previous stage. In our application, structural graphs are exhaustively searched to identify all four-node cliques.

In the third stage, all atom cliques from selected protein structures (defining a functional family) were pair-wise compared such that every local atom clique extracted from one of the protein structures was compared with each atom clique from the other protein graphs to find similarities and so define CSCs.

A CSC establishes correspondence between two local atom cliques selected from two protein structures. Interatomic distances between corresponding atoms belonging to the CSC are approximately equal in both protein structures and corresponding atoms belonging to the same pair have the same chemical identity (see Figure 1 for illustration). Inter-atom distances were considered approximately equal when they differ by <1 Å. This parameter was also tested with a value of 1.5 Å on a set of protein structures from the serine protease family (see Results for details). The tests showed no evident improvement in the generated 3D pattern quality, but a drastic increase in the number of random (not related to active site) atom cliques.

Depending on the size of compared protein structures and their structural similarity, from one to >1000 CSCs may be extracted for a single protein pair. Because many extracted cliques share common atom pairs, they may be consolidated into larger constructs, called here Structural Templates (STs) (see Figure 1). Often, STs created by overlapping CSCs contain conflicting cliques, where one atom from the first protein is paired in different cliques with several different atoms from the second protein. As a result, the proposed ST must be pruned to remove inconsistencies and the selected cliques must be merged in order to refine information about functionally relevant fragments of proteins. The CSC merging is performed by a greedy algorithm, which attempts to generate a largest possible ST that contains the atoms pairs most frequently used in the analyzed CSCs.

The data format for the description of STs contains a list of atoms, with their coordinates, that is extracted from one of the compared protein structures and a square matrix, analogous to the adjacency matrix from graph theory. However, in the presented application, this matrix is used to describe the expected importance of the distance between two atoms for the specific protein function. This importance is evaluated by pair comparisons of protein structures from the given functional family and in the present work it is defined as ‘1’, meaning that the given distance is relevant, and ‘0’, meaning that it is irrelevant for the function. Application of this distance importance matrix allows for a formal description of flexible structural patterns, containing two or more rigid sub-patterns connected by flexible hinges. Such patterns are impossible to describe in the rigid r.m.s.d.-based definition of structural patterns. To compare templates we used distance r.m.s.d., i.e. the mean square deviation of relevant distance differences. Hence, irrelevant distance differences contribute effectively with zero weight to the distance r.m.s.d. While the distance r.m.s.d. for any ST will by definition be less than a threshold value (1.0 Å), the corresponding rigid r.m.s.d. may be much higher.

The merging algorithm starts by selecting from the list the CSC containing the most overlaps (common atom pairs) with other CSCs. If multiple CSCs have the maximum number of overlaps, the algorithm compares their number of conflicts and the CSC with the minimum number of conflicts is chosen. The selected clique is then used as a seed for a first draft of the ST, which is a superposition of all cliques that have overlaps with the previously chosen one. Usually, this ST contains conflicts, which means that an atom from one protein is paired with multiple atoms from another. Therefore, the proposed ST must be pruned to establish a one-to-one relation for atoms from compared proteins. In this process, conflicting cliques are iteratively removed from the ST definition. Every step of the pruning algorithm selects the CSC with the maximum number of conflicts with other CSC in the set. If many cliques have the same number of conflicts, the one with the minimum number of overlaps is chosen. The chosen CSC is then removed from the ST definition; the ST is rebuilt and again tested for conflicts. This step is repeated until the resulting ST does not contain conflicting atom pairs.

In the first set of experiments we analyzed protein structures from the serine endopeptidase family (EC numbers 3.4.21.–). This family contains hydrolases acting on peptide bonds. Table II contains the list of protein structures from the non-redundant database labeled with this EC number that also have a resolution of 2.0 Å or better. Two important enzymes from this family are trypsin and subtilisin. The catalytic activity of trypsin is provided by a charge relay system involving an aspartic acid residue hydrogen bonded to a histidine, which itself is hydrogen bonded to a serine. The sequence fragments near the active-site serine and histidine residues are well conserved in this family (Brenner, 1988Go). Catalytic activity of subtilase (Siezen et al., 1991Go) is provided by a charge relay system similar to that in trypsin; however, it most probably evolved by independent convergent evolution (Brenner, 1988Go). The sequences around the residues involved in the catalytic triad (aspartic acid, serine and histidine) are completely different from that of the analogous residues in the trypsin serine proteases and are used as signatures specific to that class of proteases.


View this table:
[in this window]
[in a new window]
 
Table II. List of studied protein structures from the serine endopeptidase family (EC 3.4.21.–)
 
Two structures of aminotransferases, L-aspartate aminotransferase (1AJR, EC 2.6.1.1) and D-amino acid aminotransferase (1DAA, EC 2.6.1.21), were used in another application of the CSC algorithm. Both enzymes catalyze the reaction where an {alpha}-amino acid is converted to an {alpha}-keto acid followed by conversion of a different {alpha}-keto acid to a new {alpha}-amino acid. Both proteins use the same cofactor, pyridoxal phosphate (PLP), in this process. The first aminotransferase converts L-aspartate to L-glutamate and the second one catalyzes the analogous reaction for D-forms of various amino acids in bacteria. The basic catalytic mechanism remains the same for both of them. However, the amino acid sequences of these enzymes, and also their structures, are very different and they were used as an example of converging evolution of enzymes (Petsko and Ringe, 2002Go).


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Calibration

The CSC algorithm, as presented above, depends on two main geometric parameters: the distance threshold (used in the procedure of extracting local atom cliques from protein structures) and the distance tolerance parameter (used for the definition of distance similarity in the CSC development procedure). Several combinations of these parameters were tested to locate values optimal for analysis of protein enzyme structures. The tests were performed on a set of low sequence homology (<25%), high quality (2.0 Å or better resolution) structures of serine endopeptidases (EC 3.4.21.–). In this test, the seven selected structures were compared in pairs using the CSC algorithm with different values of parameters. Tested were combinations of geometric parameters for the local atom clique extraction procedure. Three values of distance threshold (7.0, 8.0 and 9.0 Å) and two values of distance tolerance (1.0 and 1.5 Å) were used.

The number of local atom cliques, extracted from a single structure, grows approximately geometrically with the distance threshold increase (see Figure 2). Analogously, numbers of extracted CSCs and sizes of final STs for all analyzed pairs of structures grow significantly with increasing values of the distance threshold parameter (Table III) and the distance tolerance parameter (Table IV). Nevertheless, the size of the extracted STs does not depend strongly on the analyzed parameters and the final STs for a given pair of structures overlap, in most cases. Therefore, a distance threshold 8.0 Å and a distance tolerance parameter 1.0 Å were used in the remaining part of this work.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 2. Number of local atom cliques extracted from every structure in the serine endopeptidase set (EC 3.4.21.–) for the three values of the distance threshold.

 

View this table:
[in this window]
[in a new window]
 
Table III. CSC extraction results for the serine endopeptidase set (EC 3.4.21.–) with a distance tolerance parameter of 1.0 Å and distance threshold values of 7, 8 and 9 Å
 

View this table:
[in this window]
[in a new window]
 
Table IV. CSC extraction results for the serine endopeptidase set (EC 3.4.21.–) with a distance tolerance parameter of 1.5 Å and a distance threshold of 8.0 Å
 
Serine endopeptidases

Tables V and VI present lists of atoms included in ST definitions for the selected, representative structures of proteins that have the serine endopeptidase function. Rows of these tables correspond to atoms found in any ST from the selected structure. Values of ‘1’ or ‘0’ indicate whether the given atom is a member the corresponding ST or not. For example, the first row of Table V contains information that the atom SG from residue 42 Cys, from chain A in the 1AVW structure, was included in the STs created by comparison of this structure with 1C5L, 1FLE, 1FN8 and 1SGP. This atom was not included in STs created by comparison of this structure with the subtilisin structures 1GCI and 1SCJ. Table rows with information about atoms that belong to well-defined active site for serine endopeptidases (catalytic triad: His, Asp, Ser) are in boldface. It is encouraging that in both examples presented almost all STs contained all the active atoms from the catalytic triad residue. The only exception is the 1GCI/1SCJ pair, in which the oxygen from Ser125 replaced the oxygen from Ser221, which is used in all remaining STs. Analogous results were obtained for remaining structure pairs from this family. The conclusion is that by using any of the pairs of structures as a source, one would be able to locate precisely the active site for the serine endopeptidase family in a fully automated procedure, from atom coordinates alone.


View this table:
[in this window]
[in a new window]
 
Table V. Atom conservation in diverse STs: list of ST atoms extracted from pairs trypsin (1AVW) and remaining serine endopeptidase structures
 

View this table:
[in this window]
[in a new window]
 
Table VI. Atom conservation in diverse STs: list of ST atoms extracted from pairs subtilisin (1CGI) and remaining serine endopeptidase structures
 
Examples of two STs extracted for the serine endopeptidase family are presented in Tables VII and VIII. All tables containing ST information are formatted as follows: the left side of the table contains pairs of atoms from both compared structures (e.g. 1AVW and 1GCI in the case of Table VII); and the right side of the table contains the adjacency matrix for the ST for this particular pair of structures. For example, the first row of this table contains information that atom ND1 from residue His57 from chain A of structure 1AVW was included into the definition of the ST for this structure, created by CSC comparison with the structure 1GCI. This atom was paired in a CSC with atom ND1 from residue His64 from chain A of structure 1GCI; additionally, all distances between this atom and all other atoms from the list are important for the ST definition (the appropriate row in the adjacency matrix contains only 1s).


View this table:
[in this window]
[in a new window]
 
Table VII. ST extracted from structures of trypsin from Sus scrofa (1AVW) and subtilisin from Bacillus lentus (1GCI), from the serine endopeptidase family
 

View this table:
[in this window]
[in a new window]
 
Table VIII. ST extracted from structures of trypsin from Sus scrofa (1AVW) and trypsin from Fusarium oxysporum (1FN8), from the serine endopeptidase family
 
The templates from Tables VII and VIII were created by comparison of the same protein (porcine trypsin, 1AVW) with two different proteins. One of these proteins (bacterial subtilisin, 1GCI) has a different fold to trypsin and most probably evolved independently, as was discussed above. The second protein (fungal trypsin 1FN8) evolved from the same ancestor as the porcine trypsin and has a similar fold. It may be seen that the second ST is larger, which should be expected for homology reasons. It is significant that all the atoms from structure 1AVW that were included in the definition of the first ST are also included in the second. The relevant atom names are in boldface in Table VII for easier comparison. By comparing the non-homologous trypsin and subtilisin structures, we were able to identify an atom cluster that is known to be indispensable for performing of the serine endopeptidase function.

The spatial arrangement of the extracted ST is presented in Figure 3. In this figure, the atoms from the catalytic triad form the core of the ST. These core atoms have neighbors on one side of two cysteines and a serine and on the other side of another serine. There are no conserved distances between these two groups of bordering atoms.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 3. Spatial arrangement of the atoms from 1AVW structure included in templates from Tables VII and VIII. Atoms 195 Ser OG, 57 His (NE2 and ND1) and 102 Asp (OD1 and OD2) from the catalytic triad form the core of the ST and are present in both templates. Remaining atoms were only included in the ST from Table VIII. Lines represent atom–atom distances conserved in the ST.

 
Aminotransferases

L-Aspartate aminotransferase (1AJR) and D-amino acid aminotransferase (1DAA) are two completely different protein structures sharing a common biochemical function. They catalyze the same type of chemical reaction for substrates that have different chiralities. These proteins form functional dimers and CSC was not able to discover any STs when applied to single chains of these proteins. However, when we included all chains in the analysis process, the ST presented in Table IX was extracted. The format of this table is analogous to that of Tables VII and VIII. The extracted ST contained atoms from two tyrosine glutamic acid and arginine residues. The connectivity matrix for the template is comparatively dense, which means that the distances between included atoms are similar in both structures. However, the adjacency matrix contains some zeros, meaning that distances between some atoms are excluded from the ST definition. For example, the arginine atoms (numbers 7 and 8) are coordinated only with atoms from one of the tyrosines from the complex (numbers 3 and 4).


View this table:
[in this window]
[in a new window]
 
Table IX. ST extracted from the structures of the two aminotransferases, L-aspartate aminotransferase (1AJR) and D-amino acid aminotransferase (1DAA) (see Figures 4 and 5 and Discussion for more details)
 
Figure 4 shows the spatial arrangement of the atoms from 1AJR (Figure 4a) and from 1DAA (Figure 4b) that were included in the ST definition for this pair of proteins. The core part of the ST contains two interacting tyrosine residues. In one structure (1AJR) these tyrosine residues belong to two separate amino acid chains; whereas in the second structure (1DAA) both these residues belong to the same chain. The glutamic acid side chain is correlated with core atoms on one side and the arginine on the other side. According to the ST definition, there is no distance correlation between glutamic acid and arginine side chains.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 4. Configuration of ST extracted from structures of L-aspartate aminotransferase (1AJR, EC 2.6.1.1) and D-amino acid aminotransferase (1DAA, EC 2.6.1.21). (a) ST atoms from 1AJR. (b) ST atoms from 1DAA. The core part of the ST is formed by two interacting tyrosine residues. In 1AJR these residues belong to two different amino acid chains; in 1DAA they belong to the same chain.

 
Figure 5 shows an attempt to overlap ST atoms from both analyzed structures of aminotransferases. Although the template extraction procedure succeeded, it is impossible to overlay all atoms in these templates using a rigid body transformation. This results from the internal flexibility of our ST definition. Some distances between ST defining atoms were excluded from the ST definition and the overall chirality of the pattern was disregarded. The rigid body r.m.s.d. is 2.7 Å on eight template atoms, whereas the distance-based r.m.s.d. is only 0.35 Å. In Figure 5, only the atoms from tyrosines and glutamic acid were overlapped, leaving the arginine atoms unaligned. It may be seen that the ST atoms from 1AJR form a conformation that is almost a mirror image of atoms from 1DAA. This demonstrates how our algorithm’s distance-based matching definition of an ST is different from the most commonly used active site comparisons that are based on rigid overlap (r.m.s.d. minimization). Notably, the basic difference in chemical functions of both compared proteins is chirality-based, that is, one of them converts L-amino acids and the second converts D-amino acids. Therefore, the template provides an example of how the chirality of the substrate possibly influences the chirality of the enzyme active site.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 5. Overlap of ST atoms from the analyzed structures of aminotransferases (1AJR and 1DAA). Side chains from 1AJR are green; side chains from 1DAA are blue. The PLP cofactor molecules are colored analogously. Only the atoms from tyrosines and glutamic acids were overlapped, leaving the arginine atoms free for placement. The selected side chains from 1AJR form a conformation close to a mirror image of the side chains from the 1DAA molecule. This figure provides an example of the possible influence of chirality of the substrate on the chirality of the enzyme active site.

 
CSC algorithm versatility tests

Table X reports CSC ST extraction performance on a diverse set of protein pairs based on EC classes. We chose pairs of proteins that correspond to the same EC number (catalyze the same chemical reaction) but belong to different fold classes according to the SCOP (Murzin et al., 1995Go) classification of proteins. The protein size varied from 200 to >500 residues and the overall structural r.m.s.d. varied from 2.7 to 7.3 Å. Our experience is that the CSC extraction times rarely exceed 15 s for the ST extraction per protein pair. The results are summarized in the Table X and the full analysis of the underlying templates will be reported in subsequent papers. In order to analyze the other end of the similarity spectrum, we extracted template from a cathepsin and papain protein. These two structures share the same fold (1.5 Å r.m.s.d.) and sequence similarity (45%) but catalyze slightly different reactions (as characterized by EC 3.4.22.1 and 3.4.22.2, respectively). The CSC algorithm in this case took <20 s to extract the template that contained atoms from residues Cys25 and His159, known to be associated with the catalytic function.


View this table:
[in this window]
[in a new window]
 
Table X. CSC template extraction performance from enzyme pairs sharing the same Enzyme Classification (EC) number and having different folds as assigned in SCOP database (Murzin et al., 1995)
 

    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The results presented show that the proposed method for the discovery of local similarities between protein structures works efficiently even in the case of large protein systems. For example, our algorithm was able to extract all local atom cliques from two dimeric aminotransferase structures containing 824 (1AJR) and 564 (1DAA) residues, respectively and then locate all possible matches between these atom cliques (CSCs). The calculations for aminotransferase structures, including the extraction of the final ST from the CSC list, took <30 s on a 1.8 GHz Pentium IV PC. Additionally, this process is fully automatic, allowing the analysis of local similarities for large sets of protein structures.

The proposed method for joining overlapping CSCs and creating the final ST works well in the majority of the cases tested; the resulting sets of atoms are predominantly located in the neighborhood of the expected active site for the analyzed structures (see Tables VVIII). The clique-joining algorithm is based on a greedy search method. It detected atoms from the catalytic triad in 20 out of 21 pairs of serine endopeptidase structures. In the failed case, the search was disrupted by a large clique of atoms from the inhibitor chain of the structure. However, the catalytic triad was localized for this pair of structures in the second run, after removing the inhibitor chain from the extraction process. In the case of aminotransferases, the algorithm extracted sets of atoms from the neighborhood of the cofactor molecule (see Table IX and Figure 5), which supports the likely functional importance of these atoms. Obviously, it is difficult to prove that the cliques containing the most frequent atom pairs are always more functionally relevant; however, we see it as a plausible assumption in the case when the compared proteins have different structures and similar functions.

The results of the method parameters calibration tests (Tables III and IV) show that the selected set of side chain atoms (see Table I) and proposed values of distance threshold (8.0 Å) and distance tolerance parameters (1.0 Å) work well for the tested structures. As shown in Figure 2, the number of initial local atom cliques that must be analyzed during the CSC creation procedure grows substantially as the distance threshold value increases. An increase in the number of local atom cliques implies an increase in CSCs for every analyzed pair of structures (see Table III, below the diagonal). Nevertheless, the final STs remain basically the same, with only a slight increase in size (see Table III, above the diagonal). Analogously, an increase in the value of distance tolerance (a parameter used in the atom clique comparison procedure) increases the number of initial CSCs (see Table IV, below the diagonal) without drastic changes in the final STs (see Table IV, above the diagonal).

Analysis of the STs, even without looking at the structures or literature analysis, may provide some insight into the compared proteins. Taking the ST extracted from the aminotransferase structures (Table IX) as an example, we can form a hypothesis: if the selected ST contains functionally relevant residues and the structure is not an enzyme–inhibitor complex, then these proteins form functional dimers. The hypothesis is true for both analyzed enzymes. Additionally, in the case of 1AJR, the ST is created by tyrosine and arginine from chain A and tyrosine and glutamic acid from chain B; in the case of 1DAA, both tyrosine residues are from chain A and glutamic acid and arginine are from chain B. This implies that this particular functional site is not sequence dependent and most probably evolved as an effect of a convergence process.

The examples presented illustrate the applicability of the algorithm to extract common structural features from proteins that catalyze a particular chemical reaction, but evolved from different ancestors owing to convergent evolution. Obviously, sequence-homology-based methods for pattern extraction are not applicable in such cases. Additionally, the presented approach captures an important malleability feature of the active site description. It allows for expressing similarities between active sites that are very difficult to describe otherwise, since they cannot be superimposed by the rigid body transformation.

The template extraction algorithm works best for comparison of structures of non-related proteins expressing some common feature or function. We hypothesize that when the configuration of the template atoms is conserved in either convergent or divergent protein structures sharing the same function, then the selected set is likely to contain functionally relevant atoms. Application of the CSC algorithm to similar proteins, sharing similar function, is also feasible, as demonstrated on the cathepsin–papain example.


    Acknowledgement
 
The authors thank Kathleen Moore for reading and commenting on the manuscript.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Artymiuk,P.J., Poirrette,A.R., Grindley,H.M., Rice,D.W. and Willett,P. (1994) J. Mol. Biol., 243, 327–344.[CrossRef][ISI][Medline]

Axe,D.D., Foster,N.W. and Fersht,A.R. (1996) Proc. Natl Acad. Sci. USA, 93, 5590–5594.[Abstract/Free Full Text]

Bairoch,A. (2000) Nucleic Acids Res., 28, 304–305.[Abstract/Free Full Text]

Brenner,S. (1988) Nature, 334, 528–530.[CrossRef][ISI][Medline]

Chotia,C. and Gerstein,M. (1997) Nature, 385, 579–581.[CrossRef][ISI][Medline]

Chothia,C. and Lesk,A.M. (1986) EMBO J., 5, 823–826.[Abstract]

Di Gennaro,J.A., Siew,N., Hoffman,B.T., Zhang,L., Skolnick,J., Neilson,L.I. and Fetrow,J.S. (2001) J. Struct. Biol., 134, 232–245.[CrossRef][ISI][Medline]

Fetrow,J.S. and Skolnick,J. (1998) J. Mol. Biol., 281, 949–968.[CrossRef][ISI][Medline]

Fisher,D., Wolfson,H., Lin,S.L. and Nussinov,R. (1994) Protein Sci., 3, 769–778.[Abstract/Free Full Text]

Fisher,D., Tsai,C.J., Nussinov,R. and Wolffson,H. (1995) Protein Eng., 8, 981–997.[Abstract]

Gassner,N.C., Baase,W.A. and Matthews,B.W. (1996) Proc. Natl Acad. Sci. USA, 93, 12155–12158.[Abstract/Free Full Text]

Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522–524.[Abstract/Free Full Text]

Irving,J.A., Whisstock,J.C. and Lesk,A.M. (2001) Proteins, 42, 378–382.[CrossRef][ISI][Medline]

Lesk,A.M. and Chothia,C. (1980) J. Mol. Biol., 136, 225–270.[ISI][Medline]

Milik,M., Szalma,S. and Olszewski,K.A. (2002) In Guigo,R. and Gusfield,D. (eds), Algorithms in Bioinformatics. Springer, Berlin, p. 182.

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536–540.[CrossRef][ISI][Medline]

Petsko,G. and Ringe,D. (2002) Protein Structure and Function: from Sequence to Consequence. New Science Press, London. E-book: http://www.new-science-press.com/browse/protein/.

Reddy,B.V., Li,W.W., Shindyalov,I.N. and Bourne,P.E. (2001) Proteins, 42, 148–163.[CrossRef][ISI][Medline]

Reva,B., Finkelstein,A. and Topiol,S. (2002), Proteins, 47, 180–193.[CrossRef][ISI][Medline]

Russell,R.B. (1998) J. Mol. Biol., 279, 1211–1227.[CrossRef][ISI][Medline]

Russell,R.B., Sasieni,P.D. and Sternberg,M.J. (1998) J. Mol. Biol., 282, 903–918.[CrossRef][ISI][Medline]

Shindyalov,I.N., Bourne P.E. (1998) Protein Eng., 11, 739–747.[Abstract]

Siezen,R.J., de Vos,W.M., Leunissen,J.A.M. and Dijkstra,B.W. (1991) Protein Eng., 4, 719–737.[Abstract]

Turcotte,M., Muggleton,S.H. and Sternberg,J.E. (2001) J. Mol. Biol., 306, 591–605.[CrossRef][ISI][Medline]

Wallace,A.C., Laskowski,R.A. and Thornton,J.M. (1996) Protein Sci., 5, 1001–1013.[Abstract/Free Full Text]

Wallace,A.C., Borkakoti,N. and Thornton,J.M. (1997) Protein Sci., 6, 2308–2323.[Abstract/Free Full Text]

Received March 9, 2003; revised June 6, 2003; accepted June 24, 2003.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (4)
Request Permissions
Google Scholar
Articles by Milik, M.
Articles by Olszewski, K. A.
PubMed
PubMed Citation
Articles by Milik, M.
Articles by Olszewski, K. A.