Accelrys, 9685 Scranton Road, San Diego, CA 92121, USA 1To whom correpsondence should be addressed. e-mail: kato{at}accelrys.com
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: convergence/function assignment/graph representation/protein structure/structural templates/structurefunction relationship
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Looking at protein structures from the functional perspective, one may distinguish two classes of protein residues: (i) functionally relevant residues that participate in the specific protein function, and (ii) scaffold residues that form the structural environment for the functional residues, keeping them in proper spatial configuration. Obviously, such a division is in many cases arbitrary and artificial. It may be used, however, as a starting point for a more advanced analysis of structurefunction relationships in proteins. This focuses interest on a small subset of residues that are extracted from the whole protein structure and that are expected to contain most of the function-related information for the given protein. Additionally, knowledge of the most functionally relevant residues can enhance functional annotation of low homology sequence alignments (Reddy et al., 2001) because it is expected that the functionally relevant residues would be better conserved in the evolutionary process than the structural ones (Lesk and Chothia, 1980
; Chothia and Lesk, 1986
). For example, it is possible to mutate drastically structural scaffold residues of proteins without significant disruption of their functionality (Axe et al., 1996
; Gassner et al., 1996
; Chotia and Gerstein, 1997
). Moreover, atoms from the functional residues are expected to form similar arrangements both in homologous and in evolutionarily unrelated proteins performing the same biological function [for example, the aminotransferases family discussed by Petsko and Ringe (Petsko and Ringe, 2002
)]. Therefore, functional annotation procedures based on functionally relevant residues can be applied to non-homologous proteins expected to perform similar biological functions (Russell, 1998
; Russell et al., 1998
).
Comparison of protein structures and searches for Structural Templates (STs) of functionally relevant residues currently concern many research groups. A sub-graph isomorphism algorithm was proposed by Artymiuk et al. for searching protein structures for user-defined 3D patterns (Artymiuk et al., 1994). Patterns were defined based on the prior knowledge of functional residues in the given protein family. The authors used a reduced representation of protein structures with pseudo-atoms representing side chains. Pattern geometry was represented by a graph with the pseudo-atoms as nodes and interatomic distances as edges. In another approach, Fisher et al. used geometric hashing for a C
-only representation of protein structure (Fisher et al., 1994
, 1995). Because protein structures were represented there as unconnected spheres centered at C
coordinates, the extracted patterns were sequence-independent. The patterns resulted from automated clustering of pairs of compatible spheres. The results described in the paper were encouraging, but a C
-only representation is not always optimal for function-defining patterns. Wallace et al. proposed a similar method applying the geometric hashing paradigm to protein structures in expanded side chain representation (Wallace et al., 1996
, 1997). The patterns, containing a set of atoms defining the active site and a list of allowed amino acids, were stored in the PROCAT database. In another approach, Russell proposed a 3D pattern extraction method based on comparison of conserved residues from protein structures (Russell, 1998
), using multiple sequence alignment to define patterns. Each residue was represented by three atoms and a weighted root mean square deviation (r.m.s.d.) between residues was used as a similarity measure. Another method, proposed by Fetrow and Skolnick and Di Gennaro et al., used a C
-only representation of protein structures, distance conservation for each pair of residues and additional sequence-dependent constraints, to define Fuzzy Functional Forms (FFFs) (Fetrow and Skolnick, 1998
; Di Gennaro et al., 2001
). The FFFs definition of structural pattern is rooted strongly in expert knowledge and literature analysis. Turcotte et al. proposed a machine learning method for protein fold recognition (Turcotte et al., 2001
). Using the derived rules, they were able to assign automatically protein structures to the proper SCOP families. Unfortunately, common protein folds do not always imply a common biological function. Irving et al. designed a method for protein active site identification by structural alignments (Irving et al., 2001
). The method was used to suggest the locations of plausible active sites in homologous proteins and it was based on a search of maximal common sub-clusters of C
atoms. Another method combined sequence threading with chemostructural restrictions to detect functionally important and evolutionarily conserved fragments of protein structures (Reva et al., 2002
). The restrictions applied during the threading procedure are extracted from experimental data and from literature analysis. The method was applied to refine a homology model of dipeptidyl peptidase IV.
Most of these structure comparison methods, based on rigid structure comparison (measured by r.m.s. distance between atoms), use very restricted (e.g. C-only) representations or require external knowledge about residues belonging to the eventual functionally relevant site. The rigid structure comparison methods work best for reasonably similar protein domains that share a common fold. It may be difficult to compare, using these methods, active sites of two proteins with low sequence similarity or of converged proteins without additional information about possible active site definition. Additionally, in the case of evolutionarily close proteins and in the case of C
-only representations, it is difficult to distinguish between residues conserved because of their functional importance and structural scaffold residues.
In this paper, we propose a method, called Common Structural Cliques (CSC), for automatically locating functionally relevant atoms in protein structures. The method provides a representation of spatial arrangements of the functional atoms and allows for the flexible description of active/binding sites in proteins. This new representation allows for a formal description of flexible structural patterns that contain multiple rigid sub-patterns connected by flexible hinges. The search method is based on the comparison of protein structures that share a common biological function. The method does not depend on overall similarity of structures and sequences of compared proteins or on previous knowledge about functionally relevant residues in the considered protein family. This work is a modification and expansion of the previously published FAUST method for the extraction of functionally relevant templates from protein structures (Milik et al., 2002).
In this modification, new algorithms are applied to the extraction of local similarities between protein structures. By restricting our attention to four-atom cliques, we were able to search exhaustively for eventual matches in both structures, in contrast to the previously used heuristic algorithm. Additionally, we added an algorithm that automatically searches for a most promising ST by combining structural cliques containing atom pairs having the maximum number of overlaps (the algorithm is described in detail below). The manual procedure for this process, previously used in FAUST, is still available in CSC as an option.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Analysis of local similarities between protein structures using the CSC procedure occurs in four general steps. First, the compared protein structures must be condensed to the atom graph representation, with atoms as nodes and atomatom distances as edge labels. Second, all local atom cliques are extracted from these atom graphs. Third, all possible CSCs are extracted by exhaustive comparison of local atom cliques. Fourth, the extracted CSCs are merged to create sets of larger and continuous graphs: STs that describe structural analogies between compared proteins (see Figure 1 for illustration).
|
|
The second stage of the procedure starts from extraction of all possible local atom cliques from the compared protein structural graphs. The local atom clique is defined as a set of atoms from the structural graph with the property that every atom from this set is adjacent to every other atom from the same set. The adjacency is defined in the adjacency matrix created in the previous stage. In our application, structural graphs are exhaustively searched to identify all four-node cliques.
In the third stage, all atom cliques from selected protein structures (defining a functional family) were pair-wise compared such that every local atom clique extracted from one of the protein structures was compared with each atom clique from the other protein graphs to find similarities and so define CSCs.
A CSC establishes correspondence between two local atom cliques selected from two protein structures. Interatomic distances between corresponding atoms belonging to the CSC are approximately equal in both protein structures and corresponding atoms belonging to the same pair have the same chemical identity (see Figure 1 for illustration). Inter-atom distances were considered approximately equal when they differ by <1 Å. This parameter was also tested with a value of 1.5 Å on a set of protein structures from the serine protease family (see Results for details). The tests showed no evident improvement in the generated 3D pattern quality, but a drastic increase in the number of random (not related to active site) atom cliques.
Depending on the size of compared protein structures and their structural similarity, from one to >1000 CSCs may be extracted for a single protein pair. Because many extracted cliques share common atom pairs, they may be consolidated into larger constructs, called here Structural Templates (STs) (see Figure 1). Often, STs created by overlapping CSCs contain conflicting cliques, where one atom from the first protein is paired in different cliques with several different atoms from the second protein. As a result, the proposed ST must be pruned to remove inconsistencies and the selected cliques must be merged in order to refine information about functionally relevant fragments of proteins. The CSC merging is performed by a greedy algorithm, which attempts to generate a largest possible ST that contains the atoms pairs most frequently used in the analyzed CSCs.
The data format for the description of STs contains a list of atoms, with their coordinates, that is extracted from one of the compared protein structures and a square matrix, analogous to the adjacency matrix from graph theory. However, in the presented application, this matrix is used to describe the expected importance of the distance between two atoms for the specific protein function. This importance is evaluated by pair comparisons of protein structures from the given functional family and in the present work it is defined as 1, meaning that the given distance is relevant, and 0, meaning that it is irrelevant for the function. Application of this distance importance matrix allows for a formal description of flexible structural patterns, containing two or more rigid sub-patterns connected by flexible hinges. Such patterns are impossible to describe in the rigid r.m.s.d.-based definition of structural patterns. To compare templates we used distance r.m.s.d., i.e. the mean square deviation of relevant distance differences. Hence, irrelevant distance differences contribute effectively with zero weight to the distance r.m.s.d. While the distance r.m.s.d. for any ST will by definition be less than a threshold value (1.0 Å), the corresponding rigid r.m.s.d. may be much higher.
The merging algorithm starts by selecting from the list the CSC containing the most overlaps (common atom pairs) with other CSCs. If multiple CSCs have the maximum number of overlaps, the algorithm compares their number of conflicts and the CSC with the minimum number of conflicts is chosen. The selected clique is then used as a seed for a first draft of the ST, which is a superposition of all cliques that have overlaps with the previously chosen one. Usually, this ST contains conflicts, which means that an atom from one protein is paired with multiple atoms from another. Therefore, the proposed ST must be pruned to establish a one-to-one relation for atoms from compared proteins. In this process, conflicting cliques are iteratively removed from the ST definition. Every step of the pruning algorithm selects the CSC with the maximum number of conflicts with other CSC in the set. If many cliques have the same number of conflicts, the one with the minimum number of overlaps is chosen. The chosen CSC is then removed from the ST definition; the ST is rebuilt and again tested for conflicts. This step is repeated until the resulting ST does not contain conflicting atom pairs.
In the first set of experiments we analyzed protein structures from the serine endopeptidase family (EC numbers 3.4.21.). This family contains hydrolases acting on peptide bonds. Table II contains the list of protein structures from the non-redundant database labeled with this EC number that also have a resolution of 2.0 Å or better. Two important enzymes from this family are trypsin and subtilisin. The catalytic activity of trypsin is provided by a charge relay system involving an aspartic acid residue hydrogen bonded to a histidine, which itself is hydrogen bonded to a serine. The sequence fragments near the active-site serine and histidine residues are well conserved in this family (Brenner, 1988). Catalytic activity of subtilase (Siezen et al., 1991
) is provided by a charge relay system similar to that in trypsin; however, it most probably evolved by independent convergent evolution (Brenner, 1988
). The sequences around the residues involved in the catalytic triad (aspartic acid, serine and histidine) are completely different from that of the analogous residues in the trypsin serine proteases and are used as signatures specific to that class of proteases.
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The CSC algorithm, as presented above, depends on two main geometric parameters: the distance threshold (used in the procedure of extracting local atom cliques from protein structures) and the distance tolerance parameter (used for the definition of distance similarity in the CSC development procedure). Several combinations of these parameters were tested to locate values optimal for analysis of protein enzyme structures. The tests were performed on a set of low sequence homology (<25%), high quality (2.0 Å or better resolution) structures of serine endopeptidases (EC 3.4.21.). In this test, the seven selected structures were compared in pairs using the CSC algorithm with different values of parameters. Tested were combinations of geometric parameters for the local atom clique extraction procedure. Three values of distance threshold (7.0, 8.0 and 9.0 Å) and two values of distance tolerance (1.0 and 1.5 Å) were used.
The number of local atom cliques, extracted from a single structure, grows approximately geometrically with the distance threshold increase (see Figure 2). Analogously, numbers of extracted CSCs and sizes of final STs for all analyzed pairs of structures grow significantly with increasing values of the distance threshold parameter (Table III) and the distance tolerance parameter (Table IV). Nevertheless, the size of the extracted STs does not depend strongly on the analyzed parameters and the final STs for a given pair of structures overlap, in most cases. Therefore, a distance threshold 8.0 Å and a distance tolerance parameter 1.0 Å were used in the remaining part of this work.
|
|
|
Tables V and VI present lists of atoms included in ST definitions for the selected, representative structures of proteins that have the serine endopeptidase function. Rows of these tables correspond to atoms found in any ST from the selected structure. Values of 1 or 0 indicate whether the given atom is a member the corresponding ST or not. For example, the first row of Table V contains information that the atom SG from residue 42 Cys, from chain A in the 1AVW structure, was included in the STs created by comparison of this structure with 1C5L, 1FLE, 1FN8 and 1SGP. This atom was not included in STs created by comparison of this structure with the subtilisin structures 1GCI and 1SCJ. Table rows with information about atoms that belong to well-defined active site for serine endopeptidases (catalytic triad: His, Asp, Ser) are in boldface. It is encouraging that in both examples presented almost all STs contained all the active atoms from the catalytic triad residue. The only exception is the 1GCI/1SCJ pair, in which the oxygen from Ser125 replaced the oxygen from Ser221, which is used in all remaining STs. Analogous results were obtained for remaining structure pairs from this family. The conclusion is that by using any of the pairs of structures as a source, one would be able to locate precisely the active site for the serine endopeptidase family in a fully automated procedure, from atom coordinates alone.
|
|
|
|
The spatial arrangement of the extracted ST is presented in Figure 3. In this figure, the atoms from the catalytic triad form the core of the ST. These core atoms have neighbors on one side of two cysteines and a serine and on the other side of another serine. There are no conserved distances between these two groups of bordering atoms.
|
L-Aspartate aminotransferase (1AJR) and D-amino acid aminotransferase (1DAA) are two completely different protein structures sharing a common biochemical function. They catalyze the same type of chemical reaction for substrates that have different chiralities. These proteins form functional dimers and CSC was not able to discover any STs when applied to single chains of these proteins. However, when we included all chains in the analysis process, the ST presented in Table IX was extracted. The format of this table is analogous to that of Tables VII and VIII. The extracted ST contained atoms from two tyrosine glutamic acid and arginine residues. The connectivity matrix for the template is comparatively dense, which means that the distances between included atoms are similar in both structures. However, the adjacency matrix contains some zeros, meaning that distances between some atoms are excluded from the ST definition. For example, the arginine atoms (numbers 7 and 8) are coordinated only with atoms from one of the tyrosines from the complex (numbers 3 and 4).
|
|
|
Table X reports CSC ST extraction performance on a diverse set of protein pairs based on EC classes. We chose pairs of proteins that correspond to the same EC number (catalyze the same chemical reaction) but belong to different fold classes according to the SCOP (Murzin et al., 1995) classification of proteins. The protein size varied from 200 to >500 residues and the overall structural r.m.s.d. varied from 2.7 to 7.3 Å. Our experience is that the CSC extraction times rarely exceed 15 s for the ST extraction per protein pair. The results are summarized in the Table X and the full analysis of the underlying templates will be reported in subsequent papers. In order to analyze the other end of the similarity spectrum, we extracted template from a cathepsin and papain protein. These two structures share the same fold (1.5 Å r.m.s.d.) and sequence similarity (45%) but catalyze slightly different reactions (as characterized by EC 3.4.22.1 and 3.4.22.2, respectively). The CSC algorithm in this case took <20 s to extract the template that contained atoms from residues Cys25 and His159, known to be associated with the catalytic function.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The proposed method for joining overlapping CSCs and creating the final ST works well in the majority of the cases tested; the resulting sets of atoms are predominantly located in the neighborhood of the expected active site for the analyzed structures (see Tables VVIII). The clique-joining algorithm is based on a greedy search method. It detected atoms from the catalytic triad in 20 out of 21 pairs of serine endopeptidase structures. In the failed case, the search was disrupted by a large clique of atoms from the inhibitor chain of the structure. However, the catalytic triad was localized for this pair of structures in the second run, after removing the inhibitor chain from the extraction process. In the case of aminotransferases, the algorithm extracted sets of atoms from the neighborhood of the cofactor molecule (see Table IX and Figure 5), which supports the likely functional importance of these atoms. Obviously, it is difficult to prove that the cliques containing the most frequent atom pairs are always more functionally relevant; however, we see it as a plausible assumption in the case when the compared proteins have different structures and similar functions.
The results of the method parameters calibration tests (Tables III and IV) show that the selected set of side chain atoms (see Table I) and proposed values of distance threshold (8.0 Å) and distance tolerance parameters (1.0 Å) work well for the tested structures. As shown in Figure 2, the number of initial local atom cliques that must be analyzed during the CSC creation procedure grows substantially as the distance threshold value increases. An increase in the number of local atom cliques implies an increase in CSCs for every analyzed pair of structures (see Table III, below the diagonal). Nevertheless, the final STs remain basically the same, with only a slight increase in size (see Table III, above the diagonal). Analogously, an increase in the value of distance tolerance (a parameter used in the atom clique comparison procedure) increases the number of initial CSCs (see Table IV, below the diagonal) without drastic changes in the final STs (see Table IV, above the diagonal).
Analysis of the STs, even without looking at the structures or literature analysis, may provide some insight into the compared proteins. Taking the ST extracted from the aminotransferase structures (Table IX) as an example, we can form a hypothesis: if the selected ST contains functionally relevant residues and the structure is not an enzymeinhibitor complex, then these proteins form functional dimers. The hypothesis is true for both analyzed enzymes. Additionally, in the case of 1AJR, the ST is created by tyrosine and arginine from chain A and tyrosine and glutamic acid from chain B; in the case of 1DAA, both tyrosine residues are from chain A and glutamic acid and arginine are from chain B. This implies that this particular functional site is not sequence dependent and most probably evolved as an effect of a convergence process.
The examples presented illustrate the applicability of the algorithm to extract common structural features from proteins that catalyze a particular chemical reaction, but evolved from different ancestors owing to convergent evolution. Obviously, sequence-homology-based methods for pattern extraction are not applicable in such cases. Additionally, the presented approach captures an important malleability feature of the active site description. It allows for expressing similarities between active sites that are very difficult to describe otherwise, since they cannot be superimposed by the rigid body transformation.
The template extraction algorithm works best for comparison of structures of non-related proteins expressing some common feature or function. We hypothesize that when the configuration of the template atoms is conserved in either convergent or divergent protein structures sharing the same function, then the selected set is likely to contain functionally relevant atoms. Application of the CSC algorithm to similar proteins, sharing similar function, is also feasible, as demonstrated on the cathepsinpapain example.
![]() |
Acknowledgement |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Axe,D.D., Foster,N.W. and Fersht,A.R. (1996) Proc. Natl Acad. Sci. USA, 93, 55905594.
Bairoch,A. (2000) Nucleic Acids Res., 28, 304305.
Brenner,S. (1988) Nature, 334, 528530.[CrossRef][ISI][Medline]
Chotia,C. and Gerstein,M. (1997) Nature, 385, 579581.[CrossRef][ISI][Medline]
Chothia,C. and Lesk,A.M. (1986) EMBO J., 5, 823826.[Abstract]
Di Gennaro,J.A., Siew,N., Hoffman,B.T., Zhang,L., Skolnick,J., Neilson,L.I. and Fetrow,J.S. (2001) J. Struct. Biol., 134, 232245.[CrossRef][ISI][Medline]
Fetrow,J.S. and Skolnick,J. (1998) J. Mol. Biol., 281, 949968.[CrossRef][ISI][Medline]
Fisher,D., Wolfson,H., Lin,S.L. and Nussinov,R. (1994) Protein Sci., 3, 769778.
Fisher,D., Tsai,C.J., Nussinov,R. and Wolffson,H. (1995) Protein Eng., 8, 981997.[Abstract]
Gassner,N.C., Baase,W.A. and Matthews,B.W. (1996) Proc. Natl Acad. Sci. USA, 93, 1215512158.
Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522524.
Irving,J.A., Whisstock,J.C. and Lesk,A.M. (2001) Proteins, 42, 378382.[CrossRef][ISI][Medline]
Lesk,A.M. and Chothia,C. (1980) J. Mol. Biol., 136, 225270.[ISI][Medline]
Milik,M., Szalma,S. and Olszewski,K.A. (2002) In Guigo,R. and Gusfield,D. (eds), Algorithms in Bioinformatics. Springer, Berlin, p. 182.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Petsko,G. and Ringe,D. (2002) Protein Structure and Function: from Sequence to Consequence. New Science Press, London. E-book: http://www.new-science-press.com/browse/protein/.
Reddy,B.V., Li,W.W., Shindyalov,I.N. and Bourne,P.E. (2001) Proteins, 42, 148163.[CrossRef][ISI][Medline]
Reva,B., Finkelstein,A. and Topiol,S. (2002), Proteins, 47, 180193.[CrossRef][ISI][Medline]
Russell,R.B. (1998) J. Mol. Biol., 279, 12111227.[CrossRef][ISI][Medline]
Russell,R.B., Sasieni,P.D. and Sternberg,M.J. (1998) J. Mol. Biol., 282, 903918.[CrossRef][ISI][Medline]
Shindyalov,I.N., Bourne P.E. (1998) Protein Eng., 11, 739747.[Abstract]
Siezen,R.J., de Vos,W.M., Leunissen,J.A.M. and Dijkstra,B.W. (1991) Protein Eng., 4, 719737.[Abstract]
Turcotte,M., Muggleton,S.H. and Sternberg,J.E. (2001) J. Mol. Biol., 306, 591605.[CrossRef][ISI][Medline]
Wallace,A.C., Laskowski,R.A. and Thornton,J.M. (1996) Protein Sci., 5, 10011013.
Wallace,A.C., Borkakoti,N. and Thornton,J.M. (1997) Protein Sci., 6, 23082323.
Received March 9, 2003; revised June 6, 2003; accepted June 24, 2003.