Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK 1Present address: Department of Biochemistry and Chemistry, University of Leicester, Leicester LE1 7RH, UK
2 To whom correspondence should be addressed. e-mail: wtaylor{at}nimr.mrc.ac.uk
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: classification/evolution/fold/protein/topology/trees
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Hierarchical clustering
Protein structures are usually analysed at the level of the domain. However, the definition of a domain is not always straightforward. Small structural differences between otherwise similar proteins can have major consequences to the way in which a protein structure is broken into domains and even within a domain such differences can alter the way in which its fold (or topology) is perceived. Expert judgement can be used to some extent to overcome these problems, but experts do not always agree.
Current systems for the classification of protein structures use different methods. SCOP (Murzin et al., 1995) is principally a manual approach while FSSP (Holm and Sander, 1997
) is an automated method. Between these lie CATH (Orengo et al., 1997
) and HOMSTRAD (Mizuguchi et al., 1998
) which combine automation with manual curation. The basic approach, however, is the same: proteins are divided into their component domains that are then compared pairwise and clustered into groups of similar structures. This approach works well when there is clear similarity between the structures; however, for distantly related (or unrelated) structures, it becomes difficult to define a rational hierarchy on the relationships.
This difficulty is further compounded by the problem of domain definition where differences in the definitions can result in inconsistent fragmentary similarities between structures. Indeed many of the differences between the current classification systems are largely due to different domain definitions (Hadley and Jones, 1995).
Sub-graph matching
To tackle the problem of relating distant structural similarities, more abstract methods were developed based on the network of interactions between secondary structures. Such networks were amenable to the computational methods of graph analysis, allowing a structural similarity to be defined as a sub-graph isomorphism between two structures (Mitchell et al., 1989; Artymiuk et al., 1990
; Koch et al., 1992
; Harrison et al., 2002
; Taylor, 2002c
). One advantage of the approach is that it implicitly solves the problem of defining a domain since each domain is the connected components in the graph.
This approach is encountered again in the TOPS method that was derived from comparing topological diagrams of protein folds (Flores et al., 1994; Gilbert et al., 1999
); however, this method is restricted to protein folds that incorporate a ß-sheet as it does not embody a detailed description of helical interactions.
In these approaches, the size of the common sub-graph found in two proteins can be taken as a measure of their similarity and used as a basis either for clustering or representing relationships as a network of connected cores.
Evolutionary approach
From the pioneering work of Ptitsyn and colleagues (Ptitsyn and Finkelstein, 1980; Finkelstein and Ptitsyn, 1987
), it has been shown that the three-dimensional arrangement of helices and strands in larger proteins can be obtained from the stepwise addition of secondary structure elements (SSEs) to basic structural motifs (Efimov, 1993
, 1997
). Whether this accretion of SSEs reflects either a possible folding pathway for the protein or an evolutionary history is debatable but irrespective of any of these rationalizations, it provides a valid approach for the classification of protein folds.
Using this approach, the protein folds become organized into a phylogenetic tree (which may include missing-links). Unlike clustering by similarity, the tree can be arbitrarily deep, so relating the most dissimilar folds. A disadvantage, however, is that the construction of the trees is a manual operation that embodies an implicit set of assumptions and rules that are only to varying degrees stated explicitly.
Matching ideal forms
In a double attack on the problems of protein fold classification with domain definition, a series of ideal folds (called forms) were developed and matched at the level of secondary structures to known structures (Taylor, 2000, 2002a
). Each successful match simultaneously identifies a fold and defines the domain. The set of ideal folds can be organized in a table which is not unlike a protein equivalent to the periodic table of elements (Taylor, 2002b
).This analogy is based on the correspondence between layers of secondary structure with electronic orbitals. Just as the orbitals become filled with electrons, so the layers become filled with secondary structures. In this arrangement, a step in any direction in the table represents the addition or deletion of an SSE in one of the layers.
The organization of known structure based on their ideal forms embodies many of the principles discussed above for the alternative approaches: for example, the identification of the largest common form shared between two proteins corresponds to their largest isomorphic sub-graph in the graph-based methods, while steps within the table are equivalent to the addition of secondary structures onto a core structure. The growth of a large structure from a simple core can then be viewed as a pathway through the table of forms (ToF) and a distance metric between structures can be established as the shortest path (or edit distance) between the structures.
Outline of the work
In this work, the connections between the evolutionary approach to protein fold analysis and pathways through the ToF are explored. The aim of this investigation is to develop both a metric of structure similarity based on an edit distance and also to establish in a rigorous way, the set of rules that can be used to constrain steps through the ToF.
The work concentrates on the three-layer ßß structures which provide a rich collection of structures that have been extensively analysed by Efimov into a large evolutionary tree.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The encoding of protein architectures as layers of secondary structure allows the fold of the chain to be described as a series of moves between the layers. Concentrating on the three-layer ßß architecture, each layer can be designated by the letters A, B and C (respectively) specifying a position coordinate in one dimension. Location in the layer was encoded as a numeric displacement relative to the first SSE to be found in the layer (specifying a second coordinate dimension). The third dimension encodes just the orientation of each SSE relative to the first SSE. This string encoding is similar to that devised by Flower (1998
) but is directly rooted in a coordinate frame making it more computationally tractable. A chain path can then be encoded using three descriptors for each SSE, as shown in Figure 1.
|
Largest common fold
Dynamic programming solution. Given two topology strings, the problem of finding the largest common fold can be solved using a variation of the dynamic-programming algorithm commonly used to align sequences. In this initial consideration of the problem, we will assume that the assignment of layers and the orientation of the first strand are the same in both strings.
Given the topology strings corresponding to the two proteins shown in Figure 1:
+B0.A0.+B2.C0.+B1.C1.+B1.C2.+B2
and
+B0.A0.+B1.C0.+B2.C1.+B3.A1.+B1
a matrix of scores is calculated by accumulating a positive score (+3) for a match in the SSE position and orientation relative to the last SSE in the same layer and a negative score (1) for a mismatch. Taking the two example topology strings (Figure 2), the first strands match and score 3, as do the following helices (giving 6), the next pair of strands have the same orientation but differ in their displacement from the first strand, so they mismatch, reducing the score to 5. Following this algorithm to completion identifies the highest scoring pathway through the matrix which corresponds to the elements that match in the common core. (See Figure 2 for further details.)
|
Additional constraints. To produce biologically meaningful results, the algorithm was modified so that solutions that contained a gap in a ß-sheet were discarded. This implies that the algorithm will not skip-over core ß-strands within a substructure but these are still able to be discarded from the edge of a sheet as mismatches in the normal way.
The solution to a single comparison is not necessarily unique and it can be seen from the example in Figure 2 that there are two equally high scoring paths (one substitutes a final helix in the match for a strand). In this situation, a decision was made to take the solution with most ß-strands.
An option was provided to report the largest common substructure (LCS) that does not contain any internal mismatches (corresponding to a strict sub-diagonal in the score matrix). This constrains solutions to be derived only by deletions from the termini of the chains. By analogy with sequence alignment, the two variants are referred to as global and local, with the latter being the more restrictive. The different behaviour of this option is shown in Figure 3. It was considered that the option that allows general deletion of structure was more biologically realistic and the results presented below were all generated with this option.
|
Proteins can now be clustered according to the common substructures that they share and two algorithms were investigated for this. The first clusters according to the size of the substructure: i.e. the largest substructure is chosen as the next node in the tree. The second creates clusters according to the distance between the structures: i.e. in each cycle the substructure which requires the least number of additions to form the two larger structures is chosen as the next node in the tree. These alternatives will be referred to as the basic and distance based algorithms, respectively.
In both algorithms, the new node is added to the pool of structures, and the two larger structures are removed. The common substructure of two proteins is then represented at their joining node as an ancestral structure, unless this also corresponds with one of the structures (i.e. the smaller structure is contained within the larger), in which case the smaller structure itself occupies the ancestral node.
Data
The method was tested on a subset of the three-layer ß proteins analysed by Efimov (1997
). These were: (1) thiolase; (2) phosphoglucomutase (PGM); (3) adenylate kinase (ADK); (4) fructose-2,6-biphosphatase (FBiP); (5) phosphofructokinase (small domain) (PFK); (6) porphobilinogen deaminase (domain 2) (PBGD); (7) L-arabinose binding protein (Q domain) (ABP); (8) rhodanase (RHD); (9) dethiobiotin synthase (DBS); (10) subtilisin (SBT); (11) isocitrate dehydrogenase (ICD); (12) pyruvate decarboxylase (
,
-domains) (PDC); (13) hypoxanthine-guanine phosphoribosyltransferase (HGPRT); (14) pyruvate oxidase (POx); (15) mitochondrial F1-ATPase (ATPase); (16) rec A protein (recA). These proteins will be referred to in the text by name, or more commonly by their number (in parentheses) or abbreviation (in brackets). The topology strings for these structures are given in Table I.
|
The selected structures cover the major well populated part of the tree of structures drawn by Evimov (1997, table 5) and ignores the part of his tree in which larger structures are connected by long sparsely populated branches back to the root.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
While the automatically generated trees are in broad agreement with the Efimov tree, there are details, involving the desirability of making particular insertions/deletions in core positions that need to be further examined. In the current work we have not attempted to make any further modifications to the basic algorithm but have, rather, assessed its behaviour on a well characterized group of protein structures. If it is assumed that Efimov provides a gold-standard for the relationships between structures, it would now be possible to modify the algorithm (adding further constraints or relaxing existing ones) to optimize the behaviour of the algorithm on its ability to regenerate the trees of Efimov. Rather than pursue this route, a more fundamental investigation would involve an attempt to derive the underlying constraints from the data. Using the automated approach described here, it will be possible to calculate trees of structures rapidly for each formulation of the rules, so allowing these to be varied until the most parsimonious tree is obtained.
The primary application of this approach is to impose a minimal hierarchical description on the relationship of protein folds in a way that clearly states the assumptions and rules that have been applied. While the resulting classification is in itself of value, it is also interesting to speculate whether the relationships might have resulted from a corresponding series of evolutionary events. If this were so then the ancestral nodes that have no equivalent known structure might correspond to relic structures that are yet to be discovered. Alternatively, it might be postulated that the resulting order reflects the constraints of similar folding pathwayswith the accretion of secondary structure imitating the assembly of the protein structure as it folds (as originally postulated by Ptitsyn and colleagues). These two options may even be simultaneously true: in the way in which ontogeny recapitulates phylogeny in embryonic development, so the protein folding pathway might equally recapitulate its evolutionary history.
For the further application of the method to a large set of protein structures, the problem of secondary structure definition will need to be addressed. For this, a method that is insensitive to the fine details of hydrogen-bonding is required and we are investigating the use of a line-segment based approach for this (Taylor, 2001). More importantly, for the method to become fully automatic, it will also be necessary to have a robust method to define protein topology. As mentioned above, the matching of idealized forms can be used for this (Taylor, 2002b
) but this introduces the added complexity that there is not always a unique match of a form to a structure. To overcome this problem, the current method has the potential to be extended to consider alternative topology strings derived from multiple forms. Thus, rather than force the choice of a best match for each protein, the matches that give rise to the best path can be selected.
In this work, we have presented the first automatic method to cluster protein structures by their fold using a hierarchical tree organization with the representation of ancestral nodes. We have shown that, in outline, the resulting trees correspond well with those compiled by an expert and that some differences arise both through differing emphasis on the permitted rules of evolution and some from inconsistent manual application of these rules. The evolutionary significance of the resulting tree and its ancestral structures is difficult to evaluate but as the tree has been compiled without bias and without reference to function, the subsequent analysis of a larger number of structures with reference to their function may provide some insights.
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Efimov,A.V. (1993) Prog. Biophys. Mol. Biol., 60, 201239.[CrossRef][ISI][Medline]
Efimov,A.V. (1997) Proteins, 28, 241260.[CrossRef][ISI][Medline]
Finkelstein,A.V. and Ptitsyn,O.B. (1987) Prog. Biophys. Mol. Biol., 50, 171190.[CrossRef][ISI][Medline]
Flores,T.P., Moss,D.S. and Thornton,J.M. (1994) Protein Eng., 7, 3137.[ISI][Medline]
Flower,D.R. (1998) Protein Eng., 11, 723727.[CrossRef][ISI][Medline]
Gilbert,D., Westhead,D., Nagano,N. and Thornton,J.M. (1999) Bioinformatics, 15, 317326.
Hadley,C. and Jones,D.T. (1995) Structure, 7, 10991112.
Harrison,A., Pearl,F., Mott,R., Thornton,J. and Orengo,C. (2002) J. Mol. Biol., 323, 909926.[CrossRef][ISI][Medline]
Holm,L. and Sander,C. (1997) Nucleic Acids Res., 25, 231234.
Koch,I., Kaden,F. and Selbig,J. (1992) Proteins, 12, 314323.[ISI][Medline]
Mitchell,E.M., Artymiuk,P.J., Rice,D.W. and Willett,P. (1989) J. Mol. Biol., 212, 151166.[ISI]
Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) Protein Sci., 7, 24692471
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Ptitsyn,O.B. and Finkelstein,A.V. (1980) Q. Rev. Biophys., 13, 339386.[ISI][Medline]
Taylor,W.R. (2000) Biochem. Soc. Trans, 28, 264269.[ISI][Medline]
Taylor,W.R. (2001) J. Mol. Biol., 310, 11351150.[CrossRef][ISI][Medline]
Taylor,W.R. (2002a) In Mewes,B. and Weiss,H.S. (eds), Bioinformatics and Genome Analysis. Springer-Verlag, Berlin, Ernst Schering Research Foundation Workshop Vol. 38, pp. 133148.
Taylor,W.R. (2002b) Nature, 416, 657660.[CrossRef][ISI][Medline]
Taylor,W.R. (2002c) Mol. Cell. Proteomics, 1, 334339.
Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 122.[ISI][Medline]
Received June 19, 2003; revised September 13, 2003; accepted October 21, 2003