Protein fold comparison by the alignment of topological strings

Linus O. Johannissen1 and William R. Taylor2

Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK 1Present address: Department of Biochemistry and Chemistry, University of Leicester, Leicester LE1 7RH, UK

2 To whom correspondence should be addressed. e-mail: wtaylor{at}nimr.mrc.ac.uk


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Using the definitions of protein folds encoded in a text string, a dynamic programming algorithm was devised to compare these and identify their largest common substructure and calculate the distance (in terms of the number of edit operations) that this lay from each structure. This provided a metric on which the folds were clustered into a ‘phylogenetic’ tree. This construction differs from previous automatic structure clustering algorithms as it has explicit representation of the structures at ‘ancestral’ branching nodes, even when these have no corresponding known structure. The resulting tree was compared with that compiled by an ‘expert’ in the field and while there was broad agreement, differences were found that resulted from differing degrees of emphasis being placed on the types of operations that can be used to transform structures. Some concluding speculations on the relationship of such trees to the evolutionary history and folding of the proteins are advanced.

Keywords: classification/evolution/fold/protein/topology/trees


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
With the rapidly growing number of known protein structures, it becomes increasingly important to have a consistent and systematic way to classify and analyse these structures. Such analysis provides insights into the variety of structures found in nature and helps in the understanding of how they relate to one another. As a consequence, light may be shed on the enigmatic relationship between structure and sequence, including its manifestation through function, folding and evolution. A better understanding of these can then, in turn, be applied to improve the prediction of structure from sequence data.

Hierarchical clustering

Protein structures are usually analysed at the level of the domain. However, the definition of a domain is not always straightforward. Small structural differences between otherwise similar proteins can have major consequences to the way in which a protein structure is broken into domains and even within a domain such differences can alter the way in which its fold (or topology) is perceived. Expert judgement can be used to some extent to overcome these problems, but experts do not always agree.

Current systems for the classification of protein structures use different methods. SCOP (Murzin et al., 1995Go) is principally a manual approach while FSSP (Holm and Sander, 1997Go) is an automated method. Between these lie CATH (Orengo et al., 1997Go) and HOMSTRAD (Mizuguchi et al., 1998Go) which combine automation with manual curation. The basic approach, however, is the same: proteins are divided into their component domains that are then compared pairwise and clustered into groups of similar structures. This approach works well when there is clear similarity between the structures; however, for distantly related (or unrelated) structures, it becomes difficult to define a rational hierarchy on the relationships.

This difficulty is further compounded by the problem of domain definition where differences in the definitions can result in inconsistent fragmentary similarities between structures. Indeed many of the differences between the current classification systems are largely due to different domain definitions (Hadley and Jones, 1995Go).

Sub-graph matching

To tackle the problem of relating distant structural similarities, more abstract methods were developed based on the network of interactions between secondary structures. Such networks were amenable to the computational methods of graph analysis, allowing a structural similarity to be defined as a sub-graph isomorphism between two structures (Mitchell et al., 1989Go; Artymiuk et al., 1990Go; Koch et al., 1992Go; Harrison et al., 2002Go; Taylor, 2002cGo). One advantage of the approach is that it implicitly solves the problem of defining a domain since each domain is the connected components in the graph.

This approach is encountered again in the TOPS method that was derived from comparing topological diagrams of protein folds (Flores et al., 1994Go; Gilbert et al., 1999Go); however, this method is restricted to protein folds that incorporate a ß-sheet as it does not embody a detailed description of helical interactions.

In these approaches, the size of the common sub-graph found in two proteins can be taken as a measure of their similarity and used as a basis either for clustering or representing relationships as a network of connected cores.

‘Evolutionary’ approach

From the pioneering work of Ptitsyn and colleagues (Ptitsyn and Finkelstein, 1980Go; Finkelstein and Ptitsyn, 1987Go), it has been shown that the three-dimensional arrangement of helices and strands in larger proteins can be obtained from the stepwise addition of secondary structure elements (SSEs) to basic structural motifs (Efimov, 1993Go, 1997Go). Whether this accretion of SSEs reflects either a possible folding pathway for the protein or an evolutionary history is debatable but irrespective of any of these rationalizations, it provides a valid approach for the classification of protein folds.

Using this approach, the protein folds become organized into a phylogenetic tree (which may include ‘missing-links’). Unlike clustering by similarity, the tree can be arbitrarily deep, so relating the most dissimilar folds. A disadvantage, however, is that the construction of the trees is a manual operation that embodies an implicit set of assumptions and rules that are only to varying degrees stated explicitly.

Matching ideal forms

In a double attack on the problems of protein fold classification with domain definition, a series of ideal folds (called forms) were developed and matched at the level of secondary structures to known structures (Taylor, 2000Go, 2002aGo). Each successful match simultaneously identifies a fold and defines the domain. The set of ideal folds can be organized in a table which is not unlike a protein equivalent to the periodic table of elements (Taylor, 2002bGo).This analogy is based on the correspondence between layers of secondary structure with electronic orbitals. Just as the orbitals become filled with electrons, so the layers become filled with secondary structures. In this arrangement, a step in any direction in the table represents the addition or deletion of an SSE in one of the layers.

The organization of known structure based on their ideal forms embodies many of the principles discussed above for the alternative approaches: for example, the identification of the largest common form shared between two proteins corresponds to their largest isomorphic sub-graph in the graph-based methods, while steps within the table are equivalent to the addition of secondary structures onto a core structure. The ‘growth’ of a large structure from a simple core can then be viewed as a pathway through the table of forms (ToF) and a distance metric between structures can be established as the shortest path (or edit distance) between the structures.

Outline of the work

In this work, the connections between the evolutionary approach to protein fold analysis and pathways through the ToF are explored. The aim of this investigation is to develop both a metric of structure similarity based on an edit distance and also to establish in a rigorous way, the set of rules that can be used to constrain steps through the ToF.

The work concentrates on the three-layer ß{alpha}ß structures which provide a rich collection of structures that have been extensively analysed by Efimov into a large ‘evolutionary’ tree.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Topology strings

The encoding of protein architectures as layers of secondary structure allows the fold of the chain to be described as a series of moves between the layers. Concentrating on the three-layer ß{alpha}ß architecture, each layer can be designated by the letters A, B and C (respectively) specifying a position coordinate in one dimension. Location in the layer was encoded as a numeric displacement relative to the first SSE to be found in the layer (specifying a second coordinate dimension). The third dimension encodes just the orientation of each SSE relative to the first SSE. This string encoding is similar to that devised by Flower (1998Go) but is directly rooted in a coordinate frame making it more computationally tractable. A chain path can then be encoded using three descriptors for each SSE, as shown in Figure 1.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 1. Example topology strings. Two small {alpha}ß{alpha} layer proteins fitting form 1–5–3 (a) (one helix above and three below a five-stranded sheet) and form 2–5–2 (b) are shown as topology diagrams with their corresponding topology strings below. In the topology diagrams, helices are depicted as circles and ß-strands as triangles. In the topology strings, the three layers of secondary structure ({alpha}ß{alpha}) are designated A, B and C, respectively. Each SSE is given a label of three parts indicating orientation (‘+’, ‘–’), layer and position in the layer. The first strand in each layer is, by definition, at position 0 with others numbered relative to this. In the topology diagram negative numbers lie to the left, positive to the right. Similarly, in the strings, a positive orientation corresponds to an SSE approaching (‘out of the page’) in the diagrams.

 
This encoding scheme might appear to depend on the orientation of the molecule; however, both the orientation and positions of the SSEs are determined relative to the first strand and, if in addition, the labelling of the layers is not predefined, then the scheme can be made orientation independent. This latter property was achieved by assigning the label ‘A’ to the first {alpha} layer to be occupied. Although the scheme is independent of orientation for the whole molecule, i.e. identical structures will have the same string, it remains sensitive to the starting point so two identical substructures within larger molecules need not have the same string. This difficulty will be considered further below.

Largest common fold

Dynamic programming solution. Given two topology strings, the problem of finding the largest common fold can be solved using a variation of the dynamic-programming algorithm commonly used to align sequences. In this initial consideration of the problem, we will assume that the assignment of layers and the orientation of the first strand are the same in both strings.

Given the topology strings corresponding to the two proteins shown in Figure 1:

+B0.–A0.+B–2.–C0.+B–1.–C1.+B1.–C2.+B2

and

+B0.–A0.+B–1.–C0.+B2.–C1.+B3.–A1.+B1

a matrix of scores is calculated by accumulating a positive score (+3) for a match in the SSE position and orientation relative to the last SSE in the same layer and a negative score (–1) for a mismatch. Taking the two example topology strings (Figure 2), the first strands match and score 3, as do the following helices (giving 6), the next pair of strands have the same orientation but differ in their displacement from the first strand, so they mismatch, reducing the score to 5. Following this algorithm to completion identifies the highest scoring pathway through the matrix which corresponds to the elements that match in the common core. (See Figure 2 for further details.)



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 2. Example topology string alignment. The basic algorithm to align topology strings is illustrated for two proteins ‘A’ and ‘B’. (a) Their strings are arranged to form a score matrix which is filled with values beginning at the top left. Matching SSEs score 3 and mismatches score –1. The scores are accumulated towards the lower right with each new score taking the highest score from the adjacent cells of the submatrix to its top left (of which it forms the bottom right corner). After a match (*) only a diagonal transition is allowed. A bias of 1 is added for matching ß-strands when they have different relative positions in the sheet and a penalty of –3 is given to cross-layer mismatches. Since the match is dependent on the relative position to the first match, each initial pairing of SSEs is made in turn for both interchanges of the A and C {alpha}-layers. See text for details. (b) The resulting alignment.

 
Substructure matching. The basic algorithm described above is limited by the requirement that SSE matches are dependent on their orientation and displacement relative to the first instance of an SSE in the same layer and so cannot be used to find substring matches (corresponding to local alignments in sequence matching). However, following the use of dynamic programming in three-dimensional structure matching (double dynamic programming) (Taylor and Orengo, 1989Go), the common core can be calculated for each ‘enforced’ matching of all pairs of SSEs (of like type). It may now be necessary to ‘rotate’ the subfragments into the same orientation but this can be easily done typographically: a 180° rotation in the plane of the topological diagram requires only the interchange of A{leftrightarrow}C characters combined with the negation of all position values while a flip in up/down orientation requires the negation of the orientation sign combined with either of the preceding transformations. As the orientation is set by the forced match of the initial elements in the substrings, only one alternative string needs to be tested and the better match of the two retained as the solution.

Additional constraints. To produce biologically meaningful results, the algorithm was modified so that solutions that contained a gap in a ß-sheet were discarded. This implies that the algorithm will not skip-over core ß-strands within a substructure but these are still able to be discarded from the edge of a sheet as mismatches in the normal way.

The solution to a single comparison is not necessarily unique and it can be seen from the example in Figure 2 that there are two equally high scoring paths (one substitutes a final helix in the match for a strand). In this situation, a decision was made to take the solution with most ß-strands.

An option was provided to report the largest common substructure (LCS) that does not contain any internal mismatches (corresponding to a strict sub-diagonal in the score matrix). This constrains solutions to be derived only by deletions from the termini of the chains. By analogy with sequence alignment, the two variants are referred to as ‘global’ and ‘local’, with the latter being the more restrictive. The different behaviour of this option is shown in Figure 3. It was considered that the option that allows general deletion of structure was more biologically realistic and the results presented below were all generated with this option.




View larger version (25K):
[in this window]
[in a new window]
 
Fig. 3. Local versus global algorithm. The difference in results between the two alternative algorithms is illustrated. (A) The global algorithm allows the deletion/insertion of SSEs at both edges of a domain, while (B) the local algorithm allows deletion/insertion only at the termini of the chain. This corresponds to the longest common substring in the topology strings (as defined by structure, not character matching). Deleted portions in both structures are shown in grey. A triangle represents a ß-strand while a circle represents an {alpha}-helix.

 
Hierarchical clustering

Proteins can now be clustered according to the common substructures that they share and two algorithms were investigated for this. The first clusters according to the size of the substructure: i.e. the largest substructure is chosen as the next node in the tree. The second creates clusters according to the distance between the structures: i.e. in each cycle the substructure which requires the least number of additions to form the two larger structures is chosen as the next node in the tree. These alternatives will be referred to as the basic and distance based algorithms, respectively.

In both algorithms, the new node is added to the pool of structures, and the two larger structures are removed. The common substructure of two proteins is then represented at their joining node as an ‘ancestral’ structure, unless this also corresponds with one of the structures (i.e. the smaller structure is contained within the larger), in which case the smaller structure itself occupies the ‘ancestral’ node.

Data

The method was tested on a subset of the three-layer ß{alpha} proteins analysed by Efimov (1997Go). These were: (1) thiolase; (2) phosphoglucomutase (PGM); (3) adenylate kinase (ADK); (4) fructose-2,6-biphosphatase (FBiP); (5) phosphofructokinase (small domain) (PFK); (6) porphobilinogen deaminase (domain 2) (PBGD); (7) L-arabinose binding protein (Q domain) (ABP); (8) rhodanase (RHD); (9) dethiobiotin synthase (DBS); (10) subtilisin (SBT); (11) isocitrate dehydrogenase (ICD); (12) pyruvate decarboxylase ({alpha}, {gamma}-domains) (PDC); (13) hypoxanthine-guanine phosphoribosyltransferase (HGPRT); (14) pyruvate oxidase (POx); (15) mitochondrial F1-ATPase (ATPase); (16) rec A protein (recA). These proteins will be referred to in the text by name, or more commonly by their number (in parentheses) or abbreviation (in brackets). The topology strings for these structures are given in Table I.


View this table:
[in this window]
[in a new window]
 
Table I. Topology strings
 
For these structures, we have adopted the secondary structure definitions of Efimov. For the general application of the method, the problem of secondary structure definition will need to be addressed. This aspect will revisited in the Discussion.

The selected structures cover the major well populated part of the tree of structures drawn by Evimov (1997Go, table 5) and ignores the part of his tree in which larger structures are connected by long sparsely populated branches back to the root.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Each of the clustering methods described in Materials and methods (Hierarchical clustering) generated a reasonable tree structure of the similarities between the proteins. Allowing for some branch rearrangement in the representation of the trees (which were generated automatically), it can be seen that the two trees are almost isomorphic (Figures 4 and 5). The main difference is the insertion of some extra intermediate steps using the LCS-based variation of the clustering algorithm. Specifically, an extra node is inserted between the root of the tree and the two branches carrying thiolase (1) and adenylate kinase (3) and the intermediate node represented by the ‘ancestral’ structure ‘number 18’ in Figure 5 is replaced by two ‘ancestral’ structures in Figure 4.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 4. LCS clustering—basic algorithm. The dendrogram shows the clustering of the 16 protein structures using the basic clustering algorithm in which the LCS is chosen as the next node in the tree. The proteins are numbered as in Table I. ‘Ancestral’ or branching node structures that do not correspond to a known structure are given an arbitrary number.

 


View larger version (15K):
[in this window]
[in a new window]
 
Fig. 5. LCS clustering—distance-based algorithm. The dendrogram shows the clustering of the 16 protein structures using the clustering algorithm operating on the distance of each structure from their LCS. The proteins are numbered as in Table I. ‘Ancestral’ or branching node structures that do not correspond to a known structure are given an arbitrary number. (As these are automatically generated, they do not correspond to those in Figure 4.)

 
The latter tree (Figure 4) was compared with a simplified representation of the tree constructed by Efimov (1997Go) (Figure 6). The majority of the tree structure is isomorphic with that of Efimov, including the relationship of the thiolase (1) and adenylate kinase (3) branches. A minor rearrangement occurs among the structures 12 (PDC), 13 (HGPRT) and 14 (POx). Using our automatic method, these join at a common node, whereas Efimov links 12 to 14. While this may seem like a trivial change, the connection between 12 and 14 involves the ‘deletion’ of an internal (buried) ß-strand from the sheet (with the subsequent closure and reformation of an intact sheet). This is an operation that was specifically disallowed by our algorithms but has been allowed on this occasion by Efimov who allows some flexibility in the application of his rules. Specifically, he states: ‘a structure obtained in the preceding step is maintained [but] it can be slightly modified’ (Efimov, 1997Go).



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 6. Efimov tree. The major part of the ß/{alpha} tree constructed by Efimov (1997Go, table 5) is shown in simplified form in which each of the ‘ancestral’ structures that do not constitute a branching node have been removed.

 
The largest rearrangement, however, occurs between the two branches carrying subtilisin (10) and PFK (5). Efimov links subtilisin (10) by a long branch of seven ‘ancestral’ structures back to a point where there is a common node with our automatically generated tree. Similarly, DBS (9) is linked back through five ‘ancestral’ structures to PFK (5). In contrast, these two structures are linked by a common node (‘number 17’) in our construction (Figure 4). This association involves the loss of two helices and an edge strand from DBS to recreate the unusual left-handed ß{alpha}ß unit found at the N-termini edge of the subtilisin sheet. Although this transition can be accomplished in three steps it is not an obvious route to take as it involves inserting/deleting a helix that lies between two existing helices.

While the automatically generated trees are in broad agreement with the Efimov tree, there are details, involving the desirability of making particular insertions/deletions in core positions that need to be further examined. In the current work we have not attempted to make any further modifications to the basic algorithm but have, rather, assessed its behaviour on a well characterized group of protein structures. If it is assumed that Efimov provides a ‘gold-standard’ for the relationships between structures, it would now be possible to modify the algorithm (adding further constraints or relaxing existing ones) to optimize the behaviour of the algorithm on its ability to regenerate the trees of Efimov. Rather than pursue this route, a more fundamental investigation would involve an attempt to derive the underlying constraints from the data. Using the automated approach described here, it will be possible to calculate trees of structures rapidly for each formulation of the rules, so allowing these to be varied until the most parsimonious tree is obtained.

The primary application of this approach is to impose a minimal hierarchical description on the relationship of protein folds in a way that clearly states the assumptions and ‘rules’ that have been applied. While the resulting classification is in itself of value, it is also interesting to speculate whether the relationships might have resulted from a corresponding series of evolutionary events. If this were so then the ‘ancestral’ nodes that have no equivalent known structure might correspond to relic structures that are yet to be discovered. Alternatively, it might be postulated that the resulting order reflects the constraints of similar folding pathways—with the accretion of secondary structure imitating the assembly of the protein structure as it folds (as originally postulated by Ptitsyn and colleagues). These two options may even be simultaneously true: in the way in which ontogeny recapitulates phylogeny in embryonic development, so the protein folding pathway might equally recapitulate its evolutionary history.

For the further application of the method to a large set of protein structures, the problem of secondary structure definition will need to be addressed. For this, a method that is insensitive to the fine details of hydrogen-bonding is required and we are investigating the use of a line-segment based approach for this (Taylor, 2001Go). More importantly, for the method to become fully automatic, it will also be necessary to have a robust method to define protein topology. As mentioned above, the matching of idealized forms can be used for this (Taylor, 2002bGo) but this introduces the added complexity that there is not always a unique match of a form to a structure. To overcome this problem, the current method has the potential to be extended to consider alternative topology strings derived from multiple forms. Thus, rather than force the choice of a ‘best’ match for each protein, the matches that give rise to the best path can be selected.

In this work, we have presented the first automatic method to cluster protein structures by their fold using a hierarchical tree organization with the representation of ‘ancestral’ nodes. We have shown that, in outline, the resulting trees correspond well with those compiled by an expert and that some differences arise both through differing emphasis on the permitted ‘rules’ of evolution and some from inconsistent manual application of these rules. The evolutionary significance of the resulting tree and its ‘ancestral’ structures is difficult to evaluate but as the tree has been compiled without bias and without reference to function, the subsequent analysis of a larger number of structures with reference to their function may provide some insights.


    Acknowledgements
 
We thank Beatrice Gorinsky for her support and encouragement during this project. L.O.J. was supported by BBSRC and the work formed part of the requirements for fulfilment of the MSc course in Crystallography at Birkbeck College, London.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Artymiuk,P.J., Rice,D.W., Mitchell,E.M. and Willett,P. (1990) Protein Eng., 4, 39–43.[ISI][Medline]

Efimov,A.V. (1993) Prog. Biophys. Mol. Biol., 60, 201–239.[CrossRef][ISI][Medline]

Efimov,A.V. (1997) Proteins, 28, 241–260.[CrossRef][ISI][Medline]

Finkelstein,A.V. and Ptitsyn,O.B. (1987) Prog. Biophys. Mol. Biol., 50, 171–190.[CrossRef][ISI][Medline]

Flores,T.P., Moss,D.S. and Thornton,J.M. (1994) Protein Eng., 7, 31–37.[ISI][Medline]

Flower,D.R. (1998) Protein Eng., 11, 723–727.[CrossRef][ISI][Medline]

Gilbert,D., Westhead,D., Nagano,N. and Thornton,J.M. (1999) Bioinformatics, 15, 317–326.[Abstract/Free Full Text]

Hadley,C. and Jones,D.T. (1995) Structure, 7, 1099–1112.

Harrison,A., Pearl,F., Mott,R., Thornton,J. and Orengo,C. (2002) J. Mol. Biol., 323, 909–926.[CrossRef][ISI][Medline]

Holm,L. and Sander,C. (1997) Nucleic Acids Res., 25, 231–234.[Abstract/Free Full Text]

Koch,I., Kaden,F. and Selbig,J. (1992) Proteins, 12, 314–323.[ISI][Medline]

Mitchell,E.M., Artymiuk,P.J., Rice,D.W. and Willett,P. (1989) J. Mol. Biol., 212, 151–166.[ISI]

Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) Protein Sci., 7, 2469–2471[Abstract/Free Full Text]

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[CrossRef][ISI][Medline]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Ptitsyn,O.B. and Finkelstein,A.V. (1980) Q. Rev. Biophys., 13, 339–386.[ISI][Medline]

Taylor,W.R. (2000) Biochem. Soc. Trans, 28, 264–269.[ISI][Medline]

Taylor,W.R. (2001) J. Mol. Biol., 310, 1135–1150.[CrossRef][ISI][Medline]

Taylor,W.R. (2002a) In Mewes,B. and Weiss,H.S. (eds), Bioinformatics and Genome Analysis. Springer-Verlag, Berlin, Ernst Schering Research Foundation Workshop Vol. 38, pp. 133–148.

Taylor,W.R. (2002b) Nature, 416, 657–660.[CrossRef][ISI][Medline]

Taylor,W.R. (2002c) Mol. Cell. Proteomics, 1, 334–339.[Abstract/Free Full Text]

Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 1–22.[ISI][Medline]

Received June 19, 2003; revised September 13, 2003; accepted October 21, 2003





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Request Permissions
Google Scholar
Articles by Johannissen, L. O.
Articles by Taylor, W. R.
PubMed
PubMed Citation
Articles by Johannissen, L. O.
Articles by Taylor, W. R.