The ups and downs of protein topology; rapid comparison of protein structure

Andrew C.R. Martin

School of Animal and Microbial Sciences, University of Reading, Whiteknights, P.O. Box 228, Reading RG6 6AJ, UK. E-mail: a.c.r.martin{at}reading.ac.uk


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Protein topology can be described at different levels. At the most fundamental level, it is a sequence of secondary structure elements (a `primary topology string'). Searching predicted primary topology strings against a library of strings from known protein structures is the basis of some protein fold recognition methods. Here a method known as TOPSCAN is presented for rapid comparison of protein structures. Rather than a simple two-letter alphabet (encoding strand and helix), more complex alphabets are used encoding direction, proximity, accessibility and length of secondary elements and loops in addition to secondary structure. Comparisons are made between the structural information content of primary topology strings and encodings which contain additional information (`secondary topology strings'). The algorithm is extremely fast, with a scan of a large domain against a library of more than 2000 secondary structure strings completing in ~30 s. Analysis of protein fold similarity using TOPSCAN at primary and secondary topology levels is presented.

Keywords: CATH/protein fold/structure comparison/structural similarity


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
As rapidly increasing quantities of new structure data become available, the classification of protein folds is becoming more and more important. When a new structure is solved, one wishes to ask the question of whether this fold has been seen before. If the fold is one of the commonly occurring superfolds (e.g. immunoglobulin fold, TIM barrel, {alpha}ß-plait, Rossmann fold), then it can generally be recognized by eye, but the less common folds are more difficult to recognize. Outside these superfolds, there is generally a one-to-one mapping between fold and homologous family. In some cases, remote homologues may only be recognised in the light of structural similarity allowing inference of functional information. Automated servers to answer the classification question have recently been developed and rely on structure comparison programs.

Unfortunately, detailed automatic structure comparison is a time-consuming process. The double-dynamic programming algorithm used in SSAP (Taylor and Orengo, 1989Go) at the atomic coordinate level is particularly computer intensive making it impractical to run a full scan of a protein of any size against a library of known folds. An experimental `CATH-server' at University College London (Orengo et al., 1998Go) has made use of sequence screening and a hierarchical expansion scheme through the representative levels in the CATH domain classification database (Orengo et al., 1997Go) to reduce the search, but in the worst-case scenario, it could still be necessary to scan against more than 3000 near-identical sequence representatives. It is estimated that ~2 weeks of computer time would be required to scan a large protein domain such as a TIM barrel in this way and this is clearly impractical for use as a server. The initial goal of this work therefore was to develop a rapid approximate method for protein fold comparison which could be used as a pre-screen for SSAP.

The CATH classification of protein domain structures is a hierarchical classification encoding class (C), architecture (A), topology (T), homology (H), sequence family (S), near- identical sequence (N) and identical sequence (I). The use of the word `topology' in this description is something of a misnomer. When we refer to the topology of a protein, what we generally mean is the three-dimensional fold. More strictly, for a given spatial arrangement of secondary structure elements (SSEs), the topology describes how these elements are connected.

The dictionary definition of `topology' is `the factors which remain unchanged as an object undergoes a continuous deformation'. In terms of protein secondary structure, the true topology is simply the sequence of SSEs, i.e. if one imagines being able to hold the N- and C-terminal ends of a protein chain and pull it out straight, the topology does not change whatever the protein fold (providing no knots are formed in the folded protein). Here, we describe this as the `primary topology' while, by analogy with primary and tertiary structure, the protein fold is described as the `tertiary topology'. A primary topology string is a sequence of E and H characters representing ß-strand and {alpha}-helix in DSSP notation (Kabsch and Sander, 1983Go).

When a sequence for a protein of unknown structure does not show obvious sequence homology with a protein of known structure, fold recognition is commonly used to give clues as to the three-dimensional structure. There are two common approaches to this problem: one is `threading' which assesses how well a sequence is accommodated within a three-dimensional structure (e.g. Jones et al., 1992); the other is alignment of a predicted sequence of secondary structure assignments (Rost et al., 1997Go) or SSEs (Russell et al., 1996Go) against a fold library encoded in this form. The latter fold recognition method is therefore making the assumption that the tertiary topology can be predicted from the primary topology. An analysis of the occurrence of given primary topologies in different tertiary topologies is presented.

If this assumption is true, then it should also be possible to use primary topology strings for three-dimensional structure comparison. However, given the extra information available within a three-dimensional structure, one can introduce an intermediate level of topology `secondary topology' which contains additional information (such as SSE direction, proximity, accessibility and length of the elements and the loops which connect them) to improve the mapping between these lower levels of topological description and tertiary topology (i.e. protein fold).

A number of other methods of protein structure comparison have made use of a simplified representation of protein structure since the concept was first suggested by Sheridan et al. (1985). For example, Mitchell et al. (1990) looked for common sub-structures by using a linear representation of helices and strands to create a graph amenable to comparison using subgraph isomorphism. Kleywegt and Jones (1997) described a protein as a set of SSEs together with number of residues and physical length of each SSE and used matrices of distances between the centres of the SSEs and cosines of angles between the direction vectors of the elements. These representations were then compared to find similar motifs.

Other methods similar in principle to TOPSCAN include FAST-SSAP (Taylor and Orengo, 1989Go), a constraint programming (CP) method based on TOPS diagrams (Gilbert et al., 1999Go) and TOP, a least-squares fitting method (Lu, 2000Go). FAST-SSAP uses double dynamic programming to align vectors representing the SSEs to compare two structures. The CP method uses a four-letter alphabet to represent the two types of secondary structure going either up or down and adds information about chirality and hydrogen bonds between SSEs. Comparison is performed by finding a maximum common template and scoring the two structures against that template. TOP represents each SSE by its two end-points and then performs iterative least-squares fitting of this small number of points to find an optimal match.

The major difference between the method presented here and other methods is that the representation of SSEs is purely symbolic. This has a distinct speed advantage and means that only a single algorithm (dynamic programming) is needed for performing the comparison of structures. No comparisons are needed at the level of coordinates or distances. Here an analysis of protein fold similarity at the primary and secondary topology levels is presented and the ability of these to predict similarity at the tertiary topology (protein fold) level is assessed. Implications of the results for fold recognition and secondary structure prediction are discussed.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
TOPSCAN reduces a protein to a topology string which can represent the structure as a string of letters encoding simply the primary topology (a two-letter alphabet) or the secondary topology using various additional data from the 3D structure. For example, by encoding direction information with the secondary structure, a 12-letter alphabet is used.

A simple Needleman and Wunsch dynamic programming algorithm (Needleman and Wunsch, 1970Go) is then applied to compare two such strings and a percentage similarity score is calculated. The percentage similarity score is calculated as the percentage of the higher score achieved by scoring each of the two sequences against itself. Alternatively, rather than comparing two structures, a library of topology strings may be pre-built from the Protein Databank (or a representative subset) and a structure may then be scanned against this library.

The TOPSCAN program itself is implemented in C. All analysis and control of the programs was achieved using sets of scripts written in Perl and driven from a relational database of the CATH data implemented using the freely available object-relational database PostgreSQL (http://www.postgresql.org). Analysis was performed on a dual PentiumIII-450 machine running RedHat Linux 6.0.

Datasets

All analysis was performed using datasets derived from the CATH classification of protein domain folds (Orengo et al., 1997Go). In CATH, representative structures are assigned at T, H, S and N levels. For analysis of parameters to be used in the program, the representatives from the H level (Hreps) from CATH v1.6 were selected, but from this set, any structures now obsoleted from the protein databank (Berman et al., 2000Go) and any NMR structures were removed. This gave a set of 914 domains representing a set of non-homologous protein domains.

For testing the performance of the program, a similar set was created using the CATH Nreps leading to a set of 3124 domains. Each Nrep is a representative of a near identical group of sequences having >=95% sequence identity. These were used to build the libraries against which each of those 3124 domains was tested (matches against self were always ignored in calculating results).

Creating primary topology strings

TOPSCAN enables secondary structure to be calculated from a three-dimensional structure using either DSSP (Kabsch and Sander, 1983Go) or STRIDE (Frishman and Argos, 1995Go). For the analyses presented in this paper, STRIDE was used. From these assignments, regions of ß-sheet (Kabsch and Sander assignment, E) and of {alpha}-helix (Kabsch and Sander assignment, H) are extracted. Only continuous regions of at least a specified number of residues with the same assignment are selected. This produces the primary topology of the protein equivalent to a string of E and H characters where one character represents one complete strand or helix. The program also allows one to treat 310-helix assignments as {alpha}-helix as is frequently done in secondary structure prediction.

Creating secondary topology strings

To increase the information content of the topology string, various information from the three-dimensional structure can be incorporated into a secondary topology string.

For simplicity and clarity of explanations, topology strings are described here as vectors of characters. In practice, integer vectors are used to allow more than 52 descriptors. This also provides a technical simplification as lookups in the scoring matrix may be made simply by the integer offsets for the two descriptors being compared.

Secondary structure element (SSE) direction

The end-points of each SSE in the primary topology are found in three dimensions and the vector between them is calculated. The direction of the vector is grouped into one of six classes depending on the largest component of the vector (i.e. positive or negative x, y or z). This is equivalent to saying the element points up, down, left, right, forward or back. The encoding is summarized in Table IGo.


View this table:
[in this window]
[in a new window]
 
Table I. Encoding scheme used to represent secondary structure and direction information
 
The scoring matrix used for the dynamic programming algorithm is based on the scoring scheme shown in Table IIGo. Although somewhat arbitrary, it appears to work well: the same secondary structure in the same orientation scores highest, different orientations score worse and different secondary structure types achieve much lower scores.


View this table:
[in this window]
[in a new window]
 
Table II. Scoring scheme employed in the matrix for the dynamic programming comparison of two topology strings
 
The starting premise was that for the same type of secondary structure, one wishes to assign three scores: approximately the same direction (0° <= {Delta} <= 45°) scores 10, approximately at right-angles (45° <= {Delta} <= 135°) scores 5, approximately opposite (135° <= {Delta} <= 180°) scores 2. However, because of the boolean definition of whether a vector is in a given quadrant, it is possible that vectors actually point in very similar directions although they are in different quadrants (for example, one points at +44° while another points at +46°; because the definition of quadrant depends on the largest component of the direction vector, the quadrant boundaries are at ±45° and ±135°). Picking pairs of random numbers between 0–90 and 90–180 and plotting the distribution of the differences shows an upside-down V-shaped curve centred around 90 (data not shown). Two vectors in adjacent quadrants will actually be within 90° of one another 50% of the time. The actual score used for adjacent quadrants is therefore the median of 10 and 5 (7.5 rounded up to 8).

When two vectors are assigned to opposite quadrants, they can never be closer than 90°. Picking pairs of random numbers between 0–90 and 180–270 gives an identical distribution centred around 180. The angle between two vectors can never be greater than 180°, so every angle <180° observed between vectors in opposite quadrants can also be seen in vectors assigned to adjacent quadrants. In opposite quadrants, the angle between two vectors is <135° (our cutoff for saying two vectors are approximately at right angles) 12.5% of the time. The score is therefore assigned as 2 + [(5 – 2)x12.5/100] = 2.375, which is rounded back down to 2. For different secondary structure types, we make the simple arbitrary assignments of 3, 1 and 0. A gap insertion penalty of 8 is used with no gap extension penalty.

Because any pair of proteins is in an arbitrary relative orientation, the definition of `up' in one protein may not correspond to `up' in the second. Therefore, one of the strings is permuted 23 times, such that the dynamic programming algorithm comparison is performed a total of 24 times (equivalent to the six sides of a cube, each of which may be four ways up). Table IIIGo shows the modifications made to the encoding to achieve rotations about the x, y and z axes.


View this table:
[in this window]
[in a new window]
 
Table III. Permutations to topology strings
 
Secondary structure element (SSE) proximity

To add information about the packing of SSEs, the proximity of an SSE to the preceding element is encoded. This is performed as follows (see Figure 1Go).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 1. Calculation of proximity of an SSE to the preceding element. (a) is considered proximal if da, db, dc or dd is <12.0 Å; (b) is considered proximal if da or dd is <12.0 Å; (c) is not considered proximal because the nearest points to A, B, C and D (shown by black circles) are not between the secondary structure end points.

 
Given an element with endpoints C,D and a preceding element with endpoints A,B, the minimum distances between points A and B and the line described by CD are calculated. This is repeated with points C and D to line AB. As each minimum point-to-line distance is calculated, the position on the line closest to the point is calculated (shown in Figure 1Go with a black circle). If this point is between the two endpoints of the line and the distance is <12.0 Å, then we say that the element C,D is proximal to the preceding element and encode this within the topology strings. The N-terminal element does not have a proximity parameter assigned to it.

The scores in the similarity matrix are multiplied by 0.75 and rounded down to the next integer for mismatches in proximity. This also applies to the other items encoded in secondary topology strings. Note that all cutoffs use single values, so the penalty of a mismatch (a reduction of the score by 25%) is only small.

Accessibility

Using the Hrep fold library, the mean residue accessibility was calculated for all residues in helices and in strands as assigned by STRIDE. This uses the algorithm of Eisenhaber and co-workers (Eisenhaber and Argos, 1993Go; Eisenhaber et al., 1995Go). Since absolute, rather than relative, accessibility is used, an element of amino acid composition is also taken into account. The mean was calculated for each SSE and then averaged over all elements. The values obtained depend on the minimum allowed length for an SSE: three residue helices, 52.8 Å2; four residue helices, 52.8 Å2; three residue strands, 33.8 Å2; four residue strands, 32.4 Å2. The values for helices also change if 310-helices are treated as {alpha}-helices (three residue cutoff, 56.5 Å2; four residue cutoff, 53.3 Å2).

SSEs with mean residue accessibilities less than these mean cutoffs are assigned as buried, others are exposed.

Element length

Mean helix and strand lengths were calculated on the same basis using STRIDE secondary structure assignments. The mean helix length is 12.5 residues while the mean strand length is 5.4 residues. If an element is longer than the appropriate cutoff, it is defined as `long', otherwise `short'.

If 310-helix assignments are treated as {alpha}-helix, then the mean for helix length falls to 10.5 residues as a result of the large number of short 310-helices. The mean length of a 310-helix was also calculated and found to be 3.3 residues.

Loop length

Mean loop lengths were calculated in the same way and assigned to the element which follows the loop. The first SSE therefore does not have a loop length parameter assigned to it. The mean loop length was 6.9 residues. In this case, 310-helix assignments are treated as loop. If, instead, they are treated as helix, the mean loop length was 5.7 residues.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Assignment of secondary structure (primary topology)

The assignment of secondary structure is the critical first stage. We have found that the DSSP algorithm (Kabsch and Sander, 1983Go) can be over-sensitive to errors in the structure. For example, NMR structures often show little secondary structure in DSSP assignments whereas an intuitive visual inspection shows substantial secondary structure. The DSSP algorithm works by assigning a cutoff to hydrogen-bonding energies and finding patterns of hydrogen bonds characteristic of secondary structures. Any hydrogen bonds which fail to meet these criteria are excluded and secondary structure assignment may therefore be less than optimal.

The STRIDE software from Frishman and Argos (1995) was designed to reduce this sensitivity. However, using STRIDE in place of DSSP makes little difference to the overall results (data not shown). The TOPSCAN software allows either DSSP or STRIDE secondary structure assignments to be used and STRIDE was selected for all further analysis.

We explored the effects of setting the minimum required number of consecutive residues assigned to a given secondary structure to three and four residues. We also looked at treating 310-helix residues as the same as {alpha}-helix (data not shown). This made little difference when the length cutoff was set to four, but made results significantly worse when the length cutoff was set to three. This is because most instances of true 310-helix are less than four residues in length (as shown above, the mean length of a 310-helix is 3.3 residues). These are then treated as {alpha}-helix and compared with regions of true {alpha}-helix. This suggests that treating 310-helix as {alpha}-helix is not a good strategy and this has implications for those who do this in secondary structure prediction (e.g. King et al., 2000). It appears that it would be better to treat 310-helix as coil in secondary structure prediction, particularly when this is being used for fold recognition.

Assignment of secondary topology

In addition to secondary structure and the direction of SSEs, the following factors were considered in defining secondary topology: accessibility, proximity, element length and loop length. The performance is assessed in terms of percent coverage versus percentage error, specifically taking error rates of 1 and 5%. The dataset used was the CATH Nreps with NMR structures excluded. Error and coverage at a given TOPSCAN score were calculated as (number of false hits/total number of hits) and (number of true hits/total true hits), respectively, and coverage values at 1 and 5% error rates were calculated by linear interpolation of the coverage versus error plots in the ranges 0.5–1.75 and 3.0–8.0, respectively. This gives a sample of approximately eight points evenly spread above and below the interpolation point. These results are summarized in Table IVGo.


View this table:
[in this window]
[in a new window]
 
Table IV. Summary of effects of adding information to the secondary topology strings
 
As can be seen in Table IVGo, the best coverage results are achieved with the secondary structure cutoff set to three and with accessibility, proximity, element length and loop length information in the secondary topology strings. Interestingly, at the 5% error rate, loop length has a detrimental effect, presumably because the loops are the most variable factor between proteins adopting the same fold.

Does primary topology predict protein fold?

Using the CATH classification, each protein fold containing more than one near-identical sequence family (i.e. each CAT level containing more than one Nrep) was examined. Within each of these 292 groups, the mean TOPSCAN score was calculated for all pairs of Nreps using only the primary topology (secondary structure cutoff length set to three). The frequency of scores obtained is shown in Figure 2aGo. Unsurprisingly, the highest peak is for a score of 95–100% (see Table VGo), i.e. within a protein fold, the primary topology is generally identical. As shown in the table, similar results were obtained when trivial matches between very small structures containing fewer than 4 SSEs were ignored.




View larger version (30K):
[in this window]
[in a new window]
 
Fig. 2. Frequency of scores achieved using primary topology (a) within a given protein fold and (b) when comparing a given fold with every other fold.

 

View this table:
[in this window]
[in a new window]
 
Table V. High-scoring comparisons
 
Every Nrep within each protein fold (CAT) was then compared with every Nrep of every other protein fold (a total of 4 688 807 comparisons) and the distribution of the scores was plotted (Figure 2bGo). Initially one would expect that different folds will have a low score in this comparison, but 14.2% of the comparisons score >65% with 0.66% scoring >95% (see Table VGo), the distribution peaking at a score of 35–40%. However, when one considers, for example, 4-helix bundles as shown in Figure 3Go, it becomes obvious that there are multiple folds containing the same set of SSEs. The figure shows two 4-helix bundles from bovine acyl-coenzyme A binding protein (Kragelund et al., 1993; PDB code, 1aca) and cytochrome c' from Rhodospirillum molischianum (Finzel et al., 1985; PDB code, 2ccy). At the primary topology level, both would be described as HHHH (note that 2ccy has a short segment of 310-helix between the second and third main helices).




View larger version (139K):
[in this window]
[in a new window]
 
Fig. 3. Two examples of 4-helix bundles: (a) PDB code 1aca and (b) PDB code 2ccy chain A. Both would have the same primary topology description of HHHH, but they show different topologies (CATH topology assignments 1.20.80 and 1.20.120, respectively). The four helices are labelled A–D working from N-terminus to C-terminus. Images were generated with Molscript 2.1.2 (Kraulis, 1991Go).

 
Thus, by knowing only the primary topology (perhaps from secondary structure prediction), there is a high chance of matching the incorrect protein fold with an identical primary topology. These results contrast with the protein taxonomy based on secondary structure derived by Przytycka et al. (1999) which is based purely on primary topology strings. They calculate a similarity score on the basis of the common number of residues matched when a pair of SSEs is aligned. From the results presented here, it appears that the authors were either lucky to avoid false positive matches in the data set they used or that the scoring on the basis of length is a very good filter.

Does secondary topology predict protein fold?

As with comparison of primary topology, each protein fold containing more than one near-identical sequence family was examined. Within each of these groups, the TOPSCAN score was calculated using the secondary topology with a secondary structure length cutoff of three and including accessibility, proximity, element length and loop length information (Figure 4aGo, Table VGo). The distribution moves to the left compared with using primary topology and the peak for 95–100% is lost. This is largely a result of the single cutoffs used to define secondary topology. Scores are artificially reduced where two examples lie close to, but on either side of, the cutoff. The distribution peaks are around 55–60%.




View larger version (26K):
[in this window]
[in a new window]
 
Fig. 4. Frequency of scores achieved using secondary topology (a) within a given protein fold and (b) when comparing a given fold with every other fold.

 
Comparison between protein folds (Figure 4bGo) now peaks at a score of 15–20% and all the high-scoring peaks are removed. Only 0.15% of comparisons score >65% (Table VGo).

Referring again to Figure 3Go, although the primary topology for the two folds is the same (HHHH), by following the chain trace, it is clear that the directions of the helices are different (1aca, up, down, down, up; 2ccy, up, down, up, down) and incorporating this information into the secondary topology strings will distinguish between the two folds.

Interestingly both the comparisons of primary and secondary topology within a fold show one comparison with a very low score (<10%). These structures (domain 3 from PDB files 1dar (al-Karadaghi et al., 1996Go) and 1elo (Aevarsson et al., 1994Go)) were examined using RasMol and are shown in Figure 5Go. They represent one domain from elongation factor G (EF-G) from Thermus thermophilus strain HB8 complexed and uncomplexed with GDP, respectively. Whereas 1dar contains only ß-strands, 1elo has two poorly defined strands and a helix. The sequences of these two are identical, yet the fold is different as a result of conformational flexibility of EF-G on binding nucleotide. It should also be noted that the structures are poorly defined in this region with some residues not visible in the crystal structures. Thus some of the differences may reflect errors in the structures although, since both structures are solved by the same group, it is unlikely that they would have built one with an {alpha}-helix and the other without unless there were a real difference.




View larger version (69K):
[in this window]
[in a new window]
 
Fig. 5. Domain three of elongation factor G (EF-G) from Thermus thermophilus strain HB8 (a) from crystal structure 1dar in complex with GDP and (b) from crystal structure 1elo of the free protein. The structures are shown with the N-terminus (residue 433) on the left and the C-terminus (residue 483) on the right. Note the discontinuities resulting from some residues not being visible in the x-ray structures.

 
TIM barrels and Rossmann folds

At the primary topology level, the TIM barrel and the Rossmann fold are somewhat similar—in both cases, they consist in the main of alternating ß-strands and {alpha}-helices which form a core of ß-sheet with helices on the outside. In the case of the TIM barrel the ß-sheet is curved into a barrel whereas the Rossmann fold has a relatively flat sheet. Because the differences between the two occur in a plane perpendicular to the direction of the SSEs, the directional element of the secondary topology assignments is also similar. However, differences do occur in the accessibilities and lengths of the elements. It is interesting that in Rose's taxonomy based on secondary structure (Przytycka et al., 1999Go), the flavodoxin and flavodoxin-like Rossmann folds are well separated from the TIM barrels. However, one of the structures which they place in the middle of the TIM barrel cluster (2cmd: lactate and malate dehydrogenase) is split in the CATH classification into two domains, the first of which is assigned as a Rossmann fold.

Table VIGo shows three examples of TIM barrels and Rossmann folds having almost identical primary topology. Although in these cases the secondary topology similarities are much lower, it can be seen that the scores are relatively high for such different protein folds although, in practice, they are comparable with the SSAP scores.


View this table:
[in this window]
[in a new window]
 
Table VI. Examples of TIM barrels and Rossmann folds with similar primary and secondary topology
 
Comparison of TOPSCAN with other rapid methods

As cited in the Introduction, a number of other methods have used a reduced representation of protein structure to allow rapid comparison. The main difference between TOPSCAN and other methods is that the representation is purely symbolic and this simple approach allows a very significant speed advantage. It is approximately 25 times faster than the current implementations of FAST-SSAP (Taylor and Orengo, 1989Go) and CP (Gilbert et al., 1999Go) and is estimated to be approximately 30 000 times faster than normal SSAP. FAST-SSAP could be speeded by calculating the secondary structure vectors simply from the end-points of the SSEs rather than using an eigenvector.

Compared with FAST-SSAP and other methods which rely upon the endpoints of SSEs, the simple segregation of secondary structure vectors into six quadrants throws away a lot of more detailed information about the relative orientation of the elements, but as a result, allows us to perform normal single dynamic programming, removing the need for double dynamic programming.

Compared with CP, the TOPSCAN representation actually has more information about direction of the secondary structure vectors, but loses the more detailed proximity information (encoded in the constraint programming by hydrogen-bond information) and the chirality arcs.

The authors of TOP (Lu, 2000Go) do not provide directly comparable results. However, since the method relies not only on the direction of the SSEs, but also on the precise location of the endpoints, it seems likely that the method will be particularly sensitive to the secondary structure definition. It is well known that assignment is particularly problematic at the ends of the secondary structures and minor variations between structures can cause one or two residues at each end to be (or not to be) included in an SSE. This could cause a difference in end-point location of >6.0 Å. This is less of a problem with DEJAVU (Kleywegt and Jones, 1997Go) where the matrix of distances between midpoints of the SSEs and angles between the SSEs are used. While the endpoints of the SSEs are used by TOPSCAN to define the direction, their precise location is less important. They are also implicitly used by TOPSCAN to define the length of an SSE and of a loop, but since these are simply taken as `long' or `short' rather than using an exact length, only a small percentage of SSEs or loops (those of length close to the cutoff) will be likely to be misclassified.

The best results achieved with TOPSCAN (~26.5% coverage at 1% error and ~52% coverage at 5% error) compare with ~60% (at 1% error) and ~78% (at 5% error) for full SSAP and ~36% (at 1% error) and ~58% (at 5% error) for CP (Gilbert et al., 2000Go). Given the large increase in speed achieved over these methods, TOPSCAN performs well and is particularly valuable as a preliminary screen.

Performance of TOPSCAN as a prescreen for more detailed comparison

The coverage versus error figures shown in Table IVGo give a standard measure for the performance of the method in structure comparison. As described above, these results are similar to the CP method at a 5% error rate.

If the method is to be used as a pre-screen for a more detailed and computationally more demanding method such as SSAP, one is more interested in being able to obtain true positives (correct hits) with a minimum of false positives (incorrect hits). In summary, error rates at 90, 95 and 99% coverage are 29.02, 39.15 and 60.82%, respectively (it is impractical to aim for 100% coverage owing to odd cases such as domain 3 from 1elo and 1dar where the proteins are identical but have different folds as well as possible errors in CATH).

Figure 6Go shows the distribution of the maximum TOPSCAN score for a correct hit against another protein domain of the same fold. (As before, each protein fold containing more than one near-identical sequence family was examined and, for each of these families, the maximum TOPSCAN score for a correct hit against another protein domain with the same fold was recorded.) It is clear that a number of correct hits have low maximum scores. Therefore, it is recommended that if TOPSCAN is to be used as a filter for other methods, that it is not used to exclude structures, but to sort the structures for presentation to a detailed method such as SSAP with the most likely matches first. As soon as SSAP finds a good hit, the search through the ordered list is terminated. This procedure has been implemented using an early version of TOPSCAN (encoding only secondary structure type and direction) in the CATH-server (Orengo et al., 1998Go).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 6. Distribution of best TOPSCAN scores obtained for each CATH Nrep matching another Nrep of the same fold (i.e. in the same CAT family).

 
Conclusions

TOPSCAN has proved a useful addition to the available methods for comparing protein structures. TOPSCAN was initially developed as a simple method to reduce the search space for the CATH-server. In that implementation, the secondary topology strings contain only the directional information and it is used as a rapid method of ranking structures for further more detailed comparison using SSAP.

Analysis of the CATH data using TOPSCAN has shown some interesting anomalies as well as a few minor errors in the CATH data which have been fed back to the CATH maintainers. Thus TOPSCAN also appears to have a role in rapid validation of structural classification.

Our analysis of factors affecting the performance of TOPSCAN has suggested that treating 310-helix as {alpha}-helix in secondary structure prediction is a bad strategy, especially when the results are to be used for fold recognition.

In common with other methods which reduce a three-dimensional structure to a set of secondary structure elements with or without associated vector information, TOPSCAN is most likely to fail when a structure is poorly defined and DSSP or STRIDE are unable to make reliable secondary structure assignments. The atomic level full SSAP is less prone to these problems, but pays a huge penalty in the time required for comparison.

Future work will include investigating more complex classification of properties such as solvent accessibility. Currently only binary classifications are used (e.g. exposed versus buried). In the case of ß-strands, specific allowance could be made for edge strands. Changing the global Needlemann and Wunsch dynamic programming algorithm to a local Smith–Waterman alignment will also allow for sub-structure matching. Given the current analysis of factors which affect the accuracy of comparison of structures, this information can feed back into fold recognition and the algorithm described here, when combined with methods for prediction of the factors used in the secondary topology string, will be of application in fold recognition.

A TOPSCAN server has been made available at http://www.bioinf.org.uk/topscan/ and at http://www.rubic.rdg.ac.uk/topscan/.


    Acknowledgments
 
I thank Christine Orengo, Frances Pearl and Janet Thornton for making the CATH data available and for their support during the first part of this work, which was funded by departmental funds at University College London as part of the development of the CATH-server.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Aevarsson,A., Brazhnikov,E., Garber,M., Zheltonosova,J., Chirgadze,Y., al-Karadaghi,S., Svensson,L.A. and Liljas,A. (1994). EMBO J., 13, 3669–3677.[Abstract]

al-Karadaghi,S., Aevarsson,A., Garber,M., Zheltonosova,J. and Liljas,A. (1996). Structure, 4, 555–565.[ISI][Medline]

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000). Nucleic Acids Res., 28, 235–242.[Abstract/Free Full Text]

Biou,V., Dumas,R., Cohen-Addad,C., Douce,R., Job,D. and Pebay-Peyroula,E. (1997). EMBO J., 16, 3405–3415.[Abstract/Free Full Text]

Eisenhaber,F. and Argos,P. (1993). J. Comput. Chem., 14, 1272–1280.[ISI]

Eisenhaber,F., Lijnzaad,P., Argos,P. and Scharf,M. (1995). J. Comput. Chem., 16, 273–284.[ISI]

Finzel,B.C., Weber,P.C., Hardman,K.D. and Salemme,F.R. (1985). J. Mol. Biol., 186, 627–643.[ISI][Medline]

Frishman,D. and Argos,P. (1995). Proteins: Struct. Funct. Genet., 23, 566–579.[ISI][Medline]

Gilbert,D., Westhead,D., Nagano,N. and Thornton,J.M. (1999). Bioinformatics, 15, 317–326.[Abstract/Free Full Text]

Gilbert,D., Westhead,D., Viksna,J. and Thornton,J.M., (2000) In Proceedings of the Symposium on Artificial Intelligence in Bioinformatics, Annual Convention of the Society for the Study of Artificial Intelligence and the Simulation of Behaviour (AISB-00). University of Birmingham, 11–17.

Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992). Nature, 358, 86–89.[ISI][Medline]

Kabsch,W. and Sander,C. (1983). Biopolymers, 22, 2577–2637.[ISI][Medline]

King,R.D., Ouali,M., Strong,A.T., Aly,A., Elmaghraby,A., Kantardzic,M. and Page,D. (2000). Protein Eng., 13, 15–19.[Abstract/Free Full Text]

Kleywegt,G.J. and Jones,T.A. (1997). Methods Enzymol., 277, 525–545.[ISI]

Kragelund,B.B., Andersen,K.V., Madsen,J.C., Knudsen,J. and Poulsen,F.M. (1993). J. Mol. Biol., 230, 1260–1277.[ISI][Medline]

Kraulis,P.J. (1991). J. Appl. Crystallogr., 24, 946–950.[ISI]

Lu,G. (2000). J. Appl. Crystallogr., 33, 176–183.[ISI]

Mitchell,E., Artymiuk,P.J., Rice,D.W. and Willett,P. (1990). J. Mol. Biol., 5, 151–166.

Needleman,S.B. and Wunsch,C.D. (1970). J. Mol. Biol., 48, 443–453.[ISI][Medline]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997). Structure, 5, 1093–1108.[ISI][Medline]

Orengo,C.A., Martin,A.C.R., Hutchinson,E.G., Jones,S., Jones,D.T., Michie,A.D., Swindells,M.B. and Thornton,J.M. (1998). Acta Crystallogr., D54, 1155–1167.[ISI]

Przytycka,T., Aurora,R. and Rose,G.D. (1999). Nature Struct. Biol., 6, 672–682.[ISI][Medline]

Rost,B., Schneider,R. and Sander,C. (1997). J. Mol. Biol., 270, 471–480.[ISI][Medline]

Russell,R.B., Copley,R.R. and Barton,G.J. (1996). J. Mol. Biol., 259, 349–365.[ISI][Medline]

Sheridan,R.P., Dixon,J.S. and Venkataraghavan,R. (1985). Int. J. Pept. Protein Res., 25, 132–143.[ISI]

Spezio,M., Wilson,D.B. and Karplus,P.A. (1993). Biochemistry, 32, 9906–9916.[ISI][Medline]

Story,R.M., Weber,I.T. and Steitz,T.A. (1992). Nature, 355, 318–325.[ISI][Medline]

Taylor,W.R. and Orengo,C.A. (1989). J. Mol. Biol., 208, 208–229.

Vidgren,J., Svensson,L.A. and Liljas,A. (1994). Nature, 368, 354–358.[ISI][Medline]

Wilmanns,M., Priestle,J.P., Niermann,T. and Jansonius,J.N. (1992). J. Mol. Biol., 223, 477–507.[ISI][Medline]

Wilson,D.K., Bohren,K.M., Gabbay,K.H. and Quiocho,F.A. (1992). Science, 257, 81–84.[ISI][Medline]

Received August 11, 2000; revised October 10, 2000; accepted October 16, 2000.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (9)
Request Permissions
Google Scholar
Articles by Martin, A. C.R.
PubMed
PubMed Citation
Articles by Martin, A. C.R.