From the Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, United Kingdom
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The level of description of protein structure at which the greatest simplification can be achieved with the least loss of important topological information is when secondary structure elements (SSEs) 1 are represented as line segments. This gives an order-of-magnitude reduction in the volume of data, which in some algorithms can lead to significant increases in speed (6, 7). With this level of description it is practical to represent the interactions between secondary structures as a graph and to apply graph-theoretic methods to find common substructures in proteins (810). The principal method used in these approaches is to identify subgraph isomorphisms between the two secondary structure interaction graphs from each protein. This is a reasonably difficult problem, especially when the desired solution is to find the maximal common subgraph. Most methods use domain heuristics to guide a branch-and-bound search based on the algorithms of Ullmann (11) or Bron and Kerbosch (12).
In this work, the matching of protein SSEs is approached using the simpler class of bipartite graph-matching algorithms. These are designed to find an optimal pairing-up of SSEs, but do not attempt to constrain these in a coherent network as is required in the isomorphism algorithms. Despite this limitation, it will be demonstrated that good solutions can be found that, because of their speed, will prove most useful as a prefilter on the selection of proteins to be compared by a more exhaustive comparison method.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() | (Eq. 1) |
In this equation, x is the distance between two points that lie equidistant from the end points of the contact normal between the two lines. The parameters b and d are the distance cutoff (b) beyond which the Gaussian decay is applied with a damping factor determined by d. The values of b (for base) and d (for decay) will be adjusted below.
|
Solvent-accessible Surface Area Changes
The interactions of the segments of protein structure corresponding to the SSEs were measured using the change in the solvent-accessible surface area observed when each segment was removed from the intact structure. This approach follows earlier studies (13, 14) but used the Definition of Secondary Structure in Proteins program (15) to calculate the solvent areas rather than the original program of Lee and Richards (16).
The accessible surface area (summed over each residue) was calculated first using the intact protein (length N) giving a set of residue areas {C1 . . . CN}. The two segments were then removed in turn, and the areas were recalculated, giving two further sets of residue areas {A1 . . . AN} and {B1 . . . BN} (in which the residue numbering of intact protein is retained). If the first segment runs from an . . . ac and the second runs from bn . . . bc, then the combined effect of removing each segment on the other (summed area change (SAC)) can be found as follows in Equation 2.
![]() | (Eq. 2) |
Note that (within the error of the area calculation) the areas of the intact protein (Ci) are always less than those after removal of a segment. In addition, the linear segments were always separated by at least one residue to avoid covalent bonded surfaces being exposed and counted.
Calibrating Segment Packing
Continuous Packing Classes
Each segment was characterized only by the length/residue (rise) along the segment axis (17). This can be extended to pairs of segments as Rij = ri + rj, where r is the rise along the axis of the segment for segments i and j. Plotting the combined rise of both segments against their interaction strength (Fig. 2a) shows that the , ßß, and ß
classes remain sufficiently distinct and that little information is lost by reducing the pair of values to one.
|
Matching Segment Packing
The previous section developed a geometric measure of line segment interaction that gives a reasonable estimate of true interaction of the full atomic coordinates of the segments (as measured by solvent-accessible surface area). In the forthcoming section it is investigated whether this measure will provide a useful basis for the comparison of protein structures when represented as line segments.
Secondary Structure Packing Plot
For two segments, i and j, their combined rise (Rij) can be plotted against their interaction strength as estimated by the IOA measure. This gives a very quick visualization of the type of protein in terms of its secondary structure packing (or architecture). Furthermore, the comparison of two of these plots can give a rough measure of the similarity of the packing between two proteins. Without resolving the specific identity (or sequence order) of the SSEs, it is possible to match up similar interactions. For example, a small ß protein (5nul) with two helices above a five-stranded sheet and three helices below will have three
interactions and four ßß interactions plus various ß
interactions. When plotted with a structurally equivalent protein (1fx1), corresponding interactions are apparent to the eye (Fig. 3).
|
Trials on matrices of random numbers with various sizes and distributions of values suggested that the SMA generally found a solution around 5% less optimal than the Hungarian algorithm. Given the rough nature of the comparison, this was not considered to be an unacceptable result, and with its more efficient time dependence, the SMA algorithm was adopted in the studies below. Indeed, on a limited number of trials on real data, the SMA performed almost equally as well as the Hungarian algorithm (see below).
Interaction Match Score
The matching algorithms discussed above require a metric between points in the packing plot. This should have the properties that strong interactions are paired up and that SSEs of unlike type are not matched. If the interaction (IOA) between two SSEs, i and j, in protein A is designated as IijA, and their combined rise values (ri + rj) is RijA, then an equivalent pair (m and n) in another protein, B will have values ImnB and RmnB . To ensure that both interactions are significant, the match of these pairs of SSEs was quantified by the product of their individual interactions weighted by W as follows.
![]() | (Eq. 3) |
The difference in secondary structure types (r = RijA - RmnB) should lead to a weaker match as the difference increases. This inversion was implemented using a Gaussian switch function of the form, shown in Equation 4,
![]() | (Eq. 4) |
such that when there is no difference, then the value of s is 1. The factor ws (which corresponds to the standard deviation in the normal distribution bell curve) modifies the effect of the transform and will be optimized below.
Two further geometric quantities that were readily available from the calculation of the overlap areas were the distance between the lines (specifically, the closest approach of the line segments) and the angle between them. 3 The difference in these quantities was inverted as above, giving a distance score d and an angle score a, each with their associated weighting factor, wd and wa, respectively. Similarly the difference in packing was also considered in the form of a similar score p and weight wp.
These quantities were all combined in an overall score (S) as shown in Equation 5.
![]() | (Eq. 5) |
Thus, if the secondary structure types, distances, angles, and packing are a close match, then the value of S will approach P (the combined interaction), but if any one quantity should differ markedly, then the value of S will fall toward zero.
The value of S for each pair of interactions compared between two proteins provides the preferences on which the stable marriage algorithm operates and the resulting sum of the S values over all matched pairs of interactions gives the measure of similarity between the two proteins (referred to below as the score).
Data Selection
As described above, the method embodies four adjustable parameters, along with the weight W on the combined packing. It is desirable to optimize these for typical proteins. This was done by selecting a set of true relationships and a set of false relationships and varying the parameters to optimize the separation between the scores obtained on each set.
The set of true relationships was taken as all pairwise relationships between proteins defined as belonging to a family in the HOMSTRAD data bank (20). Similarly, false relationships were taken as the interfamily pairs in the HOMSTRAD data bank. As there are many more interfamily pairs, these were limited to the same number of intrafamily pairs, giving slightly over 5000 in each set.
Dividing True from False
Both from theory and trials, the size of the score was expected, and found, to be in rough proportion to the square of the number of matched pairs, and its square root was plotted correspondingly as a function of the number of pairs. When matching elements in objects of different size, the number of matched pairs is often taken as the normalizing quantity. In the current application, as in sequence alignment, this would be the size of the smaller protein. However, with the current method, when matching a small protein against a large one, the former is almost sure to find a better match for its components as the size of the latter increases. To model this aspect, the geometric mean of the lengths was taken, which retains the property that the expected score is zero if one protein has zero components and increases slowly (as a function of the square root of the difference) as one protein grows larger than the other.
On this plot, the line was then found that divided optimally the true scores from the false score. This was determined as the sum of the number of true scores lying below the dividing line and the number of false scores lying above the line. This quantity is often referred to as the sum of false-negatives and false-positives or FN + FP. For computational efficiency, this sum was not minimized as this would involve a search, dealing with multiple minima. Instead, the minimum sum was found when FN = FP (±1), which is more easily located and is referred to below as the balanced sum.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
For the stable marriage algorithm the elapsed computer time on an otherwise quiet 733-MHz Pentium III processor was 8 min for the 4000 comparisons (0.12 s per comparison). This includes filling the score matrices and their sorting (into ranked order). The time for the Hungarian algorithm was just four times greater, but comparison of the scores found by the two algorithms revealed that there was almost no difference, suggesting that the stable marriage algorithm performs better on real data compared with the trials with random number fields described above.
Parameter Optimization
The fast execution time of the SMA allowed the parameter space to be explored extensively. The weight on the packing match (W in Equation 3) was varied from 0 to 0.5 by intervals of 0.1, whereas the Gaussian factor on the packing difference (wp in Equation 4) was varied from 1/20 to 20 in rough doubling intervals (0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0, and 20.0). For the best values found with these combinations, all the other parameters were raised or lowered by a factor of two, and the process was repeated until the minimum value in packing parameter space was not lowered by displacement of any of the other parameters. The minimum solution was found when W = 0.2 and wp = 10, lying in a smooth shallow basin (Table I). At this point, the other parameters were ws = 2, wa = 5, and wd = 0.05 for secondary structure, angle matching, and distance matching, respectively. The low weight on the distances reflects their redundancy with the packing measure. These results were generated using the signed angles (incorporating chirality) and were slightly better than those generated using the unsigned angle (data not shown).
|
|
Protein Structure Classification
One of the clear applications of a fast structure comparison method is the pairwise comparison of all known structures (21). This results in lists of similarities that can be ordered partially into a tree structure. However, because so many of the similarities are effectively random, the data do not conform to a metric space, and the resulting trees (or other visualizations) do not provide a unique overall representation of the data. This problem has been approached recently in a different way by using idealized protein structure representations (stick models) that can be arranged in something like the periodic table of elements (22). These ideal protein forms, however, must accommodate many arrangements of SSEs and are therefore very numerous (currently over 12,000) and still require a fast-matching algorithm. Their comparison with native protein stick structures is suited ideally to the current SMA method as, initially, the sequential order of the SSEs is not considered (23).
Each native protein structure (reduced to sticks) was matched against a collection of idealized forms, and the score obtained from the current algorithm was plotted against the number of sticks in the match. The results of this application are shown for the small ß chemotaxis-Y protein (PDB code 3chy) in Fig. 5. The native protein is matched against the set of forms with three secondary structure layers (
-helices packed on either side of a ß-sheet) and the set with four layers (
-helices packed on either side of a ß-sandwich). The former class to which the native protein corresponds is clearly favored, and the ideal form corresponding to the native structure that has two helices above and three helices below a five-stranded sheet (25-3) scores second highest.
|
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The algorithm has been applied to the comparison of the PDB against a large number of idealized structures (forms) (22) and should also be useful as a pre-filter on more conventional large structure comparison problems. It will prove to be particularly suitable when these involve fake or decoy structures or large numbers of automatically generated models (25), because its focus at the SSE level makes it insensitive to conformational detail. A further application is to incorporate the algorithm into an existing heuristic algorithm to improve the selection of potentially equivalent SSEs that could then be refined at the more detailed residue level (24).
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, March 4, 2002, DOI 10.1074/mcp.T200001-MCP200
1 The abbreviations used are: SSE, secondary structure element; IOA, inverse overlap area; SAC, summed area change; SMA, stable marriage algorithm; FN, false-negative(s); FP, false-positive(s).
2 The contact normal between two extended lines is the (unique) line that is perpendicular to both. For two line segments, a b and c
d, running in the directions x = b - a and y = d - c, the direction of their contact normal is z = x x y. A matrix (M) can then be formed from x, y, and z (referred to as the basis vectors) as Mx = x, My = y, and Mz = z. If the line segments are joined by e = d - a, then the end points of the contact normal (g and h) can be found from the components of the basis vectors needed to get from a to d via the contact normal. That is, e = fM, where f contains the required coefficients and M contains the basis vectors. The components of f can be solved for as f = eW (where W is the inverse of M) and then obtaining the displacements along each line as x' = fx · x and y' = fy · y, giving g = a + x' and h = d + y'. Note: x indicates the vector (or cross-) product whereas · is a simple product (and not the dot product). Note also that the ends of the contact normal need not lie within either line segment. The situation of exactly parallel line segments does not arise in natural proteins but can be encountered in artificial (idealized) models where the situation is treated separately.
3 Two types of angles were considered. The first was the simple (unsigned) angle between the two lines as calculated from their scalar (dot) product in the range 0
. The second had the angle signed by the chirality of the pair as lines right = 0
and left = 0
-
. The difference in the latter angles was calculated as a - b or 2
- a - b when a - b >
.
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
To whom correspondence should be addressed. Tel.: 44-02089138552; Fax: 44-02089138545; E-mail: wtaylor{at}nimr.mrc.ac.uk
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|