MolCom: a method to compare protein molecules based on 3-D structural and chemical similarity

S.D. O’Hearn1, A.J. Kusalik2,3 and J.F. Angel4

1 National Research Council of Canada, Plant Biotechnology Institute, Saskatoon, Saskatchewan, S7N 0W9 and Departments of 3 Computer Science and 4 Biochemistry, University of Saskatchewan, Saskatoon, Saskatchewan, S7N 0W0, Canada

2 To whom correspondence should be addressed. E-mail: kusalik{at}cs.usask.ca


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
This paper describes an improved method for conducting global feature comparisons of protein molecules in three dimensions and for producing a new form of multiple structure alignment. Our automated MolCom method incorporates an octtree strategy to partition and examine molecular properties in three-dimensional space at multiple levels of analysis. The MolCom method’s multiple alignment is in the form of an octtree which locates regions in three-dimensional space where correspondence between molecules is identified based on a dynamic set of molecular features. MolCom offers a practical solution to the inherent compromise between computational complexity and analytical detail. MolCom is currently the only method that can analyze and compare a series of defined physicochemical properties using multiple, simultaneous levels of resolution. It is also the only method that provides a consensus structure outlining precisely where the similarity exists in three-dimensional space. Using a modest-sized collection of structural properties, separate experiments were conducted to calibrate MolCom and to verify that the spatial analyses and resulting structure alignments accurately identified both similar and dissimilar structures. The accuracy of MolCom was found to be over 99% and the similarity scores correlated strongly with the z-scores of the Alignment by Incremental Combinatorial Extension of the Optimal Path method.

Keywords: 3-D protein structure comparison/MolCom


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Automated protein structure comparison is fast becoming a necessity given the enormous growth in the number of known protein structures and with the emergence of high-throughput structural genomics. Methods in widespread use involve interstructural distance deviations, distance matrices, clustering and topology. Interstructural distance deviation algorithms use dynamic programming to minimize a measurement of structural deviation between matched {alpha}-carbon atoms of amino acid residues. Examples include root-mean-square deviation (r.m.s.d.) and Area Functional with Fit Comparison (AFFC) (Falicov and Cohen, 1996Go). r.m.s.d. and AFFC are effective and fast quantitative measures of similarity (Falicov and Cohen, 1996Go), but optimal scores do not guarantee true biological significance (Alexandrov and Fischer, 1996Go). Distance matrix algorithms build matrices containing quantitative values about relationships internal to each structure, such as contact patterns for each residue (Taylor and Orengo, 1989Go; Sander and Holm, 1993Go). To improve the overall score, either progressively more complex matrices are constructed using Monte Carlo methods (Taylor and Orengo, 1989Go; Holm and Sander, 1993Go) or paths representing pairs of aligned structural fragments in the original matrices are extended (Shindyalov and Bourne, 1998Go). The DALI method (Holm and Sander, 1993Go) and the Alignment by Incremental Combinatorial Extension of the Optimal Path method (CE) (Shindyalov and Bourne, 1998Go) are Web-accessible programs based on distance matrices. Three-dimensional clustering algorithms find groupings of structural elements arranged similarly in space [collectively called the common structural core (Sander and Holm, 1996Go)]. The Web-accessible programs SARF2 (Alexandrov et al., 1992Go; Alexandrov and Fischer, 1996Go) and the Vector Alignment Search Tool (VAST) (Gilbrat et al., 1996Go) compare vector representations of Spatial ARrangements of backbone Fragments (SARFs). Topology is a relaxation of the geometric requirement that all distances between portions of an object must remain invariant for the object to maintain its identity (Mezey, 1993Go). The representation of topological structure can be simplified for efficient searching and topology matching by two-dimensional ‘TOPS protein topology cartoon’ diagrams (Gilbert and Westhead, 1998Go). The diagrams represent secondary structural elements (SSEs) which can be aligned and scored using constraint logic programming.

In essence, conventional methods for comparing proteins attempt to co-locate similar portions of the molecules using features or properties associated with a single level of analysis; to date, no methods analyze and compare a series of defined molecular properties (be they structural, chemical or otherwise) using multiple, simultaneous levels of resolution and none provide a compact consensus structure outlining precisely where the similarity exists in three-dimensional space. The MolCom method simultaneously satisfies all of these shortcomings.

This paper addresses the problem of how to analyze and compare protein structures rigorously so that several levels of detail can be incorporated into a comprehensive three-dimensional ‘spatial map’ of structural and chemical similarity. Our MolCom method spatially decomposes protein molecules for comparison, similarity detection and categorization. It uses an octtree space-partitioning strategy to explore recursively and represent trends in a dynamic collection of molecular features whose composition changes systematically with the level of analysis. The collection of properties examined and their relative contributions to the assessment of similarity at each level of analysis can be tailored by the investigator. The fundamental goal is to explore and summarize variations in the structure and chemistry of the molecules through space. Features appropriate to the volumes under analysis are compared at every iteration of the algorithm. Measurements taken from overlapping regions of space are summarized at discrete aggregation points. The summarized information is used, in turn, to construct an octtree which identifies those cubic regions containing similar molecular features. The octtree is particularly efficient at summarizing information essential for the characterization and topological localization of molecular properties observed within and around the space occupied by the co-located protein structures being compared. Moreover, the octtree constitutes a multiple alignment of the spatial regions containing molecular similarity (O’Hearn, 2000Go; O’Hearn and Kusalik, 2000Go) and provides a compact three-dimensional representation of equivalenced portions of the molecules which reveals a common structural core (Sander and Holm, 1996Go). A collection of octtrees creates a rapid, efficient comparison space for automated classification of newly resolved molecules.

A computer application, MolCom3D, has been developed which successfully applies the MolCom method to protein molecules (the name ‘MolCom’ will refer to both the algorithm and the MolCom3D implementation hereafter, except where indicated otherwise). Separate experiments were conducted to calibrate MolCom and to verify that the spatial analyses and resulting alignments accurately identified both similar and dissimilar molecules (O’Hearn, 2000Go). Using pairs of structures with varying degrees of known similarity, the accuracy of MolCom was found to be over 99% in classifying structures as ‘similar’ or ‘dissimilar’. MolCom’s results were also correlated with scores from CE and compared qualitatively with results from DALI and VAST and with detailed visual inspections of the orientations of secondary structural elements. The similarity scores correlated strongly with the z-scores of the CE method (Shindyalov and Bourne, 1998Go). MolCom was highly sensitive even when molecules had extremly low sequence identities (below 4%). On the basis of the qualitative visual comparisons, we consider MolCom’s similarity scores to be ‘chemically’ and ‘biologically’ accurate insomuch as collections of physicochemical properties, which ultimately govern folding and function, are appropriately matched during the comparison.

In the next section, a number of concepts for comparing structures in three dimensions are described, including octtree spatial maps, hierarchical multiple structure alignments and appropriate levels of analysis for structural and chemical properties. These concepts are then used in presenting the details of the MolCom method and its calibration and verification methodologies. The results of calibration and verification tests follow. Finally, in the last section, the advantages of the MolCom method and future research directions are discussed.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
In this section, important concepts underlying the algorithm are explicitly defined. Then the algorithm and the methodology for its calibration and verification testing are described.

Our technique requires a number of parameters and functions (called ‘settings’ hereafter) to be established. Some settings were simply assigned reasonable values based on background knowledge, whereas others required empirical calibration through controlled comparisons using proteins of known similarity. Once an acceptable calibration was achieved, the calibrated method was verified by comparing results from the MolCom and CE methods using a representative sample of proteins.

Definitions

The MolCom method extends prevailing notions of octtrees, multiple structure alignments, levels of analysis and approriate levels of analysis. We now present these definitions before delving into the details of the method itself.

Definition of the octtree spatial map The MolCom method is a three-dimensional instance of our recently developed Structure Comparison Framework for comparing N-dimensional objects using hypercubes (O’Hearn, 2000Go). In three dimensions, the hypercubic subdivision specified in the framework reduces to a cubic mesh structure represented by an octtree. Based on operational definitions compiled from the literature (Foley et al., 1990Go), we give the following general definition of an octtree:

An octtree is a hierarchical data structure which represents and locates aggregated feature information within a cubic region of space. It is constructed in accordance with a recursive spatial decomposition of each cubic region into eight uniform cubic sub-regions. These sub-regions are bounded by the three (orthogonal) dividing planes and the original (planar) faces of the cubic region being subdivided.

The octtree is neither the actual subdivided cubic region nor the resulting mesh structure. Rather, it is a logical structure in the form of a tree representing and locating features of interest which are found whenever the corresponding cubic regions are similarly broken down. The actual data structure used to implement the octtree can exist in any convenient form (e.g. binary tree, octary tree, 2–3 tree, etc.). The octtree simply requires each of the eight sub-cubes to be represented by a corresponding subtree. The MolCom method considers the octtree to be a special case of a binary space partition (BSP) tree (Foley et al., 1990Go; O’Hearn, 2000Go).

Figure 1Go shows a spatial decomposition of successive cubic regions (left-hand side) and a corresponding octary tree representation of the octtree (right-hand side).



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1. Four levels of analysis of an octtree spatial decomposition are shown. The left column (top to bottom) shows the space examined from the first to the fourth level of analysis. All properties are considered to emanate from the ‘aggregation points’ at the centers of the cubes. The right column shows the corresponding octtrees which encode information about the properties radiating from those aggregation points.

 
Traversing the tree is equivalent to locating cubes, sub-cubes, sub-sub-cubes and so forth, represented in the tree. Cubic regions of space are represented in the tree if the overall trends in specific chemical and structural properties are equivalent in those regions for all molecules compared using that tree. The precise geometric mapping from the origin to the center of a given cube represented in the tree, along with its dimensions, are defined precisely in terms of a recurrence relation specified in our original Structure Comparison Framework (O’Hearn, 2000Go) on which the MolCom method is based.

Definition of analytical appropriateness We informally define the level of analysis as the degree of physical detail or conceptual abstraction chosen for the investigation or description of a system. A level of analysis can be considered appropriate if the degree of physical detail (or conceptual abstraction) succinctly embodies the essential characteristics and functionality of some portion of the system under investigation. Current comparative methods apply a fixed level of analysis. For example, r.m.s. deviation and distance matrices consider amino acid residue positions (Holm and Sander, 1993Go; Collaborative Computational Project No. 4, 1994Go; Shindyalov and Bourne, 1998Go), clustering is based on backbone fragments (Alexandrov et al., 1992Go; Gilbrat et al., 1996Go) and topological algorithms concentrate on a particular level of structure (Gilbert and Westhead, 1998Go). However, the hierarchical nature of protein structure warrants investigation from a variety of perspectives adopting different levels of analysis. To date, no method for protein comparison examines molecules in a recursive manner while changing the features considered salient over the progression of the algorithm. We devised the MolCom method to fill this void, focusing its comparisons on appropriate molecular features at every level of analysis over the progression of the algorithm.

Definition of the hierarchical multiple structure alignment Current methods for detecting protein similarity involve the construction of some form of multiple structure alignment which reveals regions of similarity based on three-dimensional structure. Our method is no exception. However, the notion of what constitutes a multiple structure alignment varies considerably with the approach used to build it, ranging from alignments of {alpha}-carbon atoms to alignments of secondary structural elements. Consequently, we have extended the definition of the multiple structure alignment by considering the principal purpose of such alignments, namely to identify general regions of correspondence between proteins in a way that is chemically and biologically meaningful. Our extended definition reads as follows:

A multiple structure alignment of k objects is a computational entity representing the collection of spatial locations (or other referencing identifiers) mapped to regions deemed to coincide in all k objects and deemed equivalent with respect to the consideration of (one or more) properties within and around those regions in all k objects.

Correspondence (coincidence) does not pertain exclusively to sequence or to a given type of structure, but rather to any comparable property that arises from the structural constitution of the molecules (consequently, the word ‘structure’ has been retained in ‘multiple structure alignment’). In particular, the spatial localities from which properties are considered (measured, predicted, conceptualized) need not coincide geometrically with the spatial locations actually attributed with the measurements. This abstraction allows for physically reasonable but insightful departures from the purely geometric bounds of the objects being compared. Such geometric departures have been found to increase sensitivity to similarity between non-rigid objects (O’Hearn, 2000Go; O’Hearn and Kusalik, 2000Go) and are characteristic of topological methods such as the TOPS-based structure comparison method (Gilbert and Westhead, 1998Go). The MolCom method matches hierarchical sets of features in building its octtrees and the resulting octtree-based alignments are referred to as hierarchical multiple structure alignments.

Dynamic levels of analysis and property weighting curves

Over successive levels of analysis, the depth of the octtree increases and property information is aggregated from cubes having edge lengths which are half as long as in the previous level. To promote analytical appropriateness, the MolCom method defines a set of property weighting functions that regulate the extent to which the various types of molecular properties contribute to the overall similarity score as the level of analysis increases (see Figure 2Go). The weighting functions are meant to approximate the relative contributions of empirical properties at the volumes of space corresponding with each level of analysis.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 2. Property contribution weighting functions.

 
For instance, at L = 1, cubes may have edge lengths of 256 Å and the presence and orientations of ß-strands [a member of the ‘Group 1(a)’ properties] are heavily weighted, whereas the number of charged residues [a member of the ‘Group 2(a)’ properties] are only moderately considered. The cubic regions that are further explored (as a result of their contents being similar) are subdivided into 64 Å sub-cubes and then into 32 Å sub-sub-cubes and so on. Within cubes of, say, 4 Å at L = 7, ß-strands contribute much less to the similarity score and charged residues are heavily weighted. This is because the long-range property, ß-strand, has already been largely accounted for in the larger volumes. MolCom consequently maintains an appropriate level of analysis at every comparative iteration. The merits of this approach are accentuated if one considers the consequences of conducting the above analysis in reverse – charged residues at high volumes, ß-strands at low volumes – or if one considers not applying a dynamic level of analysis at all. Further aspects of Figure 2Go are discussed in subsequent sections.

Octtrees constructed using this hierarchically organized approach to property analysis constitute information-rich hierarchical multiple structure alignments. The octtree-based alignments are, in effect, consensus structures which reveal structurally and chemically relevant similarity based on the defined collection of properties used in the comparative analyses.

Operation of the MolCom algorithm

In general, the MolCom algorithm operates as follows. Properties in corresponding regions of space for each molecule are compared. Regions found to contain similar properties are recursively analyzed in further detail. As the resolution becomes finer, the collection of properties examined changes so that the volumes of the regions under analysis are matched with properties expected to dominate molecular structure and chemical activity. An octtree spatial map is constructed during the comparison which precisely locates those cubic regions of space where, collectively, the properties match. Moreover, the octtree spatial map can be annotated with information about the properties deemed similar in those regions.

The details of the method are as follows. It is assumed that the molecules are pre-aligned in approximately the same orientation. [In earlier implementations of MolCom, molecules under comparison were first rotated and translated into initial alignment orientations by a program called lsqkab, a protein crystallographic program from the Collaborative Computational Project (CCP4) (Collaborative Computational Project No. 4, 1994Go). With subsequent development of the algorithm, the initial alignment function was incorporated into the implementation. Currently the alignment is performed using repeated orienting applications of MolCom, starting at 40° increments and then refining the best 40° orientations to within 20, 10 and then 5° of the optimal orientation giving the best similarity score. The authors are currently perfecting a better incremental pre-alignment method (see Discussion section).] A starting cube of edge length |H0| is defined such that it is large enough to enclose fully the largest molecule under investigation. Over successive iterations of the algorithm or levels of analysis, L, this cube is recursively subdivided into eight uniform sub-cubes of length |HL| and comparative information is summarized while an octtree is constructed. A fully constructed octtree locates only those cubic regions of space where collections of properties have been found equivalent. Further subdivision of a given sub-cube terminates when an octant decision score falls under a certain exploration threshold value. An octtree is considered fully constructed if no further similarity can be found for the collection of molecules or if the maximum height Lmax has been reached. As a consequence of the algorithm, if one molecule is vastly larger than the other, the molecules will be deemed dissimilar early on in the analysis.

The results of all property measurements for a given sub-cube (or equivalently, octant) are collated and assigned to a point located at the center of the sub-cube called the aggregation point (the points at the centers of the cubes shown in Figure 1Go). The properties currently examined by the MolCom method are listed in Table IGo. The nine properties are categorized into groups ‘1a’ and ‘2a’ (more influential) and groups ‘1b’ and ‘2b’ (less influential).


View this table:
[in this window]
[in a new window]
 
Table I. Properties compared in MolCom
 
The analysis of additional or alternative properties can be readily added to the method to enable, for example, comparisons of force fields, solvent accessible regions, surface topologies and so on. The contribution from a property, p, in the property list (Table IGo) is related to the level of analysis using a sigmoid function, Wp(L) (see Figure 2Go). The family of functions Wp(L) is defined in Table IIIGo and the coefficients used in the MolCom method are listed in Table IIGo. The coefficients were derived from mathematical software (Hyams, 1997Go) to give functions which allow properties to be ‘blended’ over successive levels of analysis. A property-specific scalar, wp, affects the property’s relative scoring contribution Wp(L) uniformly over all levels of analysis. That is, wp controls the ‘amplitude’ of the property. To add an additional property type, Q, say, all that needs be defined is a weighting function WQ(L) indicating the contribution of Q to the overall analysis relative to the other properties over successive levels of analysis and a uniform scalar wQ. The function WQ can be any type, it need not be sigmoid.


View this table:
[in this window]
[in a new window]
 
Table III. Parameters of the MolCom method

 

View this table:
[in this window]
[in a new window]
 
Table II. Coefficients of the property weighting functions
 
Every property analyzed is assigned one or more observation points which represent points in space defined to give rise to that property (e.g. properties such as {alpha}-helix orientations are represented by more than one observation point in space). Observation points are used in assigning property measurement scores to neighboring aggregation points. Properties are thought to emanate from observation points in space and to interact with one or more aggregation points in the vicinity; the interaction strength decreases with distance as shown in Figure 3Go. All interactions are summarized at the aggregation points.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 3. The fringe locality of an aggregation point is the volume inside the larger cubic region shown but outside the inner cubic region surrounding A. A potential observation point P is shown situated in this fringe locality. Measurements from P to be summarized at A are scaled by D(r). The size of the overlapping cubic localities and the corresponding number of A–P interactions are limited by E(L).

 
A limit on the distance of interaction between observation points and aggregation points is imposed through the definition of overlap functions, E(L). An overlap function defines a volume of space which is at least as large as the octant and which contains at its center the aggregation point of the octant. Any property measurements taken from observation points within this overlapping spatial domain are aggregated to all aggregation points contained within it. Measurements within this region but outside of the octant itself (the fringe region) are scaled by a decay function (measurements inside the cubic region can be thought to be scaled by unity). The decay function, D(r), defined in Table IIIGo, distributes measurements to aggregration points in proportion to volume by considering the distances, r, between a given aggregation point and the observation points of the detected property, as shown in Figure 3Go. Considerable accuracy is achieved by the algorithm when the overlap function is varied over successive trials and the results are averaged.

The MolCom method indicates the extent to which trends in the defined collection of properties match in the compared molecules by way of a similarity score. The score is meant to account fairly for sparse distributions of properties in space. A significant portion of the volume containing a molecule consists of empty space. While this empty space should be considered in the similarity scores, it should not be allowed to dominate the scoring, lest every molecule be found nearly 100% similar. Thus our similarity score is cumulative over the levels of analysis and is adjusted for the amount of volume under analysis. The similarity of two molecules for a given level of analysis, L, is given by the cumulative volume-adjusted (CVA) score defined by the following relation:


(1)

where

CVA0 does not exist;

TL = total number of octants examined at level L;



The CVA score at level L > 1 is the sum of two terms. The first term scales the ratio of similar octants to total octants examined (RL) by the proportion of the volume actually found to contain properties (TL/8TL- 1). Every octant in a region where RL is above the exploration threshold is further subdivided in the next iteration. The second term awards the remaining proportion of the volume whose octants contain no properties [1 – (TL/8TL- 1)] the prevailing similarity score from the previous level of analysis (RL- 1). The octants in this region are not further subdivided because they contain no discernible properties. The factor (1 – S) governs the extent to which empty space influences the score. The influence of empty space increases as S is increased. S was set to 0 during calibration, but subsequent informal tests found S = 0.15 to give slightly better scores with large molecules. The overall similarity score (simply referred to as the CVA score) is the geometric mean of the level-dependent CVA scores:


(2)

By definition, a CVA score at or above 70% is meant to indicate similarity; dissimilar molecules should never be awarded CVA scores of 70% or more. MolCom was calibrated on the basis of this similarity boundary.

Implementation

The MolCom method for proteins was implemented as a C++ computer application, called MolCom3D, along with auxiliary UNIX shell scripts and a Perl/CGI Web interface. MolCom3D takes as inputs two protein coordinate files (currently in PDB format). It produces the octtree during the comparative process, the best CVA similarity score found and two new PDB coordinate files with the molecules co-located at their centers of mass and rotated into the positions giving the highest CVA scores. A script file is also produced which automatically loads and superimposes the generated files into VMD (Humphrey et al., 1996Go) for viewing.

Calibration of the MolCom algorithm

The MolCom algorithm required several constants and functions (‘settings’) to be determined in accordance with the rules of the Structure Comparison Framework (O’Hearn, 2000Go) from which it was derived. These are summarized in the first two columns of Table IIIGo. Some of these parameters could be assigned values whereas others required empirical calibration using protein structural data. On the basis of background knowledge (Lehninger et al., 1993Go; Shortle, 1998Go) and practical experience, it was straightforward to assign values for settings controlling the general exploratory behaviour and the overall scoring contribution scheme for similarity detection. These included |H0|, Lmax, D(r), S and the functional coefficients of Wp(L), i.e. a, b, c and d. To enable MolCom’s exploratory behavior and property scoring contributions to be tuned for proteins, the remaining settings, including E(L), T and the scalars wp, were assigned values based on calibration. The objective of calibration testing was to determine at least one set of parameter assignments that would result in the generation of reasonable CVA similarity scores based on criteria described below.

Calibration testing was conducted using pairs of structures drawn from the ‘knottin’ and ‘snake toxin’ superfamilies (folds) (Murzin et al., 1998Go). These superfamilies offered a collection of medium-sized proteins with roughly equal sizes (60–70 residues) which were convenient for calibration purposes. The calibration based on medium-sized proteins was expected to extend to large proteins; this extension was validated with subsequent verification testing.

MolCom made a total of 1950 comparisons using 78 prealigned pairs of protein structures with varying degrees of similarity. The 78 pairs consisted of all possible (132) combinations of structures with PDB identifiers 1AHO, 1LQH, 1NRA, 1PTX, 2SN3, 1COD (‘knottins’) and 1COD, 1CRE, 1ERA, 1FAS, 1KBS, 1NXB, 2CDX, 2CRT (‘snake toxin-like’). The comparisons involved five cubic overlap settings, k, of E(L), ranging from 100 to 200% in increments of 25%, and five exploration thresholds, T, ranging from 50 to 70% in 5% increments (O’Hearn, 2000Go). The protein pairs were assigned to two groups based on their confirmed similarity; similar pairs were assigned to Group ‘S’ and dissimilar pairs were assigned to Group ‘D’. Similarity was confirmed if the r.m.s.d. values were below 10 Å, the proteins came from the same SCOP classifications and the presence and orientations of SSEs matched on the basis of visual inspection; pairs were confirmed to be dissimilar otherwise.

A limit of seven cubic subdivisions was imposed for each execution of MolCom since the CVA score was found not to change appreciably beyond this level during previous experimentation. Using the best parameter assignments for k and T, further comparisons were conducted using four different scalar multiplier (wp) combinations for the functions Wp(L) (Figure 2Go) associated with each property group. The multipliers are given as a 4-tuple (w1a,w1b,w2a,w2b). The 4-tuples tested were (1,1,1,1) (shown in Figure 2Go), (1,2,1,1), (1,1,1,2) and (1,2,1,2). For instance, applying combination (2,1,1,1) to the curves shown in Figure 2Go would cause helices and sheets to affect the similarity score four times as much as turns and loops (instead of two times) and charged and polar residues to have twice the effect of aromatic and aliphatic residues (as already shown in the figure).

The validity of the CVA scores was assessed as follows. First, the CVA scores were expected to demonstrate a tendency to increase monotonically from 0 to 100% as the degree of similarity of the calibration structures rose from extreme dissimilarity to identity. The correlation of the CVA scores with the minimum r.m.s.d.s of the structures was used as an approximate measure of this tendency. A successful calibration required a correlation of -0.8 or less. Second, the number of individual errors committed in awarding CVA scores was to be minimized. An error was committed whenever a CVA score below 70% was awarded to a pair of proteins from group ‘S’ (similar structures were deemed dissimilar) or whenever a score of 70% or more was awarded to proteins from group ‘D’ (dissimilar structures were deemed similar). This was referred to as the ‘segregation rule’ and was in effect only during calibration wherefor a special, visually inspected, bipartite collection of structures was used. The number of mismatches based on the ‘segregation rule’ was tallied. It was required that at least 95% of the protein pairs be correctly classified as ‘similar’ or ‘dissimilar’ according to the ‘segregation rule’. Finally, the difference between the averaged CVA scores from Groups ‘S’ and ‘D’ was to be maximized. This difference is referred to as the ‘selectivity’ of the CVA score.

The results of calibration testing are reported in the Results section.

Verification of the MolCom method

The objective of verification testing was to confirm that the calibrated MolCom method produced meaningful and accurate CVA scores (with at least as much accuracy as the currently available automated methods) and that MolCom’s alignments of superimposed structures, based on a quantitative analysis of trends in molecular properties through space, would be as good as alignments produced by manual manipulation of structures using visualization software (such as Turbo Frodo, MolMol, Setor or VMD). The verification was performed on data different from that used for calibration.

A collection of 1054 pairs of structures (distinct from those used during calibration testing) was formed from four superfamily classifications, including ‘knottins’ (1AYJ, 1BCG, 1BMR, 1BRZ, 1CHZ, 1CN2, 1DQ7:A, 1DQ7:B, 1FH3, 1GPS, 1I6F, 1LQQ, 1SNB, 2B3C), ‘snake toxins’ (1CRD, 1CTX, 1CVO, 1FRA, 1FSC, 1KBT, 1NEA, 1NOR, 1NTN, 1QKE, 2CRS, 2CTX, 3EBX), ‘lysozymes’ (1AT5, 1AZF, 1HEL, 1HER, 1JPO, 1LYY, 1LZG, 1OUG, 1UIF, 1UIH, 1WQR, 2HEC) and ‘microbial ribonucleases’ (1BU4, 1DET, 1GSP, 1RGA, 1RGL, 1RLS, 1RMS, 1RTU, 2GSP, 3BIR, 3GSP, 4BIR, 4GSP, 5GSP, 6GSP). The structure sizes ranged from 51 to 436 amino acid residues. The pairs formed from within and between the four superfamily classifications were compared by MolCom and by CE (Shindyalov and Bourne, 1998Go). There were about four times as many pairs of dissimilar structures as similar structures as a result of the pairing composition. The structures were not pre-aligned. Instead, MolCom’s property-based technique for alignment of structures was tested. MolCom was configured to combine CVA score estimates from 125, 150, 175 and 200% cubic overlap into an overall CVA score.

In addition to the above test cases, some structures known to be complex or difficult to compare were also tried. These included pairs of large structures with low sequence identities and pairs of structures whose similarity was documented as ‘difficult to detect’ for CE, DALI and VAST (Shindyalov and Bourne, 1998Go). The pairs are listed in Table VGo.


View this table:
[in this window]
[in a new window]
 
Table V. Challenging pairs of structures
 
Biological accuracy was verified by comparing MolCom’s CVA scores with the CE method’s z-scores for the same pairs. The z-scores of the CE method already have a convenient interpretation established by the biological community. A z-score below 3.7 indicates similarities of low significance; z-scores between 3.7 and 4.0 are in a ‘twilight zone’ where some biological significance can be seen; z-scores between 4.0 and 4.5 indicate superfamily-level similarities, strongly related functions or strongly recurring folds; z-scores above 4.5 indicate family-level similarity. It was expected that MolCom’s CVA score boundaries would closely match the CE z-score boundaries with the corresponding interpretation: CVA scores below 50% indicate dissimilarity; CVA scores between 50 and 60% correspond to MolCom’s ‘twilight zone’; CVA scores between 60 and 70% indicate ‘marginal’ similarity; and CVA scores of 70% or more indicate high similarity. A structure compared with itself (identity) always gives a CVA score of 100%.

The validity of the CVA scores was also assessed by counting the number of incorrectly awarded CVA scores: pairs of structures formed between groups were expected to receive CVA scores under 70%, while pairs formed within groups were expected to receive CVA scores over 50%.

A subset of 200 alignments of superimposed structures produced by MolCom was chosen at random and assessed qualitatively by visual inspection and any errors were tallied. The MolCom alignments were also visually compared with alignments produced by the lsqkab software (Collaborative Computational Project No. 4, 1994Go) whenever the minimized r.m.s.d. was 4 Å or lower. Where CVA scores were high (over 70%), the major trends in SSEs were expected to match (e.g., {alpha}-helices found near {alpha}-helices, ß-sheets found near ß-sheets, etc.) and were expected to be in approximately similar orientations. This tendency was to become more pronounced with increasing CVA scores.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Calibration

MolCom was successfully calibrated. A detailed (and Web-accessible) listing of the calibration results is available (O’Hearn et al., 2002Go). For brevity, we discuss only the parameter assignments that yielded an acceptable calibration:

  1. The best exploration threshold, T, was 55%.
  2. Using the property weighting functions, W1a(L), W1b(L), W2a(L), W2b(L), with the coefficients shown in Table IIGo, the best scalar multiplier combination (w1a,w1b,w2a,w2b) was found to be (1,1,1,1).
  3. The best cubic overlap functions, , were found to be in the range . Different cubic overlap values in this range were found to give equally good estimates of the CVA score. Accuracy was improved (fewer classification errors were observed) when the CVA scores were calculated from several values of k and averaged together.

Figure 4Go shows the general relationship between the CVA scores and the amount of r.m.s.d. between structures. The CVA scores were found to increase monotonically with the degree of similarity as evidenced by a strong negative correlation between the CVA scores and the r.m.s.d. values, r = -0.85.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 4. Relationship between MolCom’s CVA scores and r.m.s.d.s between compared structures during calibration. Each point on the graph represents a comparison of two protein structures.

 
The mean CVA score for Group ‘S’ (similar pairs) was 77.9% and for Group ‘D’ (dissimilar pairs) 36.1%. The resulting best selectivity was therefore 41.8%. The number of errors according to the ‘segregation rule’ was three (96% accuracy). The calibration criteria were therefore satisfied by the determined combination of settings. Although achieving an optimal calibration might increase MolCom’s sensitivity further, it was not a focus of this research; instead, optimal calibration was left for future work.

Verification

A Web-accessible listing of the verification results is available (O’Hearn et al., 2002Go). The MolCom method was found to produce biologically accurate CVA scores on the 1053 pairs of structures. Figure 5Go shows a scatter plot of MolCom’s CVA scores versus z-scores from the CE method. The regression line relating CVA scores to z-scores was



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 5. Relationship between MolCom’s CVA scores and z-scores from the CE method for pairs of proteins compared during verification testing. Each point on the graph represents a comparison of two protein structures.

 

The CVA scores correlated strongly with the z-scores, r = 0.82. Figure 5Go also shows four intersecting areas where mismatches in scoring pairs of structures were considered significant. These areas are also listed in Table IVGo. Each area is bounded by a z-score domain boundary and a CVA score range boundary; mismatches in other areas were considered marginal.


View this table:
[in this window]
[in a new window]
 
Table IV. Mismatched combinations of CE z-scores and CVA scores
 
A single mismatch was found in area 1, involving the similar structures 1CTX and 1QKE shown in Figure 6Go. MolCom correctly detected significant similarity (CVA = 73.6). However, CE detected no significant similarity (z = 3.3). The low score awarded by CE probably resulted from the high r.m.s.d. of 10.2 Å between the structures coupled with the extremely low sequence identity of 3.6%.



View larger version (45K):
[in this window]
[in a new window]
 
Fig. 6. MolCom’s superposition of structures 1CTX (light shading) and 1QKE (dark shading) is shown. Similarity was detected by MolCom (version 0.7a) (CVA = 73.6) but not by the CE method (version 1.01) (z = 3.3). The sequence identity, 3.6%, was extremely low.

 
There were six mismatches where CE detected similarity but MolCom did not. These mismatches are visible in areas 2 and 3 (see Table IVGo). Two of the structure pairs (1CHZ+1DQ7:b and 1BCG+1I6F) were considered to have family-level similarity by CE but were considered to be only marginally similar by MolCom. However, both pairs had r.m.s.d.s above 10 Å and the alignments produced by MolCom were found to be correct by visual inspection. The other four pairs had z-scores between 4 and 4.5 (superfamily-level similarity) but were within MolCom’s CVA ‘twilight zone’. The alignments of these structures by MolCom were examined and were found to be correct. The low CVA scores reflected either high r.m.s.d. (1BCG+1DQ7:b, 1CN2+1DQ7:b) or considerable geometric discordance in SSE arrangement (1BCG+2B3C, 1BCG+1CN2). There were no mismatches in area 4 (strong similarity detection by CE and no similarity detection by MolCom).

The number of incorrectly awarded CVA scores was seven out of 1054 (>99% accuracy). No false positives were detected; all pairs of structures formed between groups (dissimilar pairs) received CVA scores <70%. The seven CVA scores considered incorrect (pairs within groups receiving CVA scores <50%) all exhibited highly discordant secondary structural geometries (the structure pairs are listed in the right-hand column of Table IVGo). Of the seven misclassified pairs, the CE method also misclassified five of the pairs as being insignificant, with z < 3.7 (1AYJ+1CN2, 1BCG+1GPS, 1CDR+1CTX, 1CN2+1GPS, 1GPS+2B3C) and classified one pair (1CHZ+1GPS) with z = 3.7. The largest mismatch in scoring (1GPS+1SNB) between CE (z = 3.9) and MolCom (CVA = 46.9%) was in the CE ‘twilight zone’. The 1SNB structure contains a large looped region not present in the 1GPS structure. The resulting discord in centers of mass causes MolCom’s low score.

Qualitative visual assessment of the subset of MolCom’s superimposed structure alignments showed that all alignments appeared correctly superimposed. All lsqkab-generated alignments below 4 Å were virtually indistinguishable from alignments generated by MolCom. In all cases, the major trends in SSEs were noticeably coincident with an increasing amount of overlap as the CVA score increased.

MolCom was able to detect at least marginal similarity in 10 out of the 12 challenging pairs of structures listed in Table VGo deemed similar by experts (Shindyalov and Bourne, 1998Go). Pairs 1–4 were not detected as similar by DALI or VAST; pairs 5 and 6 were not detected by DALI; pairs 7 and 8 were not detected by VAST. Pairs 9–12 (and pair 8) were rather complex structures with low sequence identities (well under 30%). MolCom did not consider pairs 6 and 8 to be similar, but the alignments were reasonable. Only VAST detected similarity in pair 6. CE detected similarity in all pairs detected by MolCom, along with pair 8. However, the structures of pair 8 were considered similar by CE primarily because of matching localized regions of sequence and not because of global, 3-D structural similarity. MolCom, in fact, produced a reasonable CVA score for pair 8 despite the amount of global structural differences.

As further illustration of the operation of MolCom, Figure 7Go shows an example superposition produced by MolCom of two similar structures from Staphylococcus aureus, V8 protease (V8prot) [recently resolved by X-ray crystallography at the University of Saskatchewan (Prasad et al., 2001Go)] and epidermolytic toxin A (1AGJ). Figure 8Go shows another example superposition generated by MolCom involving two large TIM-barrel molecules, enolase (1EBH) and mandelate racemase (2MNR).



View larger version (83K):
[in this window]
[in a new window]
 
Fig. 7. MolCom’s superposition of V8 protease (dark shading) and structurally similar epidermolytic toxin A (1AGJ:A) (light shading); sequence identity, 27.9%; MolCom CVA similarity score, 68.9% (increasing to 73.1% with the bottom helix of 1AGJ removed).

 


View larger version (84K):
[in this window]
[in a new window]
 
Fig. 8. MolCom’s superposition of enolase (1EBH) (dark shading) and structurally similar mandelate racemase (1MNR) (light shading); sequence identity, 13.7%; MolCom CVA similarity score, 67.1%.

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The MolCom method is currently the only method that compares a collection of user-defined molecular properties directly in three-dimensional space. It is designed to indicate the degree of global physicochemical similarity between structures and to align those structures solely on the basis of a hierarchical set of molecular properties. MolCom can be adapted to examine other hierarchical collections of properties in detail, such as surface topologies, force fields, ligand-receptor docking sites and so on. MolCom finds orientations of structure giving the best match of collections of properties for the compared structures and summarizes this information in an octtree. New structure files are generated showing the best superposition of the structures based on the properties analyzed. The CVA score represents the degree of similarity in the structures.

Using a modest-sized collection of properties and an unoptimized but acceptable calibration, MolCom performed as well as the CE method in detecting similarity. We conclude that MolCom gives biologically accurate results since CE is considered to do so and MolCom’s results are consistent with those of CE. MolCom has the potential to be optimized for detecting (and revealing the locations of) extremely subtle differences in similar molecules by expanding the set of properties considered or by increasing the exploration threshold parameter. Thus MolCom can become an indispensable tool of the protein crystallographer for studying subtle differences in crystal structures resulting from differences in determination method (X-ray crystallography, NMR spectroscopy), substrate conformations in the ligand-bound and ligand-unbound states and so on.

The MolCom paradigm can be extended in several ways to handle a wider range of comparative and analytical problems, from comparing two or more structures, to analyzing individual structures and summarizing their structural and chemical characteristics. To this end, annotated octtrees will be incorporated into the algorithm in the near term. Annotation will allow octtree-based comparison spaces to be created in which newly resolved molecules can be rapidly compared and classified. Annotated octtrees could also be used to represent the nature and locations of structural and chemical properties analyzed for a single protein structure in three-dimensional space. It is possible that additional calibration testing would result in settings that give even better verification results. For instance, some of the parameters that were simply assigned values could be set using empirical tests. For improved robustness, the calibration set could be extended to incorporate more diverse proteins. The extended calibration set would be kept available for investigators to add new property types to the method. A substructural co-location algorithm will be developed (O’Hearn, 2000Go), which will bring regions of statistically significant similarity into closer contact. Co-location should dramatically increase sensitivity to subtle similarities and differences in hierarchically complex aggregate features, such as binding sites. Additional properites will also be added to MolCom, such as regional dihedral angle distribution tendencies [suggested by Prasad et al. (Prasad et al., 2001Go)] and other collections of properties for comparing surfaces and for exploring complementarity in structures. A fast alignment method is under development. It is based on a superposition of the actual observation points using a four-dimensional r.m.s.d. algorithm which incorporates three observation point positional vectors and one observation point type vector. The fast alignment method will allow a local version of MolCom to be created which will be able to detect interesting similarities between molecules of substantially different sizes. We also intend to build a ‘property browser’ allowing an investigator to view configurable representations of ‘property densities’ through space.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Alexandrov,N.N. and Fischer,D. (1996) Proteins, 25, 354–365.[CrossRef][ISI][Medline]

Alexandrov,N.N. Takahashi,K. and Go,N. (1992) J. Mol. Biol., 225, 5–9.[ISI][Medline]

Collaborative Computational Project No. 4 (1994) Acta Crystallogr., P50, 760–763.

Falicov,A. and Cohen,F.E. (1996) J. Mol. Biol., 258, 871–892.[CrossRef][ISI][Medline]

Foley,J.D., van Dam,A., Feiner,S.K. and Hughes,J.F. (1990) Computer Graphics, Principles and Practice. Addison-Wesley, Reading, MA.

Gilbert,D. and Westhead,D. (1998) http//www.ebi.ac.uk/~drg/epsrc-vf/epsrc-report-long.html.

Gilbrat,J.F., Madej,T. and Bryant,S.H. (1996) Curr. Opin. Struct. Biol., 6, 377–385.[CrossRef][ISI][Medline]

Holm,L. and Sander C. (1993) J. Mol. Biol., 233, 123–138.[CrossRef][ISI][Medline]

Hyams,D. (1997) http//www.ebicom.net/~dhyams/cmain.htm.

Humphrey,W., Dalke,A. and Schulten K. (1996) J. Mol. Graphics, 14, 33–38.[CrossRef][ISI][Medline]

Lehninger,A.L., Nelson,D.L. and Cox,M.M. (1993) Principles of Biochemistry. Worth, New York.

Mezey,P.G. (1993) Shape in Chemistry, an Introduction to Molecular Shape and Topology. VCH, New York.

Murzin,A.G., Brenner,S.E., Hubbard,T.J.P. and Chothia C. (1998) http//scop.mrc-lmb.cam.ac.uk/scop/intro.html.

O’Hearn,S. (2000) Thesis, University of Saskatchewan. http//www.octtech.com/publications/ohearn_thesis.pdf.

O’Hearn,S. and Kusalik,A. (2000) P073, Intelligent Systems for Molecular Biology. http//ismb2000.sdsc.edu/posters_abstracts/MondayPS.html (download, http//www.octtech.com/publications/ohearn_poster.pdf).

O’Hearn,S., Kusalik,A. and Angel J (2002) The Protein 3-D Structure Alignment Web Site. http//www.octtech.com/MolCom3D.

Prasad, L, Hayakawa,K., Leduc,Y. and Delbaere,L. (2001) P097. American Crystallographic Association Abstracts. http//www.hwi.buffalo.edu/ACA/ACA01/abstracts/PSII.html

Sander,C. and Holm,L. (1993) J. Mol. Biol., 233, 123–138.[CrossRef][ISI][Medline]

Sander,C. and Holm,L. (1996) Science, 273, 595–602.[Abstract/Free Full Text]

Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739–747.[Abstract]

Shortle,D. (1998) Curr. Biol., 7, 151–154.

Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 1–22.[ISI][Medline]

Received May 15, 2002; revised October 28, 2002; accepted December 17, 2002.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (2)
Request Permissions
Google Scholar
Articles by O’Hearn, S.D.
Articles by Angel, J.F.
PubMed
PubMed Citation
Articles by O’Hearn, S.D.
Articles by Angel, J.F.