1 Center for Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, Jakob-Haringer Str. 3, A-5020 Salzburg and 3 ProCeryon Biosciences GmbH, Jakob-Haringer Str. 3, A-5020 Salzburg, Austria
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: alternative alignments/structure similarity measure/superimposition
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
There are many possible ways to define structure similarity among proteins, e.g. by root mean square deviation (r.m.s.d.) of rigid body superimposition (Zuker and Somorjai, 1989; May and Johnson, 1994
; Diederichs, 1995
) or differences in interatomic distances (Taylor and Orengo, 1989
; Holm and Sander, 1993
). We focus on rigid body superimposition, where the goal is to find the highest number of equivalent residues with the lowest r.m.s.d. Unfortunately, both parameters are interdependent. A solution is to keep the r.m.s.d. constant while optimizing the number of equivalent residues (Feng and Sippl, 1996
). As a consequence, the main measure of similarity is the number of structurally equivalent residues, which is a rather intuitive measure and considerably simplifies ranking.
It has been demonstrated that multiple solutions are inherent in structure comparison (Boutonnet et al., 1995; Feng and Sippl, 1996
; Godzik, 1996
). For a particular protein pair several solutions with comparable extent of similarity can be observed. They can differ in small shifts in secondary structure elements, e.g. a shift of two residues in ß-strands, but also correspond to alignments of different regions, e.g. different domains. Therefore, alternative solutions are relevant whenever an alignment is needed, as in the evaluation of structure prediction (Lackner et al., 1999
). Here we investigated and optimized the ProSup structure comparison program (Feng and Sippl, 1996
). In ProSup the r.m.s.d. is kept below a threshold by defining a certain distance cutoff for equivalent residues. This cutoff applies during both major steps of the algorithm, (1) the iterative expansion of preliminary alignments and (2) the dynamic programming-based refinement. Pairs of small fragments having an r.m.s.d. below a certain cutoff are selected for initial superimpositions. This in general results in a number of distinct alignments. ProSup provides different filters applicable to the alignments, in order to remove regions of non-significant similarity.
We describe and discuss the ProSup algorithm and its parameters in detail in the Materials and methods section. In the Results and discussion section we (1) show that ProSup is able to detect remote structural similarity using a test set of 10 protein pairs reported in the literature, (2) compare the ranking of ProSup with the ranking of DALI (Holm and Sander, 1993), which is based on the alignment of distance matrices, (3) compare the ProSup ranking with the expert knowledge-derived structure classification of SCOP (Murzin et al., 1995
), (4) calculate a measure of statistical significance for the structure alignments, (5) discuss the problem of side chain orientation and (6) investigate the nature of multiple solutions on several examples.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The algorithm
The algorithm basically has four steps (Figure 1):
|
Every similar fragment pair from (1) is expanded to an initial alignment. From step (2) a subset of initial alignments is selected rather than a single one. This subset is the origin of the alternative alignments. The preliminary initial alignments are optimized in step (3). Subsequently the optimized alignments are evaluated by a filter, which removes short fragments and checks for side chain orientations. Finally, they are classified by a clustering procedure.
Let A and B be the first and second protein chain, respectively. Let lS be the length of the fragments to be compared and lA and lB the sizes of proteins A and B. Then in step (1) all (lA lS + 1)x(lB lS + 1) fragments are superimposed (Sippl and Stegbuchner, 1981) and the subset with r.m.s.d. cS (r.m.s.d. cutoff for seed fragments) is forwarded to step (2).
In step (2) a seed is being expanded by (2a) superimposing the whole A and B according to the seed fragment pair Ai..i + lS1 and Bj..j + lS1 where i and j define the position of the seed in A and B. Subsequently (2b) starting from the N- and C-terminal ends of the seed fragment those residue pairs An, Bm are located which have a distance in space smaller than the distance cutoff for alignment construction dC. This defines a new set of equivalent residues. In (2c) A and B are superimposed according to the new set of equivalent residues and step (2b) is repeated until the initial alignment does not change any more.
Step (2) is repeated for each seed fragment pair. Naturally many Ai..i+lS1, Bj..j+lS1 pairs expand to the same initial alignment. Identical alignments are disregarded. Because of computing time considerations, not every initial alignment will be refined. Only the nS alignments with the highest number of equivalent residues and the lowest r.m.s.d. will be forwarded to step (3).
In step (3) a seed alignment is refined by a dynamic programming algorithm (Needleman and Wunsch, 1970). From step (2) we keep the last superimposition. Then in (3a) we compute a score matrix S for every residues pair Ai, Bj by measuring the distance dAi,Bj:
![]() |
Then the optimal path through this matrix is search by a Needleman and Wunsch algorithm (Needleman and Wunsch, 1970). This results in a set of aligned residues. Note that aligned now does not necessarily mean structurally equivalent. In (3b), those residues where dAi',Bj'
dC and i', j' is an aligned pair, define a new set of equivalent residues. The procedure is repeated at (3a) until the alignment does not change any more. Also in this step different initial alignments may converge to the same final alignment. Identical alignments are disregarded. Besides dC, the distance cutoff for alignment construction, there are also other parameters which can be applied in the refinement: gap penalty g, gap extension penalty and maximum gap length. In the study presented, only gap penalty is applied.
In (4) several filters are applied to remove non-significant similarities. A second distance cutoff is now introduced which defines the final equivalent residues (dE). As an option, a cutoff (dB) for the maximum distance between Cß atoms can be applied. These will remove equivalent residues where the side chains have different orientations. Finally, a filter for a minimum fragment length of consecutively aligned pairs (lF) is applied.
Parameter optimization
In order to determine the parameter set that produces the most reasonable alignments in the shortest computational time, an optimization procedure was performed. In general, a high number of equivalent residues is associated with significant matches and maximizing this number was used as the main optimization criterion. Human decision was required to select the parameters dE and lF by visual inspection of the superimpositions. Seven parameters were optimized (Table I, except nE and cC ) using a set of 100 protein pairs (Table II
).
|
|
Seed fragment parameters: lS and cS
The selection of initial seed fragments is controlled by the length lS and r.m.s.d. cutoff cS. Large r.m.s.d. cutoff values cS and small fragment lengths lS result in a large number of seed fragments and initial alignments, which is computationally demanding but is more sensitive to weak similarities. Optimization results are shown in Figure 2.
|
If the maximum number of initial alignments nS is set to one, only the best initial alignment is being forwarded to refinement. Alternative alignments would be neglected. Increasing nS increases the number of alternative alignments but also the computation time. Low values of nS should be used for database searches, where the goal is just to identify the related structures. Larger values should be used for individual comparison where the set of alternatives might provide important information.
The distance cutoff for alignment construction dC defines a maximum distance allowed between two residues in steps (2) and (3). By optimization of the number of equivalent residues and by visual inspection we found that fairly large values for dC should be used in order to find the similarities. To remove aligned regions of low similarity a more restrictive cutoff should be applied in the filter stage.
Refinement parameters: dC and g
dC is used in creating the seed alignments and also on the refinement, the overall influence of dC is discussed above. Small values for the gap penalty g, as in any other application of this type of dynamic programming alignment algorithms, cause scattered alignments if the structures to be aligned have low similarity. An optimal gap penalty depends not only on the similarity function in the alignment procedure but also on the definition of the minimum fragment length in the evaluation. A larger gap penalty forces the alignments to be less scattered. Less scattered alignments produce larger regions of consecutive equivalent residues. Thus, if the minimum fragment length is larger also larger gap penalties should be preferred (Figure 3).
|
Alignment evaluation parameters cannot be optimized on the number of equivalent residues. Increasing the distance cutoff for evaluation dE always increases the number of equivalences whereas increasing the minimum fragment length lF always decreases that number. Therefore, visualization of individual superimpositions was required to select reasonable values for these parameters, 5 Å for dE and five residues for lF. If lF is set to a low number some residues may be marked as equivalent where the structural relation is insignificant. It happens, for example, that a helix from the one protein crosses a strand from the other and the corresponding C atoms are close together in a limited area. Larger values of lF avoid this problem. Again a balance must be found between filtering insignificant equivalences and not losing short but important stretches of equivalences.
An additional criterion is the minimum number of equivalent residues nE (after applying lF and dE). Alignments with a smaller number of equivalences than nE are disregarded.
ProSup provides an additional feature which does not influence the structure comparison results, but is convenient for the analysis of the output. In some cases alternative alignments might share many equivalent residues, in other cases they are rather different. We use a cluster algorithm to classify the alternative alignments. Two alternative alignments are put into different clusters if they have less than cCx100% equivalent residues aligned equally, relative to the smaller alignment. Smaller values for the cluster cutoff cC result in more clusters.
Suggestion for parameter settings
Parameter setting depends on the application of structure comparison. The optimized parameters presented in this study (Table I) are a good starting point. For finding remote similarity the distance cutoffs dE and dC can be larger, e.g. 56 Å. For calculating precise alignments lower distance cutoff values should be preferred (34 Å). Scattered alignments should be avoided by setting lE to 45. In case of database searches, where speed is an issue, the number of initial alignments forwarded to the refinement step should be kept low (nS of 20). For studying individual protein pairs large values for nS (100 or more) should be used to search exhaustively for alternative solutions.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
A main requirement for any structure comparison method is its ability to detect weak structural similarity. A test suite should contain structures which have low but detectable structural similarity. We chose a set of 10 protein pairs which was previously used by Shindyalov and Bourne (1998) to investigate their CE method. The set itself was published by Fischer et al. (1996). ProSup results were compared with those from three methods available as web services and frequently used by the scientific community, VAST, DALI and CE (Table III).
|
R.m.s.d. reported by ProSup is fairly constant in all cases; the range of r.m.s.d. is 1.7 2.7 Å. The upper limit of r.m.s.d. is determined by the parameters applied in ProSup. In the other three methods we find a larger variation in r.m.s.d.; the largest value is 4.2 Å in VAST. This variation in r.m.s.d. prevents an objective comparison of the results in terms of performance based on number of equivalent residues. For example, consider the comparison of the granulocyte-colony stimulating factors 1bge-B and 2gmf-A. ProSup reports 87 equivalent residues with an r.m.s.d. of 2.4 Å. VAST gives a similar r.m.s.d. of 2.3 Å, but the alignment is shorter with 71 aligned residues. CE reports the largest number of equivalent residues, 107, but also a larger r.m.s.d. of 3.9 Å. Even a detailed comparison according to these numbers is difficult, but in general ProSup results are in the range of those from the other three methods.
We also compare the structural alignments obtained from the different methods. We found that the ProSup alignments agree best with CE, where in sum 761 residue pairs are aligned identically; worst is the agreement with VAST, where 715 residue pairs are aligned identically. In most cases >40% of the aligned residues match. A remarkable exception is the pair 1bge-B and 2gmf-A, where the alignment of ProSup differs from all the others. In this case also CE differs substantially from DALI and VAST in terms of alignment.
Ranking in database searches
The number of equivalent residues nevertheless can be an appropriate measure if r.m.s.d. is kept constant and provides a simple and intuitive measure to rank structural similarity. This principle is applied in ProSup and used whenever a comparison of a single protein structure with a database of structures is performed. Other methods use different approaches. For example, in DALI ranking is done by a Z-score derived from a measure based on intramolecular distances.
DALI is a frequently used method and enjoys broad acceptance in the scientific community. We compare the results of database searches performed with ProSup and DALI. Table IV shows the results for a database scan with the transcription factor ETS-1 (1bqv), a CASP3 target (Murzin, 1999
). The table shows the 10 top ranking structures reported by ProSup. In DALI, structures with a Z-score >2 are considered to be similar and therefore are shown in the table.
|
|
A further test of ProSup ranking was performed by comparison with the expert knowledge classification of SCOP (Murzin et al., 1995). ProSup ranking should match SCOP classification. We used a set of 53 single domain proteins from Domingues et al. (2000) and compared them against a database of about 800 protein chains with low sequence similarity. For all of the 53 target proteins there is at least one structure in the database with the same SCOP fold category. In addition, each of the target proteins has several similar folds in the database according to ProSup. We removed trivial cases of high sequence similarity between target and database structures. Only chains where FASTA3 (Pearson and Lipman, 1988
) does not detect significant similarity were considered.
In 48 of the 53 cases the target protein and the best ranking database structure are members of the same SCOP fold category (Table V). In 42 cases (80%) ProSup identifies a chain which belongs to the same superfamily. This means that ProSup tends to rank those structures in first position where SCOP expects evolutionary relationships.
|
Significance of structure comparison
We investigate how a measure of significance of structural similarity can be applied to ProSup results. Such a measure has been proposed by Levitt and Gerstein (1998). Here we calculate the structure similarity score, defined as Sstr = M{1/[1 + (dij/d0)2] Ngap/2}, where M = 20, dij is the distance between aligned C
atoms, d0 = 5 and Ngap is the number of gaps for the ProSup based alignments. Note that the dij are the distances in the complete alignment, not from the equivalent regions only. Subsequently we convert the scores into p-values by applying the formalism given in Levitt and Gerstein (1998). We compare how the p-values correlate with the number of equivalent residues. We assume that the underlying distribution of Sstr is similar in both methods. Figure 5
shows the number of equivalent residues versus the logarithm of p-values. The horizontal line separates the 10 lowest p-values and the vertical line separates the ten top ranking cases with ProSup. Five proteins are common in both cases; the agreement is considerable. This was confirmed when we repeated this experiment in the 53 database scans described in the section above. The correlation coefficient ranges from 0.66 to 0.95, average 0.85.
|
Frequently structure comparison is based on C traces. This is sufficient if the goal is the identification of similar architectures or topologies. In other applications, such as the detection of evolutionary relationships or evaluation of predictions, a more detailed description of similarity is required. Not only C
superimposition but also the orientation of the side chains is important.
For example, in the superimposition of ß-strands, C atoms close in space can have Cß atoms in an opposite orientation (Figure 6
). A simple but efficient way to take side chain orientation into account was implemented in ProSup. It consists of a filter applied after C
superimposition. For each pair of equivalent C
atoms we measure also the distance between the corresponding Cß. If the distance is larger than a distinct cutoff this pair is not considered as structurally equivalent. For some protein pairs, Cß filtering reduces the number of equivalent residues considerably (Table VI
).
|
|
Structure comparison often yields multiple solutions which correspond to distinct alignments. The main sources of alternative alignments are structural repeats at different levels, secondary structure, supersecondary structure, internal repeats and domains.
For example, the comparison of four-helix bundles with shorter and larger helices obviously can be aligned in different ways (Feng and Sippl, 1996), because of the repetitive nature of
-helices. However, there are also more intricate cases at the level of secondary structure, such as in the comparison of the
+ ß proteins 1cew-I and 1oun-A. Superimposition yields two alternative alignments which have comparable numbers of equivalent residues and r.m.s.d., 70/2.4 and 68/2.4 Å, but the aligned residues are different in the two alternatives (Figure 7
). The differences between these two solutions come from shifts of four residues in a helix and shifts of two residues in the strands.
|
An example of alternative alignments caused by internal repeats is the second domain of 1tdj. It consists of two subdomains. Both of them can be superimposed with the ferredoxin-like domain of 1psd-A.
There are many examples on the domain level. For example, the superimposition of a protein consisting of a single immunoglobulin-like domain with one consisting of two such domains has obviously (at least) two solutions.
Alternative alignments provide important information for several applications of structure comparison. For example, if structure alignments are used for deriving amino acid substitution matrices (Prli et al., 2000
), the question arises of which alternative to use. If there is sequence homology, the number of identical residues could serve as a measure to select alternatives but in the case of remote homologous or analogous proteins this should not be considered as a reliable criterion. Probably there is no answer and the best solution is to use both alignments. If structure comparison is used to evaluate or optimize structure prediction methods, considering only one and neglecting the other solution might lead to a wrong evaluation. To conclude, in every application of structure comparison, where the alignment plays a role, one must be aware that alternative solutions may exist.
Computational effort and availability
The computing time depends on both the size of the structures and the parameters used. With the default parameter set a search in a database containing 842 protein chains takes between 14 min and 3.2 h on a PentiumIII 500MHz Linux system. For example, a BPTI-like fold (1den, 60 residues) takes 14 min, a globin (1ash, 150 residues) 1 h and a TIM barrel (1ece-A, 358 residues) 2.5 h. On average it is 4.5 s per pairwise comparison for an average target structure size of 157 residues and an average database structure size of 258 residues.
ProSup is available as an online service at http://www.came.sbg.ac.at.
Conclusion
Optimizing the number of equivalent residues while keeping the r.m.s.d. constant provides a simple and intuitive measure of structure similarity. We have demonstrated that this measure can be used for ranking in database searches, whose results are in good agreement with expert knowledge classification. We further showed that structure comparison frequently has multiple solutions, which should not be neglected for some applications. ProSup provides a number of alternative alignments rather than a single solution. The C trace is an adequate approximation of the backbone conformation. By default ProSup uses C
atoms only in the structure superimposition calculation. In this way the computing time is kept to a minimum. Nevertheless, in some applications side chain orientation plays an important role. ProSup provides a filter which guarantees proper side chain orientation.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Defay,T. and Cohen,F.E. (1995) Proteins, 23, 431445.[ISI][Medline]
Diederichs,K. (1995) Proteins, 23, 187195.[ISI][Medline]
Domingues,F.S., Lackner,P. and Sippl,M.J. (2000) J. Mol. Biol., 297, 10031013.[ISI][Medline]
Feng,Z.K. and Sippl,M.J. (1996) Fold. Des., 1, 123132.[ISI][Medline]
Fischer,D., Eloffson,A., Rice,D.W. and Eisenberg,D. (1996) In Hunter L. and Klein T. (eds), Proceedings of the 1st Pacific Symposium on Biocomputing. World Scientific Publishing Company, Singapore, pp. 300318.
Godzik,A. (1996) Protein Sci., 5, 13251338.
Holm,L. and Sander,C. (1993) J. Mol. Biol., 233, 123138.[ISI][Medline]
Holm,L. and Sander,C. (1996) Science, 273, 595603.
Koppensteiner,W.A., Lackner,P., Wiederstein,M. and Sippl,M.J. (2000) J. Mol. Biol., 296, 11391152.[ISI][Medline]
Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, 946950.[ISI]
Lackner,P., Koppensteiner,W.A., Domingues,F.S. and Sippl,M.J. (1999) Proteins, Suppl. 3, 112120.
Lemer,C.M., Rooman,M.J. and Wodak,S.J. (1995) Proteins, 23, 337355.[ISI][Medline]
Levitt,M. and Gerstein,M. (1998) Proc. Natl Acad. Sci. USA, 95, 59135920.
Marchler-Bauer,A. and Bryant,S.H. (1995) Proteins, 23, 7482.
May,A. and Johnson,M.S. (1994) Protein Eng., 7, 475485.[ISI][Medline]
Merritt,E.A. and Murphy,M.E.P. (1994) Acta Crystallogr. D, 50, 869873.[ISI][Medline]
Murzin,A.G. (1993) EMBO J., 12, 861867.[Abstract]
Murzin,A.G. (1998) Curr. Opin. Struct. Biol., 8, 380387.[ISI][Medline]
Murzin,A.G. (1999) Proteins, Suppl. 3, 88103.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[ISI][Medline]
Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443453.[ISI][Medline]
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Orengo,C.A., Todd,A.E. and Thornton,J.M. (1999) Curr. Opin. Struct. Biol., 9, 374382.[ISI][Medline]
Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 24442448.[Abstract]
Prli,A., Domingues,F.S. and Sippl,M.J. (2000) Protein Eng., 13, 545550.
Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739747.[Abstract]
Sippl,M.J. and Stegbuchner,H. (1981) Comput. Chem., 15, 7378.
Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 122.[ISI][Medline]
Weber,P.C. and Salemme,F.R. (1980) Nature, 287, 8284.[ISI][Medline]
Zuker,M. and Somorjai,R.L. (1989) Bull. Math. Biol., 51, 5578.[ISI][Medline]
Received March 9, 2000; revised September 14, 2000; accepted September 28, 2000.