Department of Biological Chemistry, Weizmann Institute of Science, Rehovot 76100, Israel
1 To whom correspondence should be addressed. e-mail: gideon.schreiber{at}weizmann.ac.il
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: docking/interface prediction/proteinprotein interaction/scoring function
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The docking process can be divided into two steps (Halperin et al., 2002). First, a large number of potential structures with reasonable surface complementarity are generated. In a second step, these structures are ranked according to a score, which extracts the near-native structures out of the pool of non-native structures. Common approaches for scoring the results are based either on surface complementarity (Katchalski-Katzir et al., 1992
; Walls and Sternberg, 1992
; Norel et al., 1994
, 1995
, 1999
), sometimes together with an electrostatic filter (Gabb et al., 1997
; Norel et al., 2001
; Heifetz et al., 2002
) or on energy-based methods such as residue potential scores or elaborate free energy evaluations (Jackson and Sternberg, 1995
; King et al., 1996
; Jackson et al., 1998
; Moont et al., 1999
; Camacho et al., 2000
; Lorber et al., 2002
). The known scoring functions, especially the more refined free energy estimates, are dependent on a high-resolution description of the protein surfaces. As the side-chain and main-chain conformations may change upon complexation and are difficult to predict, the broadness of the approaches is limited. Moreover, homology models, even when starting from highly similar structures, can provide only a rough estimate of the surface shape of the modeled protein. Therefore, so far only very low-resolution docking has been attempted for homology models (Vakser, 1996
; Tovchigrechko et al., 2002
).
Recently, we presented a new structure-based prediction program, ProMate (Neuvirth and Schreiber, 2004), which calculates the potential location of a proteinprotein interface. The algorithm is based on the analysis of the unique structural and biochemical characteristics of transient proteinprotein binding sites. Here, we present a new scoring function that is based on the prediction of putative binding sites using ProMate. Our scoring procedure is not reliant on energy considerations, but measures the tightness of the fit of the interacting proteins at the predicted binding site. The function is independent of side-chain conformations and tolerant towards inaccuracies in the backbone conformation.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
ProMate (http://bioportal.weizmann.ac.il/promate) is based on a statistical analysis of several properties that were found to distinguish binding regions from non-binding ones. The properties were modeled using a database of 57 transient hetero-interactions of proteins, the structure of which is known both in the unbound and complex forms (excluding antigens). Histograms of the distributions of each property in the interface and non-interface regions were constructed and served as the basic model for the prediction stage. Properties used for prediction include the frequency of atoms, their characteristics, chemical character, secondary structure, hydrophobic patches, distribution of water molecules and evolutionary conservation. A tested protein is initially processed as an independent set of circles. For every circle each of the properties is examined and the likelihood of this circle belonging to the interface is determined. The score is the observed frequency of the specific score in the interface of the training set divided by its sum of observed frequencies in the interface and non-interface. In other words, denoting interface by I and surface by S, where O refers to the observed frequency in the training set, for an input circle c:
Each circles probability is multiplied by the probability of being an interface for a protein of a specific size. The combined score is the product of all the scores resulting from the different properties corrected according to the actual frequencies as they appear in the training set. To smooth the score further the frequencies of the adjacent dots in a 7 Å circle are taken into account for the final score. This procedure is repeated for a number of iterations (H.Neuvirth and G.Schreiber, unpublished work).
Docking
An extensive set of 21 non-redundant enzymeinhibitor complexes in their unbound state (Chen et al., 2003; Gray et al., 2003
) was docked. The superimposition of the unbound structures on the bound complexes served as reference structures. All docking calculations were performed with a parallel version of the program package FT-Dock (Walls and Sternberg, 1992
; Gabb et al., 1997
; Moont et al., 1999
) (http://www.bmm.icnet.uk/docking/) on a Mac G5 dual processor computer. The parallel version was generously provided by G.R.Smith and M.J.Sternberg. FT-Dock follows closely the shape complementarity algorithm introduced by Katchalski-Katzir et al. (1992
). The docking was performed with the size of a single grid unit set to 2.0 Å, an angle step of 12°, a surface thickness of 4.0 Å and an internal deterrent value of 20. The molecular grid extended 1.8 Å outwards from the mobile molecule and 3.2 Å outwards from the static molecule. The 20 best surface complementarity translations were kept for each rotation. These settings have been found to be a good compromise between docking speed and accuracy (G.R.Smith, personal communication). The structures obtained were rescored using the residue potential scoring function implemented in FT-Dock (Moont et al., 1999
); 10 000 structures were generated and further evaluated per docked complex. The docking calculations for all 21 complexes took 15 h on the dual G5 processor.
Scoring
The probabilities of residues being an interface were calculated as described elsewhere, using our software ProMate (http://bip.weizmann.ac.il/promate). Since a binding site covers 10% of the total surface of the protein, just the top 10% scoring residues were taken as interface. These residues do not necessarily form a continuous patch. Two distances were calculated: first, the average minimum distance of the predicted interfacial C
atoms of protein 1 to any of the C
atoms of the binding partner, and second, the average minimum distance of all C
atoms of protein 1 to any C
atom of protein 2. From these two distances the ToF (tightness of fit) score was calculated according to:
where
Dinter;i is the minimum distance of the C of residue i, predicted to be interface, of protein 1 to any C
of protein 2, and Dall;j is the minimum distance of the C
atom of surface residue j, which is either interface or not, of protein 1 to any C
atom of protein 2. There are n predicted interfacial residues and m surface residues altogether for the respective protein. dinter is therefore the average minimum distance of the predicted interfacial residues to the other protein, and dall is the average minimum distance of all surface residues to the other protein. Hence ToF measures the tightness of fit at the predicted binding site, normalized by the size of the protein.
Evaluation of scoring performance
The scoring performance was evaluated by calculating the chance of obtaining a result as good as or better than that obtained from the scoring function by randomly picking complexes out of the pool of generated complexes. This probability is described by the hypergeometric distribution. Hence, the probability was calculated according to:
where m is the total number of complexes (10 000), n the rank of the first near-native complex, r the total number of near-native complexes in the ensemble, a = 1 (at least one near-native structure is to be found), and, if n > r, b = r, otherwise b = n; b describes the upper limit of the possible number of near-native structures when picking n times. This calculates the probability of obtaining at least one near-native complex and at maximum all possible near-native complexes by chance when picking n times. Structures with an r.m.s.d. of 3.0 Å were considered as near-native. If no near-native conformation was found, b was set to 1 and n was set as the rank of the best structure in the ensemble.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
ToF is based on calculating the minimum distance between the binding site predicted by ProMate and the protein partner. Three distance scores can be calculated by this method. The binding site can be predicted for the enzyme (E), not taking the inhibitor (I) into account [ToF(E I)], it can be predicted for the inhibitor, not taking the enzyme into account [ToF(I
E)] and it can be predicted for both [ToF(E
I)]. Preliminary tests on a reduced dataset clearly demonstrated that ToF(E
I) performs best (data not shown). A possible reason for this is that ProMate was not designed to work for small proteins (<85 amino acids), which is the size of many of the inhibitors.
Analysis of the ToF performance
The performance of ToF(E I) for all 21 structures is summarized in Table I and Figure 1. The most intuitive criterion for success of the scoring function is to test whether near-native structures are found at low ranks. A near-native structure with an r.m.s.d. of <3.0 Å in the first 10 top-ranking structures is found for seven complexes and eight additional complexes have a r.m.s.d. of <5.1 Å in the 10 top-ranking structures. However, using this description of success does not discriminate between the performance of the docking function and the scoring function. If, for example, 9990 near-native structures are found by the docking algorithm and the scoring function gives the first 10 ranks to only the non-native structures, the scoring function performs badly even though the lowest rank of a near-native structure is only 11. If, on the other hand, the docking algorithm generates just one near-native structure and the scoring function gives this structure the rank 11, the scoring function performs well. A way to quantify the performance of the scoring function depending on the performance of the docking algorithm is to calculate the probability of performing as good as or better than the scoring function by random picking. This probability can be described with a hypergeometric distribution. The difference between the probability and the lowest rank as a criterion is clearly demonstrated for 1PPE: although the rank of the first near-native structure is 9, which appears to represent good success of the scoring function, the existence of 299 near-native conformations (which is the success of the docking program) renders it probable that the same result could be obtained by random picking (with a probability of p = 0.23). As demonstrated here, the probability is the one most unbiased single value that describes the performance of the scoring function. The average probability of being as good as or better than the scoring function by chance is p = 0.08. When excluding two outliers, 1SNI and 1BRS, the average probability drops to p = 0.05. Hence the scoring function performs much better than random picking. In four cases, the probability is <0.01 and in five cases it is >0.1. Out of these five cases, three have a probability of p > 0.2. These three or five cases can be considered as failures of the scoring function. This converts to a success rate of 7785%.
|
|
Relationship between r.m.s.d. and ToF
A linear relationship between a score and the r.m.s.d. is desirable, since this would allow successful scoring if no near-native structure is found. Indeed, an R2 value of >0.5 is found in 16 cases (Table I and Figure 1). The worst correlation between ToF and r.m.s.d. is found for 1BRS (barnasebarstar) (Figure 1). This is slightly surprising, since the interface prediction works well for both proteins. Nevertheless, it has been observed earlier that the barnasebarstar complex features an unusually large number of interfacial water molecules. One can speculate that the interface between barnase and barstar is not as tight as it is for the other complexes. This would explain the failure of our scoring scheme for this complex and the existence of water in the interface.
The apparently worst case in Table I is the scoring for 2SNI, with the first near-native structure having rank 2281 (Table I). This is also the only protein for which no near-native structure has been found in the 1000 top-scoring complexes. The lowest r.m.s.d. of 3.1 Å in the ensemble is slightly higher than our cutoff for near-native structures, indicating that not only the scoring but also the docking failed on this protein. Still, even for this protein the scoring function discriminates fairly well between low- and high-r.m.s.d. structures (Figure 1). However, many false-positive structures in the r.m.s.d. range 510 Å are found. This stems from the interface predictor, which has a significant offset to the real interface. Therefore, the highest scoring complex positions the inhibitor close to, but significantly offset from the real structure. Overall, the results shown in Table I and Figure 1 suggest that for none of the 21 complexes analyzed did the scoring fail completely. A complete failure would be an anti-correlation between the score and the r.m.s.d. or a probability of p > 0.5. The ability of ToF(E I) to distinguish between near-native and non-native structures without taking into account a prediction of the binding site of the second protein apparently stems from a feature of proteinprotein complexes. From all possible orientations at the binding site, the tightest fitting orientation appears to be the correct one. This leads to a lock-and-key mechanism of protein recognition, once the correct binding site is found in an early stage of complexation. Structural changes in the sense of an induced fit mechanism might have to occur to realize the tight fit at the binding site. Still, our scoring function might be improved by taking energy considerations into account.
Docking of homology models give valuable information, even if the r.m.s.d. to the true complex structure is rather high. Energy-based functions rely on a folding funnel, in which the true conformation is at a low energy and all non-native conformations are of similar, but higher, energy (Gray et al., 2003). Therefore, the applicability of high-resolution energy functions for docking of homology models is problematic, as homology models describe the surface at low resolution. The funnel-like behavior might not be observed in these cases. As our function is not reliant on a high-resolution description of the surface, the linear relationship between the ToF score and r.m.s.d. should be retained even for homology models (provided that the interface is predicted correctly at a narrow region).
Examining the performance of ToF on selected protein complexes
Here we will show three specific cases, 1AVW, 1ACB and 1CSE, and discuss the performance of our new score in more detail. In all three cases, the location of the interface is correctly predicted by ProMate, albeit with a small offset for 1 CSE (Figures 24). Still, the score of the first near-native structure for the three is 1, 11 and 500, respectively, with the best r.m.s.d. within the first 10 results being 2.1, 6.7 and 6.4 Å, respectively. Clearly, we obtain the best result for 1AVW, with both proteins being properly oriented relative to each other. For the other two, the angular orientation of the inhibitor is rotated relative to the real complex structure, while the translational orientation is correct. It is interesting to compare the spread of the r.m.s.d. of the docking results versus the score (Figure 1) with the spread in the center of mass (Figures 24). For 1AVW the predicted interface is located around the correct center, although with a larger spread in comparison with 1ACB. For 1AVW, structures with an r.m.s.d. of up to 15 Å display low scores (Figure 1). This is caused by two factors: on the one hand the predicted interface is rather large, allowing also slightly offset conformations to score well; on the other hand, the binding partner is large, so that an error in the angular orientation results in a large r.m.s.d. Despite these two factors, the top-scoring conformation has an r.m.s.d. of 2.1 Å to the real structure, underlining the power of our scoring function. For 1ACB, the interface prediction is excellent. Still, the first near-native structure is only at rank 11. This is caused by tight fitting, yet wrongly rotated conformations at the predicted interface (Figure 3). Here, other scoring functions or biological data might help to distinguish between these possibilities. This also demonstrates that the quality of the interface prediction can be judged from the R2 factor of the linear fit: a strictly linear relationship as, for example, for 1ACB or 1CHO indicates a good prediction of the interface, while deviations from linearity as, for example, for 1AVW and 1MAH indicate a rather broad predicted interface. Combination with measuring the tightness of fit at the predicted interface nevertheless enables good results to be obtained for these predictions. For 1CSE, the binding site prediction is slightly offset. This, together with the existence of tightly fitting, yet wrongly rotated, conformations leads to a poor performance, with the rank 500 of the first near-native structure. However, since the binding site is identified nearly correctly, even the non-native results can improve our understanding of proteinprotein interactions and can guide experiments.
|
|
|
A valid question to be asked is whether a score, which is not dependent on any predicted binding site, but only on the tightness of fit of the two proteins, would perform well. If this were true, only the correct binding site would allow a tight fit between the protein and its partner. In order to answer this question, a modified score has been calculated for the successfully docked protein complexes of decoy set I. The modified score does not take into account the binding site prediction, but calculates the normalized average minimum distance of the lowest 10% of all distances between the two proteins. This score cannot discriminate between near-native and non-native structures (Figure 5). While ToF clearly distinguishes between near-native and non-native structures and has a linear relation with the r.m.s.d., the normalized averaged 10% smallest distances between the two proteins are not related to the r.m.s.d. Therefore, the correct orientation of the two proteins is not the tightest fit possible between the two proteins, but the tightest fit possible at the binding site of the larger protein. This conclusion can be justified from the two-step mechanism suggested for protein complexation (Schreiber, 2002). In the first step, the protein surface is roughly scanned for patches that are suitable for binding. This preliminary scan leads to the formation of an encounter complex, where the relative orientation of the two proteins is already near-native, but short-range interactions are not yet formed. The second step scans the binding site for the best possible fit, leading to the final complex. As shown earlier by both our group and others (Lo Conte et al., 1999
; Ma et al., 2003
), certain characteristics, such as hydrophobicity, atom density and the potential to form specific salt bridges, determine potential binding sites. These are exactly the attributes that allow us to predict binding sites from unbound structures. Hence our procedure, which first scans the protein surface for potential binding sites and thereafter scans the binding site for the tightest fit, emulates the formation of proteinprotein complexes.
|
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Chance,M.R. et al. (2002) Protein Sci., 11, 723738.
Chen,R., Mintseris,J., Janin,J. and Weng,Z. (2003) Proteins, 52, 8891.[CrossRef][ISI][Medline]
Gabb,H.A., Jackson,R.M. and Sternberg,M.J. (1997) J. Mol. Biol., 272, 106120.[CrossRef][ISI][Medline]
Gray,J.J., Moughon,S., Wang,C., Schueler-Furman,O., Kuhlman,B., Rohl,C.A. and Baker,D. (2003) J. Mol. Biol., 331, 281299.[CrossRef][ISI][Medline]
Halperin,I., Ma,B., Wolfson,H. and Nussinov,R. (2002) Proteins, 47, 409443.[CrossRef][ISI][Medline]
Heifetz,A., Katchalski-Katzir,E. and Eisenstein,M. (2002) Protein Sci., 11, 571587.
Jackson,R.M. and Sternberg,M.J. (1995) J. Mol. Biol., 250, 258275.[CrossRef][ISI][Medline]
Jackson,R.M., Gabb,H.A. and Sternberg,M.J. (1998) J. Mol. Biol., 276, 265285.[CrossRef][ISI][Medline]
Katchalski-Katzir,E., Shariv,I., Eisenstein,M., Friesem,A.A., Aflalo,C. and Vakser,I.A. (1992) Proc. Natl Acad. Sci. USA, 89, 21952199.[Abstract]
King,B.L., Vajda,S. and DeLisi,C. (1996) FEBS Lett., 384, 8791.[CrossRef][ISI][Medline]
Lo Conte,L., Chothia,C. and Janin,J. (1999) J. Mol. Biol., 285, 21772198.[CrossRef][ISI][Medline]
Lorber,D.M., Udo,M.K. and Shoichet,B.K. (2002) Protein Sci., 11, 13931408.
Ma,B., Elkayam,T., Wolfson,H. and Nussinov,R. (2003) Proc. Natl Acad. Sci. USA, 100, 57725777.
Marti-Renom,M.A., Stuart,A.C., Fiser,A., Sanchez,R., Melo F. and Sali,A. (2000) Annu. Rev. Biophys. Biomol. Struct., 29, 291325.[CrossRef][ISI][Medline]
McConkey,B.J., Sobolev,V. and Edelman,M. (2002) Curr. Sci., 83, 845856.[ISI]
Moont,G., Gabb,H.A. and Sternberg,M.J. (1999) Proteins, 35, 364373.[CrossRef][ISI][Medline]
Neuvirth,H. and Schreiber,G. (2004) J. Mol. Biol., in press.
Norel,R., Lin,S.L., Wolfson,H.J.and Nussinov,R. (1994) Biopolymers, 34, 933940.[ISI][Medline]
Norel,R., Lin,S.L., Wolfson,H.J.and Nussinov,R. (1995) J. Mol. Biol., 252, 263273.[CrossRef][ISI][Medline]
Norel,R., Petrey,D., Wolfson,H.J. and Nussinov,R. (1999) Proteins, 36, 307317.[CrossRef][ISI][Medline]
Norel,R., Sheinerman,F., Petrey,D. and Honig,B. (2001) Protein Sci., 10, 21472161.
Rosenfeld,R., Vajda,S. and DeLisi,C. (1995) Annu. Rev. Biophys. Biomol. Struct., 24, 677700.[CrossRef][ISI][Medline]
Sali,A., Glaeser,R., Earnest,T. and Baumeister,W. (2003) Nature, 422, 216225.[CrossRef][ISI][Medline]
Sandak,B., Nussinov,R. and Wolfson,H.J. (1998) J. Comput. Biol., 5, 631654.[ISI][Medline]
Schreiber,G. (2002) Curr. Opin. Struct. Biol., 12, 4147.[CrossRef][ISI][Medline]
Smith,G.R. and Sternberg,M.J. (2002) Curr. Opin. Struct. Biol., 12, 2835.[CrossRef][ISI][Medline]
Tovchigrechko,A., Wells,C.A. and Vakser,I.A. (2002) Protein Sci., 11, 18881896.
Vakser,I.A. (1995) Protein Eng., 8, 371377.[ISI][Medline]
Vakser,I.A. (1996) Biopolymers, 39, 455464.[CrossRef][ISI][Medline]
Walls,P.H. and Sternberg,M.J. (1992) J. Mol. Biol., 228, 277297.[ISI][Medline]
Received November 26, 2003; revised January 22, 2004; accepted January 22, 2004 Edited by Alan Fersht