Department of Chemistry, Peking University Jiuyuan Molecular Design Laboratory and 1 Department of Technical Physics, Peking University, Beijing 100871, China
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: automated docking/genetic algorithm/molecular recognition/tabu search
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The computational procedure of docking for proteinprotein and proteinpeptide can be classified into three levels by the degree of approximations (Fraga et al., 1995)RBD (rigid-body docking) (Jiang et al., 1991
), SFD (semi-flexible docking) (Shoichet et al., 1991
) and FD (flexible docking) (Hart et al., 1992; Luty et al., 1995
). The RBD computation usually uses the crystal structures of bound complexes to perform docking calculations, but it is also useful just for bound complexes. It is difficult to find the appropriate associated sites for unbound complex systems because this model does not consider the conformational change during the docking process; however, it is very common for minor conformational changes to result when an active molecule associates with its substrate. SD can be viewed as soft-docking which allows for minor conformational changes when a receptor binds with its substrate. One successful method, Fan Jiang's soft-docking procedure (Jiang et al., 1991
), uses a cube representation of the molecular surface and volume. From this procedure it is possible to design a simple algorithm for a six-dimensional search and to embody implicitly the effects of the conformational changes caused by complex formation. FD is mainly used in the small active moleculeprotein systems. In the search, the transnational and rotational degrees of freedom are restricted and some twist angles of the ligand have been treated as variables in the energetic function. In the docking method of Luty et al. (1995), molecular dynamics is used to evaluate the ligandreceptor interaction. However, while it is well known that fully considering the conformational changes near the active site is very time-consuming, it is still impossible to carry out full energy minimization at each local position for large complex systems, such as proteinprotein systems.
During the binding process between a receptor and its substrate, its energetic potential surface is so complicated that it is impossible to determine the associated site by carrying out minimization using gradient methods such as the steepest descent and GaussNewton methods. These methods fall into the local potential wells very easily. Some stochastic methods, including Monte Carlo simulated annealing, have been introduced into the study of molecular association, usually with a more complete potential energetic function. We have introduced the Simplex method into the minimization procedure and we found that it could overcome the local minima more easily than Gradient methods. Combined with a random search, Simplex methods can offer a good set of answers to some systems (Wang et al., 1997). Genetic algorithms, which are regarded as intelligent stochastic methods, were introduced into computational chemistry in the late 1970s (Michalewicz et al., 1994). Now genetic algorithms have been used in other fields of computational chemistry. Oshiro et al. (1995) started work on docking procedures in 1994. The development of two kinds of GA method was proposed, which were a sphere-based GA method and a explicit orientation-based method. Both of the two GA methods aimed at optimizing orientations and conformations of the ligand. The fitness of each of the `chromosomes' was the molecular mechanics interaction energy. More recently, the tabu (or taboo, TS) (Glover et al., 1993) search has begun to attract attention as an effective heuristic search procedure for combinatorial optimization problems in the molecular design field. David et al. (1997) was the first to apply this search method to a docking procedure and proved it was very effective in finding the proper binding mode.
We have compared several heuristic algorithms in a previous study (Hou et al., 1999), which showed that a genetic algorithm and tabu search were both superior to the Monte Carlo simulated annealing algorithm. But from a comparison of the results, we found that these two algorithms did not perform very effectively in all conditions, in some cases both GA and TS showed bad results. It is difficult to solve a docking problem thoroughly when using only a single algorithm. Although genetic algorithms can overcome the potential barriers successfully in some cases, it is still common that GA staggers in local potential wells in most cases. With respect to the escape from local minima, TS seems more superior than GA; however, it converges relatively slower, especially near the best solutions. So according to their merits and shortcomings, a hybrid algorithm (HA) combining GA with TS was proposed. The hybrid algorithm was applied to explore the possible associated sites of proteinpeptide and proteinprotein complexes. It is expected that the hybrid method may overcome the potential barriers more easily. In our laboratory, we have developed different score functions for the following two stages of conformation searching. In the first stage, surface complementarity is considered, while in the second stage only energetic complementarity is considered. From extensive studies, we found that the steric complementarity was more important than energetic complementarity, especially for proteinprotein and proteinpeptide systems. Moreover, in order to take account of the conformational flexibility of the ligand and the protein, two strategies were introduced into our docking procedure. The first docking stage is a kind of soft-docking, some degree of surface overlap is tolerated to account for side-chain flexibility of the proteins. The second docking stage is a sort of flexible-docking, to some relatively small ligands, the internal conformational flexibility of the ligands are also taken into account, some torsion angles of the ligand are allowed to rotate freely. Using these two strategies, the conformational changes can be well considered to some extents. The program is written in C language and run under the Unix, Dos or Windows operating systems, and our molecular docking procedure has been embedded into the Peking University Interaction Computational System (PUICS) as a separate module. In this study, all ligands were considered as soft bodies, but their torsion angles cannot be allowed to rotate freely.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The idea of a genetic algorithm was borrowed from genetics and natural selection. A population of `chromosomes' encoding solutions to the problem is first generated and then it `evolves' through a process similar to biological evolution, including genetic crossover, genetic mutation and natural selection. Chromosomes encoding good partial solutions survive, reproduce and combine to generate new chromosomes, which hopefully encode better solutions in the succeeding generations. Chromosomes with small fitness will gradually perish in the succeeding generations.
The strength of the genetic algorithm lies in its ability to handle a large and diverse set of variables. For example, genetic algorithms have been considered as one of the two strongest and hopeful methods in conformational analysis (Fraga et al., 1995) (the other one is molecular dynamics) which involve a large set of variables (twist angles). The genetic algorithm has been widely used in computational chemistry, more detailed introductions can be found in Michalewicz et al. (1994).
Tabu search
The tabu search was first suggested by Glover et al. (1993) and originally applied in the field of operation research. However, compared with the field of computational chemistry, it may be a somewhat new algorithm. The basic idea of the method, described by Glover et al. (1993), is to explore the search space of feasible solutions by a sequence of moves, and, in the mean time, some restrictions will be imposed to enable a search process to rake otherwise difficult regions. The real foundation of the tabu search may be sought in concepts that systematically violate feasibility conditions, as in heuristic procedures based on surrogate constraints, or even in cutting plane algorithms.
A tabu search will only remain one current solution during the course of a search. First, an initial solution is specified or randomly generated at the start of the iterations, then, some moves are generated from the current solution. Each of these moves is evaluated using the evaluation function and they are ranked in order (the best move at the head of the list). Moves are considered as tabu if they are not different enough from those solutions in the tabu list (we will define the criteria to check whether they are tabu). The best move will be accepted if it is better than other solutions in the tabu list. So, only non-tabu solutions will be accepted. If neither of these criteria can be met, the iteration cycle is terminated. If a new current solution can be found, it will be added to the tabu list. If the tabu list is full after several iterations, the current solution will replace one of the solutions in the tabu list. Usually, the tabu list will be managed in a `first-in, first-out' manner. When a new current solution has been identified and stored, additional moves are generated from it and the search procedure will continue with a new iteration. After a number of iterations, if the best solution cannot be changed, `convergence' is achieved, the tabu search exits and the best solution will be returned. The restrictions imposed in the tabu search mean that this algorithm can search relatively large areas after many iterations, and it has been proven that tabu searches can efficiently find the global solution to difficult optimization problems.
The hybrid algorithm combined with GA and TS
A comparison of the GA and TS algorithms shows that they both have their merits and limitations. GA converges faster when near the best solution, and can find it very quickly, but GA can fall into local minima relatively more easily. In contrast with GA, TS can avoid falling into local minima, but it converges relatively more slowly. As a result, a hybrid algorithm was proposed. The basic procedure of the hybrid algorithm is similar to TS, but compared with traditional TS, it is different with respect to two points. The first modification is that after N possible moves from the current solution, some extra steps of crossover and mutation operations, which come from GA, are added. The test results showed that this modification offered particular advantages over the traditional tabu search. The second modification is that we use an elite population to remain several best and different solutions in the crossover and mutation operations. After 2030 steps of crossover operation, N new solutions are ranked, and several of the best several solutions in the elite population are compared with the solutions in the tabu list to check if they are tabu; if they are, then the GA iteration will be performed again. In a traditional tabu search only one current solution remains during the search, but in the second modification, several solutions remain, which may help the hybrid algorithm converge faster. The scheme of the hybrid algorithm is illustrated in Figure 1. The new hybrid algorithm combines the advantages of both GA and TS; it not only converges faster, but also does not fall into local minima easily.
|
In the first step, the dot surface is generated using the MS program written by Connolly (1983). The parameters used in this program are discussed later. Then the coordinates of the probe molecule and the target molecule as well as their surface dot are randomly rotated and translated. The surface dot coordinates have also been synchronistically moved with the probe and target molecules. Then an initial solution is randomly generated containing six variables, three translational degrees of freedom and three rotational degrees of freedom. The three rotational variables are described by three Euler angles. The position of the target molecule is fixed and six variables define the orientation of the probe molecule.
The initial solution is evaluated using surface complementarity. The evaluation score is composed of two parts: the matching score and penalty score of atomic overlapping. The matching score is calculated in the following way. For each surface dot of the probe molecule, the matching property with target surface dots within a certain distance is repeatedly tested (Figure 2). The distance usually reaches a value 1.02.0 times the sum of radii of the two atoms which hold the two dots. The normal lines of the two interesting dots have a separation angle and if the angle is smaller than a threshold, says 30°, the two dots can be assumed to be `matched' and the areas shared by the two dots are added. The total value of those areas is used to evaluate the matching complementarity (Figure 3
).
|
|
![]() |
where Scorematch is the matching score and Scoreoverlap is the penalty score. Const is a coefficient balancing the contributions of the two parts. Const is mainly determined by the dot density, an important parameter of the MS program, which is defined as the average dot numbers per angstrom square area of both probe and target molecules. Const usually takes a value 520-fold that of the dot density. Then the hybrid algorithm is performed iteratively, Equation 1 is used to evaluate all solutions. After convergence, a set of solutions is obtained from the tabu list. Using the hybrid algorithm, in most cases we have successfully overcome the tendency of GA to stagger in the local extreme points. Generally, if we performed 50 steps of crossover and mutation operations for one TS iteration, 50100 iterations were enough to find the appropriate sites. If the best several solutions cannot be improved after 1020 tabu iterations, `convergence' is achieved. For a typical system containing more than 15003000 atoms, this process takes between 20 and 30 h on a PC with a Pentium II 350 processor. Moreover, cluster analysis is not needed in this stage. Because in the minimization process every solution has been checked whether it is tabu; the threshold measure used in this paper to determine the tabu status has a r.m.s.d. of 3 Å or less between the two solutions being compared. Then, a more detailed searching will be performed for each conformation at the next stage.
Detailed searching the local associated sites based on energetic complementarity
In this stage, a more detailed search is performed for each solution derived from the first stage. The position of the target molecule is also fixed and some changeable variables are defined for the probe molecule. In this stage, only a local search was performed near these binding sites from the surface complementarity. Considering the fast convergence of GA near the best solution, we usually only use GA in the local search. A set of `chromosomes' are randomly generated and each one represents an orientation. The fitness score of each `chromosome' is the interaction energy between the probe and target molecules. Only van der Walls energy, electrostatic energy and hydrogen-bond energy are considered. The force field was AMBER (Weiner et al., 1984, 1986
). Non-polar hydrogen atoms were omitted for simplification and united atom types were introduced in order to evaluate the interaction energy more precisely. Our purpose of this step is mainly to purge the high-energy conformations after the surface complementarity, and in the mean time to calculate the interaction energy precisely. When the unbonded interaction energy remains stable in a user-defined region after 2030 iterations, `convergence' is achieved. At this stage, cluster analysis should be performed and the solution with the largest fitness is selected to calculate the r.m.s. with the crystal structure.
At this stage, different systems are treated in different manners. For proteinprotein and some proteinpeptide systems, due to the high flexibility of the ligand, it is very difficult to consider their conformational change at this stage; so for relatively large ligands, only three degrees of translation and three degrees of rotation are considered. For proteinsmall molecule systems, a flexible docking procedure is applied, the internal conformational flexibility of the ligand is taken into account, and some torsional angles are defined as variables in the GA minimizations.
The difference between the two stages lies in the fact that the first one optimizes the orientations in the whole translational space; however, in the second stage, the translational vector is restrained near the associated sites derived from the first stage. The CPU used in this step is only about 5 percent of that in the first stage.
![]() |
Result and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In our docking procedure, so many parameters need to be carefully calibrated. So before every docking calculation, these parameters must be properly defined. Some important parameters and their abbreviations in our docking procedure are listed in Table II. The parameters can be divided into two types: four parameters are concerned with surface and energetic complementarity, the other seven are concerned with the hybrid minimization algorithm. An appropriate selection of these parameters is important, since they affect not only the total computational time but also the quality of the solutions.
|
For the hybrid minimization algorithm, some parameters will greatly affect the calculation results. The parameters of the hybrid minimization algorithm include tabu size, tabu threshold measure, tabu iterations, the number of moves (population size), elite size, mutation ratio and crossover ratio. The tabu size controls how many different solutions should remain in the tabu list, definition of this parameter by the user is optional. In general, a 1020 tabu size is recommended in our docking procedure. A larger tabu size did not improve the final solutions by much, but more time was consumed on solution comparison. In this study, the tabu size for all test sets were all defined as 20. From our studies, Tabu threshold measure was a very important parameter, it directly connects with the efficiency of final solutions, this parameter is used to determine if the solutions after the GA interruptions are different enough from those solutions in the tabu list. In this study, the r.m.s.d. between two conformations was used to compare two different docked binding modes, and the tabu threshold measure could vary with specific studied system. For larger complexes, this parameter can be defined larger, but for some smaller complexes, this parameter can be smaller. For example, in systems 1, 2 and 3, the tabu threshold measure was defined as 2 Å, but in other systems, the tabu threshold measure was defined as 4 Å. Another parameter, the number of moves (population size) means that for every tabu iteration, how many moves are generated based on the current solution. This parameter also represents the population size in GA iterations. The larger the number of moves, the greater the chances that global orientations can be found and the more time that will be consumed. The elite size is the number of best solutions directly into the next GA iterations and tabu list in the elitist strategy. In the calculations, the elite size is about 5 percent of the number of moves, the number of moves is 50 and the elite size is 2. The mutation ratio and crossover ratio are defined as 0.05 and 0.35, respectively.
Five bound complexes
The initial five complexes in Table I are bound complexes of proteinprotein and proteinpeptide. The crystal structure of the complexes are directly from the PDB, the components of each complex were taken apart at an arbitrary relative orientation, then our docking procedure was used to dock together and compare the docked complexes with the crystal structures of the complexes.
In the surface complementarity stage, in order to speed up the calculations, all hydrogens were omitted. Moreover, for energetic complementarity and surface complementarity, two types of reference frames were used. At the surface complementarity stage, the alternate space axes remained the same as the initial axes from PDB. But in the energetic complementarity stage, the alternate space axes were transformed, after transformation, the origin of the coordinate was superposed with the gravity center of the ligand molecule after the surface complementarity. The goal of the transformation was to leave a relatively small rotation vector for the local search stage of energetic complementarity, thus restraining the movement of the ligand within a relatively small region. In the second stage, the translational extents cannot exceed 4 Å. In the energetic complementarity stage, a cut-off of 12 Å was used to calculate the van der Waals interaction, another cut-off at 16 Å was applied in the calculation of the Coulombic interaction.
Table III lists the calculated conformations from both the first and second stages. For the surface complementarity, the top three solutions are given in Table III
; for the energetic complementarity, only the best solution for each binding mode from the surface complementarity was given. For these five systems, the correct binding model could be found within 50 tabu iterations. The calculated results were very encouraging (see Table III
and Figure 4
); in most cases, the smallest root mean square distance is smaller than 1.0 Å (besides 1PLG). To these five bound complexes, the correct binding conformation was only determined precisely using surface complementarity. In our laboratory, abundant bound complexes were calculated using our docking procedure, in most cases, the best binding conformation possesses the best surface complementarity. But in a very few cases, another binding conformation may exist with a better surface complementarity than the correct binding conformation; these calculation results will be discussed in the next paper. But, in general, given a reliable surface score function, you can be sure of finding the correct binding conformation.
|
|
Two unbound complexes
The ultimate goal of molecular docking is to predict proteinprotein and proteinpeptide interactions without requiring a complexed crystal structure. Compared with the docking of these complexes with crystal structures, calculations for complexes without crystal structures seem more difficult. During the formation of a complex, some molecules will undergo conformational changes, so the docking procedure must be sufficiently soft to manage conformational changes, yet specific enough to identify the correct solution. In some cases, especially, the binding regions between protein and protein or peptide are unknown, complete search using flexible body is not tractable. Even using rigid-body, it is very difficult to determine the global minimum using conventional minimization algorithms. In these two cases, the calculation results are listed in Table IV, the highest ranked correct prediction is shown in Figure 5
.
|
|
For 2KAI, it can be found that the best solution from the surface complementarity no longer corresponds to the correct docking conformation. After detailed energetic minimization and superimposed with the crystal structure of bound 2KAI, a good solution was found in four out of 10 solutions from the tabu list. So for some unbound complexes, using surface complementarity alone cannot reliably dock unbound complexes, further energetic complementarity is needed to filter these solutions from the surface complementarity. In the mean time, we cannot conclude that the correct solution is sure to have the best energetic complementarity, because in the docking process, we do not really consider the flexibility of the systems. For 2KAI, the second solution has the smallest interaction energy, but its r.m.s.d. was larger than 10 Å.
The crystal structures of uncomplexed trypsin (3PTN) and an uncomplexed trypsin inhibitor (4PTI) have been solved separately in different crystal forms. A comparison of their structures with the corresponding components of a complex has indicated that relatively large conformational changes have occurred, especially in the trypsin inhibitor. After superimposing only the backbone atoms for 3PTN and 4PTI, the r.m.s.d. for 3PTN is only 0.323 Å; but for 4PTI, the conformational change is relatively large, its r.m.s.d. is 1.272 Å. This test case is very challenging, because it has been extensively studied by several other docking procedures (Kaichalski-Katzir et al., 1992; Gabb et al., 1997). The attempt by Katchalski-Katzir et al. (1992) to dock 3PTN and 4PTI was unsuccessful. Our calculation results listed in Table IV
show the correct binding conformation was successfully found. When we superimposed the native complexed crystal structure of 2PTC with our docking result, the r.m.s.d. was 2.54 Å (only backbone atoms were considered in the calculation of r.m.s.d.). From Table IV
, it is obvious that the best solution from surface complementarity corresponds with the correct solution, but it no longer significantly better than the rest of the solutions. It is relatively difficult for us to determine if the first solution is the correct solution, but after the second stage of energetic complementarity, we found that the first solution was more energetically favorable, in fact, this solution was very close to the correct solution. In order to compare the potential influence of the conformational change and test our minimization algorithm further, we docked receptor and ligand from the complexed protein structure together (see case 4 in Table III
). For the bound and unbound structures, the same parameters were used. But we found that the calculation results were so different, the best solution of surface complementarity for the bound system was much better than that for the unbound system. When comparing their r.m.s.d., the result for bound system was significantly superior to unbound system, the r.m.s.d. for bound system is only 0.475, much smaller than that of the unbound system. The reason for these differences between the bound and unbound complexes are mainly derived from the conformational changes during the process of forming the complex. These conformational changes may greatly affect the shape of a molecular surface. Minor changes in the molecular surface, especially near the binding site, will greatly affect the docking results. Our methods only implicitly consider the conformation change for these molecules near the binding site, but the conformations for these molecules do not really change. In this circumstance, the docking results will produce some deviations from the real complex. But for 2PTC, the surface did not change a lot in the process of complex formation, the essential molecular shape does not change greatly. Generally, for most unbound complexes, the best binding conformation may not correspond with the best surface or energetic complementarity, but some binding conformations with a relatively large score of surface and energetic complementarity surely can be found near the correct binding mode. So using surface complementarity and a good minimization algorithm, combined with the filter of energetic complementarity, in most cases, it is possible for us to predict the binding mode for an unbound complex.
Technical information
The complete docking package, named SFDOCK, consists of approximately 5000 lines of C language code, including a soft-docking procedure for proteinprotein interactions, a flexible-docking procedure for the small moleculesprotein procedure and a database searching procedure for ligand design. All docking experiments were carried out on a PC. The soft-docking procedure has been embedded into the Peking University Interaction System as a separate module and can be easily used through a graphics interface. Some source code from this study can be obtained from the author upon request.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
David,R.W., David,E.C. and Christopher,W.M. (1997) J. Comp. Mol. Des., 11, 209228.[ISI]
Fraga,S., Parker,J.M. and Pocok,J.M. (1995) Computer Simulations of Protein Structures and Interaction. Springer Verlag, BerlinHeidelbergNew York, p. 2081.
Glover,F. and Laguna,M. (1993) In Reeves,C.R., Modern Heuristic Techniques for Combinatorial Problem. Blackells, Oxford, UK, pp. 70150.
Hou,T.J., Wang,J.M. and Xu,X.J. (1998) Chinese Chemical Letters. (In press).
Jiang,F. and Kim,S.H. (1991) J. Mol. Biol., 219, 79102.[ISI][Medline]
Katchalski-Katzi,E., Shariv,I., Eisenstein,M., Friesen,A.A., Alfalo,C. and Wodak,S.J. (1992) Proc. Natl Acad. Sci. USA, 89, 21952199.[Abstract]
Gabb,H.A., Jackson,R.M. and Sternberg,M.J.E. (1997) J. Mol. Biol., 272, 106120.[ISI][Medline]
Luty,B.A., Wasserman,Z.R., Stoutern,P.F.W., Hodge,C.N. and McCammon,J.A. (1995) J. Comp. Chem., 16, 454464.[ISI]
Michalewics,Z. (1994) Genetic Algorithm + Data Structures = Evolution Programs. Springer Verlag, BerlinHeidelbergNew York, pp. 1330.
Oshiro,C.M., Kuntz,I.D. and Dixon,J.S. (1995) J. Comp. Mol. Des., 9, 113130.[ISI]
Shoichet,K. and Kuntz,I.D. (1991) J. Mol. Biol., 221, 327346.[ISI][Medline]
Wang,J.M., Chen,L.R., Jiang,F. and Xu,X.J. (1998) Proceeding of Chinese Peptide Symposium-96, ESCOM.
Weiner,S.J., Kollman,P.A., Case,D.A., Singh,U.C., Ghio,C., Alagona,G., Proteta,S.Jr. and Weiner,P. (1984) J. Am. Chem. Soc., 106, 765784.[ISI]
Weiner,S.J., Kollman,P.A., Nguyen,D.T. and Case,D.A. (1986) J. Comp. Chem., 7, 230252.[ISI]
Received September 18, 1997; revised January 22, 1999; accepted May 28, 1999.