BCB (Berlin Center for Genome-based Bioinformatics) at the Institute of Biochemistry, Charité (Medical Faculty of the Humboldt University Berlin), Monbijoustrasse 2, D-10117 Berlin, Germany
1 To whom correspondence should be addressed. e-mail: elke.michalsky{at}charite.de
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: homology modelling/protein loops/protein segments/structure database/structure prediction
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Basically, the approaches fall into two main categories: knowledge based and ab initio (de novo) methods. Knowledge-based approaches try to find a segment of a protein with known three-dimensional structure that fits the stem regions of a loop. Those residues preceding and following the loop are called stem residues. Usually, a database search is followed by an evaluation of suitable candidates and an optimization by means of an energy function. Ab initio methods have a search for or enumeration of conformations in common, which is usually based on potentials or scoring functions. Often knowledge-based parts are included, e.g. phi-, psi-maps of known loops (e.g. Fiser et al., 2000; Deane and Blundell, 2001
; Tosatto et al., 2002
).
First ab initio methods for modelling loops or short polypeptide segments were introduced by Moult and James and Bruccoleri and Karplus using conformational search with an optional energy minimization (Moult and James, 1986; Bruccoleri and Karplus, 1987
). Fine et al. generated multiple conformations followed by either energy minimization or molecular dynamics followed by minimization (Fine et al., 1986
). Knowledge-based methods were pioneered by Greer (Greer, 1981
), combined approaches were introduced by Martin et al. (1989
) and Sutcliffe et al. presented one of the first automated methods (Sutcliffe et al., 1987a
,b).
Van Vlijmen and Karplus presented a knowledge-based approach where a set of loops is selected from a database, followed by a constrained optimization of the loop orientation and ranking by means of an energy function (van Vlijmen and Karplus, 1997). Starting from a set of possible loop conformations extracted from a database, Samudrala and Moult use a graph theoretical approach to find the conformation that approximates the natural one best. Plausible conformations are found using a clique-finding method, which combines a recursive backtracking procedure with a branch and bound technique (Samudrala and Moult, 1998
).
An ab initio method is presented in Fiser et al. (2000). Here, the positions of all non-hydrogen atoms are optimized with respect to a pseudo energy function, supplemented with statistical preferences for dihedral angles and for non-bonded atomic contacts. The algorithm of Tosatto et al. (Tosatto et al., 2002
) is based on a divide and conquer approach recursively decomposing the target loop until the conformations of the resulting segments can be compiled analytically. For this purpose, a database of possible conformations for loop segments is used, which were anticipated using a list of (phi, psi)-angle pairs extracted from the Protein Data Bank (PDB) (Berman et al., 2000
). Artificial neural networks are used in Reczko et al. (1995)
to predict H3 loops of a set of antibodies. The neural network is trained on a set of loops that are similar to known H3 loops. CODA, an algorithm presented in Deane and Blundell (2001)
combines a knowledge-based and an ab initio method by clustering the predictions of the two algorithms and making a consensus prediction using a set of filters.
Although both ab initio and knowledge-based loop modelling methods have improved in recent years and particularly the length of modelled loops has increased, it was concluded from the CASP4 experiment (Critical Assessment of Techniques for Protein Structure Prediction) (Lattman, 2001) that there was no significant progress in homology modelling in general (Moult et al., 2001
; Tramontano et al., 2001
). Fidelis et al. compared the performance of an ab initio and a database method and concluded that database methods are limited to loops of four residues (Fidelis et al., 1994
). However, van Vlijmen and Karplus succeeded in predicting loops of length nine with reasonable accuracy by means of a database method (van Vlijmen and Karplus, 1997
). Deane and Blundell stated that their database search method is overtaken by their ab initio method at around six residues loop length (Deane and Blundell, 2001
). All this was a motivation to create Loops In Proteins (LIP), the database of protein segments presented in this paper, and to supplement it by different selection criteria and a ranking function designed for the purpose of loop prediction. The performance of the resulting loop prediction algorithm was compared in detail with a recently published ab initio approach (Fiser et al., 2000
), in the following called Fiser method, and with two further methods for a small loop test set.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
A non-homologous set of protein structures from the PDB (<20% pairwise sequence identity) that were determined by X-ray crystallography at a resolution of 1.8 Å or better was obtained from http://www.fccc.edu/research/labs/dunbrack/pisces/culledpdb.html. Secondary structural elements were identified using the DSSP program (Kabsch and Sander, 1983). Those segments connecting two secondary structural elements were defined as loops. Thus, N- and C-termini were excluded in particular. Loop test sets, each containing 50 loops of the same length, length ranging from 1 to 15 residues, were extracted by random selection. No test set for a given length contains two loops from the same protein structure. These test sets were used to optimize the selection criteria and ranking function described in the subsequent paragraphs and are therefore called parameterization test sets in the following.
For comparison purposes, loop predictions were made for the test sets from Fiser et al. (2000). They are available at the URL http://www.salilab.org/. Each test set consists of 40 loops of the same length, whereas length, i.e. number of amino acid residues, ranges from 1 to 14. Some of the proteins included in the test sets were substituted by newer versions in the PDB: 4ptp was substituted by 5ptp, 2cyr by 3cyr, 4fxn by 2fox, 3b5c by 1cyo, 1aak by 2aak. For technical reasons, i.e. missing stem residues, some loops had to be eliminated from the test sets. This concerns one loop of length 4, 6, 7 and 12 residues, and two 14-residue loops. All test sets are available at http://www.protein-design.com/LIP/.
LIP database
LIP is a comprehensive compilation of backbone conformations found in the PDB. It includes all protein segments of 115 amino acid residues length contained in the PDB, which amounts to 108. For the purpose of loop modelling, both NMR structures and theoretical models are excluded from the database. Furthermore, only proteins with a resolution of 3.5 Å or better are included.
For each protein segment, the following items are stored: length, PDB identifier of the protein, PDB number of the N-terminal stem residue, amino acid sequence and the values x, y, where (x, y) is a two-dimensional vector between the atoms C(N) and N(C) in the C(N)C(N)N(C) plane and the distance between the stem residues can be calculated from diststem = (x2 + y2)1/2; ß, the angle included by the lines connecting the atoms C(N) and N(C) and also N(C) and C
(C);
, the dihedral angle between the C
(N)C(N)N(C) and C(N)N(C)C
(C) planes (for clarification, see Figure 1). Atoms indicated by a superscript, (N) and (C), belong to the N- and C-terminal stem residues, respectively. The loop parameters are stored in different files where the files are indexed with sequence length and the rounded values of x and y. These files are stored in several subdirectories which are named after the respective x and y value pairs.
|
|
As a first step, a list of loops of the required length with arbitrary amino acid sequence is extracted from the database by reading four files with the required indices. A loop is included in the list if it fits into the gap between the N- and C-terminal stem residues with a tolerance of 0.75 Å. For each loop extracted from the database, a goodness is calculated. The goodness is a rough estimation for RMSDstem, which is defined as the RMSD (root mean square deviation) with respect to the C(N), C(N), N(C) and C
(C) atoms of the original protein and the protein that contains the loop candidate after superposition of those atoms. Values incorporated into the calculation of the goodness are x and y and the angles ß and
(see the preceding section):
goodness = x2 +
y2 + 2(
ß2 +
2)
where e.g. x2 denotes the squared deviation of the x values of original protein and loop candidate. To calculate the goodness, the following coordinate system is chosen: C
(N) is the origin and C(N) defines the positive part of the x-axis. Considering original loop and loop candidate in this system, the C
(N) and C(N) atoms, respectively, coincide owing to the identical bond lengths. The squared distance between the N(C) atoms is
x2 +
y2. In the following, it is assumed that both N(C) atoms coincide: the enlargement of the C
(C) distance owing to the translational displacement of the N(C) atoms being neglected for simplification. Only the rotation term is retained. Let
denote the angle enclosed by the N(C)C
(C) bonds (with centre N(C)). The distance between the C
(C) atoms is smaller than their arc distance, which equals bond length x
. For
, the inequality
2
ß2 +
2 holds. With dist(C
(C)) denoting the distance between the C
(C) atoms, this yields:
dist(C(C))2
bond length2 x
2
bond length2 x (
ß2 +
2)
2 x (
ß2 +
2)
with the assumption bond length 1.4. As shown above, the distances between the remaining atom pairs can be calculated exactly by the equations:
dist(C(N)) = dist(C(N)) = 0
and
dist(N(C))2 = x2 +
y2
Overall, considering all simplifications, goodness is a qualitative upper estimate for RMSDstem2.
In the second step, the loop candidates are ranked according to goodness and the best 250 loops are selected. A database search takes less than 1 s on average, as only four files containing the loops that fit into the gap have to be read (see above).
Loops that clash with the rest of the protein and those likely to protrude from the protein surface are rejected from the list of loop candidates. For this purpose, the minimal and maximal distances to the main chain of the protein are calculated for each loop. Main-chain atoms, including oxygen, are considered and the stem residues are excluded from this calculation. If a loop has a minimal distance <2.4 Å, it is eliminated. The maximal distance cut-off is chosen depending on loop length: a loop candidate is rejected if its maximal distance to the rest of the protein exceeds the value 4.5 x ln(loop length) + 4. These distance cut-off values were determined by analysis of the parameterization test sets (see the section Test sets). A logarithmic curve which approximates but does not fall below the greatest maximal distances found in the parameterization test sets was fitted by aid of Microsoft Excel. The different values are shown in Figure 3. Furthermore, loop candidates are checked to result in correct phi/psi angles at the stem regions after they were fitted into the protein structure.
|
Ranking
Loop candidates are ranked by a function including RMSDstem, introduced in the section Selection of loop candidates from the database. In addition, the sequence similarity of a loop candidate to the original loop is assessed. A sequence score M is calculated using an environment-specific amino acid substitution matrix for accessible residues (Overington et al., 1992). The data were taken from the database AAindex2, which can be found at http://www.genome.ad.jp/aaindex/ (Kawashima and Kanehisa, 2000
). Now, the rank is calculated according to the equation
Rank = M 0.1 x RMSDstem2
The main-chain atoms of the loop candidate that ranks first are a prediction for those of the original loop.
The fitting of the ranking function was started from its more general form:
Rank = a x M b x goodness c x RMSDstem2
In order to determine the parameters of the ranking function, several combinations of those were tested and applied to the parameterization test set. For this, plausible upper and lower bounds were fixed for each parameter. Then, these intervals were discretized and all combinations of parameters within those grids were tested. The goal of this procedure was to minimize the average global RMSD between original protein and the top-ranked loop candidates. Separate optimizations for different loop lengths resulted in highly inconsistent parameters and yielded no uniform trend; a simultaneous optimization for all loop lengths yielded the above ranking function, which does not depend on the loop length and resulted in a loss of 0.07 Å for the overall mean global RMSD with respect to the averaged optimal values achieved by separate optimizations.
As inclusion of the goodness, introduced in the section Selection of loop candidates from the database, into the ranking function did not significantly change the results regarding mean values over the test sets, it was consequently rejected. Nevertheless, experience shows that in some cases, especially for shorter loops (up to length five), unsuitable loop candidates can be rejected by inclusion of the goodness.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Short loops of up to four residues are modelled with comparable quality with respect to the global RMSD by the Fiser method and the LIP method. Concerning the local RMSD, the LIP method performs about one-third better for loops of length four. Loops of length 12 are predicted 37% better with respect to global RMSD and
45% better with respect to local RMSD by the LIP method. Regarding the median, LIP predicts four residue loops almost 40% better with respect to both global and local RMSD. The median values for the eight- and 12-residue loops are >75% (global RMSD) and
85% better (local RMSD) than those of Fiser et al. (see Table I).
|
|
Table II shows detailed results for 14 selected loops in comparison with the results of Deane and Blundell (2001), Fiser et al. (2000)
and van Vlijmen and Karplus (1997)
. Deane and Blundell and Fiser et al. chose these 14 loops from the test set of van Vlijmen and Karplus for their comparison of prediction results. As indicated in the Introduction, the loop prediction method of van Vlijmen and Karplus is based on a database of protein segments, like the method presented in this paper. In contrast, Fiser et al. follow an ab initio approach. Deane and Blundell combine these two methods. From Table II, it can be seen that loop modelling with the aid of the LIP database produced the best results of the four methods compared for seven out of 14 loops. For three loops the LIP method performed worst and in two cases LIP yielded the second best results. Overall LIP performed best regarding the median with respect to all 14 loops. However, the comparison of the results for these 14 loops can be regarded as only a rough estimation of performance. The most evident reason is that loops of different lengths are compared. Moreover, the test set of 14 loops is not representative in any respect, which is the reason for using the median as the most appropriate measure here, as deviations from the main trend are not overestimated. Additionally, the predictions in Table II were produced at different dates. Owing to enlarged protein structure databases or increased computing power, it can be expected that earlier methods would yield better results at present. In order to assess the influence of this effect, all loops contained in Table II were subjected to a number of loop prediction web servers, including those using the prediction methods of Fiser et al. (server ModLoop) and Deane and Blundell (server CODA) as well as the server RAPPERmaintained by DePristo et al. (DePristo et al., 2003
). However, the results do not indicate any overall trend of improvement over time, at least for this small test set. These results are available at http://www.protein-design.com/LIP/.
|
An advantage of database methods compared with ab initio methods is that the database solely has to be updated regularly and thus grows with the PDB in a natural manner. By contrast, the cost for recalculation of (phi, psi) maps and resulting potentials, as used in the ab initio approach (Fiser et al., 2000), is much higher. Furthermore, the computing time for one loop takes several hours with the Fiser method. To model one loop of arbitrary length with the LIP method presented in this paper, a database search of less than 1 s is necessary. Evaluation and ranking of the loop candidates take
10 min on average, depending more on factors such as the number of considered loop candidates and the size of the respective proteins than on loop length.
Loop modelling with aid of the LIP database has already been applied successfully in the homology modelling section of the CASP5 experiment, where the authors took part under the group name Preissner with group number 488. Three out of 16 models completed with loops from the LIP database ranked among the best 15 (rank 3, 7 and 15, respectively) of all submitted models according to the CASP criterion. A further six models ranked among the best 50, where the total number of submitted models amounts to 150 in each case. Two examples of successfully modelled loops are shown in Figure 5. In both cases, the loop modelled by means of LIP was more accurate than those coming from the two top-ranked predictions for the target (see Table III). In the field of homology modelling, a further advantage of the LIP database takes effect: every kind of protein segment, not only loops, can be modelled. Solely the ranking function has to be adapted to the particular situation.
|
|
A novel knowledge-based method for the prediction of loops in the framework of homology modelling has been presented. It is based on a comprehensive compilation of backbone conformations from the PDB, called LIP. The results were compared with those of a thoroughly evaluated ab initio method published recently (Fiser et al., 2000). Predictions were made for 14 test sets of 40 loops each, loop lengths ranging from 1 to 14 residues. Loops of lengths up to nine residues could be modelled with a local RMSD <1 Å by the LIP method, and those of length up to 14 residues with an accuracy better than 2 Å. This indicates that, in particular for longer loops, the LIP method performs better than the Fiser method (Fiser et al., 2000
). Prediction accuracies were compared for an additional test set of 14 loops of lengths four to nine residues. This test set has already been used elsewhere (van Vlijmen and Karplus, 1997
; Fiser et al., 2000
; Deane and Blundell, 2001
). Here again, the LIP method yielded very good results, in particular for longer loops.
Every loop prediction method, including various ab initio methods, uses data from the PDB in some way. Therefore, and owing to the different dates of publication of the compared methods, the comparison of these is not as realistic as earlier methods may yield better results at present. In particular, this can be expected for the database methods, as indicated by comparison of the performances of a current and a reduced version of the LIP database.
Nevertheless, loop modelling by means of LIP yields very good results and performs better than other methods. Particularly for longer loops, the presented knowledge-based method reaches higher accuracy than other approaches. Especially inspection of the median values in comparison with other methods indicates that the LIP method is a valuable contribution to the field of loop modelling.
Further positive aspects of the presented method are the short computing times and the fact that the database grows in a natural manner with the PDB, which will lead to better prediction accuracies continuously. Moreover, like CODA (Deane and Blundell, 2001), the LIP method is not limited to loop prediction but can be used to model arbitrary peptide fragments in proteins. A limitation of the LIP method is the relatively high standard deviations in prediction accuracies in comparison with the ab initio method (Fiser et al., 2000
).
For the purpose of homology modelling, a graphical user interface has been designed to make interactive use of the protein segment LIP database possible. The program together with the database on DVD is available from the authors on request.
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235242.
Bruccoleri,R.E. and Karplus,M. (1987) Biopolymers, 26, 137168.[ISI][Medline]
Deane,C.M. and Blundell,T.L. (2001) Protein Sci., 10, 599612.
DePristo,M.A., de Bakker,P.I.W., Lovell,S.C. and Blundell,T.L. (2003) Proteins, 51, 4155.[CrossRef][ISI][Medline]
Fidelis,K., Stern,P.S., Bacon,D. and Moult,J. (1994) Protein Eng., 7, 953960.[ISI][Medline]
Fine,R.M., Wang,H. Shenkin,P.S., Yarmusch,D.L. and Levinthal,C. (1986) Proteins, 1, 342362.[Medline]
Fiser,A., Kinh Gian Do,R. and ali,A. (2000) Protein Sci., 9, 17531773.[Abstract]
Greer,J. (1981) J. Mol. Biol., 153, 10271042.[ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Kawashima,S. and Kanehisa,M. (2000) Nucleic Acids Res., 28, 374.
Lattman,E.E. (2001) Proteins, 44, 399.[CrossRef][ISI]
Martin,A.C.R., Cheetham,J.C. and Rees,A.R. (1989) Proc. Natl Acad. Sci. USA, 86, 92689272.[Abstract]
Moult,J. and James,M.N. (1986) Proteins, 1, 146163.[Medline]
Moult,J., Fidelis,K., Zemla,A. and Hubbard,T. (2001) Proteins, 45, 27.[ISI][Medline]
Overington,J., Donnelly,D., Johnson,M.S., ali,A. and Blundell,T.L. (1992) Protein Sci., 1, 216226.
Reczko,M., Martin,A.C.R., Bohr,H. and Suhai,S. (1995) Protein Eng., 8, 389395.[ISI][Medline]
Samudrala,R. and Moult,J. (1998) J. Mol. Biol., 279, 287302.[CrossRef][ISI][Medline]
Sánchez,R. and ali,A. (1997) Curr. Opin. Struct. Biol., 7, 206214.[CrossRef][ISI][Medline]
Schonbrun,J., Wedemeyer,W.J. and Baker,D. (2002) Curr. Opin. Struct. Biol., 12, 348354.[CrossRef][ISI][Medline]
Sutcliffe,M.J., Haneef,I.,Carney,D. and Blundell,T.L. (1987a) Protein Eng., 1, 377384.[ISI][Medline]
Sutcliffe,M.J., Hayes, F.R.F. and Blundell,T.L. (1987b) Protein Eng., 1, 385392.[ISI][Medline]
Tosatto,S.C.E., Bindewald,E., Hesser,J. and Männer,R. (2002) Protein Eng., 15, 279286.[CrossRef][ISI][Medline]
Tramontano,A., Leplae,R. and Morea,V. (2001) Proteins, 45, 2238.[CrossRef]
van Vlijmen,H.W.T. and Karplus,M. (1997) J. Mol. Biol., 267, 9751001.[CrossRef][ISI][Medline]
Received May 12, 2003; revised October 17, 2003; accepted October 21, 2003