Loops In Proteins (LIP)—a comprehensive loop database for homology modelling

E. Michalsky1, A. Goede and R. Preissner

BCB (Berlin Center for Genome-based Bioinformatics) at the Institute of Biochemistry, Charité (Medical Faculty of the Humboldt University Berlin), Monbijoustrasse 2, D-10117 Berlin, Germany

1 To whom correspondence should be addressed. e-mail: elke.michalsky{at}charite.de


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
One of the most important and challenging tasks in protein modelling is the prediction of loops, as can be seen in the large variety of existing approaches. Loops In Proteins (LIP) is a database that includes all protein segments of a length up to 15 residues contained in the Protein Data Bank (PDB). In this study, the applicability of LIP to loop prediction in the framework of homology modelling is investigated. Searching the database for loop candidates takes less than 1 s on a desktop PC, and ranking them takes a few minutes. This is an order of magnitude faster than most existing procedures. The measure of accuracy is the root mean square deviation (RMSD) with respect to the main-chain atoms after local superposition of target loop and predicted loop. Loops of up to nine residues length were modelled with a local RMSD <1 Å and those of length up to 14 residues with an accuracy better than 2 Å. The results were compared in detail with a thoroughly evaluated and tested ab initio method published recently and additionally with two further methods for a small loop test set. The LIP method produced very good predictions. In particular for longer loops it outperformed other methods.

Keywords: homology modelling/protein loops/protein segments/structure database/structure prediction


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Currently, comparative or homology modelling performs best among the existing methods for the prediction of unknown protein structures. Given a template protein with known three-dimensional structure and sufficiently high sequence identity to the target, high accuracy models can be produced. Besides the proper alignment, the most important and challenging task in protein modelling is the prediction of loop conformations (Sánchez and Sali, 1997Go; Fiser et al., 2000Go; Baker and Sali, 2001Go; Schonbrun et al., 2002Go). A large variety of existing approaches address this problem.

Basically, the approaches fall into two main categories: knowledge based and ab initio (de novo) methods. Knowledge-based approaches try to find a segment of a protein with known three-dimensional structure that fits the stem regions of a loop. Those residues preceding and following the loop are called stem residues. Usually, a database search is followed by an evaluation of suitable candidates and an optimization by means of an energy function. Ab initio methods have a search for or enumeration of conformations in common, which is usually based on potentials or scoring functions. Often knowledge-based parts are included, e.g. phi-, psi-maps of known loops (e.g. Fiser et al., 2000Go; Deane and Blundell, 2001Go; Tosatto et al., 2002Go).

First ab initio methods for modelling loops or short polypeptide segments were introduced by Moult and James and Bruccoleri and Karplus using conformational search with an optional energy minimization (Moult and James, 1986Go; Bruccoleri and Karplus, 1987Go). Fine et al. generated multiple conformations followed by either energy minimization or molecular dynamics followed by minimization (Fine et al., 1986Go). Knowledge-based methods were pioneered by Greer (Greer, 1981Go), combined approaches were introduced by Martin et al. (1989Go) and Sutcliffe et al. presented one of the first automated methods (Sutcliffe et al., 1987aGo,b).

Van Vlijmen and Karplus presented a knowledge-based approach where a set of loops is selected from a database, followed by a constrained optimization of the loop orientation and ranking by means of an energy function (van Vlijmen and Karplus, 1997Go). Starting from a set of possible loop conformations extracted from a database, Samudrala and Moult use a graph theoretical approach to find the conformation that approximates the natural one best. Plausible conformations are found using a clique-finding method, which combines a recursive backtracking procedure with a branch and bound technique (Samudrala and Moult, 1998Go).

An ab initio method is presented in Fiser et al. (2000Go). Here, the positions of all non-hydrogen atoms are optimized with respect to a pseudo energy function, supplemented with statistical preferences for dihedral angles and for non-bonded atomic contacts. The algorithm of Tosatto et al. (Tosatto et al., 2002Go) is based on a divide and conquer approach recursively decomposing the target loop until the conformations of the resulting segments can be compiled analytically. For this purpose, a database of possible conformations for loop segments is used, which were anticipated using a list of (phi, psi)-angle pairs extracted from the Protein Data Bank (PDB) (Berman et al., 2000Go). Artificial neural networks are used in Reczko et al. (1995)Go to predict H3 loops of a set of antibodies. The neural network is trained on a set of loops that are similar to known H3 loops. CODA, an algorithm presented in Deane and Blundell (2001)Go combines a knowledge-based and an ab initio method by clustering the predictions of the two algorithms and making a consensus prediction using a set of filters.

Although both ab initio and knowledge-based loop modelling methods have improved in recent years and particularly the length of modelled loops has increased, it was concluded from the CASP4 experiment (Critical Assessment of Techniques for Protein Structure Prediction) (Lattman, 2001Go) that there was no significant progress in homology modelling in general (Moult et al., 2001Go; Tramontano et al., 2001Go). Fidelis et al. compared the performance of an ab initio and a database method and concluded that database methods are limited to loops of four residues (Fidelis et al., 1994Go). However, van Vlijmen and Karplus succeeded in predicting loops of length nine with reasonable accuracy by means of a database method (van Vlijmen and Karplus, 1997Go). Deane and Blundell stated that their database search method is overtaken by their ab initio method at around six residues loop length (Deane and Blundell, 2001Go). All this was a motivation to create Loops In Proteins (LIP), the database of protein segments presented in this paper, and to supplement it by different selection criteria and a ranking function designed for the purpose of loop prediction. The performance of the resulting loop prediction algorithm was compared in detail with a recently published ab initio approach (Fiser et al., 2000Go), in the following called ‘Fiser method’, and with two further methods for a small loop test set.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Test sets

A non-homologous set of protein structures from the PDB (<20% pairwise sequence identity) that were determined by X-ray crystallography at a resolution of 1.8 Å or better was obtained from http://www.fccc.edu/research/labs/dunbrack/pisces/culledpdb.html. Secondary structural elements were identified using the DSSP program (Kabsch and Sander, 1983Go). Those segments connecting two secondary structural elements were defined as loops. Thus, N- and C-termini were excluded in particular. Loop test sets, each containing 50 loops of the same length, length ranging from 1 to 15 residues, were extracted by random selection. No test set for a given length contains two loops from the same protein structure. These test sets were used to optimize the selection criteria and ranking function described in the subsequent paragraphs and are therefore called ‘parameterization test sets’ in the following.

For comparison purposes, loop predictions were made for the test sets from Fiser et al. (2000)Go. They are available at the URL http://www.salilab.org/. Each test set consists of 40 loops of the same length, whereas length, i.e. number of amino acid residues, ranges from 1 to 14. Some of the proteins included in the test sets were substituted by newer versions in the PDB: 4ptp was substituted by 5ptp, 2cyr by 3cyr, 4fxn by 2fox, 3b5c by 1cyo, 1aak by 2aak. For technical reasons, i.e. missing stem residues, some loops had to be eliminated from the test sets. This concerns one loop of length 4, 6, 7 and 12 residues, and two 14-residue loops. All test sets are available at http://www.protein-design.com/LIP/.

LIP database

LIP is a comprehensive compilation of backbone conformations found in the PDB. It includes all protein segments of 1–15 amino acid residues length contained in the PDB, which amounts to ~108. For the purpose of loop modelling, both NMR structures and theoretical models are excluded from the database. Furthermore, only proteins with a resolution of 3.5 Å or better are included.

For each protein segment, the following items are stored: length, PDB identifier of the protein, PDB number of the N-terminal stem residue, amino acid sequence and the values x, y, where (x, y) is a two-dimensional vector between the atoms C(N) and N(C) in the C{alpha}(N)–C(N)–N(C) plane and the distance between the stem residues can be calculated from diststem = (x2 + y2)1/2; ß, the angle included by the lines connecting the atoms C(N) and N(C) and also N(C) and C{alpha}(C); {gamma}, the dihedral angle between the C{alpha}(N)–C(N)–N(C) and C(N)–N(C)–C{alpha}(C) planes (for clarification, see Figure 1). Atoms indicated by a superscript, (N) and (C), belong to the N- and C-terminal stem residues, respectively. The loop parameters are stored in different files where the files are indexed with sequence length and the rounded values of x and y. These files are stored in several subdirectories which are named after the respective x and y value pairs.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1. Geometric values that are stored in the LIP database for each loop. The points C{alpha}(N), C(N) and N(C) define the xy plane of the coordinate system, where the x-axis connects C{alpha}(N) and C(N). The figure shows the values x and y that characterize each loop and are stored in the database. Further characteristic values are ß, the angle included by the lines connecting the atoms C(N) and N(C) and also N(C) and C{alpha}(C), and {gamma}, the dihedral angle between the C{alpha}(N)–C(N)–N(C) and C(N)–N(C)–C{alpha}(C) planes.

 
To assess the influence of the number of available protein structures on database methods for loop modelling, a reduced version of the LIP database was derived from a 1990 version of the PDB. This PDB version was retrieved by aid of the tool ‘PDBmining2’, accessible from http://mirrors.rcsb.org/SMS/. The size of the current LIP version amounts to 6.71 GB of disk space. Detailed information about the number of database entries depending on the length of protein segments and the distance spanned by the segments is included in Figure 2.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 2. LIP statistics. The number of protein segments contained in the LIP database is shown depending on the length of loop segments (black bars). The unit on the appendant left-hand axis is millions. LIP contains 8 334 561 protein segments of length 1 residue and 7 644 887 segments of length 15. White bars show the average distance in Å that is spanned by a segment of a certain length. More precisely, the spanned distance here means diststem, the distance between the stem residues, i.e. the distance between the C-atom of the N-terminal stem residue and the N-atom of the C-terminal stem residue. Protein segments of length 1 span a distance of 3.72 Å and those of length 15 span 19.98 Å on average.

 
Selection of loop candidates from the database

As a first step, a list of loops of the required length with arbitrary amino acid sequence is extracted from the database by reading four files with the required indices. A loop is included in the list if it fits into the gap between the N- and C-terminal stem residues with a tolerance of 0.75 Å. For each loop extracted from the database, a ‘goodness’ is calculated. The goodness is a rough estimation for RMSDstem, which is defined as the RMSD (root mean square deviation) with respect to the C{alpha}(N), C(N), N(C) and C{alpha}(C) atoms of the original protein and the protein that contains the loop candidate after superposition of those atoms. Values incorporated into the calculation of the goodness are x and y and the angles ß and {gamma} (see the preceding section):

goodness = {Delta}x2 + {Delta}y2 + 2({Delta}ß2 + {Delta}{gamma}2)

where e.g. {Delta}x2 denotes the squared deviation of the x values of original protein and loop candidate. To calculate the ‘goodness’, the following coordinate system is chosen: C{alpha}(N) is the origin and C(N) defines the positive part of the x-axis. Considering original loop and loop candidate in this system, the C{alpha}(N) and C(N) atoms, respectively, coincide owing to the identical bond lengths. The squared distance between the N(C) atoms is {Delta}x2 + {Delta}y2. In the following, it is assumed that both N(C) atoms coincide: the enlargement of the C{alpha}(C) distance owing to the translational displacement of the N(C) atoms being neglected for simplification. Only the rotation term is retained. Let {Delta}{alpha} denote the angle enclosed by the N(C)–C{alpha}(C) bonds (with centre N(C)). The distance between the C{alpha}(C) atoms is smaller than their arc distance, which equals bond length x {Delta}{alpha}. For {Delta}{alpha}, the inequality {Delta}{alpha}2 <= {Delta}ß2 + {Delta}{gamma}2 holds. With dist(C{alpha}(C)) denoting the distance between the C{alpha}(C) atoms, this yields:

dist(C{alpha}(C))2 <= bond length2 x {Delta}{alpha}2 <= bond length2 x ({Delta}ß2 + {Delta}{gamma}2) <= 2 x ({Delta}ß2 + {Delta}{gamma}2)

with the assumption bond length {approx} 1.4. As shown above, the distances between the remaining atom pairs can be calculated exactly by the equations:

dist(C{alpha}(N)) = dist(C(N)) = 0

and

dist(N(C))2 = {Delta}x2 + {Delta}y2

Overall, considering all simplifications, ‘goodness’ is a qualitative upper estimate for RMSDstem2.

In the second step, the loop candidates are ranked according to goodness and the best 250 loops are selected. A database search takes less than 1 s on average, as only four files containing the loops that fit into the gap have to be read (see above).

Loops that clash with the rest of the protein and those likely to protrude from the protein surface are rejected from the list of loop candidates. For this purpose, the minimal and maximal distances to the main chain of the protein are calculated for each loop. Main-chain atoms, including oxygen, are considered and the stem residues are excluded from this calculation. If a loop has a minimal distance <2.4 Å, it is eliminated. The maximal distance cut-off is chosen depending on loop length: a loop candidate is rejected if its maximal distance to the rest of the protein exceeds the value 4.5 x ln(loop length) + 4. These distance cut-off values were determined by analysis of the parameterization test sets (see the section ‘Test sets’). A logarithmic curve which approximates but does not fall below the greatest maximal distances found in the parameterization test sets was fitted by aid of Microsoft Excel. The different values are shown in Figure 3. Furthermore, loop candidates are checked to result in correct phi/psi angles at the stem regions after they were fitted into the protein structure.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 3. Parameters for loop selection. A loop is eliminated from the list of candidates if its maximal distance from the rest of the protein exceeds or its minimal distance falls below a certain cut-off value. Minimal and maximal distances from the respective protein main chains were calculated for all loops in the parameterization test set (50 loops for each loop length from 1 to 15 residues). For each loop length, the largest maximal distance and the smallest minimal distance found in the test set are shown in Å. MinDist and MaxDist refer to the minimal and maximal distances, respectively. A logarithmic curve which approximates but preferably does not fall below the largest MaxDist value was fitted [MaxDist cut-off = 4.5 x ln(loop length) + 4]. As MinDist cut-off, the value 2.4 Å was chosen: only 0.05% of the test loops have a smaller minimal distance from the protein.

 
To evaluate prediction quality fairly, all loops that originate from a protein with an amino acid sequence similar to that of the original protein are eliminated from the list of loop candidates. Obviously this would not be done in the application case. To assess the ‘similarity’ of amino acid sequences, only the chains of the proteins that contain the loops are considered. Dividing the shorter of both protein chains into segments of 20 residues, the frequency of occurrence of those in the longer sequence is counted. If it amounts to at least half of their total number, then the sequences are called ‘similar’. This procedure ensures that loops coming from different versions or slight mutations of the original protein are not considered as loop candidates. In this way, loops belonging to the original protein are removed from the list of candidates as well.

Ranking

Loop candidates are ranked by a function including RMSDstem, introduced in the section ‘Selection of loop candidates from the database’. In addition, the sequence similarity of a loop candidate to the original loop is assessed. A sequence score M is calculated using an ‘environment-specific amino acid substitution matrix for accessible residues’ (Overington et al., 1992Go). The data were taken from the database AAindex2, which can be found at http://www.genome.ad.jp/aaindex/ (Kawashima and Kanehisa, 2000Go). Now, the rank is calculated according to the equation

Rank = M – 0.1 x RMSDstem2

The main-chain atoms of the loop candidate that ranks first are a prediction for those of the original loop.

The fitting of the ranking function was started from its more general form:

Rank = a x Mb x goodnessc x RMSDstem2

In order to determine the parameters of the ranking function, several combinations of those were tested and applied to the parameterization test set. For this, plausible upper and lower bounds were fixed for each parameter. Then, these intervals were discretized and all combinations of parameters within those grids were tested. The goal of this procedure was to minimize the average global RMSD between original protein and the top-ranked loop candidates. Separate optimizations for different loop lengths resulted in highly inconsistent parameters and yielded no uniform trend; a simultaneous optimization for all loop lengths yielded the above ranking function, which does not depend on the loop length and resulted in a loss of 0.07 Å for the overall mean global RMSD with respect to the averaged optimal values achieved by separate optimizations.

As inclusion of the goodness, introduced in the section Selection of loop candidates from the database, into the ranking function did not significantly change the results regarding mean values over the test sets, it was consequently rejected. Nevertheless, experience shows that in some cases, especially for shorter loops (up to length five), unsuitable loop candidates can be rejected by inclusion of the goodness.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
In order to be able to compare the results of this study with those of the Fiser method (Fiser et al., 2000Go), the same RMSD values have to be calculated. In the following, the RMSD for the loop main-chain atoms (N, C{alpha}, C, O) after superposition of the main-chain atoms in the stem residues is called ‘global RMSD’ and is indicated by RMSDglobal. The ‘local RMSD’ RMSDlocal is calculated after superposition of the loop main-chain atoms. Fiser et al. define accuracy classes of loop prediction: a ‘good’ prediction has a local RMSD <1 Å, a ‘medium’ prediction has a local RMSD between 1 and 2 Å and the local RMSD of a ‘bad’ prediction is >2 Å.

Short loops of up to four residues are modelled with comparable quality with respect to the global RMSD by the Fiser method and the LIP method. Concerning the local RMSD, the LIP method performs about one-third better for loops of length four. Loops of length 12 are predicted ~37% better with respect to global RMSD and ~45% better with respect to local RMSD by the LIP method. Regarding the median, LIP predicts four residue loops almost 40% better with respect to both global and local RMSD. The median values for the eight- and 12-residue loops are >75% (global RMSD) and ~85% better (local RMSD) than those of Fiser et al. (see Table I).


View this table:
[in this window]
[in a new window]
 
Table I. Results: detailed results for the loop test sets of length 4, 8 and 12, each including 40 loops, in comparison with the results of Fiser et al. (2000)
 
Figure 4 shows that, on average, loop modelling by means of the LIP database achieves medium results even for loops of 14 residues length (RMSDlocal = 1.71 Å on average over the test set). Predictions for loops of up to nine residues length (nine residue loops: RMSDlocal = 0.93 Å on average over the test set of 40 loops) have good accuracy. The global RMSD is <3.5 Å for all test sets; the average for the 14-residue loops is 3.46 Å (see Figure 4). In comparison, the Fiser method achieved medium results for loops of length six to nine residues and only accuracies >2 Å for loops longer than nine residues (see Fiser et al., 2000Go and Table I).



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 4. Results: quality of loop prediction. For each loop length of 1–14 residues, the average local and global RMSD (RMSDlocal and RMSDglobal) and the standard deviations with respect to the test set are shown in Å. Each test set included 40 loops of the same length, reduced by the loops that could not be modelled where necessary (see the section ‘Test sets’ in Materials and methods). Loop predictions were made for a current version of the LIP database and for a reduced version of LIP, which was derived from a 1990 version of the PDB.

 
The number of published protein structures increases continuously. It is evident that database methods for loop modelling will benefit from this development. To assess the quantity of this influence, the calculations for all test sets were repeated with a reduced version of the LIP database, which was derived from a 1990 version of the PDB. For direct comparison, the results are included in Figure 4. With the reduced LIP database, good prediction accuracies are achieved only for loops of up to three residues length. Predictions for loops of up to nine residues length have a medium accuracy for the reduced LIP version in comparison with good accuracy for the current version. Inaccuracy increases dramatically with loop length. In particular, the global RMSD values are not acceptable: for loops longer than five residues, rising above 3 Å, whereas the standard deviations remain comparable to those of the current results. Furthermore, the accuracies for the current LIP version are distributed more linearly than those for the earlier one. These facts indicate that owing to the growth of the PDB, in particular the deposition of new fold structures, database methods for loop prediction will yield continuously better results.

Table II shows detailed results for 14 selected loops in comparison with the results of Deane and Blundell (2001)Go, Fiser et al. (2000)Go and van Vlijmen and Karplus (1997)Go. Deane and Blundell and Fiser et al. chose these 14 loops from the test set of van Vlijmen and Karplus for their comparison of prediction results. As indicated in the Introduction, the loop prediction method of van Vlijmen and Karplus is based on a database of protein segments, like the method presented in this paper. In contrast, Fiser et al. follow an ab initio approach. Deane and Blundell combine these two methods. From Table II, it can be seen that loop modelling with the aid of the LIP database produced the best results of the four methods compared for seven out of 14 loops. For three loops the LIP method performed worst and in two cases LIP yielded the second best results. Overall LIP performed best regarding the median with respect to all 14 loops. However, the comparison of the results for these 14 loops can be regarded as only a rough estimation of performance. The most evident reason is that loops of different lengths are compared. Moreover, the test set of 14 loops is not representative in any respect, which is the reason for using the median as the most appropriate measure here, as deviations from the main trend are not overestimated. Additionally, the predictions in Table II were produced at different dates. Owing to enlarged protein structure databases or increased computing power, it can be expected that earlier methods would yield better results at present. In order to assess the influence of this effect, all loops contained in Table II were subjected to a number of loop prediction web servers, including those using the prediction methods of Fiser et al. (server ‘ModLoop’) and Deane and Blundell (server ‘CODA’) as well as the server ‘RAPPER’maintained by DePristo et al. (DePristo et al., 2003Go). However, the results do not indicate any overall trend of improvement over time, at least for this small test set. These results are available at http://www.protein-design.com/LIP/.


View this table:
[in this window]
[in a new window]
 
Table II. Results: detailed results for 14 selected loops in comparison with the results of Deane and Blundell, Fiser et al. and van Vlijmen and Karplus
 
Overall, the LIP method achieves very good results in comparison with other methods. In particular, it outperforms others in the case of longer loops (see Tables I and II). A limitation of loop prediction by means of the LIP database is the relatively high standard deviations in comparison with the results of Fiser et al. (2000)Go (see Table I). This can be traced back to the fact that, if the database contains a loop similar to or almost identical with the original loop, then the prediction quality is very good. By contrast, if no roughly similar loop can be found in the database, the conformation of the predicted loop will deviate critically from the original one.

An advantage of database methods compared with ab initio methods is that the database solely has to be updated regularly and thus grows with the PDB in a natural manner. By contrast, the cost for recalculation of (phi, psi) maps and resulting potentials, as used in the ab initio approach (Fiser et al., 2000Go), is much higher. Furthermore, the computing time for one loop takes several hours with the Fiser method. To model one loop of arbitrary length with the LIP method presented in this paper, a database search of less than 1 s is necessary. Evaluation and ranking of the loop candidates take ~10 min on average, depending more on factors such as the number of considered loop candidates and the size of the respective proteins than on loop length.

Loop modelling with aid of the LIP database has already been applied successfully in the homology modelling section of the CASP5 experiment, where the authors took part under the group name ‘Preissner’ with group number 488. Three out of 16 models completed with loops from the LIP database ranked among the best 15 (rank 3, 7 and 15, respectively) of all submitted models according to the CASP criterion. A further six models ranked among the best 50, where the total number of submitted models amounts to ~150 in each case. Two examples of successfully modelled loops are shown in Figure 5. In both cases, the loop modelled by means of LIP was more accurate than those coming from the two top-ranked predictions for the target (see Table III). In the field of homology modelling, a further advantage of the LIP database takes effect: every kind of protein segment, not only loops, can be modelled. Solely the ranking function has to be adapted to the particular situation.






View larger version (148K):
[in this window]
[in a new window]
 
Fig. 5. Loop modelling at CASP5. Two loops modelled in the framework of the CASP5 experiment are shown. (a and b) Loop 98–102 from target T0153 (PDB identifier 1mq7); (c and d) loop 202–205 from target T0183 (PDB identifier 1o0y). The original loop is coloured blue and the loop predicted by the LIP method is red. The two thin green loops stem from the models that ranked best according to the CASP-criterion for the respective target. Light green, best; dark green, second rank. Each loop model is shown after superposition of the main-chain atoms of two stem residues with those of the original loop (a), (c) and after local superposition of the respective loop residues (b), (d). In each case, the original protein is shown with the same orientation in both figures and two stem residues are added on each loop terminus. RMSD values for all superpositions are given in Table III.

 

View this table:
[in this window]
[in a new window]
 
Table III. Accuracies for loop modelling at CASP5
 
Conclusions

A novel knowledge-based method for the prediction of loops in the framework of homology modelling has been presented. It is based on a comprehensive compilation of backbone conformations from the PDB, called LIP. The results were compared with those of a thoroughly evaluated ab initio method published recently (Fiser et al., 2000Go). Predictions were made for 14 test sets of 40 loops each, loop lengths ranging from 1 to 14 residues. Loops of lengths up to nine residues could be modelled with a local RMSD <1 Å by the LIP method, and those of length up to 14 residues with an accuracy better than 2 Å. This indicates that, in particular for longer loops, the LIP method performs better than the Fiser method (Fiser et al., 2000Go). Prediction accuracies were compared for an additional test set of 14 loops of lengths four to nine residues. This test set has already been used elsewhere (van Vlijmen and Karplus, 1997Go; Fiser et al., 2000Go; Deane and Blundell, 2001Go). Here again, the LIP method yielded very good results, in particular for longer loops.

Every loop prediction method, including various ab initio methods, uses data from the PDB in some way. Therefore, and owing to the different dates of publication of the compared methods, the comparison of these is not as realistic as earlier methods may yield better results at present. In particular, this can be expected for the database methods, as indicated by comparison of the performances of a current and a reduced version of the LIP database.

Nevertheless, loop modelling by means of LIP yields very good results and performs better than other methods. Particularly for longer loops, the presented knowledge-based method reaches higher accuracy than other approaches. Especially inspection of the median values in comparison with other methods indicates that the LIP method is a valuable contribution to the field of loop modelling.

Further positive aspects of the presented method are the short computing times and the fact that the database grows in a natural manner with the PDB, which will lead to better prediction accuracies continuously. Moreover, like CODA (Deane and Blundell, 2001Go), the LIP method is not limited to loop prediction but can be used to model arbitrary peptide fragments in proteins. A limitation of the LIP method is the relatively high standard deviations in prediction accuracies in comparison with the ab initio method (Fiser et al., 2000Go).

For the purpose of homology modelling, a graphical user interface has been designed to make interactive use of the protein segment LIP database possible. The program together with the database on DVD is available from the authors on request.


    Acknowledgements
 
The authors thank Björn Peters for critical discussions and valuable comments and suggestions. This work was supported by the BMBF (German Federal Ministry of Education and Research).


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Baker,D. and Sali,A. (2001) Science, 294, 93–96.[Abstract/Free Full Text]

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235–242.[Abstract/Free Full Text]

Bruccoleri,R.E. and Karplus,M. (1987) Biopolymers, 26, 137–168.[ISI][Medline]

Deane,C.M. and Blundell,T.L. (2001) Protein Sci., 10, 599–612.[Abstract/Free Full Text]

DePristo,M.A., de Bakker,P.I.W., Lovell,S.C. and Blundell,T.L. (2003) Proteins, 51, 41–55.[CrossRef][ISI][Medline]

Fidelis,K., Stern,P.S., Bacon,D. and Moult,J. (1994) Protein Eng., 7, 953–960.[ISI][Medline]

Fine,R.M., Wang,H. Shenkin,P.S., Yarmusch,D.L. and Levinthal,C. (1986) Proteins, 1, 342–362.[Medline]

Fiser,A., Kinh Gian Do,R. and Sali,A. (2000) Protein Sci., 9, 1753–1773.[Abstract]

Greer,J. (1981) J. Mol. Biol., 153, 1027–1042.[ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Kawashima,S. and Kanehisa,M. (2000) Nucleic Acids Res., 28, 374.[Abstract/Free Full Text]

Lattman,E.E. (2001) Proteins, 44, 399.[CrossRef][ISI]

Martin,A.C.R., Cheetham,J.C. and Rees,A.R. (1989) Proc. Natl Acad. Sci. USA, 86, 9268–9272.[Abstract]

Moult,J. and James,M.N. (1986) Proteins, 1, 146–163.[Medline]

Moult,J., Fidelis,K., Zemla,A. and Hubbard,T. (2001) Proteins, 45, 2–7.[ISI][Medline]

Overington,J., Donnelly,D., Johnson,M.S., Sali,A. and Blundell,T.L. (1992) Protein Sci., 1, 216–226.[Abstract/Free Full Text]

Reczko,M., Martin,A.C.R., Bohr,H. and Suhai,S. (1995) Protein Eng., 8, 389–395.[ISI][Medline]

Samudrala,R. and Moult,J. (1998) J. Mol. Biol., 279, 287–302.[CrossRef][ISI][Medline]

Sánchez,R. and Sali,A. (1997) Curr. Opin. Struct. Biol., 7, 206–214.[CrossRef][ISI][Medline]

Schonbrun,J., Wedemeyer,W.J. and Baker,D. (2002) Curr. Opin. Struct. Biol., 12, 348–354.[CrossRef][ISI][Medline]

Sutcliffe,M.J., Haneef,I.,Carney,D. and Blundell,T.L. (1987a) Protein Eng., 1, 377–384.[ISI][Medline]

Sutcliffe,M.J., Hayes, F.R.F. and Blundell,T.L. (1987b) Protein Eng., 1, 385–392.[ISI][Medline]

Tosatto,S.C.E., Bindewald,E., Hesser,J. and Männer,R. (2002) Protein Eng., 15, 279–286.[CrossRef][ISI][Medline]

Tramontano,A., Leplae,R. and Morea,V. (2001) Proteins, 45, 22–38.[CrossRef]

van Vlijmen,H.W.T. and Karplus,M. (1997) J. Mol. Biol., 267, 975–1001.[CrossRef][ISI][Medline]

Received May 12, 2003; revised October 17, 2003; accepted October 21, 2003





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (5)
Request Permissions
Google Scholar
Articles by Michalsky, E.
Articles by Preissner, R.
PubMed
PubMed Citation
Articles by Michalsky, E.
Articles by Preissner, R.