Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB1 2GA, UK
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: database/loop prediction/protein conformation/SLoop/substitution tables
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Donate et al. (1996) described a database of 2000 loops derived from 223 protein chains. Loops of a given length were grouped according to the type of their bounding secondary structures, which were either -helical or ß-strand. The loops within each group were pairwise superposed and populated conformation classes were identified by clustering the loops according to their structural similarity using the phylogeny package PHYLIP (Felsenstein, 1985
). This resulted in 161 well-populated, conformationally unique structural classes. Other databases of loops have also been developed. Oliva et al. (1997) classified over 3000 loops from 233 proteins into five types (
, ßß links, ßß hairpins,
ß and ß
) according to the secondary structures they embrace. This resulted in 56 structurally unique groups. Each class was divided into one of 121 sub-classes according to four geometric descriptors: the distance (D) between the secondary structures bracing the loop, the packing angle (the angle between the axes of the secondary structures), the hoist angle (the angle between axis of the first secondary structure and the vector between the secondary structures bracing the loop) and the meridian angle (the angle between axis of one of the second secondary structures and a plane parallel to the vector of D). More recently, Li et al. (1999) described a database of loops extracted from a set of homologous proteins. They only considered loops when the bounding secondary structures had an r.m.s.d. of <1 Å over all C
atoms across the family of homologous proteins. These loops were then clustered into 84 structurally distinct classes based on the C
distances using average linkage cluster analysis. Clusters were selected so that members of a cluster differed by no more than 1.5 Å. Only 44 classes had an equivalent in the classification of Donate et al. (1996). Kwasigroch et al. (1996) also described a database of loops of length 38 residues, clustered according to the length of loop. Classification into structural families depended on two values, the mean distance between the first and last C
and the distance to the centre of gravity of the cluster. This database has been extended and a loop prediction method developed based upon these metrics (Wojcik et al., 1999
). Their analysis also shows that there are distinct preferences for residues close to the adjacent secondary structures with residues in the middle of the loop having greater variation in both sequence and structure.
The SLoop database described by Donate et al. (1996) has now been updated using protein chains selected from the June 2000 version of the HOMSTRAD database (Mizuguchi et al., 1998). It currently contains over 10 000 loops of up to 20 residues in length, which cluster into over 560 well populated classes (Burke et al., 2000
). There are also ~3100 distinct structural classes with fewer than three members. In this paper, we compare different scoring schemes for loop class prediction and analyse their success as a function of loop length and the nature of the bounding secondary structures.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A set of 968 newly characterized proteins with a resolution of better than 2.5 Å was derived from the June 2000 version of the HOMSTRAD database (Mizuguchi et al., 1998). Loops were defined using the algorithm described by Kabsch and Sander (1983). As previously described (Donate et al., 1996
), loops of a given length were grouped according to the type of their bounding secondary structures, which were either
-helices or ß-strands. The loops within each group were pairwise superposed and populated conformation classes were identified by clustering the loops according to their structural similarity (Felsenstein, 1985
). The name of a loop class is derived from the types of its bounding secondary structures (H for helix, E for strand) and the main chain conformation of the loop, defined as one of seven conformations (
, g, l, t, b, p and e) where
-helix, 310 helix and
-helix are grouped into the
conformation. Where there are alternative conformations of loops, these are both indicated but separated by an underscore. For example, Haab_tH describes a loop class of length three, conformation aab or aat, linking two helices.
Description of the class scoring templates
Each class in the SLoop database contains information about the sequences of its member loops, the local structural environment of the loop residues and the angle and distance between bounding secondary structures. Residue environments are defined, as described by Topham et al. (1993), in terms of main chain conformation, relative side chain solvent accessibility and side chain hydrogen bonding. Nine main chain conformations (-helix, 310-helix,
-helix, g, l, t, b, p and e) are defined along with three types of solvent accessibility (<7%; between 7 and 40%; >40%). There are also three types of independent hydrogen bonding possible (side chain
side chain; side chain
main chain amide; and side chain
main chain carboxyl). This gives a total of 216 (9x3x2x2x2) residue environments. By considering only one type of side chain hydrogen bond, the number of environments can be reduced to 54 (9x3x2). A reduced table with 96 environments was also derived using six main chain conformations (H, b, p, t, g, coil), two types of solvent accessibility (<7%; >7%) along with three types of independent hydrogen bonding. Position specific substitution templates can be derived for each class by averaging the environment-dependent amino acid substitution tables, as discussed by Topham et al. (1993), at each position in the loop. These class templates describe the probabilities of substituting a residue at each position in a loop by each of the 21 residue types, distinguishing cystine and cysteine. Contributions of a loop to these probabilities are weighted by the inverse of the number of its homologous member loops within a class.
Sequence-based score
A score of the compatibility of a sequence with a loop class can be calculated from the class template. The score for the complete loop sequence, Sseq, is defined as
![]() |
Loop testset
The method was tested on the loops in the database using 7-fold cross-validation. All loops from the SLoop database were randomly placed into one of seven groups containing ~1400 loops each. One consequence of removing a loop from a class and into a group is that about 10% of classes with a low population of loops no longer have any loops in the class. Each group was checked to make sure that the distribution of length and type of loops was similar to that for the complete database. Class templates were recalculated with the loops from that group removed. The conformational class of the loops in each group was then predicted using the relevant reduced set of templates and results averaged over all seven groups.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Figure 1 shows the distribution of well-populated loop classes according to their length and type of bounding secondary structure. Compared with the original analysis of Donate et al. (1996), the number of unique, well-populated classes has increased at every loop length for all combinations of bounding secondary structures. The total number of classes has more than doubled at all lengths, with loop lengths three and four residues showing the largest increase. In terms of percentage of the number of well-populated classes, classes of length three and four residues comprise 17 and 18.5%, respectively, increasing from 15% each in the original database. Loop classes of length 2 and 5 residues have decreased by about 2% each to 12.5% and 13.5% of the number of classes. The other loop lengths comprise almost the same percentage of the database. Grouping the classes according to type of bounding secondary structure, the number of EE classes has decreased considerably from 39 to 28% of the classes. The number of classes for EH, HE and HH loops has increased to 28% (from 20%) and 24% (from 22%) and 20% (from 19%), respectively. When the low-populated classes are considered, the number of HH classes rises further to 25% of the database, at the expense of EH and HE loops with the number of EE loops increasing slightly to 31%. Interestingly, the most common length of loop rises from 4 to 7 residues. This reflects the assumption that longer loops have more degrees of conformational freedom and can adopt a larger number of unique conformations. As the number of protein structures and the database increases, we should see more conformational classes at these longer lengths. However, longer loop lengths are also more conformationally flexible and this limits structural characterization when using X-ray crystallography.
|
|
Each class in the current database was analysed to identify features conserved between its member loops. Figure 3 shows the percentage of classes that have particular features completely conserved in all loops in a class and those that have a particular feature conserved in >75% of the loops. The percentages were calculated considering only residues in the loop and also including three residues from each of the secondary structures bounding the loop. Some general observations can be made. Since the classes were clustered on the basis of their structural similarity, it is not surprising that the most conserved feature is the main chain torsion angles within a class.
|
Surprisingly, there is very little conservation of sequence or solvent accessibility within any class. To investigate the apparent lack of sequence conservation within the classes, the average percentage sequence identity of loops within well populated classes was calculated and is shown in Figure 4. These values were also calculated for the original database. The most common sequence identity is very low, between 10 and 20% in both versions of the database. The classes in the current database seem to have more diverse sequences with 47% of classes possessing an identity between 10 and 20% and only 15% of classes having a sequence identity of >40%, compared with only 30 and 25% of loops, respectively, in the original database.
|
Since the size of the database, in terms of both the number of loops and number of classes, has increased considerably since the original description of the database, the accuracy of the prediction method was re-evaluated using the seven testsets. Based on the observed conservation of structural features, several different scoring schemes were tested. These included the use of substitution tables with a combined hydrogen bonding environment (54 tables) rather than specific hydrogen bonding environments (216 tables) and six classes of phipsi rather than nine (96 tables). Exclusion of classes with low populations and the inclusion in the sequence of residues in the bounding secondary structures were also tested.
To analyse the effect of the residues in the bounding secondary structures, classes were predicted including the sequence of the three residues from each of the bounding secondary structures in the scoring. This increases the accuracy of prediction by up to 10% for every type of environment definition (see Figure 5). This result, in agreement with the findings of Wojcik et al. (1999), suggests that the environment and properties of the capping residues of the secondary structures not only stabilize the secondary structure elements, but for short loops also affect the allowed conformations of the adjoining loop residues.
|
The effect of the population of a SLoop class on the prediction accuracy can be seen in Figure 6. As the population of the class increases, so does the accuracy of prediction. Considering all loop classes, irrespective of population, the accuracy for the highest scoring prediction is 58% (including the sequence of the bounding secondary structures and using the 96 environment substitution tables). This accuracy steadily reaches 63% by considering only classes with five or more loop members (Figure 6
). This improvement is due to a combination of an improved profile of the class and as a consequence of scoring against fewer classes, which reduces the number of false-positive predictions.
|
To examine the effect of local environment created by other parts of the structure on loop selection, each SLoop prediction was re-evaluated using knowledge from the structure from which it was derived. A fragment from each predicted SLoop class was superimposed on to the native structure using the three residues from each bounding secondary structures as anchor regions. The r.m.s.d.s between the residues in the anchor region of the native structure and those in the fragment were calculated. If the anchor r.m.s.d. was greater than a predefined cutoff value, the prediction was considered incorrect and not used. Once the fragment was fitted, the r.m.s.d. between the residues in the predicted loop conformation and the actual loop conformation was also calculated. All backbone atoms were used in fitting fragments to the native structure and in calculating the r.m.s.d. of predicted fragments. A knowledge-based contact potential, described by Moult and colleagues (Samudrala and Moult, 1998), was also calculated for the predicted loop fitted on to the native structure. The difference in contact energy between the native structure without the loop and the native structure with a correctly predicted loop is below 20 for 96% of all correct predictions over all loop lengths (data not shown). Any predicted loop with a difference in contact potential of above this value was assumed to be incorrect and the prediction discarded. Figure 7
shows the average loop r.m.s.d. for each loop length for various anchor r.m.s.d. cutoffs. There is a clear correlation between the average loop r.m.s.d. and the length of the loop, although the average loop r.m.s.d. is also very dependent on the anchor r.m.s.d. cutoff. A lower anchor r.m.s.d. cutoff reduces the average loop r.m.s.d. but it also reduces the coverage of loop prediction. The coverage is >90% for all lengths for anchor r.m.s.d.s of
2.0 Å. As the anchor r.m.s.d. cutoff falls, there is a considerable drop in coverage across all loop lengths down to about 2030% resulting in the lowest average r.m.s.d..
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The average loop r.m.s.d.s obtained using this method are comparable to those using previously published loop databases (Wojcik et al., 1999, Deane and Blundell, 2000
) or ab initio methods (van Vlijmen and Karplus, 1997
) over all loop lengths. As has been noted before (Fidelis et al., 1994
; Wojcik et al., 1999
), comparison of the accuracy of any loop prediction method with other loop prediction methods is sometimes complicated by the fact that different atoms are used to calculate the final r.m.s.d. of the loop. This problem is compounded by the many different methods that are used to fit a predicted loop and the native loop structure. The superposition of loops in this analysis is performed with only the anchor residues in the neighbouring secondary structures. Once the predicted supersecondary fragment is fitted on to the native structure anchors, for the loop residues only, the r.m.s.d. between all main chain atoms in the predicted and actual structures is calculated.
Since SLoop predicts the conformational class of a loop using only sequence information, knowledge of the local structural environment of the loop can be used to identify incorrect predictions. As is commonly done in many loop prediction methods, the r.m.s.d. of the ends of the predicted fragment to the residues in the bounding secondary structures can be calculated and predictions above a threshold value can be rejected. The use of contact potentials has also been shown to discriminate 95% of correct predictions. Conversely, since no structural information is used to predict the conformational class, the method can also be considered as a validation method for loops built on to homology models, discarding predictions based on the fitting to a structure or suggesting changes to the orientation of the bounding secondary structures. The structural variation within a class may also suggest dynamics of a loop and the sequence variation can also be used to predict allowed mutations in the sequence.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Blundell,T., Carney,D., Gardner,S., Hayes,F., Howlin,B., Hubbard,T., Overington,J., Singh,O.A. Sibanda,B.L. and Sutcliffe,M. (1988) Eur. J. Biochem., 172, 513520.[Abstract]
Bruccoleri,R.E., Haber,E. and Novotny,J. (1988) Nature, 335, 564568.[ISI][Medline]
Bruccoleri,R.E. and Karplus,M. (1987) Biopolymers, 26, 137168.[ISI][Medline]
Bruccoleri,R.E. and Karplus,M. (1990) Biopolymers, 29, 18471862.[ISI][Medline]
Chothia,C., Lesk,A.M., Tramontano,A., Levitt,M., Smith-Gill,S.J., Air,G., Sheriff,S., Padlam,E.A., Davies,D. and Tulip,W.R. (1989) Nature, 342, 877883.[ISI][Medline]
Claessens,M., Van Cutsem,E., Lasters,I. and Wodak,S. (1989) Protein Eng., 2, 335345.[Abstract]
Deane,C.M. and Blundell,T.L. (2000) Proteins: Struct. Funct. Genet., 40, 135144.[ISI][Medline]
Donate,L.E., Rufino,S.D., Canard,L.H.J. and Blundell,T.L. (1996) Protein Sci. 5, 26002616.
Efimov,A.V. (1991) Protein Eng., 1, 173181.[Abstract]
Felsenstein,J. (1985) Evolution, 39, 783791.[ISI]
Fidelis,K., Stern,P.S., Bacon,D. and Moult,J. (1994) Protein Eng., 7, 953960.[Abstract]
Jones,T.A. and Thirup,T. (1986) EMBO J., 5, 819822.[Abstract]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Kwasigroch,J.M., Chomilier,J. and Mornin,J.P. (1996) J. Mol. Biol., 259, 855872.[ISI][Medline]
Li,W., Liu,Z. and Lai,L. (1999) Biopolymers, 49, 481495.[ISI][Medline]
Martin,A.C.R. and Thornton,J.M. (1996) J. Mol. Biol., 263, 800815.[ISI][Medline]
Meirovitch,H. and Hendrickson,T.F. (1997) Proteins, 29, 127140.[ISI][Medline]
Milner-White,E.J. and Poet,R. (1986) J. Mol. Biol., 238, 733747.
Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) Protein Sci., 7, 24692471.
Oliva,B., Bates,P.A., Querol,E., Aviles,F.X. and Sternberg,M.J. (1997) J. Mol. Biol., 266, 814830.[ISI][Medline]
Rufino,S.D., Donate,L.E., Canard,H.J. and Blundell,T.L. (1997) J. Mol. Biol., 267, 352367.[ISI][Medline]
Samudrala,R. and Moult,J. (1998) J. Mol. Biol., 275, 895916.[ISI][Medline]
Sibanda,B.L. and Thornton,J.M. (1985) Nature, 316, 170174.[ISI][Medline]
Sibanda,B.L., Blundell,T.L. and Thornton, J.M. (1989) J. Mol. Biol., 206, 759777.[ISI][Medline]
Sudarsanam,S., DuBose,R.F., March,C.J. and Srinivasan,S. (1995) Protein Sci., 4, 14121420.
Sutcliffe,M.J., Hayes,F.R.F. and Blundell,T.L. (1987) Protein Eng., 1, 385392.[Abstract]
Topham,C.M., McLeod,A., Eisenmenger,F. Overington,J.P. Johnson,M.S. and Blundell,T.L. (1993) J. Mol. Biol., 229, 194220.[ISI][Medline]
van Vlijmen,H.W.T. and Karplus,M. (1997) J. Mol. Biol., 267, 9751001.[ISI][Medline]
Wintjens,R.T., Rooman,M.J. and Wodak,S.J. (1996) J. Mol. Biol., 255, 235253.[ISI][Medline]
Wojcik,J., Mornon,J.-P. and Chomilier,J. (1999) J. Mol. Biol., 289, 14691490.[ISI][Medline]
Received October 19, 2000; revised February 26, 2001; accepted March 12, 2001.