Department of Chemistry and Chemical Biology and BioMaPS Institute, Rutgers, the State University of New Jersey, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
1 To whom correspondence should be addressed. e-mail: ronlevy{at}lutece.rutgers.edu
Abstract
Assembling short fragments from known structures has been a widely used approach to construct novel protein structures. To what extent there exist structurally similar fragments in the database of known structures for short fragments of a novel protein is a question that is fundamental to this approach. This work addresses that question for seven-, nine- and 15-residue fragments. For each fragment size, two databases, a query database and a template database of fragments from high-quality protein structures in SCOP20 and SCOP90, respectively, were constructed. For each fragment in the query database, the template database was scanned to find the lowest r.m.s.d. fragment among non-homologous structures. For seven-residue fragments, there is a 99% probability that there exists such a fragment within 0.7 Å r.m.s.d. for each loop fragment. For nine-residue fragments there is a 96% probability of a fragment within 1 Å r.m.s.d., while for 15-residue fragments there is a 91% probability of a fragment within 2 Å r.m.s.d.. These results, which update previous studies, show that there exists sufficient coverage to model even a novel fold using fragments from the Protein Data Bank, as the current database of known structures has increased enormously in the last few years. We have also explored the use of a grid search method for loop homology modeling and make some observations about the use of a grid search compared with a database search for the loop modeling problem.
Keywords: database/fold/loop modeling/protein fragment/protein structure
Introduction
Modeling of protein structure based on sequence and structural homology or limited experimental data can make use of either systematic search of conformational space (Deane and Blundell, 2000), use of spatial restraints (Fiser et al., 2002
) or databases of fragments of known proteins. The fragment database approach dates back to 1986, when retinal binding protein was reconstructed by choosing fragments from only three other proteins (Jones and Thirup, 1986
). Since then, protein fragment databases have been used to build complete protein backbone structures (Reid and Thornton, 1989
; Correa, 1990
; Summers and Karplus, 1990
; Holm and Sander, 1991
; Levitt, 1992
) or serve as candidates for loop modeling [e.g. (van Vlijman and Karplus, 1997
; Wojcik et al., 1999
)]. The protein structure prediction program ROSETTA (Simons et al., 1999
), which was used successfully in the CASP competition, also built structures by assembling short fragments.
As an example of an application of fragment-based modeling, we recently determined the backbone structure of ubiquitin using limited NMR residual dipolar coupling data (Andrec et al., 2001; 2002). In our method, we selected a small number of seven-residue fragments which fit well to the experimental data from a library of nearly 200 000 fragments drawn from the SCOP40 database (Murzin et al., 1995
; Brenner et al., 2000
; Chandonia et al., 2002
). This was done for all overlapping seven-residue data windows. The chosen fragments were subjected to a filtering procedure to maximize their structural similarity over overlapping regions of sequence and the complete protein backbone structure was built by superimposing the selected fragments (Figure 1).
|
Methods
For seven-residue fragments, two protein structure databases were constructed for purposes of this study: a query database, which is used to represent fragments of unknown novel structures, and the template database, which is used to represent fragments of known structures. The query and template databases are derived from the SCOP20 and SCOP90 databases, version 1.53 (Brenner et al., 2000; Chandonia et al., 2002
). These databases consist of domains from the complete SCOP database (Murzin et al., 1995
) selected so that no pair of domains has more than 20 or 90% sequence identity, respectively. Our databases are constructed using only those structures with an R-factor of <20% and resolution better than 2 Å. The query database contains 34 205 seven-residue fragments from 172 domains selected from SCOP20 so that no pair of domains belong to the same fold, while the much larger template database contains 174 914 seven-residue fragments from 955 domains selected from SCOP90. The fragments in both the template and query databases can be overlapping, e.g. residues 17, 28 and 39 on the same peptide chain can all be in the database.
The Fidelis et al. study (Fidelis et al., 1994) considered only loop structures, which they defined as seven-residue fragments with fewer than four continuous
-helical or ß-strand residues as defined by DSSP (Kabsch and Sander, 1983
). In this study, a loop fragment is defined in exactly the same way. Furthermore, we define an
fragment to be a seven-residue fragment with four or more continuous
-helical residues and a ß fragment to be a seven-residue fragment with four or more continuous ß-strand residues. The percentage of
, ß or loop fragments in our query database is 33.7, 18.8 and 47.6%, respectively. For the template database, the corresponding percentages are 32.3, 19.5 and 48.2%, respectively.
The goal of this work was to study the distribution of the r.m.s.d.s of the most similar fragment in the template database (the nearest neighbor fragment) for every fragment in the query database. The structures of protein fragments were compared by calculating the r.m.s.d. after optimal superimposition of the C atoms (Kabsch, 1976
; McLachlan 1979
). Since we wish our results to be relevant even in the case where the unknown structure corresponds to a previously unobserved fold, for each fragment in the query database we eliminate from consideration all fragments in the template database which come from domains belonging to the same fold according to the SCOP classification (Murzin et al., 1995
) as the domain from which the query fragment is derived.
Fragments of nine and 15 residues were studied in the same manner as seven-residue fragments, except that the definitions of loop, and ß fragments were adjusted for the longer size. In particular, an
fragment was defined to be a nine-residue fragment with five or more continuous
-helical residues or a 15-residue fragment with eight or more continuous
-helical residues. A ß fragment was defined to be a nine-residue fragment with five or more continuous ß-strand residues or a 15-residue fragment with eight or more continuous ß-strand residues. Loop fragments were defined to be all fragments that are neither
fragments nor ß fragments. These definitions serve the purpose of roughly separating the contributions from fragments of different secondary structures.
Results
For seven residues, Figure 2A shows the distribution of the C r.m.s.d. for 99 995 pairs of loop fragments randomly chosen from the template database such that the two fragments do not come from domains of the same SCOP fold. These random pair r.m.s.d.s have a bell-shaped distribution ranging from 0 to almost 6 Å, with the peak at 2.8 Å. This distribution is almost identical with Fidelis et al.s result of an all-against-all comparison of loop fragments [their figure 1A (Fidelis et al., 1994
)]. Figure 2B shows the distribution of the r.m.s.d.s of the nearest neighbor fragments in the template database for each fragment in the query database (solid curve). This distribution has a sharp peak at 0.05 Å, a broader peak at
0.25 Å and a long tail extending to 1.0 Å. When that distribution is decomposed according to the relative contributions from loop,
and ß query fragments (long dashed, dot-dashed and short dashed curves, respectively), it becomes obvious that the sharp peak at 0.05 Å is due to the
query fragments, as the distribution of nearest neighbor r.m.s.d.s of the
fragments shows a peak that overlaps with the 0.05 Å peak of the solid curve. The distribution of nearest neighbor r.m.s.d.s for loop fragments is a broad curve with a peak at 0.3 Å. Beyond 0.6 Å, the tail of this distribution almost overlaps the tail of the solid curve, which is consistent with the commonly held view that loops are irregular structures and have fewer close neighbors in the structural database. In comparison, Fidelis et al.s distribution of nearest neighbor r.m.s.d.s for loop fragments has a peak at 0.50.6 Å and has a tail extending to 1.4 Å (Fidelis et al., 1994
). This significant improvement in the peak position and the upper tail over the results obtained by Fidelis et al. in 1994 is a reflection of the vastly increased structural diversity of todays PDB and suggests that there is sufficient coverage in the present database to construct models for novel structures based on existing protein fragments. The distribution of nearest neighbor r.m.s.d.s for ß query fragments is also broadly distributed, with a peak at 0.20.25 Å and a long tail extending to 0.8 Å. The solid curve in Figure 5 shows the cumulative probability of nearest neighbor r.m.s.d.s for all loop fragments in the query database. The height of the curves in Figure 5 at an r.m.s.d. value of x is equal to the percentage of the query fragments whose nearest neighbor r.m.s.d. is
x. If the threshold for structural similarity is set to 1 Å, then we are virtually guaranteed to find a fragment in the template database which has a similar structure to a given query database fragment, even if we insist that the two fragments come from domains with different folds. With a stricter similarity threshold of 0.7 Å, the probability is still 99.3%.
|
|
|
|
|
The results described above are considerably more promising than those reported by Fidelis et al. in 1994 (Fidelis et al., 1994) and are due to the much larger database of protein structures currently available. The center of the distribution of nearest neighbor r.m.s.d.s for a seven-residue loop fragment is improved significantly from
0.6 to 0.3 Å. The fact that 99.3% of the time one can find a seven-residue fragment in the template database that is within 0.7 Å of a given fragment in the query database implies that a database approach is applicable for the construction of complete protein folds from short fragments, when combined with sparse experimental data, such as given by Andrec et al. (Andrec et al., 2001
, 2002). Consider the case where the threshold of similarity is realistically set to 0.7 Å and a protein of unknown structure is 200 residues long. The probability that all 194 overlapping seven-residue fragments in the target protein have a similar structure in the template database is 0.256 (0.993194), while the probability of encountering one rare fragment for which no similar structure exists in the database is 0.350 (194x0.993193x0.007). The latter probability decreases rapidly for more than one rare fragment. Since the fragments can be overlapping, the presence of a small number of rare fragments is not fatal, since there may be sufficient structural information in the neighboring fragments to allow for the construction of the protein structure.
As expected, fragments are structurally very similar to each other for all three sizes. It is surprising, however, that for seven-residue fragments, the average nearest neighbor r.m.s.d. for ß fragments is 0.20.25 Å and is broadly distributed. In fact, the distribution of nearest neighbor r.m.s.d. for ß fragments is actually close to that for loop fragments (Figure 2B). Although it is known that ß strands are structurally diverse owing to twisting and other distortions (Chothia, 1973
), the observed variability is somewhat greater than might have been expected.
Enforcing the condition that the query and template database fragments belong to domains having different SCOP folds might be too strict to use as a criterion to remove structural homology, since proteins in the same fold class but different superfamilies are not homologous to each other. The fact that structurally similar fragments can be found even under this very strong condition implies that the results are applicable even when the prediction target is of an entirely new fold and is confirmed by the results with the CASP4 new fold set (Figure 6).
For longer fragments, it becomes more difficult to find a similar fragment in the template database for each fragment in the query database. However, how similar a fragment must be to the native structure for it to be useful clearly depends on the application. For example, the existence of fragments in the database which are within 0.7 Å r.m.s.d. over seven residues is necessary for building models of protein structure using an NMR residual dipolar coupling based approach (Andrec et al., 2001
). As the fragment size becomes longer, the structure becomes more fold-specific, i.e. fragments from a different fold are less likely to share similar structure. For example, an average 15-residue fragment typically has a nearest neighbor r.m.s.d. about 1.5 Å (e.g. the peak of the distribution in Figure 4B or the dot-dashed curve in Figure 5). Such fragments are not structurally similar enough to fit NMR dipolar coupling data, though they will be useful in building homology models. When compared with the distribution of the r.m.s.d. of pairs of randomly chosen fragments (Figure 4A), which has a mean of 5.5 Å and a standard deviation of 1.6 Å, the 1.5 Å r.m.s.d. between nearest neighbor non-homologous 15 residue fragments is 2.8 standard deviations from the mean and is therefore highly statistically significant in the sense of structural similarity.
Knowing that the right structural fragment is in the database is only the first step: one must also be able to pick it out. For the loop modeling problem, one can search for fragments in the PDB whose residues adjacent to the loop can be superimposed to those of the target (e.g. Greer, 1980; Summers and Karplus, 1990
; van Vlijman and Karplus, 1997
; Wojcik et al., 1999
; Deane and Blundell 2000
). This approach has already been shown to be effective for loops up to nine residues long (van Vlijman and Karplus, 1997
), but becomes less reliable for loops of longer size. A second way to pick out the correct fragments is to use information in the amino acid sequence and composition, which are used in the loop modeling programs by Kwasigroch et al. (Kwasigroch et al., 1996
) and Wojcik et al. (Wojcik et al., 1999
). Short fragments whose structures correlate strongly with sequence profiles can also be predicted using sequence to structure clustering according to the method of Bystroff and Baker (Bystroff and Baker, 1998
).
An increasingly important way to pick out the correct fragments from the database is the use of sparse experimental data [e.g. (Jones and Thirup, 1986; Cornilescu et al., 1999
; Delaglio et al., 2000
; Andrec et al., 2001
, 2002)], which very effectively reduces the number of feasible fragments. An example of such a strategy based on NMR residual dipolar couplings is shown in Figure 1. Residual dipolar coupling data can be highly sensitive to small changes in conformation, resulting in a relatively high rate of false negatives, that is, fragments that are similar in structure to the target but which do not score sufficiently well to pass the filter in step A of Figure 1 (Andrec et al., 2001
, 2002). For such a strategy to be sucessful, it is critical that there be a sufficient number of structures close to the target structure. The growth of the structural database described above has been essential for the feasiblity of model building based on fragment libraries using NMR data.
The database-oriented approach to protein structure construction is complementary to the grid search approach, which systematically searches the conformational space. The former is efficient and the fragments found are guaranteed to be physically reasonable, but is limited by the completeness of the database for longer fragments. The latter is not limited by completeness, but rather by the exponential increase of the conformational space that must be searched. Based on results from this study, the database is essentially complete for seven- and nine-residue fragments. Furthermore, the efficiency of database methods can be greatly increased by making use of clustering methods to reduce the structural redundancy of the database (Lessel and Schomburg, 1997; Kolodny et al., 2002
). In particular, the recent work of Kolodny et al. has demonstrated that the database size can be reduced to under 500 fragments for fragment lengths of four to seven residues and still result in adequate modeling accuracy. In addition, fragments selected by the database-based approach can be successfully used as initial conformation for optimization (van Vlijman and Karplus, 1997
; Simons et al., 1999
). The usefulness of database search for longer fragments will depend on the radius of convergence necessary for the particular homology modeling application and the ability to anneal database loop fragments onto template frameworks.
In order to compare the database search results with a grid search method, we performed both types of searches on a set of 23 loops, nine and 15 residues in length. The conformational search was performed using an early implementation of the PLOP (Protein Local Optimization Program) loop homology modeling software of M.P.Jacobson and R.A.Friesner (personal communication), which has been partially described in other publications (Jacobson et al., 2002a,b). Loop conformations are generated in PLOP by sampling the backbone dihedral angles using a discretized version of the Ramachandran plot for the N- and C-terminal halves of the loop independently and then applying a loop closure algorithm in the middle of the loop. The primary mechanism for screening loop conformations is identification of steric clashes. However, other criteria are also employed to eliminate rapidly unlikely conformations, including screens to ensure that the side chains on the loop can fit properly. Because the accessible backbone conformational space of a loop can vary widely, the number of conformations in the backbone dihedral angle library is not set in advance. Rather, the PLOP algorithm commences with coarse sampling and gradually samples more finely until a prescribed number of loop conformations have been generated.
One question of interest is whether nature makes use of all sterically satisfactory loops of a given size. For example, one could imagine that not every sterically feasible loop has a close neighbor in the PDB. If that were the case, then the use of database search would have a distinct advantage over systematic search, since one would avoid those sterically feasible loops that nature (for whatever reasons) does not use. To determine if this is the case, we used PLOP to generate sterically feasible loop models for each of the target loops as described in the caption of Figure 7. For all of these feasible loop models, we found the nearest neighbor in the template database as described above and generated overall histograms of the resulting distributions of r.m.s.d.s, which are shown in Figure 7. For both nine- and 15-residue loops, these distributions are qualitatively very similar to the distributions in Figures 3 and 4, particularly in the location and thickness of the upper tail. This indicates that, for the most part, nature does in fact use all sterically feasible loops, since loops systematically generated by grid search have nearest neighbors in the database with the same distribution as loops from actual proteins.
|
Overall, we have shown that there exist structurally very similar fragments in the PDB from a non-homologous protein for short loops; for seven-residue fragments there is a 99% probability that there exists a non-homologous structure within 0.7 Å, whereas for nine-residue fragments there is a 96% probability that there exists a non-homologous structure within 1.0 Å. For longer loops (15 residues) we observe a >90% probability that there exists a non-homologous structure within 2 Å r.m.s.d.. Compared with systematic search in conformational space, the use of a database of known structures has the potential advantage of more efficient sampling and guarantees that all backbone conformations are physically reasonable. Results from this study are far more optimistic than those from previous studies of a similar nature and should encourage the use of fragment databases for protein structure determination, prediction and loop modeling, either alone or in combination with the conformational search approach to building protein structures.
Acknowledgements
We thank Matthew Jacobson and Richard Friesner for providing us with an early version of their homology modeling program and for helpful discussions. This work was supported in part by grants from the National Institutes of Health (GM 30580 and GM 06899).
References
Andrec,M., Du,P. and Levy,R.M. (2001) J. Biomol. NMR, 21, 335347.[CrossRef][ISI][Medline]
Andrec,M., Harano Y., Jacobson,M.P., Friesner,R.A. and Levy,R.M. (2002) J. Struct. Funct. Genomics, 2, 103111.[CrossRef][Medline]
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235242.
Brenner,S.E., Koehl,P. and Levitt,M. (2000) Nucleic Acids Res. 28, 254256.
Bystroff,C. and Baker,D. (1998) J. Mol. Biol., 281, 565577.[CrossRef][ISI][Medline]
Chandonia,J.M., Walker,N.S., Lo Conte,L., Koehl,P., Levitt,M. and Brenner,S.E. (2002) Nucleic Acids Res., 30, 260263.
Chothia,C. (1973) J. Mol. Biol., 75, 295302.[ISI][Medline]
Cornilescu,G., Delaglio,F. and Bax,A., (1999) J. Biomol. NMR, 13, 289302.[CrossRef][ISI][Medline]
Correa,P.E. (1990) Proteins, 7, 366377.[ISI][Medline]
Deane,C.M. and Blundell,T.L. (2000) Proteins, 40, 135144.[CrossRef][ISI][Medline]
Delaglio,F., Kontaxis,G. and Bax,A., (2000) J. Am. Chem. Soc., 122, 21422143.[CrossRef][ISI]
Fidelis,K., Stern,P.S., Bacon,D. and Moult,J. (1994) Protein Eng., 7, 953960.[Abstract]
Fiser,A., Feig,M., Brooks,C.L.,III and Sali,A. (2002) Acc. Chem. Res., 35, 413421.[CrossRef][ISI][Medline]
Greer,J. (1980) Proc. Natl Acad. Sci. USA, 77, 33933397.[Abstract]
Holm,L. and Sander,C. (1991) J. Mol. Biol., 218, 183194.[ISI][Medline]
Jacobson,M.P., Friesner,R.A., Xiang,Z. and Honig,B., (2002a) J. Mol. Biol., 320, 597608.[CrossRef][ISI][Medline]
Jacobson,M.P., Kaminski,G.A., Friesner,R.A. and Rapp,C.S., (2002b) J. Phys. Chem. B, 106, 1167311680.[CrossRef][ISI]
Jones,T.A. and Thirup,S. (1986) EMBO J., 5, 819822.[Abstract]
Kabsch,W. (1976) Acta Crystallogr., A32, 922923.
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Kolodny,R., Koehl,P., Guibas,L. and Levitt,M. (2002) J. Mol. Biol., 323, 297307.[CrossRef][ISI][Medline]
Kwasigroch,J., Chromilier,J. and Mornon J. (1996) J. Mol. Biol., 259, 855872.[CrossRef][ISI][Medline]
Lessel,U. and Schomburg,D. (1997) Protein Eng., 10, 659664.[Abstract]
Levitt,M. (1992) J. Mol. Biol., 226, 507533.[ISI][Medline]
McLachlan,A.D. (1979) J. Mol. Biol., 128, 4979.[ISI][Medline]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Reid,L.S. and Thornton,J.M. (1989) Proteins, 5, 170182.[ISI][Medline]
Simons,K.T., Bonneau,R., Ruczinski,I. and Baker,D. (1999) Proteins, Suppl. 3, 171176.[CrossRef]
Sippl,M.J., Lackner,P., Domingues,F.S., Prlic,A., Malik,R., Andreeva,A. and Wiederstein,M. (2001) Proteins, Suppl. 5, 5567.
Summers,N.L. and Karplus,M. (1990) J. Mol. Biol., 216, 9911016.[ISI][Medline]
van Vlijmen,H.W.T. and Karplus,M. (1997) J. Mol. Biol., 267, 9751001.[CrossRef][ISI][Medline]
Wojcik,J., Mornon,J. and Chomilier,J. (1999) J. Mol. Biol., 289, 14691490.[CrossRef][ISI][Medline]
Received October 23, 2002; revised March 3, 2003; accepted March 30, 2003.