Systèmes Moléculaires et Biologie Structurale, LMCP, Universités Paris 6 et Paris 7, CNRS UMR 7590, Case 115, 75252 Paris cedex 05, France
1 To whom correspondence should be addressed. e-mail: jean-paul.mornon{at}lmcp.jussieu.fr or jacques.chomilier{at}lmcp.jussieu.fr
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: homology modeling/hydrophobic core/minimal surface/protein folding/secondary structure
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In order to validate the prediction, one must, at this stage, compare modeled structures and X-ray structures. This is done by means of the root mean square function (r.m.s.), which is the root mean square deviation on C between the generated structures and the native structures (see Equation 11 in Part I). Owing to approximations of the model, the minima of the goal function F seldom coincide exactly with those of the r.m.s., i.e. with the actual structure. Hence a new
function has been defined that accounts for an average distance separating all hydrophobic amino acids belonging to the different considered SSE (Equations 9 and 10 in Part I). The underlying idea beneath this function is to produce a structure where hydrophobic residues are as compact as possible. We will try in this paper to relate
to r.m.s. on a small set of known structures of various folds in order to validate the algorithm. It appeared that, in order to smear peculiar points, smoothing of
over a small range of angular positions (typically 10 or 20°) of SSE can allow one to find a structure close to the native structure in the neighborhood of the minimum of this
function.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We illustrate the HELIX model procedure with a set of four proteins (PDB codes 1enh, 2mhr, 1lpe, 1gmf) consisting of three and four -helix bundles listed in Table I, which reports the total number of residues, the number of amino acids involved in the SSE (corresponding to the fact that loops are not included in this model) and the total number of models generated by RUSSIA. They correspond to different initial positions of the SSE obtained by discrete rotations around their respective HFCs with a step of 10°. In the case of an n-helix bundle model, this procedure yields a set of 36n initial positions for the SSE. Owing to the loop constraints, many conformations are rejected. For instance, from the 46 656 initial conformations of 1enh, only 883 structures are kept for further processing as shown in Table I. The quality of the model is estimated on the following basis: each model is compared with the native structure from the PDB and the percentage of models whose r.m.s. values are better than 3 and 4 Å are given in Table I, in addition to the value of the best r.m.s. In all four cases, the minimal r.m.s. is better that 3 Å, but not necessarily detected by the algorithm among the selected structures. There is no correlation between the quality of the model with respect to the crystallographic structure, as measured by the best r.m.s. and the number of residues of which the protein is composed. Indeed, the difference between the models and the actual proteins appear to be at least partly dependent upon the curvature of helices. In particular, long helices are often curved and/or kinked and in such cases the present rigid ideal cylinders used to describe helices lead to severely increased r.m.s. (see for instance, 1lpe in Figure 2). For all proteins of Table I, we compared the
function with the r.m.s., i.e. prediction was compared with validation. In most cases, the smallest r.m.s. are not spread all over the conformation space represented by all initial angles of the helices (see, for instance, Figure 14 in Part I) but are mainly located in a single region. In contrast, as the RUSSIA procedure does not take into account loops explicitly, the function
may have several regions (often two) containing small values and, as only one of them fits that of the small r.m.s., selecting the best model on the basis of the minimum of
may miss the most native-like model. This is why
is smoothed over a small range of angles around its minimum. For three-helix proteins this interval of F is 10° and is increased to 20° for four-helix fold to account for the increased complexity and to smooth local irregularities better. The appearance and the number of the false minima of
(corresponding to a high r.m.s.) are closely related to the length of the loops: the shorter the loops, the less false minima. Nevertheless, even in the case of short loops, the SSE can still adopt different conformations, owing to a reverse in the direction of one loop (between helices 2 and 3) in the case of the four-helix bundle and produce close
values (Figure 1).
|
|
|
One notices that the percentage of the native-like models (r.m.s. <4 Å) decreases with the complexity of the structure, taken as the number of residues and the number of SSE, because the number of degrees of freedom increases with long sequences. If one takes the percentage of SSE whose r.m.s. is <4 Å, one decreases from 100% in the case of 1enh (three helices and 56 amino acids) to 7% in the case of 1lpe (four helices and 166 amino acids). The intermediate cases are interesting to analyze because both 1gmf and 2mhr have comparable lengths, 118 and 119 residues, but 2mhr produces 90% of native models whereas the 1gmf output is only 27%. One can assume that this is because the better models are produced for the protein which has the largest number of amino acids involved in regular SSE. There are 20 more amino acids included in helices in 2mhr than in 1gmf, which is a relative increase of one-third.
Although the RUSSIA algorithm determines the minimum of , it does not always find the global minimum of r.m.s., but merely a local minimum, close to the global value. The average values of the r.m.s. among the selected structures belong to the range classified as good (35 Å). In the case of 1gmf, the level of the global r.m.s. minimum among the generated structures was at 2.92 Å from the native structure. A local minimum reached by the algorithm on the
criterion had an r.m.s. of 3.17 Å, which is only 0.25 Å greater than the global value, but nevertheless greater than 3 Å. This is why the percentage of structures with the level of r.m.s. <3 Å decreases from 0.51% for all generated structures to 0% for the selected structures of the 1gmf model. Otherwise, as already commented, for apolipoprotein (1lpe), the influence of the rigid nature of modeled helices is clearly visible on the long helix and, consequently, participates to give a rather high r.m.s. This difficulty, not found with ß-sheets (see later), would probably be reduced by using two independent successive cylinders to model the longest helices, assuming a reasonable increase in computer time.
Sheet parameters
We extracted 30 sheets from a set of proteins given in Table II, with different locations of the sheets within protein structures: buried in the core of the structure, solvent exposed on the surface or partially exposed. The 13 upper proteins are from the MIXED model, i.e. concern strands involved in globular domains of the /ß class. The six lower proteins are from the ß class and constitute the BETA model. Their sheets are composed of 35 strands. All these ß-sheet structures were modeled for different values of strand shifts b and the value of the
function was calculated for all resulting conformations of ß-sheets.
was determined from each derived model and the r.m.s. was derived from the superimposition of each ß-sheet model on its correspondant in the native structure. Then, correlation coefficients between
and r.m.s. were performed. They range from 0.34 to 0.98 for the 13 proteins of the
/ß class, with a mean of 0.70. This value was used to scale the parameters used further in the sheets and must not be confused with the corelation coefficient of 0.95 between
and r.m.s. (see Part I), when the full cores of the model and the actual X-ray structures were compared. We noticed that
and r.m.s. are correlated for structures with a high degree of solvent exposure. One reasonable assumption is that positions of the HFC are more conserved when the sheet is located on the surface of the protein with one side facing the solvent. Several structures revealed a low value of this correlation coefficient owing to a phase shift between
and r.m.s. It appeared that
is less correlated to the r.m.s. when calculated for either side of the ß-sheet than for both sides of the ß-sheet, as occurs for a solvent-exposed sheet. We observed that the native structure of a ß-sheet in a small globular protein corresponds to a low value of
(calculated only for residues belonging to the ß-sheet), which we shall call
s (
sheet), in order to distinguish it from the global protein
p (
protein) calculated for all hydrophobic residues of the protein, i.e. including the helices. In other words,
s is only concerned by the hydrophobic residues involved in ß-sheets and it is better correlated to the r.m.s. than
p. This might be because compactness is better realized with sheets than with helices, as hydrophobic residues are more facing each other in the first case. We then decided that for the MIXED model the goal function would be the sum,
=
p +
s, thus enhancing the weight of
s as the hydrophobic residues involved in the sheets are included in both
s and
p. This noticeably improved the correlation between
and r.m.s.
|
The worst correlation coefficient among BETA class occurs for 1bec, 0.03. One can argue that this is probably because it concerns a fairly large sheet of five strands. They are often in the interior of the protein with a large hydrophobic face that can explain the failure of the algorithm. However, the five-stranded ß sheets of 1qi3 and 1svr in the MIXED class show good correlation coefficients, 0.9 and 0.83, respectively. Hence for the moment the limit of the method is reached with four strands and is better as long as sheets are exposed to the solvent.
Mixed /ß structures
In the MIXED models corresponding to /ß class, a ß-sheet is advantageously considered as a single SSE. However, it is more complex than the HELIX model, since strands can slide along the mean backbone propagating direction within each ß-sheet, thus generating numerous initial conformations. Four proteins with different levels of complexity, containing both helices and sheets, are analyzed in detail in Table III: 1bbg, 1qi3 and 2igd are two-layer sandwiches and contain one helix and one ß-sheet with three, five and four strands, respectively; 1aba is a three-layer sandwich, formed by three helices located on both sides of the central four-strand sheet (Figure 3). The single helix on one side of 1aba shows a large translation relative to the native structure. This might be due to the fact that the space left at the right lower corner of the native structure by the absence of explicit loops in the RUSSIA procedure is filled by translating this
2 helix in order to decrease the distances between its hydrophobic residues and the facing sheet. The constraints based upon loop length surrounding this helix are not sufficient to produce a correct positioning. This feature mainly contributes to the large r.m.s. among the selected models: 3.69 Å. Owing to its geometry, the MIXED model is more stable than the HELIX model. In particular, it is less sensitive to the rotation of helices and for slightly twisted and long sheets, compactness of the hydrophobic core is not perturbed by the rotation of helices. This is why
is smoothed over a range of ±60° around its minimum in the case of the four proteins in Table III. Rotation of one sheet with respect to the second one in a ß-sandwich does not affect much the compactness of the structure. This is not the case when a relative rotation is made between helices in a HELIX model. This increases the effect of loop closure constraints in the BETA model.
|
|
ß-Class structures
The BETA model deals with all ß class and each ß-sheet is considered as one SSE, as in the MIXED model. The BETA model explores the interactions between two ß-sheets and it may contain up to 10 strands altogether (Figure 4). The number of generated structures in Table IV is increased, compared with both HELIX and MIXED models, because of the numerous possible values of shifts that one can introduce in each sheet. The generated structures were sorted according to the level of and it was decided to retain 100 of the best structures corresponding to the lowest values of
for each modeled protein structure. It is important to note that the number of initial native-like structures (at better than 3 Å r.m.s.) is not very high, 14% at most, but the mean r.m.s. among the selected structures remains at very reasonable values, at 3.4 Å for the worst (1hnf). This is a direct consequence of the fact that the rigid blocks considered are the sheets and not the strands themselves, yielding only two blocks per structure.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Having analyzed the distributions of inter-residue closest distances in Part I, we noticed that in most cases they were far from Gaussian. The notion of closest distance refers to a given pair of residues belonging to different SSE in one protein. They are obtained as follows: for each protein in the PDB set, one determines the closest neighbor of each amino acid; therefore, an ensemble of closest distances is derived which is sorted according to the chemical nature of the amino acids. Only some pair distance distributions (generally hydrophobic) were single mode and bell-like, as shown in Figure 5 for the example of the LeuIle pair. When distributions were equivalent between the two members of a pair, the data were merged as in the case of Figure 5, otherwise they were not. In this latter case, the distributions were generally bimodal and not commutative, as reported in Figure 6 for the Ala
Gln distribution. This is probably the consequence of the relative size of the partners and of their respective locations (hydrophobic core or surface shell). The first mode with the smaller inter-residue distance in the Ala
Gln distribution in Figure 6 mainly corresponds to residues in protein cores. The second mode with the larger distance mainly represents the interactions between residues at the periphery, whereas in the reverse pair, the second mode prevails. The second mode is often less significant than the first one, as one can observe from Figure 7. This is a general trend that was found in numerous cases.
|
|
|
|
|
One of the main advantages of the RUSSIA algorithm is its speed. For most of the structures, generation of all possible conformations took a few hours on a Pentium 233 PC. Only in the case of absence of loop conditions (1gmf) was the program executed for about 100 h, generating many similar structures. The program, however, is not completely optimized and the computation time may still be considerably decreased in the future.
Improvements and perspectives
RUSSIA depends on the previous knowledge of the position and nature ( or ß) of the regular secondary structures and, more precisely, of the strong hydrophobic amino acids that they contain. The robustness of the procedure with regard to the predicted limits of
or ß secondary structure is therefore of crucial importance. As often occurs, a limited number of hydrophobic amino acids lie on the extremities of
or ß (Poupon and Mornon, 1999a
, 2001
), and the exact limits of the SSE may consequently be of only relative importance. Moreover, as strands within ß-sheets are translated during the exploration of the conformational space, this feature becomes less significant for such structures. However, to check these considerations, we moved the SSE limits of one hydrophobic amino acid and consequently, the number of hydrophobic residues taken into consideration as shown in Figure 8. A representative member of each protein class (
,
/ß, ß) was tested. Although one can observe a certain deterioration of results in terms of r.m.s., the algorithm generated good resulting structures when the SSE limits were shifted: they stayed within 37.5 Å of the r.m.s. deviation limit, compared with the native structures, i.e. in the range considered as good when one follows the qualification of Bonneau et al. (2001
). This demonstrates the robustness of the algorithm so far as limits of SSE are concerned.
|
|
RUSSIA works with a central and punctual attractor, the overall geometric center C of the hydrophobic faces of -helices or ß-sheets. This feature is in accordance with the prediction of small- to medium-sized globular domains. For larger ones, a broader definition of this geometric attractor is required, e.g. an elongated area, and may allow one to deal successfully with the largest domains, provided that the number of independent SSE is kept within acceptable limits of computer time. With respect to this, the number of strands is of relatively little importance compared with that of the helix population because they have few degrees of freedom within the sheets. For the moment, the upper limit of sheets is of six strands. For large globular domains, one possible improvement might be to take into account the central hydrophobic attractor C in addition to hydrophilic ones outside the protein in order to balance the working forces.
In a somewhat surprising manner, helices described as rigid cylinders appear to be more difficult to handle than ß-strands gathered in unique helicoid surfaces and they contribute to clearly worse results for or mixed structures than for all-ß structures. One perspective may therefore be to leave more freedom to helices (non-rigid cylinders) although this will increase the number of parameters to be explored and consequently the computing time. However, this latter will no longer be a significant constraint.
Finally, it may be significant to note that, for three- or four-stranded sheets as considered in this paper, knowing the spatial order of strands in advance is not absolutely crucial since, among the few theoretical possibilities to be explored, some of them are very unlikely (Znamenskiy et al., 2000). In the future, five or more ß-stranded sheets might also be considered and processed.
Conclusion
We have proposed an algorithm able to assemble the core of a protein, knowing the location and nature of the SSE. The main advantages of this procedure are its simplicity and speed. This is due to the fact that helices and sheets are treated as rigid bodies and loops are discarded. The motor is the maximization of hydrophobic compactness. It generates compact 3D structures for small and regular globular proteins. It can be considered as a step in the building of protein cores, as it models reasonably well the relative topology of the SSE. The resulting structures compared with the native ones have r.m.s. of the order of 3 Å and smaller r.m.s., of the order of 2 Å, are also frequently detected. Generally, the structures with r.m.s. deviations from the native structure within 37.5 Å are considered as good (Bonneau et al., 2001) and the structures with r.m.s. deviations <4 Å are considered as native-like (Simmons et al., 1999
). Hence the average r.m.s. for each set of the structures, generated by the RUSSIA algorithm, compared with the modeled one, are good and the best structures in the
-selected sets of structures are native-like. Hence we can conclude that the inter-residue distance matrix provides a good approximation of the set of all existing distances between the residues in the PDB.
A comparison function was proposed to sort out the generated structures, without any complementary knowledge or assumption about the native structure. To evaluate further the conformations produced by RUSSIA, one has to use alternative approaches, such as measuring distances between topologically conserved residues. Otherwise, introducing into the procedure a proper treatment of loops could improve the algorithm. The present algorithm can be extended to the prediction of larger proteins given a sufficient number of loop constraints, provided that they fold in one single compact domain. Transferring this present procedure to non-compact domains (i.e. for which two or more local hydrophobic barycenters can be defined) has not yet been considered.
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Cohen,F.E. and Sternberg,M.J.E. (1980) J. Mol. Biol., 138, 321333.[ISI][Medline]
Guerois,R., Nielsen,J. and Serrano,L. (2002) J. Mol. Biol., 320, 369387.[CrossRef][ISI][Medline]
Kloczkowski,A. and Jernigan,R.L. (2002) J. Biomol. Struct. Dyn., 20, 323325.[ISI][Medline]
Laskowski,R. (2001) Nucleic Acids Res., 29, 221222.
Lee,J., Liwo,A., Ripoll,D.R., Pillardy,J. and Sheraga,H.A. (1999) Proteins, Suppl., 3, 204208.
Poupon,A. and Mornon,J.P. (1998) Proteins, 33, 329342.[CrossRef][ISI][Medline]
Poupon,A. and Mornon,J.P. (1999a) FEBS Lett., 452, 283289.[CrossRef][ISI][Medline]
Poupon,A. and Mornon,J.P. (1999b) Theor. Chem. Acc., 101, 28.[ISI]
Poupon,A. and Mornon,J.P. (2001) Theor. Chem. Acc., 106, 113120.[CrossRef][ISI]
Reddy,B., Li,W., Shindyalov,I. and Bourne,P. (2001) Proteins, 42, 148163.[CrossRef][ISI][Medline]
Reva,B.A., Finkelstein,A.V. and Skolinck,J. (1998) Fold. Des., 3, 141147.[ISI][Medline]
Rykunov,D.S., Lobanov,M.Y. and Finkelstein,A.V. (2000) Proteins, 40, 494501.[CrossRef][ISI][Medline]
Samudrala,R., Xia,Y., Huang,E. and Levitt,M. (1999) Proteins, Suppl., 3, 194198.
Simmons,K.T., Ruczinski,I., Kooperberg,C., Fox,B.A. and Baker,D. (1999) Proteins, 34, 8295.[CrossRef][ISI][Medline]
Sippl,M.J. and Weitckus,S. (1992) Proteins, 13, 258271.[ISI][Medline]
Srinivasan,R. and Rose,G.D. (1995) Proteins, 22, 8189.[ISI][Medline]
Znamenskiy,D., Le Tuan,K., Poupon,A., Chomilier,J. and Mornon,J.-P. (2000) Protein Eng., 6, 407412.
Received July 25, 2003; revised October 25, 2003; accepted October 30, 2003