A new protein folding algorithm based on hydrophobic compactness: Rigid Unconnected Secondary Structure Iterative Assembly (RUSSIA). II: Applications

Denis Znamenskiy, Khan Le Tuan, Jean-Paul Mornon1 and Jacques Chomilier1

Systèmes Moléculaires et Biologie Structurale, LMCP, Universités Paris 6 et Paris 7, CNRS UMR 7590, Case 115, 75252 Paris cedex 05, France

1 To whom correspondence should be addressed. e-mail: jean-paul.mornon{at}lmcp.jussieu.fr or jacques.chomilier{at}lmcp.jussieu.fr


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The RUSSIA procedure (Rigid Unconnected Secondary Structure Iterative Assembly) produces structural models of cores of small- and medium-sized proteins. Loops are omitted from this treatment and regular secondary structures are reduced to points, the centers of their hydrophobic faces. This methodology relies on the maximum compactness of the hydrophobic residues, as described in detail in Part I. Starting data are the sequence and the predicted limits and natures of regular secondary structures ({alpha} or ß). Helices are treated as rigid cylinders, whereas ß-strands are collectively taken into account within ß-sheets modeled by helicoid surfaces. Strands are allowed to shift along their mean axis to allow some flexibility and the {alpha}-helices can be placed on either side of ß-sheets. Numerous initial conformations are produced by discrete rotations of the helices and sheets around the direction going from the center of their hydrophobic face to the global center of the protein. Selection of proposed models is based upon a criterion lying on the minimization of distances separating hydrophobic residues belonging to different regular secondary structures. The procedure is rapid and appears to be robust relative to the quality of starting data (nature and length of regular secondary structures). This dependence of the quality of the model on secondary structure prediction and in particular the ß-sheet topology, is one of the limits of the present algorithm. We present here some results for a set of 12 proteins ({alpha}, ß and {alpha}/ß classes) of lengths 40–166 amino acids. The r.m.s. deviations for core models with respect to the native proteins are in the range 1.4–3.7 Å.

Keywords: homology modeling/hydrophobic core/minimal surface/protein folding/secondary structure


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
In Part I (preceding paper in this issue), we described a new algorithm tailored to assemble rigid models of {alpha} and ß secondary structures, whose nature and approximate length were assumed to be known from previous sequence analysis. Helices are modeled by cylinders with no internal degree of freedom and ß-sheets by adjustable helicoid surfaces in which strands are allowed to shift along the mean direction of the helicoid axis. The physical motor is the optimal compactness of hydrophobic amino acids with respect to current constraints such as (i) repulsion of amino acids below a certain limit and (ii) maximum distance between successive secondary structure extremities, in order to accommodate the connecting loop whose sequence length is known. Each helix or sheet is reduced to a single point, its hydrophobic face center (HFC), defined as the geometric center of its hydrophobic residues. The various HFCs are iteratively displaced towards the geometric center of all HFCs (simulating the center of the protein), thus leading their secondary structure element (SSE) towards a compact assembly. During this displacement, if some constraint occurs, either contacts between residues or terminations of SSE too far to be connected by a loop, rotations of {alpha}-helices around the direction of displacement are allowed. This procedure is applied in this paper to a representative set of small- to medium-sized globular domains (up to 166 amino acids) of the {alpha}, ß and {alpha} classes of known structures, thus with the topology of various strands inside the sheets taken from the actual structures.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The basis of the RUSSIA (Rigid Unconnected Secondary Structure Iterative Assembly) procedure was discussed in detail in Part I and will just be briefly summarized here. The algorithm is currently limited to small globular proteins, i.e. for domains where the hydrophobic core results in a unique gravity center and which are built from three to four secondary and super secondary structure elements, helices and/or ß-sheets (up to five strands each) for the all-{alpha} (HELIX), all-ß (BETA) or {alpha}/ß (MIXED) classes. All SSE are currently considered as rigid blocks, and loops linking them are discarded as explicit elements in this model. Geometric centers of the hydrophobic faces of each SSE are determined. The starting conformation is such that no residues from different SSE are in contact with each other. Further, within one ß-sheet, each strand can be shifted relative to the others, of one or several C{alpha} steps, producing various initial conditions. To reach a compact structure, the SSE are moved by steps towards the geometric center C of their hydrophobic face centers. The geometric aggregation of the hydrophobic core of globular proteins is obtained by minimizing a goal function F during the simulation, provided that the two sets of constraints are respected, one on loops and the other on steric clashes of SSE.The function F was introduced to define the halt condition for energy iteration of the algorithm. It is defined in each step as the largest distance between gravity centers of the hydrophobic faces of SSE and the global geometrical center of the model (see, for instance, Figure 7 in Part I). In other words, all blocks are moved one step in the direction of the center of the nucleus under construction. A new gravity center of all HFC is calculated and a new displacement is applied in the direction of the protein center. Displacements stop when an inter-residue distance constraint starts to apply, i.e. when one distance between two amino acid drops below a certain threshold. Another constraint is also applied, if the distance of the loop connecting two SSE becomes too long relative to its sequence length. At this stage of the procedure, small rotations (typically 1°) are performed for each block in order to suppress contacts between blocks without alteration of the compactness of the whole structure. The process stops when F no longer decreases.

In order to validate the prediction, one must, at this stage, compare modeled structures and X-ray structures. This is done by means of the root mean square function (r.m.s.), which is the root mean square deviation on C{alpha} between the generated structures and the native structures (see Equation 11 in Part I). Owing to approximations of the model, the minima of the goal function F seldom coincide exactly with those of the r.m.s., i.e. with the actual structure. Hence a new {Phi} function has been defined that accounts for an average distance separating all hydrophobic amino acids belonging to the different considered SSE (Equations 9 and 10 in Part I). The underlying idea beneath this function is to produce a structure where hydrophobic residues are as compact as possible. We will try in this paper to relate {Phi} to r.m.s. on a small set of known structures of various folds in order to validate the algorithm. It appeared that, in order to smear peculiar points, smoothing of {Phi} over a small range of angular positions (typically 10 or 20°) of SSE can allow one to find a structure ‘close’ to the native structure in the neighborhood of the minimum of this {Phi} function.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
All-{alpha}-helix structures

We illustrate the HELIX model procedure with a set of four proteins (PDB codes 1enh, 2mhr, 1lpe, 1gmf) consisting of three and four {alpha}-helix bundles listed in Table I, which reports the total number of residues, the number of amino acids involved in the SSE (corresponding to the fact that loops are not included in this model) and the total number of models generated by RUSSIA. They correspond to different initial positions of the SSE obtained by discrete rotations around their respective HFCs with a step of 10°. In the case of an n-helix bundle model, this procedure yields a set of 36n initial positions for the SSE. Owing to the loop constraints, many conformations are rejected. For instance, from the 46 656 initial conformations of 1enh, only 883 structures are kept for further processing as shown in Table I. The quality of the model is estimated on the following basis: each model is compared with the native structure from the PDB and the percentage of models whose r.m.s. values are better than 3 and 4 Å are given in Table I, in addition to the value of the best r.m.s. In all four cases, the minimal r.m.s. is better that 3 Å, but not necessarily detected by the algorithm among the selected structures. There is no correlation between the quality of the model with respect to the crystallographic structure, as measured by the best r.m.s. and the number of residues of which the protein is composed. Indeed, the difference between the models and the actual proteins appear to be at least partly dependent upon the curvature of helices. In particular, long helices are often curved and/or kinked and in such cases the present rigid ideal cylinders used to describe helices lead to severely increased r.m.s. (see for instance, 1lpe in Figure 2). For all proteins of Table I, we compared the {Phi} function with the r.m.s., i.e. prediction was compared with validation. In most cases, the smallest r.m.s. are not spread all over the conformation space represented by all initial angles of the helices (see, for instance, Figure 14 in Part I) but are mainly located in a single region. In contrast, as the RUSSIA procedure does not take into account loops explicitly, the function {Phi} may have several regions (often two) containing small values and, as only one of them fits that of the small r.m.s., selecting the best model on the basis of the minimum of {Phi} may miss the most native-like model. This is why {Phi} is smoothed over a small range of angles around its minimum. For three-helix proteins this interval of F is 10° and is increased to 20° for four-helix fold to account for the increased complexity and to smooth local irregularities better. The appearance and the number of the ‘false’ minima of {Phi} (corresponding to a high r.m.s.) are closely related to the length of the loops: the shorter the loops, the less ‘false’ minima. Nevertheless, even in the case of short loops, the SSE can still adopt different conformations, owing to a reverse in the direction of one loop (between helices 2 and 3) in the case of the four-helix bundle and produce close {Phi} values (Figure 1).


View this table:
[in this window]
[in a new window]
 
Table I. HELIX model
 


View larger version (33K):
[in this window]
[in a new window]
 
Fig. 2. HELIX model: RUSSIA models and native 3D structures. Comparison of the ‘best’ 3D conformation among those selected and the native structure. The four structures in Table I, 1enh, 2mhr, 1gmf and 1lpe, are presented. Lengths and natures of SSE used to calculate the models are depicted above the sequence. The length of loops is indicated in parentheses.

 


View larger version (41K):
[in this window]
[in a new window]
 
Fig. 1. Two possible configurations for a four-helix bundle. These two configurations are difficult to distinguish using only the average inter-hydrophobic distances as a criterion of comparison. Both give rise to close low values of the function {Phi} (average distance between C{alpha} of hydrophobic amino acids).

 
To evaluate the robustness of the algorithm to widely different initial conditions and, consequently, to its power of convergence, the resulting models were compared with X-ray ones and r.m.s. were calculated. The number of selected structures on the basis of the minimum of the {Phi} function is given in Table I, together with their percentage with an r.m.s. from the native structure below 3–4 Å. For the selected models, minimal, maximal and average r.m.s. are given. It is found that selected structures do not include the minimal r.m.s. among all generated structures, but are fairly close to it. The minimum value of r.m.s. is better than 3.2 Å for all four models. If one admits that the final result is chosen at random among the set of selected structures, one can expect to have a mean deviation for the actual structure which remains <5 Å for all four examples, which falls in the range of medium resolution models. Reva et al. stated that one should consider as fairly successful an r.m.s. deviation better than 6 Å, as the ‘probability of obtaining such a model by chance is so remote’ (Reva et al., 1998Go). Further, it was established that a random prediction of a compact structure is of the order of 15 Å for a 100-residue protein (Cohen and Sternberg, 1980Go). In the case of 1gmf, although the maximum r.m.s. is fairly high at 8.02 Å, the mean value among the 113 selected models is reasonable at 4.7 Å and it can be compared to the native ones (Figure 2).

One notices that the percentage of the native-like models (r.m.s. <4 Å) decreases with the complexity of the structure, taken as the number of residues and the number of SSE, because the number of degrees of freedom increases with long sequences. If one takes the percentage of SSE whose r.m.s. is <4 Å, one decreases from 100% in the case of 1enh (three helices and 56 amino acids) to 7% in the case of 1lpe (four helices and 166 amino acids). The intermediate cases are interesting to analyze because both 1gmf and 2mhr have comparable lengths, 118 and 119 residues, but 2mhr produces 90% of native models whereas the 1gmf output is only 27%. One can assume that this is because the better models are produced for the protein which has the largest number of amino acids involved in regular SSE. There are 20 more amino acids included in helices in 2mhr than in 1gmf, which is a relative increase of one-third.

Although the RUSSIA algorithm determines the minimum of {Phi}, it does not always find the global minimum of r.m.s., but merely a local minimum, close to the global value. The average values of the r.m.s. among the selected structures belong to the range classified as ‘good’ (3–5 Å). In the case of 1gmf, the level of the global r.m.s. minimum among the generated structures was at 2.92 Å from the native structure. A local minimum reached by the algorithm on the {Phi} criterion had an r.m.s. of 3.17 Å, which is only 0.25 Å greater than the global value, but nevertheless greater than 3 Å. This is why the percentage of structures with the level of r.m.s. <3 Å decreases from 0.51% for all generated structures to 0% for the selected structures of the 1gmf model. Otherwise, as already commented, for apolipoprotein (1lpe), the influence of the rigid nature of modeled helices is clearly visible on the long helix and, consequently, participates to give a rather high r.m.s. This difficulty, not found with ß-sheets (see later), would probably be reduced by using two independent successive cylinders to model the longest helices, assuming a reasonable increase in computer time.

Sheet parameters

We extracted 30 sheets from a set of proteins given in Table II, with different locations of the sheets within protein structures: buried in the core of the structure, solvent exposed on the surface or partially exposed. The 13 upper proteins are from the MIXED model, i.e. concern strands involved in globular domains of the {alpha}/ß class. The six lower proteins are from the ß class and constitute the BETA model. Their sheets are composed of 3–5 strands. All these ß-sheet structures were modeled for different values of strand shifts b and the value of the {Phi} function was calculated for all resulting conformations of ß-sheets. {Phi} was determined from each derived model and the r.m.s. was derived from the superimposition of each ß-sheet model on its correspondant in the native structure. Then, correlation coefficients between {Phi} and r.m.s. were performed. They range from 0.34 to 0.98 for the 13 proteins of the {alpha}/ß class, with a mean of 0.70. This value was used to scale the parameters used further in the sheets and must not be confused with the corelation coefficient of 0.95 between {Phi} and r.m.s. (see Part I), when the full cores of the model and the actual X-ray structures were compared. We noticed that {Phi} and r.m.s. are correlated for structures with a high degree of solvent exposure. One reasonable assumption is that positions of the HFC are more conserved when the sheet is located on the surface of the protein with one side facing the solvent. Several structures revealed a low value of this correlation coefficient owing to a phase shift between {Phi} and r.m.s. It appeared that {Phi} is less correlated to the r.m.s. when calculated for either side of the ß-sheet than for both sides of the ß-sheet, as occurs for a solvent-exposed sheet. We observed that the native structure of a ß-sheet in a small globular protein corresponds to a low value of {Phi} (calculated only for residues belonging to the ß-sheet), which we shall call {Phi}s ({Phi}sheet), in order to distinguish it from the global protein {Phi}p ({Phi}protein) calculated for all hydrophobic residues of the protein, i.e. including the helices. In other words, {Phi}s is only concerned by the hydrophobic residues involved in ß-sheets and it is better correlated to the r.m.s. than {Phi}p. This might be because compactness is better realized with sheets than with helices, as hydrophobic residues are more facing each other in the first case. We then decided that for the MIXED model the goal function would be the sum, {Phi} = {Phi}p + {Phi}s, thus enhancing the weight of {Phi}s as the hydrophobic residues involved in the sheets are included in both {Phi}s and {Phi}p. This noticeably improved the correlation between {Phi} and r.m.s.


View this table:
[in this window]
[in a new window]
 
Table II. Parameters for the sheets
 
For the BETA model, there is more than one sheet in the structure, contrary to the MIXED model. The correlation between {Phi} and r.m.s. was performed in this class of proteins for each individual sheet and it ranges from 0.03 to 0.88, with a mean at 0.63 (Table II). In the case of 1hnf, composed of two domains of ~100 amino acids each of ß-sandwich architecture (Laskowski, 2001Go), there is no interaction between the two domains. The second domain is composed of one sheet of three strands facing a second sheet of four strands, while one of the sheets of the first domain is composed of six strands and therefore has not been considered in this study because it falls outside the limits of the present algorithm. Both three-strand sheets of 1hnf give a satisfactory correlation coefficient (0.69 and 0.88), whereas the larger sheet does not show any correlation at 0.25 Å. This sheet is bent at one end in the actual structure, hence in the model the two long strands have been shortened in comparison with the assignment, in order to avoid the bending. Maybe one part of the missed residues really contributes to the HFC, but the algorithm does not produce a bend, so we decided to skip these residues. As ß-sheets generally have a larger hydrophobic face than helices, strand shifts yield a small change in the hydrophobic face geometry. If both sheets of a ß-sandwich have a small degree of twist, rotation of one of the sheets around the axis passing through the HFCs of both sheets does not significantly change the distance between the hydrophobic residues of the ß-sheets. Therefore, the {Phi} function is less sensitive to deformations in the BETA model than in the MIXED and HELIX models. Although correlation coefficients between {Phi} and r.m.s. are lower for the BETA model than for the others, the absolute values of the r.m.s. are the best (typically <2.0 Å, as will be shown later) owing to a small number of degrees of freedom for the blocks in the case of the BETA model.

The worst correlation coefficient among BETA class occurs for 1bec, 0.03. One can argue that this is probably because it concerns a fairly ‘large’ sheet of five strands. They are often in the interior of the protein with a large hydrophobic face that can explain the failure of the algorithm. However, the five-stranded ß sheets of 1qi3 and 1svr in the MIXED class show good correlation coefficients, 0.9 and 0.83, respectively. Hence for the moment the limit of the method is reached with four strands and is better as long as sheets are exposed to the solvent.

Mixed {alpha}/ß structures

In the MIXED models corresponding to {alpha}/ß class, a ß-sheet is advantageously considered as a single SSE. However, it is more complex than the HELIX model, since strands can slide along the mean backbone propagating direction within each ß-sheet, thus generating numerous initial conformations. Four proteins with different levels of complexity, containing both helices and sheets, are analyzed in detail in Table III: 1bbg, 1qi3 and 2igd are two-layer sandwiches and contain one helix and one ß-sheet with three, five and four strands, respectively; 1aba is a three-layer sandwich, formed by three helices located on both sides of the central four-strand sheet (Figure 3). The single helix on one side of 1aba shows a large translation relative to the native structure. This might be due to the fact that the space left at the right lower corner of the native structure by the absence of explicit loops in the RUSSIA procedure is filled by translating this {alpha}2 helix in order to decrease the distances between its hydrophobic residues and the facing sheet. The constraints based upon loop length surrounding this helix are not sufficient to produce a correct positioning. This feature mainly contributes to the large r.m.s. among the selected models: 3.69 Å. Owing to its geometry, the MIXED model is more stable than the HELIX model. In particular, it is less sensitive to the rotation of helices and for slightly twisted and long sheets, compactness of the hydrophobic core is not perturbed by the rotation of helices. This is why {Phi} is smoothed over a range of ±60° around its minimum in the case of the four proteins in Table III. Rotation of one sheet with respect to the second one in a ß-sandwich does not affect much the compactness of the structure. This is not the case when a relative rotation is made between helices in a HELIX model. This increases the effect of loop closure constraints in the BETA model.


View this table:
[in this window]
[in a new window]
 
Table III. MIXED model
 


View larger version (34K):
[in this window]
[in a new window]
 
Fig. 3. MIXED model: RUSSIA models and native 3D structures. Comparison of the ‘best’ 3D conformations among those selected and the native structures. The four structures in Table III, 1bbg, 1qi3, 2igd and 1aba, are presented. Details as in Figure 2.

 
The nature of interactions between {alpha}-helix and ß-sheet is different from that between {alpha}-helices. In the MIXED model, the false minima are less present among the models with a low value of {Phi}. The generated structures were sorted out according to the level of the {Phi} function and it was found that the 100 expected ‘best’ structures corresponding to the lowest values of {Phi} represent a sufficient set of native-like structures.

ß-Class structures

The BETA model deals with all ß class and each ß-sheet is considered as one SSE, as in the MIXED model. The BETA model explores the interactions between two ß-sheets and it may contain up to 10 strands altogether (Figure 4). The number of generated structures in Table IV is increased, compared with both HELIX and MIXED models, because of the numerous possible values of shifts that one can introduce in each sheet. The generated structures were sorted according to the level of {Phi} and it was decided to retain 100 of the ‘best’ structures corresponding to the lowest values of {Phi} for each modeled protein structure. It is important to note that the number of initial native-like structures (at better than 3 Å r.m.s.) is not very high, 14% at most, but the mean r.m.s. among the selected structures remains at very reasonable values, at 3.4 Å for the worst (1hnf). This is a direct consequence of the fact that the rigid blocks considered are the sheets and not the strands themselves, yielding only two blocks per structure.



View larger version (36K):
[in this window]
[in a new window]
 
Fig. 4. BETA model: RUSSIA models and native 3D structures. Comparison of the ‘best’ 3D conformations among those selected and the native structures. Four structures in Table IV, 1hnf, 1hla, 7fab and 1shs, are presented. Details as in Figure 2.

 

View this table:
[in this window]
[in a new window]
 
Table IV. BETA model
 
The accuracy of the potential of the present model does not allow a reliable selection of the best structure among those selected but rather a fairly large set of good ones, close to the native structures, typically around 3 Å r.m.s. deviation. A more thorough sorting of the resulting structures can eventually be achieved, however, by extending the model to an all-atom one (Lee et al., 1999Go; Samudrala et al., 1999Go). An alternative way of selecting models would be the use of threading algorithms (Sippl and Weitckus, 1992Go; Rykunov et al., 2000Go) or contact energies (Srinivasan and Rose, 1995Go), pertinent to extract proper folds.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Inter-residue distance distribution

Having analyzed the distributions of inter-residue closest distances in Part I, we noticed that in most cases they were far from Gaussian. The notion of closest distance refers to a given pair of residues belonging to different SSE in one protein. They are obtained as follows: for each protein in the PDB set, one determines the closest neighbor of each amino acid; therefore, an ensemble of closest distances is derived which is sorted according to the chemical nature of the amino acids. Only some pair distance distributions (generally hydrophobic) were single mode and bell-like, as shown in Figure 5 for the example of the Leu->Ile pair. When distributions were equivalent between the two members of a pair, the data were merged as in the case of Figure 5, otherwise they were not. In this latter case, the distributions were generally bimodal and not commutative, as reported in Figure 6 for the Ala->Gln distribution. This is probably the consequence of the relative size of the partners and of their respective locations (hydrophobic core or surface shell). The ‘first’ mode with the smaller inter-residue distance in the Ala->Gln distribution in Figure 6 mainly corresponds to residues in protein cores. The ‘second’ mode with the larger distance mainly represents the interactions between residues at the periphery, whereas in the reverse pair, the ‘second’ mode prevails. The ‘second’ mode is often less significant than the ‘first’ one, as one can observe from Figure 7. This is a general trend that was found in numerous cases.



View larger version (39K):
[in this window]
[in a new window]
 
Fig. 5. Monomodal distribution of inter-residue distances. The distributions of closest distances between C{alpha} atoms of leucine to isoleucine and isoleucine to leucine have no significant difference between them according to the {chi}2 test, so they were merged into one set, which presents one mode and which is Gaussian-like.

 


View larger version (43K):
[in this window]
[in a new window]
 
Fig. 6. Bimodal distribution of inter-residue distances. The distribution of closest distances between C{alpha} atoms of alanine to glutamine contains two modes, corresponding to distances in the core for the smallest and in the periphery of proteins for the largest one.

 


View larger version (23K):
[in this window]
[in a new window]
 
Fig. 7. Distributions of closest distances between alanine and isoleucine C{alpha} atoms. (a) Distances were measured from alanine to isoleucine. The largest mode corresponds to long distances from residues belonging to less compact regions to residues within protein cores. (b) Distances were measured from isoleucine to alanine. The relative importance of the longest mode has strongly decreased.

 
In order to simulate the necessary repulsion to avoid steric clashes when performing SSE clustering, a non-symmetric matrix containing the values of the largest modes for all residue pairs was generated (Table V). The monomodal distributions are mostly situated in the upper left quadrant of the table, corresponding to hydrophobic pairs with modes merely situated between 6 and 7 Å. In the case of bimodal distributions, the modes were often situated in the 4–6 and 7–9 Å intervals. To be usable, the matrix of distance distribution modes obtained must be symmetrical. In the case of pairs of both hydrophobic or both hydrophilic residues, their distributions were merged. It is impossible to choose between two distances for the same pair of residues without an a priori knowledge of their 3D positions in the protein. The hydrophobic versus hydrophilic pair distributions represent distances found in the core of a protein and therefore the upper right quadrant was retained and only the ‘first’ modes of the distributions were considered. Nevertheless, as some of these pair distributions contained <30 elements, they had to be merged with those of the hydrophilic versus hydrophobic pairs, to obtain a greater sample size. Table VI shows the symmetrical matrix of the ‘representative’ core inter-residue distances used as limits below which two SSE were not allowed to interpenetrate further.


View this table:
[in this window]
[in a new window]
 
Table V. Non-symmetric matrix containing the largest modes of residue pair distance distributions
 

View this table:
[in this window]
[in a new window]
 
Table VI. Sample sizes of distance distributions for the hydrophobic versus hydrophilic residue pairs
 
Computing

One of the main advantages of the RUSSIA algorithm is its speed. For most of the structures, generation of all possible conformations took a few hours on a Pentium 233 PC. Only in the case of absence of loop conditions (1gmf) was the program executed for about 100 h, generating many similar structures. The program, however, is not completely optimized and the computation time may still be considerably decreased in the future.

Improvements and perspectives

RUSSIA depends on the previous knowledge of the position and nature ({alpha} or ß) of the regular secondary structures and, more precisely, of the strong hydrophobic amino acids that they contain. The robustness of the procedure with regard to the predicted limits of {alpha} or ß secondary structure is therefore of crucial importance. As often occurs, a limited number of hydrophobic amino acids lie on the extremities of {alpha} or ß (Poupon and Mornon, 1999aGo, 2001Go), and the exact limits of the SSE may consequently be of only relative importance. Moreover, as strands within ß-sheets are translated during the exploration of the conformational space, this feature becomes less significant for such structures. However, to check these considerations, we moved the SSE limits of one hydrophobic amino acid and consequently, the number of hydrophobic residues taken into consideration as shown in Figure 8. A representative member of each protein class ({alpha}, {alpha}/ß, ß) was tested. Although one can observe a certain deterioration of results in terms of r.m.s., the algorithm generated ‘good’ resulting structures when the SSE limits were shifted: they stayed within 3–7.5 Å of the r.m.s. deviation limit, compared with the native structures, i.e. in the range considered as ‘good’ when one follows the qualification of Bonneau et al. (2001Go). This demonstrates the robustness of the algorithm so far as limits of SSE are concerned. Go



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 8. Effects of modification of the SSE limits. Three proteins (1enh, 2igd and 1shs) from different classes ({alpha}, {alpha}/ß, ß) were chosen to test the robustness of the algorithm by modifying the SSE limits. The structures with alternative SSE limits are noted as 1enh', 2igd' and 1shs'. Hydrophobic residues are marked in the sequence by white letters with a black background. SSE limits are marked by horizontal lines. The SSE nature is indicated by an {alpha} or ß symbol. Two helices are displaced in 1enh, one helix and two strands in 2igd and five strands in 1shs.

 

View this table:
[in this window]
[in a new window]
 
Table VII. Comparison of resulting structures with actual and simulated SSE limits prediction
 
Considering hydrophobic residues, those which are conserved with no functional need, belonging to the folding nucleus and forming the super core (Kloczkowski and Jernigan, 2002Go), occupying the so-called ‘topohydrophobic’ positions (Poupon and Mornon, 1999bGo) or conserved key amino acid positions (Reddy et al., 2001Go) may be of valuable help in further sorting out selected structures or even in constituting new constraints in the process itself. Moreover, it would also be interesting to compute models for several divergent sequences known to code for a same fold, e.g. at a level of 15% sequence identity. One would thus expect to converge to the best candidates of the common core necessary to produce a given fold. Alternatively, it may be useful to examine the mutations which do not modify the convergence of the procedure to explore better the sequence–structure signature of a fold, somewhat in the sense of the program Fold-X (Guerois et al., 2002Go).

RUSSIA works with a central and punctual attractor, the overall geometric center C of the hydrophobic faces of {alpha}-helices or ß-sheets. This feature is in accordance with the prediction of small- to medium-sized globular domains. For larger ones, a broader definition of this geometric attractor is required, e.g. an elongated area, and may allow one to deal successfully with the largest domains, provided that the number of independent SSE is kept within acceptable limits of computer time. With respect to this, the number of strands is of relatively little importance compared with that of the helix population because they have few degrees of freedom within the sheets. For the moment, the upper limit of sheets is of six strands. For large globular domains, one possible improvement might be to take into account the central hydrophobic attractor C in addition to hydrophilic ones outside the protein in order to balance the working forces.

In a somewhat surprising manner, helices described as rigid cylinders appear to be more difficult to handle than ß-strands gathered in unique helicoid surfaces and they contribute to clearly worse results for {alpha} or mixed structures than for all-ß structures. One perspective may therefore be to leave more freedom to helices (non-rigid cylinders) although this will increase the number of parameters to be explored and consequently the computing time. However, this latter will no longer be a significant constraint.

Finally, it may be significant to note that, for three- or four-stranded sheets as considered in this paper, knowing the spatial order of strands in advance is not absolutely crucial since, among the few theoretical possibilities to be explored, some of them are very unlikely (Znamenskiy et al., 2000Go). In the future, five or more ß-stranded sheets might also be considered and processed.

Conclusion

We have proposed an algorithm able to assemble the core of a protein, knowing the location and nature of the SSE. The main advantages of this procedure are its simplicity and speed. This is due to the fact that helices and sheets are treated as rigid bodies and loops are discarded. The motor is the maximization of hydrophobic compactness. It generates compact 3D structures for small and regular globular proteins. It can be considered as a step in the building of protein cores, as it models reasonably well the relative topology of the SSE. The resulting structures compared with the native ones have r.m.s. of the order of 3 Å and smaller r.m.s., of the order of 2 Å, are also frequently detected. Generally, the structures with r.m.s. deviations from the native structure within 3–7.5 Å are considered as ‘good’ (Bonneau et al., 2001Go) and the structures with r.m.s. deviations <4 Å are considered as ‘native-like’ (Simmons et al., 1999Go). Hence the average r.m.s. for each set of the structures, generated by the RUSSIA algorithm, compared with the modeled one, are ‘good’ and the ‘best’ structures in the {Phi}-selected sets of structures are ‘native-like’. Hence we can conclude that the inter-residue distance matrix provides a good approximation of the set of all existing distances between the residues in the PDB.

A comparison function {Phi} was proposed to sort out the generated structures, without any complementary knowledge or assumption about the native structure. To evaluate further the conformations produced by RUSSIA, one has to use alternative approaches, such as measuring distances between topologically conserved residues. Otherwise, introducing into the procedure a proper treatment of loops could improve the algorithm. The present algorithm can be extended to the prediction of larger proteins given a sufficient number of loop constraints, provided that they fold in one single compact domain. Transferring this present procedure to non-compact domains (i.e. for which two or more local hydrophobic barycenters can be defined) has not yet been considered.


    Acknowledgements
 
D.Z. and K.L.T. were funded by the French Ministries of Foreign Affairs and Research, respectively. The Genome CNRS Project partially supported this research. Part of this project was supported by EU under contract number QLG2-CT-2002-01298.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Bonneau,R., Charlie,E.M.S. and Baker,D. (2001) Proteins, 43, 1–11.[CrossRef][ISI][Medline]

Cohen,F.E. and Sternberg,M.J.E. (1980) J. Mol. Biol., 138, 321–333.[ISI][Medline]

Guerois,R., Nielsen,J. and Serrano,L. (2002) J. Mol. Biol., 320, 369–387.[CrossRef][ISI][Medline]

Kloczkowski,A. and Jernigan,R.L. (2002) J. Biomol. Struct. Dyn., 20, 323–325.[ISI][Medline]

Laskowski,R. (2001) Nucleic Acids Res., 29, 221–222.[Abstract/Free Full Text]

Lee,J., Liwo,A., Ripoll,D.R., Pillardy,J. and Sheraga,H.A. (1999) Proteins, Suppl., 3, 204–208.

Poupon,A. and Mornon,J.P. (1998) Proteins, 33, 329–342.[CrossRef][ISI][Medline]

Poupon,A. and Mornon,J.P. (1999a) FEBS Lett., 452, 283–289.[CrossRef][ISI][Medline]

Poupon,A. and Mornon,J.P. (1999b) Theor. Chem. Acc., 101, 2–8.[ISI]

Poupon,A. and Mornon,J.P. (2001) Theor. Chem. Acc., 106, 113–120.[CrossRef][ISI]

Reddy,B., Li,W., Shindyalov,I. and Bourne,P. (2001) Proteins, 42, 148–163.[CrossRef][ISI][Medline]

Reva,B.A., Finkelstein,A.V. and Skolinck,J. (1998) Fold. Des., 3, 141–147.[ISI][Medline]

Rykunov,D.S., Lobanov,M.Y. and Finkelstein,A.V. (2000) Proteins, 40, 494–501.[CrossRef][ISI][Medline]

Samudrala,R., Xia,Y., Huang,E. and Levitt,M. (1999) Proteins, Suppl., 3, 194–198.

Simmons,K.T., Ruczinski,I., Kooperberg,C., Fox,B.A. and Baker,D. (1999) Proteins, 34, 82–95.[CrossRef][ISI][Medline]

Sippl,M.J. and Weitckus,S. (1992) Proteins, 13, 258–271.[ISI][Medline]

Srinivasan,R. and Rose,G.D. (1995) Proteins, 22, 81–89.[ISI][Medline]

Znamenskiy,D., Le Tuan,K., Poupon,A., Chomilier,J. and Mornon,J.-P. (2000) Protein Eng., 6, 407–412.

Received July 25, 2003; revised October 25, 2003; accepted October 30, 2003





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Request Permissions
Google Scholar
Articles by Znamenskiy, D.
Articles by Chomilier, J.
PubMed
PubMed Citation
Articles by Znamenskiy, D.
Articles by Chomilier, J.