A new protein folding algorithm based on hydrophobic compactness: Rigid Unconnected Secondary Structure Iterative Assembly (RUSSIA). I: Methodology

Denis Znamenskiy, Jacques Chomilier1, Khan Le Tuan and Jean-Paul Mornon1

Systèmes Moléculaires et Biologie Structurale, LMCP, Universités Paris 6 et Paris 7, CNRS UMR 7590, Case 115, 75252 Paris cedex 05, France

1 To whom correspondence should be addressed. e-mail: jacques.chomilier{at}lmcp.jussieu.fr or jean-paul.mornon{at}lmcp.jussieu.fr


    Abstract
 Top
 Abstract
 Introduction
 The RUSSIA methodology
 References
 
We present an algorithm that is able to propose compact models of protein 3D structures, only starting from the prediction of the nature and length of regular secondary structures. Helices are modeled by cylinders and sheets by helicoid surfaces, all strands of a sheet being considered as a single block. It means that relative topology of the strands inside one sheet is a prerequisite. Loops are only considered as constraints, given by the maximal distance between their C{alpha} extremities according to their sequence length. Unconnected regular secondary structures are reduced to a single point, the center of their hydrophobic faces. These centers are then repeatedly moved in order to obtain a compact hydrophobic core. To prevent secondary structures from interpenetrating, a repulsive term is introduced in the function whose minimization leads to the compact structure. This RUSSIA (Rigid Unconnected Secondary Structure Assembly) algorithm has the advantage of relying on a small number of variables and therefore many initial conformations can be tested. Flexibility is produced in the following way: helices or sheets are allowed to rotate around the direction leading to the center of the model; residues in a sheet can slide along the main direction of the strand where they are embedded. RUSSIA is fast and simple and it produces on a test set several neighbor good models with an r.m.s. to the native structures in the range 1.4–3.7 Å. These models can be further treated by statistical potentials used in threading approaches in order to detect the best candidate. The limits of the present method are the following: small proteins with few secondary structures are excluded; multi domain proteins must be split into several compact globular domains from their sequences; sheets of more than five strands and completely buried helices are not treated. In this first paper the algorithm is developed and in Part II, which follows, some applications are presented and the program is evaluated.

Keywords: homology modeling/hydrophobic core/minimal surface/protein folding/secondary structure


    Introduction
 Top
 Abstract
 Introduction
 The RUSSIA methodology
 References
 
Prediction of protein fold from the knowledge of the amino acid sequence is still an important challenge in the post-genomic era. In many cases, a necessary stage is the prediction of secondary structure elements (SSE) as precisely as possible, regarding both their nature and position. The set of SSE constitutes the core of a protein and it is frequently called the ‘structurally conserved regions’ between several proteins of related sequence and common fold (Russell and Barton, 1994Go). In most of the procedures developed for the purpose of automatic modeling, the core backbones are first modeled from one or several templates selected from a multiple alignment. Then the peptides linking the SSE, i.e. the loops, are constructed and in a final step all side chains are built. In this paper, we introduce a new algorithm called ‘Rigid Unconnected Secondary Structures Iterative Assembling’ (RUSSIA), which does not need reference to a previous three-dimensional (3D) known template to operate. It performs a compact assembling of SSE as rigid blocks in order to predict the tertiary structure of small- to medium-sized protein cores. It thus produces a model restricted to the core, where the loops are used as distance constraints but their backbones are not modeled, leaving the SSE unconnected. In their pioneering work, Cohen et al. introduced the idea of considering an {alpha} class protein as composed of helical segments and connecting links of known length (Cohen et al., 1979Go).

RUSSIA uses the secondary structure prediction from existing programs, such as PHD (Rost et al., 1994Go) or HMM (Eddy, 1998Go). In the present model, ß-sheets (instead of ß-strands) and {alpha}-helices are considered as SSE, representing the basic building blocks of a structure. The {alpha}-helix geometry is invariant, in the sense that it has no inner degree of liberty, and the architecture of ß-sheets can be altered by the relative displacements (shifts) of the strands involved in the sheet, giving rise to different initial conformations for one sheet of known composition and topology. All SSE are reduced to a single point, the geometric center of their hydrophobic residues. The global 3D structure is produced by the successive displacements of these points towards their geometric center. To avoid steric hindrance during these displacements, a repulsive potential, based on a mean inter-residue distance analysis, is used in conjunction with the fact that each loop must accommodate a given number of residues. From this simple force field, several structures are proposed and in the final step, a choice is performed among them.


    The RUSSIA methodology
 Top
 Abstract
 Introduction
 The RUSSIA methodology
 References
 
The present algorithm can be divided into three major parts: (1) the geometry (topology) of the model and the initial conformation; (2) the force field (potential) used to produce compact structures; and (3) the search technique and the evaluation of proposed models.

To represent the residues ri of a sequence, we defined the vector ri = ri(a,x,y,z), where a is the nature of amino acid i and (x, y, z) are the coordinates of its {alpha}-carbon. We chose to model the residues by material points, placed at their C{alpha} atom positions, as these positions are well determined in the SSE and also because they do not depend on rotamers. Within a given SSE, the spatial positions of residues are fixed during one simulation.

The topology of SSE

The topology of ß-sheet surfaces emerges from the study of the geometry of two media borders (Osserman, 1986Go; Nitsche, 1989Go; Safran, 1994Go) and, in particular, from the study of soap films (Fomenko, 1986Go). Let us consider two liquids in a container, not mixing with each other. When the system is in equilibrium, there is a border between the two media, which is a two-dimensional (2D) smooth surface. The geometry of the inter-media border is under the influence of the equilibrium. Here, we introduce, from differential geometry, the notion of the ‘average curvature’ of a surface (Frankel, 1997Go). To calculate at a given point the average curvature of a smooth 2D surface embedded in a 3D Euclidean space, one needs to calculate the values of curvatures (second-order derivatives) of all curves produced by intersections of the surface with a plane, containing the vector normal to this surface. The sum of the largest and the smallest values of curvatures is called the average curvature of the surface at this point. From the Poisson theorem we know that the average curvature of an inter-media border in equilibrium is constant at all regular points.

If we consider a soap film introduced between two media with pressures p1 and p2, its average curvature will be {chi}(p1 p2), where 1/{chi} is the surface tension (a constant, depending on the nature of the film). If p1 = p2, the average curvature of the surface is equal to zero. Such surfaces are called ‘minimal surfaces’. The term ‘minimal’ means that any small distortion of the surface in the neighborhood of a given point does not diminish the surface.

The ß-sheets are approximated by a minimal surface, a helicoid one (Znamenskiy et al., 2000Go). Another minimal surface is the catenoid surface, produced by rotation around the y-axis of the curve y = ach(x/a) where a is a constant (this curve also describes the shape of a chain hanging between two fixed points). This surface can be used to model ß-sheet barrels (Lasters et al., 1988Go; Lasters, 1990Go; Flower, 1994Go; King et al., 1994Go; Murzin et al., 1994Go).

From the C{alpha} coordinates of an SSE, a vector field v can be constructed. The vectors at the C{alpha} positions can be defined as the sum of two vectors that bind C{alpha} i with i – 1 and i + 1 (Figure 1a). Between the C{alpha} the vector field can be smoothly interpolated. Then the integral curve C of the field v can be traced to model the backbone chain for ß-strands (Figure 1b) as well as for {alpha}-helices (Figure 1c).



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 1. Vector field v for various SSE. (a) The vector field in the C{alpha} positions is defined as the sum of two vectors (Ci – 1,Ci) and (Ci,Ci + 1) that bind the C{alpha} atom i with i – 1 and i + 1. (b) Interpolation of the vector field between the C{alpha} atoms of a ß-strand. The integral curve C of the field v is a linear function in cylindrical coordinates and is represented in gray. (c) Interpolation of the vector field between the C{alpha} atoms of an {alpha}-helix represented in gray.

 
We modeled the SSE by the level surfaces f1 = r (cylinder) and f2 = z/{Theta} (helicoid) (Figure 2). All the regular SSE are therefore modeled using these 2D surfaces in the 3D space: a cylinder and a helicoid. Both {alpha}-helix and ß-sheet are described by the same equation with different parameters. By adjusting the step (twist) of the helicoid, thus fixing f2 and the radius of the cylinder, thus fixing f1, we obtained either an {alpha}-helix or a ß-sheet structure (Figure 2), according to the values of f1 and f2. The intersection of a cylinder and a helicoid sharing the same axis yields two curves and only one is kept to model {alpha}-helices (Figure 3): f1 is then set to 2.4 Å and f2 is set to h{pi}/1.8 where h is the rise of an {alpha}-helix. For a ß-sheet model (Figure 4), f1 takes discrete values, corresponding to the number of strands in the sheet and f2 is determined by the twist of the sheet (Znamenskiy et al., 2000Go).



View larger version (77K):
[in this window]
[in a new window]
 
Fig. 2. Surfaces defining the SSE. The SSE are modeled by the surfaces of the integrals of the vector field v: f1 = C1 (for strands) and f2 = C2 (for helices), with C1, C2 real numbers.

 


View larger version (44K):
[in this window]
[in a new window]
 
Fig. 3. {alpha}-Helix modeling. The intersection of the cylinder and the helicoid yields two curves and one of them is taken as a representation of the periodic backbone, with 3.6 C{alpha} atoms per turn. The step h of the helicoid surface and the radius r of the cylinder are fixed, thus with a null number of degrees of freedom.

 


View larger version (82K):
[in this window]
[in a new window]
 
Fig. 4. Modeling of a four-stranded ß-sheet. The various ß-strands are constructed on the basis of the intersections of multiple concentric cylinders of radii r12 and r34 with the helicoid surface, whose equation is given on the figure. In this example, the two cylinders give rise to four intersecting curves.

 
{alpha}-Helix modeling. To model an {alpha}-helix structure, we used the conformation parameters most frequently found in helices of globular proteins. In order to decrease the number of degrees of freedom of the system, the {alpha}-helices are considered as ideally rigid. The C{alpha} atoms of an {alpha}-helix structure are placed at 3.8 Å intervals on the curve resulting from the intersection of two surfaces locally approximated by the cylinder and the helicoid (Figure 3). Their positions remain fixed relative to the local coordinate system.

ß-Sheet modeling. To decrease the number of degrees of freedom of the model, we have previously shown that gathering the ß-strands in a sheet described by a single helicoid surface (Figure 4) is a pertinent description of this regular super secondary structure (Znamenskiy et al., 2000Go). The algorithm assumes the twisted architecture of the ß-sheets (Ho and Curmi, 2002Go), often observed in the PDB database (Berman et al., 2000Go), and due to the lone pair of the backbone nitrogens that tend to eclipse the C{alpha}–C bond which results in a decrease in the {phi} value (Shamovsky et al., 2000Go). Nevertheless, owing to good convergence of the algorithm, slightly bent ß-sheets can also be considered, yielding a lower precision for the resulting structure. For the three- and four-stranded ß-sheets, the average root mean square (r.m.s.) deviation between the ‘twisted’ models of ß-sheets and all the ß-sheets from the PDB database, including the bent ß-sheets, is 1.5 Å, making the ‘twisted’ model representative for these classes of ß-sheets.

In the present version of RUSSIA, we focused on proteins whose folding relies on the compactness of SSE. Hence it is not designed to propose models of folds made of single sheets, as in the case of some toxins (see, for instance, bucadin, PDB code 1f94), where folding holds because of the presence of disulfide bridges, instead of SSE interactions. Our method starts with sheets of three strands, hence smaller sheets are not considered. Even if more strands were involved, bulged or bent sheets are not in the reach of this method, so far.

In practice, the cylinder radii correspond to the distances between one strand and the z-axis. The step of the helicoid surface is given by 2{pi}H and the angle of rotation of a strand around the z-axis is {theta}. The conformation of a particular sheet is given by the twist T = 360°/2{pi}H, related to a fixed value of H (Znamenskiy et al., 2000Go). The strands are obtained as the intersections of cylinders of radius r and the helicoids, whose equations are given in Figure 4. The C{alpha} atoms of a strand are positioned along these intersecting curves, alternately on either side of the helical surface at a distance of 0.9 Å distance from it, on the cylinder surface at 3.8 Å from each other, as shown in the inset of Figure 4. The radius is invariant for a given strand and the distance between two adjacent strands is 4.7 Å. The ß-sheet architecture is defined by the number of residues in a sheet with fixed relative topology, i.e. strands cannot move from one end to the other in the sheet. To overcome this limitation, one can enumerate all possible topologies as starting conformations and run the program as many times as necessary, taking advantage of a few privileged topologies (Znamenskiy et al., 2000Go). Furthermore, it was demonstrated that the number of allowed topologies both for ß-sandwich and {alpha}-helix packing can be decreased because of the restrictions due to steric and connectivity constraints (Cohen et al., 1980Go).

This algorithm is limited to a small number of strands (4–5), hence full enumeration of the topologies is feasible. For a four-stranded sheet the total number of topologies is only 16, as long as one ignores the sequential order, and 96 when loops are considered, i.e. when connectivity is included. Furthermore, it is known that all topologies are not equivalently probable and when ‘the numbers of observed topologies are compared with the numbers of theoretical topologies for three- to eight-stranded ß-sheets, it is clearly seen that the current structural database represents only a small fraction of what topologies are possible’ (Zhang and Kim, 2000Go; Znamenskiy et al., 2000Go). Hence one can drastically reduce the number of investigations and limit it to the known folds. As the program is fairly rapid, one can hope in the near future to generate from scratch all already indexed topologies and then select the best ones among the computed models.

Building of the SSE as rigid blocks and choice of their initial positions

In globular proteins, most SSE have one side exposed to the solvent and the other facing the inner part of the protein. Statistically, hydrophobic residues are less present on the solvent-exposed side of SSE. Therefore, we can distinguish a region with a high density of hydrophobic residues on the SSE surface, which we call the ‘hydrophobic face’ (HF), characterized by its ‘hydrophobic face center’ (HFC). Residues V, I, M, W, Y, L, F and C were considered as hydrophobic following the results of several compilations of experimental data (Callebaut et al., 1997Go; Soyer et al., 2000Go; Hennetin et al., 2003Go). Hence the RUSSIA algorithm considers each HFC as a point of attraction, guiding its SSE during the folding process. Once the initial relative positioning of these HFCs has been performed (the choice of initial positions is described later), the various HFC drive the SSE towards the geometric center of the protein, until one equilibrium position is reached. Hence, the HFC of a given SSE represents the attraction point for other SSE in the dynamic part of the algorithm and the motor of the algorithm is the collapse of these hydrophobic HFC.

{alpha}-Helix hydrophobic face definition. To determine the HFC of the {alpha}-helices, hydrophobic residue angular positions only were considered. In a canonical {alpha}-helix with 3.6 residues per turn, a C{alpha} atom can occupy one of 18 discrete angular positions in a plane perpendicular to the axis of the local cylindrical coordinate system (see Figure 3).

A ‘connected region’ was defined as successive angular positions occupied by hydrophobic residues on the base circle, allowing at most insertion of one non-hydrophobic residue among a pair of hydrophobic ones (see Figure 5). The ‘hydrophobic face’ of the {alpha}-helix was defined by the set of hydrophobic residues belonging to the previously defined longest ‘connected region’. The angular coordinate of the HFC of an {alpha}-helix was defined as the weighted sum of C{alpha} angular coordinates of hydrophobic residues belonging to the hydrophobic face. For helices longer than 18 amino acids, one angular position can be occupied by more than one residue. To choose the weights for each hydrophobic residue in the sum, we considered a training set of five four-helix bundle proteins, ferritin (PDB code 1fha), granulocyte-macrophage colony-stimulating factor (1gmf), hemerythrin (1hmd), apolipoprotein E3 (1lpe) and myohemerythrin (2mhr). Using the simulated annealing algorithm (Press et al., 1991Go), we derived the weights for hydrophobic residues that determined the HFC producing the best orientation of an {alpha}-helix towards the center of a protein. Four hydrophobic amino acid species were considered to derive the HFC of the helix: Cys (with a weight of 4.4), Leu (2.5), Met (1.3) and Phe (1.4). This choice relies on the fact that, when one of these four amino acids is present in the HF of a helix, its position is very close to the HFC, in our test set. This was done on a trial and error basis and as it happened to give a reasonably successful packing of the structure, this rule of thumb has not been further refined in the present state of the algorithm. If a helix had none of these four residues, then Ile, Val, Trp and Tyr are considered with equivalent weights. Once the angular coordinate has been set, the HFC is then placed in the middle point along the {alpha}-helix axis. This choice of the HFC determination produces the most compact structures.



View larger version (37K):
[in this window]
[in a new window]
 
Fig. 5. Definition of the HFC of an {alpha}-helix. The C{alpha} positions of an {alpha}-helix are projected on a base circle of the cylinder. We defined a ‘connected region’ as successive positions occupied by projections of hydrophobic residues on this base circle. The ‘hydrophobic face’ of the {alpha}-helix is defined as the set of hydrophobic residues whose projections belong to the longest ‘connected region’. The hydrophobic face is shaded gray.

 
ß-Sheet hydrophobic face definition. One HF is defined as the set of all hydrophobic residues belonging to one given side of a ß-sheet. Therefore, as a ß-sheet has two sides, there are two HF for one sheet. To decide which side of a ß-sheet a given residue belongs to, one considers the sign of the scalar product of the (Cß – C{alpha}) vector with the vector n, normal to the considered side of the ß-sheet surface; if it is positive then the residue belongs to this side.

The HFCs are calculated as the unweighted geometrical centers of the C{alpha} of hydrophobic residues on each side. As the HFC of a sheet contains more hydrophobic residues than in a helix, we did not evidence any statistical correspondence between the HFC and the nature of the amino acids. Two independent HFCs are obtained, one for each side of a ß-sheet (Figure 6).



View larger version (44K):
[in this window]
[in a new window]
 
Fig. 6. Definition of the HF of a ß-sheet. The two hydrophobic faces of a ß-sheet are defined as the sets of hydrophobic residues belonging to a given side of a ß-sheet. Each HFC is defined as the geometric center of the hydrophobic residues C{alpha} on the same side of the ß-sheet.

 
SSE initial positions. In the algorithm, three models of globular proteins were distinguished: the HELIX model describes the three- and four-helix bundles; the BETA model contains two ß-sheets; and the MIXED model contains both {alpha}-helices and ß-sheets. We assumed that more complex proteins can be split into a set of these basic motifs, so they were not explicitly treated in our study.

For the HELIX model with three helices, the HFCs are placed at the vertices of an equilateral triangle and in the case of four helices they were placed at the vertices of a square. The initial position of helix i is determined by the angle of rotation {alpha}i around the axis passing through the HFC of this helix i and C, the geometric center of all HFCs (Figure 7). We found that the optimal number of angular positions {alpha} covering the whole conformational space of an {alpha}-helix is 36, yielding a 10° grid. In other words, the initial conformations correspond to 36 possible positions for each helix, obtained by rotating each helix around this axis. Hence, for a three-helix bundle we obtained 363 initial positions and 364 for a four-helix bundle. The actual number of initial positions, once the loop connectivity conditions are applied, is much smaller.



View larger version (45K):
[in this window]
[in a new window]
 
Fig. 7. HELIX model. The goal function F to be minimized is defined as the largest distance hk, where hk is the distance between the HFC of the kth helix and C, the geometric center of HFCs.

 
In the BETA model, the two ß-sheets are placed at an initial distance h = 8 Å, such that both HFCs belong to the same normal to the sheets. This initial distance is coherent with the range for a standard packing geometry of twisted sheets (Cohen et al., 1981Go). One of the ß-sheets is then rotated around the axis determined by the two HFCs and a 10° grid is also applied. Some flexibility has been introduced in the relative combinations of ß-strands in the sheet. Residues can be displaced of one or two positions along the mean direction of the strand, resulting in several initial positions of ß-strands within a ß-sheet (Figure 8). These shifts are characterized by b values, corresponding to the number of C{alpha} positions by which the strand has been shifted. The direction is given by the sign of b, whose values have been limited to the range [–2, +3].



View larger version (67K):
[in this window]
[in a new window]
 
Fig. 8. Shift of strand i in a three-stranded sheet. The parameter bi determines the direction and value (in units of C{alpha}–C{alpha} displacement) of the shift. bi = 0 is the reference where all three strands present a canonical conformation. The residues of a strand shifted by one position change of side relatively to the mean plane of the ß-sheet.

 
To describe completely the topology of the sheets, all relative positioning of the strands and their possible directions are taken as initial conditions. For instance, in one three-stranded sheet, one has three independent topologies regardless of the direction of the strands, that we can call 123, 132 and 213, and all others are not independent of this set. For each of the three topologies, four cases must be considered when the direction of the strands is included: {uparrow}{uparrow}{uparrow}, {downarrow}{uparrow}{downarrow}, {uparrow}{uparrow}{downarrow} and {downarrow}{uparrow}{uparrow}. It therefore leads to 12 independent initial conditions for a three-strand sheet.

For the MIXED model, an {alpha}-helix was placed in front of a ß-sheet. Its HFC is placed along the normal to the ß-sheet plane passing by the HFC of the sheet side facing the helix. A second {alpha}-helix can be set on the other side of the ß-sheet. They are both at an initial distance h = 8 Å from the sheet, along the normal to the sheet passing at the HFC facing the considered helix (Figure 9). Among the MIXED models, the simplest case is a structure containing one {alpha}-helix facing one ß-sheet, and at the other extreme, there can be several helices on each side of the ß-sheet. The maximum number of {alpha}-helices that one can position on one side of a sheet depends on the numbers of residues involved in these SSE.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 9. MIXED SANDWICH model. The goal function F for the MIXED SANDWICH model is defined as the greatest distance hk between the HFC of each {alpha}-helix and the corresponding HFC of the ß-sheet.

 
The diameter of an {alpha}-helix is 5.8 Å and the lower limit of inter-residue distances, derived from our analysis (see Figure 10), is around 4 Å. Therefore, if we consider a three-stranded ß-sheet and two parallel {alpha}-helices, we can hardly position more than one {alpha}-helix on each side of the ß-sheet without exposing part of their HF to the solvent. Nevertheless, a four-stranded ß-sheet may contain two {alpha}-helices on each side. In the present algorithm, the maximum number of {alpha}-helices facing a four-stranded ß-sheet was fixed at two.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 10. Definition of residue constraints. (a) The distributions of distances dij were calculated for each pair of residues i, j. The mode of the distribution dij* is found. f(dij) is the smoothed distribution of distances. (b) The constraint function g(dij). The minimum is attained at dij*. (c) A simplified constraint function g*(dij) is constructed, it reaches its minimum at dij*.

 
Distance constraints among SSE

As the bulk of the RUSSIA algorithm relies on geometry, the first step was to produce a force field able to attract the SSE, reduced to a single point, their HFC. In addition to this attraction, one repulsion must be introduced in order to prevent the rigid blocks from inter-penetrating during the simulation. It was derived from the real mean equilibrium distances between SSE in a set of structures. This training set consisted of 32 895 residue pairs, extracted from the June 2000 release of the PDB (Berman et al., 2000Go), limited to a maximum of 25% sequence identity between any pair. SSE assignment was provided by DSSP (Kabsch and Sander, 1983Go). For each residue ri of a given protein, we measured the distances dij between the C{alpha} of this residue and any C{alpha} j in the sequence, provided that residues ri and rj belong to different SSE (Vendruscolo et al., 2000Go). Remember that RUSSIA assembles SSE as rigid blocks and loops are discarded from this study in terms of explicit treatment of their residues. Therefore, we did not consider a distance dij if any residue in the pair (ri, rj) belonged to a loop. To be coherent with the fact that all strands of a sheet constitute one rigid block and for ease of computing, we did not consider distances between residues if both belonged to ß-strands.

From the set of distances dij, the closest distance di for each residue ri is calculated between all residues and we obtained a set D of 32 895 closest distances. These distances were then grouped according to the chemical nature of amino acid pairs ai and aj, thus giving a matrix of 400 data, where each cell in this matrix is a list of closest distances:

D(ai,aj) = (dij1, ..., dijnij)(1)

where nij are the numbers of minimal distances found in the whole data set for the amino acid types ai and aj. We analyzed the distribution of distances for each of the 400 D(ai,aj). The number of distance classes lij was chosen from the sample size nij according to the Sturgess equation (see, for instance Wonnacott and Wonnacott, 1995Go):

lij {approx} 1 + 1.44lnnij(2)

The ‘representative’ distance dij* = d*(ai,aj) for each pair of amino acids was assumed to be the mode of the distribution D(ai,aj), i.e. the value for which the density of the distribution is maximal. In the case of multimodal distributions, the mode corresponding to the smallest value of dij* was selected. This is consistent with the fact that the bulk of the algorithm is the maximal compactness of the hydrophobic core of proteins. Hence we obtained a 20x20 symmetric matrix giving a value of the ‘representative’ distance between any pair of amino acids (Bryant and Lawrence, 1993Go), presented in Table I.


View this table:
[in this window]
[in a new window]
 
Table I. Symmetric inter-residue distance matrix for the 20 amino acids
 
The ‘representative’ distances between residues were needed to determine some constraints in the non-linear programming problem of the SSE assembling (Simons et al., 1999Go). In other words, the SSE will be forced to place themselves at relative distances corresponding to those of ‘representative’ distances, provided that one can decide what are the residues representative of the SSE. To find a numerical solution using a ‘gradient’ algorithm, we had to replace these constraints by ‘smooth’ curves. Once the ‘smoothed’ distributions of distances f(dij) had been calculated (Figure 10a), it was possible to define a constraint function:

g(dij) = Cf(dij)(3)

where C is a constant. The function g(dij) represents a potential for each pair of interacting residues and is similar to that of van der Waals (Figure 10b).

To decrease the number of calculations, we used a simplified version of the constraint function g* for a pair of residues (Sippl, 1990Go):

The function g* so constructed reaches its minimum at the ‘representative’ distance dij*. A shorter distance between residues is not prohibitive (as in a real protein) but less favorable. This constraint g is used to avoid the overlapping of the SSE when they are brought closer in order to produce a compact structure. It also accounts for the steric hindrances of the residues, reduced to a single point, the C{alpha}, in these calculations. Thus for n residues, one gets a set of n(n – 1) inter-residue conditions with g*(dij) = 0. This set of conditions is denoted B1.

The goal function

In order to produce a compact structure from a set of independent and unconnected SSE, the distances between their HFCs were used to define a goal function F. Its minimum must be reached if one wishes to fulfil the conditions of compactness of a protein. Minimization is accomplished by regular displacements of the SSE along the axis passing through the HFC of the particular SSE and the geometric center of HFCs of all the SSEs. When SSE come into contact, they are rotated in a plane perpendicular to their previous displacement.

For the HELIX model, the goal function was the greatest distance between all HFCs and C, the geometric center of all the HFCs (Figure 7), such that, for a protein with k helices:

where hk is the distance between the HFC of helix number k and C. By choosing this type of very simple goal function, containing as few parameters as possible, we assumed that minimizing the distance hk for every SSE would lead to a ‘compact’ 3D structure. Furthermore, we noticed that minimizing the average h for all SSE could produce locally compact structures which, however, were not globally optimal.

For the MIXED model, the goal function F has the same formalism as for the HELIX model. In this case, the hk are the distances from the helix HFCs to the HFCs of each side of the facing ß-sheet (Figure 9). Minimization of F is performed independently for each side of the sheet. Flexibility has been introduced in the sheets by means of shifts of strands inside one sheet. To ensure that the native structure is present among all generated models, the helix is moved from one side of the sheet to the other under the initial conditions of the simulation.

For the BETA model, the distance h in the goal function F was defined as the distance between the HFCs of the two sheets (Figure 11). One of the two sheets is permitted to rotate around the axis passing through the HFCs, the second one remaining fixed. The parameters allowed to vary are the shifts for each ß-strand in each ß-sheet and the relative rotation of ß-sheets around the axis passing through their HFCs.



View larger version (72K):
[in this window]
[in a new window]
 
Fig. 11. BETA SANDWICH model. h is the distance between the HFCs of the two ß-sheets.

 
Loop connectivity

The RUSSIA algorithm does not consider loops as a rigid SSE, because they present a wide range of conformations. The size of the loops is taken as the greatest distance between the first and last C{alpha} atoms of the loop and it is used in the algorithm as a constraint applied to the SSE. This greatest distance lm is derived from the study of loop conformation by Wojcik et al. (Wojcik et al., 1999Go), where all loops of a given length m are classified with respect to the distance separating the first and last C{alpha} of each loop.

Then, a constraint function skm(l) (Figure 12) is defined by



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 12. Definition of loop constraint function sm. l is the distance between the adjacent SSE; lm is the maximal loop length for m residues; sm reaches {infty} if l > lm.

 

The value of {delta} = 0.5 Å was fixed by trial and error tests, for all values of m. The parameter l is the distance between the two SSE connected by the loop. These constraints allow the elimination of a large number of ‘prohibited’ initial positions of the SSE at the very beginning of the algorithm, thus saving computing time. This set of conditions is denoted B2.

Minima search technics

Having modeled the regular SSE, we wished to bring them closer to form compact structures. Therefore, we had to minimize the goal function F that has been previously defined:

F(x1, ..., xk, y1, ..., yk, z1, ..., zk, {phi}1, ..., {phi}k,{psi}1, ..., {psi}k ,{theta}1, ..., {theta}k)(7)

The parameter p = (x1, ..., xk, y1, ..., yk, z1, ..., zk) determines the Cartesian coordinates of the k SSE, considered as rigid blocks and thus reduced to a single point. The parameter q = ({phi}1, ..., {phi}k, {psi}1, ..., {psi}k, {theta}1, ..., {theta}k) determines the angular positions of all the SSE, also by reference to their HFCs. When the distance between the SSE decreases in order to minimize F, the SSE can come into contact. To avoid the steric hindrance, a set of inter-residue contact conditions, called B1, has been defined, based upon inter-residue distance analysis. To account for the presence of the loops linking the k SSE, another set of k – 1 boundary conditions, called B2, has been introduced. Hence the problem of producing a model is given by the following set of equations:

where F is the goal function and B1, B2 are the set of boundary conditions for k SSE.

To reach the minimum of F, the SSE are moved by steps towards the center of the protein (Figure 13). Let us remember that F itself is defined as the largest hk distance between helices and the center C. One can understand the minimization process as the following: one searches for the largest distance hk, then this distance is reduced, which in turn produces a new position for the center C, thus giving rise to a new largest hk distance (it may be with another helix) and so on. Once the function g (residue contact constraint) becomes greater than zero, the SSE are slightly moved (of the order of 5% of the largest F) and rotated (typically 1°) around their HFC until g = 0. The operation is repeated as long as values of F decrease and s = 0.



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 13. Minimization strategy. (a) To minimize the goal function F, we move the SSE i in the direction of the vector –{nabla}F until the boundary conditions are violated. (b) If residues are too close SSE i is moved away in the direction of the vector –{nabla}g or –{nabla}s if the loop length is such that the loop constraint function must be applied.

 
The initial positions of SSE are discretized and explore all possible SSE combinations. For instance, the {alpha}-helices adopt 36 different angular positions, corresponding to 10° rotations. Sheets are also rotated with the same angular step. One has verified that rotating the initial position of one {alpha}-helix by 10° produces an r.m.s. deviation between the C{alpha} of the model and the native structures of <1 Å, which is an acceptable step for our model.

The transfer of an {alpha}-helix from one side of the ß-sheet to another in the MIXED model and the swap of {alpha}-helices (change of relative position of one helix in the bundle) in the HELIX model yield a fairly complete set of the initial positions for the {alpha}-helices, although proper comparison of very different structures is difficult. Nevertheless, wrong structures will be eliminated at the end of the process. In a sheet every strand adopts six possible positions, because of the shifts (Figure 10). These positions cover all possible ß-sheet conformations for the considered class of ß-sheets (Znamenskiy et al., 2000)Go. Swapping the ß-strands within a ß-sheet, with respect to the connectivity condition, allows a complete description of ß-sheet conformations, at the cost of several runs of the program. Hence the algorithm explores all possible SSE conformations and a simulated annealing algorithm (Press et al., 1991Go) is used to perform the search of the minima.

Sorting out solutions

For each protein model, the conformational space was explored and the minimization algorithm was applied to different initial positions of SSE corresponding to the various angles at rotations of the SSE around the direction of their displacements (see earlier). Each one leads to a tertiary structure. As the SSE also rotate when one of the amino acids comes into contact with another SSE, very similar structures can be produced from different initial conditions. In the particular case of sheets, initial conditions correspond to one given value of the shift b for each strand of the sheet, besides the global angular orientation of the sheet relative to its direction of displacement. After some structures have been eliminated by the loop length condition, several fulfil the conditions of the goal function and remain possible candidates for the final structure. To sort out these structures and to determine the ‘best’ ones, we introduced the potential function {Phi}:

where {Phi} represents the average distance between the C{alpha} of the hydrophobic residues belonging to different SSE. The initial angles for the h helices: are {alpha} = ({alpha}1, ..., {alpha}h) and the initial shifts for the s strands are b = (b1, ..., bs) for a total number of SSE given by k = h + s. The total number of hydrophobic residues is nphob. The function {Phi}, although fairly simple, is smoother than the function F and more suitable for comparing structures. Conceptually, it corresponds to the fact that the folding is accomplished by allowing the local clusters of hydrophobic residues attached to each SSE to come into contact. It is a consequence of the empirical observation that hydrophobic clusters correspond to regular secondary structures (Woodcock et al., 1992Go; Callebaut et al., 1997Go; Hennetin et al., 2003Go). Of course, if any additional information about the 3D structure of a protein is available, a more precise and specific function could be used. From the analysis performed on a test set of 12 sample structures (described in the following Part II), we noticed that the shape of {Phi} is similar to that of the r.m.s. deviation between the RUSSIA generated and the native structures. This latter r.m.s. is only calculated over the {alpha}-carbons of the residues involved in helices and strands. As an example, we considered a three-helix bundle from engrailed homeodomain, PDB code 1enh. Owing to the small number of SSE, a simple graphical representation can easily be made (e.g. Figure 14). Initial angles {alpha}1 for the first helix are on the first line, in degrees between 0 and 180°, and for the second helix below, {alpha}2 is limited to the range 140–200°. The large regions corresponding to the ‘prohibited’ initial angles were discarded in order to focus more clearly on the important areas. At every 10° step of {alpha}2, the {alpha}1 distribution is reported over 180°. The ordinate is the initial {alpha}3 angle of the third helix of 1enh, from –60 to 160°. The light-shaded areas represent the smoothed values of {Phi} (top, Figure 14) and r.m.s. (bottom, Figure 14) and the dark areas correspond to these lowest values. Regions where the combination of these initial angles is impossible are left white. As this was also the case for the whole test set, we introduced the ‘smoothed’ {Phi} function and r.m.s. over 20°:



View larger version (59K):
[in this window]
[in a new window]
 
Fig. 14. {Phi} and r.m.s. for 1enh. Comparison of the smoothed {Phi} function (upper figure) and smoothed r.m.s. difference (lower figure) between the generated and native structures of the engrailed homeodomain protein. All hydrophobic residues contribute. On the abscissa the second line represents the {alpha}2 initial angle of the second helix, from 140 to 200°. At each step of 10°, the range of {alpha}1 angles from 0 to 150° is represented on the first line of the absissa. The last {alpha}3 initial angle is reported on the ordinate, in the range –60 to 150°. The minimal r.m.s. deviation is 2.58 Å. The best r.m.s. occurs for angle values of 40, 150 and 60° for the first, second and third helices, respectively.

 

{Phi} = [{Phi}({alpha}i) + {Phi}({alpha}i + 10°) + {Phi}({alpha}i – 10°)]/3(10)

r.m.s. = [r.m.s.({alpha}i) + r.m.s.({alpha}i + 10°)+ r.m.s.({alpha}i 10°)]/3(11)

Indeed, the correlation coefficient between the {Phi} and r.m.s. is clearly higher when these functions are smoothed (Figure 14). The fact that smoothing the {Phi} and r.m.s. over a range of ±20° is an a posteriori proof that the step of 10° used to generate the structures does not need to be smaller. A correlation, however, does not always provide a coincidence of minima. The fact that smoothed parameters {Phi} and r.m.s. correlate means that one should not restrict to the smallest value of {Phi} in order to select the best model, but scan the various models over a range of 20° around this minimal solution. From here on and throughout this paper and Part II, we will be concerned by those smoothed values of {Phi} and r.m.s. To select a structure of low r.m.s. with the native one, we used two search techniques. The first approach is to sort out all the generated structures by their {Phi} values and select a small number of structures of low {Phi} values The second approach is to choose the structure with the lowest {Phi} value and to explore the structures generated for close initial positions; the convergence of the algorithm leads to structures of small r.m.s. when starting from neighboring initial structures. For the HELIX model the second approach seemed to be more appropriate, whereas for the BETA and MIXED the first approach was preferred.

In the example in Figure 14, a correlation coefficient of 0.95 was obtained between smoothed {Phi} and r.m.s., revealing a quasi-linear relation between these two functions. However, the generated structure with minimal value of {Phi} does not necessarily provide the least r.m.s. compared with the native structure. Nevertheless one can find a reasonably close structure in the neighborhood of ±20° around the corresponding initial angles. Our tests showed that this search range was sufficient.

We also studied, as an example of the MIXED model, a pollen allergen 5 of PDB code 1bbg, containing three ß-strands and one {alpha}-helix. The conformational space has four dimensions, the initial {alpha} angle of the helix and the three shifts b1, b2, b3 for the corresponding strands in the sheet (Figure 15). We also found a phase shift of {Phi} relative to the r.m.s. between the generated and the native structures. The correlation coefficient between the ‘averaged’ function {Phi} and the r.m.s. is 0.95. The 10% of the lowest values of {Phi} and r.m.s. is colored in black and 90% of the greatest values of the functions is colored in gray. White areas represent the initial angles and shifts of the SSE for which the algorithm could not generate a tertiary structure. The minimal r.m.s. deviation is 1.73 Å.



View larger version (89K):
[in this window]
[in a new window]
 
Fig. 15. {Phi} and r.m.s. for 1bbg. Smoothed {Phi} values for a generated structure of the pollen allergen 5 protein are represented in the upper figure. The lower figure represents the values of the r.m.s. difference between the generated and the native structures for different initial angles of the {alpha}-helix and shifts of strands. The values of shifts for the first and second strands of the ß-sheet are indicated on the ordinate. The initial angles for the {alpha}-helix and shifts for the third strand are indicated on the abscissa in the first and second rows, respectively. The minimal r.m.s. deviation is 1.73 Å. The best r.m.s. occurs for values of the shifts of the three strands and the angle around the helix of 0, –1, –1 and 240°.

 
Conclusion

In a subsequent study (see the following Part II), we applied the RUSSIA procedure to a representative set of small- to medium-sized globular domains (up to 166 amino acids) of the all-{alpha}, all-ß and {alpha}/ß protein classes of CATH (Orengo et al., 1998Go). We show that, with limited computing power (PC Pentium 233), it routinely allows one to obtain good folds (C{alpha} r.m.s. in the range 1.4–3.7 Å) for the core or the protein, limited to the SSE considered as rigid blocks.


    Acknowledgements
 
D.Z. and K.L.T. were funded by the French Ministries of Foreign Affairs and Research, respectively. The Genome CNRS project partially supported this research. Part of this project was supported by EU under contract number QLG2-CT-2002-01298.


    References
 Top
 Abstract
 Introduction
 The RUSSIA methodology
 References
 
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235–242.[Abstract/Free Full Text]

Bryant,S.H. and Lawrence,C.E. (1993) Proteins, 16, 92–112.[ISI][Medline]

Callebaut,I., Labesse,G., Durand,P., Poupon,A., Canard,L., Chomilier,J., Henrissat,B. and Mornon,J.P. (1997) Cell. Mol. Life Sci., 53, 621–645.[CrossRef][ISI][Medline]

Cohen,F.E., Richmond,T.J. and Richards,F.M. (1979) J. Mol. Biol., 132, 275–288.[ISI][Medline]

Cohen,F.E., Sternberg,M.J.E. and Taylor,W.R. (1980) Nature, 285, 378–382.[ISI][Medline]

Cohen,F.E., Sternberg,M.J.E. and Taylor,W.R. (1981) J. Mol. Biol., 148, 253–272.[ISI][Medline]

Eddy,S.R. (1998) Bioinformatics, 14, 755–763.[Abstract]

Flower,D.R. (1994) Protein Eng., 7, 1305–1310.[ISI][Medline]

Fomenko,A.T. (1986) Comput. Math. Appl. B, 12, 825–834.[ISI]

Frankel,T. (1997) The Geometry of Physics. Cambridge University Press, Cambridge.

Hennetin,J., Le Tuan,K., Canard,L., Colloc’h,N., Mornon,J.-P. and Callebaut,I. (2003) Proteins, 51, 236–244.[CrossRef][ISI][Medline]

Ho,B.K. and Curmi,P.M.G., (2002) J. Mol. Biol., 317, 291–308.[CrossRef][ISI][Medline]

Kabsch,W. and Sander,C. (1983) FEBS Lett., 155, 179–182.[CrossRef][ISI][Medline]

King,R.D., Clark,D.A., Shirazi,J. and Sternberg,M.J. (1994) Protein Eng., 7, 1295–1303.[ISI][Medline]

Lasters,I. (1990) Protein Eng., 4, 133–135.[ISI][Medline]

Lasters,I., Wodak,S., Philippe,A. and van Gunsteren,E. (1988) Proc. Natl Acad. Sci. USA, 85, 3338–3342.[Abstract]

Murzin,A.G., Lesk,A.M. and Chothia,C. (1994) J. Mol. Biol., 236, 1369–1381.[CrossRef][ISI][Medline]

Nitsche,J.C.C. (1989) Lectures on Minimal Surfaces. Cambridge University Press, Cambridge.

Orengo,C.A., Martin,A.M., Hutchinson,G., Jones,S., Jones,D.T., Michie,A.D., Swindells,M.B. and Thornton,J.M. (1998) Acta Crystallogr. D, 54, 1155–1167.[CrossRef][ISI][Medline]

Osserman,R.A. (1986) A Survey of Minimal Surfaces. Dover, New York.

Press,W.H., Flannery,B.P., Teukolsky,S.A. and Vetterling,W.T. (1991) Numerical Recipes in C. Cambridge, Cambridge University Press.

Rost,B., Schneider,R. and Sander,C. (1994) Comput. Applic. Biosci., 10, 53–60.[Abstract]

Russell,R.B. and Barton,G.J. (1994) J. Mol. Biol., 244, 332–350.[CrossRef][ISI][Medline]

Safran,S.A. (1994) Statistical Thermodynamics of Surfaces, Interfaces and Membranes. Addison-Wesley, Boston.

Shamovsky,I.L., Ross,G.M. and Riopelle,R.J. (2000) J. Phys. Chem. B, 104, 11296–11307.[CrossRef][ISI]

Simons,K.T., Ruczinski,I., Kooperberg,C., Fox,B.A., Bystroff,C. and Baker,D. (1999) Proteins, 34, 82–95.[CrossRef][ISI][Medline]

Sippl,M. (1990) J. Mol. Biol., 213, 859–883.[ISI][Medline]

Soyer,A., Chomilier,J., Mornon,J.-P., Jullien,R. and Sadoc,J.-F. (2000) Phys. Rev. Lett., 85, 3532–3535.[CrossRef][ISI][Medline]

Vendruscolo,M., Najmanovich,R. and Domany,E. (2000) Proteins, 38, 134–148.[CrossRef][ISI][Medline]

Wojcik,J., Mornon,J.-P. and Chomilier,J. (1999) J. Mol. Biol., 289, 1469–1490.[CrossRef][ISI][Medline]

Wonnacott,T.H. and Wonnacott,R.J. (1995) Introductory Statistics for Business and Economics. Wiley, New York.

Woodcock,S., Mornon,J.P. and Henrissat,B. (1992) Protein Eng., 5, 629–635.[ISI][Medline]

Zhang,C. and Kim S.H., (2000) J. Mol. Biol., 299, 1075–1089.[CrossRef][ISI][Medline]

Znamenskiy,D., Le Tuan,K., Poupon,A., Chomilier,J. and Mornon,J.-P. (2000) Protein Eng., 13, 407–412.[CrossRef][ISI][Medline]

Received July 25, 2003; revised October 25, 2003; accepted October 30, 2003





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Request Permissions
Google Scholar
Articles by Znamenskiy, D.
Articles by Mornon, J.-P.
PubMed
PubMed Citation
Articles by Znamenskiy, D.
Articles by Mornon, J.-P.