Loop fold nature of globular proteins
Igor N. Berezovsky1, and
Edward N. Trifonov
Department of Structural Biology, The Weizmann Institute of Science, P.O.B. 26, Rehovot 76100, Israel
 |
Abstract
|
---|
Protein chains make numerous returns in globules, thus forming loops, closed by tight residue-to-residue contactsclosed loops. Previous statistical analysis of the sizes and locations of the closed loops in all major protein folds revealed that the loops have an almost standard contour length of 2530 amino acid residues and follow one after another along the chain. In this work the closed loops of the major folds are presented in three dimensions. A special image filtering procedure is introduced that allows one to visualize the standard size closed loops for the first time. The loop positions along the sequences are verified by detection of loop-end clusters.
Keywords: closed loops/image filtering/major folds/protein folding/protein structure
 |
Introduction
|
---|
Proteins are characterized in many ways on the basis of structural and evolutionary considerations (
Murzin et al.1995
;
Orengo et al.1997
). There is one type of structural element which, inexplicably, has never been considered, namely closed loops, i.e. returns of the chain trajectories. Note that these are not loops in the sense of the traditional definition as linkers between elements of secondary structure (Leszczynski and Rose, 1986
;
Martin et al.1995
;
Kwasigroch et al.1996
;
Oliva et al.1997
). The so-called U-turns (
Kolinski et al.1997
) also do not include the loop closure points. Closed loops of a protein globule connect points distantly positioned along the chain which are thus in contact (defined, for example, as short C
to C
distances). Compactness of the proteins implies large numbers of such chain-to-chain contacts, unlike loose Gaussian trajectories only occasionally returning to themselves. The loop fold structure of the globular proteins is not immediately seen, being disguised by frequent trajectory changes due to various elements of secondary structure, primarily
-helices. A simple filtering (smoothing) procedure described below makes the closed loops clearly seen. The loops and the sites of multiple contacts in 10 major folds are analyzed. The maps are constructed in which the closed loops (total average size 2426 amino acid residues) show the same size preference as in previous work (2530 residues) where the statistics of the loop sizes of large ensembles of protein structures were analyzed (
Berezovsky et al.2000
). Three-dimensional (3-D) structures of 10 major fold types demonstrate that the proteins are universally built of consecutively connected standard closed loops.
 |
Materials and methods
|
---|
We define the closed loops as continuous sub-trajectories of the folded chains with small C
-to-C
distances between their ends (up to 10 Å). The C
C
contacts with immediate neighbors along the sequence are not considered. Five residues are taken as the cut-off value. The standard deviations for the peak values in the loop size histograms (Figure 1
) are estimated as square roots of the values in the nearby minima.

View larger version (72K):
[in this window]
[in a new window]
|
Fig. 1. Loop size distributions for 101 eukaryotic proteins (a) and 162 prokaryotic proteins (b), of more than 200 amino acid residues. The protein structures for the analysis are taken from PDB database. The threshold of allowed sequence similarity is taken at 25% (PDB_SELECT).
|
|
Inspection of the positional distribution of the loop ends along the sequences reveals numerous sites (small regions) where many loops originate, as illustrated in
Figure
2
. We use such diagrams for the purpose of locating the loops, considering first the most prominent ones, as suggested by the diagrams. The mapping is started from the sites of multiple end-to-end connections. That is, we map first only the loops with both ends belonging to multiple connection points, as in
Figure
2c
(see also the flowchart of this procedure presented in Figure 3
). The loops with tightest end-to-end distances irrespective of their size are taken first. The procedure is repeated until the 10 Å limit is reached, although it is normally exhausted at shorter distances. Then the second round follows, which involves the standard loops with only one end belonging to the multiple connection points. The last stage involves single isolated standard loops with no multiple connections, again in the order of the tightness of the closure. These stages are also presented in the flowchart (Figure 3
).

View larger version (18K):
[in this window]
[in a new window]
|
Fig. 3. Flowchart of the loop mapping procedure. Multiple contact sites (MCS), dots in the scheme, correspond to positions of the major maxima in the diagrams of multiple contacts as in Figure 2c .
|
|
The uncertainty for the points of multiple contacts is ±2 amino acid residues. The (anti)parallel
- and ß-structures form several short C
C
contacts, in which case the shortest is taken. As suggested by the size distribution of the loops (
Berezovsky et al.2000
) the least frequent loop size is 15 amino acid residues. Correspondingly, the loops accepted into the mapping procedure could be as small as 16 amino acid residues. Acceptance of such small loops may cause a bias in the final loop size distribution, but as the results below indicate, this is not the case. The procedure described is practically devoid of uncertainties.
In a few cases composite loops have been observed, consisting of two loops with all four ends within a 10 Å distance. Such loops were split into smaller ones. For example, loop 3283 (4.014 Å, 52) in ß Aligned Prism (1vmoA) consists of loops 3157 (3.740 Å, 27) and 5676 (7.840 Å, 21). In the case of partial overlapping, the tighter of the two loops was accepted. With overlapping of less than five common amino acid residues, both loops were accepted.
A trajectory smoothing procedure replaces the coordinates of every C
atom by average coordinates for seven C
atoms centered at a given residue.
 |
Results
|
---|
The updated histograms of the loop size distributions for prokaryotic and eukaryotic proteins are shown in
Figure
1
. The histograms are calculated as earlier (
Berezovsky et al.2000
) by utilizing enlarged sets of structures (162 prokaryotic and 101 eukaryotic proteins). Both plots demonstrate a major preference for loop sizes of 2530 amino acid residues. The amplitudes in the peak positions show an excess over nearby minima of 603 and 423 occurrences in Figure 1a and b
, respectively. This corresponds to over 11 standard deviations in both cases. The purpose of the loop mapping is to split every protein structure into a set of minimum-sized elementary closed loopsprimary loops. An important lead in the process of mapping is existence of multiple contacts, clusters of N-ends and C-ends of the loops. This is illustrated by Figure 2
, where clusters of N-ends (Figure 2a
), C-ends (Figure 2b
) and both (a product of the two, Figure 2c
) are shown for the ß Trefoil fold (1afc A). Figure 4
displays the polypeptide chain trajectories for 10 major folds in standard backbone (left) and smoothed presentations with the mapped loops indicated by various colors. A striking uniformity of the variety of proteins otherwise thoroughly different is observed. This is better seen when the smoothed trajectories not obstructed by ubiquitous zig-zags of
-helixes are inspected. As
Figure
4
amply illustrates, all major types of folds, despite substantial differences in their overall appearances, are equally `spelled' by consecutive arrays of the loops. It is important to note that this is not only a property of the typical sized folds (100200 amino acid residues), but also of substantially larger molecules (data not shown), such as ß-galactosidase made of five domains (
Jacobson et al.1994
) or the huge multi-domain muscular protein titin (
Politou et al.1996
). In other words, all globular proteins regardless of their types, size or function appear to be largely built of connected loops of the same typical size. In a few cases the mapping procedure allows for mutually exclusive alternative linear arrays of the loops. Although both variants can be considered in each case, selection of the tighter end-to-end contacts leads to a unique choice. For example, in the case of
ß Barrel the array 940 (4.463, 32), 4263 (4.377, 22), 6390 (4.365, 28), 90122 (4.735, 33), 131170 (3.739, 40), 167211 (4.489, 45), 211232 (4.640, 22) and 230249 (6.316, 20) can be partially replaced by the overlapping set 68112 (6.670, 45), 110150 (4.379, 41), 147190 (7.744, 44) and 188227 (5.883, 40). The latter, however, offers the loops which are more relaxed and more scattered size-wise. Similarly, in the rotationally symmetrical case of
ß Horseshoe (1bnh) two alternatives are possible, as shown in Figure 4
, bottom. Apparent secondary contacts appear in composite loops, that is, large loops with smaller internal closures. For example, in ß Sandwich (2hlaB) loop 3980 (4.421, 42) covers the loop 4968 (4.760, 20). A large loop may consist of several nearly standard-sized loops. For example, loop 254299 (4.047, 46) of ß 8 Propellor consists of loops 254272 (5.156, 19) and 273289 (3.814, 17); similarly, loop 324384 (4.250, 61) is made of smaller ones, 327343 (4.422, 17) and 348379 (4.936, 32); finally, loop 468509 splits into 466482 (6.187, 17) and 485506 (3.359, 22). In the ß Sandwich (2hlaB) region 3980 can be considered either as a composite loop or as the overlap with another tight loop 2863 (3.741, 36). In both cases, linearity and nearly standard size are maintained. Thus, composite loops responsible for the secondary contacts between primary loops do not interfere with the general linear arrangement of the primary loops. The distant contacts may be responsible for 3-D stabilization of sequentially engaged primary loops during the protein folding. The mean value of the loop sizes in the maps of Figure 2
is 2426 amino acid residues, matching well the preferential size observed in the overall histogram of the loop sizes, as in Figure 1
.




View larger version (104K):
[in this window]
[in a new window]
|
Fig. 4. Major protein folds in traditional backbone presentation (left of each single-column group) and in smoothed form (right of each single-column group): Non-Bundle (1eca), ß Roll (1pht), ß Sandwich (2hla B), ß Trefoil (1afc A), ß Aligned Prism (1vmo A), ß Barrel (4tim A), ß 8 Propellor (3aah A), ß 3 Solenoid (2pec), ß 3-Layer Sandwich (1 pya B), ß Horseshoe (1bnh). The alternative arrays for the ß Horseshoe (1bnh) are 227 (4.267, 26), 2653 (4.131, 28), 5482 (4.007, 29), 83110 (4.135, 28), 112141 (4.227, 30), 140167 (3.907, 28), 165194 (3.992, 30), 197224 (4.045, 28), 226255 (4.436, 30), 254281 (4.114, 28), 282310 (4.383, 29), 311338 (4.271, 28), 339367 (4.451, 29), 368395 (4.276, 28), 396424 (4.390, 29), 430456 (4.242, 27) (average C C distance is 4.2 Å); and 227 (4.267, 26), 3260 (4.576, 29), 6089 (4.969, 30), 94120 (4.862, 27), 119147 (4.798, 29), 151177 (4.273, 27), 178205 (5.077, 28), 264291 (4.543, 28), 292320 (4.927, 29), 322348 (4.277, 27), 349376 (4.765, 28), 378404 (4.596, 27), 401429 (4.438, 29), 430456 (4.242, 27). The final, double-page spread set consists of the loops with larger end-to-end distances(average C C distance is 4.6 Å). These can be considered rather as secondary contacts in the polypeptide chain trajectory. Chain sections of various colors correspond to the nearly standard size closed loops mapped as described.
|
|
 |
Discussion
|
---|
The preferred size of the closed loops, 2530 amino acid residues, may originate from polymer statistical properties of the polypeptide chains. It is in the range of the optimum size for ring (loop) closure of the polypeptide chain with a mixed amino acid sequence (
Berezovsky et al.2000
). At first sight this may appear as a statistical feature of no relevance to the biological functions of proteins. The loops as such are obviously important building blocks of the protein structure and may well have been under selection pressure during protein evolution. Both the size of the loops and their actual positions along the protein sequence could have been selected. It was, perhaps, natural from the beginning to keep unchanged the optimum size as enforced by the polymer statistics. As to the actual positions of the loop ends in the protein sequence, selection most likely has taken place. Indeed, the sequence, evolutionarily driven, would have matching sites, making `stitches' at key positions to guarantee an efficient and unique loop pattern. Such hypothetical stitches obviously should play an important role both for primary looping (linear arrangement of nearly standard-sized loops) and for secondary interactions. The loop closure might also have been an important stage in the earliest evolution of proteins when the chain lengths were approaching the loop closure size. Since there are many proteins of such small size which are biologically active (
Douglass et al.1984
), one could speculate that the observed nearly standard-sized loops may have been independent active entities at some early stage of protein evolution. Later, owing to fusion of the respective genes, the small loop-like proteins may have turned into larger multiloop structures. The loop closure dramatically decreases the number of alternative conformations that the chain may acquire, thus fixing selected conformations. The chain-to-chain contacts are also advantageous energetically, providing the necessary stability to the loops and their associations in multiloop structures.
The linear arrangement of the loops immediately suggests the sequence of events during cotranslational protein folding. The folding process may start with the formation of the contact closing the first primary loop. Other loops would be formed sequentially involving correponding interacting sites until completion of the synthesis. Already at this initial stage the sequence would thus provide instructions for the protein folding (looping). A whole arsenal of current concepts about protein structure suggests further, secondary events: formation of
-helixes, of ß-sheets, of secondary hydrophobic and polar loop-to-loop contacts, etc. It is not excluded, of course, that the formation of these secondary elements may occur already during the primary looping as well as subsequently.
 |
Notes
|
---|
1 To whom correspondence should be addressed. E-mail: igor.berezovsky{at}weizmann.ac.il 
 |
Acknowledgments
|
---|
The authors are grateful to A.Grosberg for stimulating discussions and E.Yakobson for critical reading of the manuscript. I.N.B. is a Post-Doctoral Fellow of the Feinberg Graduate School, Weizmann Institute of Science.
 |
References
|
---|
Berezovsky,I.N., Grosberg,A.Y. and Trifonov,E.N. (2000) FEBS Lett., 466, 283286.[ISI][Medline]
Douglass,J., Civelli,O. and Herbert,E. (1984) Annu. Rev. Biochem., 53, 665714.[ISI][Medline]
Jacobson,R.H., Zhang,X.J., DuBose,R.F. and Matthews,B.W. (1994) Nature, 369, 761766.[ISI][Medline]
Kolinski,A., Skolnick,J., Godzik,A. and Hu,W.-P. (1997) Proteins: Struct. Funct. Genet., 27, 290308.[ISI][Medline]
Kwasigroch,J.M., Chomilier,J. and Mornon,J.P. (1996) J. Mol. Biol., 259, 855872.[ISI][Medline]
Leszczynski,J.F. and Rose,G.D. (1986) Science, 234, 849855.[ISI][Medline]
Martin,A.C.R., Toda,K., Stirk,H.J. and Thornton,J.M. (1995) Protein Eng., 8, 10931101.[Abstract]
Murzin,A., Brenner,S.E., Hubbard,T.J.P. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[ISI][Medline]
Oliva,B., Bates,P.A., Querol E., Aviles,F.X. and Sternberg M.J.E. (1997) J. Mol. Biol., 259, 814830.
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Politou,A.S., Gautel,M., Improta,S., Vangelista,L. and Pastore A. (1996) J. Mol. Biol., 255, 604616.[ISI][Medline]
Received October 18, 2000;
revised February 26, 2001;
accepted March 12, 2001.