Loop fold nature of globular proteins

Igor N. Berezovsky1, and Edward N. Trifonov

Department of Structural Biology, The Weizmann Institute of Science, P.O.B. 26, Rehovot 76100, Israel


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Protein chains make numerous returns in globules, thus forming loops, closed by tight residue-to-residue contacts—closed loops. Previous statistical analysis of the sizes and locations of the closed loops in all major protein folds revealed that the loops have an almost standard contour length of 25–30 amino acid residues and follow one after another along the chain. In this work the closed loops of the major folds are presented in three dimensions. A special image filtering procedure is introduced that allows one to visualize the standard size closed loops for the first time. The loop positions along the sequences are verified by detection of loop-end clusters.

Keywords: closed loops/image filtering/major folds/protein folding/protein structure


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Proteins are characterized in many ways on the basis of structural and evolutionary considerations ( Murzin et al.1995Go; Orengo et al.1997Go). There is one type of structural element which, inexplicably, has never been considered, namely closed loops, i.e. returns of the chain trajectories. Note that these are not loops in the sense of the traditional definition as linkers between elements of secondary structure (Leszczynski and Rose, 1986Go; Martin et al.1995Go; Kwasigroch et al.1996Go; Oliva et al.1997Go). The so-called U-turns ( Kolinski et al.1997Go) also do not include the loop closure points. Closed loops of a protein globule connect points distantly positioned along the chain which are thus in contact (defined, for example, as short C{alpha} to C{alpha} distances). Compactness of the proteins implies large numbers of such chain-to-chain contacts, unlike loose Gaussian trajectories only occasionally returning to themselves. The loop fold structure of the globular proteins is not immediately seen, being disguised by frequent trajectory changes due to various elements of secondary structure, primarily {alpha}-helices. A simple filtering (smoothing) procedure described below makes the closed loops clearly seen. The loops and the sites of multiple contacts in 10 major folds are analyzed. The maps are constructed in which the closed loops (total average size 24–26 amino acid residues) show the same size preference as in previous work (25–30 residues) where the statistics of the loop sizes of large ensembles of protein structures were analyzed ( Berezovsky et al.2000Go). Three-dimensional (3-D) structures of 10 major fold types demonstrate that the proteins are universally built of consecutively connected standard closed loops.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
We define the closed loops as continuous sub-trajectories of the folded chains with small C{alpha}-to-C{alpha} distances between their ends (up to 10 Å). The C{alpha}–C{alpha} contacts with immediate neighbors along the sequence are not considered. Five residues are taken as the cut-off value. The standard deviations for the peak values in the loop size histograms (Figure 1Go) are estimated as square roots of the values in the nearby minima.



View larger version (72K):
[in this window]
[in a new window]
 
Fig. 1. Loop size distributions for 101 eukaryotic proteins (a) and 162 prokaryotic proteins (b), of more than 200 amino acid residues. The protein structures for the analysis are taken from PDB database. The threshold of allowed sequence similarity is taken at 25% (PDB_SELECT).

 
Inspection of the positional distribution of the loop ends along the sequences reveals numerous sites (small regions) where many loops originate, as illustrated in Figure 2Go. We use such diagrams for the purpose of locating the loops, considering first the most prominent ones, as suggested by the diagrams. The mapping is started from the sites of multiple end-to-end connections. That is, we map first only the loops with both ends belonging to multiple connection points, as in Figure 2cGo (see also the flowchart of this procedure presented in Figure 3Go). The loops with tightest end-to-end distances irrespective of their size are taken first. The procedure is repeated until the 10 Å limit is reached, although it is normally exhausted at shorter distances. Then the second round follows, which involves the standard loops with only one end belonging to the multiple connection points. The last stage involves single isolated standard loops with no multiple connections, again in the order of the tightness of the closure. These stages are also presented in the flowchart (Figure 3Go).



View larger version (38K):
[in this window]
[in a new window]
 
Fig. 2. Example of distribution of the loop ends along the protein sequence (ß Trefoil fold, 1afc A). (a) N-ends of the tight loops (C{alpha}–C{alpha} distance up to 10 Å); (b) C-ends of the loops; (c) product of (a) and (b). Panel (c) displays the sites of multiple contacts used for the mapping of primary loops.

 


View larger version (18K):
[in this window]
[in a new window]
 
Fig. 3. Flowchart of the loop mapping procedure. Multiple contact sites (MCS), dots in the scheme, correspond to positions of the major maxima in the diagrams of multiple contacts as in Figure 2cGo.

 
The uncertainty for the points of multiple contacts is ±2 amino acid residues. The (anti)parallel {alpha}- and ß-structures form several short C{alpha}–C{alpha} contacts, in which case the shortest is taken. As suggested by the size distribution of the loops ( Berezovsky et al.2000Go) the least frequent loop size is 15 amino acid residues. Correspondingly, the loops accepted into the mapping procedure could be as small as 16 amino acid residues. Acceptance of such small loops may cause a bias in the final loop size distribution, but as the results below indicate, this is not the case. The procedure described is practically devoid of uncertainties.

In a few cases composite loops have been observed, consisting of two loops with all four ends within a 10 Å distance. Such loops were split into smaller ones. For example, loop 32–83 (4.014 Å, 52) in ß Aligned Prism (1vmoA) consists of loops 31–57 (3.740 Å, 27) and 56–76 (7.840 Å, 21). In the case of partial overlapping, the tighter of the two loops was accepted. With overlapping of less than five common amino acid residues, both loops were accepted.

A trajectory smoothing procedure replaces the coordinates of every C{alpha} atom by average coordinates for seven C{alpha} atoms centered at a given residue.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The updated histograms of the loop size distributions for prokaryotic and eukaryotic proteins are shown in Figure 1Go. The histograms are calculated as earlier ( Berezovsky et al.2000Go) by utilizing enlarged sets of structures (162 prokaryotic and 101 eukaryotic proteins). Both plots demonstrate a major preference for loop sizes of 25–30 amino acid residues. The amplitudes in the peak positions show an excess over nearby minima of 603 and 423 occurrences in Figure 1a and bGo, respectively. This corresponds to over 11 standard deviations in both cases. The purpose of the loop mapping is to split every protein structure into a set of minimum-sized elementary closed loops—primary loops. An important lead in the process of mapping is existence of multiple contacts, clusters of N-ends and C-ends of the loops. This is illustrated by Figure 2Go, where clusters of N-ends (Figure 2aGo), C-ends (Figure 2bGo) and both (a product of the two, Figure 2cGo) are shown for the ß Trefoil fold (1afc A). Figure 4Go displays the polypeptide chain trajectories for 10 major folds in standard backbone (left) and smoothed presentations with the mapped loops indicated by various colors. A striking uniformity of the variety of proteins otherwise thoroughly different is observed. This is better seen when the smoothed trajectories not obstructed by ubiquitous zig-zags of {alpha}-helixes are inspected. As Figure 4Go amply illustrates, all major types of folds, despite substantial differences in their overall appearances, are equally `spelled' by consecutive arrays of the loops. It is important to note that this is not only a property of the typical sized folds (100–200 amino acid residues), but also of substantially larger molecules (data not shown), such as ß-galactosidase made of five domains ( Jacobson et al.1994Go) or the huge multi-domain muscular protein titin ( Politou et al.1996Go). In other words, all globular proteins regardless of their types, size or function appear to be largely built of connected loops of the same typical size. In a few cases the mapping procedure allows for mutually exclusive alternative linear arrays of the loops. Although both variants can be considered in each case, selection of the tighter end-to-end contacts leads to a unique choice. For example, in the case of {alpha}ß Barrel the array 9–40 (4.463, 32), 42–63 (4.377, 22), 63–90 (4.365, 28), 90–122 (4.735, 33), 131–170 (3.739, 40), 167–211 (4.489, 45), 211–232 (4.640, 22) and 230–249 (6.316, 20) can be partially replaced by the overlapping set 68–112 (6.670, 45), 110–150 (4.379, 41), 147–190 (7.744, 44) and 188–227 (5.883, 40). The latter, however, offers the loops which are more relaxed and more scattered size-wise. Similarly, in the rotationally symmetrical case of {alpha}ß Horseshoe (1bnh) two alternatives are possible, as shown in Figure 4Go, bottom. Apparent secondary contacts appear in composite loops, that is, large loops with smaller internal closures. For example, in ß Sandwich (2hlaB) loop 39–80 (4.421, 42) covers the loop 49–68 (4.760, 20). A large loop may consist of several nearly standard-sized loops. For example, loop 254–299 (4.047, 46) of ß 8 Propellor consists of loops 254–272 (5.156, 19) and 273–289 (3.814, 17); similarly, loop 324–384 (4.250, 61) is made of smaller ones, 327–343 (4.422, 17) and 348–379 (4.936, 32); finally, loop 468–509 splits into 466–482 (6.187, 17) and 485–506 (3.359, 22). In the ß Sandwich (2hlaB) region 39–80 can be considered either as a composite loop or as the overlap with another tight loop 28–63 (3.741, 36). In both cases, linearity and nearly standard size are maintained. Thus, composite loops responsible for the secondary contacts between primary loops do not interfere with the general linear arrangement of the primary loops. The distant contacts may be responsible for 3-D stabilization of sequentially engaged primary loops during the protein folding. The mean value of the loop sizes in the maps of Figure 2Go is 24–26 amino acid residues, matching well the preferential size observed in the overall histogram of the loop sizes, as in Figure 1Go.






View larger version (104K):
[in this window]
[in a new window]
 
Fig. 4. Major protein folds in traditional backbone presentation (left of each single-column group) and in smoothed form (right of each single-column group): {alpha} Non-Bundle (1eca), ß Roll (1pht), ß Sandwich (2hla B), ß Trefoil (1afc A), ß Aligned Prism (1vmo A), {alpha}ß Barrel (4tim A), ß 8 Propellor (3aah A), ß 3 Solenoid (2pec), {alpha}ß 3-Layer Sandwich (1 pya B), {alpha}ß Horseshoe (1bnh). The alternative arrays for the {alpha}ß Horseshoe (1bnh) are 2–27 (4.267, 26), 26–53 (4.131, 28), 54–82 (4.007, 29), 83–110 (4.135, 28), 112–141 (4.227, 30), 140–167 (3.907, 28), 165–194 (3.992, 30), 197–224 (4.045, 28), 226–255 (4.436, 30), 254–281 (4.114, 28), 282–310 (4.383, 29), 311–338 (4.271, 28), 339–367 (4.451, 29), 368–395 (4.276, 28), 396–424 (4.390, 29), 430–456 (4.242, 27) (average C{alpha}–C{alpha} distance is 4.2 Å); and 2–27 (4.267, 26), 32–60 (4.576, 29), 60–89 (4.969, 30), 94–120 (4.862, 27), 119–147 (4.798, 29), 151–177 (4.273, 27), 178–205 (5.077, 28), 264–291 (4.543, 28), 292–320 (4.927, 29), 322–348 (4.277, 27), 349–376 (4.765, 28), 378–404 (4.596, 27), 401–429 (4.438, 29), 430–456 (4.242, 27). The final, double-page spread set consists of the loops with larger end-to-end distances(average C{alpha}–C{alpha} distance is 4.6 Å). These can be considered rather as secondary contacts in the polypeptide chain trajectory. Chain sections of various colors correspond to the nearly standard size closed loops mapped as described.

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The preferred size of the closed loops, 25–30 amino acid residues, may originate from polymer statistical properties of the polypeptide chains. It is in the range of the optimum size for ring (loop) closure of the polypeptide chain with a mixed amino acid sequence ( Berezovsky et al.2000Go). At first sight this may appear as a statistical feature of no relevance to the biological functions of proteins. The loops as such are obviously important building blocks of the protein structure and may well have been under selection pressure during protein evolution. Both the size of the loops and their actual positions along the protein sequence could have been selected. It was, perhaps, natural from the beginning to keep unchanged the optimum size as enforced by the polymer statistics. As to the actual positions of the loop ends in the protein sequence, selection most likely has taken place. Indeed, the sequence, evolutionarily driven, would have matching sites, making `stitches' at key positions to guarantee an efficient and unique loop pattern. Such hypothetical stitches obviously should play an important role both for primary looping (linear arrangement of nearly standard-sized loops) and for secondary interactions. The loop closure might also have been an important stage in the earliest evolution of proteins when the chain lengths were approaching the loop closure size. Since there are many proteins of such small size which are biologically active ( Douglass et al.1984Go), one could speculate that the observed nearly standard-sized loops may have been independent active entities at some early stage of protein evolution. Later, owing to fusion of the respective genes, the small loop-like proteins may have turned into larger multiloop structures. The loop closure dramatically decreases the number of alternative conformations that the chain may acquire, thus fixing selected conformations. The chain-to-chain contacts are also advantageous energetically, providing the necessary stability to the loops and their associations in multiloop structures.

The linear arrangement of the loops immediately suggests the sequence of events during cotranslational protein folding. The folding process may start with the formation of the contact closing the first primary loop. Other loops would be formed sequentially involving correponding interacting sites until completion of the synthesis. Already at this initial stage the sequence would thus provide instructions for the protein folding (looping). A whole arsenal of current concepts about protein structure suggests further, secondary events: formation of {alpha}-helixes, of ß-sheets, of secondary hydrophobic and polar loop-to-loop contacts, etc. It is not excluded, of course, that the formation of these secondary elements may occur already during the primary looping as well as subsequently.


    Notes
 
1 To whom correspondence should be addressed. E-mail: igor.berezovsky{at}weizmann.ac.il Back


    Acknowledgments
 
The authors are grateful to A.Grosberg for stimulating discussions and E.Yakobson for critical reading of the manuscript. I.N.B. is a Post-Doctoral Fellow of the Feinberg Graduate School, Weizmann Institute of Science.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Berezovsky,I.N., Grosberg,A.Y. and Trifonov,E.N. (2000) FEBS Lett., 466, 283–286.[ISI][Medline]

Douglass,J., Civelli,O. and Herbert,E. (1984) Annu. Rev. Biochem., 53, 665–714.[ISI][Medline]

Jacobson,R.H., Zhang,X.J., DuBose,R.F. and Matthews,B.W. (1994) Nature, 369, 761–766.[ISI][Medline]

Kolinski,A., Skolnick,J., Godzik,A. and Hu,W.-P. (1997) Proteins: Struct. Funct. Genet., 27, 290–308.[ISI][Medline]

Kwasigroch,J.M., Chomilier,J. and Mornon,J.P. (1996) J. Mol. Biol., 259, 855–872.[ISI][Medline]

Leszczynski,J.F. and Rose,G.D. (1986) Science, 234, 849–855.[ISI][Medline]

Martin,A.C.R., Toda,K., Stirk,H.J. and Thornton,J.M. (1995) Protein Eng., 8, 1093–1101.[Abstract]

Murzin,A., Brenner,S.E., Hubbard,T.J.P. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[ISI][Medline]

Oliva,B., Bates,P.A., Querol E., Aviles,F.X. and Sternberg M.J.E. (1997) J. Mol. Biol., 259, 814–830.

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Politou,A.S., Gautel,M., Improta,S., Vangelista,L. and Pastore A. (1996) J. Mol. Biol., 255, 604–616.[ISI][Medline]

Received October 18, 2000; revised February 26, 2001; accepted March 12, 2001.