A pentapeptide-based method for protein secondary structure prediction

A. Figureau1, M.A. Soto2,3 and J. Tohá2

1 Institut de Physique Nucléaire de Lyon, Université Claude Bernard, 69622 Villeurbanne Cedex, France and 2 Departamento de Física, Fac. Ciencias Físicas y Matemáticas, Universidad de Chile, Casilla 487-3, Santiago, Chile


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
We present a new method for protein secondary structure prediction, based on the recognition of well-defined pentapeptides, in a large databank. Using a databank of 635 protein chains, we obtained a success rate of 68.6%. We show that progress is achieved when the databank is enlarged, when the 20 amino acids are adequately grouped in 10 sets and when more pentapeptides are attributed one of the defined conformations, {alpha}-helices or ß-strands. The analysis of the model indicates that the essential variable is the number of pentapeptides of well-defined structure in the database. Our model is simple, does not rely on arbitrary parameters and allows the analysis in detail of the results of each chosen hypothesis.

Keywords: pattern recognition/pentapeptide dictionary/pentapeptide recognition/protein structure prediction


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
During the last decade, molecular databases have experienced a rapid growth, principally from genome sequencing projects (Apweiler et al., 2001Go) and the gap between known protein sequences and structures has consequently grown also. Release 40.5 (December 2001) of the Swiss-Prot protein database contains 103 079 entries, comprising 3.7x107 amino acids, and the last release of EMBL protein structure data (September 2001) contains 25 351 chains with 90% sequence homology (Hobohm and Sander, 1994Go).

Hence, the development of theoretical methods to predict protein structure has become an important research field and during the last 40 years a diversity of such algorithms have been published, using different approaches (Chou and Fasman, 1974Go; Ptitsyn and Finkelstein, 1983Go; Biou et al., 1988Go; Levin et al., 1993Go; Rost and Sander, 1993Go, 1995Go; Donnelly et al., 1994Go; Eisenhaber et al., 1995Go; Srinivasan and Rose, 1995Go; Frishman and Argos, 1997Go; Salamov and Solovyev, 1997Go; Chandonia and Karplus, 1999Go; Cuff and Barton, 2000Go) and continuous progress has been reported over the years (Rost, 2001Go).

One of the possible difficulties in obtaining satisfactory results is related to the size of databanks, which are limited to very small subsets of existing proteins (this is even more true if redundant information is eliminated). A second limitation lies in the reduction from three-dimensional data to secondary structure sequences. Methods developed to assign defined conformations to each residue of primary sequences (Kabsch and Sander, 1983Go; Frishman and Argos, 1995Go; King and Johnson, 1999Go) give results differing on average by 15% (Colloc’h et al., 1993Go; Labesse et al., 1997Go). This value reflects the different approximations used to evaluate residue interactions. In particular, the role of long-range interactions (Cootes et al., 1998Go) should be included.

The identification of sequence motifs represents an approach which could in principle use most of the information contained in the banks (Rooman and Wodak, 1988Go; Shestopalov, 1990Go). In a previous work (Figureau et al., 1999Go), we proposed to circumvent the lack of data comparing a test motif with a limited variety of known structural patterns. The goal was to produce a complete dictionary of archetypal short pentapeptides. This choice was justified by physico-chemical considerations: an {alpha}-helix can essentially be maintained through an NH group of one residue H-bonded with a CO group three positions further in the chain. Similarly, ß-strands are defined only if at least three consecutive residues in the chain are bonded through ß-bridges to three others. The idea was to select regular structures among all the existing ones, so that small differences in the sequence should correspond to small variations in the conformation. The choice of pentamers was a compromise: on the one hand, larger polypeptides would be more appropriate, since they offer less structure ambiguity, as will be discussed later. However, such a choice has the inconvenience of greatly reducing the data available for {alpha}- and even more for ß-structures to construct the databank, because such long polypeptides are more difficult to form and more rarely found in proteins. On the other hand, smaller polypeptides are more numerous but present a more flexible structure so they are present in various types of defined structures.

The selection of short peptides obviously makes our method insensitive to long-range interactions that participate in the protein native conformation. Hence it is not a surprise that identical pentamers can adopt different conformations: this was noted in the pioneering work of Kabsch and Sander (1984)Go. The same observation applies also, although at a lower level, to polypeptides as long as eight residues or even longer (Sudarsanam, 1998Go). Moreover, it has been shown that statistically non-local through-space interactions (Cootes et al., 1998Go), and also amino acid surroundings (Rogov and Nekrasov, 2001Go), are important for the 3D organization of a protein molecule.

Our goal was to compare any given short peptide with a dictionary extracted from existing data, so that the prediction rate can potentially be improved as this dictionary becomes more complete, with at least two repetitions of each pentapeptide.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Databanks

Data were taken from the EMBL release of May 14, 1997 (Hobohm and Sander, 1994Go). Proteins were determined with resolution better than or equal to 3 Å. We chose the 25% pdb_select list, which contains 635 chains, corresponding to 1.7x105 residues. The secondary structure of protein residues corresponds to the DSSP method (Kabsch and Sander, 1983Go). We grouped the eight possible structural classes into three types: {alpha}-helix (H, G, I), ß-strands (E) and coil (B, T, S, C). Note that we included B in coils because, as an isolated bridge, it does not represent a regular geometric structure. This is more consistent with our approach (King and Johnson, 1999Go). There are, respectively, 58 513, 36 616 and 74 020 residues of each type (169 149 in total), corresponding to 34.6, 21.6 and 43.8% of {alpha}, ß and coil residues, respectively.

Pentapeptides are obtained using the sliding window method: a sequence of n residues gives rise to n – 4 pentapeptides. The type of these segments is defined as {alpha} (or ß) if it contains three consecutive residues of the same type, as compared with our previous work (Figureau et al., 1999Go) where the type of the segment was defined according to that of five consecutive pentapeptide residues. This is not equivalent to the use of triplets because pentapetides containing three successive residues of the same type are attributed this same structure. However, the structures defined as regular are now much more numerous, rendering the equilibrium between coils and other types more balanced.

As a result, we construct a bank of pentapeptide patterns. The proportions of {alpha}, ß and coil pentapeptides are 35.2, 20.6 and 44.2%, respectively. As expected, these proportions are similar to those of residues. The total number of pentapeptides in the bank is 165 903, which should be compared with the total number of possible pentapeptides. Since all random sequences of five amino acids might not be expected to be present in proteins, this number is probably smaller than 205 = 3 200 000, but anyway much larger than our sample, which contains only 155 928 different peptides (those found more than once in the bank are rather few). It is then convenient to restrain further the representation space.

Amino acid grouping

We grouped the 20 amino acids into sets. The only condition that we imposed on the choice of these groups was to respect as much as possible the separation observed between the {alpha} and ß pentapeptide sets of the bank. In this calculation, pentapeptides {alpha} (ß) were homogeneously constituted by five {alpha} (ß) residues instead of three: this allows less flexibility in the type of the pentapeptides and gives more confidence in the separation of the two groups (Figureau et al., 1999Go). The number of distinct groups of amino acids was restricted to 10, going by a step-by-step optimization from 20 amino acids to 19, 18, etc. The size of the space of pentapeptides is then reduced by a factor 25 = 32.

The 10 sub-groups of amino acids giving the best separation between {alpha} and ß pentapeptides are the following: (C–F–W–Y), (I–V), (L–M), (H–Q–R), (E–K), (D–N), (S–P), (A), (G) and (T). In this set, equivalent amino acids have roughly similar physico-chemical properties, although our algorithm does not take these into account. For comparison, we also constructed other sets taking advantage explicitly of these similarities (Johnson and Overington, 1993Go). Among many possibilities we chose the following:

Set 2: (C), (F–W–Y), (I–V–L–M), (G–T), (H), (R–Q–N–K), (D–E), (S), (P) and (A);

Set 3: (C), (F–V–Y–I), (W–L–M), (T–S), (H–G), (Q–K), (N–D), (R–E), (P) and (A).

Pattern dictionary

Using the above-described amino acid grouping, there are 74 616 different pentapeptides in our bank, most of which are found more than once. These represent, then, a major fraction of the reduced configuration space, since only 25 384 (100 000 – 74 616) different pentapeptides are absent in the bank. To illustrate this fact, we present in Figure 1Go the number of pentapeptides in the bank as a function of their number of occurrences. The average frequency is 1.7 and a large majority (88%) of all possible pentapeptides are represented less than four times in the bank. For comparison, equivalent numbers are presented for the first 123 proteins (27 294 pentapeptides). The average frequency is then 0.27 and practically no pentapeptide is present more than three times in this sub-bank.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1. Number (N) of different pentapeptides occurring f times in two databases of different size (165 903 vs 27 294 pentapeptides).

 
Clearly, from the small sample, it would be difficult to obtain information for many pentapeptides. In contrast, with the larger sample it is possible to use a dictionary approach. When considering a pentapeptide of unknown type we should find out in the pattern bank all its occurrences, translate them into the corresponding type and finally decide from this information which kind of pattern is most appropriate.

In fact, as data are still not overabundant, we proceed with the jack-knife method. Each pentapeptide is confronted with all the others that are not in the same protein and its predicted structure is compared with the real one. As a consequence, all pentapeptides that appear only once in the bank are considered as unknown structures. The dictionary approach is then unadapted for nearly 55 000 different pentapeptides (55% of the dictionary, but only 18% of the bank; they correspond in Fig. 1Go to 25 384 pentapeptides not found and to 29 487 found only once).

Nearest neighbour method

As the identification of all pentapeptides is not possible, an extrapolation procedure has to be devised. We suppose that similar pentapeptides are of the same type, so our approach constitutes a truly nearest neighbour method. Since a neighbourhood assumed to be of the same nature surrounds each pentapeptide, the knowledge of N regularly dispersed points in our representation space determines the knowledge of N assemblies of points. Now, any pentapeptide differs by only one residue from its 9x5 = 45 nearest neighbours. The knowledge of 105/45 regularly distributed pentapeptides should then be sufficient to allow the determination of the type of all the others. As we have seen, the repartition is not regular, in particular for the less abundant ß-strands. Then, we define a distance between pentapeptides which can be used even for less homologous pairs.

Consequently, to define the degree of similarity between two pentapeptides we calculate the distance d between them. Aligning the two pentapeptides, we look for all the identities between their residues, wherever these may be located. If each identity is represented by a link between both pentapeptides, we define a graph for this pair. We call any part of this graph a configuration. Next, we retain all of those configurations in which there is no residue cross pairing. Each of these is then evaluated as a function of the number p of residues related to the graph, the shifting {Delta}i for the corresponding pairs and the gap {delta}ij between connected residues of each pentapeptide. The minimum value is chosen as characteristic and is called distance d:


Although very rough, this measure is generally in good agreement with the qualitative eye guided estimation. For instance, the distance between abcde and abcfg is 20, whereas that between abcde and abfcg is 23 = (50 – 30 + 1 + 2). Other examples were given in a previous paper (Figureau et al., 1999Go).

Our method consists then of the following steps:

  1. The calculation of the distances of any test pentapeptide to all those in the bank (because of the jack-knife process, the pentapeptides present in the same protein in the test are excluded).
  2. The construction of histograms presenting the numbers of pentapeptides of each category {alpha}, ß or coil at a given non-zero distance of the test element. We recall that the zero-distance case has been treated in the dictionary approach. Histograms are normalized to the same total number of pentapeptides, so as to compensate for the unequal numbers of {alpha}, ß and coils in the data bank.
  3. Then, various criteria might be applied for the comparison of these histograms. Here we consider only the simplest criterion of proximity: after normalization, the tested pentapeptide is attributed the same type as those which are more frequent and at the smallest distance. The criterion is modified if there is ambiguity in the prediction: for those peptides represented in similar proportions by two (or three) types, we consider the next nearest neighbourhood to make the prediction.

Secondary structure prediction

Another step must be performed, since we have considered pentapeptides as basic elements. We must define a smoothing process, which transforms the structural information on pentapeptides into one on residues.

The simplest smoothing process takes into account the types of the five pentapeptides where the given residue is present (less than five at the ends of the sequences). Then the residue is given the type of the most frequent one. More than 95% of the residues recover their original type after application of the two processes: pentapeptide characterization and smoothing. The missed residues are either isolated or belong to an isolated pair of consecutive residues, i.e. they are mainly coil residues.

For a proper evaluation of result reliability, we use the percentage of correctly predicted residues Q3(i):


When undetermined structures are found, we define also


where Ni is the total number of residues of type i, N+i is the number of correctly predicted residues of type i and Nui the number of residues of type i undetermined.

The Matthews coefficients are calculated for a more precise evaluation of the accuracy. Equivalent quantities are used when discussing the accuracy of pentapeptide predictions.


    Results
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Residue type prediction

In Table IGo we present the results for the 169 149 residues of all the proteins in the databank. In this case, the Matthews coefficients for {alpha}, ß and coil secondary structures are 0.525, 0.458 and 0.499, respectively. To compare these results with other prediction methods, we give in the second row of Table IGo the success rates obtained with the Rost-126 database, which contains 126 proteins and 24 395 residues with proportions of {alpha}, ß and coil similar to those of our databank.


View this table:
[in this window]
[in a new window]
 
Table I. Percentages of success rate predictions for residues of the large database (169 149 residues) and the standard Rost database (126 proteins)
 
One important ingredient in our calculation is the choice of the amino acid partition. In order to explore the validity of our standard set of amino acid groups and the sensitivity of the method to this choice, we present in Table IIGo the results for the two other sets given in Methods. Then we conclude that our approach for the grouping determination is justified.


View this table:
[in this window]
[in a new window]
 
Table II. Success rate values obtained using three different amino acid groupings: G2 = (C), (F–W–Y), (I–V–L–M), (G–T), (H), (R–Q–K–N), (D–E), (P), (S) and (A); G3 = (C), (F–V–Y–I), (W–L–M), (T–S), (H–G), (Q–K), (N–D), (R–E), (P) and (A); G4 = (C), (F), (W), (Y), (I), (V), (L), (M), (G), (T), (H), (R) (Q), (K), (N), (D), (E), (P), (S) and (A)
 
Pentapeptide prediction

In Table IIIGo we present the success rates for pentapeptides equivalent to those presented in Table IGo for residues. Both sets of results are very similar. The saturation effect, in particular for ß pentapetides, is also very visible when the large and Rost databanks are compared.


View this table:
[in this window]
[in a new window]
 
Table III. Percentages of success rate predictions for pentapeptides of the large and standard Rost databases (note that Q3 = Q3*, contrary to residue predictions, where the indeterminacy was produced by the smoothing process)
 
Table IVGo displays the success rates for pentapeptides occurring n times in the larger bank (n = 1, 2, ..., 7). For this evaluation, the type of the pentapeptide was considered as {alpha}, ß or coil only if all five residues were homogeneously of that type.


View this table:
[in this window]
[in a new window]
 
Table IV. Success rates for pentapeptides present n times in the larger databank (165 903 pentapeptides) (all Q3* values are equal to the corresponding Q3)
 
To evaluate the importance of having a pentapeptide dictionary where each one is represented at least three times in the bank, we separated calculations based on zero distance from those obtained from non-zero distances (neighbourhood approach). In Table VGo these two options are easily differentiated, demonstrating once again the convenience of using a larger database.


View this table:
[in this window]
[in a new window]
 
Table V. Success rates for pentapeptides as deduced when only data for distances 0 and 10 are used (the values Q3 are not equal to the corresponding Q3*)
 

    Discussion
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
The results in Table IGo show, as expected, that a smaller databank allows less precise predictions. While results for this small bank are rather deceiving, it is encouraging to find much better predictions in the larger one, contrary to other methods giving similar results in the two situations (Frishman and Argos, 1997Go). To verify that there is no peculiarity in these two databanks, we selected various sub-banks from pdb_select (size 635) of size similar to the Rost-126 bank and we arrived at almost identical results.

The results in Table IIGo justify the selection of the amino acid set employed in the calculations. Better results might be obtained if instead of a step-by-step optimization a global one had been performed. Even a non-optimal grouping might do better, but there is no simple way to determine such a set. Anyway, we cannot expect significant progress from a modification of our grouping. Table IIGo includes the results obtained when the 20 amino acids are considered as 20 groups (G4). It is clear that in this case all success rates are much less satisfactory. In our interpretation this is as expected, since a larger representation space (of size 205 instead of 105) means less significant statistics.

Therefore, in its present state, our method allows moderately satisfactory predictions of protein secondary structures. Calculations on small banks do not attain the precision of other methods. Nevertheless, for larger banks one should emphasize that we have not used any information from homologous proteins. Our results of 68.6% accuracy, although not directly comparable to recent publications, are of the same order, provided that the inclusion of homologous aligned sequences is not taken into account, so as to compare equivalent sets of hypotheses.

Let us then examine the most direct effect related to the size of the bank: the frequency of appearance of each pentapeptide. In Table IVGo, values for n = 1 correspond to pentapeptides present only once in the bank and because of the jack-knife process, the calculation bears on neighbours only. This explains the lower values in the corresponding row. Except for n = 1 and for ß rates, the numbers in Table IVGo are similar when n varies, suggesting that the proportion of ambiguous pentapeptides is statistically independent of n, especially for n >= 3. The addition of any new pentapeptide in the bank is really efficient if this pentapeptide was already present only once or twice in the bank, which is the case for 50% of the 105 possible ones (75% if all the pentapeptides not represented in the bank are included). As for ß rates, their observed decrease with n is due simply to the small number of ß pentapeptides in the bank and in particular to the low probability of finding the same ß pentapeptide many times. A systematic increase in the databank size will then be useful mainly as long as it concerns pentapeptides of low frequency so as to lead them to categories n >= 3. A second effect to be analysed concerns the pentapeptide nearest neighbours. In the Rost bank the proportion of predictions based on the dictionary approach is low, contrary to the larger bank, where on average the pentapeptide nearest neighbours are at a smaller distance. A priori, this explains the better results obtained with this last bank, as exhibited in Table IGo for residues and Table IIIGo for pentapeptides.

Finally, as expected, the results in Table VGo show that calculations restricted to zero distances (dictionary approach) give much better predictions than those using only data at d = 10. These results are pentapeptide predictions, because the smoothing process cannot be used in the first case, where only a fraction of all pentapeptides can be found directly (Q3 and Q3* are different). We note that the final success rates for pentapeptides (Table IIIGo) result from the combination of these two rows and consequently they are intermediate.

Our conclusion is that improvements can still be attained with the recent growth of databanks, the lack of sufficient ß-strand data being the principal difficulty.


    Notes
 
3 To whom correspondence should be addressed. E-mail: masoto{at}cec.uchile.cl Back


    Acknowledgments
 
We are grateful to Dr J.Delorme and Professor G.Lamot for many interesting discussions and advice and to Dr B.Rost for useful suggestions. A.Figureau is indebted to the Departamento de Investigación y Desarrollo, Universidad de Chile, for partial financial support. We are also grateful to the Programs CNRS/CONICYT and ECOS/CONICYT for the financial support that allowed continuous collaboration.


    References
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Apweiler,R. et al. (2001) Nucleic Acids Res., 29, 37–40.[Abstract/Free Full Text]

Biou,V., Gibrat,J.F., Levin,J.M., Robson,B. and Garnier,J. (1988) Protein Eng., 2, 185–191.[Abstract]

Chandonia,J.M. and Karplus,M. (1999) Proteins, 35, 293–306.[CrossRef][ISI][Medline]

Chou,P.Y. and Fasman,G.D. (1974) Biochemistry, 13, 222–245.[ISI][Medline]

Colloc’h,N., Etchebest,C., Thoreau,E., Henrissat,B. and Mornon,J.P. (1993) Protein Eng. 6, 377–382.[Abstract]

Cootes,A.P., Curmi,P.M.G., Cunningham,R., Donnelly,C. and Torda,A.E. (1998) Proteins, 32, 175–189.[CrossRef][ISI][Medline]

Cuff,J.A. and Barton. J. (2000) Proteins, 40, 502–511.[CrossRef][ISI][Medline]

Donnelly,D., Overington,J.P. and Blundell,T.L. (1994) Protein Eng., 7, 645–653.[Abstract]

Eisenhaber,F., Persson,B. and Argos,P. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 1–94.[Abstract]

Figureau,A., Soto,M.A. and Tohá,J. (1999) J. Theor. Biol., 201, 103–111.[CrossRef][ISI][Medline]

Frishman,D. and Argos,P. (1995) Proteins, 23, 566–579.[ISI][Medline]

Frishman,D. and Argos,P. (1997) Proteins, 27, 329–335.[CrossRef][ISI][Medline]

Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522.[Abstract/Free Full Text]

Johnson,M.S. and Overington,J.P. (1993) J. Mol. Biol., 233, 716–738.[CrossRef][ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Kabsch,W. and Sander,C. (1984) Proc. Natl Acad. Sci. USA, 81, 1075–1078.[Abstract]

King,S.M. and Johnson,W.C. (1999) Proteins, 35, 313–320[CrossRef][ISI][Medline]

Labesse,G., Colloc’h,N., Pothier,J. and Mornon,J.P. (1997) CABIOS, 13, 291–295.[Medline]

Levin,J.M., Pascarella,S., Argos,P. and Garnier,J. (1993) Protein Eng., 6, 849–854.[Abstract]

Ptitsyn,O.B. and Finkelstein,A.V. (1983) Biopolymers, 22, 15–25.[ISI][Medline]

Rogov,S.I. and Nekrasov,A.N. (2001) Protein Eng., 14, 459–463.[Abstract/Free Full Text]

Rooman,M.J. and Wodak,S.J. (1988) Nature, 335, 45–49.[CrossRef][ISI][Medline]

Rost,B. (2001) J. Struct. Bioinf., 134, 204–218.

Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584–599.[CrossRef][ISI][Medline]

Rost,B. and Sander,C. (1995) Proteins, 23, 295–300.[ISI][Medline]

Salamov,A.A. and Solovyev,V.V. (1997) J. Mol. Biol., 268, 31–36.[CrossRef][ISI][Medline]

Shestopalov,B.V. (1990) Mol. Biol., 24, 1117–1125.[ISI]

Srinivasan,R. and Rose,G. (1995) Proteins, 22, 81–99.[ISI][Medline]

Sudarsanam,S. (1998) Proteins, 30, 228–231.[CrossRef][Medline]

Received January 20, 2002; revised October 11, 2002; accepted December 18, 2002.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (2)
Request Permissions
Google Scholar
Articles by Figureau, A.
Articles by Tohá, J.
PubMed
PubMed Citation
Articles by Figureau, A.
Articles by Tohá, J.