1 Institut de Physique Nucléaire de Lyon, Université Claude Bernard, 69622 Villeurbanne Cedex, France and 2 Departamento de Física, Fac. Ciencias Físicas y Matemáticas, Universidad de Chile, Casilla 487-3, Santiago, Chile
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: pattern recognition/pentapeptide dictionary/pentapeptide recognition/protein structure prediction
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Hence, the development of theoretical methods to predict protein structure has become an important research field and during the last 40 years a diversity of such algorithms have been published, using different approaches (Chou and Fasman, 1974; Ptitsyn and Finkelstein, 1983
; Biou et al., 1988
; Levin et al., 1993
; Rost and Sander, 1993
, 1995
; Donnelly et al., 1994
; Eisenhaber et al., 1995
; Srinivasan and Rose, 1995
; Frishman and Argos, 1997
; Salamov and Solovyev, 1997
; Chandonia and Karplus, 1999
; Cuff and Barton, 2000
) and continuous progress has been reported over the years (Rost, 2001
).
One of the possible difficulties in obtaining satisfactory results is related to the size of databanks, which are limited to very small subsets of existing proteins (this is even more true if redundant information is eliminated). A second limitation lies in the reduction from three-dimensional data to secondary structure sequences. Methods developed to assign defined conformations to each residue of primary sequences (Kabsch and Sander, 1983; Frishman and Argos, 1995
; King and Johnson, 1999
) give results differing on average by 15% (Colloch et al., 1993
; Labesse et al., 1997
). This value reflects the different approximations used to evaluate residue interactions. In particular, the role of long-range interactions (Cootes et al., 1998
) should be included.
The identification of sequence motifs represents an approach which could in principle use most of the information contained in the banks (Rooman and Wodak, 1988; Shestopalov, 1990
). In a previous work (Figureau et al., 1999
), we proposed to circumvent the lack of data comparing a test motif with a limited variety of known structural patterns. The goal was to produce a complete dictionary of archetypal short pentapeptides. This choice was justified by physico-chemical considerations: an
-helix can essentially be maintained through an NH group of one residue H-bonded with a CO group three positions further in the chain. Similarly, ß-strands are defined only if at least three consecutive residues in the chain are bonded through ß-bridges to three others. The idea was to select regular structures among all the existing ones, so that small differences in the sequence should correspond to small variations in the conformation. The choice of pentamers was a compromise: on the one hand, larger polypeptides would be more appropriate, since they offer less structure ambiguity, as will be discussed later. However, such a choice has the inconvenience of greatly reducing the data available for
- and even more for ß-structures to construct the databank, because such long polypeptides are more difficult to form and more rarely found in proteins. On the other hand, smaller polypeptides are more numerous but present a more flexible structure so they are present in various types of defined structures.
The selection of short peptides obviously makes our method insensitive to long-range interactions that participate in the protein native conformation. Hence it is not a surprise that identical pentamers can adopt different conformations: this was noted in the pioneering work of Kabsch and Sander (1984). The same observation applies also, although at a lower level, to polypeptides as long as eight residues or even longer (Sudarsanam, 1998
). Moreover, it has been shown that statistically non-local through-space interactions (Cootes et al., 1998
), and also amino acid surroundings (Rogov and Nekrasov, 2001
), are important for the 3D organization of a protein molecule.
Our goal was to compare any given short peptide with a dictionary extracted from existing data, so that the prediction rate can potentially be improved as this dictionary becomes more complete, with at least two repetitions of each pentapeptide.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Data were taken from the EMBL release of May 14, 1997 (Hobohm and Sander, 1994). Proteins were determined with resolution better than or equal to 3 Å. We chose the 25% pdb_select list, which contains 635 chains, corresponding to 1.7x105 residues. The secondary structure of protein residues corresponds to the DSSP method (Kabsch and Sander, 1983
). We grouped the eight possible structural classes into three types:
-helix (H, G, I), ß-strands (E) and coil (B, T, S, C). Note that we included B in coils because, as an isolated bridge, it does not represent a regular geometric structure. This is more consistent with our approach (King and Johnson, 1999
). There are, respectively, 58 513, 36 616 and 74 020 residues of each type (169 149 in total), corresponding to 34.6, 21.6 and 43.8% of
, ß and coil residues, respectively.
Pentapeptides are obtained using the sliding window method: a sequence of n residues gives rise to n 4 pentapeptides. The type of these segments is defined as (or ß) if it contains three consecutive residues of the same type, as compared with our previous work (Figureau et al., 1999
) where the type of the segment was defined according to that of five consecutive pentapeptide residues. This is not equivalent to the use of triplets because pentapetides containing three successive residues of the same type are attributed this same structure. However, the structures defined as regular are now much more numerous, rendering the equilibrium between coils and other types more balanced.
As a result, we construct a bank of pentapeptide patterns. The proportions of , ß and coil pentapeptides are 35.2, 20.6 and 44.2%, respectively. As expected, these proportions are similar to those of residues. The total number of pentapeptides in the bank is 165 903, which should be compared with the total number of possible pentapeptides. Since all random sequences of five amino acids might not be expected to be present in proteins, this number is probably smaller than 205 = 3 200 000, but anyway much larger than our sample, which contains only 155 928 different peptides (those found more than once in the bank are rather few). It is then convenient to restrain further the representation space.
Amino acid grouping
We grouped the 20 amino acids into sets. The only condition that we imposed on the choice of these groups was to respect as much as possible the separation observed between the and ß pentapeptide sets of the bank. In this calculation, pentapeptides
(ß) were homogeneously constituted by five
(ß) residues instead of three: this allows less flexibility in the type of the pentapeptides and gives more confidence in the separation of the two groups (Figureau et al., 1999
). The number of distinct groups of amino acids was restricted to 10, going by a step-by-step optimization from 20 amino acids to 19, 18, etc. The size of the space of pentapeptides is then reduced by a factor 25 = 32.
The 10 sub-groups of amino acids giving the best separation between and ß pentapeptides are the following: (CFWY), (IV), (LM), (HQR), (EK), (DN), (SP), (A), (G) and (T). In this set, equivalent amino acids have roughly similar physico-chemical properties, although our algorithm does not take these into account. For comparison, we also constructed other sets taking advantage explicitly of these similarities (Johnson and Overington, 1993
). Among many possibilities we chose the following:
Set 2: (C), (FWY), (IVLM), (GT), (H), (RQNK), (DE), (S), (P) and (A);
Set 3: (C), (FVYI), (WLM), (TS), (HG), (QK), (ND), (RE), (P) and (A).
Pattern dictionary
Using the above-described amino acid grouping, there are 74 616 different pentapeptides in our bank, most of which are found more than once. These represent, then, a major fraction of the reduced configuration space, since only 25 384 (100 000 74 616) different pentapeptides are absent in the bank. To illustrate this fact, we present in Figure 1 the number of pentapeptides in the bank as a function of their number of occurrences. The average frequency is 1.7 and a large majority (88%) of all possible pentapeptides are represented less than four times in the bank. For comparison, equivalent numbers are presented for the first 123 proteins (27 294 pentapeptides). The average frequency is then 0.27 and practically no pentapeptide is present more than three times in this sub-bank.
|
In fact, as data are still not overabundant, we proceed with the jack-knife method. Each pentapeptide is confronted with all the others that are not in the same protein and its predicted structure is compared with the real one. As a consequence, all pentapeptides that appear only once in the bank are considered as unknown structures. The dictionary approach is then unadapted for nearly 55 000 different pentapeptides (55% of the dictionary, but only 18% of the bank; they correspond in Fig. 1 to 25 384 pentapeptides not found and to 29 487 found only once).
Nearest neighbour method
As the identification of all pentapeptides is not possible, an extrapolation procedure has to be devised. We suppose that similar pentapeptides are of the same type, so our approach constitutes a truly nearest neighbour method. Since a neighbourhood assumed to be of the same nature surrounds each pentapeptide, the knowledge of N regularly dispersed points in our representation space determines the knowledge of N assemblies of points. Now, any pentapeptide differs by only one residue from its 9x5 = 45 nearest neighbours. The knowledge of 105/45 regularly distributed pentapeptides should then be sufficient to allow the determination of the type of all the others. As we have seen, the repartition is not regular, in particular for the less abundant ß-strands. Then, we define a distance between pentapeptides which can be used even for less homologous pairs.
Consequently, to define the degree of similarity between two pentapeptides we calculate the distance d between them. Aligning the two pentapeptides, we look for all the identities between their residues, wherever these may be located. If each identity is represented by a link between both pentapeptides, we define a graph for this pair. We call any part of this graph a configuration. Next, we retain all of those configurations in which there is no residue cross pairing. Each of these is then evaluated as a function of the number p of residues related to the graph, the shifting i for the corresponding pairs and the gap
ij between connected residues of each pentapeptide. The minimum value is chosen as characteristic and is called distance d:
![]() |
Although very rough, this measure is generally in good agreement with the qualitative eye guided estimation. For instance, the distance between abcde and abcfg is 20, whereas that between abcde and abfcg is 23 = (50 30 + 1 + 2). Other examples were given in a previous paper (Figureau et al., 1999).
Our method consists then of the following steps:
Secondary structure prediction
Another step must be performed, since we have considered pentapeptides as basic elements. We must define a smoothing process, which transforms the structural information on pentapeptides into one on residues.
The simplest smoothing process takes into account the types of the five pentapeptides where the given residue is present (less than five at the ends of the sequences). Then the residue is given the type of the most frequent one. More than 95% of the residues recover their original type after application of the two processes: pentapeptide characterization and smoothing. The missed residues are either isolated or belong to an isolated pair of consecutive residues, i.e. they are mainly coil residues.
For a proper evaluation of result reliability, we use the percentage of correctly predicted residues Q3(i):
![]() |
When undetermined structures are found, we define also
![]() |
where Ni is the total number of residues of type i, N+i is the number of correctly predicted residues of type i and Nui the number of residues of type i undetermined.
The Matthews coefficients are calculated for a more precise evaluation of the accuracy. Equivalent quantities are used when discussing the accuracy of pentapeptide predictions.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In Table I we present the results for the 169 149 residues of all the proteins in the databank. In this case, the Matthews coefficients for
, ß and coil secondary structures are 0.525, 0.458 and 0.499, respectively. To compare these results with other prediction methods, we give in the second row of Table I
the success rates obtained with the Rost-126 database, which contains 126 proteins and 24 395 residues with proportions of
, ß and coil similar to those of our databank.
|
|
In Table III we present the success rates for pentapeptides equivalent to those presented in Table I
for residues. Both sets of results are very similar. The saturation effect, in particular for ß pentapetides, is also very visible when the large and Rost databanks are compared.
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The results in Table II justify the selection of the amino acid set employed in the calculations. Better results might be obtained if instead of a step-by-step optimization a global one had been performed. Even a non-optimal grouping might do better, but there is no simple way to determine such a set. Anyway, we cannot expect significant progress from a modification of our grouping. Table II
includes the results obtained when the 20 amino acids are considered as 20 groups (G4). It is clear that in this case all success rates are much less satisfactory. In our interpretation this is as expected, since a larger representation space (of size 205 instead of 105) means less significant statistics.
Therefore, in its present state, our method allows moderately satisfactory predictions of protein secondary structures. Calculations on small banks do not attain the precision of other methods. Nevertheless, for larger banks one should emphasize that we have not used any information from homologous proteins. Our results of 68.6% accuracy, although not directly comparable to recent publications, are of the same order, provided that the inclusion of homologous aligned sequences is not taken into account, so as to compare equivalent sets of hypotheses.
Let us then examine the most direct effect related to the size of the bank: the frequency of appearance of each pentapeptide. In Table IV, values for n = 1 correspond to pentapeptides present only once in the bank and because of the jack-knife process, the calculation bears on neighbours only. This explains the lower values in the corresponding row. Except for n = 1 and for ß rates, the numbers in Table IV
are similar when n varies, suggesting that the proportion of ambiguous pentapeptides is statistically independent of n, especially for n
3. The addition of any new pentapeptide in the bank is really efficient if this pentapeptide was already present only once or twice in the bank, which is the case for 50% of the 105 possible ones (75% if all the pentapeptides not represented in the bank are included). As for ß rates, their observed decrease with n is due simply to the small number of ß pentapeptides in the bank and in particular to the low probability of finding the same ß pentapeptide many times. A systematic increase in the databank size will then be useful mainly as long as it concerns pentapeptides of low frequency so as to lead them to categories n
3. A second effect to be analysed concerns the pentapeptide nearest neighbours. In the Rost bank the proportion of predictions based on the dictionary approach is low, contrary to the larger bank, where on average the pentapeptide nearest neighbours are at a smaller distance. A priori, this explains the better results obtained with this last bank, as exhibited in Table I
for residues and Table III
for pentapeptides.
Finally, as expected, the results in Table V show that calculations restricted to zero distances (dictionary approach) give much better predictions than those using only data at d = 10. These results are pentapeptide predictions, because the smoothing process cannot be used in the first case, where only a fraction of all pentapeptides can be found directly (Q3 and Q3* are different). We note that the final success rates for pentapeptides (Table III
) result from the combination of these two rows and consequently they are intermediate.
Our conclusion is that improvements can still be attained with the recent growth of databanks, the lack of sufficient ß-strand data being the principal difficulty.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Biou,V., Gibrat,J.F., Levin,J.M., Robson,B. and Garnier,J. (1988) Protein Eng., 2, 185191.[Abstract]
Chandonia,J.M. and Karplus,M. (1999) Proteins, 35, 293306.[CrossRef][ISI][Medline]
Chou,P.Y. and Fasman,G.D. (1974) Biochemistry, 13, 222245.[ISI][Medline]
Colloch,N., Etchebest,C., Thoreau,E., Henrissat,B. and Mornon,J.P. (1993) Protein Eng. 6, 377382.[Abstract]
Cootes,A.P., Curmi,P.M.G., Cunningham,R., Donnelly,C. and Torda,A.E. (1998) Proteins, 32, 175189.[CrossRef][ISI][Medline]
Cuff,J.A. and Barton. J. (2000) Proteins, 40, 502511.[CrossRef][ISI][Medline]
Donnelly,D., Overington,J.P. and Blundell,T.L. (1994) Protein Eng., 7, 645653.[Abstract]
Eisenhaber,F., Persson,B. and Argos,P. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 194.[Abstract]
Figureau,A., Soto,M.A. and Tohá,J. (1999) J. Theor. Biol., 201, 103111.[CrossRef][ISI][Medline]
Frishman,D. and Argos,P. (1995) Proteins, 23, 566579.[ISI][Medline]
Frishman,D. and Argos,P. (1997) Proteins, 27, 329335.[CrossRef][ISI][Medline]
Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522.
Johnson,M.S. and Overington,J.P. (1993) J. Mol. Biol., 233, 716738.[CrossRef][ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Kabsch,W. and Sander,C. (1984) Proc. Natl Acad. Sci. USA, 81, 10751078.[Abstract]
King,S.M. and Johnson,W.C. (1999) Proteins, 35, 313320[CrossRef][ISI][Medline]
Labesse,G., Colloch,N., Pothier,J. and Mornon,J.P. (1997) CABIOS, 13, 291295.[Medline]
Levin,J.M., Pascarella,S., Argos,P. and Garnier,J. (1993) Protein Eng., 6, 849854.[Abstract]
Ptitsyn,O.B. and Finkelstein,A.V. (1983) Biopolymers, 22, 1525.[ISI][Medline]
Rogov,S.I. and Nekrasov,A.N. (2001) Protein Eng., 14, 459463.
Rooman,M.J. and Wodak,S.J. (1988) Nature, 335, 4549.[CrossRef][ISI][Medline]
Rost,B. (2001) J. Struct. Bioinf., 134, 204218.
Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584599.[CrossRef][ISI][Medline]
Rost,B. and Sander,C. (1995) Proteins, 23, 295300.[ISI][Medline]
Salamov,A.A. and Solovyev,V.V. (1997) J. Mol. Biol., 268, 3136.[CrossRef][ISI][Medline]
Shestopalov,B.V. (1990) Mol. Biol., 24, 11171125.[ISI]
Srinivasan,R. and Rose,G. (1995) Proteins, 22, 8199.[ISI][Medline]
Sudarsanam,S. (1998) Proteins, 30, 228231.[CrossRef][Medline]
Received January 20, 2002; revised October 11, 2002; accepted December 18, 2002.