Skewed distribution of protein secondary structure contents over the conformational triangle

Chun-Ting Zhang1,2 and Ren Zhang3

1 Department of Physics, Tianjin University, Tianjin 300072 and 3 Department of Epidemiology and Biostatistics, Tianjin Cancer Institute and Hospital, Tianjin 300060, China


    Abstract
 Top
 Abstract
 Introduction
 Database and method
 Results and discussion
 References
 
A conformational triangle method is presented to analyze the secondary structure contents of 1028 structurally known proteins in the non-redundant data set of the recent 25% PDB_SELECT. The secondary structure contents of each protein are mapped on to a point in the triangle. It was found that the distribution of the 1028 points is strongly skewed in the triangle and about 42% of the whole area is empty, which is called the forbidden area. The detailed border between the allowable and forbidden areas was calculated. The possible explanation of the skewed distribution is discussed. The distributions of the mapping points for enzymes and non-enzymes in this non-redundant data set are compared. It was found that a necessary rather than a sufficient condition for an enzyme molecule is that its coil content must be >=0.223. It is hoped that the skewed distribution observed here could be used to test the secondary structure and threading predictions.

Keywords: coil content/conformational triangle/enzymes/flexibility/forbidden area/helix content/non-enzymes/strand content


    Introduction
 Top
 Abstract
 Introduction
 Database and method
 Results and discussion
 References
 
Biological data are increasing exponentially and a vast amount of biological information is emerging, and we are faced with the question of what they mean. It is a severe challenge to analyze these data. The great accumulation of biological data, including detailed structural data for more than 7000 proteins, provides a chance to discover new knowledge. Knowledge discovery and data mining (KDDM) are the main subject of bioinformatics today. This paper represents typical KDDM work attempting to look for some empirical rules describing the relationships of helix, strand and coil composition based on more than 1000 proteins of a non-redundant data set, in which the three-dimensional structures are currently available in the PDB.

The helix, strand and coil compositions of a protein are the fractions of residues in the conformations of {alpha}-helix, ß-strand and coil, respectively, where turns are treated as coil. There are different ways to assign one of the above three secondary structure types to each residue in a protein, based on its three-dimensional structural data (Kabsch and Sander, 1983Go; Richards and Kundrot, 1988Go; Sklenar et al., 1989Go). In this paper, the method of Kabsch and Sander is used, i.e. the DSSP program is used to compute the secondary structures of proteins (Kabsch and Sander, 1983Go). The conformation of H, G and I in the output file of the DSSP program is treated as helix, E and B as strand and all the remainder as coil. Hence coils defined here include turns. This is a simplified treatment only.


    Database and method
 Top
 Abstract
 Introduction
 Database and method
 Results and discussion
 References
 
The PDB is a very biased and highly redundant database. In order to obtain a precise result, we should use a non-redundant database. Here the recent 25% PDB_SELECT protein database of Hobohm et al. (1992) is used, in which the pairwise sequence identity is <25%. The version used here is the Release December 1998, in which there are 1028 proteins. We obtained these data via the web site ftp://ftp.embl-heidelberg.de/pub/databases/pdb_select.

Since three real numbers representing the contents of helix, strand and coil are associated with each protein, we have to analyze 3 x 1028 = 3084 data, which would occupy several printed pages. We hope to find something useful from such a large amount of data. This would be a difficult task. The strategy used is to visualize these data by a graphic technique. It will be seen later that the concept of a conformational triangle is introduced, by which the secondary structure composition of a protein (corresponding to three numbers) is mapped on to a point on a two-dimensional plane. Accordingly, a great amount of data can be studied in a perceivable form.

For convenience, the contents of {alpha}-helix, ß-strand and coil in a protein are denoted by {alpha}, ß and c, respectively. Obviously, {alpha} + ß + c = 1 This means that among the three real numbers only two are independent. This provides a method to map the secondary structure composition of a protein into a regular triangle. Consider the regular triangle {Delta}ABC with its height equal to 1, as shown in Figure 1Go. It is well known that the sum of the distances of any point within this triangle to the three sides is exactly equal to 1. Let the distances of a point P to the sides BC, AC and AB be equal to {alpha}, ß and c, respectively. The point P constitutes a mapping of the secondary structure composition of the protein studied. This is a mapping of the one-to-one correspondence. A Cartesian coordinate system is set up, in which the origin O is at the center of the triangle with the x-axis parallel with the side AB. The coordinate of the point P(x, y) may be expressed in terms of {alpha} and ß as follows:

where {alpha} and ß are the contents of {alpha}-helix and ß-strand for the protein studied. In this way, the points representing the secondary structure contents of the proteins studied are distributed within the triangle {Delta}ABC, which is called the conformational triangle hereafter. Consequently, some relationships of helix, strand and coil contents are found by studying the distribution of the mapping points.



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 1. Consider the regular triangle {Delta}ABC with its height equal to 1. The contents of {alpha}-helix, ß-strand and coil in a protein are denoted by {alpha}, ß and c, respectively. Let the distances of a point P to the sides BC (denoted by {alpha}), AC (denoted by ß) and AB (denoted by coil) be equal to {alpha}, ß and c, respectively, then the coordinate of the point P is uniquely determined by the three numbers. A Cartesian coordinate system is set up, in which the origin O is at the center of the triangle with the x-axis parallel with the side AB. Consequently, a total of 1028 mapping points representing the 1028 proteins, respectively, are distributed within the triangle. Note the skewed distribution of these 1028 points. The region at which no points are situated is called the forbidden area. The border between the allowable and forbidden areas is denoted by the broken line within the triangle.

 

    Results and discussion
 Top
 Abstract
 Introduction
 Database and method
 Results and discussion
 References
 
The distribution of the 1028 mapping points representing the secondary structure contents of the 1028 proteins, respectively, in the conformational triangle is shown in Figure 1Go. It can be seen that the distribution is not uniform but strongly skewed. This observation is in accordance with the finding of Brenner et al. (1997). They pointed out that the study of the structural classification of proteins revealed strikingly skewed distributions at all levels. In Figure 1Go, some regions within the conformational triangle are densely covered, some are sparsely scattered and some are empty. The empty region, in which there are no mapping points, is called the forbidden area. There may be various ways to define the forbidden area. The one we define is shown in Figure 1Go, in which the whole of the bottom side of the triangle is divided into seven sub-intervals. In each sub-interval the border between the allowable and forbidden areas is simply represented by a straight line. The whole borderline is then described by a function y(x), defined as follows:

where x and y are defined in Equation 1. The function defined in Equation 2 roughly describes the borderline between the allowable and forbidden areas. The percentage of the forbidden area over the whole area of the triangle is 42.35%, based on the borderline function y(x) in Equation 2. Suppose that there is a protein with contents of helix and strand {alpha}* and ß*, respectively. We calculate x* and y* by using Equation 1. If y* >= y(x*), the mapping point is within the allowable area; otherwise, if y* < y(x*), the mapping point is within the forbidden area. As an example, consider the prion protein. It is well known that the prion protein has two possible structures, PrPC and PrPSc (Prusiner, 1982Go). PrPC is basically an all-{alpha} protein with {alpha}* = 0.40 and ß* {approx}0, whereas PrPSc is an {alpha}ß protein with {alpha}* = 0.30 and ß* = 0.43, according to the experimental report using FTIR and CD techniques (Pan et al., 1997Go; Aguzzi and Weissmann, 1997Go). Both structures satisfy the condition y* >= y(x*). Hence their mapping points are all situated at the allowable area (see Figure 2Go).



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 2. The distribution of the mapping points of the secondary structure contents of 441 enzymes within the conformational triangle {Delta}ABC. See the legend of Figure 1Go for a detailed explanation of the triangle coordinate system. Comparing the point distribution here with that in Figure 1Go, we find that some mapping points with less coil contents are `filtered'. There exists a threshold of coil content (denoted by {Lambda}) for enzymes. Based on the data in the plot, we find {Lambda} = 0.223. A necessary rather than a sufficient condition for an enzyme molecule is that its coil content must >=0.223. As an example, the mapping points of the secondary structure contents for the two structures of the prion protein, PrPC and PrPSc, are shown and denoted by filled circles. The two points are all situated at the allowable area with c* >= y(x*) and satisfy the condition of enzyme with c* > {Lambda}.

 
The fact that 42.35% of the whole area of the conformational triangle belongs to the forbidden area is worthy of study. One possible explanation is presented in the following. The condition of the allowable area is y* >= y(x*), as mentioned above. Using Equation 1, we transform this condition to c* >= y(x*) + 1/3 {equiv} {Lambda}(x*), where {Lambda}(x*) is the cut-off for the coil content. The fact that {Lambda}(x*) > 0, for any x* [–0.577, 0.577] (the whole interval of x), indicates that proteins are not allowed to have no coils (turns). Although this is trivial, it reflects the basic fact that coils (turns) are absolutely necessary for protein folding, whereas helices and strands are not always necessary. Observing Figure 1Go further, we find that the minimum values of {Lambda}(x*) [denoted by min{Lambda}(x*)] for different protein classes are different. Using the borderline function y(x) in Equation 2, we find that for the all-{alpha} proteins, min{Lambda}(x*) = 0.059; for the all-ß proteins, min{Lambda}(x*) = 0.174 and for the {alpha}ß (including {alpha}/ß and {alpha} + ß) proteins, min{Lambda}(x*) = 0.217. Here the definition of structural classes proposed by Nakashima et al. (1986) is taken into account. These cut-off values reflect the different intrinsic structural characteristics of helix and strand. It is well known that hydrogen bonds turn out to be important for protein folding. For the {alpha}-helix, hydrogen bonds are formed between different residues within the {alpha}-helix itself, subject to the three-dimensional constraints. In contrast, for the ß-sheet, hydrogen bonds are formed between adjoining ß-strands, subject to the two-dimensional constraints (Chothia et al., 1997Go). Because the formation of hydrogen bonds needs stringently definite bond length and orientation, it seems that the formation of the structure of ß-sheet needs more coils to be involved than does the {alpha}-helix. This is probably one of the possible reasons why min{Lambda}(x*) for the all-ß or {alpha}ß proteins is greater than min{Lambda}(x*)for the all-{alpha} protein. This reasoning is also in agreement with the following observation. Fitting the 1028 mapping points by a straight line using a least-squares technique, we find the fitting line

where (x, y) is the coordinate of the point on the fitting line. Using Equation 1, we rewrite Equation 3 as

where {alpha} and ß are the contents of helix and strand, respectively, associated with the point on the fitting line. The fact that the slope of the line in Equation 3 or 4 is greater than zero indicates that on average the content of ß-strand is positively correlated with that of coil, whereas the content of {alpha}-helix is negatively correlated with that of coil. In other words, overall, the more strands, the more coils there are and the more helices, the fewer coils there are. It is well known that of the three secondary structural elements the helix is generally the least flexible and the coil is the most flexible with the strand in between (Schulz and Schirmer, 1979Go; Chothia et al., 1997Go; Oliva, et al., 1997). In other words, the intrinsic flexibility of ß-strands is much greater than that of {alpha}-helices (Chothia et al., 1997Go) and the flexibility of coils seems to be generally greater than that of ß-strands. The condition of the allowable area, i.e., c* >= {Lambda}(x*), indicates that flexibility of coils is very necessary for the stable folding of proteins. Summarizing, the appearance of the forbidden area in the conformational triangle seems to be relevant to the flexibility of protein structures.

Proteins are thought to be structurally and dynamically flexible molecules. The flexibility of some (not all) proteins is necessary to their functions, especially for the enzymatic functions (Tsou, 1986Go). To illustrate this, the mapping points of the 441 enzymes in the recent 25% PDB_SELECT protein database of Hobohm et al. (1992) are shown in Figure 2Go. Comparing the distribution in Figure 2Go with that in Figure 1Go, we find that some mapping points with less coil contents are `filtered'. There exists a threshold of coil content (denoted by {Lambda}) for enzymes. Based on the data in Figure 2Go, we find that {Lambda} = 0.223. A necessary rather than a sufficient condition for an enzyme molecule is that its coil content must be >=0.223. The value of {Lambda} may be changed with respect to the enlargement of the protein database; however, a substantial deviation from 0.223 in the future is unlikely. In other words, at least about one quarter of the residues of the enzyme molecule must assume the coil (including turn) conformation. As mentioned above, coils are probably more flexible than helices and strands. Therefore, the existence of a larger threshold {Lambda} indicates that the functions of enzymes need more flexible conformational elements such as coils. Nevertheless, a protein with a higher coil content may not be an enzyme. We performed a statistical test to see whether the distributions of the coil contents between enzymes and non-enzymes are significantly different. The average contents of helix, strand and coil and their variances for the 441 enzymes and 587 (1028 – 441) non-enzymes were calculated and are listed in Table IGo. Based on these data, a t-test was performed and it was found that the two coil content distributions (enzymes compared with non-enzymes) are not significantly different with a significance level of 0.05. Summarizing, the constraint that the coil content of an enzyme must be greater than or equal to a threshold {Lambda} is only a necessary condition for a protein to be an enzyme, but by no means a sufficient one.


View this table:
[in this window]
[in a new window]
 
Table I. The average secondary structure contents and the variances of enzymes and non-enzymesa
 
In conclusion, the skewed distribution of the secondary structure contents over the conformational triangle reported in this paper is a worthwhile finding. The existence of a larger forbidden area in the secondary structure composition space seems to be related to the flexibility of protein structures. However, an exact explanation for the forbidden area is still not available. Any future successful protein folding theory should give this phenomenon a satisfactory explanation. At present, the skewed distribution observed here could be used to test the secondary structure and threading predictions.


    Acknowledgments
 
This work was supported in part by the Pandeng Project of China and a grant from the State Education Commission of China.


    Notes
 
2 To whom correspondence should be addressed. E-mail: ctzhang{at}tju.edu.cn Back


    References
 Top
 Abstract
 Introduction
 Database and method
 Results and discussion
 References
 
Aguzzi,A. and Weissmann,C. (1997) Nature, 389, 795–798.[ISI][Medline]

Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1997) Curr. Opin. Struct. Biol., 7, 369–376.[ISI][Medline]

Chothia,C., Hubbard,T., Brenner,S., Barns,H. and Murzin,A. (1997) Annu. Rev. Biophys. Biomol. Struct., 26, 597–627.[ISI][Medline]

Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409–417.[Abstract/Free Full Text]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152–162.

Pan,K.M., Baldwin,M. Nguyen,J., Casset,M., Serban,A., Groth,D., Mehlhorn,I., Huang,Z., Oliva,B., Bates,P.A., Querol,E., Aviles,F.X. and Sternberg,M.J.E. (1997) J. Mol. Biol., 266, 814–830.[ISI][Medline]

Prusiner,S.B. (1982) Science, 216, 136–144.[ISI][Medline]

Richards,F.M. and Kundrot,C.E. (1988) Proteins, 3, 71–84.[ISI][Medline]

Schulz,G.E. and Schirmer,R.H. (1979) Principles of Protein Structure. Springer, New York.

Sklenar,H., Etchebest,C. and Lavery,R. (1989) Proteins, 6, 46–60.[ISI][Medline]

Tsou,C.L. (1986) Trends Biochem. Sci., 11, 427–429.[ISI]

Received April 14, 1999; revised July 8, 1999; accepted July 8, 1999.