Computer-Aided Drug Discovery, Pharmacia and Upjohn, Kalamazoo, MI 49007-4940 and 1 Department of Computer and Information Science, Indiana University Purdue University Indianapolis, Indianapolis,IN 46202-5132, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: 1st-order coupled components/-helix/ß-sheet/ß-bridge/310-helix/
-helix/H-bonded turn/bend, random coil
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In a pioneering study, Krigbaum and Knutton (1973) introduced the multiple linear regression (MLR) algorithm to predict the secondary structure content of a protein based on its amino acid composition. Muskal and Kim (1992) approached the problem in a different way when they developed a tandem neural network method in which the protein's amino acid composition, molecular weight and heme presence were taken into account. Recently, by incorporating some nonlinear terms as well as knowledge of protein structural class, Zhang et al. (1996, 1998) proposed a new approach to predict the amount of secondary structure in a globular protein. According to their report, the predicted results of Zhang et al. (1995, 1998) are better than those of Krigbaum and Knutton (1973) and Muskal and Kim (1992). However, in Zhang's method, the a priori knowledge of structural class of the query protein is needed to perform the prediction of its secondary structure content. Thus, as a consequence, this method has some limitations. Besides, the amino-acid-composition defined in all the aforementioned methods is the 0th-order coupled composition, as defined by
|
where A, C, D, E, ..., and Y represent the single-letter codes of the 20 amino acids and P(A) represents the proportion of amino acid A (alanine) in a given protein, P(C) the proportion of C (cystenine), P(D) the proportion of D (aspartic acid), and so forth. As we can see from eqn 1, each amino acid component was treated independently, i.e. the coupling effects among the 20 amino acid components were not incorporated at all. The amino-acid-composition thus defined is actually the 0th-order coupled composition, as denoted by the subscript 0 of in eqn 1.
Obviously, the 0th-order-coupled system is the lowest approximation. If we wish to incorporate the coupling effects of residues along a sequence so as to reflect more accurately the reality in a protein, how can we develop a method to predict its secondary structure content? The present study was initiated in an attempt to deal with this problem.
![]() |
Algorithm |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
where P(C|A) is the proportion of amino acid C occurring along a protein sequence from the N- to the C-terminus, given that A has occurred immediately preceding it; P(D|C) is the proportion of amino acid D occurring along the same sequence, given that C has occurred just preceding D; and so forth.
Generally speaking, if the coupling effects of the (
= 2, 3, ...) closest neighboring amino acid residues are to be considered, then eqn 1 should be modified to be an
th-order coupled amino-acid-composition consisting of 20
+1 components, each of which would correspond to an
th-order conditional proportion. As one could surmise, the analysis of a higher-order coupled system would be much more complicated. Therefore, the treatment in this paper is confined to the 1st-order coupled system; i.e. only the coupling effect of the closest adjacent amino acids is taken into account, as formulated by eqn 2.
The current method is established on the basis of eqn 2, which formulates a conditional probability contribution from each amino acid in the sequence given that it is immediately preceded by a particular one of the 20 amino acids. Accordingly, the 1st-order coupled amino-acid-composition (eqn 2) introduced here involves explicit representation of sequential properties that are not included in the conventional amino-acid-composition, or the 0th-order coupled amino-acid-composition, as formulated by eqn 1.
Suppose the 20 native amino acids are denoted by Xi (i =1, 2, ..., 20) in the alphabetical order of their single-letter codes, i.e. X1 = A, X2 = C, ..., X20 = Y, then according to the normalization condition we have
|
For brevity, the 400 components in eqn 2 are denoted by y1, y2, ..., y400. The rationale of the current method is the secondary structure content of a protein is correlated with its amino-acid-composition; however, compared with the 0th-order composition, such a correlation would be more accurately reflected in terms of the 1st-order coupled composition. Thus, the content of a secondary structural element in a protein, e.g. -helix, can be estimated by the following equation:
|
where represents the
-helix content, n
the number of residues occurring in the
-helices of a given protein and n the number of its total residues, while F
(y1, y2, ..., y400) is a function to be determined. Expanding the function F
according to Taylor series at y1 = y2 = ... = y400 = 0, we have
|
where the subscript 0 means that the value of the corresponding term is obtained by substituting y1 = y2 = ... = y400 = 0 into it. Since all yi (i = 1, 2, ..., 400) in a real protein are generally << 1 with an average equal to = 0.0025 and the derivatives are bounded for real-world situations, the third term and above in eqn 5 can be neglected. Thus, we approximately have
|
where c = F
and c
= (
F
yi)0. The coefficients c
(i = 0, 1, ..., 400) can be determined through a training dataset by the following procedure.
Suppose in a given training dataset there are N proteins identified by an index k, and its 400 coupled-components are denoted by yk,1, yk,2, ..., yk,400. In order to determine the coefficients of eqn 6, we define an objective function given by
|
where d is the content of
-helices in the kth protein and derived here from the DSSP file (Kabsch and Sander, 1983
) of the kth protein in a given training dataset, as done in Chou et al. (1998). The process of determining the coefficients c
(i = 0, 1, ..., 400) is actually a process of finding the minimum of Q
, and hence a process of solving the following set of linear algebraic equations
|
Actually, the procedure adopted here is essentially the least squares solution to the multiple regression problem. It can be shown that eqn 8 usually has a unique solution if N, the number of proteins in the training dataset, is equal to or greater than 401 (see Appendix A). Accordingly, all the coefficients c (i = 0, 1, ..., 400) in eqn 7 can be derived. We may also use singular value decomposition to obtain the least squares solution. Substituting them into eqn 6, we immediately obtain the desired equation for predicting the content of
-helices in a query protein.
Following a similar procedure, we can also predict the content of ß-sheet, its parallel and antiparallel fractions, as well as the content of ß-bridges, 310-helices, -helices, H-bonded turns, bends and random coils for a given protein. Accordingly, in parallel to eqn 6, a general formulation for predicting all the secondary structure elements can be written as
|
where is a general symbol for all the secondary structure elements, and c
(j = 0, 1, 2, ..., 400) are also called the 1st-order coupled `rule-parameters' for predicting the content of the secondary structural element
. When
= `
', eqn 9 will yield the content of
-helices; when
= `ß', the content of ß-sheets; when
= `parallel', the content of parallel ß-sheets; when
= `antiparallel', the content of antiparallel ß-sheets; when
= `bridge', the content of ß-bridges; when
= `310', the content of 310-helices; when
= `
', the content of
-helices; when
= `H-bond', the content of H-bonded turns; when
= `bend', the content of bends; and when
= `coil', the content of random coils. Note that by definition the secondary structure content must be within the range 0 to 1 (see eqn 4). Therefore, if it was found that
> 1 or
< 0, the value of
should be assigned to 1 or 0, respectively. However, cases like that happened very rarely.
In order to facilitate comparison, here let us also give the corresponding equations based on the conventional amino-acid-composition (eqn 1). By following the procedures parallel to the above derivation, these equations can be easily obtained as follows.
|
is actually the proportion of amino acid Xi in a protein whose secondary structure contents are to be predicted (see eqn 1), and b (j = 0, 1, 2, ..., 20) are the 0th-order coupled `rule-parameters' for predicting the content of the secondary structural element
as can be derived by the following equations:
|
where xk,1, xk,2, ..., xk,20 are the 20 0th-order coupled components (see eqn 1) as usually defined for the amino-acid-composition of the kth protein in the training dataset, and dk is a general symbol for the observed content of the secondary structure element
in the kth protein. When
= `
', it becomes dk
of eqn 7 that is none but the observed content of
-helices in the kth protein. As mentioned here, the observed value of dk
(
= `
', `ß', `bridge', `310', `
' or any other secondary structural element) can be derived from the DSSP file (Kabsch and Sander, 1983
) of the kth protein in a given training dataset.
A comparison of eqns 1012 with eqns 79 indicates that all the sequence-coupled effects are no longer counted for the result predicted by eqn 10. This is because all the conditional probability terms, which were originally associated with the 1st-order coupled rule-parameters in eqn 9, are degenerated into the independent amino-acid-composition terms (see eqns 10 and 11).
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
the average absolute error for each secondary structure element
|
the standard deviation for each secondary structure element
|
and the overall average error <>
|
where = `
', `ß', ..., or `coil',
k is the predicted content for the secondary structure element
in the kth protein, while d
is the corresponding observed content, and
is the total number of the secondary structure elements considered; that is, 10 for the current study.
Self-consistency test
In this test, the rule parameters derived from the 628 proteins in Table I by eqns 78 were used to predict the secondary structure content of the same proteins by eqn 9. The 10 sets of 1st-order coupled rule parameters (each contains 401 coefficients) thus found for predicting the content of
-helices, ß-sheets, its parallel and antiparallel proportions, ß-bridges, 310-helices,
-helices, H-bonded turns, bends and random coils, respectively, are given in Appendix B. The results of the self-consistency test for the 628 proteins in Table I
are given in Table II
, from which we can see that the average absolute errors for the prediction of
-helices and ß-sheets are 0.056 and 0.046 with a standard deviation of 0.008 and 0.005, respectively. For the other secondary structure elements, except for the proportions of parallel and antiparallel ß-strands, the average errors were all
0.020 with a standard deviation of
0.001. The average absolute error for the prediction of the parallel and antiparallel ß-strand portions are relatively large. However, even though the overall average error for all the 10 secondary structure elements is 0.062, by excluding these two from consideration, the overall average error becomes 0.028, indicating an excellent self-consistency by using the 1st-order couple composition regression algorithm. To show the prediction quality, the calculated and observed content of
-helices and ß-sheets in each of the 628 proteins are shown in Figure 1a and b
, respectively.
|
|
Although prediction errors reported above are very small, it should be pointed out here that they are merely the results obtained by the self-consistency test based on a limited number of proteins. Using the self-consistency test, the secondary structure content of each protein from a training dataset is predicted using the coefficients derived from the same dataset. In other words, the rule parameters derived from the training dataset include information about a protein later tested. This will certainly give an overly optimistic error estimate because of the memorization effect. Nevertheless, the self-consistency test is absolutely necessary because it reflects the consistency of a prediction method, especially for its algorithm part. A prediction algorithm certainly cannot be deemed a good one if it is non-consistent. In other words, the self-consistency test is necessary but not sufficient for evaluating a prediction method. As a complement, a cross-validation examination based on an independent testing dataset is needed as given below.
Independent-dataset test
Testing on a set of proteins not present in the training dataset is important because it can reflect the effectiveness of a prediction method, especially in checking the validity of a training dataset: whether it contains sufficient information to reflect all the important features concerned so as to yield high prediction quality in application. For cross-validation, an independent testing dataset was constructed. It consisted of 52 proteins with known structures (Table III). The sequence similarity between two proteins in this dataset, or between a protein in this dataset and any one in the training dataset (Table I
), is no more than 35%. The secondary structure contents of these proteins were calculated in terms of the rule parameters derived from the proteins of the training dataset by the 0th- and 1st-order coupled algorithms, respectively. The results thus obtained for the content of
-helices and ß-sheets, together with the corresponding observed values, are listed in Table III
. As we can see there, for each of the 52 proteins the content predicted by the 1st-order-coupled algorithm for both
-helices and ß-sheets are much closer to the observed values than those by the 0th-order coupled algorithm.
|
|
It should be pointed out that although in principle the algorithm formulated here can be used to predict the percentage of parallel and antiparallel ß-sheets in a protein, the results are relatively much poorer than those of the other secondary structure elements. To improve this situation, the incorporation of some special effect into the algorithm might be necessary.
![]() |
Conclusion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Appendix A |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
where yk,0 = 1 is a dummy symbol. The above equation can be written as
|
where
|
T is the transposition operator, and
|
Accordingly, we have
|
If XTX is invertible, C has a unique solution
|
The condition that XTX is invertible requires N 401. When N
401 and when the N proteins selected for the training dataset are not homologous to one another, XTX is usually invertible.
![]() |
Appendix B |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bode,W., Papamokos,E. and Musil,D. (1987) Eur. J. Biochem., 166, 673692.[Abstract]
Bussian,B.M. and Sander,C. (1989) Biochemistry, 28, 42714277.[ISI]
Chou,K.C. (1988) Biophys. Chem., 30, 348.[ISI][Medline]
Chou,K.C. (1995) Proteins Struct. Funct. Genet., 21, 319344.[ISI][Medline]
Chou,K.C. (1997a) J. Peptide Res., 49, 120144.[ISI][Medline]
Chou,K.C. (1997b) Biopolymers, 42, 837853.[ISI][Medline]
Chou,K.C. and Blinn,J.R. (1997) J. Protein Chem., 16, 575595.[ISI][Medline]
Chou,K.C. and Elrod,D.W. (1999) Protein Engng, 12, 107118.
Chou,K.C. and Zhang,C.T. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 275349.[Abstract]
Chou,K.C., Liu,W., Maggiora,G.M. and Zhang,C.T. (1998) Prot. Struct. Funct. Genet., 31, 97103.
Chou,P.Y. (1908) Amino Acid Composition of Four Classes of Proteins. In Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas.
Chou,P.Y. (1989) Prediction of Protein Structural Classes from Amino Acid Composition. In Fasman,G.D. (ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 549586.
Chou,P.Y. and Fasman,G.D. (1978) Adv. Enzymol. Relat. Subj. Biochem., 47, 45148.
Dubchak,I., Holbrook,S.R. and Kim,S.-H. (1993) Proteins, 16, 7991.[ISI][Medline]
Farber,G.K. and Petsko,G.A. (1990) Trends Biochem. Sci., 15, 228234.[ISI][Medline]
Fasman,G.D. (1989) The Development of the Prediction of Protein Structure. In Fasman,G.D. (ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 317358.
Folmer,R.H., Nilges,M., Konings,R.N. and Hilbers,C.W. (1995) EMBO J., 14, 41324142.[Abstract]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Krigbaum,W.R. and Knutton,S.P. (1973) Proc. Natl Acad. Sci. USA, 70, 28092813.[Abstract]
Lehmann,M.S., Pebay-Peyroula,E., Cohen-Addad,C. and Odani,S. (1989) J. Mol. Biol., 210, 235236.[ISI][Medline]
Liu,W. and Chou,K.C. (1997) J. Protein Chem., 17, 209217.[ISI]
Liu,W. and Chou,K.C. (1998) Protein Sci., 7, 23242330.
Muskal,S.M. and Kim,S.-H. (1992) J. Mol. Biol., 225, 713727.[ISI][Medline]
Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152162.
Pastore,A., Saudek,V., Ramponi,G. and Williams,R.J.P. (1992) J. Mol. Biol., 224, 427440.[ISI][Medline]
Sreerama,N. and Woody,R.W. (1994) J. Mol. Biol., 242, 497507.[ISI][Medline]
Zhang,C.T., Zhang,Z. and He,Z. (1996) J. Protein Chem., 15, 775786.[ISI][Medline]
Zhang,C.T., Zhang,Z. and He,Z. (1998) J. Protein Chem., 17, 261272.[ISI][Medline]
Received March 9, 1999; revised July 24, 1999; accepted August 5, 1999.