1Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, 3Bioinformatics Research Centre, Donghau University, Shanghai 200050, 4Department of Biomedical Engineering, Shanghai Jiaotong University, Shanghai 200030, 5Tianjin Institute of Bioinformatics and Drug Discovery, Tianjin, China and 6Gordon Life Science Institute, San Diego, CA 92130, USA
2 To whom correspondence should be addressed. E-mail: mengking{at}sjtu.edu.cn; kchou{at}san.rr.com
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: Chou's invariance theorem/covariant discriminant algorithm/pseudo-amino acid composition/spectral analysis/weighted -SVM
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Membrane proteins are generally classified into the following five types: (1) type I membrane proteins, (2) type II membrane proteins, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins and (5) GPI-anchored membrane proteins (Figure 1). The function of a membrane protein is closely related to the type to which it belongs. Therefore, a fast and efficient method for predicting the type of membrane protein will significantly speed up the process of function determination for newly found membrane proteins. In a pioneering study, based on the amino acid composition, the covariant discriminant algorithm was introduced by Chou and Elrod (1999) to predict the types of membrane proteins. By definition, the conventional amino acid composition is a vector of 20 components, each representing the frequency of occurrence of one of the 20 native amino acids (Nakashima et al., 1986
; Chou, 1995
; Zhou, 1998
). Accordingly, using the amino acid composition to represent a sample of protein will miss all the sequence-order and sequence-length effects. In order to cope with this problem, a new concept, the so-called pseudo-amino acid composition, was proposed by Chou (2001)
. The pseudo-amino acid composition can bear the main features of amino acid composition, but meanwhile it can also incorporate some sequence order effects. Stimulated by its success in improving prediction quality, here we introduce a different approach to formulate the pseudo-amino acid composition.
|
This paper is devoted to combining the concept of pseudo-amino acid composition and -SVM to develop a new predictor for predicting the membrane protein types.
![]() |
Pseudo amino acid composition and discrete Fourier transform |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() | (1) |
The conventional amino acid composition is defined as 20 discrete numbers each representing the frequency of occurrence of one of the 20 native amino acids (Nakashima et al., 1986; Chou and Zhang, 1994
; Chou, 1995
; Zhou, 1998
). Compared with the conventional amino acid composition, the pseudo-amino acid composition is a vector with 20+
discrete components (Chou, 2001
) and hence may be viewed as a point in a (20+
)-D space, as given by
![]() | (2) |
![]() | (3) |
|
![]() | (4) |
As can be seen from Figure 2, the sequence-order effect of a protein can be, to some extent, reflected through a set of discrete numbers 1,
2,
3, ...,
m, as defined by Equation 4. Such information is very useful in the analysis of proteins with a set of discrete numbers. Accordingly, the first 20 components of Equation 1 reflect the effect of the amino acid composition, whereas the components from 20 + 1 to 20 +
reflect some sequence-order effects. A set of 20 +
components as formulated by Equations (1) and (2) is called the pseudo-amino acid composition for protein P. Such a name is used because it still has the main features of amino acid composition, but on the other hand it contains the information beyond the conventional amino acid composition. The pseudo-amino acid composition thus defined has the following advantages: compared with the 210-D pair-coupled amino acid composition (Chou, 1999
) and the 400-D first-order coupled amino acid composition (Liu and Chou, 1999
) that contain the sequence order effect only for a very short range (i.e. within two adjacent amino acid residues along a chain), the pseudo-amino acid composition incorporates much more sequence effects, i.e. those not only for the short range but also for the medium and long range, as indicated by a series of sequence-coupling factors with different tiers of correlation (see Figure 2 and Equations 2
4). Therefore, the prediction quality can be significantly improved by using the pseudo-amino acid composition to represent the sample of a protein. For detailed formulation and application of the pseudo-amino acid composition, readers are referred to two recent papers (Chou, 2001
; Chou and Cai, 2003
).
Below we shall use the technique of spectral analysis to formulate the pseudo-amino acid composition. As shown in Equation 1, the sequence of a protein is composed of a series of characters, which is hard for a computer to process because each element in the sequence is a linguistic symbol rather than a numerical value. To cope with this situation, each individual amino acid in the protein sequence has to be coded in a numerical way, i.e. expressed in terms of f(Ri) of Equation 4. As is well known, the hydrophilic value of an amino acid is a very important physicochemical property that has crucial effects on the folding of a protein as well as its function, particularly for membrane proteins. In view of this, we choose the hydrophilic value of Ri for f(Ri). Since a coded protein sequence can be treated as a stationary random process, many technologies in statistical signal processing can be used to characterize the sequence-order effects of a protein sequence.
In statistical signal processing, the correlation, covariance sequence and spectral density function are the three basic statistical quantities of discrete random signals. The true cross-correlation sequence is a statistical quantity defined as
![]() | (5) |
![]() | (6) |
The autocorrelation and autocovariance are their special cases as defined as follows
![]() | (7) |
![]() | (8) |
![]() | (9) |
Comparing Equation 9 and Equations 3 and 4, we can see that the sequence-order correlation factors m defined by Chou (2001)
are virtually an autocorrelation sequence of the coded protein sequence. Hence all the powerful tools in statistical signal analysis can be used to incorporate sequence-order effects. The goal of spectral analysis is to describe the distribution (over frequency) of the power contained in a signal, based on a finite set of data. The power spectrum of a stationary random process xn is mathematically related to the correlation sequence by the discrete-time Fourier transform. In terms of physical frequency f (e.g. in hertz) is given by
![]() | (10) |
![]() | (11) |
For real signals, the average power of a signal over a particular frequency and can be found by integrating the PSD over that band:
![]() | (12) |
![]() | (13) |
Since the low-frequency components of a given PSD better reflect the global information (Chou, 1988, 1989
) of a given signal than the high-frequency components, we took the first 20 components of the energy spectrum to incorporate most of the sequence order information. Thus the pseudo-amino acid composition of a protein can be defined in a 40-D space as given by
![]() | (14) |
![]() | (15) |
In the above equation, fi (i = 1, 2,...,20) denote the frequencies of occurrence of the 20 native amino acids in a protein that actually reflect the amino acid composition (Nakashima et al., 1986; Chou and Zhang, 1993
; Chou, 1995
) and
j (j = 1, 2,...,20) the first 20 low-frequency coefficients of the energy spectrum of a given protein sequence that reflect some sort of sequence-order effect. The parameter w is used to control how much sequence order information should be considered. In this paper, w was set as 0.15 (Pan et al., 2003
).
Compared with the original pseudo-amino acid composition introduced by Chou (2001), the current representation has the following features. (1) It can be seen from Equation 10 that the PSD is mathematically related to the correlation sequence by the discrete-time Fourier transform. Therefore, the sequence order information harboured by autocorrelation sequence is also preserved by the PSD since the transform is linear and will not lose any information. (2) The PSD is a more compact representation. For the coded protein sequences, most of the signal's power concentrates on the low-frequency components of the PSD. Hence the 20 components of the PSD are sufficient to represent the protein sequence as the high-frequency parts contain little information. (3) The low-frequency components represent the global information (Chou, 1988
, 1989
) of the coded sequence. The type of protein can be reflected by the curve of the hydrophobic values of the residues. The curve's global shape, represented by the low-frequency components of the PSD, is more important in determining the type of membrane protein. For example, the appearance of several peaks and valleys in the curve may indicate that the corresponding protein is likely to be a multi-pass trans-membrane protein (Figure 1).
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
During the operation, the width of the Gaussian RBFs was selected as 1 to minimize the estimation of the VC dimension. The parameter is assigned as 0.06. The weight si (i = 1, 2, 3, 4, 5) for each class is determined by the following procedure:
After being trained, the hyper-plane was built in the feature space and thus the output could be obtained. The prediction quality was examined by three methods (Chou and Zhang, 1995), the re-substitution test, the jackknife test and independent dataset test, as explained below.
Re-substitution test
The so-called re-substitution test is designed to examine the self-consistency of an identification method (Zhou, 1998; Cai, 2001
; Zhou and Assa-Munt, 2001
; Zhou and Doctor, 2003
). When the re-substitution test is performed for the current classifier, the type of each membrane protein in a data set is in turn predicted using the rule parameters derived from the same training data set. In Table I, the success rate for the 2059 membrane proteins is listed; the overall success rate is 99.85%, which shows that after being trained, the weighted
-SVM has captured the complicated relationship between the pseudo-amino acid composition and the types of membrane proteins. Because the rule parameters derived from the training data set harbor the information of the query protein later plugged back into the test, the re-substitution test tends to underestimate the error and enhance the success rate. Therefore, the success rate thus obtained may give some sort of optimistic estimation (Chou and Zhang, 1994
; Zhou, 1998
; Cai, 2001
; Zhou and Assa-Munt, 2001
; Zhou and Doctor, 2003
). However, the re-substitution test is definitely necessary because any algorithm whose self-consistency performance is poor cannot be deemed a good one. Namely, the re-substitution test is necessary but not sufficient for evaluating a classifier. Hence a cross-validation test for an independent testing data set is recommended because it can reflect the generalization of a classifier in practical applications. This is very useful when checking the validity of a training database for whether it contains sufficient information to reflect all the important features concerned so as to yield a high success rate in application. The results of the re-substitution test obtained for the 2059 membrane proteins are given in Tables I and II.
|
The independent dataset test, sub-sampling test and jackknife test are the three most common methods for cross-validation in statistical prediction. Among these three, the jackknife test is regarded as the most objective and effective one; see, e.g., Chou and Zhang (1995) for a comprehensive discussion of this and Mardia et al. (1979)
for the mathematical principles. During jackknifing, each membrane protein in the dataset is in turn taken out and all the rule parameters are calculated based on the remaining proteins. In other words, the type of each membrane is predicted using the rule parameter derived from all the other membrane proteins except that which is being identified. During the process of jackknifing, both the training data set and testing data set are actually open and a protein will move from one to the other in turn. The results of the jackknife test thus obtained for the 2059 membrane proteins are also given in Tables I and II.
Independent dataset test
Furthermore, predictions were also conducted for the 2625 independent membrane proteins based on the rule parameter derived from the 2059 proteins in the training dataset. The 2625 independent proteins were also taken from Chou and Elrod (1999). Of the 2625 proteins, 478 are type I transmembrane proteins, 180 type II transmembrane proteins, 1867 multi-pass transmembrane proteins, 14 lipid-chain anchored membrane proteins and 86 GPI anchored membrane proteins. The predicted results are also listed in Tables I and II.
From Table I, we may draw the following conclusions. (1) The success predictions obtained by the pseudo-amino acid composition approach are significantly higher than those obtained by the other approaches. (2) A comparison between the current approach and all the other approaches indicates that the success rates by the former are about 6% higher than those by the latter in the self-consistency test and 3% higher in independent dataset test. The only setback is that the jackknife result is about 3.7% lower than that of the functional-domain approach (Cai et al., 2003b), but higher than all other methods.
It should be pointed out that it is not sufficient only to compare the overall success rate. To make an in-depth comparison, one should look into the success rate for each type. In order to illustrate the weighted -SVM's ability to compensate for the bias caused by the imbalance of the dataset, a comparison with its original form was done as listed in Table 2, from which the following facts can be deduced. (1) For class 4, which is with the smallest size, the success rate by the weighted
-SVM is about 31% higher than that by the
-SVM by the self-consistency test and about 51% higher by the independent dataset test. For the jackknife test, the weighted
-SVM also outperformed the
-SVM by about 5%. This is fully consistent with what is expected because the effects caused by the dataset size have been taken into consideration during the process of algorithm formulation. (2) The success rates by the weighted
-SVM are higher than or equal to those obtained by the original
-SVM in the self-consistency test, jackknife test and independent dataset test, for classes 1, 2, 4 and 5. (3) For class 3, which has the largest size, the success rates by the weighted
-SVM are lower than those by the
-SVM in the jackknife test and independent dataset test. However, the overall success rates by the weighted
-SVM are higher than those by
-SVM. (4) The results are fully consistent with what we expected: the success rates for the classes with small size are improved at the cost of slightly reducing the success rates for large size classes. This is rational because the samples in small classes are treated as more important, i.e. assigning a larger coefficient for the data in the small class to improve its prediction accuracy. Meanwhile, the importance is that the overall success rates were enhanced.
![]() |
Conclusion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Appendix A: Weighted support vector machines |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Based on the concept of weighted-SVM (Fan et al., 2003), we adopt its specific form, the weighted
-SVM, and apply it to the problem of prediction of membrane protein types.
-SVM
The basic idea of applying SVMs to pattern classification can be outlined as follows. First, map the input vectors into a feature space (possible with a higher dimension), either linearly or non-linearly, which is relevant to the selection of the kernel function. Then, within the feature space, seek an optimized linear division, i.e. construct a hyper-plane which can separate the entire samples to two classes (this can be extended to multi-classes) with the least errors and maximal margin. The SVMs training process always seeks a global optimized solution and avoids over-fitting, so it has the ability to deal with a large number of features. A complete description to the theory of SVMs for pattern recognition was given by Vapnik (1998).
Given a set of samples, i.e. a series of input vectors
![]() | (A1) |
-SVM (Scholkopf et al., 2000
) uses the parameter
to control the number of support vectors and errors. Its primal problem is
![]() | (A2) |
![]() | (A3) |
![]() | (A4) |
To calculate b and in the above equation, we need to select the same number of samples (S > 0 is the number of samples) from the two datasets. Suppose S+ is the number of samples from the positive training dataset and S that from the negative training dataset. According to the KarushKuhnTucker (KKT) conditions (Karush, 1939
; Cristianini and Shawe-Taylor, 2000
), the condition in Equation A2
![]() | (A5) |
![]() | (A6) |
![]() | (A7) |
![]() | (A8) |
![]() | (A9) |
In C-SVM, the only adjustable parameter is the constant C, which influences its performance greatly. However, because there is no natural interpretation of this parameter, it is hard to adjust it. In -SVM, the parameter C is replaced by
. In this parameterization (Equation A3),
places a lower bound on the sum of the
i, which causes the linear term to be dropped from the objective function. Another connection between these two algorithms is that an increase in parameter C leads to a decrease in the number of support vectors in C-SVM, while a decrease in
leads to a smaller number of support vectors.
Weighted -SVM
The performance of -SVM is also impaired when training sets with uneven class sizes are used. We propose the weighted
-SVM to solve this problem. The primal problem of weighted
-SVM is given by
![]() | (A10) |
![]() | (A11) |
![]() | (A12) |
![]() | (A13) |
![]() | (A14) |
![]() | (A15) |
![]() | (A16) |
In the weighted -SVM, by assigning training samples of different classes different weights, we can compensate for the unfavorable impact caused by the uneven class size. For simplicity, we analyze the case of two classes. We assign si = s+ for positive class and si = s for negative class. Hence the constraint in Equation A11 is reduced to
![]() | (A17) |
![]() | (A18) |
Because most of the gaps are not zero in the trained SVMs, according the constraint in Equation A1, is greater than zero. By KuhnTucher theory, we obtain
= 0. If
> 0, then
= 0. Substituting this into Equation A6, we obtain
![]() | (A19) |
![]() | (A20) |
Before further analysis, we need to define several notations. The support vector (SV) is the training sample whose dual variables i
0. The normal support vector (NSV) is defined as the training sample whose dual variables
. The boundary support vector (BSV) is the training sample that both satisfy
and
i > 0. The BSVs are misclassified training points.
Because the BSV has the property , the sum of NBSV+
i (BSV's dual variable) is less than
, which is the number of BSVs that belong to the positive class:
![]() | (A21) |
![]() | (A22) |
![]() | (A23) |
![]() | (A24) |
![]() | (A25) |
![]() | (A26) |
![]() | (A27) |
![]() | (A28) |
![]() | (A29) |
In the weighted -SVM, by weighting the samples in the small class, the classification accuracy of the small class can be improved. Meanwhile, eliminating the bias toward the large class, the prediction accuracy of the large class is reduced slightly. Such a method can be applied directly to the prediction of membrane protein types where the training sets of five classes are highly uneven.
In this paper, we use the one-against-one approach (Knerr et al., 1990), in which k(k 1)/2 classifiers are constructed and each one trains data from two different classes; k is the number of classes and in this paper k = 5. During the training stage, we first calculate the ratio of different sample sizes according to Equation A29 and then assign different weights to the training samples of different classes as described in the next section. In classification we use a voting strategy: each binary classification is considered to be a voter where votes can be cast for all data points; the end point is designated to be in a class with maximum number of votes.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Cai,Y.D. (2001) Proteins, 43, 336338.[CrossRef][ISI][Medline]
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2000) Mol. Cell Biol. Res. Commun., 4, 230233.[CrossRef][Medline]
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002a) Comput. Chem., 26, 293296.[CrossRef][ISI][Medline]
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002b) Internet Electron. J. Mol. Des., 1, 219226.
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002c) Peptides, 23, 205208.[CrossRef][ISI][Medline]
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002d) J. Cell. Biochem., 84, 343348.[CrossRef][ISI][Medline]
Cai,Y.D., Lin,S. and Chou,K.C. (2003a) Peptides, 24, 159161.[ISI][Medline]
Cai,Y.D., Zhou,G.P. and Chou,K.C. (2003b) Biophys. J., 84, 32573263.
Cedano,J., Aloy,P., Perez-Pons,J.A. and Querol,E. (1997) J. Mol. Biol., 266, 594600.[CrossRef][ISI][Medline]
Chew,H.G., Crisp,D.J. and Bogner,R.E. (2000) Target detecting in radar imagery using support vector machines with training size biasing. Singapore.
Chew,H.G., Bogner,R.E. and Lim,C.C. (2001) Dual nu-support vector machine with error rate and training size biasing. Salt Lake City, pp. 12691272.
Chou,J.J. and Zhang,C.T. (1993) J. Theor. Biol., 161, 251262.[CrossRef][ISI][Medline]
Chou,K.C. (1988) Biophys. Chem., 30, 348.[CrossRef][ISI][Medline]
Chou,K.C. (1989) Trends Biochem. Sci., 14, 212.[CrossRef][ISI][Medline]
Chou,K.C. (1995) Proteins, 21, 319344.[ISI][Medline]
Chou,K.C. (1999) J. Protein Chem., 18, 473480.[CrossRef][ISI][Medline]
Chou,K.C. (2000) Biochem. Biophys. Res. Commun., 278, 477483.[CrossRef][ISI][Medline]
Chou,K.C. (2001) Proteins, 43, 246255; Erratum, 2001, 44, 60.[CrossRef][ISI][Medline]
Chou,K.C. (2002) In Weinrer,P.W. and Lu,Q. (eds), Gene Cloning and Expression Technologies. Westborough, MA, Eaton Publishing, pp. 5770.
Chou,K.C. and Cai,Y.D. (2002) J. Biol. Chem., 277, 4576545769.
Chou,K.C. and Cai,Y.D. (2003) J. Cell. Biochem., 90, 12501260; Addendum, 2004, 91, 1085.[CrossRef][ISI][Medline]
Chou,K.C. and Elrod,D.W. (1999) Proteins, 34, 137153.[CrossRef][ISI][Medline]
Chou,K.C. and Zhang,C.T. (1994) J. Biol. Chem., 269, 2201422020.
Chou,K.C. and Zhang,C.T. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 275349.[Abstract]
Chou,P.Y. (1980) In Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas.
Cristianini,N. and Shawe-Taylor,J. (2000) Support Vector Machines. Cambridge: Cambridge University Press.
Ding,C.H. and Dubchak,I. (2001) Bioinformatics, 17, 349358.[Abstract]
Fan,X.W., Du,S.X. and Wu,T.J. (2003) J. Image Graphics, 8, 10371042.
Guo,Z.M. (2002) Master's Thesis, Shanghai Jiaotong University.
Hua,S.J. and Sun,Z.R. (2001) J. Mol. Biol., 308, 397407.[CrossRef][ISI][Medline]
Karush,W. (1939) MSc Thesis, University of Chicago.
Knerr,S., Personnaz,L. and Dreyfus,G. (eds) (1990) Single-layer Learning Revisited, a Stepwise Procedure for Building and Training Neural Networks. Berlin: Springer.
Lee,Y.J. and Mangasarian,O.L. (2001) RSVM, Reduced Support Vector Machines.
Lin,C.F. and Wang,S.D. (2002) IEEE Trans. Neural Networks, 13, 464471.[CrossRef][ISI]
Liu,W. and Chou,K.C. (1999) Protein Eng., 12, 10411050.[CrossRef][ISI][Medline]
Mahalanobis,P.C. (1936) Proc. Natl Inst. Sci. India, 2, 4955.
Mardia,K.V., Kent,J.T. and Bibby,J.M. (1979) Multivariate Analysis. London: Academic Press, pp. 322381.
Nakashima,H. and Nishikawa,K. (1994) J. Mol. Biol., 238, 5461.[CrossRef][ISI][Medline]
Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152162.
Pan,Y.X., Zhang,Z.Z., Guo,Z.M., Feng,G.Y., Huang,Z.D. and He,L. (2003) J. Protein Chem., 22, 395402.[CrossRef][ISI][Medline]
Pillai,K.C.S. (1985) In Kotz,S. and Johnson,N.L. (eds), Encyclopedia of Statistical Sciences. New York: Wiley, pp. 176181.
Schneider,G. and Wrede,P. (1994) Biophys. J., 66, 335344.[Abstract]
Scholkopf,B., Smola,A. and Williamson,R.C. (2000) Neural Comput., 12, 12071245.
Suykens,J. and Vandewalle,J. (1999) Neural Process. Lett., 9, 293300.[CrossRef][ISI]
Vapnik,V.N. (1995) The Nature of Statistical Learning Theory. Berlin: Springer.
Vapnik,V.N. (1998) Statistical Learning Theory. New York: Wiley-Interscience.
Zhou,G.P. (1998) J. Protein Chem., 17, 729738.[CrossRef][ISI][Medline]
Zhou,G.P. and Assa-Munt,N. (2001) Proteins, 44, 5759.[CrossRef][ISI][Medline]
Zhou,G.P. and Doctor,K. (2003) Proteins, 50, 4448.[CrossRef][ISI][Medline]
Received May 10, 2004; revised July 23, 2004; accepted July 26, 2004.
Edited by Alan Fersht