Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition

Meng Wang1,2, Jie Yang1, Guo-Ping Liu1, Zhi-Jie Xu1 and Kuo-Chen Chou1,2,3,4,5,6

1Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, 3Bioinformatics Research Centre, Donghau University, Shanghai 200050, 4Department of Biomedical Engineering, Shanghai Jiaotong University, Shanghai 200030, 5Tianjin Institute of Bioinformatics and Drug Discovery, Tianjin, China and 6Gordon Life Science Institute, San Diego, CA 92130, USA

2 To whom correspondence should be addressed. E-mail: mengking{at}sjtu.edu.cn; kchou{at}san.rr.com


    Abstract
 Top
 Abstract
 Introduction
 Pseudo amino acid composition...
 Results and discussion
 Conclusion
 Appendix A: Weighted support...
 References
 
Membrane proteins are generally classified into the following five types: (1) type I membrane proteins, (2) type II membrane proteins, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins and (5) GPI-anchored membrane proteins. Prediction of membrane protein types has become one of the growing hot topics in bioinformatics. Currently, we are facing two critical challenges in this area: first, how to take into account the extremely complicated sequence-order effects, and second, how to deal with the highly uneven sizes of the subsets in a training dataset. In this paper, stimulated by the concept of using the pseudo-amino acid composition to incorporate the sequence-order effects, the spectral analysis technique is introduced to represent the statistical sample of a protein. Based on such a framework, the weighted support vector machine (SVM) algorithm is applied. The new approach has remarkable power in dealing with the bias caused by the situation when one subset in the training dataset contains many more samples than the other. The new method is particularly useful when our focus is aimed at proteins belonging to small subsets. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, indicating that the current approach may serve as a powerful complementary tool to other existing methods for predicting the types of membrane proteins.

Keywords: Chou's invariance theorem/covariant discriminant algorithm/pseudo-amino acid composition/spectral analysis/weighted {upsilon}-SVM


    Introduction
 Top
 Abstract
 Introduction
 Pseudo amino acid composition...
 Results and discussion
 Conclusion
 Appendix A: Weighted support...
 References
 
Owing to the development of high-throughput sequencing technology, the data in various biology databases have been increasing at an unprecedented rate, which challenges the speed and ability of biologists and computational scientists to analyze these data. Because most of the specific functions of a cell are carried out by the membrane proteins (see e.g. Alberts et al., 1994; Lodish et al., 1995), prediction of membrane protein types has become a vitally important subject in molecular and cellular biology. Although the type of a membrane protein can be determined by various biochemical experiments, it is both time consuming and costly if the determination is based on an experimental approach alone. In view of this, it is highly desirable to develop an automated method to expedite the speed of determination.

Membrane proteins are generally classified into the following five types: (1) type I membrane proteins, (2) type II membrane proteins, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins and (5) GPI-anchored membrane proteins (Figure 1). The function of a membrane protein is closely related to the type to which it belongs. Therefore, a fast and efficient method for predicting the type of membrane protein will significantly speed up the process of function determination for newly found membrane proteins. In a pioneering study, based on the amino acid composition, the covariant discriminant algorithm was introduced by Chou and Elrod (1999)Go to predict the types of membrane proteins. By definition, the conventional amino acid composition is a vector of 20 components, each representing the frequency of occurrence of one of the 20 native amino acids (Nakashima et al., 1986Go; Chou, 1995Go; Zhou, 1998Go). Accordingly, using the amino acid composition to represent a sample of protein will miss all the sequence-order and sequence-length effects. In order to cope with this problem, a new concept, the so-called ‘pseudo-amino acid composition’, was proposed by Chou (2001)Go. The pseudo-amino acid composition can bear the main features of amino acid composition, but meanwhile it can also incorporate some sequence order effects. Stimulated by its success in improving prediction quality, here we introduce a different approach to formulate the pseudo-amino acid composition.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 1. Schematic drawing showing the following five types of membrane proteins: (a) type I transmembrane, (b) type II transmembrane, (c) multipass transmembrane, (d) lipid-chain anchored membrane and (e) GPI-anchored membrane. As shown, although both type I and type II membrane proteins are single-pass transmembrane, type I has a cytoplasmic C-terminus and an extracellular or luminal N-terminus for plasma membrane or organelle membrane, respectively, while the arrangement of N- and C-termini in type II membrane proteins is just the reverse. No such distinction was drawn between the extracellular (or luminal) and cytoplasmic sides for the other three types in the current classification scheme. Reproduced from Chou (2002)Go, with permission.

 
Meanwhile, SVM (support vector machine) has recently been widely used in bioinformatics. However, its performance is greatly limited by the uneven sizes of the subsets in the training dataset. The classification results based on SVM are undesirably biased toward the class with more samples in the corresponding subset. In other words, the larger the size of a subset, the smaller is the classification error; whereas the smaller the size of a subset, the larger is the classification error. In the dataset constructed by Chou and Elrod (1999)Go, the training subsets are uneven. To solve this problem, we use the weighted support vector machine ({upsilon}-SVM) to cope with this problem.

This paper is devoted to combining the concept of pseudo-amino acid composition and {upsilon}-SVM to develop a new predictor for predicting the membrane protein types.


    Pseudo amino acid composition and discrete Fourier transform
 Top
 Abstract
 Introduction
 Pseudo amino acid composition...
 Results and discussion
 Conclusion
 Appendix A: Weighted support...
 References
 
A protein sequence can be represented as a series of amino acids by their single-character codes A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y, formulated as

(1)
where the component R1 is the first residue, R2 the second residue and so forth.

The conventional amino acid composition is defined as 20 discrete numbers each representing the frequency of occurrence of one of the 20 native amino acids (Nakashima et al., 1986Go; Chou and Zhang, 1994Go; Chou, 1995Go; Zhou, 1998Go). Compared with the conventional amino acid composition, the pseudo-amino acid composition is a vector with 20+{lambda} discrete components (Chou, 2001Go) and hence may be viewed as a point in a (20+{lambda})-D space, as given by

(2)
where the first 20 components are the same as in the conventional amino acid composition, while the additional components p20+1, ... p20+{lambda} are related to {lambda} different ranks (Figure 2) of sequence-order correlation factors as formulated by the following equation (Chou, 2001Go):

(3)



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 2. A schematic drawing to show (a) the first-rank, (b) the second-rank and (c) the third-rank sequence-order correlation mode along a protein sequence. Panel (a) reflects the correlation mode between all the most contiguous residues, panel (b) that between all the second most contiguous residues and panel (c) that between all the third most contiguous residues. Adapted from Chou (2001)Go, with permission.

 
In the above equation, L denotes the length of the protein and {tau}i is called the ith rank of coupling factor that harbours the ith sequence-order correlation factor. An illustration to show how these factors are associated with the sequence order effect is given in Figure 2. The coupling factor Ji,j in Equation 3 is defined as a function of the amino acids Ri and Rj, such as the physicochemical distance (Schneider and Wrede, 1994Go; Chou, 2000Go) from Ri to Rj, or some combinations of several biochemical quantities related to Ri and Rj (Chou, 2001Go, 2002Go). Hence {tau}i can be rewritten as:

(4)

As can be seen from Figure 2, the sequence-order effect of a protein can be, to some extent, reflected through a set of discrete numbers {tau}1, {tau}2, {tau}3, ..., {tau}m, as defined by Equation 4. Such information is very useful in the analysis of proteins with a set of discrete numbers. Accordingly, the first 20 components of Equation 1 reflect the effect of the amino acid composition, whereas the components from 20 + 1 to 20 + {lambda} reflect some sequence-order effects. A set of 20 + {lambda} components as formulated by Equations (1) and (2) is called the pseudo-amino acid composition for protein P. Such a name is used because it still has the main features of amino acid composition, but on the other hand it contains the information beyond the conventional amino acid composition. The pseudo-amino acid composition thus defined has the following advantages: compared with the 210-D pair-coupled amino acid composition (Chou, 1999Go) and the 400-D first-order coupled amino acid composition (Liu and Chou, 1999Go) that contain the sequence order effect only for a very short range (i.e. within two adjacent amino acid residues along a chain), the pseudo-amino acid composition incorporates much more sequence effects, i.e. those not only for the short range but also for the medium and long range, as indicated by a series of sequence-coupling factors with different tiers of correlation (see Figure 2 and Equations 2Go4). Therefore, the prediction quality can be significantly improved by using the pseudo-amino acid composition to represent the sample of a protein. For detailed formulation and application of the pseudo-amino acid composition, readers are referred to two recent papers (Chou, 2001Go; Chou and Cai, 2003Go).

Below we shall use the technique of spectral analysis to formulate the pseudo-amino acid composition. As shown in Equation 1, the sequence of a protein is composed of a series of characters, which is hard for a computer to process because each element in the sequence is a linguistic symbol rather than a numerical value. To cope with this situation, each individual amino acid in the protein sequence has to be coded in a numerical way, i.e. expressed in terms of f(Ri) of Equation 4. As is well known, the hydrophilic value of an amino acid is a very important physicochemical property that has crucial effects on the folding of a protein as well as its function, particularly for membrane proteins. In view of this, we choose the hydrophilic value of Ri for f(Ri). Since a coded protein sequence can be treated as a stationary random process, many technologies in statistical signal processing can be used to characterize the sequence-order effects of a protein sequence.

In statistical signal processing, the correlation, covariance sequence and spectral density function are the three basic statistical quantities of discrete random signals. The true cross-correlation sequence is a statistical quantity defined as

(5)
where m is integer, and E{} is the expected value operator, and (*) denotes complex conjugate. The covariance sequence is the mean-removed cross-correlation sequence

(6)
where µx and µy are the mean of stationary processes xn and yn, respectively.

The autocorrelation and autocovariance are their special cases as defined as follows

(7)

(8)
In practice, one must estimate these sequences, because it is possible to access only a finite segment of the infinite-length random process. For example, the autocorrelation sequence is estimated as follows:

(9)
where x(n) are indexed from 1 to L, and |m| is the absolute value of integer m.

Comparing Equation 9 and Equations 3 and 4, we can see that the sequence-order correlation factors {tau}m defined by Chou (2001)Go are virtually an autocorrelation sequence of the coded protein sequence. Hence all the powerful tools in statistical signal analysis can be used to incorporate sequence-order effects. The goal of spectral analysis is to describe the distribution (over frequency) of the power contained in a signal, based on a finite set of data. The power spectrum of a stationary random process xn is mathematically related to the correlation sequence by the discrete-time Fourier transform. In terms of physical frequency f (e.g. in hertz) is given by

(10)
where f is the sampling frequency, {tau}(m) (m = 1,...,L) is the autocorrelation sequence of xn as defined in Equation 9, and j is the sign indicating the imaginary part. Hence the power spectral density (PSD) of the stationary signal xn is defined as

(11)
where L is the number of xn's data point.

For real signals, the average power of a signal over a particular frequency and can be found by integrating the PSD over that band:

(12)
From the above expression, it can be seen that Pxx(f) represents the power content of a signal in an infinitesimal frequency band, which is why we call it the power spectral density. The energy of Pxx(f) is calculated by the following equation:

(13)
where Re[Pxx(f)] is the real part of Pxx(f) and Im[Pxx(f)] is the imaginary part.

Since the low-frequency components of a given PSD better reflect the global information (Chou, 1988Go, 1989Go) of a given signal than the high-frequency components, we took the first 20 components of the energy spectrum to incorporate most of the sequence order information. Thus the pseudo-amino acid composition of a protein can be defined in a 40-D space as given by

(14)
where

(15)

In the above equation, fi (i = 1, 2,...,20) denote the frequencies of occurrence of the 20 native amino acids in a protein that actually reflect the amino acid composition (Nakashima et al., 1986Go; Chou and Zhang, 1993Go; Chou, 1995Go) and {theta}j (j = 1, 2,...,20) the first 20 low-frequency coefficients of the energy spectrum of a given protein sequence that reflect some sort of sequence-order effect. The parameter w is used to control how much sequence order information should be considered. In this paper, w was set as 0.15 (Pan et al., 2003Go).

Compared with the original pseudo-amino acid composition introduced by Chou (2001)Go, the current representation has the following features. (1) It can be seen from Equation 10 that the PSD is mathematically related to the correlation sequence by the discrete-time Fourier transform. Therefore, the sequence order information harboured by autocorrelation sequence is also preserved by the PSD since the transform is linear and will not lose any information. (2) The PSD is a more compact representation. For the coded protein sequences, most of the signal's power concentrates on the low-frequency components of the PSD. Hence the 20 components of the PSD are sufficient to represent the protein sequence as the high-frequency parts contain little information. (3) The low-frequency components represent the global information (Chou, 1988Go, 1989Go) of the coded sequence. The type of protein can be reflected by the curve of the hydrophobic values of the residues. The curve's global shape, represented by the low-frequency components of the PSD, is more important in determining the type of membrane protein. For example, the appearance of several peaks and valleys in the curve may indicate that the corresponding protein is likely to be a multi-pass trans-membrane protein (Figure 1).


    Results and discussion
 Top
 Abstract
 Introduction
 Pseudo amino acid composition...
 Results and discussion
 Conclusion
 Appendix A: Weighted support...
 References
 
Using the dataset constructed by Chou and Elrod (1999)Go, we test our approach to demonstrate its feasibility. The dataset contains 2059 membrane protein sequences. There are 435 type I transmembrane proteins, 152 type II transmembrane proteins, 1311 multi-pass transmembrane proteins, 51 lipid-chain anchored membrane proteins and 110 GPI anchored membrane proteins (Figure 1). Chou and Elrod classified the 2059 into five groups and the names of these proteins are given in Table I in Chou and Elrod (1999)Go.


View this table:
[in this window]
[in a new window]
 
Table I. Overall rates of correct prediction for the five membrane protein types by different algorithms and test methods

 
Instead of the covariant discriminant algorithm, which is a combination of the Mahalanobis distance (Mahalanobis, 1936Go; Pillai, 1985Go) and Chou's invariance theorem (Zhou and Assa-Munt, 2001Go; Pan et al., 2003Go; Zhou and Doctor, 2003Go), here we are to use SVM, a kind of learning machine, to conduct prediction. However, owing to the highly uneven sizes of the sub-datasets investigated here, SVM is often undesirably biased toward the classes of membrane proteins with more samples. To deal with this problem, the weighted {upsilon}-SVM is proposed to improve the prediction accuracy of small subsets. For a detailed description of this algorithm, see Appendix A, where a full introduction to {upsilon}-SVM is given, followed by analysis of the reasons that lead to the undesirable bias toward the subset with more samples in the training set. Finally, by assigning the samples in different subsets with different weights, the unfavorable impact caused by the uneven class size is compensated.

During the operation, the width of the Gaussian RBFs was selected as 1 to minimize the estimation of the VC dimension. The parameter {upsilon} is assigned as 0.06. The weight si (i = 1, 2, 3, 4, 5) for each class is determined by the following procedure:

  1. Set the largest set's smax as 1.
  2. Set other si (i = 1, 2, 3, 4, 5) by , where {ell}max denotes the number of training samples of the largestdata set and {ell}i is the number of training samples of otherdata sets.
  3. Train the weighted {upsilon}-SVM with the given sample's weight si (i = 1, 2, 3, 4, 5) for each class and classify the new entered data.

After being trained, the hyper-plane was built in the feature space and thus the output could be obtained. The prediction quality was examined by three methods (Chou and Zhang, 1995Go), the re-substitution test, the jackknife test and independent dataset test, as explained below.

Re-substitution test

The so-called re-substitution test is designed to examine the self-consistency of an identification method (Zhou, 1998Go; Cai, 2001Go; Zhou and Assa-Munt, 2001Go; Zhou and Doctor, 2003Go). When the re-substitution test is performed for the current classifier, the type of each membrane protein in a data set is in turn predicted using the rule parameters derived from the same training data set. In Table I, the success rate for the 2059 membrane proteins is listed; the overall success rate is 99.85%, which shows that after being trained, the weighted {upsilon}-SVM has captured the complicated relationship between the pseudo-amino acid composition and the types of membrane proteins. Because the rule parameters derived from the training data set harbor the information of the query protein later plugged back into the test, the re-substitution test tends to underestimate the error and enhance the success rate. Therefore, the success rate thus obtained may give some sort of optimistic estimation (Chou and Zhang, 1994Go; Zhou, 1998Go; Cai, 2001Go; Zhou and Assa-Munt, 2001Go; Zhou and Doctor, 2003Go). However, the re-substitution test is definitely necessary because any algorithm whose self-consistency performance is poor cannot be deemed a good one. Namely, the re-substitution test is necessary but not sufficient for evaluating a classifier. Hence a cross-validation test for an independent testing data set is recommended because it can reflect the generalization of a classifier in practical applications. This is very useful when checking the validity of a training database for whether it contains sufficient information to reflect all the important features concerned so as to yield a high success rate in application. The results of the re-substitution test obtained for the 2059 membrane proteins are given in Tables I and II.


View this table:
[in this window]
[in a new window]
 
Table II. Rates of correct prediction (%) for the five membrane protein types by the weighted {upsilon}-SVM and {upsilon}-SVM algorithms and different test methods

 
Jackknife test

The independent dataset test, sub-sampling test and jackknife test are the three most common methods for cross-validation in statistical prediction. Among these three, the jackknife test is regarded as the most objective and effective one; see, e.g., Chou and Zhang (1995)Go for a comprehensive discussion of this and Mardia et al. (1979)Go for the mathematical principles. During jackknifing, each membrane protein in the dataset is in turn taken out and all the rule parameters are calculated based on the remaining proteins. In other words, the type of each membrane is predicted using the rule parameter derived from all the other membrane proteins except that which is being identified. During the process of jackknifing, both the training data set and testing data set are actually open and a protein will move from one to the other in turn. The results of the jackknife test thus obtained for the 2059 membrane proteins are also given in Tables I and II.

Independent dataset test

Furthermore, predictions were also conducted for the 2625 independent membrane proteins based on the rule parameter derived from the 2059 proteins in the training dataset. The 2625 independent proteins were also taken from Chou and Elrod (1999)Go. Of the 2625 proteins, 478 are type I transmembrane proteins, 180 type II transmembrane proteins, 1867 multi-pass transmembrane proteins, 14 lipid-chain anchored membrane proteins and 86 GPI anchored membrane proteins. The predicted results are also listed in Tables I and II.

From Table I, we may draw the following conclusions. (1) The success predictions obtained by the pseudo-amino acid composition approach are significantly higher than those obtained by the other approaches. (2) A comparison between the current approach and all the other approaches indicates that the success rates by the former are about 6% higher than those by the latter in the self-consistency test and 3% higher in independent dataset test. The only setback is that the jackknife result is about 3.7% lower than that of the functional-domain approach (Cai et al., 2003bGo), but higher than all other methods.

It should be pointed out that it is not sufficient only to compare the overall success rate. To make an in-depth comparison, one should look into the success rate for each type. In order to illustrate the weighted {upsilon}-SVM's ability to compensate for the bias caused by the imbalance of the dataset, a comparison with its original form was done as listed in Table 2, from which the following facts can be deduced. (1) For class 4, which is with the smallest size, the success rate by the weighted {upsilon}-SVM is about 31% higher than that by the {upsilon}-SVM by the self-consistency test and about 51% higher by the independent dataset test. For the jackknife test, the weighted {upsilon}-SVM also outperformed the {upsilon}-SVM by about 5%. This is fully consistent with what is expected because the effects caused by the dataset size have been taken into consideration during the process of algorithm formulation. (2) The success rates by the weighted {upsilon}-SVM are higher than or equal to those obtained by the original {upsilon}-SVM in the self-consistency test, jackknife test and independent dataset test, for classes 1, 2, 4 and 5. (3) For class 3, which has the largest size, the success rates by the weighted {upsilon}-SVM are lower than those by the {upsilon}-SVM in the jackknife test and independent dataset test. However, the overall success rates by the weighted {upsilon}-SVM are higher than those by {upsilon}-SVM. (4) The results are fully consistent with what we expected: the success rates for the classes with small size are improved at the cost of slightly reducing the success rates for large size classes. This is rational because the samples in small classes are treated as more important, i.e. assigning a larger coefficient for the data in the small class to improve its prediction accuracy. Meanwhile, the importance is that the overall success rates were enhanced.


    Conclusion
 Top
 Abstract
 Introduction
 Pseudo amino acid composition...
 Results and discussion
 Conclusion
 Appendix A: Weighted support...
 References
 
The above results indicate that the types of membrane proteins are predictable with considerable accuracy. The development of the statistical prediction of protein attributes generally consists of two aspects: constructing a training dataset and formulating a prediction algorithm. The latter also consists of two aspects, i.e. how to define a protein and how to operate the prediction. The process in expressing a protein from the 20-D amino acid composition space (Nakashima et al., 1986Go; Chou and Zhang, 1993Go; Chou, 1995Go; Zhou, 1998Go) to the (20+{lambda})-D pseudo-amino acid composition space (Chou, 2001Go) to the 2005-D functional-domain composition space (Chou and Cai, 2002Go; Cai et al., 2003bGo) reflects the development in representing a protein sample. In this paper, the technique of signal spectrum analysis was introduced to represent a protein via the pseudo-amino acid composition (Chou, 2001Go) to incorporate the sequence-order information. The process from introducing the simple geometry distance algorithm (Nakashima and Nishikawa, 1994Go), to the Mahalanobis distance algorithm (Chou and Zhang, 1994Go; Chou, 1995Go), to the covariant discriminant algorithm (Chou and Elrod, 1999Go; Pan et al., 2003Go; Zhou and Doctor, 2003Go) and to the current weighted {upsilon}-SVM algorithm reflects the development in operating algorithms. The weighted {upsilon}-SVM algorithm is particularly useful in solving the problem caused by uneven sizes of the subsets in the training dataset or dealing with the case where the classification accuracy is focused on a small subset.


    Appendix A: Weighted support vector machines
 Top
 Abstract
 Introduction
 Pseudo amino acid composition...
 Results and discussion
 Conclusion
 Appendix A: Weighted support...
 References
 
Owing to its characteristics of global optimization, sparseness of the solution and the use of kernel-induced feature spaces, SVMs are widely used in bioinformatics. SVMs have been applied to protein fold recognition (Ding and Dubchak, 2001Go), protein–protein interaction prediction (Bock and Gough, 2001Go), protein secondary structure prediction (Hua and Sun, 2001Go), protein structural class prediction (Cai et al., 2002aGo), prediction of the specificity of GalNAc-transferase (Cai et al., 2002cGo), subcellular location prediction (Cai et al., 2000Go, 2002dGo; Chou and Cai, 2002Go), signal peptide prediction (Cai et al., 2003aGo) and membrane protein type prediction (Cai et al., 2002bGo, 2003bGo). Many variations have been proposed to improve its performance, such as C-SVM (Vapnik, 1995Go, 1998Go), one-class (Scholkopf et al., 2000Go), RVSM (Lee and Mangasarian, 2001Go), {upsilon}-SVM (Scholkopf et al., 2000Go), weighted-SVM (Chew et al., 2001Go; Lin and Wang, 2002Go) and LS-SVM (Suykens and Vandewalle, 1999Go). When training datasets with uneven class sizes are used, the classification results based on support machines are undesirably biased toward the class with more samples in the subset. Namely, the larger the subset, the smaller is the classification error. Some efforts have been made in this regard (Chew et al., 2000Go, 2001Go; Lin and Wang, 2002Go). The weighted-SVM (Fan et al., 2003Go) compensates for the unfavorable impact caused by this kind of bias by assigning each subset a different penalty coefficient.

Based on the concept of weighted-SVM (Fan et al., 2003Go), we adopt its specific form, the weighted {upsilon}-SVM, and apply it to the problem of prediction of membrane protein types.

{upsilon}-SVM

The basic idea of applying SVMs to pattern classification can be outlined as follows. First, map the input vectors into a feature space (possible with a higher dimension), either linearly or non-linearly, which is relevant to the selection of the kernel function. Then, within the feature space, seek an optimized linear division, i.e. construct a hyper-plane which can separate the entire samples to two classes (this can be extended to multi-classes) with the least errors and maximal margin. The SVMs training process always seeks a global optimized solution and avoids over-fitting, so it has the ability to deal with a large number of features. A complete description to the theory of SVMs for pattern recognition was given by Vapnik (1998)Go.

Given a set of {ell} samples, i.e. a series of input vectors

(A1)
where xi can be regarded as the ith protein or vector defined in the 40-D pseudo-amino acid space according to Equations 14 and 15 and Rd is a Euclidean space with d dimensions. Since the multi-class identification problem can always be converted into a two-class identification problem, without loss of generality, the formulation below is given for the two-class case only. Suppose that the output derived from the learning machine is expressed by yi {+1, –1} (i = 1, ..., N), where the indices –1 and +1 are used to stand for the two classes concerned, respectively, the goal here is to construct one binary classifier or derive one decision function from the available samples that has a small probability of misclassifying a future sample.

{upsilon}-SVM (Scholkopf et al., 2000Go) uses the parameter {upsilon} to control the number of support vectors and errors. Its primal problem is

(A2)
Its dual problem is

(A3)
The decision function is

(A4)

To calculate b and {rho} in the above equation, we need to select the same number of samples (S > 0 is the number of samples) from the two datasets. Suppose S+ is the number of samples from the positive training dataset and S that from the negative training dataset. According to the Karush–Kuhn–Tucker (KKT) conditions (Karush, 1939Go; Cristianini and Shawe-Taylor, 2000Go), the condition in Equation A2

(A5)
becomes

(A6)
and

(A7)
Hence, with some deductions, we obtain the formulations to calculate b and {rho}:

(A8)

(A9)

In C-SVM, the only adjustable parameter is the constant C, which influences its performance greatly. However, because there is no natural interpretation of this parameter, it is hard to adjust it. In {upsilon}-SVM, the parameter C is replaced by {upsilon}. In this parameterization (Equation A3), {upsilon} places a lower bound on the sum of the {alpha}i, which causes the linear term to be dropped from the objective function. Another connection between these two algorithms is that an increase in parameter C leads to a decrease in the number of support vectors in C-SVM, while a decrease in {upsilon} leads to a smaller number of support vectors.

Weighted {upsilon}-SVM

The performance of {upsilon}-SVM is also impaired when training sets with uneven class sizes are used. We propose the weighted {upsilon}-SVM to solve this problem. The primal problem of weighted {upsilon}-SVM is given by

(A10)
where si denotes the weight for each sample. We use the Lagrange multiplier method to solve the above optimization problem

(A11)
where {alpha}i ≥ 0, ßi ≥ 0, {delta} ≥ 0 are all Lagrange multipliers. Differentiating and imposing a stationary condition, we obtain

(A12)

(A13)

(A14)

(A15)
Substituting Equations A12GoGoA15 into Equation A11, we obtain its dual problem:

(A16)

In the weighted {upsilon}-SVM, by assigning training samples of different classes different weights, we can compensate for the unfavorable impact caused by the uneven class size. For simplicity, we analyze the case of two classes. We assign si = s+ for positive class and si = s for negative class. Hence the constraint in Equation A11 is reduced to

(A17)
According to the constraint in Equation A9, we obtain

(A18)

Because most of the gaps are not zero in the trained SVMs, according the constraint in Equation A1, {rho} is greater than zero. By Kuhn–Tucher theory, we obtain {delta}{rho} = 0. If {rho} > 0, then {delta} = 0. Substituting this into Equation A6, we obtain

(A19)
Thus, from Equations A18 and A19, we deduce that

(A20)

Before further analysis, we need to define several notations. The support vector (SV) is the training sample whose dual variables {alpha}i ≥ 0. The normal support vector (NSV) is defined as the training sample whose dual variables . The boundary support vector (BSV) is the training sample that both satisfy and {xi}i > 0. The BSVs are misclassified training points.

Because the BSV has the property , the sum of NBSV+ {alpha}i (BSV's dual variable) is less than , which is the number of BSVs that belong to the positive class:

(A21)
i.e.

(A22)
In addition, the maximum of SV's dual variable {alpha}i is and Nsv+ denotes the number of training samples belonging to the positive class. Thus, we have

(A23)
i.e.

(A24)
From Equations A15 and A17, we have

(A25)
Likewise, we have a similar result for the case of negative class:

(A26)
Equations A18 and A19 can be transformed into

(A27)

(A28)
where {ell}+ is the number of positive training samples, {ell} the number of negative training samples and {ell} = {ell}+ + {ell}. The in Equation A18 and in Equation A19 may be interpreted as the rates of accuracy of positive and negative classes, respectively. From Equations A20 and A21, it can be shown that the rate of accuracy for positive class is upper bounded by and the rate of accuracy for negative class is upper bounded by . Therefore, in order to balance two classes' rates of accuracy, we only need to force . Then we have the following equation:

(A29)

In the weighted {upsilon}-SVM, by weighting the samples in the small class, the classification accuracy of the small class can be improved. Meanwhile, eliminating the bias toward the large class, the prediction accuracy of the large class is reduced slightly. Such a method can be applied directly to the prediction of membrane protein types where the training sets of five classes are highly uneven.

In this paper, we use the ‘one-against-one’ approach (Knerr et al., 1990Go), in which k(k – 1)/2 classifiers are constructed and each one trains data from two different classes; k is the number of classes and in this paper k = 5. During the training stage, we first calculate the ratio of different sample sizes according to Equation A29 and then assign different weights to the training samples of different classes as described in the next section. In classification we use a voting strategy: each binary classification is considered to be a voter where votes can be cast for all data points; the end point is designated to be in a class with maximum number of votes.


    Acknowledgments
 
The authors wish to thank the three anonymous referees whose comments were very helpful in strengthening the presentation of this work.


    References
 Top
 Abstract
 Introduction
 Pseudo amino acid composition...
 Results and discussion
 Conclusion
 Appendix A: Weighted support...
 References
 
Bock,J.R. and Gough,D.A. (2001) Bioinformatics, 17, 455–460.[Abstract]

Cai,Y.D. (2001) Proteins, 43, 336–338.[CrossRef][ISI][Medline]

Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2000) Mol. Cell Biol. Res. Commun., 4, 230–233.[CrossRef][Medline]

Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002a) Comput. Chem., 26, 293–296.[CrossRef][ISI][Medline]

Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002b) Internet Electron. J. Mol. Des., 1, 219–226.

Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002c) Peptides, 23, 205–208.[CrossRef][ISI][Medline]

Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002d) J. Cell. Biochem., 84, 343–348.[CrossRef][ISI][Medline]

Cai,Y.D., Lin,S. and Chou,K.C. (2003a) Peptides, 24, 159–161.[ISI][Medline]

Cai,Y.D., Zhou,G.P. and Chou,K.C. (2003b) Biophys. J., 84, 3257–3263.[Abstract/Free Full Text]

Cedano,J., Aloy,P., Perez-Pons,J.A. and Querol,E. (1997) J. Mol. Biol., 266, 594–600.[CrossRef][ISI][Medline]

Chew,H.G., Crisp,D.J. and Bogner,R.E. (2000) Target detecting in radar imagery using support vector machines with training size biasing. Singapore.

Chew,H.G., Bogner,R.E. and Lim,C.C. (2001) Dual nu-support vector machine with error rate and training size biasing. Salt Lake City, pp. 1269–1272.

Chou,J.J. and Zhang,C.T. (1993) J. Theor. Biol., 161, 251–262.[CrossRef][ISI][Medline]

Chou,K.C. (1988) Biophys. Chem., 30, 3–48.[CrossRef][ISI][Medline]

Chou,K.C. (1989) Trends Biochem. Sci., 14, 212.[CrossRef][ISI][Medline]

Chou,K.C. (1995) Proteins, 21, 319–344.[ISI][Medline]

Chou,K.C. (1999) J. Protein Chem., 18, 473–480.[CrossRef][ISI][Medline]

Chou,K.C. (2000) Biochem. Biophys. Res. Commun., 278, 477–483.[CrossRef][ISI][Medline]

Chou,K.C. (2001) Proteins, 43, 246–255; Erratum, 2001, 44, 60.[CrossRef][ISI][Medline]

Chou,K.C. (2002) In Weinrer,P.W. and Lu,Q. (eds), Gene Cloning and Expression Technologies. Westborough, MA, Eaton Publishing, pp. 57–70.

Chou,K.C. and Cai,Y.D. (2002) J. Biol. Chem., 277, 45765–45769.[Abstract/Free Full Text]

Chou,K.C. and Cai,Y.D. (2003) J. Cell. Biochem., 90, 1250–1260; Addendum, 2004, 91, 1085.[CrossRef][ISI][Medline]

Chou,K.C. and Elrod,D.W. (1999) Proteins, 34, 137–153.[CrossRef][ISI][Medline]

Chou,K.C. and Zhang,C.T. (1994) J. Biol. Chem., 269, 22014–22020.[Abstract/Free Full Text]

Chou,K.C. and Zhang,C.T. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 275–349.[Abstract]

Chou,P.Y. (1980) In Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas.

Cristianini,N. and Shawe-Taylor,J. (2000) Support Vector Machines. Cambridge: Cambridge University Press.

Ding,C.H. and Dubchak,I. (2001) Bioinformatics, 17, 349–358.[Abstract]

Fan,X.W., Du,S.X. and Wu,T.J. (2003) J. Image Graphics, 8, 1037–1042.

Guo,Z.M. (2002) Master's Thesis, Shanghai Jiaotong University.

Hua,S.J. and Sun,Z.R. (2001) J. Mol. Biol., 308, 397–407.[CrossRef][ISI][Medline]

Karush,W. (1939) MSc Thesis, University of Chicago.

Knerr,S., Personnaz,L. and Dreyfus,G. (eds) (1990) Single-layer Learning Revisited, a Stepwise Procedure for Building and Training Neural Networks. Berlin: Springer.

Lee,Y.J. and Mangasarian,O.L. (2001) RSVM, Reduced Support Vector Machines.

Lin,C.F. and Wang,S.D. (2002) IEEE Trans. Neural Networks, 13, 464–471.[CrossRef][ISI]

Liu,W. and Chou,K.C. (1999) Protein Eng., 12, 1041–1050.[CrossRef][ISI][Medline]

Mahalanobis,P.C. (1936) Proc. Natl Inst. Sci. India, 2, 49–55.

Mardia,K.V., Kent,J.T. and Bibby,J.M. (1979) Multivariate Analysis. London: Academic Press, pp. 322–381.

Nakashima,H. and Nishikawa,K. (1994) J. Mol. Biol., 238, 54–61.[CrossRef][ISI][Medline]

Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152–162.

Pan,Y.X., Zhang,Z.Z., Guo,Z.M., Feng,G.Y., Huang,Z.D. and He,L. (2003) J. Protein Chem., 22, 395–402.[CrossRef][ISI][Medline]

Pillai,K.C.S. (1985) In Kotz,S. and Johnson,N.L. (eds), Encyclopedia of Statistical Sciences. New York: Wiley, pp. 176–181.

Schneider,G. and Wrede,P. (1994) Biophys. J., 66, 335–344.[Abstract]

Scholkopf,B., Smola,A. and Williamson,R.C. (2000) Neural Comput., 12, 1207–1245.[Abstract/Free Full Text]

Suykens,J. and Vandewalle,J. (1999) Neural Process. Lett., 9, 293–300.[CrossRef][ISI]

Vapnik,V.N. (1995) The Nature of Statistical Learning Theory. Berlin: Springer.

Vapnik,V.N. (1998) Statistical Learning Theory. New York: Wiley-Interscience.

Zhou,G.P. (1998) J. Protein Chem., 17, 729–738.[CrossRef][ISI][Medline]

Zhou,G.P. and Assa-Munt,N. (2001) Proteins, 44, 57–59.[CrossRef][ISI][Medline]

Zhou,G.P. and Doctor,K. (2003) Proteins, 50, 44–48.[CrossRef][ISI][Medline]

Received May 10, 2004; revised July 23, 2004; accepted July 26, 2004.

Edited by Alan Fersht