Institute of Bioinformatics and System Biology, MOE Key Laboratory of Bioinfomatics, State Key Laboratory of Biomembrane and Membrane Biotechnology, Department of Biological Science and Biotechnology, Tsinghua University, Beijing 100084, China
1 To whom correspondence should be addressed. E-mail: sunzhr{at}mail.tsinghua.edu.cn
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: classification/dipeptide composition/cytokine/prediction/support vector machine/SVM
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Besides its many novel functions, an increasing number of newly discovered molecules have been identified as members of the cytokine superfamily. Although the sequences of these molecules are quickly accumulating, for a large proportion of them their precise function remains unclear. Indeed, laboratory work is essential and irreplaceable in the procedure to confirm a protein's structure and function, but might appear too expensive and lengthy when applied on a large scale. Computational methods, however, provide the possibility of a quicker and less expensive solution. Although several methods, such as BLAST, HMM and ANN, have been exploited for protein family prediction, less effort has been devoted to the prediction of cytokines from sequence data (Altschul et al., 1990; Papasaikas et al., 2003
; Bhasin and Raghava, 2004
).
This paper describes a support vector machine (SVM)-based method developed for the recognition of cytokines on the basis of dipeptide composition. The method uses a three-step strategy. First, a protein sequence is examined to determine whether it belongs to the cytokine superfamily. If it is recognized as a cytokine, the method then predicts to which family of cytokine it belongs. Finally, it classifies the protein to subfamily level if it belongs to the TGF-ß family of cytokines. The performance of this method was evaluated in each step on independent and non-redundant datasets created in this study. An online web server was also developed on the basis of the above method and is freely accessible at http://bioinfo.tsinghua.edu.cn/~huangni/CTKPred/.
![]() |
Method and procedure |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Recognition of cytokine superfamily
First, we developed an SVM module for identifying cytokines from protein sequence data uncovered by various genome-sequencing projects. The original dataset, obtained from http://cytokine.medic.kumamoto-u.ac.jp/, consisted of 1173 cytokines belonging to the eight major classes. Next we excluded highly homologous sequences within the dataset using CD-HIT software (Li et al., 2001, 2002
) by a threshold of id90 and thus resulted in 437 sequences. Then the dataset was extended by adding 673 additional negative examples randomly selected from the SCOP version 1.37 PDB90 domain data. The performance of the module was evaluated using a 7-fold cross-validation test. The SVM was trained with a fixed-dimensions (400) vector obtained on the basis of the dipeptide composition of protein sequences.
Recognition of cytokine family
Cytokines can be divided into seven major classes: FGF/HBGF, IL-6, LIF/OSM, MDK/PTN, NGF, TGF-ß and TNF. The dataset consisted of 83 sequences from FGF/HBGF, 22 sequences from IL-6, 12 sequences from LIF/OSM, 10 sequences from MDK/PTN, 24 sequences from NGF, 190 sequences from TGF-ß and 96 sequences from TNF. Because of a lack of adequate sequences, we put IL-6, LIF/OSM, MDK/PTN and NGF into a single class (thus containing 68 sequences) through the rest of process (hence there were then four major classes). Classification of cytokines into one of these four classes is a multi-class classification problem. Therefore, a multi-class SVM was employed to classify sequences from all possible classes. The vectors were extracted from the dipeptide composition of proteins. The performance of SVM classification was evaluated using 7-fold cross-validation.
Recognition of subfamilies
Classifying a cytokine to the subfamily level is of greater significance to further specific studies. Therefore, we chose to classify the TGF-ß family which possesses most known sequences to a lower level, since other families lack enough sequences for SVM training and cross-validation. As described in Figure 1, TGF-ß can be divided into six major subfamilies: bone morphogenetic protein (BMP), growth differentiation factor (GDF), glial-derived neurotrophic factor (GDNF), inhibin (INHA/INHB), transforming growth factor ß (TGFB) and others. Again, a multi-class SVM was constructed for this multi-class classification problem and the performance was evaluated using 2-fold cross-validation because of the smaller number of sequences.
|
SVMs are a class of statistical learning algorithms whose theoretical basis was first presented by Vapnik (1982). After the 1990s, they became extremely popular in the machine-learning community (Cristianini and Shawe-Taylor, 2000; Hua and Sun, 2001a
,b
; Bhasin and Raghava, 2004
; Guo et al., 2004
). In this study, the SVM was implemented using the freely downloadable software package libsvm written by Chang and Lin (2001)
. The software, which features an efficient multi-class classification, enables the user to define a number of parameters and to select from a choice of inbuilt kernel functions, including a radial basis function (RBF) and a polynomial kernel (of given degree). The experimentation was conducted using an RBF kernel. The SVM was provided with fixed-length vector input. The fixed-length feature vector was obtained from proteins of variable length using dipeptide composition.
Dipeptide composition
The dipeptide composition used as input provides global information on protein features in the form of a fixed-length vector. Dipeptide composition encapsulates information about the fraction of amino acids and their local order. The dipeptide composition of each protein was calculated using the following equation:
![]() |
Performance evaluation
The performance of SVMs in distinguishing cytokines from non-cytokines was evaluated using 7-fold cross-validation. In this approach, the dataset was partitioned randomly into seven equal-sized sets. The training and testing of each classifier was carried out seven times using one distinct set for testing and the other sets for training. Four threshold-dependent parameters, sensitivity, specificity, accuracy and Matthews's correlation coefficient (MCC) (Hua and Sun, 2001b), were used to measure the performance of this module. The performance of SVM modules constructed for recognizing cytokine family and subfamily were evaluated using 7- and 2-fold cross-validation, respectively, also measured by sensitivity, specificity, accuracy and MCC. Calculations of sensitivity, specificity, accuracy and MCC were carried out as follows:
![]() |
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
To predict the family of cytokines, a multi-class SVM was constructed. The SVM was trained and tested using dipeptide composition and evaluated by 7-fold cross-validation. The performance in recognizing different classes of cytokines is summarized in Table II. As shown, our method discriminated the four families of cytokines with an accuracy of 96.9% and an MCC of 0.93 on average.
|
|
Cytokinepred server
Based on our study, we constructed a freely accessible web server at http://bioinfo.tsinghua.edu.cn/~huangni/CTKPred/ that allows users to recognize and classify cytokines from protein sequence. The common gateway interface (CGI) script is written in PERL version 5.8.4. Users can enter one or more protein sequences at a time in FASTA format by copy and paste or file upload. The result of the prediction will be displayed in a user-friendly format on the screen or e-mailed to the users if provided with a valid e-mail address. The interface of our web server is shown in Figure 2.
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) J. Mol. Biol., 215, 403410.[CrossRef][ISI][Medline]
Benveniste,E.N. (1998) Cytokine Growth Factor Rev., 9, 259275.[CrossRef][ISI][Medline]
Bhasin,M. and Raghava,G.P. (2004) Nucleic Acids Res., 32(Web Server issue), W383W399.
Chang,C.-C. and Lin,C.-J. (2001) http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cristianini,N. and Shawe-Taylor,J. (2000) An Introduction to Support Vector Machines. Cambridge University Press, Cambridge.
Derouet,D., Rousseau,F., Alfonsi,F., Froger,J., Hermann,J., Barbier,F., Perret,D., Diveu,C., Guillet,C., Preisser,L., et al. (2004) Proc. Natl Acad. Sci. USA, 101, 48274832.
Dranoff,G. (2004) Nat Rev. Cancer, 4, 1122.[CrossRef][ISI][Medline]
Guo,J., Chen,H., Sun,Z. and Lin,Y. (2004) Proteins, 54, 738743.[CrossRef][Medline]
Hua,S. and Sun,Z. (2001a) J. Mol. Biol., 308, 397407.[CrossRef][ISI][Medline]
Hua,S. and Sun,Z. (2001b) Bioinformatics, 17, 721728.
Kleemann,R., Hausser,A., Geiger,G., Mischke,R., Burger-Kentischer,A., Flieger,O., Johannes,F.J., Roger,T., Calandra,T., Kapurniotu,A. et al. (2000) Nature 408, 211216.[CrossRef][ISI][Medline]
Li,W., Jaroszewski,L. and Godzik,A. (2001) Bioinformatics, 17, 282283.[Abstract]
Li,W., Jaroszewski,L. and Godzik,A. (2002) Bioinformatics, 18, 7782.
Papasaikas,P.K., Bagos,P.G., Litou,Z.I. and Hamodrakas,S.J. (2003) SAR QSAR Environ. Res., 14, 413420.[CrossRef][ISI][Medline]
Ueki,K., Kondo,T., Tseng,Y.H. and Kahn,C.R. (2004) Proc. Natl Acad. Sci. USA, 101, 1042210427.
Vapnik,V.N. (1979) Estimation of Dependencies Based on Empirical Data. Springer-Verlag, Berlin.
Received March 5, 2005; accepted May 18, 2005.
Edited by Paul Carter
|