Predicting proteasomal cleavage sites: a comparison of available methods
Patricia Saxová1,2,
Søren Buus3,
Søren Brunak1 and
Can Ke
mir1,4
1 Center for Biological Sequence Analysis, BioCentrum-DTU, Technical University of Denmark, Lyngby, Denmark 2 Institute of Biology and Ecology, P. J.
afárik University, Kosice, Slovakia 3 Institute for Medical Microbiology and Immunology, University of Copenhagen, Copenhagen, Denmark 4 Theoretical Biology/Bioinformatics, Utrecht University, Utrecht, The Netherlands
Correspondence to: C. Ke
mir, Theoretical Biology/Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands. E-mail: C.Kesmir{at}bio.uu.nl
Transmitting editor: M. J. Bevan
 |
Abstract
|
---|
The proteasome plays an essential role in the immune responses of vertebrates. By degrading intercellular proteins from self and non-self, the proteasome produces the majority of the peptides that are presented to cytotoxic T cells (CTL). There is accumulating evidence that the C-terminal, in particular, of CTL epitopes is cleaved precisely by the proteasome, whereas the N-terminal is produced with an extension, and later trimmed by peptidases in the cytoplasm and in the endoplasmic reticulum. Recently, three publicly available methods have been developed for prediction of the specificity of the proteasome. Here, we compare the performance of these methods on a large set of CTL epitopes. The best method, NetChop at www.cbs.dtu.dk/Services/NetChop, can capture
70% of the C-termini correctly. This result suggests that the predictions can still be improved, particularly if more quantitative degradation data become available.
Keywords: artificial neural network, cleavage site prediction, MHC class I epitope, proteasome, protein degradation
 |
Introduction
|
---|
Proteasomes are multisubunit proteases that play a central role in the degradation of proteins in the cell (1). Some degradation products of the proteasome are taken up by the transporter associated with antigen processing (TAP) and transferred into the endoplasmic reticulum. Here they can associate with newly synthesized MHC class I molecules. Recognition of such MHCpeptide complexes on the cell surface by activated cytotoxic T lymphocytes (CTL) is essential for the cellular immune responses (2).
The proteasome has at least three different catalytic activities: trypsin-like (i.e. cleavage after basic amino acids), chemotrypsin-like (i.e. cleavage after large, hydrophobic amino acids) and peptidyl-glutamyl-peptide-hydrolyzing activity (i.e. cleavage after acidic amino acids) (3). Since the overall enzymatic activity is the result of an interaction between these catalytic subunits, the cleavage-inhibiting or -enhancing motifs are quite complex. In the presence of IFN-
, the three catalytic subunits of the proteasomes of vertebrates are replaced by their homologous subunits to form an immunoproteasome (4). The cleavage specificity of the constitutive proteasome and the immunoproteasome seems to be different (5,6), a factor that further increases the complexity of the enzymatic activity of the proteasome.
Due to the involvement of the proteasome in the generation of antigenic peptides it is of general interest to obtain additional insight into the specificity of the proteasome, and to predict which peptides will be generated from both pathogenic and human proteins. At the moment three proteasome cleavage prediction methods are publicly available on the Internet: PAProC (www.paproc.de) developed at Tübingen University (7,8), MAPPP (www.mpiib-berlin.mpg.de/MAPPP/) developed at the Max-Planck Institute in Berlin (9,10) and NetChop (www.cbs.dtu.dk/services/NetChop/) developed at the Center for Biological Sequence analysis at the Technical University of Denmark (11).
PAProC is a method for predicting cleavages by human proteasomes as well as wild-type and mutant yeast proteasomes. The influences of different amino acids at different positions are assessed using a stochastic hill-climbing algorithm (7) based on the experimentally in vitro verified cleavage and non-cleavage sites (8).
MAPPP is a method that combines proteasome cleavage prediction with MHC-binding prediction. FragPredict is the part of the MAPPP package that deals with the proteasome cleavage prediction. FragPredict consists of two algorithms. The first algorithm uses a statistical analysis of cleavage-enhancing and -inhibiting amino acid motifs to predict potential proteasome cleavage sites (9). The second algorithm, which uses the results of the first algorithm as an input, predicts which fragments are most likely to be generated. This algorithm is based on a kinetic model of the 20S proteasome (10) and it takes the time-dependent degradation into account.
NetChop is a neural network-based method trained on MHC class I ligands generated by the human proteasomes. Every MHC ligand has to be generated by the proteasome, therefore the rationale behind using MHC class I ligands is that these ligands bear the closest resemblance to naturally processed in vivo cleavage products. However, as some of the products of the proteasome would not bind MHC molecules, MHC class I ligands represent only a subset of in vivo cleavage products. The MHC class I ligands used to develop NetChop were compiled from public databases (11). There are two versions of NetChop available, 1.0 and 2.0. The later version is trained with a data set that is 3 times larger.
The aim of this study is to compare the performance of the three publicly available methods mentioned above. Since there is increasing evidence that antigenic peptides result from proteasome cleavage especially at the C-terminal end [see, e.g. (1215)], we test all the methods on a set of publicly available MHC Class I ligands. We are concerned primarily with the ability of the methods (i) to predict correctly the C-terminal of a ligand and (ii) not to predict major cleavage sites within the ligand. We excluded N-terminal cleavage analysis, because the majority of the T cell epitopes are trimmed at their N-terminal by other peptidases, e.g. in the endoplasmic reticulum (15).
We find that the method developed using MHC class I ligands, i.e. NetChop, predicts CTL epitope boundaries more accurately than the methods based on in vitro degradation data.
 |
Methods
|
---|
Performance measurement
We require that a proteasome cleavage prediction method should be able to identify the C-terminal of any natural MHC class I ligand without predicting major cleavage sites within the ligand. Thus, for each ligand we test whether (i) the proteasome cleavage prediction methods can predict the C-terminal cleavage correctly and (ii) the same methods do not predict a cleavage site within the epitope (i.e. all positions except the C-terminal residue) which is more likely than at the C-terminal.
The predictions originate from scores that are compared with a threshold and they are classified as follows:
True positive (TP): if the prediction at the C-terminal, Pc, is above the threshold.
False negative (FN): if Pc is less than the threshold.
True negative (TN): if no cleavages are predicted within the epitope (excluding the C-terminal residue) or if the predicted cleavage sites within the epitope are less likely than at the C-terminal (i.e. less than Pc and the threshold).
False positive (FP): if there is at least one predicted cleavage site within the epitope which is more likely than at the C-terminal (i.e. higher than Pc).
We use the following performance measures to compare NetChop, PAProC and MAPPP:
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)

The sensitivity gives the percentage of C-terminal cleavages that are predicted correctly and the specificity gives the percentage of epitopes with no major predicted cleavage sites (i.e. cleavage sites that are more likely than at the C-terminal) within the epitope. The correlation score, CC, is a measure of how well a method performs both in positive (i.e. true cleavage sites) and in negative (i.e. true non-cleavage sites) examples.
 |
Results
|
---|
Organization of test data set
We focus on the prediction of the specificity of the human proteasome, and therefore we use only peptides associated with HLA-A and HLA-B molecules from the SYFPEITHI database (16) to test various methods. In October 2001 there were 977 unique ligands associated with 120 different HLA-A and HLA-B molecules in the SYFPEITHI database. These ligands are either known T cell epitopes or are naturally processed peptides eluted from MHC molecules. We discarded ligands <8 or >12 amino acids. We also excluded ligands that had already been used for developing NetChop 1.0 or 2.0. The source protein for each ligand was searched for in the SWISSPROT database (17). When an epitope was found in several homologous proteins, homologous proteins were aligned and the most representative protein was chosen unless some additional information about the source protein could be deduced from the original paper. Only epitopes originating from human proteins or from possible human pathogens were included in the data set. The resulting set of 402 peptides contained homologous ligands. In order to prevent possible biases in the analysis, the homologous ligands were excluded using the FASTA (18) and Hobohm-1 algorithms (19). The final set used in our analysis consisted of 249 unique ligands from 135 proteins. The process is described in Fig. 1. The list of ligands is given in Appendix A. Excluding overlapping epitopes, we tested each method on 231 ligands.
Comparison of the methods predicting cleavage by the human proteasome
We use three performance measures to compare the publicly available methods for predicting proteasome cleavage. The formal definitions of these measures are given in Methods. Since there is accumulating evidence that the C-termini of MHC ligands are cleaved precisely by the proteasome, each method should be able to predict the C-terminal of HLA ligands as possible cleavage sites. The sensitivity measure gives the percentage of cleavage sites predicted at the C-terminal of 231 MHC ligands. Note that while all three methods can predict proteasome cleavage sites, only FragPredict can predict fragments generated by the proteasome. In order to be able to compare the FragPredict method with the other two methods, we use only the prediction of cleavage sites from FragPredict. For FragPredict and NetChop, which produce the probability scores of cleavage for each position in a protein sequence, we used a threshold of 0.5 to classify the predictions, i.e. any position in the sequence with a predicted probability >0.5 is considered as a predicted cleavage site. PAProC does not allow the use of a threshold value for predictions; we assume that the sites with corresponding +++, ++ and + values produced by this method are predicted cleavage sites. The performance measures of the methods for this data set are given in Table 1. FragPredict is able to predict most of the C-termini as cleavage sites, followed by NetChop 2.0. In contrast, PAProC and NetChop 1.0 predict much fewer of the MHC ligand C-termini residues as cleavage sites.
View this table:
[in this window]
[in a new window]
|
Table 1. The performance of three publicly available methods for the prediction of proteasomal cleavage sites deduced from natural human MHC class I ligands
|
|
An effective prediction method should also be capable of identifying non-cleavage sites (i.e. sites that are not likely to be used by the proteasomes). When the MHC ligands are used as a test set for proteasome cleavage predictions, it is hard to define which sites are really non-cleavage sites. Many CTL epitopes contain minor cleavage sites [see, e.g. (20,21)]. Nevertheless, an epitope should not contain a major cleavage site, i.e. a cleavage site that is more likely than the cleavage site at the C-terminal. Therefore, one can assume that if a method does not predict any major cleavage sites within an epitope, it is able to classify non-cleavage sites correctly. In other words, an incorrect prediction of a non-cleavage site (i.e. a false positive) is one where at least one internal position within an epitope has a probability of cleavage higher than both the threshold and the probability of the cleavage at the C-terminal. Following this definition, the total number of true non-cleavage sites becomes the same as the number of epitopes. The specificity measure in Table 1 gives the percentage of the MHC ligands with no major predicted cleavage sites within the ligand. NetChop 1.0 is the most successful method in classifying non-cleavage sites, followed by NetChop 2.0 and PAProC. FragPredict predicts many major cleavage sites within ligands that would make them highly unlikely MHC ligands. The performance of this method does not change much when we use the full FragPredict package (i.e. including the fragment prediction method): 11% of MHC ligands are predicted to stay intact during the protein degradation (using the suggested value of P > 0.9). There are other ways of measuring the performance on non-cleavage sites and we have tried many of them, e.g. one can assume that each position within a ligand should have a cleavage probability lower than the threshold. In all cases, the ordering of the methods according to their success in classifying non-cleavage sites correctly did not change (results not shown).
The correlation coefficient (CC) is a measure of how well a method performs both on positive (i.e. true cleavage sites) and negative (i.e. true non-cleavage sites) examples. CC = 0 corresponds to random prediction and CC = 1.0 represents 100% correct prediction. A negative CC value means that the predictions are not correlated with the real values. Only NetChop 2.0 has a positive CC (see Table 1). This suggests that NetChop 2.0 generates the most reliable predictions.
Different threshold values can be used in FragPredict and NetChop to classify positions as predicted cleavage sites or predicted non-cleavage sites. When a low threshold is used the methods predict more cleavage sites (and vice versa for a high threshold). We investigate the performance measurements of both methods at the standard threshold of 0.5 and at the threshold when the methods reach a maximum correlation coefficient. However, varying the threshold did not change the ranking of the methods according to their performance (results not shown).
The better performance of NetChop may be due to the fact that it was trained using MHC ligands. MHC ligand data reflect not only proteasome specificity, but they also reflect a combined specificity of the proteasome, TAP and MHC. Thus, it cannot be ruled out that NetChop captures this combined specificity and thus performs best when the C-termini of MHC ligands are used for proteasome cleavage predictions. To see if this is the case we also tested all three methods on in vitro degradation data generated by the human proteasome. We collected such data from the literature (see Appendix B) excluding the data used to develop PAProC and FragPredict. The results shown in Table 2 confirm that NetChop is able to capture the specificity of the proteasome better than the other methods.
View this table:
[in this window]
[in a new window]
|
Table 2. The performance of three publicly available methods for the prediction of proteasomal cleavage sites identified by in vitro degradation studies
|
|
 |
Conclusion
|
---|
We found that NetChop, an artificial neural network trained with MHC class I ligands, predicts the C-terminal of CTL epitopes more reliably. This is mainly because NetChop can predict the non-cleavage sites better than any of the other methods (see Table 1). There are two possible explanations for this. First, artificial neural networks are much more non-linear than the other two methods. Thus they might capture the complex specificity of the proteasome better. Second, both PAProC and FragPredict are based on very limited set of in vitro degradation data, whereas NetChop is trained on a larger data set, i.e. with MHC class I ligands.
The C-termini of MHC ligands represent only a subset of cleavage sites occurring during in vivo degradation because not all cleavages would result in protein fragments that can be transferred to the endoplasmic reticulum and can bind to an MHC class I molecule. Thus, the use of MHC ligands to develop a method that can predict proteasome cleavage has been the subject of much criticism (H. Margalit, pers. commun.). However, here we demonstrate that the C-termini of MHC ligands might even represent the specificity of the in vivo degradation better than the in vitro cleavage maps. Degradation data derived from in vitro experiments probably overestimate in vivo degradation, because the methods based on this type of data, e.g. FragPredict, predict that most of the MHC ligands in our data set will be destroyed due to major cleavage sites within the ligands.
Even the best method could predict only 73% of the C-termini of natural MHC class I ligands correctly. Moreover, only 42% of the natural MHC ligands are predicted to remain intact. The stochastic nature of degradation (22) and the differences between the immunoproteasome and the constitutive proteasome are just two of many reasons that can explain the poor performance. The use of quantitative data, i.e. concerning not only the cleavage sites used, but also how often a certain site is used, improves the prediction results significantly (C. Kesmir et al., unpublished). Thus, it should be possible to improve on current prediction methods when more quantitative data become available.
In a separate study we found that NetChop 2.0 can correctly discriminate the C-termini of natural MHC ligands from the rest of the protein (results not shown). Thus, NetChop can discriminate the regions that are most likely to be presented to T cells across a protein. This creates a promising future perspective to identify the immunogenic regions in the pathogenic and the human genomes.
 |
Acknowledgements
|
---|
We thank Lars Juhl Jensen for advice on the statistical evaluation of our results. We are grateful to S. M. McNab for linguistic advice.
 |
Abbreviations
|
---|
CTLcytotoxic T lymphocyte
TAPtransporter associated with antigen processing
 |
Appendix A
|
---|
 |
Appendix B
|
---|
 |
References
|
---|
- Rock, K. L. and Goldberg, A. L. 1999. Degradation of MHC class I-presented peptides. Annu. Rev. Immunol. 17:739.[CrossRef][ISI][Medline]
- Rammensee, H. G., Falk, K. and Rotzschke, O. 1993. Peptides naturally presented by MHC class I molecules. Annu. Rev. Immunol. 11:213.[CrossRef][ISI][Medline]
- Uebel, S. and Tampe, R. 1999. Specificity of the proteasome and the TAP transporter. Curr. Opin. Immunol. 11:203.[CrossRef][ISI][Medline]
- Tanaka, K. and Kasahara, M. 1998. The MHC class I ligand-generating system: roles of immunoproteasomes and the interferon-
-inducible proteasome activator PA28. Immunol. Rev. 163:161.[ISI][Medline]
- Van den Eynde, B. J. and Morel, S. 2001. Differential processing of class-I-restricted epitopes by the standard proteasome and the immunoproteasome. Curr. Opin. Immunol. 13:147.[CrossRef][ISI][Medline]
- Toes, R. E., Nussbaum, A. K., Degermann, S., Schirle, M., Emmerich, N. P., Kraft, M., Laplace, C., Zwinderman, A., Dick, T. P., Muller, J., Schonfisch, B., Schmid, C., Fehling, H. J., Stevanovic, S., Rammensee, H. G. and Schild, H. 2001. Discrete cleavage motifs of constitutive and immunoproteasomes revealed by quantitative analysis of cleavage products. J. Exp. Med. 194:1.[Abstract/Free Full Text]
- Nussbaum, A. K., Kuttler, C., Hadeler, K. P., Rammensee, H. G. and Schild, H. 2001. PAProC: a prediction algorithm for proteasomal cleavages available on the WWW. Immunogenetics 53:87.[CrossRef][ISI][Medline]
- Kuttler, C., Nussbaum, A. K., Dick, T. P., Rammensee, H. G., Schild, H. and Hadeler, K. P. 2000. An algorithm for the prediction of proteasomal cleavages. J. Mol. Biol. 298:417.[CrossRef][ISI][Medline]
- Holzhutter, H. G., Frommel, C. and Kloetzel, P. M. 1999. A theoretical approach towards the identification of cleavage-determining amino acid motifs of the 20 S proteasome. J. Mol. Biol. 286:1251.[CrossRef][ISI][Medline]
- Holzhutter, H. G. and Kloetzel, P. M. 2000. A kinetic model of vertebrate 20S proteasome accounting for the generation of major proteolytic fragments from oligomeric peptide substrates. Biophys. J. 79:1196.[Abstract/Free Full Text]
- Kesmir, C., Nussbaum, A. K., Schild, H. and Brunak, S. 2002. Prediction of proteasome cleavage motifs by neural networks. Protein Eng. 15:287.[Abstract/Free Full Text]
- Craiu, A., Akopian, T., Goldberg, A. and Rock, K. L. 1997. Two distinct proteolytic processes in the generation of a major histocompatibility complex class I-presented peptide. Proc. Natl Acad. Sci. USA 94:10850.[Abstract/Free Full Text]
- Stoltze, L., Dick, T. P., Deeg, M., Pommerl, B., Rammensee, H. G. and Schild, H. 1998. Generation of the vesicular stomatitis virus nucleoprotein cytotoxic T lymphocyte epitope requires proteasome-dependent and -independent proteolytic activities. Eur. J. Immunol. 28:4029.[CrossRef][ISI][Medline]
- Paz, P., Brouwenstijn, N., Perry, R. and Shastri, N. 1999. Discrete proteolytic intermediates in the MHC class I antigen processing pathway and MHC I-dependent peptide trimming in the ER. Immunity 11:241.[ISI][Medline]
- Mo, X. Y., Cascio, P., Lemerise, K., Goldberg, A. L. and Rock, K. 1999. Distinct proteolytic processes generate the C and N termini of MHC class I-binding peptides. J. Immunol. 163:5851.[Abstract/Free Full Text]
- Rammensee, H. G., Bachmann, J., Emmerich, N. N., Bachor, O. A. and Stevanovic, S. 1999. SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 50:213.[CrossRef][ISI][Medline]
- Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28:45.[Abstract/Free Full Text]
- Pearson, W. R. and Lipman, D. J. 1988. Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA 85:2444.[Abstract]
- Hobohm, U., Scharf, M., Schneider, R. and Sander, C. 1992. Selection of representative protein data sets. Protein Sci. 1:409.[Abstract/Free Full Text]
- Lucchiari-Hartz, M., Van Endert, P. M., Lauvau, G., Maier, R., Meyerhans, A., Mann, D., Eichmann, K. and Niedermann, G. 2000. Cytotoxic T lymphocyte epitopes of HIV-1 Nef: generation of multiple definitive major histocompatibility complex class I ligands by proteasomes. J. Exp. Med. 191:239.[Abstract/Free Full Text]
- Morel, S., Levy, F., Burlet-Schiltz, O., Brasseur, F., Probst-Kepper, M., Peitrequin, A. L., Monsarrat, B., Van Velthoven, R., Cerottini, J. C., Boon, T., Gairin, J. E. and Van den Eynde, B. J. 2000. Processing of some antigens by the standard proteasome but not by the immunoproteasome results in poor presentation by dendritic cells. Immunity 12:107.[ISI][Medline]
- Nussbaum, A. K., Dick, T. P., Keilholz, W., Schirle, M., Stevanovic, S., Dietz, K., Heinemeyer, W., Groll, M., Wolf, D. H., Huber, R., Rammensee, H. G. and Schild, H. 1998. Cleavage motifs of the yeast 20S proteasome beta subunits deduced from digests of enolase 1. Proc. Natl Acad. Sci. USA 95:12504.[Abstract/Free Full Text]
- Ayyoub, M., Stevanovic, S., Sahin, U., Guillaume, P., Servis, C., Rimoldi, D., Valmori, D., Romero, P., Cerottini, J. C., Rammensee, H. G., Pfreundschuh, M., Speiser, D. and Levy F. 2002. Proteasome-assisted identification of an SSX-2-derived epitope recognized by tumor-reactive CTL infiltrating metastatic melanoma. J. Immunol. 168:1717.[Abstract/Free Full Text]