Center for Biological Sequence Analysis Department of Biotechnology, The Technical University of Denmark, DK-2800 Lyngby, Denmark and 2 Department of Biochemistry, Arrhenius Laboratory, Stockholm University, S-106 91 Stockholm, Sweden
![]() |
Abstract |
---|
![]() |
Introduction |
---|
By definition, the cell can recognize all kinds of protein sorting signals with almost 100% selectivity and specificitythe level of mis-sorting in vivo appears to be very low, although this aspect of the problem has not been studied in detail. Given that the sorting signals mentioned above seem to be, at least to a good approximation, defined by a linear, N-terminal stretch of the polypeptide, it would appear that we should be able to devise sequence-based methods that can recognize these signals with an efficiency approaching that of the cell itself. If such methods can be developed, they will clearly be of major use for genome analysis and automatic database annotation; at the same time, these massive data analysis tasks necessitate very accurate prediction methods.
While prediction of sorting signals has a long history, started by the early work on secretory signal peptides (von Heijne, 1983; McGeoch, 1985
; von Heijne, 1986b
), it is only with the application of modern machine learning techniques, such as neural networks (NNs) and hidden Markov models (HMMs), that we seem to be approaching the necessary levels of accuracy (Baldi and Brunak, 1998
; Durbin et al., 1998
). Machine-learning techniques are ideally suited for pattern recognition tasks where relatively large amounts of data are present and where the patterns are `noisy' and not easily described by a compact set of rules. The fundamental idea behind these approaches is to learn to discriminate automatically from the data, using experimentally verified examples, which most often are extracted from large public sequence and structure databases. While HMMs are best at recognizing, in an `elastic' fashion, the sequential pattern in the amino acids or nucleotides, the NN algorithms are better at handling sequence features correlated over a longer range, especially if there is some degree of conservation in the positioning of the relevant features. Together, the NN and HMM methods can therefore handle a very substantial part of the sequence diversity created by evolution that is characteristic for many complex biological mechanisms. Thus, there now exist quite reliable machine learning-based methods for the identification of both secretory signal peptides (SPs), mitochondrial targeting peptides (mTPs) and chloroplast transit peptides (cTPs).
In this review, we will concentrate on the present status and future perspectives of SP predictionin particular the developments and applications of our own method, SignalP, since it was published in Protein Engineering two years ago (Nielsen et al., 1997a). Several NN-based methods for prediction of SPs have been developed (Ladunga et al., 1991
; Schneider and Wrede, 1993
), but only SignalP is publicly available. SignalP has been used extensively since it was made available over the internet, but the first version has some important shortcomings that necessitate further development and integration with other prediction methods. In addition, we will review a couple of methods for predicting other protein sorting signals, and discuss some general aspects of sorting signal prediction.
![]() |
Constructing the training set for machine learning methods |
---|
Another problem is that a sequence database always contains numerous examples of genes belonging to gene families and homologous genes from various organisms. This can lead to statistical results that are biased for the over-represented sequences, and the performance of prediction methods will be overestimated if the test set contains sequences closely related to those used in the training. Thus, after selecting an initial set of sequences from SWISS-PROT, one has to remove homologous sequences (unless the training algorithm can deal with redundant data sets) using, for example, the Hobohm redundancy reduction method (Hobohm et al., 1992). The question of when two sequences are `too closely related' to be kept within the reduced data set is far from trivial. For the SignalP data set, the similarity threshold is found from the principle that if it is possible to infer the position of the cleavage site in one SP by alignment to another SP, the sequences are too similar. Another approach, which uses the statistical theory of local alignments (Altschul and Gish, 1996
), is to fit the alignment scores to an extreme value distribution and choose a threshold value above which there are more observations than expected from the distribution (Pedersen and Nielsen, 1997
).
Unless the remaining set at this point is prohibitively large, it should be checked by hand against the primary publications. In our experience, features like cleavage sites for sorting signals are not always correctly annotated: sites not listed as `putative' may in fact be based only on an informed guess (or even an existing prediction method), and experimentally verified sites are sometimes incorrectly entered into the database (database `typos'). In a recent study of chloroplast transit peptides (O.Emanuelsson, H.Nielsen and G.von Heijne, manuscript submitted), we had to remove around 10% of the sequences in our homology-reduced data set for such reasons. Even experimentally verified data may be wrong if the interpretation of the results has been faulty. The most relevant example in this context is that an N-terminus of a mature protein, confirmed by amino acid sequencing, might derive not from cleavage by the signal peptidase but from a subsequent cleavage by another protease in the secretory pathway.
If the data set is too large to allow for manual inspection of all entries, some suspicious looking examples may be identified by automated methods. One possibility is to use alignments of the unreduced set to single out pairs of sequences that show a very high similarity but discrepancies in assignment of subcellular location or cleavage site position (Nielsen et al., 1996). Another method is to use the training algorithm itself to pick out cases which are more difficult to learn than others (Brunak, 1993
). Both these approaches are necessarily biased; the first will never be able to pick up errors in sequences with no matching homologues, and both can fail to recognize systematic errors that occur in several entries. Still, experience has shown that machine learning methods can serve as extremely useful tools for data set validation; in several cases, NNs have been able to detect errors caused both by simple misprints and by incorrect interpretation of experiments (Brunak et al., 1990a
,b
).
Another aspect of the choice of training set is whether sequences from all, some subset of, or only a single organism should be included. If there is enough data, organism-specific methods should be expected to perform better than more general ones, but in most cases it is not possible to be this restrictive.
In the SignalP work, we trained two species-specific versions on human and Escherichia coli SPs, and concluded that there was no significant gain in performance when testing with networks trained on a single-species data set relative to networks trained on larger groups (Nielsen et al., 1997a). This result is not definitive, however. The reason why the E.coli-specific network did not show an improvement compared with one trained on a larger set of Gram-negative SPs might simply be that the E.coli set at that time was too small to achieve the same relative performance. Regarding the human-specific network, one should note that the eukaryotic set is dominated by mammals, i.e. rather close relatives to humans; and we cannot exclude the possibility that signal peptides from, for example, yeast (which are relatively underrepresented in the data set), are significantly different from those of mammals. Nevertheless, genomic sequencing opens up the possibility of constructing species-specific versions of the basic algorithm, perhaps by a bootstrapping procedure where a more general version trained on, for example, all eukaryotic sequences, is used to extract an initial set of reliably predicted sequences from, for example, yeast, which is then used to iteratively train a species-specific version.
![]() |
Current status of the SignalP method |
---|
SignalP combines two different NNs, one that has been trained to classify each residue in the sequence as either belonging or not belonging to a SP (S-score), and one that has been trained only to recognize the site at the C-terminal end of the SP that is cleaved by the signal peptidase enzyme after targeting (C-score). Cleavage-site prediction performance is significantly enhanced by penalizing C-score peaks that are far away from the transition region between the SP and the mature polypeptide identified by the S-score. This is formalized by using the `Y-score', a geometric average of the C-score and a numerical derivative of the S-score. In the example shown in Figure 1, the C-score has two peaks, where the upstream one is slightly higher but the downstream one occurs in the transition zone of the S-score and therefore has a higher Y-score.
|
The performance values of SignalP are shown in Table I, both for the original version and for a version retrained on a new data set, based on SWISS-PROT release 35 instead of 29. Note that the performance for cleavage site location has improved. Since the old and new data sets are extracted by the same method, and the sizes have changed only slightly, the most probable explanation for the improvement is that the quality of SWISS-PROT annotations concerning SPs are better in the newer version.
|
On the other hand, the performance values given in Table I are calculated under two limiting assumptions: that the correct N-terminus of the protein in question is known, and that the sequence does not contain an N-terminal transmembrane helix. The data sets on which SignalP is trained and tested contain only the N-terminal part (up to 70 amino acids) of each protein, and transmembrane proteins were not included in the negative set. The decision to use only the N-terminal part of each protein was based on the idea that SignalP should reproduce the recognition task met by the cell in vivo, where SP cleavage takes place only within a certain range from the N-terminus. The reason for the lack of transmembrane helices in the negative set is more practical: it is very hard to ensure that there is experimental evidence for absence of cleavage of a transmembrane protein. For a subset of transmembrane proteins, however, we have a reliable set: eukaryotic signal anchors (see below).
These two points constitute a problem for the application of SignalP to genome and EST data. As an illustration of this, the scanning of the Haemophilus influenzae genome which we reported in the SignalP paper (Nielsen et al., 1997a) produced a remarkably large variation in the estimate of the proportion of proteins with SPs: from 14% if using the maximal Y-score as discriminator, to 28% when using the maximal S-score, even though all these measures give high discrimination performances when used on the SignalP data set. This means that the performance of (at least) one of these measures is considerably lower when applied to genome data; and that SignalP, when used for this purpose, should ideally be combined with a transmembrane helix prediction and a start codon prediction.
![]() |
SignalP-HMM: distinguishing signal peptides from signal anchors |
---|
The discrimination between SAs and SPs has proved to be very difficult for the neural network: approximately 50% of the SAs are predicted as SPs according to the mean S-score. Since both the C-score and the S-score are calculated from sequence windows of a limited width, a feature such as region length is difficult to represent in the input. To solve this problem, we have developed SignalP-HMM, a HMM architecture for SPs and SAs (Figure 2).
|
Secretory signal peptides have three distinct regionsan N-terminal positively charged n-region, a central hydrophobic h-region, and a C-terminal c-region encompassing the signal peptidase cleavage site (von Heijne, 1985). Each of these is represented by a separate part of the model: the n- and h-regions are modeled in a simple way, with all states having the same amino acid frequencies, while the region around the cleavage sites is modeled in more detail (essentially like a weight matrix). Signal anchors have both an n- and an h-region, and no cleavage site. By having two parallel submodels of the HMM, it is possible to represent differences in both length distribution and amino acid frequencies between the nand h-regions of SPs and SAs. A third branch (actually, just a shortcut) is added to represent those sequences that are neither SPs nor SAs. When threading a sequence through this model, one of the three branches is chosen, and this serves as the prediction of protein type. Additionally, this method provides an objective way to delineate the n-, h- and c-regions in a SP, and it may thus be used to compare the overall design of SPs from different organisms.
SignalP-HMM is able to discriminate between SPs and SAs with a correlation coefficient of 0.74 (see Table I)far from perfect, but much better than with the NNs. In a sense, this comparison is not quite fair, because the SAs were not used explicitly as negative examples during training of the NN, but this would have been problematic given the small size of the SA set. With the HMM, it is easy to take this limitation into account by using a simpler submodel (with a smaller number of free parameters) in the SA branch than in the SP branch. Regarding the identification of SPs versus soluble non-secretory proteins, the HMMs perform on a par with the NNsand for Gram-negative bacteria even betterbut they are less accurate for cleavage site prediction, see Table I
.
Type II membrane proteins constitute only a minor fraction of transmembrane proteins. When scanning genome data, it is desirable to distinguish SPs not only from SAs, but also from other types of transmembrane helices. It is advisable to combine SignalP with one of the available prediction methods for transmembrane helices, e.g. PHDhtm (Rost et al., 1996) or TopPred (von Heijne, 1992
). Of course, it would be preferable, both for usage on large data sets and from a theoretical point of view, to obtain one prediction of the presence and location of both SPs and transmembrane helices in the sequence. To this end, we plan to build an integrated HMM architecture based on SignalP-HMM and an HMM-based transmembrane helix prediction method, TMHMM (Sonnhammer et al., 1998
).
![]() |
Start codon prediction |
---|
For expressed sequence tags (ESTs) the problem can be even worse, since it is very difficult to decide whether a given sequence includes the start codon at allit might be entirely untranslated, or correspond to an internal stretch of a protein. The last case can also produce false positive predictions, since non-cytoplasmic ends of transmembrane helices are often rather similar to SP cleavage sites, and the SignalP networks have never been trained to avoid SPs here.
Therefore, it would be desirable to have a method which, given a nucleotide sequence, would provide a prediction of both ends of a SP, i.e. the start codon and the cleavage site. Such a method does not exist yet, but a partial solution would be a score describing the probability that any given triplet is the start codon. To this end, we have developed a NN-based method for start codon prediction in eukaryotes, NetStart (Pedersen and Nielsen, 1997). It is trained to recognize the start codon AUG against all other AUG triplets in the mRNA sequence. It performs this task by using both local contextthe Kozak box (Kozak, 1984
)and long-range context in the form of implicit reading frame detection. NetStart is designed to work with EST or cDNA data; for use with genomic DNA, the possible occurrence of introns shortly downstream of the start codon could be detrimental to the prediction.
Statistical analyses (A.G.Pedersen et al., manuscript in preparation) have shown that the local start codon context varies widely between different systematic groups of eukaryotes. The current NetStart 1.0 contains only two organism-specific versions, for vertebrates and Arabidopsis thaliana, but more will be added in future releases. Although NetStart 1.0 should be regarded as a `first attempt' at this problem, it does show test set performances, measured by correlation coefficient, of 0.62 for vertebrates and 0.71 for A.thaliana.
![]() |
Signal peptides of Archaea |
---|
We used a `consensus' between the three SignalP versions in a first attempt at characterizing the SPs of Methanococcus jannaschii, the first archaeon to be completely sequenced (Bult et al., 1996). SPs should indeed be expected in this organism: a signal peptidase has been identified by homology in the genome, and it shows greater homology to its eukaryotic than to its bacterial counterpart. The underlying idea is that if we are able to find sequences in the genome which could function as SPs in all other domains of life (i.e. in eukaryotes and both groups of bacteria), they would presumably function as signal peptides in M.jannaschii as well.
Methanococcus jannaschii SPs might have been predicted by alignment to known SPs from other organisms, if significant matches to experimentally verified secretory proteins including the SP region could be found. We made local pairwise alignments between all the predicted M.jannaschii protein sequences and all sequences in the SignalP data set, but found only insignificant matches. Even the best pairwise alignment scores were considerably lower than the threshold required for using a local alignment of two SP sequences to predict the location of the cleavage site (Nielsen et al., 1996). This shows that we cannot expect to find M.jannaschii SPs by alignmenta prediction method is indeed necessary for this task.
We selected sequences where both the maximal Y-score and the mean S-score were above their cut-off values for all three SignalP versions (eukaryotic, Gram-positive and Gram-negative). This is a very conservative criterion: when tested on the SignalP data sets, it accepts 75% Gram-negative, 66% Gram-positive and only 39% of the eukaryotic SPs. Used on the M.jannaschii genome, it yielded 34 putative SPs, none of which had a known subcellular location. This number is too small to train a species-specific neural network (it might be used for an HMM but this has not yet been implemented), but it is enough to draw a few tentative conclusions about M.jannaschii SPs.
The 34 sequences were divided into n-, h- and c-regions, and the amino acid content compared with that of eukaryotes and bacteria. The H.influenzae genome (Fleischmann et al., 1995) served as a reference example of a Gram-negative bacterium. In Figure 3
, the 34 putative M.jannaschii SPs are represented as a sequence logo, i.e. a sequence of stacked letters, where the total height of the stack at each position shows the amount of information (conservation), while the relative height of each letter shows the relative abundance of the corresponding amino acid (Schneider and Stephens, 1990
). When compared with logos of eukaryotic or bacterial SPs (Nielsen et al., 1997a
), the following characteristics are observed.
|
In the c-region, the dominance of Ala at position 1 is typical for both bacterial and eukaryotic signal peptide cleavage sites, whereas the tolerance of other uncharged residues, such as Val, Leu and Ile, at 3 and the short length of the c-region clearly suggest a eukaryotic type of cleavage site. Around the cleavage site, a unique feature is also found: a high occurrence of Tyr (8% of the c-regions as opposed to 2% in H.influenzae), particularly visible at positions +1 and 2. This seems to be specific for SPs, since the general Tyr content is only slightly higher in M.jannaschii than in H.influenzae (4.3 versus 3.3%). Finally, the occurrence of negatively charged residues in the first few positions of the mature protein has previously been noted for bacterial but not for eukaryotic signal peptides (von Heijne, 1986a).
In conclusion, our analysis suggests that SPs from an archaeon have a eukaryotic-looking cleavage site, a bacterial-looking charge distribution and a unique composition of the hydrophobic region. The statistical description is of course to some extent affected by the fact that we use a consensus method, which only finds signal peptides and cleavage sites that would be acceptable in both eukaryotes and bacteria; chances are that signal peptides peculiar to archaea have gone undiscovered. In other words, we have if anything underestimated the unique characteristics of the M.jannaschii signal peptides.
![]() |
Other protein sorting prediction methods |
---|
The currently most developed method to predict mTPs is based on a linear combination of a number of sequence characteristics such as amino acid abundance, maximum hydrophobicity and maximum hydrophobic moment that are combined into an overall score (Claros and Vincens, 1996). Preliminary work using the same NN approach as for ChloroP suggests that similar performance levels can be reached using machine learning (our unpublished data).
In addition to the recognition of the sorting signals, prediction of protein sorting can exploit the fact that proteins of different subcellular compartments differ in global properties such as amino acid composition and residue-pair frequencies. While the signal prediction methods are probably closer to mimicking the information processing in the cell, methods based on global properties can complement imperfect signal-based methods, especially on incomplete sequences. Specifically, a composition-based method for recognizing extracellular proteins can be used without knowledge of the N-terminus, and could, for example, give correct predictions for EST-derived protein fragments where the signal peptide has not even been sequenced. The drawback is that such methods will not be able to distinguish between very closely related proteins that differ in the presence or absence of a SP. Most of the work on such methods has been based on traditional statistics (Nakashima and Nishikawa, 1994; Cedano et al., 1997
), but machine learning has been employed in the NNPSL method, which uses NNs trained on overall amino acid composition to predict location to three (bacteria) or four (eukaryotes) possible subcellular compartments (Reinhardt and Hubbard, 1998
).
The PSORT program (Nakai and Kanehisa, 1992; Horton and Nakai, 1997
) is an integrated system of several prediction methods, using both sorting signals and global properties. Some of the components are developed within the PSORT group, others are implementations of methods published elsewhere. PSORT is the only publicly available system that shows this degree of integration, and it includes sorting predictions that are not found elsewhere (e.g. nuclear or peroxisomal targeting). However, it does not include the newest machine-learning methods, which means that PSORT prediction of the more extensively studied protein sorting problems, e.g. SPs or transmembrane helices, is in many cases not the best available.
![]() |
The future |
---|
On the other hand, one big integrated system of all methods may not be the most desirable solution for all users. For automated annotation of very large data sets, integrated prediction systems are of course preferable, but the biologist working on one specific gene might be better off considering comprehensive graphical output from several prediction methods separately, and then deciding which conclusion should be drawn from the possibly conflicting predictions. In some cases (rare but interesting), the biologically correct answer will be something not anticipated by the method builders (e.g. dual targeting, double cleavage, non-standard use of sorting machineries), and uncritical use of a totally integrated prediction system could actually block new discoveries instead of promoting them.
Finally, any given application will require careful consideration of how to strike the best balance between sensitivity and specificity. For gene hunting, one may want high sensitivity (i.e. few false negatives) in order not to miss interesting candidate genes, whereas for database annotation it may be more prudent to ask for high specificity (i.e. few false positives) even if this will leave many sequences unannotated.
The trade-off between sensitivity and specificity illustrates a common aspect in the evaluation of prediction methods. Performances are given as percent correct, correlation coefficients etc., but these depend on the choice of cut-off and the definition of positive and negative data sets. In the signal peptide case, it is quite clear what the positive data sets should be, although it may be argued whether, for example, bacterial lipoproteins should be considered as positive examples. On the other hand, there are many questions to be asked about negative examples: should they comprise only soluble cytoplasmic and nuclear proteins, or include transmembrane and membrane-associated proteins? Should they be limited to N-terminal parts or include entire protein chains? There is no single correct answer to questions like these, which makes comparison of performances of different methods a very tricky business.
Since numerical performance measures are mandatory for deciding whether methods have improved, the task of defining such measures is very important, and much more work is needed within the bioinformatics field in order to arrive at common testing standards for method comparison (Nielsen et al., 1996). However, we feel that the most informative test of the performance and applicability of a sequence-based prediction method is carried out by making it available to the biological community, both in academia and in industry, e.g. by implementing it as a server or a portable program. The feedback from users, either directly, or implicitly via usage and citation statistics, can tell us more about the quality of our bioinformatics work than percentages and correlation coefficients will ever be able to.
![]() |
Availability of methods |
---|
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
Bailey,T. and Elkan,C. (1994) ISMB, 2, 2836.[Medline]
Bairoch,A. and Apweiler,R. (1997) Nucleic Acids Res., 25, 3136.
Baldi,P. and Brunak,S. (1998) Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge.
Brunak,S. (1993) In Soumpasis,D. and Jovin,T. (eds) Computation of Biomolecular StructuresAchievements, Problems and Perspectives. Springer-Verlag, Berlin, pp. 4354.
Brunak,S., Engelbrecht,J. and Knudsen,S. (1990a) Nature, 343, 123.[Medline]
Brunak,S., Engelbrecht,J. and Knudsen,S. (1990b) Nucleic Acids Res., 18, 47974801.[Abstract]
Bult,C.J., White,O., Olsen,G.J. et al. (1996) Science, 273, 10581073.[Abstract]
Cedano,J., Aloy,P., Pérez-Pons,J. and Querol,E. (1997) J. Mol. Biol., 266, 594600.[ISI][Medline]
Chou,M.M. and Kendall,D.A. (1990) J. Biol. Chem., 265, 28732880.
Claros,M.G. and Vincens,P. (1996) Eur. J. Biochem., 241, 779786.[Abstract]
Durbin,R.M., Eddy,S.R., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge.
Fleischmann,R.D., Adams,M.D., White,O. et al. (1995) Science, 269, 496512.[ISI][Medline]
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409417.
Horton,P. and Nakai,K. (1997) ISMB, 5, 147152.[Medline]
Kozak,M. (1984) Nucleic Acids Res., 12, 857872.[Abstract]
Ladunga,I., Czakó,F., Csabai,I. and Geszti,T. (1991) CABIOS, 7, 485487.[Abstract]
Mathews,B. (1975) Biochim. Biophys. Acta, 405, 442451.[ISI][Medline]
McGeoch,D.J. (1985) Virus Res., 3, 271286.[ISI][Medline]
Nakai,K. and Kanehisa,M. (1992) Genomics, 14, 897911.[ISI][Medline]
Nakashima,H. and Nishikawa,K. (1994) J. Mol. Biol., 238, 5461.[ISI][Medline]
Nielsen,H., Brunak,S., Engelbrecht,J. and von Heijne,G. (1997a) Protein Engng, 10, 16.[Abstract]
Nielsen,H., Brunak,S., Engelbrecht,J. and von Heijne,G. (1997b) Int. J. Neural Sys., 8, in press.
Nielsen,H., Engelbrecht,J., von Heijne,G. and Brunak,S. (1996) Proteins, 24, 165177.[ISI][Medline]
Nilsson,I., Whitley,P. and von Heijne,G. (1994) J. Cell Biol., 126, 11271132.[Abstract]
Olsen,G. and Woese,C. (1997) Cell, 89, 991994.[ISI][Medline]
Pedersen,A.G. and Nielsen,H. (1997) ISMB, 5, 226233.[Medline]
Reinhardt,A. and Hubbard,T. (1998) Nucleic Acids Res., 26, 22302236.
Richter,S. and Lamppa,G. (1998) Proc. Natl Acad. Sci. USA, 95, 74637468.
Rost,B., Fariselli,P. and Casadio,R. (1996) Protein Sci., 5, 17041718.
Schneider,G. and Wrede,P. (1993) J. Mol. Evol., 36, 586595.[ISI][Medline]
Schneider,T.D. and Stephens,R.M. (1990) Nucleic Acids Res., 18, 60976100.[Abstract]
Sonnhammer,E.L., von Heijne,G. and Krogh,A. (1998) ISMB, 6, 175182.[Medline]
von Heijne,G. (1983) Eur. J. Biochem., 133, 1721.[Abstract]
von Heijne,G. (1985) J. Mol. Biol., 184, 99105.[ISI][Medline]
von Heijne,G. (1986a) J. Mol. Biol., 192, 287290.[ISI][Medline]
von Heijne,G. (1986b) Nucleic Acids Res., 14, 46834690.[Abstract]
von Heijne,G. (1988) Biochim. Biophys. Acta, 947, 307333.[ISI][Medline]
von Heijne,G. (1992) J. Mol. Biol., 225, 487494.[ISI][Medline]
Received November 23, 1998; accepted November 24, 1998.