From the Center for Experimental BioInformatics (CEBI), Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
When analyzing peptide mixtures by liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS), a large number of fragmentation events occur. The tandem mass spectra are searched against amino acid sequence databases by one of a number of database search algorithms. The identified peptides receive a score and are combined into lists of identified proteins. A critical question in these experiments is what constitutes a reliable peptide and protein hit (2). Some laboratories save raw mass spectrometric data and interpret this raw data in all questionable cases. In some algorithms, the score is itself a probability and can be used to estimate levels of false positives (incorrect hits) and false negatives (missed hits). For other algorithms, this question has been addressed by analyzing defined mixtures of known proteins (3); or by searching in reversed databases that should not yield significant hits (4, 5). On the basis of these findings, a set of parameters for the scores is often defined that will yield a given trade-off of false positives and false negatives. Recently, more sophisticated statistical learning algorithms have been employed to estimate levels of false positives and negatives from parameters including search score, charge state, and length of peptides (6, 7).
A long-standing question in determining the reliability of peptide hits regards the occurrence of non-tryptic peptides: While trypsin is a very specific protease, it is often assumed to also cleave at other residues than arginine or lysine with a certain probability. Thus many research groups allow "non-tryptic" or "half-tryptic" peptides to match in database searchesalbeit after requiring a higher identification scorewhereas other groups only allow peptides generated by strict tryptic cleavage specificity.
In our group, we require tryptic cleavage specificity of potential sequence matches based on our experience with interpretation and verification of a large number of tandem mass spectra. Specifically, during the last 10 years, we have often used the peptide sequence tag algorithm for peptide identification (8). This algorithm does not require peptides to obey a certain cleavage pattern and is also able to find peptides with modifications and in the presence of sequence errors in databases. From these experiments, we have no clearly documented cases of identification of non-tryptic peptides, other than the C-terminal peptide of the protein and peptides with an N-terminal proline. (These were almost always accompanied by a tryptic peptide of extended N-terminal sequence, which was fully tryptic, indicating that the identified peptide was due to further fragmentation of a proline-directed y-ion.) However, as peptides that were not fully tryptic were often reported in studies using ion traps instead of the quadrupole time-of-flight mass spectrometers used by our group, it was possible that different fragmentation mechanisms made non-tryptic peptides more readily observable on such instruments.
The recently developed hybrid linear quadrupole ion trap-Fourier transform ion cyclotron resonance (FTICR) (Finnigan LTQ-FT; Thermo Electron, Bremen, Germany) mass spectrometer combines fragmentation in an ion trap instrument with the ability to obtain parent mass accuracies in the low or sub-parts per million (ppm) range in the ICR part (9). This mass accuracy is about a factor 1000 higher than that normally obtained in an ion trap instrument alone and correspondingly allows more than a 100-fold higher discrimination in the identification of peptides. We therefore decided to use this mass accuracy to determine if trypsin does indeed exclusively cleave C-terminal to arginine or lysine. A complex protein mixture was enzymatically degraded, and more than 1000 peptides were identified in a sequence database search. Average absolute mass accuracies of less than 1 ppm were obtained. The only peptides of apparently non-tryptic originexcept the C-terminal peptides of the proteins and peptides seemingly non-tryptic because of database annotation issueswere peptides with an N-terminal proline. As mentioned above, these are well-known breakdown products either of acid conditions in solution or of "nozzle-skimmer" fragmentation. Thus we conclude that trypsin indeed solely cleaves C-terminal to arginine and lysine.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
One-dimensional SDS-PAGE Protein Separation and In-Gel Digest of Mouse Liver Proteins
Protein concentration of the mouse liver fraction was determined by Bradford assay (Bio-Rad, Hercules, CA) and 90 µg of protein was applied on a 412% Bis-Tris gel (Novex; Invitrogen). After staining by colloidal Coomassie (Invitrogen), the entire gel lane was cut into 10 pieces of equal size and subjected to in-gel tryptic digestion essentially as described (10). Briefly, the gel pieces were destained and washed, and, after dithiothreitol reduction and iodoacetamide alkylation, the proteins were digested with porcine trypsin (modified sequencing grade; Promega, Madison, WI) overnight at 37 °C. The resulting tryptic peptides were extracted from the gel pieces with 30% acetonitrile, 0.3% trifluoroacetic acid, evaporated in a vacuum centrifuge to remove organic solvent, then desalted and concentrated on reversed-phase C18 StageTips as previously described (11).
Nanoflow LC-MS/MS and Data Analysis
All nanoflow LC-MS/MS experiments were done on a 7-Tesla Finnigan LTQ-FT mass spectrometer (Thermo Electron) equipped with a nanoelectrospray ion source (Proxeon Biosystems, Odense, Denmark). The liquid chromatography (LC) part of the analytical system consisted of an Agilent Series 1100 nanoflow LC system (Waldbronn, Germany) comprising a solvent degasser, a nanoflow pump, and a thermostated micro-autosampler. Chromatographic separation of the peptides took place in a 20-cm fused silica emitter (75-µm inner diameter; Proxeon Biosystems) packed in-house with methanol slurry of reverse-phase ReproSil-Pur C18-AQ 3-µm resin (Dr. Maisch GmbH, Ammerbuch-Entringen, Germany) at a constant pressure (50 bar) of helium. Then 6 µl of the tryptic peptide mixtures were autosampled onto the packed emitter with a flow of 500 nl/min for 20 min and then eluted with a 90-min gradient from 440% acetonitrile (MeCN) in 0.5% acetic acid (AcOH) at a constant flow of 200 nl/min.
The mass spectrometer was operated in the data-dependent mode to automatically switch between MS and MS/MS acquisition. Survey MS spectra (from m/z 3001500) were acquired in the FTICR with r = 25,000 at m/z 400 (after accumulation to a target value of 10,000,000). The three most intense ions were sequentially isolated for accurate mass measurements by a FTICR "SIM scan," which consisted of a 10-Da mass range, r = 50,000, and a target accumulation value of 50,000. These were then fragmented in the linear ion trap using collisionally induced dissociation with normalized collision energy of 30% and a target value of 2000. Former target ions selected for MS/MS were dynamically excluded for 30 s. Total cycle time was approximately 3 s. In total, 4755 tandem mass spectra were acquired in the LC experiment.
Proteins were identified via automated database searching (Matrix Science, London, United Kingdom) of all tandem mass spectra against an in-house curated version of the Mouse International Protein Index protein sequence database (IPI, versions 2.18, 40,402 protein sequences; European Bioinformatics Institute, www.ebi.ac.uk/IPI/) containing all mouse protein entries from Swiss-Prot, TrEMBL, RefSeq, and Ensembl as well as frequently observed contaminants (porcine trypsin and human keratins). Carbamidomethyl cysteine was set as fixed modification, and oxidized methionine and protein N-acetylation were searched as variable modifications. Initial mass tolerances for protein identification on MS and MS/MS peaks were 3 ppm and 0.8 Da, respectively. The instrument setting for the Mascot search was specified as "ESI-Trap."
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Using the significance score provided by Mascot, we established a level that should lead to 99% confidence in peptide hits even without manual inspection of the spectra. For searches with tryptic peptides, two "missed cleavages" (that is, allowing up to two internal trypsin cleavage sites), and a mass accuracy of 3 ppm, this significance score was 26. Because the Mascot probability score is only an approximation, we independently tested the level of false positives by searching our data in a reverse database (that is, each polypeptide sequence is written in reverse from last residue to first) (4, 5). Only 35 out of the 4755 tandem mass spectra matched a peptide sequence with a score of at least 26. Furthermore, about half of these peptide sequences were equivalent to real peptides that were identified also in the normal, "forward" search. This leaves 17 false positives, which corresponds to 1.5% of the more than 1000 tryptic peptides identified (see below), roughly in line with the predicted level of 1% incorrect hits given by the Mascot significance score.
Having established criteria that would lead to 99% correct peptide hits even without manual inspection, we then searched the mouse IPI database with the full dataset under those conditions. Of the 4775 fragmentation spectra, 1131 matched to fully tryptic peptides with at least six amino acids and an identification score of at least 26. The average absolute mass accuracy was 0.7 ppm. Of these peptides, 607 matched the top 50 proteins (Supplemental Table I), which are analyzed in more detail below.
Searches for Not Fully Tryptic Peptides
We next repeated our searches of the dataset with the parameter sets called "SemiTrypsin" or "no enzyme specificity" in the Mascot software. For the SemiTrypsin search, amino acids different from Arg and Lys are allowed at one of the termini of the peptides. This less-restrictive criterion increases the number of candidate sequences to be correlated with each tandem mass spectrum by about a factor 10. Correspondingly, a higher significance score is required. In the case of "No Enzyme" specificity, both termini of the peptide are arbitrary and the peptide can have any number of internal cleavage sites, increasing the number of candidate sequences by more than 100-fold. The significance scores for 99% certain identification for the two searches are 37 and 45, respectively. However, even with the significance score left at 26 only a total of eight and six additional peptides, respectively, matched the top 50 proteins for the two searches. These peptides are listed in Table I along with their peptide sequence, flanking amino acids, and Mascot peptide score. The fact that only 14 not fully tryptic peptides were found compared with 607 tryptic peptides already suggests that using these relaxed search parameters may not yield any great advantage for protein identification. Furthermore, close inspection of the additional peptides revealed that they likely resulted from fully tryptic peptides too: Three of these peptides were in fact fully tryptic but were not identified in the other searches because they contained three internal cleavage sites. Two apparently semi-tryptic peptides were generated from cleavages between aspartic acid (D) and proline (P) in the N terminus (for example, (D)PAKAPNSPDVLEIEFKK(G)). It is well known that the amide bond between D-P residues in peptides is the weakest peptide bond (e.g. acid-labile in dilute formic acid) (16) and thereby easily hydrolyzed in solution as well as in gas phase (upon collision-induced dissociation MS/MS (17) or by nozzle-skimmer fragmentation). We believe the former to be the case here as the intact tryptic peptide(K)TQDPAKAPNSPDVLEIEFKK(G)was also identified in our LC-MS/MS analysis but at a different retention time in the LC run. Another peptide likewise had an N-terminal proline and probably originated by the same mechanism (Table I).
|
Of the remaining four peptides, only one peptide is above the 99% significance score for these searches (see Table II). It had a score of 53, well above the significance score for semi-tryptic peptides of 37. Because it is the only such peptide out of 607, this single peptide may be a false positive, which would also be consistent with a 1% false positive rate as established with the reversed database above.
|
The distribution of peptide hits to correct and incorrect proteins is visualized in Fig. 3, where we plot the incorrect peptides (red) from the 2.0-Da search alongside the correct peptides (green) from the 3-ppm search. The vast majority of additional peptide matches registers as new peptides to random protein hits throughout the list. It has been noted previously that incorrect peptide hits will tend to distribute to incorrectly identified proteins because the correct peptide hits cluster on proteins correctly identified with several peptides (5).
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Trypsin belongs to the serine protease family and is very similar to chymotrypsin in primary structure. The enzymatic mechanism entails recognition of a target amino acid in a binding pocket and subsequent cleavage of the C-terminal amide bond by a mechanism involving a serine residue on the protease (hence the name). Chymotrypsin has a substrate binding pocket that preferentially recognizes bulky aromatic residues (that is, F,Y,W), whereas trypsins substrate binding pocket is deep and narrower and has a negatively charged aspartate at the bottom of the binding pocket that binds basic amino acids via an ionic interaction. Thus target amino acids for cleavage need to have long side chains and be positively charged to allow formation of the ionic bond. Only arginine and lysine fulfill these criteria and therefore trypsin might be expected to be a very specific protease on theoretical grounds. Our study identified 607 fully trypic peptides matching to the top 50 proteins in a complex protein mixture. Only 2% half-tryptic or non-tryptic peptides matched to these 50 proteins. Upon further investigation, however, we found that these peptides also originated from specific trypsin cleavage. Some of them only appeared to be non-tryptic peptides but were tryptic with many internal cleavage sites or appeared non-tryptic because of database annotation issues such as the fact that the signal peptide in the database is not automatically removed. Others had been full tryptic at the time of digestion but degraded in-solution or by in-source decay at proline regions. Taken together our data indicate that trypsin exclusively cleaves C terminal to arginine or lysine. Even though the mechanism of trypsin cleavage readily explains such exquisite specificity, to our knowledge this report is the first to establish this fact for any proteolytic enzyme.
It has often in the past been suggested that trypsin preparations may contain some chymotryptic activity. Here we show that trypsin itself has no chymotrypticor any other unspecificactivity. Such an activity has previously also been attributed to actual chymotryptic contamination of purified trypsin. While we have not observed this contamination here, it is possible that this problem existed several decades ago.
In practical terms, our results suggest that only peptides originating from specific trypsin cleavage be considered in database searches. In order to identify as many peptides as possible, several changes to database search algorithms could, however, be considered. First, peptides with many internal cleavage sites can still be correct hits. Search engines could accommodate this finding by allowing such peptides to match to already identified proteins. Second, peptide sequences with an N-terminal proline can also be correct and could be allowed in a search for fully tryptic peptides. It is well known that protein databases are currently not optimal for proteomic experiments. Clearly, some N-terminal peptides can be retrieved by considering the mature, fully processed form of the protein. Better isoform annotation would additionally allow retrieval of peptides that are not fully tryptic in some but not other isoforms of a protein. However, our results also suggest that all those changes would only add about 2% correctly identified peptides.
It has been known for some time that high mass accuracy can be very beneficial for database searches (18, 19), and this notion is strongly supported by these experiments. The precursor mass accuracy achieved in the experiments reported here allowed us to use database searches with 3 ppm, compared with 200 ppm frequently used for initial searches in quadrupole time-of-flight instruments. (However, the actually achieved mass accuracy in quadruple time-of-flight experiments can be as low as 10 ppm, see for example Ref. 20.) In ion trap experiments, the database search windows are typically several Daltons wide. Thus many more sequences are compared with the tandem mass spectrum, increasing the chance for spurious matches. We have shown here that the LTQ-FT combination allows more than a 100-fold increased confidence in peptide matches, which we have here used to investigate the occurrence of non-tryptic peptides, but which should also be very beneficial in any other proteomic experiment.
This study did not address the causes of reported non-tryptic peptide matches in the literature, except to indicate that low mass accuracy may have contributed. While it is theoretically possible that trypsin was nonspecific under the conditions used by other experimenters, we believe that this is an unlikely cause of these reported peptides. Likewise, non-tryptic peptides are sometimes attributed to proteases in the sample itself. We consider this to be an unlikely possibility based on our experience with a broad range of samples. This notion could be tested by processing samples without adding trypsin.
In several cases, detailed studies in less complex mixtures have been madesuch as our own nanoelectrospray studies, combined with peptide sequence tag database studiesand in these cases there was little, if any, evidence for non-tryptic peptides, apart from the N-terminal proline peptides discussed above. This was also the conclusion of a recent study of 1424 manually interpreted tandem spectra of a single LC-MS/MS run on a quadrupole time-of-flight instrument (21). A more likely explanation for most of the non-tryptic or half-tryptic peptides reported in the literature may be that automated matching of a large number of tandem mass spectra to a similarly large number of possible peptide sequences from the databases does produce a certain number of convincing but spurious hits.
In summary, we have used the LTQ-FT, a state-of-the-art instrument for sub-ppm accuracy analyses of a complex protein mixture to reveal the proteolytic specificity of trypsin in proteomics experiments. Our data clearly demonstrates the extremely high specificity of trypsin and has important implications for proteomics researchers seeking the optimal search parameters to minimize false positives.
![]() |
ACKNOWLEDGMENTS |
---|
We thank other members of our laboratory for help and fruitful discussions. We are grateful to Thermo Electron, especially Drs. S. Horning, R. Pesch, and A. Wieghaus, for help and tips for operation of the Finnigan LTQ-FT. We warmly thank Alexey Nesvizhskii (Institute of Systems Biology, Seattle, WA) for sharing his insights into database identifications of peptides.
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, March 19, 2004, DOI 10.1074/mcp.T400003-MCP200
1 The abbreviations used are: MS, mass spectrometry; ppm, parts per million; FTICR, Fourier transform ion cyclotron resonance; LC, liquid chromatography; MS/MS, tandem mass spectrometry; SIM-scan, scan-selected ion monitoring scan; IPI, International Protein Index.
* Work in the Center for Experimental BioInformatics (CEBI) is supported by a generous grant by the Danish National Research foundation. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
S The on-line version of this manuscript (available at http://www.mcponline.org) contains supplemental material.
To whom correspondence should be addressed: Center for Experimental BioInformatics (CEBI), Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark. Tel.: 45-6550-2364; Fax: 45-6539- 3929; E-mail: mann{at}bmb.sdu.dk
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|