From the Proteomics Research Center,
National Key Laboratory of Medical Molecular, Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical, Sciences/ Peking Union Medical College, Beijing, Peoples Republic of China 100005; and | Institute of Automation, Chinese Academy of Sciences, Beijing, Peoples Republic of China 100080
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Recently, several groups applied different algorithms to evaluate SEQUEST database search results (1115). Moore et al. described a probabilistic algorithm called Qscore (11), which was based on a probability model. It included the expected number of matches from a given database, the effective database size, a correction for indistinguishable peptides, and a measurement of match quality. Anderson et al. (12) applied the support vector machine learning algorithm to distinguish between correctly and incorrectly identified peptides by using a vector of parameters describing each peptide identification including SEQUEST output, considering observed data (peptide mass, precursor ion intensity) and SEQUEST-calculated statistics (such as the parameters Xcorr, DeltaCn, Sp, and RSp). Keller et al. (13, 14) employed another machine learning algorithm, the expectation maximization algorithm. It incorporated four SEQUEST scores plus the number of tryptic peptide termini present in the matched peptides to estimate a peptide probability. Probabilities of the peptides with correct assignments are combined together to estimate the probability of the corresponding protein. More recently, Razumovskaya et al. (15) developed another method, which combines a neural network and a statistical model, for normalizing SEQUEST scores, and also for providing a reliability estimate for each SEQUEST hit. The above methods can improve the separation between correct and incorrect peptides and reduced the number of SEQUEST protein identifications that have to be validated manually.
The above approaches are based on different algorithms. Here we address the same problem using a different approach. Manual validation of a peptide match often makes use of various spectral properties to discriminate positives from negatives (16, 17). We put manual validation rules into a computer program and to filter SEQUEST outputs automatically. Two rules are important for manual validation: the fragment ions should be clearly above baseline noise and the spectrum should have continuous b or y ion matches (16). Facts underlying in these rules are "highly abundant fragment ions are more likely to be signals" and "the MS/MS spectrum of an optimally fragmented peptide should theoretically contain continuous fragment ions of b or y series." Based on these two facts, two functions were programmed to calculate the match percentage of high-abundance fragment ions and continuity of b or y ion series in AMASS (advanced mass spectrum screener) software. Tandem mass spectra datasets of known protein mixtures searched with SEQUEST were filtered by AMASS with relaxed Xcorr and DeltaCn settings, and the result was compared with that of using common Xcorr and DeltaCn settings alone (17).
![]() |
Experimental Procedures |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
SEQUEST Search and Xcorr filter
The 22 raw files were searched against the protein database with Bioworks 3.1 from ThermoFinnigan. The protein database was composed of 88,374 proteins including the Swiss-Prot human protein database and 18 proteins in the mixture. Tryptic cleavages at only Lys or Arg and up to two missed internal cleavage sites in a peptide were allowed. The maximal allowed uncertainty in the precursor ion mass was m/z 1.4. Peptides from m/z 400 to 4,500 and precursor charge states of +1, +2, and +3 were allowed. The minimum total ion current required for precursor ion fragmentation was 1.0 x 105 and the minimum number of ions was 25. Altogether, 47,907 spectra were searched against database.
The output files were filtered by Xcorr filter (Xcorr+DeltaCn). The following value of XCorr and DeltaCn were as common setting (17): DeltaCn 0.1, Xcorr:
Xcorr 1.9 for +1 charged peptides, with fully tryptic ends
Xcorr 2.2 for +2 charged peptides, with partially and fully tryptic ends
Xcorr 3.75 for +3 charged peptides, with partially and fully tryptic ends
XCorr filters used were derived from the common setting with constant DeltaCn. For example, an 80% XCorr filter meant 0.8x (common setting). Thus the filter was actually: XCorr 0.8 x 1.9 = 1.52 for +1 charge peptides, and so forth. The XCorr filters examined in the analysis were 0120%, in a step of 10%.
Positive and Negative Peptides
Positive and negative peptides were selected according to the rule that whether it was one part of 18 known proteins. Only the first scoring peptide was used to judge the presence of one particular protein. If the peptide passing above the Xcorr filter was a part of the 18 known proteins, it was counted as a positive peptide. Otherwise, it was counted as a negative peptide.
In calculating the number of positives, common contaminants were not included, which decreased the number of positives. In our experiments such a conserved strategy was adopted because we only wanted to prove AMASS parameters effect in the most conservative settings.
Computer Programs: AMASS
The following rules are commonly applied in the manual validation of mass spectra (16): 1) the MS/MS spectrum must be of good quality with fragment ions clearly above baseline noise; and 2) there must be some continuity to the b or y ion series.
Based on these rules, we proposed two functions.
1. Match percentage, MatchPct:
MatchPct = [number of matched daughter ions with relative abundance higher than RACutoff/number of total daughter ions with relative abundance higher than RACutoff] x 100%
RACutoff (Relative Abundance Cutoff) was a number between 0 and 100 serving as a relative abundance cutoff point in MS/MS spectra. For example, when RACutoff was 20, the ions with relative abundance higher than 20 were included in the calculation of MatchPct. When a lower RACutoff value was used, more fragment ions were included in the calculation. A higher MatchPct value means that more fragments ions above a certain RACutoff were matched. Commonly, the higher the value of MatchPct, the better the quality of the identification.
2. Continuity, Cont:
![]() |
where f(i) = 1 if the ith b or y series ion is matched, 0 otherwise; b(i) = n2 if the (i + 1)th b series ion is not matched and n = the number of continuously matched b ions immediately before the ith (including the ith ion), 0 otherwise; y(i) = n2 if the (i + 1)th y series ion is not matched and n = the number of continuously matched y ions immediately before the ith (including the ith ion), 0 otherwise; and l = the amino acid number of the peptide.
Cont adds up the number of continuously matched b series and y series ions to the second degree and the total number of matched ions, and is then normalized by dividing the maximum possible value of the addition and multiplying 100. A higher Cont value means more continuous matching fragment ions.
When calculating MatchPct and Cont, all matched daughter ions under different charge state were taken into account. In order to determine the distinguishing value of AMASS on the number of positive and negative peptides, the values of RACutoff, MatchPct, and Cont were ranged from 0 to 90 and applied to SEQUEST results as a secondary filter besides corresponding Xcorr filter with incremental steps of 10. The proper values of parameters should maximize the number of positive peptide without sacrificing the rate of positive. The values of AMASS parameters, RACutoff, MatchPct, and Cont, were estimated experimentally as 20, 60, and 40, respectively (data are shown in supplement 1).
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The Effect of Each AMASS Parameter
The above result was based on the hypothesis that all the peptides belonging to 18 known proteins were positives. However, positives with poor quality should be considered as FP with manual validation. So in order to further prove the effect of AMASS on manual validation result, all of the 22 datasets under common Xcorr filter settings were manually assigned as true positives (TP) or FP according to above manual validation rules (16). If a tandem mass spectrum assigned to a peptide meets manual validation rules, the peptide was considered as TP, otherwise it was considered as FP.
In order to evaluate different effects of each AMASS parameter, according our experience the tandem mass spectra assigned to FP were classified into three categories. The first category was poor fragmentation, with much of the ion current in few major peaks. The second one was noisy spectra, which had low signal-to-noise ratio. The third one was false interpretation, which had major peaks and good signal-to-noise ratio, but most of matched ions were noises. The final list of TP assignments consisted of 1,295 peptides, confidently identified in the mixture. The list of FP assignments contains 233 peptide hits by SEQUEST (73, 81, and 79 to the third category negatives, respectively). We assigned a fewer number of TP peptide identifications than Kellers result (10). The reason was that they assigned all the outputs to peptide identification without any filter, while we only assigned the peptides passing the common Xcorr filter.
Fig. 2A shows the number of TP and FP under different filters, which indicated that AMASS could decrease the number of FP at little cost of TP. Fig. 2B shows the different effect of AMASS parameters on the three categories of FP. Cont and MatchPct filtered out most of noisy and false interpretation FP, but only about half of the poor fragmentation ones.
|
Combining MatchPct and Cont, more FP, most of noisy and false interpretation, and about half of poor fragmentation FP were filtered out, which proved that effects of those parameters were different.
Combination of AMASS and Rscore
Our previous work, Rscore (18), was a score evaluating the relative quality in cross-correlation and matched intensity percentage. The notion underlying RScore was that TP peptide identifications should be better than other randomly generated identifications. In this sense, for poor fragmentation spectra, the few high-abundance ions were likely to be matched in both the first and the second scoring peptide. In this way, the relative quality difference of them would be little and could be filtered out by Rscore. Because AMASS works best in the other two kinds of FP, AMASS and Rscore should be complementary to each other. Fig. 3 shows that when the two filters were used, the Xcorr filter could be lowered to 70% of common settings and more positives (1,790) could be achieved with a similar number of negatives (102) compared with common settings (99). This result was better than that of using each filter singly.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
AMASS was proposed based on the two manual validation rules. In our results, AMASS could dramatically increase the number of positives and the positive rate with lower than common Xcorr filter settings. Manual validation results showed that it can filter out most noisy MS/MS spectra, false interpretation, and about half of poor fragmentation FP at low cost of TP. When AMASS and Rscore were both applied, more positives could be achieved with similar number of negatives. Such result proved that high-quality positive identification could be achieved with AMASS, but it also failed to completely separate TP from FP.
AMASS made use of a threshold model. We chose the threshold model because we would want TP results to satisfy all the AMASS criteria. AMASS criteria are independent such that a high value in one parameter cannot compensate for the deficit in other parameters (for instance, a perfect Cont score would not guarantee the matched ions are signals). A linear model does not have this property. Other models may also be used in tackling this problem. A quadratic model would be able to approximate it, but we decided to preserve the simplicity of the model, because a simple model would have better generalization ability (19) (supplement 2).
To our knowledge, none of present parameters or algorithms can completely distinguish positives from negatives. The possible reason is that the search results may not be a binary yes or no answer (11). Because many peptide matches are of intermediate quality, using score cutoffs and/or algorithms to force intermediate quality results into positive or negative categories actually interferes with the goal of maximizing the data extracted from the system. Even with different perfect evaluation parameters of the detailed information of tandem mass spectra, peptide sequence, database, etc. and various algorithms, it is of great possibility not to completely distinguish positives from negatives.
Because the final aim of proteomics research is the identification of proteins, the probability of proteins correctly identified is more important than that of peptide. Therefore, several steps may be applied to the problem. First, new parameters and algorithms are still necessary to be proposed to improve the distinguishing efficiency. Second, the probability of protein identifications can be estimated based on peptide evaluations, as has been done by Kellers and Razumovskayas groups (14, 15). Third, with present parameters and algorithms, in order to achieve high-creditability protein identification, one approach is to use relatively stringent filters, such as higher Xcorr filter setting (17), two or more peptides for one protein identification (11), or a combination of different algorithms. The other is that the protein identification should be reproducible during multiple experiments for a conclusive result.
There are two other rules for manual validation (16): the y ions that correspond to a proline residue should be intense ions, and unidentified, intense fragment ions correspond to the loss of one or two amino acids from one of the ends of the peptide. Because the two rules were difficult to be quantified using functions as MatchPct and Cont, they were not considered in the present AMASS program. Our future work will take them into account.
Some notices should be mentioned here. First, our result were based on 18 known protein datasets, but the proteomic research result of tissue or protein complex was much more complex than 18 known protein mixture, and whether our result can be applied to complex result or not should be further proved. Second, different Xcorr are used with the different charge state and length of precursor ion, so there are different settings about them (1618, 20). The one used in our article was the one producing a relatively higher positive rate (10), but other setting may have better performance than the one. Last, the database we used was only the human database and not Swiss-Prot and the nonredundant NCBI, which may produce more random matches.
![]() |
CONCLUSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
1 The abbreviations used are: FP, false positives; AMASS, advanced mass spectrum screener; TP, true positives.
* This work was partially supported by grants from Key Project for International Corporation (no. 2002AA229031), Pilot Study for Key Basic Research Project (no. 2002CCA04100), National Basic Research Program (no. 2004CB520804), and National Natural Science Foundation (nos. 30270657 and 30230150).
S The on-line version of this manuscript (available at http://www.mcponline.org) contains supplemental material.
Published, MCP Papers in Press, October 15, 2004, DOI 10.1074/mcp.M400120-MCP200
¶ To whom correspondence should be addressed: 5 Dong Dan San Tiao, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, Peoples Republic of China 100005. Tel.: 086-010-6787-2251-206; Fax: 086-010-6787-2251-201; E-mail: gaoyouhe{at}pumc.edu.cn, sunwei1018{at}hotmail.com
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|