From the Broad Institute, Cambridge, MA 02139; ¶ Institute for Systems Biology, Seattle, WA 98103; || University of California, San Francisco, CA 94143; and ** Millennium Pharmaceuticals, Cambridge, MA 02139
Over the past few years, the number and size of proteomic datasets composed of mass spectrometry-derived protein identifications reported in the literature have grown dramatically. This is a direct result of the widespread availability of instruments, methods, and easy-to-use software for collecting large amounts of data and for converting the observed peptide and fragment-ion masses to peptide and then protein identities. In particular, the analysis of samples containing large numbers of proteins by multidimensional liquid chromatography (LC/LC)1 coupled on-line with tandem mass spectrometry (MS/MS) is now a common component of many biological projects. Clearly it is in the interest of the scientific community to make such data readily available. However, the publication of large proteomic datasets poses new and significant challenges for authors, reviewers, and readers as universally accepted and widely available computational tools for validation of the published results are not yet available (1). In an effort to ensure that high-quality, significant data are entering the proteomics literature, Molecular & Cellular Proteomics (MCP) is introducing guidelines for authors planning to submit manuscripts containing large numbers of proteins identified primarily by LC-MS/MS.
The need for these guidelines is driven in part by the fact that a significant but undefined number of the proteins being reported as "identified" in proteomics articles are likely to be false positives (2). These incorrect matches probably result most often from the use of low-quality peptide MS/MS data to search the database. However, even high-quality data can produce invalid identifications if, for example, the actual peptide sequence is not in the database being searched. Many different algorithms are being used for peptide and protein assignment (e.g. MSTag, Mascot, SEQUEST, SpectrumMill, Sonar, etc.), and each has unique rules for scoring to move the most probable peptide assignment to the top of the "hit" list. In addition, new filtering criteria are being developed that, when layered onto the results from the above algorithms, help to eliminate a certain additional percentage of false positives (3, 4). It is very important that the users of these tools, our authors, have at least a working understanding of how the algorithm they use works. However, even the judicious use of scoring, threshold parameters, and additional filtering criteria for search engines, while serving the very important purpose of reducing the number of misassigned peptides and proteins, does not eliminate the problem. It is almost always possible to match a MS/MS spectrum to a peptide in the database; the difficult part is validating that the match is correct.
This is not to imply that the situation is bleak. In fact, most assignments of proteins made with high-quality data and using more than a single peptide to identify a protein are likely correct. Furthermore, improved methods are being developed at a rapid pace. Recently, application of statistical methods to validate peptide assignments to MS/MS spectra of peptides has been shown to be a promising approach, and a number of groups are working in this area (2, 511). However, these programs are only beginning to be widely used and they are not universally accepted. MCP fully supports continued development and testing of such programs and will publish new search and filtering approaches to make them widely available to the proteomics community. However, in the absence of accepted standards and widely available tools that operate on such standards, there are guidelines that the journal can formulate that would help ensure the publication of high-quality information and to assist readers in being able to make their own assessment of the validity of the assignments in manuscripts. Thus, we introduce the initial set of such guidelines in this issue of the journal. The rationale and purpose of each of these is described below.
The first (and obvious) guideline is to obtain sufficient information from authors to document what search engine was used and how peptide and protein assignments were made using that software.
Guideline 2 defines how peptides should be counted toward the identification of a protein. We do not at this time attempt to deal with the related issue of what constitutes a "unique" peptide with respect to the proteins identified. For example, the situation arises with respect to how to use peptides matching to one member of a protein family but not to any other. We are working toward a better definition of both the problem and possible ways for how to deal with this issue (also see explanation of guideline 6, below).
Guidelines 3 and 4 relate to the fact that, regardless of the search engine employed, the risk of a false-positive protein assignment is greater when only a single peptide is used to identify a protein than when multiple peptides (each satisfying the criteria for a good match in the given software) are used to make an identification by database searching (10). Therefore we are increasing the stringency of information required to use single-peptide identifications for protein assignment. This change is not meant to signal that proteins assigned with two or more peptide are automatically correct, only that there is a significantly higher potential for single-peptide assignments to be wrong.
While the accepted standard for peptide identification is now sequence information obtained by MS/MS data, a significant portion of the proteomics community continues to employ peptide mass fingerprinting data for peptide identification. Guideline 5 addresses this type of data and increases the stringency for its acceptance..
Guideline 6 addresses the present difficulties search engines have with counting the number of unique proteins identified based on the peptides found. At present, there is no agreed upon approach. The issue is that essentially the same protein appears in many cases under different names and accession numbers in the databases. In some cases, this is due to redundant entries in the database being searched. However, some of this apparent "redundancy" is biological in that many genomes have multiple copies of similar genes as well as splice variants. In both cases, multiple proteins with similar (if not identical) sequences are identified by the search engine using subsets of the same group of mass spectra. This is a difficult problem for which there is no ideal solution at present. However, it is possible to formulate a practical definition of "similar" sequences and to then group proteins together according to the spectra used to identify them. This guideline addresses this issue.
Achieving the ultimate goal of publishing only high-quality datasets with small and known false-positive rates will require new data analysis tools and methods. As such tools become available, published results will be subject to reanalysis and interpretation. This is a healthy situation for the field, which MCP will assist by moving in the future to require that authors submit their data (in a suitable format) as a condition for acceptance of their manuscript. The goal is to make openly available the MS and LC-MS/MS data in published studies (in a suitable form and with necessary tools) to enable reanalysis, data mining, and development of improved algorithms. The guidelines introduced in this issue are a start. We consider this a work in progress and welcome your comments.
GUIDELINES FOR THE PUBLICATION OF PEPTIDE AND PROTEIN IDENTIFICATION DATA IN MOLECULAR & CELLULAR PROTEOMICS
The following guidelines have been developed to specifically address problems associated with articles containing peptide and protein identifications. These guidelines are under development and will likely undergo revisions as experience and the supporting technology develop further. Authors with questions regarding the guidelines are encouraged to contact the Editors prior to submission of their manuscripts for additional discussion and/or clarification.
Manuscripts containing protein identifications based on fragmentation and database searching must provide:
ACKNOWLEDGMENTS
We wish to acknowledge the thoughtful contributions of Robert Chalkley, Kirk Hansen, Kati Medzihradszky (University of California, San Francisco), Andrew Keller (Institute for Systems Biology), and Ron Beavis (Beavis Informatics, Ltd.).
Received, April 7, 2004, and in revised form, February 13, 2004.
FOOTNOTES
Published, MCP Papers in Press, April 8, 2004, DOI 10.1074/mcp.T400006-MCP200
1 The abbreviations used are: LC/LC, multidimensional liquid chromatography; MS/MS, tandem mass spectrometry; MCP, Molecular & Cellular Proteomics.
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
To whom correspondence should be addressed: Steven Carr, Director of Proteomics, Broad Institute, 320 Charles Street, Cambridge, MA 02139. Tel.: 617-324-1483; E-mail: scarr{at}broad.mit.edu
REFERENCES