From MDS Inc., Denmark, Staermosegaardsvej 6, DK-5230 Odense M, Denmark
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
One of the initial challenges when analyzing LC MS/MS data is the assignment of peptides to precursor ions. Currently, this is typically achieved by statistical algorithms that match a theoretical peak list with the measured peak list and include the cross-correlative Sequest algorithm (3) and probability-based algorithms such as Mascot (4). The number of incorrect peptide assignments made using probabilistic or cross-correlative algorithms alone can become an issue, as different peptides may have overlapping or even identical fragmentation patterns (e.g. Leu/Ile substitutions). This issue is particularly valid for large LC MS/MS datasets and/or when a high sensitivity (i.e. the true positive rate) is required, e.g. in target discovery projects (5). The end result is that a substantial amount of time and resources are required for manual validation. The sensitivity of the peptide assignment can be improved at different levels, including: additional processing of the MS/MS data (6); improved charge-state determination (7, 8); removal of low-quality MS/MS data (9); and clustering of redundant spectra (10). Alternatively, a higher peptide assignment sensitivity can be achieved using sophisticated scoring schemes that exploit empirical information derived from MS/MS data and the search results. For instance, this could be the presence of consecutive fragment ions (i.e. sequence tag-like information); specific fragmentation signatures (e.g. a relatively intense proline ion); and the number of sibling peptides (NSP)1 (11). For example, Colinge et al. introduced a new probabilistic scoring scheme termed OLAV (12) that exploits structural information in the MS/MS data to assign peptides. Another example is the SALSA algorithm, which seeks specific sequence-dependent features in MS/MS spectra (13). SALSA scores peptides based on how well the theoretical ion series for peptide sequence motifs correspond with the actual MS/MS product ion series, regardless of absolute position on the m/z-axis. The approach can be used in the identification of both unmodified and modified peptides (e.g. post-translationally or genetically). In the present study, we address the peptide assignment issue by exploiting various empirical parameters when validating the assignments returned by the Mascot search engine.
Once peptides have been assigned to the precursor ions, the next step is presentation of the protein evidence. This is a challenging task due to the degenerate nature of peptides, i.e. the same peptide can be derived from more than one protein entry. This redundancy may be derived from, e.g. homologous proteins or protein splice variants, or the database itself may be redundant. In many cases the MS/MS evidence therefore points toward a group of proteins rather than a single protein, and it may be impossible to determine which group members are present in the actual biological sample on the basis of the MS/MS evidence alone. Consequently caution should be taken when ranking protein hits, e.g. as seen in the result summary returned by Mascot (4). Nesvizhskii et al. addressed this issue by designing a statistical model for identifying proteins by LC MS/MS (11). Redundant protein identifications (i.e. assignments that can not be distinguished by the MS/MS evidence) were collapsed into a single identification, and a minimal protein list was generated using an expectation-maximization algorithm.
A major challenge in the analysis of proteomes is to maximize the information that can be extracted from a biological sample. There are several approaches, such as two-dimensional LC MS/MS, multi-step fractionation, and multiple analysis of the same sample using an exclusion list approach. The challenge remains, however, in how to deal with the huge amounts of data generated from proteomic analysis of complex biological samples, both in terms of database searches and data validation/mining. All these issues were central in the development of the new generic software platform presented here.
A peptide-centric relational database (Experimental Peptide Identification Repository, or EPIR) was developed for the storage, validation, and mining of LC MS/MS data. EPIR is a data storage area for all precursor ions to which peptides have been assigned by a given search engine.
At the same time, EPIR is cumulative, meaning that any number of datasets can be parsed into EPIR at any given time, and subsequently validated/mined as a single combined dataset. A set of software modules have been developed to automatically validate and mine datasets stored in EPIR. For instance, one module collapses proteins into groups on the basis of shared peptides. Protein evidence is thus presented in concise protein groups rather than as a ranked list of proteins, and this significantly reduces the complexity of the result summary. All proteins with conclusive, unambiguous MS/MS evidence are automatically highlighted within the group. Using a validation module, peptide assignments returned by the search engine (e.g. Mascot) are automatically validated or reassigned within EPIR, on the basis of different empirical parameters; including the presence of consecutive y/b-ions, the relative intensity of proline fragment ions (14), and the NSP. This functionality greatly enhances the ability to validate peptide assignments from large datasets in an automatic fashion. A generic quantitative module compatible with non-coeluting labels has also been developed to extract quantitative information from any type of differential experiment, regardless of the labeling method used (chemical or metabolic). Statistical modules were developed to extract information related to the quality of the datasets stored in EPIR. A key feature of the system is that no evidence is lost during data validation and mining, because the core data (a list of precursor ions with all potential peptide identifications and protein associations) remains unaffected at all times. This is because the data validation and mining process simply provides a means of filtering and organizing the core data, with the aim of addressing specific biological or analytical questions. In the present study, the utility of EPIR and associated modules is demonstrated on LC MS/MS datasets generated on a Q-TOF mass spectrometer.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Cell Culture
The human breast carcinoma MCF-7 cell line was cultured in Dulbeccos modified Eagles medium supplemented with 10% FCS, and 1% penicillin/streptomycin, 0.01 mg/ml insulin, 1.5 g/liter sodium bicarbonate, and nonessential amino acids. The cells were maintained at 37 °C in a humidified atmosphere of 95% air and 5% CO2. For isotopic labeling, the cells were grown for at least six cell divisions in medium deficient in L-leucine supplemented with 10% double dialyzed FCS (Hyclone, Logan, UT) and 52 mg/ml normal L-leucine (LeuD0) or [5,5,5-D3]-L-leucine (LeuD3) from Sigma-Aldrich (St. Louis, MO)
Preparation of Enriched Plasma Membranes by Density Gradient Centrifugation
A total of 5 x 108 cells were homogenized in 10 ml of GB buffer containing 0.25 M sucrose, 10 mM HEPES·NaOH, 2 mM CaCl2, 2 mM MgCl2, 1 mM AEBSF hydrochloride, 1 mM EDTA, 20 µM leupeptin hemisulfate, 150 µM aprotinin, pH 7.4 (buffer A) using a motor-driven Potter homogenizer (B. Braun Biotech, Allentown, PA). The homogenate was centrifuged at 1,000 x g for 10 min, and the supernatant collected. Homogenization and centrifugation were repeated. The post-nuclear supernatant was centrifuged at 50,000 x g for 30 min. The resultant pellet containing crude membranes (P2) was resuspended in 4 ml of GB buffer and mixed with 3.85 ml of 100% Percoll (Amersham Biosciences) and 0.55 ml 2 M sucrose in a 11.5-ml crimp tube (tube PA 11.5 ml; Sorvall, Asheville, NC). The tube was filled with GB buffer, capped, and centrifuged at 50,000 r.p.m. in a fixed-angle rotor T 890 (Sorvall) at 4 °C for 15 min. The gradient was fractionated from the top by the displacement method. In order to select fractions containing enriched plasma membranes, individual fractions were assayed for -glutamyl transpeptidase, cytochrome c oxidase, and NADH-cytochrome c reductase as described previously (15). Total protein was determined fluorometrically on solubilized and denatured proteins by measuring the fluorescence of tryptophan (excitation at 295 nm, emission at 360 nm) using tryptophanamide as a standard.
Protein Reduction, Alkylation, and Digestion
Percoll was removed by centrifugation of the fractions in 1-ml 1PC tubes at 900,000 x g in Sorvall RC M150 GX using the S150AT rotor at 4 °C for 20 min. The isolated membrane fractions were washed and the proteins reduced with DTT on membrane as described previously (15). Finally, the membranes were resuspended in 200 µl of 4 M urea in 0.1 M Tris·HCl, pH 8.0. Next, 20 µl of 1 M iodoacetamide was added and the mixture incubated at room temperature for 2 h. The membranes were collected by centrifugation at 900,000 x g at 4 °C for 20 min, and the pellet was resuspended in 200 µl of 4 M urea in 0.1 M Tris·HCl pH 8.0. Five micrograms of endoproteinase Lys-C were added, and the membranes were incubated overnight at room temperature. The released peptides were separated from the membranes by centrifugation. This procedure yielded 88 ± 10 µg peptide per 5 x 108 cells.
Reverse-phase Chromatography and Tryptic Digestion of the Lys-C Peptides
The Lys-C peptides were separated over a Dionex Acclaim 300 C18 3-µm column (i.d. 2.1 mm x 150 mm). The peptides were eluted with an ACN gradient in water containing 0.1% TFA. The flow rate was 100 µl/min. Next, 200-µl fractions were collected and lyophilized. The fractionated peptides were dissolved in 20 µl of 100 mM NH4HCO3 and incubated overnight at 37 °C with 0.5 µg trypsin.
LC MS/MS
All LC MS/MS experiments were performed on a QStar Pulsar XL (MDS Sciex, Toronto, Canada) connected to an LC Packings Ultimate system equipped with a Famos autosampler and Switchos unit (LC Packings, Sunnyvale, CA). All hardware systems were controlled from the Analyst QS software (MDS Sciex). Samples were loaded onto the precolumn (4 cm x 150 µm, Zorbax SB-C18 5-µm beads) using a flow rate of 5 µl/min solvent A (0.005% heptafluorobutyric acid and 0.4% acetic acid in HPLC-grade water) using the Switchos unit. The peptides were subsequently eluted at 300 nl/min from the precolumn over the analytical column (4 cm x 75 µm, Zorbax SB-C18 3.5-µm beads) using an 80-min gradient from 1035% solvent B (90% ACN, 0.005% heptafluorobutyric acid and 0.4% acetic acid in HPLC-grade water) delivered by the Ultimate CAP pump. The total duration of the LC run was 120 min, including sample loading and column equilibration. The QStar XL was operated in information-dependent acquisition (IDA) mode. In MS mode, ions were screened from m/z 3501,000, and MS/MS data were acquired from m/z 801,000 (QStar pulsing mode on). In standard acquisition mode, each acquisition cycle was comprised of a 1-s MS and a 2-s MS/MS. MS to MS/MS switch threshold was set to 40 cps. Five exclusion list runs were performed, where all precursor ions subjected to MS/MS in the previous run(s) were excluded for 9 min using a 3-amu window. The broad exclusion window (±4.5 min) was necessary as the retention time for individual precursor ions drifted up to 4 min during the 5 days required to exhaustively analyze a single biological sample. The exclusion list acquisition methods were generated manually by importing the precursor ion list (text file) into the Analyst method editor. In the first exclusion list analysis, the MS and MS/MS acquisition times and the MS to MS/MS switch threshold were unaltered (1 s, 2 s, and 40 cps, respectively). In the latter exclusion list analyses (run 35), the MS/MS acquisition time was increased to 3 s and the MS to MS/MS switch threshold was lowered to 25 cps.
Database Searching
The IDA processor (Applied Biosystems, Foster City, CA) was used to generate Mascot msm files with peak lists from the Analyst wiff files. The IDA settings were as follows: default charge state was set to 2+, 3+, and 4+; MS centroid parameters were 50% height percentage and 0.05 amu merge distance; all MS/MS data were centroided, with a 50% height percentage and a merge distance of 0.05 amu. The threshold peak intensity was set to 2 cps; MS/MS averaging parameters was set to reject spectra with less than 5 peaks or precursor ions with less than 5 or more than 10,000 cps; the precursor mass tolerance for grouping was set to 1; and the maximum and minimum number of cycles between groups was set to 10 and 1, respectively. MS/MS data from the standard protein sample was searched as a single merged msm file against all entries in the public NCBInr database (downloaded November 23, 2003; 1,543,949 entries in total) from the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov) using the Mascot search engine (version 1.9.05; Matrix Science, London, United Kingdom). MCF-7 data were searched against human database entries only. Alkylation of cysteine residues was set as a fixed modification, and oxidation of methionine was set as a variable modification for all Mascot searches. One missed trypsin cleavage site was allowed, and the peptide MS and MS/MS tolerance was set to 0.3 and 0.13 Da, respectively. All common porcine trypsin autoproteolysis products were excluded after the data was entered into EPIR.
EPIR
EPIR was implemented as a standard SQL database using MySQL version 3.2.4 running on a PC equipped with RedHat Linux 9.0. EPIR contains structured information concerning samples (name, type, LIMS link); acquisitions (filename, time); raw data preprocessing (filtering parameters, processing application); database identification parameters (MS and MS/MS identification tolerance, database name, database version, species restrictions); spectrum identification results (peptide sequence, score, delta mass, expected ions, calculated ions, retention time); and relationship to the associated proteins.
Parsing Identification Results in EPIR
A software module has been developed to extract peptide identification information from the original Mascot result file into EPIR. The following information is extracted: search parameters; query/precursor; peptide score; suggested modifications; protein assignments; retention time; and fragment ion matches. Parsers for other result formats, i.e. Sequest (ThermoElectron, Waltham, MA), PepSea (MDS Inc., Odense, Denmark) are under development.
Quantitation
The precursor peak intensity (PPI, i.e., the maximum ion count observed for a precursor ion) is extracted automatically for all suggested peptide matches. The PPI is obtained directly from the raw MS acquisition file. The window in which the PPI is extracted is 60 s pre- and 90 s post-MS/MS acquisition time for both the identified peptides and nonidentified partner ions. For nonidentified peptide partners, the PPI is extracted using the theoretical precursor ion mass predicted from the identified peptide partner. The PPI elution profile is Gaussian fitted using nonlinear least squares. If the profile exceeds a 3-min elution time, and the PPI value is not observed within the analysis window, then the profile is excluded. A 3-min elution time was chosen as the maximum because most precursor ions eluted in less than 2 min. Furthermore, quantitative data was excluded if the difference in PPI times for non-coeluting peptides exceeded 30 s. This value was chosen because previous experiments have shown that more than 95% of light and heavy peptides have differences in PPI times that were less than 30 s (data not shown).
Automatic Peptide Assignment
Besides the standard search engine results used for peptide assignment (score, expected versus calculated fragment ions, delta mass), additional empirical information is computed by the EPIR peptide validation module to assist in the assignment. Currently these include: a) the NSP for all potential peptide hits; b) the presence of consecutive y/b fragment ions; and c) a proline score for potential peptide identifications containing proline residues.
![]() |
where i is the consecutive matched fragment ion and n is the total number of matched fragment ions of the partial tag. The total fragment ion score, s, is computed:
![]() |
where i is the tag index, n is the total number of tags, and pi is the number of nonmatching fragment ions between the partial tags.
![]() |
where pr is the intensity rank of the proline-containing y-ion and n is the total number of fragment ions in the MS/MS spectrum. A proline score of 100 therefore indicates that the proline fragment ion is the most intensive peak, i.e. pr is 1.
The automatic assignment of peptides to precursor ions is initially based on the NSP information. If any of the suggested peptides have an NSP > 1, these peptides will be the only entries considered. In addition to the NSP information, the suggested peptide is removed from the potential list if the elution profile is invalid (see "Quantitation"); if the suggested number of labels do not match the number of labeling sites (isotopically labeled samples); and if the proline score is below 80 (proline-containing peptides). After the list of potential peptides has been generated, the peptide with the highest average group score (average Mascot score for all peptides in the group) and structural score (average y-ion or b-ion scores for all peptides in the group) is selected as the correct assignment. In cases where the same peptide has been identified multiple times, the identification with the highest score will be used. In situations where two or more peptides have the same group and structural score, no peptide will be assigned. The spectrum is flagged for manual inspection.
If all suggested peptides have an NSP of 1, a list of possible peptides is generated based on a valid elution profile (see "Quantitation"); correct number of labels (isotopically labeled samples); a proline score greater than 80 (proline-containing peptides); and a minimum of five consecutive fragment ions. The peptide with the highest structural and identification score will be selected. When two or more peptides have the same group and structural score, the spectrum is flagged for manual inspection.
Protein Grouping
Proteins with shared peptides are collapsed into a group and reported as a single identification, with the highest-scoring protein entry as the anchor. All information on the proteins in a group is stored in a collapsed format. Consequently no protein evidence is removed or lost. Protein groups with unambiguous protein identifications are highlighted. ClustalW was used for aligning the entries within a protein group (16). The following ClustalW settings were used: -quicktree, -score = absolute, -output = gde, -outorder = input, -case = upper, -GAPOPEN = 200, and -GAPEXT = 100.
Result Browser
The data entered into EPIR was managed, viewed, merged, and analyzed using a web application module. The module was developed using J2EE and running under a Jboss web application server version 3.0.4.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Standard Protein Analysis
The mixture of six model proteins was analyzed by one standard and two exclusion list analyses. The individual msm files were merged and searched as a single file (supplemental material) using the Mascot search engine. A total of 1,883 MS/MS spectra were generated, and Mascot returned 88 confident protein identifications (defined by Mascot as entries with a total score 45), including the six proteins in the sample and Lys-C endopeptidase. Among the 88 returned identifications, 46 contained bold peptide assignments (data not shown). On other words, Mascot suggested that the sample contained 46 distinct protein hits. Most of the 88 proteins, however, were redundant identifications of the same protein from different species and sequence redundancy within the database itself (e.g. entries with partial sequences) (data not shown). The results were parsed into EPIR, and proteins were grouped on the basis of shared peptide evidence. A total of 12 protein groups (NSP/group
2) were generated by EPIR (Fig. 1A), and seven of these groups were derived from the six model proteins (ferritin light and heavy chain each contributed one group). Fig. 1B shows the EPIR protein alignment view of the ovalbumin group, which contained eight members. Lys-C endopeptidase was also identified as a group, thereby explaining eight of the observed groups. The remaining four groups were hemoglobin, tropomyosin, lactate dehydrogenase, and phophofructokinase, which had 5, 5, 13, and 2 unique peptide identifications, respectively. The MS/MS evidence for these proteins was conclusive, and it was therefore concluded that the proteins were contaminants of the original protein mixture. A total of 199 precursor ions, corresponding to 133 unique peptides, were assigned to the 12 protein groups. Among the 199 peptide assignments there were 14 examples where incorrect peptides had an identical or a higher Mascot score than the correct peptide. In all cases, EPIR assigned the correct peptide on the basis of the NSP, y- and b-ion scores, as well as the proline score (Fig. 2).
|
|
|
|
|
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Another issue is the accuracy of assigning peptides to precursor ions. As stand-alone tools, probability-based algorithms have limited applicability in high-throughput proteomics. In part, this is due to the overlap in the distribution of true versus false peptide assignments as illustrated theoretically in Fig. 8A. In other words, the price to pay for a higher absolute number of true assignments is the presence of more false assignments. In high-throughput research, the number of false peptide assignments may severely impact the time and resources required for subsequent data validation. Ideally, the overlap between the two distributions should be minimized (Fig. 8B). Empirical information derived from the MS/MS data and database search results can be exploited to reduce this distribution overlap. Consequently this was one of the focus areas when developing the EPIR platform. There is also the issue of reporting protein level evidence derived from the peptide assignments. The presence of degenerate peptides means that the peptide evidence often points to a group of proteins rather than a single protein (11). Furthermore, protein evidence is nonexclusive, i.e. the unambiguous presence of one protein does not mean that other proteins in the same group are not present in the sample. Therefore protein-level evidence should ideally be presented as protein groups based on shared peptide evidence, rather than a ranked list of proteins. The use of such protein groups also effectively addresses the problem of database redundancy, because redundant entries are always collapsed into a single group.
|
The challenges and issues mentioned above inspired the development of the EPIR platform, which is shown schematically in Fig. 9. Several key features were chosen as the foundation on which to build the platform: 1) it should be a concise repository for LC MS/MS-derived peptide evidence; 2) peptide assignment should be based on additional empirical information, including structural information derived from the MS/MS data; 3) protein-level evidence should be presented in groups, based on shared peptide evidence; 4) quantitation should be generic, i.e. compatible with any labeling technology, and based on the peptide identifications; and finally 5) the platform should be cumulative in nature, i.e. data validation and mining should be possible at any time on any number of combined datasets. A key concept of EPIR is that the core data, i.e. a list of precursor ions with all suggested peptide identifications and protein associations, remains unaffected during data validation and mining. The validation and mining process itself is simply a means of filtering and organizing the core data, with the aim of answering specific biological and/or analytical questions. Therefore, evidence is always retained in the EPIR data repository, even though the investigator can apply a broad range of filters to the same datasets. All data manipulations in EPIR are performed in real time, and consequently there is no impact on the size of the database, i.e. the data processing does not generate new information that requires additional storage and hardware requirements.
|
Once the peptides have been assigned, the next step is to associate the information at the protein level. The EPIR protein-grouping module collapses all proteins with shared or degenerate peptides into a single group. Such a group represents the most concise summary of all the protein-level evidence derived from the LC MS/MS data, and the bias toward a single protein entry observed in a ranked protein list is eliminated. This is critical as multiple protein variants without unambiguous MS/MS evidence may be present in a sample, and because MS/MS evidence alone is nonexclusive. The utility of protein grouping was demonstrated on the standard protein sample (Fig. 1). In this example, EPIR returned a concise protein summary in the form of 12 protein groups (NSP 2/group), eight of which were expected, and all with clear MS/MS evidence. The standard proteins used in this experiment ranged from small (13.7 kDa, RNase A) to large (669 kDa, thyroglobulin), and included a protein (ovalbumin) that is known to be resistant to trypsin digestion (19). Despite the high degree of heterogeneity among the studied proteins, all were represented in the protein group summary returned by EPIR. Fig. 4 illustrates a screenshot of a protein group list from the MCF-7 plasma membrane-enriched sample. The content of the EPIR protein group view can be defined by the user, and in the current example the protein group ID is displayed together with an anchor protein, species information, number of proteins in the group, number of peptide sequences in the group, and unambiguously identified proteins within the group. From the group ID (e.g. 11, in the given example) a protein group window can be accessed. This window contains database identifiers and names for all the members of the group, the peptides assigned to the group, and a sequence alignment of the group members. When selecting a protein in the group window (e.g. 2118484), all associated peptides are automatically highlighted. Each assigned peptide (e.g. NLSDVATK) can be expanded to show a list of potential peptides for the precursor ion. For each potential peptide, the MS/MS spectrum with assigned fragment ions can be accessed (data not shown). Peptides can be manually assigned to precursor ions by selecting the appropriate radio button, thereby allowing users to overrule the automatic peptide assignments.
Once the peptides have been assigned and proteins grouped on the basis of peptide degeneracy, a variety of different EPIR software modules can be utilized to mine the data. The dataset derived from 60 LC MS/MS analyses of a MCF-7 plasma membrane sample (10 off-line fractions analyzed 6 times by LC MS/MS) was employed to demonstrate the functionality of EPIR and associated modules. This dataset was also used to demonstrate the advantage of working with combined datasets, a key strength of the integrated platform. Fig. 3 summarizes the results of this study. It can be seen that when working with merged datasets in EPIR more protein groups were observed and the peptide coverage improved. Note that the increase in the NSP/protein group not only improves protein level evidence (i.e. more peptides are seen for a protein group); it also increases the chance of correctly assigning peptides to precursor ions, because this is strongly influenced by the NSP. In this example, 60 datasets from a single biological sample were merged and mined; however, datasets can also be analyzed across different biological samples or subfractions thereof (data not shown). EPIR also has a comparative functionality, which allows the investigator to determine differences or similarities between two or more datasets. One example could be to filter for proteins present in dataset 1, but not dataset 2, and vice versa, or perhaps the investigator wants to filter for protein groups that are present in both datasets. As datasets from a specific biological system accumulate in EPIR, so does the body of peptide and protein evidence. Accumulated evidence can be exploited for various purposes. It would be possible, for instance, to build a knowledge base of the most frequently observed peptides (and therefore more reliable identifiers) for any given protein. Such information could be used to estimate the accuracy of protein assignments in independent LC MS/MS acquisitions. From EPIR all relevant precursor ion information can be extracted. An investigator could, e.g. generate exclusion lists to minimize MS/MS acquisition time on previously identified precursor ions (trypsin, keratin, high-abundance housekeeping proteins, etc.). Alternatively, an inclusion list of peptides from known proteins could be generated to perform targeted LC MS/MS. The presence of specific proteins in a biological sample could then be rapidly confirmed or rejected. As exemplified by the above-mentioned cases, the EPIR platform allows the application of a broad range of validation and statistical tools across any number of dependent or independent experiments. This flexibility provides a solid foundation for maximizing the quantity and quality of information that can be derived from LC MS/MS experiments.
Analysis of Differential Samples
Differential analysis of isotopically labeled samples plays an important role in quantitative proteomics, particularly in target discovery using LC MS/MS-based approaches (1). There are different strategies for labeling samples prior to mass spectrometric analysis, including chemical labeling (ICAT (20) and HysTag (15)) or metabolic labeling of living cells (17). The various labeling technologies pose a challenge in terms of quantitative data extraction. For instance, the mass difference between light and heavy labels will depend on the approach (8 Da for ICAT, 4 Da for HysTag, 3 Da for triply deuteriated leucine, etc), as well as the number of labels on individual peptides. It is consequently difficult to develop generic software tools that assign light and heavy partners on the basis of m/z information alone. Furthermore, some labeling approaches produce light and heavy peptides that do not coelute, and therefore quantitation of the light and heavy labels must be performed in a time-independent manner. To address these issues, a quantitation strategy was chosen that is based on peptide identifications rather than m/z information alone. For each peptide assignment stored in EPIR, the quantitation module extracts a PPI. That is, it determines the maximum ion count for a precursor ion and stores the value in the database. The module pairs light and heavy PPIs on the basis of peptide identifications, and in cases where only one of a pair has been identified, the module extracts the PPI for the nonidentified precursor ion by predicting the number of labels and elution time from the identified partner. The quantitation module extracts PPI information in a time-independent manner for the light and heavy peptides, and consequently it is a generic approach that is applicable to both coeluting and non-coeluting labels.
The functionality of the quantitation module was demonstrated on MCF-7 cells labeled with normal or deuteriated leucine. This type of labeling represents the most challenging approach from a quantitative point of view. First, the number of leucine residues can vary from zero to several within a peptide. In order to obtain the correct quantitative information, it is essential that the peptides are correctly identified, because the pairing of light and heavy partners is directly based on the information from the peptide identifications. Furthermore, in this example, the light and heavy peptides do not coelute. A total of 1,301 peptides pairs were generated after the filtering process (see "Experimental Procedures"), and the results are summarized in the three EPIR scatter plots shown in Fig. 6. Besides being generic, the quantitative EPIR module rapidly extracts PPI data (5 min for a 2-h LC MS/MS analysis). The speed is due to the fact that the data is extracted directly from the raw MS data. Previous attempts at developing generic quantitative software tools focused on quantitation of the area below the extracted ion current and the use of MS data alone. The data processing, however, was substantially slower, and the accuracy of the results was inadequate, because many false light/heavy pairs were generated, particularly when analyzing complex samples with many precursor ions per MS spectrum (data not shown). For these reasons, the PPI approach was chosen. Besides being generic, the strength of the approach lies in the fact that it is simple and rapid. The module filters and removes peptide pairs with unreliable quantitative data, such as overlapping isotopic clusters or weak signal-to-noise ion intensities. Fig. 7A shows an example of overlapping isotopic clusters, which is a common observation in highly complex samples, such as the total MCF-7 lysate used in the current study. We believe that many, although not all, of the overlapping isotopic clusters are detected during the Gaussian-fitting procedure. Another problem lies in the limited dynamic range of the detector of the mass spectrometer. One example is shown in Fig. 7B, where the light peptide saturates the detector, in contrast to the heavy peptide, which is present at lower concentrations. The issue of dynamic range of the detector is particularly relevant for differential pairs, because these are present at different concentrations. Solid statistics are consequently required with respect to reproducibility of quantitative data, and in general such data should be considered approximate rather than accurate. Differential information, however, is a useful parameter to filter data from large datasets generated by LC MS/MS. When presented as a scatter plot, a visual "normalization" of the total dataset is possible, which assists in pinpointing truly differential peptide pairs. In the current example, differential data was illustrated at the peptide level; however, it is also possible to view differential results at the protein group level or at the level of individual proteins (data not shown).
![]() |
Conclusion and Future Perspectives |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bioinformatic modules adding biological context are currently under development and will be implemented in the near future. These will allow result filtering from single or combined datasets using a broad range of criteria, including genetic annotation (gene, isoform, allele), biochemical pathways, subcellular localization, disease association, etc. With these modules implemented, the EPIR platform will assist in addressing the LC MS/MS data-mining bottleneck in a concise and effective manner.
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, July 29, 2004, DOI 10.1074/mcp.T400004-MCP200
1 The abbreviations used are: NSP, number of sibling peptides; EPIR, Experimental Peptide Identification Repository; PPI, precursor peak intensity; IDA, information-dependent acquisition.
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
S The on-line version of this manuscript (available at http://www.mcponline.org) contains supplemental material.
D. B. K. and J. C. B. contributed equally to the manuscript.
To whom correspondence should be addressed: MDS Inc., Denmark, Staermosegaardsvej 6, DK-5230 Odense M, Denmark. Tel.: 45- 33262013; Fax: 45-65572001; E-mail: dkristensen{at}mdsdenmark.com
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|