From the Institute for Systems Biology, 1441 North 34th Street, Seattle, WA 98103
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
This method has proven quite successful for the cataloguing of large numbers of proteins in complex samples. However, the approach is highly repetitive, labor intensive, and difficult to automate. In addition, it necessarily selects only for proteins that can be resolved by 2DE, missing many larger and smaller proteins, in addition to proteins with lower solubility, such as membrane proteins. Also, due to sample loading limitations for 2DE, it generally selects for only the most abundant proteins in a biological sample (4, 7), thus missing many lower abundance, regulatory proteins, rarely detected when complex mixtures are analyzed. 2DE also typically resolves different posttranslationally modified forms of the same proteins. Given the high degree and variety of post-translational modifications occurring on the proteins of eukaryotic organisms, this results in great difficulties in obtaining accurate quantitative data on the many proteins that separate into multiple spots, as well as multiple proteins that co-migrate to the same spot, during 2DE. However, because the in vivo activities of many proteins are regulated by post-translational modification, the ability to readily resolve differentially modified forms of protein allows for the use of 2DE to monitor changes in the known "active" and "inactive" forms of many proteins.
The recently developed isotope-coded affinity tag (ICAT) technology instead allows for quantitative proteomic analysis based on differential isotopic tagging of related protein mixtures (811) and is summarized schematically in Fig. 1. ICAT reagents consist of three functional elements: a thiol-reactive group for the selective labeling of reduced Cys residues, a biotin affinity tag to allow for selective isolation of labeled peptides, and a linker synthesized in either an isotopically normal ("light") or "heavy" form (utilizing 2H or 13C) that allows for the incorporation of the stable isotope tags. In a typical experiment, protein disulfide bridges are reduced under denaturing conditions, and the free sulfhydryl groups of the proteins from the two related samples to be compared are labeled respectively with the isotopically "light" or "heavy" forms of the reagent. The samples are then combined, proteolyzed with trypsin, and the resulting peptides can be separated by any number of optional fractionation steps, including the removal of untagged peptides (i.e. not containing a Cys residue) via avidin-affinity chromatography. Peptide/protein identifications are made by MS/MS analyses of the individual fractions, followed by protein sequence database searching of the observed MS/MS spectra. Finally, the observed ratio between the signal intensities for the unfragmented isotopically "light" and "heavy" forms of the same peptide yields the relative abundances of that peptide, and hence the protein from which it was derived, in the original samples.
|
These experiments thus illustrated how statistical tools of this nature will greatly facilitate the timely processing of large proteomic datasets, currently a time-consuming and frequently manual process. Also, the application of such tools for assigning measures of confidence to each peptide and protein identified should offer some form of standardization for the interpretation of, in particular, large proteomic datasets. In turn, this should enable researchers to perform any experiment, interpret their results consistently, and then compare the results to those from any other related experiment. Finally, the general application of statistical tools such as these should allow, for the first time, the transparent comparison of related datasets from multiple laboratories.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Protein Labeling and Digestion
ICAT labeling and analysis was performed essentially according to the manufacturers protocol (ICAT Kit for Protein Labeling; Applied Biosystems, Foster City, CA), with optimized conditions known to result in quantitative labeling (16). In short, following reduction of cysteines and labeling of control (d0-ICAT) and stimulated (d8-ICAT) samples, the samples were pooled and then diluted to 1 M urea,
0.01% SDS for proteolysis, using an excess of trypsin (Promega, Madison, WI).
Peptide Separation and Purification
The peptides were separated by cation exchange chromatography using a 4.6 x 200 mm Polysulfoethyl A column (5 µm particles, 300 Å pore size; Poly LC, Columbia, MD) at a flow rate of 800 µl/min. Peptides were eluted by a gradient of 025% B over 30 min, followed by 25100% B over 20 min (buffer A: 5 mM K2HPO4, 25% CH3CN, pH 3.0; buffer B: 5 mM K2HPO4, 25% CH3CN, 600 mM KCl, pH 3.0). The elution profile of the cation exchange chromatography (Fig. 8A) determined which fractions were further analyzed. Forty-three (fractions 1052) cation exchange fractions were individually processed over avidin cartridges (Applied Biosystems) according to the manufacturers protocol (ICAT Kit for Protein Labeling; Applied Biosystems), to isolate the labeled Cys-containing peptides. Both the avidin column eluate and flow-through fractions were retained. To increase the peptide concentration of Cys-containing peptides for microcapillary-liquid chromatography MS/MS (µLC-MS/MS) analysis, avidin column eluates were pooled in pairs combined (except fraction 52), making a total of 22 fractions for µLC-MS/MS. Because the flow-through fractions contained higher peptide concentrations, these were analyzed individually by µLC-MS/MS. Three sets of samples were generated for subsequent µLC-MS/MS analysis: the avidin-affinity elutes (i.e. mostly Cys-containing ICAT-labeled peptides) from the two iterations of the biological experiment and the avidin-affinity flow-through samples (i.e. unlabeled peptides) from the first iteration of the biological experiment. The resultant three data subsets generated from the analysis of these samples were termed ICAT 1, ICAT 2, and Flow-through 1, respectively.
|
|
Statistical Analysis of Peptide Sequence Matches Using PeptideProphetTM
SEQUESTTM output files were automatically submitted to PeptideProphetTM (19) for computation of the probability that each peptide sequence assignment is correct (pcomp). The resultant outputs from SEQUESTTM and PeptideProphetTM were displayed using INTERACT (9), a software tool that allows for web/intranet-based data display, and data filtering and sorting via a range of user-definable parameters. INTERACT was used to restrict the datasets by filtering at different pcomp cut-offs, and its sorting functions were used to determine the number of "single hit" peptides and proteins (i.e. database entries identified via only one peptide with a pcomp above the predetermined threshold) that were contained within each filtered version of the data. The in-house software tool, INTERACT differential (IADIFF) was used for side-by-side comparison of identified peptide sequences contained within multiple INTERACT files. This allowed for determination of the overlap between the three datasets for both the peptide sequence matches made and the proteins (i.e. database entries) to which they corresponded. INTERACT also generates an Excel spreadsheet version of any filtered and/or sorted dataset for distribution and publication purposes.
Statistical Analysis of Protein Sequence Matches Using ProteinProphetTM
The INTERACT data files for all three datasets (ICAT 1, ICAT 2, and Flow-through 1) were submitted to ProteinProphetTM. ProteinProphetTM utilizes the list of peptide sequences and their respective pcomp scores to determine a minimal list of proteins (database entries) that can explain the observed data and to compute a probability (Pcomp) that each protein was indeed present in the original sample(s) (20). The ProteinProphetTM output groups together all peptides that (potentially) match a given protein (i.e. database entry). It deals with indistinguishable database entries by grouping them as one "protein." This commonly occurs when multiple sequences (mRNAs) and fragments of the same sequence are represented as multiple database entries. Highly homologous gene families are dealt with by formation of related "protein groups," again as single output results. ProteinProphetTM then generates a computed probability (Pcomp) for each protein or protein group match. These functions are discussed in detail below under "Results and Discussion." The ProteinProphetTM output is also web-based and can be readily exported to an Excel spreadsheet for sorting, distribution, and publication purposes.
More information on PeptideProphetTM, ProteinProphetTM, and INTERACT can also be found on the Proteomics pages at www.systemsbiology.org/. These applications are available upon request and are open source.
![]() |
RESULTS AND DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
From the two iterations of the experiment described above, the following fractions were carried forward for µLC-MS/MS analysis: all pooled avidin eluate fractions (i.e. Cys-containing, ICAT-labeled peptides) from both experiments, which will be referred to as the ICAT 1 and ICAT 2 datasets, respectively; the avidin flow-through fractions (i.e. non-Cys-containing peptides) from the first (ICAT 1) experiment, which will be referred to as the Flow-through 1 dataset. All recorded MS/MS spectra were searched against a human protein sequence database using SEQUESTTM software (18). Peptide and protein identifications inferred from these search results were determined using PeptideProphetTM (19) and ProteinProphetTM (20) software tools, respectively, summarized in Fig. 2, and further described below and under "Experimental Procedures."
The Need for Statistical Data Analysis for Validation of Peptide and Protein Identifications from Large Datasets
Currently, MS/MS data are searched via a range of database search tools that generate scores relating in some way to the quality of the peptide sequence assigned to each spectrum. To date, determination of the final list of "correct" peptide identifications has typically been based on a "threshold approach," where data is filtered on the basis of these scores alone, with everything below the threshold being discarded. Protein identifications are subsequently determined from the database entries from which the peptide sequences were derived. Typically, visual inspection of spectra is performed by the user to verify spectral quality, and hence the "correctness" of peptide/protein identifications. This is particularly the case when scores are close to the preset threshold, or in cases of "single hits," whereby a protein is identified via only a single peptide sequence identification.
This process is necessarily highly variable. Furthermore, each user/laboratory has their own opinion of a suitable minimum threshold score to set. This problem is compounded by the fact that the various laboratories use both a range of database search engines, each with their own unique scoring system, and different types of mass spectrometers, each producing MS/MS spectra with their own unique characteristics. In fact, due to a range of variable factors, MS/MS spectral quality can dramatically affect scores obtained for spectra derived from the same peptide. This means that, even if using the same filtering threshold for all experiments, direct comparison of experiments, whether in the same laboratory or another, is most problematic. This difficulty is further compounded by the fact that visual inspection of data is a matter of individual opinion, and thus varies greatly from one individual to the next. Indeed, it is highly unlikely that the same, experienced, user would make precisely the same judgment calls for every spectrum viewed in a large dataset upon a second visual inspection.
Thus there is a clear and recognized need for alternative methods of data analysis to help obviate the time-consuming and vacillatory nature of visual data interpretation. These are needed to help provide the consistency required for comparison of results generated in different experiments, and by different laboratories, using different machines and different database search engines (22). An obvious approach to addressing these issues is the application of statistical methods to the interpretation of proteomic data. Such approaches would replace the threshold method of determining which proteins have been identified by instead assigning confidence levels to potential identifications. The next few sections below describe the application of two new statistical tools, designed for such a purpose, applied to peptide and protein identifications, respectively. Fig. 2 summarizes the data flow for this process. Also discussed below are some of the limitations and pitfalls inherent in the interpretation of any such large proteomic dataset.
Statistical Analysis and Validation of Peptide Identifications
Following SEQUESTTM searching of recorded MS/MS spectra, rather than interpret the data solely on the basis of filtering by database search engine output scores (threshold approach) as in the past (18, 23), SEQUESTTM output files were submitted to a recently developed statistical data modeling algorithm, PeptideProphetTM. This algorithm generates its own discriminant score for the peptide sequence assigned to each MS/MS spectrum, based on weighting of a number of parameters for the peptide, including the various SEQUESTTM scores, the mass differential between the observed and calculated mass for the sequence in question, etc. (19). PeptideProphetTM then calculates the population distribution for the discriminant scores for all peptide matches. Next it learns the underlying distributions of "positive" (i.e. correct) and "negative" (i.e. incorrect) identifications that explain this observed distribution. PeptideProphetTM then employs an expectation maximization (EM) algorithm to perform an iterative process of refining the model to better fit the observed data. A detailed account of how this process works has been published elsewhere (19). PeptideProphetTM performs this modeling process separately for +1, +2, and +3 peptide ion distributions.
Fig. 3 shows the final modeled positive and negative discriminant score distributions, generated by PeptideProphetTM, for the 18,109 SEQUESTTM output files that comprised the +2 peptide ion subset of the ICAT 1 data subset. One thing immediately apparent is that, in this case, the model learned that only a small fraction of the peptide assignments made by SEQUESTTM were, in fact, correct. The final step PeptideProphetTM performs is to use these positive and negative distributions to compute a probability (Pcomp), for each of the 18,109 peptide assignments for being a member of the positive identification distribution. This computed probability is on a scale of 0 to 1, where 0 is "incorrect" and 1 is "correct," with pcomp = 0.5 occurring at the point at which the two distributions intersect. These pcomp values, along with SEQUESTTM output scores, peptide sequences, database entries assigned, etc., are then exported to a software application called INTERACT. INTERACT is a web-based application that allows the user to view the data, as well as sort and/or filter it according to a range of user-definable parameters, including peptide sequence, SEQUESTTM scores, pcomp, database accession, etc. (9).
|
|
Error versus Sensitivity: Compromising Maximal Return with False Positives
Another significant benefit of using the PeptideProphetTM data modeling algorithm is that the computed probabilities generated for all peptide identifications allow the user to know what is referred to as the "sensitivity" and "error rate" for the entire dataset. Sensitivity is defined as the percentage of the actual correct identifications contained in the restricted (filtered) dataset. Error rate is defined as the percentage of the identifications contained in the restricted (filtered) dataset that are incorrect (i.e. false positives). The sensitivity and error rate are directly related and are dependent on the pcomp threshold set to filter the dataset. Also, sensitivity and error rates vary for each dataset thus analyzed because, as described above, the pcomp values depend upon the calculated positive and negative assignment distributions, which vary from one dataset to another.
Fig. 4 shows plots of the peptide sensitivity and error rates for the ICAT 1 and Flow-through 1 data subsets. These were both derived from the same initial set of samples, separated after avidin-affinity chromatography: ICAT 1 being mostly the ICAT-labeled Cys-containing peptides, and Flow-through 1 being the unlabeled peptides. Fig. 4A shows how the error and sensitivity rates are affected as the pcomp threshold set to filter the datasets is altered. It is important to reiterate that these curves are not fixed and can vary substantially from one dataset to another.
|
Fig. 4B also illustrates the flexibility and power of a statistical data modeling approach. One can readily see that, compared with Flow-through 1, the ICAT 1 dataset yielded a curve are closer to the ideal point: i.e. 100% sensitivity with a 0% error rate (indicated in Fig. 4B with a filled square). This is not a coincidence. This occurred because ICAT labeling targets cysteine residues. Thus we were able to include the presence of (labeled) cysteine in the peptide sequences assigned by SEQUESTTM, for the ICAT 1 (and ICAT 2) data subset, as an additional factor for PeptideProphetTM to model for its calculation of final pcomp values. This ultimately lead to better discrimination between "correct" and "incorrect" identifications for ICAT 1 versus Flow-through 1, for which the additional constraint did not apply (and was thus not used for pcomp calculations for Flow-through 1). Indeed, this inherent flexibility of PeptideProphetTM makes it able to utilize the output results generated by almost any database search program, as well as improving its performance for "specialized" applications other than ICAT. For example, if one were searching for phosphorylated peptides, the presence of (phosphorylated) serine, threonine, and/or tyrosine in the matched sequence could be included, as appropriate, as additional contributory factors for pcomp calculation for that particular dataset.
False Positives Relating to Database Issues
An important caveat to bear in mind when interpreting proteomic data, even when applying statistical tools such as PeptideProphetTM (and ProteinProphetTM) to improve confidence in identifications, arises when studying higher eukaryotic organisms (including humans) where the sequence databases searched are incomplete and/or not fully annotated. Indeed, at the time of writing, only a few genomes have been fully completed, even fewer being eukaryotic. Furthermore, for most genomic sequences, it is not yet clear which sequences represent those that code for protein, nor is it yet clear what, in fact, constitutes one gene. This means that the sequence databases searched for a proteomic investigation of most organisms, particularly for humans, are de facto also incomplete. Any search algorithm used to search proteomic data, including SEQUESTTM, will only report the "best" match from the searched database, which is what they are designed to do. However, if a peptide/protein in the original sample is not represented in the database searched, or the sequence in the database is incorrect (due to a sequencing error or polymorphism, for example), then the "best" match reported will also be incorrect.
False positives of this nature are hard to identify by their very nature. This is because good MS/MS spectra can randomly yield "acceptable" matches to the wrong sequence. The "correct" match would of course yield a better search result, but is not represented in the database. It is difficult to know how frequently this occurs, though this is likely related to how much of the database is "missing" (also not known). However, if the organism of study has little sequence information available on it, then such events would likely be frequent. While not completely immune from this effect, because it is a data modeling algorithm rather than a database search engine, PeptideProphetTM evaluates multiple parameters to model the false identification population, thus does not rely solely on scores generated by the search engines and visual data inspection. These parameters include the SEQUESTTM-generated cross-correlation (Xcorr) score (an indication of the number of peaks of common mass between observed and expected spectra) and preliminary SpRank (a preliminary indication of how well the assigned peptide scored relative to those of similar mass) (18), and for experiments where trypsin was used for proteolysis, the number of tryptic termini for the assigned peptide (19). This enables PeptideProphetTM to identify many false positives of this nature by assigning them a low pcomp. Furthermore, PeptideProphetTM is also impartial when it evaluates potential identifications, unlike even an experienced human user, who may make different judgment calls for the same data point on different days and be biased toward potential identifications that fit with the biology of the experiment in question.
Fig. 5 illustrates this point well. In two iterations of the same ICAT experiment, the same peptide from the same protein, macrophage inhibitory factor (MIF), was identified with SEQUESTTM scores that passed commonly used filtering parameters used to date (18, 23). Fig. 5A shows one such MS/MS scan, along with the search results for the given peptide sequence from both the ICAT 1 and ICAT 2 datasets. While the peptide sequence was only partially tryptic, the biology of MIF was in keeping with the biology of the experiment performed and made acceptance of the identification tempting. However, when PeptideProphetTM interpreted the data, it reported low values for pcomp, indicating that MIF was not identified. However, as shown in Fig. 5B, when the same data were searched against a nonredundant database, the same MS/MS scan (better) matched a peptide sequence for the bovine heterotrimeric G2 protein. The human homologue of this gene, though known, was not in the human database searched for some reason, resulting in the errant MIF identification. Because the human and bovine G
2 amino acid sequences are conserved for the region spanning the assigned peptide, when the human G
2 sequence was added to the human database and the data re-searched, the pcomp values for this new match were now very high (i.e. most likely correct). Indeed as discussed elsewhere,2 heterotrimeric G proteins are highly abundant in lipid rafts (the source of our initial sample), thus this result was also in keeping with the biology.
|
Statistical Analysis and Validation of Protein Identifications
When interpreting proteomic MS/MS data, there are two related but entirely separate steps to the process of identifying the protein(s) in the original sample. The first, as has been discussed above, is assigning individual MS/MS spectra to peptide sequences in a database. For this purpose, PeptideProphetTM was developed to calculate a level of confidence (computed probability) for each sequence so assigned. The second step is the determination of the proteins that these peptides, collectively, represent. This process is very different from the process of assigning peptide sequence to MS/MS spectra, but is also by no means simple, especially when dealing with complex higher eukaryotic organisms such as human. ProteinProphetTM was thus developed to assist in the deconvolution of the complexities inherent in protein identification, again by calculating a probability (Pcomp) for each protein potentially identified (20)
ProteinProphetTM also uses an EM algorithm to derive the simplest list of proteins (i.e. database entries) that can explain the observed peptide data. ProteinProphetTM uses features of the observed peptides assigned to each database entry in question pertinent to the likelihood that this protein was actually present in the original sample(s), including number of sibling peptides (different peptide sequences matching the same database entry) and the pcomp values for each of these peptides, etc. In this way, ProteinProphetTM assigns each potential "protein" identification its own Pcomp value: i.e. the probability that the given protein identification is correct, on a scale of 0 (incorrect) to 1 (correct). A detailed description of how ProteinProphetTM models the peptide data and calculates Pcomp values can be found elsewhere (20).
In order to generate a "final list" of the proteins (database entries) most likely copurifying with the T cell lipid rafts, the three datasets (ICAT 1, ICAT 2, and Flow-through 1) were combined and submitted to ProteinProphetTM (in this case, ProteinProphetTM included all peptides with a pcomp 0.05 for its calculations in order to speed up the process, without sacrificing data of any significance). As with the final peptide list, the resultant protein identification dataset was filtered at Pcomp
0.5. This final protein list, along with the Pcomp values for each protein identification match and its matching peptide sequences with their respective pcomp values, among other things, are given separately elsewhere (24). Similarly to the peptide identification results, the ProteinProphetTM output allows for the determination of error rate and sensitivity plots for these data (see Fig. 6). Again, ProteinProphetTM models each set of peptide data separately, thus the results are data-dependent and will be different for the same protein(s) identified in separate experiments. In this case, as shown in Fig. 6, when we restricted our protein data to list only the most confident identifications (Pcomp
0.95), we got a false positive (error) rate of 0.5%, but at the price of retrieving only 85.0% of the actual correct identifications. However, when we filtered the data at Pcomp
0.5, we instead retrieved 97.2% of the actual correct matches, but at a price of a 3.4% false positive rate. Again, the power of statistical data analysis with tools such as ProteinProphetTM is the control it gives the user in making informed and transparent decisions when interpreting data, allowing them to accurately know the likelihood that any specific potential protein identified was present in the original sample(s).
|
ProteinProphetTM deals with these problems, when necessary, by grouping proteins (database entries) in one of two ways, examples of which can be found within the full list of T cell lipid raft proteins identified (24). On occasion a peptide, or set of peptides, can be assigned to a single database entry. Other times, two or more database entries are essentially identical. This commonly occurs when one or more mRNA/cDNA/partial coding sequences for the same protein (or fragment thereof) have separate entries in the database being searched. Typically, when interpreting just SEQUESTTM output results, one gets multiple protein "identifications" from such cases, and the onus is on the user to rationalize the results. This process is very time consuming and difficult for large datasets, typically resulting in a higher number of proteins "identified" than are actually present in the dataset. In cases where a peptide, or set of peptides, match two or more essentially identical database entries, ProteinProphetTM groups these entries together to form one "protein," collectively making a single entry in its protein identification output. In effect, the software reports that this "protein" has been identified at a given Pcomp, but that it cannot distinguish between the two or more database entries listed for it, on the basis of the available peptide data.
On other occasions, a high degree of protein sequence homology makes it difficult to distinguish between conserved gene family members. ProteinProphetTM similarly deals with these scenarios via the formation of "protein groups," which again form single entries in its output file. ProteinProphetTM again assigns a Pcomp for the entire group, i.e. the probability that one or more of the family members were present in the original sample(s). The protein family members from whom ProteinProphetTM has assigned one or more peptides are then listed under the group heading, along with the peptide(s) matched to each entry, in the same way that it is done for other proteins. Finally, ProteinProphetTM assigns Pcomp values to indicate which protein group members were most likely present in the original sample(s), based on the preponderance of the evidence (typically, though not necessarily, those with the highest number of unique peptide sequences).
One thing to note about such "protein groups" is that none of the assigned peptides will be exclusive to any one database entry. Proteins for which unique identifying peptides have been assigned by PeptideProphetTM (at high enough pcomp), will automatically be assigned their own individual output lines with corresponding Pcomp by ProteinProphetTM. Thus it is fair to say that it is not possible to say with absolute certainty that any of the specific protein group family members were indeed present in the original sample(s), only that one or more were at the given Pcomp for the group. Also, while many group members are assigned a Pcomp of zero, they similarly cannot be ruled out with any certainty. Finally, on occasions when the gene family is large and/or has a very high degree of sequence homology, ProteinProphetTM will assign a Pcomp of 1 to the group, but zero to all group members. This is an indication that while this class of protein was clearly present in the original sample(s), the peptides observed were shared by too many separate database entries to calculate which were most likely present. Four examples of this occurred when studying T cell lipid rafts: tubulin and ß chains, stomatin, and spectrin
chain.2
Overlap of Peptide and Protein Identifications from Related Datasets
One of the observations made when performing large-scale proteomic µLC-MS/MS-based investigations on complex protein mixtures is that reanalysis of the same sample under essentially identical conditions leads to the identification of a somewhat different set of proteins than was observed in previous analyses (17, 2527). The overlap between consecutive µLC-MS/MS runs of the same sample typically depends on the sample complexity. This can range from close to 100% overlap for a very small set of abundant proteins to as low as 20% overlap for complex mixtures spanning a wide range of abundances. When dealing with highly complex protein mixtures, multidimensional chromatography is required to simplify the peptide mixture for separate µLC-MS/MS analyses to allow for increased peptide/protein identification, as was performed in this study. Even when performing such additional upstream prefractionation, the overlap between the dataset obtained for the whole experiment and that obtained from further repetitions of the same protocol can likewise vary tremendously, again depending on the complexity of the original sample and the prefractionation protocol employed.
This variability in the overlap in the set of peptides/proteins identified when complex samples are repeatedly analyzed has led, indirectly, to an unfortunate and unforeseen trend in proteomics, whereby the proteins in two samples, related through a biological experiment, are separately determined. The absence (or presence) of a given protein in one sample versus the other is then, incorrectly, interpreted as being a consequence of the biological experiment. While it is not entirely clear why this often poor overlap occurs, it is most likely due in large part to the mass spectrometers rate of sampling from the large set of overlapping peptide peaks eluting from the various chromatography columns employed. From this we can infer that when the overlap between multiple experiments is not 100% (100% overlap is very rare) then not all of the proteins in the original sample(s) were identified. Given this, it is thus not appropriate to draw the conclusion that a given protein was not in a given sample, simply on the grounds that it was not identified, even if the same protein was identified in a separate analysis of a highly related, or even identical, sample. It was to address this problem that stable isotope-tagging approaches, such as ICAT, were devised. With such an approach, the related samples are labeled with different isotopic versions of the same chemical and then combined. The original samples are then analyzed as one. Once a peptide is identified, reconstructing the ion chromatograms for the different isotopic versions of the same peptide determines the relative abundance of the peptide (hence protein) in the original samples. Thus if only one isotopic version is observed, it now is valid to assume that the peptide/protein was not present in the other original sample(s), or that its level was reduced sufficiently so as to be indistinguishable from the observed level of signal noise for the given experiment.
In order to look at this overlap effect more closely, and the effect, if any, that subsequent statistical data analyses had on it, we performed the ICAT experiment twice in its entirely. From these, we generated two sets of peptide/protein identification data from the avidin-affinity eluates (i.e. Cys-containing peptides) to compare the ICAT 1 and ICAT 2 datasets. We chose to focus our attention on the ICAT-labeled peptides because we were also interested in the reproducibility of the observed ICAT ratios for proteins identified in common between the two experiments. This second, equally important, aspect of the comparison is discussed in detail elsewhere.2 Finally, we analyzed the avidin-affinity flow-through fractions for one of the iterations of the experiment. This was done to assess the benefit, in terms of increased protein identifications and our confidence in them, versus the cost through additional machine and data processing time (i.e. the overlap between the ICAT 1 and Flow-through 1 datasets).
Fig. 7 shows the overlaps for both peptide and protein identifications made for all three datasets. These peptides represent the 2,669 unique peptide sequences derived from the list of total peptide identifications at pcomp 0.5, listed separately elsewhere (24). As would be expected, the overlap at the peptide level between the avidin-affinity eluate samples (ICAT-labeled Cys-containing peptides) and flow-through samples (unlabeled peptides) was very low (Fig. 7A). Indeed, only 29 of 5,152 assigned peptides (23 of the 1,843 unique peptides) in the flow-through fractions contained (unmodified) Cys (24), and only 17 peptides contained ICAT-labeled Cys upon re-searching of the data for ICAT modifications (data not shown). Also, all but one of the overlapping peptide sequences between the ICAT and flow-through samples were non-Cys-containing peptides, coming from the
9.5% (at pcomp
0.5) of unlabeled peptides nonspecifically binding and eluting from the avidin cartridges (see Table I). These observations confirmed that both the ICAT-labeling process and the avidin-affinity step to enrich for ICAT-labeled peptides worked most efficiently.
|
Another reason why we expected that the number of identifications shown in Fig. 7B would be an overestimate of the actual number of unique proteins present was that because the peptide data was initially filtered at pcomp 0.5, many lower-confidence "single hit" proteins (those to which only a single peptide was assigned) would likely be included in the final list. We would also expect many such "single hits" to be eliminated upon further processing using ProteinProphetTM and subsequent data filtering. Prior to the availability of tools such as ProteinProphetTM, data reduction and simplification of such a list of identifications has been a manual process, typically involving numerous BLAST searches, and the "weeding out" of poor quality hits, based on raw data inspection, one MS/MS scan at a time. For large datasets, this process is necessarily very slow. Also, because the manual process involves frequent and nonreproducible judgment calls by the user, assessing the confidence in each final curated list is almost impossible, making the comparison of results between different individuals and laboratories difficult at best. ProteinProphetTM was developed, in part, to try and address these problems.
Fig. 7C shows the overlap at the protein level determined solely by ProteinProphetTM, with the only human input being setting the cut-off for protein inclusion in the list, again at Pcomp 0.5. Comparing Fig. 7, B and C, several things become apparent. We observed a reduction in total protein identifications, particularly in the Flow-through 1 dataset. As can also be seen from Table I, much of this was due, as expected, to the loss of "single hit" proteins, because they are frequently incorrect. ProteinProphetTM "penalizes" single hit identifications in a data-dependent fashion, based upon the learned number of sibling peptides distribution generated by the EM algorithm (20). In order to obtain a protein Pcomp
0.5, a "single hit" peptide score must typically be high, in these datasets requiring a peptide pcomp of
0.95 or higher. We also observed that the overlap between ICAT 1 and 2 was a little higher, but close to that in Fig. 7B: 52.0% of ICAT 1 confirmed in ICAT 2 and 71.9% of ICAT 2 confirmed in ICAT 1. However, the overlaps between the ICAT datasets and Flow-through 1 dataset were increased; now more than 70% of ICAT-identified proteins were confirmed by additionally analyzing the avidin flow-through fractions. We believe this increase may be due in part to the loss of single hits, but also because ProteinProphetTM condenses identical and related database entry matches into single "proteins," an effect that would also contribute to the reduction in total identifications made. This potential benefit of additionally analyzing avidin-affinity flow-through fractions should thus should be considered when performing a quantitative ICAT-type experiment, balanced against the increased machine time and data interpretation time required, when one wishes to be as sure as possible of the identity of the proteins regulated in a biological experiment.
Sample Complexity Reduction Via Multidimensional Chromatography
As the data presented above demonstrate, proteomic analysis of complex biological samples still present significant challenges. While statistical analysis of database search results clearly holds great promise for improving the speed, accuracy, and transparency of proteomic data interpretation, the reproducibility (overlap) of related experiments and the maximization of protein identifications for such experiments are more difficult to address. One thing that does seem clear from our work and that of others is that the simplification (i.e. fractionation) of peptide samples is a requirement for any attempt at optimal data return for complex protein mixtures (9, 28). This is also critical if one is to have any chance of identifying the lower abundance proteins in any such mixture (which often turn out to be the more interesting proteins biologically).
In the experiments presented here, we used a three-step peptide separation protocol, previously applied with some success to the identification (and quantification) of membrane proteins (9). The first step was an ion exchange fractionation, which separates peptides roughly according to charge state. The second step was an avidin-affinity column, which enriches for the ICAT-labeled (i.e. Cys-containing) peptides. This should simplify the peptide mixture, which in turn should help increase the number of protein identification possible from a complex starting material. The final step was reversed-phase liquid chromatography, which is performed with online MS and MS/MS for both peptide identification and quantification. While it is almost impossible to compare different separation strategies in an attempt to determine an "ideal" approach for maximal data return, we were able to draw some conclusions about the effectiveness of the separation steps used here from our data. As mentioned above (see also Table I and Fig. 7A), the recovery and identification of essentially only Cys-containing peptides in the ICAT 1 and 2 datasets (avidin-affinity eluate) and non-Cys-containing peptides in the Flow-through 1 dataset confirmed the effectiveness of the avidin-affinity step in an ICAT experiment. The effectiveness of µLC for peptide separation prior to MS and MS/MS is also well established and widely documented. Because ion exchange fractions (both avidin-affinity eluates and flow-through fractions) were separately analyzed by µLC-MS/MS, we were also able to assess from our data the effectiveness of the ion exchange peptide fractionation step for a large-scale quantitative proteomic experiment.
We were interested to see whether there was a relationship between peptides that produced MS/MS data of sufficient quality for peptide sequence identifications (in this case yielding pcomp 0.5) and where they eluted from the ion exchange column. We did this by sorting the peptide identification datasets by cation exchange fraction number. Fig. 8 shows the ion exchange ultraviolet trace (Fig. 8A) aligned with the number of peptides that were subsequently identified via µLC-MS/MS from that portion of the profile in the first ICAT experiment, where both the avidin-affinity eluate (ICAT 1) and flow-through (Flow-through 1) fractions were analyzed. These data showed that subsequently useful peptide assignments were obtained throughout the ion exchange gradient for ICAT-labeled peptides (Fig. 8B), peptides also capable of yielding quantitative information. The fact that the labeled peptides did elute throughout the gradient used also suggested that the ICAT modification itself did not adversely affect the chromatographic properties of the peptides. When the flow-through peptide identifications were superimposed (Fig. 8C), we observed a similar distribution of hits (even though a few flow-through fractions were not successfully analyzed for various reasons). Thus we could conclude that the use of ion exchange as a preliminary fractionation step for large-scale proteomic experiments is an effective strategy for simplification of both ICAT-labeled and unlabeled peptide mixtures.
![]() |
CONCLUSIONS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
A logical solution to this problem is the use of statistical data analysis. By using statistical algorithms to interpret the results of protein sequence database searches, it should be possible to assign confidence (or probability) to each individual peptide and protein identification. One of the benefits of probability-based statistical analysis is that it also allows the user to know the likely error (false positive) rate of any large dataset restricted on the basis of calculated probability. This is, of course, far more realistic than the current method of reporting results simply as a list of proteins "successfully" identified, at the exclusion of all else. Thus the adoption of suitable statistical approaches to the interpretation of MS-based proteomic data should, for the first time, allow the investigator to compare results from completely separate experiments. Furthermore, if common statistical approaches are applied to the datasets in question, they should allow for the comparison of any one dataset to those generated in other laboratories, even using different machines and search algorithms. Additionally, datasets already published could be reprocessed using the latest versions of these new tools in order to facilitate such comparisons.
Fortunately, this urgent need has been recognized by a number of groups working in proteomics, and several early attempts providing statistical tools for the interpretation of (in particular MS/MS) proteomic data have recently emerged and are beginning to be used. One such attempt has been to generate statistical significances for each peptide assignment in an experiment, based upon the database search engine output scores generated (29). Other recent attempts have used training datasets to determine an algorithm that calculates distributions of "correct" and "incorrect" peptide assignments for any given dataset (of search engine output results), based on the training dataset (30, 31). While such an approach can allow for the calculation of probabilities from these distributions (30), they lack the ability to take data quality into account by relying exclusively on the training data, rather than "learning" the distributions from the observed data, as PeptideProphetTM and ProteinProphetTM are capable of (19, 20). Nevertheless, all of these attempts at applying statistical methods to the interpretation of proteomic MS/MS data hold considerable promise and represent steps in the right direction.
It is thus hoped that the application of new, statistically validated, methodological approaches such as these will soon alleviate much of the confusion and complexity currently in the MS-based proteomics field. This, in turn, will allow for a common platform for the presentation and dissemination (i.e. publication) of such proteomic data, allowing for the extraction of more and clearer information by the research community as a whole, and thus accelerate the already significant inroads MS-based proteomics is making into the study and understanding of human biology and disease.
![]() |
FOOTNOTES |
---|
1 The abbreviations used are: 2DE, two-dimensional polyacrylamide gel electrophoresis; EM, expectation maximization; IADIFF, INTERACT differential; ICAT, isotope-coded affinity tag; LC, liquid chromatography; µLC-MS/MS, microcapillary-liquid chromatography tandem mass spectrometry; MIF, macrophage inhibitory factor; MS, mass spectrometry; MS/MS, tandem mass spectrometry; pcomp, computed probability that the given peptide sequence assignment is correct.; Pcomp, computed probability that the given protein identification is correct.; TCR, T cell receptor.
2 P. D. von Haller, E. Yi, S. Donohoe, K. Vaughn, A. Keller, A. I. Nesvizhskii, J. Eng, X. Li, B. Wollscheid, D. R. Goodlett, R. Aebersold, and J. D. Watts, manuscript in preparation.
* This work was supported in part by grants from the National Institutes of Health (RO1-AI-41109-01 and RO1-AI-51344-01 to R. A. and J. W., respectively), the National Heart, Lung, and Blood Institute Proteomics Center at the Institute for Systems Biology (N01-HV-28179), and a fellowship awarded by the Swiss National Science Foundation to P.D.H. We thank Oxford GlycoSciences (UK) for additional generous financial support. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Current address: MacroGenics, 1441 North 34th Street, Seattle, WA 98103.
Published, MCP Papers in Press, June 25, 2003, DOI 10.1074/mcp.M300041-MCP200
To whom correspondence should be addressed. Tel.: 206-732-1283; Fax: 206-732-1299; E-mail: jwatts{at}systemsbiology.org
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|