From the Discovery Proteomics and Small Molecule Research Center, Applied Biosystems, Framingham, MA 01701; ¶ Biocrates Life Sciences, 66 Innrain, Innsbruck, Austria; and || Department of Medical Genetics and Microbiology, University of Massachusetts Medical School, Worcester, MA 01655-0122
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
One simple statistic for determining the relative success of a proteomics experiment is to count the number of correctly identified peptides and proteins. Because identification of peptides by electrospray is dependent on appropriate automated selection of precursor ions, repetition of experiments has frequently been observed to result in a higher number of identifications. In the matrix-assisted laser desorption/ionization (MALDI) approach, the output of the reversed-phase columns is deposited on the MALDI plate, so that time is much less of a limitation in selecting the most informative set of precursors for fragmentation. However, even in this approach, one can expect a larger number of identifications upon repetition because of slight changes in elution times, differences in matrix crystallization, experiment-specific loss of peptides prior to reversed-phase chromatography by absorption, or limitations on sample consumption by the acquisition process itself. For both approaches, another desired output is the expression ratio: the ratio of expression of each protein in the experimental sample versus the control. In these experiments, in addition to learning about the ramifications of the deletion of UPF1, we sought to determine what limitations there are in both identification and quantification that might apply to proteomics experiments in general. In particular, we sought to develop additional scoring parameters so as to more easily identify false positives in a semi-automatic fashion, while retaining borderline identifications that are consistent with what is known about correct identifications. To that end, we report here on the results of five complete ICAT experiments, starting from exactly the same samples, so that biological variability does not contribute.
In these experiments, we combined together identifications from two different instrument types that initially used two different search engines. After combining the data into a common relational database, we submitted all of the spectra corresponding to identified peptides to Mascot. In addition, we developed new measures of reliability of identification and quantification that proved useful in resolving discrepancies in identifications. These quantities are adapted from the parameters we recently described for peptide mass fingerprinting experiments (8). In addition to the percent intensity matched parameter, we describe a percent ChemScore matched parameter that semiquantitatively assesses what percentage of the critically important ion fragments were detected. A third parameter was the internal mass consistency of the fragment ions. We also describe a fourth parameter, called Fragment TriScore, that gives higher credit to spectra in which the fragments with the highest ChemScore are also the most intense ions. These parameters were combined in an overall MatchScore (MatSc) parameter that is useful in documenting the credibility of an identification, especially in marginal cases. The parent mass accuracy parameter was left separate as an independent measure of credibility of identification. Upon calibration, based on added calibrants or masses that can be used as calibrants because they can be confidently assigned, the mass of a precursor ion should not deviate from the theoretical peptide mass by more than 20 ppm.
In order to determine which factors were most significant in preventing additional protein identifications, a lot of attention was focused on spectra that were not readily identifiable. In many cases, these spectra could be attributed to chemically modified forms of the peptides that had already been identified many times and were therefore surely very abundant. A second class of these peptides corresponded to peptides with one noncanonical tryptic terminus from these same abundant proteins. An additional problem is spectra whose measured masses were several mass units different from the mass of the peptide to which they were matched. To quantify these problems, each of the spectra were classified in groups corresponding to chemical modification status, mass accuracy, and tryptic specificity.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Peptide Chemistry
ICAT Reagent Labeling Procedure
Two 500-µg aliquots from each strain were resuspended in 6 M guanidine-HCl, 1% Triton X-100, 50 mM Tris HCl, pH 8.5 (buffer B). The proteins were then reduced by the addition of 10 µl of 50 mM tricarboxyethylphosphine and boiled at 100 °C for 10 min. After cooling for 5 min to room temperature, 1 mg of the acid-cleavable form of the ICAT light reagent, dissolved in acetonitrile (ACN), was added to the wild type, whereas 1 mg of the acid-cleavable form of the ICAT heavy reagent was added to the upf1 knockout sample. After incubation for 2 h at 37 °C, the two aliquots were combined and precipitated with acetone (6:1 volume of acetone:volume of sample). The precipitated proteins were centrifuged for 10 min at 13,000 x g, the acetone was decanted, and the pellet was resuspended in 100 µl of ACN. The sample was then diluted with 900 µl of 50 mM Tris, pH 8.5, 10 mM CaCl2, 20% ACN. Then 12 µg of porcine trypsin (Promega, Madison, WI) was added, the sample was incubated for 2 h at 37 °C, then another 12 µg of porcine trypsin was added, followed by overnight digestion.
Ion Exchange Chromatography (IEX)
The sample (1 ml) was diluted to 10 ml with 10 mM K3PO4, 25% ACN, pH 2.5 (buffer C). In two batches, the sample was injected onto a 4.6 x 100 mm polysulfoethyl A cation exchange column at a flow rate of 1 ml/min. The high salt buffer contained 350 mM KCl, 10 mM K3PO4, 25% ACN, pH
2.5 (buffer D). Peptides were separated over four linear gradient segments using an Applied Biosystems Vision Work station (Applied Biosystems, Foster City, CA) in order to separate the peptides as efficiently as possible: 2 min to 10% buffer D, 15 min to 20% buffer D, 3 min to 45% buffer D, and 10 min to 100% buffer D. Seven to 23 fractions (see Table V, column No. IEX) consisting of 1.5 ml were collected beginning 4 min into the gradient. Prior to affinity chromatography, 250 µl of 100 mM Na3PO4 1500 mM NaCl, pH 10 was added to each fraction, which was sufficient to bring the pH to
7.2.
|
Cleavage of Biotin
Each eluate was dried completely using reduced pressure. A 200-µl aliquot of ICAT cleaving reagent from the ICAT reagent kit was added, followed by incubation at 37 °C for 2 h. Once again the sample was dried under reduced pressure until time for reversed-phase separation. At that time, each sample was resuspended in 100 µl of 2% ACN, 0.1% trifluoroacetic acid (TFA).
Mass Spectrometry
Electrospray Analysis
Three or four dependent scans were collected per mass spectrometry (MS) scan using a QStarR Pulsar I (Applied Biosystems) equipped with a nanospray source using Analyst software. Typically, data was collected over 110 min at a flow rate of 0.3 µl/min, with 3-s parent scans from 300 to 1500 m/z, and with tandem MS (MS/MS) scans from 70 to 1500 m/z every 3 s. Information-dependent acquisition was used to select the most intense parent masses excluding singly charged ions and using dynamic exclusion to prevent the same parent mass from being chosen for fragmentation within a 45-s window. Samples were injected onto a capillary trap cartridge (Captrap, Michrom Bioresources, CA) at a flow rate of 20 µl/min to concentrate and desalt the samples. After 10 min, the trap cartridge was automatically switched in-line with the analytical column. Peptides were separated using a 75-µm x 15-cm reversed-phase C18 column, 3-µm particle size (PepMap; LC Packings, Hercules, CA), by means of an UltimateTM System (Dionex Corporation, Sunnyvale, CA). The gradient was typically from 5 to 30% buffer B, where buffer B is 85% ACN/10% water/5% n-propanol/0.1% formic acid/0.01% TFA and buffer A is 98% water/2% ACN/0.1% formic acid/0.01% TFA.
MALDI Analysis
After cleavage, the peptides were separated using an Ultimate Chromatography system (Dionex-LC Packings, Hercules, CA) equipped with a Probot MALDI spotting device. A total of 50 µl of each digested protein fraction was injected and captured on a 0.3 x 5-mm trap column (3 µm, C18; Dionex-LC Packings) with a 0.1 x 150-mm resolving column (3 µm, C18; Dionex-LC Packings) connected in series. Peptides were resolved by dual-solvent gradient elution at a flow rate of 800 nl/min, using a gradient of 545% buffer B over 35 min, followed by a gradient of 3590% buffer B over 5 min, where buffer A is 98% water, 2% ACN, 0.1% TFA and buffer B is 85% ACN, 10% water, 5% isopropanol, 0.1% TFA. Column effluent was monitored using a 3-nl ultraviolet flow cell and spotted directly onto a MALDI target using the Probot. Column effluent was mixed 1:2 with MALDI matrix (7.5 mg/ml -cyano-4-hydroxycinnamic acid dissolved in 75:25 ACN:water containing 0.15 mg/ml dibasic ammonium citrate) by means of a 25-nl mixing tee (Upchurch Scientific, Oak Harbor, WA) and was spotted onto the target in a 12 x 12 array at 20-s intervals. All MALDI spectra were acquired using a 4700 Proteomics Analyzer (Applied Biosystems) equipped with GPS Explorer version 1.0, and peaks were selected for MS/MS analysis using a strategy to collect as many useful spectra as possible without regard to heavy-to-light (HL) ratio (11). To accomplish this, the parent spectra were collected first, and masses were chosen for fragmentation from the spots in which they were most intense using the PeakPicker program (11).
Peptide Identification Methods
QStar files were submitted to the ProICAT software package (Applied Biosytems) for the process of both identification and quantification. The database used was Swiss-Prot release 36. ProICAT generates a set of tables housed in an Access relational database. From these tables, two new tables were generated using SQL queries: A table that contained one row for each mass spectrum (see Table III ) and a table that contained the peak list information from each mass spectrum (see Table VI). Although the HL ratio and peptide sequence were originally in distinct tables, these values were incorporated into Table III.
|
|
|
|
|
|
|
A large number of spectra were examined manually (Fig. 1, Workflow 3). Special attention was paid to spectra that were derived from unmatched, intense MS/MS spectra (Fig. 1, Workflow 3, step 1a), or intense precursors that were not automatically matched (Fig. 1, Workflow 3, step 1b). We were also particularly interested in spectra that had precursor masses that appeared to belong to HL pairs that indicated differential expression (Fig. 1, Workflow 3, step 1c). Other spectra were selected for special attention because they matched to proteins that we did not expect to be abundant, or that contained unusual modifications, or did not conform to trypsin cleavage rules (Fig. 1, Workflow 3, step 1d). Some spectra were examined randomly as a spot check of the annotation process (Fig. 1, Workflow 3, step 1e). Some of the methods used to identify these spectra are listed in Fig. 1, Workflow 3, section 2. For some spectra, the precursor mass or charge state was adjusted (Fig. 1, Workflow 3, step 2b). In other cases, the peak list was refined manually because of inappropriate de-isotoping, or because some fragment ions appeared to be derived from substances other than peptides (Fig. 1, Workflow 3, step 2c). In this case, the peak list was usually not permanently altered, and MatSc was calculated using the peptide sequence that was determined manually.
After we had uncovered evidence for peptides with new modifications, selected high-performance liquid chromatography (HPLC) runs were resubmitted to Mascot, for example, with lysine-modified ICAT reagent (KICAT), oxidized CICAT, or N-terminal ICAT reagent modified as alternative variable modifications (Fig. 1, Workflow 4). Once again, MatSc_Calc was used as the criterion to resolve conflicts. Regardless of how many searches were performed, in the end, each spectrum was allowed to match to no more than one peptide sequence. Some spectra were manually placed into Y1 class 99 (see Table IVB), because manual examination of the spectrum indicated that a proposed sequence was implausible, yet MatSc was significant.
|
|
For peak density filtration, in many instances, the most important ions for determining the correct sequence are relatively high in molecular mass but weak in overall intensity. To ensure that the peak list contains the most significant ions from all regions of the mass spectrum, the peak list was filtered to eliminate all but the six most intense masses every 100 amu. Later, these same processed peak lists were resubmitted to Mascot so that the Mascot scores listed were derived from these altered peak lists.
Percent Intensity Matched
PIM is calculated beginning with the filtered peak list. Only masses that are greater than 200 amu and smaller than 50 amu below the precursor mass are allowed to contribute to PIM because most masses outside this range are not sequence specific (Table II, rules 6 and 7).
Calculation of Fragment Ion ChemScore
The Fragment Ion ChemScore is designed to approximate the theoretical expected intensity for each ion type (see Refs. 13 and 14 for alternative methods to do this). In this article, the ChemScore for each fragment ion is calculated based on the rules listed in Table I. If it was desired, these assumptions could be tuned to the instrument type (electrospray versus MALDI), but in this article we have used the same settings for all spectra.
In this scheme, it is possible to count any desired ion type. In this article, we have counted only y ions, b ions, a ions, y-17 ions, b-17 ions, certain immonium ions, and y and b ions generated by neutral loss of 64 amu from oxidized Met.
The first principle is that y ions and b ions are the most important. y ions have been assigned an arbitrary score of 6000, versus 5000 for b ions, etc. (Table I, rules 9 and 10). The ChemScore for any ion that contains a Lys or Arg is further multiplied by 1.5-fold (Table I, rules 3 and 4). Because we and others (15, 16) have observed preferential fragmentation before Pro (in all ion series), and after Asp, and to a lesser degree Glu, the ChemScores of all such ions have been multiplied the factors described by Table I, rules 5, 7, and 8. In addition, the ChemScores of all ions C-terminal to Pro have been decreased by 1.5-fold (Table I, rule 6), because these ions are less frequently detected (13).
We have observed that y-17 and b-17 ions are more intense when the N-terminal amino acid (aa) is Gln, possibly because of facile elimination of ammonia by cyclization; thus such ions are given the same ChemScore as the corresponding y or b ion (Table I, rule 14). The remaining (y-17) and (b-17) ions are given 20-fold more credit than other -17 ions if they contained R, H, or Q (Table I, rule 15). We have also observed unusually intense y-17 and b-17 ions after ICAT reagent labeling and acid cleavage; thus ICAT reagent-labeled Cys-containing (CICAT) fragments are dealt with the same as R, H, and Q (Table I, rules 16 and 17).
Most neutral losses of water have not been considered here, except for the case of N-terminal Glu, which are often unusually intense (Table I, rule 18).
Because the b2 ion and a2 ion often seem to be more intense than other a and b ions, the ChemScore for the b2 ion has been multiplied by 1.5-fold, whereas the ChemScore for the a2 ion is multiplied by 6.6-fold (Table I, rules 19 and 20).
Finally, it has been observed that an ion having the molecular mass of the parent ion minus the C-terminal aa is often present, especially when the sequence contains an internal arginine (which in general makes fragmentation of any kind more difficult). In this report, the ChemScore of the b(n1)+18 ion is counted the same as an ordinary b ion (Table I, rule 21).
In order to promote detection of ions, the timed ion selector in MALDI mode is typically relaxed so that small amounts of ions from the unselected HL pair are often detected. We decided to add these ions to the fragment list and assign them ChemScores 10-fold below the value for the ICAT reagent-labeled peptide that was selected (Table I, rule 22).
Differential ChemScore values have also been applied to the most common immonium ions. The immonium ion for His at 110 is usually the most reliable immonium ion, thus it was assigned a ChemScore of 100. This value is too low to have much impact on scoring, but these small values would break ties between peptides that otherwise seemed to be equally plausible. If the starting peptide samples were less complex, then the reliability of immonium ions for identification could be increased, but even small amounts of His-containing peptides appear to render the His 110 ion detectable (Table I, rules 2540).
It has also been observed that when methionine sulfoxide is present, a second series of ions is observed that is 64 amu below the canonical y and b ions (Table I, rule 23). A similar neutral loss from the oxidized ICAT side-chain has also been observed (Table I, rule 24).
Calculation of Percent ChemScore Matched (PCM)
All of the masses in the filtered peak list that match to predicted ions contribute to PCM. The total ChemScore for the sequence in question is the sum of the Fragment Ion ChemScores of all of the considered ions. The ChemScore Matched is calculated by summing the Fragment Ion ChemScores for those ions that were matched to masses in the filtered peak list within tolerance (Table I, rule 42). Thus, PCM is calculated by:
![]() |
where ti is the total number of ions considered, MFM is the number of ions matched, m is the set of matched ions, and n is the set of all ions. If two theoretical ions matched within tolerance to the same mass, then the Fragment Ion ChemScore for both ions was used. Because of the quantitative nature of the Fragment ChemScore index, PCM is not significantly affected by consideration of additional ions, so long as these additional ions get assigned low ChemScores. In contrast, consideration of large numbers of additional ions can by itself destroy the usefulness of the PIM term, because almost any mass can be explained by some ion.
Calculation of Peptide Fragment TriScore (FrT)
It is to be expected that the ions with the highest ChemScore (ChS) should correspond to the most intense ions. In addition, the observed masses are more credibly identified if they match the calculated fragment ion mass within experimental accuracy. To make this quantitative, we define PpmMin as a lower limit of mass error in parts per million (ppm) (Table I rule 41). Ions that match to a tolerance below PpmMin get increasingly less additional credit for doing so. PpmMin is in principle an instrument-specific factor, although in this study a value of 400 ppm was used for both instruments. The upper limits for matching were set to 500 ppm for ions with >200 amu (Table I, rule 42 and 43) and to 0.5 amu for ions <200 amu (Table I, rule 44). To prevent a small number of ions from dominating FrT, it was arbitrarily decided to truncate ChS and intensity (Int) parameters to the fourth highest value (Table II, rules 8 and 9). To calculate FrT, the theoretical ion list is sorted by decreasing ChS, and the peak list is sorted by decreasing Int. To normalize the value of Frt to the intensity distribution observed, the MaximumFrT was calculated as follows:
![]() |
where i is the number of elements in the shorter of the two lists. If i < 5, then MaximumFrT is calculated using:
![]() |
The MatchedFrT was then calculated for each matched fragment according to:
![]() |
where ThIM is the number of matched ions, and the list is sorted by decreasing ChS x Int. As with MaximumFrT, the equation for MatchedFrt may need to be reduced to suit the number of elements. Finally, FrT was calculated according to:
![]() |
so that the peptide would receive a value of 100 if the intensity distribution of ions exactly matched the theoretical distribution postulated above, and with a mass accuracy of <<PpmMin. Significantly lower scores result whenever the intensity distribution of ions does not conform to expectations. Note that FrT can still have a high value (near 100) if a small number of ions are detected, so long as these ions are predicted to be the most intense. This is one reason why no match is considered plausible unless the sum of b and y ions matched exceeds 3.
The FrT term acts to buffer against devaluation of the PIM term by random matching of intense masses to minor ion types, because when this takes place, although PIM increases, FrT decreases significantly.
Internal Calibration of MSMS Spectra
In order to improve the internal accuracy of the fragment ions for the MALDI spectra, for each tentative identification, masses were selected to be used to calibrate the remaining fragments. To accomplish this, two masses corresponding to y or b ions were sought that were as intense as possible, and also well separated so that a slope measurement derived from them would be as accurate as possible. The rules to find appropriate masses are listed in Table II (rules 15). First, a low-molecular-mass fragment was sought that had a mass greater than 0.2 times the mass of the parent ion, and no greater than 0.4 times the mass of the parent ion. The second mass was then selected from the fragments >0.6 times the parent mass using the rules in Table II. A two-point calibration was then performed. If no appropriate masses could be found, then no calibration was performed.
Intensity Weighted Mass Error (ppw)
So that matches to low-intensity ions do not distort the mass error term, intensity-weighting was performed. Because the immonium ion region was often poorly calibrated after this procedure, no fragments less than a value of 200 amu were allowed to contribute to ppw.
Calculation of Overall Score (MatSc)
The overall score (MatSc) is a compound index that includes contributions from many of the parameters that can be used to judge the quality of an identification. It is not yet clear how these parameters should be optimally combined, as each of the individual parameters have limitations under certain conditions (like if the peak list is too large). Sophisticated mathematical techniques have been used by others to optimize the weighting of such parameters (17).
In this article, MatSc is calculated according to:
![]() |
where PpmMin is the minimum ppm value below which matches are of no greater significance. In these calculations, a PpmMin of 400 ppm was used. This seems rather high, but there were a few identifications that by all other criteria appeared to be correct for which this value was appropriate. In most instances, MatSc would have higher discriminating power if PpmMin were around 50 ppm. Note that the PpmMin term can break ties between peptides that are identical except for Lys versus Gln (0.036 amu difference) even when PpmMin is 400 ppm.
Calculation of SeqString
The SeqString describes how the ions that match the proposed peptide are dispersed along the length of the sequence of the peptide. It is not used for spectrum classification, but is a useful visual guide. A similar scheme has been used previously by others (David Fenyo, personal communication). Each ion type is assigned a score, as listed in Table I, rules 913, column SeqString. Each peptide bond in the sequence is assigned the sum of those scores, with the exception that the sum must not exceed a value of 9. The most meaningful way to interpret the SeqString is to place it in TrueType font directly over the sequence to which it corresponds, shifted by half a space.
For example, the SeqString 9004555929 might correspond to the sequence ACDEFGHILMK. It could be displayed as:
![]() |
This would indicate that both b and y ions were found that support the AC peptide bond, that no ions were found to corroborate the CD and DE bonds, a b ion corroborates the EF bond, etc.
Counting y and b Ions
Probably the simplest way of assessing the quality of a database identification is to count the number of y and b ions matched (see Ref. 18 for an alternative way to do this). If this number is small (e.g. 3), then the identification is uncertain. If three y + b ions match, and the peak list contains only three or four masses, then the identification might be correct. If the peak list is much larger, then the number of b and y ions matched is no longer so useful, and it must be balanced with a term like % Intensity Matched. In this report, four or more y + b ions were required for all classifications of identified spectra. Table II, rules 1012 list several additional considerations that apply to counting ions.
![]() |
CRUDE RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Table S6 contains each of the peak lists that were automatically extracted for each of the 73,009 spectra in Table S3. This is an enormous table, and three examples of peak lists corresponding to two distinct spectra (one of which was obtained manually from the peak list in Table S6) are shown in Table VI.
Tracking
The tracking fields include such parameters as spectrum number SpID; instrument type EM, where E designates electrospray and M designates MALDI; experiment number Ex; elution time (electrospray) or well number (MALDI), designated TiW; the ion exchange fraction number Fr; and a plate number (MALDI) or replicate number (electrospray) Pla. An experiment consists of all data from the same initial ICAT reagent labeling reaction and is instrument-specific. In some cases, replicate electrospray HPLC runs were performed on the same IEX fraction, resulting in the need for Pla. Because in some cases one HPLC run was collected on more than one plate, there is also an HPLC run index EID. Whenever the same spectrum was submitted to more than one database search, MatSc_Calc was used to resolve sequence discrepancies, using SpID as a key field.
Parent Ions
The parent ion fields include the recalibrated parent ion mass CalMW; its intensity IntP; the mass of the corresponding peptide sequence PepMW; and the difference between these two masses in ppm Df. A fifth field defines the mass bin MB of the parent ion, which is obtained by dividing the parent ion mass by 1.0005, and then rounding to the nearest integer. MB is especially useful for comparing the results of multiple experiments, or adjacent ion exchange fractions, and is particularly useful as an index. Because a large number of identifications were made to peptides where CalMW-PepMW was nearly equal to small integers, another pair of fields were calculated to classify these spectra. The mass bin class field MBC corresponds to the integer itself. When MBC = 1, CalMW is about 1 amu larger than PepMW. This happens if the peptide is deamidated, or if CalMW was incorrectly assigned to the second isotope of the isotope cluster, rather than to the monoisotopic mass. This is more likely to happen at higher peptide molecular masses where the monoisotopic mass is harder to distinguish. Because deamidation is more interesting than a mistake in de-isotoping, the mass difference in ppm DfC was calculated, assuming an ideal mass difference of 0.984 amu, which is the mass difference between an acid and an amide. If the explanation is de-isotoping, then the mass difference should be 1.0033, which is the mass of the extra neutron in C13. Spectra were also classified according to mass accuracy class MAc, where class 1 is defined as 20 ppm for MALDI spectra or
80 ppm for electrospray spectra. Class MAc 2 is defined as
80 ppm (MALDI only), whereas class 3 is defined as
0.5 amu. Spectra of class 4 are off by more than 3.5 amu, because any mass difference between 0.5 amu and 3.5 amu would be grouped in a different MBC prior to calculation of MAc. Identifications with MAc 4 may be correct if the wrong isotope cluster was assigned to the spectrum, or nearly correct if the peptide contains an unassigned chemical modification.
In most cases, in the MALDI experiments, masses that were identified with high confidence were used to calibrate the spot in which the MS/MS spectrum was collected, as well as immediately adjacent spots. Such masses have a value of 1 in the Cal field. Thus, if the ppm for the parent ion is 0, it was probably used as a calibrant. In a few cases, the ppm difference rounded to 0 to four significant figures even when the mass in question was not used as a calibrant.
MS/MS Ions
The MS/MS fields include the Mascot score Sc; the overall match score MatSc, defined above; the number of y and b ions matched yb; SeqString defined above; and the four major components of MatSc: namely, Percent ChemScore Matched PCM, Percent Intensity Matched PIM, intensity-weighted average ppm deviation ppw, and Fragment TriScore FrT. The number of masses detected prior to peak filtering MFD, after peak filtering MFU, the number of masses matched MFM, and the number of theoretical ions matched ThIM are also listed. Note that it is possible for more than one theoretical ion to match the same mass, and vice versa. IntD lists the sum of the intensity of the ions counted in MFU. Note that all of these scores are dependent on peak detection. In an ideal world, it would be possible to detect peaks reliably and reproducibly. In these experiments, the peak lists for the MALDI data are usually reliable; that is, upon manual inspection, the peaks deposited in the database appear to reflect faithfully what one would expect from careful inspection of the raw spectra. For the electrospray data, however, it is difficult to define peak detection, de-isotoping, and de-charging parameters that result in a good peak list for all spectra. For this reason, we spent a lot of time validating electrospray identifications, as is regular practice. We hope to develop more robust peak extraction methods in the future.
HL Fields
The HL fields include the measured heavy-to-light ratio HL. In some cases, the ratio was not measurable because no HL partner was detected. HL_S summarizes how many possible HL partners were detected. HL_S has six digits, each of which can be either 0 or 1. If the first position = 1, then a potential HL partner was detected about 9.03 amu above the parent mass. The remaining five positions of HL_S correspond to the presence or absence of a peak 18, 27, 9, 18, and 27 amu above or below the parent mass in question. Under ideal circumstances, the HL pair would be nonambiguous, meaning there would be only one non-zero digit in HL_S, which would correspond to the number of CICAT residues in the peptide. In this case, the binary field HL_1 is set to 1. We have included as "identified" (Y1 class 1, see below) only those peptides in which HL_S is consistent with the number of Cys residues in the corresponding sequence. A second parameter, Ippm, is the ppm difference between the observed masses of the HL pair and the theoretical mass difference of the HL pair (not shown in Table III). When the HL pair was ambiguous, the value for Ippm listed corresponds to listed sequence. A final HL parameter is the HLI_S. It is similar to HL_S, but lists instead whether there are any interfering masses within 2 amu of an HL pair at +9, +18, +27, -9, -18, or -18. When HL_S is nonambiguous, and HLI_S is all zeroes, then the ratio of the HL pair is more reliable (depending on the intensity of the peaks), and the binary field HLI_1 is set to 1. When HL_S is ambiguous, the exact value of the HL ratio may be unknowable, because more than one peptide may contribute to the intensity of either the c0 form of the ICAT reagent or the c9 form of the ICAT reagent. This problem is lessened if it turns out that the ambiguous ICAT reagent-labeled mass belongs instead to a nonoverlapping HL pair. For example, if HL_S was 111,000 and the corresponding peptide had one Cys, the potentially confounding masses could correspond to a second unrelated HL pair whose masses were 18 and 27 amu above the first peptides mass. Because there are slight differences in HL ratio between experiments due to mixing inaccuracies, the HL values must be normalized. In the five experiments listed here, these normalization constants were small (between 0.855 and 1.11, see Table V, column Nor).
Peptide and Protein Fields
These fields indicate which peptide sequence Seq was identified, and which protein was matched to that sequence, which is designated by (usually) a Swiss-Prot accession number Acc. The amino acid preceding the peptide is listed in < (SeqA in Access), whereas the amino acid following the peptide is listed in > (SeqZ in Access). If field "<" is empty, then the peptide was N terminal; similarly for field ">" and C-terminal peptides. When modified amino acids were detected, the normal capital letter abbreviation for the amino acid is altered in field Seq, whereas DSeq corresponds exactly to the sequence in the database (not shown in Table III). Table IVA lists all of the single-letter amino acid codes for altered amino acids in column Sym, where column RMW lists the molecular mass difference between the modified aa and the natural aa, or in the case of modified CICAT peptides the molecular mass difference between the modified and unmodified form of the CICAT residue. For example, "C" refers to the c0 form of the ICAT reagent, whereas "c" refers to the c9 form. "Z" refers to unmodified Cys. If an Asn or Gln has been deamidated, it is labeled as the corresponding acid residue in Seq, but not in DSeq. In some cases, the accession numbers are PIR accession numbers from csc-fserve.hh.med.ic.ac.uk/delphos.html (referred to hereafter as DelPhos) because there was no corresponding protein in Swiss-Prot. Table IVA also lists how many spectra are classified into each grouping, either at high confidence (field HiMac), at high confidence but also considering different MBC bins (field High), or at any confidence (field Low).
|
Overall Fields
Two final parameters, Y1 and Y2, group spectra into classes of overall reliability. The most important class is Y1 class 1, which corresponds to the highest confidence category of CICAT peptides. Table IVB lists each of these classes. Note that most of the classifications of spectra according to fields Y1 and Y2 depend on the classifications in fields ChM, MAc, NT, HL_S, HL, yb, MatSc, or Sc. Y1 classes 8 and 9 are special exceptions that correspond to Cys-modified tryptic peptides that derive from trypsin or human keratin, respectively, and therefore do not contribute to any of the yeast peptide or protein statistics. Column No. lists how many spectra are in each Y1 class. Field Y2 groups spectra according to how they are modified. Y2 class 1 refers to CICAT peptides, whereas Y2 class 3 correspond to Lys-modified (KICAT) peptides. Y2 class 2 refers to peptides that are not modified at all, whereas Y2 class 4 have an unalkylated Cys. In most cases, this appears to be a result of ISD. Y2 class 5 are spectra matched to peptides with chemical modifications that correspond to more than one of classes 14, whereas Y2 class 0 have fewer than three y or b ions matched and are therefore essentially unmatched. Note, however, that such a spectrum could be matched with high confidence due to co-migration with second spectrum with high Mascot score (Sc) and yb. In addition, in MALDI experiments, such spectra could become identifiable after subsequent MS/MS experimentation.
Data Processing
Low Stringency Identifications
The first task in extracting information from proteomics experiments is to determine which identifications are to be considered reliable. We chose to start with identifications using either ProICAT (for electrospray samples) or Mascot (for MALDI samples). Normally, threshold criteria are chosen that appear to exclude the bulk of the incorrect assignments while retaining the bulk of the correct assignments. In these experiments, we were particularly interested in studying the borderline identifications, and therefore have collected statistics on many identifications that would normally be discarded. We decided upon four minimal requirements for tentative identification: 1) at least four y and b ions must match within 300 ppm; 2) the sequence must be at least 6 aa long; 3) MatSc must exceed 5000; and 4) the tentative sequence must match the spectrum in question better than any other considered sequence. This last requirement requires some judgment regarding which chemical modifications are at all plausible for consideration. Generally a chemical modification is considered reasonable only if the same modification has been found on a peptide that was matched with high confidence to a high-quality spectrum from within the same experiment, or alternatively if the unmodified form of the peptide in question is known to be so abundant that nearly any chemical modification seems plausible. There is a special category of spectra that we excluded from further consideration: they consist of spectra that by the criteria of MatSc and yb seem to match a certain sequence, yet manual inspection of the spectrum indicates that the identification is not credible (Y1 class 99). In some cases, this results because the spectrum is so weak that the process of extracting a peak list failed. In other cases, examination of the spectrum indicates that the substance that was fragmented does not correspond to a peptide at all and is probably derived from contamination. In a few cases, the spectrum seems to correspond to a peptide, but has intense ions that cannot be explained easily by the proposed sequence. This process has been selectively applied to the data, with special attention paid to spectra that uniquely define proteins. Therefore, there are surely still examples of spectra that are matched to peptide sequences that can be shown by manual examination to correspond to other sequences, or substances other than peptides. Next, we classified spectra as 1) ICAT reagent modified on Cys (CICAT), 2) not ICAT reagent labeled, 3) ICAT reagent modified on Lys (KICAT), or 4) incompletely alkylated. These classifications correspond to Y2 classes 14.
Using the criteria for tentative identification described above, there are 12,249 identifications in Y2 class 1, corresponding to 2181 distinct CICAT peptides, or 1029 distinct proteins, some of which are surely misidentifications (Fig. 2A).
|
It is not possible to derive the optimized list of proteins until all of the relevant data are consolidated together (see Ref. 19 for a discussion of this issue). In this report, the relevant data include identifications from all five experiments. Another complication is that some peptides are virtually identical by mass spectrometry (I versus L, and combinations of amino acids that add up to the same mass without distinguishing backbone fragment ions). Because of this, there is a danger that two distinct peptide sequences could get mapped to indistinguishable MS/MS spectra. To derive an optimized protein list, we submit the entire list of identified CICAT peptides to the Mascot search engine in the form of theoretical y ion spectra. Mascot then automatically selects the smallest number of distinct protein sequences that can explain the spectra. Later, the sequences are mapped to the other tracking fields like codon bias using the ICAT dictionary.
To generate our theoretical ICAT dictionary, we started with the yeast proteome at genome-ftp.stanford.edu/pub/yeast, updated 1/2003. In this database, there are 23,308 distinct Cys-containing peptides with no missed trypsin cleavage sites that contain at least 6 aa and have a MW of <4000. Of these, 481 (2%) are encoded by more than one distinct protein. The ICAT dictionary has a degeneracy field that enumerates how many genes encode each peptide. After the list of these sequences was generated, the sequences were grouped using an SQL query so that each unique sequence would be paired with the protein accession number that corresponded to the protein with the highest codon bias, as well as a number that listed how many distinct genes encoded the peptide. In some cases, it was necessary to add new sequences to the ICAT dictionary to accommodate additional sequences that were derived from other databases, for example, the Mascot nonredundant database. The final protein list contains Swiss-Prot accession numbers whenever possible, and PIR-derived "S" numbers from the Mascot nonredundant database when the proteins in question were not in Swiss-Prot. Extensive use was made of the web site csc-fserve.hh.med.ic.ac.uk/delphos.html to resolve questions of protein identity. At this web site, one can input a query sequence and rapidly obtain a list of matching proteins from public databases. The codon adaptation indices were downloaded from genome-ftp.stanford.edu/pub/yeast/data_download/protein_info/protein_properties.tab. Efforts were made to match Swiss-Prot accession numbers to the "Orf" field in this table by sequence.
Identifications for Quantification
When the goal is to determine which proteins have changed in expression level, it is desirable to weed out all confounding data. To accomplish this, only measurements that corresponded to fully tryptic CICAT modifications were included in further quantification analyses. As before, at least four y and b ions were required for a match, but in addition MatSc 10000 and a Sc of at least 20 was required. In addition, internally consistent mass accuracy was required; for MALDI experiments, the mass accuracy requirement was 20 ppm; for electrospray experiments, the mass accuracy was 80 ppm (MAc = 1). Had we taken greater trouble to internally calibrate the QStar HPLC runs, the electrospray mass accuracy could have been additionally constrained, as there was typically <25 ppm systematic error in the electrospray measurements. Next, measurements were eliminated for which no HL data was extractable (using ProICAT data for QStar, using the parent spectra for MALDI data). At this point, 7441 spectra still qualify (Fig. 2B). Finally, MALDI measurements were excluded if there were two possible HL pairs that could contribute to the HL ratio that was measured that were 9, 18, or 27 amu above or below the HL pair that corresponded to the sequence in question (see HL_S in Table IIID) or confounding isotope clusters (see HLI_S in Table IIID). In addition, 24 peptide identifications were eliminated because they had ratios that were either <0.2 or >5. Manual examination of these measurements indicated that they were measured incorrectly.
Reduction to Unique HL Measurements
The list now contains 5668 predominantly credible measurements (Fig. 2C), but in many cases these measurements are not independent because in some cases the measurements derive from each member of an HL pair, or from multiple charge states of the same HL pair. The ProICAT software automatically reduces these measurements to unique measurements, but this feature was overridden so that statistics on individual spectra could be collected. To restore uniqueness, electrospray HL measurements were binned by time and peptide sequence by dividing the elution time in minutes by 3, rounding to the nearest integer, and multiplying by 3 to get a time bin integer. In most cases, the HL ratio for such binned measurements was the same as each individual measurement because the HL ratio that was deposited into the database had been generated by the ProICAT software using similar logic. The MALDI measurements were binned by spot and sequence. Each bin was considered a separate measurement.
Upon accomplishing this, the 5668 credible peptide identifications were reduced to 3726 measured HL ratios (Fig. 2C), which are attributable to 1361 distinct peptides and 702 different proteins. A total of 496 of these proteins had two different HL ratio measurements, but only 57 of them had standard deviations of >0.2. Manual examination of a large amount of discrepant data indicates that in most cases the reason for the large standard deviation has to do with overlapping isotope clusters (largely excluded previously) or low signal intensity. A second category of discrepancy may be based on low levels of reversible or isomeric chemical modifications to ICAT reagent-labeled peptides that cause MS/MS-identical peptides to separate by either IEX or reversed-phase chromatography, sometimes with distinguishable ratios. In this case, the most reliable measurement is from the major form of the peptide. A small minority of discrepancies may be due to false identifications or, more interestingly, biological complexity at the level of the proteome, based on multiple protein forms. The first two problems can be eliminated by assigning a single measurement per peptide per experiment based on an intensity-weighted average for that peptide. However, this removes much of the valuable redundancy in the data and should only be performed to test the seriousness of the third problem.
At this point, automated processes are not able to deal with all of the issues described above. Software continues to evolve by optimizing for the selection of precursors that are not compromised by confounding overlapping ion clusters. The soundest conclusions derive from examination of the data, first at the level of the identifications and then at the level of the spectra themselves in interesting cases. Proteins whose mean HL ratio is distinct from 1 thus need to be evaluated manually in cases where the changes are subtle, especially if the data are discordant.
Calculation of Peptide-specific and Protein-specific HL Ratios
To calculate the appropriate ratio for each peptide, the individual measurements for each peptide were combined. Distinct measurements for oxidized Met or Trp (+16 only) and unmodified Met or Trp were combined, and also for N-terminal pyroglutamic acid and unmodified Gln. First, each ratio was converted to a natural logarithm, and then an intensity-weighted average was calculated. Then, the logarithm of the ratio was converted back into the ratio. For proteins, the exact same methodology was used, but results were combined using Acc rather than the peptide sequence. The standard deviation was calculated using a complex formula that takes into account the intensity of the individual measurements (11).
Comparison of Elution Profiles
In each of the five HPLC experiments, the peptides were subjected to cation exchange chromatography followed by reversed-phase chromatography. We intentionally varied the ion exchange separations somewhat between experiments in an effort to find an optimal separation protocol. We were interested in determining why different peptides were identified in different experiments. For example, in order to compare the peptides that were identified from electrospray experiment 2 fraction 15 to the most similar fraction from MALDI experiment 5, we first determined that many of the same peptides were also identified in fraction 15 from experiment 5. To get at the question of reproducibility, we then compared the sequences and elution times for one HPLC experiment to the corresponding data for the second experiment. These comparisons were performed using SQL queries starting from replicates of Table III, using the appropriate EID values as filters, and the Seq parameter as a key (See Fig. 6A and Table XIV). The comparisons were more meaningful when the identifications were limited so that each peptide was paired to a single elution time; namely, the time that corresponded to the highest MatSc. Using this technique, it is possible to ask whether parent masses that were selected for fragmentation in one experiment are detectable in the second experiment, because of knowing exactly where to look. As expected, we usually found that corresponding parent masses were mapped to corresponding peptide sequences. When this was not the case, we examined the spectra and corrected the misidentified spectrum.
|
|
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Each spectrum was matched whenever possible to yeast peptides using Mascot or ProICAT, followed by Mascot as a search engine. A great deal of effort was spent in determining a plausible sequence that could explain high-quality spectra. In many cases, spectra matched chemically modified forms of abundant peptides (see Table IVB). In most cases, these chemical modifications were either oxidations of Met, Trp, or the ICAT reagent itself. Whenever such a modification was found, much of the original data was submitted to another round of database searching with Mascot, with altered settings so that additional instances of the same modification could be identified. This had the effect of increasing the total number of spectra that could be identified without increasing the number of peptides or proteins that could be matched. In some cases, spectra that had originally been matched to ICAT peptides with low confidence could be better explained by modified abundant peptides. In most cases, MatSc (described above) correctly determined which identification was most plausible. In a few cases, manual examination of the spectra indicated that these identifications were unlikely to be correct (due to strong peaks that could not be explained by known fragmentation patterns), and these spectra were manually excised from the list of identified spectra (Y1 class 99). By this means, 15,233 CICAT tryptic peptide identifications from yeast have been made that pass a minimum threshold of credibility (MatSc > 5000, at least four b or y ions matched, MAc = 1), and that could not be attributed to alternative sequences by automatic means (Table V).
When the criteria for matching was tightened (requiring in additional Sc > 20, NT = 2, MBC = 0, 0.2 < HL < 5), this list was reduced about 2-fold to 6337 identifications (Table V).
Example of a Borderline Identification
Table VI contains three peak lists corresponding to two different spectra. The first list corresponds to MALDI spectrum 52964 (Table III, spectrum 4), which was matched to peptide ITLHVDcLR from imidazoleglycerol-phosphate dehydrogenase, also known as His3p (see "Biological Significance" below). This is an example of a noisy, borderline MALDI spectrum (Fig. 3A) that has a low Mascot score of 9, but MatSc = 36,188, and yb = 6. Many spectra from Table S3 of similar low quality, MatSc, Sc, and # yb ions are probably incorrectly identified, based on selected manual examination of spectra. However, this spectrum derived from a spot with two other identified peptides. Immediately adjacent spots supplied an additional three peptides. Upon internal calibration, all five of these peptides matched each other to within 10 ppm. Peptide ITLHVDcLR matched to 4 ppm to this recalibrated spectrum, suggesting that ITLHVDcLR could indeed explain this spectrum.
|
Noncanonical Peptides
Database searching was also directed against three additional categories of peptides: peptides that have no Cys residues and were not modified that arise from incomplete avidin column washing (Y2 class 2), peptides that were modified by the ICAT reagent on Lys residues (KICAT; Y2 class 3), and CICAT peptides that were incompletely alkylated (Y2 class 4). A total of 2731 spectra fell into one of those three categories at low stringency (Table V), whereas 1144 spectra could be so classified at higher stringency (Mascot score of 20, two tryptic termini, and MAc = 1).
There were 886 spectra that corresponded to unlabeled tryptic peptides, many of which were observed repeatedly. Table VII lists each of the 24 proteins that were identified based on at least 10 different spectra. Most of these proteins are metabolic enzymes or ribosomal proteins with high codon adaptation indices (CAI); the lowest CAI in Table VII is 0.52.
|
|
|
|
|
|
|
With regard to reproducibility, two proteins, P03965 (rows 1218) and P33302 (rows 2937), have seven and nine peptides identified, respectively. All of the measurements for P03965 are >1.48, whereas all of the measurements for P33302 are below 0.90 (and mostly around 0.7). These proteins represent examples of up-regulated and down-regulated proteins, whose significance is discussed below. Most proteins that were identified based on multiple peptides had ratios much closer to 1.0
Proteins Identified and Quantified
Table S13 lists all 1029 of the identified proteins, sorted from highest ratio to lowest ratio. Table XIII lists the 14 "interesting" proteins discussed below as well as all proteins that contributed at least eight distinct peptides. As with Table XII, the No. of Pep, No. of ID, HL, and StD fields are duplicated to show the consequences of including lower-stringency data. One representative peptide sequence, as well as the Mascot score, MatSc, and yb are also displayed to illustrate the confidence of the identification. As with the peptide data, including HL data from low-stringency identifications tends to support the data obtained from the high-stringency identifications. In all cases, we have used the ICAT dictionary to assign the identified peptides to the protein with the highest codon bias.
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This leads to the second way to assess the reliability of an identification: for an MS/MS expert to look carefully at how well the spectrum matches the peptide that is assigned to it. We have introduced many of the parameters in Table III to make this process more automatic; namely MatSc and Sc, which quantify overall reliability, as well as yb, PCM, PIM, FrT, and ppw. The higher the values for all but the last of these parameters, the more confident the identification. Similarly, SeqString enables one to discern which part of the peptide sequence is best supported by the evidence. Thus, one can gauge the reliability of each identification in Table III by examining these parameters. In most cases, one is primarily interested in the reliability of identification of a particular protein or a peptide from across the whole set of data. For that reason, for each peptide and protein, the values of these parameters for the highest quality spectrum have been copied into the peptide and protein tables, respectively (Tables XII and XIII). Also, there are two additional parameters that separately can be used to gauge the reliability of identification; namely, the mass accuracy of the parent and the relative elution time of the peptide upon either ion exchange chromatography or reversed-phase chromatography, which are discussed below. Neither of these parameters are incorporated directly into Sc or MatSc (although mass accuracy is commonly used to limit the number of peptides to be considered in database searching). Finally, the more often a peptide is identified, the more likely the identification is correct. This is especially true when many alternative explanations have been considered. Alternative explanations can be tested by performing additional rounds of database searches directed to separate classes of chemical modifications, or by considering peptides without trypsin cleavage constraints. Only selected spectra have been subjected to these additional rounds of searches. However, these spectra routinely included the spectra that were most convincing in the identification of each protein.
Mass Accuracy
It is to be expected that the parent mass of every correctly identified peptide should be close to its theoretical mass. In MALDI experiments, one can substantially improve mass accuracy by performing internal calibration. In these experiments, unfortunately, no calibration standard was added directly to each MALDI spot; however, many spots contained multiple identified components, all of which ought to be consistent with each other. To improve the mass accuracy of the MALDI data, peptides that exceeded a MatSc of 10,000 were used to calibrate the spots (see above). There were 4668 identifications from MALDI that matched within 1 amu (MBC = 0) and had MatSc > = 5000 with Y2 = 1 (from any Y1, ChM, or NT category). Of these, 1952 were used as calibrants or were a member of an HL pair that was used as a calibrant. That leaves 2716 identifications whose mass accuracy can be assessed. To study the effect of mass accuracy on credibility of identification, the identifications were broken up into four classes: 1) those with Mascot scores (Sc) 20, which are highly enriched in correct identifications due to the attention paid to questionable identifications; 2) those with MatSc > 5000 and Sc < 20; 3) those that matched at least four y or b ions and MatSc < 5000; and 4) those with four or more y or b ions. In Fig. 6, one can see that the highest confidence category also had high mass accuracy; that is, the most abundant categories were peptides that matched between 0 and 2.5 ppm and those between 2.5 ppm and 7.5 ppm. Only 22 peptides had mass accuracies that were higher than 100 ppm, which are probably due to badly calibrated spots or spots that were not calibrated at all. In the second category, an even larger number of spectra were matched to high mass accuracy, though there were also a few more peptides (53) that matched to >100 ppm. The third and fourth category of peptides had a few more identifications that matched to high mass accuracy, but a much larger number matched to >25 ppm, which indicates that many of these identifications are likely to be false. The spectra included in this analysis included all of the CICAT identifications regardless of chemical modifications or whether they adhere to trypsin cleavage rules, excluding only those matches that were >1 amu. The high percentage of high mass accuracy identifications in the second category indicates the power of the MatSc to distinguish borderline Mascot identifications from false-positive identifications. Nonetheless, the main power of MatSc is in discriminating between alternative sequences, and it probably would not be nearly as useful if it were used to identify the highest scoring sequences from all of sequence space.
Comparison of Elution Times
The data presented are derived from five independent ICAT experiments, starting from exactly the same preparations of yeast cells. To determine the reproducibility of the chromatography and to obtain more identifications, on some occasions individual ion exchange fractions were injected several times. The reproducibility of the chromatography can be estimated by selecting peptides that were identified in two different runs, and then plotting the elution times of the peptides against one another. To illustrate this, Fig. 7A shows the relationship between electrospray elution time and MALDI spot for fraction 15, which was chosen because it contains peptide ITLHVDcLR described above as an example of a borderline identification. There were a total of 57 peptides that were identified by both the electrospray and the MALDI runs, and they eluted in nearly the same order. This strongly suggests that each of these 57 pairs of spectra correspond to the same components. Table XIV illustrates that for four peptides, one of the identifications was tentative because the MatSc was below 5000. Note that the reliability of the each identification is no greater than the more credible of the two identifications, because although the evidence may be compelling that the same component has been detected, that component could be misidentified.
|
Chemical Modifications
Table IVA lists how many spectra were matched to each class of chemical modifications. Even when only high-confidence identifications are considered, about 5% of the spectra correspond to peptides containing modifications other than oxidation of Met. It is likely that many more spectra could be assigned to each of these classes of chemical modifications with more extended searching, and there are undoubtedly additional classes of chemical modifications yet to be discovered. We conclude from these studies that in most cases these additional identifications do not add to the number of distinct peptides or proteins that can be identified, and may even eliminate borderline identifications by providing alternative explanations (21). We found that relatively few unmodified CICAT peptides were eliminated by this means. However, when Mascot was used to search for additional variable modifications or if the requirements for trypsin specificity were relaxed, then a much greater percentage of spectra were incorrectly matched to peptides that instead corresponded to chemically modified peptides.
Chemical modifications can in principle be biological in origin, or may depend on the protocols used for peptide preparation and separation. Biological modifications will be consistently observed across experiments, unless a protocol variable partially masks the modification, like differential phosphatase activity. Several of the chemical modification classes we observed in these experiments represent oxidation of side chains, the most common of which is methionine sulfoxide formation (ChM class 1). Table V columns Met and MetOx list how many identifications were attributed to peptides that contained Met as a function of experiment number. It can be seen that the extent of Met oxidation is highly variable, across both instrument type and experiment, ranging from 8 to 95% oxidized. ICAT reagent oxidation is also variable; it was much more commonly observed with electrospray data, especially when the number of MetOx identifications was large. These data indicate that some subtle variation in protocol substantially affects the degree of oxidation, perhaps involving electrospray needle field strength (22). If oxidation was eliminated, the overall peptide complexity would be reduced, making it easier to identify more proteins. It is noteworthy that a great number of oxidized peptides are difficult to identify with Mascot even when they are explicitly considered (as a variable modification), because their spectra are often dominated by neutral losses that are not accounted for, especially by cleavage just before the sulfur atom of the side chain as has been previously observed for oxidized Met (20). Fig. 4 shows spectra for an HL pair containing oxidized CICAT in which the y ions containing the modified Cys residue appear as neutral loss species.
Unmodified Peptides
In addition to the modified peptides, another complication in ICAT reagent experiments is the presence of unlabeled peptides that should have been eliminated at the avidin affinity chromatography step (Table V, column N and Table VII). These peptides are a problem in two ways: they may suppress signals for CICAT peptides, and, more importantly, they commonly have HL ratios that on first glance appear to be biologically interesting. Of the 886 spectra that were identified that corresponded to unlabeled peptides, only 37 had apparent ratios between 0.2 and 5, presumably due to unrelated co-eluting peptides. Therefore, the majority of these spectra appeared to be "singlets" and thus were not quantifiable. These spectra corresponded to 473 different peptides, about 38% of which (183) were identified more than once. At the protein level, this corresponded to only 205 proteins, 120 of which were encountered more than once. These peptides arise from incomplete washing of the avidin affinity column; note that in experiments 3 and 6 only five high-stringency identifications were made to peptides in this category (Table V), indicating that many of these peptides can be eliminated if appropriate washing protocols are followed. Unfortunately, the degree to which this is a problem becomes evident only after the experiment has been completed. Many of these proteins were also identified from CICAT peptides, but many of these proteins do not have Cys. Despite the small number of these proteins, some proteins (like enolase) are separable into two distinct isoforms by their unmodified peptides, but encode identical Cys-containing peptides, and thus could not be distinguished in a normal ICAT experiment.
MatSc Versus
Mascot ScoreOne of the purposes of these experiments was to derive new parameters that can be used to judge the validity of database identifications. One of the strengths and limitations of the Mascot score (12) is that the significance threshold of the score depends on the size of the database that is searched and how many chemical modifications are to be considered (or nonspecific cleavage sites). In general, this is reasonable, and in fact, it is easily demonstrated that the larger the number of chemical modifications that are considered, the more likely a spectrum that had previously been correctly identified will become matched instead to an incorrect peptide, often to a peptide with multiple chemical modifications. The significance threshold that Mascot calculates does not take into account the likelihood of the modification; instead, it assumes that any modification that is being considered is just as likely as a completely unmodified peptide. One of the main conclusions from this investigation is that as a general rule, chemically modified peptides are only likely to be correctly identified if they derive from the most abundant peptides in the sample. This is reasonable to expect if the degree of overall chemical modification is small. For this reason, it is important to determine how common each modification is within the experiment. One way to proceed is to determine the likelihood of a particular modification, and use that probability to adjust the threshold for correctness of a score (14, 23). Another alternative is to apply a different threshold for each modification (or nonspecific cleavage), as is commonly done with Sequest scores (24). Another way is to take into account which protein is being considered for modification, as suggested here.
The MatSc is no different than the Mascot score with regard to significance thresholds, and in fact it is not wise to consider that peptides whose MatSc is greater than a threshold value are likely to be correct. However, it is more useful in distinguishing peptides from one another than the Mascot score so long as the predictions as to which fragment ions should be most intense are reasonably accurate. In its current manifestation, short peptides typically get assigned higher MatSc than longer peptides with similar Mascot scores. Thus MatSc is not useful for grouping together identifications of equal confidence based on peptides with significantly different lengths. Another spectral feature that contributes mightily to both MatSc and the Mascot score is spectral quality. In many ways, each of the terms that contribute to MatSc and also the Mascot Score are independent criteria for measuring both spectral quality and identification confidence.
Biological Significance
Generally speaking, few proteins changed in expression level upon knock-out of the UPF1 gene. Upf1p, previously shown to be present at 1600 copies per cell (25), was automatically fragmented and identified in wild-type yeast in one of the five experiments. It was definitely lower in the knockout, but there was an isotope cluster that was not selected for fragmentation close to the noise level at the nominal mass of the heavy form of the peptide. Therefore, from the data, one could conclude only that Upf1p was down-regulated by 5-fold or greater. Of course, in theory, Upf1p should not be expressed at all in the knockout strain.
A second protein that is notably lower in the UPF1 knockout strain is the pleiotropic drug-resistance protein (PDR5p), which was detected 27 different times based on nine different peptides, based on measurements that ranged between 0.52 and 0.99 with a median of 0.69. Only 132 of 5731 measurements were below 0.69; hence 10% of these measurements corresponded to PDR5p. At the mRNA level, this protein is only slightly repressed (0.88; see Ref. 7). Two transcription factors are thought to control the expression of PDR5, namely PDR1 and PDR3 (26). One hexapeptide from PDR1p was detected with a ratio of 0.98, which indicates that PDR1 is not responsible for the lower expression of PDR5p in the upf1 knockout strain. PDR3p, and two other proteins, SNQ2p and YOR1p, thought to be co-regulated with PDR5p (26), were not detected. Thus it is not obvious why PDR5p is down-regulated upon upf1 knockout.
Both subunits of succinate dehydrogenase (SDH1p and SDH2p), a tricarboxylic acid cycle (TCA) enzyme, are down-regulated (0.5 and 0.7), but the other TCA enzymes are nearly unchanged. As is the case for most of the up-regulated proteins, the remaining down-regulated proteins are not obviously related to one another, whether proteins are categorized by biological process, molecular function, or cellular compartment, using the Gene Ontology (GO) system (27).
When the up-regulated proteins are considered, the arginine biosynthesis pathway stands out, as four out of five proteins in this category are observed, and all four have increased expression in the UPF1 knockout strain (Fig. 8). Twenty-four HL measurements and 11 distinct peptides map to these four proteins, and the ratios range from 1.3 to 4.9. Of these, the top four measurements all map to Cpa2p, which is the large subunit of carbamoyl phosphate synthetase. A previous study indicated that mutations in UPF1 result in increased synthesis of the small subunit of carbamoyl phosphate synthetase, in agreement with our data (28), as the genes encoding the two subunits of carbamoyl phosphate synthetase (CPA1 and CPA2) are known to be co-regulated in Saccharomyces based on other experiments (29). In addition, urea amidolyase (DUR1) is related to arginine metabolism, and is also up-regulated (1.83) in the UPF1 knockout strain. Of these proteins, only CPA1 and DUR1 appear to be up-regulated at the mRNA level (7), although DAL2, DAL3, DAL5, and DAL7, involved in allantoin/nitrogen metabolism, are all up-regulated at the mRNA level.
|
Other up-regulated proteins fit into smaller categories; there are three subunits of glycine decarboxylase, two were detected (Gcv1p and Gcv2p) with ratios of 3.0 (2.96 mRNA) and 2.4 (1.62 mRNA), the third subunit has no Cys. These proteins may also be related to nitrogen metabolism (30) and therefore arginine metabolism. Another pair of proteins of related function consists of peroxisomal alkyl hydroperoxide reductase (Ahp1p) and glutathione peroxidase (Hyr1p), with ratios of 1.6 (.93 mRNA) and 2.0 (.77 mRNA), which are related to the stress response according to the GO system. There are many other proteins that appear to be up-regulated, but that are not easily explainable in terms or function.
Surprisingly, proteins whose function are similar to UPF1 itself appear to be mostly unchanged; Dcp2p has a ratio of 0.96; Dbp2p has a ratio of 1.0 even though it partially controls nonsense-mediated decay (31), whereas Nmd2p, Upf3p (5), and Hrp1p (32) were not detected. Hhf2p, also known to be up-regulated upon deletion of UPF1 (33), has no Cys. In general, between two and three times more proteins are positively correlated with message expression than are negatively correlated, as can be seen from Fig. 9. Additional studies will be required to determine whether the observed changes are reproducible at the biological level, and to increase the number of proteins that can be identified and quantified to determine which of these biological processes are directly related to UPF1 function.
|
![]() |
CONCLUSIONS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Many chemical modifications of peptides were found in these experiments, as well as small numbers of peptides in which one peptide terminus did not follow the trypsin cleavage rules. However, all reliable identifications of these peptides indicated that they derived from the most abundant proteins in yeast, or were formed by biological processing of the proteins N terminus. The higher the percentage of spectra that are correctly attributed to these noncanonical tryptic peptides, the more reliable the remaining identifications, because noncanonical peptides from abundant proteins account for some incorrect borderline identifications of single-hit proteins. This problem is not too severe in yeast because of its small genome size, but it is likely to become a serious problem in mammalian studies. The spectrum scoring parameters described here allow many of the tenuous identifications to be classified and correctly distinguished from chemically modified peptides. These additional identifications are especially useful if they provide additional HL measurements that corroborate HL ratios of proteins that are significantly altered. The set of identifications made here should make it possible to perform additional biological experiments directed at the measurement of changes in HL ratio of those proteins that can most easily be detected. Additional levels of protein fractionation prior to digestion will be necessary to delve further into the proteome.
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, March 28, 2004, DOI 10.1074/mcp.M300110-MCP200
1 General notes: In the text, aa are referred to by standard three-letter abbreviations except when they are in protein names; however, in peptide sequences and some tables, single-letter codes are used, supplemented by the codes in Table IVA. Yeast genes are designated according to the standard name at www.yeastgenome.org. This name is italicized in capital letters when wild type and is italicized in small letters when mutated. The corresponding protein is designated by the standard name with a p suffix, with the first letter only capitalized (e.g. Upf1p).
2 The abbreviations used are: ICAT, isotope-coded affinity tag; aa, amino acid; Acc, Swiss-Prot accession number; ACN, acetonitrile; AutoSeq, a sequence determined by de novo sequencing; CAI, codon adaptation index; Cal, a key that designates whether the spectrum was used as a calibrant; CalMW, measured MW of parent mass post-calibration; ChM, chemical modification class defined by Table IVA; ChS, ChemScore; CICAT, Cys residue modified by the ICAT reagent; CysN, the number of modified Cys residues in the sequence; Df, mass difference in ppm between CalMW and PepMW; DfC, mass difference in ppm assuming MB is off by a small integer; DSeq, the database sequence corresponding to an identified peptide; EID, experiment ID (HPLC run index); EM, electrospray versus MALDI; Ex, experiment number; Fr, ion exchange fraction number; FrT, Peptide Fragment TriScore; HL, heavy-to-light ratio; HLI_S, a string that indicates whether there are confounding isotope clusters at the position of each possible HL pair; HLI_1, a key that indicates whether an identified HL pair is free of overlapping isotope clusters; HL_1, a key that indicates whether an identified HL pair is consistent with the number of Cys residues and HL_Type of the corresponding peptide sequence; HL_S, a string that indicates whether there are potential overlapping HL pairs; HL_Type, a key that indicates whether a peptide is in the c0 form, the c9 form, or mixed; HPLC, high-performance liquid chromatography; IEX, ion exchange chromatography; Int, intensity; IntD, sum of the intensity of the fragment masses in MFU; IntP, parent ion intensity; Ippm, ppm difference between the masses in a HL pair; ISD, indicates whether the parent ion may have been generated by in-source decay from a different parent ion; IT, the number of missed trypsin cleavage sites within a sequence; KICAT, Lys residue modified by the ICAT reagent; MAc, mass accuracy class; MALDI, matrix-assisted laser desorption/ionization; MatSc, Match score; MB, mass bin, equals rounded CalMW/1.0005; MBC, mass bin class; MFD, number of masses of fragments that were detected and stored in the data base; MFM, number of masses of fragments matched; MFU, number of masses of fragments used in searches; MetOx, oxidized Met or methionine sulfoxide; MS, mass spectrometry; MS/MS, tandem mass spectrometry; Nor, normalization constant for HL measurements; NT, the number of peptide termini that are consistent with the specificity of trypsin; PCM, Percent ChemScore Matched; PepMW, the MW of a peptide based on Seq; PIM, Percent Intensity Matched; Pla, Plate number; ppm, parts per million; PpmMin, the minimum ppm used in calculations of MatSc and FrT; ppw, intensity-weighted average ppm error; Sc, Mascot score; Seq, the sequence of a peptide, with modified aa indicated with special characters; SeqString, an ion fragmentation distribution string; SpID, Spectrum ID Number; TFA, trifluoroacetic acid; ThIM, theoretical ions matched; TiW, time (minutes for electrospray) or spot number (or well number for MALDI); yb, the number of y ions + b ions matched; Y1, a classification scheme for spectra (see Table IVB); Y2, a second classification scheme for spectra (see Table IVB).
* This work was supported in part by Grants GM27757 and GM61096 (to A. J.) from the National Institutes of Health. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
S The on-line version of this manuscript (available at http://www.mcponline.org) contains supplemental material.
To whom correspondence should be addressed: Discovery Proteomics and Small Molecule Research Center, Applied Biosystems, 500 Old Connecticut Path, Framingham, MA 01701. E-mail: parkerkc{at}appliedbiosystems.com
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|