Profiling Core Proteomes of Human Cell Lines by One-dimensional PAGE and Liquid Chromatography-Tandem Mass Spectrometry*,S

Markus Schirle, Marie-Anne Heurtier and Bernhard Kuster{ddagger}

From Cellzome AG, Meyerhofstrasse 1, 69117 Heidelberg, Germany


    ABSTRACT
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS AND DISCUSSION
 REFERENCES
 
Protein expression profiles vary considerably between human cell lines and tissues, which is in part a reflection of their specialized roles within an organism. It is of considerable practical use to establish which proteins constitute the primary components of the respective proteomes. When compiled into databases, such information can facilitate the assessment of selectivity and specificity of a wide range of proteomic experiments. Here we describe the major constituents of proteomes of six human immortalized cell lines. By employing a combination of one-dimensional SDS-PAGE and nanocapillary liquid chromatography-tandem mass spectrometry (LC-MS/MS), we identified up to 1785 non-redundant cytoplasmic and nuclear proteins from a single cell line using 50 and 30 µg of total protein from the corresponding fractions. Up to 38 proteins could be identified from a single band in one liquid chromatography-MS/MS experiment. When combined with systematic gridding of gel lanes into 48 slices, a dynamic range for protein identification of ~1:2000 can be envisaged for this approach. Identified proteins range from 4–553 kDa in size, cover the pI range between 3.4 and 12.8, and include 255 proteins with predicted transmembrane domains. Repeated analysis of peptides derived from the same gel band showed that the reproducibility of nanocapillary liquid chromatography-MS/MS of such complex mixtures is about 60–70% suggesting that a particular analytical experiment would need to be repeated about three times to arrive at a representative estimate of the set of highly abundant proteins in a given proteome. Given its technical simplicity, sensitivity, and wealth of generated information, we have adopted this experimental approach to characterize every cell line and tissue that is the subject of experimentation in our laboratory. The combined dataset for the six cell lines consists of 2341 non-redundant human proteins and thus constitutes one of the largest collections of human proteomic data published to date.


Cells and tissues express many thousands of proteins at any one time whose expression levels differ enormously between different cell types and span at least six orders of magnitude. As a result, a key question in any biochemical experiment aiming at the identification of a set of proteins in a particular biological context is how significant the observation actually is. In other words, the ability to discriminate between specific protein identifications in the experiment and so-called background identifications is of utmost importance. A range of factors can contribute to the often large number of nonspecific or at least functionally irrelevant identifications of a protein in a given experiment. These may include particular functional properties of a detected protein (such as the binding of chaperones to proteins) that result in prominent identifications of such proteins but also biochemical properties such as multiple hydrophobic interactions between proteins (often observed in membrane-containing protein preparations), nonspecific absorption of proteins on surfaces such as affinity matrices, or an ill understood general exceptional compatibility with standard protein separation and identification technologies. Undoubtedly, highly abundant proteins are likely to contribute to unspecific background to a major extent because these proteins are often notoriously difficult to eliminate completely during the course of a biochemical experiment (e.g. albumin in serum, cytoskeletal proteins, and metabolic enzymes). The abundance aspect is of particular relevance because it differs widely between tissues of different origins (and thus may lead to varying results of identical experiments performed in different cell types). The presence of highly abundant proteins in a sample often limits purification yields for proteins of interest as well as the depth of analysis in proteomic experiments because these proteins take up a significant portion of the available analytical space and thus leave less room for the identification of proteins that may be more relevant to a particular experiment. As a result of these considerations, it is useful to generate a database of core proteomes of tissues and cell lines that are commonly used in biological research to facilitate the analysis of abundance-related protein background issues in expression proteomics, chemical proteomics, and protein-protein interaction studies.

The large scale analysis of gene expression patterns required for the establishment of such a database can in theory be performed at the level of mRNA or proteins; however, mRNA-based approaches such as high density oligonucleotide arrays (1) and serial analysis of gene expression (2) are generally not set up to provide a measure for the absolute number of transcripts of a particular gene. In addition, it has been shown in yeast that the correlation between mRNA and protein levels is insufficient to predict protein expression levels from quantitative mRNA data (3). As a result, direct protein profiling is required to approach a description of the protein composition of a particular cell under a given set of physiological conditions.

Nowadays, large scale protein profiling experiments are dominated by the use of mass spectrometry (MS)1 techniques for protein identification such as peptide mass fingerprinting (4) or peptide sequencing using tandem mass spectrometry and database searching (57). To reduce sample complexity, a number of separation steps on the protein or peptide level are usually employed prior to protein identification. Traditionally, two-dimensional polyacrylamide gel electrophoresis (2D PAGE) has been the method of choice in expression proteomics studies because it features very high protein separation capacity and because the visualized protein spot pattern provides additional semiquantitative information for comparative studies. Many 2D reference gels of bacterial, fungal, and mammalian proteomes are publicly available (www.expasy.org/ch2d/), but even though high quality 2D gels display thousands of protein spots, these typically comprise multiple isoforms and technical artifacts of no more than 200–300 highly abundant different gene products identified by microsequencing or mass spectrometry. Conference reports suggest that this figure might increase to 1000 genes if subcellular fractionation is used prior to 2D PAGE, but such data has not yet appeared in the public domain. Further limitations of 2D PAGE for general protein profiling of cell lines and tissues include the fact that the technique is still quite technically demanding and the observation that proteins with extreme biochemical properties (size, isoelectric point) as well as certain protein classes (e.g. transmembrane proteins) have been notoriously underrepresented in datasets from standard profiling experiments (8).

More recently, omitting a protein separation step altogether and instead combining two orthogonal chromatography steps and tandem mass spectrometry (LC/LC-MS/MS) to resolve and analyze peptides following in-solution digestion of proteins (9) has become increasingly popular. Controlling this technology is not trivial, but recent reports indicate great potential. For example, 1484 yeast proteins (9), 2363 rice proteins (10), 2415 plasmodium proteins (11), and 1610 rat proteins (12) have been identified using this approach, suggesting that 2D LC-MS/MS is much more powerful for the global characterization of proteins from cells and tissue compared with 2D PAGE. However, such high proteome coverage does come at a price, notably sensitivity. Typically, shotgun identification experiments require 0.5–5 mg of total protein. Although this will often not present a real limitation when working with cell lines or tissue, such quantities might not be available from needle biopsy material, collected by laser capture microdissection or from biochemical fractionation of cellular contents. From an analytical point of view, abandoning separation on the protein level and instead employing 2D LC separations of complex peptide mixtures might suffer from the fact that peptides from highly abundant proteins take up a large proportion of the available analytical space in both chromatographic dimensions, which should limit the dynamic range available for protein identification and thus might compromise the identification of proteins with low expression levels. This problem has been addressed by reducing sample complexity using isotope-coded affinity tag (ICAT) technology (13). Although the selective modification and isolation of cysteine-containing peptides in combination with LC/LC-MS/MS analysis allowed the identification of 986 yeast proteins (14), this three-dimensional chromatographic approach did not really constitute a benefit over the 2D approach reported by Peng et al. (15), who managed to identify 1504 yeast proteins without the use of the isotope-coded affinity tag step. Possibly, this is the result of restricting the accessible proteome space to proteins that contain a cysteine residue within a tryptic fragment in the mass range typically used for protein identification (800–2500 Da). About 14% of all human proteins do not have any cysteine-containing peptides in that mass range, and a further 19% only possess one such peptide. In addition, the amount of starting material for an isotope-coded affinity tag experiment is also in the mg range, which restricts the application of the approach to experiments in which relatively large amounts of proteins are available.

An approach that constitutes an interesting compromise between the advantages and shortcomings of the methods mentioned above is a combination of 1D PAGE protein separation and nanocapillary LC-MS/MS analysis (GeLC-MS/MS) of in-gel-generated peptides for protein identification. It is technically simple in nature and combines decent protein separation capability that also captures those proteins typically not accessible via 2D PAGE (notably large proteins and those with transmembrane domains) and the well established excellent sensitivity of gel-based protein identification using mass spectrometry for samples of low complexity (16). Recent examples from the literature indicate that this approach might also be viably applied for the analysis of complex protein mixtures as shown by the identification of 1289 plasmodium proteins (17) and of 271 proteins from the nucleolus (18). In this paper, we report on the application of GeLC-MS/MS to the analysis of core proteomes of six human immortalized cell lines as well as on the effect of subcellular fractionation on the total yield of protein identifications. The dataset currently includes cytoplasmic core proteomes of HEK293 (embryonic kidney cells), SKNBE2 (neuroblastoma), SW480 (colon carcinoma), HeLa (cervical adenocarcinoma), HeLaS3 (a clonal derivative of HeLa), and HepG2 (hepatocellular carcinoma) cells as well as a nuclear preparation from HEK293 cells. We identified between 268 and 1111 non-redundant cytoplasmic proteins for each cell line utilizing only 50 µg of total protein. The resulting information has proven very valuable in our laboratory for discriminating between specific protein signals and background in various types of affinity-based protein purification/identification experiments.


    EXPERIMENTAL PROCEDURES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS AND DISCUSSION
 REFERENCES
 
Preparation of Protein Samples—
For cytoplasmic fractions, 2.5 x 107 HEK293, SKNBE2, SW480, HeLa, HeLaS3, and HepG2 cells were harvested, washed once with phosphate-buffered saline, resuspended in lysis buffer, and lysed with 10 passages through a 15-gauge needle. Cell lysates were centrifuged to obtain a post-nuclear supernatant, which was further centrifuged to obtain a cytoplasmic fraction (100,000 x g supernatant). Protein concentration was determined by Bradford assay, and 50 µg of protein were separated on a 4–15% gradient gel. For the nuclear fraction of HEK293 cells, 5 x 108 cells were harvested and washed three times with phosphate-buffered saline. Cells were lysed in hypotonic buffer (10 mM Tris/HCl, pH 7.4, 1.5 mM MgCl2, 10 mM KCl, 25 mM NaF, 1 mM Na3VO4, 1 mM dithiothreitol) followed by a 10-min incubation on ice. After Dounce homogenization, a nuclear pellet was obtained by spinning the lysate for 10 min at 2000 x g. The pellet was resuspended in two volumes of a high salt buffer (50 mM Tris/HCl, pH 7.4, 1.5 mM MgCl2, 20% glycerol, 420 mM NaCl, 25 mM NaF, 1 mM Na3VO4, 1 mM dithiothreitol) and incubated for 30 min on ice. After dilution to a final salt concentration of 110 mM, the fraction was centrifuged for 1 h at 100,000 x g. The protein concentration of the fraction was determined by Bradford, and 30 µg were applied on a 4–12% bis-Tris gel. After visualization by colloidal Coomassie, whole gel lanes were cut into 48 pieces of equal size and subjected to in-gel tryptic digestion essentially as described by Shevchenko et al. (19).

Mass Spectrometry and Data Analysis—
One-third of the total tryptic digest sample was subjected to 60 min of data-dependent nanocapillary reversed-phase LC-MS/MS analysis using self-packed 75-µm inner diameter columns (Reprosil, Maisch) on nanoLC systems (CapLC, Waters; Ultimate, LC Packings) coupled to quadrupole time-of-flight (QTOF) instruments (QTOF Ultima, QTOF Micro, QTOF II, Waters). Data-dependent acquisition was performed using three MS/MS channels and no exclusion time. Where possible, measurements were repeated up to two times using the remaining amount of sample.

Proteins were identified by automated database searching (Mascot Daemon, Matrix Science) against an in-house curated version of the monthly updated International Protein Index protein sequence database (IPI, versions 2.5–2.18, European Bioinformatics Institute, www.ebi.ac.uk/IPI/). This minimally redundant yet maximally complete compilation of entries from Swiss-Prot, TrEMBL, RefSeq, and Ensembl was complemented with frequently observed non-human contaminants and viral proteins expressed by the examined immortalized cell lines (e.g. expression of E1B protein from human adenovirus type 5 (Swiss-Prot accession number P03243) in HEK293 cells). Search parameters were as follows: MS and MS/MS tolerance of 0.4 Da, tryptic specificity allowing for up to 3 missed cleavages and K/R-P cleavages, fixed modification of carbamidomethylation of cysteine, and variable modification of oxidation of methionine. Results were read into a Oracle database for further data analysis and comparison of identified protein sets using standard database querying tools. To ensure the highest possible quality of identification, protein hits with Mascot scores between 30 and 80 were evaluated by visual examination of the corresponding MS/MS spectra. In the case of identifications of multiple protein database entries based on the same set of peptides, only a single entry (highest molecular weight) was considered. Identifications based on peptides comprising a subset of a larger set of peptides used for identification of another database entry were not included. Comparative analysis of protein identification sets was performed on the level of database accession numbers as well as on clusters of 97% sequence identity to minimize effects caused by the ongoing revision of protein database content. Prediction of transmembrane domains was done using TMHMM 2.0 (20), and isoelectric points were calculated using EMBOSS (21).


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS AND DISCUSSION
 REFERENCES
 
Analysis of Core Proteome Datasets Obtained by GeLC-MS/MS—
Using a simple 1D PAGE separation step prior to nanocapillary LC-MS/MS analysis, we outlined cytoplasmic core proteomes of different human cell lines that are frequently used for a wide range of experimental applications in our and many other laboratories. Table I summarizes the overall experimental results for cytoplasmic core proteomes from HEK293, SKNBE2, SW480, HeLa, HeLaS3, and HepG2 cells as well as the nuclear core proteome from HEK293 cells. The full lists of identified proteins can be found in the supplementary material (Tables S1–S7).


View this table:
[in this window]
[in a new window]
 
TABLE I Summary of results from protein profiling experiments by GeLC-MS/MS

MW, molecular weight.

 
Up to 1111 non-redundant proteins (991 clusters of 97% sequence identity, HEK293 cytoplasmic) could be identified from 50 µg of total protein separated by 1D PAGE followed by slicing gel lanes into 48 bands (gel fractions) and measuring tryptic peptides from all 48 gel fractions in triplicate by LC-MS/MS. Interestingly, the range of proteins identified in these cell lines varied by a factor of four despite the fact that identical amounts of total protein were used for every experiment. This probably reflects differences in dynamic range of individual protein expression levels in different cell lines and not technical limitations. Three different types of quadrupole time-of-flight instruments were used in this study, but no correlation between absolute sensitivity of the particular instrument and the number of identified proteins could be observed, indicating that the amount of protein available for detection was not limiting. This is probably not surprising because by the nature of the experiment, all of the identified proteins should represent highly abundant cellular proteins and should therefore not constitute a major challenge for the analytical sensitivity of the LC-MS/MS systems used. The raw data for all cytoplasmic proteomes comprises 34,242 best spectrum-to-peptide assignments by Mascot (not counting repeated fragmentation of the same peptide in one LC-MS/MS run) resulting in 8860 protein identifications (average of four matched peptides/protein). All individual identifications could be consolidated into a non-redundant set of 10,258 tryptic peptides (21 and 3% of which contain one and two or more missed cleavage sites, respectively) and 2341 proteins (1792 clusters). One-peptide identifications accounted for 27% of all identifications. It should be noted that the number of protein identifications is a rather conservative estimate of the number of proteins represented in the raw MS data because we opted to systematically reject all Mascot identifications for which the calculated probability of a random match was 50% or higher (corresponding to a Mascot score of 30 for the protein sequence database used in this study). However, our experience with data acquired on quadrupole time-of-flight instruments indicates that about 25% of all Mascot hits with scores as low as 30 can be rescued by manual interpretation of MS/MS spectra.2 When taken together with the rather low number of single peptide identifications and missed enzymatic cleavage sites, we can be confident that the reported protein identifications are genuine.

The identified core proteomes cover a wide range of biochemically diverse proteins in terms of size (4.4–553.0 kDa), isoelectric point (3.41–12.76), and presence of up to 16 transmembrane domains (TMDs, Tables I and S1–S7). Fig. 1 shows a comparison between the distribution of protein size in the protein sequence database and that obtained for all proteins identified here. The two distributions match quite closely for proteins larger than 20 kDa indicating that GeLC-MS/MS does provide a fair representation of the protein content of a cell. However, very small proteins tend to be underrepresented, and that trend becomes more severe the smaller the protein is. This can of course partly be attributed to the fact that small proteins may run off the bottom of the gel. However, our analysis shows that proteins as small as 4 kDa were identified, and the same effect is indeed observed in shotgun protein identification studies that do not involve the use of gels (Ref. 9 and data not shown). An alternative explanation is of course that small proteins yield few peptides for identification, but a third factor should also be considered, a potential overrepresentation of small proteins in the sequence database. This can arise from the fact that the protein sequence database contains a large number of entries that are predicted from genomic sequences. Small genes are difficult to assign with good confidence from genomic sequences (22), and as a result, many small coding sequences in the protein database might be overpredictions of the software tools used for gene finding. When analyzing our dataset for the presence of transmembrane proteins, we find that 11% of all proteins (255) contain at least one transmembrane domain and that 101 proteins contain more than one such domain. Although one might have expected a higher incidence of transmembrane proteins (20% of all proteins in IPI contain at least one predicted TMD), it is noteworthy that a relatively large number of transmembrane proteins could be identified here without any specific enrichment for this class of proteins.



View larger version (14K):
[in this window]
[in a new window]
 
FIG. 1. Molecular weight distribution of the human protein database IPI and the total set of identified proteins from all experiments. Although there is good overall correlation, the part of the proteome below 30 kDa is underrepresented.

 
Depth of Proteome Characterization by GeLC-MS/MS—
At first approximation, the maximum number of proteins that can possibly be identified by GeLC-MS/MS should depend on the number of gel slices cut and the number of proteins the LC-MS/MS system can identify from such a gel band. In our case, we cut gel lanes with an effective separation range of about 7 cm into 48 equally spaced slices and used a 75-µm column for peptide separation. On average, the same protein was identified in 1.2 gel slices, and these were almost invariably made in adjacent gel slices. Hence, the separation power provided by the 1D gel should be about a factor of 40 (48:1.2). NanoLC-MS/MS allowed the identification of up to 38 distinct proteins in a single analysis of a single sample as indicated in Table I. Additional titration experiments using in-solution protein digests of 100 fmol of bovine serum albumin spiked into digests of chicken conalbumin varying from 100 fmol to 5 pmol suggest a dynamic range of about 1:50 for the LC system used in this study (data not shown). Taken together, one would therefore not expect to be able to identify more than ~40 x 38 = 1520 proteins in a single LC-MS/MS analysis of all gel slices. In practice, we have generally not identified more than 800 proteins in one such analysis. There are two primary reasons why there is a discrepancy between the theoretically possible number of protein identifications and the one that was actually achieved. First, there is no even distribution of the number of proteins across the gel (see Fig. 1). Second, not all proteins are equally abundant. Both aspects are illustrated in Fig. 2. The number of identified proteins/gel band closely follows the protein size distribution shown in Fig. 1. Therefore, comparatively few proteins are identified in the high mass range of the gel, and many proteins are identified in the 20–50-kDa range. However, the number of peptides with which a particular protein is identified remains relatively constant across the gel, which means that the mass spectrometer is probably undersampling in the high mass region of the gel because few peptides make it above the detection threshold, whereas the opposite is the case in the lower parts of the gel, where many more peptide ion signals are competing for measurement time than the instrument can handle. This corresponds well with the observation that the number of acquired peptide spectra per LC-MS/MS run tends to peak in the 50-kDa region. However, the ratio of spectra that contributed to verified protein identification events and the number of all acquired spectra do not show a clear trend over the molecular weight range (data not shown). We, like many investigators before us, find that a large proportion of MS/MS spectra do not match a peptide sequence in the database with any reasonable confidence. The average ratio of MS/MS spectra used for protein identification from all different cell lines was 41% with a standard deviation of 19%. There are many possible explanations for this observation. For example, if the intensity thresholds for switching between MS and MS/MS are low, the mass spectrometer will select for precursor ions of which the intensity will never be sufficient to produce an informative MS/MS spectrum. Another important point to consider is the separation performance of the liquid chromatography system. If the chromatographic peak width is small, some ions that are detected in an MS survey spectrum might have an intensity that is too low by the time it is their turn to be fragmented in a particular data acquisition scheme. Hence, the number of precursor ions that are subjected to collision-induced dissociation between survey spectra should be adjusted to the separation characteristics of the LC system. Overall, these factors might introduce a slight bias against smaller yet more abundant proteins, which is probably not so much the case in shotgun identification approaches because the bias introduced by the gel does not apply. One way to deal with this potential bias is to repeat the measurement of the same sample several times to arrive at a fairer representation of the proteins present.



View larger version (18K):
[in this window]
[in a new window]
 
FIG. 2. Distribution of protein identifications and number of matched peptides/protein identification over 1D gel regions in GeLC-MS/MS (average values from three experiments using the cytoplasmic fraction of HEK293). The number of protein identifications (ID) in different gel regions by and large follows the molecular weight distribution of human proteins (see Fig. 1). The average number of matched peptides/protein identification remains relatively constant.

 
Reproducibility of Proteome Characterization by GeLC-MS/MS—
The more complex a peptide mixture gets, the more the analytical system is strained in terms of reproducibility. This is not so much a problem in terms of confidence in the presence of the identified proteins but is more a question of how representative the list of identified proteins for a given sample is. In one published report, the reproducibility issue has been addressed by repeatedly analyzing an unfractionated tryptic digest of a total yeast lysate by LC-MS/MS. The results showed that only 35 of a total of 401 identified proteins were present in every of triplicate measurement, which demonstrates that there is a limit as to how complex a mixture a one-dimensional LC system can handle (23). Reproducibility of measurements using GeLC-MS/MS should be higher because the complexity of the sample is reduced owing to the protein separation step that precedes the LC-MS/MS experiment. For the proteome datasets contained in this report, LC-MS/MS analysis was mostly done in triplicate. Analysis of the HEK293 nuclear fraction showed that reproducibility of protein identification (on the level of 97% identity clusters) was about 70% between duplicates and 60% among triplicates with no significant deviation over the molecular weight range as judged from repeat experiments run consecutively on the same instrument and searched against the same database version (Fig. 3). This shows that including an additional separation step on the protein level increases the overall depth of proteome coverage as well as reproducibility and thereby representation.

We observed a somewhat unexpected but significant source of variation in reproducibility that is related to changes in the sequence database against which MS data is searched. Although two rounds of searching of identical primary LC-MS/MS data (here, single analysis of HEK293 cytoplasm) against different versions of the IPI sequence database (versions 2.5 and 2.12) produced a relatively constant number of non-redundant identifications (447 versus 431 IPI entries, 440 versus 424 clusters of 97% identity), the overlap was only 63% on the level of accession numbers and 83% on the cluster level. This is not a feature of this particular sequence database, but rather all sequence databases are constantly changing in size and content as the result of the increasing amounts of available genomic sequence information and more (or less) refined annotation of such sequences. As a result, protein identifications might "disappear" because of the removal of the sequence against which the spectrum was originally matched. Re-searching of primary mass spectrometry data against updated protein databases is not practical when it comes to large and rapidly growing datasets. In a lucky case, this would remove false positive identifications, but there are numerous examples from our laboratory where true protein sequences have been identified with one database version and have disappeared in the next release. Freezing a certain database version for extended periods of time is also not a viable solution because of the risk of missing newly identified gene products. One way out of that problem is to keep an in-house curated master database version in which novel entries of a new database release are added without removing entries that have been eliminated from the new database release. This procedure is, however, not practical for every laboratory because it requires the presence of an appropriate IT infrastructure. In addition, small but biochemically and functionally irrelevant changes in protein sequences might lead to an ever increasing database. An alternative is to cluster database entries based on sequence homology (here, 97% identity) to confine sequences for a particular gene product to a small number, which in turn helps to recognize proteins that are functionally identical. The downside of this approach is that homology clustering often cannot cope with subtle differences in amino acid sequence that are easily differentiated by MS, and the approach may thus lead to loss of information on e.g. isoforms or splice variants of gene products. We are currently evaluating clustering algorithms that reflect the information content of primary MS data more accurately and should provide a means to alleviate some of the problems mentioned above.

Compilation of Proteins into Abundance Lists—
All proteins identified in the cell lines analyzed are compiled in Tables S1–S7 and sorted by the sum of all Mascot protein scores with which a particular protein was found in the experiments. Mass spectrometry of proteins and peptides is not a quantitative method as such. Therefore, it is difficult to assess the abundance of a particular protein from MS data per se. Nevertheless, there are several empiric indications that can help to estimate the relative abundance of a protein in a mixture. In general, the higher the amount of protein, the higher the MS signal intensity, number of sequenced peptides/sequence coverage, and Mascot score. Any of these measures work fairly well as long as proteins in the mixture are of similar size. If that is not the case, signal intensities are probably a more suitable measure for abundance (24). However, it is not very practical to compute these values for all peptides in very large datasets and acquired on different types of instrumentation. Therefore, we have chosen to rely on the sum of all Mascot protein scores with which a particular protein was found in a 1D PAGE gel lane. We are aware that this tends to overestimate the abundance of large proteins because these generate more peptides that contribute to the total Mascot protein score. However, we believe that this is not a fundamental problem because the protein lists of Tables S1–S7 are compiled from GeLC-MS/MS data for use with GeLC-MS/MS data; thus, the abundance assessment will always be made relative to where a protein was found on a gel. For example, clathrin heavy chain 1 (IPI00024067.1, 192 kDa) appears with a total Mascot score of 2161, whereas Ras GTPase-activating-like protein IQGAP1 (IPI00009342.1), a protein of almost identical size (189 kDa), is listed with a score of 250 (Table S1). Hence it is fair to assume that the former protein is much more abundant than the latter. We have also found that the bias toward larger proteins introduced by using total Mascot score for abundance classification is not overly large. When plotting total Mascot score versus protein size for all proteins of a cell line (data not shown), the maximum of the score distribution is found at about 40 kDa, which is not that far off the maximum of the size distribution of all identified proteins of 25–30 kDa (Fig. 1). In addition the "usual suspects" for abundant proteins, like cytoskeletal proteins, chaperones, and metabolic enzymes, populate the top positions of all lists.

Reported total Mascot scores in Tables S1–S7 span values between 30 and 10,000. The respective range for individual gel slices (LC-MS/MS runs) is ~30–3000. Clearly, all proteins in Tables S1–S7 are abundant cellular constituents, but the range of scores suggests that the absolute abundance range might be in the order of 1:100. Again, it should be stressed that the purpose of the sorting exercise is to reflect an overall trend in the dataset rather than an abundance criterion for individual proteins. Nevertheless, proteins that are closer to the top of Table S1 can be expected to be more prone to contributing to unspecific protein background in affinity purification experiments than those at the bottom of the list. It should also be kept in mind that even when identifying over a thousand proteins from a cell line or tissue using GeLC-MS/MS (or LC/LC-MS/MS for that matter), the depth of proteome coverage is still very limited considering that it is estimated that the expression levels of proteins in cells span six orders of magnitude.

Comparative Analysis of Datasets from Different Cell Lines—
It is a well known fact that different cell lines and tissues express different sets of proteins, which at least in part is a reflection of their specialized physiological roles. However, one would expect that a considerable fraction of that part of the proteome that is primarily occupied with housekeeping functions ought to be abundantly expressed in all cell types. When comparing our data (protein clusters of 97% sequence identity), we were surprised by the fact that only 104 of the total number of 1543 non-redundant cytoplasmic protein clusters were found to be shared between all cell lines analyzed. The overall overlaps between cell lines (Table II) are largely independent of whether proteins are among the top 10, 20, 50, or 100 proteins on the list in Tables S1–S7 (data not shown). 50% of the proteins are shared among HEK293, SKNBE2, and SW480 cells. For SKNBE2 and SW480 cells, the respective overlap is ~30%. Cell line-specific cytoplasmic clusters (i.e. proteins that were exclusively found in a single cell line) account for 36% of all identified proteins in HEK293, 23% in SKNBE2, 22% in HeLaS3, 21% in HepG2, 13% in SW480, and 6% in HeLa cells. Obviously, the fraction of unique identifications increases with the size of the datasets for a particular cell line, and it should be borne in mind that the absence of a protein in a particular dataset may just be because the expression level of that protein was below the detection limit of the system. Nevertheless, there are numerous examples of highly abundant proteins that are found in a cell line-specific manner. Examples include the dopamine monooxygenase precursor (IPI00012890.1) and the neuron-specific calcium-binding protein hippocalcin (IPI00145135.1), which were both found exclusively in the neuroblastoma cell line SKNBE2. Given the substantial differences in relative expression levels of proteins in core proteomes of different cell lines, we would recommend analyzing every cell line or tissue used for affinity purification approaches in the way described here for the purpose of defining abundance-related protein background that is meaningful for a particular source of protein.


View this table:
[in this window]
[in a new window]
 
TABLE II Overlap of non-redundant protein identifications (97% identity clusters) between pairs of cytoplasmic fractions from different human cell lines

Bold underlined numbers denote the total number of identified protein clusters in the corresponding cell line. exp, replicate experiments.

 
Subcellular Fractionation Increases Proteome Coverage—
For HEK293 cells we explored the effect of subcellular fractionation on the number of proteins identified by GeLC-MS/MS. Table III shows that almost 1000 proteins were identified in both the cytoplasmic and nuclear HEK293 preparations with an overlap between the two fractions of 57%, which increased the total number of identified non-redundant proteins from this cell line to 1785 (1412 clusters). As one would expect, many of the proteins identified specifically in the nuclear fraction are known nuclear constituents. These range from histones to RNA- and DNA-binding or -metabolizing proteins to nuclear pore components and protein involved in DNA replication and repair. Data from extending fractionation of cells by rather simple means (e.g. into membrane fractions, lipid rafts, etc.) prior to protein profiling indicates that the depth of protein profiling can be relatively easily increased to cover more than 3000 proteins of a single cell line (data not shown).


View this table:
[in this window]
[in a new window]
 
TABLE III Overlap of non-redundant protein identifications (97% identity clusters) between the cytoplasmic and nuclear fraction of human HEK293 cells

Bold underlined numbers denote the total number of identified protein clusters in the corresponding fraction. exp, replicate experiments.

 
Conclusions—
The presented data demonstrate the utility of GeLC-MS/MS for the description of core proteomes as a tool for determining abundance-related protein background in biochemical experiments. Compared with the use of more elaborate separation techniques such as 2D gel electrophoresis or multidimensional chromatography prior to MS-based analysis, a simple protein separation step simplifies the overall procedure, speeds up analysis time, minimizes sample loss in general, and captures proteins with a wide range of biochemical characteristics (size, pI, transmembrane domains). GeLC-MS/MS allowed the identification of 1785 proteins from a single human cell line, thus reaching the same depth of proteome coverage reported for shotgun experiments in the literature. Only 30–50 µg of total protein were required as starting material compared with typical amounts of 200–300 µg of protein for 2D PAGE and 0.5–2 mg for 2D LC/LC-MS/MS experiments. As a result, this method is also applicable to the study of cellular proteomes that are of very limited availability such as samples obtained by needle biopsy or laser capture microdissection, where initial studies have reported expression proteomic analyses from as little as 1–5 µg of total protein (25).

On a cautionary note, one should be aware of the fact that protein abundance is only one, albeit important, factor contributing to the phenomenon of protein background in affinity purification and other experiments. It by no means excludes the possibility that a very abundant protein might serve a very specific function. For example, HSP90 appears repeatedly among the highest scoring proteins in our datasets, but at the same time, it has been shown to be required in conjunction with cdc37 for activation of the I{kappa}B kinase complex in tumor necrosis factor-{alpha} signaling (26).

We have adopted the use of GeLC-MS/MS to characterize every cell line and tissue that is subject to biochemical experimentation in our laboratory. The resulting information has been integrated into our in-house protein sequence database and has become an important tool for the assessment of experimental results of biochemical experiments. Although the data of this study is available as supplementary information and can be used for the same purpose, we would also encourage our colleagues to go through this worthwhile and limited effort for their favorite source of protein and to share the data with the scientific community.



View larger version (20K):
[in this window]
[in a new window]
 
FIG. 3. Reproducibility among replicate analyses in GeLC-MS/MS (HEK293, nuclear fraction). Overall reproducibility among total datasets is 59% among triplicate and 70% between duplicate measurements. Overlaps among datasets remain relatively constant over 1D gel regions. Reproducibility on the sample level among multiple analyses (a1–an) was calculated as follows: the number of proteins shared among a1–an of sample x divided by the average number of total distinct protein identifications in a1–an of sample x. For reproducibility between duplicate analyses, the average values from all possible comparisons (for n = 3; a1 versus a2, a1 versus a3, and a2 versus a3) are given. Overall reproducibility among total sets of identified proteins was calculated accordingly. (For example, three rounds of analysis of HEK293 nuclear fraction resulted in 422 proteins common to all three datasets comprising 783, 660, and 705 distinct proteins. 422/((783 + 660 + 703)/3) x 100% = 59%)

 

    ACKNOWLEDGMENTS
 
We thank all of the members of the Mass Spectrometry Department at Cellzome for fruitful discussions, Yann Abraham and Cristina Cruciat for help with the preparation of protein samples, and Gitte Neubauer for critical review of the manuscript.


    FOOTNOTES
 
Received, August 28, 2003, and in revised form, September 23, 2003.

Published, MCP Papers in Press, October 6, 2003, DOI 10.1074/mcp.M300087-MCP200

1 The abbreviations used are: MS, mass spectrometry; MS/MS, tandem mass spectrometry; 1D, one-dimensional; 2D, two-dimensional; LC, liquid chromatography; LC/LC, two orthogonal liquid chromatography steps; GeLC-MS/MS, 1D PAGE protein separation followed by nanocapillary LC-MS/MS analysis; IPI, International Protein Index; TMD, transmembrane domain; bis-Tris, 2-[bis(2-hydroxyethyl)amino]-2-(hydroxymethyl)propane-1,3-diol. Back

2 M. Schirle, M.-A. Heurtier, and B. Kuster, manuscript in preparation. Back

* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. Back

S The on-line version of this article (available at http://www.mcponline.org) contains Tables S1–S7. Back

{ddagger} To whom correspondence should be addressed. E-mail: bernhard.kuster{at}cellzome.com


    REFERENCES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS AND DISCUSSION
 REFERENCES
 

  1. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14, 1675 –1680[Medline]

  2. Velculescu, V. E., Zhang, L., Vogelstein, B., and Kinzler, K. W. (1995) Serial analysis of gene expression. Science 270, 484 –487[Abstract]

  3. Gygi, S. P., Rochon, Y., Franza, B. R., and Aebersold, R. (1999) Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol. 19, 1720 –1730[Abstract/Free Full Text]

  4. Henzel, W. J., Billeci, T. M., Stults, J. T., Wong, S. C., Grimley, C., and Watanabe, C. (1993) Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc. Natl. Acad. Sci. U. S. A. 90, 5011 –5015[Abstract]

  5. Mann, M., and Wilm, M. (1994) Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390 –4399[Medline]

  6. Eng, J., McCormack, A., and Yates, J. (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in protein databases. J. Am. Soc. Mass Spectrom. 5, 976 –989[CrossRef]

  7. Perkins, D. N., Pappin, D. J., Creasy, D. M., and Cottrell, J. S. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551 –3567[CrossRef][Medline]

  8. Rabilloud, T. (2002) Two-dimensional gel electrophoresis in proteomics: old, old fashioned, but it still climbs up the mountains. Proteomics 2, 3 –10[CrossRef]

  9. Washburn, M. P., Wolters, D., and Yates, J. R., III (2001) Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242 –247[CrossRef][Medline]

  10. Koller, A., Washburn, M. P., Lange, B. M., Andon, N. L., Deciu, C., Haynes, P. A., Hays, L., Schieltz, D., Ulaszek, R., Wei, J., Wolters, D., and Yates, J. R., III (2002) Proteomic survey of metabolic pathways in rice. Proc. Natl. Acad. Sci. U. S. A. 99, 11969 –11974[Abstract/Free Full Text]

  11. Florens, L., Washburn, M. P., Raine, J. D., Anthony, R. M., Grainger, M., Haynes, J. D., Moch, J. K., Muster, N., Sacci, J. B., Tabb, D. L., Witney, A. A., Wolters, D., Wu, Y., Gardner, M. J., Holder, A. A., Sinden, R. E., Yates, J. R., and Carucci, D. J. (2002) A proteomic view of the Plasmodium falciparum life cycle. Nature 419, 520 –526[CrossRef][Medline]

  12. Wu, C. C., MacCoss, M. J., Howell, K. E., and Yates, J. R. (2003) A method for the comprehensive proteomic analysis of membrane proteins. Nat. Biotechnol. 21, 532 –538[CrossRef][Medline]

  13. Gygi, S. P., Rist, B., Gerber, S. A., Turecek, F., Gelb, M. H., and Aebersold, R. (1999) Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994 –999[CrossRef][Medline]

  14. Gygi, S. P., Rist, B., Griffin, T. J., Eng, J., and Aebersold, R. (2002) Proteome analysis of low-abundance proteins using multidimensional chromatography and isotope-coded affinity tags. J. Proteome Res. 1, 47 –54[CrossRef][Medline]

  15. Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J., and Gygi, S. P. (2003) Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2, 43 –50[CrossRef][Medline]

  16. Wilm, M., Shevchenko, A., Houthaeve, T., Breit, S., Schweigerer, L., Fotsis, T., and Mann, M. (1996) Femtomole sequencing of proteins from polyacrylamide gels by nano-electrospray mass spectrometry. Nature 379, 466 –469[CrossRef][Medline]

  17. Lasonder, E., Ishihama, Y., Andersen, J. S., Vermunt, A. M., Pain, A., Sauerwein, R. W., Eling, W. M., Hall, N., Waters, A. P., Stunnenberg, H. G., and Mann, M. (2002) Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature 419, 537 –542[CrossRef][Medline]

  18. Andersen, J. S., Lyon, C. E., Fox, A. H., Leung, A. K., Lam, Y. W., Steen, H., Mann, M., and Lamond, A. I. (2002) Directed proteomic analysis of the human nucleolus. Curr. Biol. 12, 1 –11[CrossRef][Medline]

  19. Shevchenko, A., Wilm, M., Vorm, O., and Mann, M. (1996) Mass spectrometric sequencing of proteins silver-stained polyacrylamide gels. Anal. Chem. 68, 850 –858[CrossRef][Medline]

  20. Moller, S., Croning, M. D., and Apweiler, R. (2001) Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics (Oxf.) 17, 646 –653[Abstract/Free Full Text]

  21. Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276 –277[CrossRef][Medline]

  22. Zhang, M. Q. (2002) Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 3, 698 –709[CrossRef][Medline]

  23. Yi, E. C., Marelli, M., Lee, H., Purvine, S. O., Aebersold, R., Aitchison, J. D., and Goodlett, D. R. (2002) Approaching complete peroxisome characterization by gas-phase fractionation. Electrophoresis 23, 3205 –3216[CrossRef][Medline]

  24. Rappsilber, J., Ishihama, Y., Mittler, G., Mortensen, P., Foster, L., and Mann, M. (2003) Proceedings of the 51st ASMS Conference on Mass Spectrometry and Allied Topics, June 8–12, 2003, American Society for Mass Spectrometry, Santa Fe, NM

  25. Wu, S. L., Hancock, W. S., Goodrich, G. G., and Kunitake, S. T. (2003) An approach to the proteomic analysis of a breast cancer cell line (SKBR-3). Proteomics 3, 1037 –1046[CrossRef][Medline]

  26. Chen, G., Cao, P., and Goeddel, D. V. (2002) TNF-induced recruitment and activation of the IKK complex require Cdc37 and Hsp90. Mol. Cell 9, 401 –410[Medline]