Global Analysis of the Cortical Neuron Proteome *,S

Li-Rong Yu{ddagger}, Thomas P. Conrads{ddagger}, Takuma Uo§, Yoshito Kinoshita§, Richard S. Morrison§, David A. Lucas{ddagger}, King C. Chan{ddagger}, Josip Blonder{ddagger}, Haleem J. Issaq{ddagger} and Timothy D. Veenstra{ddagger},

From the {ddagger} Laboratory of Proteomics and Analytical Technologies, SAIC-Frederick, Inc., National Cancer Institute at Frederick, P.O. Box B, Frederick, MD 21702-1201; and § Department of Neurological Surgery, University of Washington School of Medicine, Box 356470, Seattle, WA 98195


    ABSTRACT
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
In this study, a multidimensional fractionation approach was combined with MS/MS to increase the capability of characterizing complex protein profiles of mammalian neuronal cells. Proteins extracted from primary cultures of cortical neurons were digested with trypsin followed by fractionation using strong cation exchange chromatography. Each of these fractions was analyzed by microcapillary reversed-phase LC-MS/MS. The analysis of the MS/MS data resulted in the identification of over 15,000 unique peptides from which 3,590 unique proteins were identified based on protein-specific peptide tags that are unique to a single protein in the searched database. In addition, 952 protein clusters were identified using cluster analysis of the proteins identified by the peptides not unique to a single protein. This identification revealed that a minimum of 4,542 proteins could be identified from this experiment, representing ~16% of all known mouse proteins. An evaluation of the number of false-positive identifications was undertaken by searching the entire MS/MS dataset against a database containing the sequences of over 12,000 proteins from archaea. This analysis allowed a systematic determination of the level of confidence in the identification of peptides as a function of SEQUEST cross correlation (Xcorr) and delta correlation ({Delta}Cn) scores. Correlation charts were also constructed to show the number of unique peptides identified for proteins from specific classes. The results show that low-abundance proteins involved in signal transduction and transcription are generally identified by fewer peptides than high-abundance proteins that play a role in maintaining mammalian cellular structure and motility. The results presented here provide the broadest proteome coverage for a mammalian cell to date and show that MS-based proteomics has the potential to provide high coverage of the proteins expressed within a cell.


The achievements that have been made in the field of genomics over the past decade have spurred the movement toward the characterization of the proteome. Unfortunately, the biological differences between a genome and its related proteome make the challenge of proteomics much greater. The genome of a particular cell type is a static entity, while its corresponding proteome is dynamic, changing with the slightest perturbation. In studies where the goal is to completely sequence an organism’s genome, there is a point at which the analysis can be deemed complete, in that the entire base pair sequence of the genome has been determined. In proteomic analysis, there is no such definable end point, because it is presently impossible to accurately determine the number of proteins, and their possible isoforms, that are actually present within a cell at any given time. Presently, we are faced with trying to put together a cell’s proteomic jigsaw puzzle without the knowledge of how many pieces make up that final picture.

One of the major goals of proteomics is to develop technologies capable of measuring the dynamic nature of protein expression, protein interactions, and post-translational modifications as a time-dependent function of the cellular state (1). To begin to accumulate this type of data requires that many measurements be made, therefore high-throughput protein characterization is essential. The initial piece of data required is to identify the proteins that are expressed within the cell under a given set of conditions. This information provides a foundation to understanding the proteins that are observable within a cell’s proteome and provides insight into the components that differentiate cells that have identical genomes.

The traditional approach for fractionating and identifying proteins within complex proteome mixtures has been a combination of two-dimensional (2D)1 PAGE followed by MS or MS/MS analysis of the visualized protein spots. The bias of 2D-PAGE against proteins with extreme isoelectric points and molecular masses, as well as its difficulty to resolve membrane proteins, has been well documented (2). Not only are certain classes of proteins underrepresented by 2D-PAGE, the subsequent identification of the separated proteins is laborious and time-consuming (3). Alternative approaches for the identification of proteins circumvent the need for 2D gel fractionation and rely solely on multidimensional chromatographic separation of proteolytically digested proteins prior to MS/MS analysis (1, 4). While these solution-based methods do not provide a direct method for protein quantitation, they routinely are capable of providing well over 1,000 protein identifications in rapid (i.e. hours to days) fashion. Recent studies have used such solution-based strategies to identify 490 proteins in serum (5), 1,504 proteins in yeast (6), 2,528 proteins in rice (7), and, most recently, 2,415 proteins in Plasmodium (8).

We have applied a multidimensional fractionation method followed by MS/MS resulting in the identification of over 15,000 unique peptides corresponding to at least 4,542 proteins within the proteome of mouse cortical neurons. The raw data analysis was performed to provide a statistical analysis of the confidence in the protein identifications. In addition, plots were constructed to determine if there exists a correlation between a protein’s abundance and the number of unique peptides identified for that protein. The results show that proteins anticipated as being in high abundance, such as structural proteins, are typically identified by the largest number of unique peptides. The fewest number of unique peptides was associated with proteins of low abundance such as transcription factors and signal transducers. Although this data is qualitative in the strictest sense, the results obtained from the present study can be used to ascertain information about the protein abundances in complex mixtures and also identify novel proteins that have not previously been shown to exist in neural cells.


    EXPERIMENTAL PROCEDURES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Cortical Neuron Proteome Sample Preparation—
Primary cortical neuron cultures were established from newborn mouse pups on a 129/Sv x C57BL/6 background as described previously (9). Briefly, cortical brain tissue was excised, trypsinized, and dissociated by trituration to obtain single cells. Cells were then plated onto poly-d-lysine-coated cultureware and maintained in Neurobasal-A medium with B27 supplements (Invitrogen, Carlsbad, CA). Cultures were maintained for 4 days and lysed directly in the culture dishes with a lysis buffer containing 50 mm Tris-HCl (pH 8.5), 2% Triton X-100, 10 mm NaF, and 1 mm Na3VO4. The lysed neurons were scraped into a microcentrifuge tube and sonicated five times (10 s each) (Branson digital sonifier 250; Danbury, CT) on ice. The lysate was centrifuged at 15,000 x g for 15 min at 4t°C. The supernatant was collected and desalted into 50 mm NH4HCO3, pH 8.3, using a PD-10 column (Amersham Biosciences, Uppsala, Sweden). Protein concentration was determined using a bicinchonic acid protein assay. An aliquot of lysate containing 100 µg of solubilized protein was digested overnight at 37t°C with sequencing-grade modified trypsin (Promega, Madison, WI) at a ratio of 50:1 (w/w, protein-to-trypsin). The digestion reaction was terminated by boiling the sample in a water bath for 10 min. The sample was acidified by 1% formic acid and desalted using a 1-ml Oasis MCX extraction cartridge (Waters Corporation, Milford, MA). The desalted digestate was lyophilized and stored at –80t°C.

Strong Cation Exchange (SCX) Liquid Chromatographic Fractionation—
The cortical neuron protein digestate was dissolved in 25% ACN and 0.1% TFA and loaded onto a 1-mm inner diameter x 150-mm SCXLC column (PolyLC, Columbia, MD) that was pre-equilibrated with 25% ACN delivered by an Agilent 1100 capillary LC system (Agilent Technologies, Palo Alto, CA). The peptides were eluted by a gradient generated from mobile phase A (25% ACN in water) and mobile phase B (25% ACN with 0.5 m ammonium formate, pH 3) over 96 min. Ninety-six fractions were collected for microcapilary LC (µLC)-MS/MS analysis.

Reversed-phase (RP) µLC-MS/MS of SCX Fractions—
Ten-centimeter-long µRPLC-ESI columns were coupled online with an ion trap (IT) MS (LCQ Deca XP; Thermo Finnigan, San Jose, CA) to analyze each SCXLC fraction. To construct the µRPLC-ESI columns, 75-µm inner diameter fused-silica capillaries (Polymicro Technologies, Phoenix, AZ) were flame-pulled to construct a 10-cm fine inner diamter (i.e. 5–7 µm) tip against which Luna C18 (2) (3-µm diameter, 100 Å pore size) (Phenomenex, Torrence, CA) RP particles were slurry-packed using a slurry packing pump (Model 1666; Alltech Associates, Deerfield, IL). The columns were connected via a stainless steel union to an Agilent 1100 capillary LC system (Agilent Technologies), which was used to deliver mobile phases A (0.1% formic acid in water) and B (0.1% formic acid in ACN). After loading one-third content of each SCXLC fraction, the peptides were eluted at a flow rate of 300 nl/min using a step gradient of 2–40% solvent B for 120 min and 40–85% solvent B for 30 min. The IT-MS was operated in a data-dependent MS/MS mode using a normalized CID energy of 30%. The voltage and temperature for the capillary of the ion source were set at 10 V and 180t°C, respectively.

Cross-database Search and MS Informatics for Global Proteome Characterization—
The raw MS/MS data were searched using SEQUEST (10) against a mouse proteome database from the National Center for Biotechnology Information (www.ncbi.nih.gov) (mouse database I, 12,000 protein entries) and a mouse proteome database from the European Bioinformatics Institute (EBI) (www.ebi.ac.uk/proteome/index.html) (mouse database II, 20,624 protein entries). The raw MS/MS data were also searched against an Archaean protein database (12,038 protein entries) from EBI, which consists of the protein sequences from five Archaean species (i.e. Aeropyrum pernix, Archaeoglobus fulgidus, Pyrobaculum aerophilum, Sulfolobus tokodaii, and Thermoplasma volcanium). For the SEQUEST analysis, the peptide mass tolerance was set as 2.5 Da and the fragment ion tolerance was 0.5 Da. A tryptic enzyme restriction with a maximum of two internal missed cleavage sites was used. No residues (i.e. cysteine and methionine) were considered as modified in the database search. Based on the cross-database search, the SEQUEST criteria such as cross correlation (Xcorr) and delta cross correlation score ({Delta}Cn) for confident peptide identification was set and applied to the global mouse proteome identification by searching a larger mouse proteome database downloaded from EBI (mouse database III, 28,437 protein entries).

An in silico tryptic digestion (accounting for up to two missed cleavages) of the mouse protein fasta database (mouse database III) was performed to create a reference table that contains each tryptic fragment for each protein along with the corresponding Swiss-Prot accession numbers. The filtered TurboSEQUEST nonredundant peptide identification table (referred to as the ID table) was queried against the reference table using Microsoft Access. Because a single peptide in the ID table may correlate to more than one entry in the reference table, and therefore to more than one protein, it is then necessary to calculate the number of times a peptide within the ID table correlates to a peptide in the reference table in order to assess the extent to which a given peptide uniquely identifies a given protein. Hence, a peptide within the ID table that correlates to one entry in the reference table is unique to a single protein reference and, by definition, called protein-specific peptide tag (PPT). The final list of unique proteins identified in this work (Supplemental Table I) is derived solely from PPTs, and other peptides that were identified and may correspond to these unique proteins have also been listed. For purposes of clarifying which of these peptides uniquely identifies a protein, the number of times a peptide linked to a Swiss-Prot protein reference is included within Supplemental Table I (i.e. a 1 refers to a peptide that is distinct within the entire mouse database whereas any number >1 corresponds to more than one protein and cannot uniquely identify these proteins).


View this table:
[in this window]
[in a new window]
 
TABLE I List of proteins identified by 20 or more unique peptides (Npeptides) along with their molecular function

 
Although a single protein may contain dozens of unique tryptic peptides, those peptides actually identified in a proteome experiment (the ID table) may correspond to more than one entry in the reference table and may not be unique to a single protein. In these cases, the peptide in the ID table does not correlate to a unique protein, and if this situation is not taken into account the result will be the over-counting of the number of unique proteins identified. The proteins identified solely by these peptides rather than PPTs were grouped by cluster analysis as protein clusters (ProCluster). To perform the cluster analysis, each peptide was tabulated against its corresponding proteins, via its protein accession number, to make a peptide-protein matrix where the value of 1 was given to a peptide that matched a protein and 0 was given in the cases where the peptide did not match the proteins within the matrix. Average linkage hierarchical cluster analysis was then conducted using the program Cluster, and the data were visualized in TreeView (11). Each protein cluster consists of at least two proteins (based on accession number) sharing the same peptides each other. The protein clusters containing the accession number, name, peptides, and so forth for each protein within a protein cluster were listed as Supplemental Table II.

Immunostaining—
Mice were perfused transcardially, under deep Nembutal anesthesia, first with heparinized saline followed by 4% paraformaldehyde in 0.1 m phosphate buffer (pH 7.4). The brain was immediately removed, post-fixed in the same fixative for 4 h, and cryoprotected in 10% and subsequently 30% sucrose in phosphate buffer. Parasagittal frozen sections were cut at 30 µm on a sliding microtome. After blocking in PBS containing 5% goat serum and 1% BSA, free-floating sections were incubated with the polyclonal pescadillo antibody (0.5 µg/ml) for 60 h at 4t°C. Sections were washed four times in PBS, including a wash with 1% H2O2 to quench endogenous peroxidase activity, and then incubated with a biotinylated goat anti-rabbit IgG secondary antibody (2 µg/ml; Jackson ImmunoResearch Laboratories, West Grove, PA) for 24 h at 4t°C. Immunoreactivity was visualized by incubating with avidin-biotinylated peroxidase complexes (Vectastain Elite ABC kit; Vector Laboratories, Burlingame, CA) overnight at 4t°C, followed by color development with diaminobenzidine as a peroxidase substrate.

Micrographs were taken using an Axiovert 100 inverted microscope (Carl Zeiss Microimaging, Thornwood, NY) with a cooled CCD camera (Cooke Corp, Auburn Hills, MI) and Slidebook image analysis software (Intelligent Imaging Innovations, Denver, CO). Immunostaining of samples requiring a direct comparison was done in single runs, and the subsequent processing of images was performed in an identical way for individual photographs using Slidebook and Photoshop (Adobe, San Jose, CA).


    RESULTS
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
SCXLC Fractionation and µRPLC-MS/MS Analysis of Cortical Neuron Proteome—
While µRPLC-MS/MS is ideally suited for detection and identification of peptides, it is unlikely that a single dimension of separation is sufficient to obtain broad proteome coverage. Therefore, multidimensional chromatographic fractionation was employed in this study to decrease the complexity of the samples being analyzed in any single µLC-MS/MS analysis, resulting in increased peak capacity and dynamic range of protein identifications of the overall measurements. Peptides derived from the cortical neuron digestate were separated into 96 fractions by SCXLC (Fig. 1). Each of these SCXLC fractions was analyzed by µRPLC-MS/MS. Even though the SCXLC fractionation had comparatively high resolution and 96 fractions were collected, the resulting µRPLC-MS/MS base peak chromatograms were quite complex, as shown for selected examples in Fig. 1. Examples of mass spectra acquired at different time points during the µRPLC-MS/MS analysis of each SCXLC fraction (Fig. 1) also show that the number of peptides eluted at any particular time can be well above the sampling capacity of a typical data-dependent MS/MS analysis. These µRPLC-MS/MS base peak chromatograms and mass spectra demonstrate the high complexity of cell lysates and clearly exemplify the importance of continued development of technologies and sample preparation strategies for enabling increased protein coverage, especially in the case of mammalian proteomics.



View larger version (30K):
[in this window]
[in a new window]
 
FIG. 1. A SCX chromatogram of 100 µg of trypsin digestate of total protein extract from the mouse cortical neurons and representative base peak chromatograms and representative MS scans of the µLC-MS/MS analysis of the SCX fractions. A total of 96 SCX fractions were collected at 1-min intervals.

 
Protein Database Searching and Filtering—
The data from the 96 µRPLC-MS/MS analyses were searched against mouse protein databases using SEQUEST as shown in Fig. 2. During the course of this study, the number of proteins within the available databases increased from 12,000 to over 28,000, warranting the researching of the original data against the largest available proteomic database. To determine the confidence level of the peptide identifications at various SEQUEST Xcorr scores, three of the µRPLC-MS/MS fractions that showed a large number of peptide identifications when searched against the smaller mouse database (12,000 protein entries) were researched against a combined database (12,038 protein entries) of five different Archaean protein databases. The purpose of searching the data against the Archaean protein database is to gain a measure of peptides identified through a random correlation between the MS/MS spectra and an arbitrary peptide sequence. These identifications should be considered as false positive because the Archaean protein sequences are extremely divergent from those of mouse. The size of this Archaean database was equivalent to that of the initial mouse database used to search MS/MS spectra, ensuring that the opportunity of random hits in the database search is equivalent within these two databases. The results of this cross-database search allowed us to determine confidence levels of peptide identifications based on their SEQUEST Xcorr value.



View larger version (18K):
[in this window]
[in a new window]
 
FIG. 2. A flowchart showing the cross-database search scheme for the determination of SEQUEST criteria for confident peptide identification. The MS/MS data from SCX fractions 32, 36, and 40 were searched against both Archaean protein database and mouse protein database I, and the data from all the 96 SCX fractions were searched against both mouse database I and II. The confidence levels for peptide identification were calculated based on these cross-database searches, and the SEQUEST criteria were set for identifying peptides with ≥90% confidence level by searching the largest mouse database available, mouse database III (details in the text).

 
The number of fully tryptic peptides identified within the mouse or Archaean databases in relation to different Xcorr cutoffs is shown in Fig. 3 for both singly ([M+H]+), doubly ([M+2H]2+), and triply charged ([M+3H]3+) peptide molecular ions. When low Xcorr values are considered, the number of peptides identified using both databases is very similar. As the Xcorr value threshold is increased, however, the number of peptides identified by searching against the mouse database remains considerably higher than the number of peptides identified by searching against the Archaean database, although the exact number of peptides is dramatically decreased. These observations suggest that most of the mouse peptide identifications are resulted from random hitting at low Xcorr cutoffs such as 1.5 for [M+H]+ and [M+2H]2+ peptides and 2.0 for [M+3H]3+ peptides, and confident peptide identification can only be reached when the Xcorr value thresholds are significantly increased. The probability of a positive peptide identification was then calculated by dividing the number of positive mouse peptide identifications (total mouse identification minus total Archaean identifications) by the total number of mouse peptide identifications at a specific Xcorr threshold for separately charged peptide molecular ions. From this calculation, minimum Xcorr values were determined that provide a 90% confidence limit in singly, doubly, and triply charged peptides identified by SEQUEST searching against the mouse database. These Xcorr thresholds are 2.2 for [M+H]+, 2.5 for [M+2H]2+, and 3.0 for [M+3H]3+ peptides.



View larger version (25K):
[in this window]
[in a new window]
 
FIG. 3. Evaluation of different Xcorr thresholds on the number of peptide identifications and peptide identification confidence levels at +1 (left), +2 (middle), and +3 (right) precursor charge states. Cross-database searches were conducted by searching the MS/MS data of SCX fractions 32, 36, and 40 against both Archaean and mouse databases as described in the text, and results were filtered for fully tryptic peptides. A false positive was defined as a peptide identified from the Archaean database. The minimum Xcorr values were determined to be 2.2, 2.5, and 3.0 for singly, doubly, and triply charged precursor peptides, respectively, which provided a 90% confidence limit in peptide identification by SEQUEST.

 
Similar results were obtained when the peptide identifications from searching the datasets of 96 SCX fractions against the small mouse protein database (database I, 12,000 entries) were compared with those obtained from searching the larger mouse protein database (database II, 20,624 entries). We sought to determine if the same tryptic peptide was identified from the same MS/MS spectrum when searching these two databases, and considered such peptides as "positive identifications." The probability of a positive identification was calculated by dividing the number of "positive peptide identifications" by the total number of peptides identified from mouse database I at a certain Xcorr cutoff. As shown in Fig. 4A, the probability of a positive peptide identification increases with increasing Xcorr cutoffs for [M+H]+, [M+2H]2+, and [M+3H]3+ peptides. When the Xcorr cutoffs are set as 2.1, 2.5, and 2.9 for [M+H]+, [M+2H]2+, and [M+3H]3+ peptides, respectively, all the confidence levels of positive identification are above 90%.



View larger version (23K):
[in this window]
[in a new window]
 
FIG. 4. A, evaluation of Xcorr thresholds on the probability of positive peptide identification by cross-database search of MS/MS data from 96 SCX fractions against mouse database I (12,000 entries) and mouse database II (20,624 entries). A positive identification was defined as the same tryptic peptide identified from the same MS/MS scan when searching both databases. The probability of positive identification was calculated by dividing the number of "positive peptide identifications" by the total number of peptides identified from mouse database I at certain Xcorr cutoffs. Similar results were obtained as shown in Fig. 3. B, effect of peptide molecular mass of doubly charged peptides on the Xcorr threshold required to achieve a ≥90% confidence level in the peptide identification. C, an analysis of the effect of {Delta}Cn on the confidence levels of peptide identification at different Xcorr cutoffs. Only doubly charged tryptic peptides were included in the analysis. Peptide identification confidence is more sensitive to {Delta}Cn when low Xcorr thresholds are applied.

 
As illustrated by the need for increasing Xcorr thresholds required for confident identification of [M+H]+ and [M+3H]3+ peptides, the length of a peptide can also affect the Xcorr value needed to achieve ≥90% confidence for peptide identification. While [M+H]+ and [M+3H]3+ charged species are typically produced by peptides with low and high molecular masses (Mr), respectively, [M+2H]2+ peptides can span a wide mass range. Therefore, the effect of peptide molecular mass of [M+2H]2+ peptides on the Xcorr threshold required to achieve a ≥90% confidence level in peptide identification was evaluated. The peptides were arbitrarily divided into two sets, one set with molecular mass <1200 Da and the other ≥1,200 Da. A plot of the probability of positive identification of each set of peptides versus Xcorr thresholds (Fig. 4B) demonstrates that higher Xcorr cutoffs are required for high Mr peptides to achieve the same confidence level for peptide identification as compared with that required for lower Mr peptides. While the Xcorr cutoff of 2.2 is required for identifying peptides with Mr < 1,200 Da at confidence level of ≥90%, the Xcorr value of ~2.5 is needed for the peptides with Mr ≥ 1,200 Da at the same confidence level.

The above investigations were conducted without consideration of delta Xcorr ({Delta}Cn) value cutoff threshold, another criterion commonly used in SEQUEST for peptide identification. The effect of {Delta}Cn thresholds on the confidence in peptide identification for [M+2H]2+ peptides at different Xcorr cutoffs is shown in Fig. 4C. When Xcorr thresholds are set at a low level (i.e. 1.5 and 1.9), slight changes in the {Delta}Cn threshold (when {Delta}Cn<0.3) have a large impact on the confidence in peptide identification. As the Xcorr threshold is increased, however, the contribution of {Delta}Cn is of much less impact on the confidence of any of the identifications. For example, the slope of the line between {Delta}Cn values of 0 and 0.1 is ~165 when using a Xcorr threshold of 1.5, but approaches zero when the Xcorr threshold is increased to 2.8. Therefore at sufficiently high Xcorr, the confidence level for a positive identification becomes less dependent on the {Delta}Cn value. Regardless of the Xcorr thresholds, however, confidence of any of the identifications exceeds 90% when the {Delta}Cn cutoff is above ~0.25.

The parameter Xcorr measures the extent to which an experimental MS/MS spectrum corresponds to a mass and theoretical MS/MS spectrum of a peptide within a given proteomic database, while the {Delta}Cn measures how far the Xcorr value of the first (top) candidate peptide is from that of the second candidate peptide. Based on the analyses presented, we determined that utilizing Xcorr thresholds for tryptic peptide identification of 2.1 for [M+H]+ peptides, 2.2 for [M+2H]2+ peptides with Mr < 1,200 Da, 2.5 for [M+2H]2+ peptides with Mr ≥ 1,200 Da, 2.9 for [M+3H]3+ peptides, and 0.08 for the {Delta}Cn cutoff results in a 95% confidence in peptide identification. Utilizing the parameters described, a total of 15,300 unique tryptic peptides were identified from primary cultures of cortical neurons from a total of 33 µg of sample when the 96 datasets were searched against the mouse database with 28,437 protein entries. A histogram of the total number of peptides identified based on their charge state and Xcorr score is shown in Fig. 5. Of the 26,566 total (redundant) number of peptides identified, 7.5% were identified as being from [M+H]+ peptides, 71.7% were from [M+2H]2+ peptides, and 20.8% were from [M+3H]3+ peptides. This classification shows that a large percentage of the peptides were identified with Xcorr values much greater than the minimum values required to achieve a 90% confidence level for correct identification. For example, 67.6% of the [M+2H]2+ peptides were identified with Xcorr values greater than 3.1 and 61.5% of the [M+3H]3+ peptides were identified with Xcorr values greater than 3.5. In addition, two-thirds of all of the peptides identified had Xcorr values at least 25% greater than the minimal SEQUEST parameter values used for peptide identification (i.e. at least 3.1 for [M+2H]2+ ions greater than 1,200 Da in mass).



View larger version (32K):
[in this window]
[in a new window]
 
FIG. 5. A histogram illustrating the number of peptides identified (including redundant peptide identifications) with different charge states in different Xcorr score ranges.

 
It is noticed that more than 11,000 peptide identifications, which occupy 42% of the total peptides identified, are redundant. The remaining 58% of the peptides were identified only once (i.e. in a single MS/MS spectrum) in the entire analysis. We examined the source where the redundant peptides were generated during the whole analytical process. The results indicated that 14–16% of the total peptide identifications were from the MS/MS acquisition of the same peptides with different charge states, and ~6% originated from the repeated analysis of the same peptides eluting over a long time period in a single RP gradient (longer than the time window of dynamic exclusion). Approximately another 20% were contributed by peptide content overlap between two adjacent SCXLC fractions. The peptide overlap between one SCXLC fraction and the second adjacent fraction was only 2–4%. This low overlap between SCXLC fractions suggests the peptides were not over-fractionated when they were subjected to SCXLC fractionation into 96 fractions, and this high efficient fractionation enabled more peptides to be analyzed by MS/MS and increased the overall dynamic range.

Proteins Identified in Mouse Cortical Neurons—
Although 15,300 nonredundant peptides were identified in this study, not each of them could be uniquely assigned to a single protein. A total of 3,590 proteins (Supplemental Table I) were definitely identified from 12,839 peptides with at least one peptide unique to a single protein, which was defined as a PPT (described in detail in "Experimental Procedures"). A histogram illustrating the number of unique peptides and proteins identified in the various SCXLC fractions is shown in Fig. 6, along with the SCXLC fluorescence chromatogram. However, each of the remaining 2,461 peptides could not be assigned to a single protein. These peptides resulted in 2,322 protein identifications according to the Swiss-Prot accession number. We found that some of these proteins are actually redundant proteins. To address this issue, cluster analysis was conducted to group the proteins together by the common peptides shared each other. This analysis resulted in 952 distinct protein clusters (each was assigned a ProCluster number; Supplemental Table II). As shown in Fig. 7, protein cluster 1 consists of four accession numbers, Q9QXE7 is the primary accession number and Q8BMM0 is the secondary accession number of the same protein Transducin ß-like 1X protein. This protein has one identified peptide HQEPVYSVAFSPDGK distinct from another protein, which also has two accession numbers, within the cluster. Hence this cluster may contain two proteins. Similarly, cluster 2 contains various actin isoforms sharing some common identified peptides within two or more isoforms. In the protein database, proteins with similar sequences, protein fragments, mutations, and protein isoforms in the same or different cell types all account for some identified peptides not unique to a single protein with a distinct accession number. The number of protein clusters identified in this study is initial; however, this represents the minimum number of proteins that could be identified. The issue described here is even more difficult to be tackled in the quantitative proteome study, further statistical analysis is necessary to achieve accurate quantitation of a single protein.



View larger version (31K):
[in this window]
[in a new window]
 
FIG. 6. A histogram illustrating the number of unique peptides and proteins identified in µRPLC-MS/MS analysis of each SCX fraction of a cortical neuron protein digestate (SCX elution profile overlaid in dotted line). A total of 15,300 unique peptides from which 3,590 proteins (not including 952 protein clusters) were identified from a total of 33 µg of sample based on the SEQUEST criteria described in the text.

 


View larger version (30K):
[in this window]
[in a new window]
 
FIG. 7. A representative 2D display of cluster analysis of the proteins with the identified peptides not specific to a single protein. Protein isoforms or proteins with similar sequences are clustered based on the shared peptides identified. The protein names and their accession numbers of protein cluster 1 and 2 are presented in the right tables.

 
While as discussed above, one of the unknowns in proteomics is whether every constituent within a proteome has been identified. Making liberal assumptions (many that are admittedly too simplistic), however, can provide a sense of what percentage of all of the proteins within the proteome of mouse cortical neurons have been identified by at least a single peptide in this study. There are 28,437 proteins listed in the mouse protein database used in this study, of which at least 4,542 proteins could be identified considering only one real protein was identified from each protein cluster. Assuming that only 35% of the proteins within the database are expressed at any given time, this suggests that ~46% of the expressed proteome was identified in this study. This approximation is obviously high because the calculation does not take into account the presence of modified (both post-transcriptionally and post-translationally) proteins. This calculation illustrates, however, that technology has advanced to the point of providing significant coverage of a proteome’s protein constituents.

Of importance in global proteomic analysis is the ability to identify proteins from every compartment within the cell. This need is particularly focused on membrane proteins because of solubility issues with this class of proteins. In this study, almost 29% of the identified proteins were classified as membranous by gene ontology. This compares favorably with genomic analysis that predicts 20–30% of the genome encodes for membrane proteins. In addition, a study of mouse brain homogenates that employed an enzymatic digestion strategy for the identification of membrane proteins identified about 28% of the proteins in their mixture as membranous (12). Obviously the technology used in this study is not biased against membrane proteins and provides very good coverage of proteins from all cellular compartments.

Another indicator of the bias of the survey presented in this study is the number of peptides identified per protein from particular classes. The overall distribution of the number of peptides identified per protein is shown in Fig. 8A. Approximately 61% of the proteins were identified by two or more peptides. A comparison of the number of peptides identified per intracellular and membrane protein is shown in Fig. 8, B and C. Approximately 65% of the intracellular proteins were identified by two or more peptides, while 57% of the membrane proteins were identified by two or more peptides. Overall, these plots do not show a striking bias against the identification of membrane proteins.



View larger version (45K):
[in this window]
[in a new window]
 
FIG. 8. Pie charts comparing the percentage of (A) total, (B) intracellular, and (C) membrane-associated proteins identified by a specific number of peptides.

 
Quantitative Assessment of the Identified Proteins—
Although the protein identification data in this study is largely qualitative, we also determined if some quantitative results could be gleaned from the data. For instance, does the number of peptides identified per individual protein reflect the abundance of that protein within the cell? This hypothesis is reflected in the case of serum, where albumin peptides are continuously identified during a typical LC-MS/MS analysis of this sample. For the 3,590 definitely identified proteins, a plot of the percentage of proteins from specific functional classes versus the number of peptides identified per protein is shown in Fig. 9. A few interesting trends are observed in the data. First, proteins that regulate transcription (i.e. transcription factors) and signal transduction (i.e. kinases and phosphatases) are most likely to be identified by a single peptide and least likely to be identified by more than six peptides. This result is consistent with their low abundance within the cell. Proteins most likely to be identified by multiple peptides include structural proteins and proteins involved in translation, catalysis, binding, and motor activity. Proteins involved in translation are the most likely to be identified by more than six peptides. Overall, these results are consistent with the relative abundance of these functional classes within the cell.



View larger version (25K):
[in this window]
[in a new window]
 
FIG. 9. Histogram showing the percentage of proteins from various functional classes identified by a specific number of peptides.

 
A more detailed examination of the 44 proteins identified by 20 or more unique peptides is presented in Table I. Approximately 39% of these proteins are classified as playing a structural role or involved in motor activity within the cell, consistent with their high level of abundance within cells. Chaperones represented about 16% of the proteins on this list, while the remaining proteins consisted of various binding proteins and proteins involved in protein translation, catalysis, and transporting. Proteins involved in signal transduction or transcription were not represented within this list. The data obtained in this study is consistent with a correlation between the number of peptides identified per protein and the expected abundance of that protein within the cell.

A significant benefit gained from performing a global proteomic analysis described in this study is the opportunity to identify proteins not previously associated with a particular cell type or biological process. One example that illustrates this point is the identification of the pescadillo protein in the neuronal proteome (Supplemental Table I). Pescadillo is a unique nucleolar protein involved in ribosome biogenesis and cell cycle control. It was recently identified as a protein expressed at abnormally high levels in malignant human brain tumors (13). The protein has not previously been characterized in neurons. To validate the proteomics result that was obtained using cultured postnatal cortical neurons, pescadillo expression was evaluated in adult mouse brain by immunostaining. Validating the proteomics finding, intense immunoreactivity was detected in neurons in all brain regions examined, while glial cells showed only weak immunoreactivity. Analysis of the hippocampus was particularly instructive as it contains a defined layer of pyramidal neurons surrounded by areas where neurons are very scarce, and glial cells account for the majority of cells. As seen in Fig. 10, strong nuclear immunoreactivity was predominantly localized to neurons in the CA1 pyramidal cell layer and to displaced pyramidal neurons and/or interneurons (arrows). Glial cells outside the CA1 pyramidal cell layer (arrowheads) and in the corpus callosum, a fiber tract lacking neurons, were weakly immunostained. Thus, it appears that pescadillo is highly expressed in neurons consistent with its identification in the neuronal proteome. Because it is not clear what role a cell cycle regulator would play in post-mitotic neurons, this finding provides fertile new ground for future study.



View larger version (147K):
[in this window]
[in a new window]
 
FIG. 10. Immunolocalization of pescadillo expression in adult mouse brain. Pescadillo-expressing cell types were determined by immunohistochemistry using frozen sections of the brain from 1-month-old p53+/+ mice. A representative staining result for a parasagittal section through the dorsal hippocampus is shown. Strong immunoreactivity was detected almost exclusively in neurons, while glial cells were only weakly immunopositive. Arrows and arrowheads indicate some of the neurons and glial cells, respectively, outside the CA1 pyramidal cell layer. See the text for specific details. Scale bar, 50 µm.

 
Global proteome analysis of cortical neurons identified many other proteins, in addition to pescadillo, that had not previously been identified in or associated with neurons. A number of hemopoietic proteins were identified such as mDomino, a myeloid cell differentiation factor (14), myeloid zinc finger protein-2, a zinc-finger transcription factor (15), and Clast3, a novel cell cycle-regulated protein first identified in activated and naive B lymphocytes (16). Unique transcriptional activators and repressors were identified in cortical neurons, such as myocardin-related transcription factor A and enhancer of rudimentary homologue. Interestingly, several tumor suppressor and oncogenic proteins were also identified in cortical neurons, such as the breast cancer type 2 susceptibility protein (BRCA2), and the epithelial cell transforming sequence 2 protein (ECT2). ECT2 is a guanine nucleotide exchange factor for Rho GTPases and a regulator of cytokinesis (17). Another unique cell cycle regulator identified in this analysis was the G2 and S phase expressed protein 1 (GTSE-1) (18). The GTSE-1 protein appears to be involved in regulating p53-dependent apoptosis following DNA damage. Because primary neuronal cultures may contain 2–3% of non-neuronal cells, it will be necessary to validate the expression of these proteins in neurons using complementary methods, as shown above for pescadillo. Nevertheless, this analysis of the neuronal proteome demonstrates the feasibility of identifying unique proteins that may play important and unanticipated roles in the biology of neurons.


    DISCUSSION
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
To achieve the goal of capturing a "proteomic snapshot" of the cell, analytical methodologies must demonstrate that a significant portion of the proteins present within the cell can be identified. The best available separations and MS-based technologies are able to successfully sequence on the order of 10,000 peptides corresponding to about 2,000–3,000 unique proteins from a sample of human cells. While these technologies clearly provide significant proteome coverage, the proteins are usually identified from relatively few peptides. Indeed in most global proteome studies where the focus is on identifying a significant percentage of the proteome, many of the proteins are identified from MS/MS of a single peptide. This situation results in a minimal description of the proteins other than their identity. However, it is obvious why, on average, few peptides per protein are identified when one considers the complexity of the mixture being analyzed in these studies. A recent report estimates that the human proteome may be comprised of 25,931 proteins (not taking into account alternative splicing, etc.) corresponding to a total of 1,175,015 tryptic peptides (assuming complete digestion) (19). If 30% of the proteome is expressed at any given time, the analysis of ~7,800 proteins and 350,000 peptides would be required for a completely comprehensive characterization.

With the paucity of coverage of most proteins in a proteome obtainable using current MS-based technologies, the value of such studies as presented here needs to be considered. As shown in this study, pescadillo was identified by only one peptide (Supplemental Fig. 1 for MS/MS of this peptide); however, its presence in mouse neurons was validated using immunostaining. Pescadillo was identified in zebrafish less than a decade ago (20), and has only recently been shown to play a role in cell cycle progression (13). Prior to this study, pescadillo had not been shown to be present within brain tissue. Proteomic studies such as this can serve as an invaluable resource to researchers who have traditionally focused on a single (or handful of) protein(s) with the goal of in-depth characterization. An excellent example is vitamin D. The dietary need for vitamin D to prevent rickets was first shown in the 1920s (21), and its function in bone mineralization and calcium mobilization soon after (22). It was not until the 1980s, however, that evidence for a function for vitamin D within brain tissues was established (23). This discovery was made through autoradiography and sophisticated purification of the vitamin D receptor followed by the design on an activity assay. With the advent of proteomic studies of this type, investigators now have the opportunity to peruse databases to search for the tissue-specific distribution of their protein(s) of interest. Downstream validation of the identified peptides, however, will continue to be a critical issue in proteomics. Proteomics is protein biochemistry done on a grand scale. Studies such as that presented here provide a "30,000-foot view" of what proteins are present within a particular cell type. The result is a foundational database that can be compared with those generated from other cell types or organisms to decipher which proteins function to confer the specific attributes or properties of any particular cell. Because proteomics is a biological science, important results collected from such studies require validation. This validation needs to be considered on an individual protein basis and what its ultimate value will be to the understanding of the cell’s character or function. So in a sense, global proteomic studies serve to filter the list of proteins that may be expressed by the organism’s genome to those that actually are expressed by the genome.


    FOOTNOTES
 
Received, March 4, 2004, and in revised form, June 23, 2004.

Published, MCP Papers in Press, June 30, 2004, DOI 10.1074/mcp.M400034-MCP200

1 The abbreviations used are: 2D, two dimensional; RP, reversed-phase; SCX, strong cation exchange; µLC, microcapillary LC; IT, ion trap; PPT, protein-specific peptide tag. Back

* By acceptance of this article, the publisher or recipient acknowledges the right of the United States Government to retain a nonexclusive, royalty-free license and to any copyright covering the article. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organization imply endorsement by the United States Government. This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Contract NO1-CO-12400 and by grants from the National Institutes of Health NS35533 and NS42699 (to R S M.). Back

S The on-line version of this manuscript (available at http://www.mcponline.org) contains supplemental material. Back

To whom correspondence should be addressed: SAIC-Frederick, Inc., National Cancer Institute at Frederick, P.O. Box B, Building 469, Room 160, Frederick, MD 21702. Tel.: 301-846-7286; Fax: 301-846-6037; E-mail: veenstra{at}ncifcrf.gov.


    REFERENCES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Lin, D., Tabb, D. L., and Yates, J. R. 3rd. (2003) Large-scale protein identification using mass spectrometry. Biochim. Biophys. Acta 1646, 1 –10[Medline]

  2. Celis, J. E., and Gromov, P. (2003) Proteomics in translational cancer research: Toward an integrated approach. Cancer Cell 3, 9 –15[CrossRef][Medline]

  3. Liu, H., Berger, S. J., Chakraborty, A. B., Plumb, R. S., and Cohen, S. A. (2002) Multidimensional chromatography coupled to electrospray ionization time-of-flight mass spectrometry as an alternative to two-dimensional gels for the identification and analysis of complex mixtures of intact proteins. J. Chromatogr. B Analyt. Technol. Biomed. Life Sci. 782, 267 –289[Medline]

  4. Issaq, H. J., Conrads, T. P., Janini, G. M., and Veenstra, T. D. (2002) Methods for fractionation, separation and profiling of proteins and peptides. Electrophoresis 23, 3048 –3061[CrossRef][Medline]

  5. Adkins, J. N., Varnum, S. M., Auberry, K. J., Moore, R. J., Angell, N. H., Smith, R. D., Springer, D. L., and Pounds, J. G. (2002) Toward a human blood serum proteome: Analysis by multidimensional separation coupled with mass spectrometry. Mol. Cell. Proteomics 1, 947 –955[Abstract/Free Full Text]

  6. Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J., and Gygi, S. P. (2003) Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2, 43 –50[CrossRef][Medline]

  7. Koller, A., Washburn, M. P., Lange, B. M., Andon, N. L., Deciu, C., Haynes, P. A., Hays, L., Schieltz, D., Ulaszek, R., Wei, J., Wolters, D., and Yates, J. R. 3rd. (2002) Proteomic survey of metabolic pathways in rice. Proc. Natl. Acad. Sci. U S A. 99, 11969 –11974[Abstract/Free Full Text]

  8. Carucci, D. J., Yates, J. R. 3rd, and Florens, L. (2002) Exploring the proteome of Plasmodium. Int. J. Parasitol. 32, 1539 –1542[CrossRef][Medline]

  9. Xiang, H., Kinoshita, Y., Knudson, C. M., Korsmeyer, S. J., Schwartzkroin, P. A., and Morrison, R. S. (1998) Bax involvement in p53-mediated neuronal cell death. J. Neurosci. 18, 1363 –1373[Abstract/Free Full Text]

  10. Eng, J. K., McCormack, A. L., and Yates, J. R. 3rd. (1994) An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976 –989[CrossRef]

  11. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U S A. 95, 14863 –14868[Abstract/Free Full Text]

  12. Wu, C. C., MacCoss, M. J., Howell, K. E., and Yates, J. R. 3rd. (2003) A method for the comprehensive proteomic analysis of membrane proteins. Nat. Biotechnol. 21, 532 –538[CrossRef][Medline]

  13. Kinoshita, Y., Jarell, A. D., Flaman, J. M., Foltz, G., Schuster, J., Sopher, B. L., Kanning, K., Irvin, D. K., Kornblum, H. I., Nelson, P. S., Hieter, P., and Morrison, R. S. (2001) Pescadillo, a novel cell cycle regulatory protein abnormally expressed in malignant cells. J. Biol. Chem. 276, 6656 –6665[Abstract/Free Full Text]

  14. Ogawa H., Ueda T., Aoyama T., Aronheim A., Nagata S., and Fukunaga R. (2003) A SWI2/SNF2-type ATPase/helicase protein, mDomino, interacts with myeloid zinc finger protein 2A (MZF-2A) to regulate its transcriptional activity. Genes Cells 8, 325 –339[Abstract/Free Full Text]

  15. Murai K., Murakami H., and Nagata S. (1998) Myeloid-specific transcriptional activation by murine myeloid zinc-finger protein 2. Proc. Natl. Acad. Sci. U S A. 95, 3461 –3466[Abstract/Free Full Text]

  16. Bahar R., O-Wang J., Kawamura K., Seimiya M., Wang Y., Hatano M., Okada S., Tokuhisa T., Watanabe T., and Tagawa M. (2002) Growth retardation, polyploidy, and multinucleation induced by Clast3, a novel cell cycle-regulated protein. J. Biol. Chem. 277, 40012 –40019[Abstract/Free Full Text]

  17. Tatsumoto T., Xie X., Blumenthal R., Okamoto I., and Miki T. (1999) Human ECT2 is an exchange factor for Rho GTPases, phosphorylated in G2/M phases, and involved in cytokinesis. J. Cell Biol. 147, 921 –928[Abstract/Free Full Text]

  18. Monte M., Benetti R., Buscemi G., Sandy P., Del Sal G., and Schneider C. (2003) The cell cycle-regulated protein human GTSE-1 controls DNA damage-induced apoptosis by affecting p53 function. J. Biol. Chem. 278, 30356 –30364[Abstract/Free Full Text]

  19. Cagney, C., Amiri, S., Premawaradena, T., Lindo, M., and Emili, A. (2003) In silico proteome analysis to facilitate proteomics experiments using mass spectrometry. Proteome Sci. 1, 1 –15[CrossRef][Medline]

  20. Allende, M. L., Amsterdam, A., Becker, T., Kawakami, K., Gaiano, N., and Hopkins, N. (1996) Insertional mutagenesis in zebrafish identifies two novel genes, pescadillo and dead eye, essential for embryonic development. Genes Dev. 10, 3141 –3155[Abstract]

  21. Goldblatt, H., and Soames, K. N. (1923) A study of rats on a normal diet irradiated daily by the mercury vapor quartz lamp or kept in darkness. Biochem. J. 17, 294 –297

  22. Holick, M. F. (2000) Calcium and vitamin D. Diagnostics and therapeutics. Clin. Lab. Med. 20, 569 –590[Medline]

  23. Haussler, M. R., Manolagas, S. C., and Deftos, L. J. (1980) Evidence for a 1,25-dihydroxyvitamin D3 receptor-like macromolecule in rat pituitary. J. Biol. Chem. 255, 5007 –5010[Free Full Text]