From The Plasma Proteome Institute, Washington DC 20009-3450; ¶ Large Scale Biology Corporation, Proteomics Division, Germantown, MD 20876; ** Laboratory of Proteomics and Analytical Technologies, SAIC-Frederick Inc., National Cancer Institute, Frederick, MD 21702-1201;
Biological Sciences Department, Pacific Northwest National Laboratory, Richland, WA 99352;
Inpharmatica Ltd., London, W1T 2NU, United Kingdom
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() |
---|
At the same time, plasma is the most generally informative proteome from a medical viewpoint. Almost all cells in the body communicate with plasma directly or through extracellular or cerebrospinal fluids, and many release at least part of their contents into plasma upon damage or death. Some medical conditions, such as myocardial infarction, are officially defined based on the increase of a specific protein in the plasma (e.g. cardiac troponin-T), and it is difficult to argue convincingly that there is any disease state that does not produce some specific pattern of protein change in the bodys working fluid. This immense diagnostic potential has spurred a rapid acceleration in the search for protein disease markers by a wide variety of proteomics strategies.
Current methods of proteomics are only beginning to catalog the contents of plasma. Two-dimensional electrophoresis was able to resolve 40 distinct plasma proteins in 1976 (2), but, because of the dynamic range problem, this number had only grown to 60 in 1992 (3) and is substantially unchanged today, a quarter century later. It is now clear that more than two dimensions of conventional resolution are required to progress beyond this point. Recently, several truly multidimensional survey efforts have been mounted, with the result that the number of distinct proteins detected has increased dramatically. Additional dimensions of separation can be introduced at any of three levels: a) separation of intact proteins, either by specific binding (e.g. subtraction of defined high-abundance proteins) or continuous resolution (e.g. electrophoresis or chromatography); b) separation of peptides derived from plasma proteins, either by specific binding (e.g. capture by anti-peptide antibodies) or continuous resolution (e.g. chromatography); and c) separation of peptides, and particularly their fragments, by mass spectrometry (MS). Many possible combinations of these dimensions can be implemented, the only limitations being the effort, cost, and time of analyzing many fractions or runs instead of one.
In this article, we have compared and combined data from three different multi-dimensional strategies with data from a fourth, classical source (the protein biochemistry and clinical chemistry literature) to provide a meta-level overview of both the contents and the rate of discovery of new components in plasma. The three experimental datasets are derived from 1) whole protein separation by a three-dimensional process (immunosubtraction/ion exchange/size exclusion) followed by two-dimensional electrophoresis (2DE) followed by MS identification of resolved spots (4); 2) Ig subtraction followed by trypsin digestion followed by two-dimensional liquid chromatography (LC) (ion exchange/reversed phase) followed by tandem MS (MS/MS) (5); and 3) molecular mass fractionation, followed by trypsin digestion followed by two-dimensional LC (cation exchange/reversed phase) followed by MS/MS (6). These three experimental approaches have two features in common (the removal of most Igs, by specific subtraction or size, and the use of MS for molecular identification) but otherwise they span the gamut of proteomics discovery approaches: separation at the protein level, separation at the tryptic peptide level, and a hybrid.
Combining experimental data with literature search results on proteins detected in plasma (representing a large body of accumulated "nonproteomics" data) should provide a broad perspective on plasma contents. Because the same proteins detected by various methods can be referred to by different names or accession numbers, we have used a sequence-based approach to eliminate redundancy and cluster all occurrences of the same protein. The resulting list makes it possible to examine the overlap between the various approaches and to see whether they are biased toward particular classes of proteins. In addition, a pooled nonredundant list should provide a relatively unbiased survey of the kinds of proteins present in plasma, which could have important diagnostic implications. Finally, a large list of proteins actually observed in plasma paves the way for top-down, targeted proteomics approaches to the discovery of disease markers: the development of accurate high-throughput specific assays for selected candidates from this list, as a supplement to the use of single methods for marker discovery in small sample sets. In the longer term, proteins with strong, mechanistic disease relationships may be viable therapeutic candidates as well.
![]() |
DATA SOURCES AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() |
---|
2DEMS: Separation of Serum Proteins (LC3/2-DE) + MS/MS Identification
Intact proteins were fractionated by chromatography and 2DE and identified by MS, generating the dataset described by Pieper et al. (7). Briefly, human blood sera were obtained in equal volumes from two healthy male donors (ages 40 and 80). Albumin, haptoglobin, transferrin, transthyretin, -1-anti trypsin,
-1-acid glycoprotein, hemopexin, and
-2-macroglobulin were removed by immunoaffinity chromatography. The immunoaffinity-subtracted serum concentrate was fractionated further by sequential anion exchange and size exclusion chromatography. The resulting 66 samples were individually subjected to 2DE. All visible Coomassie Blue R250 spots were cut out, destained, reduced, alkylated, and digested with trypsin. All extracted peptides were analyzed by matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) on a Bruker Biflex or Autoflex mass spectrometer (Bruker, Billerica, MA) and searched against Swiss-Prot. Those samples that did not give positive identification by MALDI-TOF where subjected to LC-MS/MS analysis by ion trap (IT) MS (Thermo Finnegan LCQ, Woburn, MA) and searched against the National Center for Biotechnology Information (NCBI) database using SEQUEST.
LCMS1: Separation of Peptide Digests of Serum Minus (LC2) + MS/MS Identification
A published dataset prepared by Adkins et al., (5) was used. Briefly, human blood serum was obtained from a healthy anonymous female donor. Igs were depleted by affinity adsorption chromatography using protein A/G. The resulting Ig-depleted plasma was digested with trypsin and separated by strong cation exchange on a polysulfoethyl A column followed by reverse-phase separation on a capillary C18 column. The capillary column was interfaced to an IT-MS (Thermo Finnigan LCQ Deca XP) using electrospray ionization. The IT-MS was configured to perform MS/MS scans on the three most intense precursor masses from a single MS scan. All samples were measured over a mass/charge (m/z) range of 4002,000, with fractions containing high complexity being measured with segmented m/z ranges. Tandem mass spectra were analyzed by SEQUEST as described using the NCBI May 2002 database.
LCMS2: Separation of Peptide Digests of Low-Molecular-Mass Serum Proteins (LC2) + MS/MS Identification
The fourth dataset is that described by Tirumalai et al. (6), focused on the lower-molecular-mass plasma proteome. Briefly standard human serum was purchased from the National Institute of Standards and Technology. High-molecular-mass proteins were removed in the presence of acetonitrile using Centriplus centrifugal filters with a molecular mass cutoff of 30 kDa. The low-molecular-mass filtrate was reduced, alkylated, and digested with trypsin. The digested sample was fractionated by strong cation exchange chromatography on a polysulfoethyl A column. Reversed-phase LC was subsequently performed on 300A Jupiter C-18 column coupled on line to an IT-MS (Thermo Finnegan LCQ Deca XP). Each full MS scan was followed by three MS/MS scans where the three most abundant peptide molecular ions were selected. MS/MS spectra were searched against the a human protein database using SEQUEST.
Bioinformatics
Sequence Clustering
The Blastp protein comparison algorithm (8, 9) was used to query the sequence of each protein identified against a database containing the aggregate sequences of all proteins identified by any method. Sequences sharing greater than 95% identity over an aligned region were grouped into "unique sequence clusters." Sequences were unmasked, and the minimum alignment length considered was 15 aa. This similarity-based approach was sufficient to group identical sequences, sequence fragments, and splice variants. Annotation in the nonredundant table was reported for the "best annotated" protein in the cluster set.
Signal Peptide Prediction
Signal peptides were predicted using the commercially available SignalP version 2.0 neural net and hidden Markov model (HMM) algorithms (10) and sigmask (11) signal masking program developed as part of Inpharmaticas Biopendium (12) protein annotation database. Each sequence received a score of +1 for a statistically significant positive signal peptide prediction from any of the three algorithms. The scores 0, 1, 2, and 3 for a particular sequence were then converted to qualitative terms "no," "possible signal," "signal," or "signal confident," respectively.
Transmembrane Prediction
Transmembrane (TM) regions were predicted using the commercial version of TMHMM version 2.0 algorithm (13). The total number of TM helices predicted per sequence was reported for each protein sequence. When a predicted TM region overlapped a predicted signal sequence (as it did in 40 cases in H_Plasma_NR_v2), this was interpreted as a signal sequence only.
Structural and Sequence-based Domain Annotation
Sequences were scanned against a library of BioPendium and iPSI-BLAST (9, 11)-like protein profiles constructed from SCOP (14), PFAM (15), PRINTS (16), and PROSITE (17) domain families. Hits to these profiles were reported at a statistical e-value cut-off of 1e-5. This cut-off was chosen to maximize profile coverage and minimize the occurrence of false positives. Sequences were not masked for low complexity or coiled coils prior to profile scanning.
Gene Ontology (GO) Term Annotation
NCBI GI number accessions for the sequences were matched to their SPTR (18) equivalents based on sequences sharing >95% sequence identity over 90% of the query sequence length. GO (19) component, process, and function terms were then extracted from text-based annotation files available for download from the GO database ftp site: ftp.geneontology.org/pub/go/gene-associations/gene_association.goa_human. For graphical reporting, a series of GO terms in each category were extracted by text searching of relevant keywords (indicated by the category names on plots) through all the assigned GO definitions. A GO component summary for the whole human proteome was prepared by applying the same approach to the complete GO human database referred to above.
Database Assembly
The nonredundant (NR) plasma database was assembled as a series of tables in a PostgreSQL relational database and queried to derive summary statistics for tables and figures shown here.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() |
---|
|
|
Protein Coverage by Data Source
Of the 1,175 nonredundant human proteins in H_Plasma_NR_v2, 195 entries, or 17%, were present in more than one dataset (set H_Plasma_195: Fig. 2 and Table II). Only 46 (4%) were found in all four sets of accessions (Total_sources = 4, shown in bold type in Table II). Of these only one (inter- trypsin inhibitor heavy chain H1) is predicted to have even a single transmembrane domain, and only one (the hemoglobin ß chain presumably released from red cell lysis) is predicted not to have a signal sequence. These characteristics (presence of signal sequence and absence of transmembrane domains) are those expected for major plasma proteins secreted by organs such as the liver.
|
|
A further 102 proteins (9%) were found in two datasets (Table II). Of these, 43 proteins were found in two experimental datasets but not in the literature dataset. These include a number of proteins that would not typically be thought of as likely plasma components, including a chloride channel and a copper-transporting ATPase (with 10 and 7 predicted transmembrane domains, respectively), an oxygen-regulated protein, three hypothetical proteins, and a group of likely nuclear proteins including mismatch repair protein, mitotic kinesin-like protein 1, and centromere protein F.
The remaining 980 proteins (83% of NR) were found in only one of the four input datasets. Of these, 696 proteins (71%) were found only in the experimental sets, with LCMS1, LCMS2, and Lit having similar large percentages of source-unique proteins (70, 69, and 66%, respectively, versus 50% for 2DEMS).
Characterization of the Plasma Proteome Via Annotation Statistics
Predicted Signal Sequences
The signal sequence prediction algorithm used yielded four levels of likelihood: "no" (strong probability of no signal sequence), "possible signal," "signal," and "signal confident" (strong probability of a signal sequence) in order of increasing likelihood of a signal sequence (i.e. the number of algorithms out of the three used that predict a signal sequence). The procedure does not distinguish between the signal sequences of secreted and membrane-bound proteins (including e.g. plasma, Golgi, and mitochondrial membranes), and thus does not directly predict final protein location. Most of the 1,175 H_Plasma_NR_v2 nonredundant sequences (83%) yielded a strong positive or negative prediction (i.e. good agreement between the three prediction algorithms used), with these two results occurring in about a 2:3 ratio overall. Approximately 49% of H_Plasma_NR_v2 had no evidence of a signal sequence, while only about 25% of the H_Plasma_195 lacked such evidence; conversely only 34% of H_Plasma_NR_v2 gave a "signal confident" while 54% of H_Plasma_195 gave this signal. Comparing the four data sources over H_Plasma_195 (Fig. 3A), entries from all four sources were likely to have signal sequences: the Lit set had the highest bias toward "signal confident" proteins (indicated by a ratio of "signal confident" to "no" signal of 5.05), while 2DEMS showed less bias (ratio of 3.44) and LCMS1 and LCMS2 showed a higher representation of "no" signal predictions (ratios of 2.03 and 1.96, respectively). Comparing the four sources across all the 1,175 H_Plasma_NR_v2 proteins (Fig. 3B), these ratios were reduced, reflecting a greater preponderance of "no" signal proteins, but the relative differences between the sources remained, yielding ratios for Lit, 2DEMS, LCMS1, and LCMS2 of 3.85, 0.91, 0.49, and 0.44, respectively.
|
|
|
|
|
|
|
The set of 195 accessions that occur in at least two of the four datasets represents a confirmed list of targets that should be accessible for routine measurement by multiple proteomics technologies in human plasma or serum: these proteins have been detected by at least two methods in different laboratories. This set actually comprises more than 195 observable protein subunits because of the collapse of multiple forms into a single entry: for example haptoglobins and ß chains collapse onto one primary gene product (cleaved after synthesis to yield the individual chains), and all Ig chains (
and
light chains, and
,
, µ,
, and
heavy chains) are lumped onto one accession because of their sequence similarity. The fact that a total of 96 Ig accessions occurred in the experimental input data (14 in 2DEMS, 76 in LCMS1, and 6 in LCMS2, in addition to the 11 Ig entries in Lit) indicates that a substantial number of heterogeneous Ig sequences remain in plasma/serum samples even after the treatments used in these studies to remove antibodies prior to digestion/fractionation.
H_Plasma_195 contains most of the expected high- and medium-abundance plasma components, but also contains a number of proteins that might not ordinarily be expected to be abundant enough to appear in a "common component" list. These include adiponectin (involved in the control of fat metabolism and insulin sensitivity), atrial natriuretic factor (a potent vasoactive substance synthesized in mammalian atria and thought to play a key role in cardiovascular homeostasis), various cathepsins (D, L, S), centromere protein F (involved in chromosome segregation during mitosis), creatine kinase M chain (an abundant muscle enzyme), glial fibrilary acid protein (distinguishes astrocytes from other glial cells), psoriasin (S-100 family, highly up-regulated in psoriatic epidermis), interferon-induced viral-resistance protein MxA (confers resistance to influenza virus and vesicular stomatitis virus), melanoma-associated antigen p97 (a proposed cancer marker also expressed in multiple normal tissues), mismatch repair protein MSH2 (involved in postreplication mismatch repair, and whose defective forms are the cause of hereditary nonpolyposis colorectal cancer type 1), oxygen-regulated protein (which plays a pivotal role in cytoprotective cellular mechanisms triggered by oxygen deprivation), peroxisome proliferator-activated receptor binding protein (which plays a role in transcriptional coactivation), prostate-specific antigen (a protease involved in the liquefaction of the seminal coagulum, and one of the few successful cancer diagnostics), selenoprotein P (contains selenocyteines encoded by the opal codon, UGA), signal recognition particle receptor subunit (an integral membrane protein ensuring, in conjunction with srp, the correct targeting of the nascent secretory proteins to the endoplasmic reticulum membrane system), squamous cell carcinoma antigen 1 (which may act as a protease inhibitor to modulate the host immune response against tumor cells), and V-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog (the receptor for stem cell factor). A number of these proteins have obvious relevance to important disease mechanisms, and thus are of potential diagnostic value. Cathepsin S, centromere protein F, psoriasin, mismatch repair protein MSH2, oxygen-regulated protein, and signal recognition particle receptor
subunit did not occur in the Lit accession list, but were rather found via detection in two of the experimental datasets.
Two types of protein features (signal sequences and TM domains) were predicted from the H_Plasma_NR_v2 sequences. These two parameters are somewhat related, because transmembrane, as well as secreted, proteins are likely to contain signal sequences. In the 1,175 proteins of H_Plasma_NR_v2, approximately one-third were confidently predicted to contain signal sequences (34% overall, with 32% having 0 or 1 TM segments) as compared with 19% containing signal sequences over the whole human proteome (and only 10% containing signal sequences with 0 or 1 TM segments; R.F., unpublished observation). H_Plasma_NR_v2 is thus substantially (3:1) enriched in a set of proteins having signal sequences and 0 or 1 TM segments (compared with the genome), which is consistent with the presence of a large number of classical secreted proteins. However, because more than half of H_Plasma_NR_v2 proteins do not contain a signal sequence, the total representation of nonsecreted molecules (presumably cellular constituents) is high.
In the full NR set of 1,175 proteins (H_Plasma_NR_v2), several major groups of proteins occur in patterns that suggest interesting biases between our four data sources. At least 10 transcription factors were observed in the experimental sets (each by only a single method), and none of these were found in the Lit accession set. Similarly, proteins GO-annotated with a DNA-binding function were essentially absent from the Lit set. In contrast, only 4 of 39 cytokines and growth factors included were found in any of the experimental datasets (IL-6, IL-12A, ciliary neurotrophic factor, and FGF-12), while 37 occurred in the Lit set. These results suggest that while the experimental proteomics methods were not sensitive enough to detect most cytokines and hormones, they did detect important classes of proteins not detected in literature reports using targeted assay methods. On a more global level, the distribution of GO_component assignments (Fig. 6) shows substantial differences overall between the set of proteins found in a literature search versus the three experimental proteomics technologies. Predicted features of protein sequence also show major source-related differences. Most striking is the fact that the Lit set was strongly biased toward proteins that were confidently predicted to have signal sequences (ratio of "signal confident" to "no" of 3.85), while 2DEMS showed little preference (0.91) and the LCMS methods showed a moderate (2 to 1) bias toward proteins without signal sequences (ratios of 0.49 and 0.44). The strong bias of the Lit set toward signal sequences is likely due to the greater ease with which these more soluble proteins can be isolated and studied by biochemical techniques. Similarly, the difference between 2DEMS and LCMS may be due to the failure of many less-soluble proteins to focus in the first (isoelectric focusing) dimension of the 2DE procedure (e.g. intact membrane or very large proteins), or else to the presence of numerous isoforms that divide the protein among members of a charge train, and thus decrease the limit of detection (e.g. heavily glycosylated extracellular domains cleaved from membrane proteins). By placing fewer requirements on the behavior of sample proteins prior to digestion, and by providing identifications based on a few soluble peptides, the MS-based techniques provided a significantly less biased, though not necessarily more complete, view of the plasma proteome.
Because the four input protein sets were of similar size, it is clear that the literature, viewed as an historical summary of research on proteins in plasma, shows a bias toward secreted proteins and against investigation of cellular proteins in solution in blood. This effect cannot be due to detection sensitivity alone, because the low-abundance cytokines, generally not detected by the experimental proteomics methods, are accessible via immunoassay and widely reported in the literature. The bias may instead be due to a general skepticism that detectable amounts of many cellular proteins are being released into plasma (absent some major cause of tissue damage), or to a view that cellular protein release, if it occurred, would not be especially informative. The present results, built on experimental studies of multiple groups, demonstrate that many (perhaps all) cellular proteins are present in plasma. The demonstrated utility of cardiac muscle protein markers as serum diagnostics for myocardial infraction provides a persuasive argument that many may have diagnostic use.
The next major challenge thus becomes the systematic exploration of protein abundance and structural modification in relation to disease, normal physiological processes, and treatment effects. In a sense this shift can be seen as analogous to the current evolution in pharmaceutical target selection: the genome has provided a wealth (in practical terms an overabundance) of previously unknown therapeutic targets, creating a major challenge in selecting those that are "druggable" and specifically linked to disease mechanisms. A shift, in other words, from protein discovery to target validation. In the context of diagnostics based on proteins in blood, we now have in H_Plasma_NR_v2, and a growing body of other experimental data, a substantial set of candidate disease markers that can be detected in plasma. While these will continue to be supplemented by discovery techniques (22), the stage is set for systematic efforts to validate disease markers for near-term application in clinical trials, medium-term use in disease detection, staging, and therapy selection, and long-term use in population screening. While some of these proteins have been examined individually as potential markers and found to have low sensitivity or specificity as individual tests, growing evidence indicates that these limitations may be overcome using fingerprints of change across panels of proteins that together better represent patient status. Thus even the present H_Plasma_NR_v2 offers an abundance of candidates deserving of measurement in selected clinical sample sets. Making such measurements, accurately and on a large scale presents a series of technical challenges likely to require substantial efforts for much of the next decade. The results of such a "targeted" proteomics effort can transform diagnostics, improve therapy, and lead to substantial and needed improvements in the economics of healthcare.
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, January 12, 2004, DOI 10.1074/mcp.M300127-MCP200
1 The abbreviations used are: Ig, immunoglobulin; MS, mass spectrometry; GO, Genome Ontology; 2DE, two-dimensional electrophoresis; NR, nonredundant; TM, transmembrane; LC, liquid chromatography; MS/MS, tandem MS; IT, ion trap.
2 J. N. Adkins, manuscript in preparation.
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
S The on-line version of this article (available at http://www.mcponline.org
) contains supplemental materials.
|| Current address: The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850.
To whom correspondence should be addressed: The Plasma Proteome Institute, P.O. Box 53450, Washington DC 20009-3450. Tel.: 301-72811451; Fax: 202-234-9175; E-mail: leighanderson{at}plasmaproteome.org
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() |
---|