From the Department of Statistics, Yale University, New Haven, CT
Department of Mathematics, Florida Atlantic University, Boca Raton, FL || Program in Proteomics and Bioinformatics, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada ** Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA
School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, Canada
![]() |
ABSTRACT |
---|
Many of the successes of therapeutic intervention have evolved from improvements in the ability to diagnose, stage, and stratify subgroups of patients who may respond differently to various management strategies (4). Despite these advances, the treatment of many diseases, such as cancer and cardiovascular disease, still suffers from the fact that most patients present at late stages of illness. Earlier detection of pathology is highly beneficial to patient outcomes (5), yet there are few effective diagnostic tools for recognizing early stage disease or prognostic tools for identifying those at high risk of dying or being nonresponsiveness to therapy (6). Development of satisfactory therapeutics is also hampered by a lack of informative bioassays (6). As a result, biological markers are urgently needed to improve the efficacy of clinical intervention, the reliability of clinical trials, and the validation of leads and targets (3, 6).
DNA microarrays are commonly used to detect differences in gene expression between different physiological states (7), including global changes in mRNA abundance across repeat experiments, distinct experimental perturbations, discrete time points, or large patient cohorts (7). Pattern recognition algorithms can then be applied to sort and classify samples based on their expression profiles (8). Nonetheless, it is likely that pathophysiological mal-adaptations associated with common pathologies, such as diabetes and cancer, are more accurately reflected in the proteomic patterns of disease-affected tissues (3), especially in samples with little messenger RNA (e.g. serum) (9).
To date, most clinically useful protein biomarkers have either been found serendipitously or through limited candidate evaluation based on hypotheses regarding disease action (3, 10). The lack of effective generic procedures for routinely detecting differences in global protein patterns across many different samples hinders the discovery of new biomarkers (3, 5). This constraint is particularly apparent in a clinical setting (5, 11), where specialized analytical procedures are often required to derive useful qualitative and quantitative information from the minuscule amounts of protein typically found in patient specimens, such as a biopsy. Furthermore, while sensitive immunoassays can be used to prospectively validate a biomarker, they are generally not well suited to the discovery of new biomarkers (3).
MS has emerged as a key enabling technology for protein expression profiling (1, 12). Recent ground-breaking studies have demonstrated the utility of combining MS-based profiling and computer-based pattern recognition as a means of detecting proteomic signatures of cancer in blood (13, 14). However, the relatively simple MS instrumentation used in these pioneering studies was biased toward the detection of low molecular mass proteins (13). Moreover, it did not allow for ready protein identification, which is critical if such biomarkers are to form the basis of a simplified, widely adopted diagnostic (5, 10). Reliable methods for determining both the identity and quantity of large numbers of proteins across many different clinical samples are therefore urgently needed to test and prospectively validate the hypothesis that compensatory responses to disease are reflected by changes in the proteomic patterns of blood or tissue (10, 15). Moreover, there is a parallel need to develop rigorous statistical methods to evaluate the significance of any differences detected (16). This is particularly true of global proteomic studies subject to numerous sources of variation, both experimental and biological in origin.
Gel-free protein profiling procedures coupling capillary-scale HPLC to data-dependent MS/MS (LC-MS) present an exciting new paradigm for proteomic screening (1, 12). In particular, multidimensional protein identification technology (17, 18) and isotope-coded affinity reagents (19) now allow for the "shotgun" profiling of hundreds of proteins in a single experiment, albeit with a significant expenditure of time and effort. The clinical impact of these methods has been limited to date (3), however, in part due to problems associated with the reproducibility of LC-MS (20, 21), as well as to difficulties in extracting clinically relevant information from the limited number of samples that can practicably analyzed using these specialized methods (16).
In an effort to improve the reliability of LC-MS-based profiling studies, Smith and colleagues (22) have reported the utility of advanced equipment, FT-ICR MS and HPLC pumps capable of sustained performance at >10,000 psi. While this strategy circumvents many of the problems associated with traditional profiling procedures, it relies on technologies that are not widely available to the broader biomedical community. Moreover, it does not address fundamental issues concerning the statistical evaluation of multivariate proteomic datasets for the purpose of biomarker discovery (16).
Becker and colleagues (20) recently introduced an alternative computational method for detecting differential protein abundance by LC-MS without the need for isotopic labeling or advanced instrumentation. Their approach relies on the roughly linear relationship of MS signal as a function of peptide ion concentration. Proprietary data processing algorithms were then used to track quantitative variation in peptide signal across different LC-MS datasets. A key step in minimizing sample dispersion was the use of a "time wrapping" alignment algorithm to correct for spurious deviations in recorded ion maps, resulting in modest (25%) coefficients of variation across integrated peak intensities. Significant computational cost was observed with increasing sample complexity (20), restricting the effectiveness of this first-generation platform for pattern recognition across multiple complex proteomic datasets (10, 15, 20, 23, 24).
Experimental repetition, pattern recognition, and mathematical algorithms can minimize the effects of unwanted noise and spurious signal fluctuation (15, 25). Here, we report the development of a more advanced generation of computer algorithms, statistical data-mining procedures, and software built upon these principles that greatly facilitate large-scale protein expression profiling of mammalian tissue samples using basic gel-free shotgun profiling procedures and standard LC-MS instrumentation. We show that this informatics toolkit allows for systematic global comparison and classification of complex tissue proteomic samples, speeding discovery of biologically relevant proteomic biomarkers.
![]() |
EXPERIMENTAL PROCEDURES |
---|
Signal Filtering Algorithm
After conversion of the raw LC-MS data files to text format, a Perl script is used to parse out all irrelevant MS/MS scan data. Peak m/z ratios in the retained scans are rounded off to the closest integer and binned (±0.5 Th).
Applying the M-N rule
The program processes individual nominal m/z ion traces {Zi} (where Zi is the intensity on the ith scan header, and m/z is fixed) and computes a robust center, C. We suggest C = 30% of the trimmed mean of {Zi}, although C = median of {Zi} produces reliable results. The data are smoothed using moving averages. That is, for a given fixed m/z slice, the feature intensities are transformed by averaging over a fixed window of five consecutive scans. Next, for predefined constants M and N, the algorithm extracts only those features of Zi greater than M*C for N observations in a row. For example, if C = 3,000, and the M-N rule is set to 53, we would declare Zi a pixel if Zi1, Zi, and Zi+1 are all larger than 15,000 (i.e. the ion signal intensity was >5*C for at least three scans in a row). Finally, for a declared constant, Li = 2i11,000, i = 1, ... 5, a set of M-N constants (rules) producing Li pixels are chosen.
The M-N computation scales linearly with the number of experiments, and application of the algorithm (five levels per analysis, with each level taking 1 min of CPU time) is generally not a limiting factor.
Normalization
A basic form of data normalization is carried out by a two-step mechanism. First, before application of the M-N rule, the feature intensities of individual datasets are transformed into an integer (ranging between 110,000 arbitrary units) by dividing all feature intensities with a constant, K. The constant, K, is chosen as the minimum value such that the total number of features with intensities larger than K*10,000 equals 100. In other words, there are only 100 features with intensities above the cut-off value of 10,000. The second mechanism is designed to detect a potential normalization problem. It monitors both the K constants and the M-N rules produced for each corresponding pamphlet, and issues an alarm (error message) if these (feature intensities and pixel count) differ for more than 20%. No error messages were generated for any of the datasets reported in this study.
Evaluating System Stringency
Generally, the first scans acquired during LC-MS consist of noise (assuming no sample bleed-through). Thus, by concatenating the first 10% of total acquired scans obtained from 10 independent LC-MS analyses of a yeast cell tryptic digest, we constructed a virtual experimental dataset consisting entirely of nonpeptide ion noise. We then evaluated the performance and sensitivity of various elaborations of the M-N rule with each of the genuine LC-MS datasets versus the noise dataset alone. As seen in Table I below, the M-N rule approach proved to be highly stringent. Even a liberal Level 33 threshold, which extracts
1116,000 features on average from each of the peptide profiles, resulted in detection of a few spurious peaks in the control dataset (i.e. only 34 false-positives satisfying the rule detected).
|
Peak Extraction Algorithm
The algorithm proceeds in an iterative stepwise manner, starting from Level 1 (1,000 pixels) using the basic extraction principle. Each distinct peak (defined by discrete scan headers and m/z) is assigned a unique ID. Next, the algorithm progressively adds features from successive levels (Level 2, then 3, and so on). If these added features overlap with more than one peak, the groupings are split (IDs reassigned) such that, at most, only a single peak is retained from a previous level. That is, the overlapping features are bisected at each additional level by computing a discriminating line that separates (and hence preserves) the original peaks.
Peak Alignment Algorithm
Because a pamphlet can be represented as a collection of pixels with X and Y coordinates, where X is the scan number and Y the nominal m/z (for the alignment problem, we need not consider intensity values), we let P1 = {Xi,1,Yi,1}i=1L1 and P2 = {Xi,2,Yi,2}i=1L2 formally represent two different pamphlets. A relatively simple but robust measure of similarity between two datasets (Matching) is then calculated based on the percentage feature (pixel) overlap, defined as:
![]() |
Given a smooth increasing function, F(X,Y), we let P2 = {F(Xi,2,Yi,2),Yi,2}i=1L2 serve as a time-transformed second pamphlet. The alignment problem now reduces to finding the function, F, that maximizes Matching(P1,P2). We considered only functions of the form
![]() |
where Ai,j and Ci,j are chosen such that |F(x1,y) F(x2,y)| K1|x1 x2| and |F(x,y1) F(x,y2)|
K2|y1 y2|, with Lipschitz constants 0.9
K1
1.1 and 0.05
K2
0.05. The partitions {ai}i=1E and {bj}j=1Dwere uniform, while constants D and E were set as 5 and 6, respectively. The optimization now reduces to finding the optimal values for Ai,j and Ci,j. Accelerated Random Search (26) was the optimization schema of choice, because it is robust and easy to implement. As a further measure of peak matching, a final "wobble" function is applied wherein a peak is allowed to move (±12% of total scan headers) in order to find the nearest adjacent peak in a different experimental dataset. Generally, even for complex mixtures and higher level pamphlets, there is an extremely low probability that there will be two (or more) peaks exactly equidistant. If that happens, the software will pick one (randomly) and an alarm (error message) is produced (we have rarely seen this form of error).
The alignment algorithm is computationally intensive and scales with the square of the number of experiments (e.g. pair-wise dataset matchings). Sufficient RAM is therefore suggested to carry out the most demanding calculations in memory.
Peptide Quantitation
A peptide quantitation module processes input LC-MS datasets and outputs signature expression profiles, along with a measure of statistical variation. Peak integration is performed by summing the intensities of grouped features across adjacent MS scans recorded in full scan mode.
Protein Sample Preparation
Human serum was prepared according to standard practice. Purified human heart troponin complex was obtained from a commercial source. For the mouse protocol, we removed chow from the fasted mice in the morning and sacrificed all mice 24 h later. The strain used (27) was an inbred cross of C57BL6 x 129. Liver extract were prepared as reported by Kislinger et al. (28). Protein fractions were precipitated, solubilized in urea, resuspended in 100 mm NH4HCO3 with 1 mm CaCl2 (pH 8.5), and digested with Poroszyme trypsin beads (Applied Biosystems, Foster City, CA). The resulting peptide mixtures were solid phase extracted with SPEC-Plus PT C18 cartridges (Ansys Diagnostics, Lake Forest, CA) and stored at 80°C until further use. Synthetic peptides were obtained from Sigma Aldrich (St. Louis, MO).
LC-MS Analysis
Peptide mixtures were subjected to capillary-scale LC-MS using a quaternary HPLC pump coupled online to an LCQ DECA ion trap MS (Thermo Finnigan, San Jose, CA) essentially as described (29). Briefly, a fused-silica microcolumn (100 µm i.d. x 365 µm o.d.) was pulled with a Model P-2000 laser puller (Sutter Instrument Co., Novato, CA) and packed with 5 cm of 5-µm C18 reverse-phase material (Zorbax XDB-C18; Agilent, Palo Alto, CA). After loading, the column was placed in-line with the ion source and the peptides eluted with a linear gradient [100% buffer A (5% ACN, 0.02% heptafluorobutyric acid, 0.5% acetic acid) to 80% solvent B (100% ACN) over 45, 60, or 90 min] at a tip flow rate of
0.3 µl/min using a split line. Eluting peptide ions were analyzed with alternating MS modes, using a full scan mass range of 4001,600 m/z followed by data-dependent CID. A dynamic exclusion list was used to limit collection of redundant CID spectra.
Protein Identification
Peptide fragmentation product ion spectra were sequence-mapped against a database of nonredundant protein sequences (Swiss-Prot) using the SEQUEST software algorithm (30) running on a multiprocessor computer. The probability-based evaluation algorithm STATQUEST (28) was used to filter all putative matches based on a 95% likelihood of predicted accuracy. Functional annotation was obtained from Swiss-Prot.
![]() |
RESULTS |
---|
Modern LC-MS systems can resolve hundreds of peptides (1, 12). A typical dataset, consisting of a multiplexed stream of co-eluting ion peaks (or ion map) acquired on a quadrupole ion-trap, is shown in Supplemental Fig. S1. We refer to such data as an empirical profile. Inspection of representative total ion chromatograms demonstrates the general reproducibility of LC-MS (Supplemental Fig. S2). Nonetheless, even under controlled conditions (20, 22), both stochastic system performance variation and chemical and electronic noise can affect the relative position, width, amplitude, and shape of individual peaks (Supplemental Fig. S3). We refer to this peak artifact as drift and distortion.
![]() |
Extracting Quantitative Information from LC-MS Datasets |
---|
![]() |
Step 1: Data Filtering and Signal Extraction |
---|
A feature extraction algorithm is then used to select an optimal set of (M,N)i rules to acquire a predefined series of Li features for a geometrically increasing sequence, Li = 2i1 1,000. [Pamphlets with Li features are referred to as Level i pamphlets.] The algorithm starts conservatively, extracting the most prominent ion features first, and then progressively adding features until the cutoff is met. Statistical analysis (see "Experimental Procedures") suggests [M = 3,n = 3] as a generally acceptable lower threshold, resulting in many discrete features (depending on sample complexity) with little specious background. Examples of Level 5 and 2 pamphlets (16,000 and 2,000 features), generated by LC-MS analysis of a yeast cell extract, are shown in Fig. 1, a and b, respectively. Despite evidence of crowding, higher resolution "zoom in" reveals good peak discrimination (Fig. 1c).
|
![]() |
Step 2: Peak Definition |
---|
|
![]() |
Step 3: Correction of Peak Drift and Distortion and Peak Alignment |
---|
|
Last, to compensate for residual random (nonsmooth) variation, each of the peaks detected by contour mapping is "wobbled" to maximize peak overlap (see "Experimental Procedures"). Although this local optimization is limited in scope (±1% total scans), it provides an added measure of peak matching. Fig. 3b illustrates the considerably improved peak matching achieved by this multistage procedure.
We note that, just as data normalization is often used to correct for systemic signal discrepancies in microarray studies (34), global peak intensities of different datasets can likewise first be normalized by adjusting median feature intensities to unity prior to matching. However, many substantive issues are raised by normalization procedures (34). In our experience, well-controlled sample preparation and LC-MS procedures serve sufficiently well in most instances such that data normalization is not a major concern. Nevertheless, normalization may improve the inferences that can be drawn from comparisons of proteomic datasets generated by different sources and locations.
Computational time is another obvious constraint here. We have worked under the general guiding principle that the routine application of our informatics platform should not exceed the time necessary to complete the LC-MS analyses themselves (that is, the rate of data production should not exceed data processing capacity). In fact, running the software on a basic single Pentium CPU Win/PC workstation is generally more than sufficient to keep up with the data output of a dedicated LC-MS system collecting spectra more or less around-the-clock.
The alignment algorithm is by far the most computationally intensive and scales with the square of the number of experiments (e.g. pair-wise dataset matchings). While the inherent computational difficulties (multidimensional optimization generally requires 510 min of CPU time per matching) will be hard to speed up, the concept of "Mother pamphlet" was specifically designed to tackle the quadratic increase.
![]() |
Step 4: Quantitative and Qualitative Proteomic Comparisons |
---|
For quantitative peak comparisons, grouped feature intensities are summed. Consistent with earlier reports (20), standard titration curves recorded for model peptides exhibited linear signal responses after peak processing and quantification, with residual variation mitigated by repeat analyses and signal averaging (Fig. 4a). Moreover, a good correlation (R2 = 0.84) was observed in scatter log10 plots of peak intensities measured for >400 peptides reproducibly detected in the two aligned datasets reported in Fig. 3b, with relatively few outliers and only modest dispersion at lower signal-to-noise ratios (Fig. 4b). Importantly, >93% of the peaks exhibited 2-fold or less deviation in observed signal intensities, an established benchmark of reproducibility used in microarray studies (34), across a 34 order of magnitude dynamic range.
|
We next tested the sensitivity of the platform to spurious experimental variations stemming from fluctuations in sample work up. Duplicate aliquots of a protein mixture were processed in parallel and analyzed by repeat LC-MS. The resulting profiles were found to be highly similar (Supplemental Fig. S5). We concluded that the platform is relatively robust to artifacts stemming from standard sample handing procedures.
![]() |
Sample Classification |
---|
In principle, one could use either quantitative patterns or qualitative (present/absent) differences in peptide abundance to discriminate between the samples. For the purpose of diagnostic development, it may be preferential to focus on the latter (5, 6, 15, 36). A good way to calibrate the software was to see if it could reveal modest differences in protein abundance between individual mice within the two groups, as some biological variation is expected (37). Indeed, pair-wise comparisons of all intra-group datasets revealed greater differences in the proteomic patterns of individual mice (Supplemental Fig. S6b) than could be explained by experimental variation alone. With this added confidence, we then tested whether the software generated patterns could be used to correctly classify mouse F. Indeed, the profile separation was pronounced (Fig. 5), allowing for its unambiguous assignment to the control (fed) group. Not only were each of the mouse F data profiles more similar "on average" to the control mice (A and B) than to the fasted mice (C and D), they were more alike in every single comparison.
|
We made use of this relatively straight-forward procedure for its simplicity, speed and ability to handle multiway classifications, but alternative classification procedures and algorithms may be more effective in certain data-mining scenarios. Again, these algorithms can be readily incorporated as stand-alone modules within the platform. When the phenotypic difference (in terms of proteomic profiles) between classes is less pronounced, it is reasonable to expect that the two curves plotted in Fig. 5 could sometimes cross. However, our software can still handle this scenario, provided that the intra-group variations are sufficiently small to detect statistically significant differences in comparisons between the two respective classes. The reason for this lies in the fact that the classification algorithm was used not only to process the data shown in Fig. 5, but also to take into account additional possible sources of experimental and biological variability as reported in Supplemental Fig. S6.
We note that the matching score of an unknown test dataset against all other datasets obtained for a particular class (for example, all profiles acquired for the group of fed mice) can likewise be regarded as a "point" in high-dimensional space (typically known as the "feature space" in pattern recognition literature, where the dimension is the number of samples within a group). Generating such a point for each dataset (that is, for both the fed and starved mice) gives rise to two clusters of points in the space, one for each class. If the two clusters are sufficiently separable in space or exhibit relatively confined covariance structure, robust classifiers can be readily obtained using established, rigorous criteria, allowing ready classification of the unknown sample F even when the two curves shown in Fig. 5 cross. Of course, if these clusters are not sufficiently distinct, for instance due to severe intra-class profile variations, virtually all classification schema would be expected to fail. So, in the end, any data-mining approach will be data driven.
![]() |
Sequence Validation |
---|
![]() |
Repetition Improves Data Consistency |
---|
To gauge the extent to which experimental repetition might compensate for this problem, we developed the concept of a Mother Pamphlet (J,K,L), a data matrix combining key elements (scan number, m/z, and signal intensity) that define all the features reproducibly detected at least K times in an Level L pamphlet generated from J repeat datasets. For example, a Mother Pamphlet (4,2,3) contains only those peaks detected in common in any two of four related input Level 3 pamphlets. Based on this new measure, each of the four profiles recorded for mouse F did, in fact, exhibit considerably better peak overlap (99%) to a Mother Pamphlet (3,1,1) constructed from the remaining three other datasets (Supplemental Fig. S7a). That is, virtually all of the peaks detected in one pamphlet were likewise found in at least one of the other datasets. Hence, even limited repetition reveals essentially all of the principle peptide peaks that define a sample.
We expanded on this concept by establishing the commonality of proteomic patterns within a group of related mice. We compared a Mother Pamphlet (4,4,1) encompassing all peaks reproducibly detected in all four mouse F datasets to a more inclusive Mother Pamphlet (8,2,3) created for all eight datasets derived for the other two fed mice. As expected, a high degree of similarity (90.5%) was still detected (Supplemental Fig. S7b), with the modest residual variation due, in part, to biological variation between individuals (37).
![]() |
Biomarker Identification |
---|
|
![]() |
Case Study: Blood Profiling |
---|
![]() |
DISCUSSION |
---|
While gel-free LC-MS-based profiling methods offer remarkable analytical speed and sensitivity (17), its variability has limited its general suitability for biomarker discovery (16). In an attempt to overcome this, several research groups have developed innovative chemical labeling strategies designed to improve the reliability of quantitative inferences that can be made by LC-MS (19). In addition, an "accurate mass and elution time" profiling method based on ultra-high performance LC-MS systems has been reported (22). While effective, the impact of these approaches on the clinical domain has been restricted to date, in part due to the considerable dedicated instrument time, technical expertise, and costs associated with multisample analyses (38).
To overcome these limitations, we have developed and validated a complementary informatics strategy designed to derive reliable qualitative and quantitative protein profiling data using established, broadly applicable LC-MS procedures. The software described here corrects for spurious deviation between experiments, permitting meaningful comparisons of proteomic datasets for the purpose of identifying differential protein expression between samples. It also automates large-scale pattern recognition and mining of proteomic datasets for the purpose of sample classification and biomarker discovery. The Mother Pamphlet strategy, in particular, allows detection of reproducible differences in proteomic patterns, improving proteome coverage and dynamic range.
Using this approach, we have shown that informatics methods can reveal biologically significant changes in tissue protein patterns, allowing for sample classification without being overly sensitive to experimental noise (e.g. the particular day an LC-MS experiment was performed). The data strongly suggest that the software can serve as the basis for systematic molecular investigation of disease or therapeutic action. Many experimental variations can be envisaged, including strategies to monitor dynamic changes in the levels of protein post-translational modifications in response to stimuli. Importantly, this informatics platform can also accommodate the use of isotope-based chemical labeling methods (19, 29) to further enhance the accuracy of the quantitative measurements.
The threshold-like data filtering criteria used here can be implemented as a semi-quantitative measure of peptide abundance (for instance, by comparing the presence/absence of peak detection at different pamphlet filter levels). In this study, we performed a series of control experiments showing that integrated peptide peak intensities nicely correlate with abundance (Fig. 4). Thus, in order to compare the levels of peptides between classes, one only needs to revisit the appropriate Mother Pamphlets and compare the intensities corresponding to peptides of interest. Indeed, we set forth with the long-term aim of developing the software as the basis for routinely quantifying differences (relative ratios or fold-changes in protein abundance) in peak intensities across even the most complex proteomic patterns. Implementation of this feature is now a relatively straightforward programming issue because there are no substantive mathematical difficulties, and we hope to address this desirable functionality in the next generation of the software. The utility of various methods of data normalization on the reliability of the inferences made from proteomic comparisons also needs to be more extensively evaluated. Of course, one is still unlikely to be able to detect all possible biomarkers due to limitations in protein and peptide extraction and variations inherent to the experiments themselvesnamely, dynamic range confines and extremes in biological complexity and/or variability.
The alignment process can be computationally demanding, particularly for larger sample sizes. In the mouse tissue profiling example provided, instead of a larger full-scale set of 400 alignments (5 mice*4 Pamphlets = 20 datasets, resulting in 202 = 400 matched pairs), the alignment problem can be stratified by first creating 5 Mother Pamphlets for each individual mouse (i.e. 4 mice*42 = 64 alignments) and then aligning the 5 Mother Pamphlets (52 = 25 matchings), totaling 89 alignments, with a corresponding increase in processing speed. Nevertheless, we have found that comparisons of upward of 100 individual datasets are quite manageable and are taxing only for the highest feature extraction pamphlet levels. Given the modular design of the software and depending on the design of experiments and available hardware, different and more appropriate computational methods to address this possible constraint can also be implemented if needed. Likewise, the brute force M-N algorithm could also be sped up, but since this computation scales linearly with the number of experiments and since the application of the algorithm (5 levels per analysis, with each taking 1 min of CPU time, as compared with the 6090 min typically required for most LC-MS analyses) is not a limiting factor, there is no pressing need at this point to optimize it further.
It should be noted that a key aspect of the profiling strategy outlined here is the ability to detect and evaluate candidate biomarkers without the need for carrying out time-consuming CID. Proteins of interest, such as those whose levels change reproducibly as a result of an experimental perturbation or which help differentiate between clinical samples, can then be identified in targeted follow-up sequencing experiments. While similar concepts have recently been introduced by others (18, 22, 45), our approach has major advantages in that it builds on established experimental techniques and existing instrumentation that are broadly available throughout the biomedical research community. Nonetheless, our toolkit can exploit the improved dynamic range, resolution, and mass accuracy of newer generation MS instrumentation. Moreover, although high-abundance proteins were preferentially detected in the mouse profiling experiments reported here (in part due to the limited dynamic range of the instrumentation used), we expect that proteome coverage can be significantly improved by using basic subcellular fractionation and affinity enrichment techniques prior to LC-MS (1, 2). By uncoupling sequence identification from peptide quantitation, a markedly expanded number of samples can be analyzed in a single day, increasing the precision and throughput of quantitative proteomic measurements, resulting in a better accounting of biological variation.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, July 21, 2004, DOI 10.1074/mcp.M400061-MCP200
* This work was supported in part by funding from the National Science and Engineering Research Council of Canada (NSERC), Genome Canada and the Canadian Protein Engineering Network Centre of Excellence (PENCE) to A E.
S The on-line version of this manuscript (available at http://www.mcponline.org)). The masses of the tryptic peptides revealed the identities of the putative PCAPs as summarized in Table I. In several cases, MS/MS analysis of selected peptides confirmed the identity or provided the identity (see Table I and Fig. 5). The relevant software used at Yale was principally ProFound (see prowl.rockefeller.edu/cgi-bin/ProFound) and PeptideSearch (see www.narrador.embl-heidelberg.de/GroupPages/PageLink/peptidesearchpage.html). The relevant software used at the University of Massachusetts was principally MS-Fit and MS-Tag (see prospector.ucsf.edu/) and MOWSE (see www.hgmp.mrc.ac.uk/Bioinformatics/Webapp/mowse/ ) contains supplemental material.
Current address: Shoreline Community College, North Shoreline, WA 98133.
¶ To whom correspondence should be addressed: Dragan Radulovic, Department of Mathematics, Florida Atlantic University, Boca Raton, FL. E-mail: radulovi@fau.edu. Andrew Emili, CH Best Institute, 112 College Street, Toronto, ON, Canada M5G 1L6. Tel.: 416-946-7281; Fax: 416-978-8528; E-mail: andrew.emili{at}utoronto.ca.
![]() |
REFERENCES |
---|