From the Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada;
Program in Proteomics and Bioinformatics and ¶ Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario M5G 1L6, Canada
![]() |
ABSTRACT |
---|
The field of expression proteomics seeks to answer the following questions: 1) which proteins and variant isoforms are expressed during the lifecycle of an organism; 2) which post-translational modifications occur in each of these proteins; 3) how do these patterns differ in different cell types and tissues and under different developmental, physiological, and disease conditions; and 4) how can biologists make use of this information to better understand the molecular basis for fundamental biological processes as well as for monitoring the course of disease so as to improve clinical diagnosis and treatment (1\N3). These questions are made all the more difficult by the complexity of most biological systems, which increases exponentially as one goes downstream from DNA sequence to mRNA intermediates to the protein end-products. While it appears there are likely far fewer genes coded for by the human genome than first anticipated, it is estimated that >60% of the 25,000 putative ORFs encode more than one splice variant (often tens to hundreds), and these in turn are frequently subject to post-translational modification (4, 5). Moreover, because proteins typically function together as components of dynamic multisubunit macromolecular complexes, the final complexity of a biological system is enormous; hence the search for interesting and relevant proteomic patterns remains a challenging task.
Capillary-scale HPLC-MS/MS (LC-MS) is rapidly emerging as a method of choice for large-scale proteomic analysis (1, 2, 6\N12). State-of-the-art LC-MS systems can be used to identify and track the relative abundance of thousands of molecules (6, 7). For standard bottom-up profiling experiments, the molecules in question are peptides derived by proteolysis of intact proteins. For very complex protein samples, such as blood, the peptide mixtures are physically resolved by chromatographic separation prior to injection into the mass spectrometer so as to generate a more-richly informative map, consisting of both the unique elution characteristics (column retention times) as well as m/z (mass-over-charge) ratios of individual peptides. Discrete peptides of interest are subject to collision-induced fragmentation followed by database matching for the purpose of sequence identification, while the recorded pattern of precursor ion peak intensities can be used to infer the relative quantities of the various proteins between samples. Nevertheless, comparisons of large-scale multivariate proteomic datasets are subject to a number of challenging analytical issues, such as experimental noise, systematic variations between experimental runs, the extreme overall range and dynamic nature of protein levels, and the huge number of measured features (e.g. protein levels) of which many are uncorrelated (or spuriously correlated) to the variables of interest.
In this review, we discuss important computational and statistical concepts that should be considered when performing comparative proteomic analyses and outline procedures for processing MS data to achieve more reliable quantitative analyses. As the broader domains of statistical and machine-learning methods are beyond the scope of this review, we limit our examination to studies applying these methods to shotgun profiling datasets. For more general information about statistical learning models, a good reference is Ref. 13. For an in-depth discussion of the technical nature of LC-MS, and the relative merits of different MS platforms, the reader is directed to other reviews present in this issue.
![]() |
LC-MS-BASED PROTEOMIC PROFILING |
---|
A non-LC-based MS platform commonly used in profiling studies is SELDI (a refinement of MALDI), in which subsets of proteins in samples are selectively pre-adsorbed onto various proteomic chips (consisting of differing binding surfaces, such as hydrophobic or metal-binding surfaces) (8, 9).
The quality of profiling studies is determined by the overall sensitivity, detection coverage, dynamic range, fragmentation efficiency, mass resolution, and accuracy (6). Femptomole or better detection limits are commonly attained with LC-MS, even with mixtures exceeding several thousand components (7). With SELDI, much of the sample is discarded (not selectively bound to the matrix), whereas LC-MS enables full sample throughput, while providing the added opportunity for protein identification and mapping of post-translation modifications.
![]() |
PARADIGMS FOR INFORMATION EXTRACTION FROM LC-MS DATASETS |
---|
However, such an approach is complicated by a number of factors: 1) the LC time axis needs to be corrected to account for spurious deviations in peptide elution times across different experiments; 2) there may be confounding overlap of peptides across the (time, m/z) space; 3) LC-MS systems are subject to considerable noise and variability that is not fully characterized or accounted for; and 4) differences in overall sample composition, leading to differential ion context, may affect the apparent signal intensities recorded for peptides common to multiple samples.
To address these issues, the following tasks need to be resolved: 1) corrective alignment along the experimental LC time axis so that times are comparable across experiments; 2) combining replicate experimental LC-MS datasets so as to improve the overall signal-to-noise ratio; 3) developing statistically sound methods for distinguishing signal from background; 4) systematically studying sources of variability and signal characteristics inherent to LC-MS data and modeling them so as to take them into account (for example, systematic variation in peak amplitude and width, signal linearity, and background artifacts should be investigated); 5) developing algorithms to detect and quantify peptide ion peaks in LC-MS data; and 6) developing probabilistic algorithms to uncover interesting peptide patterns and to match new patterns to previously discovered ones.
In the aforementioned approach, peptide peaks are extracted and then identified, and only after these steps are performed does one pose queries related to the particular experiment at hand. An alternative approach to exploiting LC-MS data is to take a signal processing perspective. For instance, rather than starting with peak detection and peptide identification to find disease biomarkers using LC-MS patterns, the data could be treated as a signal matrix, allowing the application of established methods in signal processing, statistics, and machine learning (13, 16, 17) to tease out interesting and relevant patterns from the data. Nevertheless, many of the previous tasks, such as time alignment, background detection and correction, and accounting for systematic variations in signal amplitude, are relevant to this latter approach.
Several groups (1, 10, 18) have developed working systems to address the main problems described above. The system of Radulovic et al. (18) performs filtering, normalization, peak detection, quantification and alignment, and then classification; they also establish linearity in LC-MS peak signal with peptide concentration. Although no algorithm details are provided, Kearney and Thibault (1) report a system for peptide detection and data alignment, which identifies the m/z, retention time, charge state, and maximum intensity of the principle peptide isotope. Wang et al. (10) performed baseline subtraction, peak detection, isotope detection, alignment, and quantification and demonstrated a linear LC-MS peak signal response to increasing peptide concentration in a complex mixture over a fairly large dynamic range.
Several studies of LC-MS have focused on particular low/mid-level data processing steps such as noise reduction, error models (which model the variance of the peptide abundance), or alignment in time, while other studies have been devoted strictly to the use of MS signal for sample classification. For studies attempting to cover the entire set of tasks, a typical approach for information extraction from LC-MS datasets typically involves a sequence of operations as listed in Table I, each treated in a largely independent manner from the others. The optimal order in which these should be performed so as to obtain the most information possible remains unclear. For instance, one could align the signals before or after peak detection, one could normalize before or after alignment, etc. There are several sequences of operations that make intuitive sense, but that differ significantly from one another. The optimal ordering depends in part on our ability to perform each task. For example, if alignment can be performed optimally on the raw data, without normalization or peak detection, then it may be desirable to do this first, potentially allowing for better normalization and peak detection. Alternatively, if one can not align the raw signals well, then it may be desirable to normalize and find peaks before alignment. Lastly, it may be desirable to approach some of these subtasks together, for example as in the continuous profile model (CPM),1 which aligns and normalizes abundance simultaneously, removing the necessity to choose one before the other (19).
|
![]() |
LC-MS DATA PROCESSING |
---|
![]() |
LOW-LEVEL PROCESSINGQUANTIZATION OF m/z VALUES |
---|
An issue to consider here is how to bin the m/z values. For example, one could opt for evenly spaced bins in either native or log m/z space. An optimal bin width would be large enough to keep the matrix tractable and not too sparse, but small enough that individual m/z values remain informative (i.e. not collapsing the information too much); this trade-off depends on the MS instrumentation used. For instance, Radulovic et al. (18) decided to round m/z values to the nearest integer, consistent with the nominal mass accuracy of the ion trap instruments used in their study, while Wang et al. and Anderle et al. (10, 20) used evenly spaced bins of width 0.2Th to exploit the increased accuracy of their instruments. In a study of gas-phase chromatography (GC)-MS data by Stein (21), bin widths were chosen as a function of the measured mass accuracy and resolution and increased linearly with m/z. In Ref. 22, LC-MS peak-widths (ignoring time) were reported to be reasonably constant in log m/z space by Tibshirani and co-workers. No methods have been reported for evaluating optimal bin width, nor for determining the sensitivity of further calculations to this parameter.
![]() |
SIGNAL FILTERING AND BACKGROUND SUBTRACTION |
---|
Various filters for data smoothing along the LC time axis have been implemented (25). These include simple "moving average," median, and moving geometric mean filters, and the Savitzky-Golay filter, which preserves high-frequency content by fitting a higher-order polynomial to the data over a local window (26). For example, Wang et al. observed well-defined peaks and baseline after applying the Savitsky-Golay filter to LC-MS data (10). Data points belonging to the baseline were then hand-picked, fit with a low-order polynomial, and subtracted from the original data, together with a second application of the Savitsky-Golay filter for added peak smoothing. Nevertheless, manual delineation of background is a subjective, tedious, and error-prone process, and inconsistent with high-throughput analysis.
Bylund et al. have noted that taking the second derivative of raw MS signal, followed by matched filtering, whereby signal (in this case the second derivative of the signal) is cross-correlated with a Gaussian template, reduces background noise and enhances peaks. This smoothing is similar to application of a low-pass "top hat" filter, as is done on SELDI-MS data along the m/z axis in (27) in which signal frequencies above some threshold are thereby completely removed from the data. Note that by taking the second derivative of the signal, nonlinear noise is actually amplified (and hence the need to filter afterward), while with matched filtering, noise (specifically white noise) adjusts to the template frequency, making it harder to identify. On the other hand, Radulovic et al. (18) reported a two-step procedure for noise reduction/binarization of LC-MS signal. First, a "moving average" filter (five-scan header width) is used to smooth the dataset across discrete m/z bins. Then peak intensities exceeding a pre-defined threshold, T (related to the trimmed mean or median intensity of one m/z bin across all scan headers), for N consecutive scans are selected as being signal, with the rest of the intensities deemed to be noise. Processing of negative control spectra acquired during LC-MS analysis of solvent alone revealed few false-positive peaks. However, the number of missed genuine peptide peaks (false-negatives) is quite sensitive to the selected processing parameters (values of T, N, or M), and peak detection efficiency has not been fully optimized.2
Wagner et al. (10) have made use of an iterative, nonparametric, local regression smoother (a robust loess smoother) to model the baseline in MALDI-MS datasets. Because distinct regions of m/z were different in nature, empirical selection of the size of the smoothing window for each region was necessary and ranged from 1% of the total number of m/z values for regions with small m/z to 70% for regions with larger m/z values. In contrast, Baggerly et al. found a single sinusoidal baseline noise component in their MALDI dataset, which they speculate was produced by use of an alternating current power source (28). The noise frequency was estimated by Fourier transform of a hand-selected data region, and then eliminated by subtracting out a sinusoid of this frequency. Following this procedure, the residual baseline was modeled with a modified local minimum at each m/z value and them removing the modeled baseline.
Bylund et al. developed a unique "orthogonal background subtraction" approach for LC-MS baseline correction (25), wherein principal component analysis (PCA) is applied to time vectors (one vector per time point, over all m/z) from a region in the dataset expected to consist solely of background. The noise subspace was then characterized by taking the top principal components, and noise is removed by subtracting the components of the data that lie in this subspace. While such a model may prove useful, it is not clear whether it is appropriate for LC-MS analysis; PCA operates in a global way (across all time points in this case) and therefore does not allow for noise characteristics to change over time.
Concerns have been raised about the suitability of data filtering followed by parametric fitting (e.g. based on a peak model) (29), and thus one must carefully consider the ultimate goal of a LC-MS-profiling experiment when performing tasks in sequence. To our knowledge, no systematic comparison of the effects of various filtering/background subtraction techniques on MS data integrity has yet been reported, and it is advisable to investigate the consequences on downstream analysis (e.g. classification, peak detection). Because the filtering efforts published to date were performed on only one data dimension (m/z), it will be interesting to see if filtering independently in both axes (time and m/z) or simultaneously is more beneficial.
![]() |
MID-LEVEL PROCESSINGPEAK DETECTION AND QUANTIFICATION |
---|
Radulovic et al. used an iterative coarse-to-fine strategy to extract two-dimensional (in time and m/z) peaks from LC-MS data (18). Neighboring points in the data matrix deemed to be signal (rather than noisesee previous section on filtering) were combined to form peaks at the coarsest level, and then iteratively through each of the more refined levels, with a bisection method used to avoid spurious peak mergers. Peaks were quantified by summing individual grouped feature intensities. On the other hand, Wang et al. detected LC-MS peaks based on coinciding local signal maxima, in time and m/z; local maxima are defined as an increase in ion abundance greater than a prespecified threshold over a predefined range (10). Peaks were then quantified either by summing intensity over the component elution time or based on the maximum peak height. Unfortunately, the authors do not describe how the component elution time was determined. Similar techniques were used in Refs. 20 and 30.
Yasui et al. (31) defined peaks (in SELDI-MS data) as m/z elements exhibiting higher intensity than the N nearest neighbors, with N chosen empirically, with an added constraint that peaks have a higher intensity than the "broader" neighborhood as calculated by a local linear super-smoother method. In an effort to reduce dataset misalignment problems, m/z values within ±0.2% of these peaks were further selected as additional peaks. Although peak finding was not the emphasis of the study of Tibshirani et al. (22), these authors used a similar routine to scan for m/z peaks in SELDI/MALDI datasets exhibiting a higher intensity than a prespecified number of closest neighbors. In contrast, Randolf et al. made use of multiscale wavelet decomposition to detect peaks in MALDI-MS data (32), trying to avoid ad hoc decisions pertaining to thresholds and filter parameters. The wavelet decomposition provides a breakdown of the signal according to scale (and location), and by taking the derivative at each scale these authors detected scale-specific peaks. In a histogram of peak locations extracted in this way from many samples, locations with high counts were considered as evidence of true peaks, although the authors did not offer a method for choosing an optimal scale.
Instead of attempting to find peaks, Idborg et al. applied a curve resolution approach developed by the Chemometrics community for processing GC-MS profiles (see Ref. 33 for an overview of Chemometrics analysis techniques) to extract the major spectral components in LC-MS profiling of urine (34, 35), a notably simpler mixture than tissue or serum. In this manner, a data matrix, X, defining a single LC-MS experiment, with rows corresponding to time points and columns to m/z values, was resolved into a set of "pure" spectral profiles. These profiles were stored as column vectors of length m in the matrix, S, with the complete set of profiles approximately spanning the entire space of all m/z spectra generated across the various time points. The relative amount of each spectral profile present at a given time point was then estimated in the "pure" concentration profiles, stored as t row vectors of length a in the matrix C. Unexplained variation was represented by matrix E, and X = CST + E. An iterative method was used to solve for C and S, starting from an initial set of "key" spectra and using constraints related to the non-negativity of the concentrations (i.e. related to the fact that one can only add in physical components, not remove them), to update the spectral profiles. However, Idborg et al. do not describe how to select the number of spectral profiles, although various approaches to this problem are provided in Ref. 33, as well as other algorithms used for spectral resolution. Because use of these techniques has been largely restricted to relatively simple mixtures, it is not clear how well these methods will scale to more complex proteomic samples.
Overall, peak detection has generally been performed in a rather ad hoc manner, with little evaluation of the effectiveness of the various methods or parameter choices. The algorithms employed to date make no use of a priori or learned information with regards to peak shape, along either the time or m/z dimensions, and in some cases ion intensity values are only exploited very indirectly. Rather than retaining abundance information, peaks are frequently binarized (18, 31). Radulovic et al. do not motivate this decision, while Yasui et al. state that it helps to overcome noise in the signal. Such a step is lossy and is likely suboptimal for downstream analysis. Incorporating richer information would likely improve analytical performance, albeit at the cost of more computation. The underlying methodologies of machine learning and statistical techniques are intended to account for random variation caused by noise, and their performance is likely deteriorated by using them with binarized MS data. A less-extreme approach, and one retaining more information, would be to apply filtering and/or baseline subtraction as well as intensity normalization rather than binarization. It has been suggested that at present LC-MS is still not generally a quantitative science (36). Peak detection and quantification, even if done optimally, does not guarantee linearity of peak signal relative to analyte concentration due to possible ion suppression effects, although compelling evidence of linearity of extracted LC-MS peak intensities, at least for spiked reference proteins, has been established using certain data-processing methods and technological platforms (10, 18).
![]() |
DE-ISOTOPING, CHARGE DECONVOLUTION, AND PEAK MATCHING |
---|
Wang et al. and Anderle et al. applied a de-isotoping step (to assign isotope patterns) before peak quantification (10, 20). Though few details are provided, the algorithms were apparently based on cross-correlating the observed peak envelopes to reference isotopic tables, with the highest-scoring match identifying the most probable isotope shoulders. One can imagine that probabilistic modeling techniques, such as hidden Markov models (HMMs), may significantly improve upon this template-matching scheme. In contrast, Tibshirani et al. opted to smooth MALDI/SELDI data along the m/z axis as a simpler alternative to deconvolution (22). Extremely complex samples may prove to be less amenable to full de-isotoping.
Peak matching is another related topic relevant to quantitative proteomic comparisons. To measure reproducibility of peptide signal, experimental peaks must be matched across LC-MS datasets. Naive methods, based on simple proximity (in time or m/z space), are reported to be effective (10, 20, 39). For instance, Radulovic et al. used MS/MS-derived sequence identities to verify the correct matching of 200 putative peptides across multiple samples (18). However, given that MS/MS targets prominent peaks, this assessment is likely biased. Anderle et al. (20), on the other hand, found it necessary to remove
2% of data pointspresumed outliers thought to be attributable to mismatching (20). Incorporation of prior knowledge of peak shape, instrument m/z drift, and a more-probabilistic formulation might significantly improve the effectiveness of peak detection, quantification, and matching.
![]() |
DATASET ALIGNMENT AND COMPARISON |
---|
Alignment algorithms typically involve either: i) maximizing some objective function over a parametric set of transformations (usually linear); or ii) nonparametric alignment, by way of dynamic programming (DP); or iii) some combination of these methods (e.g. piecewise linear transformations). Some of the main differences among algorithms are: i) whether alignment is performed before or after peak detection (i.e. using either continuous signal or peaks); ii) whether or not signal amplitude is used; and iii) whether or not changes in scale are corrected for (i.e. allowing for interplay between these two types of corrections). Most algorithms used to date require a template, specified a priori, to which all time series are pre-aligned. Because suboptimal template choice could result in poor alignments, it may be desirable to avoid this.
Nielsen et al.s correlated optimized warping (COW) algorithm (40) and modifications of it were used by Bylund et al. (41) to align chromatographic data by dividing the time axis into segments, and then performing a linear warp within each segment to optimize overlap while constraining segment boundaries to maintain agreement. An objective function defined the optimal set of transformations based on the sum of correlation coefficients or covariance between data segments in pairs of samples, and it was maximized by way of DP. COW can be applied to both scalar and vector time series by defining the correlation or covariance appropriately. Use of more than one data vector component (e.g. multiple m/z bins rather than the TIC) produced more stable alignments with respect to variation of free parameters such as maximum allowable warp (40). Nielsen et al. established visually using artificial chromatograms that COW is robust to varying peak numbers, heights, and widths, and is superior to a global linear warp (40). An interesting evaluation is provided in Ref. 41, where PCA was performed on the base peak chromatograms (BPC). The amount of variance explained by the top two principal components was 70% before alignment and 98% afterward. Similarly, explained variance went from 60 to 97% with a seven-component parallel factor analysis (PARAFACa generalization of PCA to three-way data), indicating a reduction in the major sources of sample variation.
Radulovic et al. (18) approach to alignment in time used binarized data and one alignment per data block. The data matrix was divided into five equal m/z partitions and six equal time partitions, creating 30 blocks. One experimental dataset was warped to match the other by applying a linear transformation with offset to each block and constraining neighboring blocks to have similar transformations. Monte Carlo maximization was used to find the optimal set of linear transformations as defined by a "data overlap" objective function. Finally, a post hoc "wobble" of peaks in time was applied to compensate for residual peak drift. Empirical assessment of feature-wise overlap of the datasets implied a considerably improved alignment (18), although the authors did not use an objective measure because their evaluation was based on the function being optimized. Another drawback of the proposed method is that it works only with binarized data, and is therefore sensitive to choice of binarization threshold, and does not exploit signal amplitude, which ultimately may be more informative.
Randolf et al. used coarse scale-specific peaks, extracted by multiscale wavelet decomposition, to align MALDI data along the m/z axis (32). Dominant peaks (above some threshold) were used to compute a single optimal shift for all peaks, and thus the alignment is not very flexible as it does not allow for even a simple linear stretch or compression. It is also not clear whether or not features detected at different scales should be aligned differently or could leverage one another in alignment. In contrast, Idborg et al. do not explicitly align their datasets, but compare detected components derived by curve resolution, using cross-correlation, and shifting individual components to account for constant time shifts between experiments. Components correlating above some threshold are said to be identical (34, 35).
We have recently developed the CPM, an HMM-based model to do multiple alignments of time series (for continuous-valued output, such as the abundances in an MS experiment) using LC-MS TIC traces (19). In the context of the CPM, one can think of the HMM (42, 43) as containing a series of hidden states, each of which represents some underlying "true" or canonical time, to which each scan header in the TIC is ultimately assigned. The alignment in time is dictated by which scan header gets (probabilistically) mapped to which hidden state. The states are called "hidden" because until the algorithm is run on the data (i.e. until the model is trained), we do not know which scan headers map to which states. In addition to the hidden time states mentioned, hidden states are also augmented by "scale" states, which allow scaling of the TIC amplitudes locally in time. Use of the model after training provides a mapping to both time and scale states, thereby performing alignment and normalization concurrently. Training, whereby the best parameters for the HMM are found, is performed by maximum likelihood (i.e. the objective function is defined to be the likelihood of the data under the HMM probabilistic model framework) by way of Expectation-Maximization (44). Both training and later use of the model (i.e. deduction of which scan headers map to which hidden states) are performed efficiently in HMMs by use of DP. The CPM has the advantages that no template is required, all experimental TICs are aligned simultaneously (leveraging the information across experiments), normalization local in time is concurrent with alignment, and the model is probabilistically formulated. As pointed out in Ref. 19, the algorithm can be extended to use vector time series rather than TICs and also to allow alignment of nonreplicate data (see end of this section).
The classical algorithm for aligning time series is dynamic time warping (DTW), a DP-based approach that originated in the speech recognition community as a robust distance metric between pairs of time series (45). DTW aligns one time series to a specified reference series. It is closely related to COW, except instead of moving only nodes (time segment boundaries) around, every data point can be moved; thus, transformations are not restricted to piecewise linear. It is likely that Bylund, Nielsen et al. avoided use of this less-restrictive model in order to reduce computational costs and avoid overfitting (40, 41). In contrast, Wang et al. used DTW on LC-MS data, but constrained the analysis to no more than 200 m/z bins so as to make it computationally practicable (10).
Hierarchical clustering was used in a novel way by Tibshirani et al. (22) to align MALDI/SELDI spectra along (log) m/z space after peak detection. Input to the hierarchical clustering algorithm is a list of putative candidate extracted peaks as well as the Euclidean distance between peaks in log m/z space. After clustering was completed, the dendogram was cut off at an empirically determined level, with the mean m/z of each cluster defining an individual peak.
A brief description is given for an alignment in m/z space for SELDI data by Sauve et al. (27) who used a tree, built for example by hierarchical clustering of samples, to guide the progressive warping of related experiments together. Although few details were reported, presumably the algorithm starts by aligning the two closest samples, forming a single pseudo-sample, to which the next closest sample is aligned, etc. Such a method removes the need for a prespecified template, but is likely adversely affected by the fact that the distances between spectra are measured before alignment, and hence are largely meaningless. It is also not clear that starting with the two closest spectra is any less arbitrary or more effective than other approaches.
As one moves from a local alignment perspective to a global one, a bias-variance trade-off comes into effect. That is, with more data constraining the transformation, more stable alignments result. The optimal trade off is determined by the informativeness of the LC-MS signal used and the type of misalignments present. In terms of what data are best for use in alignment, there are a number of choices. To date, these have largely been limited to the TIC, BPC, and to individual (or sets of) extracted m/z ion chromatograms. In practice, one could select ion chromatograms based on the highest ion count, or highest sum of second derivatives along the m/z axis, or rather use some dynamic binning of m/z such that the ion signal is evenly distributed to attempt to extract a smaller number of informative m/z bins rather than simply choosing evenly spaced ones. Alternatively, one could apply a dimensionality reduction technique, such as PCA, on the m/z space, with the aim of using a smaller number of features that are still informative (in this case, the features would be "eigen-m/z," that is, pseudo-m/z bins made up of linear combinations of the original m/z bins).
Whether one uses piecewise linear transformations in small regions, such as in Refs. 18 and 41, or more flexible alignment schemes, such as in Refs. 10 and 19, ultimately may have little importance, so long as the overall transformation is not restricted to a global linear warp. Incorporating local scaling simultaneously with alignment may also prove to be advantageous, as reported in Ref. 19. Another issue to consider is whether to align m/z bins individually, together, or somewhere in between (i.e. in a smoothly varying way). The issue of whether one should do alignments before or after peak detection has not been clearly answered. Assuming it were possible to correct the LC time axis before peak detection, one could better leverage the information encoded across aligned datasets to achieve more reliable and sensitive peak detection. Historically, with LC-MS data, researchers have concentrated on correcting the time axis, ignoring the m/z axis. However, corrections are commonly performed along the m/z axis in SELDI/MALDI experiments, suggesting it may be desirable to do so with LC-MS data (though this may be instrument-dependent). Alignment algorithms are typically formulated to work on datasets that are very similar to each other. However, if one knows a prior that the samples may differ significantly in a few (unknown) locations (by an unknown amount), for example in comparisons of cancer and noncancer serum, then this should be incorporated into the model, as suggested in Ref. 19. This should improve the overall performance of alignment algorithms and may be a fruitful direction to pursue.
![]() |
DATA NORMALIZATION |
---|
Global normalization refers to cases where all features are simultaneously used to determine a single normalization factor between two experiments, while local normalization refers to cases where a subset of features are used at a time (different subsets for different parts of the data). Locality can be defined by, say, similarity in m/z values, time (scan headers), or abundance (peak intensity) levels. For example, in an abundance-dependent, local normalization, peaks of similar abundance within the same MS experiment would be scaled in a similar way, while peaks of different abundance are scaled in a different way. If the mean of all features is made to agree across all experiments, it is referred to as a global mean normalization, as for example used by Sauve et al. (27). By plotting the point-wise log ratio of matched features between datasets versus either m/z or abundance, Sauve et al. (27) visually established that no trend existed along either the m/z or intensity axes (27) and hence that the normalization method need not take these factors into account. While several groups have opted for global abundance normalization, in the case of LC-MS data it may be necessary to normalize locally in time (19), because chromatography can produce irregular fluctuations in signal.
Many of the normalization techniques applicable to LC-MS data have also been applied to the results of microarray experiments (46). With gene expression profiles, the genes used for normalization have sometimes been restricted to so-called "housekeeping" genes presumed to remain constant across the experimental conditions. An analogous concept was applied to LC-MS data by Wang et al. (10), whereby a constant intensity ratio between pairs of experiments was computed based on reference peaks. These authors noted, however, that the use of all detected peaks provided similar results. Baggerly et al. likewise considered using "housekeeping" peaks to normalize, but stated that they were unable to find stable peaks across experiments (28).
Anderle et al. (20) opted to normalize multiple LC-MS samples to a single reference dataset, using median global normalization over pairs of matched peaks, while Wagner et al. and Baggerly et al. (28, 39) chose global mean normalization for processing MALDI data. Radulovic et al. also used global normalization, with each dataset multiplied by K, such that the total number of intensity values exceeding some predefined threshold was set to equal the somewhat arbitrarily chosen value 100 (18). In contrast, Tibshirani et al. (22) normalized their MALDI/SELDI spectra by mapping the 10th and 90th percentiles to 0 and 1, respectively (linearly interpolated between). In our own recent study (19), TIC traces were normalized collectively in conjunction with dataset alignment, leveraging information contained across all experiments simultaneously. Normalization was done locally at each scan header (but globally across m/z), with the constraint that neighboring scan headers have similar scaling factors.
Normalization is often evaluated by calculating the coefficient of variation (CV) between peaks across different experiments after normalization. While reasonable CVs (e.g. <30%) are commonly reported, a comparison to CVs from prenormalized data is often not provided. Moreover, because no systematic comparison of these various normalization techniques has been reported, it is difficult to assess their relative merits. While Sauve et al. reported no abundance-dependent artifacts with SELDI data (27), it will be interesting to see if this holds more generally across data sets, and also for LC-MS data.
![]() |
DATA TRANSFORMATIONS AND ERROR MODELS |
---|
To examine patterns of variation and to deduce the variation attributable to sample preparation, Anderle et al. conducted a well-controlled LC-MS study, borrowing established parametric models of heteroscedasticity (i.e. unequal variance, in this case, across peaks with different abundance levels) from the microarray community. Human serum was fractionated into 40 samples (after removal of the most-abundant proteins and following tryptic digestion), with half of these analyzed directly by LC-MS, and the other half recombined and again resplit before analysis. The variation in the amplitude of matched peak intensities formed the basis of their study. A variance model, 2x
µ2 + ßµ, was fit to the observed intensity values (with true mean abundance µ and constants
and ß) and observed visually to be appropriate. The quadratic term dominates at higher-intensity levels resulting in a constant CV. Fitting the pooled and individual datasets to this model, it was shown that the pooled samples exhibited a CV of 11%, while the individually processed samples had a CV of 20%, suggesting much variation is attributable to sample preparation. Anderle et al. also report that application of this error model to two randomly divided subsets of individually processed data, in conjunction with t tests on each of the matched peaks between the groups, resulted in far fewer false-positives as compared with no data transformation. The number of false-negatives was not assessed, however, and could be high. Moreover, it would be beneficial to evaluate this model with other LC-MS datasets to ensure that the patterns of variability are generally valid.
Satten et al. (48), on the other hand, take the view that "all identification should be made using only features that well exceed a noise threshold, to ensure that the resulting classification algorithm has scientific validity." While these concerns are warranted, the emphasis appears to be in the wrong place. Classification algorithms, when used properly, for example in the context of Bayesian methods, or using cross-validation (with feature selection inside the cross-validation loop (49)), will not assign any importance to random structure in the data. Conversely, the approach offered by Satten et al. (48) can cause loss of small, but significant signals, reducing the efficacy of profiling. They also state that "the goal of analysis [is] to show that standardization and de-noising algorithms retain sufficient information to allow categorization." We counter that data processing should improve information availability. Otherwise, faced with more challenging tasks, unnecessarily conservative processing steps may obscure the answers we seek.
Stein et al. mention that event-counting detectors, such as electron multipliers, generate signals with "average random deviation" proportional to the square root of signal intensity (23). The proportionality constant was determined empirically (for GC-MS) and was invariant to m/z across a variety of MS platforms except at lower signal strengths where background noise becomes dominant. Assuming "average random deviation" refers to mean standard deviation, these results match closely to those of Anderle et al. (20).
We would conclude, largely on the basis of the Anderle study (20), that log transformation of MS data is appropriate, and that the error model they presented can be used to properly stabilize the variance of low-intensity peaks, or to obtain robust estimates of variance. It would also be interesting to find out whether these transformations could be applied directly to a raw data matrix rather than to extracted peak abundances.
![]() |
HIGH-LEVEL PROCESSING |
---|
These tasks are listed from most-to-least tractable, although the latter two have not been fully addressed in the literature. Building classification algorithms to tease apart class-dependent data is often feasible. Determining the complete set of features (i.e. peaks, peptides, or m/z scan header tuples) responsible for pattern differences, however, in a statistically sound way, can be difficult. Frequently in machine-learning methods, a feature selection step is used to automatically select features thought to lead to a near-optimal classifiersometimes independently from learning the classification model (called filtering), or interactively with learning (wrapper), or by way of the classification algorithm itself (e.g. decision trees). Feature selection is an open, difficult research question. Even if a theoretically optimal set of features for a particular learning algorithm (in the sense of providing the best generalization) can be found, not all statistically discriminative features will necessarily be contained in this set (or even only such features), because the optimal set of features is heavily dependent on the mathematical framework of the algorithm at hand. For example the optimal set of features in a linear classifier would not contain important pairs of features, which acted only together to provide discriminative power (e.g. imagine two proteins, which, if the expression of only one is known, provides no information, but when both are known, they give perfect discrimination). These artifacts, where features optimal for particular classifiers do not necessarily correlate best with the classes, occur with virtually every classifier, offering an explanation to biologists seeking to know why different analyses of the same MS data often shed light on different features, without there being an error or problem in the analysis as has been suggested (9).
![]() |
STATISTICAL ISSUES |
---|
Typically permutation tests are performed one feature at a time, resulting in thousands of significance tests. In such cases, the false-positive rate (FPR), which p value thresholds seek to control, will be extremely large for commonly used scientific p values (e.g. 0.01), and therefore some sort of multiple hypothesis testing correction needs to be applied. Many methods have been developed for this, including the conservative Bonferroni factor, which controls the family-wise error rate over all tests. However, as pointed out by Storey and Tibshirani (17), in the context of biological experiments, it is rather the false-discovery rate (FDR) one seeks to control. Whereas the FPR predicts how many truly null features are expected to be called significant, the FDR predicts how many of the features called significant are in fact likely not to be. From a practical standpoint, the FDR tells us how many "hits" (e.g. peaks putatively different between two classes) are likely false.
Storey and Tibshirani recently developed the q value, which is similar to the p value, except that it controls for the FDR (rather than FPR) and automatically accounts for multiple testing (17). If one chooses a q value of 0.05, then one can expect that 5% of all features deemed significant will in fact not be. The q value is therefore a more appropriate statistical measure for biomarker discovery in the context of complex, multivariate datasets.
Permutation tests are sometimes used to test the validity of a classification algorithm (28, 39); the classifier is retrained using randomly permuted labels (with the number of samples in each class maintained) and its predictive power assessed. Such an analysis provides similar information to that obtained from cross-validation, but in manner better suited to reporting the p values typically seen in the biomedical literature (as opposed to the machine-learning literature, which does not commonly use p values).
Whether seeking discriminative features or aiming to build a classifier, the type of features must be decided upon (e.g. intensity values in time, m/z space, or peak abundance). If not using peaks, then feature correlation, due for example to isotope shoulders, system imperfection blurring m/z values, and the fact that peptides do not elute instantaneously, should be carefully considered. Thus it might be preferable to use quantified peak abundances, combining ions belonging to a single peptide, because these have a clear, intuitive biological meaning. In a discussion concerning the validity of classification using m/z values in a set of prostrate cancer proteomic studies with different results, it was suggested that only features with meaning, such as peaks, should be used in biomarker discovery (36). Tibshirani et al. advocate the same approach (22). However, it may be wiser to first perform classification, and then look for peak evidence post hoc because imperfect peak detection can cause important information to be missed, and because use of peaks over continuous signal does not in fact afford any additional statistical rigor. Assuming the proper methodology is used to validate the classification algorithm, for example by cross-validation, spurious signal should not be identified as a point of interest, regardless of the features used.
![]() |
MACHINE LEARNING AS APPLIED TO MS |
---|
|
The Petricoin study has sparked an ongoing debate due to the fact that the putative biomarkers discovered were not conclusively proven to be proteins of interest, with those arguing for the need for identification prior to use as a diagnostic and others suggesting that discriminating patterns alone are sufficient (11, 36, 51). The ability to obtain near perfect classification has since been widely reported in the MS literature (8, 28, 48, 52, 53), but reproducibility of results has been limited. There is speculation that improper experimental methodology is to blame, and that the patterns observed are not in fact related to disease state, but instead to extraneous factors such as sample collection, preparation, or storage (9, 11, 22, 36, 53). Thus, a major immediate goal is not so much to develop new classification and biomarker discovery algorithms, but instead to ensure that the datasets in use are in fact representative of the biological problems one seeks to address (an issue beyond the scope of this review).
Here, we briefly report on the major types of methods used to date. We note, however, that it is difficult to empirically judge these techniques as different sample sizes, error estimates (e.g. bootstrap, n-fold cross-validation with varying n, etc.), and MS platforms were used without recourse to a reference "gold standard" dataset, and there is a paucity of follow-up biological validation experiments.
Yasui et al. used boosted, single-feature, linear regression, and an early stopping criterion based on sensitivity and specificity in their SELDI-MS training set (31). From an initial pool of 12,000 processed m/z values, the final classifier used 25 features (binarized m/z intensities).
Qu et al. applied boosted decision stumps on SELDI data to obtain near-perfect classification results, with feature selection (m/z peaks) based on receiver-operator characteristic (ROC) performance, although the type of classifier used to generate the ROC curve is not stated (52).
Satten et al. used a random forest (RF) algorithm to classify MALDI m/z values, while Lilien et al. used PCA followed by linear discriminant analysis (LDA) on SELDI data, computing feature importance by back-projecting the PCA hyperplane onto the m/z space (3, 48).
Wagner et al. used each of: k-nearest-neighbor (k = 6, Mahalanobis distance), support vector machine (SVM) with linear kernel, LDA, and quadratic discriminant analysis (QDA) to classify MALDI data (39), selecting the top 3\N15 peaks as features with an F statistic. They correctly state that their evaluation is flawed in that some of the methods use the entire dataset for training (e.g. to calculate the covariance matrix for the Mahalanobis distance), even though this was not necessary.
Li et al. selected 10 m/z values as features in three SELDI datasets trying both a t-test filter and a genetic algorithm. These were used in conjunction with an SVM classifier, where the choice of kernel was reported to have little effect (settling on a linear one) (54). Overall, the genetic algorithm performed better than the t-test approach, suggesting that higher-order interactions between features provide discriminative power, or that the t-test is a less-ideal statistic in this scenario.
Wu et al. (47) compared the performance of LDA, QDA, k-nearest neighbor (k = 1\N3, Euclidean metric), bagging and boosting classification trees, SVM (kernel not specified), and RF on MALDI data, using both a t-test rank and the by-product of the RF algorithm for m/z feature selection (15 and 25 features). Overall, no substantive differences in performance were reported, with QDA marginally best, although different error estimators (cross-validation or bootstrap) were used for different classifiers, complicating interpretation.
Baggerly et al. undertook an unusual feature selection approach (28). After computation of over 60,000 t-tests on individual MALDI m/z values, many of the most discriminative m/z values appeared as nonpeaks (i.e. on the slope of a peak). Dissatisfied with this, they reduce dimensionality by binning m/z (bins growing smoothly in size with increasing m/z), retaining only those bins (1%) with detected signal in some minimal number of samples. They note that this binning partially corrects for the correlation between neighboring m/z values.
Tibshirani et al. set out to develop a classification algorithm, peak probability contrasts (PPC), for MALDI/SELDI data that provides a measure of discriminatory power for all features using simple peak information (22). First, split-points for each peak were found that most discriminate between the two classes. Next, binary features were formed from the split-points and then nearest shrunken centroid (NSC) was applied, with importance assigned to each feature by: i) defining a test statistic based on the difference in the shrunken class proportions at each peak site; ii) using permutation tests to estimate the FDR for different thresholds on the test statistic; and iii) choosing a test-statistic threshold that provided a low FDR (e.g. 5%). Using an artificial spike-in experiment, a nonshrunken centroid method reportedly identified spiked features better than the shrunken model, but at the expense of more false-positives and worse classification. Lastly, they compared their PPC algorithm to: i) LDA, using the same 15 features as PPC; ii) SVM (kernel not mentioned, using the same 15 features from PPC and also all m/z; iii) binary peak probability features with the Lasso3; iv) PPC, but using only peak presence/absence as a feature (not using split-points); and v) using six wavelet coefficients (spanning all m/z values) as features with NSC. All-feature SVM performed slightly better than standard PPC, which performed similarly to the Lasso, while LDA, 15-feature SVM, the wavelet method, and presence/absence PPC all performed worse. Note that although the PPC classifier operates on binary data, it uses intensity information by way of split-points (but not fully because split-point distance was not used). Better results might be obtained if this information were incorporated.
Classification in the context of LC-MS data has been more limited, with only a handful of articles (18, 34, 35). After resolving and aligning LC-MS data matrix components (see above), Idborg et al. used PCA, unfolded-PCA, partial least squares (PLS), and PARAFAC to tease apart the proteomic differences in the urine of normal mice and drug-administered mice (34). The dimensionality of the classification problem was kept reasonably small and tractable using the abundance of each spectral component at each retention time rather than the full data matrix. In contrast, Radulovic et al. performed classification by simple plotting and visualization (18). Evaluation of this approach was limited to only two training cases per class and a single test sample (which can always be perfectly classified by choosing the appropriate point on an ROC curve). A comment was made that if two classes do not sufficiently cluster in the input space, then virtually all classification algorithms will fail. But any nonlinear method can potentially overcome this problem. Moreover, the key motivation for using kernel methods (e.g. SVMs) is the fact that even if two classes are not linearly separable in the input space, then one can project data vectors into a higher-dimensional space into which they may become separable. Thus, inability to perform classification in the original input space (or subset of it) should not stop one from further exploration of classification algorithms.
It is useful to ask how differences in the data generated by chromatographic versus nonchromatographic MS affect classification and, in particular, if the preprocessing methods reported for LC-MS data are good enough to allow comparable classification results to those reported for MALDI/SELDI (keeping in mind the fact that different peptides are often detected by these platforms). The ability to perform effective time alignments for LC-MS is far more crucial than the ability to align MALDI/SELDI m/z values, because the latter problem is less severe. If peak detection is not used prior to LC-MS classification, then correlation across features needs to be taken into account, especially in the time domain where peaks are almost never instantaneous. Aside from these issues, development and use of classification algorithms need not differ between LC-MS and MALDI/SELDI.
An area not studied in any great detail is the use of less-greedy feature selection. In the context of MS, most of the algorithms cited above rank features on an individual basis, using for example a t-test, and then greedily take the top n features. This completely ignores interaction between features. While genetic algorithms are less greedy, many other heuristics can be used to achieve a similar (or possibly better) end and may be worth exploring.
![]() |
EVALUATING CURRENT METHODS |
---|
![]() |
FUTURE PROSPECTS |
---|
One can imagine building a classifier directly on top of the CPM, in a fairly straightforward manner for SELDI/MALDI data. For example, after extending the CPM to non-replicate data, a latent trace could be obtained for each class. The simplest classifier would involve finding the likelihood of a sample for each of the class-specific latent traces and then assigning the class label of the trace producing the highest likelihood. Exploring differences between the latent traces could also be used for biomarker discovery, and prior information of expected differences (e.g. sparsity of differences incorporated). A model of this type would provide a unified framework in which preprocessing, classification, and biomarker discovery are systematically tackled, possibly providing better results than sequential, ad hoc approaches. We anticipate that existing and emerging statistical and computational techniques, side-by-side with rigorous and systematic evaluation, will help to unleash the full biomedical potential of proteomic profiling.
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, March 1, 2005, DOI 10.1074/mcp.R500005-MCP200
1 The abbreviations used are: CPM, continuous profile model; GC, gas-phase chromatography; PCA, principal component analysis; TIC, total ion current; DP, dynamic programming; COW, correlated optimized warping; BPC, base peak chromatogram; PARAFAC, parallel factor analysis; HMM, hidden Markov model; DTW, dynamic time warping; CV, coefficient of variation; FPR, false-positive rate; FDR, false-discovery rate; ROC, receiver-operator characteristic; RF, random forest; LDA, linear discriminant analysis; SVM, support vector machine; QDA, quadratic discriminant analysis; PPC, peak probability contrasts; NSC, nearest shrunken centroid; PLS, partial least squares.
2 A. Emili, unpublished observations.
3 A linear regression model with L1 norm penalty on the regression coefficients.
|| To whom correspondence should be addressed: CH Best Institute, 112 College St., Toronto, Ontario M5G 1L6, Canada. Tel.: 416-946-7281; Fax: 416-978-8528; E-mail: andrew.emili{at}utoronto.ca
![]() |
REFERENCES |
---|