From the Laboratory of Neuro-oncology, Department of Neurology, Dr Molewaterplein 40, 3015 GD, and
Department of Urology, P.O. Box 1738, 3000 DR, Erasmus MC, Rotterdam, The Netherlands, Departments of
Neurology and ** Clinical Chemistry, The Netherlands Cancer Institute, Plesmanlaan 121, 1066CX, Amsterdam, The Netherlands, ¶ Department of Neurology, University of Innsbruck, Anichstrasse 35, 6020, Innsbruck, Austria, || Chordiant, De Lairessestraat 150, 1075 HL, Amsterdam, The Netherlands,
Department of Neurology, Leiden University Medical Centre, P.O. Box 9600, 2300 RC, Leiden, The Netherlands, and ¶¶ Department of Neurology, University Medical Center Nijmegen, P.O. Box 9101, 6500 HB, Nijmegen, The Netherlands
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
One of the tumors most frequently associated with LM is breast cancer. During the course of the disease, 5% of patients with metastatic breast cancer will develop symptoms caused by LM. This debilitating complications response to therapy depends upon early treatment. However, diagnosis of LM remains challenging because 25% of samples tested are false negative at the first cytological examination of the CSF, probably because of sampling error (1).
Protein expression profiling of body fluids from patients with cancer has recently become a valuable tool for obtaining information on the state of protein circuits inside tumor cells and outside the cells at the host-tumor interface (2, 3). In serum and CSF, low molecular weight proteins and peptides that are related to this altered microenvironmental "cancerous" state can be detected.
We studied the differential tryptic peptide profiles in the CSF from patients with breast cancer with and without LM and in CSF from control subjects. Studying CSF has several advantages over studying serum. First, tumor cells in LM patients are located in the CSF and in the leptomeninges that are surrounded by CSF. Before their transport into serum, tumor-related proteins will therefore first be shed into the CSF. Second, the normal protein concentration of CSF is 100- to 400-fold lower than in serum (4). This results in a significant over-representation of LM-related proteins in CSF compared with serum. The identification of protein profiles specific for LM may be helpful in diagnosing patients with clinical suspicion of LM but negative cytology. In addition, such proteins may reveal cellular mechanisms relevant to the biology of LM.
With the advent of mass spectrometry into the field of clinical proteomics, the comparison of large numbers of proteins in complex biological samples such as serum and CSF has become feasible (3, 5). Until now, the most commonly used instrument was the SELDI-TOF MS. Using SELDI-TOF MS analysis of various body fluids, discriminatory protein expression profiles have been identified in various diseases (3, 6). However, SELDI-TOF MS does not allow a direct identification of the discriminatory proteins and suffers from low reproducibility and accuracy (710). To improve the reproducibility and accuracy and find better ways to identify relevant discriminatory proteins (11), we first digested our samples with trypsin and analyzed the resulting peptide mixtures by MALDI-TOF MS (12). The reproducibility of this type of analyses has been described elsewhere (13).
We analyzed CSF samples from 106 patients with active breast cancer, 54 of whom had LM, and CSF from 57 control subjects. Tryptic peptide mixtures were measured by MALDI-TOF MS and analyzed using a newly designed bioinformatics tool. We could identify unique peptide patterns that discriminated the LM patients from the other patients with breast cancer and from control subjects.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sample Preparation and Measurement of Samples
All samples were blinded and analyzed in random order. From each sample, 20 µl of CSF was put into a 96-well plate, and 20 µl of 0.2% Rapigest (Waters, Milford, MA) in 50 mM ammoniumbicarbonate buffer was added to each well. The samples were incubated for 2 min at 37 °C. 4 µl of 0.1 µg/µl gold grade trypsin (Promega, Madison, WI) in 3 mM Tris-HCl was added to each well, and the 96-well plates were incubated at 37 °C. After 2 h of incubation, 2 µl of 500 mM HCl was added to obtain a final concentration of 3050 mM HCl, pH < 2. The 96-well plates were then incubated again for 45 min. A 96-well zip C18 microtiter plate (Millipore Corporation, Bedford, MA) was prewetted and washed twice with 200 µl of acetonitrile per well. Full vacuum was applied to the plate using a vacuum manifold (Millipore). 3 µl of acetonitrile was put on the C18 resin without vacuum to prevent it from drying. Each sample was mixed with 200 µl of water HPLC grade/TFA 0.1%. Subsequently the samples were loaded on the washed and prewetted 96-well zip C18 plate (Millipore); a pressure differential of 5 inches of Hg vacuum was used. After the wells had been cleared, the wells were washed twice with 100 µl of 0.1% TFA. Full vacuum was applied until all wells were empty. The samples were eluted in a new 96-well plate with an elution volume of 15 µl of 50% acetonitrile/water HPLC grade 0.1% TFA; a pressure differential of 5 inches of Hg vacuum was used. After elution, the samples were stored at 4 °C in the 96-well plates covered with aluminum seals. All samples were spotted on a MALDI target (600/384 anchor chip with transponder plate; Bruker Daltonik GmbH, Bremen, Germany) in triplicate. To do so, 2 µl of elute was mixed with 10 µl of matrix solution (2 mg of -cyano-4 hydroxycinnamic acid; Bruker Daltonik GmbH) in 1 ml of acetonitrile for 30 min using an ultrasonic bath). Afterward, samples were automatically measured on a MALDI-TOF MS (Biflex III; Bruker Daltonik GmbH). The digestion step was repeated twice for each sample, the purified peptides were spotted in triplicate, and all the spots were measured in triplicate. This resulted in 18 spectra for each sample. The standard method for peptide measurements on the MALDI-TOF MS was used (default file Bruker "12kD positive" with the measurement range changed to 3003000 Da). For the automated measurements, the settings of initial laser power of 20% and a maximum of 35% were used. The highest peak above the 750 Da had to have a signal-to-noise ratio of at least 5 and a minimum resolution of 5000. After every 30 laser shots, the sum spectrum was checked for these criteria. If the sum spectrum did not meet these criteria, it was rejected. If 13 sum spectra from 30 shots met the criteria, these were combined and saved; when 50 sum spectra from 30 shots were rejected, the measurement of that spot was then ended, and the next spot was measured.
Analysis of Spectra
First, the raw binary data files were converted to ASCII files containing the measured intensities for all channel indices of the spectra. We then developed a peak detection algorithm in the statistical language R (www.r-project.org). The definition of a peak (or local maximum) in this algorithm states that the intensity of the peak position has to be above a predefined threshold and has to be the highest intensity value in a surrounding mass window. This peak-finding algorithm was tested on a small set of spectra with different settings for the threshold and the mass window. The settings for the peak finding were chosen such that the resulting peak list most resembled peaks that would be manually assigned, thus optimizing the trade off between signal sensitivity and noise detection. We chose a percentile threshold of 98.5% (the intensity of the position must belong to the 1.5% highest intensity values of the spectrum) and a mass window of 0.5 Da. A quadratic fit with a number of internal calibrants was used to calibrate the channel numbers to masses. For this mass calibration, five omnipresent albumin peaks (960.5631, 1000.6043, 1149.6156, 1511.8433, and 2045.0959 m/z) were used. The accurate mass of these albumin peaks was obtained by performing a "tryptic digest" on the human albumin amino acid sequence with MS-digest (prospector.ucsf.edu/ucsfhtml4.0/msdigest.htm).
During the process of alignment and conversion, the quality of the spectra was checked as follows. If two or more of the omnipresent albumin peaks were not detected, the spectrum was not used in the further analysis. The peak finding algorithm was then used to create a list of peak positions for each individual spectrum. These peak lists were combined by comparing the lists one by one. If peak positions were present in a mass window of 0.5 dalton in both spectra, these peak positions were combined. The combined peak list was then compared with a new spectrum until all peak lists had been combined. The latter peak list was used to create a matrix displaying the frequency of each peak position for each sample. Peak positions that were present in less than 5% of the spectra were deleted from the matrix to reduce the number of noise peaks. The matrix created in this way was used for statistical analysis of the data. Using a univariate analysis in R, a p value was determined for every peak position. When comparing more than two groups, we used the Kruskal-Wallis test; when comparing two groups, we used the Wilcoxon-Mann-Whitney test.
To investigate whether differences in the total CSF protein concentration of the samples affected the performance of the MALDI-TOF, we first used the Bio-Rad detergent-compatible protein assay (Bio-Rad) to determine the protein concentration of all the CSF samples. We then calculated the sum of albumin peaks detected in all seven spectra of each sample (excluding the albumin peaks that had been used for calibration). Using Prism version 4.0 (GraphPad Software, San Diego, CA), we compared the total protein concentration and the sum of the albumin peaks of the three groups. To test for statistically significant differences between the three groups, we used one-way ANOVA followed by Bonferronis multiple comparison test. All tests were two-sided, and p < 0.05 was considered statistically significant. In addition, the correlation between peak frequency and protein concentration was calculated for each individual peak position. A histogram of all correlation coefficients was created and the distribution was compared, with a normal distribution using the Kolmogorov Smirnov (SPSS Inc., Chicago, IL).
All peak positions with a frequency that was two times higher in group I than in the control groups II and III were selected. These peak values were submitted to the Mascot search engine (Matrix Science, London, UK) to search the MSDB human database using a 100-ppm tolerance.
Building a Predictive Model
A supervised multivariate analysis method was used to determine whether sample groups I and II could be separated on the basis of their peak positions. For each patient, seven mass spectra were used, and the peak positions of each of those seven spectra were combined. Therefore, the number of times that a peak was present varied between 0 and 7. To reduce noise, a minimum of two peaks was required to determine whether a peak was present (2, 1) in a sample or not (<2, 0) allowing the formation of a binary data matrix. The required frequency was kept low to minimize loss of signal. To reduce the number of variables, a clustering was performed that combined peaks of similar behavior. Peptide peaks that often occurred simultaneously were grouped into the same cluster using a hierarchical clustering algorithm. The distance between each possible pair of peptide peaks was determined with the Manhattan distance measure (i.e. the sum of the absolute differences for all patients). The number of clusters was set at 50. With 50 clusters, isotope peaks were generally grouped into the same cluster. The clusters generated represented groups of peptide peaks that might be derived, at least in part, from the same protein or proteins. The clustering of the masses made it possible to compose a new data matrix. Each matrix cell contained the number of peaks present for a certain patient relative to the total number of peaks in a particular cluster. In other words, each cell in the new matrix defined the proportion of peptides that was present in a cluster for a certain patient. To further reduce the complexity of the data, we set a threshold for the presence of a cluster to obtain a binary data matrix. Using the clustered, binary variables, we constructed a non-linear predictive model that separated group I from group II. In the model thus generated, a maximum of eight clusters was allowed, and only those clusters with an area under curve (AUC) greater than 0.62 were considered. Genetic programming was used to search for the model with the highest AUC (14). To obtain an unbiased estimate of the predictive accuracy of the model, we used the bootstrapping method (15, 16). Bootstrap data sets were created by randomly selecting patients with replacements from the original data set. As an extra precaution, the clustering step was included in the bootstrapping process as well. 100 bootstrapped matrices were created from the original matrix by resampling with replacement. The clustering was repeated for each of these resampled matrices and a predictive model was constructed. The AUC of each model was measured on the bootstrap data set as well as on the original data set. The average difference between the performance on the bootstrap data set and the performance on the original data set provided a correction factor that gave an estimate of the bias of our model development process. Finally, we developed a model on the original data and corrected its AUC with the correction factor, producing a conservative estimate of the performance of the model.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Peak Detection
We detected an average of 350 peaks per spectrum (95% CI, 250 to 450). Spectra with more than 450 detectable peaks were excluded from the analysis because these spectra consisted mainly of noise peaks (peaks without an isotopic distribution). After alignment and quality control, the number of good quality spectra per sample was counted, and samples with fewer than seven good quality spectra were discarded. The number of seven spectra is based on an earlier reproducibility study (13). At this threshold, the remaining number of cases per patient group was 41 in group I, 46 in group II, and 43 in group III (Table II). Of the samples with more than seven spectra, only the first seven spectra were used. The combined peak list of all these spectra contained 2006 possible peak positions. After noise reduction, the matrix created from this list contained 895 peak positions.
|
|
The CSF total protein concentration of the samples differed significantly between groups I and III (ANOVA, p < 0.001; Fig. 2A). However, the sum of albumin peaks detected per sample did not differ between the groups (ANOVA, p = 0.8; Fig. 2B). For all individual peak positions, the correlation between the peak frequency and the protein concentration was calculated and plotted in a histogram (Fig. 3). The distribution of the correlation coefficients did not significantly differ from a normal distribution (Kolmogorov Smirnov test, p = 0.35). The constant number of albumin peaks and the lack of effect of protein concentration on the peak frequency indicated that differences in total protein concentration had not significantly affected MALDI-TOF performance.
|
|
Clustering Analysis
A clustering on the masses was performed on a matrix in which the samples had been sorted on group number. This clustering resulted in the detection of group-specific clusters (Fig. 4). In Fig. 4A, a zoom in of the dendrogram is displayed that shows peak positions that have higher frequencies in patients with breast cancer without LM (group II) and healthy control subjects (group III) than in patients with breast cancer with LM (group I). In B, peak positions with a higher frequency in patients with breast cancer with LM (group I) than in groups II and III are shown.
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
After the original highly intriguing report that the serum proteome profile can be used for the early detection of ovarian cancer (3), many researchers have applied the SELDI-TOF technology to detect proteome profiles specific for other forms of cancer and non-malignant disease (6, 19). However, criticism has focused on the low reproducibility of the SELDI-TOF analytical tool (7, 9, 10, 2025). Models based on SELDI-TOF protein profiling data generally performed poorly upon external validation in time (26). This lack of reproducibility may be due to variation in chip batches, mass spectrometers, sample stability, the low reproducibility of peak height, and the low number of measurements per sample (7, 20, 27). We believe that our model is less affected by these variations for several reasons. First, the sample preparation is simple, fully automated, and does not require chips or fractionations. Second, we did not include the height of the peaks in the model because quantitative measurements of peak heights with both the MALDI and SELDI methods are poorly reproducible (28). In addition, we have carefully determined before analysis the number of replicates per sample that provided the optimal reproducibility (13). The number of replicates that we used (18) was much higher than in other studies. Third, the predictors that we used in our model were clusters of peaks and not single peaks improving the robustness of the model. Changes in one peak position of a cluster, used as a predictor, will not have a dramatic effect on the performance of the predictive value of the entire cluster. In the future, the reliability of the method can be further improved by linking multiple peptide peaks to a single protein.
The direct identification of peptides from complex samples remains difficult because of the complexity of tryptic digests of body fluids. A direct MS/MS identification of the peptides using MALDI TOF/TOF is not possible as a result of the presence of multiple peptides, even in small mass windows. Although off line nano LC-MALDI could solve this problem, we believe that the best method to identify peptides in complex mixtures is Fourier transform MS in which the exact mass of the peptide of interest is obtained. In most cases, the detection of multiple peptides derived from a single protein will allow identification of the protein. Our database search on the up-regulated peptides has demonstrated the feasibility of this approach for apolipoprotein A1. The up-regulation of different forms of apolipoprotein has been observed before in different SELDI-TOF studies (2931). We are currently performing Fourier transform MS to identify the other up-regulated peptides as well.
Confounding factors in the present study could be differences in sample collection and storage between institutes, differences in total protein concentration and white cell count between the groups, reproducibility of the method and patient selection bias. The number of peptides, differentially expressed between the institutes, was identical to a chance distribution, excluding potential biases introduced by differences in sample handling. The white cell count and protein concentration in CSF from patients with LM were increased compared with both patients with breast cancer without LM and healthy control subjects. All samples were routinely centrifuged after lumbar puncture, making contamination of the supernatant with cellular debris unlikely. The elevated protein concentration in the CSF from LM patients is well known and is caused by dysfunction of the blood-brain barrier (4), resulting in an increase of high abundance serum proteins in the CSF. A normalization on protein concentration could have been used to compensate for this difference. However, this implies that less CSF should be used from LM samples. This would result in a lower amount of CSF-specific proteins compared with the control samples. In our opinion, this would result in a bias among the three sample groups. To investigate the potential confounding effect of differences in total protein concentration, we calculated the average number of tryptic peptide digests derived from albumin in each group. The number of albumin-derived peptide peaks did not differ among the three groups. In addition, no significant negative or positive correlation between the number of peaks and the protein concentration could be detected. This provided strong evidence that the differences in protein concentration had not interfered with the analysis.
All patients with breast cancer in the present study had signs or symptoms compatible with LM, which led to the performance of a lumbar puncture. All patients in group I who were diagnosed with LM had positive cytology. All patients in group II had negative cytology combined with clinical follow up, indicating an alternative diagnosis. At this stage, we did not include a group of patients with "false-negative" CSF cytology as indicated by MRI and/or clinical follow up. It will be particularly interesting to investigate the performance of this proteome-based test also in these patients, preferably in a prospective manner.
We conclude that MALDI-TOF analysis of tryptic peptide digests derived from the CSF of patients with breast cancer can support the diagnosis of LM. We expect that the use of more accurate and sensitive measurements by Fourier transform mass spectrometry will further improve the identification of disease-specific patterns and markers from body fluids in the near future.
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, June 21, 2005, DOI 10.1074/mcp.M500081-MCP200
1 The abbreviations used are: LM, leptomeningeal metastases; CSF, cerebrospinal fluid; AUC, area under the curve; ANOVA, analysis of variance.
* This study was supported by the Netherlands Proteomics Centre, by a grant from the Erasmus MC Revolving Fund, and by the European Union grant P-mark.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
|||| To whom correspondence should be addressed. Tel.: 31-104633327; Fax: 31-104633208; E-mail: p.sillevissmitt{at}erasmusmc.nl
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
All ASBMB Journals | Journal of Biological Chemistry |
Journal of Lipid Research | Biochemistry and Molecular Biology Education |