CORRESPONDENCE

Re: The Central Role of Receiver Operating Characteristic (ROC) Curves in Evaluating Tests for the Early Detection of Cancer

Stefano Parodi, Alberto Izzotti, Marco Muselli

Affiliations of authors: Epidemiology and Biostatistics Section, Scientific Directorate, G. Gaslini Children's Hospital, Genoa, Italy (SP); Health Science Department, University of Genoa, Genoa, Italy (AI); Institute of Electronics, Computer and Telecommunication Engineering, Italian National Research Council, Genoa, Italy (MM)

Correspondence to: Stefano Parodi, PhD, Epidemiology and Biostatistics Section, Scientific Directorate, G. Gaslini Children's Hospital, Largo G. Gaslini, 5-16147 Genoa, Italy (e-mail: stefanoparodi{at}ospedale-gaslini.ge.it).

Receiver operating characteristic (ROC) curves are a powerful and flexible tool to identify differentially expressed genes in microarray experiments. This use was discussed by Baker (1) in reference to a study by Pepe et al. (2), and was described by us (3), to illustrate a method of selecting cut-off values to separate different groups of cancer patients. Moreover, Baker indicated an innovative strategy to identify cancer patients that was based on grouping genes selected on the basis of the partial area under the ROC curve (pAUC) and suggested that multiple methods could be developed to generate classification rules. In our opinion, this strategy is promising and should not be limited to the framework of the early detection of cancer. In fact, using the more conventional whole AUC for gene selection, bidirectional rules can be obtained, allowing identification of different classes of cancer patients or of subjects differentially exposed to carcinogenic compounds.

To assess the latter hypothesis, we developed a classification method that combined cut-off values at the highest observed specificity, as illustrated in Fig. 1, using two hypothetical genes. In the first step, the combination of a few genes with high AUC is identified, which allows the best separation between the two groups. In the next steps, the same procedure is repeated recursively on the remaining genes to generate a panel of classification rules. For each class, a score proportional to the AUC is then assigned to each sample. The scores generated at each step are summed with the previous scores to yield, in each rule, the genes with the highest AUC.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 1. Method for the identification of classification rules in microarray data. a and b) For each selected gene expression, the two cut-off values that allow the classification of the highest number of samples without error (highest specificity) are identified (for example, G1A and G1B for gene 1 and G2A and G2B for gene 2, respectively). c) Two (or more) genes are then combined to reach the best classification rate. In cross-validation, samples falling in "A" areas are assigned to the class A, whereas samples falling in "B" areas are assigned to the other class. Samples falling in "C" areas are assigned to the class receiving the highest score, based on the two AUC (as illustrated in the text).

 
This method was applied to the data from two online databases (4,5), and the results were validated by complete leave-one-out cross-validation (6). The first data set (4), generated from 34 current smokers and 23 nonsmokers, used Affymetrix chips and included 7329 genes (http://pulm.bumc.bu.edu/aged/index.html). The second data set (5) used a specialized cDNA array (Lymphochip), designed at the Stanford University School of Medicine, and evaluated the expression of 4096 genes in neoplastic and nonneoplastic cells (http://llmpp.nih.gov/lymphoma/data/figure1). The analysis was restricted to the two largest groups of patients (42 patients with diffuse large-cell lymphomas and 20 patients with other lymphatic malignancies). After excluding missing values, 1332 genes were retained in the analyses.

Using the conventional 0.05 P value, adjusted for multiple comparisons by Bonferroni's correction, 94 hallmark genes were extracted from the first database (4) and 444 from the second database (5). In the first database, three samples (5.3%) were misclassified during cross-validation at the first step (using two genes) whereas the error rate fell to 1.8% (one sample) during steps 3 to 20, corresponding to the inclusion of 12 to 90 genes. In the second database, cross validation misclassifications ranged between three (4.8%) and zero samples during the first nine steps (1 to approximately 20 genes) and between one (1.6%) and zero samples during the next steps. After step 18, the number of misclassifications increased.

The high performance observed in this study (zero to one misclassifications during cross validation for each database) indicates that, as suggested by Baker (1), ROC analysis represents a powerful and reliable tool for supervised analyses of microarray data.

REFERENCES

(1) Baker SG. The central role of receiver operating characteristic (ROC) curves in evaluating tests for the early detection of cancer. J Natl Cancer Inst 2003;95:511–5.[Free Full Text]

(2) Pepe MS, Longton G, Anderson GL, Schummer M. Selecting differentially expressed genes from microarray experiments. Biometrics 2003;59:133–42.[CrossRef][ISI][Medline]

(3) Parodi S, Muselli M, Fontana V, Bonassi S. ROC curves are a suitable and flexible tool for the analysis of gene expression profiles. Cytogenet Genome Res 2003;101:90–1.[CrossRef][ISI][Medline]

(4) Spira A, Beane J, Shah V, Liu G, Schembri F, Yang X, et al. Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc Natl Acad Sci USA 2004;101:10143–8.[Abstract/Free Full Text]

(5) Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000;403:503–11.[CrossRef][ISI][Medline]

(6) Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003;95:14–8.[Free Full Text]



Response to this Correspondence

             
Copyright © 2005 Oxford University Press (unless otherwise stated)
Oxford University Press Privacy Policy and Legal Statement