Departments of 1 Breast Medical Oncology and 2 Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, TX, USA
* Correspondence to: Dr L. Pusztai, Department of Breast Medical Oncology, Unit 424, The University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030-4009, USA. Tel: +1-713-792-2817; Fax: +1-713-794-4385; Email: lpusztai{at}mdanderson.org
![]() |
Abstract |
---|
Key words: clinical trials, microarrays, gene-expression profiling, predictive markers, multigene predictors
![]() |
Introduction |
---|
It may be useful to think of marker discovery studies as conceptually similar to clinical trials that lead to the introduction of new drugs. The hallmark of clinical drug development is the multistage trial process. A similar focused, prospective, multistage evaluation of genomic markers could facilitate the introduction of new diagnostic markers into the clinic [8]. Phase III marker discovery studies would be expected to show that a technology can be reliably and reproducibly applied to clinical specimens and that the estimated predictive accuracy of the proposed test falls within a range that is considered clinically useful. Phase III marker validation studies would then evaluate the predictor in a larger number of cases to demonstrate that clinical outcome is better when the new marker is used for decision making compared with the current standard, which may be another marker- or no marker-based recommendation.
![]() |
Tissue sampling |
---|
![]() |
Gene expression data |
---|
Gene-expression profiling, particularly if performed by a high-volume central laboratory may, in fact, lend itself to greater quality control than can be achieved with many of the current molecular diagnostic tests that are most commonly performed by end users. RNA quality and quantity can be measured, and the efficacy of the RT reaction, probe labeling and hybridization process can be monitored. Arrays routinely contain several types of positive and negative controls embedded in their array matrix. Global statistics can be applied to the gene-expression data to compare any new result with existing profiles in a previous data set, enabling investigators to flag results that appear to be beyond acceptable limits of variation.
A quality-assured, properly normalized gene-expression dataset is a data matrix with a row for each sample and a column for each gene. Several standard algorithms and mathematical methods are available for analyzing such datasets (see sections on gene selection and class prediction below). A unique challenge for investigators using microarray data is how to apply observations made on one platform to data generated on another platform. Different arrays contain different sets of genes. Even for the same set of genes, different oligonucleotide sequences may be used as probes, which could result in variable signal intensity. Investigators in different laboratories who present their gene-expression results as the ratio of the test samples to a common reference sample, often use a different reference from laboratory to laboratory. Furthermore, different normalization methods and signal detection techniques yield different numeric results, even for the same raw data. Not surprisingly, cross-platform validation of results has proved difficult [24].
![]() |
Gene selection for outcome prediction |
---|
A common and erroneous assumption is that a quick scrutiny of these genes will reveal the underlying biological differences between the two groups of samples (for example, that examination of the list of differentially expressed genes between chemotherapy-sensitive and -resistant tumors will reveal why one group is sensitive to treatment and the other is not). This biological information is partly embedded in the gene list, but may be difficult to recognize for both analytical and biological reasons. Differentially expressed gene lists are unstable, particularly when they are generated from small sets of samples and when each gene has only limited discriminating value. Different statistical methods applied to the same data yield distinct but overlapping gene lists, and the rank orders of genes are particularly unstable.
Even when truly differentially expressed genes are identified, these genes may or may not contribute to the most important biological differences between the two sets of samples. Many of these genes may represent bystanders rather than playing a causative role in the biological differences between the groups. Differentially expressed gene lists are thus best considered to be hypothesis generating.
Despite the widespread use of univariate significance screening methods to select gene sets for class prediction, these methods have not been rigorously compared with the more conventional optimal feature-selection methods that form part of the classification analysis. Gene sets composed of the few most individually significant genes need not necessarily have better prediction value than other sets, particularly if the data do not contain several individually strong predictors [28, 29
]. In other words, genes that are not individually predictive may predict well when used in combination with others. However, the search for such combinations in array data is a formidable computational challenge, given the astronomical number of potential combinations to be examined and the large number of spurious associations that may be found.
![]() |
Validation of differentially expressed genes |
---|
With regard to the experimental data, the first two applications assume that a particular bright spot on the microarray indicates that one particular gene, corresponding to the spot, is up- (or down-) regulated. The third application assumes that a spot on the microarray lights up differently between different groups of samples. For this latter classification application, it is of secondary importance whether the signal originates from a single gene or results from a composite signal, as long as it is consistently associated with one outcome group. A composite signal could arise from non-specific or cross-hybridization of the probe sequence with genes other than the one corresponding to the spot on the array. Consequently, validation of the expression of particular genes by using different methods, such as RTPCR, is critical for the first two applications, whereas the most appropriate validation for the third application is testing the predictor on independent sets of cases.
An inherent limitation of DNA microarrays is that not all of the thousands of genes included on the array are expected to generate a perfectly specific and linear signal. A single hybridization condition is applied to tens of thousands of distinct nucleic acid hybridization reactions, resulting in suboptimal conditions for some reactions. Some degree of non-specific and cross-hybridization for a substantial number of probes is unavoidable. In general, the reported RTPCR confirmation rate of microarray data for individual genes is 70% [30
, 31
].
![]() |
Class prediction |
---|
Because there are hundreds or thousands of times more genes than samples, the selection of a gene subset that yields maximal prediction is challenging. Several sets of genes can be selected that predict well on the original data but fail to predict accurately on independent data. The explanation for this phenomenon, called overfitting, is that artifacts or noise that are particular to the original sample are being fit to yield more accurate predictions for the original sample, but these features are not present in other data and their inclusion leads to loss in accuracy on independent data.
One solution to attempt to minimize overfitting is repeated cross-validation within the training data to determine the best classifier and also to estimate the predictive accuracy that could be expected when the predictor is applied to independent cases. During this process, a subset of cases is omitted from the training set; discriminating genes are identified from the remaining subset and a class predictor is constructed which is then tested on the held out cases [33]. This process is repeated many times, with different sets of cases left out on each occasion, and the predictive accuracies are averaged to yield estimated error rates for each classifier. Typically, the classifier requiring the least number of genes to achieve an acceptable rate of correct classification is considered the best. To determine the true predictive accuracy of this optimized classifier, it will need to be tested on an independent set of data. What is considered as acceptable predictive accuracy depends on the clinical outcome that is to be predicted. For example, high >90% accuracy would be required for any clinically useful predictor of prognosis for breast cancer, because misclassification could lead to recommending against potentially life-saving adjuvant therapy. On the other hand, less accuracy may be sufficient for a response predictor developed to select one drug over the other. Most patients would select a drug if they have a 60% chance to benefit from it compared with another drug that gives only 30% chance of benefit, particularly if toxicities are comparable.
With any number of genes randomly included in a classifier, chance could produce a certain number of correct predictions. To assess whether the classification error rates differ significantly from what chance alone could produce, a random-label permutation test is performed [34]. During this process, the outcome label of each case (i.e. responder versus non-responder) is randomly changed. Hundreds of such datasets with randomly permutated labels are created, and the classifier generated from the true data is applied to the randomly permutated data. The observed error rate for the true dataset is compared with the distribution of the error rates observed with the randomly permutated data sets to calculate a permutation P value. A model that results in significantly correct predictions is taken for further validation on independent data to determine its true predictive accuracy.
An important feature of learning-based clinical outcome predictors is that the performance of the predictor may improve as the training sample size increases until it reaches a plateau. Figure 1 illustrates computer-simulated learning curves of SVM-based classifiers for different gene expression datasets [34]. It shows how the estimated cross validation classification error rates decrease as the training sample size increases. For some prediction problems where the difference in gene-expression profiles between the groups to be separated is substantial, the learning curves are steep, and relatively few samples may yield good predictors. For other prediction problems, large training samples are needed to yield a predictor that operates close to its plateau of accuracy. The simulation used existing, publicly available datasets as starting data and projected how the predictor would improve if it were trained on increasingly larger sets of cases [35
].
|
![]() |
Sample size calculations for multigene predictive marker discovery |
---|
|
A few other methods for sample-size calculation for supervised classification have also been reported. The approaches of Hwang et al. [40] and Fisher and van Belle [41
] are not related to gene selection, but rather are based on global tests of significance between outcome groups. In both approaches, the global test is based on Fisher's linear discriminant analysis. The method recently proposed by Mukherjee et al. [35
] estimates optimal sample size for discovery on the basis of assessing the statistical significance of classification performance and using preliminary results to extrapolate classification results for larger sample sizes. This is an appealing strategy which exploits the assumption that the predictive accuracy of learning algorithm-based predictors will improve as they are trained on larger and larger sets of samples until they reach a plateau of accuracy. These authors fit inverse power-law models to the preliminary data to predict, for a given classifier, how fast the classification performance increases with the sample size (Figure 1).
![]() |
Study design for predictive marker validation |
---|
|
A higher level of evidence for the utility of a new predictive test may be generated through randomized trials that could simultaneously address response rates in selected and unselected patients, and could also assess the treatment specificity of the predictor. A definitive validation study for a response predictor may be to randomize patients to either receive treatment only if the marker is positive or to receive therapy regardless of marker status [42]. In the case of a marker that predicts response to an existing chemotherapy, the unselected use of the treatment represents the current standard of care. This two-arm design could examine clinical utility directly, since comparison between the marker-selected and unselected trial arms could determine to what extent patient selection improves outcome compared with unselected use of the same treatment. Sample size for this design would be similar to conventional phase III clinical trials. Early stopping rules to halt the trial if the predictor performs too well or too poorly could be incorporated.
A different design may be necessary to assess whether the predictor is treatment-specific or it is only a general marker of response to (any) cytotoxic therapy. The primary trial objective is to assess interaction between a marker and the response to two or more different types of chemotherapies. Stratified randomization into two or more treatment arms on the basis of marker results is an appealing strategy. A study with two treatment arms may be designed as follows: the response marker is determined at the time of the patients' entry into the study, and each patient is assigned an expected outcome (e.g. response or no response to drug A, for which the marker was developed). The primary objective of the study would be to establish whether patients who test positive for the marker are significantly more likely to experience a response to drug A than to drug B. A secondary objective would be to show that patients who test negative for the marker will not benefit from drug A as much as the marker-positive patients do, and that for these patients the alternative therapy with drug B may be the preferred treatment. Power and sample size calculations can be based on assumptions about the prevalence of the marker among the patients and the rates of response of their tumors to drugs A and B, respectively. For example, if the prevalence of marker positivity is 25% and the rate of tumor response to drug A is 60%, and we assume that the rate of response to treatment B in the same patients is only 20%, then a total sample size of 210 patients would be needed for 98% power to demonstrate that treatment A is superior to treatment B for marker-positive patients. This study could also be conducted by limiting treatment to only marker-positive patients; however, with that design, the NPV of the test would not be evaluated, and the comparative efficacy of the alternative therapy (i.e. drug B) could not be evaluated in the patients who responded poorly. Including early discontinuation rules may be the preferred option to minimize the exposure of marker-negative patients to ineffective drugs.
Gene expression profiling has demonstrated in several proof-of-principle studies that multigene signatures can predict important clinical outcomes and therefore have the potential to evolve into true diagnostic tests. Some are concerned that moving these observations into a validation phase may be premature; because the technology itself is constantly evolving and new, improved gene sets to predict a particular outcome are identified at regular intervals [4346
]. Technological development will never stop, but this need not discourage attempts for validation of markers already discovered. The true clinical utility of any proposed gene marker set can only be established through independent validation. It is possible, and in fact probable, that several distinct genes sets measured by different profiling platforms may predict a given clinical outcome equally well [43
46
]. It is also possible that as larger and larger patient populations are used for discovery and training of multigene predictors, second-, third- and fourth-generation predictors will emerge with steadily increasing predictive accuracy. Whereas this process will lead to increased competition among aspiring diagnostic companies (and academic laboratories), if truly useful predictors are discovered, patients will ultimately benefit. High-throughput genomic (and proteomic) technologies represent perhaps one of the most exciting opportunities in diagnostic medicine since the discovery of monoclonal antibodies. However, the true diagnostic value of this technology can only be established expeditiously through a series of well-designed marker discovery and validation studies.
Received for publication March 23, 2004. Revision received June 15, 2004. Accepted for publication June 18, 2004.
![]() |
References |
---|
2. Hortobagyi GN, Hayes D, Pusztai L. Integrating newer science into breast cancer prognosis and treatment. Molecular predictors and profiles. ASCO Annual Meeting Summaries. Alexandria (VA): American Society of Clinical Oncology 2002; 191202.
3. Ramaswamy S, Golub TR. DNA microarrays in clinical oncology. J Clin Oncol 2002; 20: 19321941.
4. de Bolle X, Bayliss CD. Gene expression technology. Methods Mol Med 2003; 71: 135146.[Medline]
5. Ali TR, Li MS, Langford PR. Monitoring gene expression using DNA arrays. Methods Mol Med 2003; 71: 119134.[Medline]
6. Walker SJ, Worst TJ, Vrana KE. Semiquantitative real-time PCR for analysis of mRNA levels. Methods Mol Med 2003; 79: 211227.[Medline]
7. Paik S. Incorporating genomics into the cancer clinical trial process. Semin Oncol 2001; 28: 305309.[CrossRef][ISI][Medline]
8. Simon R, Altman DG. Statistical aspects of prognostic factor studies in oncology. Br J Cancer 1994; 69: 979985.[ISI][Medline]
9. Sotiriou C, Powles TJ, Dowsett M et al. Gene expression profiles derived from fine needle aspiration correlate with response to systemic chemotherapy in breast cancer. Breast Cancer Res 2002; 4: R3.[CrossRef][Medline]
10. Assersohn L, Gangi L, Zhao Y et al. The feasibility of using fine needle aspiration from primary breast cancers for cDNA microarray analyses. Clin Cancer Res 2002; 8: 794801.
11. Pusztai L, Ayers M, Stec J et al. Gene expression profiles obtained from single passage fine needle aspirations (FNA) of breast cancer reliably identify prognostic/predictive markers such as estrogen (ER) and HER-2 receptor status and reveal large scale molecular differences between ER-negative and ER-positive tumors. Clin Cancer Res 2003; 9: 24062415.
12. Ma X-J, Wang W, Salunga R et al. Gene expression signatures associated with clinical outcome in breast cancer via laser capture microdissection. Breast Cancer Res Treat 2003; 82 (Suppl 1): S15 (Abstr 29).
13. Baunoch D, Moore M, Reyes M et al. Microarray analysis of formalin fixed paraffin-embedded tissue: the development of a gene expression staging system for breast carcinoma. Breast Cancer Res Treat 2003; 82 (Suppl 1): S116 (Abstr 474).
14. Paik S, Shak S, Tang G et al. Multi-gene RT-PCR assay for predicting recurrence in node negative breast cancer patientsNSABP studies B-20 and B-14. Breast Cancer Res Treat 2003; 82 (Suppl 1): S10 (Abstr 16).
15. Esteva FJ, Sahin AA, Coombes K et al. Multi-gene RT-PCR assay for predicting recurrence in node negative breast cancer patientsM.D. Anderson Clinical Validation Study. Breast Cancer Res Treat 2003; 82 (Suppl 1): S11 (Abstr 17).[CrossRef]
16. Symmans WF, Ayers M, Clark E et al. Fine needle aspiration and core needle biopsy samples of breast cancer provide similar total RNA yield, but different stromal gene expression profiles cancer. Cancer 2003; 97: 29602971.[CrossRef][ISI][Medline]
17. Altman DG, Lyman GH. Methodological challenges in the evaluation of prognostic factors in breast cancer. Breast Cancer Res Treat 1998; 52: 289303.[CrossRef][ISI][Medline]
18. King HC, Sinha AA. Gene expression profile analysis by DNA microarrays: promise and pitfalls. JAMA 2001; 286: 22802288.
19. Miller LD, Long PM, Wong L et al. Optimal gene expression analysis by microarrays. Cancer Cell 2002; 2: 353361.[ISI][Medline]
20. Rhodes A, Jasani B, Anderson E et al. Evaluation of HER-2/neu immunohistochemical assay sensitivity and scoring on formalin-fixed and paraffin-processed cell lines and breast tumors: a comparative study involving results from laboratories in 21 countries. Am J Clin Pathol 2002; 118: 408417.[ISI][Medline]
21. Rhodes A, Jasani B, Barnes DM et al. Reliability of immunohistochemical demonstration of estrogen receptors in routine practice: interlaboratory variance in the sensitivity of detection and evaluation of scoring systems. J Clin Pathol 2000; 53: 125130.
22. Ambros IM, Benard J, Boavida M et al. Quality assessment of genetic markers used for therapy stratification. J Clin Oncol 2003; 21: 20772084.
23. Liu ET. Molecular oncodiagnostics: where we are and where we need to go. J Clin Oncol 2003; 21: 20522055.
24. Kuo WP, Jenssen TK, Butte AJ et al. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 2002; 18: 405412.
25. Ringner M, Peterson C, Khan J. Analyzing array data using supervised methods. Pharmacogenomics 2002; 3: 403415.[ISI][Medline]
26. Simon R, Radmacher MD, Dobbin K et al. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003; 95: 1418.
27. Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003; 4: 210.[CrossRef][Medline]
28. Goldberg DA. Genetic Algorithms in Search. Optimization and Machine Learning. New York: Addison-Wesley 1989.
29. Li L, Weinberg CR, Darden TA, Pedersen LG. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 2001; 17: 11311142.
30. Rajeevan MS, Vernon SD, Taysavang N et al. Validation of array-based gene expression profiles by real-time (kinetic) RT-PCR. J Mol Diagn 2001; 3: 2631.
31. Taniguchi M, Miura K, Iwao H et al. Quantitative assessment of DNA microarrays-comparison with Northern blot analyses. Genomics 2001; 71: 3439.[CrossRef][ISI][Medline]
32. Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol 2002; 9: 505511.[CrossRef][ISI][Medline]
33. Shoa J. Linear model selection by cross-validation. J Am Stat Assoc 1993; 88: 422.
34. Good PI. Permutations Tests for Testing Hypotheses. New York: Springer-Verlag 1994.
35. Mukherjee S, Tamayo P, Rogers S et al. Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol 2003; 10: 119142.[CrossRef][ISI][Medline]
36. Simon R, Radmacher MD, Dobbin K. Design of studies using DNA microarrays. Genet Epidemiol 2002; 23: 2136.[CrossRef][ISI][Medline]
37. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc (Ser B) 1995; 57: 289300.[ISI]
38. Hatfield GW, Hung S, Baldi P. Differential analysis of DNA microarray gene expression data. Mol Microbiol 2003; 47: 871877.[CrossRef][ISI][Medline]
39. Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of P-values. Bioinformatics 2003; 19: 12361242.
40. Hwang D, Schmitt WA, Stephanopoulos G et al. Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics 2002; 18: 11841193.
41. Fisher LD, van Belle G. Sample size calculations in selecting continuous variables to discriminate between populations. In Fisher LD, van Belle G (eds): Biostatistics: A Methodology for the Health Sciences. New York: Wiley 1993; 851858.
42. Sargent D, Allegra C. Issues in clinical trial design for tumor marker studies. Semin Oncol 2002; 3: 222230.[CrossRef]
43. Lossos IS, Czerwinski DK, Alizadeh AA et al. Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Engl J Med 2004; 350: 18281837.
44. Alizadeh AA, Eisen MB, Davis RE et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000; 403: 503511.[CrossRef][ISI][Medline]
45. Rosenwald A, Wright G, Chan WC et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 2002; 346: 19371947.
46. Shipp MA, Ross KN, Tamayo P et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 2002; 8: 6874.[CrossRef][ISI][Medline]