Clinical trial design for microarray predictive marker discovery and assessment

L. Pusztai1,* and K. R. Hess2

Departments of 1 Breast Medical Oncology and 2 Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, TX, USA

* Correspondence to: Dr L. Pusztai, Department of Breast Medical Oncology, Unit 424, The University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030-4009, USA. Tel: +1-713-792-2817; Fax: +1-713-794-4385; Email: lpusztai{at}mdanderson.org


    Abstract
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
Transcriptional profiling technologies that simultaneously measure the expression of thousands of mRNA species represent a powerful new clinical research tool. Similar to previous laboratory analytical methods including immunohistochemistry, PCR and in situ hybridization, this new technology may also find its niche in routine diagnostics. Outcome predictors discovered by these methods may be quite different from previous single-gene markers. These novel tests will probably combine the information embedded in the expression of multiple genes with mathematical prediction algorithms to formulate classification rules and predict outcome. The performance of machine learning-algorithm-based diagnostic tests may improve as they are trained on larger and larger sets of samples, and several generations of tests with improving accuracy may be introduced sequentially. Several gene-expression profiling–technology platforms are mature enough for clinical testing. The most important next step that is needed for further progress is the development and validation of multigene predictors in prospectively designed clinical trials to determine the true accuracy and clinical value of this new technology. This manuscript reviews methodological and statistical issues relevant to clinical trial design to discover and validate multigene predictors of response to therapy.

Key words: clinical trials, microarrays, gene-expression profiling, predictive markers, multigene predictors


    Introduction
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
Decades of extensive research have yielded few clinically useful single molecular markers predictive of response to chemotherapy in patients with cancer [1Go, 2Go]. With the advent of high-throughput genomic technologies it is now possible to survey the expression of a large number of genes simultaneously in cancer tissue [3Go–6Go]. It is hypothesized that the pretreatment gene expression profile of cancer holds information about sensitivity to chemotherapy, and that this information can be extracted with transcriptional profiling and multigene predictors of response can be developed through mathematical analysis of the data [7Go]. However, the true accuracy and clinical value of these novel predictive tests is yet to be defined in prospectively designed marker discovery and validation trials.

It may be useful to think of marker discovery studies as conceptually similar to clinical trials that lead to the introduction of new drugs. The hallmark of clinical drug development is the multistage trial process. A similar focused, prospective, multistage evaluation of genomic markers could facilitate the introduction of new diagnostic markers into the clinic [8Go]. Phase I–II marker discovery studies would be expected to show that a technology can be reliably and reproducibly applied to clinical specimens and that the estimated predictive accuracy of the proposed test falls within a range that is considered clinically useful. Phase III marker validation studies would then evaluate the predictor in a larger number of cases to demonstrate that clinical outcome is better when the new marker is used for decision making compared with the current standard, which may be another marker- or no marker-based recommendation.


    Tissue sampling
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
Gene expression profiling with DNA microarrays are best performed on fresh or frozen tissues, because the accuracy of the results is dependent on good quality RNA. Many investigators use excisional biopsy specimens; however, core needle biopsies or fine needle aspiration of cancer can also yield sufficient amounts of RNA for microarray experiments [9Go–11Go]. Instantaneous RNA preservations can be achieved by collecting biopsy specimens into a one-step RNA-preserving reagent (RNAlater; Ambion, Austin, TX, USA). Recently, it has also been shown that reverse transcription-PCR (RT–PCR) and DNA microarray profiling can be performed on RNA extracted from formaldehyde-fixed, paraffin-embedded tissues [12Go–15Go]. These technological advances provide clinical investigators with a wide repertoire of choice for the tissue sampling that best suits their trial design. It is important to realize, however, that different sampling methods yield materials with different cellular composition, and can result in distinct gene expression profiles [16Go]. The tissue sampling technique should be selected to suit the experimental purpose and kept uniform during the study.


    Gene expression data
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
Reproducible measurements are key to developing reliable predictors of outcome. Measurement error will lead to underestimation of the effect of a predictive marker and may also lower the power to detect any predictive effects [17Go]. Array data can be sensitive to many sources of variation in sample procurement, RNA extraction, array manufacture, hybridization protocol, scanning protocol, image quantitation and initial data processing [18Go, 19Go]. However, these quality control issues are not unique to microarray experiments; other molecular diagnostic methods including immunohistochemistry, RT–PCR, fluorescence in situ hybridization and flow cytometry are all subject to a large number of potential errors [20Go–23Go]. Nonetheless, it is important to determine the major sources of variability of gene-expression profiling in well controlled intra- and inter-laboratory and cross-platform replicate experiments.

Gene-expression profiling, particularly if performed by a high-volume central laboratory may, in fact, lend itself to greater quality control than can be achieved with many of the current molecular diagnostic tests that are most commonly performed by end users. RNA quality and quantity can be measured, and the efficacy of the RT reaction, probe labeling and hybridization process can be monitored. Arrays routinely contain several types of positive and negative controls embedded in their array matrix. Global statistics can be applied to the gene-expression data to compare any new result with existing profiles in a previous data set, enabling investigators to flag results that appear to be beyond acceptable limits of variation.

A quality-assured, properly normalized gene-expression dataset is a data matrix with a row for each sample and a column for each gene. Several standard algorithms and mathematical methods are available for analyzing such datasets (see sections on gene selection and class prediction below). A unique challenge for investigators using microarray data is how to apply observations made on one platform to data generated on another platform. Different arrays contain different sets of genes. Even for the same set of genes, different oligonucleotide sequences may be used as probes, which could result in variable signal intensity. Investigators in different laboratories who present their gene-expression results as the ratio of the test samples to a common reference sample, often use a different reference from laboratory to laboratory. Furthermore, different normalization methods and signal detection techniques yield different numeric results, even for the same raw data. Not surprisingly, cross-platform validation of results has proved difficult [24Go].


    Gene selection for outcome prediction
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
The fundamental goal of gene-expression profile-based prediction is to identify a set of informative genes and develop a class-prediction algorithm that formulates a rule, based on the expression level of the informative genes, by which to categorize cases into outcome groups [25Go, 26Go]. The process often begins with the identification of differentially expressed genes in two groups of samples. This step is often performed by using parallel univariate tests of significance to compare the expression level of each gene between the groups [27Go]. Significance levels are often adjusted to account for the consequences of multiple testing, or other methods are used to limit false-discovery rates (FDR). Results are often presented as a list of genes ranked by significance level, and attention is usually focused on the top few differentially expressed genes.

A common and erroneous assumption is that a quick scrutiny of these genes will reveal the underlying biological differences between the two groups of samples (for example, that examination of the list of differentially expressed genes between chemotherapy-sensitive and -resistant tumors will reveal why one group is sensitive to treatment and the other is not). This biological information is partly embedded in the gene list, but may be difficult to recognize for both analytical and biological reasons. Differentially expressed gene lists are unstable, particularly when they are generated from small sets of samples and when each gene has only limited discriminating value. Different statistical methods applied to the same data yield distinct but overlapping gene lists, and the rank orders of genes are particularly unstable.

Even when truly differentially expressed genes are identified, these genes may or may not contribute to the most important biological differences between the two sets of samples. Many of these genes may represent bystanders rather than playing a causative role in the biological differences between the groups. Differentially expressed gene lists are thus best considered to be hypothesis generating.

Despite the widespread use of univariate significance screening methods to select gene sets for class prediction, these methods have not been rigorously compared with the more conventional optimal feature-selection methods that form part of the classification analysis. Gene sets composed of the few most individually significant genes need not necessarily have better prediction value than other sets, particularly if the data do not contain several individually strong predictors [28Go, 29Go]. In other words, genes that are not individually predictive may predict well when used in combination with others. However, the search for such combinations in array data is a formidable computational challenge, given the astronomical number of potential combinations to be examined and the large number of spurious associations that may be found.


    Validation of differentially expressed genes
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
The appropriate validation of a microarray experiment depends on the hypothesis being tested. Gene-expression profiling methods are commonly used for three different purposes: the first involves using microarrays as a screening tool to identify individual genes of interest that might contribute to an important biological function; the second is to obtain insight into complex biological processes by examining thousands of genes simultaneously; and the third is to use this technology as a classification tool to sort cases into clinically important categories.

With regard to the experimental data, the first two applications assume that a particular bright spot on the microarray indicates that one particular gene, corresponding to the spot, is up- (or down-) regulated. The third application assumes that a spot on the microarray lights up differently between different groups of samples. For this latter classification application, it is of secondary importance whether the signal originates from a single gene or results from a composite signal, as long as it is consistently associated with one outcome group. A composite signal could arise from non-specific or cross-hybridization of the probe sequence with genes other than the one corresponding to the spot on the array. Consequently, validation of the expression of particular genes by using different methods, such as RT–PCR, is critical for the first two applications, whereas the most appropriate validation for the third application is testing the predictor on independent sets of cases.

An inherent limitation of DNA microarrays is that not all of the thousands of genes included on the array are expected to generate a perfectly specific and linear signal. A single hybridization condition is applied to tens of thousands of distinct nucleic acid hybridization reactions, resulting in suboptimal conditions for some reactions. Some degree of non-specific and cross-hybridization for a substantial number of probes is unavoidable. In general, the reported RT–PCR confirmation rate of microarray data for individual genes is ~70% [30Go, 31Go].


    Class prediction
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
Class-prediction analysis typically involves supervised classification, in which statistical learning algorithms formulate the classification rules that connect gene expression profiles to observed patient outcomes [26Go, 32Go]. Many methods have been used for this purpose, including linear discriminant analysis, support vector machines (SVM), neural networks, k-nearest neighbors and recursive partitioning. In general, these methods are trained on cases with known outcome, the training set, and combine the expression values from multiple genes to predict the outcome in new cases referred to as the test set. The multiple genes entered into these algorithms are either picked from differentially expressed gene lists, as described above, or generated as part of the classification analysis.

Because there are hundreds or thousands of times more genes than samples, the selection of a gene subset that yields maximal prediction is challenging. Several sets of genes can be selected that predict well on the original data but fail to predict accurately on independent data. The explanation for this phenomenon, called overfitting, is that artifacts or noise that are particular to the original sample are being fit to yield more accurate predictions for the original sample, but these features are not present in other data and their inclusion leads to loss in accuracy on independent data.

One solution to attempt to minimize overfitting is repeated cross-validation within the training data to determine the best classifier and also to estimate the predictive accuracy that could be expected when the predictor is applied to independent cases. During this process, a subset of cases is omitted from the training set; discriminating genes are identified from the remaining subset and a class predictor is constructed which is then tested on the held out cases [33Go]. This process is repeated many times, with different sets of cases left out on each occasion, and the predictive accuracies are averaged to yield estimated error rates for each classifier. Typically, the classifier requiring the least number of genes to achieve an acceptable rate of correct classification is considered the best. To determine the true predictive accuracy of this optimized classifier, it will need to be tested on an independent set of data. What is considered as acceptable predictive accuracy depends on the clinical outcome that is to be predicted. For example, high >90% accuracy would be required for any clinically useful predictor of prognosis for breast cancer, because misclassification could lead to recommending against potentially life-saving adjuvant therapy. On the other hand, less accuracy may be sufficient for a response predictor developed to select one drug over the other. Most patients would select a drug if they have a 60% chance to benefit from it compared with another drug that gives only 30% chance of benefit, particularly if toxicities are comparable.

With any number of genes randomly included in a classifier, chance could produce a certain number of correct predictions. To assess whether the classification error rates differ significantly from what chance alone could produce, a random-label permutation test is performed [34Go]. During this process, the outcome label of each case (i.e. responder versus non-responder) is randomly changed. Hundreds of such datasets with randomly permutated labels are created, and the classifier generated from the true data is applied to the randomly permutated data. The observed error rate for the true dataset is compared with the distribution of the error rates observed with the randomly permutated data sets to calculate a permutation P value. A model that results in significantly correct predictions is taken for further validation on independent data to determine its true predictive accuracy.

An important feature of learning-based clinical outcome predictors is that the performance of the predictor may improve as the training sample size increases until it reaches a plateau. Figure 1 illustrates computer-simulated learning curves of SVM-based classifiers for different gene expression datasets [34Go]. It shows how the estimated cross validation classification error rates decrease as the training sample size increases. For some prediction problems where the difference in gene-expression profiles between the groups to be separated is substantial, the learning curves are steep, and relatively few samples may yield good predictors. For other prediction problems, large training samples are needed to yield a predictor that operates close to its plateau of accuracy. The simulation used existing, publicly available datasets as starting data and projected how the predictor would improve if it were trained on increasingly larger sets of cases [35Go].



View larger version (21K):
[in this window]
[in a new window]
 
Figure 1. Simulated cross validation classification error rates of support vector machine learning algorithm as a function of training sample size. The different lines represent calculated learning curves for predictors generated from 8 different gene-expression data sets in the literature. The results show that prediction accuracy improves as the size of the training set increases. However, some prediction problems are easier and have steep learning curves with rapid improvement that reach a plateau of performance with only a few dozen cases. These scenarios include for example distinguishing cancer from normal tissues or acute lymphoblastic leukemia from acute myeloid leukemia. Other prediction problems are more difficult and require larger training set sizes because the gene expression profiles of the two groups to be separated are rather similar. These scenarios for example include predicting clinical outcome within a given cancer type. The figure was re-plotted based on results presented by Mukherjee at al in ref 35.

 

    Sample size calculations for multigene predictive marker discovery
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
To discover a predictive marker for a given treatment, a single-arm study design may be sufficient. A simple strategy is to base sample-size calculations on the number needed to ensure adequate power for the univariate screening of discriminating genes; in other words, how many training samples are needed to identify reliably an individually predictive gene? If we can assume that the array data are approximately normally distributed on some scale, then we can use standard two-sample testing methods to perform sample-size calculations [36Go]. The number of patients needed for a t-test to have adequate power to detect differential expression of a gene between responders and non-responders depends on the type I ({alpha}) and type II (ß) error rates, the level of inter-patient variability, the size of the difference in the mean expression values and the prevalence of response among patients. Differences in the mean expression value of a gene between responders and non-responders can be specified in terms of standardized effect size (SES), which is the mean difference of expression values between the two groups divided by the standard deviation (SD). One can perform sample-size calculations by assuming a given prevalence of response (i.e. response rate) and by specifying acceptable {alpha} and ß error rates. For example, for two-sided {alpha}=1% and ß=10%, and assuming a 10% response rate, we would need a total of 96 patients to detect an SES of 1 or greater for any particular gene, 170 patients to detect an SES of 0.75 and 381 patients to detect an SES of 0.5 (Table 1). If we wanted to adjust for multiple comparisons by using a smaller {alpha} value, the sample size would increase.


View this table:
[in this window]
[in a new window]
 
Table 1. Sample size required to discover differentially expressed genes on the basis of standardized effect size and response rate

 
Using this approach to determine the sample size needed for discovery is problematic on at least two counts: (i) it requires specification of SD, which can vary from gene to gene; and (ii) it can lead to high FDR (FDR = the proportion of genes identified as being significantly different when, in fact, they are not) due to many times more variables (genes) than samples in each group. Preliminary gene expression data from 15–20 cases in each outcome group can be used to estimate values for the SD for the genes. Several methods have been developed to control the FDR within pre-specified levels [37Go–39Go]. One such method treats the distribution of P values computed for all candidate genes as a mixture of two distributions, one from truly differentially expressed genes and one from the non-differentially expressed genes [39Go]. Computing the parameters for these distributions and specifying a P value threshold for declaring significance allows the estimation of FDR. One can specify a subjectively acceptable FDR threshold and find a P value threshold that will lead to that FDR.

A few other methods for sample-size calculation for supervised classification have also been reported. The approaches of Hwang et al. [40Go] and Fisher and van Belle [41Go] are not related to gene selection, but rather are based on global tests of significance between outcome groups. In both approaches, the global test is based on Fisher's linear discriminant analysis. The method recently proposed by Mukherjee et al. [35Go] estimates optimal sample size for discovery on the basis of assessing the statistical significance of classification performance and using preliminary results to extrapolate classification results for larger sample sizes. This is an appealing strategy which exploits the assumption that the predictive accuracy of learning algorithm-based predictors will improve as they are trained on larger and larger sets of samples until they reach a plateau of accuracy. These authors fit inverse power-law models to the preliminary data to predict, for a given classifier, how fast the classification performance increases with the sample size (Figure 1).


    Study design for predictive marker validation
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
Once a candidate predictor has been identified and its predictive accuracy was estimated, the goal of an independent validation study is to: (i) define the sensitivity, specificity and the positive (PPV) and negative predictive values (NPV) with greater precision; and (ii) to prove clinical utility of the test. Different trial designs may be needed for different clinical situations, but there may not be a single best design for any particular clinical scenario. Several designs could yield complementary information (Figure 2). An important question for a predictive marker validation study is to determine whether the response rate is higher (and how much higher) in the group that is predicted to respond compared with unselected patients that may represent the current standard of care (in the case of chemotherapy for example). Single-arm validation studies may be designed to address this issue and sample size can be computed based on the known response rate in unselected patients and the estimated sensitivity and PPV of the proposed test. For example, we may assume that the response rate in unselected patients is 30% and that the PPV of the test is 60%, which indicates a two-fold greater chance of response in patients who test positive than in unselected patients. To prove that marker-positive patients have better response rates than unselected patients, the lower boundary of a two-sided 95% confidence interval should not be less than 0.3 (expected response rate in unselected patients); this requires that the standard error of the PPV be <0.1, which in turn means that the study needs to include at least 24 patients who test positive for the marker. The proportion of patients who respond is expected to be ~30%, so an accurate (and sensitive) predictor should generate a similar 20–30% proportion of individuals who test positive for the marker, which would require a total sample size of 80 to 200 patients for validation.



View larger version (25K):
[in this window]
[in a new window]
 
Figure 2. Examples of trial designs to assess clinical utility of a predictive marker. (A) This schema illustrates treating all patients regardless of marker status and analyzing results at the end in the context of marker results. (B) This schema restricts treatment to marker positive patients only. (C) This is a randomized design that could prove that the use of predictive test improves patient outcome (i.e. response rates) for a particular treatment compared with unselected (current standard) use of the same treatment. (D) This schema could prove that drug A is a better choice than drug B for marker-positive patients, whereas marker-negative patients may be best treated with drug B.

 
Single-arm validation trials have several limitations. If in the single-arm study, treatment is restricted to test-positive cases only, this represents partial validation for the test, because the response rates can only be compared with historical results and the NPV of the test is also not evaluated (clinically important response rates may also be observed in marker-negative cases). If all patients receive the same treatment and response is analyzed in the context of marker results, then a study could become rather inefficient if the overall response rate is low due to low prevalence of marker-positive cases, particularly if the test has modest-predictive values.

A higher level of evidence for the utility of a new predictive test may be generated through randomized trials that could simultaneously address response rates in selected and unselected patients, and could also assess the treatment specificity of the predictor. A definitive validation study for a response predictor may be to randomize patients to either receive treatment only if the marker is positive or to receive therapy regardless of marker status [42Go]. In the case of a marker that predicts response to an existing chemotherapy, the unselected use of the treatment represents the current standard of care. This two-arm design could examine clinical utility directly, since comparison between the marker-selected and unselected trial arms could determine to what extent patient selection improves outcome compared with unselected use of the same treatment. Sample size for this design would be similar to conventional phase III clinical trials. Early stopping rules to halt the trial if the predictor performs too well or too poorly could be incorporated.

A different design may be necessary to assess whether the predictor is treatment-specific or it is only a general marker of response to (any) cytotoxic therapy. The primary trial objective is to assess interaction between a marker and the response to two or more different types of chemotherapies. Stratified randomization into two or more treatment arms on the basis of marker results is an appealing strategy. A study with two treatment arms may be designed as follows: the response marker is determined at the time of the patients' entry into the study, and each patient is assigned an expected outcome (e.g. response or no response to drug A, for which the marker was developed). The primary objective of the study would be to establish whether patients who test positive for the marker are significantly more likely to experience a response to drug A than to drug B. A secondary objective would be to show that patients who test negative for the marker will not benefit from drug A as much as the marker-positive patients do, and that for these patients the alternative therapy with drug B may be the preferred treatment. Power and sample size calculations can be based on assumptions about the prevalence of the marker among the patients and the rates of response of their tumors to drugs A and B, respectively. For example, if the prevalence of marker positivity is 25% and the rate of tumor response to drug A is 60%, and we assume that the rate of response to treatment B in the same patients is only 20%, then a total sample size of 210 patients would be needed for 98% power to demonstrate that treatment A is superior to treatment B for marker-positive patients. This study could also be conducted by limiting treatment to only marker-positive patients; however, with that design, the NPV of the test would not be evaluated, and the comparative efficacy of the alternative therapy (i.e. drug B) could not be evaluated in the patients who responded poorly. Including early discontinuation rules may be the preferred option to minimize the exposure of marker-negative patients to ineffective drugs.

Gene expression profiling has demonstrated in several ‘proof-of-principle’ studies that multigene signatures can predict important clinical outcomes and therefore have the potential to evolve into true diagnostic tests. Some are concerned that moving these observations into a validation phase may be premature; because the technology itself is constantly evolving and new, improved gene sets to predict a particular outcome are identified at regular intervals [43Go–46Go]. Technological development will never stop, but this need not discourage attempts for validation of markers already discovered. The true clinical utility of any proposed gene marker set can only be established through independent validation. It is possible, and in fact probable, that several distinct genes sets measured by different profiling platforms may predict a given clinical outcome equally well [43Go–46Go]. It is also possible that as larger and larger patient populations are used for discovery and training of multigene predictors, second-, third- and fourth-generation predictors will emerge with steadily increasing predictive accuracy. Whereas this process will lead to increased competition among aspiring diagnostic companies (and academic laboratories), if truly useful predictors are discovered, patients will ultimately benefit. High-throughput genomic (and proteomic) technologies represent perhaps one of the most exciting opportunities in diagnostic medicine since the discovery of monoclonal antibodies. However, the true diagnostic value of this technology can only be established expeditiously through a series of well-designed marker discovery and validation studies.

Received for publication March 23, 2004. Revision received June 15, 2004. Accepted for publication June 18, 2004.


    References
 Top
 Abstract
 Introduction
 Tissue sampling
 Gene expression data
 Gene selection for outcome...
 Validation of differentially...
 Class prediction
 Sample size calculations for...
 Study design for predictive...
 References
 
1. Bast RC Jr, Ravdin P, Hayes DF et al. American Society of Clinical Oncology Tumor Markers Expert Panel. 2000 Update of recommendations for the use of tumor markers in breast and colorectal cancer: clinical practice guidelines of the American Society of Clinical Oncology. J Clin Oncol 2001; 19: 1865–1878.[Abstract/Free Full Text]

2. Hortobagyi GN, Hayes D, Pusztai L. Integrating newer science into breast cancer prognosis and treatment. Molecular predictors and profiles. ASCO Annual Meeting Summaries. Alexandria (VA): American Society of Clinical Oncology 2002; 191–202.

3. Ramaswamy S, Golub TR. DNA microarrays in clinical oncology. J Clin Oncol 2002; 20: 1932–1941.[Abstract/Free Full Text]

4. de Bolle X, Bayliss CD. Gene expression technology. Methods Mol Med 2003; 71: 135–146.[Medline]

5. Ali TR, Li MS, Langford PR. Monitoring gene expression using DNA arrays. Methods Mol Med 2003; 71: 119–134.[Medline]

6. Walker SJ, Worst TJ, Vrana KE. Semiquantitative real-time PCR for analysis of mRNA levels. Methods Mol Med 2003; 79: 211–227.[Medline]

7. Paik S. Incorporating genomics into the cancer clinical trial process. Semin Oncol 2001; 28: 305–309.[CrossRef][ISI][Medline]

8. Simon R, Altman DG. Statistical aspects of prognostic factor studies in oncology. Br J Cancer 1994; 69: 979–985.[ISI][Medline]

9. Sotiriou C, Powles TJ, Dowsett M et al. Gene expression profiles derived from fine needle aspiration correlate with response to systemic chemotherapy in breast cancer. Breast Cancer Res 2002; 4: R3.[CrossRef][Medline]

10. Assersohn L, Gangi L, Zhao Y et al. The feasibility of using fine needle aspiration from primary breast cancers for cDNA microarray analyses. Clin Cancer Res 2002; 8: 794–801.[Abstract/Free Full Text]

11. Pusztai L, Ayers M, Stec J et al. Gene expression profiles obtained from single passage fine needle aspirations (FNA) of breast cancer reliably identify prognostic/predictive markers such as estrogen (ER) and HER-2 receptor status and reveal large scale molecular differences between ER-negative and ER-positive tumors. Clin Cancer Res 2003; 9: 2406–2415.[Abstract/Free Full Text]

12. Ma X-J, Wang W, Salunga R et al. Gene expression signatures associated with clinical outcome in breast cancer via laser capture microdissection. Breast Cancer Res Treat 2003; 82 (Suppl 1): S15 (Abstr 29).

13. Baunoch D, Moore M, Reyes M et al. Microarray analysis of formalin fixed paraffin-embedded tissue: the development of a gene expression staging system for breast carcinoma. Breast Cancer Res Treat 2003; 82 (Suppl 1): S116 (Abstr 474).

14. Paik S, Shak S, Tang G et al. Multi-gene RT-PCR assay for predicting recurrence in node negative breast cancer patients–NSABP studies B-20 and B-14. Breast Cancer Res Treat 2003; 82 (Suppl 1): S10 (Abstr 16).

15. Esteva FJ, Sahin AA, Coombes K et al. Multi-gene RT-PCR assay for predicting recurrence in node negative breast cancer patients–M.D. Anderson Clinical Validation Study. Breast Cancer Res Treat 2003; 82 (Suppl 1): S11 (Abstr 17).[CrossRef]

16. Symmans WF, Ayers M, Clark E et al. Fine needle aspiration and core needle biopsy samples of breast cancer provide similar total RNA yield, but different stromal gene expression profiles cancer. Cancer 2003; 97: 2960–2971.[CrossRef][ISI][Medline]

17. Altman DG, Lyman GH. Methodological challenges in the evaluation of prognostic factors in breast cancer. Breast Cancer Res Treat 1998; 52: 289–303.[CrossRef][ISI][Medline]

18. King HC, Sinha AA. Gene expression profile analysis by DNA microarrays: promise and pitfalls. JAMA 2001; 286: 2280–2288.[Abstract/Free Full Text]

19. Miller LD, Long PM, Wong L et al. Optimal gene expression analysis by microarrays. Cancer Cell 2002; 2: 353–361.[ISI][Medline]

20. Rhodes A, Jasani B, Anderson E et al. Evaluation of HER-2/neu immunohistochemical assay sensitivity and scoring on formalin-fixed and paraffin-processed cell lines and breast tumors: a comparative study involving results from laboratories in 21 countries. Am J Clin Pathol 2002; 118: 408–417.[ISI][Medline]

21. Rhodes A, Jasani B, Barnes DM et al. Reliability of immunohistochemical demonstration of estrogen receptors in routine practice: interlaboratory variance in the sensitivity of detection and evaluation of scoring systems. J Clin Pathol 2000; 53: 125–130.[Abstract/Free Full Text]

22. Ambros IM, Benard J, Boavida M et al. Quality assessment of genetic markers used for therapy stratification. J Clin Oncol 2003; 21: 2077–2084.[Abstract/Free Full Text]

23. Liu ET. Molecular oncodiagnostics: where we are and where we need to go. J Clin Oncol 2003; 21: 2052–2055.[Free Full Text]

24. Kuo WP, Jenssen TK, Butte AJ et al. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 2002; 18: 405–412.[Abstract/Free Full Text]

25. Ringner M, Peterson C, Khan J. Analyzing array data using supervised methods. Pharmacogenomics 2002; 3: 403–415.[ISI][Medline]

26. Simon R, Radmacher MD, Dobbin K et al. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003; 95: 14–18.[Free Full Text]

27. Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003; 4: 210.[CrossRef][Medline]

28. Goldberg DA. Genetic Algorithms in Search. Optimization and Machine Learning. New York: Addison-Wesley 1989.

29. Li L, Weinberg CR, Darden TA, Pedersen LG. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 2001; 17: 1131–1142.[Abstract/Free Full Text]

30. Rajeevan MS, Vernon SD, Taysavang N et al. Validation of array-based gene expression profiles by real-time (kinetic) RT-PCR. J Mol Diagn 2001; 3: 26–31.[Abstract/Free Full Text]

31. Taniguchi M, Miura K, Iwao H et al. Quantitative assessment of DNA microarrays-comparison with Northern blot analyses. Genomics 2001; 71: 34–39.[CrossRef][ISI][Medline]

32. Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol 2002; 9: 505–511.[CrossRef][ISI][Medline]

33. Shoa J. Linear model selection by cross-validation. J Am Stat Assoc 1993; 88: 422.

34. Good PI. Permutations Tests for Testing Hypotheses. New York: Springer-Verlag 1994.

35. Mukherjee S, Tamayo P, Rogers S et al. Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol 2003; 10: 119–142.[CrossRef][ISI][Medline]

36. Simon R, Radmacher MD, Dobbin K. Design of studies using DNA microarrays. Genet Epidemiol 2002; 23: 21–36.[CrossRef][ISI][Medline]

37. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc (Ser B) 1995; 57: 289–300.[ISI]

38. Hatfield GW, Hung S, Baldi P. Differential analysis of DNA microarray gene expression data. Mol Microbiol 2003; 47: 871–877.[CrossRef][ISI][Medline]

39. Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of P-values. Bioinformatics 2003; 19: 1236–1242.[Abstract/Free Full Text]

40. Hwang D, Schmitt WA, Stephanopoulos G et al. Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics 2002; 18: 1184–1193.[Abstract/Free Full Text]

41. Fisher LD, van Belle G. Sample size calculations in selecting continuous variables to discriminate between populations. In Fisher LD, van Belle G (eds): Biostatistics: A Methodology for the Health Sciences. New York: Wiley 1993; 851–858.

42. Sargent D, Allegra C. Issues in clinical trial design for tumor marker studies. Semin Oncol 2002; 3: 222–230.[CrossRef]

43. Lossos IS, Czerwinski DK, Alizadeh AA et al. Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Engl J Med 2004; 350: 1828–1837.[Abstract/Free Full Text]

44. Alizadeh AA, Eisen MB, Davis RE et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000; 403: 503–511.[CrossRef][ISI][Medline]

45. Rosenwald A, Wright G, Chan WC et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 2002; 346: 1937–1947.[Abstract/Free Full Text]

46. Shipp MA, Ross KN, Tamayo P et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 2002; 8: 68–74.[CrossRef][ISI][Medline]