Identifying marker genes in transcription profiling data using a mixture of feature relevance experts

M. L. Chow1,3, E. J. Moler1,2 and I. S. Mian1

1 Radiation Biology and Environmental Toxicology Group, Department of Cell and Molecular Biology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720
2 Chiron Corporation, Emeryville, California 94608
3 Gene Logic Incorporated, Berkeley, California 94704


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS AND APPROACH
 RESULTS
 DISCUSSION
 REFERENCES
 
Transcription profiling experiments permit the expression levels of many genes to be measured simultaneously. Given profiling data from two types of samples, genes that most distinguish the samples (marker genes) are good candidates for subsequent in-depth experimental studies and developing decision support systems for diagnosis, prognosis, and monitoring. This work proposes a mixture of feature relevance experts as a method for identifying marker genes and illustrates the idea using published data from samples labeled as acute lymphoblastic and myeloid leukemia (ALL, AML). A feature relevance expert implements an algorithm that calculates how well a gene distinguishes samples, reorders genes according to this relevance measure, and uses a supervised learning method [here, support vector machines (SVMs)] to determine the generalization performances of different nested gene subsets. The mixture of three feature relevance experts examined implement two existing and one novel feature relevance measures. For each expert, a gene subset consisting of the top 50 genes distinguished ALL from AML samples as completely as all 7,070 genes. The 125 genes at the union of the top 50s are plausible markers for a prototype decision support system. Chromosomal aberration and other data support the prediction that the three genes at the intersection of the top 50s, cystatin C, azurocidin, and adipsin, are good targets for investigating the basic biology of ALL/AML. The same data were employed to identify markers that distinguish samples based on their labels of T cell/B cell, peripheral blood/bone marrow, and male/female. Selenoprotein W may discriminate T cells from B cells. Results from analysis of transcription profiling data from tumor/nontumor colon adenocarcinoma samples support the general utility of the aforementioned approach. Theoretical issues such as choosing SVM kernels and their parameters, training and evaluating feature relevance experts, and the impact of potentially mislabeled samples on marker identification (feature selection) are discussed.

marker genes; mixture of experts; support vector machines; adipsin; cystatin C; azurocidin


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS AND APPROACH
 RESULTS
 DISCUSSION
 REFERENCES
 
DNA MICROARRAY TECHNOLOGY generates a panoramic survey of genes expressed in a sample of cells. Comparing the transcription profiles of different types of samples permits identification of marker genes, genes that best distinguish samples. When the samples correspond to different pathological states of the same tissue or subtypes of the same malignancy, transcription profiling holds promise as a method for classifying and analyzing cancers from a molecular rather than morphological perspective (1, 2, 11). Despite difficulties in obtaining sufficient, high quality, homogeneous tissue samples from an in situ environment rather than, for example, cell lines, transcription profiling affords an opportunity to identify novel and/or uncharacterized genes that are potential candidates for developing faster and more reliable systems for clinical diagnosis, prognosis, and monitoring. Furthermore, these marker genes represent putative targets for therapeutic agents and understanding the basic biology of the disorder. A typical profiling study measures the expression levels of thousands of genes (features) L across tens of samples N, with each sample labeled as being of one type or another. The problem considered here is that of identifying marker genes given N labeled L-feature sample profile vectors.

A variety of techniques have been employed to address three statistical tasks associated with analysis of profile data (3, 9, 11, 13, 1618, 2022). The first, unsupervised learning, involves discovering and characterizing the classes present in unlabeled profile vectors. This clustering procedure can suggest previously unrecognized cancer (sub)types. The second task, supervised learning, involves discriminating between profile vectors with different labels and assigning the label of a new profile vector. Given profiling data for a sample of unknown origin, this classification and prediction procedure can indicate the origin of the sample, for example, whether it is from tumor or nontumor tissue. The third task and subject of this work is feature relevance, ranking, and selection. This involves defining a feature relevance expert which 1) implements an algorithm that quantitates the degree to which a gene distinguishes samples, 2) reorders genes according to this relevance value, 3) selects nested subsets of ranked genes and uses them to train a supervised learning system, and 4) identifies highly informative or marker genes based on the ability of subsets to assign accurately the label for samples not used for training, i.e., the generalization performance of the subset. Thus, gene subsets corresponding to marker genes can be identified by varying a single parameter, the number of ranked features used to train and evaluate the supervised learning system. For a given data set, different feature relevance experts can be compared via their generalization performance on the same number of ranked genes.

Recently, two independent studies employed different techniques to address the three aforementioned tasks. The first study applied naive Bayes models, support vector machines (SVMs) and naive Bayes global relevance (NBGR) (16) to sixty-two 1,988-feature experiment profile vectors derived from colon adenocarcinoma samples labeled as tumor or nontumor (2). The NBGR requires unlabeled profile vectors as input, since it is computed from the probability parameters of profile vector classes discovered by a naive Bayes model. The second study applied self-organizing maps (SOMs), neighborhood analysis, and weighted voting and gene/class correlation to seventy-two 7,070-feature experiment profile vectors derived from bone marrow (BM) and peripheral blood (PB) samples labeled as acute lymphoblastic or myeloid leukemia (ALL, AML) (11). The relevance measure, referred to here as the mean aggregate relevance (MAR), requires labeled profile vectors, since it is computed from the mean and standard deviation of the expression levels of genes in samples labeled ALL and AML. For the {Tumor, Nontumor} and {ALL, AML} binary supervised learning problems, each study identified 50 markers that had the same generalization performance as the full repertoire of, respectively, 1,988 or 7,070 genes (11, 16).

This work considers three distinct but interrelated feature relevance-, ranking-, and selection-related problems. Currently, the number of training examples, N sample profile vectors, is considerably smaller than their dimensionality, L measured gene expression levels (N << L). The first problem is identifying P marker genes for development of a robust decision support system to assign the cancer (sub)type for a new sample as accurately as or better than the original L genes (P << L). The second problem involves reducing the dimensionality even further by defining the Q marker genes best-suited for subsequent experimental investigations (Q < P << L). The third problem concerns multiply-labeled profile vectors and increasing the utility of profiling studies beyond their original purpose. Apart from the primary ALL and AML labels, each leukemia sample had 1–3 additional labels: {PB, BM}, {T cell, B cell}, and {Male, Female} (11). Since it is unlikely that all 7,070 genes are involved in differentiating ALL from AML, it is possible that some (or all) could provide a readout on other aspects of the samples. The question becomes whether the L genes analyzed to address a primary supervised learning problem can be employed to identify markers for secondary problems defined by additional sample labels. Here, a mixture of feature relevance experts is used to address the first and second problems. The validity of the premise underlying the third problem is demonstrated using data from the leukemia samples. Since submission of this work, a variety approaches for identifying marker genes have been proposed (see, for example, Refs. 7, 8, 10, 12, and 23).


    METHODS AND APPROACH
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS AND APPROACH
 RESULTS
 DISCUSSION
 REFERENCES
 

Gene expression data.
The transcription profiling studies reconsidered here both employed Affymetrix technology to monitor gene expression levels. The adenocarcinoma study provides measurements for 1,988 probes in 62 human colon adenocarcinoma tissue samples, 40 labeled tumor and 22 nontumor (2). The leukemia study provides measurements for 7,070 probes in 72 human leukemia samples, 47 ALL and 25 AML (total 72), 10 PB and 62 BM (total 72), 9 T cell and 38 B cell (total 47), and 26 male and 23 female (total 49) (11). For these five aforementioned binary supervised learning problems, samples having the label shown in italics are, without loss of generality, defined to be positive training examples.

Although not examined in this work, a variety of other supervised learning problems can be derived from the leukemia data. For example, a binary problem might involve distinguishing leukemia subtypes on the basis of tissue of origin ({ALL+PB, ALL+BM} or {AML+PB, AML+BM}). A multiclass problem could include discriminating samples on the basis of tissue origin and subtype ({ALL+PB, ALL+BM, AML+PB, AML+BM}). For convenience, each functionally defined nucleic acid sequence probe whose expression level is monitored will be termed a "gene," irrespective of whether it is actually a gene, an expressed sequence tag, or DNA from another source.

Feature relevance experts.
The three feature relevance experts evaluated here implement relevance measures that are based upon labeled (MAR, MVR) or unlabeled (NBGR) sample profile vector training examples. These measures are designed to be illustrative rather than comprehensive, because, for example, all treat genes as independent of one another, whereas the transcription levels of some genes are likely to be correlated. In general, each measure generates a ranking of features and defines nested gene subsets Top1 Top2 ... TopL, where L is the number of genes monitored in the profiling study (here L = 1988, 7070). Top1 denotes the top-ranked or most distinctive gene according to the relevance measure, Top2 denotes the top 2, and so on. Evaluating all possible gene subsets in terms of how well they perform on a particular classification and prediction problem using a supervised learning method (here SVMs) is a computationally demanding task. Hence, the focus is on a small number of selected gene subsets, for example, Top4 Top5 Top11 Top25 Top50 Top100 TopL, as well as the bottom and middle 50 ranked genes.

It remains to be determined whether the degrees of difficulty of the supervised learning and feature selection problems posed by the leukemia and adenocarcinoma data sets are typical of cancer profiling studies. The strategy deployed here is sufficiently general that other feature relevance measures, ranking and selection techniques, supervised learning methods, training and evaluation procedures, and methods for combining predictions from experts could be utilized.

Median vote relevance.
For gene Fl, let xln be its expression level in sample n. Let {nu}i(Fl) and {nu}j(Fl) be the median values for samples belonging to classes i (positive training examples) and j (negative examples). Each sample casts a vote V(n, l) according to whether the expression level is closer to the median value of class i or j. The median vote relevance (MVR) is the sum over all N samples


The larger the score, the better the gene distinguishes classes (two or more genes can have the same value). Although both require labeled training examples, the MVR is less sensitive to outliers than the mean-based MAR, because the median is a more robust estimate of the center of a population sample. MVR values were computed using a spreadsheet program.

Naive Bayes global relevance.
Given the K classes identified and characterized by a naive Bayes model estimated from N unlabeled L-feature profile vectors, the NBGR (16) is the sum of the relevance over pairwise combinations of classes

P(xln|ck,l) is the probability of the expression level given class k. The greater the absolute magnitude, the better the gene distinguishes all K classes. A naive Bayes model was estimated using AutoClass C version 3.3 (5) and the 72 unlabeled 7,070-feature leukemia sample profile vectors (the reported expression values were not shifted or scaled in any way). An expectation maximization algorithm finds a mixture of Gaussian probability distributions, and a Bayesian approach finds the maximum posterior probability classification and optimum number of classes K. Thus, P(xln|ck,l) = [2{pi}{varsigma}k,l2]-1/2exp[-1/2{(xln - µk,l)/{varsigma}k,l}2] where k,l, {varsigma}k,l] is the [mean, standard deviation] of the Gaussian modeling class k. For each feature, gene l, a lower bound for {varsigma}k,l was set to 1/10 of the standard deviation of all N expression levels, {xl1, ... , xlN}.

A naive Bayes model of the adenocarcinoma experiment profile vectors identified four underlying classes (16) rather than the two indicated by the tumor and nontumor labels (2). NBGR values were calculated using Gaussian parameters determined directly from the values of gene Fl in the tumor and nontumor samples, i.e., K = 2. The generalization performance of the top 50 genes from this "supervised" NBGR expert was considerably worse than that of the top 50 from an "unsupervised" NBGR expert that employed the K = 4 classes estimated from data.

Mean aggregate relevance.
This is the correlation between a gene and the ALL/AML classes (11). Unlike the MVR, the MAR utilizes both the location and spread of samples in classes i and j

where i,l, {varsigma}i,l] and [µj,l, {varsigma}j,l] are the mean and standard deviation of the log of the expression level of gene Fl in classes i and j. A large absolute magnitude signifies a strong correlation. A positive (negative) sign indicates that the gene is more highly expressed in class i (j). MAR(Fl) is related to the Fisher criterion score |(µi,l - µj,l)/({varsigma}i,l2 + {varsigma}j,l2)|.

Leukemia and adenocarcinoma genes: feature ranking and selection.
The 7,070 genes in the leukemia data were ranked separately according to their NBGR value and MVR value for the labels {ALL, AML}, {PB, BM}, {T cell, B cell}, and {Male, Female} (a total of five different rankings). The 1,988 genes in the adenocarcinoma data were ranked separately accord-ing to their NBGR value and MVR value for the label {Tumor, Nontumor} (two different rankings). For each of these seven rankings, nine representative gene subsets were created by selecting different numbers of top-, middle-, and bottom-ranked genes. Two additional gene subsets based on the {ALL, AML} labels were defined. The first, taken from figure 3A of Ref. 11 and referred to as the MAR 50, represents the 25 genes with the highest positive values and the 25 genes with the highest negative values. The second subset consists of genes common to the MAR 50, the NBGR top 50, and the MVR top 50. For the multiply-labeled leukemia data, the NBGR ranking reflects the importance of genes in distinguishing ALL from AML, so it may be uninformative in terms of the other labels.

SVMs: training and evaluation.
Because of the limited number of training examples, a leave-one-out cross validation strategy was utilized. A pool of N known positive and negative training examples was partitioned into two disjoint sets (here N = 62, 72). The estimation set, N - 1 examples, was used to determine the parameters of an SVM, and the test set, 1 example, was used to assess its generalization performance. The label assigned by a trained SVM to a test example can be a true positive (known positive test example, assigned positive label), true negative (negative example, negative label), false positive (negative example, positive label), or false negative (positive example, negative label). This procedure was repeated for each training example in turn. The generalization performance of these leave-one-out studies is the total number of SVMs that make true positive or true negative assignments (the maximum possible generalization performance is N). Elsewhere (11), the 72 leukemia training examples were partitioned into estimation and test sets containing 38 and 34 examples, respectively. The generalization performance of this "38 estimation, 34 test" partitioning is how many of the 34 test examples were assigned to be true positives or true negatives. The roles of the two sets were then reversed, and the generalization performance of a "34 estimation, 38 test" partitioning was determined in a similar manner.

In addition to training examples, estimating an SVM requires specifying an inner-product kernel function, a measure of similarity between two profile vectors XLi = {x1i, ... , xLi} and XLj = {x1j, ... , xLj}. Since there is no general theory for determining the most appropriate kernel for a particular learning problem, two kernels were employed. The first was the dot product kernel K(XLi, XLj) = {sum}l=1L xlixli. The second was a radial basis kernel function K(XLi, XLj) = exp(-||XLi - XLj||2/2{varsigma}2), where {gamma} = 1/2{varsigma}2 is a user-defined width parameter. Two different width parameters were used: {gamma}f = 0.01, a data-independent value employed in earlier work (16); and 2) {gamma}d, a data-dependent value in which {varsigma} is set equal to the median of the Euclidean distances from each positive training example to the nearest negative training example (3).

SVMs were trained and evaluated using SVMlight version 3.02 (15). Each gene subset was employed to create training examples in which the input profile vectors contained only the selected genes. Rather than working directly with the reported expression levels, xln, each value was normalized using xln/[{sum}lS(xln)2]1/2 where S is the subset of interest. For simplicity and to illustrate the basic approach, genes were ranked once using all N training examples and not reranked for each estimation set. To account for unequal numbers of positive and negative examples, each estimation set was balanced by duplicating as many randomly chosen examples as necessary from the smaller set to yield the same number of examples as the larger set. Elsewhere (3), imbalanced data sets were handled by adding a diagonal to the kernel matrix (different values for positive and negative examples).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS AND APPROACH
 RESULTS
 DISCUSSION
 REFERENCES
 

Leukemia sample profile vector classes.
Each of the N = 72 samples could be assigned uniquely to one of three naive Bayes model classes because the probability of the profile vector for that class was 1.0 (Table 1). Although class 3 contains only ALL samples, none of the other labels exhibit any clear association with specific classes. The unsupervised learning method utilized here determines the number of classes from the data, whereas the published SOM approach (11) requires this number be specified a priori (only a four class SOM was reported). For the adenocarcinoma data set also (16), the number of classes estimated by a naive Bayes model is greater than the two that might be expected given the {Tumor, Nontumor} labels. Further research is required to ascertain whether these discrepancies are the result of deficiencies in the modeling method or reflect the fine structure and complexity of the data that is masked by the original (known) labels. Interestingly, ranking genes on the basis of these estimated (pure and mixed) classes does not diminish the ability of top-ranked gene subsets to address the {ALL, AML} and {Tumor, Nontumor} supervised learning problems.


View this table:
[in this window]
[in a new window]
 
Table 1. Naive Bayes model class assigned to leukemia samples by a model trained using 72 unlabeled 7,070-feature sample profile vectors

 
Markers for decision support systems.
Table 2 shows that the maximum generalization performance achieved is less than the maximum possible for both the {ALL, AML} (71 vs. 72) and {Tumor, Nontumor} (55 vs. 62) problems. This may be because both data sets contain outliers and potentially mislabeled samples. For the five {ALL, AML} experiments with a performance of 71, the single error is a false positive. Previously, 6/62 adenocarcinoma samples were assigned as false positive or false negative across the 17 gene subsets examined (16). Subsets with the same performance may differ in their false positive and false negative assignments. Decreasing the number of ranked genes below the top 11 degrades performance. Overall, the NBGR and MVR rankings are effective because the top 50 perform better than the middle 50 and significantly better than the bottom 50. Some subsets generalize as well as or better than the full repertoire of 1,988 or 7,070 genes. Thus the top 25–100 genes of each expert are potential markers for use in developing decision support systems aimed at distinguishing tumor from nontumor colon adenocarcinoma samples and ALL from AML samples.


View this table:
[in this window]
[in a new window]
 
Table 2. Identifying marker genes using two different feature relevance experts

 
SVM kernel function and kernel parameters.
No kernel function or parameter setting is optimal in terms of generalization performance. For example, a data-dependent width parameter {gamma}d gives superior results compared to the data-independent parameter {gamma}d for the {ALL, AML} problem. The reverse is true for the {Tumor, Nontumor} problem. The poorer performance of a data-dependent width parameter for the {Tumor, Nontumor} problem may be due to the larger number of potentially misclassified examples in the adenocarcinoma versus the leukemia data set. In previous analysis of the adenocarcinoma data (16), training examples that constituted support vectors in each of the 62 leave-one-out SVMs were used to pinpoint potentially mislabeled samples (support vectors are training examples that define the location of the decision surface). Similarly, it may be instructive to examine how the nature and number of such invariant support vector training examples vary according to feature subset, kernel function, and kernel parameters.

SVM training and evaluation.
Table 3 indicates that performance is influenced by how the training examples are partitioned (compare the false positives and false negatives in the "38 estimation, 34 test" and "34 estimation, 38 test" experiments). The MAR 50 subset and "38 estimation, 34 test" partitioning allow a direct comparison between the performance of SVMs and the published weighted vote predictor (11). In the latter, the estimation set was used to compute the MAR for each feature in the subset. This 50-feature predictor assigned the label for each of the 34 test examples as follows. Each gene Fl casts a weighted vote according to whether the expression level xl is closer to the value of the gene in class i {equiv} ALL or j {equiv} AML of the estimation set, v(Fl) = MAR(Fl)(xl - [µi,l + µj,l]/2). If the sum of the absolute values of the positive votes in the 50 genes is greater than the sum of the absolute values of the negative votes, then the test example is assigned to the positive class i. The weighted vote predictor made strong predictions for 29 of the 34 test examples, and in all instances, the assignments were true positives or true negatives. In contrast, an SVM makes true positive or true negative assignments for 33 of the 34 test examples.


View this table:
[in this window]
[in a new window]
 
Table 3. The generalization performance of different partitionings of the {ALL, AML} training examples

 
ALL and AML markers for experimental studies.
The original leukemia study provided biological explanations as to why members of the MAR 50 might be involved in this disorder and could distinguish AML from ALL (11). The results here indicate that the NBGR top 50, MVR top 50, and MAR 50 generalize as well as all 7,070 genes (compare the "72 leave-one-out" entry in Table 3 with the "All 7,070" entry in Table 2). Tables 46 list the top 50 genes of each expert. Although the precise composition of the top 50s differ, each set of 50 genes is effective in terms of discriminating between AML and ALL. The small overlap in terms of the specific genes suggests the presence of many gene subsets of a given cardinality that can generalize equally well.


View this table:
[in this window]
[in a new window]
 
Table 4. The leukemia NBGR top 50 genes

 

View this table:
[in this window]
[in a new window]
 
Table 5. The {ALL, AML} MVR top 50 genes

 

View this table:
[in this window]
[in a new window]
 
Table 6. The {ALL, AML} MAR 50 genes from figure 3A of Ref. 11

 
Only adipsin, azurocidin, and cystatin C are common to the NBGR top 50, MVR top 50, and MAR 50. Given the large number of genes assayed (7,070) and the extensive literature on leukemia, it should be possible to provide biologically based rationales as to why three particular genes might be involved in AML/ALL, even those chosen at random and having no actual role in the disease. Although such an explanation cannot be ruled out for adipsin, azurocidin, and cystatin C, circumstantial evidence suggests that they may, indeed, be robust and reliable markers and thus good candidates for additional experimental investigation. These genes are ranked highly by three independent experts and are located in chromosomal regions known to be sites of recurrent abnormalities in ALL and AML (Table 7). Chromatin reorganization of the 19p13.3 locus, which contains azurocidin, proteinase-3, neutrophil elastase, and adipsin, is associated with myeloid cell differentiation (24). The generalization performance achieved by these subsets of 3 (66) and 4 (64) genes is comparable to the NBGR top 4 (67) and higher than the MVR top 4 (49) (Table 8).


View this table:
[in this window]
[in a new window]
 
Table 7. Abnormalities associated with two chromosomal regions containing genes at the intersection of the NBGR top 50, MVR top 50, and MAR 50: azurocidin (Gene ID M96326), adipsin (M84526), and cystatin C (M27891)

 

View this table:
[in this window]
[in a new window]
 
Table 8. The generalization performance of gene subsets that are good candidates for further experimental studies of ALL and AML

 
Cystatins C, A (GenBank accession no. D88422), and S (X54667) and cathepsins G (J04990) and D (M63138) are common to two out of the three top 50s. Cystatins are endogenous protein inhibitors of cathepsins, so these specific protease-inhibitor pains might be important in the etiology of ALL and AML. Human neutrophil-derived cathepsin G and azurocidin have been identified as chemoattractants for mononuclear cells and neutrophils (6). Experimental investigation of highly ranked genes may be warranted.

T cell/B cell, PB/BM, and male/female markers for experimental studies.
The MVR expert defines 25 markers for each of the additional leukemia problems that generalize as well as all 7,070 genes (Table 9). Comparing the maximum performance achieved and the maximum possible performance indicates that the data contain sufficient information for the {PB, BM} (68 vs. 72) and {T cell, B cell} (46 vs. 47) problems, but not for the {Male, Female} (31 vs. 49) problem. Furthermore, there is little difference in performance between the {Male, Female} top 50, middle 50, and bottom 50 gene sets. This suggests little association between these sample labels and transcription profiling data. Possible explanations for the poorer {Male, Female} results include 1) transcription profile data are poor indicators of sex, 2) the 7,070 probe set did not include those that can distinguish males from females, and 3) the patients (mostly children) had not achieved sexual maturity and thus not manifested any differences.


View this table:
[in this window]
[in a new window]
 
Table 9. Marker genes that distinguish leukemia samples according to their {PB, BM}, {T cell, B cell}, and {Male, Female} labels and identified using the MVR expert

 
Of the 72 training examples, 47 have {T cell, B cell} labels, and there is only one false positive assignment when either all 7,070 or the top 25 genes are used. It is interesting to note that a dot product SVM trained using these 47 labeled 7,070-feature experiment profile vectors assigned a B cell label to each of the 72 - 47 = 25 test examples. These test examples are the AML samples listed in Table 1.

The three sets of MVR rankings appear to be biologically interesting (Tables 1012). It should be noted, however, that they are valid only within the context of tissue samples derived from patients with ALL/AML. Bearing this in mind, the {T cell, B cell} top 50 contains many known T cell related genes. Genes that have no obvious annotation linking them to this cell type, such as protein disulfide isomerase, selenoprotein W, and Ras-related protein Rab-32, may be novel markers that can discriminate between T cells and B cells. Selenoprotein W is an intracellular protein that may be involved in protection against oxidative damage and muscle metabolism (4, 14). Overexpression of Lrp, the top ranked {PB, BM} gene, often predicts a poor response to chemotherapy in leukemia because it is one of the mechanisms by which cancer cells develop resistance to cytotoxic agents (reviewed in Ref. 19).


View this table:
[in this window]
[in a new window]
 
Table 10. The {PB, BM} MVR top 50 genes

 

View this table:
[in this window]
[in a new window]
 
Table 11. The {T cell, B cell} MVR top 50 genes

 

View this table:
[in this window]
[in a new window]
 
Table 12. The {Male, Female} MVR top 50 genes

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS AND APPROACH
 RESULTS
 DISCUSSION
 REFERENCES
 
The principal requirement for identifying marker genes for use in developing a clinically relevant decision support system for cancer diagnosis, prognosis, and monitoring is that the resultant system generate accurate predictions. The generalization capacity of the system is of paramount importance since the number and diversity of samples available for its development are likely to be far smaller than samples for which predictions will need to be made. Undoubtedly, a variety of the extracellular and intracellular pathways that regulate and maintain interactions between cells and their microenvironment are perturbed during carcinogenesis. Hence, feature relevance experts should be designed that implement as fundamentally different notions of relevance as possible in order that each relevance measure captures a different, physiologically relevant pathway or mechanism leading to the biological end point. Given a mixture of experts, selecting gene subsets that are ranked highly by each expert and which generalize as well as or better than the full repertoire should help to pinpoint robust marker genes. Based on the results here, a prototype system for discriminating between ALL and AML samples could contain the 125 features that are the union of NBGR top 50, MVR top 50, and MAR 50.

Although reducing the original 7,070 leukemia genes to 125 is appropriate in terms of a decision support system, this is still too many for in-depth experimental studies. Hence, the most informative experimental markers may be genes at the intersection of the top ranked genes: adipsin, azurocidin, and cystatin C. However, they are unlikely to be the sole determinants of the difference between ALL and AML because the generalization performance of these three genes is poorer than some of the larger gene subsets. The same is true for the four closely linked genes on chromosome 19p13.3 (azurocidin-proteinase 3-neutrophil elastase-adipsin). Nonetheless, the strategy proposed here provides a protocol for pinpointing experimentally informative marker genes and thus prioritizing subsequent investigations.

In transcription profiling studies, more genes are monitored than are probably required to understand the main problem. This "overdetermined" property suggests that broader questions could be answered if additional information were available for each sample. For the leukemia {T cell, B cell} and {PB, BM} secondary problems, the 7,070 genes are sufficiently informative that 25 markers can be defined that generalize as well as all 7,070 genes. It remains to be determined whether these makers are universal or are restricted to samples originating from ALL and AML patients.

Both the leukemia and adenocarcinoma data sets contain potentially misclassified samples, samples for which the original label (the "gold standard") may be incorrect (1/72 and 6/62 respectively). In a previous study of the latter data set (16), the subset of training examples that constituted support vectors across the entire series of leave-one-out SVMs was suggested to be indicative of samples most likely to have been misclassified (the set of support vectors does appear to depend upon which training example is withheld when estimating an SVM). Misclassification may be due to simple human error during sample handling, RNA preparation, data acquisition, data analysis, and so on. Standardized protocols stipulating rigorous procedures at each step of the process should reduce this type of problem and improve the chances of creating a coherent data set. The possibility of misclassification cannot be eliminated entirely because although a sample might appear to be visually and/or histologically of one type, it might be a member of the other class in reality. By training SVMs with hard margins, assuming no a priori labeling errors, potentially mislabeled samples can be pinpointed and subjected to additional investigation to verify their label. Given the nature of the underlying biology and technical issues surrounding generation of transcription profiling data, it is conceivable that many, it not all, cancer profiling experiments will contain noisy data and misclassified samples. Soft margin SVMs do take into consideration misclassified training examples but it is difficult to estimate the underlying error rate at the present time. To improve the reliability of downstream analyses, it may be preferable to incorporate a preprocessing step that identifies, and subsequently corrects if necessary, any misclassified samples. Once achieved, the distance of a sample to the optimal hyperplane can be used to assess confidence in an assignment.

The results from this and previous (16) work highlight a need for theoretical research in several areas. As illustrated here, the generalization performance of SVMs depends not only on the precise learning problem, but also on the training and testing procedure employed. Although leave-one-out cross-validation is costly and time-consuming, it provides a reasonable estimate of the expected generalization error. In view of uncertainties in the labels assigned to samples and the small, imbalanced sample set, a relatively simple assessment of the overall performance of SVMs was utilized: the cost function used to judge accuracy was the total number of true positive and true negative assignments. Principled, sophisticated methods need to be developed for areas such as 1) selecting features in the presence of an unknown number of misclassified training examples, 2) choosing the appropriate class of kernel function and determining (near) optimal kernel parameters automatically, 3) training and evaluating a learning system that is both computationally efficient and yields biologically meaningful results, and 4) generating an integrated prediction from a set of feature relevance experts that vary in how well they perform on the classification and prediction task at hand (boosting and bagging).

Despite the aforementioned limitations, utilizing a mixture of feature relevance experts that incorporate SVMs for supervised learning problems appears to be a promising method for identifying marker genes in cancer profiling studies. This approach can be applied directly to identifying markers in transcription profiling studies addressing other discrimination problems such as those encountered in aging and responses to different doses and dose rates of xenobiotic agents such as radiation. Similarly, the technique could be used to identify marker experiments as opposed to marker genes. These ideas can be extended to molecular profiling studies in which the features monitored are not genes, but are molecules such as proteins, metabolites, and so on.


    ACKNOWLEDGMENTS
 
This work was supported by the Director, Office of Science, Office of Biological and Environmental Research, Life Sciences Division, under US Department of Energy Contract No. DE-AC03-76SF0098.


    FOOTNOTES
 
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: I. S. Mian, Dept. of Cell and Mol. Biol., MS 74-197, Life Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd., Berkeley, CA 94720 (E-mail: SMian{at}lbl.gov).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS AND APPROACH
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Alizadeh AA, Eisen MB, David RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, and Staudt LM. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503–511, 2000.[ISI][Medline]
  2. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, and Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96: 6745–6750, 1999. [The data are available at http://microarray.princeton.edu/oncology/][Abstract/Free Full Text]
  3. Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Jr, and Haussler D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97: 262–267, 2000.[Abstract/Free Full Text]
  4. Burk RF and Hill KE. Orphan selenoproteins. Bioessays 21: 231–237, 1999.[ISI][Medline]
  5. Cheeseman P and Stutz J. Bayesian classification (AutoClass): theory and results. In: Advances in Knowledge Discovery and Data Mining, edited by Fayyad UM, Piatetsky-Shapiro, G, Smyth P, and Uthurusamy R. AAAI Press/MIT Press, 1996. [The software is available at http://ic-www.arc.nasa.gov/ic/projects/bayes-group/autoclass/index.html].
  6. Chertov O, Ueda H, Xu LL, Tani K, Murphy JM, Wang WJ, Howard OM, Sayers TJ, and Oppenheim JJ. Identification of human neutrophil-derived cathepsin G and azurocidin/CAP37 as chemoattractants for mononuclear cells and neutrophils. J Exp Med 186: 739–747, 1997.[Abstract/Free Full Text]
  7. Dudoit S, Yang YH, Callow MJ, and Speed TJ. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments [Online]. Dept. of Statistics, Univ. of California at Berkeley. http://www.stat.berkeley.edu/users/terry/zarray/Html/papersindex.html [3 Sept. 2000].
  8. Efron R, Tibshirani B, Goss V, and Chu G. Microarrays and Their Use in a Comparative Experiment (Technical Report). Palo Alto, CA: Department of Statistics, Stanford University, 2000.
  9. Eisen MB, Spellman PT, Brown PO, and Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863–14868, 1998.[Abstract/Free Full Text]
  10. Furey T, Cristianini N, Duffy N, Bednarski D, Schummer M, and Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16: 906–914, 2000.[Abstract]
  11. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfeld CD, and Lander ES. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531–537, 1999. [The data are available at http://waldo.wi.mit.edu/MPR/data_sets.html][Abstract/Free Full Text]
  12. Guyon I, Weston J, Barnhill S, and Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning, In press.
  13. Hastie T, Tibshirani R, Eisen M, Brown P, Ross D, Scherf U, Weinstein J, Alizadeh A, Staudt L, and Botstein D. Gene shaving: a new class of clustering methods for expression arrays [Online]. Standford University. http://www-stat.stanford.edu/~hastie/Papers/ [Jan. 2000].
  14. Holben DH and Smith AM. The diverse role of selenium within selenoproteins: a review. J Am Dietetic Assoc 99: 836–843, 1999.[ISI][Medline]
  15. Joachims T. Making large-scale SVM learning practical. In: Advances in Kernel Methods: Support Vector Learning, edited by Schölkopf B, Burges C, and Smola A. MIT Press, 1999. [The software is available at http://ais.gmd.de/~thorsten/svm_light]
  16. Moler EJ, Chow ML, and Mian IS. Analysis of molecular profile data using generative and discriminative methods. Physiol Genomics 4: 109–126, 2000.[Abstract/Free Full Text]
  17. Moler EJ, Radisky DC, and Mian IS. Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae. Physiol Genomics 4: 127–135, 2000.[Abstract/Free Full Text]
  18. Raychaudhuri R, Stuart JM, and Altman RB. Principal components analysis to summarize microarray experiments: application to sporulation time series. In: Pacific Symposium on Biocomputing, 2000, vol. 5, p. 452–463.
  19. Ross DD. Novel mechanisms of drug resistance in leukemia. Leukemia 14: 467–473, 2000.[ISI][Medline]
  20. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, and Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96: 2907–2912, 1999.[Abstract/Free Full Text]
  21. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, and Church GM. Systematic determination of genetic network architecture. Nat Genet 22: 281–285, 1999.[ISI][Medline]
  22. Toronen P, Kolehmainen M, Wong G, and Castren E. Analysis of gene expression data using self-organizing maps. FEBS Lett 451: 142–146, 1999.[ISI][Medline]
  23. Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, and Vapnik V. Feature Selection for SVMs. Adv Neural Inform Process Syst 13: 2000.
  24. Wong ET, Jenne DE, Zimmer M, Porter SD, and Gilks CB. Changes in chromatin organization at the neutrophil elastase locus associated with myeloid cell differentiation. Blood 94: 3730–3736, 1999.[Abstract/Free Full Text]