Department of Cell and Molecular Biology, Radiation Biology and Environmental Toxicology Group, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
microarrays; biological networks; graphical models; support vector machines; decision support systems; comparative molecular profile data analysis
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Currently, the most prevalent molecular profile matrices are those from transcription profiling studies in which the molecules are genes. For convenience, each functionally defined nucleic acid sequence whose expression level is monitored will be termed a "gene," irrespective of whether it is actually a gene, an expressed sequence tag, or DNA from another source. Although still in their infancy, computational methods have proved adept at extracting experimentally and clinically useful information from transcription profile data. These techniques include hierarchical clustering (10), gene shaving (19), self-organizing maps (16, 49, 51), k-means clustering (50), Boolean networks (26, 30, 45), linear modeling (9), principal component analysis (42), nonlinear modeling (53), Bayesian networks (BNs) (14), dynamic Bayesian networks (DBNs) (36), Support Vector Machines (SVMs) (5), and Petri nets (17, 32).
This work proposes a modular framework for the analysis of molecular profile data and domain knowledge that combines generative and discriminative methods. Initially, the framework is designed to address the distinct, yet complementary tasks of elucidating basic biological mechanisms and pathways and developing decision support systems for diagnosis, prognosis, and monitoring. The long-term goal is creation of an object-oriented system for prediction, inference, and experimental planning in which local relations (network fragments) can be integrated to build models that exhibit greater complexity. Here, specific generative and discriminative methods are employed. Graphical models were selected because of their structured stochastic nature and concomitant ability to model complex relations. SVMs were chosen because of their predictive performance capabilities when applied to classification, prediction, and regression problems. These general techniques permit creation of increasingly sophisticated models and analytical methods capable of yielding useful predictions and insights during each phase of framework development. Such models have good predictive accuracy (generalization) and lend themselves to human interpretation (explanation). They can handle missing data and/or "noisy" data arising from the stochastic nature of the underlying biological process (model noise) and errors occuring during sample preparation and/or measurement (observation noise). The models can incorporate prior knowledge, model hierarchical relationships, and utilize heterogenous data.
Here, working prototypes of tools for modules that address three statistical tasks associated with analysis of profile data are described. They are applied to published 1,988-feature experiment profile vectors from 62 human colon adenocarcinoma specimens labeled as tumor or nontumor (2). A naive Bayes model, a simple graphical model, is used to discover and characterize classes of experiment profile vectors (unsupervised learning). SVMs are employed to distinguish tumor from nontumor specimens and to assign the label of profile vectors not used for training (supervised learning). Two feature relevance experts are utilized to identify marker genes, genes that distinguish the two types of specimens (feature relevance, ranking, and selection). Insights into colon adenocarcinoma biology and future directions for the methodology are discussed.
![]() |
MODELS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The aforementioned data can be represented as N {input, output} pairs or
![]() |
![]() |
![]() |
Learning predictive models from training data fall into two general headings. Unsupervised learning finds "natural groupings" using only the input variables. Here, this translates to identifying classes of experiment profile vectors by clustering the N L-feature input vectors [X19881, ... , X198862]. Supervised learning estimates a function from paired values of input and output variables with the aim of predicting the outputs for future, unseen input variables. This maps to utilising (X1988n, dn) pairs to learn a model that can assign the output label tumor or nontumor to a new profile vector. Labeled input vectors are separated into positive and negative training examples. Here, tumor (nontumor) samples are considered to be positive (negative) training examples.
The generalization performance of a learning system is a measure of how well it performs on data not used for training. For a supervised learning system, labeled training examples are partitioned into two disjoint sets. The estimation set of positive and negative training examples is used to determine the parameters of the model and the test set to assess its performance. The label assigned by a trained model to a test example can be a true positive (known positive example, positive label), true negative (negative example, negative label), false positive (negative example, positive label), or false negative (positive example, negative label). Since the number of available training examples is limited (here N = 62), a "leave-one-out cross-validation" strategy is employed. A model estimated using N - 1 training examples is evaluated using the single test example. This procedure is repeated for each example in turn. The total number of models that make true positive, true negative, false positive, and false negative assignments is determined. Here, the generalization performance is defined the sum of the true positive and true negative assignments (the maximum possible generalization performance is N).
Graphical Models
Graphical models can be viewed as highly structured stochastic systems that provide a compact, intuitive, and probabilistic framework capable of learning complex relations between variables such as genes and other molecular or environmental factors. A BN is a graphical model annotated with conditional probabilities in which the graph is directed and contains no directed cycles (Fig. 1; for reviews see Refs. 23, 25, 40, and the introductory tutorial at www.cs.berkeley. edu/murphyk/Bayes/bayes.html). Learning a model from data can be decomposed into the problem of learning the topology and/or the parameters. Many of the discrete time models proposed for reconstructing genetic networks from time series data are special cases of DBNs (36). The advantages of DBNs include the ability to model stochasticity, incorporate prior knowledge, and handle hidden variables and missing data in a principled way (36).
|
A naive Bayes model.
In a naive Bayes model, a single unobserved variable is assumed to "generate" the observed data (here, sixty-two 1,988-feature experiment profile vectors). The hidden variable is discrete, and its possible states correspond to the underlying classes in the data. The data are produced by K models or data-generating mechanisms. These K models correspond to the K classes or clusters of biological interest (Fig. 2). A naive Bayes model can be viewed as a finite mixture model. If the functional form for the data-generating mechanism is a Gaussian, then the model is a Gaussian mixture model (Fig. 3).
|
|
Given labeled input vectors, an "unsupervised naive Bayes model" refers to a model in which both the number of classes K and the K x L sets of probability parameters are estimated from unlabeled profile vectors. A "supervised naive Bayes model" refers to a model in which the number of classes is fixed a priori, and the probability parameters are calculated directly from the values of features assigned to classes. Here, the unsupervised naive Bayes model is one trained to discover and characterize the K classes present in the 62 unlabeled 1,988-feature experiment profile vectors. A supervised naive Bayes model is one estimated by first partitioning the profile vectors according to their tumor or nontumor label. The K = 2 x 1,988 sets of probability parameters are computed directly from the 40 (or 22) expression levels of the 1,988 genes in the tumor (or nontumor) samples.
Support Vector Machines
In the context of pattern classification, an SVM constructs a hyperplane as the decision surface such that the margin of separation between positive and negative training examples is maximized (Fig. 4; for review, see Ref. 52 and the introductory tutorials at www.kernel-machines.org). This is achieved via an approximate implementation of the method of structural risk minimization, a principled approach rooted in statistical learning theory. This induction principle is based on the fact that the error rate of a learning machine on test examples (the generalization error rate) is bounded by the sum of the training-error rate and a term that depends on the Vapnik-Chervonenkis (VC) dimension. For separable data, an SVM produces a value of zero for the first term and minimizes the second VC term. Thus, SVMs generalize well when applied to pattern recognition problems. Compared to other machine learning algorithms, SVMs provide flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the capacity to identify outliers.
|
![]() |
![]() |
![]() |
A MODULAR FRAMEWORK FOR ANALYSIS OF MOLECULAR PROFILE AND OTHER DATA USING GENERATIVE AND DISCRIMINATIVE METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Unsupervised Learning
Discovering and characterizing classes of profile vectors using an unsupervised learning method yields information on the fine structure of the data. Identifying classes of gene profile vectors can suggest genes whose products may have related functions as well as those that could be regulated by common transcription, environmental, or other factors. Clustering experiment profile vectors can indicate relationships between conditions and pathways. For example, if mutants with similar phenotypes fall into different classes, then the homeostatic mechanisms by which the biological endpoint is reached could differ. Alternatively, if mutants displaying seemingly unrelated physiological behaviors have similar profiles, then they may operate via common pathways.
For labeled profile vectors, an unsupervised learning method can be used to group the input vectors and the estimated classes compared to those that would be expected based on the output labels. The discrepancy between the number of classes estimated from the data and the number of known classes provides an indication of the homogeneity of the problem being addressed and/or quality of the data used. The estimated classes may correspond to subcategories of profile vectors with the same label, for example, tumor subtypes, or profile vectors with different labels. If an input vector with one label is assigned to a class that is dominated by examples from another class, then this could suggest mislabeled examples.
Here, naive Bayes models are utilized to cluster sixty-two 1,988-feature experiment profile vectors. Elsewhere, they have been used to cluster seventy-two 7,070-feature experiment (8) and 5,687 seventy-eight-feature gene (35) profile vectors.
Supervised Learning
Discriminating between profile vectors with different labels and assigning the label of a new profile vector using a supervised learning method is useful for a variety of tasks. For experiment profile vectors, this classification and prediction procedure can be a component of decision support systems for clinical and/or environmental diagnosis, prognosis, and monitoring. In cancer transcription profiling studies, for example, data from specimens that have different pathological characteristics or are subtypes of the same disorder can assist in developing systems for classifying and analyzing cancers from a molecular rather than morphological perspective. For gene profile vectors, this analytical approach can suggest potential biological roles for genes if the known labels correspond to biochemical, functional, or other physiological properties.
A trained model can assist in identifying input vectors that are most important in defining classes and pinpointing those that may have been mislabeled (outliers). In principle, an unsupervised method can be used to partition unlabeled input vectors into disjoint sets such that each class can be associated with an output label, which can then be employed subsequently by a supervised learning system.
Here, SVMs are used for supervised learning problems involving the sixty-two 1,988-feature labeled experiment profile vectors. A naives Bayes model trained to address the same discrimination problem performed considerably less well. Elsewhere, SVMs have been applied to seventy-two 7,070-feature experiment profile vectors with two to four different labels (8). SVMs have been utilized to classify 2,467 seventy-nine-feature yeast gene profile vectors and to assign functional roles for uncharacterised open reading frames (5).
Feature Relevance, Ranking, and Selection: Feature Relevance Experts
In a supervised learning problem, features in the input vector vary as to how relevant they are to the discrimination problem. For molecule profile vector classes, the extent to which experiments differentiate classes can vary. Given a compendium of gene profile vectors derived from experiments examining a range of genotypes and conditions, feature relevance can define experiments that best separate genes belonging to the same biochemical and/or functional classes. For experiment profile vector classes, molecules can vary as to how well they distinguish classes. In cancer profiling studies, molecules that discriminate tumor from nontumor samples are good candidates for subsequent in-depth experimental studies as well as developing decision support systems. Highly informative features, marker genes or marker experiments, can be identified by reducing the cardinality of input vectors such that the generalization performance of a supervised learning system is undiminished compared to one trained using all the features.
Assume that profile vectors are assigned to T 2 classes, either those estimated from unlabeled profile vectors or those designated on the basis of output labels. The "relevance of a feature Fl" is defined as how well it distinguishes class ci from cj. If the relevance is zero, then the behavior of the feature in the two classes is the same; larger values signify increasingly greater differences and thus a greater ability to distinguish classes. The absolute magnitude is augmented with a sign such that a negative (positive) value signifies that the value of the feature is lower (higher) in cj than in ci. "Multiclass relevance" denotes how well the feature distinguishes class ci from all other T - 1 classes. "Global relevance" signifies how well the feature distinguishes all T classes. Ordering features based on their relevance value ranks them in terms of how well they distinguish two specific classes, whereas the global relevance ranks them with regards to how well they distinguish all T classes. Different numbers of features can be selected either in terms of an absolute number such as the m top-, middle-, or bottom-ranked features or those with values above (below) a specified threshold.
Markers can be identified in a systematic manner with the aid of a feature relevance expert. Such an expert 1) implements an algorithm for computing feature relevance, 2) reorders features according to this value, 3) selects subsets of ranked features for use in training a supervised learning system, and 4) identifies markers based on feature subsets that generalize well, namely, assign the labels for input vectors not used for training. Preferably, a relevance measure should generate a monotonic ordering. Reducing the number of features by eliminating bottom-ranked ones should improve the generalization performance of supervised learning systems trained using the feature subsets. There should be optimal subsets that maximize the performance. Finally, retaining fewer features should degrade the performance. If optimal subsets have the same or better generalization performance than the full repertoire, then these features are likely to be particularly useful markers. Different feature relevance experts can be evaluated by determining the generalization performance of supervised learning systems trained using the m-ranked features of each expert.
For profile vectors with multiple labels, feature relevance depends on the biological question being posed. For example, the relevance of a gene in differentiating pathological states may or may not be related to its ability to discriminate between tissue types. Consider experiment profile vectors with two sets of binary labels, "tumor/nontumor" and "liver/colon." The relevance of a gene for the tumor/nontumor problem requires comparing its expression values in profile vectors labeled (tumor,liver)/(tumor,colon) and (nontumor,liver)/(nontumor,colon). In contrast, the relevance of the same gene for the liver/colon problem requires comparing (liver,tumor)/(liver,nontumor) with (colon,tumor)/(colon,nontumor).
Here, specific algorithms for calculating the relevance and global relevance of a feature are described that are based on the probability parameters of naive Bayes model classes. A feature relevance expert based on an unsupervised naive Bayes model generalizes better than one employing a supervised naive Bayes model (see above, A naive Bayes model).
External Knowledge as an Aid to Interpretation
The time taken to explore complex relationships revealed by analysis of profile data can be reduced by a systematic environment that extracts, organizes, and integrates external knowledge into the interpretation procedure. Gene ontologies and controlled vocabularies (3, 6) are key components in the creation of such environments. Two-way unsupervised learning, discovering classes of experiment and gene profile vectors, integrated with a comprehensive knowledge base could highlight markers and correlations for further study. For example, if genes and experiments are cross-indexed to external information in a qualitative and quantitative manner, then it should be easier to uncover statistically and/or biologically significant associations between profile vector classes and, for example cell type, developmental stage, small molecule concentration, environmental condition, signaling pathway, and so on. Specific gene vector classes may be correlated with protein products having similar functions, noncoding regions, protein-protein interactions, and so on.
Elsewhere (35), the associations between 45 classes estimated from 5,687 seventy-eight-feature Saccharomyces cerevisiae gene profile vectors and four types of external knowledge were determined. The results were used to suggest potential functions and physiological roles for specific genes.
Decision Support Systems for Diagnosis, Prognosis, and Monitoring
A decision support system is a knowledge-based systems aimed at organizing relevant experimental and other data for the purpose of assisting users make decisions about real world problems. Experiment profile vectors from cancer transcription profiling studies can be used to distinguish between specimens of known (sub)type and to assign the label for new specimens. Since the consequences of misdiagnosis are potentially deleterious, the supervised learning method and training data underlying such a decision support system should maximize sensitivity and specificity. Not all the genes monitored in a profile study are required to assign the label for a specimen of unknown origin with a high degree of accuracy. Some genes may even decrease prediction accuracy. Hence, feature relevance, ranking, and selection is an important component of creating prototypes of clinically useful systems. For a given data set and fixed number of features, there are likely to be a number feature subsets of this size that have similar generalization performances when used to address the same supervised learning problem. Thus, features ranked highly by a majority of, or all the experts in a mixture of feature relevance experts should be robust and reliable markers.
Here, a feature relevance expert is used to identify markers for colon adenocarcinoma. Elsewhere (8), each of the top 50 genes from three different feature relevance experts were shown to generalize as well as each other and the full repertoire of 7,070 genes. However, the specific genes in these subsets were not identical. Thus, genes at the union of these subsets (125 genes in total) were proposed as candidate for developing a prototype decision support system for distinguishing two subtypes of leukemia.
Networks for Experimental Design, Planning, and Inference
Inferring or verifying networks for use in diagnostic reasoning, causal reasoning, and assessing the effects of intervention will require fusing data and results from the other modules of the framework. For example, groupings of gene profile vectors and the identification of common noncoding regions will provide important constraints on learning the topology and/or parameters of a network. Identifying markers using feature relevance experts can pinpoint molecules that should be represented explicitly in efforts to infer networks from profile data using techniques such as graphical models.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Naive Bayes Models: AutoClass
In AutoClass C version 3.3 (7), the continuous Fl nodes are modeled using Gaussian probability density functions and the discrete classification node C using a Bernoulli distribution (Fig. 2). Training a model involves using profile vectors to estimate the number of classes K for node C and the probability parameters for each Fl node. Starting from random initial descriptions for a specified number of classes, a gradient descent search through the space of descriptors is performed. At each step of the model search procedure, the current descriptions are used to assign probabilistically each profile vector to each class. The observed values for each profile vector are used to update class descriptions, and the procedure is repeated until a specified convergence criterion is reached. The program iterates through different numbers of classes to determine the best taxonomy.
Overfitting, finding a model in which the number of classes K is equal to the number of profile vectors N, is ameliorated as follows. A variant of the expectation-maximization (EM) algorithm is used to search through model-space with the condition that each profile vector belong to some class (the sum of all class probabilities is one). A penalty is incurred for adding more classes. Increasing the number of classes decreases the prior probability of each class unless the additional class improves the likelihood of the data. The model-space that needs to be searched can be constrained by setting a lower bound on the variance of the data-generating mechanism. For each gene l, the level of observation noise (measurement error) and/or natural variation in expression between samples (patients) can be used to set this value in a data-dependent manner. Thousands of models are estimated, each starting from different random number seeds. Each resultant model, a locally optimum solution in the parameter space, is scored. These model marginals are compared to find the model that best describes the data.
The input data are the sixty-two 1,988-feature experiment profile vectors, [X19881, ... , X198872] where X1988n = [x1n, ... , x1988n].The expression level of gene l in profile vector n, xln, is used as is, i.e., the published data (2) are not rescaled, shifted, normalized, or modified. Since the measurement error and intrinsic variability are unknown, the minimum value of the standard deviation of the Gaussian for each class, k,l, is set to 0.1 of the standard deviation of the Gaussian for the expression values across all N samples, xl1, ... , xlN. The output consists of 1) K, the number of classes, 2) an N x K likelihood matrix where each element is the likelihood of experiment profile vector n given class ck, P(XLn|ck, M), and 3) a K x L parameter matrix where each element is the mean and standard deviation of the Gaussian modelling class ck and gene l, (µk,l,
k,l). For the data set here, the marginal for the best model is significantly higher than the other models. The final results do not depend on the order in which input vectors are entered into the model.
Support Vector Machines: SVMlight
SVMlight version 3.02 (24) has a fast optimization algorithm, can handle many thousands of support vectors, can be trained using tens of thousands of training examples, and supports a variety of kernel functions. The input data are labeled profile vectors and a kernel function plus any of its associated parameters. Although there is no formal mechanism for selecting the most appropriate class of kernel function for a particular problem, empirical evidence suggests that a radial basis function is a reasonable choice. This kernel performed well when applied to biological classification problems arising from transcription profiling (5, 8) and protein fold recognition (M. L. Chow and I. S. Mian, unpublished information) studies. Based on the latter work and tests using the data examined here, the width of the radial basis function = 1/2
2 is set to 0.01. Elsewhere (5, 8), the value of
is set in a data-dependent manner by choosing
to be equal to the median of the Euclidean distances from each positive example to the nearest negative example (5). The output from the learning module is a binary classification model which can be used to assign the label for a test example.
SVMs are trained and evaluated using the leave-one-out cross-validation procedure described above (Learning Models from Profile Data). To account for unequal numbers of positive and negative training examples, each estimation set is balanced by duplicating as many randomly chosen examples as necessary from the smaller set to yield the same number of examples as the larger set. The generalization performance achieved is the total number of SVMs that make true positive and true negative assignments for their test example. A false positive or false negative assignment occurs when the test example falls on the wrong side of the decision boundary.
Feature Relevance: Naive Bayes (Global) Relevance
The relevance of a feature (see above, Feature Relevance, Ranking, and Selection: Feature Relevance Experts) measure proposed here is termed the naive Bayes relevance (NBR). It is based on the probability of a profile vector class k given the observed value of the feature l, P(ck,l|xln). Using Bayes rule and assuming that classes ci and cj are independent and equally likely a priori, the NBR is defined as
![]() |
![]() |
![]() |
Naive Bayes Model-Based Feature Relevance Expert
The probability parameters for the K classes of an unsupervised naive Bayes model (see above, Graphical Models) are used to calculate NBRij(F1), ... , NBRij(F1988) and NBGR(F1), ... , NBGR(F1988). The 1,988 genes are reordered according to their NBR and NBGR values. The ranking based on the NBGR values is termed the "K-class unsupervised NBGR ranking." The probability parameters for the K = 2 classes of a supervised naive Bayes model are used to calculate NBGR values. The ranking based on these NBGR values is termed the "K = 2 supervised NBGR ranking."
For each ranking, representative gene subsets are created by selecting different numbers of top-ranked genes. Each subset is employed to create training examples for leave-one-out cross-validation studies in which the input vectors contain only the selected genes. Rather than working directly with the original expression levels, xln, each value is normalized using xln/[l
S (xln)2]1/2 where S is the gene subset of interest. For simplicity and to illustrate the basic approach, genes are ranked once using all N training examples and not for each N - 1 estimation set.
Supervised Learning System: SVM vs. Naive Bayes Model
In addition to being a generative model for unsupervised learning, a naive Bayes model can be used for supervised learning and prediction. Given a model that has grouped training data into K classes, the posterior probability of each class given a test example P(ck|XLn) is computed. The test example is assigned to the class which maximizes this value. To compare SVMs and naive Bayes models as supervised learning systems, N supervised naive Bayes models are trained and tested using the same leave-one-out cross-validation strategy employed to evaluate SVMs (see above, Support Vector Machines: SVMlight). The generalization performance of these two systems is compared using feature subsets derived from the K-class unsupervised NBGR ranking and the K = 2 supervised NBGR ranking.
Outliers and Potentially Mislabeled Specimens
Support vectors define the location of the decision surface (solid symbols in Fig. 4) whereas nonsupport vectors (open symbols) do not participate in its specification. One method for identifying outliers and potentially mislabeled specimens is pinpointing positive and negative training examples that are the support and nonsupport vectors. For each leave-one-out SVM, the training examples that constitute the support vectors and nonsupport vectors are ascertained. An "invariant support vector training example" is one that is a support vector in all the N - 1 SVMs which placed it in the estimation set. Similarly, an "invariant nonsupport vector training example" is one that is never a support vector. This approach presumes no mislabeled examples and uses a hard margin for SVM training. A soft margin would permit training examples to violate the decision boundary subject to some penalty.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The published study clustered the profile vectors by means of a binary tree computed using an algorithm based on deterministic annealing (2). Clusters 1 contained 3 nontumor and 35 tumor specimens, whereas cluster 2 contained 19 nontumor and 5 tumor specimens; 36 of the 38 specimens assigned to cluster 1 belong to classes 1, 2, or 4. Although class 4 is a subset of cluster 1, the specimens are scattered throughout the clustering tree. Hence, the unsupervised naive Bayes model defines an important subgroup of tumor specimens not detected by the binary clustering method.
Supervised Learning Using SVMs
Table 1 shows that when all 1,988 genes are used, 55 (89%) of the SVMs make consistent assignments. Classes 1 and 4 are associated with consistent assignments, whereas classes 2 and 3 contain the three nontumor and four tumor inconsistent assignments (boxed in Table 1). The discrepancy between the generalization performance achieved, 55, and the maximum possible, 62, indicates the divergence between the known labels in Alon et al. (2) and the assigned labels here. The seven false negative and false positive assignments (boxed) are valid only within the context of the original labels. Possible explanations for these seven "differences," especially the two for patient 36, include 1) deficiencies in the SVM learning method used, 2) specimens may have been mislabeled as a result of human error, and 3) pathologically "normal" regions of the colon could have substantial tumor-like properties from a molecular standpoint.
|
Overall, the results suggest the presence of three subtypes of specimens: those that are clearly tumor (classes 1 and 4), those that are mainly nontumor (class 3), and those that are heterogeneous or have a mixed tissue composition (class 2). The two tumor classes could, for example, indicate different pathways for reaching the same biological endpoint and/or variation in the treatment schedules or clinical histories of the patients.
NBR: Genes That Distinguish Class 4 From 1, 2, or 3
The NBR measure quantitates the degree to which gene Fl distinguishes class i from j [see above, Feature Relevance: Naive Bayes (Global) Relevance]. Genes with the highest and lowest values are the top- and bottom-ranked genes, respectively. Since Table 1 suggests that class 4 is perhaps the most interesting, this Class will be employed as an exemplar to illustrate the utility of the overall approach (Fig. 5). The aim of the subsequent analysis is not to provide a detailed discussion of all the genes and their potential roles, but to demonstrate that NBR values provide a useful mechanism for pinpointing biologically plausible candidates for subsequent in-depth studies. For example, genes that distinguish class 4 from the other three classes include immunoglobulin superfamily receptors (Fc receptor hFvRn) and laminin receptors. Immunoglobulin receptors are known to be associated with malignant transformation and dissemination of colon tumors (48).
|
Precursors for both complement C1s and C1r.
These proteases are responsible for the lectin pathway activation and proteolytic activity of the C1 complex of complement, an activation system designed for the elimination of pathogens. The lectin pathway plays an important role in innate immunity.
Fibulin-2.
This extracellular matrix (ECM) protein is present in the basement membrane and stroma of many tissues, and its expression pattern suggests an essential role in organogenesis, particularly in embryonic heart development.
Hevin.
This ECM protein is important for the adhesion and trafficking of cells through the endothelium. Hevin has been shown to be downregulated in non-small cell lung cancer (4) and metastatic prostate adenocarcinoma (38).
Vasoactive intestinal peptide.
Vasoactive intestinal peptide (VIP) has been implicated as an important factor in several inflammatory conditions of the human gut.
Tumour necrosis factor- inducible protein A20.
This putative DNA binding protein is a Cys2/Cys2 zinc finger protein induced by a variety of inflammatory stimuli and characterised as an inhibitor of cell death.
The cytoskeletal proteins actin and myosin and endothelial actin-binding protein.
Relative to class 4, these proteins are downregulated in classes 1, 2, and 3 (blue, Fig. 5).
Polyadenylate-binding protein.
This protein recognises the 3' mRNA poly(A) tail and plays critical roles in eucaryotic translation initiation and mRNA stabilization/degradation.
DNA-apurinic or apyrimidinic site lyase APE1/HAP1.
This protein plays an important role in DNA repair and in the resistance of cancer cells to radiotherapy.
KAP-1.
This protein (TIF1ß/KRIP-1; human unknown protein mRNA; R37428) may be a corepressor for the large class of KRAB-containing zinc finger proteins (1).
Calnexin precursor.
This protein is a chaperone that promotes the correct folding and oligomerisation of many glycoproteins. A study of protein changes associated with ionizing radiation-induced apoptosis in human prostate epithelial tumor cells indicated that the proteins levels of this molecular chaperone are higher in such dying cells (41).
Inosine 5'-monophosphate dehydrogenase 2.
Inosine 5'-monophosphate dehydrogenase 2 (IMPDH isoform 2) enzyme is the rate-limiting enzyme in the de novo synthesis of guanine nucleotide. Of the two isoforms, IMPDH isoform 2 is selectively upregulated in neoplastic and replicating cells and is thus considered to be a sensitive target for cancer chemotherapy (reviewed in Ref. 12).
Overall, the results suggest that tumor specimens belonging to classes 1 and 4 have very distinctive properties. For example, NDP kinase (nm23-H2S) is known to be associated with tumor metastasis (13), but the levels in these classes are very different. There are marked differences in genes related to cell growth, protein synthesis, energy metabolism, oxidative stress, and apoptosis. Greater knowledge of the clinical histories of the patients from which these tumor specimens were taken may reveal the origins of these differences. One possibility based on the expression patterns of calnexin and IMP is that patients whose tumor samples are assigned to class 4 may have received radiation or other therapy.
In some instances, differential expression at the gene level is mirrored at the protein level. Prohibitin and IMPDH-2 are proteins that have been shown to exhibit differential protein expression in normal and neoplastic human breast epithelial cell lines (54). The levels of the latter enzyme in tumor cell lines was elevated 2- to 20-fold relative to the levels in normal cells. Relative to tumor Class 4, the expression levels of the genes for these enzymes exhibit a similar pattern in that they are downregulated in the other classes.
NBGR: Genes That Distinguish All Classes
The NBGR measure quantitates the degree to which gene Fl distinguishes all four classes. Genes with the highest and lowest values are the top- and bottom-ranked genes, respectively. The top 50 NBGR ranked genes are listed in Table 2. Of the feature subsets examined (discussed below, Naive Bayes Model-Based Feature Relevance Experts), the top 50 represents the smallest number of features that generalize as well as all 1988 genes. Selected genes of potential interest are as follows.
|
Ferritin.
Low serum ferritin levels are associated with patients having serious gastrointestinal pathologies such as neoplasia and acid peptic disease (29). Previous work has shown that the majority of colorectal adenocarcinomas exhibit ferritin expression (20), but the clinical significance remains unknown.
Tra1/GRP94/GP96.
This molecular chaperone been suggested to be useful in cancer immunotherapy (39). The level of the protein is higher in human breast cancer cell lines compared to normal basal epithelial cell lines (15). Figure 5 indicates that HSP 90-ß, another of member of the heat shock protein 90 family to which Tra1 belongs, is downregulated in classes 1, 2, and 3 relative to class 4.
In a manner analogous to comparative sequence analysis, comparative analysis of molecular profile data may be useful for inferring the potential physiological roles of genes. Such comparison of the expression patterns of orthologous and paralogous proteins can be illustrated using "translationally controlled tumor protein" (TCTP, HRF P23), the ninth ranked gene. TCTP is a eucaryotic cytoplasmic protein found in several normal and tumor cells that is suggested to have a general, yet unknown, housekeeping function (44). Comparative sequence analysis (data not shown) provides few insights into the biological role of this evolutionarily conserved protein and a protein that may have a role in colon cancer. A naive Bayes model trained using 5,687 seventy-eight-feature yeast gene profile vectors found 45 classes (35). The yeast TCTP homologue (TCTP_YEAST; YKL056C) is found in a class populated with genes from the MIPS (33) protein functional category "PROTEIN SYNTHESIS: ribosomal proteins." Physiologically, therefore, and consistent with other genes in the top 50, human TCTP may be a ribosome-associated protein.
Marker Genes for Understanding Colon Adencarcinoma Biology
One mechanism for generating a set of candidates for subsequent study is by taking the union of the NBGR top 50 listed in Table 2 and the genes shown in Fig. 5. Experimental data support the notion that these 89 genes may be biologically relevant. For example, the set includes genes shown to be differentially expressed in mucus-secreting cells and undifferentiated HT-29 colon cancer cells: transcripts encoded by the mitochondrial genome, components of the protein synthesis machinery, ferritin, and TCTP (37). Alterations in the distribution and/or adhesiveness of laminin receptors in colon cancer cell lines may be associated with increased tumorigenicity (27). A study of cultured colon cancer cells suggests that laminin may play an important role in hematogeneous metastasis by mediating tethering and spreading of colon cancer cells under blood flow (28). In general, the markers are involved in cell signaling, adhesion and communication, immune response, heat shock, and DNA repair. Adhesion receptors and cell surface-associated molecules mediating cell-matrix and cell-cell interactions are known to play an important role in tumor cell migration, invasion, and metastasis.
Selecting markers for use in understanding colon adenocarcinoma biology, creating diagnostic tools, and highlighting targets for therapeutic intervention and drug-design requires reducing the original 1,988 genes to smaller, more manageable subsets. The aforementioned markers were defined using a fairly stringent threshold |NBRij(Fl)| 0.7 and a small fixed number of top-ranked genes (i.e., 50). This "low-hanging fruit" approach is unable to detect genes involved in more subtle interactions.
Naive Bayes Model-Based Feature Relevance Experts
Developing a decision support system may require using a larger number of genes than an experimental investigator might be interested in pursuing. A learning system designed to discriminate between tumor and nontumor specimens should optimize specificity and generalization performance rather than minimize the number of genes proposed as being important. The consequences of a tumor specimen labeled incorrectly as nontumor (a false negative) may be more severe than overpredicting false positives.
One approach to identifying markers for prototype decision support systems is by means of a feature relevance expert. Table 3 shows the generalization performance of leave-one-out SVMs trained using 11 feature subsets. The maximum generalization performance achieved, 55, is less than the maximum possible, 62. The top 50 genes perform as well as the full repertoire of 1,988 genes. Using only the top two degrades the overall performance by only three (55 to 52). Further studies are required to assess whether, for example, any 2 in the top 10 would have the same performance as the top 2. The NBGR ranking appears to be meaningful because the performance of the top 500, 50, 25, 10, 5, and 2 subsets is consistently higher than the equivalent number of bottom-ranked genes. As the number of genes used decreases from 500 to 2, the difference in performance increases from 52 - 50 = 2 to 52 - 27 = 25. As shown elsewhere (8), there are likely to be other subsets of 50 genes that have some or no overlap with the NBGR top 50 but which have the same generalization performance.
|
Although the exact shape of the function relating performance to the number of top-ranked genes is unknown, it is possible to improve the performance by examining subsets in the 50050 range. Table 4 shows that of all the subsets examined, the maximum generalization performance is achieved with the top 200 genes (56). The original 62 training examples were partitioned such that the 56 consistently assigned specimens (N or T in Table 4) formed the estimation set. The remaining six specimens formed the test examples. The assignments made by an SVM trained using the top 200 genes did not change, i.e., the false positive and false negative assignments support the notion that these six specimens are likely to be outliers. The results suggest that the 200 top-ranked genes from the 56 aforementioned specimens could be used to develop a prototype diagnostic tool. Further studies are required to ascertain the success of such a tool when used for large-scale colon adenocarcinoma screening studies.
|
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A more thorough interpretation of and explanation for the results would be possible if information such as the sizes, sites, and disease stages of cancer for the tumors and patient histories were available. Given such information, the gene expression measurements could be correlated with potential clinical outcomes such as radiosensitivity and response of the tumor to chemotherapy. The strategy utilized here is sufficiently general that it can be applied to other transcription profiling studies as well as other types of molecular profile data.
The results reiterate the importance of controlling and optimizing the experimental techniques used to obtain and handle in vivo specimens because of their impact on the information that can be extracted. The aforementioned markers implicate the microenvironment, cell-matrix interactions, cell-cell communication, and the immune system as key factors that differentiate nontumor from tumor colon adenocarcinoma tissue specimens. It remains to be seen whether transcription profiles derived from cell lines or cultures would have highlighted the role of tissue biology in this disorder. It will be necessary to compare tumor and nontumor specimens with those from individuals having no record of adenocarcinoma. Arrays containing the full complement of human genes and not just the selected set employed here are likely to reveal additional marker genes. For the purpose of developing and using a robust decision support system, it is critical that collection and preparation of all specimens conform to a standardized procedure in order to minimize heterogeneity in the cell types assayed.
Learning biologically realistic networks from data even with the aid of domain experts remains a challenging task. This stems from the nature and quality of the available data, theoretical issues of learning models with large numbers of noisy variables, and efficient implementations of the modeling methods. For example, mRNA transcript and protein levels are not necessarily correlated (18). Clearly, genetic networks inferred from molecular profile data alone will be insufficient to understand many aspects of the behavior of cells and tissues. Nonetheless, the results in this and related work (8, 35) suggest that the framework and techniques proposed here have the potential for creating robust decision support systems and learning plausible networks. The successful integration of discriminative and generative methods in the analysis of molecular sequence data (21, 34) augers well for their application to molecular profile data.
![]() |
ACKNOWLEDGMENTS |
---|
Present addresses: E. J. Moler, Chiron Corp., 4560 Horton St., Emeryville, CA 94608; M. L. Chow, Gene Logic Inc., 2001 Center St., Berkeley, CA 94704.
![]() |
FOOTNOTES |
---|
Address for reprint requests and correspondence: I. S. Mian, Dept. of Cell and Mol. Biol., MS 74-197, Radiation Biology and Environ. Toxicol. Group, Life Sci. Div., Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720 (E-mail: SMian{at}lbl.gov).
* E. J. Moler and M. L. Chow contributed equally to this work.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|