Department of Cell and Molecular Biology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
molecular profile matrix; gene profile vectors; naive Bayes model; copper and iron metabolism; Bayesian networks
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Working prototypes of techniques and tools that address tasks in specific modules were applied to transcription profile data from human colon adenocarcinoma tissue specimens, namely sixty-two 1,988-feature experiment profile vectors labeled tumor or nontumor (2). A naive Bayes model discovered and characterized three classes (clusters) of profile vectors and thus subtypes of the specimens (unsupervised learning). SVMs distinguished tumor from nontumor specimens and assigned the label of profile vectors not used for training (supervised learning). Fifty to 200 genes were identified that distinguished the two types of specimens as well as or better than the 1,988 assayed originally (feature relevance, ranking, and selection). This small subset of marker genes defined biologically plausible candidates for subsequent studies.
Clustering gene profile vectors has shown that genes encoding proteins with related functions tend to have similar expression patterns (2, 6, 25, 26, 30). For example, one cluster may be associated with genes involved in DNA repair and another with transcription. However, because proteins often have multiple functions and/or roles, it should be possible for a gene profile vector to belong more than one class and for this membership to be quantifiable. The widely used hierarchical clustering approach (6) has sharp rather than smooth cluster boundaries and cannot assign a new profile vector to an existing class. Methods such as SOMs require the number of classes to be specified a priori. The AutoClass (4) implementation of a naive Bayes model clustered the 62 aforementioned experiment profile vectors using a mixture of Gaussian probability distributions and employed Bayesian methods to derive both the maximum posterior probability of classification and the optimum number of classes.
Here, a novel suite of tools developed for the analysis, display, and visualization of genome wide features are applied to 5,687 seventy-eight-feature gene profile vectors from a published Saccharomyces cerevisiae molecular profile matrix (25). The eight studies (78 experiments) examined the cell cycle, responses to stress (heat and cold shock), and diauxic shift. AutoClass is used to discover and characterize classes of gene profile vectors (raw intensity measurements are transformed in a manner that differs from the standard log-ratio). A feature relevance measure different from that proposed earlier (17) is employed to identify experiments that are most important in defining each class. The fraction of genes assigned to a known biological category that belong to a given class is used to pinpoint classes most associated with specific functions. This novel enrichment measure allows external knowledge to be incorporated into the analysis and interpretation procedure in a systematic and quantitative manner. The results are utilized to make inferences about copper and iron homeostasis, a problem that encompasses pathway and mechanisms that were not the primary subject of the original studies. Classes of coexpressed genes suggest those that may be regulated by common factors and thus provide constraints for learning genetic networks within the same graphical model formalism as that used for clustering (17). Future directions for the methodology are discussed.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A Naive Bayes Model
Graphical models are highly structured stochastic systems that provide a compact, intuitive, and probabilistic framework capable of learning complex relations between variables (for reviews see Refs. (11, 12, and 19) and the introductory tutorial on Bayesian Networks at www.cs.berkeley.edu/murphyk/Bayes/bayes.html). A naive Bayes model is a simple graphical model in which a single unobserved variable C is assumed to generate the observed data (here, 5,687 seventy-eight-feature gene profile vectors). The hidden variable is discrete, and its possible states correspond to the underlying classes in the data. Profile vectors are believed to be generated by K models or data-generating mechanisms. These K models correspond to K clusters or classes of biological interest (Fig. 1). This model-based approach to clustering can handle missing data, noisy data, and uncertainty about class membership in a probabilistic manner. There is direct control over the variability allowed within each class (the variance characteristics of each data-generating mechanism). The question of how many classes the data suggest can be treated in an objective manner. Finding the optimal weights, locations, and shapes of the component classes can be performed in a principled manner.
|
![]() | (1) |
Given the relationships depicted in Fig. 1, the likelihood of profile vector XLn given class k is
![]() | (2) |
![]() | (3) |
Given only the observed data D and a functional form for the mechanism that generates D, the task is to find a model M that best describes D and hence the classes. Since the network topology shown in Fig. 1 is fixed, the learning problem becomes one of estimating the parameters of M. For L observed variables, these are the number of classes K and K x L local probability models and conditional probability distributions. Here, the probability parameters are the Gaussian mean and standard deviation, so K x L x 2 parameters need to be determined. Estimating these parameters from data involves finding a maximum a posteriori model: a model which maximizes the posterior probability of the model given the data (Bayes rule)
![]() | (4) |
AutoClass Implementation of Naive Bayes Models
AutoClass C version 3.3 (4) models the continuous experiment Fl nodes in Fig. 1 using Gaussians and the discrete classification C node using a Bernoulli distribution. Starting from random initial descriptions for a specified number of classes, a gradient descent search through the space of descriptors is performed. At each step of the search procedure, the current descriptions are used to assign probabilistically each profile vector to each class. The observed values for each profile vector are used to update class descriptions, and the procedure is repeated until a specified convergence criterion is reached. A variant of the expectation-maximization (EM) algorithm is employed with the additional assumption that each profile vector belongs to some class (the sum of all class probabilities is one). There is a penalty for adding more classes and thus overfitting the data. Increasing the number of classes will decrease the prior probability of each class unless the additional class improves the likelihood of the data (Eq. 3). AutoClass iterates through different numbers of classes to determine the best taxonomy.
The model-space that needs to be searched can be constrained by setting a lower bound on the variance of the data-generating mechanism. For each experiment l, the level of observation noise (measurement error) and/or the natural variation in expression between genes can be used to set this value in a data-dependent manner. However, since neither the noise nor intrinsic variability are known, a lower bound on the standard deviations of Gaussians modeling classes is set to 1/10; of the standard deviation of N expression levels {xl1, ... , xlN}. Thousands of models are estimated, each starting from different random number seeds. Each resultant model, a locally optimum solution in the parameter space, is scored. The model marginals are compared to find the model that best describes the data. The results do not depend on the order in which profile vectors are entered into a model.
The input data for AutoClass are the 5,687 seventy-eight-feature gene profile vectors [X781, ... , X785687], where X78n = [x1n, ... , x78n]. The output consists of 1) K, the number of classes, 2) an N x K likelihood matrix where each element is the likelihood of a gene profile vector n given class k P(XLn|ck, M), and 3) a K x L parameter matrix where each element is the mean and standard deviation of the Gaussian modeling class k and experiment l, (µk,l, k,l). The marginal for the best model, the one used in all subsequent analyses, is significantly higher than the next nine models. Since all L = 78 experiments are used, the gene profile vector classes discovered and characterized by the best model capture the expression behavior of genes across this specific range of conditions. Although some of the 8 studies consisted of more than one experiment, the topology of the naive Bayes model treats the 78 experiments as being independent. The notion of conditional independence of gene profile vectors given a class does not mean independence from the experimental conditions, an unlikely assumption (1).
Intensity Transformation
Frequently, the "expression level" of gene n in experiment l, xln, is taken to be the log of the background-corrected intensity measurements for samples tagged with the Cy5 and Cy3 dyes, xln = log (ICy5/ICy3). Here, the expression level is calculated by normalising background corrected intensities using xln = (ICy5 - ICy3)/(ICy5 + ICy3). This transformation has the same advantages as the log-ratio but has two additional useful properties. It minimizes errors associated with background subtraction from low-intensity signals and constrains expression levels to lie in the -1 I
+1 domain. The expression levels in the gene profile vectors used to train a naive Bayes model are not shifted, rescaled, or modified in any other way.
AutoClass Implementation of Feature Relevance
AutoClass implements a feature relevance measure termed the relative influence Ilk. Here, it signifies how important experiment l is in determining class k. It is defined as the relative entropy between two distributions representing the expression level for the N genes in the trained model P(xln|ck,l, M) and a model M* describing the entire data set by means of a single class P(xln|M*)
![]() | (5) |
Integrating External Knowledge to Aid Interpretation: Enrichment
To create a systematic and quantitative environment in which to interpret naive Bayes model classes, external knowledge about genes is integrated in the following manner. An x K enrichment matrix is calculated as the product of an
x N ontology matrix and the N x K AutoClass likelihood matrix (see above, AutoClass Implementation of Naive Bayes Models). Given a specific type of external knowledge, let
be one of the
categories to which the ontology assigns a gene. For example, an ontology could classify genes based upon cell type, developmental stage, and so on. Each element of an ontology matrix is the probability of gene n given category
, P(Xn|
). The enrichment matrix is a product of the ontology and likelihood matrices normalized so that the sum across all classes in a single category is one. Each element of this enrichment matrix,
k or enrichment, is the fraction of genes in category
that belong to class k
![]() | (6) |
Four enrichment matrices are calculated using ontologies that assign genes to the following sets of categories: 1) protein function, = 106 highest level categories in the MIPS functional catalog (16); 2) M/G1, G1, S, S/G2, and G2/M stage of the cell cycle,
= 5 stages (this matrix uses only the 800 genes identified using Fourier analysis of the cell cycle experiments (25) as being most associated with these stages); 3) target of transcription factor,
= 265 categories given by the YPD version 9.36a (10); and 4) chromosome number,
= 16 categories based upon chromosomal location given by the YPD version 9.36a (10). For simplicity, the assignment of a gene to a category is taken to binary: P(Xn|
) is set to 1.0 (0.0) if the gene is (is not) a member of category
.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Association Between Classes and Gene Categories: Enrichment
Figure 3 shows the relationship(s) between gene profile vector classes and four types of external knowledge. Because of limitations in how enrichment values are computed, subsequent discussions will focus on a qualitative assessment of selected observations that highlight the overall utility of enrichment matrices. For example, identical enrichment values can be obtained that do not have the same significance, and there is no correction for classes having different numbers of members. In addition, the assignment of genes to categories is neither comprehensive nor complete, so it is difficult to estimate the actual false positive and false negative rates. The emphasis will be on classes shown in Fig. 2.
|
M/G1, G1, S, S/G2, and G2/M stage of the cell cycle.
These 5 categories are associated with only a subset of the 45 classes. For example, M/G1 is associated with class 39, but genes assigned to G1 are distributed across classes 26 and 37. The results indicate that one class (category) can be associated with more than one category (classes).
Target of transcription factor.
Classes 29, 39, 40, 41, 42, 43, and 44 are associated with specific transcription factors. For class 40, these factors regulate cell cyclins and chromatin assembly genes (BCK2, HTA1, HTA2, HTB1, SIT4, SPT5, SPT6, SPT10, SPT12). This observation is consistent with the protein function category most associated with class 40 being "CELLULAR ORGANIZATION; organization of chromosome structure". Thus, members of class 40 with "unknown" YPD roles may be involved directly or indirectly in cellular organization and could be regulated by the aforementioned transcriptions factors. These ORFs are YDR451C, YKR012C, YMR215W, YMR305C, YNL300W, YNR009W, YOL007C, YOL019W, YOR247W, and YOR248W. A similar type of prediction can be made for ORFs in class 43: YCR013C and YKL153W may be involved in energy metabolism and could be regulated by PDC2 (pyruvate decarboxylase regulatory protein), GCR2 (transcriptional activator involved in regulation of glycolytic gene expression), and REG1 (regulatory subunit for protein phosphatase Glc7p required for glucose repression).
Chromosome number.
At the level of resolution of entire chromosomes, there is little association between class and chromosome number. It is conceivable that partitioning the genome into smaller segments might reveal classes associated with specific regions of a chromosome and thus potential common noncoding regulatory regions.
Genes Assigned to More than One Class
Table 1 of the Supplementary Material lists genes that best illustrate partial membership of multiple classes. (Tables 14 have been published online as Supplementary Material and can be viewed at the Physiological Genomics web site.)1 The genes that are almost equally distributed between the largest number of classes are YLR020C, YLR113W (HOG1), and YOR023C (AHC1). For YLR160C (ASP3D) and YHR004C (NEM1), not only is the maximum value of the likelihood of the profile vector given a class less than 0.5, but they are assigned to 5 different classes (the maximum possible). For genes that belong to multiple classes, the eight YPD cellular roles assigned to two or more genes are Pol II transcription, protein modification, cell stress, small molecule transport, vesicular transport, signal transduction, amino acid metabolism, and protein degradation.
For genes in Table 1 (of the Supplementary Material) with known YPD roles, the annotations for other genes assigned to the same set of classes was examined. This analysis yields some novel and interesting observations, especially with respect to yeast genes important in regulating metal transport (reviewed in Ref. (20)). YAL021C (CCR4) is assigned to classes 2, 5, and 7: these classes contain many other genes that have been linked genetically with CCR4. YGL233W (SEC15) and YFL025C (BST1) are assigned to classes 2 and 7: these classes contain other genes shown to be involved in vesicle transport (SEC15 also falls into class 4). YGL167C (PMR1), which encodes a protein involved in import of calcium, copper, and manganese into the Golgi, is assigned to classes 8, 12, 14, and 24: these classes contain many other genes known to be involved in intracellular metal metabolism. For example, classes 8 and 24 contain GEF1, a gene encoding an intracellular chloride transporter essential for proper assembly of copper into the iron transport protein Fet3p. Class 12 contains MMT2, which encodes a mitochondrial iron transporter that results in enhanced survival on low iron conditions. Class 14 contains both YGL071W (RCS1/AFT1), which encodes the key transcriptional regulator of iron metabolism, and BSD2, which encodes a posttranslational regulator of Smf1p, a manganese transporter. PBN1, RPN4, ASP3, HOG1, and CIN1 are genes with complex functions that are likely to be associated with many pathways and/or interactions.
To predict potential biological functions for genes in Table 1 with an "unknown" YPD role, other genes that exhibit the same spectrum of class memberships were identified (see Table 2 of the Supplementary Material). Comparing these two tables suggests that these eight genes could be involved in Pol II transcription, protein degradation, cell stress, amino acid metabolism, lipid, fatty acid, and sterol metabolism, and RNA turnover.
Genes Involved in Copper and Iron Metabolism
The original studies were designed to reveal expression patterns associated with the cell cycle, responses to stress, and diauxic shift. It is not surprising, therefore, that many of the classes identified here are highly enriched in genes involved in processes such as release or storage of energy (see Fig. 3); these pathways are likely to be coordinated with periods of the cell cycle in which varying amounts of energy is required. An important issue is whether analysis of this same data set can reveal other, potentially unknown regulatory pathways that might be less dependent upon the cell cycle. To address this question, genes involved in metal transport were selected for further study. Many yeast genes involved in metal metabolism have been characterized, and a priori such genes might not be expected to have a strong cell cycle dependency. Although yeast has proven to be a valuable model system for identification and characterization of biological processes relevant to a number of human diseases related to metal metabolism, many aspects of metal physiology remain to be discovered.
Genes known to be important in iron and copper metabolism were identified and examined in more detail (see Table 3 in the Supplementary Material). Classes which contain more than one of these genes are as follows: class 4, TAF17 (P(XLn|M) = 0.88); FRE4 (0.47); class 5, GEF1 (0.02), MMT1 (0.02); class 6, LYS7 (1.00), YFH1 (0.03); class 8, GEF1 (0.91), MNR2 (0.01); class 9, SMF2 (1.00), CTR2 (1.00), YAH1 (1.00), TAF19 (0.07), TAF17 (0.01); class 12, MAC1 (1.00), MMT2 (1.00); class 14, RCS1 (0.99), TAF145 (1.00); class 18, SMF1 (1.00), FRE7 (1.00); class 20, ATX1 (1.00), ISU2 (1.00); class 22, SLF1 (1.00), CRT2 (1.00), CRS5 (1.00), MNR2 (0.94); class 30, RIP1 (1.00), CUP5 (1.00), FET5 (1.00), SDH2 (1.00), ISU1 (0.99), FTH1 (1.00); class 34, ATM1 (1.00), GEF1 (0.02); class 35, CUP1A (1.00), CUP1B (1.00); class 41, FTR1 (1.00), FET3 (1.00); and class 42, SIT1 (1.00), ARN1 (1.00), TAF1 (1.00), FRE6 (1.00), FRE1 (1.00), ENB1 (1.00), CTR1 (1.00). There are numerous instances in which genes with similar functions are assigned to the same class. For example, SMF2 and CTR2, genes which encode low-affinity transporters of manganese and copper, respectively, are assigned to class 9. FET3 and FTR1, which together encode the yeast high-affinity iron transporter, is assigned to class 41.
The 38 members of class 42 are of particular interest since 7 are known to be involved in copper and iron transport (20). The proteins encoded by CTR1, FRE1, and FRE6 have all been implicated in copper uptake. SIT1, which encodes a protein involved in uptake of siderophore-bound iron, is classified with three closely related genes (ARN1, TAF1, and ENB1) that are known to be tightly regulated by iron need and that are collectively required for normal growth on low-iron medium (31). It remains to be seen whether the three transcription factors most associated with class 42 (Fig. 3) do indeed play a role in regulating copper and iron transport. These transcription factors are Met31 and Met32 (zinc-finger proteins involved in transcriptional regulation of methionine metabolism) and Met4 (transcriptional activator of the sulfur assimilation pathway).
Table 4 of the Supplementary Material shows the remaining 31 members of class 42, some of which have known roles. YDR040C (ENA1) is required for high-salt tolerance. YDR340W is similar to HAP1, a gene which encodes a complex transcriptional regulator of many genes involved in electron-transfer reactions and which is essential in anaerobic or heme-depleted conditions. These data suggest that the 17 members with an "unknown" YPD role represent good candidates for new genes involved in metal metabolism, especially copper and iron homeostasis. The precise mechanisms by which they regulate and maintain metal ion homeostasis and the pathways in which they participate can only be inferred by the function of the other members of class 42. The regulatory mechanism(s) that unites members of class 42 is as yet undiscovered, although one good possibility is that it may involve the activity of the transcription factors encoded by MET4, MET31, and MET32. Specific experiments using yeast mutants with deletions of genes listed in Table 4 could help clarify the situation as well as suggest additional new candidates. Interestingly, two of the uncharacterized ORFs have human homologs (YJR033C, YDR534C), suggesting that as yet unidentified human disorders may result from aberrant regulation or functioning of these proteins.
The ability of the current analysis to provide insights into unknown regulatory mechanisms relevant to metal metabolism suggests that other classes which contain a high proportion of unknown ORFs could be used to investigate other physiological processes.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The published studies characterized genes involved in several housekeeping functions. Currently available hierarchical clustering methods identified a large number of genes associated with the cell cycle but did not focus on associations between genes involved in other biological pathways and metabolic functions. The results here suggest a number of connections to genes involved in copper and iron homeostasis even though a link between metal metabolism and the cell cycle might not be expected a priori. The 17 ORFs in class 42 predicted to be involved directly or indirectly in these processes are good targets for investigation by gene deletion and/or other techniques. Since many genes involved in metal metabolism in yeast have human homolog that are altered in disease states, studies of these new yeast candidates may yield unanticipated information on biochemical mechanisms relevant to human disorders. Using techniques different from those employed here, Tavazoie et al. (27) addressed the issue of identifying other pathways and noncoding regulatory motifs (an area not considered in this work). The association between class 42 and the MET4, MET31, and MET32 transcription factors suggests that this class may be equivalent to cluster 30 in their work. A more complete comparison of their MIPS functional category enrichment with the enrichment measure computed here may be warranted.
In the graphical model used for clustering gene profile vectors, experiments were treated as independent of each other. Despite the simple nature of this probabilistic model and the assumption of a Gaussian functional form for the data-generating mechanism, novel, biologically plausible observations could be made. Consequently, future experiments of particular interest include comparisons of wild-type yeast with single and multiple deletion mutants of genes predicted here to be involved in copper and iron homeostasis as well as those known to play roles in these processes. The utility of gene profile vector classes could be enhanced by computing a hierarchical structure capturing the relationships between classes, i.e., clustering the classes. As the number and variety of experiments performed increases, a straightforward extension is two-way clustering, finding classes of experiment and gene profile vectors together with associations between them. For the same data set, it will be useful to compare the posterior probabilities of a K-class naive Bayes model with a K-class (Gaussian) mixture model computed from the clusters generated by a K-class SOM.
For the time series studies, the independence supposition may be erroneous, since the measured expression level at one time point (modeled by node Fl) might be related to those in one or more previous time points (Fj<l). Such temporal dependencies could be encoded in the topology of the graphical model by adding edges between time points (Fig. 4). Nodes could be included that represent gene profile vector labels such as biochemical activity, protein fold, and so on. An important area of future research is extending the graphical model formalism used for clustering to graphical models for inferring genetic networks. For example, the "cdc" time series experiments could be modeled using dynamic Bayesian networks (18).
|
![]() |
ACKNOWLEDGMENTS |
---|
Present address of E. J. Moler: Chiron Corp., 4560 Horton St., Emeryville CA 94608.
![]() |
FOOTNOTES |
---|
Address for reprint requests and correspondence: I. S. Mian, Dept. of Cell and Mol. Biol., MS 74-197, Radiation Biology and Environ. Toxicol. Group Life Sci. Div., Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley CA 94720 (E-mail: SMian{at}lbl.gov).
* E. J. Moler and D. C. Radisky contributed equally to this work.
1 Supplementary Material to this article (Tables 14) is available online at http://physiolgenomics.physiology.org/cgi/content/full/4/2/127/DC1.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|