1 Bioinformatics, Department of Molecular Biology, Parke-Davis Pharmaceutical Research, Warner-Lanbert, Ann Arbor 48105
2 Department of Chemical Engineering, University of Michigan, Ann Arbor, Michigan 48109
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
gene expression profiling; gene regulatory model; data mining
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Expression profiling assays generate huge data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this data is to develop algorithms to interpret and interconnect results for different genes under different conditions. Currently, most expression data is analyzed using clustering techniques, algorithms that identify distinct expression patterns by grouping genes with similar expression patterns (1, 9). Thus clustering can only distinguish between those genes that have the same and different expression profiles. However, genes in the cell make up a complex network that cannot be revealed with current techniques such as clustering. To determine the network describing how the genes interrelate, more elaborate data mining techniques need to be developed.
Fuzzy logic is an algorithm drawn from engineering and other applied sciences to control systems as diverse as washing machines to autofocus cameras (2, 10). It provides a way to transform precise numbers, such as 32.43, into qualitative descriptors, such as "high" in a process called "fuzzification." Although other techniques can be used to change precise values into discrete descriptors, fuzzy logic provides a systematic and unbiased way to perform this transformation, thereby removing the need for expert knowledge about the system. For example, is 32.43 a high value? If 32.43 is a measure of the ambient air temperature in degrees celsius, then most people would say that 32.43°C is a high temperature. But this analysis requires our own expert knowledge, which can vary from person to person. Someone from a tropical climate may feel that 32.43°C is a medium temperature, whereas someone from a very cold climate may take 32.43°C as a very high temperature. When dealing with gene expression data, the problem is even more complicated, because no expert exists to determine what defines a "high" expression level. Using fuzzy logic, the full range of data is first measured and is then broken into discrete subsections based on the observed data. These discrete subsections then provide a qualitative description of the data. Once transformed, this qualitative data can be analyzed using heuristic rules, which in turn generate fuzzy solutions. For example, the heuristic rule "if high then move fast" takes "high" as a fuzzy input and "fast" as a fuzzy solution. In another process called "defuzzification," this heuristic solution can be transformed from a qualitative descriptor back into a precise number.
There are three main advantages of applying fuzzy logic to the analysis of gene expression data. First, fuzzy logic inherently accounts for noise in the data because it extracts trends, not precise values. Second, in contrast to other automated decision making algorithms, such as neural networks or polynomial fits, algorithms in fuzzy logic are cast in the same language used in day-to-day conversation. As a result, predictions made using fuzzy logic are easily interpretable and can be extrapolated in predictable ways. Third, fuzzy logic techniques are computationally efficient and can be scaled to include an unlimited number of components. Thus they are able to recognize a large number of biologically important patterns.
In this work we present a fuzzy logic based algorithm for analyzing gene expression data. Using fuzzy logic, we have developed a analysis technique that can identify logical relationships between genes and in some cases even predict the function of an unknown gene. This algorithm was validated using yeast expression data gathered from the Affymetrix GeneChip system. By using yeast gene expression data collected at different time points of the cell cycle, we were able to identify many regulatory elements and their target genes within the cell that work together to maintain and control certain cellular processes. Several cases are validated by available experimental results, including the signaling network controlled by the transcription factors HAP1 and ROX1, which control the transition from anaerobic to aerobic growth. These results suggest that our fuzzy logic technique can indeed find biologically relevant connections between sets of genes, which in turn could help to describe the complex web of interactions that regulate gene expression.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Data filtering.
Before expression data was analyzed, the data was first filtered to ensure that 1) the expression data is above the noise level that is determined by GeneChip software and 2) the data set includes genes that differ in their level of expression significantly. In chip hybridization, the noise level is defined as the standard deviation in fluorescence intensities of the nonhybridizing probes. For the yeast data set, the noise level was determined to be 30 in fluorescence intensity; thus to filter out the nondetectable genes in the samples, the highest measurement for a particular gene had to exceed the noise level to allow that gene into the calculation. Also, for a gene to be selected, the maximum value in the series of measurements had to be at least three times greater than the minimum value, ensuring that the observed signal change was significant. The threshold of factor of 3 was decided after variation of multiple measurements from repeated assays was evaluated; 1,898 (30%) genes that met both criteria were selected.
Gene regulatory model.
In developing the algorithm, we chose to search for genes that follow the pattern of a gene product (C) controlled by both an activator (A) and repressor (B), although in theory any pattern can be searched for. In general for the activator-repressor model, when the activator is high and the repressor is low, the concentration of the target C would be high. Conversely, when the repressor concentration is high, and the activator is low, the concentration of the target is low. These qualitative, or heuristic, rules are similar to the judgement calls made by an expert analyzing the data and were used as a basis for developing our fuzzy algorithm.
Fuzzy logic algorithm.
In analyzing genetic expression data, the data is transformed from crisp values to fuzzy values in a process called "fuzzification." Data is fuzzified by first normalizing the data from 0 to 1, then the normalized value is broken up into various membership classes. For example, Fig. 1 shows the three fuzzy sets used in this algorithm, "HI," "MED," and "LO" as a function of the normalized value. For a normalized value of 0.25, the fuzzy value is 0.5 LO, 0.5 MED, and 0 HI; or said another way, 0.25 is 50% low, 50% medium, and 0% high. The three fuzzy sets HI, MED, and LO were chosen after manually examining expression data and finding that the abundance of most transcripts was either high, medium or low. Other schemes that include a different number or shape of fuzzy sets could also be used to better represent the data; however, these modifications tend to make the analysis less general and more complex and therefore were not pursued in this study.
|
|
To get an overall idea of how well the assertion fits the data, the r2 value and the variance are multiplied and scaled by a factor of 100,000 to give an overall score. Thus triplets with low r2 values and low variance will have the lowest score and also should be the most credible statements. Other data that are only low in one parameter may be filtered out because either the fit is too poor or the data set is biased.
All fuzzy logic analyses were written in the C programming language and run on an 8-processor SGI Origin 2000 system, which required 200 h to analyze the relationships between 1,898 genes. Because all combinations of triplets are checked, the algorithm scales as O(h3) with the number of genes examined. However, because the problem consists of solving a large number of smaller, independent comparisons, the algorithm lends itself to parallel computing and scales nearly linearly with the number of available processors.
A US patent application (serial no. 60/181477) has been filed on this algorithm. Copies of the program are available upon request.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To evaluate and validate the algorithm, we examined the best scoring triplets to see if they made biological sense. The complete table of all the triplets can be obtained from us. One of the best scoring triplets CYB2-HAP1-CYC7 is shown in Fig. 3. Figure 3, top, shows the tight fit of the fuzzy logic prediction for CYC7 expression level compared with the experimental data. This triplet has the second best score from our selection and the functions of all three components have been extensively studied. The overall score for this triplet is 1,295 when comparing between the calculated expression data of CYC7 and the observed data (Fig. 3, top), indicating a very high confidence for the correlation. Moreover, the expression data of CYB2 (A) and HAP1 (B) show a fairly wide range of values and are evenly distributed throughout the decision matrix (Fig. 3, bottom), thus the variance score is low and the resulting predictions are credible. It is worthwhile to note that neither CYB2 (A) nor HAP1 (B) could be categorized in the same cluster as CYC7 (C) as shown in Fig. 3, middle; thus only through the use of this fuzzy logic algorithm could this triplet be uncovered.
|
The algorithm also predicts that CYB2 should activate CYC7, again in agreement with experimental findings. CYB2, L-(+)-lactate cytochrome c oxidoreductase is a soluble protein from the intermembrane of mitochondria. This protein transfers electrons from L-(+)-lactate to cytochrome c and is upstream of cytochrome c on the electron transport chain. Experimental findings indicate that CYB2 interacts preferentially with CYC7 during the electron transfer process (4) and as such should positively regulate the expression of CYC7 as found by the algorithm.
Following up the relationships revealed by this triplet, we selected all the triplets that contain either HAP1 or HAP1 regulated genes in an effort to generate an interconnected network describing the control roles of HAP1. The network predicted by the fuzzy logic algorithm is described in Fig. 4 and is highly consistent with the experimental data obtained from previous studies. Moreover, we could functionally identify unidentified genes involved in this process and generate hypotheses for future experimental tests. For example, previous studies show that an unidentified protein X masks the activation domain of HAP1 and allows HAP1 act as a repressor under anaerobic conditions (5, 11). Currently, protein X remains uncharacterized; however, the fuzzy logic prediction suggests that several genes including YDL174C, YGL037C, YLR251W, YLR252W, and YNL007C could be this uncharacterized protein. Functionally, these proteins appear to repress the ability of HAP1 to activate CYC7;, however, further experiments are needed to determine the exact protein involved.
|
Experimentally, it has been shown that HAP1 regulates ROX1, a protein that encodes a repressor protein for the hypoxic genes (3, 12). When cells are grown under aerobic conditions, heme accumulates to levels sufficient to induce ROX1 expression and the hypoxic genes are repressed. When cells are limited for oxygen, heme levels fall, ROX1 repressor levels are reduced and hypoxic gene expression is depressed. The relationship between HAP1 and ROX1 was not revealed by the fuzzy logic, because the expression of ROX1 is transient and highly unstable. But two genes, CYT1 and GPD2, are found by the algorithm as the targets for the positive regulation by ROX1, whereas other hypoxic genes were not identified. This result suggests that the view of hypoxic gene regulation is correct in terms of the phenomenology but there is a great deal more complexity to ROX1 regulation.
Pairs.
Table 1 lists the most frequent occurring pairs of genes in triplets identified by the fuzzy logic algorithm, many of which appear to be biologically relevant. In several cases, both gene products function in the same cellular process. For example, AGP1 and MEP2 are often found together. Functionally AGP1 encodes a broad substrate range amino acid permease whose expression is subject to nitrogen repression, whereas MEP2 is a high-affinity ammonia permease induced by to nitrogen starvation. Thus it makes sense that AGP1 and MED2 are found in the same triplet. Similarly, HAP1 is a transcription factor with a broad spectrum of targets including genes involved in sterol biosynthesis such as FAA1 and ARE2. FAA1 is long chain fatty acyl:CoA synthetase in lipid metabolism and protein N-myristoylation. ARE2 is sterol-ester synthetase in ergosterol esterification. In general, HAP1 is known as a repressor, thus the fact that this algorithm identifies HAP1 as repressing FAA1 and ARE2 is consistent with known biological data.
|
In addition, there are also pairs of genes predicted by the fuzzy logic algorithm where one or both of the genes are uncharacterized. By analogy to the examples shown above, it may be possible to infer the cellular function of these unknown proteins by examining what known proteins are found to associate with the set. This ability to bootstrap functional information out of the expression data could be particularly useful in analyzing human data, where a much larger percentage of proteins are uncharacterized.
Transcription factors.
Because we used our algorithm to search for activator-repressor-target triplets, we expected to find that a disproportionately large number of triplets would include transcription factors. Among the 1,898 genes in our selected data set, we found that 124 genes annotated as transcription factors in GenBank descriptions. The expected probability of finding a transcription factor in our data set is 6.5%. After our initial screen by fuzzy logic analysis, we discovered that transcription factors were found at 9.0% in activator or repression positions, representing a 36% enrichment over would be expected by their frequency in the original data set. When only looking at the 100 best scoring triplets, we found transcription factors at 14% representing a 110% more frequently.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Although in this study the algorithm was only used to search for triplets of activator, repressor, and target genes, other variations of the algorithm are also possible. The choice of the activator-repressor model provided a simple method to demonstrate that this technology can yield biologically meaningful results. However, the technique is general and can be applied to other relationships and more complicated systems. Examples include other classes of relationships such as coactivators and corepressors or more complicated systems that involve genes whose transcription is regulated in complex ways by any number of transcription factors. Potentially, the technology could also be extended to describe complete general networks of gene interactions based on expression data alone.
However, using fuzzy logic to analyze expression data does have some limitations. To a first approximation, the interaction between multiple proteins is essentially linear, thus the algorithm searched for linear behavior. However, in the case of multiple redundant promoter binding sites (such as the two HAP1 binding sites on the CYC1 gene), this linear approximation is not accurate, causing the algorithm to overlook these biologically relevant connections. This situation could be remedied by including a more sophisticated "fuzzification" step to include nonlinear effects; however, this added complexity may only correct for a few missed connections while edging out many of the more common near linear relationships. Also, the goal of this algorithm is not to yield quantitative predictions, but instead to draw general trends that connect the regulation of multiple genes. Thus including specific nonlinear effects would not help to draw many connections, but would end up adding a significant computational burden to an already difficult problem.
The fuzzy logic algorithm found a disproportionately large number of transcription factors in the roles of activators and repressors; however, not all of the activators and repressors found were transcription factors. Two possible reasons for this discrepancy are 1) transcription factors are expressed at low levels and as such difficult to detect, and/or 2) other gene products such as enzymes can indirectly regulate transcription. Transcription factors are generally present only at a very low concentration; thus changes in transcription factor expression levels can be difficult to detect using current expression profiling techniques. Presumably, if expression profiling technology were to become more sensitive, then the fuzzy logic algorithm would detect an even greater bias of transcription factors in the activator and repressor roles. However, in many cases the expression level of a particular protein is not governed by the expression of a transcription factor, but instead by the concentration of some intracellular compound, such as Ca2+ concentration or cAMP levels, which in turn are controlled by enzymes inside the cell. In these cases, changes in the expression level of the enzyme have a "transcription-factor-like" effect and would be detected by the algorithm as an activator or repressor. From an drug design point of view, these "transcription-factor-like" enzymes are possibly more interesting than true transcription factors, because it is generally easier to change the activity of an enzyme in the cytosol with a drug than to block a true transcription factor in the nucleus. Moreover, the data set used in this study came from a single experiment in which cell cycle control was the main process of study. Transcription factors that are not involved in pathways related to this cellular process might not show significant change in their expression and thus could not be evaluated by the fuzzy logic algorithm. To perform a more comprehensive survey on transcription factors, we are analyzing a data set that includes gene expression profiles of both wild-type and various mutant yeast cells. Many more transcription factors can be evaluated because the cellular processes they control have been perturbed.
Although the validation of this algorithm was performed using GeneChip data in this report, the fuzzy logic algorithm should work equally well with other expression profiling techniques such as Sequential Analysis of Gene Expression (SAGE). SAGE has the advantage that it can detect completely unknown proteins, whereas GeneChip technologies require that at least the sequence of proteins mRNA be known. This ability to detect unknown proteins would be particularly well suited to the functional characterization that the fuzzy logic algorithm makes possible.
An additional advantage to the fuzzy logic algorithm is that data can come from any source within an organism (tissue, cell type, treatment, or physiological state), and the output actually will be improved by deeper and more diverse data set. The reason for this improvement is that the algorithm needs to observe changes in the expression level of a protein relative to changes in other expression levels. Each new data set provides a different set of expression levels that can be tested to see whether they fit the proposed regulatory model. In our studies, many data sets were eliminated solely because they did not sufficiently explore the combinations of expression levels (too high a sigma value), making their predictions impossible to believe. By including data sets from cells in different states, the algorithm gains more information about the details of the regulatory network.
A primary application of this algorithm is to independently validate or discover drug targets. Traditional techniques for drug target discovery require a detailed understanding of the biology underlying the disease, which can be slow and difficult to obtain. In contrast, expression profiling is a rapid high-throughput process that gives a large amount of information about the cell in a form that could be easily processed on a computer. By using a fuzzy logic approach to analyzing expression profile data, it is possible to confirm the mechanism of a known target. Moreover, because the fuzzy logic algorithm does not require biological information about the gene, genes with unknown functions can be included just as easily as genes with known functions. This ability to identify functional clues for uncharacterized genes is a great advantage in drug target discovery, because potential drug targets then can be followed up with the detailed biology.
![]() |
FOOTNOTES |
---|
Address for reprint requests and other correspondence: Y. Wang, Bioinformatics, Dept. of Molecular Biology, Parke-Davis Pharmaceutical Research, Warner-Lanbert, Ann Arbor, MI 48105 (E-mail: yixin.wang{at}wl.com).
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|