A meta-clustering analysis indicates distinct pattern alteration between two series of gene expression profiles for induced ischemic tolerance in rats
Makoto Kano1,
Shuichi Tsutsumi2,
Nobutaka Kawahara3,4,
Yan Wang3,
Akitake Mukasa2,3,
Takaaki Kirino3,4 and
Hiroyuki Aburatani2
1 Intelligent Cooperative System, Department of Information Systems, Research Center for Advanced Science and Technology, University of Tokyo, Tokyo
2 Genome Science Division, Research Center for Advanced Science and Technology
3 Department of Neurosurgery, Faculty of Medicine, University of Tokyo, Tokyo
4 Solution-Oriented Research for Science and Technology/Japan Science and Technology, Kawaguchi, Saitama, Japan
 |
ABSTRACT
|
---|
We have developed a visualization methodology, called a "cluster overlap distribution map" (CODM), for comparing the clustering results of time series gene expression profiles generated under two different conditions. Although various clustering algorithms for gene expression data have been proposed, there are few effective methods to compare clustering results for different conditions. With CODM, the utilization of three-dimensional space and color allows intuitive visualization of changes in cluster set composition, changes in the expression patterns of genes between the two conditions, and relationship with other known gene information, such as transcription factors. We applied CODM to time series gene expression profiles obtained from rat four-vessel occlusion models combined with systemic hypotension and time-matched sham control animals (with sham operation), identifying distinct pattern alteration between the two. Comparisons of dynamic changes of time series gene expression levels under different conditions are important in various fields of gene expression profiling analysis, including toxicogenomics and pharmacogenomics. CODM will be valuable for various types of analyses within these fields, because it integrates and simultaneously visualizes various types of information across clustering results.
time series; transcription factor; visualization
 |
INTRODUCTION
|
---|
ADVANCES IN MICROARRAY TECHNOLOGIES have made it possible to comprehensively measure gene expression profiles. Observation of dynamic changes of gene expression levels provides important markers to clarify cellular responses, differentiation, and genetic regulatory networks. In particular, a comparison of dynamic changes of time series gene expression levels under various conditions (e.g., administration of different drugs) is expected to make a major contribution to the understanding of complex biological processes. In general, we observe the influence of each condition through the results of clustering analysis, which is the most popular analysis for gene expression profiles. Therefore, a comparison between the results of clustering analyses in different conditions will allow interpretation of different macroscopic phenomenon that occurred under those conditions. However, although many clustering algorithms, including hierarchical clustering (1, 2, 4, 15), k-nearest neighbor (17), and self-organizing maps (10, 13, 16) have been proposed, there are few effective methods to effectively compare clustering results under different conditions. We have defined four issues to be addressed for a comparison of clustering results, especially for a comparison of time series gene expression data under two different conditions: changes in the composition of the cluster sets, changes in the expression patterns, integration with known other gene information, and threshold problems.
 |
Changes in the Composition of the Cluster Sets
|
---|
In this report, we focused on hierarchical clustering, since it is the most popular method for gene expression analysis. Here we define the composition of a cluster set as the hierarchical structure of clustering results and "cluster set" as the set of all clusters in the structure. A comparison of clusters compositions shows which clusters are conserved in different conditions and how the genes in a cluster for one condition are distributed into a cluster set under another condition. Genes that cluster under a single condition may possibly be regulated by the same factors for that condition. However, under different conditions, some of those genes would be regulated by other factors and generate different clusters. Thus changes in the cluster compositions could provide key information for interpreting the effects of the different conditions. To get a full picture of the relationships of two cluster sets, the overlap between each pair of clusters under the two different conditions should be evaluated. However, since clustering analysis, especially hierarchical clustering, almost always generates a great number of clusters, there are a very large number of combinations of clusters. Simple line connections of the genes between the dendrograms of two hierarchical clustering results (14) provide insufficient information about the relationships between the clusters. Therefore, an effective presentation method that provides a full picture of the relationships of the cluster sets would be desirable.
Recently, a statistical model for performing meta-analysis of independent microarray data sets was proposed (12). This model revealed, for example, that four prostate cancer gene expression data sets shared significantly similar results, independent of the method and technology used. However, in a comparison of the cluster sets based on different conditions, the objective is not to confirm that several data sets share significantly similar results, but to detect the differences between them. Several statistical algorithms have been proposed for evaluating how clusters based on expression profiles include genes with well-known functions (3, 17). However, the number of clusters that were compared was limited, and an effective presentation method was not required in those situations.
 |
Changes in the Expression Pattern
|
---|
Where two clusters under different conditions have a statistically meaningful number of genes in common, it is also important to examine the differences in their expression patterns. The differences of macroscopic phenomena that the conditions exhibit result from the differences of expression of multiple, rather than single, genes. Therefore, the genes whose expression patterns changed in a similar fashion between different conditions provide markers for the different phenomena. In other words, if the genes in a certain cluster based on one condition also constitute a cluster for another condition, but the expression patterns are greatly different between the two conditions, then these genes are causally implicated in the phenotypic difference.
In general, there will be many false candidate genes whose expression patterns coincidentally match between the two different conditions. Therefore, it is important to simultaneously evaluate the statistical significance of the overlaps between clusters and the differences in their expression patterns.
 |
Integration with Other Known Gene Information
|
---|
In gene expression analysis, it is important to biologically interpret the results after integrating them with other known gene information. Therefore, changes in the composition of the cluster sets and changes in the expression patterns between different conditions should be associated with other known gene information such as transcription factors.
 |
Threshold Problems
|
---|
In a comparison of cluster sets on gene expression profiles, we have to handle four types of thresholds: 1) a threshold for generating clusters for each condition; 2) a threshold for evaluating the number of common genes that two clusters have; 3) a threshold for evaluating the differences in the expression patterns between two clusters; and 4) a threshold for evaluating the relationship with other known gene information. Among these, determining the threshold for generating clusters is most challenging, because the clustering result strongly depends on this threshold, and a change of this threshold greatly affects the number and composition of clusters. It is generally difficult to determine optimal values for these four types of thresholds, and the results of analysis are greatly affected by the threshold values specified. Arbitrary selection of thresholds involves a risk of overlooking important genes, so the number of thresholds should be reduced, and, if used, it is necessary to allow users to interactively change the thresholds.
We focused on visualization technology to address these four issues. Interactive visualization is effective for handling ambiguous threshold problems and for providing a wide variety of information at one time. In previous work, we developed a "cluster overlap distribution map" (CODM), which is a visualization method for comparing cluster sets based on different sets of gene expression profiles (7). In this report, we extended it for time series gene expression analysis. In the CODM, the relationships of all possible pairing of clusters can be examined, and both the changes in the composition of the cluster sets and the changes in the expression patterns of the clusters can be effectively visualized as three-dimensional (3D) histograms, without any arbitrary thresholds. In addition, relationships with other known gene information such as transcription factors can also be elucidated. We applied the CODM to a comparison between the gene expression data sets of double ischemia rats and sham control rats (with sham operation) and confirmed that CODM identified distinct patterns between the two.
CODM, available on our web site (http://www.genome.rcast.u-tokyo.ac.jp/CODM), runs on a PC with Windows 2000 or Windows XP. Memory requirement is in proportion to the square of the number of genes to be analyzed. The analysis for
4,000 genes, represented in this paper, required
250 megabytes. In addition, since the analysis results of the CODM are visualized by use of the OpenGL, a machine with a graphics board with a hardware accelerator for the OpenGL is recommended.
 |
MATERIALS AND METHODS
|
---|
Experiment Design
In this report, CODM is illustrated using time series gene expression data sets obtained from rat four-vessel occlusion models combined with systemic hypotension and time-matched control animals with sham operation. In the experiment, we used 2-min ischemia rats with induced ischemic tolerance (tolerant rats, TOL) and rats with sham operation (sham rats, SHAM), after confirming the histological outcomes. Note that the sham rats did not acquire ischemic tolerance. Three days after the operation, we conducted a 6-min ischemia operation on the two groups. Because of their ischemic tolerance, very little neuronal death of CA1 hippocampal neurons was observed in the tolerant rats (9). With duplicate assessments of 6 time points ({0 h, 1 h, 3 h, 12 h, 24 h, 48 h} x 2) after the second ischemia, microdissected CA1 regions from each of the two groups were subjected to oligonucleotide-based microarray analysis.
All animal-related procedures were conducted in accordance with guidelines for the care and use of laboratory animals set out by the National Institutes of Health and were approved by the committee for the use of laboratory animals in the University of Tokyo. More detailed experimental design is described in our previous report (8).
GeneChip Experiment
Five micrograms of total RNA from each sample was used to synthesize biotin-labeled cRNA, which was then hybridized to a high-density oligonucleotide array (GeneChip Rat RG-U34A array, Affymetrix) essentially following a previously published protocol (6). The arrays contain probe sets for 8,737 rat genes and expressed sequence tags (ESTs), which were selected from Build 34 of the UniGene Database (derived from GenBank 107, dbEST/11-18-98). Sequences and GenBank accession numbers of all probe sets are available from the Affymetrix home page (http://www.affymetrix.com/index.affx). Washing and staining was performed in a Fluidics Station 400 (Affymetrix) using the protocol EukGE-WS2. Scanning was performed on an Affymetrix GeneChip scanner to collect primary data. The Affymetrix Microarray Suite v4.0 was used to calculate the average difference for each gene probe on the array, which was shown as an intensity value of gene expression defined by Affymetrix using their algorithm. The average difference has been shown to quantitatively reflect the abundance of a particular mRNA molecule in a population (6). To allow comparison among multiple arrays, the average differences were normalized for each array by assigning the mean of overall average difference values to be 100. This data set has been submitted as GSE1357 to the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/info/linking.html)
Preprocessing and Clustering
In the following analysis, we used data sets as 12 time point ({0a, 0b, 1a, 1b, 3a, 3b, . ..., 48a, 48b} = {Ti} (i = 1,2,...,12)) data sets on TOL and SHAM, since the CODM does not depend on the intervals of the time points.
Standard clustering analysis for gene expression profiles is based on the correlation coefficients between genes. Therefore, this approach cannot handle genes with expression profiles that have almost no changes for a condition. However, if the expression profiles of those genes have meaningful changes in expression levels for other conditions, then these provide markers to interpret the influence that the conditions exerted, because these are possibly regulated by different factors. To handle those genes and to align the baselines of the expression patterns between the different data sets, preprocessing (i.e., filtering and normalization) was conducted for all of the data sets where TOL and SHAM were merged. More specifically, 3,363 probes with mean expressions above 50 and coefficient of variance (CV = standard deviation/mean) above 0.1 were selected. After logarithmic transformation of the gene expression data, the expression levels were normalized to satisfy the following equations:
 | (1) |
 | (2) |
where xi and yi are normalized expression levels of a gene at time point Ti (i = 1,2,...12) on conditions TOL and SHAM, respectively. Using these normalized data sets, we performed hierarchical clustering analysis based on Euclidian distances, for each data set independently. Clustering analysis using Euclidian distances instead of correlation coefficients allows us to handle genes whose expression levels are downregulated or upregulated. In addition, due to the common normalization, gene expression patterns can be compared within a data set and between data sets.
In general, Euclidian-distance-based clustering after normalization, in terms of mean and standard deviation, is equivalent with correlation-coefficient-based clustering. That is, a Euclidian-distance-based clustering analysis for the merged data of TOL and SHAM with the above preprocessing is equivalent with a correlation-coefficient-based clustering analysis for the original merged data. In the analysis of the CODM, the preprocessing is conducted for the merged data, and Euclidian-based clustering is individually conducted for each data. Roughly speaking, this analysis provides us with results similar to those of normal correlation-coefficient-based clustering, while it allows us to handle genes with expression profiles that have changes for only one condition but not for the other.
As Fig. 1, A and B, shows, there are a large number of clusters generated at various levels. Although the composition and number of cluster sets depend on the threshold value of the distance, it is generally difficult to identify an optimum value. These aspects make it difficult to compare cluster sets derived from different sources.

View larger version (53K):
[in this window]
[in a new window]
|
Fig. 1. Hierarchical clustering of TOL (A) and SHAM (B). We obtained time series ({0 h, 1 h, 3 h, 12 h, 24 h, 48 h} x 2) microarray data from rats with induced ischemic tolerance (tolerant rats, TOL) and rats with sham operation (sham rats, SHAM). In the analysis, we used these data sets as 12 time point ({0a, 0b, 1a, 1b, 3a, 3b, ...., 48a, 48b} = {Ti} (i = 1,2,...,12)) data sets on TOL and SHAM, respectively. After preprocessing and normalization, hierarchical clustering analysis based on Euclidian distances was then performed for each data set independently.
|
|
The Cluster Overlap Distribution Map
The CODM is a visualization methodology for pair-wise comparison between cluster sets generated from different gene expression data sets. In this methodology, two types of cluster sets (i.e., dendrograms of hierarchical clustering results) are mapped, respectively, to the x-axis and to the y-axis, and the relationship between them is displayed as a 3D histogram (Fig. 2). In this report, the dendrogram of TOL is mapped to the x-axis, and that of SHAM is mapped to the y-axis. The statistical evaluation values of the overlaps between two clusters selected from the respective cluster sets are displayed as the height of the blocks (Fig. 2). More specifically, we evaluated the number of common genes between the two different clusters by using hypergeometric probability distributions (17). Assuming that the generation of gene clusters is a random selection from among the total set of genes, the probability of observing at least k overlapping genes between randomly selected n1 genes and n2 genes from among all of the g genes is given by:
 | (3) |
When the P value is small, the overlap is regarded as statistically meaningful. Thus we defined the evaluation value of the overlap as:
 | (4) |
Then in the area (Rij) determined by a cluster on the x-axis (Xi) and a cluster on the y-axis (Yj), a block whose height represents E(g,nxi,nyj,kij) is displayed, where nxi is the number of genes in Xi, nyj is the number of genes in Yj, and kij is the number of overlapping genes between Xi and Yj (Fig. 2). We term this block an "overlap block." Note that the number of UniGenes, to which probes in a cluster correspond through their original GenBank accession number, was used as the number of genes. In this report, all 8,737 probes on RG-U34A were corresponding to 5,249 UniGenes (g = 5,249).

View larger version (16K):
[in this window]
[in a new window]
|
Fig. 2. Overlap block of two clusters. The dendrogram of TOL is mapped to the x-axis, and that of SHAM is mapped to the y-axis. Then, for the area (Rij) determined by a cluster on the x-axis (Xi) and a cluster on the y-axis (Yj), a block whose height represents E(g,nxi,nyj,kij) (statistical evaluation values of the overlaps between Xi and Yj) is displayed, where g is the total number of genes, nxi is the number of genes in Xi, nyj is the number of genes in Yj, and kij is the number of overlap genes between Xi and Yj.
|
|
For hierarchical clustering, there are a large number of clusters generated at various distance levels. Our algorithm examines the overlaps of the genes between all combinations of two clusters with smaller "distance level" values than the "cut level," which is a threshold value specified by users (Fig. 1). In other words, we evaluated and visualized any clusters with a smaller distance level than the cut level, even if they were included in other clusters. Note that conventional hierarchical clustering does not focus on subclusters that are included in other clusters. Since all of the statistically significant combinations between cluster sets can be visualized simultaneously, users can grasp the overall picture of the relationships between the two different cluster sets.
In the CODM, all of the clusters are dealt with equally without regard to their difference level (i.e., their homogeneity). Even if they are included in other clusters, all of the statistical significance of the number of common genes between clusters is simultaneously visualized. Therefore, there is a risk that a small overlap block may be hidden by a large block. For example, assume that the clusters Xj and Yn are included in Xi and Ym respectively. Then, if the evaluation value Ejn is less than Eim, then the small block Bjn will be hidden in the large block Bim (Fig. 3A). To avoid this problem, the CODM allows the user to change the cut level interactively. That is, if the user decreases the cut level, some small blocks that are hidden in larger blocks will emerge. Therefore, in consideration of the homogeneity of clusters and the relationships with other gene information, the user can find important genes displayed as blocks in the CODM.

View larger version (18K):
[in this window]
[in a new window]
|
Fig. 3. Relationships of two blocks. In CODM, all of the clusters are dealt with equally, regardless of their difference levels (i.e., their homogeneity). Even if they are included in other clusters, all of the statistical significance of the number of common genes between clusters is simultaneously visualized. There is a risk that a small overlap block may be hidden in a large block. Assume that the clusters Xj and Yn are included in Xi and Ym, respectively. Then, if the evaluation value Ejn is less than Eim, the small block Bjn will be hidden within the large block Bim (A).
|
|
Color of Each Overlap Block
Since the statistical significance of the number of common genes between two different clusters is represented as the height of a block, the color of a block can be used to represent other information. In the current prototype, the CODM provides three color modes.
1) Redundant visualization.
The first mode is a representation of the evaluation values of overlaps using a gray scale. This redundant representation helps users comprehend the distribution of the relative evaluation values of overlaps.
2) Similarity of expression patterns.
The second mode is a representation of the similarity of expression patterns between two clusters, from red to blue. The similarity f(T,S) of expression patterns between cluster T on TOL and cluster S on SHAM was defined using the average of the square of the Euclidean distance between them. Assuming that NTS is the number of common genes in T and S, xki and yki are normalized expression levels of a common gene k at time Ti on TOL and SHAM, respectively. The similarity f(T,S) was defined as follows:
 | (5) |
Since {xti} and {ysi} (i = 1,2,...12) satisfy Eqs. 1 and 2, the range of f(T,S) is 1 to 1, and f(T,S) can be rewritten as follows (See APPENDIX):
 | (6) |
In the CODM, the similarity f(T,S) was represented as the color of the block from red (f(T,S) = 1) to blue (f(T,S) = 1). Roughly speaking, red indicates that expression patterns between the two clusters are similar, and blue indicates they have a negative correlation. In addition, purple (f(T,S) = 0) indicates they have no correlation, or genes of one cluster have no changes in expression levels, i.e.,
As mentioned above, if genes in a certain cluster based on SHAM also constitute a cluster in TOL, but the expression level in SHAM is significantly different from that in TOL, then these genes provide potential markers for the cause of ischemic tolerance. Strong candidates will appear as tall blue or purple blocks. CODM allows users to easily look for such blocks, with interactively controlling the thresholds.
3) Relationship with a known gene classification.
The third type of information is a representation of the relationship between overlapping genes and a known gene classification. If statistically significant representation of genes within a particular class is observed among the overlapping genes, then the block is color coded according to the class. The level of statistical significance of the representation of genes within a particular class is evaluated using Eq. 3, where g is the total number of genes that are classified by the known classification, n1 is the number of genes that are classified by the known classification among overlapping genes, n2 is the total number of genes within a class based on the known gene classification, and k is the observed number of genes found in both the given overlapping genes and the given class according to the known gene classification.
In this report, we associated overlapping genes with eight types of transcription factors (HIF, ARNT, and EGR families) that were reported to have a relationship with ischemia (5, 8, 18, 19). We extracted complete sequences of 1.0 kb upstream and 0.1 kb downstream for 2,816 UniGenes among the 5,249 UniGenes corresponding to 8,737 probes on the RG-U34A microarray. The 1.1-kb sequences of the 2,816 UniGenes were searched to determine whether they correspond to the TRANSFAC matrices v7.2 (11) with the threshold set to "minimum false negative." Table 1 shows the names of the transcription factors, the number of UniGenes that correspond to each transcription factor, and the thresholds for matching. In CODM, we color coded overlap blocks that contain statistically meaningful numbers of genes with putative transcription factor binding sites. If an overlap block represents statistical significance for multiple transcription factors putative binding sites, then only a single transcription factor with the highest evaluation value was visualized. However, the CODM allows users to click overlap blocks and browse description messages (in a console window) for the relationships with all of the transcription factors.
 |
RESULTS AND DISCUSSION
|
---|
Figure 4 shows the visualization results of the comparison between TOL and SHAM in the mode of redundant visualization, the similarity of the expression patterns, and the relationships with known gene classifications (transcription factors). In Fig. 4, the cut level for the distance for hierarchical clustering was 0.74, and all overlap blocks with 2.0 or higher evaluation values are displayed as a 3D histogram. As Fig. 4 shows, the CODM provides not only a 3D mode but also a two-dimensional (2D) mode where users can see a projected overhead view of the 3D mode. In the 3D mode, the statistical significance of the overlaps between clusters and the differences in expression levels between the clusters can be simultaneously represented, since we can use the height and color of blocks. However, it is somewhat difficult to recognize the expression patterns of clusters that generate an overlapping block. For this purpose, the 2D mode is better, although the 2D mode of CODM can visualize only a single species of information at a time, i.e., the statistical significance of the overlaps or the differences in expression levels between clusters, or relationships with known gene classification. Therefore, it is useful to interactively change the mode as required. Exploration by changing the color mode and the 2D and 3D modes allowed us to pick up three potentially important overlap blocks (Fig. 4). The information for these three overlap blocks is shown in Table 2, their gene lists are shown in the Supplemental Material, and their expression patterns are shown in Fig. 5. (The Supplemental Material is available at the Physiological Genomics web site.)1

View larger version (53K):
[in this window]
[in a new window]
|
Fig. 4. Visualizations for comparison of clustering results of TOL and SHAM. These are visualization results of the comparisons between TOL and SHAM in the mode of redundant visualization (A and B), similarity of the expression patterns (C and D), and the relationships with transcription factors (E and F). Here, the cut level of the distance for hierarchical clustering was 0.74, and all of the overlap blocks with 2.0 or higher evaluation values are displayed as three-dimensional (3D) histograms. As shown, the CODM provides not only a 3D mode (B, D, and F) but also a two-dimensional (2D) mode (A, C, and E) where users can see a projected overhead view of the 3D mode. In the mode showing the relationships with the transcription factors (E and F), we considered the relationships with 8 types of transcription factors (HIF, ARNT, and EGR families) that are known to mediate response to ischemia. Here, only overlap blocks with 2.0 or higher evaluation values of the number of genes with putative transcription factor binding sites were color coded. Where an overlap block represents statistical significance for multiple transcription factors putative binding sites, only the transcription factor with the highest evaluation value was visualized. Exploration through changing the color mode and the 2D and 3D mode allowed us to pick up three potentially important overlap blocks that represented high evaluation values of the number of genes with the binding sites (E > 2.0).
|
|

View larger version (78K):
[in this window]
[in a new window]
|
Fig. 5. Expression patterns of genes in the three overlap blocks. These are the expression patterns of common genes for the three overlap blocks that were picked up through exploration with CODM (Fig. 4). The "Expression Patterns of Cluster Ti(/Si)" (i = a,b,c) are the expression patterns of the common genes of the overlap block i in TOL(/SHAM).
|
|
As stated above, we assumed that there are four issues for a comparison of clustering results: changes in the composition of the cluster sets, changes in the expression patterns, relationships with other known gene information, and threshold problems. The CODM enables us to address these issues as follows.
Changes in the Composition of the Cluster Sets
As shown in Fig. 4, A and B, the CODM can intuitively visualize changes in the composition of the cluster sets as 3D histograms. That is, the dissimilarity of the expression level under SHAM divides each cluster on TOL into specific subclusters, and these subclusters are displayed along the y-axis. In the same manner, the relationships between each cluster of SHAM and all of the clusters of TOL are displayed on the x-axis. If a clustering analysis is conducted for the merged data of TOL and SHAM, then these subclusters would be scattered and it would be difficult to intuitively observe the relationships of the compositions of the cluster sets.
Changes in the Expression Pattern
A comparison of the dynamic changes of gene expression level across time under various conditions provides a useful tool for interpreting complex biological processes. However, there are generally many false candidate genes whose expression patterns between two different conditions are different purely by chance. For the comparison between TOL and SHAM, only 357 probes (of the 3,363 selected probes) had 0.8 or higher correlation coefficient values of expression pattern between the two conditions. On the other hand, 756 probes had negative correlation coefficient values. As stated above, the difference of macroscopic phenomena that the conditions exhibit results from the difference of expression of not a single gene but of multiple genes. Therefore, it is quite important to search for genes whose expression patterns changed in a similar fashion between different conditions. Figure 4, C and D, shows that the CODM can simultaneously depict the statistical significance of the overlaps between clusters and the differences in their expression patterns. In this mode, tall blocks colored blue or purple, such as blocks B and C, would be good candidates, since their similarities of expression patterns were negative (0.28 and 0.23), while the two clusters under different conditions share a statistically meaningful number of common genes (E = 53.3 and E = 34.8). Note that the objective of the CODM is to identify such potentially important pairs of clusters from massive combinations. To further understand the significance of the expression patterns, it would be a desirable approach to combine CODM with other visualization tools for line graphical view of expression patterns, as shown in Fig. 5. The expression of genes in TOL in block B was upregulated, compared with SHAM, at early stage, i.e., 1 h, 3 h, and 12 h. On the other hand, the expression of genes in TOL in block C was downregulated, compared with SHAM, at early stage, i.e., 1 h, and 3 h. Once again, CODM enabled us to easily detect candidate genes of this type.
Integration with Other Known Gene Information
In gene expression analysis, interpretation and validation of the results should be performed in the context of what is already known about the genes being analyzed. CODM allows us to associate the results with other such gene information and narrow down candidates. Figure 4, E and F, shows the relationships between eight types of transcription factors (HIF, ARNT, and EGR families; see Table 1) that were reported to have a relationship with ischemia (5, 8, 18, 19). In Fig. 4, overlap blocks with 2.0 or higher evaluation values for the representation of genes with putative transcription factor binding sites were color coded. Table 2 shows that overlap blocks A, B, and C implied a relationship with the transcription factors (E > 2.0). This example illustrates the utility of representing relationships with other known gene-associated information by use of the color of overlap blocks, although it may be difficult to extract biological conclusions because of the limited number of genes with the putative binding sites in the overlap blocks. If binding site information from more genes becomes available, then more detailed analysis of results will be possible. Furthermore, representation of relationships with other known gene classifications should provide us with deeper insights.
Threshold Problems
Arbitrary selection of thresholds involves a risk of overlooking important genes. In a comparison of cluster sets on gene expression profiles, there are four types of thresholds: 1) a threshold for generating clusters for each condition; 2) a threshold for evaluating the number of common genes that two clusters share; 3) a threshold for evaluating the differences in the expression patterns between two clusters; and 4) a threshold for evaluating the relationship with other known gene information. The CODM reduces the number of thresholds and allows users to interactively change the thresholds as follows.
1) Threshold for generating clusters for each condition.
Since conventional hierarchical clustering does not focus on subclusters that are included in other clusters, there is a risk that the important subclusters could be overlooked. In the CODM, overlaps of genes between any two clusters of TOL and SHAM are statistically evaluated, even if these are included in other clusters. In addition, the CODM allows users to interactively change the cut level, to reduce the risk that a small overlap block may be hidden in a large block (Fig. 6). Therefore, by considering the homogeneity of clusters and the relationships with other known gene information, the user should be able to find the important genes displayed as blocks.

View larger version (61K):
[in this window]
[in a new window]
|
Fig. 6. Interactive changes of cut levels. In CODM, there is a risk that a small overlap block may be hidden in a large block. To avoid this problem, CODM allows the user to change the cut level interactively. If the user decreases the cut level, then some small blocks that are hidden in larger blocks will emerge. By considering the homogeneity of clusters and the relationships with other gene information, the user can find important genes displayed as blocks in the CODM.
|
|
2) Threshold for evaluating the number of common genes shared by two clusters.
In CODM, the statistical significance of the number of common genes between two different clusters is represented as the height of a block, and statistical significances of the overlap of all combinations of clusters are displayed as a 3D histogram at the same time. Therefore, without the selection of an arbitrary threshold, the distribution of the statistical significance of the overlap is effectively displayed. Although (to reduce the rendering load) Fig. 4 shows only overlap blocks with 2.0 or higher evaluation values of the overlap, users can interactively change this value.
3) Threshold for evaluating the differences in the expression patterns between two clusters.
CODM represents the differences in the expression patterns between two clusters by the color of the blocks ranging from red to blue. Therefore, the distribution of differences in the expression patterns of all combinations of clusters is displayed at the same time, without any selection of an arbitrary threshold.
4) Threshold for evaluating the relationships with other known gene information.
Although only overlap blocks with 2.0 or higher evaluation values for the representation of genes with putative transcription factor binding sites were color coded in Fig. 4E and Fig. 4F, users can interactively change this value.
Conclusion
In this report we described the characteristics of the CODM method, a visualization tool for comparing clustering results of gene expression profiles under two different conditions. In CODM, the utilization of 3D space and color allows us to intuitively visualize changes in the composition of cluster sets, changes in the expression patterns of genes between the two conditions, and the relationships with a known gene classification such as transcription factors. Comparison of dynamic changes of gene expression levels across time under different conditions is required in a wide variety of fields of gene expression analysis, including toxicogenomics and pharmacogenomics. Since CODM integrates and simultaneously visualizes various types of information across clustering results, it can be applied to various analyses in these fields.
 |
APPENDIX
|
---|
Similarity f(T,S)
The similarity f(T, S) satisfies the following inequality:
Proof.
Since f(T,S)
1 is obvious, we only need to prove 1
f(T,S). We begin by showing that
where
We consider the Lagrangian function
where
is a Lagrange undetermined multiplier. By taking the derivative, we convert the constrained optimization problem into an unconstrained problem as follows:
The solutions of this problem are
or
Therefore,
 |
FOOTNOTES
|
---|
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: M. Kano, Intelligent Cooperative System, Dept. of Information Systems, Research Center for Advanced Science and Technology, Univ. of Tokyo, Tokyo 153-8904, Japan (E-mail: mkano{at}cyber.rcast.u-tokyo.ac.jp).
10.1152/physiolgenomics.00107.2004.
1 The Supplemental Material (Supplemental Tables S1S3) for this article is available online at http://physiolgenomics.physiology.org/cgi/content/full/00107.2004/DC1. 
 |
REFERENCES
|
---|
- Alizadeh AA and Staudt LM. Genomic-scale gene expression profiling of normal and malignant immune cells. Curr Opin Immunol 12: 219225, 2000.[CrossRef][ISI][Medline]
- Chiang LW, Grenier JM, Ettwiller L, Jenkins LP, Ficenec D, Martin J, Jin F, DiStefano PS, and Wood A. An orchestrated gene expression component of neuronal programmed cell death revealed by cDNA array analysis. Proc Natl Acad Sci USA 98: 28142819, 2001.[Abstract/Free Full Text]
- Cho RJ, Huang M, Campbell MJ, Dong H, Steinmetz L, Sapinoso L, Hampton G, Elledge SJ, Davis RW, and Lockhart DJ. Transcriptional regulation and function during the human cell cycle. Nat Genet 27: 4854, 2001.[CrossRef][ISI][Medline]
- Eisen MB, Spellman PT, Brown PO, and Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 1486314868, 1998.[Abstract/Free Full Text]
- Huang LE, Arany Z, Livingston DM, and Bunn HF. Activation of hypoxia-inducible transcription factor depends primarily upon redox-sensitive stabilization of its alpha subunit. J Biol Chem 271: 3225332259, 1996.[Abstract/Free Full Text]
- Ishii M, Hashimoto S, Tsutsumi S, Wada Y, Matsushima K, Kodama T, and Aburatani H. Direct comparison of GeneChip and SAGE on the quantitative accuracy in transcript profiling analysis. Genomics 68: 136143, 2000.[CrossRef][ISI][Medline]
- Kano M, Nishimura K, Tsutsumi S, Aburatani H, Hirota K, and Hirose M. Cluster overlap distribution map: visualization for gene expression analysis using immersive projection technology. Presence: Teleoperators and Virtual Environments 12: 96109, 2003.[CrossRef][ISI]
- Kawahara N, Wang Y, Mukasa A, Furuya K, Shimizu T, Hamakubo T, Aburatani H, Kodama T, and Kirino T. Genome-wide gene expression analysis for induced ischemic tolerance and delayed neuronal death following transient global ischemia in rats. J Cereb Blood Flow Metab 24: 212223, 2004.[CrossRef][ISI][Medline]
- Kirino T. Ischemic tolerance. J Cereb Blood Flow Metab 22: 12831296, 2002.[CrossRef][ISI][Medline]
- Manger ID and Relman DA. How the host "sees" pathogens: global gene expression responses to infection. Curr Opin Immunol 12: 215218, 2000.[CrossRef][ISI][Medline]
- Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, and Wingender E. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31: 374378, 2003.[Abstract/Free Full Text]
- Rhodes DR, Barrette TR, Rubin MA, Ghosh D, and Chinnaiyan AM. Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res 62: 44274433, 2002.[Abstract/Free Full Text]
- Saban MR, Hellmich H, Nguyen NB, Winston J, Hammond TG, and Saban R. Time course of LPS-induced gene expression in a mouse model of genitourinary inflammation. Physiol Genomics 5: 147160, 2001.[Abstract/Free Full Text]
- Seo J and Shneiderman B. Interactively exploring hierarchical clustering results. IEEE Computer 35: 8086, 2002.
- Shiffman D, Mikita T, Tai JT, Wade DP, Porter JG, Seilhamer JJ, Somogyi R, Liang S, and Lawn RM. Large scale gene expression analysis of cholesterol-loaded macrophages. J Biol Chem 275: 3732437332, 2000.[Abstract/Free Full Text]
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, and Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96: 29072912, 1999.[Abstract/Free Full Text]
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, and Church GM. Systematic determination of genetic network architecture. Nat Genet 22: 281285, 1999.[CrossRef][ISI][Medline]
- Wang GL, Jiang BH, Rue EA, and Semenza GL. Hypoxia-inducible factor 1 is a basic-helix-loop-helix-PAS heterodimer regulated by cellular O2 tension. Proc Natl Acad Sci USA 92: 55105514, 1995.[Abstract/Free Full Text]
- Yan SF, Lu J, Zou YS, Soh-Won J, Cohen DM, Buttrick PM, Cooper DR, Steinberg SF, Mackman N, Pinsky DJ, and Stern DM. Hypoxia-associated induction of early growth response-1 gene expression. J Biol Chem 274: 1503015040, 1999.[Abstract/Free Full Text]
Copyright © 2005 by the American Physiological Society.