Clustering gene expression data using adaptive double self-organizing map

Habtom Ressom, Dali Wang and Padma Natarajan

Intelligent Systems Laboratory, Department of Electrical and Computer Engineering, University of Maine, Orono, Maine 04469


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 EXPERIMENTAL TESTS AND RESULTS
 CONCLUSION
 APPENDIX
 REFERENCES
 
This paper presents a novel clustering technique known as adaptive double self-organizing map (ADSOM). ADSOM has a flexible topology and performs clustering and cluster visualization simultaneously, thereby requiring no a priori knowledge about the number of clusters. ADSOM is developed based on a recently introduced technique known as double self-organizing map (DSOM). DSOM combines features of the popular self-organizing map (SOM) with two-dimensional position vectors, which serve as a visualization tool to decide how many clusters are needed. Although DSOM addresses the problem of identifying unknown number of clusters, its free parameters are difficult to control to guarantee correct results and convergence. ADSOM updates its free parameters during training, and it allows convergence of its position vectors to a fairly consistent number of clusters provided that its initial number of nodes is greater than the expected number of clusters. The number of clusters can be identified by visually counting the clusters formed by the position vectors after training. A novel index is introduced based on hierarchical clustering of the final locations of position vectors. The index allows automated detection of the number of clusters, thereby reducing human error that could be incurred from counting clusters visually. The reliance of ADSOM in identifying the number of clusters is proven by applying it to publicly available gene expression data from multiple biological systems such as yeast, human, and mouse. ADSOM’s performance in detecting number of clusters is compared with a model-based clustering method.

microarray; detecting number of clusters; unsupervised learning; cluster visualization


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 EXPERIMENTAL TESTS AND RESULTS
 CONCLUSION
 APPENDIX
 REFERENCES
 
RECENT TECHNOLOGICAL ADVANCES allow us to measure expression levels for thousands of genes simultaneously (24). Efficient techniques are urgently needed to effectively analyze the generated gene expression data. The availability of reliable and accurate analysis tools will help the research community to identify the genetic makeup of the diseases and lead to suitable medical interventions.

Key and early processing steps in the analysis of gene expression data include clustering groups of genes that manifest similar expression patterns. Genes with similar expression profiles might be transcriptionally regulated through a same transduction pathway. Thus the rationale for clustering gene expression data is to identify a new transduction pathway or novel genes, which may be coregulated through the same known pathway.

A wide range of techniques has been applied for clustering gene expression data. Examples include hierarchical clustering (4), adaptive resonance theory (ART) (21), self-organizing map (SOM) (28), k-means (29), graph-theoretic approaches (1, 9), fuzzy ART (31), fuzzy c-means (8), fuzzy Kohonen (7), and growing cell structures network (7). However, most of the above-mentioned clustering algorithms are heuristically motivated, and the issues of determining the "correct" number of clusters and choosing a "good" clustering algorithm are not yet rigorously solved. Clustering gene expression data using hierarchical clustering (4) and SOM has been very popular among the bioinformatics research community.

Hierarchical clustering organizes the expression profiles in a hierarchical tree structure, which allows detecting higher order relationships between clusters of profiles. Although hierarchical clustering has been proven valuable for describing gene expression (4, 25), it has several shortcomings. The hierarchical trees do not reflect the multiple ways in which expression patterns of genes can be similar. The deterministic nature of hierarchical clustering can cause the data points to be clustered on the basis of local decisions, with no opportunity to reevaluate. As the amount of data increases, this problem can be exacerbated. Mangiameli et al. (19) compared SOM with hierarchical clustering method and found that SOM is superior in both robustness and accuracy.

The design of SOM starts with defining a geometric configuration for the partitions in a one- or two-dimensional grid. Then, random weight vectors are assigned to each partition. During training, a gene expression profile (input vector) is picked randomly. The weight vector closest to the expression profile is identified. The identified weight vector and its neighbors are adjusted to look similar to the expression profile. This process is repeated until the weight vectors converge. During operation, SOM maps gene expression profiles to the relevant partitions based on the weight vectors to which they are most similar. However, in circumstances where the expected number of partitions (clusters) available in gene expression data is unknown, the validation of the SOM clustering result becomes a critical issue. One may heuristically validate the clustering results to identify a reasonable number of clusters.

Many validation techniques have been implemented to evaluate clustering results. Yeung et al. (35) introduced figure of merit (FOM); Tibshirani (30) applied "gap statistic"; Jain and Dubes (14) referred to using "Hubert and Jaccard index"; Hubert and Arabie (12) addressed the use of "adjusted rand index"; Lubovac et al. (18) proposed to use "entropy measure"; Musavi et al. (21) used the "change in internode distance per cluster". All of these algorithms have been proven valuable by some experiments. However, the use of heuristic evaluation technique makes the clustering process extremely time-consuming and complicated given the large volume and high dimension of the gene expression data.

A number of methods have been proposed in the literature to accomplish data partitioning and cluster validation/visualization either simultaneously or independently. Fraley and Raftery (5, 6) developed model based clustering that provides functionality for displaying/visualizing cluster results. Ramoni et al. (23) introduced the Bayesian method of model-based clustering of gene expression dynamics. Kaski et al. (15, 16) introduced methods for detecting, visualizing, and interpreting clusters generated by SOM. They improved the SOM-based method of U-matrix (32) for visualization of cluster information. Nikkilä et al. (22) and Kaski (17) have applied this improved method for the analysis and visualization of gene expression data. Herrero et al. (10, 11) have proposed a combination of SOM and hierarchical method for clustering gene expression data.

Su and Chang (26) developed a new technique known as the double self-organizing map (DSOM). In DSOM, each center in the network has an N-dimensional weight vector and a two-dimensional position vector. The position vectors are projection of the weight vectors into a two-dimensional space and serve as a visualization tool for deciding how many clusters are needed, thus combining clustering and cluster visualization in one computational procedure. In other words, with the help of position vectors, DSOM adjusts its network structure during the learning phase so that neurons that respond to similar stimuli will not only have similar weight vectors but also move spatially nearer to each other.

Although DSOM provides a visualization tool for cluster evaluation, the selection of its free parameters is vital for a proper projection of the position vectors in a two-dimensional space. Some combinations of these parameters make all the position vectors converge too quickly into a small dense area. Some other combinations lead the updating process to "get stuck" after several epochs and result in a wrong number of clusters. Thus the regulation of these parameters is very important.

In this paper, we propose an adaptive double self-organizing map (ADSOM), which updates the free parameters involved in DSOM during the training process. This is achieved by carefully analyzing the mathematical relationships between the parameters and the updating processes. This new approach improves the performance of the original DSOM by reducing its unreliability. Unlike DSOM, ADSOM gives fairly consistent number of clusters provided that the initial number of nodes is greater than the expected number of clusters.

To demonstrate its effectiveness, ADSOM is applied to cluster gene expression data from multiple biological systems such as yeast, human, and mouse. The results show that ADSOM is a reliable technique for clustering gene expression data. ADSOM addresses the issue of identifying unknown number of clusters while performing data partitioning simultaneously.


    SYSTEMS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 EXPERIMENTAL TESTS AND RESULTS
 CONCLUSION
 APPENDIX
 REFERENCES
 
In addition to updating its weights and position vectors, ADSOM offers a systematic method of adapting its free parameters during the training process, thereby providing improved convergence. Let wj be the N-dimensional synaptic weight vector of node j; let pj be a two-dimensional position vector corresponding to node j; pw(k) and ww(k) represent the position and weight vectors of the winning neuron when the kth input vector x(k) is applied, respectively. The weight and position vectors are updated as follows

(1)

(2)
where

(3)

(4)

(5)

(6)

(7)

(8)
where ||·|| denotes the Euclidean distance; {eta}w and {eta}p are initial learning rates; sw is a scalar parameter which regulates how fast the function {Lambda}w(k,j) changes; kmax is the maximum number of epochs; sp(k) and sx(k) are the values of scalar parameters at epoch k that regulate the speed of the movement of the position vectors; ti(k) specifies the ratio sx(k) to sp(k) in a given boundary; f is a continuous function of {alpha}; {alpha}(k) is the maximum change of position vectors at epoch k. Equations 18 are repeated until {alpha} is steadily less than a prespecified threshold ß.

Detailed mathematical derivations of the above updating equations are provided in the APPENDIX. Copies of source codes for implementing ADSOM are available from authors of this manuscript upon request. A scheme for selecting the initial weight vectors is outlined below.

Initialization scheme.
Proper initialization of the weight vectors of ADSOM helps the parameter updating process converge quickly. The weight initialization scheme provided by Su et al. (27) is used for initialization in ADSOM. If one assumes the size of weights array to be K x L (Fig. 1), then the following steps are used for initialization.



View larger version (7K):
[in this window]
[in a new window]
 
Fig. 1. The arrangement of an K x L weight array.

 
First, we select a pair of input patterns whose distance is the largest one among the whole input data set, and then we initialize the weight nodes on the lower left corner and the upper right corner (i.e., wK,1 and w1,L) as these two input patterns, respectively. From the remaining input data, use the pattern that is the farthest to the two chosen patterns to initialize the weight vector on the upper left corner (i.e., w1,1). Set the weight on the lower right corner (i.e., w K,L) as the input pattern which is the farthest from the previously selected three patterns.

Second, we proceed to initialize the weights of the neurons on the four edges by uniformly partitioning each edge into L - 1 or K - 1 segments.

Finally, third, we initialize the remaining weights from left to right, and from top to bottom as follows

Identifying number of clusters from final position vectors.
Once the final locations of the position vectors are obtained, one can visually determine the number of clusters. To minimize possible human error resulting from counting the number of clusters visually, we developed a novel technique that provides an index for each cluster. The index is calculated based on the outcome of clustering the position vectors themselves using hierarchical method. Consider a 2-by-N matrix T that contains the final locations of all two-dimensional position vectors; where N denotes the total number of position vectors. We calculate the Euclidean distances between each pair of position vectors in T. Based on these distances and subsequent distances between merged clusters, we create a hierarchical cluster tree that grows vertically upwards. During clustering, we store the distance between two clusters that are merged together through a branch by applying the single linkage method. Hence, the height of each branch corresponds to the distance assigned to it. Consider a horizontal line that cuts c branches of the hierarchical cluster tree perpendicularly, the horizontal line picks c clusters that are embraced by the branches it cuts. An index corresponding to this horizontal line is designated as {gamma}(c) and is defined as the sum of the heights of each branch it cuts

where zc(i) is the height contributed by the ith branch for the case when a horizontal branch cuts c branches of the tree. The index, referred to as tree-based index, is calculated for c = 2,...,N. Note that {gamma}(1) = 0, since there will be no horizontal line that would cut the tree only once. The value of c at which the index reaches its highest peak is considered as an indicator of the maximum separation between clusters in a two-dimensional space. Hence, it is used as a means to detect the number of clusters.


    EXPERIMENTAL TESTS AND RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 EXPERIMENTAL TESTS AND RESULTS
 CONCLUSION
 APPENDIX
 REFERENCES
 
Experimental tests were conducted using gene expression data from multiple biological systems. The performance of ADSOM is compared with a model-based clustering method introduced by Fraley and Raftery (6) and implemented in a computer program called MCLUST (Model-Based Clustering Software). The software is publicly available at http://www.stat.washington.edu/fraley/mclust/.

We also investigated other algorithms that would detect the number of clusters in the gene expression data we used for our experiments. These algorithms include a Bayesian method for model-based clustering introduced by Ramoni et al. (23) and adaptive quality-based clustering method proposed by De Smet et al. (3). These two algorithms are implemented in computer programs called CAGED ("cluster analysis of gene expression dynamics") and Adap_Cluster (adaptive quality-based clustering) and are publicly available at http://genomethods.org/caged/ and http://www.esat.kuleuven.ac.be/~thijs/Work/Clustering.html, respectively. In both these algorithms the number of clusters detected depends on the input parameter values specified by the user. In CAGED, these parameters are Markov order, prior precision, and Bayes factor. In Adap_Cluster, the results were sensitive to the selection of a user-defined estimate of the minimal probability of gene belonging to cluster and minimum number of genes in a cluster. In this paper, we compared the results obtained using ADSOM and MCLUST, because both methods do not require the user to define sensitive parameters that would affect the outcome significantly.

To test the consistency of ADSOM in data partitioning, we counted the number of common profiles found in the clusters that were obtained by varying the initial number of nodes. We present the percentage of common profiles for the data sets where ADSOM resulted in constant number of clusters. This provides a confidence measure for the clusters formed by ADSOM and illustrates the effect of different initial number of nodes on data partitioning.

Yeast cell cycle data with the 5-phase criterion.
We used a subset of the gene expression data provided by Cho et al. (2) that shows the fluctuation of expression levels of ~6,000 genes over two cell cycles (17 time points) corresponding to mitotic cell cycle of Saccharomyces cerevisiae. As in Yeung et al. (3436), a subset of 384 genes whose expression levels peak at different time points corresponding to the five phases of cell cycle was used for our experiments. These data were obtained from supplementary material provided by Yeung et al. at http://staff.washington.edu/kayee/cluster/raw_cellcycle_384_17.txt. Prior to clustering, the data were normalized to have zero mean and unit variance. ADSOM was applied to cluster this dataset with different numbers of initial weights such as 9, 12, 15, and 20. As depicted in Fig. 2, the final locations of position vectors converge into fewer groups in a two-dimensional space regardless of the initial number of nodes. We applied the tree-based validation index to estimate the optimal number of clusters from the final locations of the position vectors. As described before, the number of clusters at which the tree-based index reaches its highest peak can be used as an indicator of optimal number of clusters. The results in Fig. 2 indicated that the optimal number of clusters is five, regardless of a different number of initial nodes. This result agrees with that of Yeung et al. (34) and was as expected from a data set corresponding to five phases of the yeast cell cycle.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 2. Final position vectors (left) after using the adaptive double self-organizing map (ADSOM) to cluster yeast cell cycle data with 9, 12, 15, and 20 initial nodes and the corresponding tree-based validation results (right).

 
Table 1 shows the number of genes in each cluster as well as the number of common genes found when clustering the yeast cell cycle data with different number of initial nodes (N), i.e., N = 9, 12, 15, and 20. The percentage of common genes between clusters formed by N = 9 and 12 was 88%; N = 9, 12, and 15 was 79.7%; and N = 9, 12, 15, and 20 was 70.8%.


View this table:
[in this window]
[in a new window]
 
Table 1. Number of genes as well as number of common genes for yeast cell cycle data using ADSOM with 9, 12, 15, and 20 initial nodes

 
To compare the performance of ADSOM with the model-based clustering method, we run MCLUST for several models. The models are characterized by their covariance matrices such as equal volume spherical (EI); unequal volume and spherical (VI); ellipsoidal with equal volume, shape and orientation (EEE); ellipsoidal, varying volume, shape, and orientation (VVV); diagonal, varying volume, varying shape (VVI), etc. The Bayesian information criterion (BIC) introduced by Yeung et al. (34) was used to compare models with different numbers of clusters and different covariance matrix parameterization. A large BIC score indicates strong preference for the corresponding model. Figure 3 shows the BIC scores obtained by clustering the yeast cell cycle data for various cluster numbers ranging from 2 to 32 using five different models. As can be seen from Fig. 3, the model-based approach favors the EEE model, which reached its maximum BIC score at 6 clusters.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 3. Bayesian information criterion (BIC) scores for yeast cell cycle data with 5 phases. EI, equal volume spherical; VI, unequal volume and spherical; EEE, ellipsoidal with equal volume, shape, and orientation; VVI, diagonal, varying volume, varying shape; and VVV, ellipsoidal, varying volume, shape, and orientation.

 
Yeast sporulation data.
Gene expression data in the budding yeast Saccharomyces cerevisiae from spotted cDNA microarrays studied during the diauxic shift, the mitotic cell division cycle, sporulation, and temperature and reducing shocks (http://genome-www.stanford.edu/clustering/) were considered. In particular, the data corresponding to sporulation were used for this experiment. From the yeast sporulation data, we extracted 218 x 11 gene expression data that belong to four classes. Variance normalization was applied to the data. The four classes are ribosomal proteins, respiration, mitochondrial organization, and tricarboxylic acid pathway. This functional classification was made using information from Munich Information Center for Protein Sequences (MIPS) (20). ADSOM clustered this yeast data into its four distinct classes without having the knowledge of the actual number of clusters. The beauty of ADSOM algorithm is that it would converge to the actual number of clusters without having to go through tedious cluster validation techniques. To demonstrate this, two experiments were conducted with this data set; once ADSOM was assigned 9 initial nodes and once 20 nodes. The ADSOM position vectors that were assigned to these initial nodes converged to four final clusters regardless of the initial number of nodes chosen. Figure 4 shows the location of final position vectors for 9 initial nodes (Fig. 4A), for 20 initial nodes (Fig. 4B), and the results obtained after applying the tree-based validation index (Fig. 4C). Note that these final four clusters that represent the actual and true number of clusters in the data are shown by enclosed boundaries. It also shows which of the initial 9 or 20 nodes had to be grouped together to make the final four classes. As can be seen from Fig. 4C, the optimal number of clusters is 4, regardless of different number of initial nodes (9 and 20).



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 4. Final position vectors with 9 (A) and 20 (B) initial nodes and the tree-based validation results (C) using ADSOM for yeast sporulation data with 4 classes.

 
It was observed that ADSOM resulted in partitioning the data with 70% accuracy. This accuracy was calculated based on the biological classification made by MIPS. In gene expression data, one does not normally have a priori knowledge about the true number of classes, and what distinguishes ADSOM from the currently available techniques is that one does not need to have this knowledge to be able to find the true number of classes accurately.

To evaluate the consistency of the clustering, the number of common genes in two clusters for different number of initial nodes was investigated. Table 2 illustrates the number of genes obtained in the four clusters formed by ADSOM and the number of common genes between the cluster pair formed with 9 and 20 initial number of nodes, respectively. The robustness and consistency of ADSOM in clustering is evident from the large number of common genes (90.4%) in the clusters as shown below.


View this table:
[in this window]
[in a new window]
 
Table 2. Number of genes as well as number of common genes for yeast sporulation data using ADSOM with 9 and 20 initial nodes

 
In a separate experiment, the performance of ADSOM for different cluster sizes was investigated. To accomplish this, yeast sporulation data from two, three, and four MIPS classes were extracted and clustered independently. As shown in Fig. 5, the number of clusters was correctly identified by ADSOM using nine initial nodes.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 5. Results of tree-based validation for yeast sporulation data with 2, 3, and 4 classes: 9 initial nodes.

 
Figure 6 shows the BIC scores obtained using MCLUST for four different models at clusters ranging from 1 to 20. As can be seen from Fig. 6, the model-based approach favors the EEE model whose first maximum BIC score is reached at 10 and the second maximum BIC score is reached at 5 clusters. Note that the expected number of clusters for the yeast sporulation data is 4.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 6. BIC scores for yeast sporulation data.

 
Yeast cdc15 and elu.
The spotted cDNA microarray data used for this experiment were those corresponding to cdc15 and elu (http://genome-www.stanford.edu/clustering/); cdc15 data consisted of 15 time points following arrest of cdc15 temperature-sensitive mutant, and elu data contained 14 time points following elutriation. After removing genes with missing points, there were 1,882 genes in cdc15 data and 2,020 genes in elu data. Before applying ADSOM, normalization was applied to both data sets to have zero mean and unit variance. ADSOM was used to cluster each data set with different number of initial weights such as 25, 30, 36, and 42. The tree-based indices corresponding to the final locations of the position vectors were used to identify the number of clusters in both data sets. ADSOM gave fairly consistent number of clusters in the unknown yeast data set, thus demonstrating its effectiveness and reliability. As it can be seen from Table 3, the result of this experiment shows that there are 16 or 17 clusters available in both data sets.


View this table:
[in this window]
[in a new window]
 
Table 3. Number of clusters obtained using ADSOM

 
Figure 7 shows the BIC scores obtained using MCLUST for five different models in clustering the yeast elu (Fig. 7A) and yeast cdc15 (Fig. 7B) data. As can be seen from Fig. 7, the maximum BIC score is reached at 21 clusters for the EEE model when clustering the yeast elu data. For the yeast cdc15 data, the EEE model reached its first, second, and third maximum BIC scores at 27, 21, and 19 clusters, respectively.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 7. BIC scores for yeast elu data (A) and yeast cdc15 data (B).

 
UNC 9 mouse tumor data.
UNC 9 mouse tumor data were provided by the Statistical Genetics Group at the Jackson Laboratory (http://www.jax.org/staff/churchill/labsite/index.html). The data contained nine tumor samples from different mouse strains. The 9 independent tumor samples were assayed with 18 cDNA microarrays using dye-swap reference design. There were 15,488 rows of genes and 36 columns representing the samples. After removing the rows that had less than threefold change, there were 12,866 rows left to which range normalization (i.e., between 0 and 1) was applied. We used the proposed ADSOM to cluster these tumor samples. As shown in Fig. 8B, ADSOM identified three distinct clusters. The first cluster contained samples 1, 2, 3, 8, and 9; second cluster contained samples 4, 5, and 6; third cluster contained sample 7. As shown in Fig. 8C, the tree-based index cluster validation result also identifies three clusters. Interpreting the results, we observed that the first cluster contained all mammary tumor tissue samples; the second cluster contained all normal mammary tissue samples; and the third cluster contained Waptag liver control samples. An interesting observation made is that tumors tend to cluster together more than different tissues of the same strain.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 8. Results obtained using hierarchical clustering (A) and ADSOM visual (B) and ADSOM tree-based validation (C) for UNC 9 tumor data: 9 initial nodes.

 
The ADSOM results were compared with hierarchical clustering in partitioning these tumor samples. MAANOVA (33) was used to perform hierarchical clustering. The results obtained using the hierarchical method are shown in Fig. 8A. Three clusters were identified which were similar to those obtained using ADSOM. However, interpretation of the number of clusters is user dependent. This can be a problem when there are a large number of clusters.

Human fibroblast data.
A subset of the gene expression data provided by Iyer et al. (13) was used for this experiment. The full data show the response of human fibroblast to serum, using cDNA microarrays representing about 8,600 distinct human genes to observe the temporal program of transcription that underlies this response. DNA microarray hybridization was used to measure the temporal changes in mRNA levels of 8,613 human genes at 12 times, ranging from 15 min to 24 h after serum simulation. The subset of 517 genes that were chosen on the basis of substantial change in the expression by Iyer et al. was considered. The data containing normalized R/G ratios were obtained from http://genome-www.stanford.edu/serum/data.html.

ADSOM was applied to cluster this data set with different initial nodes such as 16, 20, 25, and 30. Repeated trials resulted in either 10 or 11 final clusters. The location of the final position vectors and the results of tree-based index for 16 and 20 initial nodes are shown in the top (A and C) and bottom (B and D) of Fig. 9, respectively. Figure 10 depicts the BIC scores for five different models using MCLUST for the human fibroblast data. As can be seen from Fig. 10, model-based clustering favors the EEE model, whose maximum BIC score is reached at 11 clusters. Note that ADSOM also detected 11 clusters in most cases. Similar number of clusters was obtained using hierarchical clustering by Iyer et al. (13). Interpretation of the correct number of clusters using hierarchical clustering is, however, user dependent.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 9. Final position vectors after using ADSOM to cluster human fibroblasts data with 16 initial nodes (A) and 20 initial nodes (B) and corresponding tree-based validation results (C and D).

 


View larger version (21K):
[in this window]
[in a new window]
 
Fig. 10. BIC scores for the human fibroblast data.

 

    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 EXPERIMENTAL TESTS AND RESULTS
 CONCLUSION
 APPENDIX
 REFERENCES
 
This paper shows that ADSOM gives a consistent final topology regardless of the initial topology. This topological flexibility of ADSOM provides credible information about the number of clusters in the data. ADSOM accomplishes both cluster visualization and data partitioning simultaneously without compromising the clustering accuracy. This particular feature of ADSOM is achieved by adapting its free parameters during the training process. In addition, in combination with hierarchical clustering, based on the tree-based index, it also accomplishes cluster validation reliably. Comparisons between ADSOM and a model-based clustering method showed that ADSOM has comparable or in some cases better performance.

In summary, the methods described in this paper offer two main advantages. First, ADSOM converges to a consistent number of groups regardless of the initial number of nodes. This is demonstrated by testing ADSOM on real-world gene expression data. Second, ADSOM can visually present the clusters while clustering the data. With the help of a tree-based index, the number of clusters is easily identified. The approaches eliminate the trial-and-error process as well as the heuristic validating process, thus saving significant computational time.


    APPENDIX
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 EXPERIMENTAL TESTS AND RESULTS
 CONCLUSION
 APPENDIX
 REFERENCES
 
Derivation of Updating Equations for ADSOMs
The nodes of DSOMs are represented not only by their weight vector but also by two-dimensional position vectors. Let node j be represented by an N-dimensional synaptic weight vector wj and a two-dimensional position vector pj. As described in Ref. 26, the updating formulas involved in DSOM for both vectors at epoch k are given in Eqs. A1A6

(A1)

(A2)
where

(A3)

(A4)

(A5)

(A6)

In Eqs. A5 and A6, there are several parameters such as sx, sp, {eta}p, and kmax, which mainly affect the movement of position vectors. In Eq. A2, pw(k) is the position vector corresponding to a winner weight at epoch k; pj(k) and pj(k + 1) are the old and new jth position vectors, respectively; {eta}2(k,j) and hw(k,j) are both positive scalars and are always less than 1. Let us define a new positive scalar {Phi}j(k) as follows

(A7)

In Fig. 11, OA, OB, and OC represent vectors pj(k), pj(k + 1), and pw(k), respectively. According to Eq. A2, one can easily prove that points A, B, and C are located on a straight line. As shown in Fig. 11, the ratio between the length of vector AB and AC, which actually represents the vectors pj(k + 1) - pj(k) and pw(k) - pj(k), respectively, is {Phi}j(k). So we can draw a conclusion that the larger {Phi}j(k) is, the faster the jth position vector pj moves toward pw.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 11. Movement of position vector.

 
Let us assume that there are two nodes with weight vectors w1 and w2 and position vectors p1 and p2, respectively. At epoch k and for a given input pattern x(k), if w1(k) is closer to x(k) than w2(k), which means ||w1(k) - x(k)|| is less than ||w2(k) - x(k) ||, one requires that p1(k) moves toward pw(k) faster than p2(k). However, Eqs. A1A6 (the original DSOM algorithm) could not guarantee this, and it is one of the reasons why sometimes the updating process gets stuck with unexpected results. Therefore, the relationship given below ought to be true to keep position vectors from "getting stuck."

(A8)

Instead of using constant values for sp and sx, we suggest to update them at every epoch, thus we define sp(k) and sx(k). This will improve the model’s adaptability. Moreover, to analyze the proposed algorithm in a more general way, we replace the power to ||wi(k) - x(k)|| - ||ww(k) - x(k)|| in Eq. A6 from 2 to a variable n (n > 0) and replace the power to ||pi(k) - pw(k)|| in Eq. A5 from 1 to a variable m (m > 0).

Defining Ai(k) = ||wi(k) - x(k)|| - ||ww(k) - x(k)||, Eq. A8 can be rewritten as

(A9)
If Aj(k) = Ai(k), then Eq. A9 is always valid regardless of the relationship between {Phi}i(k) and {Phi}j(k). So in the following, we analyze the cases when Aj(k) != Ai(k). Letting


and

Case 1: if Aj(k) > Ai(k), then {Phi}j(k) {Phi}i(k)

Since sp'(k) and sx'(k) have the same positive coefficient, i.e., 1 + (k/kmax), we get the following

(A10)
Since Aj(k) > Ai(k) > 0, the inequality Ajn(k) > Ain(k) > 0 is valid. Also, since sx(k) > 0, sp(k) > 0, Eq. A10 can be rewritten as

(A11)
Case 2: if Aj(k) < Ai(k), then {Phi}j(k) ≥ {Phi} i(k). Following the same steps used in case 1, we can prove that

(A12)
From Eqs. A11 and A12, we can draw a conclusion that Eq. A8 holds, if

(A13)
Equation A13 gives us one end of the boundary for the ratio sx(k)/sp(k). However, the other end of the boundary is not specified. It is clear that closing both sides of the boundaries will be helpful to provide a more complete range of this ratio to further improve the accuracy and effectiveness of ADSOM. In this paper, we use a particular choice of this ratio as follows.

Defining

we specify the following relationship that allows Eq. A13 to be satisfied and which we found to be effective in our experiments.

(A14)

Defining a stopping criterion is important. On one hand, overtraining could worsen the results. On the other hand, undertraining will result in uncertain outcome. So, instead of directly finding out a way to intelligently select the number of epochs, a stopping criterion is defined. Letting the maximum change of position vectors at epoch k

and defining a threshold (ß) as 0.1% of largest distance among all the original position vectors, we cease the training process, if {alpha} is always less than ß. This definition is not general but suitable to our experiments.

If {alpha} becomes larger, which means the position vectors move faster, then one needs to reduce the speed of movement of the position vectors, which means that sp needs to be increased. Similarly, if {alpha} becomes smaller, then sp needs to be decreased. So at epoch k, the relationship between {alpha} and sp can be represented by a mathematical continuous function f as follows

(A15)
Once we know sp(k), we can use Eq. A14 to adjust the parameter sx(k). We suggest using any function f(y), which satisfies the following three requirements: 1) y (-1,+{infty}), f(y) (0,C), where C is a positive constant; 2) f(y) is a continuous, increasing, and nonlinear function; and 3) f(0) = 1. In fact, there are many functions, which have all the properties mentioned above. In our experiments, we used


    FOOTNOTES
 
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: H. Ressom, Intelligent Systems Laboratory, Dept. of Electrical and Computer Engineering, Univ. of Maine, Orono, ME 04469 (E-mail: ressom{at}eece.maine.edu).

10.1152/physiolgenomics.00138.2002.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEMS AND METHODS
 EXPERIMENTAL TESTS AND RESULTS
 CONCLUSION
 APPENDIX
 REFERENCES
 

  1. Ben-Dor A, Shamir R, and Yakhini Z. Clustering gene expression patterns. J Comput Biol 6: 281–297, 1999.[ISI][Medline]
  2. Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, and Davis RW. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2: 65–73, 1998.[ISI][Medline]
  3. De Smet F, Mathys J, Marchal K, Thijs G, De Moor B, and Moreau Y. Adaptive Quality-based clustering of gene expression profiles. Bioinformatics 18: 735–746, 2002.[Abstract/Free Full Text]
  4. Eisen MB, Spellman PT, Brown PO, and Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863–14868, 1998.[Abstract/Free Full Text]
  5. Fraley C and Raftery AE. MCLUST: Software for Model-Based Clustering, Discriminant Analysis and Density Estimation (Technical report 415). Seattle, WA: University of Washington, Department of Statistics, October 2002.
  6. Fraley C and Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Statist Assoc 97: 611–631, 2002.[ISI]
  7. Granzow M, Berrar D, Dubitzky W, Shuster A, Azuaje FJ, and Eils R. Tumor classification by gene expression profiling: comparison and validation of five clustering methods. In: SIGBIO Special Interest Group on Biomedical Computing of the ACM. New York: ACM Press, 2001, vol. 21, no. 1, p. 16–22.
  8. Guthke R, Schmidt-Heck W, Hahn D, and Pfaff M. Gene expression data mining for functional genomics. In: Proceedings of European Symposium on Intelligent Techniques (ESIT 2000), Aachen, Germany, 14–15 September, 2000, p. 170–177.
  9. Hartuv E, Schmitt A, Lange J, Meirer-Ewert S, Lehrach H, and Shamir R. An algorithm for clustering cDNAs for gene expression analysis, Proceedings of the Third Annual International Conference on Computational Molecular Biology (RECOMB99)Lyon, France, 1999. New York: ACM Press, 1999, p. 188–197.
  10. Herrero J, Valencia A, and Dopazo J. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17: 126–136, 2001.[Abstract]
  11. Herrero J and Dopazo J. Combining hierarchical clustering and self-organizing maps for exploratory analysis of gene expression patterns. J Proteome Res 1: 467–470, 2002.[ISI][Medline]
  12. Hubert L and Arabie P. Comparing partitions. J Classification 2: 193–218, 1985.[ISI]
  13. Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JFC, Trent JM, Staudt LM, Hudson J, Boguski MS, Lashkari D, Shalon D, Botstein D, and Brown PO. The transcriptional program in the response of human fibroblasts to serum. Science 283: 83–87, 1999.[Abstract/Free Full Text]
  14. Jain AK and Dubes RC. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988.
  15. Kaski S, Nikkilä J, and Kohonen T. Methods for exploratory cluster analysis. In: Proceedings of SSGRR 2000, International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet, L’Aquila, July 31–August 6. Scuola Superiore G. Reiss Romoli, 2000. (Proceedings on CD-ROM, ISBN 88-85280-52-8).
  16. Kaski S, Nikkilä J, and Kohonen T. Methods for interpreting a self-organized map in data analysis. In: Proceedings of the 6th European Symposium on Artificial Neural Networks (ESANN ’98), Bruges, Belgium, April 22–24, edited by Verleysen M. Brussels, Belgium: D-Facto, 1998, p. 185–190.
  17. Kaski S. SOM-based exploratory analysis of gene expression data. In: Advances in Self-Organizing Maps, edited by Allinson N, Yin H, Allinson L, and Slack J. London: Springer, 2001, p. 124–131.
  18. Lubovac Z, Olsson B, Jonsson P, Laurio K, and Andersson ML. Biological and statistical evaluation of gene expression profiles. In: Proceedings of Mathematics and Computers in Biology and Chemistry 2001, p. 149–155.
  19. Mangiameli P, Chen SK, and West D. A comparison of SOM neural network and hierarchical clustering methods. Eur J Operational Res 93: 402–417, 1996.[ISI]
  20. Mewes HW, Albermann K, Heumann K, Liebl S, and Pfeiffer F. MIPS: A database for protein sequences, homology data and yeast genome information. Nucleic Acids Res 25: 28–30, 1997.[Abstract/Free Full Text]
  21. Musavi MT, Wang D, and Chelian S. Gene expression data clustering using Adaptive Resonance Theory. Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS) Las Vegas, NV, June 24–27, 2002, edited by Arabnia HR. CSREA Press, p. 230–235 (ISBN 1-892512-39-4).
  22. Nikkilä J, Törönen P, Kaski S, Venna J, Castrén E, and Wong G. Analysis and visualization of gene expression data using self-organizing maps. In: Neural Networks: Special issue on New Developments on Self-Organizing Maps. Oxford, UK: Pergamon, 2002, vol. 15, issue 8–9, p. 953–966.
  23. Ramoni M, Sebastiani P, and Kohane L. Cluster analysis of gene expression dynamics. Proc Natl Acad Sci USA 99: 9121–9126, 2002.[Abstract/Free Full Text]
  24. Schena M, Shalon D, Davis RW, and Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467–470, 1995.[Abstract]
  25. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, and Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9: 3273–3297, 1998.[Abstract/Free Full Text]
  26. Su M and Chang H. A new model of self-organizing neural networks and its application in data projection. IEEE Trans Neural Networks 12: 153–158, 2001.[ISI]
  27. Su M, Liu T, and Chang H. An efficient initialization scheme for the self-organizing feature map algorithm. In: Proceedings of IEEE International Joint Conference on Neural Networks, Washington, DC, 1999, p. 1906–1910.
  28. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, and Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96: 2907–2912, 1999.[Abstract/Free Full Text]
  29. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, and Church GM. Systematic determination of genetic network architecture. Nat Genet 22: 281–285, 1999.[ISI][Medline]
  30. Tibshirani R, Walther G, and Hastie T. Estimating the Number of Clusters in a Dataset via the Gap Statistic (Technical report 208). Stanford, CA: Stanford University, Dept. of Statistics, 2000.
  31. Tomida S, Hanai T, Honda H, Kobayashi T. Gene expression analysis using fuzzy ART. Genome Informatics 12: 245–246, 2001.
  32. Ultsch A and Siemon H. Kohonen’s self-organizing maps for exploratory data analysis. In: Proceeding of Int. Neural Network Conference (INNC ’90). Dordrecht, Netherlands: Kluwer, 1990, p. 305–308.
  33. Wu H, Kerr MK, Xiangqin C, and Churchill GA. MAANOVA: a software package for the analysis of spotted cDNA microarray experiments. In: The Analysis of Gene Expression Data: Methods and Software, edited by Parmigiani G, Garrett ES, Irizarry RA, and Zeger SL. New York: Springer, 2003. (http://www.jax.org/staff/churchill/labsite/pubs/Wu_maanova.pdf).
  34. Yeung KY, Fraley C, Murua A, Raftery AE, and Ruzzo WL. Model-based clustering and data transformations for gene expression data. Bioinformatics 17: 977–987, 2001.[Abstract/Free Full Text]
  35. Yeung KY, Haynor DR, and Ruzzo WL. Validating clustering for gene expression data. Bioinformatics 17: 309–318, 2001 (http://www.cs.washington.edu/homes/kayee/cluster).[Abstract]
  36. Yeung KY and Ruzzo WL. Principal component analysis for clustering gene expression data. Bioinformatics 17: 763–774, 2001.[Abstract/Free Full Text]