Expression imbalance map: a new visualization method for detection of mRNA expression imbalance regions

Makoto Kano1, Kunihiro Nishimura2, Shumpei Ishikawa3, Shuichi Tsutsumi3, Koichi Hirota4, Michitaka Hirose4 and Hiroyuki Aburatani3

1 School of Engineering, University of Tokyo, Tokyo 113-8655
2 School of Information Science and Technology, University of Tokyo, Tokyo 113-8655
3 Genome Science Division, Department of Information Systems, Research Center for Advanced Science and Technology, University of Tokyo, 153-8904, Japan
4 Intelligent Cooperative System, Department of Information Systems, Research Center for Advanced Science and Technology, University of Tokyo, 153-8904, Japan


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIAL AND METHODS
 RESULTS AND DISCUSSION
 References
 
We describe the development of a new visualization method, called the expression imbalance map (EIM), for detecting mRNA expression imbalance regions, reflecting genomic losses and gains at a much higher resolution than conventional technologies such as comparative genomic hybridization (CGH). Simple spatial mapping of the microarray expression profiles on chromosomal location provides little information about genomic structure, because mRNA expression levels do not completely reflect genomic copy number and some microarray probes would be of low quality. The EIM, which does not employ arbitrary selection of thresholds in conjunction with hypergeometric distribution-based algorithm, has a high tolerance of these complex factors. The EIM could detect regionally underexpressed or overexpressed genes (called, here, an expression imbalance region) in lung cancer specimens from their gene expression data of oligonucleotide microarray. Many known as well as potential loci with frequent genomic losses or gains were detected as expression imbalance regions by the EIM. Therefore, the EIM should provide the user with further insight into genomic structure through mRNA expression.

gene expression profiling; allelic imbalance; chromosome mapping; hypergeometric distribution; computing methodologies


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIAL AND METHODS
 RESULTS AND DISCUSSION
 References
 
THE RECENT DEVELOPMENT of microarray technology has enabled simultaneous measurement of genome-wide expression profiles. Many research studies have revealed strong correlations between the expression profiles and cancer classifications. The next era of gene expression analysis would involve systematic integration of expression profiles and other types of gene information, such as locus, gene function, and sequence information. In particular, integration between expression profiles and locus information should be effective in detecting gene structural abnormalities such as genomic gains and losses.

In general, cancer progression is not a single but a multistep process and includes many genomic structural abnormalities. Among them, genomic gains and losses, particularly deletion of tumor suppressor genes and amplification of oncogenes, are associated with cancer progression and its malignant phenotype, although the affected lesion varies among different types of cancers. Comparative genomic hybridization (CGH) for detecting genome-wide abnormalities such as copy number changes, has been applied to various types of cancers (5), but its low resolution (~20 Mb, corresponding to about 200 genes) makes it difficult to identify the causal genes, the structural alternation of which is critical for cancer biological behavior.

Integration of gene expression profiles and gene locus information might allow detection of copy number changes at a much higher resolution. Several studies using oligonucleotide probe arrays suggested a strong relationship between genomic structural abnormalities and expression imbalances (underexpression or overexpression). Mukasa et al. (7) reported that the expression levels of a significant number of genes in the 1p region were reduced to about 50%, in oligodendrogliomas with 1pLOH. Furthermore, Virtaneva et al. (12) reported that acute myeloid leukemia with trisomy 8 was associated with overexpression of genes on chromosome 8. Recently, a genome-wide transcriptome map of non-small cell lung carcinomas based on gene expression profiles generated by serial analysis of gene expression (SAGE) was conducted (3). However, the simple spatial mapping of the expression profiles on chromosomal location sometimes hardly provides information about genomic structure for the following reasons: 1) since some microarray probes are of low quality, the microarray signal intensities do not always reflect their target mRNA expression levels; and 2) mRNA expression level does not completely reflect genomic copy number. The aim of the present study was to develop a new method with high tolerance of such complex factors, designed to detect regionally underexpressed or overexpressed genes in cancer specimens compared with the corresponding normal tissues. The expression imbalance region, constituted by these genes, likely reflects genomic structural changes such as chromosomal gain and loss.

When developing the methodology that integrates the expression profiles and locus information, two significant problems have to be dealt with. First, a definition of what constitutes an expression imbalance region is not yet clarified. How many base pairs on chromosome should be considered as a genomic region (referred to below as chromosomal proximity)? To consider that a certain gene is differentially expressed in cancer and normal tissue, how much difference in the gene expression level is needed between the two (referred to below as cancer specificity)? It is generally very difficult to determine adequate thresholds for chromosomal proximity and cancer specificity. Arbitrary selection of thresholds would involve a risk of overlooking significant genes (that is, "threshold problem"). In addition, to detect expression imbalance regions, it is necessary to search for genes with both cancer specificity and chromosomal proximity. Because determining these two thresholds synergistically increases the risk of overlooking significant genes, the "threshold problem" is more critical in this case.

When selecting thresholds, several statistical theories such as hypothesis testing are helpful. However, commonly used statistical criteria are also arbitrarily determined. If thresholds are automatically determined based on statistical theory, the user cannot search more genes with potential significance, because the information of genes overlooked is almost unknown. Therefore, to detect as many significant genes as possible, a comprehensive presentation of the distribution of the "false balance" (that is, the balance of false negative and false positive) is quite significant rather than an attempt to seek potentially optimal statistical criterion.

Second, there are many candidate expression imbalance regions. Some of them may be a family of genes that are tandemly repeated and are under similar transcriptional regulations. To confirm that a candidate locus is biologically significant, human curation is necessary, using a variety of biological information. Therefore, it is important to present large genome-wide data in a comprehensive manner, indicating which genes are to be further examined. That is, a broadband interface between humans and computers is essential.

We focused on visualization technology as the key technology to solve these two problems. Visualization is effective in providing, genome-wide, the false-balance distribution and indication of the genes that are worth examining. The visualization used in our report would make it possible to present the images of all genes that have both cancer specificity and chromosomal proximity.

In this study, we developed a novel visualization method for detecting expression imbalance regions at much higher resolution than conventional technologies such as CGH, called the expression imbalance map (EIM). The EIM was applied to gene expression data of lung squamous cell carcinoma measured by oligonucleotide microarray and detected many known as well as potential loci with frequent genomic losses or gains as regional signal images on chromosomes (expression imbalance regions). In addition, the EIM could detect not only the expression imbalance common to all cancer specimens, but also individual differences among cancer specimens.


    MATERIAL AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIAL AND METHODS
 RESULTS AND DISCUSSION
 References
 
Data Sets
In this article, the EIM is illustrated using the gene expression data of lung cancer from the study of Bhattacharjee et al. (1). In this experiment, total mRNA was extracted from histologically defined specimens of squamous cell lung carcinomas (abbreviated here as "SQ"; n = 21) and normal lung tissues (abbreviated here as "NL"; n = 17). The expression profiles were obtained using human U95A oligonucleotide probe arrays (GeneChip; Affymetrix, Santa Clara, CA). The SQ-NL gene expression data set (SQ, n = 21; NL, n = 17) was then analyzed using the EIM.

Feature Selection and Logarithmic Transformation
To compensate for distortion in the expression level, changes in the expression level were limited from 1 to 8,000. In addition, 4,083 probes with a mean expression above 50 and CV (CV = mean/standard deviation) above 0.2 were selected to eliminate potential low-quality probes. The common logarithm of the gene expression data was used for the following analysis.

Translation from Probe to UniGene
To associate gene locus information with gene expression profiles, each "probeID" on the U95A array was translated to UniGene, using information on the UniGene web site of the National Center for Biotechnology Information (NCBI), by referring to the corresponding original GenBank accession number of each probe set. Then, 11,334 of 12,533 probes on the U95A array were translated into 8,851 UniGenes.

Gene Locus Information
Gene locus information was obtained from the web sites for Genes On Sequence Map (Homo sapiens build 27) of NCBI and is defined as "LocusID." Among the LocusIDs on chromosome 1 to 22 of Genes On Sequence Map, the 12,063 LocusIDs, which had the corresponding UniGenes, were utilized to identify the chromosome locations of genes. Since the gene expression data utilized in this study were obtained from both sexes, the X and Y chromosomes were excluded. However, by using the data obtained from only males or females, the EIM can be applied to the analysis of chromosome X and Y. Since the 12,063 LocusIDs had one-to-one correspondence with UniGenes, they were translated into 12,063 UniGenes. However, only 6,652 of the 12,063 UniGenes were in common with the 8,851 UniGenes translated from the probes on the U95A array (Fig. 1). In this article, these 6,652 UniGenes are called "Key-UniGenes." The distributions of the UniGenes and Key-UniGenes on each arm of the chromosome are shown in Table 1. The number of total Key-UniGenes was defined as U (=6,652).



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 1. Correspondence between probeIDs and LocusIDs. To associate gene locus information with gene expression profiles, probeIDs on the Affymetrix U95A oligonucleotide arrays and the LocusIDs on Genes On Sequence Map (Homo sapiens build 27) of NCBI were translated into UniGenes. We utilized the 12,063 LocusIDs, which had the corresponding UniGenes, on chromosome 1 to 22 of Genes On Sequence Map. The X and Y chromosomes were excluded, because the gene expression data utilized in this study were obtained from both sexes. Since these 12,063 LocusIDs had one-to-one correspondence with UniGenes, these were translated into 12,063 UniGenes. Out of 12,533 probes on the U95A array, 11,334 were translated into unduplicated 8,851 UniGenes, by referring to the corresponding original GenBank accession number of each probe set. Although the 12,063 UniGenes were obtained from Genes On Sequence Map, only 6,652 of the 12,063 UniGenes were in common with the 8,851 UniGenes translated from the probes on the U95A array. In this article, these 6,652 UniGenes are called "Key-UniGenes."

 

View this table:
[in this window]
[in a new window]
 
Table 1. Number of the UniGenes and Key-UniGenes on Genes On Sequence Map

 
Quantization of Each Chromosome Arm Region
For easier handling of the gene locus information, each chromosome arm region was quantized by unit region called "bucket" whose length was 100,000 base pairs (100 kbp), and the Key-UniGenes were assigned the corresponding buckets according to their reading position (Fig. 2, A and B). A reading position indicates the start position for gene transcription and was obtained from Genes On Sequence Map. The number of buckets on chromosome arm arm was defined as Larm.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 2. Formation of clusters of genes with chromosomal proximity. A: for easier handling of the gene locus information, each chromosome arm region was quantized by unit region called "bucket" whose length was 100 kbp, and the Key-UniGenes were assigned the corresponding buckets according to their reading positions, which were obtained from Genes On Sequence Map (Homo sapiens build 27) of NCBI. The number of buckets on chromosome arm arm was defined as Larm. To evaluate the proximity of genes on chromosome arm arm, the Key-UniGenes on the length neighbor buckets from (begin)-th were defined as a cluster Carm_length_begin. B: to avoid considering a region containing large gaps between genes as "one region," the gaps between Key-UniGenes which lie next to each other in Carm_length_begin were calculated and the maximal gap was defined as gaparm_length_begin. The expression imbalance map (EIM) allows the user to filter out the clusters whose gaparm_length_begin is more than gapmax, which can be changed interactively. In other words, the user can exclude regions containing large gaps by controlling gapmax. C: repeating the sufficiently minute changes of length and begin formed the exhaustive uncertainty cluster set of locus information. The EIM allows even the clusters that overlap each other or include others. Therefore, all neighbor buckets in any area of each chromosome arm were defined as clusters.

 
Formation of Locus Cluster
To evaluate the proximity of genes on chromosome arm arm, the Key-UniGenes on the length neighbor buckets from (begin)-th were defined as a cluster Carm_length_begin (Fig. 2A). Repeating the sufficiently minute changes of length and begin formed the exhaustive uncertainty cluster sets of Key-UniGenes with chromosomal proximity (Fig. 2C). The EIM allows even clusters that overlap each other or include others. Therefore, all neighbor buckets in any area of each chromosome arm were defined as clusters. The number of Key-UniGenes in the cluster Carm_length_begin was defined as narm_length_begin. Carm_length_begin was defined for all



In addition, to avoid considering a region that contains large gaps between genes as "one region," the gaps between the Key-UniGenes that lie next to each other in Carm_length_begin were calculated and the maximal gap was defined as gaparm_length_begin (Fig. 2B). The EIM allows the user to filter out the cluster(s) whose gaparm_length_begin is more than gapmax, which can be changed interactively. In other words, the user can exclude regions containing large gaps by controlling gapmax. When gapmax values were 500 kbp, 1 Mbp, 2 Mbp, and 3 Mbp, the percentages of the gaps that were less than gapmax were 77.6, 89.4, 96.0, and 98.2%, among all gaps between the Key-UniGenes that lie next to each other.

EIM for Detection of Expression Imbalance Specific To Squamous Cell Carcinomas
Clusters consisting of genes with expression profiles specific to SQs.
Probes with expression profiles specific to SQs were extracted as a cluster from 4,083 probes of SQ-NL data sets. Although the EIM does not depend on the type of statistical method used for evaluating the difference between two groups, nonparametric tests such as the Mann-Whitney test have the advantage that no assumption is needed about the distribution of data, compared with parametric tests such as the t-test. Thus we explain the case of the Mann-Whitney test as an example.

More specifically, the difference in the level of expression of each gene between two groups (SQs and NLs) was defined using the statistical probability, P, of rank sum. Assume that there are two groups (Ga, n = Na; Gb, n = Nb) and the rank sums in Ga and Gb are Suma and Sumb, respectively, when all elements (Na + Nb) are sorted in order. For simplicity, assume that Suma/Na is greater than or equal to Sumb/Nb. P is the probability of observing the rank sum of the Na elements, which are randomly selected from all elements, to be more than Suma.

Based on this P value, the differential level D1(g) in which g is the probe name was defined as follows

(1)
Probes whose differential level D1 was equal to or more than diff were defined as a cluster of probes with expression profiles specific to SQs, Csign_diff (Fig. 3). The suffix sign indicates a differential direction (+, overexpression; -, underexpression in SQs). Repeating the sufficiently minute changes of diff formed the exhaustive uncertainty set of the clusters specific to SQs. Csign_diff was defined for all


For example, C+3 was a cluster of probes whose differential level D1(g) of overexpression was 3 or more. The EIM was constructed by all the clusters Csign_diff with diff greater than or equal to the minimum acceptable differential level dmin (Fig. 3). Since the default value of dmin is 2, all the clusters, Csign_diff, would be utilized. The EIM allows the user to control dmin interactively for narrowing down the probes, if needed.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 3. Probe permutation arranged in order of the difference in gene expression level between squamous cell lung carcinomas (SQs) and normal lungs (NLs). Probes on the U95A arrays are lined up in order of the D1(g) level, which represents the difference in the gene expression level between SQs and NLs. Only probes with differential levels of 2 or more were arranged. Probes with underexpression and overexpression in SQs are arranged on the left and right side, respectively. Probes whose differential level D1(g) is equal to or more than diff, are defined as a cluster of probes with expression profiles specific to SQs, Csign_diff. The suffix sign indicates the differential direction (+, overexpression; -, underexpression in SQs). Repeating the sufficiently minute changes of diff formed the exhaustive uncertainty set of the clusters specific to SQs. The EIM was constructed by all clusters Csign_diff with diff that were greater than or equal to the minimum acceptable differential level dmin. Since the default value of dmin is 2, all the clusters, Csign_diff, would be utilized. The EIM allows the user to control dmin interactively for narrowing down the probes, if needed.

 
The numbers of probes, UniGenes, and Key-UniGenes of each cluster are shown in Table 2; nsign_diff is the number of Key-UniGenes translated from probes of Csign_diff. When multiple probes in a cluster could be mapped to a single UniGene, only the probe with the highest D1 value was adopted. In addition, Fig. 3 shows probe permutations whose differential levels are 2 or more, arranged in the order of the differential level. Probes with under- and overexpression are arranged on the left and the right of Fig. 3, respectively.


View this table:
[in this window]
[in a new window]
 
Table 2. Clusters of probes with expression profiles specific to the group of squamous cell lung carcinomas

 
Construction of the EIM.
To detect the expression imbalance regions, it is necessary to search for genes with both cancer specificity and chromosomal proximity. The fundamental algorithm of the EIM is to statistically evaluate the overlaps between clusters of genes with cancer specificity and clusters of genes with chromosomal proximity. The clusters specific to the group of SQs, Csign_diff, are arranged on the abscissa, and the locus clusters, Carm_length_begin, are on the ordinate, as shown in Fig. 4. The variable k is the number of common Key-UniGenes between Csign_diff and Carm_length_begin.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 4. Clusters of genes specific to the group of SQs vs. clusters of genes with proximity on chromosomes. A: to detect expression imbalance regions, it is necessary to search for genes with both cancer specificity and chromosomal proximity. The fundamental algorithm of the EIM is to evaluate statistically the overlaps between clusters of genes with cancer specificity and clusters of genes with chromosomal proximity. The clusters of probes with expression specific to the group of SQ, Csign_diff, are arranged on the abscissa, and those of Key-UniGenes with proximity on chromosomes, Carm_length_begin, on the ordinate. Among Csign_diff values, the clusters of probes with underexpression and overexpression in SQs are arranged on the left and right side, respectively. The nsign_diff and narm_length_begin are the numbers of Key-UniGenes in Csign_diff and Carm_length_begin, respectively; k is the number of common Key-UniGenes both in Csign_diff and Carm_length_begin. The statistical significance of the overlap between Csign_diff and Carm_length_begin was visualized in the intersection area Rsign_diff_arm_length_begin as a gray scale. B: the area where the multiple Rsign_diff_arm_length_begin overlapped was overwritten at the maximum E value. Therefore, when the E value of R1 is higher than that of R2, the area where R1 and R2 overlapped is overwritten at that of R1.

 
The variable k could be evaluated using the hypergeometric probability, H, for observing at least k common elements between randomly selected n1 and n2 elements among all U elements as follows, where n1 is nsign_diff and n2 is narm_length_begin.

(2)
When the H value is small, the overlap between Csign_diff and Carm_length_begin is considered statistically significant. That is, if the H value is small, then the overlap did not occur accidentally. Thus the evaluation value, E, is defined as follows

(3)
For any combination of Csign_diff and Carm_length_begin, if both (begin)-th and (begin + length - 1)-th buckets of Carm_length_begin have the Key-UniGenes that are included in Csign_diff, then their E values were calculated. This calculation was preprocessing for the EIM. Then, in real-time processing, if both Csign_diff and Carm_length_begin met dmin and gapmax, respectively, then the E value was represented in the intersection area Rsign_diff_arm_length_begin as a gray scale. The user can control dmin and gapmax interactively. The area where the multiple Rsign_diff_arm_length_begin values overlapped is overwritten at the maximum E value (Fig. 4B). A flowchart that details these steps is shown in Fig. 5. The EIM for detecting expression imbalance specific to SQs is shown in Fig. 6. In addition, Fig. 7 shows chromosome 3 of the EIM and the influence of gapmax and dmin on the detection of the expression imbalance regions specific to SQs.



View larger version (35K):
[in this window]
[in a new window]
 
Fig. 5. Flowchart for construction of the EIM for detecting expression imbalance regions specific to SQs. This flowchart provides details of the steps of the EIM for detecting expression imbalance regions specific to SQs. For the steps of "Definition of clusters with cancer specificity," please refer to Fig. 3. For the steps of "Definition of clusters with chromosomal proximity," please refer to Fig. 2. For the steps of "Construction of the EIM" and "Visualization of EIM," please refer to Fig. 4. The user can interactively control the steps in real-time processing by changing gapmax and dmin.

 


View larger version (13K):
[in this window]
[in a new window]
 
Fig. 6. The EIM applied for detecting expression imbalance regions specific to SQs. The regions of under- and overexpression in SQs were visualized on the left and right side, respectively, as gray regional signals. All statistical evaluation values of any combinations between the exhaustive uncertainty cluster sets of cancer specificity and chromosomal proximity are visualized on the EIM as the gradation of gray scale simultaneously. Each exhaustive uncertainty cluster set was formed by repetition of the sufficiently minute changes of the threshold of cancer specificity or chromosomal proximity. While the area with high luminance corresponds to the more probable expression imbalance region, the EIM enables the user to search as many genes as possible by referring to more expanded area with lower luminance. The EIM presented the most significant overexpression regions on 3q (the evaluation value E = 7.2), which is a well-known locus with frequent genomic gains, as detected by comparative genomic hybridization (CGH) (6, 8, 9). Note the high resolution of the EIM compared with CGH resolution (~20 Mbp).

 


View larger version (17K):
[in this window]
[in a new window]
 
Fig. 7. Expression imbalance regions specific to SQs on chromosome 3. AI: chromosome 3 of the EIM and the influence of gapmax and dmin on the detection of the expression imbalance regions specific to SQs. The EIM represents the E values whose Csign_diff and Carm_length_begin meet dmin and gapmax, respectively. The EIM allows the user to control gapmax and dmin interactively. The user can narrow down the possible expression imbalance regions by changing gapmax and dmin. Especially, as is shown in AI, changing gapmax, which allows exclusion of regions containing large gaps between genes, markedly affected the detection of expression imbalance regions. J: the macrograph of the encircled region A from panel A. Intersection area R+5_3q_1894_5 shows the most significant overexpression region, which is a well-known locus with frequent genomic gains as previously detected by CGH (6, 8, 9). That is, the overlap (k = 6) between C+5 and C3q_1894_5 was statistically the most significant (E = 7.2). C+5 was the cluster of probes with overexpression whose differential level D1(g) was more than 5 and its number of Key-UniGenes, n+5, was 205. C3q_1894_5 was the region from 189,400 to 189,900 kbp on chromosome 3 and contained 9 Key-UniGenes (n3q_1894_5 = 9). The maximum gap (gap3q_1894_5) between Key-UniGenes in C3q_1894_5 was 146 kbp. In addition, all evaluation values of any combinations between the exhaustive uncertainty cluster sets of cancer specificity and chromosomal proximity are visualized simultaneously on the EIM as gradation of the gray scale. This gradation pattern could convey the distribution of the false balance to the user through visual perception and enabled the detection of as many significant genes as possible. In addition, note the high resolution of EIM compared with CGH resolution (~20 Mbp).

 
EIM for Detection of Individual Differences in Expression Imbalance Among SQs
It is effective to extract probes with expression profiles specific to the group of cancers using statistical analyses, such as the Mann-Whitney analysis. However, because this type of analysis treats all specimens with the same pathological diagnosis as one group, the variation in a group is unobservable. This is sometimes a significant problem because cancer specimens generally have a great number of variations. Thus we also developed the EIM for detecting individual differences in expression imbalance among SQs.

Clusters of probes with expression imbalance in each SQ.
The first step in the development of the EIM for detecting individual differences in expression imbalance among SQ specimens was to extract probes with under- or overexpression compared with NL specimens, in each SQ specimen independently. Assuming that the expression levels of a certain probe, g, in NL specimens have a lognormal distribution, if the expression level of a SQ specimen, Si, is included in 100p% of sections on both sides of NL’s distributions, its differential level D2 was defined as follows

(4)
Regarding each SQ specimen Si (i = 1, 2,..., 21), the probes whose differential levels D2(g,Si) were equal to or more than diff were defined as the individual-specimen cluster, Csign_diff_Si, where sign is the differential direction (+, overexpression; -, underexpression in each SQ specimen). Csign_diff_Si was defined for all



For example, C+2_Si and C-2_Si were clusters of probes whose expression of Si were included in 1% of sections on both sides of NL’s distributions. More specifically, C+2_Si was a cluster of probes whose expression levels were equal to or higher than (aveNL + 2.58 stddevNL) in a specimen Si, where aveNL is the mean and stddevNL is the standard deviation of expression level in NL specimens. In the same manner, C-2_Si was a cluster of probes whose expression levels were equal to or less than (aveNL - 2.58 stddevNL); nsign_diff_Si is the number of Key-UniGenes in Csign_diff_Si. If multiple probes in a cluster could be mapped to single UniGene, then only the probe with the highest D2 value was adopted. The average numbers, sign_diff, of {nsign_diff_Si}(i = 1, 2,..., 21) are shown in Table 3.


View this table:
[in this window]
[in a new window]
 
Table 3. Clusters of probes with under- or overexpression profiles in each squamous cell lung carcinoma

 
Construction of the EIM.
In a manner similar to the EIM for detecting expression imbalance of SQ group, that for detecting individual differences in expression imbalance among SQs was also constructed. The individual-specimen clusters, Csign_diff_Si, were arranged on the abscissa with respect to each Si, and the locus clusters on the ordinate (Fig. 8). Underexpression clusters were arranged on the left side and overexpression clusters on the right. Since the abscissa represented an array of Si, it was impossible to represent diff on the abscissa like Fig. 4. Therefore, the EIM for individual specimen was visualized by Csign_diff_Si with a defined diff, and allowed the user to change diff interactively.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 8. Individual-specimen clusters vs. locus clusters. In a manner similar to the EIM for detecting expression imbalance of SQ specimen group, that for detecting individual differences in expression imbalance among SQ specimens was also constructed. In a SQ specimen Si (i = 1, 2,..., 21), probes with expression whose differential level D2(g,Si) was equal to or higher than diff compared with NL specimens were extracted as an individual-specimen cluster, Csign_diff_Si. This extraction was independently performed with respect to each SQ specimen. The individual-specimen clusters, Csign_diff_Si values, were arranged on the abscissa with respect to each Si, and the locus clusters, Carm_length_begin values, on the ordinate. Among Csign_diff_Si values, the clusters of under- and overexpression were arranged on the left and right side, respectively. Since the abscissa represented an array of Si, it was impossible to represent diff on the abscissa like Fig. 4. Therefore, the EIM for individual specimen was visualized by Csign_diff_Si with a defined diff, and allowed the user to change diff interactively; sign_diff is the average number of Key-UniGenes in {Csign_diff_Si}(i = 1, 2,..., 21); narm_length_begin is the number of Key-UniGenes in Carm_length_begin; k is the number of common Key-UniGenes between Csign_diff_Si and Carm_length_begin. The significance of overlap between Csign_diff_Si and Carm_length_begin was visualized in the intersection area Rsign_diff_Si_arm_length_begin as a gray scale.

 
The number of common Key-UniGenes between Csign_diff_Si and Carm_length_begin, k, could also be evaluated using E(U, n1, n2, k) (Eq. 3), where n1 was sign_diff and n2 was narm_length_begin. If the different specimens have the same number of genes with under- or overexpression on the same local region, then it is necessary to evaluate them as similar. Therefore, sign_diff instead of nsign_diff_Si was used for the evaluation of the overlap between Csign_diff_Si and Carm_length_begin. The E value for any combination of Csign_diff_Si and Carm_length_begin was calculated, when both (begin)-th and (begin + length - 1)-th buckets of Carm_length_begin have the Key-UniGenes that are included in Csign_diff_Si. This calculation was preprocessing for the EIM. Then, in real-time processing, after a certain diff was selected, each E value was represented in the intersection area, Rsign_diff_Si_arm_length_begin, as a gray scale, if Carm_length_begin met gapmax. The user can control diff and gapmax interactively.

A flowchart that details these steps is shown in Fig. 9. The EIM for detecting individual difference of expression imbalance among SQ specimens is shown in Fig. 10. Figure 11 shows chromosome 3 of the EIM and the influence of gapmax and diff on the detection of the individual differences in expression imbalance among SQs.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 9. Flowchart for construction of the EIM for detecting individual differences in expression imbalance among SQs. This flowchart provides details of the steps of the EIM for detecting individual differences in expression imbalance among SQs. For the step of "Definition of clusters with chromosomal proximity," please refer to Fig. 2. For the step of "Construction of the EIM" and "Visualization of EIM," please refer to Fig. 8. In this type of EIM, since the abscissa represented an array of Si, it was impossible to represent diff on the abscissa like Fig. 4. Therefore, the EIM for individual specimen was visualized by Csign_diff_Si with a defined diff, and allowed the user to change diff interactively. In addition, it is possible to exclude regions containing large gaps between genes by changing gapmax interactively.

 


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 10. The EIM for detecting individual difference of expression imbalance among SQs. The EIM was applied for detecting individual differences of expression imbalance among the SQs. Regions of underexpression and overexpression were visualized on the left and right side, respectively, as gray regional signals. The expression imbalance regions in each SQ were evaluated independently. Note the high resolution of EIM compared with CGH resolution (~20 Mbp).

 


View larger version (19K):
[in this window]
[in a new window]
 
Fig. 11. Individual difference of expression imbalance on chromosome 3. AI: chromosome 3 of the EIM and the influence of gapmax and diff on the detection of individual differences in expression imbalance among SQs. With regard to each SQ specimen, the under- and overexpression regions were visualized on the left and right side, respectively. Since the expression imbalance regions in each SQ were evaluated independently, this type of EIM clarified the individual difference of the overexpression region on 3q, which was detected as the most significant region in the group of SQs by another type of EIM. The user can narrow down the possible expression imbalance regions by changing gapmax and diff. J: macrograph of the encircled region A from panel A. When gapmax was 1 Mbp and diff was 2, the EIM showed that 17 of 21 SQs had overexpression regions on 3q, which is comparable to other data sets by CGH (6, 8, 9). In addition, note the high resolution of the EIM compared with CGH resolution (~20 Mbp).

 

    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIAL AND METHODS
 RESULTS AND DISCUSSION
 References
 
Detection of Expression Imbalance Specific to SQs
The EIM showed the distribution of expression imbalance specific to SQs (Fig. 6). It is highly comparable to previous CGH data of lung cancer reported by other investigators (6, 8, 9). There are significant differences among these CGH data because of method variation and sample preparation (especially tumor fraction of clinical samples). So it may be of little importance to compare details with individual CGH experiments. However, the most frequent abnormal loci reported in most of these studies were also detected by the EIM as regional signal images on chromosomes (expression imbalance regions), such as loss of 3p, 4q, 5q, and 8p, and gain of 1q, 3q, and 12p (6, 8, 9). The major difference from the CGH image is that signals are detected in a more confined area, which reflects the high resolution of EIM. Figures 6, 7, 10, and 11 clearly show the high resolution of EIM compared with CGH image. Especially, the intersection area R+5_3q_1894_5 showed the most significant overexpression region on 3q (Fig. 7), which is reported to be the most frequent aberration in SQs by CGH (6, 8, 9). That is, the overlap (k = 6) between C+5 (the cluster of probes with overexpression whose differential level D1(g) is more than 5: n+5 = 205) and C3q_1894_5 (the region from 189,400 to 189,900 kbp on chromosome 3: n3q_1894_5 = 9, gap3q_1894_5 = 146 kbp) was statistically the most significant (E = 7.2). Therefore, the overlap was evaluated using the hypergeometric probability for observing at least 6 (=k) common elements between randomly selected 205 (=n+5) and 9 (=n3q_1894_5) elements among 6,652 (=U) elements. The user can narrow down the possible expression imbalance regions by changing gapmax and dmin interactively. Especially, as is shown in Fig. 7, AI, changing gapmax, which allows exclusion of the regions containing large gaps between genes, markedly influenced the detection of expression imbalance regions. In addition, all evaluation values of any combinations between the exhaustive uncertainty cluster sets of cancer specificity and chromosomal proximity are visualized simultaneously on the EIM as gradation of gray scale, which is clearly shown in Fig. 7J. This gradation pattern could convey the distribution of the false balance to the user through visual perception and enabled the detection of as many significant genes as possible.

Table 4 shows the gene list of C3q_1894_5. Although this overexpression region strongly reflected the known genomic gain detected by CGH, several probes without overexpression were also detected on this region. There may be several reasons for this. First, since several probes with low quality were possibly included in this region, signal intensity does not always reflect their target mRNA expression levels. Improvement of the quality of probes would make it possible to detect the overexpression region more clearly. Second, mRNA expression levels would not completely reflect genomic copy number changes caused by chromosomal gain or loss, although there was strong correlation between them, because they are under various transcriptional control including feedback pathway of lost or gained genes themselves. Mukasa et al. (7) also reported that several genes without reduction of expression were detected in 1pLOH region of oligodendrogliomas. In addition, it should be stated that cancer tissues used here contained significant number of noncancerous stromal or inflammatory cells, which add noisy expression to cancer profiling.


View this table:
[in this window]
[in a new window]
 
Table 4. Gene list of the overexpression region on 3q detected by the EIM

 
Because of the complex factors discussed above, simple spatial mapping of the microarray expression profiles on chromosomal location gives little information about genomic structure (Fig. 12, left). In addition, it is very difficult to define adequate thresholds for cancer specificity and chromosomal proximity, because the distribution of "false balance" is unclear and the risk of overlooking significant genes by arbitrary selection of thresholds is high (i.e., the "threshold problem"). However, the EIM, using a new methodology without arbitrary selection of thresholds in conjunction with hypergeometric distribution-based algorithm, has a high tolerance of these complex factors and controls the risk of overlooking the expression imbalance regions. This advantage of the EIM over the simple spatial mapping is clearly shown in Fig. 12. The EIM detected the underexpression regions, A and B, and overexpression region, C, on chromosome 11, which are known loci with frequent genomic gain or genomic loss (6, 8, 9), although it was difficult to detect it from the simple spatial mapping of D1 value.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 12. Advantages of the EIM over the simple spatial mapping of expression profiles. Left: a simple spatial mapping of D1 value, which was calculated from the expression profiles of SQs, on chromosome 11. Right: the EIM of the same region. The EIM allowed detection of the underexpression regions, A and B, and overexpression region, C, on chromosome 11, which are known loci with genomic gain or genomic loss (6, 8, 9), although it is difficult to detect it by simple spatial mapping.

 
Detection of Individual Difference in Expression Imbalance Among SQ Specimens
The analysis for extraction of probes with expression profiles specific to the group of cancer is very effective and popular. However, this type of analysis sometimes raises a critical problem because the individual difference among a group is unobservable. In this context, the function of the EIM to detect individual difference of expression imbalance in a group is very significant. Figure 11, AI, shows that the user can narrow down the possible expression imbalance regions on chromosome 3 by changing gapmax and diff interactively. Furthermore, Fig. 11J shows the individual difference in the most significant overexpression regions on 3q (gapmax = 1 Mbp, diff = 2), where 17 of 21 SQs had overexpression regions, a finding comparable with other data sets analyzed by CGH (6, 8, 9).

The high-resolution spatial map of expression profiles described in this report, i.e., the EIM, has several significant advantages. Its validity is clearly shown by the fact that many known loci with high frequent genomic losses or gains were detected by regional signals obtained with high resolution by this method.

Recently, several studies have been reported on microarray-based CGH for detecting genome-wide copy number changes (10). However, to our knowledge, no spatial mapping data obtained with such validity and genome-wide coverage have ever been reported previously from this array-CGH method. Experimental difficulty of genome hybridization and limited number of probes on CGH array could be major problems for it. There may be several reasons for the successful result of our alternative approach, calculation of genomic structure from expression profile. The first reason is the use of the Affymetrix-type GeneChip. The large number of probes (12,533) available enables detection of a relatively short abnormal region (chromosomal loss can frequently affect areas as short as a few hundred kbp), although this method can be easily applied to other types of microarrays. The second reason, which is most important, is that the EIM is a visualization method using a new methodology without arbitrary selection of thresholds in conjunction with hypergeometric distribution-based algorithm. By processing the complex factors and the threshold problems which hinder user’s visual perception of essential information, the EIM presents to the user a comprehensive visual image of whole genome-wide information, clearly indicating where expression imbalance regions are and which genes are to be examined. It has an obvious advantage over simple spatial mapping of the expression profiles. For further curation by the user, simple clicking of a selected expression imbalance region on the EIM image leads to a direct link to a file that contains the actual gene names of the region, their expression scores, and other biological information. In addition, if the user input the UniGene number of genes of interest, the EIM indicates its position on the chromosome. Therefore, the EIM can be a broadband interface that enables user’s visual perception of complex data and further curation.

Using the EIM, we might be able to detect regional under- or overexpressions independent of copy number changes, such as gene methylation silencing and/or imprinting abnormality (11). In addition, by using the Kruskal-Wallis test (4), which is a rank sum test to deal with three or more data groups instead of Mann-Whitney test, the EIM can easily extend to multiple phenotypes.

In conjunction with the microdissection technique, which can isolate only tumor-cell-specific RNA (2), our EIM can more precisely detect potential genomic structural changes, which offer more diagnostic and therapeutic impact.

Conclusion
In this report, we describe the development of the expression imbalance map, or EIM, a visualization method without arbitrary selection of thresholds, in conjunction with hypergeometric distribution-based algorithm, for detecting expression imbalance regions. By using this method, many known as well as potential loci with high frequent genomic losses or gains were detected as regional signals with much higher resolution than conventional methods, such as CGH. The EIM can be a broadband interface which enables user’s visual perception of complex data and further curation, and its advantage is obvious over simple spatial mapping of the expression profiles on chromosomal location. Therefore, the EIM would provide the user with further insight into the genomic structure through mRNA expression.


    ACKNOWLEDGMENTS
 
This work was supported by Grant-in-Aid for Scientific Research on Priority Areas (C) "Genome Information Science" from the Ministry of Education, Culture, Sports, Science and Technology of Japan.


    FOOTNOTES
 
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: M. Kano, Tokyo Research Laboratory, IBM Japan, 1623-14 Shimotsuruma, Yamato-shi, Kanawaga 242-8502, Japan (E-mail: mkano{at}jp.ibm.com).

10.1152/physiolgenomics. 00116.2002.


    References
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIAL AND METHODS
 RESULTS AND DISCUSSION
 References
 

  1. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, and Meyerson M. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98: 13790–13795, 2001.[Abstract/Free Full Text]
  2. Bonner RF, Emmert-Buck M, Cole K, Pohida T, Chuaqui R, Goldstein S, and Liotta LA. Laser capture microdissection: molecular analysis of tissue. Science 278: 1481–1483, 1997.[Free Full Text]
  3. Fujii T, Dracheva T, Player A, Chacko S, Clifford R, Strausberg LS, Buetow K, Azumi N, Travis WD, and Jen J. A preliminary transcriptome map of non-small cell lung cancer. Cancer Res 62: 3340–3346, 2002.[Abstract/Free Full Text]
  4. Hayter AJ. Probability and Statistics for Engineers and Scientists (2nd ed.). Florence, KY: Duxbury Press, 2002.
  5. Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, and Pinkel D. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258: 818–821, 1992.[ISI][Medline]
  6. Lu YJ, Dong XY, Shipley J, Zhang RG, and Cheng SJ. Chromosome 3 imbalances are the most frequent aberration found in non-small cell lung carcinoma. Lung Cancer 23: 61–66, 1999.[ISI][Medline]
  7. Mukasa A, Ueki K, Matsumoto S, Tsutsumi S, Nishikawa R, Fujimaki T, Asai A, Kirino T, and Aburatani H. Distinction in gene expression profiles of oligodendrogliomas with and without allelic loss of 1p. Oncogene 21: 3961–3968, 2002.[ISI][Medline]
  8. Pei J, Balsara BR, Li W, Litwin S, Gabrielson E, Feder M, Jen J, and Testa JR. Genomic imbalances in human lung adenocarcinomas and squamous cell carcinomas. Genes Chromosomes Cancer 31: 282–287, 2001.[ISI][Medline]
  9. Petersen S, Aninat-Meyer M, Schluns K, Gellert K, Dietel M, and Petersen I. Chromosomal alterations in the clonal evolution to the metastatic stage of squamous cell carcinomas of the lung. Br J Cancer 82: 65–73, 2000.[ISI][Medline]
  10. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, and Brown PO. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 23: 41–46, 1999.[ISI][Medline]
  11. Reik W and Walter J. Imprinting mechanisms in mammals. Curr Opin Genet Dev 8: 154–164, 1998.[ISI][Medline]
  12. Virtaneva K, Wright FA, Tanner SM, Yuan B, Lemon WJ, Caligiuri MA, Bloomfield CD, de La Chapelle A, and Krahe R. Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. Proc Natl Acad Sci USA 98: 1124–1129, 2001.[Abstract/Free Full Text]