RealSpot: software validating results from DNA microarray data analysis with spot images
Zhongming Chen and
Lin Liu
Department of Physiological Sciences, Oklahoma State University, Stillwater, Oklahoma
 |
ABSTRACT
|
---|
The spot images from DNA microarray highly affect the discovery of biological knowledge from gene expression data. However, results from quality analysis, normalization, differential expression, and cluster analysis are rarely validated with spot images in current data analysis methods or software packages. We designed RealSpot, a software package, to validate the results by directly associating spot quality and data with spot images in a spreadsheet table. RealSpot splits hybridization images into individual spots stored in a spreadsheet table. It subsequently associates microarray data with spot images and performs data validation through the standard table operation such as sorting, searching, and editing. RealSpot has several built-in functions to facilitate data validation, including spot quality analysis, data organization, one-way ANOVA, gene ontology association, verification, import, and export. We used RealSpot to evaluate 77 slides (30,000 features each) from real hybridization experiments and to validate results from each step of data analysis. It took
10 min to validate results of spot quality after initial evaluation and correct
0.3% of falsely assigned qualities of 10,000 spots. We validated 1,641 of 2,110 differentially expressed genes identified by SAM analysis in
1/2 h by comparing each gene with its respective spot image. Furthermore, we found that 6 of 48 genes in one cluster from k-mean clustering method showed inconsistent trends of spot images. RealSpot is efficient for validating microarray results and thus helpful for improving the reliability of the whole microarray experiment for experimentalists.
spot quality; data normalization; data filtering
 |
INTRODUCTION
|
---|
DNA MICROARRAY IS A HIGH-THROUGHPUT technique for the investigation of mRNA abundance (5). Gene probes on one slide are hybridized simultaneously with cDNA or cRNA samples labeled with fluorescence dyes, yielding images containing thousands of spots. The mRNA abundance is obtained by extracting quantification information from all spots on the images ("raw data"). According to the minimum information about microarray experiment (MIAME) guidelines, a microarray experiment is organized as gene probes, print layout, samples, hybridization images, raw data, and data normalization (2).
Raw data from the hybridization images are the fundamental information for further data analysis, including data normalization (8), statistical inference (10), cluster analysis (4), principal component analysis (PCA) (9), pathway construction (3, 11), and data interpretation. The results of a microarray experiment depend on the quality of hybridization images and the respective raw data sets. However, results from each analysis step are rarely validated with spot images in current data analysis methods or software packages. A typical data analysis procedure of microarray is as following: extracting quantification data from images, filtering data with chosen standards (e.g., background-to-noise ratio >2), normalizing log2 ratios, identifying differentially expressed genes (e.g., 2-fold change of expression or statistical inference with P = 0.05), clustering genes (e.g., hierarchic or k-mean clustering), and selecting target genes for further study. Microarray data analyses confront challenges of diverse methods, standards, and software packages. An example is spot quality evaluation, discussed below.
The image quality varies from spot to spot due to printing, sample quality, and hybridization. Ideally, all the spots with poor quality should be filtered before further data analysis. There are two main approaches for filtering poor-quality spots: manually flagging spots and automatically filtering genes (1, 6, 12, 13). Manually flagging spots is time consuming because of the large number of spots in a microarray slide. Automatic methods based on quantification information are fast and efficient. These methods calculate composite scores from spot size, intensity, signal-to-background ratio (SBR), and/or circularity coefficient (area-to-perimeter ratio). Generally, the composite scores represent several aspects of spot images. These methods may fail at spots with irregular morphology, e.g., donuts, black holes ("ghost images"), and tiny dust-contained spots. Although they are suitable for large-scale data analysis, different methods or parameters frequently generate nontrivial different results from an identical raw data set. In such a case, to validate results and choose an appropriate method are not straightforward. The association of spot images with derived data may identify irregular spots. Some software packages such as Acuity and Longhorn/Standford Microarray Database (LMD/SMD) can locate each spot image on a single slide (7). Acuity locates a spot image on a scanned slide image using GenePix results, whereas LMD can show a spot image as well as other retrieved data in a data query report. In both cases, only a single spot image can be retrieved to validate the derived data from one slide.
Here, we report a software package, RealSpot, for validating results from dual-color DNA microarray hybridizations. RealSpot evaluates spot quality and validates it with spot images. RealSpot links the images and raw data of each spot side by side and organizes them in a spreadsheet table. By standard table operation such as sorting, searching, and editing, a user can directly compare the spot images, raw data, and processed data in an efficient and reliable way. Furthermore, RealSpot provides tools for one-way ANOVA, gene ontology, and web page export, which are also helpful in choosing target genes for further study and thus improving the reliability of the whole microarray experiment. It is freely available for academic use and can be obtained at http://www.lungmicroarray.org or via an electronic mail request (liulin{at}okstate.edu).
 |
IMPLEMENTATION
|
---|
General workflow.
The basic workflow of RealSpot is composed of five modules: data import, quality evaluation, data organization, data verification, and data export (Fig. 1). The data import module converts raw data, images, and sample information into a spreadsheet table. The quality evaluation module evaluates data quality using spot images and raw data from each slide and stores it as a binary slide file. The data organization module organizes slide files as an experiment, visualizes data in a metatable based on the sample information, performs one-way ANOVA, and associates gene ontology information (http://www.geneontology.org) with gene list in a microarray experiment. The data verification module validates the results by searching a group of genes obtained from downstream data analysis, e.g., k-mean cluster, and then comparing them with raw data and spot images. The data export module exports spot images, quality index, and normalized data as images, web pages, and text files for data presentation, internet data communication, and further analysis. Each component is implemented as a module to guide a user in finishing the respective operations step by step. The following is the detailed description of these components.

View larger version (55K):
[in this window]
[in a new window]
|
Fig. 1. Overview of RealSpot. There are 5 components shown as gray boxes in RealSpot. Import module imports raw data file and TIFF image files into RealSpot. Quality module evaluates spot images and data and saves them as a slide file. Data sort function facilitates the quality evaluation. Organizer module organizes multiple slides in a metatable. Verification module confirms the final results with spot images and raw data by searching selected genes. Export module exports spot images or data quality and gene expression data of selected genes. Data normalizer normalizes raw data and draws a scatter plot. The solid text boxes are the functional modules in RealSpot, the rectangle and oval white text boxes are sample and data information and the files used by RealSpot, respectively. Arrows indicate the workflows of data or module calling.
|
|
Data import.
An import module imports raw data and image files and collects sample information of the respective slide. The raw data files, such as GenePix result (GPR) file, are tab-delimited flat text files containing gene identification (ID), gene name, geometry of subarray grids (block or metarow and metacolumn), and spots in each subarray (row and column). The GPR file also holds spot geometry (x, y, and diameter) for image splitting, and gene expression data (fluorescence intensity, background, and ratio). The image files are 16-bit tag image format files (TIFF; 16-bit image pixel: 0
65,531) directly generated by a scanner, e.g., ScanArray Express. During data import, each 16-bit TIFF image is split into spot images using the spot geometry x, y, and diameter from the respective raw data file. The spot images are then linearly transformed into 8 bit for visualization. The linear transformation is based on the whole image or individual subgrid. By default, the lowest 5% image data are converted to 0, the highest 5% to 255, and the rest between 0 and 255, calculated from
where f8 is the intensity of a transformed 8-bit image pixel (0
255 for image visualization), F16 is the original 16-bit fluorescence intensity of each pixel, and P5 and P95 are the 16-bit intensities at the 5th and 95th percentiles from a slide image or a subgrid, respectively.
During data import, a user is also asked to import sample information, including the sample names and the respective dye channels (green or red). In a RealSpot spreadsheet, each row represents a gene and each column contains gene probes (gene ID and name), information of array print layout, such as block, and subarray, spot images, and raw data such as fluorescence intensity and background (Fig. 2). Additional columns are added to the table, e.g., quality index, 16-bit spot signal, and SBR calculated directly from 16-bit TIFF images (see below).

View larger version (75K):
[in this window]
[in a new window]
|
Fig. 2. Data visualized in a metatable. Tables from 4 slides are grouped as a metatable. Each row represents the information of a gene. Columns, from left to right, are Gene Ontology (GO), gene ID, gene name, signal summary, quality index (QI) summary, spot images of each sample from each slide, and data column (P value) for sorting the table. Yellow frames on spot images highlight current sample (brain) and table (slide 3). The error bars of signal summary and QI summary are standard deviations. Significantly differentially expressed genes are highlighted with thick lines (P value < 0.05).
|
|
Quality evaluation.
The main purpose of RealSpot is the quality evaluation of spot images and raw data, and the verification of the final data analysis results with the spot images. RealSpot first evaluates spot quality based on the signal intensity and SBR, and assigns a quality index (QI) to each spot. QIs 04 indicate empty, weak, middle, strong, and saturated spots, respectively (Table 1). By default, QIs 0 and 4 are assigned to the empty and saturated spots, the intensities of which are <30% and >95%, respectively. QIs 13 are calculated on the basis of the intensity of 16-bit spot signals as
where QIij is the QI of spot j on slide i, and Iij is the intensity of spot j on slide i. By default, I0 is the intensity at the 30th percentile and I1 is the intensity at the 95th percentile of the plot (intensity vs. gene rank percentage) of the slide image. These settings may be adjusted based on visual estimation (Fig. 3). A QI of 5 is assigned to a contaminated or bad spot based on SBR. By default, any spots with a SBR of <2.0 are given a QI of 5. QI is visualized as an icon (Table 1). The QIs 0
4 are shown as columns: the shorter the height, the weaker the intensity. A QI of 5 is shown as a prohibiting cross.

View larger version (82K):
[in this window]
[in a new window]
|
Fig. 3. A plot for defining QI. Chart was plotted as signal fluorescence intensity vs. gene rank percentage (1 99%). By default, QIs of 0 and 4 were assigned to the genes whose intensities were <30% and >95%, respectively. These percentages can be adjusted based on the pseudocolor spots in the top grid. QIs 13 are calculated as described (see IMPLEMENTATION). A QI of 5 for bad spots was determined by signal-to-background ratio (<2 by default). Check box: "flags" from upstream image quantification (e.g., GPR file from GenePix) is an optional column for QI evaluation.
|
|
A small portion of spots (<0.3%, typically 100 of 30,000 features) may be falsely assigned a QI by the initial evaluation, based on signal intensity and SBR, and thus a manual correction is required (see RESULTS). RealSpot facilitates the manual correction in three ways. First, as an option, a user may choose an additional data column from upstream software, e.g., "flags" column from GPR files. If such a column is provided, RealSpot compares QI with this column. If a spot is marked as "bad" (SBR <2.0) by RealSpot but as "good" by GenePix (flag = 100), RealSpot ignores SBR criteria and reassigns a QI. Second, some good spots with a very large diameter may be found to have a QI of 5. Such bad spots frequently have a much larger diameter than an average spot of a whole slide. RealSpot marks these spots with a question marker "?" to prompt a user to manually check and correct. These spots are assigned a temporary QI (=21) to distinguish them from other spots. This QI is either replaced by a user-corrected QI or restored to 5, after reopening the file, if a user ignores these spots. Third, RealSpot offers tools to facilitate the manual correction by sorting similar-quality spots together, using QI, spot diameter, signal intensity, or spot image, and correcting them at once. Some irregular and contaminated spots may be assigned a QI of 14. A user may select these spots and assign a correct QI to them using an annotation tool. RealSpot also offers another tool to identify the spots with a similar shape by comparing the images of a selected spot (e.g., a contaminated spot) and all other spots in this channel, and calculating an image similarity (IS) for each spot based on the Pearson correlation coefficient
where ISf is the image similarity of a spot image F, which has n pixels marked as Fi (i = 1
n), to the selected spot C, marked as Ci (i = 1
n). The summations are based on index i and calculated from i = 1 to n. IS value is ranged from 1 (identical spots) to 0 (entirely different spots). RealSpot then sorts the spots by IS, so that spots with similar images are arranged together. After the sorting, the selected spot moves to the first row of the top, followed by other similar spots. A user may manually check these similar spots and correct QI accordingly, since these spots have similar morphology.
Data organization.
When there are more than one hybridizations or slides in one DNA microarray experiment, RealSpot organizes the evaluated slides as an experiment for calculating QI summary and spot signal summary, performing one-way ANOVA, associating gene ontology information with each gene, and retrieving data to verify results. RealSpot uses sample information of each slide and aligns slides by sample names. A user can import multiple slide files into a metatable at the same time by selecting the file names. A column for summarizing QI of each spot from multiple slides is added into the metatable. This column is calculated as follows: the contaminated or bad spots are first removed; the mean and SD of the QIs are calculated from spots with a QI of 0, 1, 2, 3, and 4; the mean QI is rounded to an integer and shown as an icon, as shown in Table 1; and the SD is shown as an error bar (Fig. 2). A column of spot signal summary from multiple slides is calculated as follows. The 16-bit signal intensities of each channel are scaled to an arbitrary range (1
1,000 in RealSpot). The data scaling serves as a global normalization, so that the gene expression data from different slides are comparable. The data scaling is based on an assumption that the lowest 5% of genes are not expressed (i.e., negative spots) and are converted to 1, and the highest 95% of spots are highly expressed (i.e., saturated spots, typically from housekeeping genes) and are converted to 1,000. The rest of the spot signals are linearly scaled to 1
1,000. The mean and SD of scaled spot signals for each sample group are calculated and visualized in the column as bar plots. A one-way ANOVA is performed, based on the above globally normalized signals, if three or more slides are used in an experiment. Before ANOVA, normalized signals are logarithm transformed, which can improve the homogeneity of standard deviations among sample groups (a prerequirement of ANOVA). A P value of each gene is obtained from ANOVA. RealSpot highlights the bar plots of significant genes (P value < significance level, default = 0.05) with thick lines. This indicates that there is a significant difference of gene expression for at lease two samples. RealSpot also accepts gene ontology association files [tab-delimited text files with columns of Gene Ontology (GO) ID, gene symbol, gene ID, GO term, and GO part]. If a GO file is read, a column of ontology is added to display the functions of known genes (Fig. 2). In the metatable, some columns are the same for all the slides, such as gene ID and name, printing layout information, and summary QI. Other columns, such as QI and spot images, are specific for each slide and are organized as a subtable within the respective row of each gene. A sorting column is used for showing information for sorting, e.g., the P value column in Fig. 2.
Data verification.
The data verification module directly compares DNA microarray data with spot images, providing an additional step for quality control and, more importantly, a method to validate data analysis results. After data quality evaluation, the filtered data set can be used for further data analysis such as cluster analysis or hunting differentially expressed genes. From the downstream analysis, a researcher may obtain a list of genes and associated data, such as the genes with similar normalized ratios, differential expression, and expressed pattern across a series of conditions. Before further analysis or functional studies, a user may compare the final genes with the respective spot images by clicking the "search genelist" button in Fig. 2 to search and group these genes and, optionally, the associated data. RealSpot shows all the found genes on the top of the metatable. It is relatively easy for a human being to identify a few distinct spots from other spots showing a similar pattern. Consequently, search by genelist in RealSpot is efficient for identifying the inconsistency of data analysis results and spot images. The inconsistent genes may be eliminated from further analysis, or an alternative method may be chosen to analyze the same data set to see whether consistent results are achieved.
Data export.
Spot images and raw data as well as QI can be exported by selecting interested genes or items. An export module guides a user exporting the respective information. For image export, the spot images of selected genes are exported as a Windows Enhanced Metafile (WMF), bitmap file (BMP), and web pages (hypertext markup language files; HTML). The WMF file is the default format. It contains the instructions for drawing the text and spot images and has a very high resolution. It is best for printing high-quality images. The file formats can be read by most image and word processing software packages. RealSpot also exports the summary QI and gene expression ratio of two samples. RealSpot can export hundreds or thousands of genes as an HTML file. A user may directly post it on the internet for data communication among internal lab members or external DNA microarray communities. Before data export, RealSpot provides tools for data normalization and scatter plotting. A user may select two samples and filter spots by QI or directly select spots from the metatable. The scatter plot visualizes the global distribution of the signal intensity of the selected genes in two samples. Global or intensity-dependent normalization methods are provided. LOWESS normalization (15) based on print tip is the default. A user may select an appropriate normalization method based on the scatter plot. RealSpot exports the gene ID and name, QI, and normalized expression ratio of select genes and samples as a tab-delimited text file.
 |
RESULTS
|
---|
Environment requirement.
RealSpot is executable under a Windows 98 (or above) operation system. The minimal hardware configuration is a 250-MHz Pentium III CPU, 64-MB memory, and 200-MB free hard disk space for storing data. For experiments with multiple slides, a current mainstream computer with a 2.6-GHz Pentium IV CPU, 512-MB memory, and 2-GB free hard disk space is recommended. The following performance results are based on the recommended computer.
Performance.
It took
10 s for RealSpot to import raw data and two images of a slide with 30,000 features. The table file created by RealSpot was
20 MB or one-third of the total size of the imported raw data and image files (6070 MB). RealSpot evaluates one slide, based on intensity and SBR, in
500 ms. It takes
510 min for a user to semi-automatically correct spot QI. Current version RealSpot can manage an experiment with hundreds of slides. For loading 77 slide files (30,000 spots each) from a whole experiment, RealSpot only spent
10 s because of the compact size and binary format of table files. The slowest performance of RealSpot was data exporting, due to the intensity-dependent LOWESS normalization. RealSpot spent
10 s to normalize a table, or 3 min for a whole experiment of 20 slides.
Quality evaluation.
During the above performance test,
0.3% of the spots (100 of 30,000) of each slide were semi-automatically corrected after initial quality evaluation, based on visual and subjective observation of spot images. By sorting, these spots were grouped at the end of the table. Most of these spots were extremely big and were falsely identified as bad spots. It took 510 min to correct these spots for each slide. This is a substantial time saving, compared with GenePix, where we normally spent several hours on manually evaluating the location and quality of individual spots through a whole slide.
To assess the data quality after the quality evaluation, we compared the scatter plots after data filtering using different ranges of QIs. As shown in Fig. 4, most of the bad spots and empty spots were in the lower-intensity end. After these spots were filtered (QI = 0 or 5), more consistent results were obtained.

View larger version (18K):
[in this window]
[in a new window]
|
Fig. 4. Scatter plots after data filtering based on QI; x- and y-axes are background-subtracted signal intensity from a lung-lung self-hybridization image after quality evaluation. Data were plotted to show spots within various QI ranges.
|
|
Furthermore, we compared RealSpot with GenePix for quality evaluation by counting false-positive spots from negative control probes or false-negative spots from positive control probes (Table 2). There are 169 Arabidopsis probes used as negative controls in our 10,000 rat DNA microarray. The false-positive spots identified by RealSpot from these negative control probes (a QI of 1) were slightly lower than those by GenePix (a flag of
0). It is noteworthy that the false-positive spots from RealSpot had a QI of 1, which means weak or ambiguous spots. A user may filter such weak spots in a particular experiment, and these spots might not be false positive in such a circumstance. For positive control probes, we choose 86 highly abundant genes such as ribosomal proteins and GAPDH. There was a significantly lower false-negative spot count in RealSpot than in GenePix (P < 0.05).
Result validation.
Using RealSpot, we also compared spot images with the results from different stages of data analysis, such as data filtering, data normalization, differential gene expression, and cluster analysis (Fig. 5). The spots shown in Fig. 5A had an identical flag of 100 for both red and green channels, generated by GenePix. These spots were identified as bad spots by GenePix. However, some of spots obviously are good spots as evaluated by spot images and QIs generated by RealSpot (Fig. 5A). We used two methods to normalize data: log2-transformed ratios from background-subtracted signals (e.g., the "F535 mean B535" data column in GenePix raw data) were normalized by LOWESS, based on print tip (Fig. 5B), or by globally adjusting the log2 ratio median to 0 (Fig. 5C). The genes shown had a twofold increase of expression (heart vs. lung). Compared with spot images, it appears that the log ratios from global normalization were more consistent with spot images than those from LOWESS normalization in this particular example. On the basis of the normalized log2 ratios, we identified a set of differentially expressed genes by significance analysis of microarrays (SAM; Fig. 5D) or ANOVA (Fig. 5E). The genes shown in Fig. 5, D and E, should be expressed significantly higher in lung than in heart and were generally consistent with spot images. However, the first gene in Fig. 5, D and E, might be considered to be eliminated from further studies, since there were two strong replicated spots in the heart column (2) for the former, and the gene expression level appeared to be very weak for the latter. We validated 1,641 of 2,110 differentially expressed genes identified by the SAM test in
1/2 h in this way. About 22% differential genes were eliminated because of inconsistence between log ratio and spot images or very low gene expression levels (weak spots). Using k-mean cluster analysis (4), we identified 10 clusters from 6 tissue hybridizations (lung, heart, kidney, liver, spleen, and brain). One of the clusters is spleen-specific genes (Fig. 5, E and F). The spot images were generally consistent with expression patterns from cluster analysis, except the second gene, which was apparently a false-positive result and thus eliminated from further study. We found that 6 of 48 genes in a whole cluster showed inconsistent trends between gene expression levels (normalized signal in arbitrary unit) and the respective spot images.

View larger version (78K):
[in this window]
[in a new window]
|
Fig. 5. Validation of the results at different stages of microarray data analysis. A: spots with a spot flag of 100 (bad spot) generated by GenePix from lung (column 1)-lung (column 2) self hybridization. The 2 icons in each row of column Q were the QI of columns 1 and 2 from RealSpot, respectively. B and C: 11 of the spots have a normalized log2 ratio of 1.0 (lung/heart, column 1/2) by LOWESS (B) or global (C) normalization. D and E: genes with 4 replicated spots were differentially expressed in lung (column 1) vs. heart (column 2) identified by SAM (q-value = 0.05; D) or ANOVA (P value = 0.05; E). Column S indicates QI summary of the respective spots. F and G: 1 cluster of genes identified by k-mean cluster analysis (k = 10) from tissue hybridizations: lung (1), heart (2), kidney (3), liver (4), spleen (5), and brain (6). Column S in F represents QI summary. Signals in the y-axis of G were relative gene expression levels of genes among the 6 organs in arbitrary units (AU).
|
|
 |
DISCUSSION
|
---|
We designed a software package, RealSpot, to validate the results from DNA microarray experiments at any stages of data analysis by directly associating spot quality and results with spot images. RealSpot evaluates DNA microarray data quality by assigning a QI to each spot, based on signal intensity and SBR, followed by direct comparison of spot images, raw data, and analysis results. The reliability of quality evaluation is dependent on both spot images and the respective raw data. The direct comparison of spot images and raw data makes the quality evaluation more reliable than using either of them separately. RealSpot also has several built-in functions, including search, sort, data organization, one-way ANOVA, GO, data normalization, plotting, and web page generation. Furthermore, step-by-step modules make RealSpot easy to use.
The spot images are directly imported from scanned microarray slides and linearly transformed to 0255, representing original image information. This linear transformation trims the extreme signals, i.e., the lowest and the highest 5% of image signals, since the former is usually background noise and the latter the saturated signal. Trimming these signals does not lose much image information, but makes 90% of pixels visible without adjusting brightness and contrast. This method is similar to Affymetrix data normalization, which linearly transforms fluorescence intensity to an arbitrary range, e.g., 0
10,000. The advantage is that different slides are comparable after transformation. We noticed that some slides show differences among printing tips. We therefore added an option in RealSpot for a separate transformation of each subarray or block from an identical printing tip to compensate for such differences, similar to the printing tip-based LOWESS normalization (8). We generated quantitative data (signal intensity and SBR) by RealSpot from original 16-bit spot images located by geometric data from GenePix (x, y, and diameter) for quality evaluation and verification of the original raw data from image analysis software packages. Signal intensity and SBR were affected by spot alignment algorithms but not spot segmentation algorithms (14). In RealSpot, the image of each spot was split from a whole slide at the respective spot center (x, y). The size of each spot image was identical, i.e., the average distance between two adjacent spots. The signal intensity was the average intensity of the whole transformed image of a spot, and the SBR was estimated from the center one-fourth area (for signal) and the four corners (for background). The problem in term of assigning a QI to large spots was associated with the estimation of SBR in RealSpot. The problematic spots usually had a diameter larger than the average distance between two adjacent spots. The four corners of these spot images were largely occupied by the spots. Consequently, the estimated SBR was lower than true SBR, and thus a larger spot may be falsely assigned a QI for a contaminated spot. This problem would be even worse in the slide area where large spots were clustered together. A potential solution would be the estimation of SBR using global background through a whole slide, since this problem resulted from the estimation of local background. Currently, we do not test global background, since local background works well with 99.7% of spots for identifying low-quality spots in our results. However, in RealSpot, these falsely assigned QI can be manually corrected in a quick table operation style: sorting, selecting, and editing. The quick manual correction of a table in RealSpot results in a higher efficiency compared with the time-consuming manual spot correction of a whole slide in GenePix.
The data imported from raw data and transformed images are useful for manual evaluation. For instance, the contaminated spots move together when the table from a slide is sorted by flags, signal-to-noise ratio (SNR), or "B535 mean" column (from GPR file). These contaminated spots may be manually selected and remarked simultaneously. Sorting the table also helps to correct errors, in particular, weak, noisy, or irregular spots. On the other hand, some spots may be marked as bad spots based on SBR, but they are good spots by visual assessment. Most of the reported automatic methods were based on raw data, using some criteria such as SNR, SBR, circularity coefficient, or composite score (1, 6, 12). We found that these criteria were sometimes inconsistent with spot images. For instance, many weak spots have high SBRs, because both their signals and backgrounds are close to 0. Some spots with a good morphology and intensity may be contaminated with a few tiny dust specks. These spots are good spots if manually marked but may be identified as bad spots because their SNRs are low. These spots can be corrected by sorting the QI column, followed by the spot diameter column. In RealSpot, the QIs of all the spots can be corrected in 5
10 min for 30,000 spots, and thus the mistakes that are unavoidable for automatic tools are minimized.
The one-way ANOVA implemented in RealSpot simplifies multiple statistical factors (e.g., dye, slide, and sample treatment) to one factor (sample treatment). The original 16-bit gene expression data from each slide are globally scaled to an identical range (1
1,000) and logarithm transformed for calculation of P values. The P values from this simplified one-way ANOVA can be used for fast monitoring of differentially expressed genes. For instance, a gene with a P value lower than the significance cutoff (P value = 0.05) may be considered a significantly differentially expressed gene, for further investigation. The significance cutoff may be adjusted, e.g., Bonferroni adjustment, which sets P value cutoff = 0.05/n (n = total gene no.). The adjusted cutoff may decrease type I errors (false positive).
The GO information of known genes is helpful for understanding gene functions. In a typical ontological annotation file, a gene is assigned multiple GO IDs, reflecting the molecular function, biological process, and cellular location. RealSpot summarizes all the ontological annotations of each gene and thus provides a user with comprehensive information on a gene. On the basis of the P value from one-way ANOVA and ontological annotation, a user can quickly find interesting genes and their potential roles in a particular experiment.
RealSpot is efficient. It evaluates and assigns a QI to each spot immediately after data import. A user also can identify incorrectly evaluated spots by sorting similar spots together using spot images and raw data. Another feature is the icons used for QIs. RealSpot uses standard scores 04 and represents them as bar plot-like icons to visualize spot quality (Table 1). It is easier for a user to compare a spot with the respective icon than with a number. It is also helpful for visualizing the trends of gene expression level when several slides or samples are grouped together.
RealSpot is flexible and easy to use. First, the images from many DNA microarray scanners and raw data from commonly used image analysis software packages can be directly imported. The import module guides a user importing data step by step with detailed help information. RealSpot skips the description of some raw data files, e.g., GenePix GPR files, and imports user-selected columns of raw data. By organizing raw data and spot images in a table, the user interface of RealSpot is similar to Microsoft Excel worksheets. Under such operation environment, a user can focus on data evaluation without learning new instructions for operating software. The export module exports the table as a text file for importing to database or data analysis software. Images can be exported in bitmap or metafile formats. These formats are most popularly supported in Windows operation systems. Organization of data by samples and slides clearly displays microarray experimental designs, such as a loop or reference design, and helps a user to interpret data from biological samples.
RealSpot is designed for quality evaluation of raw data and spot images, not for data analysis, although some simple data process tools such as data normalization are included. Further improvement of RealSpot may include image transformation, a direct link to database, and data analysis tools. In the current version of RealSpot, a 16-bit image is linearly transformed to an 8-bit image for display. A square-root transformation may be used to strengthen weak spots. A direct link of RealSpot with database will help a user manage microarray data. More powerful sort and search tools may be implemented in metatable. Another limitation of the current RealSpot version is that it can only work with images from dual-color hybridization, and this issue should be addressed in the future version. In summary, the software package RealSpot is efficient for validating microarray results and thus helpful for improving the reliability of the whole microarray experiment. The improvement results from the association of microarray data with the respective spot images.
 |
GRANTS
|
---|
This study was supported by National Heart, Lung, and Blood Institute Grants R01-HL-52146 and R01-HL-071628 and American Heart Association (AHA) Grant 0255992Z (to L. Liu). Z. Chen was supported by AHA Predoctoral Fellowship 0315260Z.
 |
ACKNOWLEDGMENTS
|
---|
We thank Keyu He and Tingting Weng for improving RealSpot user interface and Nili Jin and Jiwang Chen for helpful discussions during the preparation of the manuscript.
 |
FOOTNOTES
|
---|
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: L. Liu, Dept. of Physiological Sciences, Oklahoma State Univ., 264 McElroy Hall, Stillwater, OK 74078 (E-mail: liulin{at}okstate.edu).
10.1152/physiolgenomics.00236.2004.
 |
REFERENCES
|
---|
- Bozinov D and Rahnenfuhrer J. Unsupervised technique for robust target separation and analysis of DNA microarray spots through adaptive pixel clustering. Bioinformatics 18: 747756, 2002.[Abstract/Free Full Text]
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, and Vingron M. Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat Genet 29: 365371, 2001.[CrossRef][ISI][Medline]
- Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, and Conklin BR. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet 31: 1920, 2002.[CrossRef][ISI][Medline]
- Eisen MB, Spellman PT, Brown PO, and Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 1486314868, 1998.[Abstract/Free Full Text]
- Forster T, Roy D, and Ghazal P. Experiments using microarray technology: limitations and standard operating procedures. J Endocrinol 178: 195204, 2003.[Abstract/Free Full Text]
- Jain AN, Tokuyasu TA, Snijders AM, Segraves R, Albertson DG, and Pinkel D. Fully automatic quantification of microarray image data. Genome Res 12: 325332, 2002.[Abstract/Free Full Text]
- Killion PJ, Sherlock G, and Iyer VR. The Longhorn Array Database (LAD): an open-source, MIAME compliant implementation of the Stanford Microarray Database (SMD). BMC Bioinformatics 4: 32, 2003.[CrossRef][Medline]
- Smyth GK and Speed T. Normalization of cDNA microarray data. Methods 31: 265273, 2003.[CrossRef][ISI][Medline]
- Tamames J, Clark D, Herrero J, Dopazo J, Blaschke C, Fernandez JM, Oliveros JC, and Valencia A. Bioinformatics methods for the analysis of expression arrays: data clustering and information extraction. J Biotechnol 98: 269283, 2002.[CrossRef][ISI][Medline]
- Tusher VG, Tibshirani R, and Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98: 51165121, 2001.[Abstract/Free Full Text]
- van Someren EP, Wessels LF, Backer E, and Reinders MJ. Genetic network modeling. Pharmacogenomics 3: 507525, 2002.[CrossRef][ISI][Medline]
- Wang X, Ghosh S, and Guo SW. Quantitative quality control in microarray image processing and data acquisition. Nucleic Acids Res 29: E75, 2001.[CrossRef][Medline]
- Yang MC, Ruan QG, Yang JJ, Eckenrode S, Wu S, McIndoe RA, and She JX. A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. Physiol Genomics 7: 4553, 2001.[Abstract/Free Full Text]
- Yang YH, Buckley MJ, and Speed TP. Analysis of cDNA microarray images. Brief Bioinform 2: 341349, 2001.[Medline]
- Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, and Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30: e15, 2002.[Abstract/Free Full Text]