1 Division of Pulmonary and Critical Care Medicine, Department of Medicine and Pulmonary Bioinformatics, the Lung Biology Center, Brigham and Womens Hospital, Boston, Massachusetts 02115
2 Childrens Hospital Informatics Program, Harvard Medical School, Boston, Massachusetts 02115
3 Department of Obstetrics and Gynecology, Washington University School of Medicine, St. Louis, Missouri 63110
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Affymetrix; Agilent; reference sequence; RefSeq
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Commercial microarrays have been widely available, and the Affymetrix oligonucleotide microarrays (GeneChip technology; Ref. 19) represent a large proportion of previously published studies. GeneChip technology utilizes multiple independent probe hybridization events to measure the expression level for each gene investigated. Additionally, for each hybridization event, the technology utilizes 25-nt oligomers (25-mers) and corresponding single nucleotide mismatch 25-mers to measure specificity. The individual 25-mers are derived from publicly available nucleotide sequence information.
There has been tremendous success applying microarray technology to disease diagnostic applications. For instance, multiple groups have shown that microarray data can identify previously unappreciated molecular subtypes of lung cancer that differ in their prognoses (1, 2, 5). Unfortunately, poor reproducibility of results exists across these studies. This indicates either an underlying distinction in the nature of the disease investigated or, more likely, a limitation of the technology in reliably capturing the underlying biology. Microarray technology has been criticized for a lack of individual measurement accuracy. However, the technology is rapidly advancing, and improvements in reagents (11) and data analysis (12, 16) have increased measurement precision. Some sources of noise, such as those due to hybridization intensity differences, are systematic and have been successfully defined (4, 9, 12, 15, 20, 33, 40).
As a tremendous volume of data has been generated (particularly from human clinical specimens, which cannot be duplicated) strategies to improve analysis of ("clean up") existing data sets are of great value. One limitation of the application of this technology could be due to the failure of similar studies to measure identical biological parameters. For instance, Sorlie et al. (31) recently showed that limiting analysis of data to probes that are verified to query identical UniGenes improves concordance of results from one microarray data set to another.
Probe sequence inaccuracies are known to exist for both oligonucleotide and cDNA microarrays. However, there is a general lack of information regarding the scope of probe sequence inaccuracies on currently available Affymetrix platforms. In this study we report a global analysis of the probes used by Affymetrix technology, where we have systematically attempted to confirm the accuracy of individual probe sequences. We find that for a significant number of probe sets, on both old and current platforms, the probe sequences do not perfectly correspond with the appropriate mRNA as defined by the reference sequence (RefSeq). Given the approach Affymetrix uses to determine true mRNA hybridization from background (e.g., single nucleotide mismatch), any sequence discrepancy likely renders the probe uninformative. Furthermore, we report that data derived from sequence-verified probes shows vastly improved precision. Therefore, removing information from inaccurate probes should significantly improve the validity of results from this technology.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Data Processing and Probe Matching
For each Affymetrix microarray experiment, array image files were analyzed with Affymetrix Microarray Suite ver. 5 (MAS5) software, and signal intensities were calculated for MAS5, robust multi-array averaging (RMA) (12), and DNA-Chip Analyzer (dChip) (16) with the aid of Bioconductor software. Expression values for Agilent cDNA data were calculated using the manufacturers standard software and normalization procedures. Correlations were calculated using MATLAB, and clustering diagrams were generated by the TIGR Multiexperiment Viewer (MeV) (28) software package.
When comparing measurements across generations of Affymetrix platforms, probe sets were matched if they contained at least one probe that corresponds to the same UniGene. When comparing measurements between Agilent and Affymetrix technologies, probe sets were matched if the Affymetrix probe set was for a UniGene containing an Agilent cDNA sequence.
Microarray Data Sets and Analysis Details
Affymetrix technical replicate experiment.
The microarray data set used for this analysis has been previously described (20) and is available at the NCBI Gene Expression Omnibus (GEO) (GSE1302). Three independent mRNA samples were studied on the Hu95v2 platform (subarrays A-E). For each sample, a single mRNA target was generated and analyzed using five (sets of U95Av2, U95Bv2...) microarrays. Therefore, for each subarray tested, the three conditions produced three independent sets of five technical replicate samples. Each set of five replicates generated ten comparisons (e.g., replicate 1 vs. 2, 1 vs. 3, ..., 4 vs. 5). The probe sets for each platform were classified as verified or unverified as described above. Pearson and Spearman correlations were calculated for each set of ten replicate comparisons, for either the verified or unverified probe sets, using MAS5.
Breast cancer cell line experiments.
This microarray data set has been previously described (39) and is available at GEO (GSE1299). For the microarray data set used for this analysis, mRNA from two cancer cell lines and one normal cell line were hybridized to replicate Affymetrix Hu95Av2, Hu133A, and Hu133B and Agilent Human cDNA arrays. Each replicate experiment was compared with a similar experiment on the other platforms (e.g., cancer cell line 1, replicate 1 on the Hu95Av2 platform was compared with the cancer cell line 1, replicate 1 on the Hu133A, Hu133B, and Agilent platforms), resulting in six independent comparisons for each platform. Probe sets were matched across Affymetrix platforms and between Affymetrix and Agilent technologies as described above.
When comparing data across Affymetrix platforms, Pearson and Spearman correlation coefficients were calculated from signal intensity data for the matching verified and unverified probe sets for each of the six comparisons. As Agilent technology reports expression levels as a ratio between two samples, when comparing across technologies the Affymetrix data had to be transformed. Here, the expression level for each Affymetrix probe set was transformed into the log base 2 of the ratio between its signal intensity in a cancer sample with its signal intensity in a normal sample. Pearson and Spearman correlation coefficients were calculated from signal intensity data for both the verified and unverified probe sets.
Breast cancer tissue experiments.
Both breast cancer studies were performed on the U95Av2 and have been previously described. The Dana Farber Cancer Institute (DFCI) study included 101 breast cancer tumors and 8 normal breast tissue samples (30) and is publicly available (http://lungtranscriptome.bwh.harvard.edu/Breast_Cancer_Data.html). The Duke University study included 89 breast cancer tumors (38) and is publicly available (http://data.cgt.duke.edu/lancet.php). A 198 x 12,626 matrix was generated that contained RMA signal intensity values for each probe set in all samples from both the Duke and DFCI studies. hierarchical average linkage clustering (2), with the Pearson correlation as the distance metric, was preformed on the following three sets of data: 1) The 1,000-probe sets that exhibited the largest standard deviation relative to their mean intensity, 2) the subset of these 1,000-probe sets that are RefSeq-verified, and 3) the subset of these 1,000-probe sets that are unverified. Note that the Affymetrix control probe sets (e.g., AFFX-PheX-M_at) were excluded from the clustering so that no bias was introduced by different processing methods that either center might have used.
Significance Testing
Because of hybridization quality variation, the significance of the difference in correlations between sequence-verified and -unverified probe sets was determined using a paired t-test.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We found that a very high proportion of the individual probes could not be verified as measuring their associated RefSeq (Supplemental Table S1, available at the Physiological Genomics web site).1 Of particular note, the percentages of verified probes on the more recent platforms were no greater than those for the older platforms: U133A, 72% vs. HuFL, 80%; 430A, 75% vs. Mu11kA, 74%; and 230A, 80% vs. U34A, 81%. Moreover, for each platform that contains a set of arrays (e.g., Hu95AE), there was always a decrease in the percentage of verified probes on the secondary array(s) (e.g., Hu95BE, max 47%) compared with that of the primary array in the set (e.g., Hu95A, 72%). Interestingly, platforms with higher verification rates generally contained fewer EST sequence-derived probes.
As Affymetrix technology utilizes "sets" of probes to interrogate a single gene, we investigated the distribution of verified probes into probe sets (Fig. 1). A probe set was classified as either 1) entirely verified if it contained only verified probes, 2) entirely unverified if none of its probes were verified, or 3) partially verified if some, but not all, of its probes were verified. For human platforms, we found at best 62.5% (U133A) to at worst 13.8% (U95D) of probe sets were entirely verified. Again, the older and newer platforms showed similar percentages of entirely verified probe sets. Probe sets were generally either entirely verified or entirely unverified, with only 10% (U133B) to 24.7% (HuFL) being partially verified. Additionally, a large number of the partially verified probe sets contained one or two unverified probes. For U133A, more than 50% of the partially verified probe sets have 9 or 10 out of 11 probes verified (Supplemental Fig. S1). This distribution of probes in the partially verified probe sets indicates a preference for sets of probes to identify their intended target.
|
Replicate Precision of Verified Probe Sets
Although we may have decreased confidence in the reliability of unverified probes, these probes may not necessarily show a difference in measurement precision. We sought to test whether precision was related to probe sequence accuracy using data from a previously published replicate microarray data set (20), including three mRNA targets each hybridized to five individual microarrays on each of the U95 (A, B, C, D, and E) platforms. The three conditions produced three sets of five technical replicate data sets, with each set of replicates generating ten comparisons (e.g., replicate 1 vs. replicate 2, 1 vs. 3, ... 4 vs. 5). Verified probe sets were defined as those containing at least one verified probe and unverified probe sets contained no verified probes. We focused upon a comparison of the behavior between these two groups. There are two major benefits of this approach: 1) it provides a measure of "quality" for both the verified and unverified probe sets, and 2) the independence of the verified and unverified sets allow for a statistical comparison to be made.
Pearson and Spearman correlation coefficients were calculated for replicate measurements for both verified and unverified probe sets on each subarray using MAS5 data (Table 1 and Supplemental Table S2). For data generated on U95A, Pearson correlation coefficients ranged from 0.9490.991 for verified probe sets and 0.8830.987 for unverified probe sets, with P values of 0.014 to 0.000076. Spearman correlation coefficients showed a greater increase in accuracy for verified probe sets. The consistently significant increase in the correlations derived from verified probe sets indicated that probe accuracy affects data reproducibility.
|
These differences were somewhat surprising, given the measurements were from technical (and not experimental) replicates. However, it is likely that unverified probes capture background hybridization to one or more transcripts. Whether individual unverified probes measure background, an alternate transcript, or multiple transcript variants, it is intuitive that unverified probe sets would have greater variability than probe sets that are verified to measure a single transcript. The difference in correlations clearly indicated a difference in measurement reproducibility and led us to further test the effects of probe sequence accuracy on data reliability.
Accuracy of Verified Probe Sets Across Platforms and Technologies
Having shown that verified probe sets capture information with greater precision in replicate experiments performed on one platform, we tested the effects of sequence verification on measurement accuracy across complementary microarray technologies. We used a data set consisting of experiments on each of two breast cancer cell lines and one normal mammary epithelial cell line (22). For each cell line, replicates were performed on the U95A, U133A, U133B, and Agilent cDNA microarray platforms.
We tested the effect of sequence verification upon measurement accuracy across generations of Affymetrix platforms. Probe sets from the U95 and U133 platforms were matched using UniGene identifiers. Probe sets that were shared across platforms were classified as verified if at least one probe matched its RefSeq. For example, UniGene Hs.20952 is measured by two probe sets on both the Hu95Av2 (37761_at, 33386_at) and Hu133A (209821_s_at, 208886_at) platforms. However, the RefSeq for this UniGene, NM_001682, is only measured by 37761_at and 209821_s_at. Therefore, we classified these probe sets as verified, whereas 33386_at and 208886_at were classified as unverified. Spearman and Pearson correlation coefficients were calculated for both the verified and unverified (matched) probe set signal intensities. To confirm that the observed effects were not due to specific data processing algorithms, we repeated the analyses using MAS5, RMA, and dChip.
For each comparison between two platforms there were six correlations calculated. Consistently significant (P < 0.00001) increases in correlations were observed for verified probe sets compared with unverified probe sets (Table 2 and Supplemental Fig. S3A) using all analysis methods. For example, P values for the Pearson correlation coefficients in the U95A to U133A comparison were all less than 0.000001 for MAS, RMA, and dChip. The differences were unrelated to sample size, as RefSeq-verified probe sets comprised a majority in the U95A to U133A comparison and a minority the U95A to U133B comparisons. An interesting aspect of this analysis is the extremely low correlations for unverified probe sets in the U95A to U133B comparison, likely due to the high amount of EST-derived probe sequences on the U133B platform.
|
Another technique used by some researchers to remove low-quality information is to utilize the Affymetrix "detection call" algorithm. We investigated the rate of "absent" and "present" calls for both the verified and unverified probe sets. For each of the U95A, U133A, and U133B platforms the percentage of verified and unverified probe sets called present, absent, or "marginal" were calculated (Supplemental Table S5). Verified probe sets were scored present at a higher rate than unverified probe sets (56% vs. 45%) for U95A (58% vs. 46%), for U133A, and (53% vs. 34%) for U133B. However, the on-average greater than 40% score of present for unverified probe sets clearly indicates using the detection call as a basis for removing low-quality information from microarray experiments does not compensate for probe sequence inaccuracies. Additionally, we recalculated the correlations for the verified and unverified probe sets between the Hu95A and Hu133A platforms after removing all probe sets not consistently scored present. Verified probe sets retained significantly higher correlations than unverified probe sets (P < 0.00001, data not shown). These data further support the conclusion that these probes are measuring a nonspecific transcript, albeit at lower reproducibility, and this likely explains the improved measurement precision of verified probe sets in the technical replicate experiment.
Finally, we investigated the effects of sequence verification on measurement accuracy across Affymetrix and Agilent cDNA microarray technologies. There is greater concordance between cDNA and Affymetrix data when the probes used to measure expression contain similar information (22). Therefore, we only included probe sets from the U95 or U133 platforms if they overlap a sequence used on the Agilent array. Probe sets that matched Agilent cDNA probe sequences were defined as verified when at least one Affymetrix probe was RefSeq verified. For example, we identified a match between probe set 207038_at on the U133A platform with Agilent clone U79745. All 11 Affymetrix probes from probe set 207038_at were verified in RefSeq NM_004694 and the GenBank mRNA for U79745. Probe sets that matched Agilent cDNA probe sequences were defined as unverified when no Affymetrix probe was RefSeq verified. For example, we identified a match between probe set 215264_at on the U133A platform and Agilent clone X68879. None of the probes from probe set 215264_at was contained in a RefSeq. As described in the MATERIALS AND METHODS, log base 2-transformed Affymetrix expression ratios were correlated with sequence-matched cDNA expression ratios. We used MAS5, RMA, and dChip expression values to control for any bias introduced by various data processing algorithms.
Again, significantly increased correlations were consistently observed for sequence-verified probe sets (Table 2 and Supplemental Fig. S3B). For example, P values for the Pearson correlation coefficients in the U95A to Agilent comparison were 0.003 for MAS5, 0.022 for RMA, and 0.019 for dChip. Spearman correlation coefficients for U133A to Agilent were equivalent to Pearson correlation coefficients, with similar P values. However, when comparing data from U133B to Agilent, Spearman correlation coefficients were dramatically lower and not significantly increased for verified probe sets. This suggests poorer overall quality measurements on the U133B subarray, with improvements due to sequence verification being insufficient to reach significance.
Diagnostic Accuracy of Verified Probe Sets
In addition to the value of expression profiling in understanding biological mechanism, this technology has been used as a predictive measure for defining disease states. One limitation of this application has been the inability to translate predictive value between complementary experiments (31). The goal of any cancer classification study is to uncover shared biology that can be used to identify additional cases of cancer tumorigenesis or metastasis. Moreover, the most basic test of any discrimination method should involve the detection of a difference between diseased and normal samples. We used these assumptions to test the effects of probe sequence accuracy in data from two independent breast cancer expression profiling studies preformed on the U95A platform (30, 38) (10). Note we did not address the results from these individual studies, but investigated the effects of probe sequence verification upon the ability to compare multiple patient-related data sets.
To minimize artifacts arising from the experiments being performed at separate facilities (handling, hybridization conditions, scanner settings, etc.), we generated a single expression matrix using RMA. We identified the 1,000 genes that exhibited the highest mean value relative to their standard deviation for unsupervised hierarchical clustering. This filtering strategy was used to minimize the effects of noise or artifacts. Cluster analysis was then performed (Supplemental Fig. S4) for 1) all 1,000 probe sets, 2) the subset of these 1,000 probe sets that were unverified, and 3) the subset of these 1,000 probe sets that were sequence verified. When all 1,000 probe sets were used for clustering, the samples separated into two major nodes each predominantly composed of samples from either study, alone. Moreover, the normal samples were split into three separate groups. This suggested the noise in the system was much greater than the captured underlying biology. Clustering of the samples using only the unverified probe sets produced similar results. However, clustering of the samples using only the verified probe sets produced a striking improvement. First, all normal samples clustered tightly, separated from a majority of tumors. Additionally, although samples still separated into two major nodes, there was significant mixing of the two nodes with samples from both studies. The observed increase in diseased sample overlap (shared biology) and grouping of normal samples as highly similar indicates that restricting data to sequence-verified probes can improve the diagnostic power of microarray technology. This result does not address a particular classification scheme but indicates that removing unverified probe sets allows for the major component of change to be related to the underlying biology of breast cancer as opposed to the source of the experiments.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
As probe sequence inaccuracies would seem to be a likely source for measurement error, we directly tested this possibility. Within large technical replicate experiments (Table 1, and Supplemental Tables S2 and S3), across multiple generations of Affymetrix platforms (Table 2 and Supplemental Table S4) and across Affymetrix and (Agilent) cDNA technologies (Table 2), we consistently found significantly increased measurement accuracy for sequence-verified probe sets. We classified probe sets as completely verified, partially verified, or completely unverified using the individual probe verification information (Fig. 1, Supplemental Fig. S1, and Supplemental Table S1). For analysis of measurement accuracy, we considered probe sets verified if they included a single verified probe, as it is likely that the RefSeq database is not exhaustive and some transcript sequences are truncated. Further limiting verified probe sets to completely verified probes would likely increase the benefit of sequence verification but at the cost of excluding another 1020% of the data set. We tested measurement accuracy for verified probe sets and compared them to the accuracy of unverified probe sets; this allowed for a quantification of "quality" and statistical comparison of these independent sets. Compared with the entire data set (all probe sets), sequence-verified probe sets consistently showed higher correlations. Therefore, the improvement in data accuracy, when filtering for verified probe sets, is directly related to the number of unverified probe sets. Although the benefit of sequence verification can be either modest or large, dependent upon the platform used, the rationale for including data derived from inaccurate probes is unclear. The logical conclusion of these studies is that interrogation of the most accurate sequences generates the most accurate data.
The utility of using multiple probes (probe sets) for monitoring individual gene expression has been refined by the implementation of probe-level normalization methods such as those utilized by dChip and RMA (12, 16). In theory, inaccuracy of individual probe measurements could be compensated for by these algorithms. However, our data clearly show that these methods do not completely compensate for probe sequence inaccuracies and that sequence verification adds additional benefit to microarray data analysis. Another simple explanation for the benefit of sequence verification would be that unverified probe sets showed lower signal intensities, due to failure to accurately measure any transcript, and that removal of unverified probe sets was equivalent to removing low signal intensity measurements. Others have shown a dependency of measurement accuracy upon signal intensity (24). Indeed, signal intensities for unverified probe sets were less than those for verified probe sets. However, verified probe sets showed greater measurement precision than unverified probe sets, even after removing low signal intensity probe sets (Supplemental Tables S3 and S4). Furthermore, although verified probe sets were more often scored "present" by the MAS5 detection call, nearly half of all unverified probe sets were also present (Supplemental Table S5). Although there is a relationship between verified probe sequences and signal intensity, these data reveal one limitation of simply removing probe sets with low signal intensities from microarray data and further show the benefit of sequence verification independent of thresholding for signal intensity or detection call.
Importantly, we have shown that the benefit of probe sequence verification extends beyond controlled in vitro experimental samples to potential diagnostic and predictive applications of microarrays (Supplemental Fig. S4). Unsupervised clustering of two independent human breast cancer data sets resulted in nearly complete separation of all patients from each study. A possible explanation could be that the breast cancer entities are distinct and related to different, regional biological mechanisms. A more plausible explanation for this result is that the signal-to-noise ratio of the data is insufficient to reliably capture the consistent underlying biology. Our data clearly show that sequence-verified probe sets improve the capture of the underlying biology as evidenced by the improved grouping of normal samples and improved mixing of samples from the two data sets. Unfortunately, since no clear subclassification for breast cancer exists, it is difficult to prove the clustering is "better." However, the increased clustering of normal samples, which represent a distinct group, using verified probes supports this conclusion.
We are not the first to use probe sequence-based information to assess microarray data accuracy. For instance, Tan et. al. (34) reported increased consistency of replicate measurements across Affymetrix, Agilent, and Amersham technologies. Sorlie et al. (31) used UniGene-matched probes to combine information from Affymetrix and Stanford cDNA microarrays in an effort to improve breast cancer classification. In all cases, postexperiment filtering using sequence information improves data quality. As combining data from multiple microarray platforms/technologies is certain to prove a common method, our results showing increased accuracy of sequence-verified probes across platforms (oligo vs. oligo and oligo vs. cDNA) substantiate the importance of using the most reliable information to verify equivalence of measurement across technologies. This can be facilitated by using the probe mapping files available at http://lungtranscriptome.bwh.harvard.edu, which includes lists of verified and unverified probe sets for each Affymetrix platform described in this study, as well as additional information (Supplemental Fig. S2) regarding the location of individual probes within RefSeqs. Alternatively, we encourage end-user verification with the most recent, publicly available sequence information.
![]() |
GRANTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
ACKNOWLEDGMENTS |
---|
Present address of D. Z. Wetmore: Neuroscience Graduate Program, Stanford University School of Medicine, Stanford, CA 94305.
![]() |
FOOTNOTES |
---|
Address for reprint requests and other correspondence: T. J. Mariani, Pulmonary and Critical Care Medicine, Brigham and Womens Hospital, Harvard Medical School, 75 Francis St., Boston, MA 02115 (E-mail: tmariani{at}rics.bwh.harvard.edu)
10.1152/physiolgenomics.00066.2004.
1 The Supplementary Material for this article (Supplemental Tables S1S5 and Supplemental Figs. S1S4) is available online at http://physiolgenomics.physiology.org/cgi/content/full/00066.2004/DC1.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|