 |
INTRODUCTION |
The recent availability of complete genomic sequences and/or large
numbers of cDNA clones from model organisms coupled with technical
advances in DNA arraying technology have made it possible to study
genome-wide patterns of gene expression. Most high density microarray
experiments consist of one of two types: examining changes in gene
expression over a temporal or treatment gradient (1) or comparing gene
expression between two different cell sample types or genotypes (2-5).
Fluorescently or isotopically labeled cDNA or RNA probes are
hybridized to high density arrays of cDNA clones on glass supports
(1-4, 6-8), nylon membranes (9-15), or oligonucleotides directly
synthesized on silica wafers (16, 17). Signals are quantified using
phosphorimaging, photomultiplier tubes, or CCD imaging, and a data set
is created that consists of expression measurements for all of the
elements of the array.
Despite rapid technological developments, the statistical tools
required to analyze these fundamentally different types of DNA
microarray data are not in place. Data often consist of expression measures for thousands of genes, but experimental replication at the
level of single genes is often low. This creates problems of
statistical inferences because many genes will show fairly large
changes in gene expression purely by chance alone. Therefore, to
interpret data from DNA microarrays it is necessary to employ statistical methods capable of distinguishing chance occurrences from
biologically meaningful data.
The t test can be used to determine whether the observed
difference between two means is statistically significant (18). The
t test incorporates a measure of within treatment error into the statistical test; as a result only genes showing a large change in
gene expression relative to the within treatment variance are considered to have significantly changed. In a perfect world, all DNA
microarray experiments would be highly replicated. Such replication
would allow accurate estimates of the variance within experimental
treatments to be obtained, and the t test would perform well. However, samples may be available in limited supply, and DNA
microarray experiments are expensive and time consuming to carry out.
As a result, the level of replication within experimental treatments is
often low. This results in poor estimates of variance and a
correspondingly poor performance of the t test itself.
An alternative to the t test is to ignore the within
treatment variance and only look at fold change as a proxy for
statistical significance. Intuition suggests that larger observed fold
changes can be more confidently interpreted as a stronger response to the experimental manipulation than smaller observed fold changes. However, an implicit assumption of this reasoning is that the variance
among replicates within treatments is the same for every gene. In
reality, the variance varies among genes (e.g. see Fig. 2),
and it is critical to incorporate this information into a statistical
test. These different approaches demonstrate the statistical problem
faced when analyzing DNA microarray data. Ignoring the sampling
variance is incorrect, yet incorporating it in the traditional manner
may not be much better because the number of replicate experiments is
often quite low.
Although there is no substitute for experimental replication,
confidence in the interpretation of DNA microarray data with a low
number of replicates can be improved by using a Bayesian statistical
approach (19) that incorporates prior information of within treatment
measurement. The Bayesian prior assumes that genes of similar
expression levels have similar measurement errors, amounting to
parallel pseudo-replication of the experiment. For example, the
variance of any single gene can be estimated from the variance from a
number of genes of similar expression level. More specifically, the
variance of any gene within any given treatment can be estimated by the
weighted average of a prior estimate of the variance for that gene.
This weighting factor, or hyperparameter, is controlled by the
experimenter and will depend on how confident the experimenter is that
the background variance of a closely related set of genes approximates
the variance of the gene under consideration. In the Bayesian approach
employed in this study, the weight given to the within gene variance
estimate is a function of the number of observations contributing to
that value. This leads to the desirable property of the Bayesian
approach converging to the t test as the experimenter
carries out additional replications and thus becomes more confident in
the observed estimate of within treatment variance.
Commonly used software packages are poorly suited for implementing the
Bayesian statistical methods we develop in this work. However, we have
created a program, Cyber-T, which accommodates this approach. Cyber-T
is available for on-line use at the genomics web site at the University
of California at Irvine. This program is ideally suited to experimental
designs in which replicate control cDNA samples are being compared
with replicate experimental cDNA samples.
In this study we use the statistical tools incorporated into
Cyber-T to compare and analyze the gene expression profiles obtained from a wild-type strain of Escherichia coli and an otherwise
isogenic strain lacking the gene for the global regulatory protein,
integration host factor
(IHF),1 previously reported
by Arfin et al. (5). We apply different statistical methods
for identifying genes showing changes in expression to this data set
and show that a Bayesian approach identifies a stronger set of genes as
being significantly up- or down-regulated based on our biological
understanding of IHF regulation. We show that commonly used approaches
for identifying genes as being up- or down-regulated (i.e.,
simple t test or fold change thresholds) require more
replication to approach the same level of reliability as Bayesian
statistical approaches applied to data sets with more modest levels of
replication. We further show that statistical tests identify a
different set of genes than those based on fold change and argue that
the set of genes identified by fold change is more likely to harbor
experimental artifacts.
 |
MATERIALS AND METHODS |
Statistical Framework--
As data are commonly
background-subtracted, zeros in the input file do not actually
represent no expression but instead an expression level that escapes
detection. A decision may be made to statistically analyze only genes
for which a predetermined number of replicates have above zero
expression level estimates. Cyber-T allows the user to define a
constant,
, which is the minimum number of replicates required to
perform the t test. Depending on the choice of
and the
nature of the data, different statistical tests are carried out. For
example, when fewer than
non-zero expression levels are measured
for both the control and experimental treatments no statistical test is
carried out. The other two cases are detailed below.
In the case of
or greater non-zero observations for both the
control and experimental treatments, a two-sample t test is carried out (see Equation 9.2 in Ref. 18). In the case of equal variances and sample sizes for each of the control and experimental treatments, the t statistic reduces to
|
(Eq. 1)
|
where n is the number of replicates within each of
the control and experimental treatments,
is the standard deviation
in expression level observed within each treatment, and
C and
E are the mean expression
levels of the control and experimental treatments, respectively. It can
be seen from the above expression that large differences in expression
level only result in large values of the t statistic (and
hence reach statistical significance) when the within treatment
variance is small. This test has intuitive appeal because large
differences in expression are unlikely to be of biological significance
(or at least less worthy of further investigation) if they are
inherently unreplicable.
When the control treatment has less than
non-zero observations, but
the experimental treatment has
or greater non-zero observations, a
different statistical test is carried out. Data of this form correspond
to the biological situation of a gene that is expressed at an
undetectable level in the control treatment but induced by the
experimental treatment. In this case a t statistic is
constructed that has the following form.
|
(Eq. 2)
|
where n is the number of replicates within the
experimental treatment,
E is the standard deviation in
expression level observed within the experimental treatment,
E is the mean expression level of the
experimental treatment, and
is a constant whose definition depends
on the number of non-zero observations in the control treatment. If
there are any non-zero control observations, then
is equal to the
mean of the non-zero control observations. Otherwise
will reflect
the smallest expression level reliably detected. That is, we test
whether the observed experimental expression level is greater than the
smallest expression level that can be experimentally detected. The
smallest expression level that can be experimentally detected is
defined as the average of the two 0.25% quantiles associated with the
average non-zero control and experimental expression levels over all
genes on the array. Less precisely,
equals the expression level
above which 99.75% of the average expression levels fall. In the case
where the experimental treatment appears greatly repressed relative to
the control treatment, the statistic is appropriately modified.
Generally high density array experiments are replicated only a few
times. In this case the estimate of
2 obtained may be a
poor estimate of the true variation present among replicates within
treatment. It is possible to improve upon the poor performance of the
t test through the use of Bayes formula, which defines
probability (P) of the model (M) conditional on the observed data (D):
|
(Eq. 3)
|
At its most basic level, the Bayes formula can be used to
calculate the posterior probability of any model or hypothesis in light
of the data and any prior information one may have. In the case of high
density array data it is convenient to model expression levels using a
Gaussian probability distribution function of unknown mean and
variance. We use the corresponding self-conjugate prior (self-conjugate
refers to the case where both the prior and the posterior have the same
functional form) of the Gaussian distribution with unknown mean
(µ0) and variance (
). Without going into the mathematical details, which will be presented elsewhere (20), it can be shown that the posterior estimate of the
within sample variation is as follows.
|
(Eq. 4)
|
where
2 is the actual estimate of within
treatment among replicate variation, n is the number of
replicates, and v0 is weighting factor which
scales with confidence in the prior (hyperparameter). That is, the
variance is estimated as if one had, in addition to the sample square
deviation, a set of v0 observations with variance 
(21). There are a number
of reasonable choices for 
, and we
choose to estimate 
based on a
local average of the standard deviations for genes showing similar
within treatment expression levels to the gene under consideration.
Local averaging is carried out by ordering all genes within a given
treatment based on their average expression level and then taking the
average standard deviation as the standard deviation observed for any given gene and the k next higher and lower expressing genes
(where k is a user defined constant). This local averaging
is carried out using a C function described by Spector (22) that
accounts for "edge" effects. The t tests employed are
the same as those described above with

substituted for
2.
Although there exist methods for choosing an optimal
hyperparameter,2 these
complex approaches are not implemented here nor in the current version
of the Cyber-T software. Instead, a heuristic approach is taken whereby
the user can select the constant v0 representing
the total number of data points deemed necessary to compute reasonable
estimates of means and variances. In Cyber-T, the default value for
v0 is 10. If for a given gene n < v0 values are empirically measured, Cyber-T
complements these measurements with an additional
v0 "pseudo-observations" equal to the
background value derived by forming an average over neighboring genes.
Given the typical levels of replication we have seen in microarray
experiments (i.e. 2-5), we have found the default value for
the hyperparameter of 10 to work well. Analysis of actual data sets
(this paper) and simulated data sets2 shows that neither
doubling nor halving the default value of 10 has a great effect on the
subsequent ranking of results.
In the case of high density arrays a very large number of t
tests will be carried out on the data. Carrying out such a large number
of statistical tests will result in an elevated false positive rate if
the p value used for statistical significance is comparable with the value used when only a single t test is performed
(e.g. p < 0.05 or p < 0.01). To overcome this problem a statistical threshold must be set on
individual p values so that the experiment-wide false
positive rate is held constant at a fixed level. Such a correction
applied to individual p values is often referred to as a
Bonferroni correction. If one wishes to hold the experiment-wide false
positive rate to
Bonf and carries out N
independent statistical tests, then the marginal p value
associated with any individual test must be set as follows.
|
(Eq. 5)
|
In Cyber-T a slightly more stringent Bonferroni correction is
applied that requires p values to exceed
for the
t tests carried out on both the log-transformed and raw
data. Because a Bonferroni correction assumes p values for
different genes to be independent, it will often be overly
conservative. Nonetheless, it has utility when it is important to
demonstrate strict statistical significance.
Analysis of IHF Data--
The IHF data set is modified from that
published by Arfin et al. (5). The four replicates within a
given cell type are each averaged over two independent hybridizations
of the same RNA sample and two measures of intensity from each double
spotted membrane. Only genes for which all eight resulting measures of gene expression are above background are used (1973 genes in total). We
refer to the resulting eight measures of gene expression per gene as
1-8 where 1-4 are from IHF+ cells and 5-8 are from
IHF
cells. For the comparisons involving 2-fold
replication, all 12 possible 2 by 2 comparisons differing by at least
two replicates were carried out (i.e., 12v56, 12v78, 13v57,
13v68, 14v58, 14v67, 23v58, 23v67, 24v57, 24v68, 34v56, and 34v78). A
corresponding set of 9 possible 3 by 3 comparisons differing by at
least one replicate was also carried out (i.e., 123v567,
123v568, 123v678, 124v567, 124v568, 124v678, 234v567, 234v568, and
234v678). For both the 2 by 2 and 3 by 3 comparisons, the 120 genes
with the lowest p values were identified for seven
statistical tests (i.e. Bayesian t test,
t test, ratio of means on the raw data, Bayesian t test, t test, ratio of means on the
log-transformed data, and a difference of means for the log-transformed
data). In the case of the two Bayesian t tests three
different sliding window sizes (i.e. 41, 101, and 161) as
well as three different hyperparameter values (i.e., 2, 4, and 16) were tested in all possible combinations. The number of genes
commonly identified was tallied for each of the 66 possible
combinations of the 12 different 2 by 2 comparisons and for each of the
36 possible combinations of the 9 different 3 by 3 comparisons. For
each statistical approach the mean number of genes in common is
presented in Table I. With the exception of the two statistics based on
ratios of means (standard deviation for the ratio of raw means for 2 by
2, ratio of raw means for 3 by 3, ratio of mean of logs for 2 by 2 and
ratio of mean of logs for 3 by 3 were 25, 32, 26, and 13, respectively), standard deviations over comparisons were generally
small and between 7 and 13. The different combinations of window size
and hyperparameter seemed to have little effect on the consistency of
the Bayesian approach so only outcomes corresponding to the best and
worst parameter combinations are presented in Table I. The comparisons among statistical approaches presented in Table II were generated in a
manner similar to those described above.
To identify those genes differentially expressed in strains IH100
(IHF+) and IH105 (IHF
) most likely as a
direct consequence of IHF-mediated effects on transcription initiation,
500 base pairs upstream of the ORF for each of these genes were
examined for high affinity IHF-binding sites. To identify these sites
we demanded a 12 of 13 match with the core consensus sequence
(5'-WATCAANNNTR-3', where W = A/T and R = pyrimidine) and
required that at least 10 of 15 of the base pairs immediately upstream
of the core sequence were AT base pairs (5). These are very stringent
criteria that certainly identify high affinity IHF-binding sites; in
fact, they exceed the criteria for several documented high affinity
IHF-binding sites. Also, IHF-binding sites that perform a DNA looper
function located farther than 500 base pairs of an ORF would not be
identified in our search. Nevertheless, these stringent criteria have
been used in an earlier report to identify IHF-binding sites upstream of 46 operons demonstrated to be regulated by IHF (5).
Computer Software--
Cyber-T is designed to accept data in the
large data spreadsheet format that is generated as output by the
software typically used to analyze array experiment images. An element
may correspond to a single spot on the array (typical of membrane- or
glass slide-based arrays) or a set of spots (typical of GeneChips (16,
17) designed to query labeled RNA). This data file is uploaded to
Cyber-T using the "Browse" button in the Cyber-T browser window.
Fig. 1 shows a typical Cyber-T window.
This window will vary somewhat depending on the exact analysis that the
user wishes to carry out. Detailed instructions for using Cyber-T are
accessed from this web page. We briefly describe the use of Cyber-T
below.
After uploading the data file, the user defines the columns on which
analysis will be performed. One button allows the user to define a
Bonferroni significance level: the probability of a single gene being
identified as significantly different between the control and
experimental treatments by chance alone given the number of genes
examined. A second button allows the user to define an integer
"confidence" given to the Bayesian prior estimate of within
treatment variance (referred to as the Bayesian prior). Larger
confidence gives greater weight to the Bayesian prior and smaller
confidence gives greater weight to the experimentally observed variance
for the particular gene being tested. We have observed satisfactory
performance when the actual number of observations and the confidence
value total approximately ten. A third button allows the user to define
the size of the window over which the Bayesian prior is
calculated. If a window of size w is selected, the Bayesian
prior for a given gene is estimated as the average standard deviation
of a window of w genes ranked by expression level, centered
on the gene of interest. Window sizes of larger than 100 genes often
perform well. Additional buttons exist to control data formatting (Fig.
1).
Cyber-T generates three output files, two of which (allgenes.txt and
siggenes.txt) can either be viewed in the browser window or downloaded
and imported into a spreadsheet application for user-specific
formatting. The file siggenes.txt contains a set of columns similar to
those of allgenes.txt except only those genes, if any, that pass a
Bonferroni test are listed. These files return the original data and a
number of additional columns containing summary statistics such as the
mean and standard deviation of both raw and log-transformed data,
estimates of the standard deviations employing the Bayesian prior,
t tests incorporating the Bayesian prior on both the raw and
log-transformed data, p values associated with t
tests, and "signed fold change" associated with the experiment. The
exact content of these files is detailed on line. In addition to
allgenes.txt and siggenes.txt, an additional postscript file, CyberT.ps, is generated. CyberT.ps contains a self-explanatory set of
six graphs useful in visualizing the data submitted to the program
(A-F of Fig. 2 contain
examples of these plots). In some cases, two additional graphs are
generated (not shown) that plot the p values of genes that
pass the Bonferroni test against their absolute fold change in
expression. These plots are suppressed if only a few genes pass the
Bonferroni test, which will often be the case.

View larger version (31K):
[in this window]
[in a new window]
|
Fig. 2.
An example of the graphical output generated
by Cyber-T. The data presented are described in the text. In all
cases experimental refers to IHF E. coli cells and control refers to IHF+
E. coli cells. B-D are analyses on raw data,
whereas the remainder of the panels are analyses carried out on
log-transformed data. Log transformations of raw data generally change
both average expression level and the standard deviation in expression
over replicates. E and F incorporate the Bayesian
approach described in the text to stabilize estimates of the within
gene standard deviation, whereas panels G and H
do not incorporate a Bayesian prior. The output is essentially that
generated automatically by the Cyber-T program, except the axis labels
and figure titles have been edited and/or rescaled.
|
|
All statistical analysis is carried out using the
hdarray library of functions written in R. R is a
freely available statistical analysis environment adhering to the Open
Source development model. The hdarray functions are normally invoked
through the Cyber-T web-based interface but can also be used directly
in an X-Window session. A brief tutorial on how to analyze data
directly in R is available at the University of California at Irvine
genomics web site. This tutorial also lists the functions available as part of the hdarray library and R resources. The library and
Cyber-T web interface also includes routines for analyzing paired
samples, which would be produced from two-dye glass slide DNA
microarray experiments (1-4, 6-8).
 |
RESULTS |
An Application of Cyber-T to Microarray Data--
Cyber-T
was used to analyze a data set that examined gene expression levels in
IHF mutant and wild-type strains of E. coli (5). This data
set was obtained from 33P-labeled cDNA reverse
transcribed from total RNA with random hexamer oligonucleotides and
hybridized to commercially available nylon arrays obtained from Sigma
Genosys. These arrays are spotted in duplicate with each of the 4,290 predicted E. coli ORFs. This is an excellent data set for
examining some of the statistical inferences that can be drawn from
high density arrays, because four separate RNA preparations from each
of the IHF+ and IHF
strains were
independently hybridized to two separate arrays for each independent
experiment. Cyber-T was used to analyze an edited IHF data set
consisting of the 1973 genes for which an above background signal was
detected in all eight replicates of the experiment.
We first applied the simple t test to the IHF data set and
identified 128 genes significant at p < 0.01 (properties of the gene identified are discussed below). We then
employed a Bayesian analysis with a sliding window size of 101 and a
Bayesian confidence value of 10. The Bayesian analysis resulted in the
detection of 68 genes that showed a significant change in expression at
p < 0.01 (not corrected for multiple tests). Of these
genes the smallest fold change detected (1.9-fold) was larger than the
smallest fold change detected by the simple t test
(1.3-fold), indicating that the Bayesian approach is slightly more
conservative. It is interesting to note that genes showing very high
fold changes in expression often showed little statistical support
associated with their expression change when using the Bayesian
approach. For example, one gene (b2734) showed an 8-fold reduction in
expression but only had a p value of 0.48. Large
nonsignificant fold changes in expression often result from a single
extreme observation likely to be an experimental artifact. The 128 genes with the smallest p values identified using the two
methods have 86 genes in common. This indicates that although the two
statistical approaches are consistent in identifying a large number of
genes, there are also many differences in the set of genes identified
(~33%). We have found many instances where the simple t
test identifies genes as showing significant changes in expression when
the fold change is quite modest. Generally these instances are
associated with "too small to be true" within treatment variance
estimates, reflecting the large sampling variance on the estimate of
the within treatment variance when the number of observations is small.
Bayesian estimates of the within treatment variation tend to reduce the
rate of this source of false positives.
Fig. 2 shows the plots that result from the Bayesian analysis of the
IHF data. A and B show plots of expression level
in control versus experimental for the raw and
log-transformed data. C-F show plots of the standard
deviation in expression level versus its mean over genes for
raw and log-transformed data for both the IHF+ and
IHF
cells. It can be seen that the variance in expression
level is a strong function of the mean. This is a justification for
carrying out statistical analyses on log-transformed data. After a log transformation the relationship between the mean and the standard deviation is somewhat uncoupled. The coefficient of variation, which is equal to the standard deviation divided by the mean expressed as a percentage, is a measure of the relationship between the mean and
the standard deviation, and a log transformation is often considered
desirable before statistical analyses when the coefficient of variation
is large. In the case of the data set of this paper, the coefficient of
variation is smaller for every gene for log-transformed (average
coefficient of variation = 4.5%) than for untransformed data
(average coefficient of variation = 38%). Once the data are log-transformed the standard deviation in gene expression is larger for
lowly expressed genes than for highly expressed genes. This relationship makes biological sense because it is difficult to accurately measure and quantify genes showing very low expression levels using high density arrays. Based on our experience with other
data sets, the strong relationship between the variance and the mean
expression levels are not peculiar to the radioisotope/filter array
technology employed here but appear to be a general feature of DNA
microarray experiments.
G and H of Fig. 2 show plots of the standard
deviation as a function of the mean on the log-transformed data without
incorporating the Bayesian prior. In comparing G and
H with E and F it can be seen that
Bayesian conditioning reduces the large stochastic variance in within
treatment estimates of the standard deviation as evidenced by the
spread of the points.
A Comparison of Statistical Approaches Used to Identify Genes
Showing Changes in Gene Expression--
The primary objective of
hypothesis testing is to strike a balance between minimizing the
likelihood of detecting false positives while maximizing the likelihood
of detecting true positives. If high density array data sets existed in
which the true positives and false positives were known, then the
evaluation of different statistical methods for detecting these genes
would be straightforward. Such data sets do not exist. An alternate
means of assessing the appropriateness of a statistical method is to
measure its consistency over independent realizations of an experiment.
In the case of the IHF data set we are able to compare the ability of
the different statistical methods to consistently identify the same set
of genes as being up- or down-regulated in different subsets of the
data. For the IHF data set there are 12 possible IHF
versus IHF+ comparisons of size two replicates
and 9 possible comparisons of size three replicates. For each
comparison we sorted all genes by the p value associated
with the statistical test (or ratio score in the case of the ratio
tests) and examined the top 120 genes. For each comparison we were able
to measure the consistency in identifying the same set of genes as
being up- or down-regulated over a number of quasi-independent
comparisons. In the case of the comparisons of sample size two, 66 measures of consistency were obtained from comparison pairs having no
more than two replicates in common. Similarly, in the case of the
comparisons of size three, 36 measures of consistency were made from
comparisons having one replicate in common.
Table I summarizes the consistency of the
different statistical approaches. As might be expected additional
replications of an experiment result in greater consistency at
identifying the same set of genes as being up- or down-regulated. Log
transformations of expression data result in more consistent
statistical inferences, except for statistical tests that were already
consistent, in which case a log transformation of the data makes little
difference. In the case of little experimental replication (the
comparisons of size two) both the simple t test and commonly
used ratio tests performed poorly; only identifying the same set of
genes 25% of the time. In the case of little replication, Bayesian
statistical approaches performed the best, although a modified ratio
approach based on the difference of the logs was almost as good as the Bayesian approach. With additional replication, all tests performed better with the exception of ratio tests based on raw data (which performed poorly). The disparity between the poorly performing tests
(the t test and ratio of logs) and the better performing tests (Bayesian and difference of logs) was lessened by increasing replication. It is important to note that even in the best case the
average consistency was only about 75%, adding credence to the idea
that high density array experiments with modest levels of replications
are still largely an exploratory tool.
View this table:
[in this window]
[in a new window]
|
Table I
Average number of genes of the 120 with the smallest p values
identified in common based on analyzing subsamples of the IHF data set
For the Bayesian analysis the number of commonly identified genes is
given only for the best and worst window size and hyperparameter
combination.
|
|
Although the consistency of different statistical approaches appears
fair with modest levels of replication, it should be noted that the
different statistical approaches are not necessarily consistently
identifying the same set of genes. Table
II lists the number of genes of the 120 co-identified using the seven different statistical approaches, when
all four replicates of the IHF data set are considered. It is apparent
from this table that the different statistical approaches based on the
t test (i.e. the t test and t test with a Bayesian prior on the variance with both
log-transformed and raw data) converge to a similar set of genes as the
amount of replication increases. Statistical approaches based on the t test have 73% of their genes in common on average. For
the t test-based tests the decision to log transform the
data has less of an effect on the number of genes shared than the
decision to incorporate a Bayesian prior. Similarly, statistical tests
based on ratios of the means (ratio of the means of raw and
log-transformed data and the difference of the means of log-transformed
data) also converge on a similar set of up- or down-regulated genes, having 77% of their genes in common on average. Of the three
ratio-based tests, the two tests based on a log transformation of the
data appear to identify more genes in common than the test based on the
raw data. It is important to note that the tests based on the
t test and those based on ratios do not necessarily converge on the same set of genes. Comparisons among approaches yield only 54%
of their genes in common on average. It seems likely that the tests
based on ratios are consistently identifying a set of genes with a
large fold change on average but a great deal of variation among within
treatment replicates.
View this table:
[in this window]
[in a new window]
|
Table II
The number of genes identified in common by the different statistical
approaches for the entire IHF data set
|
|
The Bayesian Approach Allows the Identification of More True
Positives with Fewer Replicates--
The data in Table I show that
additional experimental replications result in the identification of a
more consistent set of up- or down-regulated genes and that the
Bayesian statistical approach identifies a more consistent set than a
simple t test. The natural question that arises is whether
these genes are true positives. That is, whether these are
IHF-regulated genes. This question is addressed by the data shown in
Fig. 3. Here we define genes whose
differential expression is likely to be due to a direct effect of IHF,
and therefore a true positive, as those genes that possess a documented
or predicted high affinity IHF binding site within 500 base pairs
upstream of the ORF for each gene, or the operon containing the gene
(see "Materials and Methods"). Of the120 genes differentially
expressed between IHF+ and IHF
strains with
the lowest p values identified by a simple t test based on two experiments (2 by 2 t test), 51 genes contain
an upstream IHF site, whereas the Bayesian analysis of these same two
data sets (2 by 2 Bayesian) identifies 59, or 15% more genes with
upstream IHF sites. Furthermore, comparison of the differentially expressed genes identified by the 2 by 2 t test or the 2 by
2 Bayesian to the differentially expressed genes identified by a simple
t test performed on four experimental data sets (4 by 4 t test) shows that the 2 by 2 Bayesian analysis also
identifies 15% more genes (38 versus 33) in common with the
genes identified with the 4 by 4 t test. The same results
are observed when the log-transformed data are compared in this way.
These data illustrate the utility of the Bayesian approach; replicating
an experiment 2-fold and performing a Bayesian analysis is comparable
in inference to replicating an experiment 4-fold and using a
traditional t test.

View larger version (12K):
[in this window]
[in a new window]
|
Fig. 3.
Analysis of IHF data with and without
Bayesian treatment. The numbers in
parentheses represent the number of genes of the 120 genes
with the lowest p values that contain a documented or
predicted IHF binding site less than 500 base pairs upstream of each
ORF. In each case the raw or log-transformed data from two
(2 × 2) or four (4 × 4) independent
experiments were analyzed either with or without Bayesian
treatment.
|
|
 |
DISCUSSION |
At present, the significance of high density array data is often
judged solely on the basis of observed fold change in expression. An
arbitrary fold change threshold is created, with genes showing greater
change than that threshold declared significant and all others declared
nonsignificant. It is apparent from Fig. 2 that fold change in
expression may not be a good proxy for statistical significance. If
analyses are carried out on nontransformed data, any given threshold of
fold change in expression will be liberal for genes expressed at a high
level and conservative for genes expressed at a low level. Conversely,
if analyses are carried out on log-transformed data, the threshold will
be conservative for genes expressed at a high level and liberal for
genes expressed at a low level. In cases where either the control or
experimental observations are replicated, it is possible to assess the
significance of the difference between the control and experimental
data relative to the observed level of within class variation. This
results in smaller fold changes being significant for genes whose
expression levels are measured with great accuracy and large fold
changes being nonsignificant for genes whose expression levels cannot be measured very accurately.
Ignoring the variation among replicates or not carrying out replication
and determining significance based solely on fold change do not negate
the problems discussed above; it amounts to assuming the variance among
replicates is equal for all genes. The support for an expression level
difference being meaningful relative to the observed variation within
treatments is conveniently represented by the t statistic
(see "Materials and Methods").
Observed differences between a single replication of a control
versus experimental treatment can be due to inadequately
controlled experimental factors as opposed to the experimental
condition itself. Examples of such variables may include small
differences in the time or method of harvesting cells, differences
between tissue samples not related to the experiment, and variation
induced by the RNA isolation or labeling protocol. For this reason,
replicates of high density array experiments are particularly useful.
Replication will increase the likelihood of detecting subtle changes in
expression between treatments while decreasing the likelihood of false
positives. In an ideal world, high density array experiments would be
replicated a "large" number of times (e.g. >10), and
the t statistic would measure the relative support for a
difference between control and experimental treatments being due to
chance alone. In practice, high density array experiments are rarely
replicated this many times, and the sampling variance on the
t statistic is therefore quite large. In this context, using
only the t statistic (or a corresponding p value)
as a measure of whether or not a gene is significant can be misleading.
Nonetheless, we have found it useful to sort genes on p
values as an exploratory tool for identifying potentially interesting
genes (5).
A Bayesian approach to estimating the within treatment variation among
replicates has been implemented within Cyber-T. The use of the weighted
average of the "local" standard deviation for genes with similar
expression levels and the observed gene-specific standard deviation
stabilizes within treatment variance estimates. Increasing the
precision of variance estimates in both the control and experimental
treatments results in more stable t statistics. This allows
inferences to be drawn from high density array experiments that have
been carried out with nominal levels of replication. This is
demonstrated in Fig. 3 where statistical inference using the Bayesian
approach with only two replicates approaches that normally achieved
with more replication. There remains a possibility that different genes
of similar expression levels have widely differing true variances.
Under this possibility and the current prior, poorly replicable genes
will be falsely declared significant, and intrinsically highly
replicable genes will be falsely declared not significant. Although
this would represent an undesirable outcome, when experiments show
little replication the relative error in inference introduced by an
incorrect prior is likely to be less than the error in inference
introduced from very poor estimates of within treatment variance.
Ultimately, it will be important to derive empirical guidelines for the
determination of the correct hyperparameter to use in weighting the
prior information (i.e. the local average standard
deviation) relative to the observed within treatment estimate of
variation. It is possible that the best weighting will depend on
factors such as the biological system being studied, experimental
conditions employed, and high density array technology used.
Analyses in Cyber-T are performed on both log-transformed and
nontransformed data. Log transformations are carried out for three
reasons. First, in plots of raw data (Fig. 2A) many of the data points are clustered at the low end of the values. Plots of
log-transformed data tend to expand these low values and make them
easier to examine visually (Fig. 2B). Second, an assumption of the t test is that the variances of the two groups being
tested are equal. Although the t test is fairly robust with
respect to violations of this assumption (especially when the sample
sizes of the two groups are equal), if the variances of the two
treatments are widely different the statistical test for a difference
between means may not be valid. Often, unequal variances between
treatments result from the variance in a set of observations scaling
with their mean. Log transformations often reduce or eliminate this dependence. It can be seen from C and D of Fig. 2
that the variance in raw expression level is a function of the mean,
and in this case a log transformation may be appropriate. E
and F of Fig. 2 show that the dependence of the variance on
the mean is somewhat uncoupled following a log transformation.
Interestingly, in these plots it appears that the variance in
log-transformed expression levels is higher for genes expressed at
lower rather than at higher levels. These plots suggest that genes
expressed at low or near background levels may be good candidates for
ignoring in expression analyses. The variance in the measurement of
genes expressed at a low level is large enough that in many cases it is
difficult to detect significant changes in expression for this class of loci. Third, statistical tests of log-transformed data have an intuitive appeal. The difference between the log of two numbers raised
to the base of the log is equivalent to the ratio of the two numbers
(i.e. a/b = elna
lnb). Thus a
test of the significance of the difference between the log expression
levels of two genes is equivalent to a test of whether or not their
fold change is significantly different.
We have shown that statistical tests for changes in genes that
incorporate the within treatment variance and a Bayesian prior on the
estimate of the within treatment variance have a number of desirable
properties. They are generally more consistent than tests not employing
a Bayesian prior, implying that they give similar results when high
density array experiments are replicated. Tests incorporating a measure
of experimental error into the test statistic do not identify genes
showing large fold changes in expression that also show little
correspondence over within treatment replicates. The IHF data presented
in Fig. 3 suggest that a Bayesian statistical framework facilitates the
identification of more true positives and fewer false positives with
fewer replications. A primary deterrent to a more widespread adoption
of statistical approaches incorporating a Bayesian prior for the
analysis of high density array data is the lack of software that can
easily be used to carry out such analyses. We have implemented the
approaches described in this work and have created a simple to use web
interface that make these tools widely available and accessible.
In summary, Cyber-T provides an easily accessible interface that allows
routine assessment of high density array data for statistical
significance. The incorporation of a Bayesian prior into the commonly
accepted t test allows statistical inferences to be drawn
from high density array data that is not highly replicated. Although it
is often difficult to achieve the levels of statistical significance
necessary to satisfy a stringent criterion for experiment-wide significance, the p values generated in Cyber-T can be used
to rank genes and determine those differences most likely to be real.