1 Department of Anatomy and Embryology, Academic Medical Center, University of Amsterdam, 1105 AZ, Amsterdam, the Netherlands
2 Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, 1105 AZ, Amsterdam, the Netherlands
3 Neurogenetics Laboratory, Academic Medical Center, University of Amsterdam, 1105 AZ, Amsterdam, the Netherlands
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() |
---|
critical values; hypothesis test; two-sided test; library size; power; serial analysis of gene expression
SERIAL ANALYSIS OF GENE EXPRESSION (SAGE; 17) was introduced as a method to quantitatively analyze the differential expression of genes. The method has since been applied successfully to cells and tissues obtained from different developmental stages or resulting from a variety of pathological processes. The SAGE procedure results in a library of short tags, each representing an expressed gene. The main assumption in the interpretation of the data in this library is that every mRNA copy in the tissue has the same chance of ending up as a tag in the library. This selection of a specific tag sequence from the total pool of transcripts can be well approximated as sampling with replacement (15).
The aim of most SAGE studies is to identify genes of interest by comparing the number of specific tags found in two different SAGE libraries. In statistical terms, the aim is to reject the null hypothesis that the observed tag counts in both libraries are equal. Testing of this hypothesis is hampered by the fact that each SAGE library is only one measurement: the necessary information on biological variation and experimental precision is not available. Therefore, each of the published statistical tests for comparing SAGE libraries is based on its own assumptions about the statistical distribution of SAGE tags from which a measure of variance is obtained.
In comparing two SAGE libraries, a large number of pairwise tests, one for each specific tag, is performed. It is possible that most pairwise differences between two libraries are just the result of random sampling from two populations that do not differ. Therefore, before starting a pairwise comparison of specific tags in two libraries, the null hypothesis that the differences between libraries result from such a random sampling has to be rejected. A similar line of reasoning is applied in the comparison of the means of more than two groups: before a multiple comparison of groups can be carried out, an overall analysis of variance has to reject the null hypothesis that all groups originate from the same population (2). In the context of SAGE research, only one reference to such an overall test has been published (14). This overall test is based on a simulation of a large number of possible distributions of two libraries within the pooled marginal totals of the observed SAGE libraries. By calculating a chi-squared statistic for each simulated pair of libraries, a distribution of this statistic under the null hypothesis can be constructed. From this simulated distribution and the chi-squared statistic of the observed libraries, one can determine the probability of obtaining the observed tag distributions by chance. Rejection of the null hypothesis that all differences between SAGE libraries are just the result of random sampling then opens the way for pairwise comparisons.
In the seminal paper of Velculescu et al. (17), tag numbers in different libraries are compared pairwise with a test based on a Monte Carlo simulation of tag counts. This test has been included into the SAGE software package SAGE300 (19). SAGE300 determines for each pairwise comparison of tags the chance of obtaining a difference in tag counts equal to or greater than the observed difference from the number of trials it takes to simulate this difference 100 times. The resulting chance serves as P value in a one-sided test.
In other papers dealing with SAGE, several pairwise test procedures have been proposed. Most of these tests have been incorporated into public database systems and analysis programs (5, 8, 10, 11, 13, 15). The test suggested by Madden et al. (11) is based on only the number of observed specific tags in each SAGE library, and the calculated statistic (Table 1) is compared with the normal distribution. Audic and Claverie (3) derived a new equation (Table 1) for the probability, P(n2|n1), of finding n2 tags in one library given the fact that n1 tags have already been observed in the other library. The sum P(n2|n1) of this probability for n2 or more tags then serves as a one-sided test. The test proposed by Kal and coworkers (7) focuses on the proportions of specific tags in each library. Since these proportions can be approximated to result from sampling with replacement, the probability of the resulting tag counts follows a binomial distribution (15). The proposed test is therefore based on the normal approximation of the binomial distribution (Z test; 7). The test statistic Z is calculated as the observed difference between proportions of specific tags in both libraries divided by the standard error of this difference when the null hypothesis is true (Table 1). This Z statistic is approximately normally distributed and can be compared with the critical Z value for the two-sided significance level
(2).
|
Recently the chi-squared test, Fishers Exact test, and the Audic and Claverie test were compared with respect to their power and robustness (12). The Madden test and SAGE300 were not included in this comparison, nor was there a comparison of the differences that are needed to lead to a statistically significant result. The latter hampers the comparison of test results in different papers. Therefore, and to further help the user of SAGE to decide between the available tests, the present review compares the critical values of five tests (excluding the chi-squared test). Critical values, sometimes called "first significant values" (3), are defined as the highest or lowest number of tags that, given an observed number of tags in one library, needs to be found in the other library to result in a P value below the significance level when the pairwise test is carried out.
Table 1 lists the five tests for pairwise testing of SAGE libraries that have been compared. It also gives the test statistic and the decision rule of each test. For details on the statistical basis of each of these tests, the reader is referred to the original papers. For all tests the null hypothesis (H0) is that there is no difference in tag numbers between the two libraries. The five tests were compared by determining their critical values for a significance level of 0.001. Such a low significance level was chosen to safeguard against accumulation of type I error. The use of a significance level of 0.001 is equivalent to an overall significance level of 0.05 and a Bonferroni correction to allow for 50 hypothesis tests (2).
In this review only the upper critical values are considered. Critical values were determined by taking a fixed tag count in the first library and subsequently performing the statistical test for an increasing number of tags in the second library until the resulting P value leads to rejection of the null hypothesis at the required level of significance. Since the Monte Carlo-based test of SAGE300 does not give the same P value every time the same input is tested, for each input the test was run six times and the mean P value was used. Such an average P value based on three trials is also given by SAGE300 in its "analyze"-"entire project" option.
All critical values were determined for 1) a total number of 10,000 tags in both SAGE libraries (N1 = N2 = 10,000) and 2) a total of 10,000 tags in the first and 50,000 tags in the second library (N1 = 10,000; N2 = 50,000). The values for the number of specific tags observed in the first library (n1) ranged from 1 to 100, effectively testing an abundance range of 0.0001 to 0.01. The critical values are the number of specific tags that have to be found in the second library (n2) and are determined by systematic simulation of an increasing difference between the two libraries. It should be kept in mind that in most comparisons between specific tags in SAGE libraries, there is no a priori knowledge about the direction of the effect. Therefore, all pairwise tests have to be carried out as a two-sided test. To do this, the test statistic Z (7, 11) was compared with Z/2, whereas the one-sided P values of SAGE300 as well as the integrated probabilities of the Audic and Claverie test and of the Fishers Exact test were compared with
/2 (Table 1).
The upper critical values for a 0.001 level of significance for the Z test of Kal et al. (7) are given in Fig. 1 for two SAGE libraries of equal size (Fig. 1A; both 10,000 tags) and for two SAGE libraries of different size (Fig. 1B; 10,000 and 50,000 tags, respectively) as continuous lines. Note that for a larger SAGE library the confidence level of an observed tag count is higher (7). Therefore, with a large second SAGE library, smaller differences in proportions can be detected as statistically significant. For two libraries of the same size (N) and relatively low tag counts (n1 + n2 less than 1% of 2N) the test statistic Z of the Z test (Table 1) reduces to Z = (n1 - n2)/. Thus, for low tag counts and two large libraries of the same size, the critical values of the Z test are independent of library size.
|
The test of Audic and Claverie (3), the Fishers Exact test, and SAGE300 (19) all have critical values that are on average within 1.5% of those of the Z test, for libraries of equal size (Fig. 1A). This equivalence of these four tests holds for tag counts as low as 1 tag per 10,000 in the first library. Only for libraries of different size and low specific tag counts, the Z test needs slightly higher critical values (Fig. 1B). Also, for other levels of significance, the critical values of the Z test are almost the same as those published for the Audic and Claverie test (3). This comparison of tests shows that, apart from the test of Madden et al. (11), all tests perform with similar resolution in detecting differences between SAGE libraries. Also, except for Madden et al., all tests can handle SAGE libraries of equal as well as different size. Therefore, the tests published by Kal et al. (7), Audic and Claverie (3), and Zhang et al. (19), as well as the Fishers Exact test, will all give the same test results when applied for pairwise comparison of SAGE libraries.
In addition, a recent paper by Man and coworkers (12) compared the chi-squared test, the test of Audic and Claverie (3), and the Fishers Exact test. This comparison was based on Monte Carlo simulations of SAGE libraries. The specificity, power, and robustness of the tests were determined for simulated SAGE libraries of various size and at severalfold difference. This comparison showed that the chi-squared test has consistently a higher power and is more robust than the other tests, especially at low expression levels (<15 tags/50,000). Therefore, the chi-squared test, which is equivalent to the Z test, was concluded to be the preferred choice for evaluating SAGE experiments (12).
The normal approximation of the binomial distribution that forms the basis of the Z test can also be used to easily construct confidence intervals for the observed proportion of specific tags as well as for the difference in proportions between two SAGE libraries (7). This approximation also enables the determination of the statistical power of the comparison of two SAGE libraries and the calculation of the sample size needed to detect an expected difference, both of which are essential in the planning of future SAGE analyses. A similar decision about sample size can be reached with a Monte Carlo-based program that calculates the power of a test for a given difference and sample size (POWER_SAGE; 12). Figure 2 shows a rearrangement of the equation of the Z test in such a way that it can be used for the evaluation and planning of SAGE experiments. In this form this equation can be used in several ways.
|
|
|
|
|
The power of a performed test tells the user how big the chance is that a real difference has been overlooked, or, in statistical terms, that a false null hypothesis is not rejected. The effect of the differences between libraries on the power of the statistical comparison of these libraries is illustrated in Fig. 4. Figure 4 shows this power as a function of the difference between a first library with 50 specific tags per 50,000 tags and second libraries of various sizes and with different numbers of specific tags. Note that the power is at its lowest when the differences in abundance are low. From this graph it can be read that when the abundance increases 1.5 times, the maximum power of the significance test will only be about 0.25: even when a second library of 100,000 tags is generated, a real 1.5-fold increase would be missed 75% of the time. To reach an acceptable power of 0.9, at least 190 tags per 100,000 should be observed. A smaller library requires relatively larger differences: at least 40 specific tags have to be observed in a library of 10,000 tags to reach the same power.
Instead of looking at detectable differences and power, one can also calculate the number of tags (N2) needed to detect a 2- to 20-fold difference between the new library and a library known from previous work or the literature (6). The number of tags needed to observe an x-fold difference as significant increases exponentially with decreasing abundance of the transcripts in the first library (Fig. 5, A and B, x-axis) and with decreasing difference between conditions (Fig. 5, separate lines) making the detection of small differences for low abundant transcripts a practical impossibility. When the number of tags in the first library is low, differences for the low abundant transcripts may never be detectable. Because the standard error of a proportion is a function of the proportion and the library size [SE = ; Ref. 2] a small difference may never exceed the critical value. In such a case one also has to increase the size of the first library. A comparison of Fig. 5 with Fig. 6 shows that, when no prior knowledge on transcript abundance is available, the most efficient way to set up a SAGE study is to compile two SAGE libraries of equal size. For example, detecting a 10-fold difference for a gene that occurs 10 times in a library of 10,000 tags would take a second library of at least 50,000 tags (Fig. 5A), whereas two new libraries of both 14,000 tags would be sufficient (Fig. 6).
Other tests for pairwise comparison of SAGE libraries may be proposed in the future. The usefulness of such tests will be limited by the fact that each SAGE library, no matter how large, only represents one experimental measurement. Consequently, one has no information about the biological variation and the precision of the observed tag counts. Such a measure of experimental variance is crucial for hypothesis testing. In the currently available tests, this measure of variance is obtained from simulation (19) or based on the putative properties of the tag distribution (3, 7, 11). The test results will be dependent on the validity of these assumptions. However, the above comparison shows that the test results of SAGE300, Fishers Exact test, the Z test, and the Audic and Claverie test differ only marginally. Additional tests will, therefore, only be a significant addition to SAGE statistics when these issues of experimental variance and accuracy are addressed. Probably the modeling of the sampling error, sequencing error, and other aspects of SAGE experiments (15) may play a role in the development of such hypothesis tests and the calculation of more accurate P values.
When only P values are published, it should be noted that SAGE300 and the Audic and Claverie test, as well as the conversion from the Z statistic to a P value for the Kal test and the Madden test, will result in a one-sided P value. The authors should be aware of this and should mention whether a one-sided or a two-sided P value is tabulated (see, for instance, Ref. 8). However, since in SAGE experiments no a priori knowledge about the direction of the effects is available, the publication of two-sided P values would be the most appropriate and should be encouraged. This would enable the direct comparison of published P values with the required level of significance and simplify the comparison of different papers on the same tissues. However, the significance of the P value of the observed difference between tag counts should not be overemphasized: the rank order of the P values may well be all the information the reader needs to pinpoint important genes and to plan future research.
![]() |
ACKNOWLEDGMENTS |
---|
SAGE300 is available from http://www.sagenet.org. The test of Audic and Claverie (3) is available from http://igs-server.cnrs-mrs.fr/~audic/significance.html. SAGEstat, for the application of the Z test (7) as well as the calculation of critical values and the number of tags needed to detect an assumed difference, is available on request (E-mail: bioinfo{at}amc.uva.nl; subject, SAGEstat). An R (S-plus) implementation of SAGEstat, with the possibility to compare public domain SAGE libraries and to plot graphs of the required number of SAGE tags is incorporated in USAGE (16), which can be reached at http://www.cmbi.kun.nl/usage/. Another program that will calculate the number of required tags and perform a chi-squared test between SAGE libraries is POWER_SAGE (E-mail: michael.man{at}pfizer.com; Ref. 12), which is based on Monte Carlo simulations.
![]() |
FOOTNOTES |
---|
Address for reprint requests and other correspondence: J. M. Ruijter, Dept. of Anatomy and Embryology, Academic Medical Center, Meibergdreef 15, K2-283, 1105 AZ Amsterdam, the Netherlands (E-mail: j.m.ruijter{at}amc.uva.nl).
10.1152/physiolgenomics.00042.2002.
![]() |
References |
---|
![]() ![]() ![]() ![]() |
---|