* Biostatistics Branch, National Institute of Environmental Health Sciences, PO Box 12233, Research Triangle Park, North Carolina 27709;
Department of Mathematics and Statistics, Miami University, Oxford, Ohio 45056-1641;
Division of Biometry and Risk Assessment, National Center for Toxicological Research, Jefferson, Arkansas 72079;
Analytical Sciences, Inc., Durham, North Carolina 27713; and
¶ Department of Statistics, University of Florida, Gainesville, Florida 32611-0339
Received December 8, 2000; accepted January 24, 2001
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This meeting was unique in that all invited speakers agreed to provide in advance the raw data from those experiments that the meeting's organizing committee had preselected as critical to the low-dose issue. Specifically, the organizing committee identified and requested raw data from 58 studies involving 15 different sets of investigators. Those studies that provided raw data for statistical analysis are listed in the Appendix.
The organizing committee identified specific variables that they wanted the Statistics Subpanel to evaluate from each study and prioritized the studies in terms of the desired order for data evaluation. These data were primarily from published studies, but included some unpublished data sets for which a manuscript (or abstract) had been prepared. The study investigators were also asked to submit responses to 23 specific questions that were designed to help the subpanels better understand important features of their individual studies.
The Statistics Subpanel was asked to reevaluate the authors' experimental design, data analysis, and interpretation of experimental results and to provide a written report prior to the meeting, which would be distributed to the other four subpanels to aid in their deliberations. The purpose of this statistical reevaluation was to provide an independent assessment of the experimental design, data analysis, and interpretation of experimental results for each of the studies and to identify and discuss key statistical issues relevant to the evaluation and interpretation of all endocrine disruptor studies. In this paper such issues are illustrated using examples from the various studies that the Statistics Subpanel evaluated. However, the statistical principles and experimental design and data analysis issues discussed below are not unique to endocrine disruptor studies, but apply to many other types of laboratory investigations as well.
The complete report of the Statistics Subpanel, which includes a detailed evaluation of individual studies, is available on the National Toxicology Program web site (http://ntp-server.niehs.nih.gov/). Hard copies of all subpanel reports and the investigator responses to the 23 questions can be obtained by contacting the U.S. EPA OPPTS docket-42208A, (202) 260-7099.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
There were several important limitations associated with the data reevaluation. For example, because of a limited time frame (46 weeks) for completion of the statistical analyses prior to the meeting, only 38 studies from 12 different investigators were reevaluated. Second, the reevaluation focused primarily on the experimental design, data analysis, and interpretation of experimental results of each individual study within the context of its own experimental conditions rather than on comparisons of results across studies. Finally, the scientific value of the statistical reanalyses is a function of the quality of the data provided and was beyond our control. It was not possible for the Statistics Subpanel to assess the validity or reliability of the data we received.
The variables of interest were primarily reproductive system responses, including organ weights (prostate, testis, uterus, ovary, etc.), daily sperm production, anogenital distances, vaginal opening, and preputial separation. The raw data included individual animal response measures of these various parameters.
The 38 studies that we evaluated used a variety of statistical methods. Recognizing that each study had its own objectives, it was nevertheless decided that our evaluation would apply a uniform statistical approach to all studies. Analysis of variance (ANOVA) was used to account for specific design effects (e.g., replicate effects) in addition to dose effects. Linear mixed-effects models were often employed using litter as a random effect and allowing for responses from littermates to be correlated. Because the primary objective of most of these studies was to determine if significant effects were present in selected dosed groups relative to controls, pairwise comparisons were made by Dunnett's test (based on litter means for those studies having significant litter effects).
The Statistics Subpanel used analysis of covariance (ANCOVA) in the evaluation of organ weights to adjust for body weight differences among groups. As low-dose effects were of interest, regression models (linear and quadratic) were applied to study dose-response trends. When appropriate, a logarithmic transformation was used to eliminate heterogeneity of variances across treatment groups. In those few instances in which heterogeneity could not be removed by a log transformation, the data were evaluated by nonparametric techniques.
Although a uniform statistical approach was followed, the Statistics Subpanel retained the flexibility of carrying out any statistical analyses of the data that any individual statistical evaluator deemed to be appropriate. More specific details on the statistical methods used in our analyses of each study are given in the complete report available on the National Toxicology Program web site or from the U.S. EPA Docket.
![]() |
RESULTS AND DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Experimental Design Issues
Study sensitivity (power).
The power of a study is the probability of detecting a treatment effect if it is present in the data. Study sensitivity or power is influenced by a number of factors, including a) sample size; b) the underlying variability of the data; c) the magnitude of the treatment effect that is present; d) the method of statistical analysis; and e) the level of significance.
A larger study will generally have more power for detecting similar chemical-related effects than a smaller study. Moreover, the interpretation of a study as negative should be given more weight when relatively large sample sizes are used. The number of animals per group ranged from 3 to 179 in the studies that were reevaluated, and this is a factor that must be considered when comparing and interpreting study results.
Although we would in general anticipate that larger group sizes would lead to greater statistical power, this may not be realized if larger group sizes are obtained at the cost of introducing new, uncontrolled sources of variability. Reducing variability should be an important study objective, and it can be achieved in a variety of ways. One way is to identify and control those factors most likely to produce variability in response. These are discussed in more detail below as general experimental design issues.
Replication.
Reproducibility of experimental results is an important and necessary feature of any scientific finding before it can be generally accepted as valid. There are several types of replication, the first being replication within an individual experiment.
If multiple replications are used within a study, then each experimental group should be represented in each replicate. In one experiment we evaluated, three replicates were used, but the mid- and high-dose groups were represented only once and in different replicates. Additionally, there were significant differences among the control groups in the three replicates, although the study authors ignored these differences and pooled these groups in their statistical analysis. This experimental design and statistical analysis greatly limited the study's scientific value.
In another study, a control and three dosed groups were each evaluated in separate time frames extending over a period of 1 year. The Statistics Subpanel felt that the lack of concurrent controls was a serious deficiency of the experimental design that greatly limited the general inferences that could be drawn from this study.
Another type of replication is the reproducibility of results among separate experiments within a given laboratory. In one study, the investigator carried out eight similar experiments with the same chemical, although these technically were not replicates, because dose levels of the test compound were not identical across all experiments. This investigator found statistically positive effects on uterus weight in four experiments and no effect in the other four experiments, including the two that used the highest doses. This demonstrates that in some cases even the same investigator may be unable to repeat experimental findings.
Perhaps the most important type of replication is reproducibility among different laboratories trying to confirm the findings of another laboratory. Among the data sets we evaluated, there were several studies that attempted to duplicate the studies of other investigators. Some confirmed the original results, but many did not, and the reasons for the different experimental outcomes were not obvious.
Litter effects.
Using data from littermates is neither an inherently good nor bad experimental strategy, but if littermate data are to be used, it is essential that this source of variability be taken into account in the statistical analysis. In the studies we evaluated that used littermates, there was generally a significant litter effect or dam effect, indicating that the pups within a litter were responding more alike than pups from different litters. Failure to adjust for litter effects (e.g., to regard littermates as independent observations and thus the individual pup as the experimental unit) can greatly exaggerate the statistical significance of experimental findings.
Increasing sample size by sampling more pups within a litter will not necessarily yield correspondingly increased statistical power. This is because within-group variation may be dominated by variation among dams, resulting in high correlation among pups within dams. When this dam effect is properly taken into account in the data analysis, the effective sample size will be smaller than the apparent sample size.
Elswick et al. (2000) carried out a simulation study in which they concluded that one pup per litter experimental designs should not be used when assessing effects on highly variable organ weights and other reproductive end points, as such a design results in a substantial percentage of incorrect conclusions about the presence or absence of treatment effects. However, their simulation study and its conclusions were flawed (see complete report for more details).
For a fixed total number of litters, increasing the number of pups per litter will increase power (reduce the false-negative rate), but it has no impact on the false-positive rate, which is fixed by the selection of alpha (typically 0.05). None of the authors' simulations indicated that the sampling strategy used (one, two or three pups per litter) had any impact on what they regarded to be the false-positive rate. Nevertheless, the authors concluded that the sampling of only one or two pups per litter may have been a contributing factor to the positive low-dose effects observed by some investigators, effects that were not confirmed by others who used more than two pups per litter.
Another problem is that the simulations of Elswick et al. were based on comparisons of subsamples selected without replacement from two (or more) samples of litters. In those instances in which the samples were not statistically different, the authors considered the samples themselves to be identical, and then in effect regarded them as populations in their simulation study. However, because the samples (now regarded as populations) were in fact different, all the authors' reported false-positive rates resulting from the simulation were in fact power calculations.
Finally, if strong litter effects are present, then a gain in power (for a fixed total number of pups) is best achieved by increasing the number of litters, not by increasing the number of pups per litter. Thus, the authors' emphasis on increasing the number of pups per litter rather than increasing the number of litters is somewhat misguided.
In an ANOVA-based statistical analysis, if littermates are used only within a single treatment group, then litter is a nested factor. However, in some experimental designs, one pup from the same litter may be assigned to each experimental group prior to treatment. In such instances, litter is a crossed, not a nested, factor. This distinction is important, as regarding crossed factors as nested factors or vice versa in an ANOVA can result in a misleading test (Kutner et al., 1996).
Another potentially complicating litter effect is the location of the pup within the uterus (e.g., whether it is located between two males, between two females, or between one male and one female). At least one investigator has data indicating that the gender of the adjacent pups in utero can influence certain biological responses. One approach for dealing with this intrauterine effect would be to stratify the data according to the various possible combinations of adjacent pups, if this information is available; generally, it is not.
Potential investigator bias.
To avoid the possibility of subtle bias, postexperimental measurements on pups should be made without prior knowledge of whether they are from dosed or control groups. In other contexts (e.g., histopathology evaluation), it has been argued that control animals need to be examined in an unblinded fashion to identify what is normal variability, and only then can experimental groups be evaluated. We do not accept this argument (which is debatable even in other contexts), as the primary variables of interest (organ weight, anogenital distance, sperm counts, etc.) are objective measures that do not require prior knowledge of control values to be assessed accurately.
A related issue is the order of experimental evaluation. Some have argued that the controls must be examined first to ascertain what is normal variability. We reject this argument in the current context, and feel that dosed and control groups should be examined together in a blinded fashion. Although it may be impractical to use a completely randomized approach, at a minimum the experimental design should ensure that there is no bias associated with any systematic ordering of experimental evaluation.
For example, one laboratory reported that it evaluated all controls (unblinded) on one day, followed by all the low-dose animals on the next day, etc. This lab also indicated that within a day, a single technician looked at all the pups, but different technicians might be used on different days. There is obvious potential bias in such a strategy. Even if the technicians have been uniformly trained, it is difficult to avoid differences among them.
Different types of control groups.
Some studies used both an untreated and a vehicle control group to evaluate the possible effect of the test vehicle. In most studies it was not expected that the vehicle would have an effect (an expectation that was generally confirmed), and the control groups were subsequently pooled. We agree that the pooling of vehicle and untreated control groups is reasonable if there is no evidence of a difference between them. If there is any evidence of a possible vehicle effect, then the two control groups should not be pooled, and the primary comparisons of interest should be relative to the vehicle control group.
Quality control.
The experimental design of a study should include procedures to ensure the accuracy of data recording/transcription. Although subsequent statistical tests for outliers can identify questionable data points, by that time it may be too late to know for certain the accuracy of such values. In some studies that the Statistics Subpanel evaluated, the organ weights within a group varied by as much as 10-fold or even 100-fold. Visual inspection of the data might have identified such outliers prior to statistical analysis.
There were also quality control issues with respect to the raw data provided to us for statistical analysis. In two studies, the raw data had significant errors. In one case (involving two different studies from the same investigator) multiple pups were misassigned to litters, and multiple litters were misassigned to dosed groups. Another investigator inadvertently omitted entire blocks of data dealing with eight high-dose animals. These errors were detected (and corrected) only because the Statistics Subpanel had access to summary data in published papers for comparative purposes.
Data Analysis Issues
Choice of statistical methodology.
Although many different statistical procedures may be used in a given experimental setting, it should be recognized that these procedures may have different objectives, make different underlying assumptions, and have different degrees of protection against false-positive and false-negative outcomes. In this context, one procedure is more conservative than another if it tends to have a lower false-positive rate and a correspondingly higher false-negative rate. Balancing false-positive and false-negative rates in the selection of statistical methodology is to some extent a matter of scientific judgement.
To take a specific example, for normally distributed data for which the desired comparisons are limited to dosed groups versus controls, Dunnett's test is a widely used and appropriate test for this purpose. This is the method of statistical analysis the Statistics Subpanel used, as noted above. Dunnett's test controls the experiment wide error rate by taking into account the multiple comparisons being made, and thus is more appropriate than, for example, multiple applications of Student's t test, which could result in an unacceptably high false-positive rate. Williams (1971, 1972) proposed a modification to Dunnett's test that is appropriate if it is reasonable to assume a monotonic dose response, but because of the potential for low-dose effects not seen at higher doses (which would invalidate the monotonicity assumption), we decided not to use the Williams procedure.
In our reanalysis, we found that even if two investigators chose the same test procedure, they may have applied it differently. For example, Dunnett's test is a standalone test that does not require statistical significance of an overall ANOVA to be valid. However, many investigators who used Dunnett's test required statistical significance of an overall ANOVA before making pairwise comparisons. Because the critical values for Dunnett's test were derived without consideration of an overall ANOVA, requiring this additional significance may result in a somewhat conservative test. Specifically, there were a few instances in which our reanalysis found significant pairwise differences by Dunnett's test that were not reported as significant by the study investigators who themselves also used Dunnett's test. Such differences were apparently due to the extra requirement of a significant overall ANOVA imposed on Dunnett's test by the study investigator.
If all possible pairwise comparisons are desired, there are many multiple comparisons tests that could be used. For a comparison of some of these methods, see Carmer and Swanson (1973), Hochberg and Tamhane (1987), and Hsu (1996). One such method is the widely used Fisher's (protected) least significant difference (LSD) test, which, unlike Dunnett's test, does require the significance of an overall ANOVA before pairwise comparisons can be made, to control the experimentwide error rate. The conditional pairwise comparisons are then made by a statistic similar in form to Student's t test, but one that uses all the data to estimate the underlying variability.
Although the protected LSD test is a valid test that can be used for endocrine disruptor data (depending upon study objectives), the overall ANOVA should not include the positive control group or any group known a priori to produce a positive response. Otherwise, overall significance would be virtually guaranteed, and the benefits of a preliminary ANOVA would be lost. The result would likely be a test with an unacceptably high false-positive rate. As a practical matter, even if done correctly, the protected LSD is generally more liberal (i.e., more prone to false-positive outcomes, as discussed above) than certain alternative multiple comparisons procedures.
There was also some possible confusion regarding the use of Jonckheere's test, a very useful nonparametric trend test. One investigator implied that it was a test for linear trend, but as it is a nonparametric test, it assumes no specific shape of the dose-response curve and is simply a test for a monotonic (i.e., nonincreasing or nondecreasing) trend. Another investigator used Jonckheere's test as the sole method of statistical analysis. The disadvantage of this approach is that nonmonotonic dose-response trends (e.g., U-shaped or inverted U-shaped dose-response curves with significant low-dose effects) would probably not be detected by Jonckheere's test. The data that we evaluated provided several examples of this.
Although there are many advantages to tests that assume underlying normality, an investigator must be aware of situations in which the data are extremely skewed, which would invalidate a normal theory-based approach. For example, one investigator assessed mammary gland differentiation (number of structures per square millimeter of mammary gland) and reported a highly significant (p < 0.01) effect of diethylstilbestrol (DES) by Student's t test. However, this investigator failed to realize that 39 of the 40 animals in the various dosed and control groups had a zero response. The one single positive response in the DES group was solely responsible for the apparent statistical significance. For such highly skewed data, we prefer a nonparametric approach, because it clearly (and correctly, in our view) would demonstrate no statistical significance associated with a single positive response.
Some investigators used a Bonferonni correction to the p values when making pairwise comparisons. This technique divides the nominal significance level alpha by the number of groups being compared. While there is nothing inherently wrong with such an approach, it is rather conservative and would have a relatively high false-negative rate. Such an approach should be unnecessary if an investigator uses one of the multiple comparison procedures discussed above, or when the significance of an overall test is required to ensure that the proper experimentwide error rate is maintained.
It is not our purpose to specify a methodology that must be used for the statistical analysis of endocrine disruptor data. Our main points are that if pairwise comparisons are of interest, then a) a valid multiple comparisons test should be used; and b) the choice of a specific multiple comparisons test should depend upon study objectives. If a parametric test is to be used, the investigator should evaluate whether the data are consistent with the underlying assumption of normality. Moreover, it should be recognized that among the valid multiple comparison procedures, some are inherently more conservative than others.
Heterogeneity of the data.
Virtually all procedures that assume an underlying normal distribution (ANOVA, Dunnett's test, etc.) also assume homogeneity of variance, that is, within-group variability is constant across all groups. Occasionally, a simple transformation (e.g., a logarithmic transformation) can eliminate heterogeneity. Although the use of ANOVA-based procedures is not invalidated by modest variance heterogeneity, nonparametric methods should be considered if the heterogeneity is extreme even after a data transformation. Significant heterogeneity may indicate the presence of an outlier in one or more of the test groups or the presence of an important covariate in the data.
An apparent failure to recognize heterogeneity was a common feature in many of the data sets we evaluated. For example, one investigator reported that by Dunn's test (a valid nonparametric alternative to Dunnett's test), the ovary/body weight ratio was significantly increased in a dosed group relative to controls. The author based this interpretation on the fact that the mean of the dosed group's response exceeded the mean control response, and Dunn's test indicated statistical significance.
However, this investigator failed to recognize that the dosed group contained a single ovary weight that was approximately 10 times the value of the group mean. As a result, although the mean response was indeed slightly elevated in the dosed group relative to controls, the preponderance of the individual animal data showed the opposite trend. Thus, Dunn's test actually identified a statistically significant decrease, not a significant increase, in the organ/body weight ratio in the dosed group. Visual inspection of the individual animal data prior to statistical analysis might have avoided this problem.
As another example, in one study an observed seminal vesicle weight was reported to be more than four times the testis weight of that animal and approximately 100 times greater than the mean seminal vesicle weight of the other animals in that group. An appeal to the original lab book may be required to resolve such unusual observations. Even if the value is real, as a minimum it should be identified as a statistical outlier. The presence of such outliers in the data can have a substantial impact on study results.
Although many of the data sets we evaluated had significant variance heterogeneity, a simple log transformation was generally successful in eliminating this heterogeneity. Ignoring heterogeneity and carrying out ANOVA-based tests and pairwise comparisons can produce false-negative outcomes, because the group showing the excessive variability unduly inflates the error term. False-positive outcomes can also occur if the cause of the heterogeneity is improper randomization. Thus, the Statistic Subpanel recommends that an investigator take appropriate measures to ensure that heterogeneity of variance is not a problem in his or her study. Possible measures include a) a statistical procedure to identify potential outliers in the data; b) a preliminary test for heterogeneity (e.g., Levene's test); c) a simple data transformation (e.g., a logarithmic transformation) to eliminate significant heterogeneity; and/or d) the use of nonparametric methods or parametric procedures that are robust to heterogeneity of the data.
Adjusting organ weight for body weight.
This is one statistical issue that is especially relevant for endocrine disruptor data: how to take body weight differences into account when assessing possible changes in organ weight. The possible strategies include
Using body weight as a covariable is not without potential difficulties, especially when the test chemical affects both organ and body weight. Technically, in an ANCOVA, the covariable should be independent of treatment. If the test chemical affects both body weight and organ weight, then it may be difficult to disentangle the effects of body weight and the effects of treatment on organ weight, as is discussed in more detail below.
Moreover, ANCOVA analyses generally assume that the effect of the covariate is the same for all dose levels. This is usually assessed with a test for the possible interaction between the covariate (body weight in our analysis) and the treatment/factor of interestchemical dose. Significant interactions between body weight and treatment were observed in certain of our analyses. This implies that the effect of body weight was not the same for all dose levels of the test chemical, which clearly complicates the evaluation of experimental results.
Figures 1 and 2, using data from Lee (1998), illustrate a possible problem when a test chemical affects both organ and body weight. In this study, all three animals in the top dose group had very low body weights and extremely low testis weights relative to the other animals. A regression line through the origin (consistent with the use of the organ/body weight ratio) resulted in the testis weights of all three high-dose animals falling below the regression line. This implies a reduced testis/body weight ratio in this group, the conclusion reached by Lee (1998).
|
|
This example, and others like it, illustrate that when the test chemical reduces both organ weight and body weight, the relative impact of the test chemical and reduced body weight on organ weight may become confounded. If a chemical reduces organ weight simply by making an animal smaller, then this is inherently different from a situation in which a chemical has a direct effect on organ weight. However, often these two outcomes are indistinguishable. An investigator must maintain an awareness of this when interpreting organ weight changes using an ANCOVA (or any other) approach.
Figure 3, which uses data from Alworth et al. (1999), illustrates another potential difficulty using ANCOVA. In this instance (ignoring the test chemical), there appears to be a negative association between uterus weight and body weight, with the heavier DES animals having reduced uterus weight relative to controls. However, a closer examination of the data reveals that within the DES and control groups, there is no significant association between uterus weight and body weight. This suggests that the effects of DES are to increase body weight and independently to decrease uterus weight. Thus, despite the apparent negative association between body weight and uterus weight shown in Figure 3
, these two variables appear to be independent within the DES and control groups, so no adjustment for body weight is needed.
|
Regression versus ANOVA.
Regression models relate responses to some function of dose, while ANOVA models essentially view the dose levels categorically. Thus, if there is an underlying pattern of dose response, the regression models will be more sensitive to detecting these trends than a more omnibus ANOVA. For this reason, our analyses frequently examined the data for linear and quadratic trends. A quadratic pattern becomes especially important if nonmonotonic dose-response patterns are possible for endocrine-disrupting compounds.
Biological interpretation.
Not all end points exhibiting statistically significant treatment effects provide evidence of biologically important responses. Biological interpretation may overrule statistical significance when the treatment effect is real but the magnitude of the response is small or the nature of change is not interpretable as an adverse response.
Some isolated treatment effects that cannot be replicated under identical experimental conditions may be false-positive outcomes. As with true positives, interpretation of false positives is a matter for scientific judgement of the investigator and lies outside the realm of statistical analysis.
On the other hand, end points not exhibiting statistically significant treatment effects may nevertheless be affected by treatment in potentially important ways. When end points responding to treatment are mistakenly overlooked, it may compromise the value of a study. The extent to which this can occur is embodied in the concept of statistical power, which is discussed earlier in this paper.
Data selectivity.
There are several valid reasons for discarding data in a given experimental setting. For example, there may have been a technical problem during the execution of a study and/or in measuring the variables of interest that compromised the scientific value of the data. Alternatively, there may be a statistical outlier in a given group that should be discarded. Another example may be a study in which the positive control does not produce the expected positive response. The prudent course of action in this latter case may be to declare the study inadequate and repeat it, regardless of the experimental outcome in the test groups.
However, data should not be discarded simply because the test groups did not produce the expected (or desired) response. Similarly, if several replicates are used, it is not appropriate to report only the one(s) producing the strongest (or weakest) response. The data evaluation and the reporting of experimental outcomes should be evenhanded, not selective.
Conclusion
This paper intentionally does not have any bottom line conclusion with regard to whether there are low-dose endocrine disruptor effects. That was not our objective. We are hopeful that interested readers will examine the complete report of the Statistics Subpanel, as well as the reports of the other four subpanels, and reach his/her own conclusions on this important issue.
However, we do have a conclusion related to statistical issues. Our paper has identified a number of important statistical considerations that must be addressed in the design, analysis, and interpretation of experimental studies. Increased awareness of these issues should reduce the frequency of problems such as those encountered in our reanalysis of the low-dose endocrine disruptor data. Moreover, the statistical principles that we discussed are not limited to endocrine disruptor studies, but apply to the general conduct and evaluation of all laboratory investigations.
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Carmer, S. G., and Swanson, M. R. (1973). An evaluation of ten pairwise multiple comparison procedures by Monte Carlo techniques. J. Am. Statist. Assoc. 68, 6674.[ISI]
Elswick, B. A., Welsch, F., and Janszen, D. (2000). Effect of different sampling designs on outcome of endocrine disruptor studies. Reprod. Toxicol. 14, 359367.[ISI][Medline]
Hochberg, Y., and Tamhane, A. C. (1987). Multiple Comparison Procedures. John Wiley and Sons, New York.
Hsu, J. C. (1996). Multiple Comparisons Theory and Methods. Chapman and Hall, London, U.K.
Kutner, M. H., Nachtschiem, C. J., Wasserman, W., and Neter, J. (1996). Applied Linear Statistical Models, 4th ed. Irwin. Homewood, IL.
Lee, P. C. (1998). Disruption of male reproductive tract development by administration of the xenoestrogen, nonylphenol, to male new born rats. Endocrine 9, 105111.[ISI][Medline]
Williams, D. A. (1971). A test for differences between treatment means when several dose levels are compared with a zero dose control. Biometrics 27, 103117.[ISI][Medline]
Williams, D. A. (1972). The comparison of several doses with a zero dose control. Biometrics 28, 519531.[ISI][Medline]