University College London, Department of Statistical Science, Gower Street, London WC1E 6BT, UK. E-mail: mervyn{at}stats.ucl.ac.uk
Wouldnt it be wonderful if Berksons quotation of Karl Pearson were true for all applications of the higher statistics? Readers will know from experience that it is not, and that journals such as this must keep alive the search for that elusive common senseby letting voices from the past speak again and pro-voke responses that may help reduce the number of misapplications.
From the perspective of Berksons 1942 paper1 we can look both waysbackward to the heady ferment of ideas in the four decades since Pearson pushed out the statistics boat into uncharted biological waters, or forward to the decades in which statistical thinking was heavily influenced by war-time demand for industrial utility and operational effectiveness, and then to the later decades in which not so much thinking as practice was free to blossom with the speed of electronic computation.
![]() |
19011942: Formative years |
---|
With its emphasis on alternative hypotheses, it is difficult to conceive of a greater challenge to the Fisherian school, but Berkson went even further with:
How blind is the procedure of doing some test of significance, when there is no knowledge at hand as to whether it is likely to show a significant result or not....
Berksons paper consists of six salvos fired into the Fisherian campeach based on a different statistical problem. Fisher responded,2 reproduced here, to only one of these salvoswhich leads one to wonder why the one and why not the other five?
Fisher always had a straightforwardly simple view of significance testing, as just one element of scientific investigation. It was for him an activity with a natural and theoretically influential order to its four components: (1) Data feature of potential scientific interest (2) Representation of the feature as a test statistic S
(3) Sceptical null hypothesis
(4) P-value as the probability under the null hypothesis that S equals or exceeds its observed value. Off-stage, alternative hypotheses may play a role as inchoate concepts in the perception of the data feature, or as components of any scientific afterthought or development.
In Section 7 of The Design of Experiments, Fisher had written:
It is usual and convenient for experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results.
A later, less circumspect, expression of this (and more) is to be found in Fishers response3 to the psychologist WE Hick, who had asked him inter alia why anyone should bother about the second tail of a two-tailed test:
... they [Neyman & Pearson] approach the problem entirely from the wrong end, i.e. not from the point of view of a research worker, with a basis of well grounded knowledge on which a very fluctuating population of conjectures and incoherent observations is continually under examination. In these circumstances the experimenter does know what observation it is that attracts his attention. What he needs is a confident answer to the question Ought I to take any notice of that? This question can, of course, and for refinement of thought should, be framed as Is this particular hypothesis overthrown, and if so at what level of significance, by this particular body of observations? It can be put in this form unequivocally only because the genuine experimenter already has the answers to all the questions that the followers of Neyman and Pearson attempt, I think vainly, to answer by merely mathematical consideration.
In other words, significance testing was for Fisher a screening device that either can protect the experimenter (or observer) from following chimeras created by chance variation from an uninteresting (null) hypothesis H0 or can positively signpost directions in which scientific gold may be discovered. The test statistic horse S has to go before any alternative hypothesis cart H1even before thinking about the relevant null hypothesis. It is S that attracts the experimenters attention or that, from the Latin roots of the word significance, makes the sign. It is what engenders the concept of one or more alternatives H1 as whatever in Nature would lead to larger values of S than would be expected when calibrated by the chance hypothesis H0. If the P-value resulting from this calibration is small enough, the experimenter is encouraged to follow the signpost and develop the science that it may point tobut must also know how to conduct an experiment which will rarely fail to give ... a statistically significant result (quoting Section 7 of Design of Experiments again).
We are now in a position to analyse and judge the strength of the five salvos (counter-examples to the Fisherian doctrine) to which Fisher did not return fire, and to comment somewhat presumptuously on the one where he did.
![]() |
Salvo of the 100-faced die ... |
---|
Had he responded to this salvo, Fisher would probably have retaliated with the pungency he expressed in the letter to Hick already quoted:
... the practical experimenter does not often put up a damn-fool test of significance but it is a labour of many years and much art for the Theory of Testing Hypotheses to avoid such tests.
This would not have been bombast. In the mid-1930s Fisher had already developed two ideas more fundamental than Neyman-Pearsonian power, in order to maximize the informativeness of both statistical estimation and significance testing: sufficiency and ancillarity. Applied to the die example, the random sample itself, excluding the outcome of the throw, is a sufficient statistic for the normality question, since the (conditional) distribution of that outcome (given the random sample values) does not depend at all on the shape of the distribution that generated the random sample. The sufficiency principleUse no more than a sufficient statisticthen excludes the die from any consideration. Exclusion is also assured if we apply the ancillarity principle: the throw outcome is an ancillary statistic because its (unconditional) distribution does not depend at all on the shape of the sample distribution, and the principle requires that we should use the (conditional) distribution of the random sample (given the ancillary) as the basis of any inference (here a significance test of normality).
![]() |
... of the problem of middling P-values for tests of Poissonianity |
---|
![]() |
... of Student and his haemocytometer |
---|
the probabilities 0.04, 0.68, 0.25, and 0.64, though not particularly high, are not at all unlikely in four trials, supposing our theoretical law [Poissonianity] to hold, and we are not likely to be very far wrong in assuming it to do so.
In Section 15 of the 1930 edition of Statistical Methods for Research Workers, Fisher looked at the set of haemocytometer counts that gave Student the P-value of 0.64 and commented that the expected frequencies ... agree well with those observed. In Section 20 (The 2 distribution) Fisher also wrote:
The term Goodness of Fit has caused some to fall into the fallacy of believing that the higher the value of P the more satisfactorily is the hypothesis verified. Values over .999 have been reported which, if the hypothesis were true, would only occur once in a thousand trials.... In these cases the hypothesis considered is as definitely disproved as if P had been .001.
The agree well with and verified in these quotations suggest that for 2 tests of goodness of fit Fisher was not far from being able to agree with Berkson and accept in some sense an unrejected null hypothesisone that could, after all, be determined by increasing without limit the sample size (the sum of the observed frequencies). But this harmony would not have extended so easily to significance tests in general, as the following extract from a letter3 that Fisher wrote to W Edwards Deming in 1935 makes clear:
There is a good deal in the approach chosen by Neyman and Pearson that I disagree with ... It is ... a pity that these writers have introduced the concept of errors of the second kind, i.e. of accepting an hypothesis when it is false, seeing that until the true hypothesis is specified, such errors are undefined both in magnitude and in frequency. Their phraseology also encourages the very troublesome fallacy that when a deviation is not significant the hypothesis tested should be accepted as true.
![]() |
... of fetal sex |
---|
![]() |
... of hospital mortality rates |
---|
I am a little sorry that you have been worrying yourself at all with that unnecessarily portentous approach to tests of significance represented by the Neyman and Pearson critical regions, etc ....
Those are the generalities touched on in this salvo. There is, however, a particularity of the hospital mortality datapresented as an actual experience of mortalities but without reference to any publicationthat deserves comment. It is revealed when we try to reproduce what Fisher would probably have done. The principal feature of the data is the apparent benefit of vaccination for all six types of operation, which gives P = 0.016 by the one-tailed sign test. For this feature (which implicitly points to a class of alternative hypotheses) Fisher would surely have chosen as test statistic not the one nominated for him by Berkson, but the product of the six one-tailed exact P-values that condition on the ancillary information in the marginal frequencies and that Fisher had presented to the Royal Statistical Society 7 years previously.4 Employing the continuity correction in ref. 5 (where one proof may also require correction!) that puts only weight half on the probability of the datum actually observed, these exact P-values are 0.33, 0.31, 0.49, 0.36, 0.34, 0.37. They are all well short of individual statistical significance despite the continuity correction and together deliver a quite insignificant Fisher-combination P-value of 0.43in spite of the fact that all six exact P-values are less than 1/2. Unless the provenance of the data is questioned, this discrepancy must be the handiwork of pure chancewithout the portentous significance that Berkson attributes to it. But Fisher might well have responded to the accusation of blindness by asking Berkson to explain the P-value of 0.98 in the 4th test: such close-to-unity values are usually indicative of spurious provenance.
![]() |
Salvo six and Return Fire: The Drosophila eye-facets |
---|
a small P is to be expected frequently if the regression is linear and a value of the abscissal variate, in this case the temperature, is not constant but subject to fluctuation, that on inspection it appears as straight a line as one can expect to find in biological material, and that his own judgement would be, not that the regression is nonlinear, but that the temperature has varied....
So Fisher was wrong to reject the linearity hypothesis! As it happens, the geneticist in Fisher had taken an interest in the deviations from linearity that Hersh related to a question of heterozygosity, while Hersh himself had openly conceded that the recorded temperature may have been subject to error. Fortunately, we have the pleasure of reading here Fishers responding salvo2an entertaining matter after 60 years but one that, at the time, Berkson is unlikely to have welcomed.
![]() |
19432003: Sixty years of schism and realignment |
---|
if we possess a unique sample ... on which significance tests are to be performed, there is always ... a multiplicity of populations to each of which we can legitimately regard our sample as belonging.
However, in the intervals between his innovative researches as the worlds leading statistical geneticist, Fisher did not elaborate the relative simplicity of his significance test doctrine, but concentrated on justifying his pre-war ideas of inductive fiducial inference. These were based on so-called pivotal quantities such as the error of observation e = x in a single observation x of a parameter
. Fisher developed a subtle line of thinking that allowed an objective and scientifically established probability distribution for e to be assigned to
, as its fiducial distribution, with x fixed at its observed value. (This transfer has never been widely accepted. Its mathematical structure has been sympathetically explored in ref. 7.)
Fisher moved down this road into open dispute with the Neyman/Wald/Lehmann version of statistical science that was being developed in North America and that generated a highly mathematical theory of hypothesis testing and estimation. Both sides of an increasingly bitter controversy had to find ingenious ways of dealing with different counter-examples to their internal consistencies. Both sides were unprepared for the discovery8 that acceptance of the Sufficiency and Ancillarity Principles implied acceptance of the Likelihood Principle: statistical evidence lies wholly in the shape of the likelihood function determined purely by the observed data. They also became increasingly vulnerable to Bayesian onslaughts from those subscribing to one or other set of axioms of rational subjective choice. By 2003, both the Fisherian and the Neyman-Pearsonian schools can be said to have failed to achieve their initiators expectations (and may even have reached the buffers, some would say)whereas the Bayesian school appears by now to be confident that it will own the professional future. Readers may find refs 9 and 10 useful in coming to their own view of the current state of affairs. What would also help would be a careful analysis of the pretension of Bayesian doctrine to identify itself with the science of statistics in its widest and most useful sense i.e. a methodology that should inspire or instill in its practitioners a resistance to unethical temptations or pressures whatever the forum in which it is invoked, ensuring that those techniques comply with the unwritten code of scientific integrity.
![]() |
A personal footnote |
---|
members of my present audience will know from their own personal and professional experience that it is to the statistician that the present age turns for what is most essential in all its more important activities. They are the backroom boys of every significant enterprise.
Sadly, many backrooms in Britain are now subject to a different sort of expansionnot of statisticians but of overseers (an influential mix of academics manqués, accountants, auditors, ... and so on through the alphabet) whose purposes are often indifferent to the high standards of truthful inquiry that Fisher had in mind in 1952.
Since hope springs eternal, I would like to end on a happier note. My own backroom, now free of overseers, has seen the completion12 of some analyses of several years experimental data on haemopoietic stem cells from Dr Martin Rosendaals University College London laboratory mice. The analyses are little more than a combination of exploratory data analysis and simple significance tests based on Fisherian techniques. But together they point to a demographic explanation of the principal feature of the data, namely, an enhancement of blood-cell repopulation by stem cell grafts of a heterozygous genotypea finding that echoes the feature of Hershs eye-facet data that probably caught the eye of the geneticist in Fisher.
![]() |
References |
---|
2 Fisher RA. Note on Dr. Berksons criticism of tests of significance. J Am Statist Assoc 1943;38:1034. Reprinted Int J Epidemiol 2003;32:692.
3 Bennett JH (ed.). Statistical Inference and Analysis: Selected Correspondence of R A Fisher. Oxford: Clarendon Press, 1990.
4 Fisher RA. The logic of inductive inference. J Roy Statist Soc 1935; 98:3954.
5 Stone M. The role of signifucance testing: Some data with a message. Biometrika 1969;56:48593.[ISI]
6 Fisher RA. Statistical methods and scientific induction. J Roy Statist Soc B 1955;17:6978.[ISI]
7 Dawid AP, Stone M. The functional model basis of fiducial inference. Ann Statist 1982;10:105474.[ISI]
8 Birnbaum A. On the foundations of statistical inference (with Discussion). J Am Statist Assoc 1962;57:269326.[ISI]
9 Cox DR. The role of significance tests. Scand J Stat 1977;4:4970.[ISI]
10 Salsburg D. Hypothesis Testing. Entry in: Armitage P, Colton T (eds). Encyclopedia of Biostatistics. Vol. 3. New York: John Wiley & Sons, 1998.
11 Fisher RA. The expansion of statistics. J Roy Statist Soc A 1953;116:16.[ISI]
12 Rosendaal M, Stone, M. Demographic explanation of a remarkable enhancement of repopulation haemopoiesis by heterozygous connexin43/45 stem cells seeded on wildtype connexin43 stroma. Clin Sci 2003; in press.