The need for statistical rigour when pooling data from a variety of sources

D.E. Walters

Thorpes, The Grip, Linton, Cambridge CB1 6NR, UK

Dear Sir,

The publication of a letter in a recent issue of Human Reproduction (James, 1999Go) has prompted this present appeal for more statistical rigour when pooling several sets of data to provide a single composite finding. The datasets in question are summarized in Table IGo, which displays the proportions of male offspring for two periods of conception, the `Most fertile' days of the menstrual cycle and the remaining period. A previous paper by Gray et al (1998) had failed to detect any importance for this factor, but James (1999) had assembled the data from five published papers and produced a composite finding suggesting that there was a higher than expected proportion of males when the conception was outside the `most fertile' period.


View this table:
[in this window]
[in a new window]
 
Table I. Table giving the proportions of male births, originating from two conception periods (James, 1999)
 
The purpose of this response is to point out some serious shortcomings of the James analysis, and to emphasize the need for absolute rigour when analysing an assembly of datasets from a variety of sources.

It would appear that James has made his inference by simply pooling the relevant frequencies over the five references to produce a {chi}2 statistic of 9.9 on one degree of freedom, a highly significant result.

In fact, that simple test would be valid only if each of the five references provided an estimate of an unknown but constant proportion for the two columns of data. Experience has shown that there is very often a degree of heterogeneity due to different data sets, representing differing conditions generally unknown to the analyst, and completely beyond his control. It is absolutely necessary therefore to investigate heterogeneity before applying the simplest of tests on the total frequencies. The effect of wrongly applying the test would invariably be to exaggerate the importance of the effect being investigated, since heterogeneity inflates the effective error and thus diminishes the perception of the effect.

The complete and rigorous analysis of the data in Table IGo would formerly have presented severe problems, but fairly recent advances in statistical computing have now resolved all these difficulties. Logistic regression may be used to model the data and investigate all the effects. The algorithm GENSTAT (1988) is particularly well suited for this type of analysis and summarizes the findings very concisely. It is not appropriate here to delve into the mathematical complexities of the analysis, but, stripped of statistical jargon, we fit a model such that the logistic transform of the proportion in any of the 10 cells in Table IGo is computed as follows:

where the `Error' component here corresponds to the TrialxPeriod effect. The results of analysing Table IGo are presented in Table IIGo where the column headed deviance may be regarded as a {chi}2 statistic on the quoted degrees of freedom. Crucially, the TrialxPeriod component gives a {chi}2 of 9.40 on 4 degrees of freedom (P = 0.05). This denotes significant (trial) heterogeneity, and a simple {chi}2 test on totals is therefore not appropriate. We may carry out a test by computing the ratio of the mean deviance for Period (= 9.46) with that of PeriodxTrial (= 2.35), which may be treated as an F statistic on 1 and 4 degrees of freedom. The P value of this ratio of 4.02 is 0.12, which is hardly remarkable. The essential ingredient of the more rigorous analysis is the use of the interaction mean deviance as the `Error' instead of the figure of 1.0, which is implicit in the use of the {chi}2 test, and indeed is the figure that would be expected if there were no trial heterogeneity. Note the deviance of 9.46 for `Period' in Table IIGo which corresponds well with the figure of 9.9 computed by James (1999).


View this table:
[in this window]
[in a new window]
 
Table II. Results of a logistic regression carried out on the data displayed in Table IGo
 
In considering the analyses for this present investigation, the question to be answered is; if these five trials represent a random sample from the entire population, is there evidence of a genuine relationship between sex ratio and the time of conception? Because of substantial systematic variation between trials, of the relative ratios, the rigorous analysis failed to detect statistical evidence whereas the simplest pooling analysis did produce a significant result. This is a clear demonstration of the care that is needed when analysing pooled datasets, in what is now widely referred to as `meta-analysis'. A thorough examination of the sources of variation is necessary in order to give a proper summary of the findings. Before the advent of GLM (General Linear Modelling) computer analyses, data such as those displayed in Table IGo were generally analysed using weighted least squares of either the proportions, or preferably a suitable transform of the proportions such as the inverse sine, or the logistic transform. The ANOVAR, typically would be as Table IIIGo.


View this table:
[in this window]
[in a new window]
 
Table III. Outline analysis of variance for the data structure of Table IGo
 
The test on whether the period of conception influenced the sex ratio is an F test on the quotient B/C, with 1 and 4 degrees of freedom. Note the similarity between the structures of the two ANOVARS, and the test statistic.

It has not been the intention in this letter to discuss the scientific proposition relating to conception time, but rather to highlight the shortcomings of the simple analysis based on a {chi}2 test of the summed frequencies. Unfortunately, the bias introduced by this unsatisfactory approach will almost certainly be in the same direction; that of exaggerating the importance of any effect being investigated. This is due to the fact that the `Error' implicit in the adoption of the {chi}2 test is very frequently a gross underestimation of the true error. Analysts need to resist the temptation simply to aggregate frequencies before carrying out a careful study of trial heterogeneity.

In view of the points outlined above, readers will appreciate that a lack of attention to study heterogeneity in meta-analyses will, more often than not, lead to exaggerated claims for effects of which there is little or no real statistical evidence. Fortunately, the widespread availability of statistical software to carry out GLM analyses now permits a detailed, rigorous analysis for data of that sort.

References

Gray, R.H., Simpson, J.L., Bitto, A.C. et al. (1998) Sex ratio associated with timing of insemination and length of the follicular phase in planned and unplanned pregnancies during use of natural family planning. Hum. Reprod., 13, 1397–1400.[Abstract]

GENSTAT (1988) The GENSTAT V Reference Manual. Clarendon Press, Oxford, UK.

James, W.H. (1999) The status of the hypothesis that the human sex ratio at birth is associated with the cycle day of conception, Hum. Reprod., 14, 2177–2178.[Free Full Text]