MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge CB2 2SR, UK. E-mail: ian.white{at}mrc-bsu.cam.ac.uk
The original paper of Gunnell et al.1 explored whether reporting error in various quantities is correlated with the true valuesfor example, whether heavier people are more likely to under-report their weight. In the present debate, the original analysis is rightly criticized both by Gunnell et al.2 and by Gilthorpe and Tu.3 Gunnell et al. propose an alternative analysis which Gilthorpe and Tu agree with only when no adjustment is made for covariates. For the adjusted case, Gilthorpe and Tu propose a multilevel analysis. I will argue (1) that the alternative analysis of Gunnell et al. is flawed even in this unadjusted case, and (2) that the multilevel analysis of Gilthorpe and Tu is not necessarily an improvement in the adjusted case.
Unadjusted case
I will use the following notation:
Formally, Model C tests the hypothesis that x1 and x2 have equal variances. Since negative correlation between e1 and X reduces the variance of x1, this appears a sensible hypothesis to test. However, the variances of x1 and x2 are also affected by the errors e1 and e2, which are unlikely to have equal variances. In fact, the reporting error e1 is likely to have larger variance than the measurement error e2. This will tend to make the variance of x1 larger than that of x2. So the proposed regression of x1 x2 on 12 (x1 + x2) could give a zero coefficient either when e1 and X are negatively correlated and e1 has larger variance than e2, or when e1 and X are uncorrelated and e1 and e2 have equal variance.
To summarize, Model B implicitly assumes that the error e2 is zero, while Model C implicitly assumes that the errors e1 and e2 have equal variances. In the present context, e2 is likely to have much smaller variance than e1, so ironically the original Model B may well be more correct than the new Model C.
Is there a better approach? A full analysis would allow the error variances to differ and would allow e1 to be correlated with X. Unfortunately this model is unidentified: it is not possible to estimate all the parameters simultaneously. To make progress, we would have to make further untestable assumptions. The most sensible approach would be to make a realistic assumption about the variance of e2.
Covariate-adjusted case
Now set aside the above difficulties by assuming that the two error variances are indeed equal, so that Model C above is correct. It is straightforward to include covariates, z say, in Model C. Gilthorpe and Tu rightly question whether such an analysis still addresses the hypothesis of interest.
However, Gilthorpe and Tu are wrong to suggest that changing from regression modelling to multilevel modelling automatically solves the problem. It is straightforward to show that the regression coefficient from Model C is zero if and only if x1 and x2 have equal variances conditional on the covariates z. This is exactly the same hypothesis that would be tested by the multilevel model. (Here I am assuming that we do not use the freedom of multilevel models to enter one set of covariates for x1 and a different set of covariates for x2.) I therefore support Gilthorpe and Tu's proposal to use a multilevel model because assumptions may be made more explicit, but much care is still needed in setting up the model to address the question of interest. Further, the multilevel model should be set up to explicitly include the parameter of interest, rather than working indirectly through comparisons of variances.
To illustrate a possible difficulty in setting up the model, consider an analysis where X is weight and we adjust for self-reported height z. Suppose that individuals who under-report their weight tend also to over-report their height. Self-reported height is therefore negatively correlated with the error e1, and this will tend to bias the regression coefficient in Model C. On the other hand, if z is measured height then no such bias arises. The precise choice of z is therefore essential.
The best approach to this problem would be to jointly model self-reported weight and height and measured weight and height. These four observed values would be explicitly related to the unobserved true values. The model would then allow the error in self-reported weight (for example) to be associated with the error in self-reported height, with the true weight, and with the true height. Such a model would be under-identified and the data analyst would be forced to be explicit about the identifying assumptions to be made.
![]() |
References |
---|
![]() ![]() |
---|
2 Gunnell D, Berney L, Holland P et al. Does the mispreporting of adult body size depend upon an individual's height and weight? Methodological debate. Int J Epidemiol 2004; 33:139899.
3 Gilthorpe MS, Tu Y-K. Mathematical coupling: a multilevel approach. Int J Epidemiol 2004; 33:1399400.