Assessing correlation between reporting errors and true values: untestable assumptions are unavoidable

Ian R White

MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge CB2 2SR, UK. E-mail: ian.white{at}mrc-bsu.cam.ac.uk

The original paper of Gunnell et al.1 explored whether reporting error in various quantities is correlated with the true values—for example, whether heavier people are more likely to under-report their weight. In the present debate, the original analysis is rightly criticized both by Gunnell et al.2 and by Gilthorpe and Tu.3 Gunnell et al. propose an alternative analysis which Gilthorpe and Tu agree with only when no adjustment is made for covariates. For the adjusted case, Gilthorpe and Tu propose a multilevel analysis. I will argue (1) that the alternative analysis of Gunnell et al. is flawed even in this unadjusted case, and (2) that the multilevel analysis of Gilthorpe and Tu is not necessarily an improvement in the adjusted case.

Unadjusted case
I will use the following notation:

X = the true (unobserved) quantity, e.g. true weight
x1 = the self-report = X + e1
x2 = the objective measurement = X + e2
Here e1 is a reporting error and e2 is a measurement error. The question of interest is whether e1 is correlated with X. e2 is assumed to be uncorrelated with X. Three possible models are:
Model A: regress x1 – X on X.
Model B: regress x1 – x2 on x2.
Model C: regress x1 – x2 on 1–2 (x1 + x2).
In each case, our interest is in whether the coefficient is zero. Model A would be ideal if we knew X, but we do not. The original analysis of Gunnell et al. used Model B, which is correct only if x2 contains no error. Otherwise, the presence of the error e2 on both sides of the regression equation leads to a negative bias in the regression coefficient. The revised analysis of Gunnell et al. uses Model C. This implicitly attempts to balance a negative bias arising from the presence of e2 on both sides of the regression equation by a positive bias arising from the presence of e1 on both sides.

Formally, Model C tests the hypothesis that x1 and x2 have equal variances. Since negative correlation between e1 and X reduces the variance of x1, this appears a sensible hypothesis to test. However, the variances of x1 and x2 are also affected by the errors e1 and e2, which are unlikely to have equal variances. In fact, the reporting error e1 is likely to have larger variance than the measurement error e2. This will tend to make the variance of x1 larger than that of x2. So the proposed regression of x1 – x2 on 12 (x1 + x2) could give a zero coefficient either when e1 and X are negatively correlated and e1 has larger variance than e2, or when e1 and X are uncorrelated and e1 and e2 have equal variance.

To summarize, Model B implicitly assumes that the error e2 is zero, while Model C implicitly assumes that the errors e1 and e2 have equal variances. In the present context, e2 is likely to have much smaller variance than e1, so ironically the original Model B may well be more correct than the new Model C.

Is there a better approach? A full analysis would allow the error variances to differ and would allow e1 to be correlated with X. Unfortunately this model is unidentified: it is not possible to estimate all the parameters simultaneously. To make progress, we would have to make further untestable assumptions. The most sensible approach would be to make a realistic assumption about the variance of e2.

Covariate-adjusted case
Now set aside the above difficulties by assuming that the two error variances are indeed equal, so that Model C above is correct. It is straightforward to include covariates, z say, in Model C. Gilthorpe and Tu rightly question whether such an analysis still addresses the hypothesis of interest.

However, Gilthorpe and Tu are wrong to suggest that changing from regression modelling to multilevel modelling automatically solves the problem. It is straightforward to show that the regression coefficient from Model C is zero if and only if x1 and x2 have equal variances conditional on the covariates z. This is exactly the same hypothesis that would be tested by the multilevel model. (Here I am assuming that we do not use the freedom of multilevel models to enter one set of covariates for x1 and a different set of covariates for x2.) I therefore support Gilthorpe and Tu's proposal to use a multilevel model because assumptions may be made more explicit, but much care is still needed in setting up the model to address the question of interest. Further, the multilevel model should be set up to explicitly include the parameter of interest, rather than working indirectly through comparisons of variances.

To illustrate a possible difficulty in setting up the model, consider an analysis where X is weight and we adjust for self-reported height z. Suppose that individuals who under-report their weight tend also to over-report their height. Self-reported height is therefore negatively correlated with the error e1, and this will tend to bias the regression coefficient in Model C. On the other hand, if z is measured height then no such bias arises. The precise choice of z is therefore essential.

The best approach to this problem would be to jointly model self-reported weight and height and measured weight and height. These four observed values would be explicitly related to the unobserved true values. The model would then allow the error in self-reported weight (for example) to be associated with the error in self-reported height, with the true weight, and with the true height. Such a model would be under-identified and the data analyst would be forced to be explicit about the identifying assumptions to be made.


    References
 Top
 References
 
1 Gunnell D, Berney L, Holland P et al. How accurately are height, weight and leg length reported by the elderly, and how closely are they related to measurements recorded in childhood? Int J Epidemiol 2000; 29:456–64.[Abstract/Free Full Text]

2 Gunnell D, Berney L, Holland P et al. Does the mispreporting of adult body size depend upon an individual's height and weight? Methodological debate. Int J Epidemiol 2004; 33:1398–99.[Free Full Text]

3 Gilthorpe MS, Tu Y-K. Mathematical coupling: a multilevel approach. Int J Epidemiol 2004; 33:1399–400.[Free Full Text]