1 Department of Epidemiology, School of Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC.
2 Department of Epidemiology and Biostatistics, McGill University, Montreal, Quebec, Canada.
Received for publication March 14, 2001; accepted for publication October 14, 2002.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
bias (epidemiology); epidemiologic methods; measurement error; regression analysis
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Errors in exposure measurement may lead to biased estimates of exposure-disease associations. A number of authors have discussed the direction and magnitude of bias resulting from specified patterns of exposure measurement error and have provided models describing these associations (13). These models can help in assessing the bias and uncertainty that result from commonly encountered problems of exposure measurement error. Such models have been used in a range of epidemiologic investigations, including studies of nutritional factors, environmental contaminants, and occupational hazards (46).
Taking a similar approach in this paper, we begin with a model in which exposure measurement error is assumed to be nondifferential (that is, independent of disease status) and randomly distributed. In this paper, however, we focus on how the effects of measurement error change when an exposure variable is constrained by a lower limit. It is common in epidemiologic studies for recorded exposures to be constrained by a lower limit, such as zero or a minimal detection threshold for a measurement process (79). For example, in studies of workers in the nuclear industry, a lower boundary for recorded radiation doses often reflects the inability of a measurement tool to accurately obtain values below a specified minimal threshold of detection (10). In this case, measurement error conforms to a nonlinear rather than an additive model, and the assumption that these errors are randomly distributed is no longer accurate. Consequently, a constraint on the minimal recorded exposure will influence the distribution of exposure measurement error and, importantly for epidemiologists, influence the effect of measurement error on estimates of exposure-disease associations.
In this paper, we investigate the direction and magnitude of bias in exposure-response associations when there is random measurement error and a lower threshold for recorded exposures. We develop a general equation for the coefficient of bias in exposure-response associations resulting from random measurement error in the presence of a recording threshold limit, and we illustrate our findings under a range of specified model conditions.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
where y is a health response measured on a continuous scale, x is a continuous exposure (assumed to be nonnegative for our purposes), and is the outcome error term, which is assumed to be uncorrelated with x and to have a mean equal to zero and a constant variance. The parameters
and ß denote the intercept and the average change in y with x, respectively.
If the health outcome under study is described by a binary variable, a logistic regression model may be preferable. In this case, equation 1 is replaced by a model in which y is a binary random variable with parameter p = E(y|x) = Pr(y = 1/x) and
In the discussion that follows, we will assume that associations conform to equation 1. However, the theory remains valid even when the correct model is the logistic one (equation 1'), provided that the following assumptions are satisfied: 1) the outcome is rare; 2) ß is not too large in absolute terms; and 3) the measurement error is not too severe (11). In many cases of epidemiologic interest, the above assumptions will be satisfied.
Bias due to measurement error
Typically, in an epidemiologic study, the true exposure, x, is not measurable without error. Consequently, one can consider that the study uses a surrogate exposure variable, z, which provides an imperfect measure of the true exposure, x. A simple case considered in epidemiology is that of the "classical error" model:
where is a random variable with variance
2 that is uncorrelated with x. It is usually assumed that cov(
,
) = 0. Several generalizations of this model are also often used in epidemiology (5); they usually imply a linear relation between z and x and/or a relaxation of the assumption of zero correlation between
and x, while maintaining the assumption that cov(
,
) = 0.
Suppose we fitted a linear regression model between a surrogate measure of exposure, z, and a continuous measure of disease, y:
obtaining the usual least squares estimator, where the subscript s denotes the usual sample estimates of variance and covariance:
.
Let us write z = x + as in equation 2, but without the assumptions that the distribution of
is independent of x and that there is no correlation between x and
. However, we will continue to assume that cov(
,
) = 0. Then, as the sample size of the study population tends towards infinity, it can be shown simply from cov(y,z) = cov(ßx +
,z) = ßcov(x,z) that the estimated association between the surrogate measure of exposure and disease, ß', is equal to the estimated association between the true exposure and disease, ß, multiplied by a coefficient of bias,
, as follows:
where
.
Therefore, provided that and
are uncorrelated, equation 6 is valid in general, not only in the familiar case of additive error.
We will call standard the well-studied case described by equation 2 and its linear generalizations, characterized by the assumption that the measurement error is distributed with mean equal to zero and constant variance, 2. In this standard case, measurement error always leads to attenuation of the exposure-disease association in addition to diminishing the goodness-of-fit of the regression model (5). The coefficient of bias,
, described by equation 6 will always take values less than 1; and, for the additive error model, the coefficient,
, is equal to the ratio of the variance of x to the sum of the variance of x and
2 (12).
However, we are interested in the range of possible values for the coefficient of bias, , in the nonstandard case in which recorded values for z are constrained by a minimal threshold limit, d, and exposures below this threshold limit are set equal to a value, a. In this nonstandard case, the model of measurement error in equation 2 can be replaced by the following model:
We first examine the case, which we call the pure threshold model, in which there is no random measurement error (the variance of , denoted by
2, is equal to zero). In this case, the only source of exposure measurement error is the inability of the measuring instrument to detect x values that are below the threshold limit, d.
Then we examine the case, which we call the threshold model with error, in which there is random measurement error (a nonzero variance for ). To explore these cases, we developed a general formula for bias due to exposure measurement error, as described in the Appendix. Under the threshold model with error, the relation between the coefficient of bias,
, and the threshold limit, d, depends on the distributions of x,
, and the value assigned to below-threshold measurements, a. Thus, rather than attempting a general analytical study of
as a function of d, a, and the parameters of the distributions involved, we give some specific examples using distributions that mimic reasonably well what can be expected in real-life situations.
To explore these examples, we generated simulation data for 1,000,000 study subjects. A true exposure value, x, was assigned to each study subject by sampling from the lognormal (0,1) or gamma (1,1) distribution (figure 1). Using equation 7, we calculated values for z for the following cases: 1) a pure threshold model (2 = 0) with x distributed according to the lognormal (0,1) distribution; 2) a pure threshold model (
2 = 0) with x distributed according to the gamma (1,1) distribution; 3) a threshold model with error with x distributed according to the lognormal (0,1) distribution; and 4) a threshold model with error with x distributed according to the gamma (1,1) distribution.
|
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Let us consider some specific examples. If the value assigned to below-threshold measurements, a, is equal to 0, the coefficient of bias will always be less than or equal to 1. In contrast, if the value assigned to below-threshold measurements is equal to the threshold limit, d, the coefficient of bias will always be greater than or equal to 1. The common practice of setting a equal to d/2 may result in upward or downward bias, depending on the magnitude of d/2 with respect to E[x|x d]. Figures 2 and 3 illustrate the relation between the coefficient of bias and a, the value assigned to exposure measurements below a threshold limit, d = 1 or d = 2. Figure 2 pertains to the case in which the population distribution of exposure is lognormal. Figure 3 pertains to the case in which the population distribution of exposure conforms to the gamma distribution.
|
|
We begin with the situation in which the population distribution of exposure conforms to the lognormal distribution. In the absence of random measurement error, assigning a value of zero to exposure measurements below a threshold limit attenuates exposure-response associations (see above). Figure 4 illustrates the situation in which random measurement error occurs and a value of zero is assigned to exposure measurements below a threshold limit, d = 1 or d = 2. In this situation, the degree of attenuation increases as the standard deviation of the random error, , increases; and the degree of attenuation is larger when the threshold limit is equal to two units than when the threshold limit is equal to one unit (figure 4). Notably, one can see that in comparison with the classical model of exposure measurement error (depicted by the solid line), the effect of random measurement error differs in the case where there is a threshold limit (depicted by the dashed lines). At small values of
, the coefficient of bias is closer to 1 under the classical model of measurement error (in which there is not a threshold limit) than under the threshold model with error (figure 4). In contrast, at large values of
, the coefficient of bias is closer to 1 under the threshold model with error than under the classical model of measurement error (figure 4). This is because the decline in the magnitude of the coefficient,
, with increasing values of
is consistently greater in the case where there is no threshold limit (the classical model of exposure measurement error) than in the nonstandard case where there is a threshold limit.
|
|
|
|
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this paper, we investigated the effect of measurement error when the surrogate exposure was constrained by a lower threshold value. We have shown that the direction and magnitude of bias in estimated associations may vary depending on recording practices, the variance in the surrogate exposure variable due to measurement error, and the population distribution of the exposure. While the assumptions underlying the simulation analyses in this paper were informed by empirical studies on the health effects of ionizing radiation, we have attempted to examine situations that are comparable to those commonly encountered by epidemiologists. In some of the examples, a large proportion of the study data had exposure values lower than the minimal threshold limit; this was particularly true when we assumed that the population distribution of the true exposure conformed to the gamma distribution. However, in studies of environmental and occupational exposures, highly skewed exposure distributions are commonly encountered; consequently, the conclusions drawn from these examples should be useful for considerations about bias in such settings (7, 15).
Measurement error was assumed to conform to a symmetrical normal distribution. In studies of ionizing radiation in which radiation doses are determined from film badge dosimetry, one source of exposure measurement error is the uncertainty that arises from laboratory processes (including film badge calibration, chemical processing of films, measurements of film optical densities, and comparison of the optical densities of badges to calibration films). In these settings, measurement error due to laboratory uncertainties may be assumed to conform to the normal distribution (16).
We focused on several situations that are illustrative of recording practices for below-threshold exposure measurements: situations in which below-threshold measurements are set equal to either zero, one half of the threshold limit, or the threshold limit. Other recording practices might be considered, such as assigning a value equal to the threshold limit divided by the square root of 2 to below-threshold measurements (17). The choice of an appropriate recording practice is often informed by knowledge or assumptions about the underlying distribution of true exposures (8). However, in regulatory settings, an upper value, such as the threshold limit, is sometimes recorded in order to ensure that exposures have not been underestimated.
As figures 2 and 3 show, in the case of the pure threshold model, the closer the assigned value is to the expected value of x in the below-threshold range, the smaller is the degree of bias in the exposure-mortality association. The special situation in which the assigned value, a, equals the expected value of x conditional on x being below the threshold limit, is an instance of Berkson error; the true exposure is distributed around the surrogate exposure with an average error equal to zero, producing no bias in estimates of exposure-response associations (14).
Our primary interest in this paper, however, was in the more general case of the threshold model with error. We developed a formula for the coefficient of bias due to exposure measurement error for the case of the threshold model with error. When compared with the classical model of measurement error, the slope of the decline in the magnitude of the coefficient of bias with increasing measurement error is lower in the case where there is a threshold than in the "classical" case where it is assumed that there is no recording threshold. We have suggested that the threshold model with error provides a better description of the exposure measurement error encountered in many occupational and environmental studies than does the "classical model" of exposure measurement error. However, the threshold model with error is still a relatively simple model, and the results presented in this paper are best viewed as examples for understanding the potential effects of measurement error under simplified conditions. We emphasize that this paper explored simple patterns of measurement error, not the effects of the complex patterns of exposure measurement error that often appear in research settings. We focused on linear estimates of exposure-response associations (which are often examined in environmental and occupational epidemiology); however, patterns of measurement error that arise when there is a recording threshold may in some cases lead to departures from linearity in estimated exposure-response associations despite the presence of a true linear association. In addition, these analyses focused on the situation where there is a single exposure measurement. In settings of chronic exposure, in which a cumulative measure of exposure is derived from a series of measurements made on each individual, the problems of measurement error may be more complex.
Furthermore, the assumption of nondifferential random variation in exposure estimates is often an inadequate description of the true extent of problems that measurement error entails. In epidemiologic studies, researchers may encounter complex patterns of measurement error, biased estimates of exposure, varying distributions of measurement error at different values of the true exposure, and even patterns of measurement error that are differential with regard to disease status. Patterns of measurement error that are differential with respect to disease status may occur in occupational settings, for example, if there is health-related selection of workers into jobs or areas where the exposure conditions lead to greater problems of measurement error. In addition, exposure information available for epidemiologic studies is often incomplete and may not reflect all relevant sources of exposure or periods of exposure. These are equally important considerations as sources of bias in estimates of exposure-disease associations. Investigative analyses, in which subcohorts are examined or assumptions about etiologically relevant exposures are varied, can often improve our understanding of the problems involved in measurement error.
Despite these limitations, the observations in this paper contribute to a growing body of epidemiologic literature on the effects of exposure misclassification. Bias resulting from the inaccurate assignment of study subjects to exposure groups has been the subject of a great deal of discussion. Much of the early literature on this topic focused on studies in which exposure variables were categorized (18, 19). More recently, discussions have expanded to cover the effects of measurement error in continuous exposure variables. As this paper illustrates, conclusions about the effects of measurement error should be sensitive to data collection and recording processes that influence the distribution of recorded exposures. In epidemiologic studies, it is common to have exposure data that are constrained by a lower threshold limit. The direction and magnitude of bias resulting from measurement error will depend on the population distribution of the exposure, the variance due to measurement error, and the recording practices used for below-threshold measurements.
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
z = (x + )I[x +
d] + aI[x +
< d],
where I["logical expression"] equals 1 if "logical expression" is true and 0 if it is false. In what follows, f (x) (F(x)) and g() (G(
)) shall denote the densities (cumulative distribution functions) of x and
, respectively.
Let us first consider the pure threshold model, var() = 0. Define, for k = 0, 1, 2,
Notice in particular that Mf,0(d) = F(d). Then a direct calculation yields
Therefore,
from which it follows that
1 if and only if
In addition, the following proposition holds.
Proposition
For a = 0, < 1 and, for a = d,
> 1. Furthermore,
= 1 for a = E(x|x
d).
We now give the formulas for the case, var()
0. Define, for k = 0, 1, 2,
Then, proceeding as above, one obtains
We finally obtain the expression for ,
which can be seen to be a generalization of equation A1, since, for = 0, the Ls become zero and the
s reduce to the
s. Notice also that the threshold model with error (equation A3) reduces to the formula for the familiar case of the classical model of measurement error when d = 0, since the
s and Ls become zero.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|