Department of Occupational and Environmental Medicine, Lund, Sweden.
Jonas Björk, Department of Occupational and Environmental Medicine, Lund University Hospital, SE-221 85 Lund, Sweden. E-mail: Jonas.Bjork{at}ymed.lu.se
Abstract
Background In ecologic studies, group-level rather than individual-level exposure data are used. When using group-level exposure data, established by sufficiently large samples of individual exposure assessments, the bias of the effect estimate due to sampling errors or random assessment errors at the individual-level is generally negligible. In contrast, systematic assessment errors may produce more pronounced errors in the group-level exposure measures, leading to bias in ecologic analyses.
Methods We focus on effects of systematic exposure assessment errors in partially ecologic case-control studies. Individual-level information on disease status, group membership, and covariates is obtained from registries, whereas the exposure is a group-level measure obtained from an established exposure database. Effects on bias and coverage of 95% CI in various error situations are investigated under the linear risk model, using both simulated and empirical ecologic data on exposures that are binary at the individual level.
Results Our simulations suggest that the bias produced by systematic exposure assessment errors under the linear risk model is generally approximately equal to the ratio of the slope bias and the intercept bias in ordinary linear regression with measurement errors in the independent variable. Consequently, bias in either direction can occur. Exposure assessment errors that systematically distort the group-level exposure measures have more pronounced effects on bias and coverage than errors producing random fluctuations of the group-level measures, which imply bias towards the null.
Conclusions The results indicate the need for careful consideration of potential effects of systematic distortions of the group-level exposure measures when constructing and applying group-level exposure databases, such as probabilistic job exposure matrices.
Keywords Bias (epidemiology), case-control studies, ecologic studies, epidemiological methods, models, occupational exposure, odds ratio, statistical
Accepted 30 May 2001
Ecologic studies usually have a cohort-like design, with comparisons of incidence rates and group-level exposure measures such as means or proportions across groups/populations.1 Ecologic exposure measures can also be used in case-control settings. We focus on partially ecologic case-control studies, where individual-level information on disease status, group membership, and covariates is obtained from registries, whereas the exposure is a group-level measure obtained from an established exposure database (cf. Kauppinen et al2). based on the group membership of each study individual. The group membership may be defined by determinants such as occupation or residential area. Partially ecologic case-control studies have so far been considered in occupational epidemiology, by applying a probabilistic job exposure matrix, in which each entry is an exposure proportion, also interpreted as the estimated exposure probability of a randomly selected group member.3
Prentice and Sheppard4 considered cohort-like settings where the exposure measures were obtained from samples of individuals; the exposures were assessed as a part of the study in contrast to the study situations we address with pre-existing group-level exposure data. Indeed, such group-level exposure databases can also be constructed by assessing the exposure individually for samples of people in each exposure group. Prentice and Sheppard4 showed that replacing group means with averages over samples of size 100 or more generally lead to negligible bias of the effect estimate. Furthermore, the effect estimates based on samples of individual exposure assessments are robust to random measurement errors.4,5 For systematic assessment errors, Brenner et al6. showed that non-differential individual-level misclassification of a binary exposure usually produce bias away from the null in ecologic analyses if the sample survey is equally sensitive and specific in all study groups. This finding was confirmed by Webster7 using a general error model. The aim of this paper is to investigate effects of more general systematic exposure assessment errors in partially ecologic case-control studies, using a linear risk model. We focus on binary exposures on the individual level but it should be possible to extend the results to continuous exposures if the exposure-disease association is linear.
The outline of this paper is as follows. First, we briefly review the arguments for using a linear risk model in ecologic analyses of binary exposures. Second, we show under what conditions the bias resulting from measurement errors in ordinary linear regression can be generalized to exposure assessment errors in partially ecologic case-control studies under the linear model. Third, we investigate effects on bias and coverage of 95% CI in various error situations, using both simulated and empirical data.
The Linear Risk Model
Let I0 and I1 denote the disease incidence among truly unexposed and exposed individuals, respectively. Assume that there is no confounding or effect modification from extraneous risk factors across or within groups,8 i.e. I0 and I1 do not vary across groups. Hence, the incidence I(x) of a group where a proportion x is exposed is
![]() |
![]() | ((1)) |
![]() |
For partially ecologic case-control studies, Equation 1 corresponds to a linear odds ratio (OR) model, if the OR can be interpreted as a RR3
![]() | ((2)) |
Let (x) denote the disease odds in a group where a proportion x is exposed. By letting
(0) = exp (
),
(x) can be expressed as
![]() | ((3)) |
|
Confounding across groups arises when extraneous risk factors are associated with the determinants of group membership.8 Such confounding from general covariates, e.g. gender and age, can be adjusted for if individual register data on the covariates are available for each case and control. Assuming a common OR for exposed versus unexposed individuals across various levels of the covariates si (i.e. no effect modification) yields3
| ((4)) |
Bias Due to Assessment Errors under the Linear Odds Model
Bias under the linear odds model (Equation 3) occurs when the estimated group-level exposure
of a group differs from the truth x. We assume that there is a single effective exposure, that there is no confounding or effect modification, and that any group-level error e is non-differential and follows a general additive error model,
= x + e, where x and e may be correlated. Let corr(x,e) denote this correlation. If group-level errors exist, the constructed response Z(
) (cf. Appendix) will differ from the response Z(x) that would have been used if no errors in
existed. Let f = Z(
) Z(x) denote this difference at the final iteration of the iterative weighted least square estimation procedure. Thus, in the final iteration, we have an ordinary linear regression situation with errors both in the estimated group-level measure
and the response Z(
). In such a situation, the slope bias factor c, i.e. the ratio between the expected slope E[exp (
)
] in the presence of errors and the true slope exp (
)ß, is12
![]() |
| ((5)) |
| ((6)) |
Similarly, the intercept bias factor a, i.e., the ratio between the expected intercept E[exp ()] and the true intercept exp (
) is12
![]() |
![]() | ((7)) |
![]() | ((8)) |
The bias factor for OR = 1 + ß depends not only on c but also on a. From the model parameterization it follows that b, the bias factor of ß, is
|
Note that the contribution of the covariance between the parameter estimates, cov[exp (),
], to the bias factor b implies that
is not necessarily unbiased even if a = c = 1. It turns out, however, that the impact of cov[exp (
),
] is essentially eliminated if the statistical inference is based on log transformed parameter estimates (see Simluations below). Thus, if the bias is evaluated using the median (or the geometric mean) rather than the mean of the empirical distribution of
|
The bias factors a0 and c0 can be directly evaluated based on assumptions about the error structure of . In contrast, evaluating a1 and c1, produced by errors in the response Z(
), is less straightforward. Our simulation results below suggest, however, that both a1 and c1 are generally negligible compared to a0 and c0 for non-differential group-level errors under the linear odds model such that
| ((9)) |
The bias factor c0 of the ordinary regression slope has been presented by Wacholder13 for a continuous measure x with range (,+
) under different assumptions about var(e)/var(x) and corr(x,e). We present the bias factor c0 under some assumptions that are adequate for the linear odds model when x is a proportion (Table 1
). Strong negative correlation between the group-level error and the truth can result in a bias factor above unity (positive bias), when var(e) < var(x) whereas zero or positive correlation leads to a bias factor below unity (negative bias). Reversal of direction of effect can occur when var(e) > var(x) for strong negative correlations.13
|
The bias arising from errors in estimated exposure proportions is evaluated in three hypothetical examples under the assumption that the bias factors a1 and c1 (Equations 6 and 8) are negligible; an assumption which is supported by the simulations of next section. We assume that the estimated exposure proportions are obtained from a public exposure database, which were established by large sample surveys in the various exposure groups. Furthermore, we assume that there is a group 0 comprising unexposed individuals only, that the true OR is 3.0 (ß = 2.0) and that an unmatched partially ecologic case-control study is conducted.
Example 1: Group-level error perfectly negatively correlated with truth
Suppose that the study population can be divided into 11 groups of equal size with the true exposure proportions 0, 0.1, 0.2,..., 0.9, 1.0. Consider the situation discussed by Brenner et al6., where the exposure proportions were assessed with constant sensitivity (Se) and specificity (Sp) across all groups. Then, the resulting exposure assessment error can be expressed as6
![]() | ((10)) |
We assume that the exposure assessment is imperfect but better than a random classification, i.e. 0 < (Se + Sp 1) < 1. Hence, corr(x,e) = 1 and var(e)/var(x) < 1, which produces positive bias. As an illustration, suppose that Se = Sp = 0.9, which corresponds to var(e)/var(x) = 0.04 and thus c0 = 1.25 (Table 1). Using Equation 7
, a0 can be calculated to be 0.75, implying that the bias factor of ß is approximately
|
(Equation 9). The expected estimate of ß is the bias factor multiplied by the true value of ß, i.e. 1.67 * 2
3.3. Thus, the expected OR estimate is 1 + 3.3
4.3.
If there is large heterogeneity between the groups, however, the exposure assessment on the individual level may not perform equally well in all groups. As an example, assessing the occupational exposure to organic solvents, dichotomized at some pre-defined average intensity, in a group with exposed subjects (e.g. painters) is not likely to be equally sensitive and specific as the assessment of a truly unexposed group (e.g. teachers). Furthermore, even if the exposure assessment is robust to group heterogeneity, group-level errors that cannot be evaluated under assumptions about constant sensitivity and specificity across groups may result if the response rate of the sample survey is associated with exposure status and varies between groups. Instead, in order to evaluate effects of more general systematic exposure assessment errors, we will in Example 2 and 3 make assumptions about validity and precision of the exposure assessments of the various groups.
Example 2: Group-level error positively correlated with truth
Suppose the study population can be divided into five groups and that the true exposure prevalence is 14%. Group 0 comprises 60% of the study population, whereas the remaining four groups comprises 10% each and have true exposure proportions 0.2, 0.3, 0.4, and 0.5, respectively. Suppose that the exposure proportion of group 0 is correctly assessed, whereas the proportions of the four other groups are systematically overestimated by 0.1 units. Such a situation would for example occur if the sensitivity was constant (0.9) in the assessment of the four groups, whereas the specificity varied from 0.7 in the group with highest exposure prevalence to 0.85 in the group with lowest exposure prevalence (cf. Equation 10). In this situation, corr(x,e) = 0.92 and var(e)/var(x) = 0.07, corresponding to a slope bias factor, c0, of 0.80 (not in Table 1
). The intercept bias factor, a0, is calculated to be 0.99. Thus, the expected ß estimate is 0.08/0.99 * 2
1.6 and the expected OR estimate is 1 + 1.6 = 2.6.
Example 3: No correlation between group-level error and truth
Consider a population, still with a true exposure prevalence of 14%, but with a large number of groups. Group 0, comprising 60% of the study population, is assessed correctly. The true mean exposure probability for subjects not belonging to group 0 is 0.35. Suppose that the response rate varied between the groups in a complex but non-systematic manner such that the estimated exposure probability of subjects not belonging to group 0 can be modelled as
![]() |
Simulations
For each of the examples above, we simulated 1000 partially ecologic case-control studies, each including 250 cases and 250 controls. The OR of each study was estimated twice; first without exposure assessment errors and then with the errors outlined in the examples. Width and coverage were calculated for 95% CI both for the untransformed OR and for the log transformed OR. The 95% CI for the untransformed OR was calculated as
| ((11)) |
For the log transformed OR, we used the delta method14 to calculate the 95% CI as
| ((12)) |
|
The log transformation leads to more proper CI if the distribution of the estimated OR is skewed to the right, i.e. when the mean is above the median. Without assessment errors, the 95% CI of the untransformed OR too often, i.e. in more than 2.5% of the simulations, underestimated the true OR (Table 2). The log transformation in the error-free scenarios had a satisfactory coverage but also a better balance in coverage, although the CI were too often overestimations in Examples 2 and 3. We chose to evaluate effects of assessment errors on the CI only for the best performing transformation strategy, i.e. the log transformation. The medians of the estimated OR conformed closely to the corresponding expected values calculated above under the assumptions of negligible impact of the errors in the constructed response (i.e. a1 and c1 eliminated; cf. Equation 9
). Clearly, bias implies that the percentage of OR estimates above the true value deviates markedly from 50% (cf. Sorahan and Gilthorpe15). The change in spread of the OR estimates, measured by the 10th and 90th percentiles of the sampling distribution, as well as the change in width of the CI followed the direction and magnitude of the bias. The systematic errors of Example 1 lead to substantially decreased coverage. In Example 1, the non-covering CI were far too often overestimations (13.2% of the simulations), whereas underestimations were abundant (5.0%) in Example 2. In contrast, the random errors of Example 3 had only minor effects on coverage and balance.
|
Albin et al16. conducted a population-based case-control study comprising 372 cases of acute myeloid leukaemia in Southern Sweden 19761993 and an equal number of controls. The individual exposure data were obtained from structured telephone interviews and assessed by occupational hygienists. For occupational exposure to aromatic hydrocarbons, the age- and gender-adjusted OR was close to unity (OR = 1.2, 95% CI : 0.731.9). For the same case series, we conducted a partially ecologic case-control study, with occupational titles for 1960 and for every fifth year between 1970 and 1990 obtained from national census data and with exposure probabilities for the occupational titles obtained from a Swedish translation of a Finnish probabilistic job exposure matrix.2 First, we included the same controls as above. For aromatic hydrocarbons, the group with zero exposure probability (group 0) comprised 85% of the subjects, whereas the mean exposure probability among the remaining 15% was 0.19 (range 0.010.88; 36 groups), which corresponds to an exposure probability of 3% among all subjects. Using the linear odds model for the partially ecologic data, adjusted for individual-level data on age and gender available for all cases and controls (cf. Equation 4), the corresponding OR estimate was elevated but imprecise (OR = 2.0, 95% CI : 0.676.2; based on the log transformation). If we consider the individual exposure data as being the truth, group 0 had a true exposure prevalence of 0.10, whereas the true exposure prevalence among the remaining was 0.52. The corresponding estimated mean exposure probabilities were 0.0 and 0.19, respectively, based on the partially ecologic data. Generally, the magnitude of the underestimation increased with the true exposure probability. Hence, a negative correlation between the true probability and the error was indicated, which often produces overestimations if adequate confounding control is assumed (Table 1
). When we added another two controls for each case in the ecologic study, positive bias remained and the precision improved (OR = 2.5, 95% CI : 1.15.5).
Discussion
The partially ecologic case-control design combines group-level exposure data with individual-level data on disease status, group membership, and covariates and focus on exposure-disease associations on the individual-level. Others have referred to similar study settings as aggregate data studies,5 semi-individual studies,17,18 or hybrid studies.1 The partially ecologic case-control design may well be a rapid and cost-efficient approach to examine a suspected exposure-disease association. Ecologic exposure data can be used prior to individual-level exposure data. A more stable hierarchical19 risk estimate may then be formed as a weighted average of the OR of the linear model based on ecologic data and the OR obtained from the individual-level analysis. In addition, an ecologic approach is useful as a means of obtaining risk estimates for subsets of subjects with missing individual-level exposure data. Methods for analysing multilevel case-control studies is a topic for further research.
A partially ecologic design does not guarantee absence of confounding or effect modification by group.18 Confounding control is more problematic in ecologic analyses than in individual-level analyses.1,8,20 Bias due to confounding is not necessarily reduced even if adjustments based on accurate summaries of covariates are made.8,20 Using a more detailed grouping would, however, tend to reduce such ecologic bias. Furthermore, data on covariates available for all study individuals through registries can be used to adjust for confounding across groups (cf. Equation 4). Confounding within groups can be handled if separate exposure probabilities for various levels of a covariate are provided by the exposure database (cf. Kauppinen et al2). Residual confounding within groups may produce errors in the estimated exposure probabilities.
We considered a fixed model for the baseline risk (pseudo-odds), which may produce CI with decreased coverage if large random fluctuations in the baseline risk across groups exist. Under such circumstances, a random effect model would be more appropriate; however it does not reduce bias due to systematic fluctuations in the baseline risk (cf. Prentice and Sheppard4).
Another concern relates to precision. In a partially ecologic case-control study of occupational exposures, the precision of the effect estimate may well be about three times less than of a corresponding case-control study based on individual exposure data.21 In our empirical example, the ecologic approach lead to substantially reduced precision compared with the individual-level analysis. Cost-efficient improvement of the precision was, however, possible by including additional controls. Furthermore, a more detailed grouping generally improves precision in aggregate data studies.5
When using group-level exposure data based on sufficiently large samples of individual exposure assessments in ecologic analyses, effects of sampling errors or random assessment errors on the individual level are negligible.4,5 In contrast, systematic assessment errors may produce more pronounced errors in the group-level exposure measures, correlated or uncorrelated with the truth, which usually leads to bias. If there is no effect of exposure, however, errors in group-level measures do not produce any bias. When the exposure has an effect, the direction and the magnitude can be evaluated under assumptions about the validity and precision of the group-level exposure measures. Group-level exposure measures have theoretical, or practical, lower and upper boundaries, occasionally implying that certain error structures are more likely to occur than others.
The presented simulation results, as well as extensive simulations under various error scenarios not presented in the text, suggest that the extra bias of the linear odds model produced by errors in the constructed response variable is negligible. Thus, the bias factor of the regression parameter of epidemiological interest, ß, can be approximated accurately as the ratio between the slope bias factor and the intercept bias factor of ordinary linear regression with measurement errors in the independent variable only. It turns out, however, that if the group of entirely unexposed individuals (group 0) constitutes a large fraction of the study population and this group is assessed correctly, the intercept bias is negligible (cf. Example 2 and 3). In such situations, the magnitude of the bias in ß is given directly by Table 1. On the other hand, if group 0 is small or assessed with extensive errors, the intercept bias may enforce the bias in ß (cf. Example 1).
Effects of exposure assessment errors in probabilistic job exposure matrices in case-control settings have been simulated by Bouyer et al.,21 but with emphasis on statistical power rather than CI width and coverage. Contrary to our error scenarios, they considered fixed assessments of variable true exposure proportions combined with systematic errors that implied bias towards the null. Our examples illustrate that errors in the group-level exposure measures that are correlated with the truth are of more concern than random fluctuations when establishing and using group-level exposure databases. In the empirical example, the elevated point estimate of the OR may be attributed to negatively correlated estimated exposure probabilities and true exposure proportions. Possible reasons for such systematic distortions of the ecologic exposure measures include the translation of Finnish to Swedish occupational codes, the underrepresentation of occupations with short duration in the census data, and differences in exposure definitions between the occupational hygienists and the job exposure matrix.
We used computationally easy formulas for the CI calculation, based on var (). The consistent underestimation of the CI for the untransformed OR strongly suggests that a transformation is needed in order to obtain a proper CI. In our simulations without assessment errors, the log transformation performed reasonably well, although some overestimation was noted for skewed exposure distributions. Prentice and Mason22 have suggested two computationally more complex but potentially more accurate CI calculation methods.
For binary exposures, the true relation between the exposure probability and the OR in a partially ecologic case-control study is linear.3 Non-linearity hence indicates violated model assumptions, such as inappropriate dichotomization of a continuous exposure, group-level confounding20 by level of exposure or by other risk factors, or systematic errors in the exposure probabilities. Using means rather than individual exposure measures in group-level analyses, known as the Berkson type error23 in regression modelling, does not distort a linear exposure-disease association.24 Thus, the bias calculation procedure of the present paper should be possible to extend to group means of individual continuous exposures under the linear odds model. For logistic exposure-disease associations, however, more specific assumptions about the error structure (e.g. errors proportional to the true values) generally have to be made in order to quantify the bias due to measurement errors.25,26
Appendix
Estimating the parameters of the linear odds model
Let (x) denote the disease probability in a group with a true group-level exposure x. Thus, the linear odds model (Equation 3
) can be written as
|
|
Suppose the current parameter estimates of the iterative weighted least square procedure11 are and
. For each individual i, the binary response Yi (disease status) is replaced by a continuous response Zi, constructed as the first order Taylor expansion of
(x;
,ß) around
(
i;
,
), where
i is the estimated group-level exposure of the individual. Thus,
|
|
Treating (
i;
,
) and
(
i,
,
) as fixed entities, parameter estimates of the next iteration are obtained by weighted linear regression of Zi onto
i with weights (cf. Hastie and Tibshirani11)
|
KEY MESSAGES
|
Acknowledgments
The authors are grateful to Maria Albin and Timo Kauppinen, who gave access to the empirical data set presented in the text, and to Roland Perfekt, Håkan Tinnerberg, and Lars Rylander, who contributed with constructive suggestions. The Swedish Council for Working Life and Social Research have contributed with financial support.
References
1 Morgenstern H. Ecologic studies. In: Rothman K, Greenland S (eds). Modern Epidemiology. 2nd Edn. Philadelphia: Lippincott-Raven, 1998, pp.45980.
2 Kauppinen T, Toikkanen J, Pukkala E. From cross-tabulations to multipurpose exposure information systems: a new job-exposure matrix. Am J Ind Med 1998;33:40917.[CrossRef][ISI][Medline]
3 Bouyer J, Hémon D. Comparison of three methods of estimating odds ratios from a job exposure matrix in occupational case-control studies. Am J Epidemiol 1993;137:47281.[Abstract]
4 Prentice R, Sheppard L. Aggregate data studies of disease risk factors. Biometrika 1995;82:11326.[ISI]
5 Sheppard L, Prentice RL, Rossing MA. Design considerations for estimation of exposure effects on disease risk, using aggregate data studies. Stat Med 1996;15:184958.[CrossRef][ISI][Medline]
6 Brenner H, Savitz DA, Jockel KH, Greenland S. Effects of nondifferential exposure misclassification in ecologic studies. Am J Epidemiol 1992;135:8595.[Abstract]
7 Webster T. Analysis of nondifferential exposure misclassification in ecologic studies using a general error model. Epidemiology 1999;10: S134.
8 Greenland S, Morgenstern H. Ecological bias, confounding, and effect modification. Int J Epidemiol 1989;18:26974.[Abstract]
9 Beral V, Chilvers C, Fraser P. On the estimation of relative risk from vital statistical data. J Epidemiol Community Health 1979;33:15962.[Abstract]
10 Rothman K, Greenland S. Case-control studies. In: Rothman K, Greenland S (eds). Modern Epidemiology. 2nd Edn. Philadelphia: Lippincott-Raven, 1998, pp.93114.
11 Hastie TJ, Tibshirani RJ. Generalized Additive Models. London: Chapman and Hall, 1990.
12 Haitovsky Y. On errors of measurement in regression analysis in economics. Int Statist Rev 1972;40:2335.[ISI]
13 Wacholder S. When measurement errors correlate with truth: surprising effects of nondifferential misclassification. Epidemiology 1995;6:15761.[ISI][Medline]
14 Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman & Hall, 1993.
15 Sorahan T, Gilthorpe MS. Non-differential misclassification of exposure always leads to an underestimate of risk: an incorrect conclusion. Occup Environ Med 1994;51:83940.[ISI][Medline]
16 Albin M, Björk J, Welinder H et al. Acute myeloid leukemia and clonal chromosome aberrations in relation to past exposure to organic solvents. Scand J Work Environ Health 2000;26:48291.[ISI][Medline]
17 Künzli N, Tager IB. The semi-individual study in air pollution epidemiology: a valid design as compared to ecologic studies. Environ Health Perspect 1997;105:107883.[ISI][Medline]
18 Webster T. Can semi-individual studies have ecologic bias? Epidemiology 2000;11:S95.
19
Greenland S. Principles of multilevel modelling. Int J Epidemiol 2000;29:15867.
20 Greenland S, Robins J. Invited commentary: ecologic studiesbiases, misconceptions, and counterexamples. Am J Epidemiol 1994;139: 74760.[Abstract]
21 Bouyer J, Dardenne J, Hémon D. Performance of odds ratios obtained with a job-exposure matrix and individual exposure assessment with special reference to misclassification errors. Scand J Work Environ Health 1995;21:26571.[ISI][Medline]
22 Prentice RL, Mason MW. On the application of linear relative risk regression models. Biometrics 1986;42:10920.[ISI][Medline]
23 Berkson J. Are there two regressions? J Am Statist Assoc 1950;45: 16480.[ISI]
24 Armstrong BG. Effect of measurement error on epidemiological studies of environmental and occupational exposures. Occup Environ Med 1998;55:65156.[Abstract]
25 Rosner B, Willett WC, Spiegelman D. Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Stat Med 1989;8:105169; discussion 107173.
26 Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. London: Chapman and Hall, 1995.