Bias due to Aggregation of Individual Covariates in the Cox Regression Model

Michal Abrahamowicz1,2 , Roxane du Berger2, Daniel Krewski3, Richard Burnett4, Gillian Bartlett5, Robyn M. Tamblyn1,5 and Karen Leffondré1,2

1 Department of Epidemiology and Biostatistics, McGill University, Montreal, Quebec, Canada.
2 Division of Clinical Epidemiology, The Montreal General Hospital, Montreal, Quebec, Canada.
3 McLaughlin Centre for Population Health Risk Assessment, University of Ottawa, Ottawa, Ontario, Canada.
4 Biostatistics and Epidemiology Division, Safe Environments Directorate, Healthy Environments and Consumer Safety Branch, Health Canada, Ottawa, Ontario, Canada.
5 Department of Medicine, McGill University, Montreal, Quebec, Canada.

Received for publication May 3, 2003; accepted for publication March 30, 2004.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 EXAMPLE
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The impact of covariate aggregation, well studied in relation to linear regression, is less clear in the Cox model. In this paper, the authors use real-life epidemiologic data to illustrate how aggregating individual covariate values may lead to important underestimation of the exposure effect. The issue is then systematically assessed through simulations, with six alternative covariate representations. It is shown that aggregation of important predictors results in a systematic bias toward the null in the Cox model estimate of the exposure effect, even if exposure and predictors are not correlated. The underestimation bias increases with increasing strength of the covariate effect and decreasing censoring and, for a strong predictor and moderate censoring, may exceed 20%, with less than 80% coverage of the 95% confidence interval. However, covariate aggregation always induces smaller bias than covariate omission does, even if the two phenomena are shown to be related. The impact of covariate aggregation, but not omission, is independent of the covariate-exposure correlation. Simulations involving time-dependent aggregates demonstrate that bias results from failure of the baseline covariate mean to account for nonrandom changes over time in the risk sets and suggest a simple approach that may reduce the bias if individual data are available but have to be aggregated.

aggregation bias; confidentiality; Cox model; ecological bias; epidemiologic methods; simulation


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 EXAMPLE
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Many epidemiologic studies of mortality use the Cox proportional hazards model (1) to assess the effects of aggregate variables representing groups rather than individuals, such as the average income or median socioeconomic status in an area (26). Using aggregate covariates facilitates data collection and is especially appealing in semi-individual studies (7) that assess the impact of group-level exposure, such as city-specific air pollution, on individual mortality (8, 9). Growing concerns about confidentiality of individual health data (1012) will likely lead to more reliance of future epidemiologic research on aggregate covariates.

There is an enormous amount of literature on the various biases in ecologic studies that use group-level aggregates of exposures, covariates, and/or outcomes (13, 14). However, implications of covariate aggregation in Cox regression remain unclear. During follow-up, frailer subjects die out (15), and the initial mean covariate value may inadequately represent actual survival in a given subgroup (16). Ecologic aggregates have been well studied in linear regression (17, 18), but transferring results from linear to Cox regression is risky (19, 20). For example, excluding an uncorrelated predictor induces underestimation of the exposure effect in Cox regression (21).

To motivate further study of these issues, we illustrate below the impact of covariate aggregation in a real-life Cox regression analysis.


    EXAMPLE
 TOP
 ABSTRACT
 INTRODUCTION
 EXAMPLE
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
When one assesses the impact of physician characteristics on the outcomes of their patients (22), aggregation of individual patients’ data within physician practice ensures confidentiality. To illustrate the impact, we focus on the different propensity of specialists versus general practitioners to prescribe flurazepam rather than another benzodiazepine to elderly patients. Benzodiazepines are sedative-hypnotics, and their adverse effects include fall-related injuries (2325), with flurazepam seemingly associated with the highest risk (24; Tamblyn R, Abrahamowicz M, McLeod P, et al. Hazards of benzodiazepine use in the elderly: the risk of injury. Montreal, Quebec, Canada: National Health Research and Development Program, 1998. (Unpublished report 952249)).

We identified 52,847 elderly, new benzodiazepine users in Quebec, Canada, between 1990 and 1994 (Tamblyn et al., unpublished report 952249). Time to event was defined as time to first flurazepam prescription. Subjects were assigned to the physician identified on their first benzodiazepine prescription, and "exposure" was a binary indicator of specialist versus general practitioner. Of 2,598 physicians, 454 (17 percent) were specialists. Exposure was adjusted for five patient-level covariates associated with flurazepam use (26): age, sex, recent injuries, use of psychotropic medications, and use of drugs affecting motor stability.

Flurazepam was prescribed to 3,393 (6.4 percent) elderly. Row 1 of table 1 shows that, in the Cox model with individual values for the five covariates, patients seen by a specialist have a 55 percent higher hazard of receiving flurazepam.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Comparison of the adjusted Cox model estimates of the association between physician specialty and choice of flurazepam for 52,847 elderly benzodiazepine users in Quebec, Canada, 1990–1994
 
Row 2 of this table shows that replacing individual patient demographics by, respectively, mean age and proportion of males among all patients of a given physician results in a 30 percent relative decrease in the estimated effect of specialty. This decrease largely exceeds Greenland’s (27) 10 percent criterion for confounding, even if both covariates are only very weakly correlated with exposure (R2 <0.04). Finally, aggregating sex alone (row 3), which is a stronger predictor of flurazepam use than age (chi-square statistics: 179.3 vs. 15.1), has a greater impact than aggregating age alone (row 4). Overall, these results suggest that aggregation of individual covariates in the Cox model may lead to an underestimation of the exposure effect, even if covariates and exposure are almost uncorrelated.

In the sections that follow, we use simulations to systematically assess the impact of aggregating covariates in the Cox regression.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 EXAMPLE
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Simulation design and data generation
We simulate a hypothetical semi-individual study (7) with cluster-level exposure and individual mortality outcomes. We focus on the impact of aggregating individual covariate values on the adjusted Cox model estimates of the exposure effect. N study subjects represent k clusters of size (N/k). All individuals in cluster i (i = 1, ... , k) have the same exposure value zi of exposure Z. The value of a continuous fixed-in-time covariate X for the jth individual in the ith cluster is denoted by xij. Assuming that Z and X are not correlated, we generate cluster-specific mean covariate Mi and exposure zi values from two independent uniform[0,1] distributions. To induce covariate-exposure correlation, we first generate zi values and then assign Mi = bzi + {varepsilon}i, where {varepsilon}i has the standard normal distribution N[0,1] and b controls the strength of the correlation. Finally, individual covariate values xij (j = 1, ... , N/k) in cluster i are generated from the normal distribution N[Mi,1] with mean Mi.

We use a similar approach to generate binary exposure and covariate values, except that 1) Mi represents the expected proportion of individuals with X = 1 in cluster i, and 2) xij values are generated from the binary distribution with probability Mi. Because Mi is the mean of a binary covariate, we present results for continuous covariates only.

Survival times tij are generated from the individual-specific exponential distribution, with hazard rate {lambda}ij determined by the proportional hazards model (1):

{lambda}ij = {lambda}0 exp[ßzzi + ßxxij],

where {lambda}0 represents the baseline hazard and ßz and ßx represent the changes in the logarithm of hazard associated with a unit increase of Z and X, respectively, the latter corresponding to one standard deviation of the within-cluster covariate distribution.

To simulate loss to follow-up, we introduce random right censoring. For each subject, we generate the expected censoring time tc,ij and compare it with the corresponding survival time tij. The ith subject is considered censored at tc,ij whenever tc,ij < tij; otherwise, death is observed at tij. All censoring times are generated from the same exponential distribution, with the hazard selected to approximately control the expected censoring rate C (28). Because both exposure and mean covariate values are generated from uniform distributions with mean = 0.5, the average of the individual hazard rates for mortality {lambda}ij, defined by the equation above, is close to {lambda}s = {lambda}0 exp[0.5(ßz + ßx)]. Then, the hazard rate for generating the censoring times tc,ij is determined by solving the equation P(ts > tc) = C, where ts and tc have exponential distributions with hazards {lambda}s and {lambda}c, respectively. This approach ensures only that the expected censoring rate is close to C; the actual censoring level varies across samples. Moreover, in simulations where zi and Mi were correlated, the initial {lambda}c value often had to be corrected by trial and error (28). Initially, we set the expected censoring rate at 35 percent, resulting in actual censoring between 30 percent and 45 percent, considered moderate censoring (2932). We then investigated how varying the censoring rate between 0 percent and 85 percent affected the results.

We considered different combinations of cluster number (k), size (N/k), total sample size (N), the "true" effects of exposure z) and covariate (ßx), and censoring level. One thousand random samples were generated, except for simulations involving model 5 (table 2) with time-dependent cluster means that used 300 samples.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Covariate definitions for the Cox regression models
 
Data analysis
Simulated data were analyzed by using six versions of the Cox proportional hazards model (1):

{lambda}(t | zij, vij) = {lambda}0(t) exp(ßzzij + ßvvij),

where zij and vij indicate, respectively, the values of exposure and of a model-specific variable representing the covariate for subject j in cluster i. All models used individual mortality outcomes and cluster-level exposure (zij = zi for j = 1, ... , ni). However, the six models differed in their covariate representation, as explained in table 2.

Model 1 requires individual covariate values, whereas models 2 and 3 use cluster-level aggregates. Although model 2 uses the cluster-specific population mean Mi, independent of the study sample, model 3 calculates the baseline mean {kwh266eq1} of actual study participants from cluster i. Model 4 excludes the covariate to yield an unadjusted exposure effect.

In model 5, vij is an aggregate that changes over time. Each time t that a death occurs, the time-dependent cluster mean {kwh266eq2} is recalculated by using only those cluster members who still remain at risk. In model 6, all individuals in a given cluster, regardless of when they died or were censored, are assigned a fixed-in-time value representing the mean {kwh266eq3} of only those at risk until the sample median of all uncensored times of death t50.

The bias of the adjusted exposure effect was defined as the mean difference between each of 1,000 sample-specific estimates and the true ßz. To estimate the corresponding 95 percent confidence interval, the standard error was calculated based on the empirical standard deviation of 1,000 differences. Bias was considered statistically significant if the confidence interval excluded zero. The ratio of the bias to ßz yields the relative bias. The root mean square error was calculated as the square root of the sum of the squared bias and the variance of the estimates. The coverage rate of the 95 percent confidence interval was estimated as the proportion of samples in which the confidence interval included ßz.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 EXAMPLE
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Bias toward the null due to aggregation of an uncorrelated covariate
Throughout this section, exposure and covariate are not correlated. Table 3 compares the results of three models across 1,000 samples generated by using 10 clusters, each with 100 individuals, 30–45 percent censoring, a strong covariate effect (ßx = 0.81), and five different exposure effects (ßz). As expected, individual-data model 1 yields unbiased estimates of the exposure effect (data not shown). By contrast, models 2 and 3, which use aggregate covariates, show statistically significant bias toward the null (row 1), unless exposure has no effect (ßz = 0). The relative bias is almost independent of ßz (row 2). Model 2 performs marginally worse than model 3 because of discrepancies between the population and sample means. Thus, in what follows, we focus on model 3.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Impact of aggregating values of a single uncorrelated covariate with a strong effect (ßx = 0.81) on the adjusted estimate of the exposure effect ßz in the Cox model (results of 1,000 simulations)
 
Aggregate covariates yield low coverage rates of the 95 percent confidence interval that, for a strong exposure effect, may fall below 80 percent (table 3, row 6). The low coverage seems entirely due to bias, because the standard errors of ßz in model 3 are slightly overestimated (row 4). Because of bias, estimates in model 3 are less accurate than those in model 1 (row 5), even if their variance is slightly lower (row 3). With a weaker covariate effect (ßx = 0.41), results were similar, but the impact of aggregation decreased (data not shown).

Model 4 shows the largest bias toward the null (table 3). Thus, covariate omission has a stronger impact on the estimated effect of an uncorrelated exposure than aggregation does. A covariate value xij may be decomposed into cluster mean {kwh266eq4} and residual {kwh266eq5} . Model 3 includes only {kwh266eq6} and omits rij. Residuals are uncorrelated with the cluster-level exposure (the mean residual in each cluster equals zero) but are associated with the hazard whenever ßx != 0. Thus, model 3 omits a predictor uncorrelated with exposure, which induces underestimation bias in Cox regression (21). The bias is even stronger in model 4 because it excludes both {kwh266eq7} and rij.

Additional analyses showed that bias in the exposure estimate is independent of size and number of clusters, and of between-cluster variance of covariate means (data not shown). To predict the impact of aggregating uncorrelated covariates, figure 1 shows the relative bias toward the null of the model 3 estimates for the exposure effect ßz (y-axis) for different censoring levels (different symbols) and covariate effects ßx (x-axis) representing the log hazard ratio for one standard deviation of the within-cluster covariate distribution. Regardless of the sign of ßx, bias toward the null increases with increasing covariate effect and with decreasing censoring, that is, increasing mortality.



View larger version (22K):
[in this window]
[in a new window]
 
FIGURE 1. Relative bias toward the null in model 3 estimates of the exposure effect ßz (y-axis) as a function of the covariate effect ßx (x-axis) and censoring levels (the five different symbols). The results were obtained from 1,000 random samples generated by assuming uncorrelated data, a moderate impact of exposure on hazard (ßz = 0.41), and 10 clusters each with 100 individual study subjects.

 
Because replacing individual (0 or 1) values of a binary covariate with the proportion of 1 in a cluster is equivalent to using the cluster-level mean, similar results were obtained for binary covariates (data not shown). For example, aggregating an uncorrelated binary covariate with a very strong (hazard ratio = 5.0) or moderate (hazard ratio = 2.25) effect underestimates the exposure effect by 24 and 10 percent, respectively.

Model 3 performance deteriorates with increasing numbers of aggregate covariates. Aggregating five uncorrelated continuous predictors, all with a moderate (ßx = 0.55) to strong x = 1.1) effect, underestimates the exposure effect by 25–50 percent (data not shown).

Exploring the sources of bias
Table 4 compares mean bias of the exposure effect estimates for three models that all use aggregate covariates but calculate them at different times (refer to table 2). Model 3 uses baseline cluster means {kwh266eq8} and yields considerably biased estimates, in contrast to the other two models that rely on updated means. In model 5, the time-dependent means are calculated at each event time by using only those subjects still at risk. In model 6, the updated means {kwh266eq9} are calculated once, using only those individuals at risk until the median event time. Yet, both models 5 and 6 produce relatively unbiased estimates of the exposure effect with small bias only when both exposure and covariate have very strong effects. Thus, bias in model 3 may be due not to aggregation per se but to failure to account for changes in the cluster means during follow-up.


View this table:
[in this window]
[in a new window]
 
TABLE 4. Comparison of bias of the exposure effect ßz between models using different representations of aggregated values of a single uncorrelated covariate in the Cox model (results of 300 simulations)
 
Figure 2A illustrates how, in a random sample in which ßx = 0.81, all cluster-specific covariate means gradually decrease because individuals with high covariate values die earlier. Additional analyses demonstrated that changes in the covariate mean over time {kwh266eq10} were bigger in clusters in which exposure was higher (zi); the average correlation between {kwh266eq11} and zi was above 0.4 (data not shown). Thus, bias in model 3 may be due to residual confounding resulting from differential misclassification of a predictor (33). Misclassification occurs because the baseline mean {kwh266eq12} is systematically different from the updated mean {kwh266eq13} and the latter better reflects the impact of the covariate on mortality in a given cluster, as average deviance for model 6 is lower than that for model 3 (data not shown). The misclassification is differential because its magnitude is correlated with exposure. Since the covariate means change mostly because of earlier deaths of individuals with "worse" values, bias increases with decreasing censoring, as shown in figure 1. By contrast, when the covariate has no effect (ßx = 0), cluster means remain stable (figure 2B) and there is no bias (table 3).



View larger version (24K):
[in this window]
[in a new window]
 
FIGURE 2. Changes over time (x-axis) in cluster covariate means (y-axis) of a single random sample. Panel A shows that under a strong covariate effect (ßx = 0.81), all cluster covariate means (identified with different symbols) decrease during follow-up. By contrast, panel B shows that when a covariate has no effect (ßx = 0), cluster covariate means tend to remain stable over time. In both panels, the random sample was generated by assuming uncorrelated data, a strong impact of exposure on hazard (ßz = 1.1), and 10 clusters each with 100 individual study subjects.

 
Reducing bias
Table 4 suggests a simple approach that may reduce bias if individual data have to be aggregated, for example, to protect confidentiality (1012). For instance, replacing age of individual patients of a physician by their mean, as in model 3, might considerably underestimate the exposure effect. Yet, this bias can be largely reduced by using updated means (model 6). Doing so would require 1) determining the median of all observed event times, 2) identifying all individuals who remain at risk until that time, and 3) calculating their cluster-specific covariate means. The need to recalculate time-dependent covariate means at each event time makes model 5 impractical. By contrast, model 6 requires calculating updated means only once.

We now revisit the flurazepam example. Row 5 of table 1 corresponds to model 6 with the updated physician-level aggregates of age and sex, calculated only for patients at risk until the median event time. Model 6 reduces the discrepancy from the unbiased model 1 estimates, from 30 percent (row 2) to 9 percent (row 5), while still preventing access to individual data.

Impact of covariate-exposure correlation
Figure 3 shows that the impact of the covariate effect (ßx) and censoring level on the underestimation of the exposure effect in model 3 does not depend on covariate-exposure correlation (rMZ).



View larger version (18K):
[in this window]
[in a new window]
 
FIGURE 3. Relative bias toward the null in model 3 estimates of the exposure effect ßz (y-axes) as a function of the covariate effect ßx (x-axes) and two censoring levels (35 percent, solid line; 85 percent, dashed line). The four panels represent different values of the correlation (rMZ) between exposure (zi) and cluster covariate mean (Mi). The results were obtained from 1,000 random samples generated by assuming a strong impact of exposure on hazard (ßz = 0.81) and 10 clusters each with 100 individual study subjects.

 
Table 5 compares the results of models 3, 4, and 6 (refer to table 2) for different combinations of covariate-exposure correlation (rMZ, columns), exposure effect (ßz, rows), and either positive or negative covariate effect (ßx). The impact of covariate aggregation in model 3 is practically independent of 1) the sign and strength of the covariate-exposure correlation and 2) the sign of the covariate effect.


View this table:
[in this window]
[in a new window]
 
TABLE 5. Comparison of bias of the exposure effect ßz between models using different representations of a single covariate at selected values of covariate-exposure correlation (rMZ) in the Cox model (results of 1,000 simulations)
 
By contrast, both 1) and 2) strongly affect results of model 4 that excludes the covariate. The bias induced by covariate omission increases with increasing covariate-exposure correlation, and its direction depends on whether the signs of rMZ and ßx agree (table 5), reflecting the well-known impact of excluding an important confounder. However, the bias toward the null is always greater than the corresponding bias away from the null. In fact, for a given combination of ßz and rMZ, the observed bias is approximately equal to the sum of the biases for 1) the same rMZ and ßz = 0, and 2) rMZ = 0 and the same ßz (table 5). The former is due to excluding the confounder (mean {kwh266eq14} , which is correlated with exposure), and the latter is due to excluding the uncorrelated residuals {kwh266eq15} . For example, for rMZ = –0.75, ßz = –0.81, and ßx = –0.81, the bias of +2.56 is close to the sum of +2.24 and +0.24 (table 5). Hence, the bias due to omitting a confounder in the Cox model is a compound of 1) classical confounding bias, whose direction depends on the correlations between exposure, confounder, and outcome; and 2) systematic bias toward the null, independent of these correlations.

Finally, table 5 indicates that bias is always much smaller in model 6 than in model 3. Thus, the advantages of replacing the baseline cluster means with the means updated at median event time apply also to covariates correlated with exposure. The updated mean model 6 offers almost unbiased estimates of the exposure effect unless the exposure-covariate correlation and/or the exposure effect is very strong (table 5).


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 EXAMPLE
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Our simulations show that aggregating individual covariate values in the Cox model induces a systematic underestimation of the adjusted exposure effect. Unless the covariate and/or exposure has no effect on mortality, bias toward the null occurs even if the covariate is not correlated with exposure, that is, does not meet the classic conditions for confounding. Bias induces low coverage of the confidence interval for the exposure effect estimates, even if their variance is estimated correctly. Bias is independent of the sample size and number of clusters, and relative bias is independent of the exposure effect.

The main determinant of bias is the strength of the covariate effect in terms of hazard ratio per one standard deviation of the within-cluster covariate distribution, implying that bias increases with increasing heterogeneity of individual covariate values within the clusters. Aggregating a covariate with moderate effect has only a small impact. However, aggregating a strong predictor may underestimate the exposure effect by 20 percent and reduce the coverage rates of the 95 percent confidence interval to below 80 percent. Bias increases with increasing mortality rates, that is, decreasing censoring, and with increasing number of aggregated covariates. However, covariate aggregation always induces smaller bias than its exclusion does. Thus, if individual values of important covariates are not available, it is still preferable to include their aggregates in the Cox model. Then, one can use the data in figures 1 and 3 to assess the relative bias as a function of the censoring rate and the covariate effect.

The mechanism responsible for bias due to covariate aggregation in the Cox model is different from classical confounding in linear or logistic regression, because bias occurs even if the aggregate covariates are uncorrelated with exposure. Aggregation is equivalent to omitting "residuals," which represent the discrepancies between individual covariate values and their cluster-specific mean and are not correlated with exposure. Excluding important uncorrelated covariates from the Cox model is known to induce an underestimation bias (1921, 34).

Simulations involving time-dependent, updated cluster means demonstrate that bias is not due to aggregation per se but to residual confounding, caused by differential misclassification of a predictor. Misclassification occurs because the baseline cluster means cannot account for gradual filtering out of the high-risk individuals during follow-up (15), which affects the average risk in a given cluster. Misclassification is differential because the change over time in the covariate mean is correlated with exposure. Thus, the omitted residual, initially not correlated with exposure, starts to act as a confounder with increasing follow-up.

Interestingly, the impact of covariate aggregation does not depend on either the strength or the direction of the covariate-exposure correlation because the mean of omitted "residuals" in each cluster equals zero. By contrast, this correlation does affect the bias due to covariate omission; the pattern is consistent with classical confounding. However, the bias toward the null is always stronger than that away from the null (model 4 in table 5). Indeed, the bias due to omitting a confounder in the Cox model is a compound of 1) classical confounding bias, which depends on the correlations between exposure, confounder, and outcome; and 2) systematic bias toward the null, independent of these correlations. Whereas the impact of excluding or categorizing an uncorrelated predictor in the Cox model is well described in the statistical literature (1921, 34), we explain the underlying mechanisms in terms familiar to epidemiologists. Moreover, we extend previous findings to the case of excluding or aggregating a covariate correlated with exposure.

We do not address other biases that affect ecologic studies (13, 14). For example, we assume the absence of contextual effects, which would require multilevel analyses (35). By considering a relatively simple model, similar to models 2 and 3 discussed by Greenland (35), we focus on problems, often ignored in the ecologic literature (13), specific to covariate aggregation in Cox regression.

Consistent with most simulations involving survival analyses (2932), censoring times were randomly distributed throughout follow-up, presumably to reflect dropouts. Yet, in some cohorts, individuals who had no events until follow-up was ended administratively may be censored, implying that most subjects are censored at the same time. To verify that our results are robust with respect to censoring mechanisms, we replicated the table 3 analyses while assuming that the earliest 65 percent of the generated events were all uncensored and the remaining 35 percent study subjects were all censored at the same time. Results were very similar to those shown in table 3, obtained by censoring about 37 percent at random times. Discrepancies in bias usually did not exceed 0.01, with the largest difference being 0.03 (data not shown). The reason why bias depends only on the censoring rate and not on the censoring mechanism is that bias results from systematic changes in covariate means over time because of earlier deaths of frailer individuals whose covariate values are higher. Such changes become more important in studies in which mortality is higher (lower censoring), but distribution of censoring times does not affect the results because censoring is independent of covariate and exposure values.

As in all simulations, we considered a restricted range of relevant parameters. However, the stable pattern of results suggests that our conclusions are robust. We focus on the estimated effect of a cluster-level exposure in semi-individual studies (79), but simulations involving several covariates show that aggregating important predictors affects both aggregate and individual-level variables (data not shown), which helps explain some published empirical findings. For example, Greenwald et al. (36) do not comment on why their table 3 shows that the adjusted Cox model effects of age and sex on mortality are reduced by 50 percent if individual incomes are replaced by their aggregate, a finding consistent with our simulation.

We also suggest an approach that may largely reduce bias due to covariate aggregation if individual values are, in principle, available but cannot be released. It is difficult to predict how frequently such a situation may occur in real-life epidemiologic research, but increasing confidentiality concerns will likely increase its practical relevance (1012). Indeed, current guidelines recommend different strategies for compression, categorization, aggregation, or transformation of individual data if precise values of individual covariates may allow reidentification of the study participants (3739). These concerns may be particularly important in semi-individual studies (7) because many subjects in a practice of an individual physician or a small geographic area may have unique covariate values. In this instance, one may consider replacing individual values with the cluster-specific means. Our results show that in such cases, using the updated means (model 6) rather than the baseline means (model 3) substantially improves the accuracy of the estimated exposure effect. Model 6 estimates are almost unbiased unless the exposure-covariate correlation and/or the exposure effect is very strong (table 5). In general, careful manipulation of available data and efficient use of additional information may often help reduce biases induced by data aggregation. For example, Prentice and Sheppard (40) propose an elegant method for incorporating individual-level confounders, measured for random subsamples of large cohorts, that largely reduces confounding bias in an aggregate multicohort study. While further empirical studies are necessary to evaluate the practical advantages of such approaches, simulations are instrumental to assess their potential to reduce bias because, in real life, the true parameters remain unknown.


    ACKNOWLEDGMENTS
 
This work was initiated during a reanalysis of a large-scale study of particulate air pollution and mortality, which involved a number of ecologic covariates, performed under contract to the Health Effects Institute in Cambridge, Massachusetts. The support of both the Health Effects Institute and the Natural Sciences and Engineering Research Council of Canada (Drs. Michal Abrahamowicz and Daniel Krewski) is gratefully acknowledged.

Michal Abrahamowicz is a James McGill Professor at McGill University. Daniel Krewski is the NSERC/SSHRC/McLaughlin Chair in Population Health Risk Assessment at the University of Ottawa. Robyn Tamblyn is an Associate Professor in the Departments of Medicine and of Epidemiology and Biostatistics, McGill University. Gillian Bartlett is a CIHR New Emerging Team New Investigator in the Department of Medicine at McGill University, and Karen Leffondré has received a postdoctoral award from the Division of Research in Environmental Etiology of Cancer (PREECAN).

The authors thank Drs. Jack Siemiatycki, Mark Goldberg, and David Jacob for helpful discussions and Andrea Benedetti for insightful comments on an earlier draft of the manuscript. They also thank Jennifer Gardner and Elena Toto for their technical assistance on this paper.


    NOTES
 
Correspondence to Dr. Michal Abrahamowicz, Division of Clinical Epidemiology, Montreal General Hospital Research Institute, 1650 Cedar Avenue, Room L10-513, Montreal, Quebec, Canada H3G 1A4 (e-mail: michal{at}epimgh.mcgill.ca). Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 EXAMPLE
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Cox DR. Regression models and life tables (with discussion). J R Stat Soc (B) 1972;34:187–220.[ISI]
  2. Anderson RT, Sorlie P, Backlund E, et al. Mortality effects of community socioeconomic status. Epidemiology 1997;8:42–7.[ISI][Medline]
  3. Dayal HH, Polissar L, Dahlberg S. Race, socioeconomic status, and other prognostic factors for survival from prostate cancer. J Natl Cancer Inst 1985;74:1001–6.[ISI][Medline]
  4. Katz MH, Hsu L, Lingo M, et al. Impact of socioeconomic status on survival with AIDS. Am J Epidemiol 1998;148:282–91.[Abstract]
  5. LeMarchand L, Kolonel LN, Nomura AM. Breast cancer survival among Hawaii Japanese and Caucasian women: ten year rates and survival by place of birth. Am J Epidemiol 1985;122:571–8.[Abstract]
  6. Mackillop WJ, Zhang-Salomons J, Groome PA, et al. Socioeconomic status and cancer survival in Ontario. J Clin Oncol 1997;15:1680–9.[Abstract]
  7. Kunzli N, Tager IB. The semi-individual study in air pollution epidemiology: a valid design as compared to ecologic studies. Environ Health Perspect 1997;105:1078–83.[ISI][Medline]
  8. Dockery DW, Pope CA III, Xu X, et al. An association between air pollution and mortality in six U.S. cities. N Engl J Med 1993;329:1753–9.[Abstract/Free Full Text]
  9. Pope CA III, Thun MJ, Namboodiri MM, et al. Particulate air pollution as a predictor of mortality in a prospective study of U.S. adults. Am J Respir Crit Care Med 1995;151:669–74.[Abstract]
  10. Denley I, Smith SW. Privacy in clinical information systems in secondary care. BMJ 1999;318:1328–31.[Free Full Text]
  11. Behlen FM, Johnson SB. Multicenter patient records research: security policies and tools. J Am Med Inform Assoc 1999;6:435–43.[Abstract/Free Full Text]
  12. Gostin LO, Hadley J. Health services research: public benefits, personal privacy, and proprietary interests. Ann Intern Med 1998;129:833–5.[Free Full Text]
  13. Greenland S, Robins J. Invited commentary: ecologic studies—biases, misconceptions, and counterexamples. Am J Epidemiol 1994;139:747–60.[Abstract]
  14. Greenland S. Ecologic versus individual-level sources of bias in ecologic estimates of contextual health effects. Int J Epidemiol 2001;30:1343–50.[Abstract/Free Full Text]
  15. Thomsen BL, Keiding N, Altman DG. A note on the calculation of expected survival, illustrated by the survival of liver transplant patients. Stat Med 1991;10:733–8.[ISI][Medline]
  16. Nieto FJ, Coresh J. Adjusting survival curves for confounders: a review and a new method. Am J Epidemiol 1996;143:1059–68.[Abstract]
  17. Goodman LA. Some alternatives to ecological correlation. Am J Sociol 1959;64:610–25.[CrossRef][ISI]
  18. Langbein LI, Lichtman AJ. Ecological inference. Beverly Hills, CA: Sage Publications, 1978.
  19. Bretagnolle J, Huber-Carol C. Effects of omitting covariates in Cox’s model for survival data. Scand J Stat 1988;15:125–38.[ISI]
  20. Schmoor C, Schumacher M. Effects of covariate omission and categorization when analysing randomized trials with the Cox model. Stat Med 1997;16:225–37.[CrossRef][ISI][Medline]
  21. Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 1984;71:431–44.[ISI]
  22. Lloyd DC, Harris CM, Roberts DJ. Specific therapeutic group age-sex related prescribing units (STAR-PUs): weightings for analysing general practices’ prescribing in England. BMJ 1995;311:991–4.[Abstract/Free Full Text]
  23. Ray WA, Griffin MR, Downey W. Benzodiazepines of long and short elimination half-life and the risk of hip fracture. JAMA 1989;262:3303–7.[Abstract]
  24. Neutel CI, Hirdes JP, Maxwell CJ, et al. New evidence on benzodiazepine use and falls: the time factor. Age Ageing 1996;25:273–8.[Abstract]
  25. Ebly EM, Hogan DB, Fung TS. Potential adverse outcomes of psychotropic and narcotic drug use in Canadian seniors. J Clin Epidemiol 1997;50:857–63.[CrossRef][ISI][Medline]
  26. Bartlett G, Abrahamowicz M, Tamblyn R, et al. Longitudinal patterns of new benzodiazepine use in the elderly. Pharmacoepidemiol Drug Saf (in press). (Published online, December 2003 (http://www3.interscience.wiley.com/cgi-bin/jtoc/5669/), DOI: 10.1002/pds.908).
  27. Greenland S. Modeling and variable selection in epidemiologic analysis. Am J Public Health 1989;79:340–9.[Abstract]
  28. Abrahamowicz M, Ciampi A, Ramsay JO. Non-parametric density estimation for censored survival data: regression spline approach. Can J Stat 1992;20:171–85.[ISI]
  29. Wang Y, Taylor JMG. Jointly modeling longitudinal and event time data with application to acquired immunodeficiency syndrome. J Am Stat Assoc 2001;96:895–905.[CrossRef][ISI]
  30. Shih JH, Chatterjee N. Analysis of survival data from case- control family studies. Biometrics 2002;58:502–9.[CrossRef][ISI][Medline]
  31. Orbe J, Ferreira E, Nunez-Anton V. Comparing proportional hazards and accelerated failure time models for survival analysis. Stat Med 2002;21:3493–510.[CrossRef][ISI][Medline]
  32. Huang YJ. Calibration regression of censored lifetime medical cost. J Am Stat Assoc 2002;97:318–27.[CrossRef][ISI]
  33. Brenner H, Blettner M. Controlling for continuous confounders in epidemiologic research. Epidemiology 1997;8:429–34.[ISI][Medline]
  34. Struthers CA, Kalbfleisch JD. Misspecified proportional hazard models. Biometrika 1986;73:363–9.[ISI]
  35. Greenland S. A review of multilevel theory for ecologic analyses. Stat Med 2002;21:389–95.[CrossRef][ISI][Medline]
  36. Greenwald HP, Polissar NL, Borgatta EF, et al. Detecting survival effects of socioeconomic status: problems in the use of aggregate measures. J Clin Epidemiol 1994;47:903–9.[ISI][Medline]
  37. Duncan GT, Jabine TB, de Wolf VA, eds. Private lives and public policies: confidentiality and accessibility of government statistics. Panel on Confidentiality and Data Access, National Research Council. Washington, DC: National Academies Press, 1993.
  38. Institute of Medicine. Protecting data privacy in health services research. Washington, DC: National Academies Press, 2000.
  39. Privacy and confidentiality of health information at CIHI: principles and policies for the protection of personal health information and policies for institution-identifiable information. 3rd ed. Ottawa, Ontario, Canada: Canadian Institute for Health Information (CIHI), 2002.
  40. Prentice R, Sheppard L. Aggregate data studies of disease risk factors. Biometrika 1995;82:113–25.[ISI]




This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (4)
Disclaimer
Request Permissions
Google Scholar
Articles by Abrahamowicz, M.
Articles by Leffondré, K.
PubMed
PubMed Citation
Articles by Abrahamowicz, M.
Articles by Leffondré, K.