1 Department of Population and Family Health Sciences, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD.
2 Department of Hospital Epidemiology and Infection Control, Johns Hopkins Medical Institutions, Baltimore, MD.
Received for publication July 26, 2002; accepted for publication October 16, 2003.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
case-control studies; epidemiologic methods; logistic models; missing data; regression analysis
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For analysis of matched case-control data with missing exposure values, three approaches have been suggested in the literature. The first is McNemars test for 2 x 2 tables or conditional logistic regression for analyzing matched pairs with complete exposure data. Pairs with missing values are simply ignored and excluded from the analysis. The second method is unconditional logistic regression analysis for cases and controls with complete data; again, subjects with missing values are ignored. However, in this paper, unconditional logistic regression is not being considered, because it is biased whenever the matching factors are confounders, even when there are no missing data. The third method is the missing-indicator method, a conditional logistic regression approach which introduces a missing indicator for all pairs that is set to 1 if there is a missing exposure value and 0 otherwise (2).
The missing-indicator method has been extended to the context of matched case-control studies and applied to analyses of the risk of childhood leukemia (2) and endometrial cancer (3). The fact that this method uses all of the data and still preserves the matching data structure makes it appealing for matched case-control studies with missing exposure values. However, few studies have assessed the performance of this method under different missing-value scenarios. The objective of our study was to use Monte Carlo simulation to evaluate the performance of the missing-indicator method in comparison with conditional logistic regression for 1:m individually matched case-control studies, as well as for a 1:1 matched data set on low birth weight with one- and two-exposure variables.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Exposure status for cases and matched controls
Each matched case-control pair or group in the simulations included one case and up to four controls. A cases exposure status, which was dichotomized as yes or no, was generated using a probability of 0.3, 0.5, or 0.7. A controls exposure status (yes/no) was generated using the calculated probabilities based on the following three scenarios, given an odds ratio for exposure and disease of 2, 4, or 6. In scenario 1, there was no confounding, because a controls exposure status was independent of that of the matched case; that is, the probability of being exposed was equal for all controls. Matched analysis and unmatched analysis yielded identical odds ratios (4). In scenario 2, negative confounding was introduced. A control matched with an exposed case was 50 percent more likely to be exposed than a control matched with an unexposed case. The odds ratio from the unmatched analysis was closer to unity than that from the matched analysis. Thus, with negative confounding, ignoring matching in the analysis could bias the estimates towards unity. In scenario 3, positive confounding was introduced by ascribing a control matched with an exposed case a one-third-lower likelihood of exposure than a control matched with an unexposed case. The odds ratio from the unmatched analysis exceeded that from the matched analysis. In all scenarios, an individual controls exposure status was always independent of that of any other control. Given an odds ratio of 4 and a probability of 0.5 for a cases exposure, we can calculate the probabilities of a controls being exposed under the three scenarios (table 1).
|
Data analysis
We analyzed the generated data using both conditional logistic regression, by excluding pairs with missing values, and the missing-indicator method, by introducing a missing indicator into the regression.
Conditional logistic regression. The general model was logit(Y) = ßX, where Y represented case or control status and X indicated a subjects exposure status (equal to 1 if exposed and 0 if unexposed). The conditional maximum likelihood approach was used to compute estimates. In the Stata software, the command was "clogit Y X, group (pair)," where "pair" indicated a matched pair. This regression can also be performed in SAS using the survival analysis PHREG procedure, in which a discrete logistic model is applied with a stratum formed for each matched pair. The command in SAS was "proc phreg; model time*Y(0) = X/ties = discrete; strata pair," where time was set so that cases had the same event time while controls had belated censored times (e.g., set time as " 2 (case status coded as 0 or 1)"), Y(0) indicated that controls were censored observations, and "ties = discrete" specified the proportional hazards model to be displaced by the discrete logistic model. When m > 1, we excluded a matched set if either the case or all of its matched controls had missing values, which is the default method of dealing with observations with missing values in both Stata and SAS conditional logistic regression.
Missing-indicator method. With the missing-indicator method, we employed conditional logistic regression by introducing a new indicator variable Z into the regression. The indicator was set to either 1, if exposure information was missing, or 0. Missing exposure values were then replaced with 0 in the regression model (2). Estimation was performed using conditional logistic regression and the commands noted above, except that the missing-indicator variable Z was included in the regression model.
We generated 2,000 matched pairs in each simulation to achieve stable statistical estimation. We summarized mean estimates (regression coefficients), standard deviations, and empirical 95 percent confidence interval coverage from 1,000 simulations. All analyses were conducted using Stata 7 (5).
Power and efficiency analysis
Empirical power was defined as the probability of zeros being excluded from the 95 percent confidence intervals for the regression coefficients of exposure (6). The ratio of the average empirical power calculated from the conditional logistic regression with missing values or the missing-indicator method to that from the conditional logistic regression without missing values was defined as relative power. Asymptotic relative efficiency was the ratio of the variance of exposure from the conditional logistic regression with missing values or the missing-indicator method to the variance of the exposure from the conditional logistic regression without missing values (7).
We determined the numbers of pairs in the simulation to achieve an empirical power of 0.9500.999 for conditional logistic regression analyses using data without missing values. We generated data for each analysis by performing 1,000 repetitions. Nonetheless, results from 20 repetitions that had the highest standard errors according to the conditional logistic regression without missing values were excluded from the summary.
Application to a low birth weight data set
The low birth weight data introduced by Hosmer and Lemeshow (8) are a 1:1 age-matched case-control data set with a total of 56 pairs of mothers. This data set was based on low birth weight data originally collected in 1986 at Baystate Medical Center, Springfield, Massachusetts. The data set comprises maternal covariates, including the mothers history of previous preterm delivery, smoking status, race, the presence of uterine irritability, and hypertension. The primary infant outcome was low birth weight (yes/no).
For this study, we augmented the data set 10 times to generate 560 pairs in order to obtain a sufficient sample size. We constructed two models including either one variable (smoking) or two variables (smoking and preterm delivery) to investigate the performance of the two methods in the one- and two-variable settings. Both smoking and preterm delivery were dichotomized. The smoking variable (for both smokers and nonsmokers) was randomly deleted with a probability of 0.1, 0.2, or 0.3 under the completely-at-random missing exposure scenario, or smokers were 50 percent more likely to have missing values under the exposure-dependent missing-value scenario. We repeated the analysis 1,000 times to obtain mean estimates, standard deviations, and coverage of the 95 percent confidence intervals in Stata 7.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Figure 1 and table 2 show the results of regression analysis in which the log odds ratio was 1.89 (equivalent to an odds ratio of 4) and the proportion of exposed cases equaled 0.5. The four parts of figure 1 show mean estimates and standard deviations of the estimates for the four different scenarios. The estimates from the first three missing-value scenarioscompletely-at-random, case-dependent, and exposure-dependentwere similar, but all differed from the estimates from the fourth scenario, where the missing values were case/exposure-dependent. In the first three missing-value scenarios, the estimates from conditional logistic regression were unbiased, and the empirical 95 percent confidence interval coverage was similar to the nominal coverage. This finding held irrespective of the presence of confounding effects. However, the estimates obtained from the missing-indicator method varied, depending on the presence of confounding effects. Negative confounding biased the estimates from the missing-indicator method toward zero, whereas positive confounding biased the estimates away from zero. The confidence interval coverage from the missing-indicator method ranged from 86 percent to 93 percent with negative confounding and from 89 percent to 96 percent with positive confounding.
|
|
The relative empirical power and asymptotic relative efficiency are presented in table 3. In each scenario, an increased missing proportion was associated with a stable or slightly decreased relative power and an elevated asymptotic relative efficiency (i.e., decreased efficiency). The missing-indicator method was slightly more powerful and efficient than conditional logistic regression.
|
When missing values were case/exposure-dependent and there was no confounding effect, estimates from the two methods for 1:4 matched studies were similar to those from 1:1 matched studies. In the presence of negative or positive confounding, the estimates from the two methods for 1:4 matched studies were smaller or larger than those for 1:1 matched case-control studies, and the confidence interval coverage ranged widely from 0 percent to 96 percent.
As the missing proportion increased, the relative empirical power and asymptotic relative efficiency demonstrated similar trends as in the 1:1 matched studies.
In both 1:1 and 1:4 matched studies, as the proportions of missing values increased, the standard deviations increased and the confidence interval coverage remained essentially unchanged. Exceptions were found for the two methods under the case/exposure-dependent missing-value scenario and for the missing-indicator method in the presence of confounding: The confidence interval coverage was decreased. The standard deviations from the missing-indicator method were slightly lower. We examined changes of estimates and standard deviations when the proportion of cases being exposed was 0.3, 0.5, or 0.7 and the odds ratio was 2, 4, or 6 and found a similar pattern in each setting. However, given a fixed odds ratio, as the case exposure proportion increased, biases, if they existed, increased but standard deviations decreased; and the confidence interval coverage was decreased except for conditional logistic regression or for scenarios without confounding. As the odds ratio increased, there was no significant change in bias/coefficient ratios and confidence interval coverage, though standard deviations increased.
Application to a low birth weight data set
The first panel in figure 2 shows results for the one-variable model with the smoking covariate only. The first mean estimate and standard deviation in this panel, which are considered the true values, are from the conditional logistic regression analysis of the full data set.
|
The second panel in figure 2 presents the estimates and standard deviations obtained with smoking and preterm delivery in a two-covariate regression model. The first mean estimates in the two subpanels, derived from conditional logistic regression for the complete data set, are the true values for the effects of smoking and preterm delivery, respectively. The estimates from the missing-indicator method were close to the unbiased estimates from conditional logistic regression. In addition, the 95 percent confidence interval coverage from conditional logistic regression and the missing-indicator method was 95 percent or more, higher than the nominal coverage for the reason given above. Similar trends in estimation precision held for both the effect of smoking (the variable with missing values) and the effect of preterm delivery (the variable without missing values).
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We evaluated the performance of the two methodsconditional logistic regression and the missing-indicator methodfor the analysis of data with missing exposure values in 1:m matched case-control studies. The results indicated that the precision of estimation depended on the underlying missing-value scenarios and on whether there was confounding between exposure and disease. It is well known that in 1:1 matched case-control studies, the odds ratio is estimated from the number of matched pairs of exposed cases and unexposed controls divided by the number of matched pairs of unexposed cases and exposed controls. When missing values were completely-at-random, case-dependent, or exposure-dependent, as designed in this study, a comparable proportion of pairs would be dropped from both the numerator and the denominator in the estimation of the odds ratio. Thus, the estimates should be the same as the true odds ratios. Under these three scenarios, data with missing values still represented the complete data. Consequently, conditional logistic regression generated unbiased estimates with empirical confidence interval coverage comparable to the nominal coverage.
In the absence of confounding, the exposure status of matched cases and controls was completely independent. Consequently, the missing-indicator method also produced unbiased estimates. However, if there was any confounding in the matching process such that the exposure status for matched cases and controls was no longer independent, the estimates from the missing-indicator method were slightly biased and the confidence interval coverage was slightly lower than the nominal level, though it had slightly greater statistical power and less asymptotic relative efficiency (higher efficiency). It appeared that the missing-indictor method did not eliminate the confounding effect completely. It is noteworthy that biases produced by the missing-indicator method were relatively small in view of the fact that considerable confounding and up to 30 percent missing exposure values were assumed in our simulations.
In addition, we extended the missing-indicator method to a two-variable setting, with one of the two variables having missing values. By applying the two methods to an augmented data set, we found that estimates obtained from the two methods were similar for variables with and without missing values. However, whether the missing-indicator method provides an appropriate approach for handling missing exposure values depends on the investigators willingness to accept a balance between the gains of estimating efficiency and the losses of estimating accuracy.
The sample for our analysis included 2,000 pairs in the simulations and 560 pairs in the augmented data analysis. Although the performance of the missing-indicator method was slightly poorer than that of conditional logistic regression in such a large data sample, the results might not hold in a smaller data set. We found that even conditional logistic regression produced significantly biased estimates when the sample size was reduced to 100 pairs of cases and controls (not shown). Consequently, the performance of the missing-indicator method for studies with small sample sizes was expected to be more biased. More importantly, when missing values were case/exposure-dependent, neither of the two methods could appropriately handle the missing values. Therefore, we suggest that other sophisticated techniques, such as the multiple imputation method (11, 12) or the semiparametric inference approach introduced by Rathouz et al. (13), be considered.
In this study, we examined the performance of the missing-indicator method in handling missing exposure values in matched case-control studies. We compared the estimates and confidence interval coverage computed from the missing-indicator method with those obtained from conditional logistic regression. We conclude that conditional logistic regression provided a slight advantage for bias and coverage probability, at the cost of slightly reduced statistical power and efficiency. The missing-indicator method was slightly less reliable than conditional logistic regression, and the performance of this method was affected by confounding and missing-value scenarios. Therefore, the missing-indicator method should be used cautiously.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|