Comparison of the Missing-Indicator Method and Conditional Logistic Regression in 1:m Matched Case-Control Studies with Missing Exposure Values

Xianbin Li1 , Xiaoyan Song2 and Ronald H. Gray1

1 Department of Population and Family Health Sciences, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD.
2 Department of Hospital Epidemiology and Infection Control, Johns Hopkins Medical Institutions, Baltimore, MD.

Received for publication July 26, 2002; accepted for publication October 16, 2003.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The missing-indicator method and conditional logistic regression have been recommended as alternative approaches for data analysis in matched case-control studies with missing exposure values. The authors evaluated the performance of the two methods using Monte Carlo simulation. Data were generated from a 1:m matched design based on McNemar’s 2 x 2 tables with four scenarios for missing values: completely-at-random, case-dependent, exposure-dependent, and case/exposure-dependent. In their analysis, the authors used conditional logistic regression for complete pairs and the missing-indicator method for all pairs. For 1:1 matched studies, given no confounding between exposure and disease, the two methods yielded unbiased estimates. Otherwise, conditional logistic regression produced unbiased estimates with empirical confidence interval coverage similar to nominal coverage under the first three missing-value scenarios, whereas the missing-indicator method produced slightly more bias and lower confidence interval coverage. An increased number of matched controls was associated with slightly more bias and lower confidence interval coverage. Under the case/exposure-dependent missing-value scenario, neither method performed satisfactorily; this indicates the need for more sophisticated statistical methods for handling such missing values. Overall, compared with the missing-indicator method, conditional logistic regression provided a slight advantage in terms of bias and coverage probability, at the cost of slightly reduced statistical power and efficiency.

case-control studies; epidemiologic methods; logistic models; missing data; regression analysis


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Case-control studies are an efficient method of investigating associations between exposure and disease, but in many epidemiologic studies, confounding factors are a concern because they distort the associations. To control for confounding effects, the case-control approach often matches cases to one or more controls. Through individual matching and group matching, controls are selected so that they are similar to cases in terms of certain characteristics, such as age, race, sex, socioeconomic status, and occupation (1). Statistical methods developed to analyze matched case-control data without missing exposure values are available in many statistical packages, including SAS (SAS Institute, Inc., Cary, North Carolina) and Stata (Stata Corporation, College Station, Texas). McNemar’s test is used to analyze 1:1 individually matched case-control data; the Mantel-Haenszel method is commonly used for 1:m (m >= 1) matched case-control data with one exposure of interest; and conditional logistic regression is applied to estimate the associations between one or more exposures and disease. However, there is no single, readily available method for handling matched case-control data with missing exposure values, which, unfortunately, is a common problem in practice.

For analysis of matched case-control data with missing exposure values, three approaches have been suggested in the literature. The first is McNemar’s test for 2 x 2 tables or conditional logistic regression for analyzing matched pairs with complete exposure data. Pairs with missing values are simply ignored and excluded from the analysis. The second method is unconditional logistic regression analysis for cases and controls with complete data; again, subjects with missing values are ignored. However, in this paper, unconditional logistic regression is not being considered, because it is biased whenever the matching factors are confounders, even when there are no missing data. The third method is the missing-indicator method, a conditional logistic regression approach which introduces a missing indicator for all pairs that is set to 1 if there is a missing exposure value and 0 otherwise (2).

The missing-indicator method has been extended to the context of matched case-control studies and applied to analyses of the risk of childhood leukemia (2) and endometrial cancer (3). The fact that this method uses all of the data and still preserves the matching data structure makes it appealing for matched case-control studies with missing exposure values. However, few studies have assessed the performance of this method under different missing-value scenarios. The objective of our study was to use Monte Carlo simulation to evaluate the performance of the missing-indicator method in comparison with conditional logistic regression for 1:m individually matched case-control studies, as well as for a 1:1 matched data set on low birth weight with one- and two-exposure variables.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Monte Carlo simulation
Our simulations were based on McNemar’s 2 x 2 contingency table. Given an exposure proportion among cases, an association between exposure and disease measured by the odds ratio, and a prespecified degree to which controls matched with exposed cases were more or less likely to have exposure than controls matched with unexposed cases, the probability that each matched control was exposed was derived using the contingency table. Although confounders were not explicitly generated in simulations, different confounding effects—no confounding, negative confounding, and positive confounding—were implicitly present because of the differential probabilities of exposure among controls matched with exposed or unexposed cases.

Exposure status for cases and matched controls
Each matched case-control pair or group in the simulations included one case and up to four controls. A case’s exposure status, which was dichotomized as yes or no, was generated using a probability of 0.3, 0.5, or 0.7. A control’s exposure status (yes/no) was generated using the calculated probabilities based on the following three scenarios, given an odds ratio for exposure and disease of 2, 4, or 6. In scenario 1, there was no confounding, because a control’s exposure status was independent of that of the matched case; that is, the probability of being exposed was equal for all controls. Matched analysis and unmatched analysis yielded identical odds ratios (4). In scenario 2, negative confounding was introduced. A control matched with an exposed case was 50 percent more likely to be exposed than a control matched with an unexposed case. The odds ratio from the unmatched analysis was closer to unity than that from the matched analysis. Thus, with negative confounding, ignoring matching in the analysis could bias the estimates towards unity. In scenario 3, positive confounding was introduced by ascribing a control matched with an exposed case a one-third-lower likelihood of exposure than a control matched with an unexposed case. The odds ratio from the unmatched analysis exceeded that from the matched analysis. In all scenarios, an individual control’s exposure status was always independent of that of any other control. Given an odds ratio of 4 and a probability of 0.5 for a case’s exposure, we can calculate the probabilities of a control’s being exposed under the three scenarios (table 1).


View this table:
[in this window]
[in a new window]
 
TABLE 1. A control’s probability of being exposed under three different scenarios for confounding (case’s exposure probability = 0.5 and odds ratio = 4)
 
Generation of missing exposure values
Missing exposure values were generated completely at random or dependent on disease and/or exposure status. The dependent (or differential) missing-value scenarios were designed as 1) case-dependent missing values in which cases were set to be 50 percent less likely to have missing exposure values than controls (reference group) or vice versa; 2) exposure-dependent missing values in which exposed subjects were 50 percent less likely to have missing values than unexposed subjects (reference group) or vice versa; and 3) case/exposure-dependent missing values in which exposed cases were 50 percent less likely to have missing values than unexposed cases and all controls (reference group). The missing value for each subject was independent of the missing values for other subjects, including those in the same matched pairs. The proportion of missing values (hereafter called the missing proportion) was set to be 0.1, 0.2, or 0.3 for subjects in the completely-at-random missing-value scenario or for reference groups in the dependent missing-value scenarios.

Data analysis
We analyzed the generated data using both conditional logistic regression, by excluding pairs with missing values, and the missing-indicator method, by introducing a missing indicator into the regression.

Conditional logistic regression. The general model was logit(Y) = ßX, where Y represented case or control status and X indicated a subject’s exposure status (equal to 1 if exposed and 0 if unexposed). The conditional maximum likelihood approach was used to compute estimates. In the Stata software, the command was "clogit Y X, group (pair)," where "pair" indicated a matched pair. This regression can also be performed in SAS using the survival analysis PHREG procedure, in which a discrete logistic model is applied with a stratum formed for each matched pair. The command in SAS was "proc phreg; model time*Y(0) = X/ties = discrete; strata pair," where time was set so that cases had the same event time while controls had belated censored times (e.g., set time as " 2 – (case status coded as 0 or 1)"), Y(0) indicated that controls were censored observations, and "ties = discrete" specified the proportional hazards model to be displaced by the discrete logistic model. When m > 1, we excluded a matched set if either the case or all of its matched controls had missing values, which is the default method of dealing with observations with missing values in both Stata and SAS conditional logistic regression.

Missing-indicator method. With the missing-indicator method, we employed conditional logistic regression by introducing a new indicator variable Z into the regression. The indicator was set to either 1, if exposure information was missing, or 0. Missing exposure values were then replaced with 0 in the regression model (2). Estimation was performed using conditional logistic regression and the commands noted above, except that the missing-indicator variable Z was included in the regression model.

We generated 2,000 matched pairs in each simulation to achieve stable statistical estimation. We summarized mean estimates (regression coefficients), standard deviations, and empirical 95 percent confidence interval coverage from 1,000 simulations. All analyses were conducted using Stata 7 (5).

Power and efficiency analysis
Empirical power was defined as the probability of zero’s being excluded from the 95 percent confidence intervals for the regression coefficients of exposure (6). The ratio of the average empirical power calculated from the conditional logistic regression with missing values or the missing-indicator method to that from the conditional logistic regression without missing values was defined as relative power. Asymptotic relative efficiency was the ratio of the variance of exposure from the conditional logistic regression with missing values or the missing-indicator method to the variance of the exposure from the conditional logistic regression without missing values (7).

We determined the numbers of pairs in the simulation to achieve an empirical power of 0.950–0.999 for conditional logistic regression analyses using data without missing values. We generated data for each analysis by performing 1,000 repetitions. Nonetheless, results from 20 repetitions that had the highest standard errors according to the conditional logistic regression without missing values were excluded from the summary.

Application to a low birth weight data set
The low birth weight data introduced by Hosmer and Lemeshow (8) are a 1:1 age-matched case-control data set with a total of 56 pairs of mothers. This data set was based on low birth weight data originally collected in 1986 at Baystate Medical Center, Springfield, Massachusetts. The data set comprises maternal covariates, including the mother’s history of previous preterm delivery, smoking status, race, the presence of uterine irritability, and hypertension. The primary infant outcome was low birth weight (yes/no).

For this study, we augmented the data set 10 times to generate 560 pairs in order to obtain a sufficient sample size. We constructed two models including either one variable (smoking) or two variables (smoking and preterm delivery) to investigate the performance of the two methods in the one- and two-variable settings. Both smoking and preterm delivery were dichotomized. The smoking variable (for both smokers and nonsmokers) was randomly deleted with a probability of 0.1, 0.2, or 0.3 under the completely-at-random missing exposure scenario, or smokers were 50 percent more likely to have missing values under the exposure-dependent missing-value scenario. We repeated the analysis 1,000 times to obtain mean estimates, standard deviations, and coverage of the 95 percent confidence intervals in Stata 7.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Monte Carlo simulation for 1:1 matched studies
The results of conditional logistic regression analysis for the complete 1:1 matched data indicated that the simulation designs were valid. The mean estimated regression coefficients (log odds ratio) were in accord with the true log odds ratio and 95 percent confidence interval coverage within the range of (0.936, 0.964), which was similar to the nominal level for 95 percent confidence intervals over 1,000 simulations.

Figure 1 and table 2 show the results of regression analysis in which the log odds ratio was 1.89 (equivalent to an odds ratio of 4) and the proportion of exposed cases equaled 0.5. The four parts of figure 1 show mean estimates and standard deviations of the estimates for the four different scenarios. The estimates from the first three missing-value scenarios—completely-at-random, case-dependent, and exposure-dependent—were similar, but all differed from the estimates from the fourth scenario, where the missing values were case/exposure-dependent. In the first three missing-value scenarios, the estimates from conditional logistic regression were unbiased, and the empirical 95 percent confidence interval coverage was similar to the nominal coverage. This finding held irrespective of the presence of confounding effects. However, the estimates obtained from the missing-indicator method varied, depending on the presence of confounding effects. Negative confounding biased the estimates from the missing-indicator method toward zero, whereas positive confounding biased the estimates away from zero. The confidence interval coverage from the missing-indicator method ranged from 86 percent to 93 percent with negative confounding and from 89 percent to 96 percent with positive confounding.



View larger version (31K):
[in this window]
[in a new window]
 
FIGURE 1. Regression coefficients and standard deviations (SDs) obtained using conditional logistic regression (C) and the missing-indicator method (M) with a log odds ratio of 1.89 (odds ratio = 4) and a case exposure proportion of 0.5. Solid and dotted lines represent results from 1:1 and 1:4 matched studies, respectively. The symbols o, C, x, and I indicate missing exposure proportions of 0.0, 0.1, 0.2, and 0.3, respectively. Parts a, b, c, and d represent different missing-value scenarios (cases, exposed subjects, and exposed cases had 50% lower missing proportions than indicated in parts b, c, and d, respectively).

 

View this table:
[in this window]
[in a new window]
 
TABLE 2. Confidence interval coverage for the conditional logistic regression and missing-indicator methods under different missing-value scenarios and different proportions of missing data (odds ratio = 4 and case exposure proportion = 0.5)
 
In the case/exposure-dependent missing-value scenario, in which exposed cases were 50 percent less likely to have missing values than unexposed cases and controls, matching pairs with exposed cases and unexposed controls were more likely to be retained in the analysis. Thus, the estimates were expected to be higher than the true values. Without confounding, the results from the two regression methods were consistent with this expectation and had similar magnitudes of bias. With confounding, the estimates from the two methods deviated away from zero. The confidence interval coverage from the two methods ranged widely, from 8 percent to 90 percent.

The relative empirical power and asymptotic relative efficiency are presented in table 3. In each scenario, an increased missing proportion was associated with a stable or slightly decreased relative power and an elevated asymptotic relative efficiency (i.e., decreased efficiency). The missing-indicator method was slightly more powerful and efficient than conditional logistic regression.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Relative empirical power/asymptotic relative efficiency for the conditional logistic regression and missing-indicator methods under different missing-value scenarios and different proportions of missing data (odds ratio = 4 and case exposure proportion = 0.5)
 
Monte Carlo simulation for 1:4 matched studies
The results obtained using the Mantel-Haenszel method for the complete matched data with multiple controls per case confirmed the validity of the simulation design. With the first three missing-value scenarios, the results from 1:4 matched studies were compared with those from 1:1 matched studies. When no confounding existed, the estimates and the confidence interval coverage were similar to those from 1:1 matched studies. Once a confounding effect was present, the estimates from conditional logistic regression for 1:4 matched studies differed slightly from those for 1:1 matched studies. The difference occurred with and without missing values. Likewise, all empirical confidence interval coverage was slightly lower than that of the nominal coverage. Although the estimates from the missing-indicator method for 1:4 matched studies showed a pattern similar to the patterns from 1:1 matched studies, they deviated more from true values, and the empirical confidence interval coverage was significantly lower than the nominal coverage.

When missing values were case/exposure-dependent and there was no confounding effect, estimates from the two methods for 1:4 matched studies were similar to those from 1:1 matched studies. In the presence of negative or positive confounding, the estimates from the two methods for 1:4 matched studies were smaller or larger than those for 1:1 matched case-control studies, and the confidence interval coverage ranged widely from 0 percent to 96 percent.

As the missing proportion increased, the relative empirical power and asymptotic relative efficiency demonstrated similar trends as in the 1:1 matched studies.

In both 1:1 and 1:4 matched studies, as the proportions of missing values increased, the standard deviations increased and the confidence interval coverage remained essentially unchanged. Exceptions were found for the two methods under the case/exposure-dependent missing-value scenario and for the missing-indicator method in the presence of confounding: The confidence interval coverage was decreased. The standard deviations from the missing-indicator method were slightly lower. We examined changes of estimates and standard deviations when the proportion of cases being exposed was 0.3, 0.5, or 0.7 and the odds ratio was 2, 4, or 6 and found a similar pattern in each setting. However, given a fixed odds ratio, as the case exposure proportion increased, biases, if they existed, increased but standard deviations decreased; and the confidence interval coverage was decreased except for conditional logistic regression or for scenarios without confounding. As the odds ratio increased, there was no significant change in bias/coefficient ratios and confidence interval coverage, though standard deviations increased.

Application to a low birth weight data set
The first panel in figure 2 shows results for the one-variable model with the smoking covariate only. The first mean estimate and standard deviation in this panel, which are considered the true values, are from the conditional logistic regression analysis of the full data set.



View larger version (23K):
[in this window]
[in a new window]
 
FIGURE 2. Regression coefficients and standard deviations (SDs) for smoking and preterm delivery (PTD) obtained using conditional logistic regression (C) and the missing-indicator method (M) for low birth weight data. The symbols o, C, x, and I indicate missing exposure proportions of 0.0, 0.1, 0.2, and 0.3, respectively, for both smokers and nonsmokers under the completely-at-random missingness scenario (solid lines) or for nonsmokers under the exposure-dependent missingness scenario in which smokers were 50% more likely to have missing values (dotted lines).

 
The estimates from the conditional logistic regression analysis of data with missing values were unbiased. The missing-indicator method yielded estimates that were close to the true value and had smaller standard deviations. In this model, all 95 percent confidence interval coverage was above 99 percent, which was higher than the nominal coverage, because of the lack of variability in data generation usually observed in conventional Monte Carlo simulations.

The second panel in figure 2 presents the estimates and standard deviations obtained with smoking and preterm delivery in a two-covariate regression model. The first mean estimates in the two subpanels, derived from conditional logistic regression for the complete data set, are the true values for the effects of smoking and preterm delivery, respectively. The estimates from the missing-indicator method were close to the unbiased estimates from conditional logistic regression. In addition, the 95 percent confidence interval coverage from conditional logistic regression and the missing-indicator method was 95 percent or more, higher than the nominal coverage for the reason given above. Similar trends in estimation precision held for both the effect of smoking (the variable with missing values) and the effect of preterm delivery (the variable without missing values).


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
In 1:1 matched studies with one exposure variable of interest, the estimates obtained by means of the Mantel-Haenszel method and the conditional logistic regression method are interchangeable. When the number of matched controls exceeds 1, however, these two methods no longer produce identical estimates, though the estimates can be quite close (9). As we have shown in the present simulations, the estimates from conditional logistic regression for matched studies with multiple matched controls per case were similar to those obtained from 1:1 matched studies, if there was no confounding. When there was confounding, the results for studies with one matched control per case and studies with multiple matched controls per case were the same if the Mantel-Haenszel method was used, but they differed if conditional logistic regression was applied. This finding of discrepancies between conditional logistic regression and the Mantel-Haenszel method is consistent with previous reports (9, 10).

We evaluated the performance of the two methods—conditional logistic regression and the missing-indicator method—for the analysis of data with missing exposure values in 1:m matched case-control studies. The results indicated that the precision of estimation depended on the underlying missing-value scenarios and on whether there was confounding between exposure and disease. It is well known that in 1:1 matched case-control studies, the odds ratio is estimated from the number of matched pairs of exposed cases and unexposed controls divided by the number of matched pairs of unexposed cases and exposed controls. When missing values were completely-at-random, case-dependent, or exposure-dependent, as designed in this study, a comparable proportion of pairs would be dropped from both the numerator and the denominator in the estimation of the odds ratio. Thus, the estimates should be the same as the true odds ratios. Under these three scenarios, data with missing values still represented the complete data. Consequently, conditional logistic regression generated unbiased estimates with empirical confidence interval coverage comparable to the nominal coverage.

In the absence of confounding, the exposure status of matched cases and controls was completely independent. Consequently, the missing-indicator method also produced unbiased estimates. However, if there was any confounding in the matching process such that the exposure status for matched cases and controls was no longer independent, the estimates from the missing-indicator method were slightly biased and the confidence interval coverage was slightly lower than the nominal level, though it had slightly greater statistical power and less asymptotic relative efficiency (higher efficiency). It appeared that the missing-indictor method did not eliminate the confounding effect completely. It is noteworthy that biases produced by the missing-indicator method were relatively small in view of the fact that considerable confounding and up to 30 percent missing exposure values were assumed in our simulations.

In addition, we extended the missing-indicator method to a two-variable setting, with one of the two variables having missing values. By applying the two methods to an augmented data set, we found that estimates obtained from the two methods were similar for variables with and without missing values. However, whether the missing-indicator method provides an appropriate approach for handling missing exposure values depends on the investigator’s willingness to accept a balance between the gains of estimating efficiency and the losses of estimating accuracy.

The sample for our analysis included 2,000 pairs in the simulations and 560 pairs in the augmented data analysis. Although the performance of the missing-indicator method was slightly poorer than that of conditional logistic regression in such a large data sample, the results might not hold in a smaller data set. We found that even conditional logistic regression produced significantly biased estimates when the sample size was reduced to 100 pairs of cases and controls (not shown). Consequently, the performance of the missing-indicator method for studies with small sample sizes was expected to be more biased. More importantly, when missing values were case/exposure-dependent, neither of the two methods could appropriately handle the missing values. Therefore, we suggest that other sophisticated techniques, such as the multiple imputation method (11, 12) or the semiparametric inference approach introduced by Rathouz et al. (13), be considered.

In this study, we examined the performance of the missing-indicator method in handling missing exposure values in matched case-control studies. We compared the estimates and confidence interval coverage computed from the missing-indicator method with those obtained from conditional logistic regression. We conclude that conditional logistic regression provided a slight advantage for bias and coverage probability, at the cost of slightly reduced statistical power and efficiency. The missing-indicator method was slightly less reliable than conditional logistic regression, and the performance of this method was affected by confounding and missing-value scenarios. Therefore, the missing-indicator method should be used cautiously.


    NOTES
 
Correspondence to Dr. Xianbin Li, Department of Population and Family Health Sciences, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205 (e-mail: xli{at}jhsph.edu). Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Godis L. Epidemiology. Philadelphia, PA: W B Saunders Company, 1996.
  2. Huberman M, Langholz B. Application of the missing-indicator method in matched case-control studies with incomplete data. Am J Epidemiol 1999;150:1340–5.[Abstract]
  3. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol 1995;142:1255–64.[Abstract]
  4. Breslow NE, Day NE, eds. Statistical methods in cancer research. Vol 1. The analysis of case-control studies. (IARC Scientific Publication no. 32). Lyon, France: International Agency for Research on Cancer, 1980.
  5. Stata Corporation. Stata statistical software, release 6.0. College Station, TX: Stata Corporation, 1999.
  6. Cepeda MS, Boston R, Farrar JT, et al. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol 2003;158:280–7.[Abstract/Free Full Text]
  7. Knight K. Mathematical statistics. Boca Raton, FL: CRC Press, 1999.
  8. Hosmer DW Jr, Lemeshow S. Applied logistic regression. New York, NY: John Wiley and Sons, Inc, 1989.
  9. Stata Press. Stata reference manual, release 6. Vol 1. College Station, TX: Stata Press, 1999.
  10. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic research: principles and quantitative methods. New York, NY: Van Nostrand Reinhold Company, 1982.
  11. Rubin DB. Multiple imputation for nonresponse in surveys. New York, NY: John Wiley and Sons, Inc, 1987.
  12. Patrician PA. Multiple imputation for missing data. Res Nurs Health 2002;25:75–84.
  13. Rathouz PJ, Satten GA, Carroll J. Semiparametric inference in matched case-control studies with missing covariate data. Biometrika 2002;89:905–16.[Abstract/Free Full Text]




This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (1)
Disclaimer
Request Permissions
Google Scholar
Articles by Li, X.
Articles by Gray, R. H.
PubMed
PubMed Citation
Articles by Li, X.
Articles by Gray, R. H.