Estimating Crude or Common Odds Ratios in Case-Control Studies with Informatively Missing Exposure Data
Robert H. Lyles and
Andrew S. Allen
From the Department of Biostatistics, The Rollins School of Public Health, Emory University, Atlanta, GA.
 |
ABSTRACT
|
---|
In case-control studies, the crude odds ratio derived from a 2 x 2 table and the common odds ratio adjusted for stratification variables are staple measures of exposure-disease association. While missing exposure data are encountered in the majority of such studies, formal attempts to deal with them are rare, and a complete-case analysis is the norm. Furthermore, the probability that exposure is missing may depend on true exposure status, so the missing-at-random assumption is often unreasonable. In this paper, the authors present an adjustment to the usual product binomial likelihood to properly account for missing data. Estimation of model parameters without restrictive assumptions requires an additional data collection effort akin to a validation study. Closed-form results are provided to facilitate point and confidence interval estimation of crude and common odds ratios after properly accounting for informatively missing data. Simulations assess performance of the likelihood-based estimates and inferences, and they display the potential for bias in complete-case analyses. An example is presented to illustrate the approach.
adjustment; bias; delta method; nonignorable nonresponse; odds ratio
Abbreviations:
MAR, missing at random; MLE, maximum likelihood estimator; OR, odds ratio; SE, standard error
 |
INTRODUCTION
|
---|
In case-control studies, epidemiologists generally encounter missing data because they are unable to obtain information on some subjects by using the chosen data collection instrument. However, most serious discussions of methods for dealing with missing data remain confined to the statistics literature. Exceptions include a review of methods for handling missing covariate information in regression analysis (1
) and a study comparing approaches for estimating a common odds ratio when some values of a confounding variable are missing (2
). Nevertheless, in crude and adjusted epidemiologic analyses, it is common practice to ignore missing data and to report results based solely on the observed (complete-case) data.
In many statistical treatments of missing data, an assumption is made
that data are missing at random (MAR) (3
). The implication is that the probability of a missing value does not depend on any missing data, although it may arbitrarily depend on observed data. While this assumption often helps facilitate a formal analytical treatment, it can be invalid in epidemiologic settings (1
). In particular, the probability that exposure is missing may depend on whether a person was exposed, in which case missingness is nonignorable and adopting the MAR assumption can lead to bias. Therefore, generality with regard to the assumed missing data mechanism is beneficial when such generality is supported by feasible study design and analytical procedures.
In the non-MAR situation upon which we focus, parameters underlying the missing data mechanism are inestimable based on the observed data alone. Statistical papers exploring analysis subject to nonignorable missingness often deal with this type of inestimability by making assumptions about the missing data process (4

7
). A drawback here is that goodness-of-fit tests to assess models for the missing data mechanism can be inconclusive (3
), and sensitivity analysis might be the only recourse (8
).
In this paper, we impose relatively few assumptions at the analysis stage, and our approach relates more to prior literature considering two-stage study designs (9
) and surveys with multiple waves of respondents (10
). Estimability of parameters relies on supplemental data similar to those obtained in validation studies to facilitate correction for misclassification, although here we assume that data can be missing but not misclassified. To emphasize the subtle distinction, we use the term reassessment study. This term implies that, of those persons for whom data are missing, a sample is chosen at random; then, every effort is made to acquire the information by using an assessment method that encourages a high response rate. For example, if exposure was first obtained by questionnaire, nonrespondents chosen for reassessment, or their relatives, might be contacted by telephone to obtain the information and/or incentives for response might be offered. Note that, while validation samples should represent a target population, reassessment samples are drawn from only those for whom data were originally missing.
For simplicity, we assume that the investigator is ultimately successful in acquiring the true data for subjects selected for reassessment, although our approach remains valid if data are MAR subsequent to reassessment (we revisit this point in the Discussion section). Since the case-control study involves sampling by disease status, we assume that only exposure information is missing. While the method described also applies to estimating relative risks and odds ratios in cohort studies with missing disease data and a fixed follow-up period for all subjects (i.e., cumulative incidence studies), obtaining near-complete reassessment information may be more problematic in that context because of losses to follow-up.
 |
PRELIMINARIES
|
---|
We focus on likelihood-based estimation and inference, but our assumptions regarding the underlying data and the missing data mechanism are natural, and model misspecification is of little concern. We use E (= 1 if exposed, 0 if not) to denote exposure status and D (= 1 if diseased, 0 if not) for disease status. For case-control designs, the natural model is a product binomial one: we have two independent samples that separately provide estimates of the underlying binomial proportions
1 = Pr(E = 1|D = 1) and
0 = Pr(E = 1|D = 0). Independence of the two samples essentially reduces the analysis to a pair of one-sample problems, with the estimated odds ratio being a function of the separate exposure probability estimates.
Consider the familiar 2 x 2 data layout in table 1; for now, assume that the data are complete. In the case-control setting, we are concerned with
1 = n11/nd1 and
0 = n01/nd0. In addition to being intuitive estimators, these are maximum likelihood estimators (MLEs) under the product binomial model, and the MLE for the odds ratio (OR) follows as O
=
1 (1 -
0)/[
(1 -
1)]. To form a confidence interval, one typically computes V
r[In (O
)] = n-111 + n-110 + n-101 + n-100, which is derived by using the delta method (11
) and reflects the well-known Woolf's method (12
).
If data are missing, the standard estimators
1 and
0 given above may not be the MLEs and, more importantly, may no longer be unbiased in large samples. Assume that we have the correct MLEs (
1,
0) and their estimated variances. Then, by the delta method, Vâr[ln(O
)] = [
1(1 -
1)]-2 Vâr(
1) + [
0(1 -
0)]-2 Vâr(
0), where the first and second terms represent the approximate variance of the log-odds estimator for diseased and nondiseased subjects, respectively. In the absence of missing data, this formula reduces to the standard variance estimate given previously. As usual, we assume that sample sizes are large enough to ensure approximate normality for ln(O
), after which upper and lower limits of a confidence interval may be computed in the usual way with exponentiation applied to obtain the interval for the odds ratio. In scenario 2 below, we describe how to obtain MLEs
1 and
0 and their estimated variances in the presence of nonignorably missing data.
 |
SCENARIOS
|
---|
Scenario 1: MAR exposure data
We first describe a simple hypothetical situation. Suppose that exposure (E) is missing for some persons in a case-control study, and the probability that E is missing depends on disease status (D) but not on E. Technically, the data are MAR. However, note that, conditional on D, which is the level of analysis, missingness of E does not depend on the data in any way. Thus, the usual sample proportions
1 and
0 based on the observed data are MLEs, as is the complete-case estimate of the odds ratio.
Scenario 2: non-MAR exposure data with reassessment
Now consider another case-control study in which E is missing for some subjects. This time, let Pr(E missing) depend on both D and E via the four additional parameters pmde = Pr(E missing |D = d, E = e) (d,e = 0,1). We want to estimate
1 = Pr(E = 1|D = 1) and
0 = Pr(E = 1|D = 0), but since the same estimation procedure is used separately for the two independent samples, we focus on the one-sample problem of obtaining
1 by using exposure data from the diseased subjects. Therefore, assume we have n subjects with D = 1, and suppose that n1 of them are found to be exposed, n0 are found to be unexposed, and nm of them have E missing. The log-likelihood for the observed data is
 | (1) |
However, the parameters pm11 and pm10 are not estimable in the original sample, and it is unrealistic to assume that their values are known. Thus, estimation requires prior information about these two probabilities, additional assumptions, or supplemental data.
If estimates of pm11 and pm10 and their standard errors are available from a prior study involving the same exposure variable, data collection method, and study population, then one could apply pseudo-likelihood methodology (13
, 14
). Using this technique, one would maximize equation 1 with respect to
1 after inserting estimates in place of the unknown pm11 and pm10 and then adjust the standard error of the estimate of
1 for uncertainty in the inserted estimates.
More commonly, prior estimates are unavailable, and we are reluctant to impose strong assumptions about unknown parameters. Thus, we focus on a study-design-based strategy whereby we obtain additional data to facilitate estimation of pm11 and pm10, and we construct the joint log-likelihood for the original and additional data. Our proposed design requires randomly choosing a sample for reassessment from among the nm subjects for whom exposure data are missing by applying a fixed selection probability pr to each subject. On average, this process generates a reassessment sample of size nr = nm x pr. An acceptable alternative approach is to select the value nr and assume that pr = nr/nm.
This design gives rise to five distinct types of likelihood contributions, specifically, 1) n1 subjects for whom E is originally observed and E = 1; 2) n0 subjects for whom E is observed and E = 0; 3) nr1 subjects for whom E is originally missing but subsequently verified as E = 1 by reassessment; 4) nr0 subjects for whom E is missing but who are reassessed and found to have E = 0; and 5) nm* subjects for whom E is missing and who are not selected for reassessment. To construct the log-likelihood, we determine each type of contribution; for example, type 1 subjects contribute pn1 = Pr(E observed and E = 1|D = 1) = Pr(E observed |D = 1, E = 1) x Pr(E = 1|D = 1) = (1 - pm11) x
1. Similarly, those of types 25 contribute pn0 = (1 - pm10) x (1 -
1), pnr1 = pr x pm11 x
1, pnr0 = pr x pm10 x (1 -
1), and pnm* = (1 - pr) x [pm11 x
1 + pm10 x (1 -
1)], respectively. If any combinatoric terms are ignored, the log-likelihood follows as
 | (2) |
Closed-form expressions are available for the corresponding MLEs under equation 2. In particular, we obtain
11 = nr1 (n1nr/nm + nr1) -1 and
m10 = nr0(n0nr/nm + nr0)-1. The MLE for
1 is a solution to a quadratic equation (details omitted) and reduces to the intuitive form
 | (3) |
Thus, the MLE effectively replaces the unknown number of originally missing values for which E = 1 by a logical estimate based on the reassessment study, namely, nm[nr1/(nr1 + nr0)]. Such a "fill-in" estimator is familiar in problems involving contingency tables with missing data (2
, 3
). Equation 3 is equivalent to an estimator presented previously (10
) for estimating prevalence in surveys with two waves of respondents. In contrast to equation 3, note that the complete-case estimate for
1 is simply n1/(n1 + n0).
In the case-control study, we also maximize the equivalent of equation 2 separately by using exposure data for nondiseased subjects to obtain
0,
m01, and
m00. The MLE of the odds ratio follows directly once we have
1 and
1. For standard error estimates of
1 and
1 to facilitate an approximate confidence interval as described previously, we provide details for computing the expected information matrix in Appendix A.
Table 2 describes simulations illustrating potential for bias and invalid inference in complete-case analysis, as contrasted with the maximum likelihood approach. Fifteen settings were considered to cover a range of values for the crude odds ratio and for the missingness parameters [pmde (d,e = 0, 1)]. For each setting, 5,000 data sets were generated, each with 250 cases and 250 controls, missing data according to the missingness probabilities, and 25 percent reassessment samples on average (i.e., pr = 0.25 regardless of disease). Means for
1 and
0, means and standard deviations of the odds ratio estimates, and mean widths and coverages of 95 percent confidence intervals are displayed under the complete-case and maximum likelihood strategies. Note that the "complete-case" approach evaluated reflects what would be done in practice, that is, it ignores the reassessment data.
With significant disparity among the four missingness probabilities, table 2 illustrates both severe overestimation (settings 13) and severe underestimation (settings 1315) of the odds ratio when complete-case analysis is used, together with abysmal confidence interval performance (coverages near 10 percent). With less disparity in the probabilities (settings 46 and 1012), complete-case analysis remains subject to (less extreme) bias and subnominal confidence interval coverage. Should all missingness probabilities be equal (settings 79), there are no problems with complete-case analysis. The maximum likelihood analysis displays minimal bias and nearly nominal confidence interval coverage throughout, and it outperforms the complete-case approach except in the special cases (settings 79).
Scenario 3: estimating a common odds ratio across strata with non-MAR exposure data
Here we extend the previous considerations to allow adjustment for confounding variables via stratification (i.e., a Mantel-Haenszel-type approach). Specifically, suppose there are S strata defined by such variables and we want to estimate a common odds ratio subject to missing exposure data. In the event of MAR exposure data (i.e., if scenario 1 applies within all strata), then the usual Mantel-Haenszel common odds ratio estimate and its confidence interval (12
) are valid. In the non-MAR case, missingness probabilities may vary by disease and exposure status, as before, as well as across strata. We adjust our prior notation to accommodate these features by using the subscript s to identify the stratum. The five parameters specific to the sth stratum are psmde = Pr(E missing |D = d, E = e, stratum = s) and
s1 = Pr(E = 1 | D = 1, stratum = s) (d,e = 0,1; s = 1, ..., S), while the parameter of main interest is
(the common log odds ratio). Since the investigator may vary reassessment probabilities across strata, we denote these as prs (s = 1, ..., S).
For stratum s, 10 types of observations are possible. Specifically, we have nsde subjects for whom E is originally observed, D = d, and E = e (d,e = 0,1). Likewise, there are nrsde subjects for whom D = d and E is originally missing but who are reassessed and found to have E = e (d,e = 0,1). Finally, let nmsd* represent the number of subjects with D = d and E missing who are not selected for reassessment (d = 0,1). The first four types of subjects make likelihood contributions of the forms psd1 = (1 - psmd1) x
sd and psd0 = (1 - psmd0) x (1 -
sd) (d = 0,1), where
s0 = Pr(E = 1 | D = 0, stratum = s) =
s1/[
s1 + (1 -
s1) x exp(
)]. The next four types of subjects make contributions of the forms prsd1 = prs x psmd1 x
sd and prsd0 = prs x psmd0 x (1 -
sd) (d = 0,1), while contributions from the final two types of subjects are pmsd* = (1 - prs) x [psmd1 x
sd + psmd0 x (1 -
sd)] (d = 0,1).
The likelihood specific to the sth stratum is a product of terms corresponding to these 10 types and is analogous to that given in equation 2 for crude analysis. The overall log-likelihood is a sum of stratum-specific log-likelihoods and (neglecting combinatoric terms) can be written as
 | (4) |
where
and pm are (S x 1) and (4S x 1) vectors, respectively, containing the exposure and missingness probabilities
s1 and psmde (s = 1, ..., S; d,e = 0,1).
An advantage of the likelihood approach is flexibility to consider more parsimonious models (e.g., one could make and test the hypothesis that all or some missingness parameters are equal across strata). However, while a program implemented by using SAS IML software (15
) to numerically maximize equation 4 and obtain standard errors is available from the authors, we can no longer readily derive closed-form expressions for the MLEs and the expected information matrix. An attractive alternative that makes use of the closed-form results is the following asymptotically unbiased, weighted-average estimator:
 | (5) |
where
s and
s represent the sth stratum-specific MLE for the log-odds ratio and its estimated variance, respectively. A valid estimator for Var(
) is
 |
thus, one can conduct point and confidence interval estimation of the common odds ratio with informatively missing exposure data by applying the methodology for the crude analysis within separate strata and combining the results.
Table 3 summarizes a brief simulation study to evaluate the common odds ratio estimator in equation 5. For comparison, we also evaluated the MLE based on equation 4 and the standard Mantel-Haenszel odds ratio estimator based on complete-case analysis (12
). Missingness probabilities were varied across three strata as indicated, and the following settings are illustrated: 1) exposure missing completely at random in each stratum, 2) moderate association between missingness and exposure status, and 3) extreme association between missingness and exposure status. Each setting was simulated 5,000 times with a true common odds ratio of 1 and with 40 percent reassessment probabilities and 500 subjects in each stratum; other conditions were as specified in the footnote to table 3. Note that the weighted estimator (equation 5) performed just as well as the MLE; in simulations with smaller sample sizes (not shown), more discrepancy was seen but results remained supportive of using equation 5. As expected, the Mantel-Haenszel estimator performed well in setting 1 but was subject to bias and poor confidence interval coverage in setting 2, and these problems were more pronounced for setting 3.
Comments and special cases
Before committing to additional sampling, it would help to obtain some sense of the magnitude of bias to be expected with complete-case analysis or what conditions would ensure no bias. Regarding the latter, we offer the following about the crude (single-stratum) analysis: 1) If the missingness probabilities are equal for both diseased and nondiseased subjects (i.e., pm11 = pm10 and pm01 = pm00), then the complete-case approach is valid and produces the MLE of the odds ratio (this reverts to scenario 1, and a special case was illustrated in cases 79 (table 2)); 2) if missingness is related to exposure but not to disease status (i.e., pm11 = pm01 and pm10 = pm00 but pm11
pm10 and pm01
pm00), then the complete-case estimators
1 and
0 are biased but the complete-case odds ratio estimator is valid. The validity of the complete-case odds ratio when missingness is unrelated to disease also holds in the case of continuous exposure.
Aside from the preceding special cases, note that complete-case analysis yields biased estimators and confidence interval coverage approaching zero percent as sample size increases. The magnitude of bias in complete-case exposure probability (i.e.,
1 or
0) estimates depends primarily on two factors: 1) the amount of missing exposure data and 2) the extent of association between missingness and the true exposure classification (the latter determined by the values of pmde (d,e = 0,1)). In a spirit similar to that of prior authors (e.g., Vach and Blettner (2
)), one can show that, in large samples, the complete-case odds ratio estimator approaches the quantity OR x {[(1 - pm11) x (1 - pm00)]/[(1 - pm10) x (1 - pm01)]}. This expression provides a preliminary guide to assess potential for bias based on various assumed values of the missingness probabilities.
 |
EXAMPLE
|
---|
For illustration, we use observed cell counts from an example by Rosner (12
). The data stem from a case-control study of association between breast cancer and age at first birth among mothers from six countries (16
). To exemplify the methods, we augment the observed data with missing and reassessment study data in an arbitrary but reasonable way.
Data from 13,465 women in the breast cancer study are presented in table 4. Suppose hypothetically that there were 300 additional cases for whom exposure (age at first birth) was missing. Although this information would be unavailable for analysis, let us further assume that 72 (24 percent) of these cases were in fact aged
30 years at first birth. Likewise, suppose that 2,500 additional controls had been identified but could not be classified as to exposure and that 700 (28 percent) of these controls were aged
30 years at first birth.
Using the observed data in table 4, we obtain
1 = 683/3,220 = 0.212 (standard error (SE), 0.0072) and
0 = 1,498/10,245 = 0.146 (SE, 0.0035). The complete-case O
= 1.57, with (via Woolf's method) a 95 percent confidence interval of 1.42, 1.74. In contrast, cell counts for the actual table are (in clockwise order) 755, 2,198, 2,765, and 10,547. Were all data observed, we would have
1 = 755/3,520 = 0.214 (SE, 0.0069) and
0 = 2,198/12,745 = 0.172 (SE, 0.0033), producing O
= 1.31 (95 percent confidence interval: 1.19, 1.44).
In applying the proposed method, we focus first on the cases. Assume that a 25 percent reassessment probability is applied and yields for reassessment a selection of 75 of the 300 cases for whom exposure is missing. Of these 75 cases, suppose that 20 are found to have been aged
30 years at first birth (E = 1) and the remaining 55 are verified as aged <30 years (E = 0). It follows that
m11 = 20/[683(75/300) + 20] = 0.105 and
m10 = 55/[2,537(75/300) + 55] = 0.080. By using equation 3 and Appendix A to estimate the variance, we obtain the MLE
1 = 0.217 (SE, 0.0079). Next, assume that a 25 percent reassessment probability yields 625 of the 2,500 controls for reassessment for whom exposure is missing. Of these 625, suppose that 172 are found to have been exposed and the remaining 453 unexposed. We obtain
m01 = 172/[1,498(625/2,500) + 172] = 0.315 and
m00 = 453/[8,747(625/2,500) + 453] = 0.172, which leads to
0 = 0.172 (SE, 0.0045). The MLE for the odds ratio follows as 1.34 (95 percent confidence interval: 1.20, 1.49). Thus, reasonable estimates of parameters driving the missing data mechanism, as obtained from partial reassessment samples, produced results much closer to those we would have obtained if we had observed the full table.
 |
CONCLUSIONS
|
---|
We have described a likelihood-based approach that permits valid estimation and inference for crude and common odds ratios in case-control studies subject to nonignorably missing exposure data. As with more familiar procedures for correcting misclassification, the approach requires an additional data collection effort for a subsample of participants. This additional effort, which we called a reassessment study, permits identification of parameters associated with the missing data mechanism and facilitates estimation without imposing assumptions or restrictions. The approach for estimating the common odds ratio allows missingness probabilities to vary (as might be expected) by stratification variables.
The practical benefit of the design considered is that an exceptional (here, meaning likely to obtain complete information) data collection effort need be applied for only some of the participants. Thus, valid estimates are derived without incurring the full cost of applying that same effort across the board. As mentioned earlier, we assumed that the investigator is successful in obtaining the exposure information for subjects selected for reassessment; however, this goal may well be overly optimistic. Missing information that cannot be recovered for some such subjects might reasonably be accommodated by considering those persons as making type 5 likelihood contributions (scenario 2), so they would not contribute to estimating the MLE via equation 3. The implicit assumption is that, unlike for the original data collection effort, the probability of the redoubled effort producing a missing value does not depend on actual exposure status (although it can depend on disease status). This assumption connects equation 3 with a prevalence estimator given previously (10
). The MAR assumption at this second (as opposed to first) level of sampling will often be more reasonable (17
); in any case, the impact from violation of MAR at the second level should be small unless the redoubled effort is unsuccessful for a substantial number of subjects.
Nevertheless, in the case of large-scale studies with a considerable amount of missing data that remains after the second stage, one might consider extending design and analytical considerations to incorporate one or more additional reassessment stages. As a hypothetical example, initial exposure assessment might be made by questionnaire, the first reassessment by requesting information over the telephone, and a second reassessment among those not responding to the first one by offering incentives. Further stages could be contemplated, although their usefulness decreases greatly. Appendix B illustrates construction of the likelihood based on a three-stage scenario, and a program suited for such analysis is available from the authors. Limited simulation studies in this context indicate that the original two-stage approach (i.e., with a single reassessment) can provide markedly better estimates and inferences than complete-case analysis, and it suffers relative to a three-stage design only with substantially incomplete initial reassessment and under unusually extreme non-MAR conditions after the second stage.
Use of the proposed method is most feasible in studies in which a noteworthy amount of data is missing and for which resources permit a reassessment study extensive enough to obtain reasonable precision in the estimates of pmde (d,e = 0,1) within each stratum. If very little information is available for this purpose, then variance estimation as described in Appendix A can be unreliable. When a significant amount of data is missing, even a mild association between missingness and exposure status could introduce bias into complete-case results, making reassessment a potentially attractive option.
The ability to conduct an essentially assumption-free analysis facilitated by supplemental data owes partly to the fact that the 2 x 2 table is the foundation of the considerations given here. Nevertheless, the Mantel-Haenszel-like approach is the fundamental facilitator for covariate-adjusted analyses that are usually the ultimate basis for epidemiologic findings, and our method extended naturally to that setting. Adaptation of our approach to multilevel categorical exposures should be straightforward. Further considerations to incorporate supplemental data into more general models, that is, those with continuous exposure and covariates, effect modification, and/or missing covariate as well as exposure data, would be of interest and might be facilitated via the expectation-maximization (EM) algorithm (18
). Such considerations may help reduce concerns about model misspecification that can linger when existing approaches for handling nonignorable missing data are applied in the context of regression analysis for case-control studies.
 |
APPENDIX A
|
---|
Expected Information Matrix Associated with Equation 3
Here, it is useful to rewrite the log-likelihood in equation 3 as
where
.
Note that
, and
, with the pj's equal to the respective arguments of the five logarithms in the above log-likelihood.
The estimated (3 x 3) variance covariance matrix Vâr[(
1,
m11,
m10)'] =
-1, where
has individual terms
and
,
and we replace the unknown parameters (
1, pm11, pm10) by their respective MLEs. Recall that the selection proportion pr is treated as a fixed known constant. These calculations were used to obtain standard error estimates for
1 and
0 and subsequently (as described in the Preliminaries section) standard errors for the odds ratio in the simulations summarized in tables 2 and 3. In the unlikely event that
m11 or
m10 equals zero, the above results must be amended slightly after removing a null term from the log-likelihood.
 |
APPENDIX B
|
---|
Likelihood to Incorporate a Second Reassessment Study
Let pra and prb represent initial and subsequent reassessment probabilities designated by the investigator. We seek to estimate
1 = Pr(E = 1|D = 1) by using data from diseased subjects together with the missingness probabilities pmea = Pr(E missing|D = 1, E = e) and pmeb = Pr(E missing|D = 1, E = e) (e = 0, 1), where the a and b subscripts refer to the initial and first reassessment sampling conditions, respectively.
Assume that there are ne subjects with E originally observed and E = e and nrea subjects with E missing but selected for the first reassessment and found to have E = e (e = 0,1). Furthermore, let there be nreb subjects with E missing who are selected for the first reassessment but still fail to provide data but are then selected for the second reassessment and found to have E = e (e = 0,1). Finally, let nma* be the number with E missing but not selected for the first reassessment, and let nmb* be the number with E missing, initially reassessed but still failing to provide data, who are not selected for the second reassessment. Nonrespondents to the second reassessment study can be viewed as of the final type, corresponding to a "MAR at the third level" assumption (refer to the Conclusions section of the text). When the same subscripts are used, the eight possible types of observations contribute p1 = (1 - pm1a) x
1,
 |
NOTES
|
---|
Correspondence to Dr. Robert H. Lyles, Department of Biostatistics, The Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA 30322 (e-mail: rlyles{at}sph.emory.edu).
 |
REFERENCES
|
---|
-
Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol 1995;142:125564.[Abstract]
-
Vach W, Blettner M. Biased estimation of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. Am J Epidemiol 1991;134:895907.[Abstract]
-
Little RJA, Rubin DB. Statistical analysis with missing data. New York, NY: John Wiley & Sons, 1987.
-
Fay RE. Causal models for patterns of nonresponse. J Am Stat Assoc 1986;81:35465.[ISI]
-
Baker SG, Laird NM. Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. J Am Stat Assoc 1988;83:629.[ISI]
-
Park T, Brown MB. Models for categorical data with nonignorable nonresponse. J Am Stat Assoc 1994;89:4452.[ISI]
-
Baker SG. The analysis of categorical case-control data subject to nonignorable nonresponse. Biometrics 1996;52:3629.[ISI][Medline]
-
Vach W, Blettner M. Logistic regression with incompletely observed categorical covariates Minvestigating the sensitivity against violations of the missing at random assumption. Stat Med 1995;14:131529.[ISI][Medline]
-
Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika 1988;75:1120.[ISI]
-
Brenner H. Alternative approaches for estimating prevalence in epidemiologic surveys with two waves of respondents. Am J Epidemiol 1995;142:123645.[Abstract]
-
Sen PK, Singer JM. Large sample methods in statistics. New York, NY: Chapman and Hall, 1993.
-
Rosner B. Fundamentals of biostatistics. 4th ed. Belmont, CA: Wadsworth Publishing Company, 1995.
-
Gong G, Samaniego FJ. Pseudo maximum likelihood estimation: theory and applications. Ann Stat 1981;9:8619.[ISI]
-
Liang KY, Self SG. On the asymptotic behavior of the pseudo-likelihood ratio test statistic. J R Stat Soc (B) 1996;59:78596.
-
SAS Institute, Inc. SAS/IML software: changes and enhancements through release 6.11. Cary, NC: SAS Institute, Inc, 1995.
-
MacMahon B, Cole P, Lin TM, et al. Age at first birth and breast cancer risk. Bull World Health Organ 1970;43:20921.[ISI][Medline]
-
Bartholomew DJ. A method of allowing for 'not-at-home' bias in sample surveys. Appl Stat 1961;10:529.
-
Dempster AP, Laird NM, Rubin DB. Maximum likelihood estimation from incomplete data via the EM algorithm. J R Stat Soc (B) 1977;39:138.[ISI]
Received for publication January 25, 2001.
Accepted for publication July 19, 2001.