Use of Missing-Data Methods to Correct Bias and Improve Precision in Case-Control Studies in which Cases Are Subtyped but Subtype Information Is Incomplete

Jane C. Schroeder1 and Clarice R. Weinberg2

1 Epidemiology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC.
2 Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 DISCUSSION
 APPENDIX 1.
 REFERENCES
 
Histologic and genetic markers can sometimes make it possible to refine a disease into subtypes. In a case-control study, an attempt to subcategorize a disease in this way can be important to elucidating its etiology if the subtypes tend to result from distinct causal pathways. Using subtyped case outcomes, one can carry out either a case-case analysis to investigate etiologic heterogeneity or do polytomous logistic regression to estimate odds ratios specific to subtypes. Unfortunately, especially when such an analysis is undertaken after the study has been completed, it may be compromised by the unavailability of tissue specimens, resulting in missing subtype data for many enrolled cases. The authors propose that one can more fully use the available data, including that provided by cases with missing subtype, by using the expectation-maximization algorithm to estimate risk parameters. For illustration, they apply the method to a study of non-Hodgkin's lymphoma in the midwestern United States. The simulations then demonstrate that, under assumptions likely to hold in many settings, the approach eliminates bias that would arise if unclassified cases were ignored and also improves the precision of estimation. Under the same assumptions, empirical confidence interval coverage is consistent with the nominal 95%.

algorithms; bias (epidemiology); case-control studies; computer simulation; epidemiologic methods; logistic models

Abbreviations: EM, expectation-maximization; NHL, non-Hodgkins lymphoma


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 DISCUSSION
 APPENDIX 1.
 REFERENCES
 
Histologic and molecular markers are used to subclassify cases in an increasing number of epidemiologic studies, with the goal of elucidating etiologic heterogeneity among outcome groups. Unfortunately, the expense of specimen retrieval and assay may limit the number of cases that can be evaluated, and biologic material needed to ascertain subtypes may not be available for every case. Exposure-outcome associations may be estimated with bias if covariates are related to the probability that case-subtype data are missing (subtype data are not missing completely at random (1Go)), as we will show. Nevertheless, the potential influence of missing-subtype data is often ignored.

There are two commonly followed approaches for analysis of data based on a case-control study in which cases have been subtyped. One can carry out polytomous logistic regression (2Go, 3Go), which compares each case subtype with the same referent sample. Alternatively, one can carry out a case-case analysis, which ignores the controls and makes the comparisons across case subtypes. The latter approach, first proposed by Begg and Zhang (4Go), can provide more efficient estimation of etiologic contrast parameters, that is, it can identify factors that are differentially related to the various disease subtypes. However, the case-case approach does not permit estimation of "main effects" for those factors because it is restricted to those who develop the disease.

We recently collaborated on a study that subtyped non-Hodgkin's lymphoma cases according to the presence or absence of a chromosomal translocation (5Go). Archival biopsies required for the translocation assay were not available for cases from a few large institutions in one of the two states from which the study population was drawn. Consequently, state of origin was spuriously associated with disease status, and several exposures of greater a priori interest were also artificially over- or underrepresented among classifiable cases relative to population-based controls. Case-subtype:control comparisons would have been biased if unclassified cases were simply omitted from analyses, an analytic strategy we will refer to as the complete-data-only method. Case:case comparisons may have been unbiased under these circumstances, but case:case odds ratios only indicate the relative strength of covariate-outcome associations between case-subtypes; they do not provide information with regard to either the direction or the magnitude of the subtype-specific relative risks (main effects). Therefore, we considered using the expectation-maximization (EM) algorithm, a classical statistical method that should be able to correct bias secondary to incomplete case-subtype classification (6Go) by maximizing the likelihood that fully incorporates information from cases that could not be assigned to subtype. The EM has been used in other contexts for analysis of categorical data (1Go, 7Go), including longitudinal data with binary outcomes (8Go) and case-control data (9Go), but epidemiologists may not be aware of these methods for analysis of incompletely classified cases.

In this report, we describe how the EM algorithm can be used when outcome subtype classification is incomplete but covariate information is available for cases with missing subtype data. Other approaches for handling missing data, such as multiple imputation (10Go), would sacrifice the efficiency advantages associated with maximum likelihood methods. The EM method is based on the assumption that subtype data are missing at random with regard to the true outcome so that, among cases with a given set of covariates, the probability that case subclassification is missing does not vary across case subtypes.

In brief, our application of the EM method involves repeated iterations between an expectation step, in which unclassified cases are fractionally assigned to case subtypes according to current model predictions, and a maximization step, in which the resulting pseudocomplete data are analyzed to provide new estimates; these iterations continue until convergence is achieved (6Go), and the likelihood is maximized. We provide an example of the method on the basis of our data and then report the results of a simulation study designed to characterize and compare the performance of the EM and complete-data-only methods under conditions likely to occur in epidemiologic studies.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 DISCUSSION
 APPENDIX 1.
 REFERENCES
 
Risk model
Let Y denote the outcome j, which will be 0 for controls or 1 or 2 for the two possible case subtypes (for simplicity, we assume that there are only two subtypes, but the approach can be generalized in a straightforward way if there are more). For convenience, we code this as two "dummy" variables, which each become 1 for the respective case subtype and 0 otherwise. Assume that the disease is rare and the risk in the population is multiplicative: Pr(Y = j|X) = exp({alpha}j + {sum} ßjX), where X is a vector of covariates. Applying the key result of the paper by Prentice and Pyke (11Go), we can estimate odds ratio parameters (although not the {alpha}j intercepts) based on retrospective case-control data by maximizing the likelihood that corresponds to a prospective study.

Let the estimated probability of outcome j(j = 0, 1, 2), given case-control sampling and the ith covariate vector Xi, be denoted P(Y = j|Xi). Setting controls as the common referent group, the polytomous regression model for the estimated probability of each outcome given Xi can be written as


and

If no subtype data are missing, the corresponding prospective likelihood (the product of all of these conditional probabilities) is maximized by using widely available software for polytomous logistic regression. If some of the subtype data are missing, the likelihood contribution for each case with missing subtype becomes the collapsed probability: P(Y = 1 or 2|Xi = P(Y = 1|Xi) + P(Y = 2|Xi), and the likelihood for the data is again simply the product of all of the conditional probabilities. To maximize this product over choices of the risk parameters, we make use of a statistical device called the EM algorithm (6Go).

EM algorithm
To apply the EM algorithm validly to estimate the odds ratios in this setting, we must be able to assume that the missing classification data are missing at random, that is, that among cases with any given set of measured covariates, the probability that a case is unclassifiable is unrelated to the true unknown subtype. Under this plausible scenario, missingness is uninformative within each stratum defined by covariates, although this does not require that, overall, missingness is unrelated to the measured covariates.

The EM algorithm maximizes the likelihood described above by carrying out an iterative sequence of fits to an iteratively evolving pseudocomplete data set. A set of fitted values is first obtained based on a polytomous regression run on the complete data only (controls along with classified cases). Next, unclassified cases are fractionally assigned to each of the case-subtypes to create pseudocomplete data, with the fractional apportionment based on the current fitted values (subtype specific) for the corresponding covariate stratum.

To do this, one uses the current estimates to fractionally assign each unclassified case i:

and

Note that these estimated values sum to 1.

Once all cases with missing subtype have been fractionally assigned to the subtypes in this way and these have been added in with the observed data, the resulting pseudodata set (which typically now includes noninteger counts) is modeled by using polytomous regression in the second maximization step. New model estimates are then used to derive new fractional assignments for unclassified cases, as above, in the second expectation step, and then the model is refit. Iteration between expectation and maximization continues in this manner until the estimates stop changing. Standard missing-data theory guarantees that the corresponding observed-data likelihood will increase at each step of the EM algorithm (6Go). As a practical rule, we declare convergence and stop the iterating when the difference in the log-likelihoods between two consecutive iterations is less than 10-6.

We caution that the likelihood the standard software is using, that is, the likelihood based on the final pseudocomplete data, is not itself a valid likelihood and cannot be used directly for variance estimation or likelihood ratio testing. The observed-data likelihood (which includes factors formed as sums of conditional probabilities, as described above) must be used for variance calculation. Accordingly, one must rely on additional programming to derive valid variance estimates for coefficients estimated using the EM. Details on how to carry out valid estimation of the variances and covariances for the risk parameters are given in appendix 1.

Example
Our example is drawn from a population-based case-control study of non-Hodgkin's lymphoma (NHL) (12Go), in which cases were subtyped according to the presence or absence of the t(14;18) chromosomal translocation (5Go). t(14;18) is believed to contribute to NHL pathogenesis by stimulating constant production of the antiapoptotic protein bcl-2, which effectively immortalizes the t(14;18)-positive lymphocyte (13Go). We reasoned that some NHL risk factors might act preferentially through t(14;18)-associated pathogenic mechanisms and that such risk factors would be more strongly associated with t(14;18)-positive NHL than with NHL in the aggregate. The example uses actual data from 1,245 controls, 68 cases classified as subtype 1 (t(14;18) positive), 114 cases classified as subtype 2 (t(14;18) negative), and 440 cases who could not be subtyped because archival tumor tissue was either unavailable or inadequate for the t(14;18) assay. For simplicity, we use two dichotomous predictors—state (0 if from Iowa, 1 if from Minnesota) and soybeans (0 if never worked on farm that produced soybeans, 1 otherwise)—corresponding to four possible covariate vectors. Neither covariate was related to NHL (as a single outcome) when all of the original study cases were compared with controls. However, both covariates were associated with missing subtype information. Specifically, 64 percent of the Iowa cases were missing subtype information compared with 77 percent of the Minnesota cases, and 74 percent of the cases who did not farm soybeans were missing subtype information compared with 63 percent of the cases who did farm soybeans.

To illustrate the method, we provide the data and estimates produced by the first two EM iterations (table 1). The first block shows the actual data, and the second shows the first set of parameter estimates computed using only the complete data. The third block shows the newly computed pseudodata, constructed by using the first set of parameter estimates. For example, the fraction of the missing 119 (row 1) assigned to subtype 1 is simply the best current estimate of exp(1)/{exp(1) + exp(2)}, which is 0.358. Hence, we assign 0.358 x 119 = 42.60 to the Y = 1 category in the second E step. Similarly, the fraction of 53 (row 4) assigned to subtype 1 is simply the best current estimate of exp(1 + 11 + 12/exp(1 + 11 + 12) + {exp(2 + 21 + 22)}, which is 0.395. This process is continued until the log-likelihoods from consecutive EM iterations differ by less than 10-6. The fourth block of table 1 shows the estimated risk coefficients associated with state and soybeans at convergence, along with the corresponding standard error estimates based on the observed data likelihood (appendix 1). Note that the invalid estimates based on complete data only (the M-step estimates from the first iteration) are quite different from the EM-based estimates (at convergence) in this example.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Estimation step data from the first two expectation-maximization iterations used in the example and maximization step estimates from the first and the final expectation-maximization iterations

 
Although table 1 seems to represent a rather tedious process, readers should be aware that the algorithm can be automated, as in a conventional logistic regression. In our example, convergence was achieved quite rapidly (after only 47 EM iterations); this number will vary depending on the specific data analyzed, but even complex models should converge in a matter of minutes. We encourage readers to use and adapt our Stata macros, which are available on request, for actual applications.

Simulation study
To compare the performance of various analytic strategies, we carried out a set of simulations. Each simulation was based on 1,000 independent, simulated case-control studies. For each person simulated, two independently distributed dichotomous covariates (A and B) were assigned according to the probabilities given in table 2. Case-subtype probabilities were based on a subtype-specific multiplicative risk model, with baseline risks of 0.00004 for subtype 1 and 0.00006 for subtype 2 (table 2). Cases of the two subtypes were generated by comparing a random uniform number with cutpoints corresponding to the risk of subtype 1 and the sum of the risks of the two subtypes. Random numbers were generated using a 32-bit uniform pseudorandom number generator with an overall period of approximately 2126 (14Go). Case-subtypes were sampled, with replacement, until 600 total cases were accrued. A control group of 1,200 observations was also sampled from the source population at random with replacement to give a total of 1,800 observations in each simulated study (which approximates the size of the lymphoma study used as our example).


View this table:
[in this window]
[in a new window]
 
TABLE 2. Data structure for simulations, including exposure distribution in the source population and risk model used to assign case-subtype status

 
Missing case-subtype status was assigned by generating a random uniform number for each simulated case observation (as above) and comparing its value with the probability of classification dictated by the missingness scenario being simulated. Four missingness mechanisms were used to assign a random fraction of cases to have missing subtype data. The probability of classification was either 1) equal for every case (missing completely at random); 2) associated with one of the two model covariates, but unrelated to the case-subtype conditional on covariates (missing at random); 3) associated with the case-subtype, but not with either covariate within subtype groups; or 4) associated with both case-subtype and one covariate. The probabilities used to assign cases to unclassified status under the different missingness scenarios are given in table 3.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Probability of missing subtype among cases, according the underlying missingness mechanism being simulated

 
Five different sets of effect estimates were generated for each covariate (A and B) for each of the 1,000 simulated case-control studies used to evaluate a given scenario, using the Stata software package (14Go). These five estimates included 1) log-odds from an unconditional polytomous logistic regression with subtype information for all observations (hypothetical full-data results); 2) log-odds from the same polytomous logistic regression that instead excluded unclassified observations (using only complete data); 3) log-odds from a binary logistic regression based on data from classified cases only (case:case model, (4Go); 4) log-odds from polytomous logistic regression models fit using the EM algorithm to incorporate the partial information from cases that could not be subtyped; and 5) case:case log-odds derived from the same EM models (4Go).

For each simulated data set, we stored the estimated log-odds for each covariate-outcome combination, each corresponding variance and upper and lower limits of the 95 percent confidence interval, and the widths of each confidence interval (upper limit – lower limit). The estimated log-odds and the widths of the confidence intervals were averaged over 1,000 simulations, and the percent coverage was calculated for comparison with the nominal coverage rate of 95 percent.

Results of simulations
Full-data results confirmed that the sampling scheme used in the simulations was valid, with mean estimated log-odds in accord with the underlying risk model and empirical confidence interval coverage within the range (93.5–96.2 percent) that is compatible with the nominal 95 percent over 1,000 simulations. The results based on full data from one set of simulations are given in table 4; full-data results from all other simulations were comparable.


View this table:
[in this window]
[in a new window]
 
TABLE 4. Results for 1,000 independent simulations with all cases subclassified (full-data results)

 
Table 5 lists results from a simulation in which missingness was random with regard to both model covariates and the true case-subtype (the missing-completely-at-random scenario). Results based on complete-data-only and EM methods were unbiased, but EM case-subtype:control estimates were more precise than corresponding complete-data-only estimates, as indicated by the smaller width of the average confidence intervals from corresponding simulations.


View this table:
[in this window]
[in a new window]
 
TABLE 5. Comparison of complete-data-only and expectation-maximization methods when missingness of case-subtype classification was not associated with covariates or case-subtype (classification data missing completely at random)*

 
Table 6 lists simulation results for scenarios in which missingness was associated with covariate A but not with covariate B or the case-subtype (data missing at random). As expected, case-subtype:control estimates based on the complete-data-only method were biased for covariate A. For example, when 70 percent of the case observations were unclassified, the average estimated case-subtype 1:control coefficient for covariate A was 1.0 compared with the true value of 0; only 13 of 1,000 estimated 95 percent confidence intervals covered the true value. Bias increased, and the percent coverage decreased as the proportion of unclassified cases increased from 20 to 70 percent. EM results for covariate A case-subtype:control estimates were in marked contrast to those based on complete data only: EM estimates were compatible with the expected log-odds, and coverage was consistent with the nominal 95 percent level, even when the proportion of missing cases was high. Case:case comparisons based on complete data only and EM model estimates were both unbiased, with similar precision and coverage in accord with expectations (data not shown).


View this table:
[in this window]
[in a new window]
 
TABLE 6. Comparison of complete-data-only and expectation-maximization methods, based on simulations with different proportions of unclassified cases, in which missingness of case-subtype classification was related to a covariate but was not associated with case-subtype conditional on covariates (classification data missing at random)*

 
Covariate B was not associated with missingness in this set of simulations, and both the complete-data and EM estimates for its coefficient were compatible with the expected log-odds, although coverage for estimates based on the complete data only fell slightly (but significantly) below the 95 percent level when 70 percent of the cases were unclassified (table 6). Smaller average confidence interval widths for corresponding simulations indicate that EM results, for covariate B were more precise than complete-data-only results, as expected based on the fact that the EM approach uses more of the available information.

Results presented in table 7 are from simulations in which the probability of classification depended on case-subtype, but not on covariates. All complete-data-only and case:case estimates were unbiased, but EM case-subtype:control estimates for covariate B were biased. In addition, coverage was above the nominal 95 percent for both EM-based case-subtype:control estimates for covariate A and below the nominal 95 percent for EM-based case-subtype 1:control estimates for covariate B. These results reflect the violation of the assumption needed for the EM method to be valid, namely that the missing data should be missing at random, which ensures that the observed data are stochastically representative of the missing data at each fixed stratum defined by the covariate data.


View this table:
[in this window]
[in a new window]
 
TABLE 7. Comparison of complete-data-only and expectation-maximization methods when missingness of case-subtype classification was associated with case-subtype, but not with corariates within case-subtypes*

 
In the scenario assessed in table 8, missingness depended on case-subtype and one covariate. All methods, including both case-case analyses, produced biased estimates.


View this table:
[in this window]
[in a new window]
 
TABLE 8. Comparison of complete-data-only and expectation-maximization methods when missingness of case-subtype classification was associated with case-subtype and one covariate*

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 DISCUSSION
 APPENDIX 1.
 REFERENCES
 
Simulation results confirmed our expectation that wholesale exclusion of unclassified cases (complete-data-only analysis) would bias estimation when missingness was related to the study covariates. In contrast, EM estimates based on the same data produced unbiased results as long as missingness was not also affected by the true case-subtype (as in table 6). However, other simulations confirmed our expectation that the EM method may not succeed in eliminating bias when the required missing-at-random assumption is violated. Nevertheless, if plausible mechanisms that produce difficulty in classifying cases are related to covariates but are not affected by the case-subtype, the EM algorithm should perform well.

While the sample sizes used in our simulations, 600 cases and 1,200 controls, are larger than those available for many case-control studies, the reader needs to be aware that detection of etiologic heterogeneity (similar to interaction) will generally require large sample sizes. Our numbers were selected on the basis of the study that inspired this work, in which about 70 percent of the classification data were missing. Clearly, fewer cases would be required if classification data were more complete, always a desirable goal.

The performance of complete-data-only, EM, and case-case methods under different missingness mechanisms is summarized in table 9. Case-case estimates and confidence intervals based on complete-data-only and EM models were comparable for all missing-data scenarios. The finding that case-case methods often gave unbiased results is understandable on theoretical grounds. If availability for classification is related to subtype (row 3), then this is analogous to the fact that sampling depends on case status in case-control data; we already know that this produces no bias (except in estimating the intercept). The finding that bias is not present when availability of classification depends on covariates is also not surprising. One can easily show that even in such a circumstance there will be no bias in the case-case odds ratio, provided availability is not differential by classification outcome, within covariate strata. This is analogous to case-control data in which participation rates can be biased by covariates (e.g., if younger people have lower response rates), but odds ratios are estimable without bias, provided the probability of sampling is the same for cases and controls within each covariate stratum (15Go). For case-subtype:control comparisons, the EM method is clearly preferable to the complete-data-only method when only the covariates are related to case classifiability and also, because of increased precision, in scenarios in which classification data are missing completely at random. On the other hand, complete-data-only methods are preferable in what may be a highly unusual scenario, in which the true case-subtype is the sole predictor of missingness. What is not clear is the best method to use when the probability of classification is related to the covariates and also to case-subtype conditional on covariates; under this worst-case scenario, the EM, complete-data-only, and case-case methods all produce biased estimates.


View this table:
[in this window]
[in a new window]
 
TABLE 9. Summary of complete-data-only method, expectation-maximization method, and case:case model performance according to the distribution of missing case-subtype data

 
In practice, the choice of method for analyzing data with incomplete case-subtype information will depend on a combination of concrete data and conjecture. One may often have good a priori grounds for thinking that the mechanisms that lead to some classification data being missing are related to certain measured covariates but are probably not influenced by outcome subtype. For example, in the non-Hodgkin's lymphoma study in our example, we know that certain hospitals refused to provide tissue specimens, implying a known missing-data mechanism that would probably not imply violation of the missing-at-random assumption needed for EM. Associations between covariates and the probability of classification can be crudely empirically assessed by comparing the covariate distribution among classified cases with unclassified cases, but the distribution of case-subtypes among unclassified observations obviously is not measurable. A differential distribution of case-subtypes between classified and unclassified cases should be suspected when classification success (in particular, the availability of tissue) is related to factors that are known or suspected predictors of case-subtype, for example, survival time or clinical grade. The potential for subtype-associated differences in diagnostic practices, biopsy practices, or biopsy characteristics should also be considered.

In conclusion, we have shown that case-subtype:control odds ratio estimates may be badly biased when unclassified cases are ignored and that case:case comparisons can also be biased. We have also demonstrated that the EM method will eliminate bias secondary to incomplete case-subtype ascertainment when model assumptions are met and data are missing at random and that the EM method may also serve to increase precision. For simplicity, we used data with only two case-subtypes, but the EM method can be extended to accommodate data with three or more case-subtypes. Programs are available on request that facilitate the practical application of the EM method in this context. The EM method, therefore, has the potential to increase the validity, utility, and practicality of outcome subtyping within case-control studies by allowing full use of covariate information available for incompletely classified study participants.


    APPENDIX 1.
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 DISCUSSION
 APPENDIX 1.
 REFERENCES
 
Variance estimation
The variance-covariance estimates from the final EM iteration fitted to the final pseudodata do not correctly account for the contribution of unclassified cases (16Go). However, an appropriate information matrix can be derived by altering the contribution of unclassified cases to the information matrix from the final iteration, as described below. The corrected information matrix is then inverted to provide a valid estimated variance-covariance matrix.

Let n be the number of unclassified case observations and p be the total number of model covariates, including a "1" coding for an intercept for each case-subtype. A matrix Im is constructed based on information from unclassified cases

(1)
where X is an n x p matrix whose rows are covariate vectors for unclassified cases, and W is an n x n diagonal matrix with the diagonal element

The contribution of the unclassified cases to the 2p x 2p information matrix I is introduced by adding Im to the pseudocomplete-data information matrix (the inverse of the software-estimated variance-covariance matrix) from the final EM iteration (If), so that I = If + Im. This corrected information matrix is then inverted to give the estimated variance-covariance matrix.

To calculate the variance for the data example of table 1, we created the 6 x 6 pseudocomplete data matrix, If, by inverting the estimated variance-covariance matrix from the final EM iteration. Next, we derived the 6 x 6 matrix, Im, based on information from unclassified cases (equation 1). In our example, X would have been a 440 x 3 matrix whose rows are the covariate vectors for the 440 unclassified cases, and W would have been a 440 x 440 diagonal matrix, with the diagonal element based on estimated subtype probabilities from the final EM model for each unclassified case. Fortunately, we can generate Im without explicitly creating X and W, using Stata's "matrix accum" command (14Go). If and Im are then summed to give the corrected information matrix I, which is inverted to give the corrected estimated variance-covariance matrix. The code used to automate this process is embedded within our EM program.


    ACKNOWLEDGMENTS
 
The authors thank Drs. A. F. Olshan, G. E. Dinse, J. A. Taylor, and A. Blair for comments and support.


    NOTES
 
Correspondence to Dr. Jane C. Schroeder, Epidemiology Branch, NIEHS P.O. Box 12233, MD A3-05, 111 TW Alexander Drive, Research Triangle Park, NC 27709 (e-mail: schroed1{at}niehs.nih.gov).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 DISCUSSION
 APPENDIX 1.
 REFERENCES
 

  1. Little R, Rubin D. Statistical analysis with missing data. New York, NY: John Wiley & Sons, 1987.
  2. Dubin N, Pasternack BS. Risk assessment for case-control subgroups by polytomous logistic regression. Am J Epidemiol 1986;123:1101–17.[Abstract]
  3. Hosmer D, Lemeshow S. Applied logistic regression. New York, NY: John Wiley & Sons, 1989.
  4. Begg CB, Zhang ZF. Statistical analysis of molecular epidemiology studies employing case-series. Cancer Epidemiol Biomarkers Prev 1994;3:173–5.[Abstract]
  5. Schroeder J, Olshan A, Baric R, et al. Agricultural risk factors for t(14;18) subtypes of non-Hodgkin's lymphoma. Epidemiology 2001 (In press).
  6. Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 1977;39:1–38.[ISI]
  7. Fuchs C, Greenhouse JB. The EM algorithm for maximum likelihood estimation in the mover-stayer model. Biometrics 1988;44:605–13.[ISI][Medline]
  8. Fitzmaurice GM, Laird NM, Lipsitz SR. Analysing incomplete longitudinal binary responses: a likelihood-based approach. Biometrics 1994;50:601–12.[ISI][Medline]
  9. Wacholder S, Weinberg CR. Flexible maximum likelihood methods for assessing joint effects in case-control studies with complex sampling. Biometrics 1994;50:350–7.[ISI][Medline]
  10. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res 1999;8:3–15.[ISI][Medline]
  11. Prentice R, Pyke R. Logistic disease incidence models and case-control studies. Biometrika 1979;66:403–11.[ISI]
  12. Cantor KP, Blair A, Everett G, et al. Pesticides and other agricultural risk factors for non-Hodgkin's lymphoma among men in Iowa and Minnesota. Cancer Res 1992;52:2447–55.[Abstract]
  13. Magrath I. Molecular basis of lymphomagenesis. Cancer Res 1992;52(suppl 19):5529s–405.[Abstract]
  14. StataCorp. Stata statistical software. College Station, TX: Stata Corporation, 1999.
  15. Weinberg CR, Sandler DP. Randomized recruitment in case-control studies. Am J Epidemiol 1991;134:421–32.[Abstract]
  16. Louis T. Finding the observed information matrix when using the EM algorithm. J R Stat Soc B 1982;44:226–33.[ISI]
Received for publication July 12, 2000. Accepted for publication May 25, 2001.