Multiple Imputation of Baseline Data in the Cardiovascular Health Study

Alice M. Arnold and Richard A. Kronmal

From the Department of Biostatistics, University of Washington, Seattle, WA.

Received for publication November 16, 2001; accepted for publication July 22, 2002.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 
Most epidemiologic studies will encounter missing covariate data. Software packages typically used for analyzing data delete any cases with a missing covariate to perform a complete case analysis. The deletion of cases complicates variable selection when different variables are missing on different cases, reduces power, and creates the potential for bias in the resulting estimates. Recently, software has become available for producing multiple imputations of missing data that account for the between-imputation variability. The implementation of the software to impute missing baseline data in the setting of the Cardiovascular Health Study, a large, observational study, is described. Results of exploratory analyses using the imputed data were largely consistent with results using only complete cases, even in a situation where one third of the cases were excluded from the complete case analysis. There were few differences in the exploratory results across three imputations, and the combined results from the multiple imputations were very similar to results from a single imputation. An increase in power was evident and variable selection simplified when using the imputed data sets.

biometry; epidemiologic methods; imputation; missing data; regression analysis

Abbreviations: Abbreviation: NHANES, National Health and Nutrition Examination Survey.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 
Most epidemiologic studies will encounter missing data. The problems that missing data present in the analysis and interpretation of results have been widely studied, as have methods for imputing missing data (19). Recently, software has become available to perform multiple imputations of missing data (10, 11). We describe our experience in the Cardiovascular Health Study of imputing missing data on approximately 150 variables collected at baseline. Then, with multiple copies of our filled-in baseline data available, we explored the questions of whether or not results from complete case analyses, which use observed data and delete any cases with missing data, differ from those using the imputed data sets and of how much results from a single imputation differ from those of the combined results from multiple imputation.


    BACKGROUND
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 
The Cardiovascular Health Study
The Cardiovascular Health Study is a population-based study designed to identify risk factors for cardiovascular disease in individuals aged 65 or more years. In 1989, 5,201 participants were enrolled, and a supplemental cohort of 687 African Americans was added in 1992–1993. Invited participants were a random sample of Health Care Financing Administration eligibility lists and persons living in their households. Participants provided informed consent, and study methods were approved by the institutional review committees at each participating center. Details of the design and recruitment have been published (12, 13).

At the baseline visit, participants were given an extensive clinical examination that included medical and personal histories, assessment of physical functioning and activity, cognitive testing, phlebotomy, electrocardiogram, and carotid ultrasound. The original cohort also had echocardiograms and spirometry tests. Despite efforts to obtain complete data, nearly all examination components were missing data on one or more participants. The reasons range from participant refusal or inability to answer certain questions or perform some of the examination components to technical difficulties resulting in unreadable images on ultrasound or echocardiogram (14).

Missing data and multiple imputation
Missing covariate data in epidemiologic studies present several problems to the analyst including difficulties in variable selection, reduced power, and the potential for bias in the resulting estimates (17). For these reasons, we sought to impute missing data and to study the impact of the imputation on previously published findings from complete case analyses. We wanted the imputed data sets to be available to other analysts using Cardiovascular Health Study data, requiring that the imputation be done once centrally and not repeatedly in the context of each particular analysis.

Greenland and Finkle (6) reviewed several methods of handling missing covariates in regression analysis including stratification on missing-data status, conditional-mean imputation, and multiple imputation, and they concluded that the more complex methods of multiple imputation were preferable but challenging to implement because of a lack of software. Barnard and Meng (15) conclude that Rubin’s method of multiple imputation is "without serious competition" in incomplete-data problems when analysis files will be distributed to researchers other than those who created and maintained the database, as is the case in the Cardiovascular Health Study. Several programs are available for multiple imputation (11). We used S-PLUS software (MathSoft, Inc., Seattle, Washington) created by Dr. Joseph L. Schafer (10).

Multiple imputation has been used and reported on in the US National Health and Nutrition Examination Survey (NHANES) (16, 17). Ezzati-Rice et al. (16) performed a simulation study to demonstrate that the confidence intervals of regression estimates from multiple imputation have the correct coverage. Schafer et al. (17) discussed the imputation process in a subset of NHANES data and showed that the distributions of the imputed variables were consistent with those from the observed data. The current manuscript adds to the literature by describing the process and difficulties of imputing over 100 variables and by comparing results from complete case analyses with those from both singly and multiply imputed data sets in the realistic setting of a large epidemiologic study.


    MULTIPLE IMPUTATION METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 
A full description of multiple imputation is beyond the scope of this report, but we provide a brief overview, including some key considerations for the analyst utilizing the software. The method for imputation and subsequent analysis of the filled-in data involve three steps: 1) imputing data under an appropriate model and repeating the imputation to obtain m copies of the filled-in data set; 2) analyzing each data set separately to obtain desired parameter estimates and standard errors; and 3) combining results of the m analyses by computing the mean of the m parameter estimates and a variance estimate that includes both a within-imputation and an across-imputation component.

In the first step, a group of correlated variables containing some missing values were imputed together in an iterative process that allowed the missing values for each variable to be predicted from all of the other variables in the correlated group. The model we used specified a log-linear distribution for categorical variables and a multivariate normal regression model for continuous data. The parameters included the cell probabilities for each distinct cell defined by the categorical variables and, within each cell, the mean and variance of the continuous variables. The variance covariance matrix of the continuous variables was assumed constant across cells, whereas the mean values were cell dependent. Because the model parameters are estimated from the observed and filled-in data, the parameters themselves can be considered to have a probability distribution: a Bayesian prior distribution specified before estimating the missing data and a posterior distribution determined afterward. In the absence of any information regarding the mean and variance of the parameters, a noninformative prior is recommended and that is what we used (8). Sampling from the posterior distribution allows for adjustment of the variability of the parameter estimates for the uncertainty introduced by the missing value replacement. An assumption of this model-based method of imputation is that the missing values are missing at random; that is, their values may depend on the values of other observed data but not on data that have not been measured.

Rubin (1) and Schafer (8) have shown that 3–5 imputations are usually all that is needed and, based on the minimal missingness observed in most of our variables (table 1), we chose to create three copies of the baseline data set. In step 2, sample analyses were replicated three times, using variables from each of the imputed data sets in turn. At this stage, any standard statistical software package may be used, provided the parameter estimates of interest (e.g., regression coefficients) and their standard errors can be saved. We used SPSS for Windows, version 8, software (SPSS, Inc., Chicago, Illinois). In step 3, the parameter estimates and standard errors from each of the three separate analyses were combined to give the mean of the point estimates and a standard error that accounts for the average variability observed within (W) and between (B) the separate analyses. With the statistical definition of information as the average negative second derivative of the log posterior density of the parameters, then W/B estimates (1 – {gamma})/{gamma}, where {gamma} is the fraction of information missing due to nonresponse, and (1 + {gamma}/m)–1 estimates the relative efficiency of an estimate based on m imputations compared with one based on an infinite number of imputations (2, 8).


View this table:
[in this window]
[in a new window]
 
TABLE 1. Amount of missing data in 156 variables used in imputation
 

    VARIABLE SELECTION AND DATA PREPARATION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 
Table 1 shows the amount of missing data on the variables considered in the imputation for each cohort. Although nearly three fourths of the variables were missing data for 2 percent or fewer cases in the original cohort, these were not always the same cases, so that in multivariable analyses the combined effect of the missing data is a loss of a greater percentage of cases. The data missing the most cases were from the M-mode echocardiography, where images were not readable for approximately one third of the original cohort participants.

In order to prepare the data for imputation, we needed to consider the distributional asssumptions of the method and the selection of variables to impute together. We decided to impute the two Cardiovascular Health Study cohorts separately because they were enrolled in different years, had some differences in the data collected, and differed substantially in racial mix, with African Americans comprising only 4.7 percent of the original cohort. More difficult to determine was which of the variables of interest to impute together. If variables that are related are not imputed together, and then subsequently used in analyses together, the relations among them will be dampened by the fact that the imputed subset of each variable will not be related to the other variables. To our advantage is the fact that the imputed subset of most variables would be small. For practical reasons of computation time and memory requirements, the programs have a default maximal size of 30 variables to impute together, and we chose to stay within that variable limit. The richness and slight redundancy of the Cardiovascular Health Study data set allowed us to impute groups of similar and highly correlated variables together (table 2). For example, heart rate was measured at four different times during the baseline examination, and three different height measurements were taken. A wealth of covariate data was available, which became especially important when an entire group of measurements was missing. For example, 49 people from the original cohort were missing all echocardiogram data, but other deterministic variables such as sex, body size, disease status, and electrocardiogram data were available. Covariates were selected by reviewing published papers of associations, by examining bivariate and partial correlation coefficients to find the variables most highly related to those selected for imputation, and by examining regression coefficients in models of many potential variables for inclusion. Ten separate imputations were run on the original cohort and nine on the new cohort. Once the imputations were completed, we explored correlations of variables in different blocks to determine empirically if correlations were dampened between variables not imputed together.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Sets of variables imputed together
 
All data were scrutinized for outliers or errors, both univariately and bivariately when another highly correlated variable was available. For example, blood pressure was measured twice with the participant seated, once standing, and once supine, allowing these values to be compared against one another. Once it was decided which variables to impute together, outliers in the multivariate space were identified by large residuals in regression analyses. Any gross outliers indicative of errors were set to missing and subsequently imputed, because they could exert undue influence on the parameter estimates and inflate the variability of the multiple imputations (18).

Continuous variables that were not normally distributed were transformed. Careful consideration was given to the choice of variables to be considered categorical in the program, since the choice of categorical variables determined the number of cells or stratification of the data within which estimation of the mean value of continuous covariates would occur. Unordered categorical variables such as clinic site needed to be modeled categorically; others could be modeled continuously and then rounded after imputation. Studies have shown that the programs are quite robust to modeling ordered categorical variables or indicator variables as continuous (8). Biologic rather than statistical considerations often influenced the selection of the categorical variables for stratification, at times choosing variables which contained no missing data but which could influence the mean of other variables in the imputation, for example, sex, race, or the presence of cardiovascular disease.


    IMPUTATION RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 
Univariate distributions of the imputed variables were consistent with those of the observed data. Bivariate correlations among 16 selected variables representing each of the different imputation blocks were compared pre- and postimputation. Of the 87 pairs of variables in different blocks, complete case correlations were less than 0.1 for 58 pairs (67 percent). Twenty-one pairs had correlations between 0.1 and 0.2, and five had correlations between 0.2 and 0.3. Three pairs had correlations between 0.3 and 0.5, and their pre- and postimputation correlations differed by 0, 0.014, and 0.003. Of the 29 pairs with correlations greater than 0.1, the maximum difference between the complete case and imputed correlations was 0.025 for two pairs. All other differences were 0.015 or less.


    RESULTS IN COMPARATIVE ANALYSES
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 
We replicated results from several previously published reports, using both singly and multiply imputed data. We present results from three analyses: 1) a stroke prediction model based on 7.5 years of follow-up (19), 2) a linear regression of left ventricular mass based on a model by Gardin et al. (20), and 3) a survival analysis of mortality in the smaller, African-American cohort, using results from a study of 5-year mortality in the original cohort (21). The complete case results reported here differ slightly from those published because of continual updates in our database and, in the model for left ventricular mass, because of a decision to incorporate only the nonechocardiographic predictors into the current model.

The stroke prediction model used variables measured at baseline to predict future stroke among participants with no history of stroke at baseline. Variables in the model are shown in table 3. The variable that accounted for eliminating the most participants from the complete case analysis was left ventricular mass, which was missing on 34 percent of the participants at risk. Covariate values were compared for participants included in and excluded from the complete case analysis, using one set of imputed values to estimate the missing data for those excluded (table 3). Significant differences were found for all variables except regular aspirin use. Those excluded from the complete case analysis were older, more likely to be male and, in general, sicker than those included, illustrating the potential bias in using complete case prevalences to estimate population prevalences.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Comparison of variables in stroke risk model by completeness of data
 
Table 4 displays results of the stroke prediction model for the complete case analysis (model 1, n = 3,088) and for two analyses using imputed data (n = 5,002), representing a single imputation (model 2) and the combined results from three imputations (model 3). The addition of the extra participants in the imputed data models resulted in tighter confidence bounds and smaller p values, most noticeably for categories with few participants represented by the group aged 85 years and older, those with frequent falls, and those with abnormal left ventricular wall motion. The hazard ratios were similar between the observed and imputed models, with the exception of the highest category of carotid stenosis, which was not precisely estimated in any model because of the small number of participants with this degree of stenosis (1 percent of the cohort). The complete case point estimate fell within the 95 percent confidence interval of the estimate from the imputed data set. There were no differences in results between the single imputation and the combined results from three imputations. The fraction of missing information was less than 6 percent for all variables except left ventricular mass, for which it was 27 percent. Therefore, the relative efficiency of the estimates based on three imputations compared with those based on an infinite number of imputations was 98 percent or more for all but left ventricular mass, for which it was 92 percent, suggesting that three imputations were adequate.


View this table:
[in this window]
[in a new window]
 
TABLE 4. Multiple imputation of risk factors for stroke: hazard ratios and confidence intervals from Cox models
 
Tables 5, 6, and 7 display the results for modeling left ventricular mass by linear regression. Table 5 contains the means and frequencies of variables for those included in versus excluded from the complete case analysis. Men were less likely than women to be included in the analysis. Participants excluded from the complete case analysis were older and had slightly higher mean systolic blood pressure than those included. The women excluded were heavier, had lower mean levels of high density lipoprotein, and were more likely to have a history of hypertension or a minor electrocardiogram abnormality. The men who were excluded were more likely to have had a previous myocardial infarction.


View this table:
[in this window]
[in a new window]
 
TABLE 5. Variables in left ventricular mass model for cases included versus excluded from complete case analysis
 
Table 6 shows results from a complete case analysis against those from the combined multiple imputation. Given the number of cases omitted from the complete case analysis and the apparent differences in health status between the two groups, the results are surprisingly similar. Only one variable, total cholesterol, goes from highly statistically significant to not significant. Although the point estimate for age is higher in the imputation model, it is less significant because of variability across imputations (table 7). The coefficient for current smoking also varies substantially across imputations, ranging from 1.9 to 3.9, which is less than the value of 4.2 from the complete case analysis. There were no significant differences in cholesterol or smoking between those included in or excluded from the complete case analysis, even though these were the two variables with the greatest difference in regression coefficients between the complete case analysis and the imputation analysis. The fraction of missing information ranges from 7 percent to 68 percent, with the largest fraction associated with the coefficient on age. The relative efficiency of the estimate for age from the combined imputations was 82 percent.


View this table:
[in this window]
[in a new window]
 
TABLE 6. Comparison of complete case and multiple imputation model results for left ventricular mass
 

View this table:
[in this window]
[in a new window]
 
TABLE 7. Comparison of results from three imputations of left ventricular mass
 
Finally, we present an example using fewer cases and a frequently used modeling strategy of backward selection. Using predictors of mortality in the original cohort (21), we explored their significance as predictors of death in the smaller African-American cohort. The variables considered are presented in table 8, and those remaining in the backward selection model for the complete case and imputed data analyses are shown in table 9. The same set of variables was selected for each of the three imputed data sets, and these did not coincide with those selected using the complete cases. When the variables chosen by the selection procedure on the imputed data set were entered into a Cox model for the observed data, point estimates were similar, but not all variables attained statistical significance. There was little difference between the results for a single imputation and those for the combined multiple imputation (table 10), and the fraction of missing information was less than 10 percent for all covariates.


View this table:
[in this window]
[in a new window]
 
TABLE 8. Comparison of data by completeness status: death in the African-American cohort
 

View this table:
[in this window]
[in a new window]
 
TABLE 9. Comparison of results from backward elimination procedure: death in the African-American cohort
 

View this table:
[in this window]
[in a new window]
 
TABLE 10. Comparison of results for variables found significant in imputed data set: death in the African-American cohort
 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 
Using available statistical software, we imputed missing baseline data on over 150 variables in the Cardiovascular Health Study. The process involved detailed explorations of the data in order to select from among dozens of correlated variables the ones to impute together, to identify gross outliers, to transform continuous variables that were not normally distributed in order to satisfy the model assumptions of the imputation method, and to decide which variables to treat categorically, defining the cells within which the continuous variables would have a common mean value. We created three filled-in copies of the baseline data and used these to replicate previously published results based on analyses of observed data. We also compared covariate values for those included in versus excluded from the complete case analyses and found significant differences on most variables. Despite these differences, the results of the complete case analyses and the analyses using imputed data were similar. Results from a single imputed data set differed little from the combined results of three imputations. Bivariate correlations of imputed variables in different blocks were similar to the complete case correlations.

The consistency of our results in comparative analyses is not unexpected, given that data were missing on 5 percent or fewer cases for approximately 85 percent of the variables imputed, implying that the imputed subset for most variables is relatively small. We explored many more models than are presented and found no greater differences than those reported, either between the observed and imputed data analysis results or across results from multiple imputations. We included left ventricular mass in two of the models reported because it was missing on 35 percent of the original cohort, therefore presenting what we believe may be a worst case scenario in the Cardiovascular Health Study. As a predictor in the stroke model, the hazard ratio for elevated left ventricular mass was similar across all models. When left ventricular mass was the outcome variable, there were some differences between the complete case and imputation results. Age and cholesterol were no longer significant in the imputed model, and the coefficient for smoking was reduced by 35 percent compared with the complete case model. The relative efficiency of 82 percent for the estimated coefficient of age quantifies the variability across imputations and indicates the desirability of more than three imputations with this amount of missing data.

The strengths of our imputation approach are its comprehensiveness and generality. The intent is for all Cardiovascular Health Study analysts to be able to utilize the imputed data without having to run separate imputations for each analysis. There are also limitations associated with our approach. We did not include any follow-up variables in the imputations, leaving the potential for missing associations with variables such as stroke and death (8, 9). Our exploratory analyses of these variables and the fact that most covariates were missing values for fewer than 5 percent of the participants suggest that it is unlikely that important associations will be missed. We chose not to include outcomes in the baseline data imputation, because event data collection is ongoing and because data would have been imputed on the basis of only the earliest outcomes. Similarly, associations among correlated variables not in the same block have the potential of being dampened because they were not imputed together. Our postimputation exploration of correlations across blocks is reassuring in this regard. To further examine the issue, we investigated the five variables in the regression model for left ventricular mass in table 6 that were not included in the imputation of left ventricular mass. Those variables were total and high density lipoprotein cholesterol, smoking, diastolic blood pressure, and minor electrocardiogram abnormality, which, although correlated with left ventricular mass, did not add significantly to its prediction given the other variables in the block of echocardiography data, which accounted for 91 percent of the variability in left ventricular mass. To quantify the effect of omitting these covariates from the imputation, we reimputed left ventricular mass adding these five variables and reran the multiple imputation model in table 6. The coefficient for smoking increased from 2.7 to 4.0, closer to the complete case value, but remained nonsignificant. The significance of the correlated variables high density lipoprotein and total cholesterol alternated, with high density lipoprotein cholesterol becoming less significant (ß = –0.10 (standard error, 0.06); p = 0.17) and total cholesterol becoming more significant (ß = –0.059 (standard error, 0.02); p = 0.001). In view of these results, we suggest that, to tease out associations involving highly correlated covariates in a regression analysis of a variable that has a substantial percentage of missing data, a model-specific imputation be done. For the majority of analyses in the Cardiovascular Health Study, our explorations suggest that the centrally created imputed data sets would preserve associations while increasing power.

There are advantages to using the imputed data in terms of power and variable selection. The sample size in the Cardiovascular Health Study is large enough that inadequate power is rarely a concern. In smaller studies or in subset analyses, the increase in power from using imputed data may be substantial. In model selection, identifying an optimal set of covariates from among the many correlated variables collected in the Cardiovascular Health Study and other large, observational studies is always a challenge. Missing data complicate that process and may influence the choice of variables to consider, with a preference for excluding from consideration those missing many cases. With filled-in data available on all cases, variables need not be eliminated because of missing data, and models resulting from different variable groups would always include the same cases, providing consistency in numbers reported within a paper or between papers on the same study.

We would like to encourage investigators in epidemiologic studies to avail themselves of the programs available (10) for state-of-the-art imputation of missing data. Although the data preparation and variable selection are time consuming, much of that work is done in the context of data analysis. The programs themselves run very quickly, and the method has been shown to be superior to other methods of imputation. As Rubin has stated, the multiple imputation method provides statistically valid inferences in the challenging setting where ultimate users of the data are not the database constructors, where a variety of analyses and models will be used, and where there is no one reason for the missing data (5). In the setting of a large, observational study where it would be impractical to impute all data in one large model, we have demonstrated an approach to creating multiple imputed data sets. In observational studies with minimal missing data and with no reason to suspect that data are not missing at random, our exploration provides some reassurance that findings published prior to implementation of missing data replacement would not have changed much had an optimal method for missing data imputation been used.


    ACKNOWLEDGMENTS
 
The research reported in this article was supported by contracts N01-HC-85079 through N01-HC-85086, N01-HC-35129, and N01-HC-15103 from the National Heart, Lung, and Blood Institute.

The following institutions and principal investigators participated in this study: Wake Forest University School of Medicine, Dr. Gregory L. Burke; Wake Forest University—Electrocardiogram Reading Center, Dr. Pentti Rautaharju; University of California, Davis, Dr. John Robbins; The Johns Hopkins University, Dr. Linda P. Fried; The Johns Hopkins University—MRI Reading Center, Dr. Nick Bryan and Dr. Norm J. Beauchamp; University of Pittsburgh, Dr. Lewis H. Kuller; University of California, Irvine—Echocardiography Reading Center (baseline), Dr. Julius M. Gardin; Georgetown Medical Center—Echocardiography Reading Center (follow-up), Dr. John Gottdiener; New England Medical Center, Boston—Ultrasound Reading Center, Dr. Daniel H. O’Leary; University of Vermont—Central Blood Analysis Laboratory, Dr. Russell P. Tracy; University of Arizona, Tucson—Pulmonary Reading Center, Dr. Paul Enright; Retinal Reading Center, University of Wisconson, Dr. Ron Klein; University of Washington—Coordinating Center, Dr. Richard A. Kronmal; National Heart, Lung, and Blood Institute Project Office, Dr. Diane Bild.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 


    NOTES
 
Correspondence to Dr. Alice Arnold, Collaborative Health Studies Coordinating Center, Building 29, Suite 310, 6200 NE 74th Street, Seattle, WA 98115 (e-mail: arnolda{at}u.washington.edu). Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND
 MULTIPLE IMPUTATION METHODS
 VARIABLE SELECTION AND DATA...
 IMPUTATION RESULTS
 RESULTS IN COMPARATIVE ANALYSES
 DISCUSSION
 REFERENCES
 REFERENCES
 

  1. Rubin DB. Multiple imputation for nonresponse in surveys. New York, NY: Wiley, 1987.
  2. Little RJA, Rubin DB. Statistical analysis with missing data. New York, NY: Wiley, 1989.
  3. Rubin DB, Schenker N. Multiple imputation in health-care databases: an overview and some applications. Stat Med 1991;10:585–98.[ISI][Medline]
  4. Little RJA. Regression with missing x’s: a review. J Am Stat Assoc 1992;87:1227–37.[ISI]
  5. Rubin DB. Multiple imputation after 18+ years. J Am Stat Assoc 1996;91:473–89.[ISI]
  6. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analysis. Am J Epidemiol 1995;142:1255–64.[Abstract]
  7. Vach W, Blettner M. Biased estimation of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. Am J Epidemiol 1991;134:895–907. [Abstract]
  8. Schafer JL. Analysis of incomplete multivariate data. New York, NY: Chapman & Hall, 1997.
  9. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res 1999;8:3–15.[CrossRef][ISI][Medline]
  10. Schafer JL. Software for multiple imputation. University Park, PA: The Pennsylvania State University Department of Statistics, 1999. (http://www.stat.psu.edu/~jls/misoftwa.html).
  11. van Buuren S, Oudshoorn K, eds. Multiple imputation online. Leiden, Netherlands: TNO Prevention and Health, Department of Statistics, 2001. (http://www.multiple-imputation.com).
  12. Fried LP, Borhani NO, Enright PL, et al. The Cardiovascular Health Study: design and rationale. Ann Epidemiol 1991;1:263–76.[Medline]
  13. Tell GS, Fried LP, Hermanson B, et al. Recruitment of adults 65 years and older as participants in the Cardiovascular Health Study. Ann Epidemiol 1993;3:358–66.[Medline]
  14. Manolio T, Hermanson B, Hill J, et al. Respondent burden in studies of the elderly: experience from the Cardiovascular Health Study (CHS). Inclusion of elderly individuals in clinical trials. In: Proceedings of an American College of Cardiology Workshop, 1993:135–47. Seattle, WA: The Cardiovascular Health Study, 2001. (http://chs3.chs.biostat.washington.edu/chs/abstract/maan93.htm).
  15. Barnard J, Meng XL. Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat Methods Med Res 1999;8:17–36.[CrossRef][ISI][Medline]
  16. Ezzati-Rice TM, Johnson W, Khare M, et al. A simulation study to evaluate the performance of model-based multiple imputations in NCHS Health Examination Surveys. In: Proceedings of the Bureau of the Census 11th Annual Research Conference. Washington, DC: US Department of Commerce, 1995:257–66.
  17. Schafer JL, Khare M, Ezzati-Rice TM. Multiple imputation of missing data in NHANES III. In: Proceedings of the Bureau of the Census Ninth Annual Research Conference. Washington, DC: US Department of Commerce, 1993:459–87.
  18. Schafer JL, Olsen MK. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivariate Behav Res 1998;33:545–71.[ISI]
  19. Manolio TA, Kronmal RA, Burke GL, et al. Short term predictors of incident stroke in older adults. The Cardiovascular Health Study. Stroke 1996;27:1479–86.[Abstract/Free Full Text]
  20. Gardin JM, Arnold A, Gottdiener JS, et al. Left ventricular mass in the elderly. The Cardiovascular Health Study. Hypertension 1997;29:1095–103.[Abstract/Free Full Text]
  21. Fried LP, Kronmal RA, Newman AB. Risk factors for 5-year mortality in older adults. The Cardiovascular Health Study. JAMA 1996;279:585–92.[CrossRef]