1 Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women's Hospital and Harvard Medical School, Boston, MA.
2 Department of Epidemiology, Harvard School of Public Health, Boston, MA.
3 Pharmacare, Ministry of Health, British Columbia, Canada.
4 Department of Preventive Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
comorbidity; confounding factors (epidemiology); databases; epidemiologic studies; health services
Abbreviations: CDS, Chronic Disease Score; ICD-9, International Classification of Diseases, Ninth Revision; OR, odds ratio; RR, relative risk; SD, standard deviation
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The predictive performance of claims-based comorbidity scores depends on several factors, including 1) the clinical conditions included in a score and their relative weights; 2) the distribution of comorbid conditions in the source population; 3) the endpoint of a study, for example, 1-year mortality; and 4) the accuracy of the administrative data (3). The predictive performance of two scores can validly be compared when factors 24 are held constant. Several studies have explored the predictive validity of comorbidity measures in claims data (4
13
). However, only a few publications compared the performance of two comorbidity scores in the same populations and for the same endpoints (11
, 12
, 14
). We are unaware of any direct comparison of medication-based versus diagnosis-based scores or more than two scores in the same population.
In this study, we compared the performance of six claims-based comorbidity scores in predicting 1-year mortality, long-term-care admission, number of hospitalizations, physician visits, and expenditures for physician services. The study population was a cohort of British Columbia, Canada, residents aged 65 years or more who had hypertension.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Scores
Original research on the metric properties of comorbidity indices for claims data was identified by a literature search using MEDLINE (National Library of Medicine, Bethesda, Maryland) and HealthStar (HealthStar, Inc., Long Beach, California) databases, bibliographies, and expert consultations. We identified six distinct indices of comorbidity for use in administrative databases (4, 7
9
, 11
, 14
). Four of the six scores use diagnostic information from International Classification of Diseases, Ninth Revision (ICD-9) codes and are based on the Charlson index originally designed for clinical data (16
). Two of the scores are based on outpatient drug utilization data.
Diagnosis-based scores. The Charlson index is a list of 19 conditions; each is assigned a weight (1 to 6). The Charlson index score is the sum of the weights for all conditions that a patient has. Although the index might seem rather simple, it was associated with a 2.3-fold (95 percent confidence interval: 1.9, 2.8) increase in the 10-year risk of death per increment in comorbidity level in a cohort of 685 breast cancer patients (16), and similar results were found for postoperative survival in patients with hypertension or diabetes (17
).
For the Deyo and Romano implementations of the Charlson index, we used the corresponding sets of five-digit ICD-9-CM (Clinical Modification) diagnoses, as delineated in these authors' original publications (5, 8
). These two scores differ only modestly in the ICD-9-CM codes that map the Charlson index conditions (5
).
For the D'Hoore implementation of the Charlson comorbidity index, we used the first three digits of the ICD-9 code, as described by D'Hoore et al. (9). The Ghali adaptation of the Charlson index was calculated with the reduced set of diagnoses specified by Ghali et al. (11
).
The four scores were calculated by using ICD-9 codes derived from all hospital discharges, which can contain up to 16 diagnoses. In addition to these original scores based on hospitalization only, we also calculated scores based on the diagnoses associated with all inpatient and outpatient physician services or procedures received during the baseline year.
The original Charlson weights were applied to the Deyo, Romano, and D'Hoore scores. The published weights were applied to the Ghali score (refer to table 4 in reference (1)).
|
Prescription-medication-based scores. For the Chronic Disease Score (CDC), outpatient pharmacy dispensing data are used to assign patients to chronic disease groups. An integer weight is given to each comorbidity category represented by selected medication classes, and all weights are summed to obtain an overall score. The CDS was developed by an interdisciplinary expert group of researchers and practitioners and was refined after several pilot studies. CDS-1 was tested among 122,911 Group Health Cooperative (Washington State) enrollees. A multivariate logistic regression model showed that with an increasing CDS-1 score, the probabilities of 1-year hospitalization and of 1-year mortality increased steadily. Compared with patients who were in the lowest CDS-1 score category, those in the highest category (7+) had a 10-fold higher probability of dying in the next year. An extended version of the score, CDS-2 (14), was designed specifically to predict future health care utilization.
To calculate the CDS-1 score, we followed the original coding (7). For the CDS-2 score, the published weights used to predict primary care visits were adopted. (14
). Drugs that have become available since 1992 were assigned to an appropriate category based on the condition for which the medication is prescribed. For example, only cimetidine was originally specified as an indicator for ulcer disease, and we expanded this list to include any H (histamine)2 antagonist or proton pump inhibitor. For drugs that were available when the score was developed but for which their indications have since been expanded to include one of the scored chronic diseases, the disease categories were not changed for that drug (e.g., methotrexate for cancer but now used more frequently for rheumatoid arthritis).
We used number of distinct prescription drugs (distinct chemical entities) dispensed during the baseline year as a crude comorbidity measure. Medications whose first eight digits of the American Hospital Formulary Services code (18) were equal were considered the same substance.
Other utilization measures. Two other simple utilization measures were also considered as predictors: 1) Number of hospitalizations for any reason and any length during the baseline year. Elective hospitalizations and unplanned emergency hospitalizations were differentiated. 2) Number of physician visits for any reason during the baseline year.
Endpoints
The primary endpoint was mortality during the follow-up year. Secondary endpoints were long-term-care admissions, hospitalizations (elective and emergency), number of physician visits (including services in hospitals), and expenditures for physician services during the follow-up year. Expenditures were measured by payments by the provincial government. For patients who left the cohort for reasons other than dying during the follow-up year, numbers of physician visits and expenditures were extrapolated to an annual count (19). The rate of emigration from British Columbia is very low among residents aged 65 years or more (20
).
Data quality
In British Columbia, pharmacists enter pharmacydispensing dataincluding medication, strength, and number of unitsinto a computer network when a prescription is filled, and underreporting and misclassification appear to be minimal (21). Although previous reports indicate reasonable levels of accuracy and completeness of diagnostic coding (22
), misclassification of ICD-9 diagnoses is probably similar to that found in research in which other administrative databases are used (23
26
). British Columbia pays all medication and medical services costs for residents aged 65 years or more. Data on medical services include accurate information on the amount paid in Canadian dollars.
Data analysis
For each endpoint, three baseline regression models were fitted to the data by modeling endpoints as a function of age, gender, and age plus gender combined. For each of the six comorbidity scores, models were constructed containing only the score as well as the score plus age and gender. Dichotomous endpoints (mortality, long-term-care admissions) were modeled by fitting logistic regression models; c statistics (i.e., the area under the receiver operating characteristic (ROC) curve) were calculated as measures of discrimination (27). The c statistic ranges from 0 to 1, with 1 indicating a perfect prediction and 0.5 a chance prediction; for example, the Framingham Heart Study could predict the incidence of coronary heart disease based on age, blood pressure, smoking, diabetes, and low density and high density lipoprotein cholesterol levels with a c statistic of 0.77 (28
). It has been suggested that c statistics of 0.70.8 could be considered acceptable and those of 0.80.9 excellent (29
); higher values are rarely observed and are described as outstanding. Asymptotic 95 percent confidence limits were reported for c statistics (30
). Because multiple hospitalizations occurred in less than 5 percent of patients during follow-up, we categorized patients as those without and those with one or more hospitalizations. For continuous outcomes (expenditures for physician services), we fitted linear regression models and reported R2 statistics to reflect the proportions of explained variance (31
). Since number of physician visits per year varied widely around a mean of 10.9 (standard deviation (SD), 12.4), we considered it a continuous variable. Expenditure and visit data were considerably skewed to the right and therefore were log-transformed (32
). Predictive performance should not be compared across outcomes but across scores within outcomes. Spearman's correlation coefficients with two-sided p values were calculated among scores and utilization measures during the baseline year.
Another way to quantify the performance of scores is to estimate how much confounding by comorbidity would be avoided by adjusting for each of the six scores, assuming an underlying null association between an exposure and outcome. Since the scores represent measurable confounding by comorbidity, it can be controlled for in stratified analyses. The true amount of confounding caused by comorbidity might be larger but remains unknown. The difference in confounding that can be adjusted between scores reflects the relative capacity of each score to adjust for confounding. Scores that perform equally may do so by controlling for different qualities of comorbidity; that is, the scores are not necessarily nested within each other.
To determine how much confounding would be avoided by adjusting for each score, we used actual outcome data and the observed associations between scores and outcome, and we considered various assumptions about the prevalence of exposure and the exposure-comorbidity association. For simplicity, we assumed a dichotomous exposure and a dichotomous comorbidity measure. The apparent or crude relative risk (RR) of an exposure (E)-outcome (O) association in the presence of confounding (crude RREO) is related to the associations between confounder (C) and exposure (OREC) as well as confounder and outcome (RRCO; refer to the Appendix). To ensure comparability, and on the basis of the observed distribution of scores, we dichotomized all six scores by choosing cutpoints closest to the 75th percentile. That is, only 25 percent of patients with the highest scores were coded as having a notable degree of comorbidity. We used the observed prevalence of comorbidity Pr(C) and the observed confounder-mortality association and varied OREC from 0.2 to 8. The prevalence of exposure Pr(E) was varied between 0.1 and 0.3. The underlying exposure-outcome association was assumed to be constant, with RREO = 1.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
Performance of scores based on hospital and ambulatory ICD-9 codes was only slightly better than using hospital discharge codes alone (table 4). We observed a 1.3 percent improvement in the Romano score based on hospital discharge diagnoses (c = 0.757) when compared with the Romano score based on both ambulatory and hospital data (c = 0.770). Only number of distinct diagnoses performed better when hospital discharge diagnoses, and not ambulatory codes, were used.
In the regression analyses, each score and age were modeled as linear terms. When age and the scores were divided into tertiles and were included in the models as ordinal variables, their predictive performance for mortality decreased marginally (<0.5 percent). Scores were also divided into two categories, with cutpoints chosen to be closest to the 75th percentile. Doing so decreased performance an average of 1.7 percent except for the D'Hoore score, which decreased by 2.7 percent (table 3). When quadratic terms of the scores were added to regression models, the c statistics improved less than 0.5 percent for all scores except those for the CDS-2, which improved by 0.9 percent.
Because ICD-9-based and medication-based scores were not strongly correlated, we fitted models including both types of scores to improve the predictive value (table 5). Combining the CDS-1 with ICD-9-based scores improved the prediction for all outcomes. The improvements in c statistics were smaller for ICD-9-based scores (e.g., Deyo + CDS-1 = 2 percent; Romano + CDS-1 = 1.7 percent at predicting mortality) but larger for medication-based scores (e.g., CDS-1 + Romano = a 6.2 percent improvement over CDS-1 alone). The combination of ICD-9-based scores and number of medications performed equally well or better than the combination of ICD-9-based scores and CDS-1 score (table 5).
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The enhanced Chronic Disease Score (CDS-2), which was designed to predict future physician visits, performed better than its predecessor (CDS-1) in predicting visits and expenditures. However, both were outperformed by number of distinct medications received during the baseline year, which was the best predictor of future physician services and expenditures, and it performed better than both CDSs in predicting mortality, hospitalizations, and long-term-care admissions, perhaps because conversion of number of distinct drugs into number of chronic diseases involves loss of information on disease severity. Although the CDS considers multiple drug therapy versus monotherapy of heart disease and respiratory illness, it fails to do so for other diagnoses and does not account for medication changes as disease progresses.
Zhang et al. (35) suggested combining multiple Deyo scores based on ICD-9 diagnoses from different data sources, including hospital discharge, outpatient physician services, and auxiliary services (nursing facilities, home health aid, etc.), to improve performance. With a model that adjusted for age and gender, these authors reported a 3 percent improvement in the c statistic to predict mortality (0.702 to 0.724) in a random sample of Medicare enrollees. When we constructed a model that included the same set of covariates but without auxiliary information, we observed only a 1.5 percent improvement (0.757 to 0.768), which is closer to the 1.1 percent improvement observed in a recent study of breast cancer patients (36
). Additional improvement (2 percent) was achieved when we combined the ICD-9-based score with the medication-based CDS-1 score. Since the combination that included number of distinct medications received during the baseline year performed equally well, we suggest its combination with the Romano or Deyo score as an easily applicable and improved measure of comorbidity.
Our data support earlier findings (12) of almost no difference between modeling comorbidity scores as a continuous variable or as several categories. Binary coding is not recommended for D'Hoore's score, since it lost 2.7 percent of its c statistic when compared with a continuous model. Including quadratic terms of the scores makes interpretation of coefficients more difficult, with almost no gain in prediction.
In their original publication, Ghali et al. claimed that their score performed almost 15 percent better in predicting mortality than the Deyo score did (c = 0.70 vs. c = 0.61) (11). However, they empirically chose the weights of their abridged Charlson score to optimize prediction of mortality in their sample of patients with coronary bypass surgery. In our study of elderly recipients of antihypertensive medications, who constitute about one third of the total British Columbia population aged 65 years or more, generic scores such as those of Deyo or Romano performed better. This conclusion confirms earlier findings of Roos et al. (37
) that performance of the Deyo score in predicting 1-year mortality can change considerably in specific disease groups, such as patients undergoing prostatectomy (c = 0.64), cholecystectomy (c = 0.70), or bypass surgery (c = 0.75).
Although the c statistics of the CDS-1 and Romano scores are statistically different, the question remains whether it is worthwhile to purchase and process diagnostic data in addition to pharmacy data to improve the c statistic from 0.738 to 0.783 (CDS-1 combined with Romano), an improvement of 9 percent in terms of the range between chance (c = 0.5) and perfect (c = 1.0) prediction. On the basis of detailed discharge data that included demographics and up to four comorbidities per patient, Hannan et al. (38) reported a c statistic of 0.742 for prediction of in-hospital mortality in patients with bypass surgery in New York State. After important clinical predictors were added, including ejection fraction, >90 percent narrowing of the left main vessel, and reoperation, the c statistic improved to 0.790, that is, 9.6 percent of the range from chance to perfect. Other authors (39
) concluded that there is a significant difference between c = 0.72 and c = 0.74 in National Cholesterol Education Program guidelines I and II in predicting cardiovascular mortality. From this and other examples, it appears that large investments yield only small numeric gains in c statistics above 0.75. Whether those gains are worthwhile depends on the benefits of a "truer" analysis and the costs of error, which are unique to each problem.
In addition to measuring the relative predictive abilities of scores, we estimated their relative abilities to reduce confounding bias. Although our analyses of the effects on confounding bias relied on simplifying assumptions (e.g., dichotomous comorbidity measures and a single confounder), they suggest that more confounding could possibly be controlled by the Romano and Deyo scores than by the other scores.
The present study estimated and ranked the performance of six published comorbidity scores for a variety of endpoints in claims databases, but the generalizability of our results may be limited to an elderly, predominantly White population aged 65 years or more with equal access to state-funded health care. Performance of the Deyo score in the British Columbia population was better than in a random sample of Medicare enrollees (35). We caution against assuming performances will be similar in patient subgroups with specific diagnoses or of low-income (Medicaid) status. Relative performance depends on data quality. Similar studies of comparative performance are needed with other databases.
Although comorbidity scores are useful because they are easy to use and they save time and resources (a major issue when analyzing massive health care databases), they provide only a limited ability to control for confounding (1). Adjusting for a score should not be regarded as successfully controlling for confounding, because a summary score imposes on the analysis a fixed model of the relation between comorbidities and outcome, which is likely to differ among populations (40
, 41
). In addition, when the outcomes of a particular disease are studied, effects may be underestimated if the disease is a major ingredient of the score. If the goal is to control confounding as best as the data permit, scores are still useful for preliminary analyses to indicate the direction and magnitude of confounding, which can guide decisions about further analyses. The benefit versus the cost of using more thorough approaches to control confounding versus comorbidity scores is a topic that requires further research.
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Assuming a 2-by-2 table of a dichotomous exposure and a dichotomous confounder, let e be the prevalence of exposed patients with the confounder present. The association between confounder and exposure can then be measured by the confounder-exposure odds ratio or ORCE, which is a function of e and the marginal probabilities of exposure Pr(E) and confounder Pr(C) (e.g., Walker (42)):
![]() | (1) |
![]() | (2) |
![]() |
Substituting the derived term for e in equation 2 yields the crude RREO as a function of ORCE, RREO, RRCO, and the marginal probabilities Pr(E) and Pr(C).
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|