a Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA.
b Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
c Pharmacare, Ministry of Health, British Columbia, Canada.
Reprint requests to: Sebastian Schneeweiss, Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women's Hospital and Harvard Medical School, 221 Longwood Ave, Boston, MA 02115, USA. E-mail: sschneew{at}hsph.harvard.edu
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods The literature was searched for studies of the validity of comorbidity scores as predictors of mortality and health service use, as measured by change in the area under the receiver operating characteristic (ROC) curve for dichotomous outcomes, and change in R 2 for continuous outcomes.
Results Six scores were identified, including four versions of the Charlson Index (CI) which use either the three-digit International Classification of Diseases, Ninth Revision (ICD-9) or the full ICD-9-CM (clinical modification) code, and two versions of the Chronic Disease Score (CDS) which used outpatient pharmacy records. Depending on the population and exposure under study, predictive validities varied between c = 0.64 and c = 0.77 for in-hospital or 30-day mortality. This is only a slight improvement over age adjustment. In one study the simple measure number of diagnoses outperformed the CI (c = 0.73 versus c = 0.65). Proprietary scores like Ambulatory Diagnosis Groups and Patient Management Categories do not necessarily perform better in predicting mortality. Comorbidity indices are susceptible to a variety of coding errors.
Conclusions Comorbidity scores, particularly the CDS or D'Hoore's CI based on three-digit ICD-9 codes, may be useful in exploratory data analysis. However, residual confounding by comorbidity is inevitable, given how these scores are derived. How much residual confounding usually remains is something that future studies of comorbidity scores should examine. In any given study, better control for confounding can be achieved by deriving study-specific weights, to aggregate comorbidities into groups with similar relative risks of the outcomes of interest.
Keywords Comorbidity, confounding, risk adjustment, health services epidemiology, clinical epidemiology
Accepted 13 March 2000
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The simplest comorbidity score is also the most widely used measure of confounding in epidemiology: age. Although it is a relatively poor index of comorbidity, it is recorded accurately and ubiquitously in administrative databases, and methods of adjusting for age are standard. It is known to be a usually necessary and often insufficient adjustment in most non-experimental studies. We began this review with the question whether any comorbidity score would improve on adjustment for age, while remaining almost as simple.
On first thought, there would seem to be at least two major advantages of a perfect single comorbidity score, if such a score could be created: (1) in multivariate modelling, a single score summarizing comorbidity would increase the statistical efficiency of analysis, compared with modelling each individual morbidity separately, especially when other risk factors modify the effects of comorbidity, necessitating use of two- or three-way product terms; (2) a validated comorbidity instrument available as a standard off-the-shelf product would simplify the process of variable selection both in the design and analysis of a study, and might increase the comparability of findings from different studies.
On second thought, after an initial look at literature on the difficulty of adjusting for comorbidity in administrative datasets,3,4 we developed the impression that these advantages might be distant or unattainable. A prominent example of controversy over comorbidity adjustment was the claim by Roos et al. that, in contrast to suprapubic prostatectomy for benign prostatic hyperplasia, transurethral prostatectomy appeared to be associated with increased 5-year mortality after comorbidity adjustment based on discharge information (RR = 1.45, 95% CI : 1.151.83).5,6 Concato et al. repeated the study and found similarly elevated risks with the same adjustment method.7 However, the increase in mortality vanished (RR = 1.03, 95% CI : 0.512.07) when the same method was used, however, based on medical record review. The authors concluded that reports of comorbidity-adjusted results from automated databases might be insufficient, tending to underrepresent some comorbid conditions6 and should be interpreted cautiously.
Chart review is rarely possible in the growing number of low-budget studies using administrative data. Therefore, comorbidity adjustment based on administrative data with all its limitations is here to stay. To proceed with such research and avoid erroneous inferences requires a mixture of scepticism about the ability of comorbidity scores to fully control for confounding, and cautious optimism that comorbidity scores are nevertheless useful tools in hypothesis-screening studies, provided they have been validated. To achieve the right mixture of scepticism and optimism, we need to understand the evidence from validation studies of existing comorbidity scores.
![]() |
Assessing metrical properties |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For continuous outcomes the most common measure of validity is the improvement in explained variance, R 2, of a linear regression model after adding the score to some baseline modelR 2 ranges from 0 to 1 with increasing explained variance. For dichotomous outcomes, there are measures of discrimination and measures of calibration. Measures of discrimination compare the predicted outcome with the actual outcome, e.g. the c statistic, which is equivalent to the area under curve (AUC) of a receiver operating characteristic (ROC). The AUC or c range from 0 to 1 with 1 indicating a perfect prediction and 0.5 indicating a chance prediction. Measures of calibration, like the Hosmer-Lemeshow goodness-of-fit statistic (H-L), assess the discrimination in all strata of the index values separately and summarize the information.8 The statistic or the corresponding P-value are usually reported. In early papers in particular, the validity of prediction is often assessed by the strength of an association between the comorbidity index and the outcome, in terms of the odds ratios (OR) or relative risk (RR) per increment in score.
Reliability
Computation of an index itself is completely reproducible in the same set of recorded data. Computerized indices from administrative databases are in that sense completely reliable but depend on the accuracy of information stored in the database. Administrative data are derived from what is documented in the medical or pharmacy record. Therefore, the real focus of reliability for code-based measures involves how accurately and completely the coded information was gathered. Thus reliability of code-based comorbidity indices is usually not tested directly but only inferred from reports of other investigators addressing coding accuracy.1 We will raise issues of data accuracy in a separate section.
![]() |
Structural characteristics and predictive validity |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
Melfi et al.23 used the Deyo-CI in studying 249 744 Medicare patients who underwent total knee replacement between 1985 and 1989. An increase in the Deyo-CI of one point increased the probability of 30-day post-operative mortality by 17%. The Deyo-CI model showed a c statistic of 0.653 in predicting 30-day post-operative mortality (baseline model = 0.645) and an R 2 of 0.175 in predicting length of stay (baseline model = 0.174). The improvement in predictive power of models including Deyo-CI is marginal, mostly because the baseline model already consists of many known important predictors: gender, age, race, socioeconomic status, area of residence and level of care to which the patient was discharged. The same model with Deyo-CI substituted by the number of distinct diagnoses shows a better prediction of mortality (c = 0.73).
Poses et al.24 used Deyo's index to predict in-hospital death at a university hospital. They reported areas under the ROC curve of c = 0.64 for the Deyo-CI, c = 0.62 for duration in hospital, and c = 0.61 for age. The ICD-based Charlson Index contributed in a multivariate model to the prediction of in-hospital death, even when clinical data were added (OR = 1.2, 95% CI : 1.11.4).
Dartmouth-Manitoba
The Dartmouth-Manitoba version of the Charlson Index (DM-CI) by Roos et al.10,11 was the first adaptation of the Charlson Index to administrative databases. It includes additional conceptually similar ICD diagnoses which were not explicitly listed in Charlson's original list of 19. This was meant to increase the sensitivity of the DM-CI. In a study of one-year mortality following several types of surgery, a multivariate analysis controlling for age and sex showed that the DM-CI did contribute significantly to the fit for prostatectomy and bypass surgery but not in cholecystectomy.
Ghali et al.17 assessed the agreement between scores from DM-CI and Deyo-CI in a population of 6791 bypass surgery patients. They found 90% of patients were assigned identical scores by the two methods and the two scores differed by only 1 among a further 5% of patients.
Romano et al.12 compared RR estimates in the presence of 16 individual comorbid conditions that are part of the DM-CI and Deyo-CI in the same bypass patients (n = 4121) and in patients with lumbar discectomy (n = 55 296). They show that individual diagnoses defined by Deyo-CI and DM-CI methods have similarly RR estimates within the same population (bypass respectively discectomy patients) and the same outcome (death respectively complication). However, the RR changed strongly between populations, e.g. RR = 1.5 for metastatic tumour in bypass population to predict mortality, RR = 4.4 for metastatic tumour in lumbar discectomy patients to predict complication. They provided no direct quantitative comparisons of the two scores.
In 1997, Roos et al. published18 an augmented version of DM-CI which added coagulopathy, neurological disorders, hypertension, arrhythmia, pneumonia and malnutrition. The predictive validity of models, already controlling for age and gender, improved from c = 0.64 to c = 0.68 for bypass surgery, from c = 0.70 to c = 0.77 for pacemaker surgery, and from c = 0.75 to c = 0.76 for hip fracture repair.
Ghali et al.
Rather than adding diagnoses, Ghali et al.17 reduced the number of diagnoses by selecting those that best predicted in-hospital mortality among 6326 bypass surgery patients identified from 257 333 Massachusetts hospital discharges. Weights were changed in order to further improve the predictive performance (Table 4). The Ghali-CI predicted in-hospital mortality in a similar population of bypass patients one year later slightly better than the DM-CI. The Ghali-CI had discrimination measures of c = 0.74 and R 2 = 0.034 compared with c = 0.704 and R 2 = 0.018 for Deyo-CI. This relatively small gain is noteworthy because the weights for the Ghali-CI were derived from a similar group of bypass patients, whereas those for the Deyo-CI were based on the original CI developed from a study of breast cancer patients using 19 comorbidities.
D'Hoore et al.
Motivated by the fact that some institutions, frequently outside the US, use only ICD-9 codes without the Clinical Modification (CM), and coding of the tailing digits in the ICD-9-CM is less reliable, D'Hoore et al. designed a Charlson Index (D'Hoore-CI) using only the first three digits of ICD-9.15 Using the MED-ECHO database restricted to 78 hospitals with 792 839 discharges (Table 2), they evaluated the prediction of in-hospital mortality among 33 940 patients with a principal diagnosis of ischaemic heart disease.16 Using a model with age, sex, and acute myocardial infarction as the principal diagnosis, the D'Hoore-CI showed good predictive performance for two consecutive 2-year periods in the same population: 1989/90: c = 0.87, and R 2 = 0.14; 1990/91: c = 0.86, and R 2 = 0.13.16 Discrimination for other primary diagnoses was reported: ischaemic heart disease (c = 0.81), congestive heart failure (c = 0.67), stroke (c = 0.66), and bacterial pneumonia (c = 0.82).15 Primary diagnoses were not excluded from the Charlson Index.
Comorbidity scores based on outpatient pharmacy data
The Chronic Disease Score (CDS) uses pharmacy dispensing data to assign patients to chronic disease groups. An integer weight is given to each comorbidity group represented by selected medication classes, which are summed to the overall score.13 The CDS was developed using the judgement of an interdisciplinary expert group of researchers and practitioners. In several pilot studies with varying populations within the Group Health Cooperation Health Plan of Puget Sound (GHC), the derived CDS was compared to clinical judgement and standard instruments to measure self-rated health status, psychological impairment, chronic pain status, and functional disability. The CDS was eventually tested among all 122 911 GHC enrolees. A multivariate logistic regression model showed that, with an increasing CDS score, the probabilities of one-year hospitalization and one-year mortality steadily increased. The highest CDS score category (7+) had a 10 times higher probability of dying the next year than CDS = 0. This effect diminished by up to 50% when age, gender and number of ambulatory visits were included in the prediction model.
An extended version, CDS-2,25 used a 50% development sample out of 250 000 managed care enrolees to derive empirical weights from a multiple logistic regression model. The CDS-2 was tested in the remaining sample of 125 000 enrolees of the same insurance plan of the same year together with the original CDS. The authors claimed that CDS-2 had a stronger association with one-year mortality, one-year hospitalization and health care utilization although no original data were shown. It remains unclear to what extent the improved prediction is due to the close similarity of the development and test sample or to an intrinsically better performance. In their study, the proportion of the explained variance in predicting costs and health care utilization was reported for the CDS and CDS-2, making both instruments directly comparable (Table 5). The CDS showed correlations with the SF-36 health status instrument and the BSI-8 depression disorder screening instrument.26
|
![]() |
Data accuracy |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Complications and diagnostic examinations
Another problem, particularly in studies of in-hospital mortality, is that it is often not clear whether some diagnoses are comorbidities at hospital admission or complications during the hospital stay. Treating complications as comorbidities can result in an overoptimistic interpretation of an index's predictive capability for unfavourable outcomes. Roos et al., however, showed that the impact of misinterpreting complications as comorbidities on the Charlson index is minor in surgical procedures.18
Diagnoses are sometimes recorded as present when the actual health service was to rule out that diagnosis. This is often distinguishable only indirectly with longitudinal data on the following encounters and procedures.
Completeness
Inaccuracies can occur when diagnoses are omitted because the data fields have been exhausted by more important diagnoses. Romano et al. showed that sensitivity to capture specific diagnoses in administrative databases with five diagnosis fields reduced by an average 13 per cent points compared to a record with 25 fields.28 The specificity is almost equal.
Prescription drugs as proxies for diagnoses can face reduced validity because they often have mixed indications, and because of a tendency to avoid prescribing additional drugs to patients who are already taking several and to reduce preventive medication in sicker patients.31,32
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
One major reason is that scores of any kind perform badly because they summarize a complex construct in an over-simplistic way making erroneous assumptions. Scores may perform reasonably well in the setting they were designed for and worse in other settings because they fail to represent a more general construct of comorbidity. The fact that only counting the number of distinct diagnoses performs equally well in one study demonstrates this but cannot be generalized.
Even if it provides only a modest improvement in ability to control for confounding, a simple score still would be appealing if, like age, it had excellent data accuracy and completeness and was widely used and understood. Unfortunately the accuracy of diagnostic data varies among databases, depending on the financial incentives and disincentives for recording it and the processes for error checking. The quality of data on drug use is less variable between databases, but drug use varies not only by disease status but also by ability to pay, prescribing customs and patient attitudes, which vary among health systems and regions.
Nevertheless we conclude that an off-the-shelf comorbidity score can still be a useful tool for exploratory data analysis. It enables one to assess the existence and direction of confounding by comorbidity. Although the magnitude of confounding is inevitably underestimated with such a score, it provides a preliminary quantification, which can guide the development of weights tailored to the particular study. For practical reasons, D'Hoore's15 adaptation of the Charlson Index, which uses only the first three digits of ICD-9 could be a starting point. If reliable pharmacy data are available, von Korff's Chronic Disease Scores13,25 might be an alternative with reasonable predictive validity for selected outcomes.
With this more modest purpose in mind, several criticisms of comorbidity scores are avoided. A major criticism of all comorbidity adjustment methods is that they were developed to predict a particular type of outcome (e.g. morbidity or mortality) but are used to adjust for the risk of other outcomes (e.g. health service use or costs). Although there are common cases when the severity of one outcome is inversely related to the intensity of health care use (e.g. sudden cardiac death), in the aggregate, adverse outcomes are positively correlated with each other. Therefore, a comorbidity score developed in one setting can be applied in a very different setting as long as it is only for exploratory purposes.
A major criticism of summary scores in general, including such diverse summaries as body mass index,33 study quality scores in meta-analyses,34 and even age as a continuous variable, is that the summarization into a single value forces a relationship on the data that may be unrealistic. Even if the original Charlson weights, derived from regression coefficients predicting survival in patients at Cornell Medical Center had been the optimal fit for those data, it is unlikely it would fit the data as well in another population or for other outcomes. An alternative is to test alternative weights and to include several indicator terms in the model for different dimensions or categories of the variable. The numerically most efficient way of doing so is to model the outcome as a function of all comorbidity information including interaction terms and use regression coefficients to weight individual items of the scores in the same study. This criticism applies much more to final rigorous analyses than to initial exploratory data analysis. Although only supported by one published study, it may be not necessary to categorize scores, as epidemiologists usually do with age.
With this reframe, what additional studies are needed of comorbidity scores? Their comparative utility and convenience in exploratory analysis needs to be directly tested in identical populations predicting the same endpoints. Are they almost as good as expensive case-mix adjustment packages at guiding the investigator on the path to full rigorous control for confounding by multiple variables? How much residual confounding is apparent in score-adjusted RR, based on the change in RR upon using more rigorous methods to control confounding by comorbidities?
In conclusion, comorbidity scores (1) can simplify the data analysis process and might be useful for exploring confounding, (2) are unlikely to be good confounder adjustments despite their popularity, (3) do not standardize confounder adjustment across studies. Published evaluations of the predictive performance of comorbidity indices are limited and more cross-validation of different indices are needed to understand their utility.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Goldfield N. Physician Profiling and Risk Adjustment. 2nd Edn. Gaithersburg: Aspen Publication, 1999.
3 Park RE, Brook RH, Kosecoff J et al. Explaining variation in hospital death rates. JAMA 1990;264:48490.[Abstract]
4 Greenfield S, Aronow HU, Elashoff RM, Watanabe D. Flaws in mortality data. JAMA 1988;260:225355.[Abstract]
5 Roos N, Wennberg JE, Malenka DJ et al. Mortality and reoperation after open and transurethral resection of the prostate for benign prostatic hyperplasia. N Engl J Med 1989;320:112024.[Abstract]
6 Malenka DJ, Roos N, Fisher ES et al. Further study of the increased mortality following transurethral prostatectomy; a chart based analysis. J Urol 1990;144:22428.[ISI][Medline]
7 Concato J, Horwitz RI, Feinstein AR, Elmore JG, Schiff SF. Problems of comorbidity in mortality after prostatectomy. JAMA 1992;267:107782.[Abstract]
8 Malenka DJ, McLerran D, Roos N et al. Using administrative data to describe casemix: a comparison with the medical record. J Clin Epidemiol 1994;47:102732.[ISI][Medline]
9 Hosmer DW, Lemeshow S. Confidence interval estimation on an index of quality performance based on logistic regression models. Stat Med 1995;14:216172.[ISI][Medline]
10 Roos LL, Sharp SM, Cohen MM, Wajda A. Risk adjustment in claims-based research: the search for efficient approaches. J Clin Epidemiol 1989;42:1193206.[ISI][Medline]
11 Romano PS, Roos LL, Jollis JG. Adapting a clinical comorbidity index for use with ICD-9-CM administrative data: differing perspectives. J Clin Epidemiol 1993;46:107579.[ISI][Medline]
12 Romano PS, Roos LL, Jollis JG. Further evidence concerning the use of a clinical comorbidity index with ICD-9-CM administrative data. J Clin Epidemiol 1993;46:108590.[ISI]
13 Von Korff M, Wagner EH, Saunders K. A chronic disease score from automated pharmacy data. J Clin Epidemiol 1992;45:197203.[ISI][Medline]
14 Deyo RA, Cherkin DC, Ciol MA. Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. J Clin Epidemiol 1992;45:61319.[ISI][Medline]
15 D'Hoore W, Sicotte C, Tilquin C. Risk adjustment in outcome assessment: the Charlson Comorbidity Index. Meth Inform Med 1993;32:38287.[ISI][Medline]
16 D'Hoore W, Bouckaert A, Tilquin C. Practical considerations on the use of the Charlson index with administrative data bases. J Clin Epidemiol 1996;49:142933.[ISI][Medline]
17 Ghali WA, Hall RE, Rosen AK, Ash AS, Moskowitz MA. Searching for an improved clinical comorbidity index for use with ICD-9-CM administrative data. J Clin Epidemiol 1996;49:27378.[ISI][Medline]
18 Roos LL, Stranc L, James RC, Li Jianwei. Complications, comorbidities, and mortality: improving classification and prediction. Health Serv Res 1997;32:22938.[ISI][Medline]
19 Elixhauser A, Steiner C, Harris R, Coffey RM. Comorbidity measures for use with administrative data. Med Care 1998;36:827.[ISI][Medline]
20 Starfield B, Weiner J, Mumford L, Steinwachs D. Ambulatory care groups: a categorization of diagnoses for research and management. Health Services Research 1991;26:5374.[ISI][Medline]
21 Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chron Dis 1987;40:37383.[ISI][Medline]
22 Charlson ME, Szatrowski TP, Peterson J, Gold J. Validation of a combined comorbidity index. J Clin Epidemiol 1994;47:124551.[ISI][Medline]
23 Melfi C, Holleman E, Arthur D, Katz B. Selecting a patient characteristics index for the prediction of medical outcomes using administrative claims data. J Clin Epidemiol 1995;48:91726.[ISI][Medline]
24 Poses RM, Smith WR, McClish DK, Anthony M. Controlling for confounding by indication for treatment. Are administrative data equivalent to clinical data? Med Care 1995;33:AS36AS46.[ISI][Medline]
25 Clark DO, von Korff M, Saunders K, Baluch WM, Simon GE. A chronic disease score with empirically derived weights. Med Care 1995;33:78395.[ISI][Medline]
26 Johnson RE, Hornbrook MC, Nichols GA. Replicating the chronic disease score (CDS) from automated pharmacy data. J Clin Epidemiol 1994;47:119199.[ISI][Medline]
27 Fowles JB, Lawthers AG, Weiner JP et al. Agreement between physician's office records and medicare part B claims data. Health Care Fin Rev 1995;16:18999.[Medline]
28 Romano PS, Mark DH. Bias in the coding of hospital discharge data and its implications for quality assessment. Med Care 1994;32:8190.[ISI][Medline]
29 Romano SR, Roos LL, Luft HS et al. A comparison of administrative versus clinical data: coronary artery bypass surgery as an example. J Clin Epidemiol 1994;47:24960.[ISI][Medline]
30 Kieszak SM, Flanders WD, Kosinski AS, Shipp CC, Karp H. A comparison of the Charlson comorbidity index derived from medical records data and administrative billing data. J Clin Epidemiol 1999; 52:13742.[ISI][Medline]
31
Redelmeier Donald A, Tan Siew H, Booth Gillian L. The treatment of unrelated disorders in patients with chronic medical diseases. N Engl J Med 1998;338:151620.
32 Glynn RJ, Monane M, Gurwitz JH, Choodnovskiy I, Avorn J. Agreement between drug treatment data and a discharge diagnosis of diabetes mellitus in the elderly. Am J Epidemiol 1999;149:54149.[Abstract]
33 Michels KB, Greenland S, Rosner BA. Does body mass index adequately capture the relation of body composition and body size to health outcomes? Am J Epidemiol 1998;147:16772.[Abstract]
34 Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiol Rev 1987;9:130.[ISI][Medline]