Problems in the development, validation and adaptation of prognostic models for acute renal failure

Robert L. Lins1,, Monique Elseviers2, Ronald Daelemans and Marc E. De Broe2

1 Department of Nephrology–Hypertension, A.C.Z.A. Campus Stuivenberg, Antwerp 2 Department of Nephrology–Hypertension, University Hospital Antwerp, Edegem/Antwerp, Belgium

Keywords: acute renal failure; disease management; prognostic models; scoring; treatment

Historical perspective of prognostic model development

Clinicians are frequently asked to make predictions concerning diagnosis, risk assessment or prognosis. In recent years, the science of prognostication (quantitative predictions) has evolved rapidly and critical care practice has been at the forefront of this important international trend [1]. Various scoring systems, mostly illness severity scores, have been developed to optimize the use of clinical experience in the intensive care unit (ICU) and to address questions of effectiveness, efficiency, quality of care and correct allocation of scarce resources [2]. Originally these scores were developed by the selection of variables and their perceived importance by experts on the basis of personal experience (Acute Physiology and Chronic Health Evaluation, APACHE system) [3]. Later, scores were based on multiple logistic regression of rather small databases (Mortality Probability Models, MPM I) [3]. After that, large databases were compiled prospectively from many ICUs (APACHE III, Simplified Acute Physiology Score, SAPS II, MPM II). Finally, scores with multiple measuring points were developed, also based on large multicentre databases (SOFA, MPM0, 24,48,72) [3,4].

The general severity scoring systems are, however, inappropriate for disease-specific populations, such as in patients with acute renal failure (ARF) [1,2,5,6]. Since ARF creates an additional risk of mortality [7], disease-specific scoring systems were developed, e.g. the Liano [8,9] and Stuivenberg Hospital Acute Renal Failure (SHARF) [10,11] scores.

Single centre or multicentre development of scoring systems

It is generally observed that models developed in one centre fail to confirm their predictive value when tested in other centres and patient groups [12]. To overcome these problems, third generation prediction models were developed based on a large database collected from many centres [2,3,13].

Different considerations made us challenge this approach. First, scores developed in a multicentre study like APACHE or MPM are also likely to perform insufficiently well when used in a different setting [6]. Moreover, in a multicentre setting, translation, conversion and definition ambiguities are a source of inter-observer variability that increases the risk of biases in the results [14]. Differences in technical and therapeutic resources in administration, level of organization and staffing can also have an uncontrollable impact [3].

Statistical problems in model design and testing

The technique most frequently recommended for model development is logistic regression [3]. The 1994 Consensus Conference in Intensive Care Medicine [2] recommended that for testing the performance of such models, discrimination and calibration must be assessed in the development and the testing sample.

Discrimination evaluates the ability of a model to distinguish dying patients from those who survive. The area under the receiver operating characteristics (ROC) curve [15] is mostly used to indicate discriminative power. In the literature, a value of >0.70 for the area under the ROC curve is considered satisfactory for discrimination [3]. However, this level is too low to be useful for mortality prediction in individual patients [16].

Calibration evaluates the degree of correspondence between the predicted probabilities of mortality and the observed mortality. Good calibration or fitting is determined using tables of observed versus predicted mortality and the Hosmer–Lemeshow goodness-of-fit {chi}2 statistics [17]. These techniques have been proposed in recent literature for so-called third-generation prognostic systems [3,18]. The result of goodness-of-fit analysis for the calibration of observed versus predicted mortality data is, however, sensitive to sample size [19]. In addition, we found that the calibration was strongly influenced by the way the data were classified.

It may be concluded that these development and testing techniques, currently considered the gold standard, could become obsolete in the future.

Performance and customization in subsets of patients and in other institutions

Validation samples must be assembled either by collecting data on a new cohort of patients or by randomly splitting the database into two parts: one for developing the model and the other for testing it [3]. If validation fails, adaptation (customization) of the model is necessary [2].

Different assumptions must be made when applying a model in different centres or subsets of patients (Table 1Go) [1620]. First, the sample of the population that is taken for the development of the score must be representative of the total population of interest [16], e.g. all ARF patients must be included [11]. If the patients are heterogeneous with respect to risk of adverse outcome, e.g. cardiac surgery patients with a high risk of ARF and a rather variable prognosis [21], the study will be much more useful if the investigators define subgroups at lower and higher risk in comparison with the group as a whole [16]. Secondly, the periodic variation in case-mix within ICUs, necessitating re-evaluation of the score in order to adapt the model over time [13], presents a problem. This is an important factor to consider for the future, especially if a model were to be used for research or quality assurance purposes. The third problem is lead-time bias, i.e. the erroneous estimation of risk at the time of admission to the ICU due to the results of therapeutic actions taken previously [21]. In a French study [22], there was a difference in prognosis for patients admitted to the ICU immediately and those admitted later during their hospitalization. Finally, the variability of ICU characteristics, such as organization, culture, therapeutic strategies, and in the content and the quality of data collection between centres, can have an important impact on the reliability of the severity score [14].


View this table:
[in this window]
[in a new window]
 
Table 1. Assumptions for use of probability models at any institution

 

The use of scores in individual patients

To estimate the prognosis for an individual patient, additional requirements are necessary (Table 2Go). The individual patient can differ from the study group [18], so the patient must conform to the inclusion and exclusion criteria that were used to develop the original model, or selection bias may occur [13].


View this table:
[in this window]
[in a new window]
 
Table 2. Assumptions for use of probability models in individual patients

 
To calculate the score and define the probability of mortality, all information concerning the parameters included in the model must be made available. The calculated probability needs an acceptable degree of confidence, statistically reflected in narrow confidence intervals [18], or decisions with respect to diagnosis, prognosis and treatment cannot be made.

If a prognostic score differs markedly from personal observations of outcome or from other reports, it is important to find other scores that confirm the calculated prognosis [16]. One must bear in mind, however, that in the few examples where this has been studied, clinical judgement seems to be equal to, if not better than, the available predictive models [13] for individual patients.

Some authors have defined a cut-off point, with no survival predicted above it [6,8]. In this case, therapy should be withheld. From a statistical point of view, however, there always remains a degree of uncertainty, making such a decision debatable. Moreover, from an ethical point of view, such a decision can never be based only on a score, whatever the patient's performance.

Perspectives for the future

Prognostication will continue to increase in importance, for the simple reason that scoring systems fit well within evidence-based disease management, which can be defined as the management of the total patient, based on the best available evidence and across the continuum of health care, in an attempt to reach the goal of enhanced patient outcomes and reduced total cost of care. Three factors are important for the future: inclusion of costs, adaptation of scores to different populations and periods, and choice of the most appropriate outcome parameter.

If costs are incorporated in the development and validation of scores, therapeutic options and cost-effectiveness of diagnostic and therapeutic procedures can be studied. Some additional applications of this strategy are help with decision-making when considering diagnostic or therapeutic options, evaluation of new therapies and evaluation of allocated resources. For a comparison of institutions regarding quality of care, indirect standardization through the calculation of observed versus predicted outcome ratios (such as the standardized mortality ratio) has been advocated as an important technique [20]. Better tools will probably be developed in the future.

To overcome the problems of adaptation of a model to a different population, some authors have proposed the use of multilevel modelling, because it is often necessary to combine patient data from different ICUs to have enough to evaluate a particular patient group. This technique controls for both ICU- and patient-related factors that can influence outcome [23]. The key technical advance in multilevel modelling is the assumption that a factor specific to each ICU varies randomly across ICUs, thus at the ICU level. Currently, however, there is no gold standard technique for adaptation of a model to different populations.

An important aspect of the future development of prognostic scoring systems is the choice of the most appropriate outcome parameter to study. Hospital mortality is probably not of the greatest interest to the patient. Long-term survival and quality of life have become the most important outcome parameters studied in other areas of epidemiological research, particularly in evidence-based disease management. According to a Consensus Conference in Intensive Care Medicine, these outcome measures should be incorporated in future research [2].

Conclusion

Developing a ‘good’ scoring system for ARF, allowing the calculation of the probability of hospital death or other outcome variables is a daunting, but very useful task. The score will need adaptation for many subsets of patients with regular correction for evolution in organization and treatment. Only a continuous study effort and a well designed computer programme with regular updates can fulfil this task. Ideally, the end point should be one basic score for all diseases, with adaptation to the centre, the subset of patients and the individual diseases.

But even if one day such an ‘ideal’ score is found, allowing perfect prediction for an individual patient and excellent quality assessment, it will always be necessary to use our clinical skills and rational thinking before making an ‘evidence based’ decision on behalf of the patient. Probability models are there to support clinical judgement, not to replace it. The final decision will always remain the responsibility of the individual physician, based on different criteria of which a good scoring system is one, among other, powerful tools.

Notes

Correspondence and offprint requests to: Robert Lins, Department of Nephrology–Hypertension, A.C.Z.A. Campus Stuivenberg, Lange Beeldekensstraat 267, B-2060 Antwerpen, Belgium. Back

References

  1. Chew SL, Lins RL, Daelemans R, De Broe ME. Outcome in acute renal failure. Nephrol Dial Transplant1993; 8: 101–107[ISI][Medline]
  2. Consensus Conference organized by the ESICM and the SRLF. Predicting outcome in ICU patients. Intens Care Med1994; 20: 390–397[ISI][Medline]
  3. Lemeshow S, Le Gall J-R. Modeling the Severity of Illness of ICU patients. A Systems Update. JAMA1994; 272: 1049–1055[Abstract]
  4. Vincent JL, Morenco R, Takala J et al. The SOFA (sepsis related organ failure assessment) score to describe organ dysfunction/ failure. Intens Care Med1996; 22: 707–710[ISI][Medline]
  5. Schaefer JH, Jochimsen F, Keller K, Wegscheider K, Distler A. Outcome prediction of acute renal failure in medical intensive care. Intens Care Med1991; 17: 19–24[ISI][Medline]
  6. Douma CE, Redekop WK, Van Der Meulen JH et al. Predicting mortality in intensive care patients with acute renal failure treated with dialysis. J Am Soc Nephrol1997; 8: 111–117[Abstract]
  7. Levy EM, Viscoli CM, Horwitz RI. The effect of acute renal failure on mortality. JAMA1996; 275: 1489–1494[Abstract]
  8. Liano F, Pascual J, Garcia-Martin F et al. Prognosis of acute tubular necrosis: an extended prospectively contrasted study. Nephron1993; 63: 21–31[ISI][Medline]
  9. Paganini EP, Halstenberg WK, Goormastic M. Risk modeling in acute renal failure requiring dialysis: the introduction of a new model. Clin Nephrol1996; 46: 206–211[ISI][Medline]
  10. Lins R, Elseviers M, Daelemans R et al. Prognostic value of a new scoring system for hospital mortality in acute renal failure. Clin Nephrol2000; 53: 10–17[ISI]
  11. Lins RL, Elseviers MM, De Broe ME. Validation and adaptation of the Stuivenberg hospital acute renal failure (SHARF) score in a multicentre study. J Am Soc Nephrol1998; 9: 154
  12. Halstenberg WK, Goormastic M, Paganini EP. Validity of four models for predicting outcome in critically ill acute renal failure patients. Clin Nephrol1997; 47: 81–86[ISI][Medline]
  13. Schuster DP. Predicting Outcome after ICU Admission. The Art and Science of Assessing Risk. Chest1992; 102: 1861–1870[ISI][Medline]
  14. Fery-Lemonnier E, Landais P, Loirat P, Kleinknecht D, Brivet F. Evaluation of severity scoring systems in ICUs—translation, conversion, and definition ambiguities as a source of inter-observer variability in APACHE II, SAPS, and OSF. Inten Care Med1995; 21: 356–360[ISI]
  15. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology1982; 143: 29–36[Abstract]
  16. Randolph AG, Guyatt GH, Richardson WS. Prognosis in the intensive care unit: finding accurate and useful estimates for counseling patients. Crit Care Med1998; 26: 767–772[ISI][Medline]
  17. Lemeshow S, Hosmer DW. A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol1982; 115: 92–106[Abstract]
  18. Braitman LE, Davidoff F. Predicting clinical states in individual patients. Ann Intern Med1996; 125: 406–412[Abstract/Free Full Text]
  19. Nouira S, Belghith M, Elatrous S et al. Predictive value of severity scoring systems: Comparison of four models in Tunisian adult intensive care units. Crit Care Med1998; 26: 852–859[ISI][Medline]
  20. Randolph AG, Guyatt GH, Carlet J. For the Evidence Based Medicine in Critical Care Group. Understanding articles comparing outcomes among intensive care units to rate quality of care. Crit Care Med1998; 26: 773–781[ISI][Medline]
  21. Mangano CM, Diamondstone LS, Ramsay JG, Aggarwal A, Herskowitz A, Mangano DT for the Multicentre Study of Perioperative Ischemia Research Group. Renal dysfunction after myocardial revascularisation: risk factors, adverse outcomes, and hospital resource utilisation. Ann Intern Med1998; 128: 194–203[Abstract/Free Full Text]
  22. Brivet FG, Kleinknecht DJ, Loirat P, Landais PJM. The French Study Group on Acute Renal Failure. Acute renal failure in intensive care units: causes, outcome, and prognostic factors of hospital mortality: a prospective, multicentre study. Crit Care Med1996; 24: 192–198[ISI][Medline]
  23. Paterson L, Goldstein H. New statistical methods for analyzing social structures: An introduction to multilevel models. Br Edu Res J1991; 17: 387–393