1 Department of NephrologyHypertension, A.C.Z.A. Campus Stuivenberg, Antwerp 2 Department of NephrologyHypertension, University Hospital Antwerp, Edegem/Antwerp, Belgium
Keywords: acute renal failure; disease management; prognostic models; scoring; treatment
Historical perspective of prognostic model development
Clinicians are frequently asked to make predictions concerning diagnosis, risk assessment or prognosis. In recent years, the science of prognostication (quantitative predictions) has evolved rapidly and critical care practice has been at the forefront of this important international trend [1]. Various scoring systems, mostly illness severity scores, have been developed to optimize the use of clinical experience in the intensive care unit (ICU) and to address questions of effectiveness, efficiency, quality of care and correct allocation of scarce resources [2]. Originally these scores were developed by the selection of variables and their perceived importance by experts on the basis of personal experience (Acute Physiology and Chronic Health Evaluation, APACHE system) [3]. Later, scores were based on multiple logistic regression of rather small databases (Mortality Probability Models, MPM I) [3]. After that, large databases were compiled prospectively from many ICUs (APACHE III, Simplified Acute Physiology Score, SAPS II, MPM II). Finally, scores with multiple measuring points were developed, also based on large multicentre databases (SOFA, MPM0, 24,48,72) [3,4].
The general severity scoring systems are, however, inappropriate for disease-specific populations, such as in patients with acute renal failure (ARF) [1,2,5,6]. Since ARF creates an additional risk of mortality [7], disease-specific scoring systems were developed, e.g. the Liano [8,9] and Stuivenberg Hospital Acute Renal Failure (SHARF) [10,11] scores.
Single centre or multicentre development of scoring systems
It is generally observed that models developed in one centre fail to confirm their predictive value when tested in other centres and patient groups [12]. To overcome these problems, third generation prediction models were developed based on a large database collected from many centres [2,3,13].
Different considerations made us challenge this approach. First, scores developed in a multicentre study like APACHE or MPM are also likely to perform insufficiently well when used in a different setting [6]. Moreover, in a multicentre setting, translation, conversion and definition ambiguities are a source of inter-observer variability that increases the risk of biases in the results [14]. Differences in technical and therapeutic resources in administration, level of organization and staffing can also have an uncontrollable impact [3].
Statistical problems in model design and testing
The technique most frequently recommended for model development is logistic regression [3]. The 1994 Consensus Conference in Intensive Care Medicine [2] recommended that for testing the performance of such models, discrimination and calibration must be assessed in the development and the testing sample.
Discrimination evaluates the ability of a model to distinguish dying patients from those who survive. The area under the receiver operating characteristics (ROC) curve [15] is mostly used to indicate discriminative power. In the literature, a value of >0.70 for the area under the ROC curve is considered satisfactory for discrimination [3]. However, this level is too low to be useful for mortality prediction in individual patients [16].
Calibration evaluates the degree of correspondence between the predicted probabilities of mortality and the observed mortality. Good calibration or fitting is determined using tables of observed versus predicted mortality and the HosmerLemeshow goodness-of-fit 2 statistics [17]. These techniques have been proposed in recent literature for so-called third-generation prognostic systems [3,18]. The result of goodness-of-fit analysis for the calibration of observed versus predicted mortality data is, however, sensitive to sample size [19]. In addition, we found that the calibration was strongly influenced by the way the data were classified.
It may be concluded that these development and testing techniques, currently considered the gold standard, could become obsolete in the future.
Performance and customization in subsets of patients and in other institutions
Validation samples must be assembled either by collecting data on a new cohort of patients or by randomly splitting the database into two parts: one for developing the model and the other for testing it [3]. If validation fails, adaptation (customization) of the model is necessary [2].
Different assumptions must be made when applying a model in different centres or subsets of patients (Table 1) [1620]. First, the sample of the population that is taken for the development of the score must be representative of the total population of interest [16], e.g. all ARF patients must be included [11]. If the patients are heterogeneous with respect to risk of adverse outcome, e.g. cardiac surgery patients with a high risk of ARF and a rather variable prognosis [21], the study will be much more useful if the investigators define subgroups at lower and higher risk in comparison with the group as a whole [16]. Secondly, the periodic variation in case-mix within ICUs, necessitating re-evaluation of the score in order to adapt the model over time [13], presents a problem. This is an important factor to consider for the future, especially if a model were to be used for research or quality assurance purposes. The third problem is lead-time bias, i.e. the erroneous estimation of risk at the time of admission to the ICU due to the results of therapeutic actions taken previously [21]. In a French study [22], there was a difference in prognosis for patients admitted to the ICU immediately and those admitted later during their hospitalization. Finally, the variability of ICU characteristics, such as organization, culture, therapeutic strategies, and in the content and the quality of data collection between centres, can have an important impact on the reliability of the severity score [14].
|
The use of scores in individual patients
To estimate the prognosis for an individual patient, additional requirements are necessary (Table 2). The individual patient can differ from the study group [18], so the patient must conform to the inclusion and exclusion criteria that were used to develop the original model, or selection bias may occur [13].
|
If a prognostic score differs markedly from personal observations of outcome or from other reports, it is important to find other scores that confirm the calculated prognosis [16]. One must bear in mind, however, that in the few examples where this has been studied, clinical judgement seems to be equal to, if not better than, the available predictive models [13] for individual patients.
Some authors have defined a cut-off point, with no survival predicted above it [6,8]. In this case, therapy should be withheld. From a statistical point of view, however, there always remains a degree of uncertainty, making such a decision debatable. Moreover, from an ethical point of view, such a decision can never be based only on a score, whatever the patient's performance.
Perspectives for the future
Prognostication will continue to increase in importance, for the simple reason that scoring systems fit well within evidence-based disease management, which can be defined as the management of the total patient, based on the best available evidence and across the continuum of health care, in an attempt to reach the goal of enhanced patient outcomes and reduced total cost of care. Three factors are important for the future: inclusion of costs, adaptation of scores to different populations and periods, and choice of the most appropriate outcome parameter.
If costs are incorporated in the development and validation of scores, therapeutic options and cost-effectiveness of diagnostic and therapeutic procedures can be studied. Some additional applications of this strategy are help with decision-making when considering diagnostic or therapeutic options, evaluation of new therapies and evaluation of allocated resources. For a comparison of institutions regarding quality of care, indirect standardization through the calculation of observed versus predicted outcome ratios (such as the standardized mortality ratio) has been advocated as an important technique [20]. Better tools will probably be developed in the future.
To overcome the problems of adaptation of a model to a different population, some authors have proposed the use of multilevel modelling, because it is often necessary to combine patient data from different ICUs to have enough to evaluate a particular patient group. This technique controls for both ICU- and patient-related factors that can influence outcome [23]. The key technical advance in multilevel modelling is the assumption that a factor specific to each ICU varies randomly across ICUs, thus at the ICU level. Currently, however, there is no gold standard technique for adaptation of a model to different populations.
An important aspect of the future development of prognostic scoring systems is the choice of the most appropriate outcome parameter to study. Hospital mortality is probably not of the greatest interest to the patient. Long-term survival and quality of life have become the most important outcome parameters studied in other areas of epidemiological research, particularly in evidence-based disease management. According to a Consensus Conference in Intensive Care Medicine, these outcome measures should be incorporated in future research [2].
Conclusion
Developing a good scoring system for ARF, allowing the calculation of the probability of hospital death or other outcome variables is a daunting, but very useful task. The score will need adaptation for many subsets of patients with regular correction for evolution in organization and treatment. Only a continuous study effort and a well designed computer programme with regular updates can fulfil this task. Ideally, the end point should be one basic score for all diseases, with adaptation to the centre, the subset of patients and the individual diseases.
But even if one day such an ideal score is found, allowing perfect prediction for an individual patient and excellent quality assessment, it will always be necessary to use our clinical skills and rational thinking before making an evidence based decision on behalf of the patient. Probability models are there to support clinical judgement, not to replace it. The final decision will always remain the responsibility of the individual physician, based on different criteria of which a good scoring system is one, among other, powerful tools.
Notes
Correspondence and offprint requests to: Robert Lins, Department of NephrologyHypertension, A.C.Z.A. Campus Stuivenberg, Lange Beeldekensstraat 267, B-2060 Antwerpen, Belgium.
References