Editorial II: Solid as a ROC

Helen F. Galley

Academic Unit of Anaesthesia and Intensive Care, Institute of Medical Sciences, University of Aberdeen, Aberdeen AB25 2ZD, UK E-mail: h.f.galley{at}abdn.ac.uk

Receiver operating characteristic (ROC) curves were developed to assess the performance of radar operators during the second world war. The radar operators had to be able not only to distinguish between friend or foe blips on their radar screens, but also between signal and noise. The ability of the radar operators to make these life-or-death distinctions was termed the receiver operating characteristic. Plots of right vs wrong answers were called ROC curves, and these curves were taken up in the 1970s by the medical profession to determine the relationship between sensitivity (true positive rate) and specificity (true negative rate) of diagnostic tests. ROC curves are used in medical imaging, materials testing, weather forecasting, information retrieval, polygraph lie detection, and aptitude testing. Though the ROC itself is sound, the values obtained from diagnostic tests often require qualification because the test data on which they are based are of dubious quality.1

Clinical researchers are frequently confronted with the dilemma of how to determine how accurate a scoring system or biochemical test is in discriminating between ‘healthy’ and ‘diseased’ patients. Healthy equates to the ‘noise’ and diseased to the ‘signal’ of the radar operators in the 1940s. In the paper by Kerbaul and colleagues in this issue of the British Journal of Anaesthesia,2 ROC curves were constructed to define the sensitivity and specificity of measurements of the cardiac hormone, pro-brain natriuretic peptide (N-BNP), in identifying patients with severe systemic inflammatory response syndrome (SIRS) after coronary artery bypass grafting (CABG).

A test to discriminate ‘diseased’ (patients with severe SIRS in this case) from ‘healthy’ (no SIRS) rarely gives perfect segregation between the two groups, as shown in Fig. 1. The distributions of N-BNP concentrations, or indeed any other test, will overlap. For every possible cut-off N-BNP level, there will be some patients correctly identified as having SIRS (true positive) but some identified as having SIRS when in fact they do not (false positives). On the other hand, those without SIRS may be correctly identified (true negative) or not (false negative).



View larger version (17K):
[in this window]
[in a new window]
 
Fig 1 Test result for two imaginary patient populations with SIRS. The cut-off value is the test result above which patients are classified as having SIRS. In (A) the cut-off value is set fairly low (high sensitivity but low specificity), resulting in a low number of false negatives, but a large number of false positives. If the cut off value is increased (B), the sensitivity is decreased and the specificity is increased, so the number of false positives decreases but the frequency of false negatives increases, and the number of true positives decreases.

 
Sensitivity can be defined as the probability that the N-BNP result will be positive (i.e. above a predefined cut-off level) when SIRS is present. Specificity is the probability that the N-BNP will be below the defined level (negative) when SIRS is not present. Thus, the predictive ability of N-BNP to detect SIRS accurately is a balance between sensitivity and specificity.

Kerbaul and colleagues used ROC curves in two ways: to assess the performance of N-BNP in predicting SIRS after CABG, and to compare N-BNP with procalcitonin (PCT), another inflammatory marker shown to increase after cardiac surgery. The best possible test for predicting SIRS would generate 100% sensitivity (all patients with SIRS correctly identified) and 100% specificity (no patients without SIRS are wrongly identified). Kerbaul and colleagues suggested a cut-off point of 500 pg ml–1 N-BNP (before surgery), which resulted in a sensitivity of 0.75 (i.e. 75% of patients with SIRS correctly identified). Increasing the cut-off value may have increased the specificity but decreased the sensitivity. Plotting the sensitivity (true positive) against 1–specificity (1–true negative) for increasing test results constitutes a ROC curve.

A good diagnostic test has small false-positive and false-negative areas across a range of cut-off values. A bad test is one in which the only cut-off values that make the false-positive rate low have a high false-negative rate (and vice versa). A ROC curve for a good test climbs rapidly up to the top left-hand corner of the graph (Fig. 2). The area under the ROC curve quantifies how good the test is: the larger the area, the better the test. If the area is 1.0, you have 100% sensitivity and 100% specificity. If the area is 0.5, you have 50% for both and the test is no better than flipping a coin. In the N-BNP study, the area under the ROC curve was around 0.8 for N-BNP (before surgery) but only 0.4 for PCT at the same time. An area of 0.8 can be classed as good and is certainly better than 0.4. Perhaps it is worth noting that substantive evidence that PCT is good at discriminating between patients with and without early SIRS is lacking.



View larger version (18K):
[in this window]
[in a new window]
 
Fig 2 Examples of what very good, good and useless ROC curves look like.

 
The decision as to where to place the cut-off point on a ROC curve is important and depends on several key factors. If the disease in question is rare, even the most specific test will be associated with many false-positive results, but if the disease is common the positives will most likely be true positives (Fig. 1). Clearly we would like to see few false alarms, but not at the expense of missing real hits.1 Thus, the cut-off point depends on the incidence of the disease and several other factors, such as the cost to the patient of wrong classification. For example, this would include the discomfort or damage caused by wrongly treating someone misclassified as having the disease, or the increased mortality, perhaps, of failing to spot a patient with the disease and thus delaying treatment. SIRS after surgery is very common and the consequences of a false positive are minimal. But would a false negative (i.e. failing to identify the patient who does have SIRS) endanger the patient? One could argue that early identification may not affect treatment, but a false negative may lead to complacency, which would put the patient at risk. All these factors should be considered in setting the cut-off point for N-BNP. It is not clear what criteria were used in the study by Kerbaul and colleagues to decide the cut-off point.2

Of course, there are other confounding issues relating to ROC curves. The accurate identification of the ‘diseased’ and ‘healthy’ patients is paramount in correctly classifying the positives and negatives. In other words, the choice of gold standard for the accurate classification of patients is crucial. In some situations, this is quite straightforward; for example, identification of patients with a fracture could be done very accurately by x-ray. For SIRS this is not so simple. In the study by Kerbaul and colleagues,2 SIRS was identified using the American College of Chest Physicians/Society of Critical Care Medicine (ACCP/SCCM) criteria.3 Although these criteria are widely accepted, it is important to remember that SIRS, and each criterion used to identify it, is a continuum. Each criterion (e.g. temperature) also has its own specificity and sensitivity. The definition of SIRS is over-inclusive—almost all patients will fulfil the criteria after surgery and even healthy people will also fulfil the criteria, for example after exercise. The use of high or low values for temperature and leucocyte count will also exclude patients in transition from high to low or low to high values, and the ventilatory frequency is difficult to interpret if a patient is ventilated. The other problem with ROC curves arises when the test and the gold standard used to classify the disease are dependent upon each other. For example, if the gold standard (in this case ACCP/SCCM criteria for SIRS) were compared with itself, the area under the curve would be 1.0, regardless of whether the gold standard is actually a good test. In addition, if the gold standard is poor, but independent from the test in question, the performance of the test will be underestimated.

Of course, it is important not to forget other factors; for example, how good is the test in discriminating between early or mild disease and late or severe disease? In the study by Kerbaul and colleagues, N-BNP was used only to detect SIRS beyond the ACCP/SCCM criteria; no classification of severity was addressed.2 The control patients are also important; for example, if the non-SIRS patients were healthy controls, it is easy to see that N-BNP would perform better than if comparative controls, for example patients without SIRS after CABG, were used. Comorbidity may also affect a test, either positively or negatively. Another potential source of error, when clinicians are not blinded to the results of the test, is bias. For example, a patient with a negative result may induce complacency, which may jeopardize later acceptance that the patient actually has the disease. Ironically, a good test is more likely to be associated with stronger bias.4

In summary, ROC curves are graphical representations of the trade-off between sensitivity and specificity. The cut-off point used to define the test performance depends on many factors, including disease incidence and the cost of misdiagnosis. There is no such thing as a bad ROC curve, it is the test that is good or bad; the ROC is solid.

References

1 Swets JA. Measuring the accuracy of diagnostic systems. Science 1988; 240: 1285–93[ISI][Medline]

2 Kerbaul F, Giorgi R, Oddoze C, et al. High concentrations of N-BNP are related to non-infectious severe SIRS associated with cardiovascular dysfunction occurring after off-pump coronary artery surgery. Br J Anaesth 2004; 93: 639–44[Abstract/Free Full Text]

3 Bone RC, Balk RA, Cerra FB, et al. Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis. The ACCP/SCCM Consensus Conference Committee. American College of Chest Physicians/Society of Critical Care Medicine. Chest 101: 1644–55

4 Begg CB, McNeil BJ. Assessment of radiologic tests: control of bias and other design considerations. Radiology 1988; 167: 565–9[Abstract]