1 Département de Biostatistique, Pavillon Saint-Jacques, Hôpital COCHIN, 27 rue du Faubourg Saint-Jacques, 75674 Paris Cedex 14, France.
2 Service de Médecine Interne, Hôpital Louis Mourier, 178 rue des Renouillers, 92700 Colombes, France.
Correspondence: Dr Joël Coste, Département de Biostatistique, Pavillon Saint-Jacques, Hôpital Cochin, 27 rue du Faubourg Saint-Jacques, 75674 Paris Cedex 14, FRANCE. E-mail : coste{at}cochin.univ-paris5.fr
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods and Results We show that the width of this grey zone depends on the difference between the means of test results for subjects with and without the disease, the variability of the test results and its components (biological, measurement), and the level of the misclassification risks (false positive, false negative) required by the context of use. We illustrate the method by application to the tuberculin skin test and iron deficiency markers in children.
Conclusion This method can be used both to display the discriminatory performance of a quantitative test in a variety of contexts and to scrutinize its components of variability. Due to the simplicity of the graphical representations, the grey zone approach may be useful during the development of quantitative tests and the publication of their performance.
Accepted 28 October 2002
Diagnostic and screening discrimination problems require a rule that enables a new subject to be assigned to the correct population, e.g. with or without a given disease, with the lowest rate (or cost) of misclassification. In the case of a quantitative test and a binary decision, the discrimination rule classifies as diseased a subject whose value is above (or below) an optimal cutpoint, determined under given constraints (e.g. rate or cost of false positives/negatives).1,2 However, since most tests do not discriminate perfectly between subjects with and without a given disease, certainty about disease status cannot be obtained for results within a given range of (intermediate) values. To deal with this problem, the construction of a three-zone partition, including a middle inconclusive zone of intermediate values has been proposed3 and applied to categorical and ordinal tests.4,5
In this paper, we extend the grey zone approach to quantitative diagnostic and screening tests, and illustrate it by application to the tuberculin skin test and iron deficiency markers in children. The graphical representations allowed by this approach are intended to help in the development of quantitative tests, and the evaluation and reporting of their measurement properties.
![]() |
Construction of the grey zone for diagnostic or screening discrimination |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In situations where perfect discrimination between subjects with and without a given disease is possible, e.g. some IgM-antibody serological tests, the construction of a grey zone is irrelevant. However, there is often a significant overlap between distributions of test results for subjects with and without the disease, and the grey zone may be wide. Its width obviously depends on the level of the overlap of distributions due to true overlap but also on the measurement error. It also depends on the requirement of the clinical or screening context in terms of likelihood ratios (LR). Indeed, to confirm or exclude the presence of a target disease, high positive LR (LR+) and low negative LR (LR-) values are necessary to ensure post-test probabilities close to 1 or 0 for a range of pre-test probability values. The required degree of closeness of post-test probabilities to 1 or 0 depends on the context. Sometimes, post-test probabilities over 0.99 or even 0.999 (or under 0.01 or 0.001) are required to confirm or exclude the presence of a target disease. Examples include confirming the diagnosis of human immunodeficiency virus (HIV) infection to initiate antiretroviral therapy or to exclude Downs syndrome by a prenatal screening test. Otherwise, clinicians or Public Health professionals may find slightly lower probabilities e.g. 0.95 (0.05) sufficient to decide whether a target disease is present.
Once the analysis of the context has provided suitable values of LR, the identification of the two cut-off points delimiting the grey zone is straightforward: one, gup, associated with the minimal desirable value of LR+; the other, glow, associated with the maximal desirable value of LR-. These cut-off points define an area of inconclusive values: the grey zone.
The tuberculin skin test
Results of the tuberculin skin test are important for care management decisions, and in particular whether to initiate antituberculous therapy in HIV-seropositive patients.69 Let us suppose that clinicians in a specialized AIDS unit want to use the tuberculin skin test to rule in or rule out a diagnosis of tuberculosis in a HIV-1 infected patient with signs of probable tuberculosis, the pre-test probability being estimated to be between 0.30 and 0.50. Clinicians require post-test levels of (1) about 0.95 for the positive predictive value and thus, LR+ being over 44 to accept the hypothesis and treat the patient without further exploration; and (2) 0.05 for the negative predictive value and thus the LR- being under 0.022 to reject the diagnosis and seek another (these values are subjective probabilities provided by clinicians working in such AIDS units). The construction of a grey zone (Figure 1, panels A and B) for these LR values (using reference data on the distribution of tuberculin skin test results in healthy and tuberculosis-infected subjects10) gives the interval 7.816.6 mm. An immediate inspection of the width of the grey zone shows the poor discrimination ability of this test in the context considered: the grey zone corresponds to one-third of the range of possible values. Also, the usual cut-off point of 10 mm, shown in bold in Figure 1
, is clearly located well within the grey zone.
|
|
Conclusive tests are also required for screening.13 There should be as few false negative cases as possible, and false positives are also unwelcome because they are often further investigated by more invasive or costly methods. The grey zone approach would allow a differentiated attitude towards results: definitely negative (no further action), most certainly positive (requiring verification) and grey (requiring another test or a follow-up).
For a given clinical or screening context, and a range of estimated pre-test probability values, our method can therefore be used with a candidate quantitative test to construct a three-zone partition including the grey zone according to LR requisites. Similarly, it could help evaluation of the discriminatory performance of a test in various clinical or screening contexts with different LR requisites, and help to choose between several tests or thresholds in a given context.
Identifying the proportion of results that will fall within the grey zone will also help assess the usefulness of a test in practice (see below).
![]() |
Entering the grey zone and analysing its determinants |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Risks of misclassification and expected proportion of test values in the grey zone
The construction of a grey zone for a test therefore implies three possible responses: positive, inconclusive or grey, and negative. A subject with the disease should ideally be classified as positive and a subject without the disease as negative, and consequently there are four risks of misclassification (Figure 3):
|
Estimating these risks is straightforward using a plot of sensitivity and specificity (Figure 4).
|
![]() |
The tuberculin skin test
The grey zone determined above in the context of a probable diagnosis of tuberculosis (p(D) between 0.30 and 0.70) is 7.816.6 mm. These limits correspond (Figure 1, panel C) to
=
= 0.025,
' = 0.435 and
' = 0.295, and the expected proportion of values inside the grey zone would therefore be between 0.34 and 0.39.
The reticulocyte haemoglobin content test
The grey zone determined above in the context of screening in a population where p(D) is 0.10 is 22.028.2 pg. These limits give = 0.04,
= 0.02,
' = 0.46 and
' = 0.83 and the expected proportion of values inside the grey zone would be 0.50 (Figure 2
, panel C).
Width of the grey zone and its determinants
Once the requisites in terms of LR values have been determined by the analysis of the clinical or screening context, the grey zone can be constructed for the test under consideration. The width of this grey zone depends on the overlap of the distributions of test values for subjects with and without the disease, and in turn, on the difference of location and level of dispersion of these distributions.
Where normal distributions of test values can be obtained, possibly after transformation, the limits of the grey zone gup and glow can be expressed in a relatively simple analytical form (Appendix) and computed with the simplest parameters of the distributions of the test result:
![]() |
where H(sH) and
D (sD = ksH) are the sample means (standard deviations) of the test results for subjects without and with the disease, respectively (we suppose the test gives higher values in subjects with the disease); and z
and z
are the upper (1 -
)th and (1 -
)th quantiles of the standard normal distribution.
The width of the grey zone is directly and positively dependent on the overall variability of the test results, the level of the risks (risk of false positive) and
(risk of false negative); and negatively dependent on the true difference between the means of the distributions (Appendix).
![]() |
The grey zone for two or several tests |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The authors of the study detailed above11 reported data for several markers of iron deficiency in children. We used these data to establish distributions in iron-deficient and healthy children, and further to construct the grey zones according to the screening context (pre- and post-tests probabilities and LR requisites as detailed above). Figure 5 shows the graphical representation of the grey zones for both CHr ([22.028.2 pg], see above) and mean corpuscular haemoglobin (MCH, [20.9 28.8 pg]): the CHr grey zone is thinner than the MCH grey zone, for which the expected proportion of grey values is 0.92! Plotting the data for individuals (not available in this report) onto this Figure would have shown that the proportion of subjects grey for both tests (in the central grey intersection zone) is smaller than that observed with each test individually.
|
![]() |
The grey zone and the evaluation and minimization of the measurement error |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The components of variance and the grey zone
The total variance of a test, , has two main components: a between-subject variance,
, i.e. the true variability component; and a within-subject variance,
, i.e. the measurement variability component
.
Therefore, we can construct and delimit two sub-zones of uncertainty inside the grey zone, which reflect these two components:
(1) The first subzone, associated with , reflects the true uncertainty due to the overlap of the distributions of biological nature. This dark grey zone is incompressible and inherent to the particular test. Its limits, glow,DARK and gup,DARK, are determined (Appendix) using the estimation
of
instead of
(or
):
![]() |
(2) The second subzone associated with reflects the measurement error including observer, instrumental and possibly biological components of variance (see below). Unlike the dark grey zone, this light grey zone may be limited by measurement organization (standardization, repetition, etc.). The width of this light grey zone depends on the absolute value of the between-subject variability (
), but also and especially on the value of the ICC (Appendix).
The graphical representation of the grey zone in reliability studies
As graphical representations of the grey zone are easily understandable, it would be convenient to couple its representation with the graphical method to evaluate reliability described by Bland and Altman (Appendix).19,20
Indeed simultaneous representation of the differences between assessments and the grey zone and its sub-zones would allow the reliability of a quantitative test to be analysed, and in particular, visualization of the proportion of subjects inside the grey zone.
The tuberculin skin test
A recent study21 evaluated the reliability of two techniques of tuberculin skin test measurement. The diameter of skin induration was measured along the long axis of the forearm both by the customary palpation method (P) or by the ballpoint-pen technique (BP).
The differences between the measures recorded by the two observers for both techniques in 69 patients with non-null values are shown in Figure 6 (panels P1 and BP1). There were relationships between the differences and the means, so that log-transformations were needed. The mean (SD) of differences on the log-scale was 0.01 (0.29) for palpation, and was -0.04 (0.25) for BP giving the limits of agreement shown in Figure 5
(panels P2 and BP2). The values of the ICC were 0.84 for palpation, and 0.88 for BP.
|
Further uses of the grey zone: to scrutinize and minimize the components of variance
As the within-subject variance may include inter-observer, intra-observer, instrumental, and possibly biological components of variance, different Bland and Altman analyses of the differences in measurement are therefore possible: difference between observers, between evaluations for a single observer, between times for a single subject, etc. These analyses can be coupled to the construction of subzones of uncertainty, reflecting each component of the variability of the test.
If we consider a test with inter- and intra-observer (or residual) components of variability (as for example, in the tuberculin skin test analysis presented above) the within-subject variance may be decomposed into two components:
. It can be shown (Appendix) that the ratio of the width of the sub-zone associated with inter-observer variability, wLIGHT/INTER, to the width of the light grey zone is equal to the ratio of the standard deviations:
. The width of the subzone associated with intra-observer variability is calculated by subtraction.
The magnitude of the components of variance as displayed by the width of their associated sub-zones may help optimize strategies to limit measurement error.
(1) The mean of measures of a repeated test can be used, instead of individual values, to decrease the intra-observer component of the within-subject variability, and therefore to shrink the compressible light grey zone.
(2) Using a sole observer allows the measurement component of variability to be decreased to the intra-observer variability. The intra-observer reliability value is considered as the asymptotic value when there are several observers who are similarly trained and experienced with the measurement method.
The tuberculin skin test
For the ballpoint technique, a two-way random effect analysis of variance allowed to be estimated to be 0.97, suggesting that the main variability in the test is the inter-observer variability. The sub-zones associated with the inter- and intra-observer variability components of the test (still for use in an HIV unit) are shown in Figure 7
(panel 1): 97% of the light grey zone width corresponds to inter-observer variability.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Above all, our approach allows the binary constraint of a black or white decision to be avoided, as this is often inappropriate to clinical or screening practice. A test result falling in the grey zone is not uninformative as it could lead one to seek further evidence, thereby transforming the test result from a decisive to contributory role. Several controversies concerning suitable thresholds for quantitative tests would have probably been avoided if such an approach had been used. A good example is the recent debate concerning the change in the criteria for the diagnosis of type 2 diabetes, and the shift in the threshold from 7.8 mmol/l to 7.0 mmol/l of fasting plasma glucose.22
Our approach also provides a complementary or alternative representation to effect scores23 and especially to ROC curves for the evaluation of the discriminatory performance of a quantitative test and the choice of thresholds. The conventional ROC curves give symmetrical parts to sensitivity and specificity, and only recent refinements of the ROC curve methodology have dealt with unequal costs of misclassification; however, these refinements are complex.24
Another advantage of our method is that it gives a visual representation of both the relationship between the width of the grey zone and the range of possible values, and the proportion of observations within this zone. This can be done by coupling the grey zone construction to the Bland and Altman method to assess reliability, a method now familiar to many clinicians and biologists. A quantitative test whose grey zone width contains one-third or a half of observed values (as was the case for the two examples) is obviously of little value in practice.
In assessing reliability by this method, the light grey zone reflects the measurement component of variability in a given design. Thus, the subzones give a simple representation of the components of variance of a measurement method. In the absence of transformation, the width of the compressible light grey zone is proportional to the within-subject standard deviation for the design. A simultaneous representation of the light grey zone and the limits of agreement provided by the Bland-Altman method exploit this proportionality.
The main difficulty in implementing the grey zone approach is determining appropriate values of LR. This involves analysis of the clinical or screening context (expressed in terms of pre-test probabilities) and requirements (expressed in terms of post-test probabilities) and may be difficult. In particular, pre-test probabilities may vary according to the epidemiological context, the care facility, information already gathered about diagnostic or risk factors, and other factors; furthermore subjective probabilities produced by clinicians or experts may be unreliable. (Post-test probabilities requisites may also vary, albeit to a lesser extent.) The rule of thumb proposed by the Evidence-Based Medicine group i.e. to consider LR+ over 10 and LR- below 0.1 as indicating conclusive tests25 may be used as a first approximation although much higher/lower values of LR+/LR- (however seldom attained by current screening or diagnostic tests) would be required in many contexts. Another approach would be to consider the sensitivity of the LR values and the resulting limits of the grey zone associated with various scenarios or hypotheses. A two-way sensitivity analysis, varying pre- and post-test probabilities simultaneously and studying the effect on LR should be performed. The location of the resulting interval of values on the LR curves would further indicate the stability of the grey zone limits: their location in (or near) the straight vertical parts of the LR curves would be reassuring (as in our second example, see above). Sensitivity analyses would also allow the stability of the grey zone limits to be tested when empirical data concerning the test are limited and cannot provide reliable estimates of LR (i.e. when the confidence interval for LR are large) and/or do not include many cut-off points.
Another limitation of this method is its reliance on several assumptions for evaluation and minimization of the measurement error. The use of analysis of variance and ICC requires: the distributions of the test results to be normal in both healthy and diseased subjects; and the measurement error to be constant across the range of test values. Logarithmic transformations may in general allow these requirements to be satisfied, but render the computation more complex and assessment of the graphical representations less immediate. Further investigation with non-parametric ICC is needed before the grey zone can be adapted for the evaluation and minimization of the measurement error, when distributions cannot be normalized or measurement cannot be made constant across the range of test values. For a simple application to evaluation of diagnostic or screening discrimination, no assumption is necessary: the grey zone construction only requires plotting both LR+ and LR- against the values of the test. Otherwise, the methodology is non-specific and the recommendations of Reid et al.26 must be followed to avoid the various biases (spectrum bias, verification bias, review bias) affecting the evaluation of the performance of screening and diagnostic tests.
In conclusion, our method allows simple graphical representation of both the discriminatory performance and the components of variability of quantitative diagnostic and screening tests. These representations may be useful supports during the development, evaluation and publication of the performances of such tests.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
where z and z
are the (1 -
)th and (1 -
)th quantiles of the standard normal distribution (Figure 4
). Replacing population values of means and standard deviations by their sample estimates, we obtain:
![]() |
If we let = µD - µH, and
D = k
H, the width, w, of the grey zone is:
![]() | (1) |
Components of variance and the light and dark grey zones
The variance of a test, , is decomposed into a between-subject variance,
, and a within-subject variance,
:
.
Let I be the one-way random effect intraclass correlation coefficient (ICC)16
which we suppose is identical in subjects with and without the disease (note that a log transformation leading to a measurement error independent of the magnitude of the measurement may, in general, allow this assumption to be satisfied). Replacing
H of equation (1)
by
, the width of the grey zone becomes:
![]() | (2) |
When I
1 or
W
0, w
wDARK = (z
+ kz
)
B -
which is the incompressible dark grey zone, the limits of which are:
![]() |
The width of the compressible light grey zone is therefore:
![]() | (3) |
Note that since we can also express the light grey zone width as a function of
W:
![]() | (4) |
The estimations of gup,DARK, glow,DARK, wDARK and wLIGHT require prior computation of the estimated component of variance (or
) and the ICC
. This can be done by a one-way random effect analysis of variance.16,17
The Bland and Altman method and the grey zone
The Bland and Altman method is based on the construction of a residual-like plot of the difference between the results of two measures against their mean. The mean and standard deviation sd of differences between pairs of repeated measurements are combined to define the limits of agreement
± 2sd, which correspond to a 95% range for the difference between two repeated measurements. The method assumes that sd is constant across the range of measurements, and, in the frequent case of the measurement error being proportional to the mean, requires a log-transformation: the limits of agreement antilogged back into the natural scale give a range of proportional agreement between repeated measurements.
Since ,19 there is a proportional relationship between the interval between the limits of agreement, 4sd, and the estimated width of the light compressible grey zone.
Components of variance and the inter- and intra-observer light grey zones
The within-subject variance is decomposed into an inter-observer component,
, and an intra-observer component,
:
.
If we let, which we will assume is identical in healthy and diseased subjects, the width of the light grey zone becomes:
![]() | (5) |
When INTRA
0 or
1,
![]() | (6) |
Thus : the ratio of the width of the sub-zone associated with inter-observer variability to the width of the light grey zone is equal to the ratio of the standard deviations. The width of the subzone associated with intraobserver variability can be easily calculated by subtraction.
The estimation of wLIGHT/INTER requires prior computation of both the estimated components of variance and
and the values of
,
and
. A two-way random effect analysis of variance is therefore necessary.16,17
KEY MESSAGES
|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 1993;39:56177.[Abstract]
3 Feinstein AR. The inadequacy of binary models for the clinical reality of three-zone diagnostic decisions. J Clin Epidemiol 1990;43:10913.[CrossRef][ISI][Medline]
4 Simel DL, Samsa GP, Matchar DB. Likelihood ratios for continuous test results, making the clinicians job easier or harder? J Clin Epidemiol 1993;46:8593.[ISI][Medline]
5 Jamart J. Chance-corrected sensitivity and specificity for three-zone diagnostic tests. J Clin Epidemiol 1992;45:103539.[CrossRef][ISI][Medline]
6 Subcommittee of the Joint Tuberculosis Committee of the British Thoracic Society. Guidelines on the management of tuberculosis and HIV infection in the United Kingdom. BMJ 1992;304:123133.[ISI][Medline]
7 Pape JW, Jean SS, Ho JL, Hafner A, Johnson WD Jr. Effect of isoniazid prophylaxis on incidence of active tuberculosis and progression of HIV infection. Lancet 1993;342:26872.[ISI][Medline]
8 Bass JB Jr, Farer LS, Hopewell PC et al. Treatment of tuberculosis and tuberculosis infection in adults and children. Am J Respir Crit Care Med 1994;149:135974.[Abstract]
9 De Cock KM, Grant A, Porter JD. Preventive therapy for tuberculosis in HIV-infected persons: international recommendations, research, and practice. Lancet 1995;345:83336.[CrossRef][ISI][Medline]
10 Rose DN, Schechter CB, Adler JJ. Interpretation of the tuberculin skin test. J Gen Intern Med 1995;10:63542.[ISI][Medline]
11 Brugnara C, Zurakowski D, DiCanzio J, Boyd T, Platt O. Reticulocyte hemoglobin content to diagnose iron deficiency in children. JAMA 1999;281:222530.
12 Kassirer JP, Kopelman RI. Learning Clinical Reasoning. Baltimore: Williams & Wilkins, 1991.
13 Morrison AS. Screening. In: Rothman KJ, Greenland S (eds). Modern Epidemiology, 2nd Edn. Philadelphia: Lippincott, Williams & Wilkins, 1998.
14 Reid MC, Lane DA, Feinstein AR. Academic calculations versus clinical judgments: practicing physicians use of quantitative measures of test accuracy. Am J Med 1998;104:37480.[CrossRef][ISI][Medline]
15 Healy MJ. Measuring measuring errors. Stat Med 1989;8:893906.[ISI][Medline]
16 Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:42028.[CrossRef][ISI]
17 Müller R, Büttner P. A critical discussion of intraclass correlation coefficients. Stat Med 1994;13:246576.[ISI][Medline]
18 Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Comput Biol Med 1990;20:33740.[ISI][Medline]
19 Altman DG, Bland JM. Measurement in medicine: the analysis of method comparison studies. The Statistician 1983;32:30717.
20 Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; i:30710.
21 Pouchot J, Grasland A, Collet C, Coste J, Esdaile JM, Vinceneux P. The reliability of tuberculin skin test measurement. Ann Intern Med 1997; 126:21014.
22 Davidson MB, Schriger DL, Peters AL, Lorber B. Relationship between fasting plasma glucose and glycosylated hemoglobin: potential for false-positive diagnoses of type 2 diabetes using new diagnostic criteria. JAMA 1999;281:120310.
23 Blakeley DD, Oddone EZ, Hasselblad V, Simel DL, Matchar DB. Noninvasive carotid artery testing. A meta-analytic review. Ann Intern Med 1995;122:36067.
24 Hilden J, Glasziou P. Regret graphs, diagnostic uncertainty and Youdens Index. Stat Med 1996;15:96986.[CrossRef][ISI][Medline]
25 Jaeschke R, Guyatt G, Sackett DL and the Evidence-Based Medicine Working Group. Users guides to the medical literature. III. How to use an article about a diagnostic test. Are the results of the study valid? JAMA 1994;271:38991.[CrossRef][ISI][Medline]
26 Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research. Getting better but still not good. JAMA 1995; 274:64551.[Abstract]