Affiliations of authors: L. G. Kessler, Fred Hutchinson Cancer Research Center, Seattle, WA, and Office of Surveillance and Biometrics, Center for Devices and Radiological Health, Food and Drug Administration, Rockville, MD; M. R. Andersen, R. Etzioni, Fred Hutchinson Cancer Research Center.
Correspondence to: M. Robyn Anderson, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. North, MP 900, Seattle, WA 98109 (e-mail: rander{at}fhcrc.org).
Variability in quality of medical care is its Achilles' heel, and one in great need of repair. But, how do we improve the quality of a procedure such as mammography and ensure that variability is minimized? This fundamental issue underlies the article by Elmore et al. in this issue of the Journal (1).
For almost 10 years, Elmore and colleagues have examined the issue of variability of radiologists' readings of mammograms (24). After reestablishing the well-known phenomenon of variability in radiologic reading in general (5,6), these authors have explored the factors that influence this variability, hoping they can find ways to reduce it (1). In their community-based study, they have found that false-positive rates ranged from 2.6% to 15.9%. They can account for half this range by controlling for patient, radiologist, and testing characteristics (e.g., year of mammographic examination or availability of previous mammographic examination for comparison), but that still leaves significant variability in this one indicator of mammography performance. These authors examined this variability question in a real-world setting rather than in a testing situation, which has been used to assess radiologists' performance in other studies.
Screening tests in the population require trade-offs. In the case of mammography, clinicians can judge many examinations abnormal (or positive) and spend a lot of their patients' and their own time tracking down minor or low-level suspicious findings to minimize missed cancers. But the price of calling many examinations positive is a consequent increase in the false-positive proportion (that is, the specificity is low). This increase has considerable financial and psychological costs (3,710). It is important to find a screening strategy that is optimal from a population perspective. That is, women who receive a screening mammographic examination would like to know that their chance of having a cancer detected, if present, is very high (that is, the sensitivity of the test is very high). They would also like to know that any abnormalities detected have a relatively high chance of clinical significance (that is, the test has a high positive predictive value).
In an ideal mammography system, women would know the sensitivity and specificity of mammographic tests performed by specific physicians or facilities. They could then choose to attend screening where they can get the most sensitive tests, with false-positive rates they consider acceptable. Indeed, women likely vary with respect to what they consider acceptable rates of false positives and, given the opportunity to make informed choices, would likely differ in their preferences for examinations in which they risk high rates of false positives for small increases in sensitivity. By making such choices, they would indicate their willingness to go through the anxiety and expense of extra tests to reduce their chance of a missed diagnosis.
The Fundamental Issue
Elmore et al.'s (1) attempt to explain variability in false-positive rates raises the fundamental issue of the source of this variability. There are two potential sources, and they have very different implications for screening: 1) Perhaps the radiologists are roughly equal in quality (i.e., they have the ability to achieve equivalent sensitivity and specificity) and have chosen different thresholds for their level of suspicion and for which findings they call positive; 2) Perhaps the radiologists vary in quality, with some having higher sensitivity and specificity than others. Their ability, described in terms of sensitivity and specificity, is reflected in the receiver operating characteristic (ROC) curve.
Because of the relatively small number of cancers in the study population, Elmore et al. (1) could examine only the false-positive proportion and not the ROC curves. Their analysis cannot reveal whether the clinicians are operating on different ROC curves, meaning they have different abilities to detect cancer, or whether they are operating on similar ROC curves but have chosen different thresholds. To fully understand the variability findings for false positives, we need to know the underlying ROC curve. Studies with carefully chosen films that provide an estimate of ROC curves therefore complement the community-based effort used in the study by Elmore et al. The wide range in variability suggests both differences in quality and choice of threshold.
The sophisticated analysis with the mixed-model approach used by Elmore et al. (1) allows them to quantify the variability in false-positive rates across radiologists, which is their objective. Although this approach allows them to present results in terms of interpretable quantities related to the false-positive measure, it does not provide information on the ROC curves. There is no substitute for knowing about the true accuracy of the examination.
Where Do We Go From Here in Terms of Public Policy?
What is the cost to repair this Achilles' heel? Is there a national consensus on where we should be on the ROC curve? Should there be minimum performance standards for mammography readers? Is a cautious radiologist (one with a high false-positive proportion but excellent sensitivity) a problem? Reducing the false-positive rate usually comes at the expense of decreases in sensitivity. Although a reduction in false-positive variability could provide better results in general, it is important to learn whether reducing variability will be accompanied by a decrease in sensitivity and therefore an increase in missed cancers.
One finding of Elmore et al. (1) is that of a time trend in false-positive proportion during the study. The data they used are from 9 to 17 years old at this time, suggesting that a contemporary study is needed. Ideally, a new study would focus on the ROC curve and would first attempt to improve all curves so that they are as close to optimal as possible while investigating the issue of where to act on the curve. At the same time, we need to better define, at a population level, what "optimal" means in this case. Knowing where to trade off an increase in sensitivity for a decrease in specificity is not a matter of a formula; it needs to be carefully considered in a policy context.
Elmore et al. (1) do not directly address the importance of multiple reading (i.e., the practice of having more than one radiologist read each film), but their findings nonetheless point to the importance of this approach. As a matter of principle, it is important to operate at an optimal point on the ROC curve, but individual clinicians have only a limited ability to improve their reading of mammograms. It has been shown repeatedly [e.g., (11,12)] that double reading can improve both sensitivity and specificity. Data from 22 countries participating in the International Breast Cancer Screening Network indicate that about half of the countries have implemented independent double reading of mammograms, considering it a key component of high-quality mammography screening (12). Why does the United States refuse to take such a low-tech approach to improving the mammography system?
Elmore et al. (1) state that "the ultimate goal is to enhance mammography performance by reducing the rate of false-positive interpretations while maintaining high levels of sensitivity and accuracy." Their research suggests training as an important factor in quality of mammographic interpretation. We suggest that public policy needs both a focus on training and an immediate re-examination of double reading as national policy.
REFERENCES
1 Elmore JG, Miglioretti DL, Reisch LM, Barton MB, Kreuter W, Christiansen CL, et al. Screening mammograms by community radiologists: variability in false-positive rates. J Natl Cancer Inst 2002;94:137380.
2 Elmore J, Wells C, Lee C, Howard D, Feinstein A. Variability in radiologists' interpretations of mammograms. N Engl J Med 1994;331:14939.
3 Elmore JG, Barton MB, Moceri VM, Polk S, Arena PJ, Fletcher SW. Ten-year risk of false-positive screening mammograms and clinical breast examinations. N Engl J Med 1998;338:108996.
4 Christiansen CL, Wang F, Barton MB, Kreuter W, Elmore JG, Gelfand AE, et al. Predicting the cumulative risk of false-positive mammograms. J Natl Cancer Inst 2000;92:165766.
5 Garland LH. Studies on the accuracy of diagnostic procedures. Am J Roent 1959;82:2538.
6 Yerushalmy J. The statistical assessment of the variability in observer perception and description of roentgenographic pulmonary shadows. Radiol Clin North Am 1969;7:38192.[Medline]
7 Brett J, Austoker J, Ong G. Do women who undergo further investigation for breast screening suffer adverse psychological consequences? A multi-centre follow-up study comparing different breast screening result groups five months after their last breast screening appointment. J Public Health Med 1998;20:396403.[Abstract]
8 Ellman R, Angeli N, Christians A, Moss S, Chamberlain J, Maguire P. Psychiatric morbidity associated with screening for breast cancer. Br J Cancer 1989;60:7814.[Medline]
9 Gram IT, Lund E, Slenker SE. Quality of life following a false positive mammogram. Br J Cancer 1990;62:101822.[Medline]
10 Lerman C, Trock B, Rimer BK, Boyce A, Jepson C, Engstrom PF. Psychological and behavioral implications of abnormal mammograms. Ann Intern Med 1991;114:65761.[Medline]
11 Blanks RG, Wallis MG, Moss SM. A comparison of cancer detection rates achieved by breast cancer screening programmes by number of readers, for one and two view mammography: results from the UK National Health Service breast screening programme. J Med Screen 1998;5:195201.[Medline]
12 Ballard-Barbash R, Klabunde C, Paci E, Broeders M, Coleman EA, Fracheboud J, et al. Breast cancer screening in 21 countries: delivery of services, notification of results and outcomes ascertainment. Eur J Cancer Prev 1999;8:41726.[Medline]
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
||||
|
Oxford University Press Privacy Policy and Legal Statement |