Affiliations of authors: C. A. Beam, Department of Radiology, Medical College of Wisconsin, Milwaukee, and the H. Lee Moffitt Cancer Center & Research Institute at the University of South Florida, Tampa; E. F. Conant, Department of Radiology, University of Pennsylvania, Philadelphia; E. A. Sickles, Department of Radiology, University of California, San Francisco.
Correspondence to: Craig Beam, Ph.D., Biostatistics Core, H. Lee Moffitt Cancer Center & Research Institute, University of South Florida, 12902 Magnolia Dr., Tampa, FL 33612-9497 (e-mail: beamca{at}moffitt.usf.edu).
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To the best of our knowledge, six studies (1015) have investigated the relationship between expertise and reading volume in mammography. In two of these studies (13,14), the relationship observed between volume and expertise was believed by the researchers to be strong enough to proffer health policy recommendations. For example, Esserman et al. (14) recommended establishment of high-volume centers with mammogram interpretations to be made by high-volume experienced and dedicated radiologists. Kan et al. (13) suggested that a yearly minimum of 2500 interpretations is sufficient to ensure high quality. In contrast, findings from the four other published studies (1012,15) suggest that the relationship between volume and expertise in mammography may be strongly influenced by other factors. For example, three of the studies (1012) suggest that the quality of feedback given to the radiologist, rather than simply the volume of reading, is an important determinant of the effectiveness of gaining expertise from experience.
This disparity in the published research is important because the implications for improving mammography in each study are so different. To understand how such disparities have occurred, it is important to recognize that each of the four studies (1012,15) that provide findings that question the unique role of volume controlled for characteristics of the readers, whereas the other two studies (13,14) did not. This difference in study design could account for the different findings because it is possible that the apparent relationship between volume and expertise is confounded with, or altered by, other unrecognized factors.
Unfortunately, the four studies (1012,15) that question the solitary volumeexpertise relationship used small and/or highly selected reader samples. They were thus not able to transect the full range of variability in the U.S. population of radiologists. In contrast, the two studies supporting the solitary importance of volume are equally limited. Kan et al. (13) did not sample U.S. radiologists. Esserman et al. (14) sampled radiologists exclusively from California and achieved only a 30% participation rate. In addition, the study by Esserman et al. also has limited external validity for U.S. populations of women being screened for breast cancer because the authors chose to treat the case sample as fixed, thus prohibiting them from making inferences about typical case populations.
To address the deficiencies and inconsistencies in existing research and knowledge, we conducted a multifactor population study to determine whether a radiologists reading volume and other factors were associated with accuracy in screening mammography. Our study was designed to be relevant to typical clinical populations and to the population of radiologists interpreting screening mammograms in the United States. Our study was also designed to more successfully capture variability and to control for the possible influence of factors that a priori were thought to be possible confounders or modifiers of the relationship between volume and expertise.
It should be emphasized that our study focuses strictly on accuracy in the interpretation of screening mammograms. It does not analyze factors associated with accuracy in the interpretation of diagnostic mammograms. Ability in screening mammogram interpretation (a task focused on the decision to call women back for further work-up) may or may not imply ability in diagnostic interpretation (a task focused on the interpretation of additional work-up that can culminate in the recommendation for tissue biopsy examination). Hence, the associations found and not found in this study should not be assumed to apply to skill in the diagnostic interpretation of mammograms by U.S. physicians.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Radiologists were recruited to participate in the Variability In Diagnostic Interpretation (VIDI) screening mammography study (16). This study is a research program devoted to the population-based assessment of interpretation variability in diagnostic medicine. Participants for the screening mammography study came from randomly sampled mammography facilities accredited by the U.S. Food and Drug Administration as of January 1, 1998. Stratified random sampling of the 9916 geographically contiguous accredited facilities ensured approximately equal representation across geographic regions (four regions as defined by the U.S. Census) and by minority composition of local screening populations (based on percent minority composition in the ZIP code area of the facility: "less than 50% nonwhite" versus "greater than or equal to 50% nonwhite"). Thus, stratified sampling gave approximately equal numbers of facilities within each of eight strata.
All radiologists at each randomly sampled facility were invited to participate. The procedure followed for recruitment began with a letter to the lead interpreting physician at a sampled facility asking them to distribute our recruitment material to all radiologists who interpret mammograms for their facility. In this way, we sampled permanent faculty as well as temporary faculty. The recruitment material explained the study, requirements, and benefits of participation in the study and asked the radiologists whether they would be willing to participate if randomly sampled. In all, 412 radiologists were contacted, and 292 (71%) expressed willingness to participate in the study, if sampled. The 292 radiologists, grouped by facility, provided our sampling frame for random sampling. Again, we sampled facilities (and hence willing radiologists within facilities) within the strata formed by geographic region and minority composition to arrive at approximately equal numbers of radiologists per strata.
One hundred ten radiologists were randomly selected to participate in this study. There were no statistically significant differences in any of the characteristics summarized in Tables 1 and 2 between the radiologists who did and did not participate in this study.
|
|
For this research study, we define a "mammogram" as consisting of the four radiographic views that are standard in screening for breast cancer in the United States. Each mammogram (index or comparison examination), therefore, consisted of mediolateral oblique and craniocaudal views of each breast (hence, four views per mammogram).
Index mammograms were obtained from 148 women who were randomly sampled from a large screening program (affiliated with the University of Pennsylvania) covering the period from January 1993 through December 1997. All mammograms selected for this study were reviewed for quality by one of the authors (E. F. Conant), who serves as Director of the Breast Imaging Program at the University of Pennsylvania. No mammogram was rejected because of poor technical quality.
Original film mammograms were used in the reading study. Comparison original film mammograms were provided, as available, to parallel usual clinical practice. Sixty-seven (45%) women had comparison mammographic examinations. Each set of mammograms was from low-dose, film screen mammography performed on dedicated mammography units using single emulsion film. Each set consisted of mediolateral oblique and craniocaudal views of each breast. The index examination of a woman was defined as the one leading to the first biopsy or as the next-to-last mammogram for those women with at least 2 years of follow-up without a biopsy examination. A comparison examination was defined as the screening examination performed immediately before the index examination.
Mammogram sampling was stratified on the disease status of the women screened ("cancer" or "cancer-free," determined by a biopsy examination or a minimum follow-up of 2 years) and age. We used the electronic patient data and biopsy databases maintained by the Breast Imaging Program to stratify women by age at the time of their index mammogram and by disease status. Women were stratified as younger than 50 years, 5059 years, 6069 years, and 70 years or older. Once the women were stratified, sampling of women (and hence, mammograms) was done at random within strata. Differences in the availability of mammograms prevented us from meeting our initial goal of equal numbers of women in each age group of women who had cancer and women who were cancer-free.
Although we attempted an equal split, our sampling resulted in a mixture of which 64 (43%) of 148 mammograms were from women with cancer. Ages of women whose mammograms were selected ranged from 40 to 85 years, with a mean of 58 years. Patients with breast cancer tended to be older than cancer-free women (P = .011, 2 test). This situation reflects differences in the availability of original films after the mammograms already stratified by the womans age were randomly selected. The examinations from younger patients with breast cancer tended more often to be in clinical use than the examinations from older patients with breast cancer.
Reading Study
All radiologists interpreted the mammograms in a controlled reading environment during two 3-hour periods. All readings were done at a central site, dedicated solely to the study, that permitted the investigators to control ambient light. Eight readers participated at a time.
Mammograms were mounted in random sequence on dedicated mammography alternators (RADX Corp., Houston, TX). The only information presented to the reader was the age of the patient. Before reading, radiologists were instructed that the set of mammograms to be read did not have the mixture of mammograms expected from a typical screening population (two to six cases of breast cancer per 1000 mammograms). Pilot studies done by the investigators have established that this instruction adequately controls for context bias (17) (details available from C. A. Beam). Before the reading session began, a member of the study team led the radiologists through a hands-on orientation session with a set of practice mammograms and provided instruction on using the computer data collection system.
Reading data were captured immediately into a database through laptop computers. A custom computer program operating in real time during the reading session captured the reading data described below and ensured data reliability.
Readers were asked 1) to identify findings, 2) to make a recommendation for further work-up, 3) to report what they believed would be the result of additional work-up, and 4) to give a subjective assessment of the presence of breast cancer for each mammogram. Responses to item 3, which relate to the management of the woman after screening, used the Breast Imaging Reporting and Data System [BI-RADS (18) scale: 1 = normal, return to normal screening; 2 = benign, return to normal screening; 3 = probably benign, 6-month follow-up recommended; 4 = possibly malignant, biopsy recommended; 5 = probably malignant, biopsy strongly recommended] and were used in the receiver operating characteristic (ROC) curve analysis for this study (described below). BI-RADS, a scale for the standardized reporting of mammograms, was developed by the American College of Radiology.
Reader Factors
Two surveys were used to collect data about the readers in our study. One survey collected data about each individual reader, and another collected data about the facility with which the radiologist was affiliated. Among other things, radiologists were asked to report their "Recent Reading Volume," which is the total number of mammograms read in the year before their participation in the study. All survey items were self-reported and not independently verified. Several survey variables were omitted from the regression analysis because of missing data or fewer than 20 observations in any level of a categorical variable. We used radiologist-level factors and facility-level factors in the analysis.
Several levels of the variable "Practice Setting" were combined to ensure adequate sample size for analysis. For analysis, the category "Hospital Radiology Department" was combined with the category "Multispecialty Medical Clinic" (combined n = 55). The category "Comprehensive Breast Diagnostic/Screening Center" was combined with the category "Freestanding Mammography Center" (n = 21).
Statistical Analysis
The expertise of each reader was assessed with two standard measures of screening accuracy based on the ROC curve (19). "Am" is the area under the ROC curve estimated nonparametrically (20). This measure can be interpreted as the ability of the diagnostician to discriminate a mammogram showing breast cancer from one not showing breast cancer when two such mammograms have been randomly selected and presented together. The area under the ROC curve includes high false-positive rates that are not relevant to screening (21,22). "pAz" is the partial area under the binormal ROC curve (2326) restricted to the interval in which false-positive probability is less than 10%. This measure can be interpreted as the average sensitivity for the diagnostician who reads within a clinically desirable range of false-positive values (23).
It is important to point out that sensitivity in our study refers to sensitivity in the context of screening. In screening, the central decision is whether to conduct additional work-up (i.e., the callback decision, which could include a recommendation for another mammogram after a short interval). It is not the goal of screening interpretation to provide a definitive diagnosis or to recommend biopsy without further consideration. Thus, a true-positive result in screening occurs whenever a woman with breast cancer is given a callback recommendation, and this determination is made without reference to correct localization of the cancer by the radiologist. Hence, our measures of skill refer to the skill of the radiologist to detect cancer in the screening mammogram but not to then localize and correctly identify it. Such skills pertain to the diagnostic interpretation of mammograms and are distinct from skill in screening.
After controlling for the possible influence of other factors, the influence of volume on accuracy was assessed with bootstrapped, multiple-regression analysis for ROC curves (27). This analysis also allowed us to investigate the independent association of the other factors with accuracy. The factors we tested are listed in Table 4. The confidence intervals (CIs) that we used to assess associations are 95% biased-corrected and accelerated CIs (28), as implemented by S-Plus 2000 (Mathsoft, Seattle, WA).
|
Finally, it is important to point out that the study was designed specifically to evaluate the statistical significance of the relationship between reading volume and accuracy, after controlling for the influence of other concomitant variables; that is, the study had a single hypothesis. Because we consider the estimation and statistical significance testing of the other independent variables to be purely exploratory, we have not exercised statistical control for multiple testing and/or estimation. We consider that the purpose of that portion of our study was to raise hypotheses to be tested in subsequent research. Nonetheless, approximate Bonferroni-adjusted (29) CIs, which provide composite 95% coverage, are reported as a guide to interpretation by the reader. All statistical tests were two-sided.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
A great deal of variation in accuracy among the 110 readers was observed, and any trend (indicated by the superimposed least squares line) was slight, as shown in Figs. 2 and 3, which present the two measures of accuracy as a function of reading volume. A 1% increase in accuracy requires increasing the annual reading volume by about 3000 mammograms for Am and by about 1200 mammograms for pAz, as indicated by the least squares regression line. The diffuse nature of this relationship is confirmed numerically by the fact that the linear relationship with volume accounts for only 2.42% of the variation observed in the accuracy measure Am and only 1.48% of the variation in the measure pAz.
|
|
Several of the other factors, however, were statistically significantly associated with both measures of accuracy. The number of years since residency was statistically significantly and negatively associated with both measures of accuracy. Having a formal rotation in mammography during residency was also negatively associated with both measures of accuracy. Our model estimated a decrease in the average Am of -0.30% (95% CI = -0.60% to -0.09%) and a decrease in the average pAz of -0.76% (95% CI = -1.75% to -0.02%) for each year after residency. In addition, the model estimated that radiologists who had a formal mammography rotation during residency had, on average, a decreased Am of 0.55% (95% CI = -2.75% to -0.00%) and a decreased pAz of -1.44% (95% CI = -10.66% to -2.32%) relative to those without a formal rotation.
Other factors were associated uniquely with each accuracy measure. Being an owner of the practice was statistically significantly associated with increased accuracy (i.e., Am = 0.59% [95% CI = 0.02% to 2.46%]). The presence of a computerized system to monitor and track screening was statistically significantly associated with decreased accuracy (i.e., Am = -0.60% [95% CI = -2.71% to -0.23%]). An increased number of diagnostic breast imaging examinations and image-guided breast interventional procedures performed at the facility was statistically significantly and positively associated with accuracy (i.e., Am = 0.55% [95% CI = 0.11% to 2.40%]). If the facility was in a hospital radiology department or multispecialty medical clinic, there was an associated decrease in accuracy (i.e., Am = -1.39% [95% CI = -3.82% to -0.15%]) compared with the accuracy expected when the facility was classified as a comprehensive breast diagnostic and/or screening center or freestanding mammography center.
Two variables were associated uniquely with pAz. The presence of double reading (the practice of having two radiologists interpret each screening mammogram) at a facility was associated with increased accuracy (i.e., pAz = 1.61% [95% CI = 1.99% to 11.65%). However, the presence of a formal pathology correlation conference (in which physicians jointly and retrospectively review the tissue pathology associated with mammographic findings that lead to biopsy) was statistically significantly associated with a decrease in mean accuracy (mean pAz = -5.46% [95% CI = -15.18% to -3.21%]).
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our study found that radiologists trained more recently, on average, interpret screening mammograms statistically significantly more accurately. The effect appears substantial: our models estimate a mean 0.3% reduction in Am and a 0.76% reduction in pAz for every 1 year after residency. We now explore various mechanisms by which this observation could have occurred.
Selection Bias
One way that our finding could have occurred is if a large percentage of radiologists with a larger number of years since residency and very high accuracy elected not to participate in our study. Because we did not, of course, measure the accuracy of those who did not participate, we cannot assess this issue directly from our data. We can, however, assess whether readers with a larger number of years since residency differentially selected to participate. This differential selection is a necessary condition for the sort of selection bias that could lead to an erroneous finding of a declining relationship. Logistic regression, however, suggests that individuals with a larger number of years since residency were more likely to be willing to participate. Yet this association was not statistically significant (P = .074; odds ratio estimate = 1.023, 95% CI = 0.998 to 1.050). We therefore conclude that there is no evidence from our study to support selection bias as a cause of this finding.
Confounding
A confounder is an independent variable that is correlated with the dependent variable and with the primary independent variable (31). It is important to identify confounding because failure to do so can lead to erroneous conclusions about the true relationship between the primary independent variable and the dependent variable.
There may be variables not considered by our analysis that confound the apparent relationship between accuracy and the number of years since residency. One possibility is that the number of years since residency is a surrogate for some factor, such as perceptual acuity, that might be related to physician age. Another possibility is that differences in types and quality of training, which naturally evolve in residency programs across time, are the operative factors behind the negative association found between accuracy and years since residency. Our study cannot rule out confounding, and further research is needed to assess the role of confounders in the apparent relationship between accuracy and the number of years since residency.
Failure of Skill Maintenance or Improvement Mechanisms
Screening mammogram interpretation is a skill that must be maintained. Declining accuracy with an increased number of years since residency could come about because of the failure of radiologists to maintain skill level against an incipient tendency of skill loss often found in human activities. Another possibility is related to the potential absence of effective methods for skill improvement during the course of a professional career coupled with initial disparities in the quality of training. Our study was not able to discern the nature of the relationship between accuracy and number of years since residency. Further research is needed to explore pathways of the relationship that we have established.
Other factors were found to be statistically significant. Some (such as double reading) seem to support the quality of feedback hypothesis. Others (such as having a rotation in mammography) are nonintuitive. As demonstrated in Figs. 4 and 5, nonintuitive factors could have come about in our study because we have adjusted for other variables. In addition, caution should be used when interpreting such findings because some of the subgroups are small, with a sample size of only three, and caution should be used because testing so many hypotheses increases the likelihood of false-positive findings. Further research is needed to better understand the true effect of these factors and the influence of other confounding factors.
|
|
We conclude that the phenomenon of expertise in mammography reflects a complex multifactorial process that needs to be better understood. We believe that scientific research and health policy recommendations aimed at improving the quality of interpretation that considers only radiologist volume will likely be misleading and ineffectual.
![]() |
NOTES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
1 Hughes RG, Hunt SS, Luft HS. Effects of surgeon volume and hospital volume on quality of care in hospitals. Med Care 1987;25:489 503.[Medline]
2 McArdle CS, Hole D. Impact of variability among surgeons on postoperative morbidity and mortality and ultimate survival. BMJ 1991;302:15015.[Medline]
3 Romano PS, Mark DH. Patient and hospital characteristics related to in-hospital mortality after lung cancer resection. Chest 1992;101: 13327.[Abstract]
4 Begg CB, Cramer LD, Hoskins WJ, Brennan MF. Impact of hospital volume on operative mortality for major cancer surgery. JAMA 1998;280:174751.
5 Begg CB, Riedel E, Bach P, Kattan MW, Schrag D, Warren JL, et al. Variations in morbidity after radical prostatectomy. New Engl J Med 2002;346:113844.
6 Birkmeyer JD, Siewers AE, Finlayson EV, Stukel TA, Lucas FL, Batista I, et al. Hospital volume and surgical mortality in the United States. New Engl J Med 2002;346:112837.
7 Sainsbury R, Haward B, Rider L, Johnston C, Round C. Influence of clinician workload and patterns of treatment on survival from breast cancer. Lancet 1995;345:126570.[Medline]
8 Gillis CR, Hole DJ. Survival outcomes of care by specialist surgeons in breast cancer: a study of 3786 patients in the west of Scotland. BMJ 1996; 312:14553.
9 Ma M, Bell J, Campbell S, Basnett I, Pollack A, Taylor I. Breast cancer management: is volume related to quality? Br J Cancer 1997;75:16529.[Medline]
10 Nodine CF, Kundel HL, Lauver SC, Toto LC. Nature of expertise in searching mammograms for breast masses. Acad Radiol 1996;3:10006.[Medline]
11 Elmore JG, Wells CK, Howard DH. Does diagnostic accuracy in mammography depend on radiologists experience? J Womens Health 1998;7:4439.[Medline]
12 Nodine CF, Kundel HL, Mello-Thoms C, Weinstein SP, Orel SG, Sullivan DC, et al. How experience and training influence mammography expertise. Acad Radiol 1999;6:57585.[Medline]
13 Kan L, Olivotto IA, Sickles EA, Coldman AJ. Standardized abnormal interpretation and cancer detection ratios to assess reading volume and reader performance in a breast screening program. Radiology 2000;215:5637.
14 Esserman L, Cowley H, Eberle C, Kirkpatrick A, Chang S, Berbaum K, et al. Improving the accuracy of mammography: volume and outcome relationships. J Natl Cancer Inst 2002;94:36975.
15 McKee MD, Cropp DM, Hyland A, Watroba N, McKinley B, Edge SB. Provider case volume and outcome in the evaluation and treatment of patients with mammogram-detected breast carcinoma. Cancer 2002;95:70412.[CrossRef][Medline]
16 Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 1996;156:20913.[Abstract]
17 Egglin TK, Feinstein AR. Context bias. A problem in diagnostic radiology. JAMA 1996;276:17525.[Abstract]
18 American College of Radiology (ACR). Illustrated breast imaging reporting and data system (BI-RADSTM). 3rd ed. Reston (VA): American College of Radiology; 1998.
19 Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986;21:72033.[Medline]
20 Hanley JA, McNeil BJ. The meaning and use of the area under an ROC curve. Radiology 1982;143:2935.[Abstract]
21 Halpern EJ, Albert M, Krieger AM, Metz CE, Maidment AD. Comparison of receiver operating characteristic curves on the basis of optimal operating points. Acad Radiol 1996;3:24553.[Medline]
22 Thompson ML, Zucchini W. On the statistical analysis of ROC curves. Stat Med 1989;8:127790.[Medline]
23 Wieand S, Gail MH, James KL, James BR. A family of nonparametric statistics for comparing diagnostic tests with paired or unpaired data. Biometrika 1989;76:58592.
24 McClish DK. Analyzing a portion of the ROC curve. Med Decis Making 1989;9:1905.[Medline]
25 Jiang Y, Metz CE, Nishikawa RM. A receiver operating characteristic partial area index for highly sensitive diagnostic tests. Radiology 1996;201: 74550.[Abstract]
26 Hanley JA. The use of the binormal model for parametric ROC analysis of quantitative diagnostic tests. Stat Med 1996;15:157585.[Medline]
27 Beam CA. A two-stage ROC regression model when sampling a population of diagnosticians. In: Chakraborty DP, Krupinski EA, editors. Medical imaging 2002: image perception, observer performance, and technology assessment. Proc SPIE 2002;4684:23647.
28 Efron B, Tibshirani RJ. An introduction to the bootstrap. New York (NY): Chapman & Hall; 1993. p. 17888.
29 Snedecor GW, Cochran WG. Statistical methods. 7th ed. Ames (IA): The Iowa State University Press; 1980. p. 1667.
30 Beam CA. Reader strategies: variability and error-methodology, findings and health policy implications from a study of the US population of mammographers. In: Chakraborty DP, Krupinski EA, editors. Medical imaging 2002: image perception, observer performance, and technology assessment. Proc SPIE 2002;4686:15768.
31 Hosmer DW, Lemeshow S. Applied logistic regression. New York (NY): John Wiley & Sons, Inc.; 1989. p. 63.
Manuscript received June 13, 2002; revised November 26, 2002; accepted January 3, 2003.
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
||||
|
Oxford University Press Privacy Policy and Legal Statement |