From the Department of Research, Olmsted Medical Center, Rochester, MN
Correspondence to Dr. Barbara P. Yawn, Department of Research, Olmsted Medical Center, 210 Ninth Street, SE, Rochester, MN 55904 (e-mail: yawnx002{at}umn.edu).
Received for publication October 1, 2004. Accepted for publication January 5, 2005.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
abstracting and indexing; data collection; epidemiologic methods; medical records; reproducibility of results; research design
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Of the few studies that report information on interrater reliability (14, 15
), most have found agreement for a single data element such as birth date or the presence of a particular diseaseinformation that may be relatively simple to identify and collect (16
). This paper reports on the process of testing interrater reliability by using multiple types of data elements during the course of a long and complex data abstraction process.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Three major types of data elements were selected for comparison: demographic data, such as age or a numerical test result; "free-text" data, such as the chief complaint that requires copying of "natural language"; and information that requires a judgment (15, 17
), such as whether coronary heart disease was considered as a potential diagnosis during the course of a medical visit. Within each category, several data items were selected for review and comparison. The data elements evaluated for initial interrater reliability testing included demographic or numerical data (birth date, date of incident myocardial infarction, date of first visit during the period of observation, and cholesterol level reported at the time of hospitalization for incident myocardial infarction); free-text data (all chief complaints for the last five visits made before the incident myocardial infarction occurred, all diagnoses on the first three visits during the period of observation, and the summary diagnosis for any chest radiographs taken during the period of observation); plus judgment data (presence or absence of consideration of heart disease on three visits selected from the midpoint of the observation period, presence or absence of treatment for smoking, and marital status at the time of the myocardial infarction).
At 1, 6, 9, 12, 18, and 24 months into the study, each of the nurses actively abstracting data (nine total over the course of the 2.5 years of data abstraction, but no more than six at any time) was asked to abstract the same data from two designated patients' records. All of the nurse abstractors had registered nurse and bachelor of science in nursing degrees. Two nurses had additional degrees, one master of public health and one master of science in nursing. None of the nurse abstractors was told which items were to be included in the interrater reliability analysis or when testing was to occur. After testing and analysis for each time point, results were discussed with the nurses as part of quality monitoring for data management.
This work was part of a study of the primary care diagnosis of coronary heart disease in men and women prior to their first myocardial infarction. Data abstraction required reviewing information on up to 10 years of medical care prior to the incident myocardial infarction and required from 2 to 20 hours of data abstraction per case.
All nurse abstractors' entries were reviewed by both study investigators (B. P. Y. and P. W.). They rated the nurses' entries for each selected data element as all six the same, five the same, and so on down to none being the same.
For the demographic data, the entries had to agree exactly to be considered the same. For free-text items, the core of the text had to be the same, but entries that added words on either end of the core content were not counted as different. For judgment items, the entries had to agree exactly, and only those items that were answered as yes/no or present versus not documented were assessed. The two investigators concurred on all reviews of all items.
All six entries the same was considered excellent agreement, five was considered very good agreement, and four was considered good agreement. For any data element for which there was agreement among only three or fewer nurse abstractors, agreement was considered unacceptable. Kappa statistics were not used because their primary purpose, to adjust for chance agreement among raters choosing from a small number of nominal responses (18), was not relevant to this evaluation.
The ratio of the total number of "same" data elements divided by the total number of data elements reviewed was calculated as the percent agreement.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The initial review of interrater reliability at 1 month showed a variation, with that for demographic data on average being very good (five of six nurses agreeing on the selected demographic items). Interrater reliability was lower for both free-text data and data requiring judgmentgood (on average, four of six nurses agreeing) and unacceptable (on average 3.4 of six agreeing), respectively. Immediate retraining was undertaken, and the level of agreement for an additional unplanned evaluation at 3 months showed that it had risen in all categories to at least very good, with excellent agreement for the demographic category.
Table 1 provides examples of the "differences" seen for the three types of data elements; table 2 displays results for the assessments at 6, 12, and 24 months. For example, the information on hospital discharge diagnoses shows that, at 6 months, 245 discharge diagnoses were reviewed from the medical records of the two cases selected and the data collected by the six nurses. Of those 245 diagnoses, only three were not the same as those listed by the other nurses, resulting in excellent interrater reliability. For each specific entry, none had more than one diagnosis that was not the same as the others. Looking across the same row in this table, the data for 12 months show three differences among 180 hospital discharge diagnoses and, at 24 months, one diagnosis that was not the same among 300 entries. Altogether, of the 7,426 total data entries reviewed at 1, 6, 9, 12, 18, and 24 months of the study, 90 were assessed to be "not the same," for an overall rate of agreement of 98.8 percent and excellent interrater reliability in all areas from the 6-month review through the final review at 24 months.
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is also important to assess the potential impact of the "errors" on final interpretation of the data. For example, a birth date of 1812 versus 1912 would be identified when the raw data are "cleaned" and checked for age ranges. The differences in the chief complaints listed might affect the assessment of whether the patient had a health maintenance examination or the analysis that links presenting complaints to coronary heart disease diagnoses. The only judgment difference that would be significant to the primary outcome analysis is the one visit in which there was a difference between whether or not coronary heart disease was considered.
Most medical records review studies in the literature fail to report any information on interrater reliability. In a review of data from emergency medicine journals (14), fewer than 5 percent of such published studies included any discussion of interrater reliability (10
, 12
, 16
, 19
, 20
). The small number of studies that do present interrater reliability information seldom provide any details regarding what types of data elements were assessed, how frequently interrater reliability was assessed, or the number of times it was assessed (12
, 21
, 22
). Different levels of interrater reliability might be anticipated for data elements requiring only transcription versus those requiring interpretation and are subject to judgment errors (13
, 14
, 21
, 23
25
). Therefore, several types of data elements should be included. The best presentation for the results is not known. Here, simple ratios appeared to convey the results adequately.
Evaluating this type of reliability is limited by having no standard or accepted format, no standard measure, and no specified level of agreement generally deemed acceptable. Even the typology of data elements is taken from a single publication with no validation, although it was perceived by the authors to have face validity.
Assessing interrater reliability is not synonymous with assessing data accuracy. No gold standards were used for comparisons, so it is possible that, when five of six nurses' data agreed, the one was correct and the five were incorrect. Validity testing of the data is also important and was completed, but it was not the focus of this study.
The cost of assessing interrater reliability will depend on the complexity of the larger study. This data abstraction process was large and complicated and included many nurse abstractors over an extended period of time. Each time that interrater reliability was assessed (six time periods), 10 extra abstractions were required, for a total of 60 extra abstractions that on average required 3 hours each. This process required a total of 180 additional hours at a cost of approximately $33 per hour. Computer programming to generate the data to review required 6 extra hours at $50 per hour, and the review required an additional 12 hours from the two investigators at $60 per hour. The total cost for this work was approximately $6,960 in a study with a budget of $1,283,000 in direct costs or 0.5 percent of the total budget over 4 years. These costs were considered part of the training program for the nurse abstractors.
Researchers have a responsibility to report the reliability of their data. Just as all published studies need to describe their sample and sample size, the percentage of eligible cases whose data are included, and the funding source, medical records review studies should present interrater reliability data. If the reliability of the data is unknown, it may not be possible to assess the reliability of the results or recommendations made in the publication.
In conclusion, interrater reliability can be assessed and reported. Standardized methods of assessing, analyzing, and reporting interrater reliability results would make the information and its implications clearer and more generalizable.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|