Interrater Reliability: Completing the Methods Description in Medical Records Review Studies

Barbara P. Yawn and Peter Wollan

From the Department of Research, Olmsted Medical Center, Rochester, MN

Correspondence to Dr. Barbara P. Yawn, Department of Research, Olmsted Medical Center, 210 Ninth Street, SE, Rochester, MN 55904 (e-mail: yawnx002{at}umn.edu).

Received for publication October 1, 2004. Accepted for publication January 5, 2005.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 References
 
In medical records review studies, information on the interrater reliability (IRR) of the data is seldom reported. This study assesses the IRR of data collected for a complex medical records review study. Elements selected for determining IRR included "demographic" data that require copying explicit information (e.g., gender, birth date), "free-text" data that require identifying and copying (e.g., chief complaints and diagnoses), and data that require abstractor judgment in determining what to record (e.g., whether heart disease was considered). Rates of agreement were assessed by the greatest number of answers (one to all n) that were the same. The IRR scores improved over time. At 1 month, the reliability for demographic data elements was very good, for free-text data elements was good, but for data elements requiring abstractor judgment was unacceptable (only 3.4 of six answers agreed, on average). All assessments after 6 months showed very good to excellent IRR. This study demonstrates that IRR can be evaluated and summarized, providing important information to the study investigators and to the consumer for assessing the reliability of the data and therefore the validity of the study results and conclusions. IRR information should be required for all large medical records studies.

abstracting and indexing; data collection; epidemiologic methods; medical records; reproducibility of results; research design


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 References
 
Medical records are an important source of information for studying epidemiology and natural history (1Go–5Go) and are used as the "gold standard" to identify comorbidity, treatment, and past medical history in health service research (6Go–10Go). The limitations of medical records data in research are often discussed (9Go–13Go), but papers seldom address the potential limitations associated with abstraction of medical records data for research purposes (9Go, 12Go, 13Go), specifically limitations associated with interrater reliability (12Go, 14Go).

Of the few studies that report information on interrater reliability (14Go, 15Go), most have found agreement for a single data element such as birth date or the presence of a particular disease—information that may be relatively simple to identify and collect (16Go). This paper reports on the process of testing interrater reliability by using multiple types of data elements during the course of a long and complex data abstraction process.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 References
 
To our knowledge, there are no standard published methods for assessing interrater reliability. The process used here included repeated comparisons of data elements taken from the same data sources by all nurses abstracting data for the primary study. Two prior studies have suggested a typology of data elements to select for those comparisons (14Go, 15Go).

Three major types of data elements were selected for comparison: demographic data, such as age or a numerical test result; "free-text" data, such as the chief complaint that requires copying of "natural language"; and information that requires a judgment (15Go, 17Go), such as whether coronary heart disease was considered as a potential diagnosis during the course of a medical visit. Within each category, several data items were selected for review and comparison. The data elements evaluated for initial interrater reliability testing included demographic or numerical data (birth date, date of incident myocardial infarction, date of first visit during the period of observation, and cholesterol level reported at the time of hospitalization for incident myocardial infarction); free-text data (all chief complaints for the last five visits made before the incident myocardial infarction occurred, all diagnoses on the first three visits during the period of observation, and the summary diagnosis for any chest radiographs taken during the period of observation); plus judgment data (presence or absence of consideration of heart disease on three visits selected from the midpoint of the observation period, presence or absence of treatment for smoking, and marital status at the time of the myocardial infarction).

At 1, 6, 9, 12, 18, and 24 months into the study, each of the nurses actively abstracting data (nine total over the course of the 2.5 years of data abstraction, but no more than six at any time) was asked to abstract the same data from two designated patients' records. All of the nurse abstractors had registered nurse and bachelor of science in nursing degrees. Two nurses had additional degrees, one master of public health and one master of science in nursing. None of the nurse abstractors was told which items were to be included in the interrater reliability analysis or when testing was to occur. After testing and analysis for each time point, results were discussed with the nurses as part of quality monitoring for data management.

This work was part of a study of the primary care diagnosis of coronary heart disease in men and women prior to their first myocardial infarction. Data abstraction required reviewing information on up to 10 years of medical care prior to the incident myocardial infarction and required from 2 to 20 hours of data abstraction per case.

All nurse abstractors' entries were reviewed by both study investigators (B. P. Y. and P. W.). They rated the nurses' entries for each selected data element as all six the same, five the same, and so on down to none being the same.

For the demographic data, the entries had to agree exactly to be considered the same. For free-text items, the core of the text had to be the same, but entries that added words on either end of the core content were not counted as different. For judgment items, the entries had to agree exactly, and only those items that were answered as yes/no or present versus not documented were assessed. The two investigators concurred on all reviews of all items.

All six entries the same was considered excellent agreement, five was considered very good agreement, and four was considered good agreement. For any data element for which there was agreement among only three or fewer nurse abstractors, agreement was considered unacceptable. Kappa statistics were not used because their primary purpose, to adjust for chance agreement among raters choosing from a small number of nominal responses (18Go), was not relevant to this evaluation.

The ratio of the total number of "same" data elements divided by the total number of data elements reviewed was calculated as the percent agreement.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 References
 
For the 600 total cases and controls, more than 1,200 individual medical records were reviewed, including over 30,000 ambulatory care visits and 1,000 hospitalizations. The average period of available follow-up was 6.8 years.

The initial review of interrater reliability at 1 month showed a variation, with that for demographic data on average being very good (five of six nurses agreeing on the selected demographic items). Interrater reliability was lower for both free-text data and data requiring judgment—good (on average, four of six nurses agreeing) and unacceptable (on average 3.4 of six agreeing), respectively. Immediate retraining was undertaken, and the level of agreement for an additional unplanned evaluation at 3 months showed that it had risen in all categories to at least very good, with excellent agreement for the demographic category.

Table 1 provides examples of the "differences" seen for the three types of data elements; table 2 displays results for the assessments at 6, 12, and 24 months. For example, the information on hospital discharge diagnoses shows that, at 6 months, 245 discharge diagnoses were reviewed from the medical records of the two cases selected and the data collected by the six nurses. Of those 245 diagnoses, only three were not the same as those listed by the other nurses, resulting in excellent interrater reliability. For each specific entry, none had more than one diagnosis that was not the same as the others. Looking across the same row in this table, the data for 12 months show three differences among 180 hospital discharge diagnoses and, at 24 months, one diagnosis that was not the same among 300 entries. Altogether, of the 7,426 total data entries reviewed at 1, 6, 9, 12, 18, and 24 months of the study, 90 were assessed to be "not the same," for an overall rate of agreement of 98.8 percent and excellent interrater reliability in all areas from the 6-month review through the final review at 24 months.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Examples of data entry differences

 

View this table:
[in this window]
[in a new window]
 
TABLE 2. Interrater reliability at three points in time, presented as the number of nurse abstractor entries assessed to be "not the same" divided by the total number of entries reviewed

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 References
 
The interrater reliability of the data from this study was high and improved after the first 3 months. The initial reliability tested at 1 month was variable by data type, with the level of agreement regarding the most complex data failing to be acceptable. Knowing the level of interrater reliability prompted intensive retraining and was then reassessed, with improved results. Thereafter, the interrater reliability was very good to excellent throughout the remainder of the study and varied little by type of data collected. Providing this information to manuscript reviewers, editors, and critical readers enables an informed assessment of the quality of the data and therefore the value of the results.

It is also important to assess the potential impact of the "errors" on final interpretation of the data. For example, a birth date of 1812 versus 1912 would be identified when the raw data are "cleaned" and checked for age ranges. The differences in the chief complaints listed might affect the assessment of whether the patient had a health maintenance examination or the analysis that links presenting complaints to coronary heart disease diagnoses. The only judgment difference that would be significant to the primary outcome analysis is the one visit in which there was a difference between whether or not coronary heart disease was considered.

Most medical records review studies in the literature fail to report any information on interrater reliability. In a review of data from emergency medicine journals (14Go), fewer than 5 percent of such published studies included any discussion of interrater reliability (10Go, 12Go, 16Go, 19Go, 20Go). The small number of studies that do present interrater reliability information seldom provide any details regarding what types of data elements were assessed, how frequently interrater reliability was assessed, or the number of times it was assessed (12Go, 21Go, 22Go). Different levels of interrater reliability might be anticipated for data elements requiring only transcription versus those requiring interpretation and are subject to judgment errors (13Go, 14Go, 21Go, 23Go–25Go). Therefore, several types of data elements should be included. The best presentation for the results is not known. Here, simple ratios appeared to convey the results adequately.

Evaluating this type of reliability is limited by having no standard or accepted format, no standard measure, and no specified level of agreement generally deemed acceptable. Even the typology of data elements is taken from a single publication with no validation, although it was perceived by the authors to have face validity.

Assessing interrater reliability is not synonymous with assessing data accuracy. No gold standards were used for comparisons, so it is possible that, when five of six nurses' data agreed, the one was correct and the five were incorrect. Validity testing of the data is also important and was completed, but it was not the focus of this study.

The cost of assessing interrater reliability will depend on the complexity of the larger study. This data abstraction process was large and complicated and included many nurse abstractors over an extended period of time. Each time that interrater reliability was assessed (six time periods), 10 extra abstractions were required, for a total of 60 extra abstractions that on average required 3 hours each. This process required a total of 180 additional hours at a cost of approximately $33 per hour. Computer programming to generate the data to review required 6 extra hours at $50 per hour, and the review required an additional 12 hours from the two investigators at $60 per hour. The total cost for this work was approximately $6,960 in a study with a budget of $1,283,000 in direct costs or 0.5 percent of the total budget over 4 years. These costs were considered part of the training program for the nurse abstractors.

Researchers have a responsibility to report the reliability of their data. Just as all published studies need to describe their sample and sample size, the percentage of eligible cases whose data are included, and the funding source, medical records review studies should present interrater reliability data. If the reliability of the data is unknown, it may not be possible to assess the reliability of the results or recommendations made in the publication.

In conclusion, interrater reliability can be assessed and reported. Standardized methods of assessing, analyzing, and reporting interrater reliability results would make the information and its implications clearer and more generalizable.


    ACKNOWLEDGMENTS
 
Funding for the study was obtained from the Agency for HealthCare Research and Quality (R01-HS10239).


    References
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 References
 

  1. Rocca LG, Yawn BP, Wollan P, et al. Management of patients with hepatitis C in a community population: diagnosis, discussions, and decisions to treat. Ann Fam Med 2004;2:116–24.[Abstract/Free Full Text]
  2. Yawn BP, Wollan P, McKeon K, et al. Temporal changes in rates and reasons for medical induction of term labor from 1980 to 1996. Am J Obstet Gynecol 2001;184:611–19.[CrossRef][ISI][Medline]
  3. Roger VL, Jacobsen SJ, Weston S, et al. Trends in myocardial infarction incidence and survival. Olmsted County, Minnesota—1979 to 1994. Ann Intern Med 2002;136:341–8.[Abstract/Free Full Text]
  4. Yawn BP, Wollan P, Kurland MJ, et al. A longitudinal study of the prevalence of asthma in a community population of school age children. J Pediatr 2002;140:576–81.[CrossRef][ISI][Medline]
  5. HEDIS Health Employers Data and Information Set. Data quality checks for HEDIS 2004 DST general submission files—validations requiring revision. (National Committee for Quality Assurance (NCQA) website: http://www.ncqa.org/Programs/HEDIS/data%20quality%20checks%20for%20HEDIS%202004%20Submission%20Files_revise.pdf). Accessed June 1, 2004.
  6. Kurland LT, Molgaard CA. The patient record in epidemiology. Sci Am 1981;245:54–63.[ISI][Medline]
  7. Melton LJ III. History of the Rochester Epidemiology Project. Mayo Clin Proc 1996;71:266–74.[ISI][Medline]
  8. Stang PE, Yanagihara T, Swanson JW, et al. A population-based study of migraine headaches in Olmsted County, Minnesota. Case ascertainment and classification. Neuroepidemiology 1991;10:297–307.[ISI][Medline]
  9. Cassidy LD, Marsh GM, Holleran MK, et al. Methodology to improve data quality from chart review in the managed care setting. Am J Manag Care 2002;8:787–93.[ISI][Medline]
  10. Peabody JW, Luck J, Glassman P, et al. Comparison of vignettes, standardized patients, and chart abstraction: a prospective validation study of 3 methods for measuring quality. JAMA 2000;283:1715–22.[Abstract/Free Full Text]
  11. Roberts CM, Lowe D, Bucknall CE, et al. Clinical audit indicators of outcome following admission to hospital with acute exacerbation of chronic obstructive pulmonary disease. Thorax 2002;57:137–41.[Abstract/Free Full Text]
  12. Allison JJ, Wall TC, Spettell CM, et al. The art and science of chart review. Joint Commission J Quality Improve 2000;26:115–36.
  13. Reyes A, Lacalle JR, Montero G, et al. Reliability of data abstraction in a study of appropriateness of care in chronic angina. (Abstract). Proceedings of the annual meeting of the International Society of Technology Assessment in Health Care 1999;15:135.
  14. Gilbert EH, Lowenstein SR, Koziol-McLain J, et al. Chart reviews in emergency medicine research: where are the methods? Ann Emerg Med 1996;27:305–8.[ISI][Medline]
  15. Herrmann N, Cayten CG, Senior J, et al. Interobserver and intraobserver reliability in the collection of emergency medical services data. Health Serv Res 1980;15:127–43.[ISI][Medline]
  16. Zadnik K, Mannis MJ, Kim HS, et al. Inter-clinician agreement on clinical data abstracted from patients' medical charts. Optom Vis Sci 1998;75:813–16.[ISI][Medline]
  17. Owen JL, Bolenbaucher RM, Moore ML. Trauma registry databases: a comparison of data abstraction, interpretation, and entry at two level I trauma centers. J Trauma 1999;46:1100–4.[ISI][Medline]
  18. Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York, NY: John Wiley & Sons, 1981.
  19. Kosecoff J, Chassin MR, Fink A, et al. Obtaining clinical data on the appropriateness of medical care in community practice. JAMA 1987;258:2538–42.[Abstract]
  20. Froelicher ES, Alexander J, Beall G, et al. Reliability of medical record abstraction for assessing the spectrum of HIV-AIDS disease in adults. Presented at the VII International Conference on AIDS, Florence, Italy, June 16–21, 1991. (Abstract no. W.C.3054).
  21. Huff ED. Comprehensive reliability assessment and comparison of quality indicators and their components. J Clin Epidemiol 1997;50:1395–404.[CrossRef][ISI][Medline]
  22. Kung HC, Hanzlick R, Spitler JF. Abstracting data from medical examiner/coroner reports: concordance among abstractors and implications for data reporting. J Forensic Sci 2001;46:1126–31.[ISI][Medline]
  23. Roger VL, Killian J, Henkel M, et al. Coronary disease surveillance in Olmsted County: objectives and methodology. J Clin Epidemiol 2002;55:593–601.[CrossRef][ISI][Medline]
  24. Luck J, Peabody JW, Dresselhaus TR, et al. How well does chart abstraction measure quality? A prospective comparison of standardized patients with the medical record. Am J Med 2000;108:642–9.[CrossRef][ISI][Medline]
  25. Assaf AR, Lapane KL, McKenney JL, et al. Coronary heart disease surveillance: field application of an epidemiologic algorithm. J Clin Epidemiol 2000;53:419–26.[CrossRef][ISI][Medline]