Validation of Self-reported Cancers in the California Teachers Study
Arti Parikh-Patel1,,
Mark Allen2,
William E. Wright and
the California Teachers Study Steering Committee3
1 Cancer Epidemiology Research Unit, Public Health Institute, Sacramento, CA.
2 Cancer Surveillance Research Unit, Public Health Institute, Sacramento, CA.
3 California Department of Health Services, Cancer Surveillance Section, Sacramento, CA.
Received for publication July 5, 2002; accepted for publication October 1, 2002.
 |
ABSTRACT
|
---|
Self-reported cancer data from the California Teachers Study were validated by using California Cancer Registry data. The California Teachers Study cohort consists of 133,479 active and retired California teachers. In 19951996, data from a mailed questionnaire were linked to the California Cancer Registry data. Sensitivity and specificity of 11 types of cancer were calculated. Multivariate analyses were conducted to evaluate correlates of false-positive and false-negative reporting. Sensitivities showed great variation by cancer site. The highest sensitivities were observed for breast (96.4%) and thyroid (92.9%) cancers, whereas the lowest sensitivities were those for cervical (44.3%), endometrial (69.1%), and other skin (53.6%) cancers. The sensitivities for in situ cancers (at the time of diagnosis) were considerably lower than those for invasive cancers in about half of the cancer types surveyed. The specificities for individual cancer sites ranged from 90% to 99%; the highest were those for lung cancer, leukemia, and Hodgkins disease (all 99.9%). The lowest specificity was for other skin cancer (90.2%). In situ stage at diagnosis and older age were significantly associated with false-positive reporting. Age and non-White race were associated with false-negative reporting. These findings suggest that the feasibility of using self-reported data without verification in epidemiologic studies of cancer varies by site.
cohort studies; neoplasms; questionnaires; recall; registries
Abbreviations:
Abbreviations: SEER, Surveillance, Epidemiology, and End Results; SES, socioeconomic status.
 |
INTRODUCTION
|
---|
For a variety of reasons, self-reported disease outcomes are frequently used without verification in epidemiologic research. One such reason is the difficulty of verifying responses in studies with large samples and limited funds. Although a number of studies have examined agreement between self-reported outcomes and medical records (17), few have verified self-reported cancers with cancer registry data (812). Reported estimates of the overall sensitivity of self-reported cancer range from 27 percent to more than 90 percent (1, 5, 9, 12). Accuracy of reporting has been found to vary by cancer site. Self-reports of breast and colon cancers were found to have higher sensitivities, for instance, whereas ovarian and uterine cancers were reported to have lower sensitivities (1, 5, 9, 10). Accuracy of self-report also varies by demographic characteristic across the study population. In a comparison of self-reported responses with registry data, false-negative reporting was found to correlate with older age, non-White race, and increased time since cancer diagnosis (10). Schrijvers et al. reported that underestimation of cancer by a survey was higher among men and among urban residents (12). Although past research has shed some light on the validity of self-reported data, patterns of misclassification of self-report are still not well understood, especially in the context of large population-based studies.
The objectives of this study were to determine the sensitivity and specificity of self-reported cancers within the California Teachers Study cohort and to identify determinants of false-positive and false-negative reporting within the cohort by using California Cancer Registry data as the "gold standard." Data from California Teachers Study and California Cancer Registry data files were linked to accomplish these objectives.
 |
MATERIALS AND METHODS
|
---|
Description of the cohort/questionnaire
The California Teachers Study is a prospective study of 133,479 current and former female public school teachers and administrators. Members of this cohort belong to the California State Teachers Retirement System (STRS). Demographic characteristics of the study population and details of cohort recruitment and maintenance have been described elsewhere (13). During the period 19951996, cohort members completed a mailed survey, which included questions regarding cancer history, diet, and reproductive history. The response rate for the mailed survey was 41 percent. The questions on the various types of cancer were phrased as follows: "Have you ever had <type> of cancer?" The 11 cancer types were breast, endometrial (body of the uterus/womb), cervix, ovary, lung, leukemia, Hodgkins disease or lymphoma, colon and rectum, thyroid, melanoma, and other skin. Respondents were asked about the date of diagnosis for breast cancer only; self-reported dates of diagnosis were not available from the questionnaire for any of the other types of cancers.
Overview of California Cancer Registry data
The California Cancer Registry is the largest population-based cancer registry for a geographically contiguous area in the world, collecting incidence reports on more than 130,000 new cases of cancer diagnosed annually in California. Cancer reporting in California is legally mandated and was fully implemented in 1988 with standardized data collection and quality control procedures (1417). The California Cancer Registry participates in the National Cancer Institutes Surveillance, Epidemiology, and End Results (SEER) program. On several standard measures of the quality of case finding recommended by the North American Association of Central Cancer Registries (NAACR), the California Cancer Registry performs very favorably (18). The completeness of case reporting exceeds 95 percent within 18 months after the end of each calendar year. Case reporting for 1999 was estimated to be 100 percent complete as of April 2002.
Data linkage/data management
Probabilistic record linkage of California Teachers Study and California Cancer Registry databases was performed by using Integrity software (19). A probabilistic linkage uses a set of identifiers contained in both data sets to calculate the probability that records from the different data sets are matches. Weights corresponding to different probabilities are assigned to each set of matches. Matches with weights higher than a predetermined cutoff weight are accepted, whereas those that are much lower than the cutoff are rejected. The variables used to calculate weights in this record linkage were complete name, social security number, date of birth, and complete address of residence. Any California Cancer Registry record matched to more than one California Teachers Study record, or vice versa, was reviewed manually. Checks were conducted for transposed digits in social security numbers and to make sure that the date of death listed on the California Cancer Registry record was not before the date on which the questionnaire was completed. A total of 14,410 tumors were matched to 12,739 California Teachers Study members.
Exclusion criteria
Data on only the first tumor of each type were retained for analysis, although a respondent could be counted multiple times if she had more than one cancer. Only those cancers diagnosed prior to completion of the California Teachers Study questionnaire were included in the analysis file. Additionally, respondents who were not California residents at the time of the survey, and those who were diagnosed with breast cancer before 1988 on the basis of either their response on the survey (for breast cancer) or the date of diagnosis in the California Cancer Registry database, were excluded. This exclusion was made since standardized, statewide, population-based cancer reporting in California was not fully implemented until 1988; it is likely that cancers diagnosed before this date might not have been captured by the registry. These exclusions resulted in an analytic file containing 9,023 tumors in 8,499 people. Approximately 4 percent of the women had more than one type of cancer. Separate analysis files were created for each of the 11 types of cancers included in the questionnaire.
Analytic methods
The sensitivity (the proportion of California Teachers Study members in the registry who self-reported their cancer) and specificity (the proportion of California Teachers Study members not found in the registry who did not report cancer) were calculated for each of the 11 types of cancer by using the cancer registry data as the gold standard. In addition, separate sensitivity estimates for invasive and in situ cancers were calculated. For breast cancer, the only cancer type with a self-reported date of diagnosis, we examined correlates of false-positive and false-negative reporting. Unadjusted odds ratios were calculated for covariates associated with false-positive and false-negative reporting separately. The variables examined in relation to false-negative reporting were race/ethnicity, age, time between diagnosis and questionnaire response date, stage at diagnosis, marital status, and socioeconomic status (SES). Since, for false-positive reports, we were limited to the information available on the questionnaire, we looked at only age, race/ethnicity, and SES.
These variables were then entered into a logistic regression model. Race/ethnicity was divided into two categories: non-Hispanic White and all others. Age at questionnaire completion was categorized into the following groups: <45, 4564, 6574, 7584,
85 years. The index of SES used in this analysis was a composite variable created by principal components analysis using a number of variables from 1990 US Census data at the block-group level (20). Block-group quintiles based on statewide measurement of the SES variable were used in the analysis.
 |
RESULTS
|
---|
Sensitivity/specificity
Site-specific sensitivities showed great variation (table 1). The highest sensitivities were observed for breast (96.4 percent) and thyroid (92.9 percent) cancers; the lowest sensitivities were those for cervical (44.3 percent), endometrial (69.1 percent), and other skin (53.6 percent) cancers. The sensitivities for in situ cancers (at the time of diagnosis) were considerably lower than those for invasive cancers in about half of the cancer types surveyed (table 2). Although the survey picked up about 98 percent of invasive breast cancers, the proportion of in situ breast cancers dropped to 88 percent. The sensitivity for endometrial cancer diagnosed at the in situ stage was 42.6 percent compared with 70.9 percent diagnosed at the invasive stage. Only 43.0 percent of in situ melanomas found in the registry were reported on the survey compared with 85.0 percent of invasive cases. All of the cases of cancer were invasive for the following types: ovarian, lung, leukemia, Hodgkins disease, thyroid, and other skin. Specificity ranged from 90 percent to 99 percent, depending on the type of cancer; lung cancer, leukemia, and Hodgkins disease had the highest specificities, while other skin cancer had the lowest.
View this table:
[in this window]
[in a new window]
|
TABLE 1. Sensitivity of self-reported cancers (in situ and invasive) in the California Teachers Study, by site, 19951996*
|
|
View this table:
[in this window]
[in a new window]
|
TABLE 2. Sensitivity of cancer questions on the California Teachers Study questionnaire, by stage at diagnosis, 19951996
|
|
Predictors of false-negative reports of breast cancer
The results of both the unadjusted and multivariate analyses of correlates of false-negative reporting of breast cancer are presented in table 3. Stage at diagnosis was the most significant determinant of false-negative reporting in both sets of analyses. Breast cancer cases that were in situ at diagnosis according to the California Cancer Registry database were approximately seven times more likely than invasive cases not to be reported by respondents on the California Teachers Study survey. Cases among older respondents were also less likely to be reported on the survey. When <45 years was used as the referent, the likelihood of false-negative reporting increased as age categories increased. Those cases in respondents aged
85 years at the time of survey completion were approximately five times more likely to be false negative. False-negative reporting did not differ significantly by race, time between dates of diagnosis and survey completion, marital status, or SES.
The multivariate model for false-negative reporting produced results similar to the unadjusted estimates (table 3). Older respondents were less likely to report breast cancer compared with those less than 45 years of age after we adjusted for the other variables. This effect was particularly pronounced for those in the oldest age categories. Respondents in the age categories 7584 and
85 years were three and nine times more likely, respectively, to not report their breast cancer. The variable most strongly associated with false-negative reporting after adjustment was, again, stage at diagnosis. After multivariate adjustment, race, time between dates of diagnosis and survey completion, marital status, and SES were not significant determinants of false-negative reporting.
Predictors of false-positive reports of breast cancer
Older age also increased the likelihood of false-positive reporting (table 4). Compared with respondents less than age 45 years, those aged 4564 years were three times more likely to falsely report breast cancer on the survey. The odds ratios increased as age increased. Respondents older than age 85 years were 16 times more likely than those younger than age 45 years to falsely report breast cancer. The odds ratios for all age categories were significant at the 0.05 level. SES appeared to be protective for false-positive reporting, although the odds ratio for only the highest SES quintile was significant at the 0.05 level. Race was not significantly associated with false-positive reporting.
The logistic regression model for false-positive reporting yielded slightly different estimates than those from the unadjusted analyses (table 4). Non-White respondents were 34 percent more likely to falsely report breast cancer on the California Teachers Study survey, after we adjusted for age and SES. Again, a dose-response relation was detected for age and false-positive reporting. As age increased, the likelihood of false-positive reporting increased. The odds ratios in all age categories were significant at the 0.0001 level. Higher SES was increasingly protective for false-positive reporting, although these results were not statistically significant.
 |
DISCUSSION
|
---|
Our validation study of self-reported cancer diagnosis using registry data has a few major advantages over previous studies. First, the quality and completeness of Californias registry data are unmatched by most cancer registries. Californias registry adopts the high standards instituted by the North American Association of Central Cancer Registries (18). In addition, this validation study is by far the largest to date using cancer registry data. To our knowledge, ours is also the only study to include a measure of SES.
Despite these strengths, this study has a few limitations that are important to consider when interpreting our results. First, the California Cancer Registry did not initiate statewide data collection until 1988. We were able to exclude women who indicated breast cancer before this time because the survey included a question about date of diagnosis for this cancer. Questions about dates of diagnosis of other cancers were not included in the survey. As a result, it is likely that cancers diagnosed before 1988 were included in our analyses. Such inclusion would have affected specificity estimates more than sensitivity estimates. Second, if a case were diagnosed outside of California, it is possible that the registry might have missed it, although we did exclude from the analysis all California Teachers Study participants who resided out of state at the time of the survey. It is also possible that we may have missed cases because of our linkage algorithm, although we used a widely accepted method. Finally, general errors in completing the survey might account for the discrepancy between California Teachers Study and California Cancer Registry data. Generalizability of our results is limited to female populations, because no males were included in our study. Additionally, since the California Cancer Registry is a well-established, high-quality registry, these results may not be generalizable to newer, non-SEER state cancer registries.
The sensitivities of the cancer questions on the California Teachers Study survey showed great variation by cancer type, which is consistent with findings from previous studies using registry data as the gold standard (912). As in most other studies, breast cancer was the most accurately reported cancer by survey respondents (1, 5, 12). Self-reports of endometrial, cervical, and other skin cancers had the lowest sensitivities in our study. Several explanations are possible for the reporting patterns observed in our study of different types of cancers. It has been suggested that cancers with very clear-cut diagnostic criteria, such as breast and thyroid cancer, are more likely to be reported than cancers whose diagnostic procedures are more ambiguous (1). Chambers et al. suggested that reporting might be lower for cancers that have a large proportion of less severe histologic types, such as cervical cancer (21). For all types of cancer, respondents were less likely to report in situ cancers than invasive ones. Respondents may be less likely to report in situ cancers because they do not regard them as cancer, which may be an issue that stems from how the physician presents the cancer diagnosis to the patient.
Older age at the time of questionnaire completion was positively associated with both false-positive and false-negative reporting in our study. Whether this finding may be due to declining cognitive functioning or perceived taboos regarding cancer among older people is unclear. It has also been suggested that a lack of communication between physicians and older patients regarding their cancer diagnoses may affect whether they think they have cancer (12, 22, 23). Higher SES levels were protective for false-positive reporting but did not affect false-negative reporting.
Our findings suggest that the feasibility of using self-reported data without verification in epidemiologic studies of cancer varies by site. Self-reported breast cancer on the California Teachers Study survey was reported quite accurately, but endometrial and cervical cancers less so. In situ cancer has much higher rates of misclassification than invasive cancer regardless of site. Use of self-reported data without validation in studies that include in situ cancers and the elderly could lead to especially biased prevalence estimates. Validation may be particularly valuable in studies of certain cancers, such as those involving the cervix or uterus.
 |
ACKNOWLEDGMENTS
|
---|
This work was supported by grants from the National Institutes of Health (5R01CA077398-04) and the Centers for Disease Control and PreventionNational Program of Cancer Registries (Cooperative Agreement U75/CCU910677).
The authors thank Jennifer Rader for her help in formatting the manuscript.
 |
NOTES
|
---|
The California Teachers Study Steering Committee includes the following members: Dr. Hoda Anton-Culver, Dr. Leslie Bernstein, Dr. Dennis Deapen, Dr. Pamela L. Horn-Ross, Dr. David Peel, Rich Pinder, Dr. Peggy Reynolds, Dr. Ronald K. Ross, Dr. Dee W. West, and Dr. Argyrios Ziogas. 
Reprint requests to Dr. Arti Parikh-Patel, Public Health Institute, California Cancer Registry, 1700 Tribute Road, Suite 100, Sacramento, CA 95815-4402 (e-mail: arti{at}ccr.ca.gov). 
 |
REFERENCES
|
---|
- Colditz GA, Martin P, Stampfer MJ, et al. Validation of questionnaire information on risk factors and disease outcomes in a prospective cohort study of women. Am J Epidemiol 1986;123:894900.[Abstract]
- Harlow SD, Linet MS. Agreement between questionnaire data and medical records. The evidence for accuracy of recall. Am J Epidemiol 1989;129:23348.[ISI][Medline]
- Linet MS, Harlow SD, McLaughlin JK, et al. A comparison of interview data and medical records for previous medical conditions and surgery. J Clin Epidemiol 1989;42:120713.[ISI][Medline]
- Paganini-Hill A, Ross RK. Reliability of recall of drug usage and other health-related information. Am J Epidemiol 1982;116:11422.[Abstract]
- Paganini-Hill A, Chao A. Accuracy of recall of hip fracture, heart attack, and cancer: a comparison of postal survey data and medical records. Am J Epidemiol 1993;138:1016.[Abstract]
- Nevitt MC, Cummings SR, Browner WS, et al. The accuracy of self-report of fractures in elderly women: evidence from a prospective study. Am J Epidemiol 1992;135:4909.[Abstract]
- Tretli S, Lund-Larsen PG, Foss OP. Reliability of questionnaire information on cardiovascular disease and diabetes: cardiovascular disease study in Finnmark County. J Epidemiol Community Health 1982;36:26973.[ISI][Medline]
- Bergmann MM, Calle EE, Mervis CA, et al. Validity of self-reported cancers in a prospective cohort study in comparison with data from state cancer registries. Am J Epidemiol 1998;147:55662.[Abstract]
- Berthier F, Grosclaude P, Bocquet H, et al. Prevalence of cancer in the elderly: discrepancies between self-reported and registry data. Br J Cancer 1997;75:4457.[ISI][Medline]
- Desai MM, Bruce ML, Desai RA, et al. Validity of self-reported cancer history: a comparison of health interview data and cancer registry records. Am J Epidemiol 2001;153:299306.[Abstract/Free Full Text]
- Kerber RA, Slattery ML. Comparison of self-reported and database-linked family history of cancer data in a case-control study. Am J Epidemiol 1997;146:2448.[Abstract]
- Schrijvers CT, Stronks K, van de Mheen DH, et al. Validation of cancer prevalence data from a postal survey by comparison with cancer registry records. Am J Epidemiol 1994;139:40814.[Abstract]
- Bernstein L, Allen MA, Anton-Culver H, et al. High breast cancer incidence rates among California teachers: results from the California Teachers Study (United States). Cancer Causes Control 2002;13:62535.[CrossRef][ISI][Medline]
- Cancer reporting in California: abstracting and coding procedures for hospitals. California Cancer Reporting System standards. Vol II. Sacramento, CA: California Department of Health Services, Cancer Surveillance Section, 1997.
- Cancer reporting in California: standards for automated reporting. California Cancer Reporting System standards. Vol II. Sacramento, CA: California Department of Health Services, Cancer Surveillance Section, 1997.
- Cancer reporting in California: data standards for regional registries and California Cancer Registry. California Cancer Reporting System standards. Vol III. Sacramento, CA: California Department of Health Services, Cancer Surveillance Section, 1997.
- Cancer reporting in California: reporting procedures for physicians. California Cancer Reporting System standards. Vol IV. Sacramento, CA: California Department of Health Services, Cancer Surveillance Section, 1998.
- Chen VW, Howe HL, Wu XC, et al. Cancer in North America, 19931997. Vol 1: incidence. Springfield, IL: North American Association of Central Cancer Registries, 2000.
- Vality Technology Inc. Integrity program: data reengineering environment software, version 3.3. (Computer program). Boston, MA: Vality Technology Inc, 1999.
- Yost K, Perkins C, Cohen R, et al. Socioeconomic status and breast cancer incidence in California for different race/ethnic groups. Cancer Causes Control. 2001;12:70311.[CrossRef][ISI][Medline]
- Chambers LW, Spitzer WO, Hill GB, et al. Underreporting of cancers in medical surveys: a source of systematic error in cancer research. Am J Epidemiol 1976;104:1415.[Abstract]
- Cassileth BR, Zupkis RV, Sutton-Smith K, et al. Information and participation preferences among cancer patients. Ann Intern Med 1980;92:8326.[ISI][Medline]
- Holland JC, Geary N, Marchini A, et al. An international survey of physician attitudes and practice in regard to revealing the diagnosis of cancer. Cancer Invest 1987;5:1514.[ISI][Medline]