False positive outcomes and design characteristics in occupational cancer epidemiology studies

Gerard GMH Swaen, Olga Teggeler and Ludovic GPM van Amelsvoort

Department of Epidemiology, University of Maastricht, The Netherlands.

Gerard GMH Swaen, Department of Epidemiology, University of Maastricht, PO Box 616, 6200 MD Maastricht, The Netherlands. E-mail: g.swaen{at}epid.unimaas.nl

Abstract

Background Recently there has been considerable debate about possible false positive study outcomes. Several well-known epidemiologists have expressed their concern and the possibility that epidemiological research may loose credibility with policy makers as well as the general public.

Methods We have identified 75 false positive studies and 150 true positive studies, all published reports and all epidemiological studies reporting results on substances or work processes generally recognized as being carcinogenic to humans. All studies were scored on a number of design characteristics and factors relating to the specificity of the research objective. These factors included type of study design, use of cancer registry data, adjustment for smoking and other factors, availability of exposure data, dose- and duration-effect relationship, magnitude of the reported relative risk, whether the study was considered a ‘fishing expedition', affiliation and country of the first author.

Results The strongest factor associated with the false positive or true positive study outcome was if the study had a specific a priori hypothesis. Fishing expeditions had an over threefold odds ratio of being false positive. Factors that decreased the odds ratio of a false positive outcome included observing a dose-effect relationship, adjusting for smoking and not using cancer registry data.

Conclusion The results of the analysis reported here clearly indicate that a study with a specific a priori study objective should be valued more highly in establishing a causal link between exposure and effect than a mere fishing expedition.

Keywords Epidemiology, false positive studies, methods, occupational cancer

Accepted 24 April 2001

Recently a number of unexpected outcomes of epidemiological studies fuelled the discussion among epidemiologists about the limits and validity of epidemiological research. In a number of articles and editorials epidemiologists argued that epidemiology is producing false results, and that the public and policy makers are developing reservations towards epidemiology in general.

Gary Taubes interviewed a number of prominent epidemiologists and published the responses in an article on the limits of epidemiologic research.1 Taubes' work was triggered by a number of conflicting research results on associations between radon exposure and lung cancer, DDT and breast cancer, cancer risks from electromagnetic fields and a number of other topics, all characterized by conflicting and contradictory study results. The interviewed epidemiologists explained that part of the problem lies in ‘the very nature of epidemiologic studies, in particular those that try to isolate causes of non-infectious disease, known variously as observational or risk factor or environmental epidemiology'. Confounding factors, exposure misclassification and recall bias were mentioned as important factors in producing conflicting study results. Trichopoulos, one of the prominent epidemiologists who was interviewed, also expressed his concern and stated, ‘we (epidemiologists) are fast becoming a nuisance to society. People don't take us seriously anymore' and he suggested that the press should become more sceptical about epidemiological findings.1 Later Trichopoulos published an article in which he wrote that ‘concern has recently arisen that epidemiology has either exhausted its potential or, worse, is generating conflicting results that confuse the public and disorient policy makers'.2 Koplan et al.3 noted that epidemiology sometimes is regarded as a science that brings bad news or even worse, and often reports contradictory findings. In the same volume of the American Journal of Public Health it was stated that epidemiology is accused of being ‘the source of spurious, confusing and misleading findings'.4 This issue was also addressed in a recent article by Bankhead5 in which the scientific aspects of epidemiology were discussed.

The problem of false positive findings in epidemiology was already addressed nearly two decades ago. Following the reporting of a presumably false positive finding of an association between coffee drinking and pancreatic cancer, Feinstein stated, ‘the credibility of risk factor epidemiology can withstand only a limited number of false alarms'.6

Clearly, as indicated by these quotes, scientists are concerned about the possibility that epidemiological research may produce false results, either in the form of false negative or false positive results. False study results may have their roots in chance. For some reason, not necessarily causally related to the exposure under investigation, the exposed population may have a different a priori risk for developing the disease than the unexposed population. In such a situation even a well-designed epidemiological study will produce a false result. In addition, it is possible that shortcomings in the study design may lead to false results, either false positive or false negative results.

There are numerous examples of conflicting study results. A collection of 56 topics with contradictory results in case-control research has been reported by Mayes et al.7 Thirty of these topics concerned issues related to cancer risks. Mayes suggested that most of the disagreement is caused by the lack of a rigorous set of scientific principles and proposed the use of the principle of experimental trials to develop the scientific standards for case-control research. We can only partially agree with this proposal. Clearly an experimental design must be regarded as being superior to non-experimental designs. However, many epidemiological questions are not open to research in which an experimental design can be applied. Investigating long-term health effects of possible toxic agents cannot be studied experimentally, as this would be unethical. Despite the notion that experimental design should be regarded as the more powerful research design, we are left with the situation that, in many fields, experiments are either unethical or would take too long to conduct because of the long latency between exposure and effect, or both.

An association between a certain exposure and a specific type of tumour can be reported in one study but not in another. Unfortunately, there is no 100% certainty or gold standard telling us which result is a true positive result and which a false positive. It has been suggested that animal carcinogenicity data can be taken as the gold standard. However, since it is a matter of great debate how well animal studies correctly predict the carcinogenicity of chemicals in humans we have decided not to take these as the gold standard.

In a number of instances it is generally accepted that some specific causal relation exists and any other association is likely to be spurious. For instance, it is generally accepted that occupational exposure to asbestos fibres in the air will increase the risk for lung cancer, mesothelioma and possibly laryngeal cancer. A study reporting an association between asbestos exposure and leukaemia can probably be regarded as a false positive finding, since many cohort studies, with sufficient statistical power to detect such an association, have not reported this result.

In general terms, false positive results can be a result of chance, confounding or bias. There are many, more specific hypotheses about which design characteristics may lead to false positive findings. For instance, in his introductory book on occupational epidemiology, Hernberg specifically addresses the problem of false positive and false negative study results.8 He states that the most common reasons for false positive studies are information bias and confounding. According to Hernberg, case-control studies have a tendency to produce false positive results, since information on exposures can be biased by the disease status of the respondent (information bias). In retrospective cohort studies, confounding is more probable to cause false positive studies, because the effect of confounding cannot be controlled for. Empirical data to support hypotheses such as the ones cited above are lacking.

Another possible source of false positive studies may be publication bias. This bias towards more likely publication of positive results as compared to negative results has been described in the literature.9,10 Publication bias may tend to increase the number of reported positive findings in general. We have focused on a comparison of false positive studies with true positive studies. Therefore the occurrence of publication bias was not the focus of our research. Publication bias may, however, have an effect on our analysis because false positive results may be more likely to be the result of publication bias than true positive studies. Possibly, review boards and editors may be somewhat hesitant to publish positive findings that have not yet been reported elsewhere. The contrary may also be true, a new finding, perhaps a false positive finding, may be more capable of attracting the reader's attention.

The study reported here was specifically designed to compare false positive studies with true positive studies. It was thought that if design characteristics that increase the probability of a false positive study result can be identified, studies with these characteristics should be interpreted with caution. If, for instance, false positive studies more often turn out to be ecologic studies than true positive studies, this could have an impact on the interpretation of the results of ecologic studies.

In order to compare false positive studies and true positive studies we searched the scientific literature focusing on occupational cancer epidemiological studies: studies aimed at investigating the possible carcinogenic effects of occupational exposures. The advantage of focusing on the literature of occupational cancer epidemiology is that the risk factors under investigation are more uniform and we have a set of agents or occupational exposure conditions that have been generally accepted as being carcinogenic.

Methods

We have distinguished false positive from true positive studies. Since there is no gold standard for this distinction we have based our classification on the International Agency for Research on Cancer (IARC) classification.11 IARC has evaluated a range of chemicals and occupational exposure circumstances regarding their possible carcinogenic effect on humans. A small number of the evaluated chemicals have been classified as being carcinogenic to humans, based on the available epidemiological study results. The IARC expert groups have critically reviewed the available epidemiological evidence for these chemicals and have concluded that there is sufficient evidence for a carcinogenic effect. These chemicals or substances have been classified as category 1 carcinogens and were selected for the study reported here. From the category 1 carcinogens only the substances that form occupational exposures were selected and agents such as drugs for treating cancer were excluded, in order to derive at a homogeneous group of substances, all investigated in an occupational setting.

These exclusions resulted in a list of 20 occupationally related carcinogenic substances or processes, all classified as category 1 human carcinogens based on occupational epidemiology studies. These substances or work processes were regarded as true positive carcinogenic substances. In addition the IARC classifications were used to identify the target organ(s) in which the carcinogenic effect was observed. Studies reporting a carcinogenic effect in other organs were regarded as being false positive. For instance, a study reporting an elevated risk of leukaemia in benzene-exposed workers was regarded as a true positive study. On the other hand, a study reporting an elevated lung cancer risk in benzene-exposed workers was regarded as being a false positive

Next, a literature search was performed to identify epidemiological studies reporting on these substances. This was done using Medline with a time window limited to 1984 until 1997. By simple searching with the name of the substance as listed in Table 1Go and ‘cancer and epidemiology' a large set of over 20 000 hits was produced. Articles reporting negative findings, either true or false negative, and studies in which no original data collection on exposed people took place (e.g. meta-analyses) were excluded from our analysis. Articles that were not based on original data collection, such as reviews and meta-analyses, were excluded. Pilot studies were also excluded. Seventy-five false positive studies were identified in this set, by reviewing the abstract of the articles. Each dataset for each agent was sorted by year of publication. For each false positive study two true positive studies that occurred on the same search immediately before or after the false positive study were selected. This procedure resulted in 75 triplets of which one was a false positive study and two were true positive studies. A list of the 225 included studies is available on request. For beryllium, no false positive studies were found in the literature search, so that the final list of carcinogenic substances or processes under investigation contained 19 substances or processes. In Table 1Go the list of 19 carcinogenic substances and their true and false positive effects are given.


View this table:
[in this window]
[in a new window]
 
Table 1 List of 20 carcinogenic substances or work processes encountered in the occupational environment and their true positive and false positive effects
 
The 225 articles were looked up in the university library or requested by interlibrary loan. Copies were made and were coded by us, by means of a simple coding form, after the copy was blinded with respect to the false or true positivity of the results. Blinding sometimes failed because the coder would know which study was a false positive or true positive. Each study was scored by GS and OT independently of each other. Later, the two score forms were compared and discrepancies were resolved. It must be noted, however, that some scoring required a certain degree of interpretation and judgement by us. For instance, all studies were scored by us as being a ‘fishing expedition' or not. This was partly a matter of judgement, since in none of the articles was it stated that the study was a fishing expedition, without a specific hypothesis. Each study was scored on a number of items. The scored items included: journal of publication, research design, specific or general study objective (a study was coded as having a specific study objective if one hypothesis between a specific exposure and a specific effect were postulated. If the study had two, three or four specific hypotheses it was separately coded as such), type of exposure data, correction for smoking habits or other confounders, testing for and observing a dose-response relationship, number of statistical tests performed, study size for the specific reported association, affiliation of the primary investigator, collaboration with other research institutes and being a fishing expedition or not. A study was coded as a fishing expedition if no clear underlying hypothesis on a specific cause (chemical agent or profession) and a specific effect (the type of tumours postulated to have an increased occurrence as the result of the exposure) was mentioned in the paper. There is only a small difference between the variables ‘fishing expedition' and ‘specific hypothesis yes/no'. Any study with one clearly defined hypothesis between a specific exposure and specific effect was coded as yes on the variable ‘specific hypothesis' and no on ‘fishing expedition'. Similarly, a study with two, three or four clearly defined hypotheses was coded as no on the variable ‘specific hypothesis', yes on ‘specific hypothesis 2,3 or 4' and no on ‘fishing expedition'. However, a study focusing on five or more specific hypotheses or was coded as no on ‘specific hypothesis', but on the other hand was not coded as a fishing expedition. An example: In an article the long-term health effects of asbestos were studied. The authors clearly hypothesized that asbestos exposure may cause lung cancer, mesothelioma and laryngeal cancer. This study was not regarded to be a fishing expedition, but on the other hand was not regarded as having a specific hypothesis anymore. Such a study of course could still report an excess of say colon cancer and thus could still be coded as being false positive.

Simple odds ratios (OR) were calculated to investigate the crude associations between the study characteristics and the status of the study in terms of false or true positivity. Next, multiple logistic regression was used to adjust the individual OR for the effect of other variables that appeared to be related to true or false positivity. For the model a number of variables were selected. The selection criteria were that the variable was relatively strongly associated with true or false positivity. Between the variable fishing expedition and the specificity of the hypothesis colinearity prevented simultaneous inclusion of both predictors in the model. We did not choose statistical significance of the unadjusted OR as primary selection criterion for inclusion in the model because of the relatively limited sample size. The variables presented in Table 3Go come from the full logistic regression model taking into account the matched design by giving each stratum its own unique intercept term.


View this table:
[in this window]
[in a new window]
 
Table 3 Associations between study selection of study characteristics false positive status of the study in a multiple logistic model, odds ratios (OR) and 95% CI
 
Results

A number of study characteristics were investigated in association with the true or false positive status of the study. The OR in Table 2Go were calculated by taking as a standard the more crude study design. For instance, the OR for adjustment for smoking was calculated by dividing the odds for a false positive study in the studies with adjustment for smoking by the odds for a false positive outcome in the studies without adjustment for smoking, the standard. An OR of below 1 indicates a protective effect against being a false positive outcome for the factor under investigation.


View this table:
[in this window]
[in a new window]
 
Table 2 Associations between study characteristics and true or false positive status of the study in terms of odds ratios (OR)a and 95% CI
 
Study objective
Several factors relating to the study objective were associated with true or false positivity. Studies with a broader scope than occupation and cancer, for instance focusing on all causes of a disease or all possible health effects of an exposure, had a marginally elevated chance of being false positive (OR = 1.14, 95% CI : 0.47–2.76). Studies that were clearly aimed at a specific hypothesis in terms of a specific exposure and a specific carcinogenic effect had a strongly decreased chance of being false positive (OR = 0.23, 95% CI : 0.11–0.49). This was also reflected in studies reporting more than 50 significance tests (OR = 0.62, 95% CI : 0.35–1.08). In addition, studies that were classified by us as not being fishing expeditions were more likely not to be false positives (OR = 0.33, 95% CI : 0.18–0.57).

Study design
Several design characteristics were associated with true or false positivity. Retrospective cohort studies had a slightly smaller chance of yielding a false positive finding (OR = 0.80, 95% CI : 0.44–1.47). The number of prospective cohort studies was too small to be analysed in detail. However, all five prospective cohort studies were categorized as being true positives. Case-control studies or cohort studies not based on cancer registries had a decreased likelihood of yielding false positive outcomes compared to studies in which cancer registries played an important role (OR = 0.35, 95% CI : 0.19–0.62). The likelihood of a false positive study tended to be smaller if adjustments for other risk factors were made. Studies in which an adjustment for smoking was made had a decreased likelihood of yielding false positive results (OR = 0.51, 95% CI : 0.29–0.90). The same was true for studies in which adjustments for other factors than smoking were made (OR = 0.67, 95% CI : 0.36–1.22). In general terms, studies with information on exposure concentration had a decreased likelihood of yielding false positive results (OR = 0.70, 95% CI : 0.34–1.44). If a study reported a positive dose-response relationship the likelihood of a false positive finding was substantially less than if a dose-response relationship was not tested for (OR = 0.26, 95% CI : 0.10–0.71). This was not the case if a dose-response relationship was tested for but not found, compared to studies where the relationship was not tested for. Thus the presence of a dose-response association appeared to be associated with a lower risk of a study finding being a false positive. A decreased OR was found both between studies that reported a positive or non-existent duration-effect relationship compared to studies in which this association was not tested for. Thus it must be concluded that having found a positive duration-effect relationship is not related to the odds for a false positive finding, but that testing for this association is related to the odds for a false positive finding.

Other variables
The log-transformed OR was a strong predictor of the outcome of a study not being a false positive (OR: 0.56 per unit increase in the logarithm of the reported OR [95% CI : 0.37–0.86]). There was only a marginal difference in likelihood of a false positive study between journals specifically focused on the occupational environment and other journals. The difference between countries was considerable, although not statistically significant at conventional levels. Taking the US as the reference group, the most likely countries to report false positive results were the Scandinavian countries.

Affiliation of the first author was also related to true or false positivity. Chances of a false positives study outcome were smallest if the first author was affiliated to industry. The highest chance of a false positive result was observed if the first author was affiliated with a university.

Adjustment for other factors
A logistic model that included the most important factors from the univariate analyses reported above is presented in Table 3Go. The variable selection for adjustment in the logistic regression model was restricted to variables describing study design characteristics and was further based on the magnitude of the univariate OR. Since there was a strong correlation between the variables ‘fishing expedition' and ‘specific hypothesis' the latter was not included in the model. From this model the factors ‘fishing expedition', ‘adjustment for smoking', ‘investigation of a dose-response relationship', ‘Log odds ratio' and ‘use of cancer registry data' emerged as important and statistically significant factors, independently of the other included variables.

Discussion

This comparative study focused on identifying factors in research objectives and design characteristics that may affect the likelihood of a false positive finding in occupational cancer epidemiology.

We selected the IARC classification as the gold standard for distinguishing between true positive and false positive studies. This choice gives some room for circular reasoning, since the IARC classification in itself is based on a judgement of true/ false positivity of the epidemiological study results. However, a line must be drawn somewhere. Our choice was to let this line be drawn by IARC. The IARC classification has the advantage of being based on an evaluation of the full evidence of the carcinogenicity of the compound. Clearly the research design least likely to yield false positive results is the experimental design, with control over the exposure, randomization and blinded observation of changes in health. However, this option normally is not practicable because of ethical reasons and possible long latency periods. Intuitively, however, it can be assumed that the study that most resembles a true experiment should be least likely to yield false (positive) results, which has also been argued by Mayes et al.7 Our results support this conclusion, particularly if one realizes that an experiment is usually performed to test a highly specific hypothesis. Our study also supports the strong need to set up studies designed to test for a specific hypothesis.

Over-reporting of positive results, either false positive or true positive, is thought to exist in the open literature because of publication bias.12 Publication bias may have had an effect on the studies selected for our analysis. However, it is questionable if selection bias has had an effect on our conclusions, since such an effect is only possible if publication bias has an effect on the type of false positive or true positive studies (and in a selective manner) published in the literature.

In summary, several design characteristics, but in particular the specificity of the study aim and whether it was a fishing expedition or not were associated with the likelihood of a false positive study result. The strongest associations were found for the inclusion of adjustment for confounders, and the specificity of the hypothesis under investigation. The latter association was confirmed by the finding that fishing expeditions were over three times more likely to yield a false positive finding than studies that were based on specific hypotheses to be tested. This finding persisted in a logistic model which included other important design characteristics.

With respect to study outcome, there were several factors associated with the likelihood of a false positive finding. Studies that reported a dose-response relationship were much more likely to yield a true positive finding than studies in which no dose-response relationship was reported, or this was not investigated. There was no association between reporting a duration-response relationship or not and the likelihood of a true positive study result. As expected, the strength of the reported OR was also associated with true or false positivity.

The finding that researchers affiliated with industry report fewer false positive findings deserves some discussion. Since our study does not have an intrinsic gold standard we cannot distinguish between two explanations for this finding. First, it is possible that researchers from industry might be less likely to report findings that are not in agreement with other findings. Second, it is possible that researchers from academia are more likely to be driven by the need to publish results. The ‘publish or perish' paradigm is more applicable to researchers from universities than to researchers from industry.

Given the intrinsic limitations of an observational non-experimental study it is difficult to draw solid causal inferences from our study. In addition it was not possible to take into account the possibility that researchers present, in their articles, their objectives and conclusions with a different perspective when dealing with a possible false positive outcome or possible true positive outcome. Despite these limitations it can be concluded that several factors are strongly associated with the likelihood of a false positive or true positive study result. Some of these factors, such as strength of the association between exposure and effect and a dose-response relationship, are already strongly embedded in the criteria for causality that have been described by Hill13 and later modified by Susser.14 However, a similarly strong factor: whether or not the study has a specific a priori hypothesis, is not included in the well-known criteria for causality. The type of study (cohort or case-control), a factor often mentioned in the context of inferior or superior designs, did not emerge as an important factor in the analysis, although the number of cohort studies and case-control studies included was comparable (97 retrospective cohort, 95 case-control). We have to remark that most of the occupational cohort studies are retrospective cohort studies, a design that is often thought to be less robust than prospective cohort studies.

The results of the analysis reported here clearly indicate that a study with a specific a priori hypothesis should be valued more highly in establishing a causal link between exposure and effect than a mere fishing expedition. We therefore suggest using results from ‘fishing expedition' studies only for hypothesis generation, and not as a basis for conclusions regarding the potential carcinogenicity of the substance under study. This is especially true if cancer registry data are used. Also, results from studies without correction for smoking and studies that did test whether or not there is a dose-response relationship but did not find one or other confounders have to be handled with care.


KEY MESSAGES
  • Studies testing a specific a priori hypothesis are less likely to report false positive outcomes.
  • Adjustment for other factors, especially smoking, decreases the risk of a false positive study outcome.
  • A positive dose-response relationship and a substantial relative risk decrease the risk of a false positive finding.

 

References

1 Taubes G. Epidemiology faces its limits. Special News Report. Science 1995;269:164–69.[ISI][Medline]

2 Trichopoulos D. The future of epidemiology. Br Med J 1996;313: 436–37.[Free Full Text]

3 Koplan JP, Thacker SB, Lezin NA. Epidemiology in the 21st century: calculation, communication and intervention. Am J Public Health 1999;89:1153–55.[ISI][Medline]

4 Bhopal R. Paradigms in epidemiology textbooks: in the footsteps of Thomas Kuhn. Am J Public Health 1999;89:1162–65.[Abstract]

5 Bankhead C. Debate swirls around the science of epidemiology. J Natl Cancer Inst 1999;91:1914–16.[Free Full Text]

6 Feinstein AR, Horwitz RI, Spitzer WO, Battista RN. Coffee and pancreatic cancer: the problem of etiologic science and epidemiological case-control research. JAMA 1981;246:957–61.[ISI][Medline]

7 Mayes L, Horwitz RI, Feinstein AR. A collection of 56 topics with contradictory results in case-control research. Int J Epidemiol 1988; 17:680–85.[Abstract]

8 Hernberg S. Introduction to Occupational Epidemiology. Michigan, MI: Lewis Publishers, 1992.

9 Easterbrook PJ, Berlin R, Gopalan R, Matthews DR. Publication bias in clinical research. Lancet 1991;337:867–72.[ISI][Medline]

10 Dickersin K, Min Y, Meinert CL. Factors influencing publication of research results. JAMA 1992;267:374–78.[Abstract]

11 International Agency for Research on Cancer. IARC monographs on the evaluation of carcinogenic risks to humans. List of IARC evaluations, IARC, Lyon, October 1996.

12 Thornton A, Lee P. Publication bias in meta-analysis: its causes and consequences. J Clin Epidemiol 2000;53:207–16.[ISI][Medline]

13 Hill AB. The environment and disease: association or causation? Proc R Soc Med 1965;58:295–300.[ISI][Medline]

14 Susser M. What is a cause and how do we know one? A grammar for pragmatic epidemiology. Am J Epidemiol 1991;133:635–48.[Abstract]