Our best friend in epidemiology, it seems, is the confounder. The confounder preoccupies our thinking, we respect its omnipresence, and we are endlessly entertained by attempting to identify one in someone else's study. As epidemiologists we spend our days chasing the confounder like detectives, anticipating its disturbing appearance when designing a study, considering potential confounders in our analysis, and trying to illuminate unconsidered or residual confounders when the results of our study do not conform with the expected.1
Other toys have also come to occupy our minds. Advanced and fancy analytical methods increasingly find their way into epidemiological analyses. They challenge the epidemiologist and impress the reader. Some real progress has been made with using more refined methods such as hierarchical models,2 structural causal models,3 and the improved graphical display of data.4
But when we contemplate how to further improve our trade maybe we have to regress to our roots and reconsider one of our oldest acquaintances.5 One that we seem to have neglected over the years and which apparently has lost favour in the epidemiological community, namely measurement error.
It seems as if measurement error has been pushed into the role of the unwanted child whose existence we would rather deny. Maybe because measurement error is common, insipid and unsophisticated. Unlike the hidden confounder challenging our intellect, to discover measurement error is a no-brainer it simply lurks everywhere. Our epidemiological fingerprints are contaminated with measurement error. Everything we observe, we observe with error. Since observation is our business, we would probably rather deny that what we observe is imprecise and maybe even inaccurate, but the time has come to unveil the secret: measurement error is threatening our profession. The threat is even more serious since mostly it is difficult if not impossible to know whether the misclassification is random or differential and thus whether it affects mainly the precision of the results or also the validity of the study.
Having entered the age of circular epidemiology,6 we have dozens of studies at hand on any particular topic. So why is it the world rightly asksthat these studies have produced inconsistent, sometimes contradictory results? Why are we not finding the same result in studies that ask the same question? How can it be? Of course, study designs differ. We study different populations and measure different confounders. Of course, there is residual confounding here and there. But would all that heterogeneity really be removable by fixing some unmeasured confounding? Or is it not rather that our instruments are not good enough to measure what we are trying to measure?
Let us look at our tools, let us look at our questionnaires: How often do you work up a sweat? How many flights of stairs do you walk up a day? What are we measuring? We are trying to capture energy expenditure. But everybody would answer this question their own way. Interpret it differently. Respond differently. Some questions leave more room for interpretation. Some questions are harder to answer than others. How many prunes did you eat on average during the past year? Desperation of the responders is exemplified in a phone call recently received from a study participant of a large cohort study: How the hell should I know how many blueberries I ate last year? Nutritional epidemiologists may have one of the most challenging tasks. We so much want to know what people eat. But can we really measure diet?
So how bad is it then? We are interested in the effects of physical activity because it is a modifiable lifestyle factor. Some studies have indicated that physical activity may reduce the risk of breast cancer.7,8 Given how little we know about the aetiology of breast cancer we are desperate to identify new pathways and methods of prevention. But early studies were countered by more recent ones that failed to identify any association with physical activity.9,10 Since every study used a different instrument, a different set of questions to extract and best characterize the exercise habits of the female study participants, the obvious question is: how good was each of these instruments? Could they differ so much that some of them might and others might not capture a glimpse of true energy expenditure?
There is only one truth. If God came down and told us what the true relation is between physical activity and breast cancer risk about half of the epidemiological studies on this issue would be proven incorrect. In our search for true relations, we are trying very hard to measure what people do. Do we stand a chance?
The inconsistencies among studies on diet and disease are no less concerning. For example, we thought we were clear on the issue of fruit and vegetable consumption and colorectal cancer risk. The majority of evidence on an inverse association stems from case-control studies.11 The more recent prospective studies did not detect any association.1214 Were the earlier findings artefacts of the retrospective study design? Recall bias? Residual confounding? Or are the differences due to an inadequate dietary assessment instrument which might have introduced bias in either direction? Most of the studies used a food frequency questionnaire (FFQ). People often cannot remember what they ate for lunch the day before. People misjudge their diet. They tend to overreport good foods and often are in denial of their dietary sins. Can we assess what people eat by asking them what they ate? When confronted with the question: How much broccoli did you eat on average during the past year? Everyone contemplates their past behaviour their own way and everyone's recollection and their ability to average across time differs. No doubt, there will be considerable error. The crucial question is: how much? Responders who mark the option 6+ times per day (which the FFQ provides) probably want to tell us they ate a lot of broccoli. And vice versa, the broccoli despisers will mark the lowest response category. Thus, much of the measurement error probably concentrates in the middle categories which mostly do not even enter our analytical model. In epidemiological analyses we tend to compare extremes, e.g. high and low food consumption. Thus, despite the indisputable measurement error, are we able to separate individuals with extreme dietary habits? We had better make sure! Or are there other explanations for the inconsistencies in our findings across studies such as the populations studied who are likely to differ in their dietary variation?
Of course, an additional consideration is that we may not even be sure that we know exactly what we want to measure. Dietary patterns may influence disease risk over a long time period but how do we capture long-term diet with a single assessment? And besides intra-subject variability in diet on a day-to-day basis dietary habits may change over time. Furthermore, diet (and physical activity) during childhood and adolescence may affect chronic disease risk differently than habits during adult life. Repeated measures of the exposure may improve estimates but we yet have to understand how to use them optimally. Average the information? Update it? Create a cumulative update? But if so, how to weigh it?
Other factors popular in epidemiology may be easier to assess than diet or physical activity. For example parity: the number of children a woman has. And yet, in cohorts I have studied there were female participants who when asked repeatedly about their parity status regressed from having four children in one questionnaire to having one in the next and none in the one thereafter. Postmenopausal women turn premenopausal again and current smokers into never smokers if sufficient time passes between questionnaires. How can this happen? Measurement erroror misclassificationis haunting us.
Measurement error does not stop at our primary exposures. What is residual confounding other than misclassified confounders that we cannot sufficiently account for in our analysis?15 Smoking is the classic among the residual confounders. Do people not know how many cigarettes they smoke? Maybe they know today's number, but not over a lifetime. Residual confounding by smoking is one of the ultimate threats to many of our analyses.
And even consistency between epidemiological studies may not be strong evidence of validity: if residual confounding introduces bias in the same direction across studies, consistency does not reflect validity.
If self-reported data are so contaminated with measurement error are we better off linking registry databases? Maybe for some purposes. But even there assessment error is likely and we have to evaluate the quality of the information in the registry. In some instances it may be superior but the major limitation is the general unavailability of confounder information. A Catch 22.16
How can the quality of epidemiological research be improved? We need to understand better how bad the problem of measurement error really is. We need improved validation of our research methods, of our assessment instruments. In the face of the currently available evidence on inconsistencies we must re-evaluate our tools and not take their validity for granted, even though they may have previously withstood some validation test. Suspicion of serious measurement error in our assessment methods should lead us to evaluate them again.
Measurement error correction models may be one way out of the mess. But they better be good models and based on the right assumptions. Even small errors in the assumptions may make corrections do more harm than good.17 To make correction methods good we have to understand measurement error well and we have to have available the gold standard data to correct appropriately each time we conduct a study. Can we grasp the error well enough to improve our estimates using statistical corrections? The preferable option is to sharpen our tools, to improve our instruments. If we neglect to do so we will increasingly produce inconsistent results and lose credibility in the scientific community.18
Notes
Obstetrics & Gynecology Epidemiology Center, Harvard Medical School and Brigham & Women's Hospital, 221 Longwood Avenue, and Department of Epidemiology, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115, USA.
References
1 Rothman KJ, Greenland S. Modern Epidemiology. 2nd edn. Philadelphia: Lippincott-Raven, 1998.
2
Greenland S. Principles of multilevel modelling. Int J Epidemiol 2000;29:15867.
3 Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology 1999;10:3748.[ISI][Medline]
4 Greenland S, Michels KB, Robins JM, Poole C, Willett WC. Presenting statistical uncertainty in trends and dose-response relations. Am J Epidemiol 1999;149:107786.[Abstract]
5 Armstrong BK, White E, Saracci R. Principles of Exposure Measurement in Epidemiology. Oxford: Oxford University Press, 1992.
6 Kuller LH. Circular epidemiology. Am J Epidemiol 1999;150:897903.[Abstract]
7 Wyshak G, Frisch RE. Breast cancer among former college athletes compared to non-athletes: a 15-year follow-up. Br J Cancer 2000;82:72630.[ISI][Medline]
8
Thune I, Brenn T, Lund E, Gaard M. Physical activity and risk of breast cancer. N Engl J Med 1997;336:126975.
9
Rockhill B, Willett WC, Hunter DJ et al. Physical activity and breast cancer risk in a cohort of young women. J Natl Cancer Inst 1998;90:115560.
10 Moore DB, Folsom AR, Mink PJ, Hong CP, Anderson KE, Kushi LH. Physical activity and incidence of postmenopausal breast cancer. Epidemiology 2000;11:29296.[ISI][Medline]
11 Steinmetz KA, Potter JD. Vegetables, fruit, and cancer prevention: a review. J Am Diet Assoc 1996;96:102739.[ISI][Medline]
12 Steinmetz KA, Kushi LH, Bostick RM, Folsom AR, Potter JD. Vegetables, fruit, and colon cancer in the Iowa Women's Health Study. Am J Epidemiol 1994;139:115.[Abstract]
13
Michels KB, Giovannucci E, Josipura KJ et al. A prospective study of fruit and vegetable consumption and colorectal cancer incidence. J Natl Cancer Inst 2000;92:174052.
14
Voorrips LE, Goldbohm RA, van Poppel G, Sturmans F, Hermus RJJ, van den Brandt PA. Vegetable and fruit consumption and risks of colon and rectal cancer in a prospective cohort study: The Netherlands Cohort Study on Diet and Cancer. Am J Epidemiol 2000;152:108192.
15 Greenland S. The effect of misclassification in the presence of covariates. Am J Epidemiol 1980;112:56469.[Abstract]
16 Heller J. "Catch 22". New York: Simon & Schuster, 1961 and 1999.
17 Phillips AN, Smith GD. Bias in relative odds estimation owing to imprecise measurement of correlated exposures. Stat Med 1992;11:95361.[ISI][Medline]
18 Taubes G. Epidemiology faces its limits. Science 1995;269:16469.[ISI][Medline]