1 Obstetrics and Gynecology Epidemiology Center, Brigham and Womens Hospital, Harvard Medical School, Boston, MA.
2 Department of Epidemiology, Harvard School of Public Health, Boston, MA.
3 Strangeways Research Laboratory, Institute of Public Health, University of Cambridge, Cambridge, United Kingdom.
4 Medical Research Council Dunn Human Nutrition Unit, Cambridge, United Kingdom.
Received for publication July 22, 2003; accepted for publication January 21, 2004.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
collinearity; confounding; diet; linear regression; measurement error; multivariate analysis; nutrition assessment
Abbreviations: Abbreviations: 7DD, 7-day diet diary; EPIC Norfolk, European Prospective Investigation into Cancer and Nutrition in Norfolk, United Kingdom; FFQ, food frequency questionnaire; SD, standard deviation.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Error in self-reported dietary intake consists of different error components: imprecision (random error) and bias in reporting (random or differential with respect to the outcome of interest). The reporting bias could be systematic with respect to the individual (e.g., consistent under- or overreporting independent of the dietary instrument) or with respect to the instrument (e.g., underreporting of foods not prespecified on a structured questionnaire or incorrect assignment of foods to a particular category or portion weight).
The magnitude of the measurement error has remained unclear. Understanding direction and magnitude of the error is essential for interpreting observed associations between diet and disease outcomes in epidemiologic studies (1). Several authors have tried to estimate the measurement error introduced by the food frequency questionnaire (FFQ) and the resulting distortion of the association between diet and disease (26).
The FFQ is the most commonly used diet assessment instrument in large observational studies (7). The FFQ generally aims to capture the average typical diet during the previous year or the prior 6 months (7). The most commonly used FFQ in epidemiologic studies, the semiquantitative FFQ developed by Willett (7), targets dietary composition rather than quantification of foods. The 7-day diet diary (7DD) is a written record kept by participants of all food and drink consumed over a 7-day period (7). Diet records may be less representative of usual diet but provide more detailed information on quality and quantity than the FFQ during the limited time interval and are not restricted to a certain number of foods as is the FFQ (8, 9).
Different measurement error correction models have been suggested, which estimate the attenuation factor and provide means of adjustment of the relative risk estimate (4, 1013). Since true intake is never known, however, several assumptions have to be made when estimating the error. Diet records have been used to test the validity of FFQs in various populations because 7DDs are considered an "alloyed gold standard" of intake and because the errors of the 7DD and the FFQ were considered to be independent (7, 10, 11, 14). The measurement errors inherent in these methods, however, have been found to be correlated (2, 3, 6). Moreover, typical validation samples included an average of 100150 subjects (7). This sample size has been deemed too small to establish validity (15, 16).
The magnitude of the measurement error may depend on the dietary assessment instrument and may differ for different foods or food groups. Furthermore, while measurement error in a model with one food or nutrient is likely to attenuate any real association, correlated errors may make models including various foods or nutrients even more difficult to interpret because the correlated error structure makes the direction of the bias difficult to predict. Many analytical models that relate diet and disease include more than one food or nutrient to reduce confounding.
In this paper, we consider a situation where one dietary variable, dietary vitamin C, is the major dietary determinant of the dependent variable, plasma vitamin C. We explore the impact of error correlations in analytical models designed to estimate simultaneously the effect of individual nutrients. We will show here that the correlated measurement error results in spurious associations that may lead to misinterpretations of the observed associations. We will further show that these distortions differ for dietary data collected with the 7DD and the FFQ.
We used data from the European Prospective Investigation into Cancer and Nutrition in Norfolk, United Kingdom (EPIC Norfolk), that provides unique information on self-reported diet assessed with both 7DDs and FFQs on the same subjects and on vitamin C plasma level, which is used as the dependent variable. We used data from 3,074 women and men who had complete information on dietary and plasma vitamin C and who did not take vitamin supplements.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We used data from 3,074 participants of the EPIC Norfolk cohort who completed both a 7DD and an FFQ between 1993 and 1998, who did not take vitamin supplements, from whom blood was available, and for whom the baseline 7DD had been computerized.
Diet assessment
7-Day diet diary
At the clinic visit, trained nurses taught participants how to fill in the diary using the previous day as an example. Participants completed the second and subsequent 5 days of the diary at home, recording in as much detail as possible all foods and beverages consumed. The 7DD booklets included colored photographs of 17 foods, each with three different portion sizes, to aid the participants in estimating their portion size consumed. The diaries were posted back to the coordinating center at the University of Cambridge. Diaries were coded and analyzed using a specially developed program to extract average daily nutrient intake (8, 9).
Food frequency questionnaire
The self-administered semiquantitative FFQ was designed to measure the average consumption of 130 food items during the year prior to the baseline health check. The questionnaire was based on the FFQ developed by Willett et al. (7, 8, 18) and adapted as previously described (8, 19). For each food item, participants were asked to indicate their usual consumption from nine frequency categories, ranging from "never or <1/month" to "six or more times per day." The FFQ did not include specific questions on portion size but, rather, specified average portions and unit sizes (e.g., piece, slice) or household units (e.g., glass, cup, spoon).
Nutrient intake
Nutrient intake for both the 7DD and the FFQ was calculated on the basis of published food composition tables for the United Kingdom that are frequently updated with data on new food items and nutrients (20). Fiber values were calculated as nonstarch polysaccharides using the method developed by Englyst et al. (21, 22).
Biomarker
Plasma vitamin C was measured from blood samples taken by venipuncture into citrate-covered tubes. Plasma was stabilized in a standardized volume of metaphosphoric acid and stored at 70°C. The plasma vitamin C concentration was measured with a fluorometric assay within 1 week of sampling (23). The coefficient of variation ranged between 4.6 percent and 5.6 percent across the plasma vitamin C distribution.
Statistical analysis
EPIC Norfolk participants were included in this analysis if they had plasma measurements for vitamin C and diet values from both dietary instruments. Regular users of vitamin supplements containing vitamin C were excluded from this analysis. We also excluded participants below the first percentile and above the 99th percentile of the measured vitamin C plasma level and participants with a total caloric intake assessed with either instrument below 500 kcal or above 4,200 kcal. Linear regression was used to model the association between plasma vitamin C levels and self-reported consumption of nutrients assessed with the 7DD and the FFQ. Regression models were adjusted for potential confounders assessed concurrently: age, sex, body mass index, height, and current smoking. Participants with missing values for any of the covariates were excluded from this analysis. This left a study population of 3,074 EPIC Norfolk participants.
To permit comparability across the two diet assessment instruments, all dietary values were standardized. Nutrients were standardized by dividing them by their standard deviation. Nutrient residuals were derived by regressing nutrients on the log scale on mean-centered log values of energy and exponentiating the resulting residuals (7). The residuals were then standardized by dividing them by their standard deviation. For nutrient density models, nutrients were divided by total energy intake, and this fraction was standardized by dividing it by its standard deviation.
All coefficients for nutrients resulting from the linear regression models are interpretable as the µmol/liter change in vitamin C plasma levels per one standard deviation change in nutrient intake.
We included four nutrients in our analyses: vitamin C, nonstarch polysaccharide fiber, fat, and total calories. The purpose of the model was to estimate the strength of the association between dietary vitamin C and plasma vitamin C. The other three nutrients were added to explore 1) how their inclusion in the model would affect the estimation of the association of interest and 2) whether we would observe a spurious association between any of these nutrients and the dependent variable.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
Introducing higher order terms of dietary vitamin C into the models did not appreciably change the results or interpretation (data not shown). Fitting a quadratic term was only a marginal improvement and did not affect the linear coefficient.
Data were also analyzed separately for women and men, but the results did not differ by gender (data not shown). Thus, all the results presented here are for women and men combined.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
As expected, in the current analysis we found that models with single nutrients derived from a 7DD or an FFQ are confounded if they do not include dietary vitamin C. Models including more than one nutrient provided distorted estimates for fiber from the 7DD and for fat from the FFQ. Adding a term for total energy to the multivariate nutrient models magnified this distortion.
We confined our study population to individuals who reported no vitamin supplement intake. Plasma vitamin C levels are largely determined by diet (25). We are assuming that the plasma level of vitamin C is largely determined by vitamin C intake, although smoking, metabolic expenditure, and other lifestyle factors also influence plasma vitamin C levels. The relation between group means of plasma vitamin C values and vitamin C intake has been reported as an S-shaped curve with a linear relation between 30 and 90 mg of vitamin C intake (26). Very low vitamin C intake is almost completely absorbed by the tissue, resulting in low circulating levels. Very high levels of vitamin C are excreted in the urine (27). The association between plasma vitamin C and dietary intake of vitamin C reported previously (26) is considerably stronger than the associations reported here. When the dietary intake is expressed as mg per day, rather than as standard deviations of intake, the regression coefficient for the 7DD is approximately five times smaller, and the regression coefficient for the FFQ is 10 times smaller than the previously reported estimate. These values imply correlations between the observed and the true vitamin C intake of 0.44 for the 7DD and 0.33 for the FFQ, adjusted for age, gender, anthropometric variables, and smoking. The difference between the 7DD and the FFQ in the observed strength of the association with plasma vitamin C might be due to the different time periods that the two assessment methods cover: Blood samples and the 7DD spanned the same calendar time, while the FFQ captured the usual diet during the year prior to blood collection.
In our analytical models that controlled for differences in age, sex, body mass index, height, and current smoking status, we would expect vitamin C intake to have the strongest relation with plasma vitamin C. Indeed, we found the vitamin C intake estimated from either dietary assessment instrument to be strongly correlated with plasma vitamin C levels. Models that did not include dietary vitamin C should reflect the positive or negative confounding effect of dietary vitamin C on other dietary variables, the direction of the confounding being determined by the direction of the association between the true intake and plasma vitamin C. Indeed, we found fiber to be directly related and fat to be inversely related to plasma vitamin C in univariate models. Thus, when fitted on their own, fiber and fat displayed positive and negative confounding effects, respectively.
In multivariate models that include dietary vitamin C among the dietary variables, other dietary variables should have little impact. Thus, we would expect neither fiber nor fat to be correlated with plasma vitamin C, if vitamin C intake is included in the model. In a model with dietary vitamin C and fiber, the association between fiber intake and plasma vitamin C was much reduced from the univariate relation but remained of borderline statistical significance. This pattern is consistent with residual confounding due to measurement error in the estimation of dietary vitamin C and fiber: The coefficients for both nutrients decrease, but the coefficient for the confounder (fiber) remains elevated. Adding energy to this model introduced modest negative confounding among the 7DD-derived nutrients. In a model including dietary vitamin C and fat, fat intake remained significantly inversely associated with plasma vitamin C. Among FFQ-derived nutrients, the coefficients for both vitamin C and fat increased in magnitude. The most plausible explanation for this observation is a combination of negative residual confounding due to measurement error in the dietary variables and a correlation of these errors, which would bring the observed estimates farther apart. When energy was added to this model, the regression coefficient for fat increased in magnitude, in particular among the FFQ values, introducing spurious negative confounding due to high error correlations, producing collinearity in estimates of the two macronutrients. The effect of adding a separate term for energy intake to a model leads to instable estimates, which may obscure the relative importance of the different nutrients in their association with plasma vitamin C levels. Adding energy intake to the model had little impact on the coefficient for vitamin C but greatly magnified the coefficient for fat.
Willett (28) has suggested that adjustment for total energy intake reduces the correlation among nutrients estimated from the FFQ as the errors in nutrient and energy estimation are correlated and partly cancel out. Furthermore, energy adjustment removes a substantial part of the extraneous variation in the nutrients studied (28). Adjusting for energy by adding a separate term to the regression model proved to be an inferior method of adjustment. The highly correlated errors in the estimation of fat intake and total energy did not cancel out but seemed to introduce collinearity. Furthermore, while energy intake estimated from either dietary assessment instrument was not related to plasma vitamin C, the coefficient for energy became significant when added to an individual nutrient model.
Willett (28) introduced nutrient residuals to adjust for energy intake. Indeed, the use of nutrient residuals or energy densities provided the least distorted estimates in multivariate models. Both methods improved estimates in the multivariate models for the FFQ, with dietary vitamin C appearing clearly as the variable with the strongest association with plasma vitamin C.
Problems with collinearity when fitting a model including fat and total energy have been reported previously by Howe et al. (29). A model including saturated fat and total energy did not converge. Howe et al. proceeded to fit a model including saturated fat and energy not from saturated fat. However, the interpretation of the coefficient for saturated fat in this model changes: It indicates the association between the disease outcome and a change in absolute intake of saturated fat, not in relative intake.
Our data suggest that correlated errors might have distorted the performance of nutrient values derived using self-administered dietary assessment instruments; the errors appear to be somewhat more correlated among nutrients derived from the FFQ. A diet assessment instrument such as the FFQ with a prespecified list of foods is likely to induce overreporting of the food items listed and grouped together; that is, a person overreporting consumption of oranges is likely to overreport consumption of apples and the other fruits listed. This error correlation between foods translates into error correlations between the nutrients calculated from the reported foods. Indeed, we found that the fruit and vegetable consumption reported on the FFQ was almost double that on the 7DD. It is therefore plausible and consistent with our findings that the error correlation among the nutrients derived from a structured food questionnaire is larger than that of an unstructured food diary.
When trying to unveil the true effect of diet on disease, we can test a large number of dietary variables. Because the true associations are unknown, spurious associations emerging from multivariate models may be mistaken for causal associations and may erroneously identify random dietary variables as predictors of disease. Our data indicate that correlated errors may lead to distortion of the underlying associations. In a multivariate diet model, the direction of such distortion cannot be predicted without knowing the full multivariate error structure. In the search for dietary candidates of disease prevention or causation, we may be easily misled.
In conclusion, the interpretation of standard multivariate models of diet may be misleading as a result of the correlated errors producing spurious associations of unpredictable direction. These problems can be largely overcome by choosing appropriate analytical models, such as nutrient residuals or nutrient densities.
![]() |
ACKNOWLEDGMENTS |
---|
The authors are grateful to Dr. Walter C. Willett for helpful comments on the manuscript.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|