1 Center for Health Research, School of Public Health, Loma Linda University, Loma Linda, CA.
2 Department of Preventive Medicine, School of Medicine, University of Southern California, Los Angeles, CA.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
bias correction; bias (epidemiology); measurement error; models, statistical; regression analysis; regression calibration; statistical power; statistical significance
Abbreviations: FFQ, food frequency questionnaire; SE, standard error
![]() |
BACKGROUND |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Recently, statistical techniques have been developed to help correct this bias, usually in the context of a calibration study (13
). These methods include regression calibration and the analysis of structural models with latent variables (4
). In both cases, the intent is to estimate the magnitude of disease associations with the unobservable true dietary intake.
The regression calibration method has been carefully described for multivariate logistic and proportional hazards models (5, 6
), both of which have application to nutritional epidemiology. Thus, we focus here on regression calibration and illustrate the loss of statistical power caused by errors in exposure variables. This depends partly on the size of the calibration study, but the details of this relation are not well understood. Power also depends on the magnitude of the correlation between the data from the FFQ used and the true data (T), as well as the degree of collinearity with other variables in the model.
![]() |
REGRESSION CALIBRATION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Regression calibration for linear models will correct such biases if the assumptions are met. The concept is simple. Rather than use true dietary data in the regression, one substitutes the expected value of the true variable (conditional on observed data measured with error) (11). For nonlinear models, the estimates obtained are approximately unbiased (1
, 5
, 6
). Specifically, for logistic regression, Rosner et al. (12
) have shown good performance for odds ratios of 3.0 or less.
The expected values, conditional on the observed food frequency and other data not measured with error, are derived from a calibration study conducted in a much smaller sample that represents the cohort. Then the calibration equation is
![]() |
![]() |
A problem with this simple substitution algorithm is that estimates of standard errors (SEs) of the corrected coefficients produced by standard analytical packages will be too small, because no account is taken of sampling variability of the calibration study regression coefficients. Fortunately, there are methods that provide unbiased estimates of the standard errors (5, 6
), but these methods are not yet included in standard analytical packages. Thus, a simple substitution strategy using standard linear, logistic, or proportional hazards analyses will produce a set of coefficients that are approximately unbiased estimates of true coefficients, ßd, but special routines are necessary to calculate their standard errors.
Why is there loss of statistical power when regression calibration is used, as compared with the case where XT values are known for the whole cohort? A primary reason is the reduced variability of the predictor variables in the new regression model. With regression calibration, we substitute as the predictor variable in place of XT, and
is always less variable than XT, because
(13
)
In general, the greater the variability of the predictor variables, the greater the power to detect a nonzero slope in the regression analysis. In addition, there may be more variability in Y given XFFQ than in Y given XT, which also reduces power. However, for binary Y, this second issue can be neglected, since the variance of Y is uniquely determined by its mean value.
For a univariate linear calibration equation, , where R is the correlation coefficient between XT and XFFQ. Thus, to a first approximation, 1/R2 more subjects are required to detect a nonzero regression between Y and XFFQ using regression calibration than if XT were available. Since R is rarely above 0.7 in dietary assessments, the implication is that at least twice as many subjects are needed to make up for errors in estimating XT using the FFQ. This loss of power is due to the presence of measurement errors, not to the use of the regression calibration method, and it would be present even if the regression calibration coefficients were known exactly.
In the univariate situation, it is well known (2)that
d =
s/ßc. If ßc is treated as a constant, SE(
d) = SE(
s)/ßc. The ratio ß/SE(
) determines power (see below), and it can be seen that this ratio is identical for
s and
d (with a very large calibration study). Thus, univariate analyses using uncorrected nutrient measurements will also require approximately 1/R2 more subjects for equivalent power than when XT is available.
For multivariate problems, the issues are similar but more complicated, since the correlations between X1, T and X2, T become potentially important determinants of study power and biases can point both toward and away from the null. In univariate analyses, uncorrected estimates and their standard errors can be used to perform appropriate tests for nonzero regression parameters ß. However, in multivariate analyses, the uncorrected estimates will usually provide invalid tests of the hypothesis that ß1 = 0 (see below), in that the actual type I error rate is not nominal. This emphasizes the importance of correcting for measurement errors when assessing the independent effect of one exposure given the effect of another.
In summary, power may be adversely affected in regression calibration by a number of mechanisms: 1) a small study size, as expressed by the number of events, Y; 2) multicollinearity between the predictor variables in the disease regressionthe typical collinearity problem; 3) poor validity, as measured by the correlation between truth and FFQ (this depends in part on the residual error in the calibration equations); and 4) imprecision in the estimate of due to imprecision of the calibration equation coefficients, ßc, which is also adversely affected by collinearity between variables in the calibration equation.
Designing a calibration study of sufficient size will diminish the fourth type of problem by decreasing the variance of the ßck's. The third type of problem will persist despite a large calibration study, but its effect on power is diminished by a larger cohort. If the validity correlation (that between XFFQ and XT) is low, the expected values of Xi,k, T will often differ substantially from the actual individual true values. However, it is these true values that determine the individual disease outcomes, so although dk is unbiased, its variance will be increased.
Referring to the work of Rosner et al. (5) and applying this to models with two dietary variables and two variables measured without error,
![]() | (1) |
![]() |
As a consequence of equation 1, in a crude analysis a particular ßsk represents a linear combination of the true ßdk's. This confounding is particularly influential when variables are importantly correlated, because then the off-diagonal values of the upper rows of are greater. Since X2, T is poorly measured by X2, FFQ, a correlation between X1, T and X2, T implies that the measurements X1, FFQ contain additional information about X2, T, an example of residual confounding. Then the estimate of the effect of X1 given X2 is biased.
It is apparent that even when ßdk = 0, ßsk in general is not zero. For instance, ßs1 = ß d1 x ß1c1 + ßd2 x ß2c1, where there are two dietary variables in the model, but only the first term becomes zero under the null for the first variable. Thus, type I error rates are distorted in a crude analysis when the true value of X1 is correlated with the measured X2, even after conditioning on the measured X1a problem also alluded to by Armstrong et al. (7).
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Thus, we obtained estimates of nutrient intake from the FFQ and also from a "reference" method constructing a synthetic week by appropriate choices of weights for weekend and weekday 24-hour recall data. In the following simulation, these reference data are used as a model with which to provide realistic covariance structures and calibration ß's when generating hypothetical true dietary data. Both reference and food frequency data were energy-adjusted by the residual method (16), and the variables XT and XFFQ below refer to energy-adjusted values. For the purpose of this illustration, we shall ignore the fact that the subjects of the above calibration study were clustered by church.
Simulating a cohort population and a separate calibration study
The steps were as follows for each combination of factors (e.g., cohort size, calibration study size, dietary variables, etc.), programming in S-Plus (Insightful Corporation, Seattle, Washington).
![]() |
![]() |
Each repetition of steps 3 and 58 produces one estimate of SE(ß). We could have based our entire analysis upon a single set of simulated data for each configuration of parameter values considered below. However, for the finite data set sizes used, some sampling variability remains in the estimate of SE(ß) obtained from a single simulation. In order to reduce this sampling variability to an ignorable level, we chose to repeat these steps a total of 30 times, as we found that p = 2030 provided very stable estimates for each parameter configuration. If, instead of using the Rosner et al. (5) formula, we had empirically attempted to estimate SE(ß) by simulation alone, we would have needed many more simulations.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Along with other correlations, those between the energy-adjusted food frequency nutrient estimates and the corresponding reference (used as a surrogate for true) data in the calibration study are shown in table 1. The correlation for vitamin C is noticeably lower than the other correlations at 0.34. Fat, a constituent of all three models, whether measured by repeated 24-hour recalls or the FFQ, is highly correlated with the corresponding dietary fiber variable, weakly correlated with calcium, and modestly correlated with vitamin C.
|
|
|
|
|
In contrast, when regression calibration is used (calibration study size = 1,000), the mean values of 30 simulated beta coefficients are 0.174 (fat) and -0.161 (fiber) for the fat/fiber model, 0.186 (fat) and -0.103 (calcium) for the fat/calcium model, and 0.181 (fat) and -0.382 (vitamin C) for the fat/vitamin C model. Apart from sampling errors, these results are close to expectation (see the "True ß" columns on the right side of table 2) and a great improvement on the crude results. In addition, the means of estimated variances produced by the Rosner et al. (5) method were close to the empirical variances (
d), as expected.
A comparison of power under three different models is shown in table 3. The three models are the regression analysis in which true values are known, the regression calibration model (with a large calibration study used to give stability to expected values), and the uncorrected (crude) analysis. The quite dramatic loss of power in both of the latter models is clear, as predicted above.
|
In table 4, each entry is the observed size of the test of the null hypothesis for that variable in a crude analysis. The true coefficient for that variable was set to zero in the simulation, other coefficients being at alternative-hypothesis values. As expected, the size with regression calibration (not shown) was always nominal at 0.05. The results demonstrate that a crude analysis can result in markedly supernominal type I error estimates. Events are hypothesized as depending on FFQ data, but in fact they are conditional on true data (as they were simulated in table 4). When the true effect is null, a nonnull apparent effect, often of substantial magnitude, persists. To reiterate, this is due to the correlation between the nutrients, as measured. Because of the distortions in type I error rates, power calculations based on crude estimates may be quite incorrect.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The reference data that we used to simulate "true" data in these analyses may not properly represent the actual errors between crude and (unobservable) true data. Thus, the calculated results may not represent true fat, calcium, fiber, and vitamin C intakes in this population, could they be measured. However, our purpose was to further explore some aspects of the regression calibration method, and we have used a variety of correlational structures for illustration.
A little-recognized fact is that a null true effect will generally not correspond to a null effect in a crude multivariate analysis, although this has been described by other investigators (8, 21
, 22
). A test of both dietary coefficients treated together would not have this problem, but separate tests are subject to residual confounding by the other dietary coefficient(s). This will result in either overestimates or underestimates of statistical significance, or even reversal of statistical significance between two variables (10
). A larger study size will not change the biased nonzero "null" beta coefficients. In fact, the likelihood of spurious statistical significance is especially great then because of the smaller standard errors.
We have demonstrated that exposure measurement error can substantially reduce statistical power, even after bias reduction by regression calibration. This reduction in power is less when both food frequency variables have good validity and are not highly correlated. The effect of moderate collinearity between measured dietary variables in reducing power can be just as severe as that of relatively poor validity, a point also noted by Wong et al. (23). Many epidemiologists prefer confidence intervals above statistical significance. Reduced power results from greater standard errors of the coefficients, and this, of course, is directly related to wider confidence intervals after bias correction.
Although a validation study size of 100200 subjects has often been selected in practice (4, 23
, 24
), our results show that this may be inadequate when multivariate calibration is the goal. A validation study that produces an acceptably precise validity correlation may be suboptimal for regression calibration where there is collinearity between the variables. In a large cohort, the cost of increasing a calibration study from 200 individuals to 1,000 may be only a modest proportion of the total budget, and it can lead to important gains in statistical power. Other researchers have considered the optimal size of a validation study for the purpose of calibration and the number of recalls/diaries needed per person when costs are fixed (25
, 26
), but not in the context of multivariate analyses.
It is of interest that regression calibration with approximately uncorrelated variables and with a relatively large calibration study does not lose much power as compared with an uncorrected analysis using FFQ data only. This is similar to the result for a univariate regression calibration analysis. However, when variables in a regression calibration analysis are importantly correlated, this has a "double" effect, increasing the standard errors of both the calibration ß's and the disease regression ß's.
Of course, misspecification of the calibration model would be a serious problem and might lead to new biases in both beta coefficients and their standard errors. Many nutritional variables are not normally distributed. Hence, study of the goodness of fit of a linear regression for the calibration model, the use of transformations where necessary, and a search for influential outliers are all mandatory.
Although in practice we cannot observe true dietary values for individuals, if a reference method can be assumed to follow the classical error model Xref,i = XT,i + i, E(
i) = 0 for all i, then calibrating to this reference method will also give unbiased estimates of ßd (27
). This is because, in such a situation, the calibration ß's, ßc, will be unbiased estimates of those produced when calibrating to XT and so will produce unbiased estimates of E(XT|XFFQ, Z) for regression calibration. Expressed a little differently, in this situation, errors in estimating XT from Xref are independent both of errors in estimating XT using XFFQ and of the outcome Y, conditional on XT (27
).
In conclusion, if a reference method is found such that the classical error model can be reasonably assumed, regression calibration produces unbiased estimates of the true ß's and thereby becomes a very useful tool. It is clear that the crude regression using only FFQ data may produce markedly biased estimates due to residual confounding, and spurious statistical significance when testing the null hypothesis in models with two or more correlated nutritional variables. The cost of measurement errors is loss of statistical power in a corrected analysis, which may be quite extreme in multivariate analyses if variables are highly correlated. Where correlations are modest, our examples show that important gains in power can accrue with calibration studies of up to 1,000 subjects.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|