1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD.
2 Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD.
Received for publication May 6, 2002; accepted for publication June 25, 2002.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
air pollution; algorithms; backfitting; generalized additive models; time series
Abbreviations: Abbreviations: GAM, generalized additive model(s); GLM, generalized linear model(s); NMMAPS, National Morbidity, Mortality, and Air Pollution Study; PM10, particulate matter <10 µm in diameter.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
GAM extends traditional generalized linear models (GLM) (16) by replacing linear predictors of the form =
j ßjxj with
=
j fj(xj), where fj(xj) are unspecified nonparametric functions. Methods for estimating fj include smoothing splines (17) or LOESS smoothers (1820). GLMs with regression splines (to which we refer here as the fully parametric alternative of the GAM with nonparametric smoothers) commonly define fj to be regression splines, such as natural cubic splines or B-splines with a prespecified number of knots at known locations (21, 22).
Estimation in GAM is based on a combination of the local scoring algorithm (1) and the backfitting algorithm (23, 24). The local scoring algorithm is a generalization of the Fisher scoring procedure for finding maximum likelihood estimates in GLM (16, 25). The backfitting algorithm is suitable for fitting any additive model, and in GAM it is used within the local scoring iteration when several smooth functions are included in the model.
Unlike linear regression models, which are fitted by using weighted least squares and have an exact solution, the estimation procedure for a GAM (or a GLM) requires iterative approximations in order to find the optimal estimates. More specifically, the backfitting procedure in GAM with smoothing splines maximizes a penalized log likelihood defined as lp(, y) = l(
, y) + P, where y is the vector of the observations, l(
, y) is the likelihood function of the linear predictor
, and P is a quadratic penalty term used to account for smoothness (2, 24). This is equivalent to maximizing a posterior distribution in a Bayesian analysis using a prior that favors a smoother relation. From a frequentist standpoint, this may result in improved mean squared error, but it may also result in bias, the extent of which depends on P (26). Asymptotic bias and variance properties of backfitting estimators have been explored by Opsomer and Rupper (27) and Opsomer (28).
Convergence of the local scoring algorithm is controlled by two user-defined parameters: 1) , which controls the convergence precision, and 2) M, which controls the maximum number of iterations allowed. Convergence of the backfitting algorithm is controlled by two additional user-defined parameters: 1)
bf, which controls the convergence precision, and 2) Mbf, which controls the maximum number of iterations to be used in backfitting.
In this paper, we discuss the use of GAM for estimating relative rates of mortality/morbidity associated with exposure to air pollution in time-series analyses of air pollution and health. Our simulation studies show that when the relative rates to be estimated are small and at least two nonparametric smooth functions are included in the model (smoothing splines or LOESS smoothers), as is often done in time-series studies of air pollution and mortality, the default convergence parameters in the S-Plus statistical function gam may be too lax to assure convergence of the backfitting algorithm and may lead to biased estimates.
Below we provide details on a simulation study evaluating the impact of default implementation of the gam function in S-Plus software (Insightful Corporation, Seattle, Washington) on published analyses. We next reanalyze the NMMAPS data using three different methods: 1) Poisson regression with natural cubic splines to achieve nonlinear adjustments for confounding factors; 2) GAM with smoothing splines and default convergence parameters; and 3) GAM with smoothing splines and more stringent convergence parameters than the default settings. We then discuss advantages and disadvantages of using GAM as compared with GLM. Implementation of the GAM estimation procedure for air pollution time-series data is detailed in the Appendix.
![]() |
SIMULATION STUDY FOR CITY-SPECIFIC ANALYSES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Yt Poisson (µt)
log µt = + ßPM10t + s1(temperature, 6) + s2(time, 7/year) +
Idow, (1)
where Yt denotes the daily number of deaths among people older than 75 years, ß denotes the log relative rate of mortality associated with a 10-unit increase in particulate matter less than 10 µm in diameter (PM10), s1(temperature, 6) and s2(time, 7/year) are smooth functions (smoothing splines) of temperature and calendar time designed to control for trend, seasonality, and weather, and Idow are indicator variables for days of the week. This is a simplified version of the GAM originally used in the published NMMAPS analyses (4, 7), with only two smooth functions and no overdispersion. In the simulation, the true ß was set to a 0.51 percent increase in mortality per 10-µg/m3 increase in PM10, the value estimated from the actual Pittsburgh data.
We then simulated 1,000 mortality time series from a Poisson distribution with mean equal to the predictive values ( t) from model 1, also shown in figure 1 (top left). Each of the simulated data sets was analyzed using three methods: 1) gam with smoothing splines s and default parameters (gam + s + default); 2) gam with smoothing splines s and more stringent convergence parameters (gam + s); and 3) Poisson regression with natural cubic splines ns, with equally spaced knots and more stringent convergence parameters (glm + ns). The default settings in version 3.4 of the S-Plus software package and our suggested, more stringent parameter values for the S-Plus function gam convergence control are summarized in table 1.
|
|
|
log µt = + ßPM10t + lo(temperature, 0.024) + lo(time, 0.4) +
Idow. (2)
In model 2, we chose the spans in the LOESS smoothers to obtain a correlation of 0.99 between fitted values of model 1 and model 2. The estimated ß was a 0.53 percent increase in mortality per 10-µg/m3 increase in PM10. We then simulated 1,000 mortality time series from a Poisson distribution with mean equal to the predictive values ( t) from model 2, also shown in figure 1 (top right). Each of the simulated data sets was analyzed using three methods: 1) gam with LOESS smoother lo and default parameters (gam + lo + default); 2) gam with LOESS smoother lo and more stringent convergence parameters (gam + lo); and 3) Poisson regression with natural cubic splines ns (glm + ns) and more stringent convergence parameters.
In the bottom panels of figure 2, we plot the 1,000 estimates of ß obtained under glm + ns (x-axis) versus the 1,000 estimates of ß obtained under gam + lo + default and versus the 1,000 estimates of ß obtained under gam + lo. The horizontal and vertical lines are placed at the true values (0.53). Smoothing splines and LOESS smoothers show similar bias in the relative rate estimates.
To further investigate the relations among bias, size of the true coefficient, and amount of control from trend and seasonality in models 1 and 2, we repeated the simulation study by lowering the number of degrees of freedom in the smoothing splines from 7 to 3 per year or, equivalently, by increasing the span in the LOESS smoother from 0.024 to 0.067. The fitted values of models 1 and 2 are shown in figure 1 (bottom left and right). Note that with less control for trend and seasonality, the estimated ß for Pittsburgh becomes larger (i.e., 0.73 percent vs. 0.51 percent). Table 2 summarizes the results of our simulations. The first column indicates the nonparametric smoother included in GAM models 1 and 2; the second column represents the true log relative rate (ß); and the third column summarizes the percent bias of gam + s + default, gam + s, and glm + ns with respect to the true ß. Even when the data are generated from a GAM with smoothing splines or LOESS smoothers, gam + s + default and gam + lo + default produce a bias of 3642 percent. The bias is not negligible (718 percent) when more stringent convergence parameters are used and when a smaller number of degrees of freedom (or a larger span in the LOESS smoother) is used in the smooth function of time to adjust for trend and seasonality.
|
|
![]() |
NMMAPS REANALYSES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Figure 4, bottom left, shows the standard errors of 90 city-specific estimates at lag 1 obtained with glm + ns (x-axis) versus gam + s + defaults (y-axis). Figure 4, bottom right, shows the standard errors of 90 city-specific estimates at lag 1 obtained with glm + ns (x-axis) versus gam + s (y-axis). GAM consistently gives smaller estimated standard errors than glm + ns. The estimated standard errors in the two models are roughly proportional to each other. This is consistent with recent work by Ramsay et al. (29), who showed that inability of the GAM to detect concurvity can lead to underestimation of the standard errors of relative rate estimates. Standard errors are underestimated even if more stringent convergence parameters are used (figure 4, bottom right).
Figure 5 compares pooled estimates across 90 cities under the three regression methods. The 90 city-specific estimates are pooled across cities under 1) a fixed-effect model, 2) a random-effect model with a moment estimator of the across-city variance (30), and 3) a Bayesian two-stage model with a noninformative prior for the across-city variance and a Markov chain Monte Carlo estimation procedure (31), as in the papers by Dominici et al. (7, 13). Relative to glm + , we observe that 1) gam + s + default gives a larger pooled estimate and 2) gam + s with restricted convergence criteria gives a similar but slightly larger pooled estimate. The pooled estimates show little sensitivity to the specific pooling procedure used. The Markov chain Monte Carlo method provides slightly more conservative confidence intervals.
|
![]() |
SIMULATION STUDY FOR POOLED ANALYSES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For each city, we obtained the maximum likelihood estimate of the linear predictor
using a fully parametric version of the GAM of Dominici et al. (7, 13). We then pooled the 90 city-specific estimates ( ) using a two-stage hierarchical normal model with a noninformative prior on the across-city variance. The pooled estimate under this model was 0.21 percent per 10-µg/m3 increase in PM10 (posterior standard error 0.06), as reported above.
For each city, we then generated 100 mortality time series from an overdispersed Poisson distribution with mean equal to the city-specific fitted values
. We analyzed each city-specific simulated data set with gam + s + default and glm + ns to obtain two sets of 90 x 100 relative rate estimates and their standard errors. We then pooled the city-specific estimates across cities under a random-effect model (30) using an estimate of the across-city variance obtained from the Markov chain Monte Carlo procedure.
Figure 6 shows histograms for the 100 pooled estimates obtained under glm + ns (left) and under gam + s + default (right). The vertical lines are placed at the true valuethat is, at the pooled estimate from the glm + ns. First, the distribution of the 100 pooled estimates under glm + ns is centered at the true pooled value (pooled estimate from glm + ns), indicating that the glm + ns model provides unbiased estimates not only of the city-specific effects but also of the pooled effects. Second, as in the simulation study for the city-specific analyses, the 100 pooled estimates under gam + s are all larger than the true value, showing a bias similar to that indicated by the simulation study for city-specific analyses.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We observed that for time-series studies of air pollution and mortality, the bias is dependent on the nonparametric adjustment for confounding factors such as trend, seasonality, and weather. A smaller number of degrees of freedom in the smoothing spline or a larger span in LOESS leads to larger estimated relative rates, smaller concurvity, and therefore less bias from lack of convergence. For example, in our simulations for Pittsburgh, we used 7 degrees of freedom per year. If we change the number of degrees of freedom to 3, the bias drops from 36 percent to 22 percent. However, if we compare the relative bias for 3 versus 7 degrees of freedom for a fixed particle effect and concurvity, we find greater bias with fewer degrees of freedom.
In simulation studies using NMMAPS data for the 90 largest US cities and using city-specific regression models with smooth functions (smoothing splines) to adjust for nonlinear confounders as described elsewhere (13, 15), we found that pooled relative rate estimates obtained under GLMs with natural cubic splines and iteratively reweighted least squares better detect true relative rates than GAMs with smoothing splines and default convergence parameters for the local scoring and the backfitting algorithm.
Although GAM with nonparametric smoothers provides a more flexible approach for adjusting for nonlinear confounders compared with fully parametric alternatives in time-series studies of air pollution and health, we have found that the use and implementation of GAMs requires extreme caution. Specifically: 1) default convergence parameters need to be modified; 2) GAM optimizes a penalized likelihood that can itself lead to increased bias in pollution effect estimates in exchange for decreased variance; and 3) even when the convergence of the backfitting algorithm is guaranteed, failure of the GAM to detect concurvity can also lead to underestimation of the standard error of relative rate estimates (29). The NMMAPS reanalysis empirically confirmed the theoretical results of Ramsey et al. (29). It showed that the degree of bias in the standard errors is proportional to the size of the standard errors and that this underestimation remains even when more stringent convergence parameters are used.
Imposing stricter convergence criteria altered the NMMAPS results quantitatively but not qualitatively. In the reanalysis, the pooled estimate across 90 cities at lag 1 moved from a 0.41 percent (posterior standard error 0.05) increase in total mortality for a 10-unit increase in PM10 to a 0.27 percent (posterior standard error 0.05) increase. When GLMs with natural cubic splines were used, the pooled estimate was 0.21 percent (posterior standard error 0.06). In every analysis, however, there was strong evidence for a positive association between acute exposure to PM10 and death, even with very conservative adjustments for trend, seasonality, and weather. The differences among these pooled time-series estimates are also small relative to the difference between the effects of acute exposures obtained from the time-series studies (like NMMAPS) and the effects of chronic exposures estimated from cohort studies (3336).
The findings of this analysis in no way diminish the utility of GAM and other nonparametric regression techniques in epidemiologic or other research. A significant advantage of nonparametric regression is that epidemiologists need not rely on difficult-to-verify assumptions about the functional form of the dependence of the outcome on individual risk factors or confounders. To guard against the problems identified here, convergence criteria must be made substantially more stringent by users and/or distributors of statistical software.
Use of default settings was standard practice in environmental epidemiology until we started investigating this issue. Researchers can guard against such problems by regularly assessing the sensitivity of their findings to the convergence criteria used. An excellent overview of software reliability for S-Plus, SAS, and SPSS is provided by McCullogh (37, 38).
As a result of the simulation studies described here, the default parameters in the gam function of S-Plus, version 6.1, have already been revised, with somewhat more stringent convergence parameters ( = 107, M = 30,
bf = 107, Mbf = 30) (39). This problem is likely to be shared by other software packages.
![]() |
ACKNOWLEDGMENTS |
---|
The authors thank Professor Trevor Hastie and Drs. Rick Burnett, Tim Ramsay, Thomas Lumley, Lianne Sheppard, David Smith, Rafael Irizarry, and Giovanni Parmigiani for their comments.
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Here we briefly review the local scoring algorithm used to fit a generalized additive model (GAM) and its implementation for air pollution time-series data. The local scoring algorithm (2, 24) is analogous to the use of iteratively reweighted least squares (16, 25) for solving likelihood and nonlinear regression equations. At each iteration, an adjusted dependent variable is formed and an additive regression model is fitted. The estimation procedure of the additive regression model depends on the number and nature of the smooth functions included in the GAM. When the smooth functions are parametric (e.g., regression splines), the additive regression model is fitted using weighted least squares, and the GAM is equivalent to a generalized linear model. When the smooth functions in the linear predictors are nonparametric (smoothing splines or LOESS), the additive regression model is fitted using the backfitting algorithm (40, 41). The backfitting algorithm cycles through the variables and estimates each coordinate function by smoothing partial residuals. If a single smooth function is included in the model, the backfitting algorithm provides a closed-form solution of the parameter estimates.
In most time-series analyses of air pollution and health that have used GAM, the models include several nonparametric smooth functions of calendar time and weather variables. More specifically, let yt be the daily air pollution counts and let x1t, x2t, and x3t be the daily time series of air pollution, calendar time, and temperature, respectively. A typical model assumes
yt Poisson(µt)
log µt = + ßx1t + s2(x2t,
2) + s3(x3t,
3) =
t, (3)
where ß denotes the log relative rate of mortality associated with a 10-µg/m3 change in air pollution levels and s2(x2t) and s3(x3t) are nonparametric smooth functions with degrees of freedom 2 and
3 modeled as smoothing splines or LOESS functions.
The local scoring algorithm for model 3 consists of the following steps:
Initialize: = ß = s2 = s3 = 0 and m = 0, where sj = (sj(xj1), ..., sj(xjT)), j = 2, 3.
Iterate: m = m + 1 < M (outer loop).
1. Form the adjusted dependent variable:
z = m1 + (y µm1)
/
µm1,
where
m1 =
m1 + ßm1x + s2m1 + s3m1 = log(µm1)
and z = (z1, ..., zT), x = (x1, ..., xT) (similarly for , µ, y).
2. Form the weights w = (µ/
m1)2 V1, where V is the variance of y at µm1.
3. Fit an additive model to z using the backfitting algorithm with weights w and estimate m, ßm, s2m, s3m, and
m, as follows:
3.1. Cycle j = 1, 2, 3 (inner loop).
3.2. Calculate residuals by removing the estimated functions or covariate effects of all of the other variables:
rt1 = yt s2m1 s3m1
rt2 = yt m1 ßm1x s3m1
rt3 = yt m1 ßm1x s2m1, t = 1, ..., T
3.3. Estimate the sjm by smoothing the residuals with respect to the next covariate:
2m(x1t) = smooth(r2|x2t)
3m(x1t) = smooth(r3|x3t)
The term "smooth(rj|xjt)" denotes a smoothing of the data (rj, xj) at the point xjt. The parameter estimates (m, ßm) are obtained by fitting weighted least squares on the data (r1, x1).
3.4. Compute the backfitting convergence criterion:
3.5. Stop when |RSSm RSSm1| < bf.
Compute the local scoring convergence criterion:
m = |(Dm1 Dm)/(Dm1 +
)|,
where Dm1 = D(y;m1) is the deviance for a fitted model
m (2).
Stop for m <
.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|