Bayesian Projections: What Are the Effects of Excluding Data from Younger Age Groups?

A. Baker1 and I. Bray2

1 School of Mathematics and Statistics, University of Plymouth, Plymouth, United Kingdom
2 Department of Social Medicine, University of Bristol, Bristol, United Kingdom

Correspondence to Dr. Isabelle Bray, Department of Social Medicine, University of Bristol, Canynge Hall, Whiteladies Road, Bristol BS8 2PR United Kingdom (e-mail: Issy.Bray{at}bristol.ac.uk).

Received for publication September 23, 2004. Accepted for publication May 18, 2005.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 References
 
Bayesian age-period-cohort models are used increasingly to project cancer incidence and mortality rates. Data for younger age groups for which rates are low are often discarded from the analysis. The authors explored the effect of excluding these data, in terms of the precision and accuracy of projections, for selected cancer mortality data sets. Projections were made by using a generalized Bayesian age-period-cohort model. Smoothing was applied to each time scale to reduce random variation between adjacent parameter estimates. The sum of squared standardized residuals was used to assess the accuracy of projections, and 90% credible intervals were calculated to assess precision. For the data sets considered, inclusion of all age groups in the analysis provided more precise age-standardized and age-specific projections as well as more accurate age-specific projections for younger age groups. An overall improvement in the accuracy of age-standardized rates was demonstrated for males but not females, which may suggest that analysis of the full data set is beneficial when projecting cancer rates with strong cohort effects.

Bayesian analysis; mortality; neoplasms; prediction; projection


Abbreviations: APC, age-period-cohort; SSRs, squared standardized residuals


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 References
 
Much work has been undertaken in recent decades with the aim of producing projections of future cancer incidence and mortality rates from observed rates by using age-period-cohort (APC) models. This area of study is of particular importance with regard to estimating current cancer rates, because of the delay in compiling and publishing these data, and to public health planning for the future. Although age-standardized projections are a convenient summary measure of cancer incidence or mortality, facilitating comparison between countries and over time, they may mask important differences between age groups (1Go). Age-specific projections are therefore also of interest to epidemiologists and those planning future health services (2Go).

Classical APC models have been widely used to make cancer projections (3Go–5Go). In recent years, a Bayesian APC approach to projecting cancer rates has been implemented, which incorporates prior information about smoothness on each time scale to reduce random variation and improve the precision of the projections. These models are described in detail elsewhere (6Go–8Go). Cancer incidence and mortality data are typically given as counts stratified into 5-year age groups and 5-year calendar periods. There is little published evidence to suggest from which age group the analysis should begin (9Go). This paper examines the issue of discarding data from young age groups from a Bayesian APC analysis and the effect it has on the resulting projections.

In classical analyses, zero counts in the observed data may lead to problems in model fitting, such as unstable parameter estimates. This problem has been discussed in terms of generalized linear models (10Go) and specifically APC models (6Go). In some analyses, it has been assumed that younger age groups in which very few events are observed can be excluded from Bayesian APC analyses with little effect on the projections of interest (9Go, 11Go, 12Go). However, since zero counts do not pose implementational problems when fitting APC models in the Bayesian framework, others (7Go, 8Go, 13Go–15Go) have included all data for the youngest age groups (e.g., age 0–4 years).

It has already been suggested that early observations may strongly influence the width of the credible intervals around the projections (6Go). It has also been reported that when primarily zero counts are given for younger age groups, then discarding these data leads to projections similar to those based on all age groups, but that differences in results are found when deaths are commonly observed among younger age groups (6Go). In an analysis of lung cancer mortality, projections for recent birth cohorts differed markedly depending on whether data for younger age groups were included (16Go). We postulate that, even when observed rates in the younger age groups are very low, where a strong cohort effect exists (such as is typically observed in smoking-related cancers), data from younger age groups will be important to improve the accuracy of projections. The purpose of this paper is to explore the effect on projections of excluding young age groups displaying negligible mortality rates, in terms of both accuracy and precision, in the application of Bayesian APC models.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 References
 
Contrary to the trend in North America and western Europe, smoking-related cancers are currently increasing in some countries in central Europe, with a particularly high smoking prevalence and tobacco-related mortality observed in Hungary (17Go). We considered three smoking-related cancer sites: larynx, esophagus, and oral cavity (including pharynx). Data for Hungary were extracted from the World Health Organization mortality database for 15 age groups (0–4, ..., 70–74 years) and for six calendar periods from 1965–1969 to 1990–1994. In a previous analysis based on age 30–74 years only (12Go), projections were made for a further three calendar periods (i.e., 15 years) up to 2005–2009. Here, we compare 15-year projections based on data for all ages (0–74 years) with those based on ages 30–74 years.

Projections were made by using a generalized Bayesian APC model (6Go–8Go), implemented in the BUGS software program (18Go). The model describes mortality (or incidence) rates in terms of additive effects for age, period, and cohort by using a log-link function. Period effects are those that apply to an entire population at a certain point in time, while cohort effects are those that affect a group of people born around the same time. The Bayesian formulation allows information to flow from observed into projected periods (11Go). The program and further statistical details are provided in the Appendix.

For our initial analysis, projections were made for each of the cancer sites by using all of the age groups available (0–4, ..., 70–74 years). Then, the analyses were repeated with the younger age groups (0–4, ..., 25–29 years) excluded. Projected mortality rates based on the full and reduced data sets were compared by means of age-specific plots. The fit of the projections to rates observed in 1995–1999 was summarized across all age groups by calculating the sum of squared standardized residuals (SSRs), and age-standardized rates were calculated by using the world standard population (19Go).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 References
 
Both age-specific and age-standardized projected rates were compared by using the full and the reduced data sets. An example of an age-specific plot (larynx cancer mortality for males aged 30–34 years) is shown in figure 1. The most striking feature is the increased width of the 90 percent credible intervals based on the reduced compared with the full data set, particularly for longer term (i.e., 15-year) projections. Plots for the other data sets considered were similar, and, in each case, age-specific projections for ages 30–34 years were more accurate when based on the full data set. In progressively older age groups, the differences became less apparent, and the fitted and projected rates for the two sets of results tended toward each other.



View larger version (16K):
[in this window]
[in a new window]
 
FIGURE 1. Results for larynx cancer mortality in males aged 30–34 years in Hungary. Solid lines, posterior estimates of fitted rates and projections with 90% credible intervals based on the full data set (ages 0–74 years); dashed lines, posterior estimates of fitted rates and projections with 90% credible intervals based on the analysis when younger age groups were excluded (ages 30–74 years); points, observed rates; vertical line, the end of the observation period and the beginning of the projections.

 
From the age-standardized mortality rates given in table 1 we see that the fitted and projected rates were similar when the full and the reduced data sets were used. Although there was no systematic difference in estimates based on the two models for 5- and 10-year projections, 15-year projections tended to be lower when the full data sets were analyzed. As expected, the credible intervals for the three projected periods were generally narrower when all of the age groups were included in the analyses, and differences in the width of the credible interval became greater the longer term the projections were. Thus, there may be little or no difference in credible interval width when projecting one 5-year period; however, when projecting 15 years into the future, the credible intervals based on all age groups are distinctly narrower.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Posterior estimates of fitted age-standardized mortality rates and 15-year projections* per 100,000 for selected cancers in Hungary

 
The accuracy of projections for the subsequently observed period 1995–1999 was assessed by summing SSRs across age groups (table 2). These results confirmed that, as noted above, the fit of the 5-year projections was similar when the full and the reduced sets of data were used. Overall, the full data sets led to projections closer to the observed values for two of the three cancer sites (larynx and oral cavity). When the SSRs for males and females were summed across cancer sites, the full data sets produced the most accurate projections for males, while the opposite was true for females.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Results of summing squared standardized residuals for 5-year projections (1995–1999) based on data from 1965–1994 using reduced and full mortality data sets for selected cancers in Hungary

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 References
 
We compared projections of cancer mortality based on a Bayesian APC model with and without data from the youngest age groups included in the analysis. We used the sum of SSRs to assess the accuracy of 5-year projections and used 90 percent credible intervals to assess the precision of projections for up to 15 years based on data from a 30-year period. Our results suggest that including all age groups in an analysis reduces the width of credible intervals for age-specific and age-standardized rates (i.e., produces projections of greater precision) and that this benefit increases with length of projection. On the basis of our findings, age-specific projections based on full data sets are more accurate for the younger age groups. Evidence for an improvement in accuracy is less consistent in age-standardized projections because the contribution to the sum of SSRs from the younger age groups is dominated by larger residuals in the older age groups. In the examples considered here, using the full data sets was found to improve accuracy of age-standardized projections for males but not for females. Our results suggest that including younger age groups will improve projections if the data contain information about important cohort effects, such as were observed in these data sets for males (12Go). In other cases, it could be argued that including childhood cases in cancer projection analyses would result in less accurate projections because of differences in childhood and adult etiology, for example, acute lymphoblastic leukemia (20Go).

There is still much work to be done in determining the best modeling approach in different situations. For example, further methodological work could develop optimal methods of making short- and long-term projections. While short-term projections are important for allocation of resources in the immediate future, long-term projections play a key role in estimating the burden of cancer in the coming decades (21Go) and prompting governments to take action to prevent predicted increases in morbidity and mortality.

We explored whether the magnitude of the excluded rates dictates whether their exclusion will affect projections, as suggested previously (6Go). We tabulated both the total number of cases excluded from the analyses and the rate for the last observed calendar period for the eldest excluded age group (these tables are available on request from the authors), but we were unable to identify any associations between these factors and our observations on the accuracy and precision of projections based on full and reduced data sets.

On the basis of our results and conclusions, we have several recommendations for further work. Firstly, when we compared results from the analysis of the reduced data sets for Hungary with those previously published (12Go), we noticed some small discrepancies in both fitted and projected rates. The maximum percentage difference observed was 3 percent (laryngeal cancer in females, 1965–1969); all other differences were of the order of 1 percent or less. This discrepancy is probably due to the process of sampling inherent in the methodology but should nevertheless be overcome by using sufficiently large numbers of iterations. This finding indicates that there is scope for improving convergence and mixing in applying these models.

Secondly, 15-year projections were consistently lower when the full data set was used (with the exception of cancer of the esophagus in males). Therefore, the central European data (12Go) should be reanalyzed, including all age groups and data for the period 1995–1999.

Finally, there is a similar ambiguity with regard to which older age groups should be included in the analysis. Since death certification in the very old is often unreliable (22Go), it is reasonable to exclude age groups above a certain point. However, as with the young age groups, the cutoff point varies between analyses. In the Bayesian literature, some authors have used an open category of 85 years or older (6Go, 7Go, 9Go, 16Go) as the highest age group. Others have used an upper limit of 74 years (8Go, 12Go, 14Go, 15Go) or 80 years (11Go). Since, in these models, the highest age groups feed information into neighboring age groups, the discarded data for older age groups may influence projected rates. This area warrants further investigation.

In conclusion, our results suggest that using all available data may improve Bayesian APC projections when there are strong cohort effects and when etiology is similar across all ages.


    APPENDIX
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 References
 
For each age group a (a = 1, 2, ..., A) and period p (p = 1, 2, ..., P), we have the number of deaths and the number of person-years (nap). We assume that the number of deaths is Poisson distributed ap) and that where the cohort c (c = 1, 2, ..., C) is the approximate year of birth. Note that problems with identifying individual parameters in the model, caused by the linear association between age, period, and cohort effects, do not affect projected rates. To improve numeric stability and mixing, the linear trend of the age effects {alpha}a is constrained to zero, and the period effects ßp and cohort effects {gamma}c are made to sum to zero by subtracting their means.

An autoregressive prior smoothing model is applied to each time scale to reduce variation between adjacent parameter estimates. We define prior distributions for age, period, and cohort effects that smooth each point on the two preceding points (random walk 2). The prior distribution for the age effects is



Prior distributions for period and cohort effects are similar. The hyperparameters and which control the degree of smoothing on each time scale, are given noninformative gamma(0.001,0.001) priors.

This model is implemented in BUGS, and the program is given below. Fitted and projected rates are estimated from samples of 10,000 drawn from the posterior distribution after excluding the first 1,000 iterations as "burn-in." The samples are summarized by their median and 90 percent credible interval, calculated from the 5th and 95th percentiles.

The BUGS program follows: {amjepidkwi273fx1_ht} {amjepidkwi273fx2_ht} {amjepidkwi273fx3_ht}


    ACKNOWLEDGMENTS
 
Conflict of interest: none declared.


    References
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 References
 

  1. dos Santo Silva I. Cancer epidemiology: principles and methods. Lyon, France: International Agency for Research on Cancer, 1999.
  2. Etzioni DA, Liu JH, Maggard MA, et al. Workload projections for surgical oncology: will we need more surgeons? Ann Surg Oncol 2003;10:1112–17.[Abstract/Free Full Text]
  3. Osmond C. Using age, period and cohort models to estimate future mortality rates. Int J Epidemiol 1985;14:124–9.[Abstract]
  4. Negri E, La Vecchia C, Levi F, et al. The application of age, period and cohort models to predict Swiss cancer mortality. J Cancer Res Clin Oncol 1990;116:207–14.[CrossRef][ISI][Medline]
  5. Quinn MJ, d'Onofrio A, Moller B, et al. Cancer mortality trends in the EU and acceding countries up to 2015. Ann Oncol 2003;14:1148–52.[Abstract/Free Full Text]
  6. Bashir SA, Estève J. Projecting cancer incidence and mortality using Bayesian age-period-cohort models. J Epidemiol Biostat 2001;6:287–96.[CrossRef][Medline]
  7. Knorr-Held L, Rainer E. Projections of lung cancer mortality in West Germany: a case study in Bayesian prediction. Biostatistics 2001;2:109–29.[Abstract/Free Full Text]
  8. Bray I. Application of Markov chain Monte Carlo methods to projecting cancer incidence and mortality. Appl Stat 2002;51:151–64.
  9. Hodgen E. Cancer forecasting in New Zealand. Unpublished master's thesis. University of Wellington, Wellington, New Zealand, 2003.
  10. McCullagh P, Nelder JA. Generalised linear models. 2nd ed. London, United Kingdom: Chapman & Hall, 1989:117.
  11. Berzuini C, Clayton D. Bayesian analysis of survival on multiple time scales. Stat Med 1994;13:823–38.[ISI][Medline]
  12. Bray I, Brennan P, Boffetta P. Projections of alcohol- and tobacco-related cancer mortality in central Europe. Int J Cancer 2000;87:122–8.[CrossRef][ISI][Medline]
  13. Verdecchia A, De Angelis G, Capocaccia R. Estimation and projections of cancer prevalence from cancer registry data. Stat Med 2002;21:3511–26.[CrossRef][ISI][Medline]
  14. Brennan P, Bray I. Recent trends and future directions for lung cancer mortality in Europe. Br J Cancer 2002;87:43–8.[CrossRef][ISI][Medline]
  15. Bray I, Brennan P, Boffetta P. Recent trends and future projections of lymphoid neoplasms—a Bayesian age-period-cohort analysis. Cancer Causes Control 2001;12:813–20.[CrossRef][ISI][Medline]
  16. Kaneko S, Ishikawa KB, Yoshimi I, et al. Projection of lung cancer mortality in Japan. Cancer Sci 2003;94:919–23.
  17. Banoczy J, Squier C. Smoking and disease. Eur J Dental Educ 2004;8:7–10.
  18. Spiegelhalter DJ, Thomas A, Best NG, et al. BUGS: Bayesian Inference Using Gibbs Sampling. Version 0.5 (version ii). Cambridge, United Kingdom: MRC Biostatistics Unit, 1996.
  19. Plummer M. Age standardisation. In: Parkin DM, Whelan SL, Ferlay J, et al, eds. Cancer incidence in five continents. Vol VII. Lyon, France: International Agency for Research on Cancer, 1987. (IARC scientific publication no. 143).
  20. Plasschaert SL, Kamps WA, Vellenga E, et al. Prognosis in childhood and adult acute lymphoblastic leukaemia: a question of maturation? Cancer Treat Rev 2004;30:37–51.[CrossRef][ISI][Medline]
  21. Boyle P. Global burden of cancer. Lancet 1997;349:23–6.[CrossRef][ISI][Medline]
  22. Doll R, Peto R. The causes of cancer: quantitative estimates of avoidable risks of cancer in the United States today. J Natl Cancer Inst 1981;66:1193–308.




This Article
Abstract
Full Text (PDF)
All Versions of this Article:
162/8/798    most recent
kwi273v1
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Disclaimer
Request Permissions
Google Scholar
Articles by Baker, A.
Articles by Bray, I.
PubMed
PubMed Citation
Articles by Baker, A.
Articles by Bray, I.