1 Department of Statistics, Minas Gerais Federal University (UFMG), Belo Horizonte, Minas Gerais, Brazil
2 Spatial Statistics LaboratoryLESTE/UFMG and Belo Horizonte Municipal Health Division (SMSA-BH), Belo Horizonte, Minas Gerais, Brazil
Correspondence: Dr R Assunção, Department of Statistics, Minas Gerais Federal University (UFMG), Av. Antonio Carlos 662M, Belo Horizonte CEP 31.270901, Minas Gerais, Brazil. E-mail: assuncao{at}est.ufmg.br
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods Our method works with a multivariate vector of different cancer sites rates in several areas and it borrows information from both, across geographical areas and across different cancer sites. We applied our method to data from a survey carried out in 18 Brazilian cities in São Paulo State in 1991. We estimated age and sex indirect standardized incidence rates for the six most common cancers in men and women, and calculated the 95% interval estimation for the incidence rates.
Results The usual indirect standardized incidence rates had very large confidence intervals for many cancers and cities due to small expected number of cases. The use of the multivariate Bayesian method led to more precise estimates.
Conclusions More precise age-standardized cancer incidence rates can be calculated using data from other cancers. The method is conceptually simple, easy to perform, has low cost, and can improve substantially the estimation of cancer incidence and other vital rates.
Accepted 3 October 2003
Health authorities frequently require different population morbidity and mortality rates, such as different age, sex and ethnic group rates or rates for different areas. If the risk populations are small, the disease is rare, or the observation period is short, the usual rates have little precision, producing unstable rate estimates. It is usual to have the extreme rates values associated with the smaller populations, with no relationship with the underlying risks.1 If many populations are under investigation, we can observe a great variation in the rates estimates, poorly reflecting genuine geographical heterogeneity of the underlying rates. Therefore, more efficient estimation is essential if accurate information is desired for surveillance and decision making.
One example of these problems appears in developing countries, which have great difficulties in creating and maintaining continuous cancer registries from which incidence rates can be regularly calculated. Occasionally, these countries make large and expensive efforts to collect incidence data for several cancers in one or more geographical locations. However, due to the short period of data collection, many of the estimated rates have large variances, becoming unreliable measures of risk for their respective populations.
To overcome this kind of difficulty, as well for other practical purposes, Bayesian methods have been proposed in the literature.27 Among its main advantages, Bayesian methods allow for the incorporation of prior information, the ease of complex model computation, and the use of posterior probabilities to make inferences about the unknown parameters. The large literature on those methods is due to recent developments in Markov chain Monte Carlo (MCMC) methodology, which facilitates the implementation of Bayesian analysis of complex models and datasets.8
The main idea underlying use of Bayesian method to estimate incidence rates of several geographical areas is the recognition that useful information exists in other population's data to estimate a given population cancer rate.9,10 The resulting Bayesian relative risk estimate for each local population is a form of shrinkage of the observed risk (i.e. the indirect standardized incidence ratio (SIR)observed/expected cases) towards the mean relative risk in the global population. The amount by which each SIR is shrunk towards the global mean is inversely related to the expected count in that local population. Since smaller populations tend to have smaller expected counts, the amount of shrinkage is usually larger for those populations where there is less confidence in their observed risk. Alternatively, the Bayesian estimates can be seen as adjustments due to overdispersion with respect to a Poisson model for the counts. That is, additionally to known risk factors such as the agesex structure, other unobserved local population characteristics can affect the expected counts, making the observed risk variation larger than that assumed in a Poisson model based solely on the covariates.10,11 The main consequence of using Bayesian estimators is that the relative risks of all populations are estimated with larger precision than if naive rates estimators are used and this is the reason for Bayesian techniques being widely used for simultaneous estimation of related parameters.
The Bayesian method has been intensely developed in the context of mapping cancer incidence rates. Besag et al. proposed a model that generated much interest.11 They imposed a plausible spatial relationship structure among the geographical areas in a map and modelled the relative risks with a spatial Markov distribution. As a consequence, the information on neighbouring areas was used to improve the estimation in a given area. Their model has been extended in several directions to deal, for example, with space and time simultaneously, with missing observations, with mismatched geographical data structures, and with errors in covariates.1214
In these methods, only information from a single cancer site rate is used. In contrast with the large literature dealing with univariate rates, there has been little work done on methods for vectors of different cancer rates. There have been a few exceptions such as the empirical Bayes method proposed by Longford15 and the fertility rates pattern in small areas proposed by Assunção et al.16 Additional work in this direction has been the search for common factors present in more than one disease as in Knorr-Held and Best,17 Kim et al.,18 and Wang and Wall.19
In this paper, we present a Bayesian method to simultaneously estimate multiple cancers incidence rates in a given population. The main idea of the method is to explore the correlation between rates from different cancers to borrow information from other cancers in order to estimate a given cancer rate. Our method works with a multivariate vector of different cancer sites rates in several geographical areas and it borrows information from both, across geographical areas and across different cancer sites.
We illustrate the method using cancer incidence rates data from 18 Brazilian cities in São Paulo state. In Brazil, there is no continuous cancer registry system with coverage wider than some large cities. Even in those large cities, the registries have not been continuous in time. In 1995, there were only six registries operating, all of them covering only large city geographical areas and with recent and different starting dates. A recent study used simple methods to geographically interpolate from those sparsely dispersed registries, but it is likely to produce biased results because there were large variations in risk among the few centres used in the study, which are in general geographically far from each other.20
In an attempt to collect more geographically refined data, a cancer incidence survey was conducted in 18 cities spread in São Paulo state.21 Incidence data for the six most common cancers for each sex in those cities provided estimated rates that showed large variations, especially in smaller cities. These variations were attributed to a variety of factors, including small population sizes and the short reference time period. We applied our method to these data, illustrating a number of practical issues in the analysis.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Initially statistical evidence of correlation between different cancer sites was sought through an analysis of the 12 cancer sites indirect SIR in the 18 cities. In each area i, we have the number Cij of cases from cancer site j which we assume to be a Poisson random variable with expected value equal to ijEij where Eij is the expected number of cases under some incidence rates schedule and
ij is the area-specific relative risk for cancer site j. The counts Cij are supposed to be conditionally independent given the underlying relative risks. Usually, the Eij are calculated by the sum
![]() |
The unknown parameter ij is the area i relative risk for cancer site j. Hence, each geographical area has an unknown vector
i = (
i1, ...,
ik) of true underlying incidence relative risks where k is the number of cancer sites in the study. The usual estimator of
ij is the standardized incidence ratio (SIR) Cij/Eij, which is the maximum likelihood estimator under the above distributional assumptions.
There can be a large positive correlation between the true underlying incidence relative risks from two different cancer sites, as we will demonstrate in the next section. If this is the case, when incidence of cancer site A in a given area is greater (or smaller) than the global cancer site A incidence, cancer site B incidence in that same area also tends to be larger (or smaller) than global cancer site B incidence. Not all pairs of cancers are highly correlated but when some of them are so, the Bayesian method explained next explores this correlation between multiple cancers to better estimate the incidence rates.
Rather than modelling the i vectors directly, it is usual to work with the logarithm of the relative risks, represented by
![]() |
We assume that each geographical area withdraws i independently from a multivariate probability distribution with mean µ = (µ1, ..., µk) and a k x k covariance matrix
. If the internal standardization is used, the value µj should be 0 or very close to 0. The covariance matrix
will have usually, but not always, non-negative elements reflecting the fact that, if a given location i has the relative risk
ij of cancer j larger than the mean µj, then we would expect to have the cancer sites
jl also above their mean rates µl. The correlation between different cancer sites commented on previously is the empirical justification for this model.
The Bayesian approach adopts a prior distribution to express our uncertainty about the vector µ and the matrix . This can also be interpreted as introducing an overdispersion in the counts above that accounted for by the Poisson variability to accommodate the presence of unobserved risk factors. For the vector µ, we assumed that each of its entries was independently distributed as an uniform random variable in the interval [2, 2], which is large enough to cover any reasonable deviance from 0 due to mismatching of the reference rates and the areas under study. As a sensitivity analysis on the choice of u[2, 2] as the prior distribution of the SIR, we also ran the MCMC procedure assuming a normal distribution with mean zero and large standard deviation (1000) for each entry of the vector µ. For the matrix
, a useful and flexible choice is the inverse Wishart distribution with parameters h
k and R where R is a pxp symmetric non-singular matrix. In this paper, we used R as diagonal matrix with entries equal to 0.1 and h = 12. Since we know little about the correlation structure between the cancer rates, we chose the minimum possible value (12) for the h parameter of the Wishart distribution in order to allow the maximum variability in the random matrix probability distribution. This model is easily implemented in a freely available software for Bayesian data analysis called WinBUGS.22 We provide the source code to run this model in the Appendix. We ran the MCMC chain for 100 000 iterations as the burn-in period and for additional 400 000 iterations saving every 100-th. We ended with a sample size of 4000 vectors
i from the posterior distribution and inferences are based on these values. The Bayesian estimate is taken as the posterior mean of the parameter and 95% CI are obtained from the posterior distribution quantiles. Convergence to the posterior distribution was checked in WinBUGS by means of the Gelman-Rubin diagnostic,23 using two chains with widely different initial values.
For comparative purposes, we also ran the Bayesian model assuming no correlation between different cancer sites. That is, the matrix was considered a diagonal matrix with j-th diagonal element equal to
j2 and following an inverse Gamma prior distribution with parameters 0.001 and 0.001. This second model is equivalent to fitting a univariate Bayesian model for each cancer site separately. The comparison between the two fitted models, with the full and with the diagonal
matrix, is made by means of the Deviance Information Criterion (DIC) proposed by Spiegelhalter et al.24 This measure incorporates two aspects to verify the model adequacy, the deviance from predicted to observed values and the number of effective model parameters. It is a sum of two components, one representing the fitness of predicted to observed values and another representing a penalty of increasing model complexity. Smaller values of DIC indicate a better fitting model.
Our method allows for the introduction of covariates, or additional confounding factors. Our previous model is changed simply by making i with a non-constant mean µi = (µi1, ..., µik) where each entry µij is a linear combination of the covariates and unknown regression parameters. These parameters would receive a vague prior distribution such as uniform with a large range and the MCMC procedure would be run as usual. It is straightforward to implement this extension in WinBUGS.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
In general, univariate Bayesian estimates of relative risks produced risk relative estimates with larger 95% credibility intervals. In fact, from the 216 estimates from the 18 cities and 12 cancer sites, only seven had multivariate-based intervals larger than the corresponding univariate-based intervals (Figure 3). The larger univariate interval is due to the extra information contained in correlated other cancer sites data in the same area used by the multivariate method. One additional argument favouring the multivariate model is its smaller DIC value, 1338.14, as compared with that of the univariate model, 1372.51.
|
|
Concerning convergence, we ran the MCMC procedure with one additional set of suitably overdispersed starting values for the vector µ (uniform values varying from 5 to 5) and the ij elements (uniform random values varying from 5 to 5) and calculated the Gelman-Rubin convergence statistic R, as modified by Brooks and Gelman.23 To run this procedure with dispersed initial values beyond the (2, 2) interval limits, we adopted the normal discussed above for the vector µ. We monitored the convergence of the statistic R for all
ij parameters and found all of them within 0.01 of the target value of 1. Therefore, we assumed the chains converged adequately.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is important to emphasize that the method is not proposed as an alternative to good quality routine data. Even where this kind of data exists, more reliable estimation of incidence rates can be obtained with the Bayesian approach if the expected counts of cases are small as is usual in small geographical areas or rare disease studies. In the situation examined in this paper, the method has the additional advantage of making better use of rarely available good quality incidence information, collected at great cost for a sample of cities in a given year.
The idea that other cancer sites incidence can provide information to better estimate a given cancer site incidence is not usual. In fact, as far as we know, our paper is the first attempt to implement this idea with many cancers simultaneously, including cancers not obviously related, such as female breast and male lung cancers, for example. Due to the highly specialized nature of cancer research, studies generally address only one specific cancer type at a time. Exceptions to this are epidemiological studies analysing incidence or mortality of several cancers, sometimes comparing different regions. However, these studies did not explore multiple correlations among different cancers as we have here. Additional research is merited to explore the possibility of routinely using other cancers incidence to estimate a given cancer incidence when reliable rates are difficult to obtain due to the small number of expected cases.
We want to emphasize the ecological nature of the statistical correlations we found. In fact, at the individual level, there is no reason to believe that one type of cancer will be associated with another one in the same person, with the obvious exceptions of primary cancer metastases. However, at the aggregate level, we believe that it is justifiable to use other cancer sites information to better estimate one site incidence or mortality, as we argue next.
Besides the empirical correlations we found in our analysis, some aspects of cancer epidemiology corroborate our hypothesis that one cancer site rate gives information about other cancer sites rates. The variation in cancer rates between countries, the differing rates among migrants from one place to another, and the trends over time suggest that most cancers can be environmentally induced. Furthermore, there is evidence that many of these determinants are risk factors for several cancer sites simultaneously. It is estimated that approximately 7580% of all cancers could have environmental determinants26,27 and these include lifestyle habits, like tobacco and alcohol consumption, food and medicine intake, water, soil and air contaminants, and occupational exposures.26 Tobacco consumption is responsible for nearly one-third of all cancer cases in the US27 and it is the major cause of lung, larynx, oral cavity, pharynx and oesophagus cancers.26,27 Alcohol interacts with tobacco to cause oral cavity, pharynx, oesophagus and larynx cancers.26,27 It is estimated that one-third of all cancers could be related to dietary and nutritional practices.27 There is an inverse association between risk of certain epithelial cancers, particularly oral, oesophageal, stomach and lung cancers, and intake of fresh fruits and vegetables. High-fat, low-fibre diets are positively associated with the risk of colon, breast and prostate cancers.26,27
Since the aetiology of many cancers is not completely understood, the association of tobacco and alcohol consumption, inadequate diet and other common environmental factors could justify why many cancers were highly correlated in our analysis. Also, similarities in the cities' health care systems can contribute to the correlation, since incidence rates are highly dependent on the health system's early detection capacity.
As we mentioned in Methods, covariates can be introduced to account for known or suspected risk factors. For example, if information on tobacco consumption is available, it should be introduced in the model for lung, larynx, and oesophagus cancers. Similarly, if information on alcohol consumption is available, it should be introduced in the model for oesophagus and larynx cancers. In general, if we suspect that the j-th cancer relative risk in area i is associated with p variables x1ij, ..., xpij, then our model can be easily extended by making µij = ß0 + ß1x1ij + ... + ßpxpij. As prior distributions, the parameters ß0, ..., ßp have independent normal distribution with mean zero and large variance. If these risk factors are available, which was not the case for the data we used, the model can be easily implemented in WinBUGS. One clear advantage of this model is that the risk factors can be the same for some cancer sites, although this is not necessary.
We believe that our methodological contribution is useful for improving the risk assessment of such an important health problem in modern societies as cancer, although it can also be used to improve rates estimation for other health problems. Besides that, we believe that our paper contributes to reinforcing the notion that environmental factors present in many modern populations can act as common risk factors for many types of cancers and, because of this, that cancer can be a largely preventable disease.
KEY MESSAGES
|
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
model{
for(i in 1 : N) {
for(j in 1 : K) {
Y[i, j] dpois(lambda[i, j]) # distribution of observations
lambda[i, j] <- E[i, j] * theta[i, j]
theta[i, j] <- exp(phi[ i, j]) # log parametrization
}
phi[i, 1:K ] dmnorm(mu[ ], Omega[, ])
}
for(j in 1:K){
mu[j] dunif(2,2)
}
Omega[1:K, 1:K ] dwish(R[, ], 12) # Wishart on prec. matrix
}
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Devine OJ, Louis TA, Halloran ME. Empirical Bayes methods for stabilizing incidence rates before mapping. Epidemiology 1994;5:622630.[ISI][Medline]
3 Dunson DB. Practical advantages of Bayesian analysis of epidemiologic data. Am J Epidemiol 2001;153:122226.
4 Etzioni RD, Kadane JB. Bayesian statistical methods in public health and medicine. Annu Rev Public Health 1995;16:2341.[CrossRef][ISI][Medline]
5 Freedman L. Bayesian statistical methods: a natural way to assess clinical evidence. BMJ 1996;313:56970.
6 Pascutto C, Wakefield JC, Best NG et al. Statistical issues in the analysis of disease mapping data. Stat Med 2000;19:2493519.[CrossRef][ISI][Medline]
7 Spiegelhalter DJ, Myles JP, Jones DR et al. An introduction to Bayesian methods in health technology assessment. BMJ 1999;319:50812.
8 Gilks WR, Richardson S, Spiegelhalter DJ (eds). Markov Chain Monte Carlo in Practice. Boca Raton: CRC Press, 1996.
9 Jarup L, Best N, Toledano M et al. Geographical epidemiology of prostate cancer in Great Britain. Int J Cancer 2002;97:69599.[CrossRef][ISI][Medline]
10 Toledano MB, Jarup L, Best N et al. Spatial variation and temporal trends of testicular cancer in Great Britain. Br J Cancer 2001;84:148287.[CrossRef][ISI][Medline]
11 Besag J, York J, Mollié A. Bayesian image restoration with two applications in spatial statistics. Annals of the Institute of Statistics and Mathematics 1991;43:159.
12 Assunção RM, Reis IA, Oliveira CL. Diffusion and prediction of Leishmaniasis in a large metropolitan area in Brazil with a Bayesian space-time model. Stat Med 2001;20:231935.[CrossRef][ISI][Medline]
13 Bernadinelli L, Pascutto C, Montomoli C, Gilks W. Investigating the genetic association between diabetes and malaria: an application of Bayesian ecological regression models with errors in covariates. In: Elliott P, Wakefield JC, Best NG, Briggs DJ (eds). Spatial Epidemiology. New York: Oxford University Press, 2000, pp.286301.
14 Gelfand AE, Zhu L, Carlin BP. On the change of support problem for spatial-temporal data. Biostatistics 2001;2:3145.
15 Longford, NT. Multivariate shrinkage estimation of small area means and proportions. J R Statist Soc A 1999;162:22745.[CrossRef][ISI]
16 Assunção RM, Potter JE, Cavenaghi SM. A Bayesian space varying parameter model applied to estimating fertility schedules. Stat Med 2002;21:205775.[CrossRef][ISI][Medline]
17 Knorr-Held L, Best NG. A shared component model for detecting joint and selective clustering of two diseases. J R Statist Soc A 2001;164:7385.[CrossRef][ISI]
18 Kim H, Sun D, Tsutakawa RK. A bivariate Bayesian method for improving estimators of mortality rates with 2-fold CAR model. J Am Statist Assoc 2001;96:150621.[CrossRef][ISI]
19 Wang F, Wall MM. Modelling multivariate data with a common spatial factor. Research Report No. 2001008 (2001), Division of Biostatistics, University of Minnesota, Minneapolis, MN.
20 Instituto Nacional do Câncer (INCA), 2002. Estimativa da Incidência e Mortalidade por Câncer no Brasil. Available at http://www.inca.gov.br/cancer/epidemiologia/estimativa2002/
21 Andreoni GI, Veneziano DB, Gianotti Filho O, Marigo C, Mirra AP, Fonseca LAM. Cancer incidence in eighteen cities of the State of São Paulo, Brazil. Rev Saude Publica 2001;35:36267.[ISI][Medline]
22 Spiegelhalter DJ, Thomas A, Best NG et al. WinBUGS: Bayesian Inference Using Gibbs Sampling. Version 1.4. Cambridge: MRC Biostatistics Unit, 2003. Available at http://www.mrc-bsu.cam.ac.uk/bugs/
23 Brooks SP, Gelman A. Alternative methods for monitoring convergence of iterative simulations. J Comp Graph Stat 1998;7:43455.[ISI]
24 Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit. J R Statist Soc B 2002;64:583639.[CrossRef][ISI]
25 Estève J, Benhamou E, Raymond L. Statistical Methods in Cancer Research. Vol. IV. Descriptive Epidemiology. Lyon: International Agency for Research on Cancer (IARC), 1994.
26 Instituto Nacional do Câncer (INCA), 2002. Prevenção e Detecção: Fatores de Risco, Tabagismo, Alcoolismo, Hábitos Alimentares. Available at http://www.inca.gov.br/cancer/prevenção/
27 Blot WJ. The Epidemiology of Cancer. In: Bennet JC, Plum F (eds). Cecil Textbook of Medicine. Philadelphia: W. B. Saunders Company, 1996, pp.102024.