1 Centre for Research into Ecological and Environmental Modelling, University of St. Andrews, St. Andrews, United Kingdom
2 Medical Research Council Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
3 Department of Statistics and Modelling Science, University of Strathclyde, Glasgow, United Kingdom
4 Statistical Laboratory, University of Cambridge, Cambridge, United Kingdom
5 Health Protection Scotland, Glasgow, United Kingdom
6 Public Health and Health Policy Section, University of Glasgow, Glasgow, United Kingdom
7 Centre for Drug Misuse Research, University of Glasgow, Glasgow, United Kingdom
Correspondence to Dr. Sheila M. Bird, MRC Biostatistics Unit, University of Cambridge, Robinson Way, Cambridge CB2 2SR, United Kingdom (e-mail: sheila.bird{at}mrc-bsu.cam.ac.uk).
Received for publication December 21, 2004. Accepted for publication April 29, 2005.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bayes theorem; data collection; epidemiologic methods; models, statistical; mortality; prevalence; substance abuse, intravenous
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In Scotland, Frischer et al. (6) were the first investigators to use capture-recapture methods to estimate the 1989 prevalence of current IDUs in Glasgow. Subsequently, Hay et al. (7
) used stratified capture-recapture methods to estimate numbers of current IDUs in Scotland in 2000, separately for 11 health boards. Four data sources were available: lists of current IDUs known to Scotland's Drug Misuse Database (DMD), obtained via reports made by 1) drug treatment agencies or 2) family practitioners; 3) social inquiry reports about IDUs; and 4) reported diagnoses of hepatitis C virus (HCV) infection among persons who had ever injected drugs. Health board-specific overlaps among the data sources (captures) were the basis for estimating the uncaptured or hidden IDUs in each region.
Hay et al.'s (7) decision to apply capture-recapture methods separately by health board reflected not only the fact that regional estimates were their primary concern but also an appreciation of potential source heterogeneity by region. For example, the reporting of IDUs to Scotland's DMD may be more comprehensive in some regions than in others, and likewise IDUs' utilization of HCV testing. Demographic characteristics of injectors, notably sex and age, may also differentially determine IDUs' propensity to be featured in data sources, as transpired in England (8
).
Bird et al. (9) used Hay et al.'s capture-recapture estimates for Scotland's current IDUs in 2000 (7
) and other published data on Scottish IDUs' sex and age distribution to draw inferences about drug-related deaths in 2000 + 2001 by region, sex, and age group per 100 IDUs.
In this paper, we present a Bayesian approach to the problem of differential propensity to be featured in data sources by explicitly modeling the extent to which region (summarized as Greater Glasgow vs. elsewhere in Scotland), sex, and age group (1534 years vs. 35 years in 2000) influence injectors' propensity to be listed in Hay et al.'s four capture-recapture data sources (7
). Resulting posterior distributionsfor example, for Greater Glasgow's number of current IDUs by sex and age groupthen serve as denominators for calculating the city's drug-related deaths in 2000 + 2001 + 2002 when it comes to providing credible intervals for demographic influences on Scotland's drug-related death rate per 100 IDUs. (See appendix table 1 for a glossary of the Bayesian terms used in this paper.)
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For illustration, we also include classical results for two sets of capture-recapture models incorporating stratification by sex or age group. Unreliable estimates are risked when the stratum-specific data are sparse in some cells. The usual solution is to pool sources or covariate levels.
Data sources
The four capture-recapture data sources for persons who reported injecting drug use, accessed for the period January 1, 1999December 31, 2000, and described in detail by Hay et al. (7), were: 1) Scotland's database of HCV diagnoses; 2) social inquiry reports; 3) general practitioners' reports to Scotland's DMD; and 4) new drug treatment agency contacts reported to Scotland's DMD.
Bayesian capture-recapture modeling with covariates
From the four data sources given above and three potential covariates (region, sex, age group), each with two levels, we constructed a 27 contingency table (11) of observed counts by data sources and covariate values. The number of persons who are unobserved by all data sources is unknown for each combination of covariate values. A log-linear model describes the relations among cell probabilities, data sources, and covariates (2
, 6
). First-order (or main-effect) log-linear parameters represent the effect of the corresponding data source or covariate on the underlying capture rate. Higher-order terms correspond to interaction effects between different data sources or covariates. We assume that all main-effect terms are present, but we use Bayesian model discrimination techniques (12
, 13
) to determine which, if any, interactions are supported by the data. Allowing only second-order interaction terms, we have 221 (approximately 2 million) distinct models corresponding to different presence/absence combinations of the 21 distinct first-order terms. Conditioning on any particular model, we can estimate the numbers of unobserved persons for each combination of covariates.
For IDUs outside Greater Glasgow (per combination of age group and sex), we multiply the estimated population sizes by 1.15 (9). This multiplier takes into account the fact that the four health service areas outside Greater Glasgow which lacked capture-recapture data had 58 drug-related deaths in 2000 + 2001, when the rest of Scotland had 566 drug-related deaths and, Hay et al. (7
) estimated, 22,805 IDUs. The health service areas with missing data were assumed to have 2,237 IDUs (58/566 x 22,805 = 2,237), an addition of 15 percent to Hay's capture-recapture total of 15,618 IDUs outside Greater Glasgow (9
); this gave rise to a multiplier of 1.15 (1 + 2,337/15,618). (Note that all Bayesian estimates presented in this paper include this factor of 1.15.)
Estimates can vary substantially, even between models that fit the data well (14, 15
). The alternative is to associate weights, in the form of Bayesian posterior model probabilities, with each model and produce model-averaged inferences by taking, as the estimated population size (here, for IDUs), the weighted average of the corresponding estimates under each model. Known as Bayesian model-averaging (16
), the resulting estimate and its HPDI reflect both parameter and model uncertainty. Posterior model probabilities within the Bayesian approach allow a formal quantitative comparison among different competing models. Bayes factors are also often used to compare two models, where the Bayes factor equals the posterior odds ratio for the pair of models divided by their prior odds ratio. A Bayes factor greater than 3 provides positive support for one model over another (17
).
Summarizing the posterior distribution for any statistic of interest
Posterior distributions are formed by combining the likelihood of the data (for a given model) with prior distributions which reflect our beliefs about the model and its parameters before observing data (17). Here, the parameters of interest are the log-linear parameters together with the unknown population sizes in the unobserved cells of the contingency tablethat is, the numbers of persons who remain unobserved by all four data sources. To discriminate between competing models, we treat the model itself as an unknown parameter which is to be estimated. The resulting posterior distribution is typically both high-dimensional and complex, but it can be obtained via a computational device known as the Markov chain Monte Carlo method (18
), which samples from the posterior distribution. It is by this very sampling that posterior means and variances for all parameters of interest, and indeed empirical estimates for any statistics of interest, can be obtained. The reversible jump Markov chain Monte Carlo method is used to obtain the corresponding posterior model probabilities (10
, 19
).
Incorporation of prior information
Initially, we consider uninformative (but conventional) priors so that information from the data (represented by the likelihood) dominates the posterior. In particular, we use Jeffreys' prior (20) for the unobserved population sizes and independent normal priors with zero mean (for neutral influence) and variance
2 for the log-linear parameters present in each model. To reflect prior uncertainty about the values to expect for the log-linear parameters, we place a vague gamma prior (to be wide-ranging) on the prior variance parameter
2. Finally, since we have model uncertainty, we place a prior on the models themselves, with regard to the interactions present. We specify equal prior probability (for agnosticism) for each possible model, which corresponds to a prior probability of 0.5 that each second-order interaction is present. In the Results section, we describe the incorporation of informative priors into this framework.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
External information.
Bayesian models incorporate expert opinion into analyses via the priors specified on the parameters in their models. We have additional but indirect external information about Scotland's total number of IDUs, which derives from an informative prior for the drug-related death rate of Europe's IDUs (2124
). In addition, upon reflection, there is also prior information, independent of the data collected here, concerning the ratio of male IDUs to female IDUs in European countries (21
) and, for Scotland, expert opinion about the direction of possible interactions between data sources and covariates. Within a Bayesian analysis, we are able to include this information by simply specifying prior distributions on the corresponding parameters to reflect these prior beliefs. The influence that the priors have on the posterior distribution is dependent on the relative amount of information contained in the prior versus in the data (via the likelihood). We discuss first the prior information that we have (21
24
) and then how to construct priors which we can use to represent these beliefs.
Incorporating prior information.
Initially, we consider the total population size. We have additional, independent information from the General Register Office for Scotland about the annual number of drug-related deaths in Scotland, which totaled 1,006 in 20002002 (an average of 335.3 per annum), of whom the vast majority (but not all) would be deaths of IDUs. We can couple this information with our European preconception of IDUs' annual drug-related death rate to obtain a prior estimate of the total population size for Scotland's IDUs. The drug-related death rate for Europe's IDUs is generally taken to be 0.52 percent (2124
), whichto allow for uncertaintywe take to represent the lower and upper fifth percentiles of our prior distribution. A corresponding prior on Scotland's total number of IDUs should then have lower and upper fifth percentiles of 16,767 and 67,066, respectively (33,533/2 percent = 16,767; 33,533/0.5 percent = 67,066), which can be translated conveniently onto the required multiplicative scale by a lognormal distribution (log(33,533), 0.1776) with appropriate variance.
European prior information (21) is that the male:female IDU ratio most often ranges from 60:40 to 90:10, which are interpreted as lower and upper 10th percentiles. To represent this prior belief about the male:female IDU ratio, we again specify the corresponding prior to be an appropriate lognormal distribution (log(3.6742), 0.489). We have no additional information about the effect of age or region within Scotland on this ratio. Thus, conditional on the number of male IDUs (likewise female), we specify a Dirichlet (1
) prior (i.e., a uniform prior over the simplex) on the proportion of male IDUs in each combination of age group and region. This completes our prior on the unknown cell entries.
We now consider interactions between the data sources and covariates. There is no strong prior information about the presence of different interaction terms, and so we once again specify an equal prior weighting for each model. However, there is prior information on the direction (or sign) of some of the interaction terms, if they are present: namely, that females are relatively less likely to appear in social inquiry reports; older injectors are relatively more likely to live in Greater Glasgow; and older injectors are relatively more likely to appear in the HCV diagnoses database (S. M. B.). We represent this prior information by using a mixture distribution of half-normals on the corresponding log-linear parameters: one positive and one negative, each with common variance parameter 2. The mixture weight for each half-normal represents the strength of prior belief associated with the corresponding sign for the interaction term. Weights of 0.5 for each half-normal simply reproduce the normal distribution (which is without preference of sign) and are used for all log-linear terms where there is no directional prior information. Conversely, mixture weights of 0 and 1 reduce the distribution to a single half-normal distribution so that only a negative (or positive) interaction is possible.
Consistent with three strong, directional prior beliefs, we set weights of 0.95 and 0.05 on the half-normals, with the larger weight on the direction specified above for the interactions where there is robust expert opinion. Thus, this does not, a priori, place zero probability (i.e., impossibility) on the opposite effect's being present, but it is heavily weighted against. To specify the prior uncertainty on the variance, we use a vague gamma distribution.
Using informative priors.
Estimates obtained under the informative priors are given in table 5. Repetition of the Markov chain Monte Carlo simulations obtained mean estimates of the parameters within 1 percent of their reported values, so estimates have converged sufficiently for our purposes. Models 13 in table 5 are close neighbors of each other, but the model which was identified as inconsistent with national preconceptions in the initial analysis using vague priors (model 2 in table 4) is out of the top three and has much lower support (0.012). This is a direct result of incorporating these preconceptions as prior beliefs within the analysis. The model with the largest posterior support is the same, irrespective of the prior placed on the parameters. However, its central estimate for total population size is slightly higher under the informative prior because of more prior support for larger population sizes.
|
Table 5 highlights the fact that relative to covariate main effects, within the age group 1534 years, both males and residents of Greater Glasgow are relatively underrepresented; likewise, the sex differential appears to be largely tilted away from males in Greater Glasgow, where the phenomenon of injection drug use has had a tenaciously long hold (25) in comparison with newer epidemics elsewhere in Scotland. The latter interaction is not present in model 2 in table 5, but it has an overall posterior probability of 0.75, or equivalently a Bayes factor of 3.
The propensity to be listed in data sources 1 (the HCV diagnosis database) and 2 (social inquiry reports) is also covariate-dependent. Thus, table 5 shows that younger injectors are less likely to be diagnosed with HCV but more likely to be subject to social inquiry reports. Male injectors are also less likely to be diagnosed with HCV. Whether this is part of a general male tendency toward later diagnosis or is due to the fact that HCV testing is specifically offered more often to female injectors, particularly if they are pregnant (26), is unclear. It appears that social inquiry reports on IDUs are relatively less likely to be made in Greater Glasgow than elsewhere.
Bayesian inference about IDU total.
The within-model estimates in table 6 have much greater precision than the corresponding model-averaged results, which reflect both parameter uncertainty and model uncertainty. The posterior mean and 95 percent HPDI bounds are higher than corresponding estimates in table 4 under the vague prior, which is a clear consequence of the informative prior's having placed more weight (a priori) on a larger number of IDUs.
|
Bayesian denominators.
We also obtained posterior estimates for the number of IDUs with each combination of covariate values (see table 6). Younger male injectors (aged 1534 years) predominated to a greater extent outside of Greater Glasgow, whereas among females there was lesser age disparity by region. In Greater Glasgow, females accounted for 30 percent (95 percent HPDI: 27.5, 33.3) of IDUs aged 1534 years but for only 27 percent (95 percent HPDI: 24.7, 29.4) of young IDUs elsewhere. Among older IDUs, only 20 percent (95 percent HPDI: 17.5, 23.4) were female.
Secondary analysis: covariate influences on Scotland's annual drug-related death rate in 20002002 per 100 current IDUs
By sampling from the posterior distribution averaged over all models, which was obtained via the Markov chain Monte Carlo algorithm and the use of informative priors, we derived the posterior means and 95 percent HPDIs for the annual drug-related death rate per 100 IDUs shown in table 7 for eight major cross-classifications of injectors by sex, age group, and region.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Interactions within and between covariates and data sources, whether incorporated in a classical framework or a Bayesian capture-recapture framework with vague priors, can still give rise to preferred models which, because of model uncertainty, give markedly different answers for total numbers of IDUs. Yet, they all agree tolerably well in terms of how those IDUs are apportioned relatively across major cross-classificationsfor example, as defined by sex, age group, and region.
Incorporation of informative priors displaced from among the models with the highest posterior probability any model which suggested that Scotland could have significantly fewer than 17,000 drug injectors. This contingency was also ruled out by a symmetric 90 percent (but not by 95 percent) credible interval around the model-averaged central estimate. We must enter a reservation here: No formal prior account was taken that some drug-related deaths occurred among non-IDUs. The more this is so, the more table 7's annual drug-related death rates per 100 current IDUs could be overestimates.
Important insights into data source-by-covariate interactions were gleaned, and in secondary analyses, there was confirmation of a higher drug-related death rate among IDUs aged 35 years (9
) and (a novel finding) IDUs resident elsewhere in Scotland than Greater Glasgow. The observation that female IDUs' lower drug-related death rate was not sustained into the age group
35 years has clear consequences for watchfulness by drug treatment agencies. We also note that females represented only 20 percent of IDUs in this older age group.
Methodologically, this paper breaks new ground in its use of an informative international prior about a rate when deriving the local denominator for that rate, the local numerator for which is known. This problem is not uncommon in epidemiology. Notice, in particular, that our informative prior on the overall drug-related death rate did not constrain drug-related death rates for individual cross-classifications to be within the same range of 0.52 percent: See, for example, the higher rates for IDUs aged 35 years in table 7.
In public health terms, we have demonstrated demographic influences on injectors' propensity to be listed in data sources, on their number and make-up in Greater Glasgow versus elsewhere in Scotland, and on injectors' drug-related death rate. A lower proportion of females among older IDUs may have concealed the fact that female IDUs' lower vulnerability to drug-related death is not sustained beyond 34 years of age. Health officials need to examine why drug-related death rates among IDUs in Scotland seem to be higher outside of Greater Glasgow.
APPENDIX TABLE 1. Glossary of Bayesian terminology
|
![]() |
ACKNOWLEDGMENTS |
---|
Professor Sheila M. Bird holds stock in GlaxoSmithKline but is not currently conducting any research sponsored by the company.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|