1 Harvard Medical School, Boston, MA.
2 Harvard Pilgrim Health Care, Boston, MA.
3 Harvard Vanguard Medical Associates, Boston, MA.
4 Eastern Massachusetts Prevention Epicenter, Centers for Disease Control and Prevention, Boston, MA.
5 Center for Education and Research in Therapeutics, HMO Research Network, Boston, MA.
6 Brigham and Womens Hospital, Boston, MA.
7 University of Sydney School of Public Health, Sydney, Australia.
Received for publication June 3, 2002; accepted for publication May 12, 2003.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
bioterrorism; communicable diseases; epidemiologic methods; generalized linear mixed model; population surveillance; spatial analysis
Abbreviations: Abbreviations: GLMM, generalized linear mixed models; ICD-9-CM, International Classification of Disease, Ninth Revision, Clinical Modification; NT, number of tests.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Recent bioterrorist attacks have highlighted the public health need to detect bioterrorism events as quickly as possible. In the case of anthrax, the initial clinical presentation of persons with acute lower respiratory infection could precede the laboratory diagnosis of anthrax by several days (1). This suggests that surveillance based on rapid reporting of an unusually large number of incident episodes of lower respiratory illness in a community might present an opportunity for early detection of a bioterrorism event. Less dramatically, episodes of small-scale foodborne illness, intentionally caused or otherwise, may currently go unremarked but could be detected through improved surveillance methods.
One way to improve surveillance is to develop methods of identifying temporal and geographic clusters of events that may merit additional evaluation, rather than rely on merely temporal methods. This improvement can be efficient in that data suitable for such analysis are collected routinely by health maintenance organizations and physicians groups. The data are increasingly available in a timely fashion as electronic medical records and other real-time data collection systems are adopted. Additionally, geographic information is often available (2). Spatial data are important: Clusters of interest may be localized and of insufficient size to be detected in analyses that consider only an entire region, and/or a lack of a specific location may interfere with treatment and prevention.
Available techniques that can be used with such data include scan statistic (3) and cu-sum (4) approaches, which explicitly aim at surveillance. The referenced articles also offer comprehensive surveys of spatiotemporal surveillance methods. In addition, time-series methods (5) could be employed, and spatial clustering techniques (6) could be adapted to temporal surveillance. As an alternative, we propose a method based on generalized linear mixed models (GLMM) (7).
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
logit(Pr[case]) = ß0 + ß1spring = ß2summer + ß3fall + ß4weekday,
where "spring," "summer," "fall," and "weekday" are indicator variables describing the season and a nonweekend day. For a given day, we could generate an estimate of the probability that an individual is a case using the estimated ß parameters from the above model.
An improvement to this approach would be to incorporate geographic data. This would allow for the possibility that some areas contained populations more likely to become sick, such as older people, or more likely to seek health care, as might be the case for certain cultural groups. We might then modify the conceptual model:
logit(Pr[case]) = ß0 + ß1spring = ß2summer + ß3fall + ß4weekday +
where i indexes the areas and each i is the log odds of being a case in area i relative to the Ith of the I areas.
Instead of the ordinary logistic regression discussed conceptually above, we propose the use of GLMM (7) to estimate the probability that each subject under surveillance is a case, in each area, on a given day. (The approach can be used for other summary periods, but we use day as an example.) The proposed model takes the form
E(yijt|bi) = pijt and logit(pit) = xijtß + bi ,
where yijt indicates whether person j in area i is a case on day t, pijt is the probability that he/she is a case, xijt is a set of covariates measured for person j and/or area i and/or day t, ß is a vector of fixed effects, and bi is a random effect for area i. One can think of the bis as allowing different baseline risks in each area i. The bis are typically assumed to have a multivariate normal distribution with a mean of 0.
The estimated bis are sometimes called the "shrinkage" estimators, because they are, loosely, a weighted average of the log odds ratio i from model 1 and the mean of these odds ratios. The weights depend on the number of persons in area i, among other things. The fewer persons in area i, the greater the weight of the overall mean; that is, the greater the shrinkage. The shrinkage estimators have a smaller mean squared error than the simple logistic regression estimators for the areas, though they are not unbiased. Shrinkage estimators are sometimes described as "borrowing strength" across the clusters (6). The borrowed strength is why we generally prefer the GLMM to the simple logistic regression.
The covariates xijt could include 1) measures on the individuals j, such as their age and gender; 2) measures on area i or its population, such as the population density or the median income; or 3) descriptors of day t, such as measures of yearly patterns (be they seasonal indicators or trigonometric functions, etc.), the relative humidity, or the day of the week. Any covariates could vary over time. Finally, the model might incorporate autocorrelation by including recent days counts as predictors.
We recommend that individual-level covariates be used whenever possible, since their predictive value may be great. In practice, however, their inclusion can make fitting models computationally difficult. In such cases, where individual-level covariates are excluded, the data are effectively a count of the number of cases and noncases in each area i on each day t. A simpler form of the model can then be used, with the notation
E(yit|bi) = nit pit and logit(pit) = xitß + bi,
where
,
nit is the number of individuals in area i on day t, and pit is the probability that any resident of community i is a case on day t. Covariates that differ between members of any community are omitted from xijt, leading to the simpler xit. Model 3 might be referred to as "binomial logistic regression," as opposed to the Bernoulli logistic regression depicted in model 2, which is commonly referred to simply as logistic regression. The practical effect of this change is wholly computational, assuming that no individual-level covariates are used in either case. The interpretation of the model estimates is identical, but instead of
lines of data for each day, there are only i lines of data for each day.
In most practical applications, including the example presented below, we expect that individual-level covariates will be impractical, and we use the representation shown in model 3 throughout the remainder of the article.
The model is fitted using data from a "historical period" including only days before the surveillance day. Days included might be all available previous days, or certain days could be omitted because of known anomalies in count or conditions such as weather. After fitting of the model, estimated values of the fixed effects ß and the random effects bi are used to calculate
, the estimated probability of being a case for a given surveillance day t in a given area i:
Note that day t is specifically excluded from the historical period. This is the surveillance problem: to use the past to evaluate the present.
The probability of seeing z or more cases is calculated from , using the binomial probability mass function:
Pr(³z cases) = (Z = s) = 1
(Z = k);
Pr(Z = k) =
This probability can be thought of as a p value for the null hypothesis that the cases in the area came from a binomial distribution with parameter .
There can be many areas, and the surveillance is repeated daily. Thus, there may be a considerable problem with multiple testing. To avoid this, we suggest not reporting the p value directly. Instead, consider the following transformation: ( p value x NT)1 = "expected time," where NT is a relevant number of tests.
The expected time is the number of times NT tests would have to be performed so that one would expect to see exactly one p value as small as or smaller than the one that was observed. For example, if NT = 100, a p value of 0.005 would be expected to appear once in two sets of 100 tests, that is, (0.005 x 100)1 = (0.5)1 = 2. A p value of 0.0001 would be expected to appear once in 100 sets of 100 trials. Note that because of the discrete nature of the binomial distribution, the expected time may be conservative (8). A useful NT would be the number of tests performed in a year; this would allow statements such as "the observed event is expected once every 10 years." One advantage of this approach is that larger values of the expected time are more alarming or unusual, as opposed to the counterintuitive interpretation of smaller p values as more unusual. It also allows the user to focus on the unusualness of the observed data in terms of time, rather than on attempting to evaluate the meaning of a p value in this context.
Example setting and data
We previously described surveillance that uses an automated electronic medical record system among members of a Massachusetts health maintenance organization (2). Briefly, for each health-care encounter, a clinician enters diagnoses, to which International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), codes are attached. Diagnoses are then grouped into syndromes (2). The diagnosis is available for data analysis immediately. The work was approved by the Harvard Pilgrim Health Care (Boston, Massachusetts) institutional review board.
Patients addresses are geocoded, which provides the US Census tract of residence. Although the accuracy of commercial geocoding has recently been called into question (9), the geocoding used in this article attached an exact location to 94 percent of addresses. A census tract location was found for an additional 1 percent of subjects. The addresses are used for billing and other business purposes, which suggests that they are sufficiently accurate for surveillance. The data set used here represents the ambulatory medical encounters of approximately 240,000 geocodable individualsabout 10 percent of the population in the region of eastern Massachusetts described below.
A cluster of inhalational anthrax is one of the conditions the system is intended to identify. Typically, inhalational anthrax begins with a nonspecific prodromal phase in which the sufferer may experience fever, dyspnea, cough, and chest discomfort. During this phase, neither physical examination nor any widely used diagnostic test suggests an unusual illness. Diagnosis usually occurs after 24 days, when respiratory failure and hemodynamic collapse may ensue. By that time, chest radiographs show an unusual pattern of mediastinal widening (1).
The lower respiratory infection syndrome used in our system was designed to capture cases of anthrax in the prodromal phase. It incorporates 119 ICD-9-CM codes including influenza, pneumonia, bronchitis, and cough. As would be expected, incidence rates are much higher in the winter than in the summer (2). Spatial clusters as well as temporal clusters are expected, because of the contagious nature of illnesses that dominate the visits associated with lower respiratory infection.
The proposed statistical method is intended to alert authorities when local increases in lower respiratory infection occur. This will allow health officials to perform additional evaluation of affected individuals and/or advise clinicians to have heightened suspicion when evaluating other persons with respiratory illnesses. If the local increases reflect prodromal cases of anthrax, the system will help to alert authorities in a timely fashion. The nonanthrax cases of lower respiratory infection are the "noise" against which a "signal" of anthrax-related complaints must be detected. Because of the prodrome, an unusual cluster of cases caused by anthrax might be detected in this way before it was possible to diagnose its cause.
The data set used here incorporates cases of lower respiratory infection syndrome that occurred between January 1, 1996, and October 31, 1999. There were 80,683 lower respiratory infection episodes during this period for which the patients residence was in one of the 529 census tracts with centroids in the Greater Boston, Massachusetts, area, between west longitudes 70.85 and 71.4 and north latitudes 42.15 and 42.65. A map depicting the number of patients residing in each census tract in October 1999 shows that while a handful of tracts had very few subjects, most had more than 400 (figure 1).
|
The model was fitted using the GLIMMIX macro (10) in SAS, version 8.2 (11). Note that GLIMMIX uses penalized quasilikelihood estimation, which causes some bias (12); the many repeated values for each census tract should have diminished the bias here. The advantage of this fitting method is that it uses a great deal less computer memory and time than methods that employ numeric integration. This is a meaningful advantage for data sets as large as the one used in this example. (In general, closed-form solutions are not available, so either approximation or numeric integration is necessary.)
We assumed that the random effects for the census tracts were independent. An assumption of spatially correlated random effects could also be appropriate; including spatial correlation would imply that areas close to one another have similar baseline risks of illness. However, a practical problem with incorporating such correlation is that much available software (e.g., S-Plus (13) or Stata (14)) does not allow spatially correlated random effects with longitudinally repeated nonnormal observations. SAS (10, 11) allows it with some modification but could not be employed in this large data set using available hardware. While BUGS (15), which uses Markov chain Monte Carlo methods to fit Bayesian models, might be suitable, this method requires a great deal of care and time, which would preclude the timely repeated fitting of the model necessary for ongoing surveillance. Use of a computationally complex method would also limit applicability because of the specialized knowledge required.
There is also a plausible rationale for avoiding spatial correlation in this example. This region around Boston is made up of many demographically and economically distinct towns and neighborhoods. Towns that border one another may share little besides physical proximity. In this context, smoothing across area may actually not be helpful; there is little reason to think that neighboring areas should have similar baseline risks.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Table 2 shows 11 census tracts approximately at the deciles of the random effects; we show the exact rank. We chose these 11 census tracts as an illustrative sample of the 525 populated census tracts in the model. In the table, we show the exponentiated random effects (interpreted as the odds ratio in the tract relative to the average tract), the estimated parameter from the model, the number nit of subjects living in the census tract, the number of visits on October 25, 1999, and the probability of seeing this many or more cases. Note that only data through October 24, 1999, were used in estimating the model. A detailed example of the calculation of
is provided in the Appendix.
Among these 11 census tracts, only one, tract 250173336, has a count that seems unusual, with a p value for two or more visits of 0.0062. This is Pr(2 cases) = 1 Pr(0 cases) Pr(1 case) = 1 (0.000633)315 315 x (0.000367) x (0.999633)314, using the estimated
from table 2. For a day, NT = 525, so the expected time is 0.31 of a day ((0.0062 x NT)1 = 0.31)that is, more than three times every day. This count is unlikely to merit further attention.
This simple example mimics surveillance: We fit the model using all of the data available up to a given point, estimate the parameters of the binomial distribution for a future point, and evaluate the observed counts. Table 2 shows the extent of the variability between census tracts and the effect that this can have. Three census tracts in the table had two visits, but while the numbers of subjects living in those census tracts were roughly similar, the p values differ by a factor of 3. Of the three, the census tract with the largest random effect has the largest p value, despite being only 80 percent the size of the largest census tract of the three.
We carry the example further in figure 2, where we plot the three census tracts with the smallest p values. (This limit was selected because three census tracts could be plotted comfortably with text.) We note the number of cases seen in these census tracts and the p value associated with that observed count. A p value of 0.0062 was calculated above to have an expected time of 0.31 days. A p value of 0.0036 has an expected time of 0.52 days. A p value of 0.00005 has an expected time of 38 days, meaning that we should expect counts this unusual only once every 38 days if the model is correct. This is unusual enough that a public health department might investigate further by evaluating the individuals contributing to the unusual count; these persons can be easily identified from health-care-provider records.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The logistic GLMM is not the only way to estimate the probability of being a case. An ordinary logistic regression model could be used, with an indicator variable for each area. We recommend the GLMM mainly because the size of the population under surveillance in each area is often variable, and mixed models are often used in such cases to "borrow strength" across the areas (6). A Poisson regression analysis, either using GLMM or with indicators, could also be an appropriate substitute. Finally, this is a relatively crude space-time model. More sophisticated models have appeared (17, 18) and could be adapted to the current surveillance purpose. However, the implementation of a sophisticated model in a large data set would be costly in time and effort. The relatively large scale of surveillance data sets does make some compromise attractive. This is particularly relevant because of the appeal of expanding real-time cluster detection to include considerably larger populations.
Data on individuals can be included with individual-level covariates, summarized within area, or stratified by some individual-level covariates and then summarized within area. We have alluded to the computational advantage of summarizing the data. Another advantage of summarizing within area is that tables with threshold values can be generated ahead of time, reducing surveillance to a simple look-up task once the model has been fitted. If individual-level covariates were included, the p value and the expected time would depend on which subjects were cases on a given day, making this useful and simple implementation impossible.
The proposed method has advantages and disadvantages relative to alternative techniques. Scan statistic (3) and spatial cu-sum (4) approaches can detect elevated counts over several contiguous areas in a natural fashion, while the proposed technique cannot. On the other hand, these techniques cannot easily incorporate continuous covariates and may be less sensitive to elevated rates in a single area. Time-series methods (5) detect elevated rates across an entire regiona situation the proposed method may lack power againstbut would be less sensitive to a smaller number of spatially focused cases, a particular strength of the proposed method.
There are several important limitations of the proposed statistical approach. Problems associated with any division of space are well known in cluster-identification research. For instance, a cluster divided between two areas may fail to attract attention (19). A similar question is how to choose among available spatial divisions. A conservatism of the "expected time" procedure is that it assumes that all tests are independent. This would be troubling if spatially correlated random effects were found useful; there would then be a discrepancy between the model and the evaluation method.
No analysis of a data set of this type is likely to detect an occurrence like the 2001 US bioterrorism attack using anthrax: There were too few cases, and they were more connected by work site than by home address. Nonetheless, it is easy to imagine an attack that could be detected in this kind of data set: exposure through the ventilation system of a restaurant, for example. In general, the data collection system described has an advantage in detecting ailments with nonspecific prodromes, such as anthrax, or nonspecific symptoms more generally. Arguably, this includes anthrax, botulism, plague, smallpox, and tularemiaall of the Centers for Disease Control and Prevention category A bioterrorism agents except viral hemorrhagic fever (20).
There are other limitations of the type of data considered in the example. The nonrepresentativeness of the population under surveillance raises questions about its suitability for bioterrorism surveillance. This concern is heightened by the fact that 5 percent of addresses in the data set could not be geocoded. Persons subject to surveillance under this system are generally better educated, wealthier, and more likely to be White than the general public. To the extent that these characteristics would protect them from exposure or decrease the chances that they would seek health care, the example data set would be less likely to reflect bioterrorism than a representative population. However, while the sample is not representative, it is quite large, which would help officials avoid completely missing a bioterrorist attack just by chance.
Additional work in statistical models for cluster surveillance will be required to resolve some of the limitations described here. One important practical development will be to assess adjacent areas with p values that are small but not small enough to suggest further investigations by themselves. A formal preliminary step for determination of the best spatial areas might be developed. The impact on sensitivity of varying proportions under surveillance in different areas should be assessed. Finally, as computing speed increases and coding efficiency improves, it may become practical to fit models that consider each individual, rather than summing data within an area.
![]() |
ACKNOWLEDGMENTS |
---|
The authors appreciate the contributions of Inna Dashevsky, who generated the basic data sets used in the example, and Courtney Adams, who coordinated the project and provided valuable suggestions on the text. In addition, Drs. Nick Horton and Russ Wolfinger provided comments that contributed to or improved this paper.
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Related articles in Am. J. Epidemiol.: