1 Epidemiology and Information Sciences, INSERM Unit 444, Université Pierre et Marie Curie, Paris, France.
2 WHO Collaborating Center for Electronic Diseases Surveillance, Paris, France.
3 Assistance PubliqueHôpitaux de Paris, Paris, France.
Received for publication January 3, 2002; accepted for publication May 7, 2003.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
communicable disease control; diffusion; epidemiologic methods; forecasting; influenza; statistics, nonparametric
Abbreviations: Abbreviations: CV, cross-validation; ILI, influenza-like illnesses.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A large body of work has been devoted to the real-time detection of influenza outbreaks, defined as some increase above a historical baseline threshold (5, 8, 13, 14). A limited range of approaches has been developed to predict the spread of the epidemic process (1519). These approaches fall into two categories: those that model the diffusion mechanisms and those that model the epidemic curve.
Large-scale mathematical deterministic models, also known as susceptible infected recovered models, are based on the mechanism of serial person-to-person transmission (20) and describe the time and geographic spread of influenza (1518). Application of these models to retrospective data in the former Soviet Union, in Western countries (16, 18), and on a global scale (15, 17) provided useful insight into the diffusion mechanisms but did not prove efficient for prospective forecasting. More flexible models such as chain binomial models allowed for simulation of the spread of influenza epidemics in structured micropopulations, that is, families or small cities (19). Although these models were potentially excellent exploratory tools, they were not designed primarily for larger communities because of their computational complexity (21).
A second approach is based on time-series modeling of the epidemic curve. Autoregressive seasonal linear models (22) have previously been applied to influenza surveillance data (5, 23). However, these models did not take into account the geographic correlations relating to the diffusion process and could not be adjusted to sudden changes in dynamics (21). Nevertheless, although these models were not used in the past for operational forecasts of influenza epidemics on a national or regional geographic scale, they can be considered reference models against which new methods should be tested.
The method of analogues is a nonparametric approach first developed by Lorenz in 1969 to forecast meteorologic time series (24). It was then applied for prediction purposes in other fields, including physics (25), finance (25), hydrology (26), and geophysics (27). A similar approach has also been used to detect chaotic behavior in epidemiologic time series (28, 29). The aim of the present study was to evaluate application of this approach to the prediction of surveillance data. First, we applied the method of analogues to prediction of weekly national incidences of influenza-like illnesses (ILI) in France. We then extended this method to forecasting of the regional spread of ILI epidemics.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For this study, we used the times series of weekly ILI incidences for France and its 21 administrative regions. The series covered the 938 weeks spanning the period from November 1984 to October 2002, during which 18 epidemic seasons occurred. National and regional ongoing ILI incidence estimates are published on the following Web site (in French): http://www.u444.jussieu.fr/sentiweb. Epidemic weeks were defined according to a periodic seasonal regression model (13, 31), used routinely in the French Sentinel Network. A similar model is applied to pneumonia and influenza mortality surveillance data by the Centers for Disease Control and Prevention (Atlanta, Georgia) (32). Epidemic onset is defined as the first week during which the national ILI incidence exceeds a baseline nonepidemic threshold given by the upper limit of the 95 percent confidence interval of the periodic model, provided the incidence remains above this threshold for at least 2 consecutive weeks. During the epidemic periods spanning 19842002, national ILI incidence estimates ranged from 105 to 1,793 cases per 100,000. For each epidemic, we also defined a preepidemic period as the 4 weeks immediately preceding onset of the epidemic. The preepidemic period captured periods of early increase in influenza activity prior to epidemic onset. The number of cases rises abruptly at the start of an influenza epidemic (21), and, in the fourth week before epidemic onset, the incidence was only 12 percent (range, 816 percent) of that observed during the epidemic. Prior to the fourth week before epidemic onset, influenza activity was even lower and hence was considered negligible. The total duration of the epidemic and preepidemic periods was 241 weeks.
Principle of the method of analogues
Time-series prediction by using the method of analogues has been described in detail elsewhere (25, 33, 34). Here, we first briefly introduce the method of analogues for predicting national ILI incidences and then describe its extension to multivariate time series for predicting regional incidences.
National ILI incidence forecasts
Let I(t) denote the observed national ILI incidence at week t, 1 t
N, where N = 938 denotes the total number of weeks in the surveillance record. Suppose we wish to forecast the future from current week T. The principle of the method of analogues is to select historical sections of the time series that most closely match the observations at T. Prediction of future observations is based on the values that follow these closely matching sections (table 1 and figure 1).
|
|
t < T. The best matches will be referred to as the "nearest neighbors" of X(T). We write {X(T 1), ... X(T i), ... X(T v)}, 1 i
v for the v nearest neighbors of X(T) and F(T + h) for the h-week-ahead forecast from week T (h
1, l + 1
T
N h). F(T + h) is computed as the weighted mean of the incidences that follow the nearest neighbors; therefore,
where wi is the weight assigned to neighbor i (we used previously published weights (34); details below).
Regional ILI incidence forecasts
In the setting of the geographic spread of an epidemic disease, we now consider multivariate time series, that is, the set of {Ik(t)} where Ik(t) denotes the observed ILI incidence at week t, 1 t
N, in region k, 1
k
21. We define influenza activity at current week T by the matrix X(T) = (Ik(T r)), 0
r
l, 1
k
21 and use a distance criterion to compare it with historical matrices expressed as
Computation of the h-week-ahead forecast in region k, similar to that described in the preceding section, follows as
where the v nearest neighbors {X(T1), ... X(Ti), ... X(Tv)}, 1 i
v are selected by minimization of the distance criterion, and wi is the weight assigned to neighbor i. Note that for a given week, identical wi weights in all regions are chosen.
Parameter estimation
We used a cross-validation (CV) criterion based on the root mean square error (35) to select the combination of parameters (l, v, and wi) that yielded the most accurate forecasts in the retrospective series. The CV criterion is defined as the error that occurs when predicting the future from week T, excluding the incidences surrounding I(T) from the library of historical observations. Specifically, we exclude the l consecutive incidences preceding I(T) and the h consecutive incidences following I(T). This algorithm avoids redundancy between the forecasts and the model (28, 29). The CV criterion for an h-week-ahead national prediction was expressed as
where I(T + h) is the observed national incidence at week T + h, F(T + h) is the forecasted national incidence at week T + h, and is the average national incidence. At a regional level, the CV criterion for an h-week-ahead prediction in region k was expressed as
where Ik(T + h) is the observed incidence at week T + h in region k, Fk(T + h) is the forecasted incidence at week T + h in region k, and is the average incidence in region k. The mean error for the h-week-ahead prediction of ILI incidences in the 21 regions was estimated as
A grid search was conducted in the parameter space (for h = 110 weeks, with v = 216 nearest neighbors, l = 010 weeks, wi = equally distributed weights or wi = weights proportional to the inverse of the distance criterion (34)) to select the combination of values that minimized CVh over the 241 epidemic and preepidemic weeks.
Overall accuracy of predictions
We used two measures of prediction accuracy to evaluate the method of analogues: the correlation coefficient and the root mean square error between observed and forecasted incidences, denoted by CVh in the text (25, 29, 33, 34, 36). We defined the prediction horizon (denoted by h in the text) as the number of weeks in advance that the prediction is made. Both measures of prediction accuracy were plotted against the prediction horizon.
Comparisons with linear methods
The forecasts estimated by using the method of analogues were compared with those estimated by using the "naive" method of prediction and by using regional linear autoregressive models (29, 36). The naive method constitutes a bottom line in the comparison. It is defined as the method in which predicted incidences are equal to the current incidence, hence F(T + h) = I(T) for national predictions and Fk(T + h) = Ik(T) for regional predictions, h 1.
An autoregressive model of order p can be written as I*(T + 1) = 0 +
1I*(T) +
2I*(T 1) + ...
pI*(T p + 1) +
(T), where I*(.) is the detrended series of incidence measures,
i, 0
i
p are autoregressive parameters to be estimated from the sample data, and
(.) are independent random normal deviates. For the 21 regions separately, the series of incidence measures were detrended, and the autocovariances were computed. The autoregressive parameters were calculated from the autocovariances in a Yule-Walker framework, and the model was built with a backward selection procedure (Forecast, SAS software, version 8; SAS Institute, Inc., Cary, North Carolina). To account for the seasonality of the disease, we allowed for annual terms to be included in the model.
The performances of the three methods (naïve, autoregressive, and analogues) were assessed by computing the correlation coefficient and the root mean square error between predicted and observed incidences (29, 36). The prediction horizon ranged from 1 to 10 weeks for the 241 epidemic and preepidemic weeks.
Mapping of the 19992000 influenza epidemic
Observed versus predicted regional incidences were mapped for the 19992000 influenza epidemic. Spearman correlation coefficients were used to compare the two series of incidences.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
National ILI incidence forecasts
The correlation coefficients between 1-, 2-, and 3-week-ahead predicted and observed national incidences were, respectively, 0.90, 0.76, and 0.63 over the 241 epidemic and preepidemic weeks (figure 4). Over the same period, the corresponding coefficients obtained with the "naïve" method were, respectively, 0.84, 0.53, and 0.21. Although the correlation decreased with the prediction horizon, the degree of accuracy of the 3-week-ahead forecasts by the method of analogues was high. For a prediction horizon of 10 weeks, the correlation coefficient was still 0.58 for the method of analogues, compared with 0.19 for the "naïve" method.
|
The autoregressive method gave higher correlation coefficients (similarly, a lower root mean square error) than the naive method did, but the difference was not large (figure 5). In the same way as for national predictions, the quality of the predictions decreased with the prediction horizon. For 1- to 10-week-ahead predictions, the correlation coefficients between the observed and forecasted regional incidences ranged from 0.81 to 0.66 for the method of analogues and from 0.73 to 0.09 for the autoregressive models (p < 0.001). For the method of analogues, there was a relative decrease of 21 percent in the correlation coefficients between the 1- and 10-week-ahead predictions. The corresponding decreases were 112 percent and 125 percent, respectively, for the autoregressive and naive methods. The root mean square error was 10 percent lower with the method of analogues for 1-week-ahead predictions and 105 percent lower for 10-week-ahead predictions. Hence, for up to 10-week-ahead predictions, the method of analogues provided better predictions than the autoregressive and naive methods.
|
Mapping of the 19992000 influenza epidemic
Extension of the method of analogues to regional ILI incidence forecasts is illustrated in figure 6 by a set of maps corresponding to the 19992000 influenza season. The first set of predictions was based on the influenza activity recorded during the first week of December 1999 (i.e., the week of epidemic onset, week 49 of 1999, started on December 6 (figure 6a)). The method of analogues forecasted the regional incidence levels for the last 3 weeks of December 1999, with a correlation coefficient between the observed and forecasted incidences of 0.68 (p < 0.001). A second set of predictions was based on the influenza activity recorded during the last week of December (i.e., 1 week before peak incidence, week 52 of 1999, started on December 27 (figure 6b)). The method of analogues forecasted the regional incidence levels for the first 3 weeks of the year 2000, with a correlation coefficient between the observed and forecasted incidences of 0.78 (p < 0.001). Because information about the incidence of influenza is often delivered dichotomously, that is, as being above or below the epidemic threshold, we also examined the accuracy of the forecasts in this discrete way. Thus, for the first 3 weeks of 2000, the numbers of regions for which the observed incidence exceeded the threshold were, respectively, 20, 19, and 19. For 18 of the 21 regions, the method of analogues forecasted an epidemic status that matched the observed one over the whole 3 weeks.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A reason that might explain why the method of analogues outperforms autoregressive-like models in influenza series is the absence of exact periodicity regarding this disease. Although influenza epidemics occur in wintertime, the time of year of epidemic onset varies between November and March in France as well as in other temperate areas of the Northern Hemisphere (20). For this reason, models including seasonal terms, as autoregressive-like models do, do not fit influenza series well. The method of analogues is nonparametric and makes no assumption about the distribution or seasonality of the disease, which may explain the improvement in fit. In addition, several authors have suggested that local models (such as the method of analogues) outperformed global models (such as autoregressive-like models), especially when the system under study was complex (33, 37). In particular in epidemiology, Sugihara and May (36) reported a similar result in their study of measles time series.
Various distance criteria have been used for past applications of the method of analogues (2427, 33, 34, 36, 37), but, to our knowledge, the issue of whether the forecasts are sensitive to the choice of distance criteria has not been studied. In our study, we used the euclidian distance criterion defined by the squared differences of raw incidence rates. ILI incidences are likely to be Poisson distributed, and the latter distance criterion is more appropriate for normal random variables. We also studied three other distance criteria: a euclidian distance applied to the log-transformed incidences, an exponentially weighted euclidian distance, and a chi-square-type distance. In particular, we used the log transformation as a variance-stabilizing transformation to obtain normal-like variables. Similarly, the chi-square-type distance is more appropriate for count data characterized by increasing variance with increasing mean. This analysis revealed that the selection of neighbors was almost identical for all of the distance criteria; hence, the forecasts were not substantially different (data not shown). The choice of the distance criteria had very little impact on this study.
The method of analogues was applied here to surveillance data collected during interpandemic periods. An influenza pandemic is a major epidemic due to a completely novel or reemerging influenza virus spreading on a global scale. The viruses isolated and reported in France during the 18 influenza epidemics that we studied were restricted to the types and subtypes isolated worldwide in humans in recent yearsB, A/H1N1, and A/H3N2. In the case of a pandemic, the method of analogues would probably yield worse performances than those reported here. During a pandemic, incidences are expected to be much higher than those recorded during interpandemic periods. For a pandemic, other types of models, such as susceptible infected recovered mechanistic models similar to those used in the retrospective forecast of the global spread of the 19681969 Hong Kong influenza pandemic (15), would probably be more appropriate.
Exogenous covariates such as meteorologic or demographic factors may alter the diffusion process of influenza (21), but the determinants of the spread of this disease are still controversial (20). The method of analogues could be refined so as to include these covariates in the process of selecting the nearest neighbors, that is, in the definition of influenza activity. In previous applications of this method, a weighting algorithm was implemented to assign relative importance to covariates on the basis of the magnitude of their association with the dynamics under study (27). However, for influenza, these covariates and their corresponding weights are not yet known.
In conclusion, the method of analogues described here constitutes a nonparametric approach that, at least during interpandemic periods, is available for forecasting the diffusion of influenza epidemics. This method makes extensive use of past observed epidemic patterns to estimate the temporal and geographic dynamics of the diffusion process. The method of analogues is probably suitable for predicting other communicable diseases such as acute diarrhea, as long as historical observations are numerous enough to provide an extensive description of likely patterns and the disease displays recurring cycles. Like any other prediction method, it also requires real-time collection of data to make prediction worthwhile. Lastly, the method of analogues is a self-learning process, which allows for accuracy to improve with the length of the time series (24, 33).
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|