a Medical Research Council (South Africa), 771 Umbilo Road, Congella, Durban 4001, South Africa.
b Malaria Research and Training Center DEAP/FMPOS, Université du Mali, Bamako, Mali.
c Department of Statistics and Biometry, University of Natal, Pietermaritzburg, South Africa.
Reprint requests to: Immo Kleinschmidt, Medical Research Council (South Africa), 771 Umbilo Road, Congella, Durban 4001, South Africa. E-mail: kleinsci{at}mrc.ac.za
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Method We describe, by way of an example, a simple two-stage procedure for producing maps of predicted risk: we use logistic regression modelling to determine approximate risk on a larger scale and we employ geo-statistical (kriging) approaches to improve prediction at a local level. Malaria prevalence in children under 10 was modelled using climatic, population and topographic variables as potential predictors. After the regression analysis, spatial dependence of the model residuals was investigated. Kriging on the residuals was used to model local variation in malaria risk over and above that which is predicted by the regression model.
Results The method is illustrated by a map showing the improvement of risk prediction brought about by the second stage. The advantages and shortcomings of this approach are discussed in the context of the need for further development of methodology and software.
Keywords Malaria risk, disease maps, geo-statistics, spatial analysis, kriging, climatic factors
Accepted 22 July 1999
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The production of malaria maps relies on modelling to predict the risk for most of the map, with actual observations of malaria prevalence usually only known at a limited number of specific locations. Accurate prediction of risk is dependent on knowledge of a number of environmental and climatic factors that are related to malaria transmission.68 However, the estimation is complicated by the fact that there is often local variation of risk that cannot easily be accounted for by the known covariates. A further complication arises from the fact that data points of measured malaria prevalence are not evenly or randomly spread across a country, but are often closely clustered in areas of high risk. Any modelling of risk has to take account of spatial autocorrelation of the data, and allow for local deviation from predictions that are based on the known climatic covariates.
In this project a two-stage procedure was followed: (1) generalized linear regression modelling was applied to determine approximate risk on a larger scale by identifying important climatic and environmental determinants and (2) the geo-statistical kriging method was used to improve prediction at a local level.
![]() |
Data Collection and Data Preparation |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For each survey the total sample size and number of individuals testing positive was known. The geographical co-ordinates of each survey were established using paper maps, electronic maps and global positioning systems. The distribution of surveys across Mali was uneven, with higher concentrations of surveys in more densely populated areas and in areas where malaria risk was perceived to be high. The location of each survey is shown in Figure 1.
|
All climatic variables were available as long-term averages for each calendar month, but not by individual year. The individual monthly averages of the climatic variables are highly correlated within climatic seasons. The question arises over what period climatic variables should be sensibly averaged. The shorter the aggregation period the stronger the likelihood of a high degree of serial autocorrelation in the values. For the purpose of selecting climatic variables for explaining the variation in malaria prevalence it was decided to average monthly climatic data over climatic seasons in order to reflect the variation in weather. Temperature and rainfall were averaged over 3-month periods, with the first quarter starting in December to coincide with the beginning of the dry season. The vegetation index NDVI was aggregated over two 6-month periods corresponding approximately to the dry season (DecemberMay) and the wet season (JuneNovember), respectively.
![]() |
Methods and Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Regression analysis
The relationship between malaria parasite prevalence and each individual potential explanatory variable was first investigated by inspection of scatter-plots and by single variable regression analysis. Since parasite prevalence data are binomial fractions, a logistic regression model for grouped (blocked) data was used as is standard practice for the analysis of such data.12 Predictions of prevalence made from the logistic model will always fall within the interval 0 to 1. Larger surveys are implicitly accorded more weight than the smaller ones. The glm command in the statistical package STATA13 was used for the analysis.
Each of the explanatory variables was adjusted for all of the others by performing multiple regression in the usual way. Non-linearity in the relationship between parasite prevalence and a predictor variable was explored by adding polynomial terms and then grouping the values of continuous variables into categorical ones. Variable selection for the multiple logistic regression model was carried out by a combination of automatic (stepwise) procedures, goodness-of-fit criteria and by using judgement in selecting variables that explain malaria prevalence in terms of vector, host and parasite dynamics of malaria. An additional criterion for selection of the final model was the degree of spatial correlation of the model residuals (see below).
The final multiple logistic regression model contained four significant explanatory variables for the prediction of malaria prevalence. These were distance to water (categorical), average NDVI during the wet season (JuneNovember, also categorical), number of months with >60 mm rainfall, and average maximum temperature during the quarter MarchMay. The detailed results are discussed in the companion paper. Table 1 summarizes these results.
|
For each variable used in the model an image covering the whole of Mali was produced in the GIS package IDRISI.15 In the case of categorical variables this entailed creating the equivalent Boolean indicator variables as used in the statistical model. The prediction formula of the model was then used with the IDRISI image calculator to produce a prediction image. The predicted risks were then grouped into four categories: <10%, 1030%, 3070% and >70%. As an additional validation exercise, the predicted frequencies in these four categories were compared with those of the known values. Of the 101 survey results, 70 fall within their predicted group. The resulting map of malaria risk is shown in Figure 2.
|
The malaria prevalence data and the residuals of the regression model were analysed for the presence of spatial pattern. We used two separate methods to investigate spatial pattern: the D-statistic and the variogram.
The non-parametric D-statistic16 is a weighted average of rank differences in the values of observations, with the average taken over all pairs of points. If yi refers to the rank of the value at any point i, then D is defined by
![]() |
A significance test was obtained by simulation. The simulation consists of randomly assigning ranks to the data points and then calculating D assuming the particular pattern of weights given by the spatial layout of the data. This process is repeated many times over, and the distribution of the simulated D is then compared to the actual value of D calculated from the observed data. This directly yields a P-value for significant evidence of spatial autocorrelation. For mutual binary weights an analytical test was used,17 which is computationally less demanding.
Since it is based on the ranks of the data rather than the actual values, the D-statistic is not dependent on normality of the data. In the malaria data (and generally) negative autocorrelation is not likely, since this would assume distant points to be more similar than near ones. Therefore, a one-sided significance test was used, rejecting the null hypothesis of random spatial pattern if the value of D is sufficiently small.
The semi-variogram1820 (often simply called the variogram) also measures spatial dependency, but there is no significance test associated with this measure. It is normally used to obtain a spatial model for kriging, but it also serves to examine spatial pattern. The semi-variance (h) measures half the average squared difference between pairs of data values separated by the so-called lag distance, h.
![]() |
Table 2 shows that the observed malaria prevalence for Mali is highly autocorrelated in space, as one would expect on account of its strong link with climatic factors. The model residuals still show evidence of spatial pattern, but some of this has been removed by the modelling process. This result holds whether spatial pattern is assessed using the D-statistic with inverse distance weights or binary neighbourhood weights. It can be seen from the P-value for binary weights, that the spatial pattern is more distinct over short distances. The semi-variogram of residuals (Figure 3
) shows that there is some evidence of spatial correlation over short ranges <20 km.
|
|
Since the variogram describes the spatial dependence between the observed measurements as a function of the distance between them, it allows us to estimate the value of malaria prevalence at any point from the observed data. The value of prevalence, Z, at the coordinates(x0, y0) can be estimated from the n nearest sampling values Zobs(x1, y1), Zobs(x2, y2), ... Zobs(xn, yn) by the linear formula
![]() |
![]() |
![]() |
The extreme variation in the Mali malaria prevalence data invalidates the assumption that a common mean exists. There is clearly a need to take covariates into account due to the strong association between malaria risk and climatic factors, and due to the wide variation of the latter across Mali. Residuals from the logit model should be free of covariate effects and the logit transformation will moderate any non-homogeneity in variance of the residuals.
Inspection of the variogram based on the residuals (Figure 3) shows that there is spatial dependence (not taken into account by the model) over short distances up to about 15 or 20 km. A variogram of logit scale model residuals was constructed, confirming a short range spatial pattern up to distances of about 18 km, although the relatively small number of pairs of points that are less than this distance apart makes the variogram less reliable in this region. This means that there is small area variation in malaria prevalence which cannot be modelled well by climatic factors presumably because these do not vary much over this short distance.
Kriging performed on residuals is equivalent to kriging a variable which has an underlying (stationary) mean of zero. To carry out this process residuals for all observed points were calculated on the logit (ln(p/1 p)) scale of the logistic model. Spatial dependence of these was modelled using the previously constructed variogram. An exponential model was fitted to the variogram using a sill and nugget of 0.7 and 0.4, respectively, and a range of 18 km. This geo-statistical model was then used in the kriging procedure of the package GEO-EAS22 to map predictions of residuals in an 18 km radius around each observation. These logit scale kriged residual predictions were then added to the logit scale predicted values produced from the original logistic model. The resultant map predictions were transformed back to prevalences in the usual way {exp(Xß + kriged residuals)/[1 + exp(Xß + kriged residuals)]} to produce a new prediction map (Figure 4). This map takes into account local spatial dependence and allows local deviation from the prediction of the logistic model.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A concern with spatial data is the potential for spatial correlation in the observations, which could lead to incorrect estimates. Spatial clustering of disease is almost inevitable since human populations generally live in spatial clusters rather than random distribution of space. An infectious disease that is heavily associated with climatic variables is likely to be spatially clustered even if population distribution was not clustered. The model derived here explains some of the spatial pattern of malaria risk, but there is still significant spatial correlation, particularly over short distances <20 km. (This result holds for differing ways of defining nearness in the D-statistic and is confirmed by the variogram method.) The reduction in spatial structure in the residuals lends credence to the correctness of the model.
Overdispersion in the logistic model does indicate that there may be important covariates missing from the model. Some of these unknown predictors are likely to be spatially distributed, particularly at a local level.
Kriging with a non-stationary mean (universal kriging) is a refinement of ordinary kriging in that it allows for covariate adjustment by means of regression modelling.20 This would be more appropriate in the case of malaria risk where we know that climatic factors are strong predictors. Since the mean prevalence is now a function of the covariates, rather than a constant, the model assumptions would not be violated as in the case of ordinary kriging. Universal kriging offers the most comprehensive approach to the mapping of malaria risk: it uses the values of the covariates (climate data) at the point at which the prediction has to be made, as well as the position of the point in relation to points at which observed values of malaria risk are available. Universal kriging applied to generalized linear models such as the logistic model, is currently not available and we have therefore not been able to apply it as such.
The two-stage approach that we used offers an appealing alternative to universal kriging and it is somewhat similar in approach. The non-spatial model provides the covariate adjustment and prediction of mean risk in an area. It thereby allows for non-stationarity in the data by modelling the long range differentials in the malaria risk pattern. Kriging of the resulting residuals allows for local deviation from the predicted mean and for spatial dependence in points that are close together. In the MARA project it is unlikely that local predictors affecting malaria risk over and above what is predicted by climatic factors will ever be available. For this reason local variation from the more global area prediction has to be taken into account by spatial modelling.
Whilst the kriging process will give minimized unbiased prediction error (of residuals) on the logit scale, this cannot be guaranteed for the backtransformed predictions.25 However, the kriged logit scale residuals are only a component (in most cases a small component) of the linear predictor which is backtransformed to produce the final prediction for the point on the map.
Prediction based on regression alone has a tendency to produce predicted values that are pulled towards the mean. For example, two observations in different parts of the country with very similar climatic data may differ in their observed malaria prevalence value. Regression modelling would predict for these two places a value close to the mean prevalence of the two points. This would result in large residuals. Kriging the residuals and adding the predicted residuals to the model predictions will produce predictions that are closer to the observed prevalences in each neighbourhood, particularly if the deviation from the model prediction is supported by other points in the neighbourhood.
As one might expect therefore, the range of final predictions from the two-stage method is wider than that produced by the regression model alone, with predictions ranging from about 0% to 92% (compared to a range of 0% to 80% for the logistic model alone). As can be seen from the new prediction map (Figure 4) and the difference map (Figure 5
), the changes brought about by this process are confined to areas around most of the survey locations. For the rest of the map the data are too sparse to be affected by this process, i.e. most places are more than 18 km removed from the nearest survey.
A problem with this approach is that often there are insufficient data points to give us a good basis for estimating the local variability. In the case of malaria maps this problem is less serious in those areas where malaria prevalence is highest, simply because the frequency of surveys is greatest in these areas. The map is therefore likely to be at its most accurate where it matters most: in places where malaria prevalence is high.
It should be noted that universal kriging might have resulted in a different model to the one obtained here, since it attempts to simultaneously obtain good estimation of covariate effects and allow for residual spatial pattern. In this particular example, however, the residual spatial correlation was weak and therefore we would not expect that universal kriging would have produced a model that differs much from the present one. We are currently investigating an iterative approach that would be applicable in situations were the residual spatial pattern is substantial.
The specification of a nugget variance makes allowance for measurement error at a location. This avoids the prediction honouring every observation, which would result in a very spiky map. Future development in this area should include a method of weighting the observations in such a way that large surveys draw the map prediction closer to their observed value than small surveys.
Additional further work in this area would be to develop goodness-of-fit indicators for this two-stage method. For example, how much of the overdispersion in the model has been taken up by local kriging? What proportion of variation in the data is explained by kriging? It would also be important to produce combined prediction errors for the whole map, taking into account both components of the process of prediction.
In conclusion, our view is that the model produced here is a reasonable representation of malaria risk in Mali. The reduction of residual spatial pattern enhances our confidence in the fidelity of the model and residual spatial dependence has been modelled by kriging wherever the density of observed points allows for this. Kriging has been made possible by levelling the map through the regression model, and applying the kriging process to the residuals. The final predictions make sense from the entomological perspective. However, a more systematic approach to this work in future would be a full mixed model with universal kriging to take account of spatial pattern.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Binka FN. Impact and Determinants of Permethrin Impregnated Bednets on Child Mortality in Northern Ghana. Phd thesis. Swiss Tropical Institute, Basel, 1997.
3 Towards an Atlas of Malaria Risk in Africa. First technical report of the MARA/ARMA collaboration. MARA/AMRA, 771 Umbilo Road, Congella, Durban, South Africa. December 1998.
4 Snow RW, Marsh K, Le Sueur D. The need for maps of transmission intensity to guide malaria control in Africa. Parasitol Today 1996;12: 45557.[ISI]
5 Kitron U, Pener H, Costin C, Orshan L, Greenberg Z, Shalom U. Geographic information systems in malaria surveillance: mosquito breeding and imported cases in Israel, 1992. Am J Trop Med Hyg 1994, 50:55056.[ISI][Medline]
6 Craig MH, Snow RW, Le Sueur D. A climate-based distribution model of malaria transmission in Africa. Parasitol Today 1999 (In Press).
7 Snow RW, Gouws E, Omumbo JA et al. Models to predict the intensity of Plasmodium falciparum transmission: applications to the burden of disease in Kenya. Trans R Soc Trop Med Hyg 1998;92:60106.[ISI][Medline]
8 Beck LR, Rodriguez MH, Dister SW et al. Remote sensing as a land-scape epidemiologic tool to identify villages at high risk for malaria transmission. Am J Trop Med Hyg 1994;51:27180.[ISI][Medline]
9 NDVI Image Bank Africa 19811991 (CD-ROM). Food and Agriculture Organization (FAO) of the United Nations Remote Sensing Centre; Africa Real Time Environmental Monitoring Information System (ARTEMIS), NASA Goddard Space Flight Centre, Greenbelt, MD 20771, USA, 1991.
10 Hutchinson MF, Nix HA, McMahon JP, Ord KD. Climate Data: A Topographic and Climate Database (CD-Rom). Centre for Resource and Environmental Studies, The Australian National University, Canberra, ACT 0200, Australia, 1995.
11 African Data Sampler (CD-ROM). World Resources Institute (WRI). 1709 New York Ave, NW, Washington, DC 20006, USA, 1995.
12 Hosmer DW, Lemshow S. Applied Logistic Regression. New York: John Wiley & Sons, 1989.
13 Stata Corp. Stata® Statistical Software: Release 5.0. College Station, TX: Stata Corporation, 1997.
14 Littell RC, Milliken GA, Stroup WW, Wolfinger RD. SAS® System for Mixed Models. Cary, NC: SAS Institute Inc., 1996.
15 Clark Labs. Idrisi for Windows Version 2.008. The Idrisi Project, Clark University, Worcester, MA, 1998.
16 Walter SD. The analysis of regional patterns in health data, Part 1. Am J Epidemiol 1992;136:73041.[Abstract]
17 Walter SD. A simple test for spatial pattern in regional health data. Stat Med 1994;13:103744.[ISI][Medline]
18 Oliver MA, Muir KR, Webster R et al.. A geostatistical approach to the analysis of pattern in rare disease. J Public Health Med 1992;14:28089.[Abstract]
19 Carrat F, Valleron AJ. Epidemiologic mapping using the kriging method: application to an influenza-like illness epidemic in France. Am J Epidemiol 1992;135:1293300.[Abstract]
20 Diggle PJ, Tawn JA, Moyeed RA. Model based geostatistics. J R Statist Soc C 1998;47:299350.[ISI]
21 Krige, DG. Two dimensional weighted moving average trend surfaces for ore-evaluation. J S Afr Inst Mining Metall 1966;66:1338.
22 Geostatistical Environmental Assessment Software. GEO-EAS 1.2.1. Las Vegas, NV: US Environmental Protection Agency, 1991.
23 Altman DG. Practical Statistics for Medical Research. London: Chapman & Hall, 1991.
24 Doumbo O, Ouattara NI, Koita O et al. Approche eco-geographique du paludisme en milieu urbain: ville de Bamako au Mali. Ecol Hum 1989;8:315.
25 Cressie NAC. Statistics for Spatial Data. New York: John Wiley & Sons, Inc., 1991. Section 3.2.2.