a El Paso County Department of Health and Environment, Colorado Springs, CO, USA.
b Emory University School of Medicine, Department of Family and Preventive Medicine, Atlanta, GA, USA.
Reprint requests to: Stephen Muth, El Paso County Department of Health and Environment, STD/HIV Programs, 301 South Union Boulevard, Colorado Springs, CO 809103123, USA. E-mail: smuth{at}uswest.net
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods Based on residential address co-ordinates for each subgroup member, we used the subgroup's centroid as the origin and constructed a 360° series of overlapping box plots of the distance of subgroups members to the origin, thereby producing closed polygons for each of the box plot demarcators.
Results These rotational box plots revealed similar geographical distributions for most participant and non-participant subgroups, with the exception of African-American men and women.
Conclusions Observed differences resulted in part from the study design, and provided some insight into sampling problems encountered in social network studies. Based on Tobler's supposition that nearby things tend to be alike, the rotational box plot is a useful additional tool for investigating sample bias.
Keywords Geography, geographical information systems (GIS), exploratory spatial data analysis (ESDA), sampling bias
Accepted 24 March 2000
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Modern geographical information systems (GIS) are not limited to predetermined small area designation, and permit considerable flexibility in examining geographical distributions.5 They too can illustrate the assumption that, except in areas that have experienced intense migration, two people drawn from the same neighbourhood are likely to share many characteristics. As an example, Latkin et al. have demonstrated that the pattern of specific drug use in Baltimore is a neighbourhood phenomenon.6
In many epidemiological studies, particularly those that are community-based, ascertainment bias is often assessed by comparing sociodemographic characteristics of participants and non-participants. The considerable power of geographical comparisons has not been an important tool in this process, in part because of the administrative and logistic complexity involved in geocoding (translating locating information to map co-ordinates) and making meaningful geographical comparisons by group. In this communication, we report on the feasibility and usefulness of a geographical technique that provides visualization of population distributions to assist in evaluating sample bias.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sample comparisons
We used standard methods to examine and report differences in sex, age, ethnicity, and recruitment characteristics, attempting to duplicate the usual way in which possible sample bias is evaluated in epidemiological studies. To compare geographical distributions, we first used the ArcView8 GIS to geocode all available residential addresses to their x and y co-ordinates on a regional street map based on the Colorado central state plane co-ordinate system. SAS9 was then employed to deliver the appropriately constructed data subsets to ArcView. Though nearly two-thirds of non-participants could not be located, we obtained information on their location. Such information was available from clinic and public agency records, and from respondents' reports.
We devised a system for classifying the quality of study subjects' geographical information. The highest accuracy was achieved by geocoding exact addresses; these co-ordinates were found by interpolation within the nearest city block. The next highest level of accuracy we called cross-street, meaning we were given street intersections as approximate locations. We classified co-ordinates in this manner when we knew that the true co-ordinates of the residence were within two blocks of the given cross-street. When the true co-ordinates were known to be within four blocks from a given landmark (as when a subject reports being near a school or police station, for example), we classified the quality of the geographical data as approximate. Occasionally, only the subjects' neighbourhood was known, which allows for accuracy to within 5 km, the maximum distance encompassed by a neighbourhood in Colorado Springs. Lastly, a few points were geocoded when we knew only which side of town they resided.
Geographical comparisons
By geocoding the entire group of participants and non-participants, we constructed sets of points corresponding to subgroups of the overall study cohort. In general, these traditional spot maps provided little information because of substantial point clutter and the presence of topographical artefacts (e.g. mountains, rivers and subdivision boundaries), which can cause dissimilar point patterns to appear to be similar.
To calculate a summary measure of distribution, we used the subgroups' centroid (mean x and y co-ordinates) as the origin on a map of Colorado Springs and selected a 90° sector extending due east from the centroid. The width of this sector was determined empirically by trying a variety of possible widths (data not shown). We calculated the median distance from the centroid to all points within this sector; the median of the upper and lower halves (fences, or the approximate 25th and 75th percentiles); and the outer whisker, which is 1.5 times the interquartile distance added to the median, thereby constructing a box plot10,11 from the centroid. The box plot parameters were plotted along the line bisecting the sector. The sector was then displaced by 1° and the calculation repeated through 360° of rotation. Once the points corresponding to a given statistic are connected as closed polygons, a rotational moving box plot is revealed, providing a visual representation of the distribution of the subgroup about its centroid. These moving box plots were then compared between subgroups as well as to that of the overall population for Colorado Springs. Since individual residence data for the general population are not released by the US Bureau of the Census, block-level data from the 1990 Census,12,13 consisting of polygons (blocks) weighted by their number of residents, were used. A function within ArcView GIS computed the geographical centroid for each block. Because the resulting block data points represent groups of people, the calculation of centroids and rotational box plots were weighted by the number of people within each block.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Traditional epidemiological evaluation informed us that participants and non-participants differed with regard to their ethnicity, their behavioural category, and the venue through which they were ascertained. These differences, however, were consonant with the goals and conduct of the study. For example, we had a greater proportion of participants who were partners of prostitutes compared to non-participants, reflecting our active recruitment of this group. Similarly, non-participants were identified preferentially from sources outside public clinic and street venues. In view of the differences in risk behaviour among ethnic groups, however, the differences in their participation in this study were more troubling.
The geographical assessment revealed little difference in the residential distribution of Whites (data not shown, but are reflected in the lack of difference in distributions for our predominantly White IDU), but substantial differences for African Americans. The rotational box plots informed us that male African-American participants were drawn from a wider geographical area than were non-participants, suggesting greater heterogeneity within the participant group and thus the potential for greater representativeness. On the other hand, in view of the heightened concentration of non-participants in the centre city area, we certainly included, but may have undersampled, people at the very highest risk for infection. Similarly, we observed a different geographical distribution for female African-American participants compared to non-participants. The former were more centred in the inner city area, suggesting that we may have oversampled the group at highest risk. Our analyses of this population confirmed that the recruited participants did engage in substantial risky behaviour, and were thus representative of the populations we wished to evaluate, but these geographical assessments furnish several caveats that we were not able to include in the previous reports.7,14,15
The technique used here, which we have termed a rotational box plot, is a variant of other exploratory spatial data analysis (ESDA) techniques now currently in use.16 For example, the box map or spatial box plot is a choropleth map depicting areas (by colour or shading) that fall into various portions of the box plot distribution. The major purpose of such analysisand its particular application in this settingis to provide visual access to underlying patterns. Formal statistical testing is more difficult, in part because of the substantial statistical problems of assessing comparability of geographical patterns (e.g. simultaneous assessment of size, conformation, location, and intensity of mapped phenomena). These statistical issues are currently the subject of intense development,16-18 with a primary focus on techniques to manage spatial autocorrelation. Jacquez has reviewed map comparison methods and suggested an overlap statistic which he tested against a null hypothesis with a series of randomization simulations.19 Though not directly applicable to the approach presented here, these efforts, and the larger body of relevant literature not reviewed here, are important advances in statistical methods for geographical assessment.
The notion underlying our approachthat geographical contiguity imparts homogeneityhas been termed Tobler's first law of geography,19 after Dr Waldo Tobler, who noted that nearby things tend to be alike.20 In some circumstances, the notion can certainly be challenged, however. Social mobility, economic upheaval, urban decay and renewal, and changing social mores can render neighbourhoods more heterogeneous than may have been the case earlier in their history. Our experience in Colorado Springs, based on 25 years of continuous observation of populations at risk for sexually transmitted diseases and HIV, suggests that the city's neighbourhoods continue to cohere despite growth and in- and out-migration, and Tobler's first law is a reasonable assumption. Such cohesion cannot be automatically expected, however, and generalization of this technique requires external confirmation by investigators of a community's homogeneity.
A further impediment to general use of this approach may be technological. Though extraordinary gains have been made in the development of GIS,5 their use is neither easy nor inexpensive. Though the authors enjoyed free access to an extensive street database, developed at substantial cost by an extramural agency, considerable manual checking of the automated geocoding process was required to ensure correct placement of data points. Hand-held global positioning systems (GPS) could provide an alternative to geocoding addresses; with an inexpensive GPS unit, one could obtain point co-ordinates directly, obviating the need for a geocoded reference map. The advantages of direct verification of reported address information and ease of obtaining co-ordinates would have to be weighed against the logistic difficulty of visiting every location to take a GPS reading, however. In any event, the power of programs such as ArcView8 is matched by their complexity, and the user must develop considerable technical expertise before the program will deign to be friendly. Finally, the rotational box-plot technique can only be applied in studies that have an intrinsic reason to collect detailed locating information on participants. In our case, the cohort design, with multiple interviews over a 4-year period, necessitated the initial collection of such information.
Given that many studies will have legitimate reasons for collecting geographical information, and that the programming techniques are likely to become more accessible, the use of geographical information for assessing ascertainment bias has potential for greater use. Increasing accessibility, at any rate, will provide researchers with the tools to evaluate the technique in diverse settings, and establish the experience needed to define its place in epidemiological methods.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Elliott P, Martuzzi M, Shaddick G. Spatial statistical methods in environmental epidemiology: a critique. Stat Meth Med Res 1995;4: 13759.[Medline]
3 US Department of Commerce, Economics and Statistics Administration, Bureau of the Census. Geographic Areas Reference Manual 1994;10:17. <http://www.census.gov/geo/www/garm.html>
4 US Department of Commerce, Economics and Statistics Administration, Bureau of the Census. Geographic Areas Reference Manual 1994;2:2730. <http://www.census.gov/geo/www/garm.html>
5 Clarke KC, McLafferty SL, Tempalski BJ. On epidemiology and geographical information systems: a review and discussion of future directions. Emerg Infect Dis 1996;2:8592.[ISI][Medline]
6 Latkin C, Glass GE, Duncan T. Using geographic information systems to assess spatial patterns of drug use, selection bias and attrition among a sample of injection drug users. Drug Alc Dep 1998;50:16775.
7 Woodhouse DE, Rothenberg RR, Potterat JJ et al. Mapping a social network of heterosexuals at high risk for human immunodeficiency virus infection. AIDS 1994;8:133136.[ISI][Medline]
8 Environmental Systems Research Institute, Inc. ArcView GIS Version 3.0a. Redlands, CA: 1999. <http://www.esri.com/software/index.html>
9 SAS Institute Inc. SAS Language and Procedures: Usage, Version 6. 1st Edn. Cary, NC: SAS Institute Inc., 1989. <http://www.sas.com/SASHome.html>
10 Tukey JW. Exploratory Data Analysis. Reading, MA: Addison-Wesley, 1977.
11 Williamson DF, Parker RA, Kendrick JS. The box plot: a simple visual method to interpret data. Ann Int Med 1989;110:91621.[ISI][Medline]
12 US Department of Commerce, Economics and Statistics Administration, Bureau of the Census. Data Access Tools, 1999. <http://www.census.gov/cgi-bin/build.html>
13 US Department of Commerce, Economics and Statistics Administration, Bureau of the Census. Geographic Areas Reference Manual. 1994;11:111. <http://www.census.gov/geo/www/garm.html>
14 Rothenberg RB, Potterat JJ, Woodhouse DE, Darrow WW, Muth SQ, Klovdahl AS. Choosing a centrality measure: epidemiologic correlates in the Colorado Springs study of social networks. Soc Networks 1995; 17:27397.[ISI]
15 Rothenberg RB, Potterat JJ, Woodhouse DE, Muth SQ, Darrow WW, Klovdahl AS. Social network dynamics and HIV transmission. AIDS 1998;12:152936.[ISI][Medline]
16 Anselin L, Bao S. Exploratory spatial data analysis linking SpaceStat and ArcView. In: Fischer M, Getis A (eds). Recent Developments in Spatial Analysis. Berlin: Springer-Verlag, 1997;35:59.
17 Anselin L, Kelejian H. Testing for spatial error autocorrelation in the presence of endogenous regressors. Int Regional Sci Rev 1997;20: 15382.[ISI]
18 Anselin L. Rao's score test in spatial econometrics. J Stat Plan Infer (in press). Available at: <http://www.skywalker.utdallas.edu/anselin.htm#Top>
19 Jacquez GM. The map comparison problem: tests for the overlap of geographic boundaries. Stat Med 1995;14:234361.[ISI][Medline]
20 Tobler W. A computer movie simulating urban growth in the Detroit region. Econ Geogr 1970;46:23440.[ISI]