Birds of a feather: using a rotational box plot to assess ascertainment bias

Stephen Q Mutha, John J Potterata and Richard B Rothenbergb

a El Paso County Department of Health and Environment, Colorado Springs, CO, USA.
b Emory University School of Medicine, Department of Family and Preventive Medicine, Atlanta, GA, USA.

Reprint requests to: Stephen Muth, El Paso County Department of Health and Environment, STD/HIV Programs, 301 South Union Boulevard, Colorado Springs, CO 80910–3123, USA. E-mail: smuth{at}uswest.net


    Abstract
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Background Comparability of study participants with non-participants is customarily assessed by contrasting the distributions of sociodemographic characteristics. Such comparisons do not necessarily provide insight into whether or not participants of a given subgroup are similar to non-participants of the same subgroup. A geographical information system (GIS) may provide such insight by visually displaying the spatial distributions of participants and non-participants. In a previously reported study of heterosexuals at elevated risk for human immunodeficiency virus (HIV), traditional methods suggested distributional differences in the demographic characteristics of participants and non-participants.

Methods Based on residential address co-ordinates for each subgroup member, we used the subgroup's centroid as the origin and constructed a 360° series of overlapping box plots of the distance of subgroups members to the origin, thereby producing closed polygons for each of the box plot demarcators.

Results These rotational box plots revealed similar geographical distributions for most participant and non-participant subgroups, with the exception of African-American men and women.

Conclusions Observed differences resulted in part from the study design, and provided some insight into sampling problems encountered in social network studies. Based on Tobler's supposition that ‘nearby things tend to be alike’, the rotational box plot is a useful additional tool for investigating sample bias.

Keywords Geography, geographical information systems (GIS), exploratory spatial data analysis (ESDA), sampling bias

Accepted 24 March 2000


    Introduction
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Small area designations (census tracts, enumeration districts) were developed to facilitate demographic and epidemiological observations on small population groupings, an enterprise now generally known as ‘small area analysis’.1,2 As originally constructed, census tracts were designed to encompass groups of 1500–8000 people who constituted a coherent neighbourhood. The tenet underlying such designations has been that people in a neighbourhood are similar with regard to social group, ethnic background, income level, occupational class, and educational attainment.3 A corollary to this tenet is that as the unit of geographical analysis increases, heterogeneity also increases, and large area designations (metropolitan statistical areas, states, regions) will be more diverse.4

Modern geographical information systems (GIS) are not limited to predetermined small area designation, and permit considerable flexibility in examining geographical distributions.5 They too can illustrate the assumption that, except in areas that have experienced intense migration, two people drawn from the same neighbourhood are likely to share many characteristics. As an example, Latkin et al. have demonstrated that the pattern of specific drug use in Baltimore is a neighbourhood phenomenon.6

In many epidemiological studies, particularly those that are community-based, ascertainment bias is often assessed by comparing sociodemographic characteristics of participants and non-participants. The considerable power of geographical comparisons has not been an important tool in this process, in part because of the administrative and logistic complexity involved in geocoding (translating locating information to map co-ordinates) and making meaningful geographical comparisons by group. In this communication, we report on the feasibility and usefulness of a geographical technique that provides visualization of population distributions to assist in evaluating sample bias.


    Materials and Methods
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
The original sample
To delineate the social, sexual, and drug-using networks of heterosexual men and women at risk for human immunodeficiency virus (HIV) in Colorado Springs, CO, we attempted to enrol women who work as prostitutes, injecting drug users (IDU), and people who had sex or used drugs with members of either category. As described elsewhere,7 we identified 1079 people between 1988 and 1991 who met eligibility criteria for the study. We successfully enrolled 595 (55%). Of those eligible but not enrolled, 63% were not located, 28% refused to participate, 3% were incarcerated or died before we could follow up with them, and 6% were not recruited because of our concern that breaches of confidentiality might ensue.

Sample comparisons
We used standard methods to examine and report differences in sex, age, ethnicity, and recruitment characteristics, attempting to duplicate the usual way in which possible sample bias is evaluated in epidemiological studies. To compare geographical distributions, we first used the ArcView8 GIS to geocode all available residential addresses to their x and y co-ordinates on a regional street map based on the Colorado central state plane co-ordinate system. SAS9 was then employed to deliver the appropriately constructed data subsets to ArcView. Though nearly two-thirds of non-participants could not be located, we obtained information on their location. Such information was available from clinic and public agency records, and from respondents' reports.

We devised a system for classifying the quality of study subjects' geographical information. The highest accuracy was achieved by geocoding exact addresses; these co-ordinates were found by interpolation within the nearest city block. The next highest level of accuracy we called ‘cross-street’, meaning we were given street intersections as approximate locations. We classified co-ordinates in this manner when we knew that the true co-ordinates of the residence were within two blocks of the given cross-street. When the true co-ordinates were known to be within four blocks from a given landmark (as when a subject reports being near a school or police station, for example), we classified the quality of the geographical data as ‘approximate’. Occasionally, only the subjects' neighbourhood was known, which allows for accuracy to within 5 km, the maximum distance encompassed by a neighbourhood in Colorado Springs. Lastly, a few points were geocoded when we knew only which side of town they resided.

Geographical comparisons
By geocoding the entire group of participants and non-participants, we constructed sets of points corresponding to subgroups of the overall study cohort. In general, these traditional spot maps provided little information because of substantial point clutter and the presence of topographical artefacts (e.g. mountains, rivers and subdivision boundaries), which can cause dissimilar point patterns to appear to be similar.

To calculate a summary measure of distribution, we used the subgroups' centroid (mean x and y co-ordinates) as the origin on a map of Colorado Springs and selected a 90° sector extending due east from the centroid. The width of this sector was determined empirically by trying a variety of possible widths (data not shown). We calculated the median distance from the centroid to all points within this sector; the median of the upper and lower halves (‘fences’, or the approximate 25th and 75th percentiles); and the outer ‘whisker’, which is 1.5 times the interquartile distance added to the median, thereby constructing a box plot10,11 from the centroid. The box plot parameters were plotted along the line bisecting the sector. The sector was then displaced by 1° and the calculation repeated through 360° of rotation. Once the points corresponding to a given statistic are connected as closed polygons, a rotational moving box plot is revealed, providing a visual representation of the distribution of the subgroup about its centroid. These moving box plots were then compared between subgroups as well as to that of the overall population for Colorado Springs. Since individual residence data for the general population are not released by the US Bureau of the Census, block-level data from the 1990 Census,12,13 consisting of polygons (blocks) weighted by their number of residents, were used. A function within ArcView GIS computed the geographical centroid for each block. Because the resulting block data points represent groups of people, the calculation of centroids and rotational box plots were weighted by the number of people within each block.


    Results
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Demographic comparisons
Although similar in mean age, male participants differed significantly from non-participants with respect to the distribution of ethnicity, participant categories, and enrolment venue (Table 1Go). Compared to non-participants, there were proportionally fewer African-American men in the participant group. There were more partners to prostitutes among the participants than among the non-participants, and examination of recruitment venues suggests that those found through public clinics and outreach activities were more likely to be enrolled. Women participants and non-participants differed slightly in mean age but were similar with regard to ethnicity and enrolment category. As with men, women participants were more likely to have been recruited in public clinics than at other sites.


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of study eligibles by characteristic and participant/non-participant status, Colorado Springs, 1988–1991
 
Geographical comparisons
Though participants' co-ordinates tended to be more accurate than those for non-participants (Table 2Go), they were similar in their overall geographical distribution, both varying substantially from that of the general population (Figure 1Go). The moving median configurations for both groups fit almost entirely within that for the general population and were considerably smaller, indicating substantial geographical concentration. Participants and non-participants displayed approximately 70% overlap (shaded) between their two moving-median polygons (Figure 2Go).


View this table:
[in this window]
[in a new window]
 
Table 2 Accuracy of ascertainment of the addresses of study recruits by participant/non-participant status, Colorado Springs, 1988–1991
 


View larger version (33K):
[in this window]
[in a new window]
 
Figure 1 Residential geographic distributions for the general population, compared to the study population by participant/non-participant status, Colorado Springs, 1988–1991

 


View larger version (29K):
[in this window]
[in a new window]
 
Figure 2 Overlap between rotational moving median plots for participants and non-participants, Colorado Springs, 1988–1991

 
In the usual comparisons (Table 1Go), the distributions of ethnicity among participant and non-participant men were significantly different. Examining the distribution of African-American participants and non-participants in the study (Figure 3Go) there are substantial differences for both men and women. For men, participants are drawn from a broader group; the non-participants were more densely concentrated around their centroid. For women, the geographical distributions are close in size but differ in shape. Substantially more non-participant women reside to the northeast of their centroid (which is located in the downtown city centre). In comparison, the geographical distribution of both male and female IDU participants and non-participants were virtually identical (Figure 4Go).



View larger version (36K):
[in this window]
[in a new window]
 
Figure 3 Residential distribution by gender among African-American participants and non-participants, Colorado Springs, 1988–1991

 


View larger version (39K):
[in this window]
[in a new window]
 
Figure 4 Residential distribution by gender among injecting drug users, Colorado Springs, 1988–1991

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
A comparison of geographical distributions is best assessed in the overall epidemiological context in which it is performed. The particular study at issue here was designed to determine the influence of social network structure on the transmission of HIV. To this end, we purposely recruited people at highest risk for HIV: men and women who inject drugs and women who work as prostitutes. Our venues for primary recruitment were clinics serving these populations and street sites that they frequent. Our secondary recruitment efforts used the network connections of the participants themselves. As a result, we expected considerable geographical clustering. This geographical assessment confirmed that both participants and non-participants in general were recruited from the central city area of Colorado Springs, and both exhibited similar distributions that were considerably more circumscribed than the distribution for the general population (Figure 1Go). These observations have also provided a basis for work currently in progress: examination of the geography of these groups' social networks in ‘real space’.

Traditional epidemiological evaluation informed us that participants and non-participants differed with regard to their ethnicity, their behavioural category, and the venue through which they were ascertained. These differences, however, were consonant with the goals and conduct of the study. For example, we had a greater proportion of participants who were partners of prostitutes compared to non-participants, reflecting our active recruitment of this group. Similarly, non-participants were identified preferentially from sources outside public clinic and street venues. In view of the differences in risk behaviour among ethnic groups, however, the differences in their participation in this study were more troubling.

The geographical assessment revealed little difference in the residential distribution of Whites (data not shown, but are reflected in the lack of difference in distributions for our predominantly White IDU), but substantial differences for African Americans. The rotational box plots informed us that male African-American participants were drawn from a wider geographical area than were non-participants, suggesting greater heterogeneity within the participant group and thus the potential for greater representativeness. On the other hand, in view of the heightened concentration of non-participants in the centre city area, we certainly included, but may have undersampled, people at the very highest risk for infection. Similarly, we observed a different geographical distribution for female African-American participants compared to non-participants. The former were more centred in the inner city area, suggesting that we may have oversampled the group at highest risk. Our analyses of this population confirmed that the recruited participants did engage in substantial risky behaviour, and were thus representative of the populations we wished to evaluate, but these geographical assessments furnish several caveats that we were not able to include in the previous reports.7,14,15

The technique used here, which we have termed a ‘rotational box plot’, is a variant of other exploratory spatial data analysis (ESDA) techniques now currently in use.16 For example, the ‘box map’ or ‘spatial box plot’ is a choropleth map depicting areas (by colour or shading) that fall into various portions of the box plot distribution. The major purpose of such analysis—and its particular application in this setting—is to provide visual access to underlying patterns. Formal statistical testing is more difficult, in part because of the substantial statistical problems of assessing comparability of geographical patterns (e.g. simultaneous assessment of size, conformation, location, and intensity of mapped phenomena). These statistical issues are currently the subject of intense development,16-18 with a primary focus on techniques to manage spatial autocorrelation. Jacquez has reviewed map comparison methods and suggested an overlap statistic which he tested against a null hypothesis with a series of randomization simulations.19 Though not directly applicable to the approach presented here, these efforts, and the larger body of relevant literature not reviewed here, are important advances in statistical methods for geographical assessment.

The notion underlying our approach—that geographical contiguity imparts homogeneity—has been termed Tobler's first law of geography,19 after Dr Waldo Tobler, who noted that ‘nearby things tend to be alike’.20 In some circumstances, the notion can certainly be challenged, however. Social mobility, economic upheaval, urban decay and renewal, and changing social mores can render neighbourhoods more heterogeneous than may have been the case earlier in their history. Our experience in Colorado Springs, based on 25 years of continuous observation of populations at risk for sexually transmitted diseases and HIV, suggests that the city's neighbourhoods continue to cohere despite growth and in- and out-migration, and Tobler's first law is a reasonable assumption. Such cohesion cannot be automatically expected, however, and generalization of this technique requires external confirmation by investigators of a community's homogeneity.

A further impediment to general use of this approach may be technological. Though extraordinary gains have been made in the development of GIS,5 their use is neither easy nor inexpensive. Though the authors enjoyed free access to an extensive street database, developed at substantial cost by an extramural agency, considerable manual checking of the automated geocoding process was required to ensure correct placement of data points. Hand-held global positioning systems (GPS) could provide an alternative to geocoding addresses; with an inexpensive GPS unit, one could obtain point co-ordinates directly, obviating the need for a geocoded reference map. The advantages of direct verification of reported address information and ease of obtaining co-ordinates would have to be weighed against the logistic difficulty of visiting every location to take a GPS reading, however. In any event, the power of programs such as ArcView8 is matched by their complexity, and the user must develop considerable technical expertise before the program will deign to be friendly. Finally, the rotational box-plot technique can only be applied in studies that have an intrinsic reason to collect detailed locating information on participants. In our case, the cohort design, with multiple interviews over a 4-year period, necessitated the initial collection of such information.

Given that many studies will have legitimate reasons for collecting geographical information, and that the programming techniques are likely to become more accessible, the use of geographical information for assessing ascertainment bias has potential for greater use. Increasing accessibility, at any rate, will provide researchers with the tools to evaluate the technique in diverse settings, and establish the experience needed to define its place in epidemiological methods.


    Acknowledgments
 
This work was supported by grant R01-DA09928 from the National Institute on Drug Abuse, National Institutes of Health, Bethesda, Maryland.


    References
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
1 Stoto MA. Public health assessment in the 1990s. Ann Rev Public Health 1992;13:59–78.[ISI][Medline]

2 Elliott P, Martuzzi M, Shaddick G. Spatial statistical methods in environmental epidemiology: a critique. Stat Meth Med Res 1995;4: 137–59.[Medline]

3 US Department of Commerce, Economics and Statistics Administration, Bureau of the Census. Geographic Areas Reference Manual 1994;10:1–7. <http://www.census.gov/geo/www/garm.html>

4 US Department of Commerce, Economics and Statistics Administration, Bureau of the Census. Geographic Areas Reference Manual 1994;2:27–30. <http://www.census.gov/geo/www/garm.html>

5 Clarke KC, McLafferty SL, Tempalski BJ. On epidemiology and geographical information systems: a review and discussion of future directions. Emerg Infect Dis 1996;2:85–92.[ISI][Medline]

6 Latkin C, Glass GE, Duncan T. Using geographic information systems to assess spatial patterns of drug use, selection bias and attrition among a sample of injection drug users. Drug Alc Dep 1998;50:167–75.

7 Woodhouse DE, Rothenberg RR, Potterat JJ et al. Mapping a social network of heterosexuals at high risk for human immunodeficiency virus infection. AIDS 1994;8:1331–36.[ISI][Medline]

8 Environmental Systems Research Institute, Inc. ArcView GIS Version 3.0a. Redlands, CA: 1999. <http://www.esri.com/software/index.html>

9 SAS Institute Inc. SAS Language and Procedures: Usage, Version 6. 1st Edn. Cary, NC: SAS Institute Inc., 1989. <http://www.sas.com/SASHome.html>

10 Tukey JW. Exploratory Data Analysis. Reading, MA: Addison-Wesley, 1977.

11 Williamson DF, Parker RA, Kendrick JS. The box plot: a simple visual method to interpret data. Ann Int Med 1989;110:916–21.[ISI][Medline]

12 US Department of Commerce, Economics and Statistics Administration, Bureau of the Census. Data Access Tools, 1999. <http://www.census.gov/cgi-bin/build.html>

13 US Department of Commerce, Economics and Statistics Administration, Bureau of the Census. Geographic Areas Reference Manual. 1994;11:1–11. <http://www.census.gov/geo/www/garm.html>

14 Rothenberg RB, Potterat JJ, Woodhouse DE, Darrow WW, Muth SQ, Klovdahl AS. Choosing a centrality measure: epidemiologic correlates in the Colorado Springs study of social networks. Soc Networks 1995; 17:273–97.[ISI]

15 Rothenberg RB, Potterat JJ, Woodhouse DE, Muth SQ, Darrow WW, Klovdahl AS. Social network dynamics and HIV transmission. AIDS 1998;12:1529–36.[ISI][Medline]

16 Anselin L, Bao S. Exploratory spatial data analysis linking SpaceStat and ArcView. In: Fischer M, Getis A (eds). Recent Developments in Spatial Analysis. Berlin: Springer-Verlag, 1997;35:59.

17 Anselin L, Kelejian H. Testing for spatial error autocorrelation in the presence of endogenous regressors. Int Regional Sci Rev 1997;20: 153–82.[ISI]

18 Anselin L. Rao's score test in spatial econometrics. J Stat Plan Infer (in press). Available at: <http://www.skywalker.utdallas.edu/anselin.htm#Top>

19 Jacquez GM. The map comparison problem: tests for the overlap of geographic boundaries. Stat Med 1995;14:2343–61.[ISI][Medline]

20 Tobler W. A computer movie simulating urban growth in the Detroit region. Econ Geogr 1970;46:234–40.[ISI]