Detecting Patterns of Occupational Illness Clustering with Alternating Logistic Regressions Applied to Longitudinal Data

John S. Preisser1 , Thomas A. Arcury2 and Sara A. Quandt3

1 Department of Biostatistics, School of Public Health, University of North Carolina, Chapel Hill, NC.
2 Department of Family and Community Medicine, Wake Forest University School of Medicine, Winston-Salem, NC.
3 Section on Epidemiology, Department of Public Health Sciences, Wake Forest University School of Medicine, Winston-Salem, NC.

Received for publication March 8, 2002; accepted for publication January 30, 2003.


    ABSTRACT
 TOP
 ABSTRACT
 PROBLEM DEVELOPMENT
 REGRESSION MODELS
 EXAMPLE
 DISCUSSION
 REFERENCES
 
In longitudinal surveillance studies of occupational illnesses, sickness episodes are recorded for workers over time. Since observations on the same worker are typically more similar than observations from different workers, statistical analysis must take into account the intraworker association due to workers’ repeated measures. Additionally, when workers are employed in groups or clusters, observations from workers in the same workplace are typically more similar than observations from workers in different workplaces. For such cluster-correlated longitudinal data, alternating logistic regressions may be used to model the pattern of occupational illness clustering. Data on 182 Latino farmworkers from a 1999 North Carolina study on green tobacco sickness provided an estimated pairwise odds ratio for within-worker clustering of 3.15 (95% confidence interval (CI): 1.84, 5.41) and an estimated pairwise odds ratio for within-camp clustering of 1.90 (95% CI: 1.22, 2.97). After adjustment for risk factors, the estimated pairwise odds ratios were 2.13 (95% CI: 1.18, 3.86) and 1.41 (95% CI: 0.89, 2.24), respectively. In this paper, a comparative analysis of alternating logistic regressions with generalized estimating equations and random-effects logistic regression is presented, and the relative strengths of the three methods are discussed.

agricultural workers’ diseases; generalized estimating equation; logistic models; odds ratio; random effect; repeated measures

Abbreviations: Abbreviations: ALR, alternating logistic regression; GEE, generalized estimating equation; GLMM, generalized linear mixed model; GTS, green tobacco sickness; POR, pairwise odds ratio.


    PROBLEM DEVELOPMENT
 TOP
 ABSTRACT
 PROBLEM DEVELOPMENT
 REGRESSION MODELS
 EXAMPLE
 DISCUSSION
 REFERENCES
 
Occupational illness often occurs in clusters (1). Consider the case of agricultural workers (2, 3). Because of shared behaviors and environmental working conditions, two workers from the same work site are likely to be more similar to each other than are a pair of workers from different work sites. Moreover, because of shared genetic traits, two workers from the same family who work together are likely to be more similar than a pair of unrelated workers who work together. The similarity of workers within work sites often leads to intraclass correlation of an illness outcome.

The nesting of individuals (e.g., workers) within subclusters (e.g., families) that are nested within clusters (e.g., work sites) characterizes multistage cluster samples in many cross-sectional studies of illness and its causes or correlates (48). Often, multistage sampling is chosen for practical considerations (9), and the intracluster correlation arising from within clusters is considered a nuisance that must be addressed in the statistical analysis of the relation of explanatory variables to outcomes. Generalized linear mixed models (GLMMs) and generalized estimating equations (GEEs) are two statistical methods used for this purpose. These approaches are called "subject-specific" and "population-averaged," respectively, because of differences in the interpretations for regression coefficients (1012).

Alternating logistic regressions (ALRs) are a procedure useful for modeling the pattern of clustering and the marginal probability of illness in populations (48). Knowledge of disease clustering is important because it may provide insights into the etiology of disease and risk factors operating within clusters. Modeling the pattern of clustering with and without covariate adjustment may reveal whether known risk factors explain or mitigate the clustering. Persistence of a sizable within-cluster correlation after covariate adjustment often indicates a need to search for further risk factors, which, in turn, may guide the design and targeting of future interventions aimed at reducing the incidence of disease.

ALRs model the within-cluster association with pairwise odds ratios (PORs). A model that has been commonly used estimates two PORs: a within-subcluster odds ratio and a within-cluster/between-subcluster odds ratio. This model may apply to agricultural workers nested within families (subclusters) that are nested within work sites (clusters).

This paper shows that ALR is applicable to detecting patterns of clustering of binary responses in cluster-correlated longitudinal data. Here, work sites are the clusters and the repeated measurements among workers the subclusters. Models for within-cluster association that include time as a covariate are applied to longitudinal data on migrant tobacco workers. A comparative analysis of ALR with GEEs and random-effects logistic regression (a kind of GLMM) is presented, and the relative strengths of the three methods are discussed.


    REGRESSION MODELS
 TOP
 ABSTRACT
 PROBLEM DEVELOPMENT
 REGRESSION MODELS
 EXAMPLE
 DISCUSSION
 REFERENCES
 
General population-averaged model formulation
We use the POR to describe association of observations within clusters. In a model without covariates, the POR can be calculated by constructing a list of all possible pairs of observations from the same cluster organized into a 2 x 2 table, where discordant pairs are evenly split between (0,1) and (1,0) cells and the odds ratio is calculated as the cross-product ratio (4, 13, 14). The POR is interpreted similarly to the usual odds ratio, where a value of one indicates no association. Except when all clusters are of size two, such as in studies of twins (13), the standard variance estimate of the odds ratio is not valid because not all pairs are independent.

ALR is used to fit a model for the within-cluster association after adjustment for the effect of covariates on the marginal probability of illness. Let Yijt be a binary response (e.g., Yijt = 1 if ill, 0 if not ill) on the tth observation within the jth subcluster (e.g., worker) of the ith cluster (e.g., work site). A marginal model for the probability of illness is

logit Pr(Yijt = 1) = x'ijtß,

where x'ijt = (x1ijt, x2ijt, ..., xpijt) is a 1 x p covariate vector, including cluster-level, subcluster-level, and observation-level covariates. GEEs with a "working correlation matrix" may be used to estimate ß (11). Alternatively, ALR fits model 1 and a model for the POR among observations within a cluster

log POR(Yijt, Yij't') = z'ijtj't' {alpha},

where z'ijtj't' = (z1ijtj't', ..., zqijtj't' ) is a 1 x q covariate vector for pairs of observations indexed by (jt,j't') in the ith cluster. The log odds ratio of any two observations from different clusters is assumed to be zero, implying independence among them. A simple model widely used in cross-sectional studies with three-level data specifies two types of pairwise associations

log POR(Yijt, Yij't') = {alpha}0 + {alpha}1 zijtj't',

where zijtj't' = 1 if the pair of observations is from the same subcluster (j = j') and 0 otherwise. In model 3, {alpha}0 is the log odds ratio for association among observations from different subclusters, and {alpha}0 + {alpha}1 is the log odds ratio for within-subcluster association. Second-order GEEs, or GEE2 (1417), are also useful for estimating {alpha} but are not feasible for larger clusters. ALR provides estimates of {alpha} nearly as efficiently as GEE2 does, and, like GEEs, gives valid estimates of ß when the model for {alpha} is incorrect (18). In this paper, ALR is applied by using the PROC GENMOD procedure in SAS software; model 3 is fit by using GENMOD with the NEST1 and SUBCLUST options on the REPEATED statement (19).

General subject-specific model formulation
A subject-specific model analogous to models 1 and 3 is the three-level random-effects model

logit Pr(Yijt = 1|{nu}i, {eta}ij) = x'ijtß* + {nu}i + {eta}ij,

where, generally, ß* != ß. The random effects {nu}i and {eta}ij are assumed to be independently distributed as and , respectively. Variance components, and , address between-cluster and within-cluster/between-subcluster variation, respectively. This model generalizes standard logistic regression and the random intercept model . Although both population-averaged and subject-specific approaches extend standard logistic regression to clustered data, they estimate different parameters for the covariates xijt and provide different interpretations (10, 11). Whereas odds ratios given by model 1 represent differences in risk among populations, odds ratios given by model 4 relate change in risk due to changes in covariates for an individual. Random-effects logistic regression models were fit by using restricted pseudolikelihood implemented with the SAS GLIMMIX macro (20). A review of available software is described elsewhere (21).

Models for cluster-correlated longitudinal data
For cluster-correlated longitudinal data, model 4 or, alternatively, models 1 and 3 apply; the repeated measures of persons constitute a subcluster and all measurements within geographically defined groups of persons a cluster (e.g., work sites, schools, medical practices, communities). In model 4, {nu}i is the random deviation from the average risk for the ith group, and {eta}ij is the random deviation for the jth person in the ith group. Alternatively, {alpha}0 from model 3 relates the clustering of observations from different persons, and {alpha}0 + {alpha}1 relates two observations from the same person. More technically, the within-group/between-person odds ratio, e{alpha}0, for any pair of observations t from person j and t' from person j' (j != j') is defined as the odds of person j at observation t having the outcome given person j' at observation t' having the outcome divided by the odds of person j at observation t having the outcome given person j' at observation t' not having the outcome. The within-person odds ratio, e{alpha}0 + {alpha}1, for any pair of observations t and t' from the same person is defined as the odds of the outcome at observation t given the outcome at observation t' divided by the odds of the outcome at observation t given the absence of the outcome at observation t'. Time may be considered as a factor that expands model 3. A general class of models is

log POR(Yijt, Yij't') = {alpha}0 + z'ijtj't' {alpha}1 + u'ijtj't' {alpha}2 + w'ijtj't' {alpha}3,

where z'ijtj't' pertains to the association structure from the cross-sectional design, as in model 2; u'ijtj't' are a vector of time variables, and w'ijtj't' are covariate-by-time interactions that enable the POR corresponding to different levels of clustering in the cross-sectional component of the data to vary by the difference in time between paired observations. Fitting model 5 requires the SAS PROC GENMOD user to specify the data set for the POR model with the ZDATA option on the REPEATED statement. Details of fitting such models, including practical strategies for addressing the complexities of data with large clusters consisting of multiple strings of unbalanced and irregularly timed repeated measures data, are available from the first author.

Model 5 is applicable to prospective occupational epidemiologic studies of disease clustering. Geographic clustering of an occupational illness among workers may occur when workplaces are the first-stage sampling units and workers within workplaces are the second-stage sampling units. In cluster-correlated longitudinal data, model 5 is useful for characterizing temporal effects on clustering when the duration of illness or its effects is short relative to the timing of repeated measures. Generally, the sickness status of a worker observed on two occasions close in time may be more alike than that from occasions observed farther apart.


    EXAMPLE
 TOP
 ABSTRACT
 PROBLEM DEVELOPMENT
 REGRESSION MODELS
 EXAMPLE
 DISCUSSION
 REFERENCES
 
The data set
In this section, ALR is applied to repeated-measures data collected in a surveillance study of green tobacco sickness (GTS) among Latino migrant farmworkers. GTS is acute nicotine poisoning caused by dermal absorption of nicotine from mature tobacco plants. Its symptoms include vomiting, nausea, headaches, and dizziness (22). Our goal was to identify risk factors and estimate the within-camp and within-worker clustering of GTS.

The study site included two nonadjacent counties in central North Carolina, Wake and Granville. A total of 182 workers from 37 residence sites were surveyed one to five times, over the 12-week growing season, from June 15 to September 5, 1999. Residence sites are referred to as work camps since, in most circumstances, workers who lived together worked together. A worker’s survey occasions were approximately 2 weeks apart and gathered daily information on the 7-day period that included the day of the survey and the 6 days immediately prior to the survey. Only days for which workers were at risk for GTS were included as observations in the data set. A worker was at risk for GTS on a given day if he or she worked in tobacco that day or the day before and was not already sick with GTS. Thus, each respondent contributed from zero to seven observations per survey. There were 660 surveys providing a total of 4,049 daily observations on 182 workers. Two to nine workers were sampled per camp; cluster sizes, the number of worker-days at risk per camp for all workers combined, ranged from 20 to 208. Farmworkers contributed 1–35 observation days, with an average of 22.3 days per worker. A key feature of the data is that they span 83 consecutive study days, resulting in irregularly timed, repeated-measures data for workers.

The response variable is a daily indicator for GTS. On any given day that a worker was at risk, GTS was defined as present if the worker reported vomiting or nausea and headache or dizziness. If these conditions were met on 2 consecutive days, only one GTS event was recorded and the worker was considered not at risk on the second day. The reason for this stipulation is that symptoms of nicotine poisoning persist as long as nicotine remains in the body and has not been metabolized to its other forms. Because the mean elimination half-life of serum nicotine is approximately 4 hours (23), exposure on a workday may result in symptoms that persist into the second day without further exposure to tobacco plants. Because symptoms should not persist to a third day, 3 or 4 consecutive days of symptoms were treated as two GTS events. There were a total of 65 GTS events from 57 surveys, and the overall incidence of GTS was 1.61 in 100 days at risk. Forty-four workers (24.2 percent) reported at least one occurrence of GTS during the study. Table 1 shows marked between-camp variability in observed GTS incidence rates. Of the 65 GTS events reported, 29.2 percent (n = 19) occurred in clusters of workers from the same sites sick with GTS on the same day. Over 70 percent (n = 47) of GTS events occurred among workers who worked together in camps during the same week.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Distribution of incidence rates of green tobacco sickness by work camp in a North Carolina study, 1999
 
This analysis included the risk factors for GTS identified in a previous analysis (22). A biobehavioral model posits that the rate of transdermal absorption is determined by the amount of dermal exposure to tobacco plants, as well as several other factors (22). The amount of contact with tobacco, measured by two variables, is thought to increase dermal exposure to nicotine. Type of work was considered the dominant activity reported for a worker during the current and previous days. "Topping" refers to breaking the flower off the top of the plant. "Priming" is picking or harvesting the tobacco leaves. "Barning" refers to putting the harvested tobacco into a barn for curing. "Other" relates to any other activity, such as driving a tractor or not working in tobacco. Learned avoidance through work experience may reduce dermal exposure to tobacco. Years worked in tobacco was grouped into the categories of first year, 2–4 years, and 5 or more years. Two variables are thought to be positively related to transdermal absorption of nicotine. Worked in wet clothes (yes/no) indicated whether the worker worked in wet clothes any time during the current or previous day. Heat was taken as the maximum recorded temperature over the previous and current days in each county; for variables recorded on a daily basis, covariates were defined by using the current and previous days’ measurements. Tobacco use reduces transdermal absorption and was defined as present if use occurred at least once during the week to which the survey applied. Finally, the agricultural season was divided into three seasons: early (June 15–July 18), middle (July 19–August 8), and late (August 9–September 5).

Analysis
Table 2 reports the incidence of GTS by risk factor subgroups. An increased incidence of GTS was associated with season (middle or late), type of work (priming), having less than 5 years of experience, and working in wet clothes. A decreased incidence of GTS was associated with using tobacco. Most GTS events occurred in the last 2 weeks of July through the end of August (study days 34–78). Evidence suggested that high temperatures were positively associated with GTS (Spearman rank correlation, 0.42). Temperature should be considered in the context of other confounding variables. Because season is associated with both maximum daily temperature (highest temperatures occur in midseason) and type of work (e.g., priming, the task with the greatest exposure, is done only from mid- to late season), we did not consider season in our model for computing adjusted within-cluster odds ratios.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Distribution of green tobacco sickness by risk factors in a North Carolina study, 1999
 
Although spatial clustering was the main focus of inference, an initial model was fit to examine whether the within-camp or within-farmworker PORs varied according to time dichotomized at less than 7 days versus 7 days or more. This time cutoff was natural because surveys were administered approximately every 2 weeks, and information across the 7 days in a given survey was collected on the same occasion. We considered model 1 with no risk factors together with a model that is a special case of model 5:

log POR(Yijt, Yij't') = {alpha}0 + zijtj't' {alpha}1 + uijtj't' {alpha}2 + wijtj't' {alpha}3,

where t indexes time in days, zijtj't' = 1 if the pair of observations is from the same worker and is equal to 0 otherwise, uijtj't' = 1 if |tt'| < 7 and is equal to 0 otherwise, and wijtj't' = zijtj't' x uijtj't'. The estimated PORs of interest and their 95 percent confidence intervals are (95 percent confidence interval: 0.90, 3.16), (95 percent confidence interval: 1.32, 3.08), (95 percent confidence interval: 1.71, 5.12), and (95 percent confidence interval: 1.47, 10.20). These are within-camp PORs, respectively, for two observations 1) from different workers 7 days or more apart and 2) from different workers less than 7 days apart; and 3) from the same worker 7 days or more apart and 4) from the same worker less than 7 days apart. Although pairs of observations from the same week appeared to be more highly associated than similar pairs from different weeks, the time effects, considered jointly, were not statistically significant . When we let {theta} = ({alpha}2, {alpha}3)', the result corresponded to the Wald test of the hypothesis {theta} = 0, whose test statistic, , had a large sample chi-square distribution with 2 degrees of freedom.

Proceeding with the simpler model given by model 3, table 3 shows the estimated PORs and their 95 percent confidence intervals for three different risk factor models (model 1). Not adjusting for risk factors resulted in a statistically significant clustering both within farmworkers and between farmworkers within camps. The estimated POR relating two observations from the same farmworker was 3.15. This finding suggests that some workers have a significantly greater propensity for GTS than others do. The estimated POR relating GTS in one worker to GTS in another worker within the same camp was 1.90. When we adjusted for those risk factors that affect exposure to green tobacco leaves, the within-worker and within-camp PORs were smaller than their unadjusted counterparts. The nonsignificance of the within-camp POR suggests that exposure factors largely explain the clustering of GTS among workers in a camp. When the model included factors affecting absorption of nicotine as well as exposure, the within-worker POR, but not the within-camp POR, was further reduced to a notable degree. The adjusted within-worker POR remained significantly different from one, suggesting that even after adjusting for known risk factors, some workers have a greater propensity than others for GTS. The stipulation that 2 days of symptoms be treated as one GTS event had little impact on the results. A sensitivity analysis based on 81 days with GTS instead of 65 GTS events provided similar odds ratios for the risk factors; not surprisingly, the adjusted within-worker odds ratio was greater (2.69 vs. 2.13).


View this table:
[in this window]
[in a new window]
 
TABLE 3. Within-cluster pairwise odds ratios, estimated by using alternating logistic regressions, for green tobacco sickness in a North Carolina study, 1999
 
The estimated odds ratios for risk factors from the fully adjusted model in table 3 (E + A) appear in table 4 in the ALR column, along with results from comparative analyses. Regardless of modeling approach, there was a strong effect of type of work on incidence of GTS. The ALR analysis suggests that workers who engaged in priming as the dominant activity had 3.58 times higher odds of GTS than workers engaged in topping. Workers who worked in wet clothes had 2.45 times higher odds of GTS than workers who did not work in wet clothes. Temperature was also found to be positively associated with GTS. Ignoring clustering (standard logistic regression) resulted in statistical significance for less than 5 years of experience and tobacco use, in contradiction to methods that adjust for clustering. Reflecting this adjustment, confidence intervals were generally wider for GEEs and ALR than for standard logistic regression. ALR with exchangeable POR (model 3 with {alpha}1 = 0 and = 1.57) gave odds ratio estimates for risk factors (not shown) very similar to those of GEEs in table 4.


View this table:
[in this window]
[in a new window]
 
TABLE 4. Population-averaged and cluster-specific odds ratio estimates from logistic models for green tobacco sickness in a North Carolina study, 1999
 
As shown in table 4, estimated subject-specific odds ratios for risk factors from the random-intercept two-level model were similar to GEEs, and those from the three-level random-effects model were similar to ALR. Both the population-averaged and subject-specific model results showed that within-worker clustering, after adjustment for covariates, was statistically significant (ALR or nearly significant (GLMM, , whereas within-camp clustering was not (ALR, ; GLMM, .


    DISCUSSION
 TOP
 ABSTRACT
 PROBLEM DEVELOPMENT
 REGRESSION MODELS
 EXAMPLE
 DISCUSSION
 REFERENCES
 
By applying ALR to data on GTS among Latino migrant farmworkers, we demonstrated that the method is useful for simultaneously estimating effects of risk factors and clustering among binary responses in cluster-correlated longitudinal data. Our application involved an acute response to short-term exposures. The ALR method is also applicable to longitudinal studies of chronic effects of long-term exposures, such as pulmonary function data, that may be collected across the workplace or in the general population (e.g., multiple cities).

Knowledge of the clustering of GTS provided by ALR is important for several reasons. First, the fact that the ALR estimate of within-worker association remained significantly different from one after adjusting for known risk factors suggests that unmeasured factors play a role in GTS. One possibility is between-individual differences in nicotine metabolism. Analysis is currently under way of urinary metabolites of nicotine for a clinic-based case-control study conducted to complement the present survey. This analysis may identify distinct patterns of nicotine metabolism. Second, when equations 3 and 4 of Katz et al. (4) were used, the intracluster correlation of GTS estimated by using ALR with exchangeable PORs led to computation of a design effect of 3.18 that may be used in planning future studies. This value is somewhat inflated because of varying cluster sizes in the GTS study. For example, if the cluster size was 109 for every camp, the design effect would be 2.87.

Ignoring within-cluster association leads to invalid inference for the risk factors predicting illness. In the presence of clustering, standard errors from standard logistic regression tend to be inappropriately small for cluster-level covariates and too large for observation-level covariates (10, 11). For the GTS data, with the exception of temperature, which may vary daily, standard errors were always larger for GEEs. Use of the GEE and ALR methods found that temperature, but not work experience (a subcluster-level covariate), was a statistically significant predictor of GTS; standard logistic regression led to the reverse conclusion.

The similarity of ALR and GEE odds ratio estimates for GTS risk factors in model 1 is not surprising. Both methods involve computations that alternate between estimation of ß and within-cluster association. In fact, the ß-estimation step is the same, and both rely on empirical sandwich variance estimation to account for the within-cluster association to give valid results for the risk factor model even when the association model is misspecified. Our analysis illustrates that careful modeling of within-cluster correlation may result in more efficient estimation of parameters in the marginal mean model, with potentially nonnegligible benefits for studies with a small number of large-size clusters (15). Standard errors resulting in the confidence intervals for GTS risk factors in the three-level ALR model in table 4 were slightly smaller than their GEE (or ALR) two-level model counterparts for five of seven covariates. Thus, use of ALR was mildly beneficial in clarifying the role of risk factors for GTS, even though results based on GEEs were qualitatively similar.

GTS results for GLMM were also qualitatively similar to those from using ALR. Generally, the presence of random effects in nonlinear models, such as random-effects logistic models, causes the regression coefficients ß* to be numerically different and to have interpretations different from corresponding coefficients ß from population-averaged models (10, 11). The fact that GTS risk factor odds ratio estimates from the two approaches were similar may be due to the moderately small size of the within-cluster association or to finite sample biases of the procedures.

ALR has several advantages compared with GLMM analysis of multilevel binary data. While both offer similar degrees of flexibility in modeling multiple sources of variation, ALR provides estimates of PORs that have natural interpretations for quantifying the magnitude of within-cluster association. In contrast, GLMM variance component estimates apply to the scale of the link function, and, as in the GTS study, their meaning may be more elusive. Next, valid inference for the effect of risk factors on the probability of illness with GLMM requires correct specification of the distribution of the random effects and the link function, whereas ALR requires only the latter. However, valid inference for within-cluster association with ALR requires that model 2 be specified correctly. A disadvantage of ALR is the requirement that data be missing completely at random compared with the less stringent missing-at-random assumption of GLMM (24).

The GTS data illustrate limitations in the scope of ALR. One concern is that the number of parameters estimated should not be excessive because of sparse data represented by a limited number of GTS events. A second concern is that the poor performance of the empirical sandwich estimator when applied with a small number of clusters in GEE applications (e.g., less than 40) may also be pertinent to ALR in the GTS study. GLMM as estimated by using restricted pseudolikelihood also has limitations for binary data; it may give biased estimates of variance components (12). Other estimation approaches to GLMMs are used (25, 26). The usefulness of software for fitting models similar to those considered in this paper depends on whether the software can handle multilevel models (21).


    ACKNOWLEDGMENTS
 
This research was supported by grant R01 OH03648 from the National Institute for Occupational Safety and Health.


    NOTES
 
Correspondence to Dr. John S. Preisser, Department of Biostatistics, School of Public Health, University of North Carolina, CB #7420, Chapel Hill, NC 27599-7420 (e-mail:jpreisse{at}bios.unc.edu). Back


    REFERENCES
 TOP
 ABSTRACT
 PROBLEM DEVELOPMENT
 REGRESSION MODELS
 EXAMPLE
 DISCUSSION
 REFERENCES
 

  1. Fleming LE, Ducatman AM, Shalat SL. Disease clusters in occupational medicine: a protocol for their investigation in the workplace. Am J Ind Med 1992;22:33–47.[ISI][Medline]
  2. McKnight RH, Koetke CA, Donnelly C. Familial clusters of green tobacco sickness. J Agromed 1996;3:51–9.
  3. McKnight RH, Kryscio RJ, Mays JR, et al. Spatial and temporal clustering of an occupational poisoning: the example of green tobacco sickness. Stat Med 1996;15:747–57.[CrossRef][ISI][Medline]
  4. Katz J, Carey VJ, Zeger SL, et al. Estimation of design effects and diarrhea clustering within households and villages. Am J Epidemiol 1993;138:994–1006.[Abstract]
  5. Bobashev GV, Anthony JC. Clusters of marijuana use in the United States. Am J Epidemiol 1998;148:1168–74.[Abstract]
  6. Bobashev GV, Anthony JC. Use of alternating logistic regression in studies of drug-use clustering. Subst Use Misuse 2000;35:1051–73.[ISI][Medline]
  7. Delva J, Bobashev G, Gonzalez G, et al. Clusters of drug involvement in Panama: results from Panama’s 1996 National Youth Survey. Drug Alcohol Depend 2000;60:251–7.[CrossRef][ISI][Medline]
  8. Petronis KR, Anthony JC. Perceived risk of cocaine use and experience with cocaine: do they cluster within US neighborhoods and cities? Drug Alcohol Depend 2000;57:183–92.[CrossRef][ISI][Medline]
  9. Korn EL, Graubard BI. Analysis of health surveys. 1st ed. New York, NY: John Wiley & Sons, 1999.
  10. Hu FB, Goldberg J, Hedeker D, et al. Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes. Am J Epidemiol 1998;147:694–703.[Abstract]
  11. Zeger SL, Liang KY, Albert PS. Methods for longitudinal data: a generalized estimating equations approach. Biometrics 1988;44:1049–60.[ISI][Medline]
  12. Breslow NE, Clayton DG. Approximate inference in generalized linear models. J Am Stat Assoc 1993;88:9–25.[ISI]
  13. Ananth CV, Preisser JS. Bivariate logistic regression: modelling the association of small for gestational age births in twin gestations. Stat Med 1999;18:2011–23.[CrossRef][ISI][Medline]
  14. Qaqish BF, Liang KY. Marginal models for correlated binary responses with multiple classes and multiple levels of nesting. Biometrics 1992;48:939–50.[ISI][Medline]
  15. Liang KY, Zeger SL, Qaqish BF. Multivariate regression analyses for categorical data (with discussion). J R Stat Soc (B) 1992;54:3–40.[ISI]
  16. Podgor MJ, Hiller R. Associations of types of lens opacities between and within eyes of individuals: an application of second-order generalised estimating equations. The Framingham Eye Studies Group. Stat Med 1996;15:145–56.[CrossRef][ISI][Medline]
  17. Zhao LP, Prentice RL. Correlated binary regression using a quadratic exponential model. Biometrika 1990;77:642–8.[ISI]
  18. Carey VJ, Zeger SL, Diggle P. Modeling multivariate binary data with alternating logistic regressions. Biometrika 1993;80:517–26.[ISI]
  19. SAS Institute, Inc. SAS/STAT user’s guide, version 8. Cary, NC: SAS Institute, Inc, 1999.
  20. Littell RC, Milliken GA, Stroup WW, et al. SAS system for mixed models. Cary, NC: SAS Institute, Inc, 1996.
  21. Zhou XH, Perkins AJ, Hui SL. Comparisons of software packages for generalized linear multilevel models. Am Stat 1999;53:282–90.[ISI]
  22. Arcury TA, Quandt SA, Preisser JS, et al. The incidence of green tobacco sickness among Latino farmworkers. J Occup Environ Med 2001;43:601–9.[ISI][Medline]
  23. Quandt SA, Arcury TA, Preisser JS, et al. Environmental and behavioral predictors of salivary cotinine in Latino tobacco workers. J Occup Environ Med 2001;43:844–52.
  24. Little RJA, Rubin DB. Statistical analysis of missing data. New York, NY: John Wiley & Sons, 1987.
  25. Hedeker D, Gibbons RD. A random effects ordinal regression model for multilevel analysis. Biometrics 1994;50:933–44.[ISI][Medline]
  26. Diggle PJ, Liang KY, Zeger SL. Analysis of longitudinal data. Oxford, United Kingdom: Oxford Science Publications, 1994.