1 Department of Biostatistics, School of Public Health, University of North Carolina, Chapel Hill, NC.
2 Department of Family and Community Medicine, Wake Forest University School of Medicine, Winston-Salem, NC.
3 Section on Epidemiology, Department of Public Health Sciences, Wake Forest University School of Medicine, Winston-Salem, NC.
Received for publication March 8, 2002; accepted for publication January 30, 2003.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
agricultural workers diseases; generalized estimating equation; logistic models; odds ratio; random effect; repeated measures
Abbreviations: Abbreviations: ALR, alternating logistic regression; GEE, generalized estimating equation; GLMM, generalized linear mixed model; GTS, green tobacco sickness; POR, pairwise odds ratio.
![]() |
PROBLEM DEVELOPMENT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The nesting of individuals (e.g., workers) within subclusters (e.g., families) that are nested within clusters (e.g., work sites) characterizes multistage cluster samples in many cross-sectional studies of illness and its causes or correlates (48). Often, multistage sampling is chosen for practical considerations (9), and the intracluster correlation arising from within clusters is considered a nuisance that must be addressed in the statistical analysis of the relation of explanatory variables to outcomes. Generalized linear mixed models (GLMMs) and generalized estimating equations (GEEs) are two statistical methods used for this purpose. These approaches are called "subject-specific" and "population-averaged," respectively, because of differences in the interpretations for regression coefficients (1012).
Alternating logistic regressions (ALRs) are a procedure useful for modeling the pattern of clustering and the marginal probability of illness in populations (48). Knowledge of disease clustering is important because it may provide insights into the etiology of disease and risk factors operating within clusters. Modeling the pattern of clustering with and without covariate adjustment may reveal whether known risk factors explain or mitigate the clustering. Persistence of a sizable within-cluster correlation after covariate adjustment often indicates a need to search for further risk factors, which, in turn, may guide the design and targeting of future interventions aimed at reducing the incidence of disease.
ALRs model the within-cluster association with pairwise odds ratios (PORs). A model that has been commonly used estimates two PORs: a within-subcluster odds ratio and a within-cluster/between-subcluster odds ratio. This model may apply to agricultural workers nested within families (subclusters) that are nested within work sites (clusters).
This paper shows that ALR is applicable to detecting patterns of clustering of binary responses in cluster-correlated longitudinal data. Here, work sites are the clusters and the repeated measurements among workers the subclusters. Models for within-cluster association that include time as a covariate are applied to longitudinal data on migrant tobacco workers. A comparative analysis of ALR with GEEs and random-effects logistic regression (a kind of GLMM) is presented, and the relative strengths of the three methods are discussed.
![]() |
REGRESSION MODELS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
ALR is used to fit a model for the within-cluster association after adjustment for the effect of covariates on the marginal probability of illness. Let Yijt be a binary response (e.g., Yijt = 1 if ill, 0 if not ill) on the tth observation within the jth subcluster (e.g., worker) of the ith cluster (e.g., work site). A marginal model for the probability of illness is
logit Pr(Yijt = 1) = x'ijtß,
where x'ijt = (x1ijt, x2ijt, ..., xpijt) is a 1 x p covariate vector, including cluster-level, subcluster-level, and observation-level covariates. GEEs with a "working correlation matrix" may be used to estimate ß (11). Alternatively, ALR fits model 1 and a model for the POR among observations within a cluster
log POR(Yijt, Yij't') = z'ijtj't' ,
where z'ijtj't' = (z1ijtj't', ..., zqijtj't' ) is a 1 x q covariate vector for pairs of observations indexed by (jt,j't') in the ith cluster. The log odds ratio of any two observations from different clusters is assumed to be zero, implying independence among them. A simple model widely used in cross-sectional studies with three-level data specifies two types of pairwise associations
log POR(Yijt, Yij't') = 0 +
1 zijtj't',
where zijtj't' = 1 if the pair of observations is from the same subcluster (j = j') and 0 otherwise. In model 3, 0 is the log odds ratio for association among observations from different subclusters, and
0 +
1 is the log odds ratio for within-subcluster association. Second-order GEEs, or GEE2 (1417), are also useful for estimating
but are not feasible for larger clusters. ALR provides estimates of
nearly as efficiently as GEE2 does, and, like GEEs, gives valid estimates of ß when the model for
is incorrect (18). In this paper, ALR is applied by using the PROC GENMOD procedure in SAS software; model 3 is fit by using GENMOD with the NEST1 and SUBCLUST options on the REPEATED statement (19).
General subject-specific model formulation
A subject-specific model analogous to models 1 and 3 is the three-level random-effects model
logit Pr(Yijt = 1|i,
ij) = x'ijtß* +
i +
ij,
where, generally, ß* ß. The random effects
i and
ij are assumed to be independently distributed as
and
, respectively. Variance components,
and
, address between-cluster and within-cluster/between-subcluster variation, respectively. This model generalizes standard logistic regression
and the random intercept model
. Although both population-averaged and subject-specific approaches extend standard logistic regression to clustered data, they estimate different parameters for the covariates xijt and provide different interpretations (10, 11). Whereas odds ratios given by model 1 represent differences in risk among populations, odds ratios given by model 4 relate change in risk due to changes in covariates for an individual. Random-effects logistic regression models were fit by using restricted pseudolikelihood implemented with the SAS GLIMMIX macro (20). A review of available software is described elsewhere (21).
Models for cluster-correlated longitudinal data
For cluster-correlated longitudinal data, model 4 or, alternatively, models 1 and 3 apply; the repeated measures of persons constitute a subcluster and all measurements within geographically defined groups of persons a cluster (e.g., work sites, schools, medical practices, communities). In model 4, i is the random deviation from the average risk for the ith group, and
ij is the random deviation for the jth person in the ith group. Alternatively,
0 from model 3 relates the clustering of observations from different persons, and
0 +
1 relates two observations from the same person. More technically, the within-group/between-person odds ratio, e
0, for any pair of observations t from person j and t' from person j' (j
j') is defined as the odds of person j at observation t having the outcome given person j' at observation t' having the outcome divided by the odds of person j at observation t having the outcome given person j' at observation t' not having the outcome. The within-person odds ratio, e
0 +
1, for any pair of observations t and t' from the same person is defined as the odds of the outcome at observation t given the outcome at observation t' divided by the odds of the outcome at observation t given the absence of the outcome at observation t'. Time may be considered as a factor that expands model 3. A general class of models is
log POR(Yijt, Yij't') = 0 + z'ijtj't'
1 + u'ijtj't'
2 + w'ijtj't'
3,
where z'ijtj't' pertains to the association structure from the cross-sectional design, as in model 2; u'ijtj't' are a vector of time variables, and w'ijtj't' are covariate-by-time interactions that enable the POR corresponding to different levels of clustering in the cross-sectional component of the data to vary by the difference in time between paired observations. Fitting model 5 requires the SAS PROC GENMOD user to specify the data set for the POR model with the ZDATA option on the REPEATED statement. Details of fitting such models, including practical strategies for addressing the complexities of data with large clusters consisting of multiple strings of unbalanced and irregularly timed repeated measures data, are available from the first author.
Model 5 is applicable to prospective occupational epidemiologic studies of disease clustering. Geographic clustering of an occupational illness among workers may occur when workplaces are the first-stage sampling units and workers within workplaces are the second-stage sampling units. In cluster-correlated longitudinal data, model 5 is useful for characterizing temporal effects on clustering when the duration of illness or its effects is short relative to the timing of repeated measures. Generally, the sickness status of a worker observed on two occasions close in time may be more alike than that from occasions observed farther apart.
![]() |
EXAMPLE |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The study site included two nonadjacent counties in central North Carolina, Wake and Granville. A total of 182 workers from 37 residence sites were surveyed one to five times, over the 12-week growing season, from June 15 to September 5, 1999. Residence sites are referred to as work camps since, in most circumstances, workers who lived together worked together. A workers survey occasions were approximately 2 weeks apart and gathered daily information on the 7-day period that included the day of the survey and the 6 days immediately prior to the survey. Only days for which workers were at risk for GTS were included as observations in the data set. A worker was at risk for GTS on a given day if he or she worked in tobacco that day or the day before and was not already sick with GTS. Thus, each respondent contributed from zero to seven observations per survey. There were 660 surveys providing a total of 4,049 daily observations on 182 workers. Two to nine workers were sampled per camp; cluster sizes, the number of worker-days at risk per camp for all workers combined, ranged from 20 to 208. Farmworkers contributed 135 observation days, with an average of 22.3 days per worker. A key feature of the data is that they span 83 consecutive study days, resulting in irregularly timed, repeated-measures data for workers.
The response variable is a daily indicator for GTS. On any given day that a worker was at risk, GTS was defined as present if the worker reported vomiting or nausea and headache or dizziness. If these conditions were met on 2 consecutive days, only one GTS event was recorded and the worker was considered not at risk on the second day. The reason for this stipulation is that symptoms of nicotine poisoning persist as long as nicotine remains in the body and has not been metabolized to its other forms. Because the mean elimination half-life of serum nicotine is approximately 4 hours (23), exposure on a workday may result in symptoms that persist into the second day without further exposure to tobacco plants. Because symptoms should not persist to a third day, 3 or 4 consecutive days of symptoms were treated as two GTS events. There were a total of 65 GTS events from 57 surveys, and the overall incidence of GTS was 1.61 in 100 days at risk. Forty-four workers (24.2 percent) reported at least one occurrence of GTS during the study. Table 1 shows marked between-camp variability in observed GTS incidence rates. Of the 65 GTS events reported, 29.2 percent (n = 19) occurred in clusters of workers from the same sites sick with GTS on the same day. Over 70 percent (n = 47) of GTS events occurred among workers who worked together in camps during the same week.
|
Analysis
Table 2 reports the incidence of GTS by risk factor subgroups. An increased incidence of GTS was associated with season (middle or late), type of work (priming), having less than 5 years of experience, and working in wet clothes. A decreased incidence of GTS was associated with using tobacco. Most GTS events occurred in the last 2 weeks of July through the end of August (study days 3478). Evidence suggested that high temperatures were positively associated with GTS (Spearman rank correlation, 0.42). Temperature should be considered in the context of other confounding variables. Because season is associated with both maximum daily temperature (highest temperatures occur in midseason) and type of work (e.g., priming, the task with the greatest exposure, is done only from mid- to late season), we did not consider season in our model for computing adjusted within-cluster odds ratios.
|
log POR(Yijt, Yij't') = 0 + zijtj't'
1 + uijtj't'
2 + wijtj't'
3,
where t indexes time in days, zijtj't' = 1 if the pair of observations is from the same worker and is equal to 0 otherwise, uijtj't' = 1 if |t t'| < 7 and is equal to 0 otherwise, and wijtj't' = zijtj't' x uijtj't'. The estimated PORs of interest and their 95 percent confidence intervals are (95 percent confidence interval: 0.90, 3.16),
(95 percent confidence interval: 1.32, 3.08),
(95 percent confidence interval: 1.71, 5.12), and
(95 percent confidence interval: 1.47, 10.20). These are within-camp PORs, respectively, for two observations 1) from different workers 7 days or more apart and 2) from different workers less than 7 days apart; and 3) from the same worker 7 days or more apart and 4) from the same worker less than 7 days apart. Although pairs of observations from the same week appeared to be more highly associated than similar pairs from different weeks, the time effects, considered jointly, were not statistically significant
. When we let
= (
2,
3)', the result corresponded to the Wald test of the hypothesis
= 0, whose test statistic,
, had a large sample chi-square distribution with 2 degrees of freedom.
Proceeding with the simpler model given by model 3, table 3 shows the estimated PORs and their 95 percent confidence intervals for three different risk factor models (model 1). Not adjusting for risk factors resulted in a statistically significant clustering both within farmworkers and between farmworkers within camps. The estimated POR relating two observations from the same farmworker was 3.15. This finding suggests that some workers have a significantly greater propensity for GTS than others do. The estimated POR relating GTS in one worker to GTS in another worker within the same camp was 1.90. When we adjusted for those risk factors that affect exposure to green tobacco leaves, the within-worker and within-camp PORs were smaller than their unadjusted counterparts. The nonsignificance of the within-camp POR suggests that exposure factors largely explain the clustering of GTS among workers in a camp. When the model included factors affecting absorption of nicotine as well as exposure, the within-worker POR, but not the within-camp POR, was further reduced to a notable degree. The adjusted within-worker POR remained significantly different from one, suggesting that even after adjusting for known risk factors, some workers have a greater propensity than others for GTS. The stipulation that 2 days of symptoms be treated as one GTS event had little impact on the results. A sensitivity analysis based on 81 days with GTS instead of 65 GTS events provided similar odds ratios for the risk factors; not surprisingly, the adjusted within-worker odds ratio was greater (2.69 vs. 2.13).
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Knowledge of the clustering of GTS provided by ALR is important for several reasons. First, the fact that the ALR estimate of within-worker association remained significantly different from one after adjusting for known risk factors suggests that unmeasured factors play a role in GTS. One possibility is between-individual differences in nicotine metabolism. Analysis is currently under way of urinary metabolites of nicotine for a clinic-based case-control study conducted to complement the present survey. This analysis may identify distinct patterns of nicotine metabolism. Second, when equations 3 and 4 of Katz et al. (4) were used, the intracluster correlation of GTS estimated by using ALR with exchangeable PORs led to computation of a design effect of 3.18 that may be used in planning future studies. This value is somewhat inflated because of varying cluster sizes in the GTS study. For example, if the cluster size was 109 for every camp, the design effect would be 2.87.
Ignoring within-cluster association leads to invalid inference for the risk factors predicting illness. In the presence of clustering, standard errors from standard logistic regression tend to be inappropriately small for cluster-level covariates and too large for observation-level covariates (10, 11). For the GTS data, with the exception of temperature, which may vary daily, standard errors were always larger for GEEs. Use of the GEE and ALR methods found that temperature, but not work experience (a subcluster-level covariate), was a statistically significant predictor of GTS; standard logistic regression led to the reverse conclusion.
The similarity of ALR and GEE odds ratio estimates for GTS risk factors in model 1 is not surprising. Both methods involve computations that alternate between estimation of ß and within-cluster association. In fact, the ß-estimation step is the same, and both rely on empirical sandwich variance estimation to account for the within-cluster association to give valid results for the risk factor model even when the association model is misspecified. Our analysis illustrates that careful modeling of within-cluster correlation may result in more efficient estimation of parameters in the marginal mean model, with potentially nonnegligible benefits for studies with a small number of large-size clusters (15). Standard errors resulting in the confidence intervals for GTS risk factors in the three-level ALR model in table 4 were slightly smaller than their GEE (or ALR) two-level model counterparts for five of seven covariates. Thus, use of ALR was mildly beneficial in clarifying the role of risk factors for GTS, even though results based on GEEs were qualitatively similar.
GTS results for GLMM were also qualitatively similar to those from using ALR. Generally, the presence of random effects in nonlinear models, such as random-effects logistic models, causes the regression coefficients ß* to be numerically different and to have interpretations different from corresponding coefficients ß from population-averaged models (10, 11). The fact that GTS risk factor odds ratio estimates from the two approaches were similar may be due to the moderately small size of the within-cluster association or to finite sample biases of the procedures.
ALR has several advantages compared with GLMM analysis of multilevel binary data. While both offer similar degrees of flexibility in modeling multiple sources of variation, ALR provides estimates of PORs that have natural interpretations for quantifying the magnitude of within-cluster association. In contrast, GLMM variance component estimates apply to the scale of the link function, and, as in the GTS study, their meaning may be more elusive. Next, valid inference for the effect of risk factors on the probability of illness with GLMM requires correct specification of the distribution of the random effects and the link function, whereas ALR requires only the latter. However, valid inference for within-cluster association with ALR requires that model 2 be specified correctly. A disadvantage of ALR is the requirement that data be missing completely at random compared with the less stringent missing-at-random assumption of GLMM (24).
The GTS data illustrate limitations in the scope of ALR. One concern is that the number of parameters estimated should not be excessive because of sparse data represented by a limited number of GTS events. A second concern is that the poor performance of the empirical sandwich estimator when applied with a small number of clusters in GEE applications (e.g., less than 40) may also be pertinent to ALR in the GTS study. GLMM as estimated by using restricted pseudolikelihood also has limitations for binary data; it may give biased estimates of variance components (12). Other estimation approaches to GLMMs are used (25, 26). The usefulness of software for fitting models similar to those considered in this paper depends on whether the software can handle multilevel models (21).
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|