1 Department of Epidemiology and Biostatistics, Faculty of Medicine, McGill University, Montreal, Quebec, Canada.
2 Division of Clinical Epidemiology, Royal Victoria Hospital, Montreal, Quebec, Canada.
3 Division of Epidemiology and Biostatistics, Department of Epidemiology and Social Medicine, Albert Einstein College of Medicine of Yeshiva University, Bronx, NY.
4 Department of Family Medicine and Community Health, School of Medicine, Tufts University, Boston, MA.
Received for publication January 7, 2000; accepted for publication August 7, 2002.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
correlation; epidemiologic methods; generalized estimating equation; longitudinal studies; odds ratio; statistics
Abbreviations: Abbreviation: GEE, generalized estimating equations.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Although some articles do discuss how much statistical information is obtainable from observations on paired organs (9) or individuals in clusters such as classrooms or physicians practices (10), investigators often take a conservative approach. In one example, where all eligible children in a household were randomized to the same treatment (11), statistics were computed as if the observations were independent but standard errors were based on the numbers of households. In another (12), where one fourth of the subjects had a sibling in the study, the authors excluded the data obtained from one of the two siblings.
In this expository article, we show how the GEE approach uses weighted combinations of observations to extract the appropriate amount of information from correlated data. We first motivate and introduce the approach using hand calculations on small hypothetical data sets. We use households as clusters, with the letter "h" (household) as a subscript. We use the Greek letters µ and and the uppercase letters P, B, and R when referring to a parameter (a mean, standard deviation, proportion, or regression or correlation coefficient); and we use the symbol
and the lowercase letters p, b, and r for the corresponding statistic (empirical value, calculated from a sample) which serves as an estimate of the parameter.
![]() |
ELEMENTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
wrow x wcolumn x row x
column x Rrow,column
for each {row, column} combination, and then summing these products over the n2 row-column combinations.
For the remainder of this section, we will assume that the s are all equal.
Variability of statistics derived from uncorrelated observations
When the observations are uncorrelated, the off-diagonal elements in the correlation matrix are zero. If each of the n weights equals 1/n, then the weighted sum is the mean, . Its variance (the sum of the diagonal elements in part b of figure 1) is thus
,
yielding the familiar formula SD[ ] =
/
.
With equal statistical weights of 1 each, the variance of the simple sum is Var[y] = n
2, so that SD[
y] =
. Although our main example involves "physical" heights and "statistical" weights, a side example is instructive. Assume that the "physical" weights of elevator-taking adults vary from person to person by, for example,
= 10 kg. Then elevators of 16 persons each (i.e., n = 16), randomly chosen from among these, will vary from load to load with a standard deviation of (only!)
(10) = 40 kg, while the average per person in each load of 16 will vary with a standard deviation of only 10/
= 2.5 kg.
Variability of statistics derived from correlated observations
In the elevator example, the "/
" and "
" laws for the variability of the two statistics do not hold if the variable of interest on sampled individuals tends to be similar from one individual to the other ("co-related")for example, if elevators are sometimes used by professional football teams and sometimes by ballet dance classes. The variance of a weighted combination of such observations now involvesin addition to the 1s on the diagonalthe pairwise nonzero off-diagonal elements of the correlation matrix.
When the ys of individuals in a cluster are positively correlated, as is typical, the additional off-diagonal elements in part b of figure 1 make the standard deviation of the unweighted average greater than
/
.
Preamble to GEE: optimal combination of correlated observations
Suppose, for simplicity, that households have either one or two children and that the mean (µ) and standard deviation () of the variable being measured are the same in both types of households (in some applications (see Hoffman et al. (13), p. 440), µ may vary systematically with cluster size, but that situation will not be considered here). Let the correlation of measurements within two-child households be R. Consider the estimation of µ using a measurement on each of three children (n = 3), one from a randomly chosen single-child household and two from a two-child household. The 3 x 3 correlation matrix (figure 2) for the three ys is made up of a 1 x 1 matrix for the response from the singleton, a 2 x 2 matrix for the two responses from the two siblings, and zeroes for pairs of responses from unrelated children. The ys for some actual pairs of unrelated children will both be above or below µ, but on average, across all possible such pairs, the expected product of deviations is zero.
|
|
For any given R, there is a less variable estimator than the three considered. Suppose that, relative to a weight of 1 for the observation on the singleton, the weight for the y for each sibling is w, yielding the weighted average
One can show that its variance, 2(1 + 2w2 + 2Rw2)/(1 + 2w)2, is lowest when w = 1/(1 + R). For example, if R = 0.5, the optimal (relative) weights are in the ratio 1:(2/3):(2/3), and the variance is (3/7)
2, smaller than that of the competitors.
More generally, suppose the sample of n consists of several sets of children from 1-, 2-, ..., k-child households, with the same µ and the same pairwise within-household correlation R in all households, regardless of size. If the responses are ordered by household, then the n x n correlation matrix consists of several repetitions of various "block-diagonal" patterns, as in figure 2. One can show by calculus that the optimal weights for combining the responses of individual children from households of sizes 1, 2, 3, ..., k are 1, 1/(1 + R), 1/(1 + 2R), ..., 1/(1 + {k 1}R). These values can be obtained by summing the entries in any row or column of the inverses of the 1 x 1, 2 x 2, ..., k x k submatrices in the overall n x n block-diagonal matrix used in the GEE equations (see next subsection).
With data from paired organs, all "clusters" are of size k = 2. Rosner and Milton (9) illustrate this idea of "effective sample size" using responses of a persons left and right eyes to the same treatment: If these have a correlation of 0.54, then 200 eyes, two from each of 100 persons, contribute the "statistical equivalent" of one-eye contributions from each of 130 persons (200 x 1/(1 + 0.54) = 130). The closer the correlation is to 1, the closer the effective sample size is to 100.
Estimation by GEE: the "EE" in GEE
In the n = 3 example in figures 2 and 3, consisting of just one household of size k = 1 and one of size k = 2, each y is a separate legitimate (unbiased) estimator, , of µ (the circumflex or "hat" over µ denotes an estimate of it, calculated from data). As was the practice in the pre-least-squares era (14), one can combine the three separate estimating equations: ysingleton
= 0, ysib1
= 0, and ysib2
= 0, using the weights wsingleton, wsib1, and wsib2, to obtain a single estimating equation
In this simple case, .
In this didactic example, the value of R used to construct the weights was considered "known"; in practice, it must be estimated, along with µ. The process is illustrated in figure 4, using a total of five observations (n = 5) from two clusters. Beginning with R = 0, one calculates five weights and, from them, an estimate of µ; from the degree of similarity of the within-cluster residuals, one obtains a new estimate, r, of R. The cycle is repeated until the estimates stabilizethat is, until "convergence" is achieved.
|
Estimation of a proportion (or odds) rather than a mean: the "G" in GEE
Figure 5 shows the GEE estimation of the expected proportion P from 0/3 and 4/5 positive responses in eight subjects in two households. The weights are 1/(1 + 2R) and 1/(1 + 4R) for the individuals in households of size three and five, respectively. The final estimate of P is p = 0.42, corresponding to r = 0.45. It is a weighed average of the eight 0s and 1s, with weights of 1/(1 + 2r) = 0.53 for each of the three responses in household 1 and 1/(1 + 4r) = 0.36 for each of the five responses in household 2; that is,
|
The sum of the eight weights, 0.53 each for the three persons in household 1 and 0.36 each for the five persons in household 2, can be viewed as the "effective" sample size of 3.39. Estimation of logit[P] = log[P/(1 P)] involves the same core calculations.
If P is different for different covariate patterns or strata, then the "unit" variance 2 = P(1 P) is no longer homogeneous. Nonconstant variances can be allowed for by incorporating a function of
2 into the weight for each observation (this is the basis of the iteratively reweighted least squares algorithm used with the usual logistic regression for uncorrelated responses).
Indeed, using different weights for each of n uncorrelated outcomes allows a unified approach to the maximum likelihood estimation of a family of "generalized linear models" (15, 16). Parameters are fitted by minimizing the weighted sum of squared residuals, using functions of the 2s as weights. For binomial and Poisson responses, where
2 is a function of the mean, weights are reestimated after each iteration. With GLIM software (17), Wacholder (18) illustrated how the risk difference, risk ratio, and odds ratio are estimated using the identity, log, and logit "links," respectively. This unified approach to uncorrelated responses has since become available in most other statistical packages. GEE implementations for correlated data use this same unified approach but use a quasi-likelihood rather than a full likelihood approach (3). Since correct specification of the mean and variance functions is sufficient for unbiased estimates, the model used does not fully specify the distribution of the responses in each cluster.
Standard errors: model-based or data-based (empirical)?
Two versions of the standard error are available for accompanying GEE estimates. The difference between them can be illustrated using the previously cited estimate, p = 0.42, of the parameter P. The "model-based" standard error is based on the estimated (exchangeable) correlation r = 0.45. This in turn implies the "effective sample size" of 3.39 (w = 3 x 0.53 + 5 x 0.36 = 1.59 + 1.80 = 3.39) shown above and in the last footnote of figure 5. Thus, based on the binomial model,
SEmodel-based(p) = {p x (1 p)/w}1/2 = {0.42 x 0.58/3.39}1/2 = 0.27.
The "empirical" or "robust" standard error uses the actual variations in the cluster-level statistics, that is, the p1 = 0/3 = 0 and p2 = 4/5 = 0.8, and the "effective sizes" of the subsamples
SEempirical(p) = {[1.592(0 0.42)2 + 1.802(0.8 0.42)2]/3.392}1/2 = 0.28.
Unless data are sparse, the empirical standard error is likely to be more trustworthy than the model-based one. Agreement between the model-based and empirical standard errors suggests that the assumed correlation structure is reasonable. However, the robust variance estimator, also known as the "sandwich" estimator, was developed for uncorrelated observations, and its theoretical behavior with correlated data has only recently received attention (19). Methods designed to improve on the poor performance in small samples (20) include bias-correction and explicit small-sample adjustments, that is, use of t rather than z (21). A second concern has been the case in which the cluster size itself is related to the outcome and so is "nonignorable." In such instances, within-cluster resampling, coupled with the use of a generalized linear model for uncorrelated data (13), provides more valid confidence intervals than GEE.
![]() |
APPLICATION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Figure 7 contrasts the Low and High socioeconomic status groups with respect to µ, mean height, and P, the proportion of children with z scores less than 1. We can estimate a difference by subtracting the specific estimates, and we can estimate its standard error from the rules for the variance of a difference between two independent estimates. Alternatively, the difference can be estimated as the coefficient of the indicator variable IHigh (1 if high socioeconomic status, 0 if not) in a regression model applied to the combined data. For height measured quantitatively, the intercept represents the mean of low socioeconomic status children, and the coefficient of IHigh represents the L-H difference in means.
|
In the above examples, groups can be compared directly. However, to assess trends in responses over levels of one or more quantitative variables measured at a cluster level (here, household level), a regression approach is more practical. Since GEE analysis is carried out at the child level, it can also include covariates, such as illness histories, that differ from child to child within a household.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The GEE approach differs in a fundamental conceptual way from the techniques included under the rubric of "random-effects," "multilevel," and "hierarchical" models (e.g., the MIXED and NLMIXED procedures in SAS, MLn (23, 24), or other software described in the paper by Burton et al. (5)). Besides the seeking of more efficient estimators of regression parameters, the main benefit of GEE is the production of reasonably accurate standard errors, hence confidence intervals with the correct coverage rates. The procedures in the other set of techniques explicitly model and estimate the between-cluster variation and incorporate this, and the residual variance, into standard errors. The GEE method does not explicitly model between-cluster variation; instead it focuses on and estimates its counterpart, the within-cluster similarity of the residuals, and then uses this estimated correlation to reestimate the regression parameters and to calculate standard errors. With GEE, the computational complexity is a function of the size of the largest cluster rather than of the number of clustersan advantage, and a source of reliable estimates, when there are many small clusters.
However, because the GEE approach does not contain explicit terms for the between-cluster variation, the resulting parameter estimates for the contrast of interest do not have the usual "keeping other factors constant" interpretation. To appreciate this, consider the (admittedly extreme) situation in table 2. If all N clusters are sufficiently large, one can fit an unconditional logistic regression model to the data. If clusters are small, one can avoid fitting one nuisance parameter per cluster (and the consequent bias in the estimated parameter of interest) and instead fit a more economical conditional logistic regression model, using each cluster as a "set." The appropriate logistic regression model "recovers" the common within-cluster ratio of 9, as does the nonregression Mantel-Haenszel approach. However, the GEE approach, with clusters identified as such, yields an odds ratio of only 5.4. The 5.4 contrasts the P1 for an individual selected randomly from the population with the P0 for another individual selected randomly from the population, that is, without "matching" on cluster. In addition, this "population averaged" measure, from the marginal model (5) used in the GEE approach, is specific to the mix of clusters studied. In contradistinction to this, the odds ratio of 9 contrasts the P1 for an individual with the P0 for another individual from the same cluster.
|
The above extreme examples are quite hypothetical. In practice, with much less variation in P0 across clusters, the discrepancy is usually relatively minor. The discrepancy does not arise with absolute differences, since, with balanced sample sizes, the difference in an aggregate is the aggregate of the within-cluster differences. Table 2 confirms this, showing that the GEE approach, with the identity link, accurately recovers the common 40 percent "risk difference" within each cluster.
Unfortunately, as currently implemented in most software, the GEE approach cannot handle several levels of clustering/hierarchy, such as households selected from randomly selected villages that in turn were selected from selected counties. For binary responses, it is possible to use alternating logistic regression (28), an extension of GEE, implemented in S-PLUS, to model different correlations at different levels, but this procedure is not yet available in SPSS, Stata, and SAS implementations of GEE. Likewise, unlike multilevel models, the GEE approach cannot accommodate both cluster-specific intercepts and slopes in longitudinal data.
In our height example, several children within the household are measured cross-sectionally, that is, just once, each at a different age. Consider a different study, in which (unrelated) children are followed and their heights and covariates are measured at several different ages (times). In such longitudinal data, now with the child as the "cluster," unless the model includes at least a separate intercept for each child, the successive residual heights of a child will be correlated, with stronger correlations among residuals that are closer together in time. Autoregressive correlation structures are commonly used for longitudinal data. The main analytical challenges are accounting appropriately for missing data and dealing with observations spaced unevenly in time. The reader is referred to the work of Liang and Zeger and colleagues (1, 2, 4) for a treatment of the GEE analysis of quantitative longitudinal data.
![]() |
ACKNOWLEDGMENTS |
---|
The authors are grateful to Rolf Heinmueller, Machelle Wilchesky, and several other students for comments on and reactions to the manuscript.
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|