Application of a Generalized Random Effects Regression Model for Cluster-correlated Longitudinal Data to a School-based Smoking Prevention Trial
Andreas I. Sashegyi1,
K. Stephen Brown2 and
Patrick J. Farrell3
1 Eli Lilly and Company, Indianapolis, IN.
2 Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada.
3 Department of Mathematics and Statistics, Acadia University, Wolfville, Nova Scotia, Canada.
 |
ABSTRACT
|
---|
In cluster-randomized trials, groups of subjects (clusters) are assigned to treatments, whereas observations are taken on the individual subjects. Since observations on subjects in the same cluster are typically more similar than observations from different clusters, analyses of such data must take intracluster correlation into account rather than assuming independence among all observations. Random effects models are useful for this purpose. The problem becomes more complicated if, in addition, repeated observations are taken on subjects over time. This introduces intraindividual correlation, which is typical for longitudinal studies. The Waterloo Smoking Prevention Project, study 3 (WSPP3), 19891996, is a study giving rise to cluster-correlated longitudinal data, where schools were randomized to either a smoking intervention program or to a control condition. Smoking status was assessed on grade 6 students in these schools, with annual follow-up observations throughout elementary and high school years. The authors illustrate the use of a generalized random effects model for analyzing this type of data. This model obtains appropriate estimates and standard errors for both individual-level covariates and those at the level of the cluster. Am J Epidemiol 2000;152:1192200.
clinical trials; health education; logistic models; models; statistical; schools; smoking; students
Abbreviations:
GEE, generalized estimating equations; TVSFP, Television, School, and Family Project; WSPP3, Waterloo Smoking Prevention Project, study 3.
 |
PROBLEM DEVELOPMENT
|
---|
In cluster-randomized intervention trials, groups of subjects such as families, schools, or entire communities are randomly assigned to intervention conditions, whereas observations are taken on the individual subjects, often on several occasions. It is a common characteristic of such studies that while observations from different groups or clusters can be assumed to be statistically independent, the responses from different individuals in the same cluster are normally dependent on one another to some extent (1
). In school-based smoking prevention trials, schools are often the unit of randomization, and students within the same school tend to behave more similarly than students from different schools (2
).
McKinlay et al. (3
) and Murray et al. (4
) provide a general discussion of some of the design and analysis issues that are of direct relevance to school-based intervention studies. This is well complemented by a similar development from a more statistical point of view in the report from Cnaan et al. (5
). We describe two examples of such investigations.
First, the Television, School, and Family Project (TVSFP) was an experimental trial to determine the efficacy of a school-based smoking prevention curriculum in conjunction with a television-based prevention program, in terms of preventing smoking onset and aiding in smoking cessation. Forty-seven schools in Los Angeles and San Diego, California, were randomized to various study conditions, in which seventh-grade students were identified as the study sample. Pretest data were collected on all students, and the intervention was followed by a posttest questionnaire as well as follow-up questionnaires 1 and 2 years after the intervention. The TVSFP study is described in detail in Flay et al. (6
).
A second example of a cluster-randomized smoking prevention trial giving rise to cluster-correlated longitudinal data is the Waterloo Smoking Prevention Project, study 3 (WSPP3), the third in a series of randomized, controlled smoking prevention trials, designed to develop, evaluate, and disseminate an effective school-based social influences smoking prevention program (see Best et al. (7
) and Cameron et al. (8
)). In this study 100 schools in southern Ontario were randomized to either a control or one of four treatment conditions, determined by the type of provider delivering the prevention curriculum (teacher or nurse) and the method of training the provider had undergone (print or workshop). In these schools an initial cohort of grade 6 students was identified, and a baseline measure of smoking was taken on these students. The social influences curriculum was administered in intervention schools in grade 6 with booster sessions in grades 7 and 8. Annual measurements of smoking status were obtained on the same cohort. The elementary school phase of the WSPP3 study was complemented by a follow-up protocol implemented in secondary schools, so that some of the initial study subjects contributed as many as seven observations, constituting annual measures of smoking status from grade 6 through grade 12. We shall examine the WSPP3 data more closely later.
Intervention studies such as the Community Intervention Trial for Smoking Cessation (9
), which randomly assign whole communities to conditions and follow participants longitudinally, are another example of this type of study.
Statistical analyses of such data must account for the intra-class correlation among observations from the same cluster in order to yield inferences that take into account the cluster-level randomization. Analyzing the data according to standard methods that assume independence of observations both within and between clusters leads to standard errors for regression coefficients that are too small and, hence, to inflated test statistics and potentially erroneous assertions regarding the significance of covariate effects (1
). One way to avoid this problem is to analyze the data on the level of the cluster. This, however, normally results in a less powerful analysis and poses problems for the incorporation of individual-level covariates.
Murray et al. (2
) describe estimates of intracluster correlation commonly incurred in cluster-randomized smoking prevention studies, and they caution against ignoring the nonindependence in the data. Norton et al. (10
) use GEE (11
) to allow for the school-level correlation in their analysis of data from Project DARE, a youth substance abuse program. Cameron et al. (8
) report test statistics for school-level variables based on Pearson's goodness-of-fit adjustment (12
) in an evaluation of the WSPP3 study; they report that both GEE and quasilikelihood models (13
) yield similar results.
The use of random effects models (14
, 15
) is also widely adopted for the analysis of cluster-correlated data. Random effects models readily handle a very general data structure, in which clusters can be of varying sizes and covariates can be specific to either the cluster or the individual. Intracluster correlation, manifested through larger than nominal cluster-to-cluster variation, or overdispersion, is accommodated by positing the existence of an unobserved effect for each cluster that is common to and influences each observation in the given cluster. These effects in turn are assumed to derive from a distribution with mean zero and variance
2. The larger the estimated value of
2, the greater the within-cluster dependence of the observations and the larger the overdispersion in the data. Random effects models normally also specify fixed regression effects as part of the complete model formulation, and for this reason they are also referred to as mixed effects models. Laird and Ware (14
) provide an excellent discussion of such mixed models for the analysis of continuous data. Stiratelli et al. (15
) extend this by considering similar models (generalized linear mixed models) for the analysis of binary data. Cnaan et al. (5
) discuss application of and estimation for the linear mixed model in school-based intervention studies.
Hedeker et al. (16
) analyze data from the TVSFP trial using a linear mixed model. However, they confine themselves to largely cross-sectional analyses, using only the pretest and postintervention data. In one instance they analyze the latter with the fonner taken as a predictor, and in another, the change score obtained by subtracting the pretest from the postintervention response. Their analyses do not consider the available follow-up data.
More theoretical accounts dealing with inference in the generalized linear mixed model are given in the reports by Breslow and Clayton (17
), McCullagh and Nelder (12
), and McCulloch (18
).
As indicated, smoking prevention studies are frequently longitudinal as well, with repeated observations taken on the same subjects over time. This, for instance, facilitates the study of smoking onset and enables the study of the longer-term effects of the intervention. However, the longitudinal nature of these designs requires the consideration of the correlation among responses on the same subject at different points in time (19
).
Modeling approaches for the longitudinal response on each subject that allow for the within subject correlation include GEE models and random effects models. These approaches lead to population-averaged and subject-specific models, respectively. Hu et al. (19
) provide a summary and comparison of these approaches for handling within subject correlation using data from the Midwestern Prevention Project, a longitudinal smoking prevention trial (20
); however, while they mention the between-subject within-school correlation, they ignore it in their analyses.
In analyzing data from multiple time points in studies such as the TVSFP or WSPP3, one must simultaneously adjust for the fact that there will likely be extraneous cluster-to-cluster (school-to-school) variability, and that repeated observations made on the same subject will be correlated. In this paper we describe such a model. The estimation procedure we propose combines empirical Bayes' methods as illustrated in the report by MacGibbon and Tomberlin (21
) for the estimation of school-level random effects, with an approach based on GEE used by Liang and Zeger (11
) for estimating intraindividual correlation parameters. The methodology is illustrated in an analysis of both the elementary and the secondary school data from the WSPP3 study.
 |
THE MODEL
|
---|
The model we describe in this section differs from that discussed by Hedeker et al. (16
) in that it is designed for the analysis of binary, not continuous, responses. Our model generalizes the model by Hedeker et al. further by allowing the analysis of data from multiple time points. We describe the essential features of the model in this paper and refer the interested reader to the dissertation by Sashegyi (22
) for a more detailed account.
In this development we adopt the terminology appropriate for school-based smoking prevention trials, since this is the area of application with which we are concerned; however, the model is applicable in general to data in which multiple observations are taken on experimental units observed and randomized in clusters.
Let the data consist of observations on N students, each observed at T times. Let the subscripts i and t refer to an individual and a time point, respectively. Furthermore, let the N x T observations be collected in K schools, with the subscript k referencing schools. Vectors or matrices with a single subscript i or k refer to collections of observations on the corresponding individual or school, respectively. Consider the following model for the binary response indicating smoking status (Y = 1 indicating a smoker, Y = 0 a nonsmoker):
where
 | (1) |
where xit(k) = (xit1(k), ..., xitp(k))' is a p x 1 covariate vector and bk is the random effect associated with the school attended by individual i at time t. The purpose of the school random effect is to account for the influence of unobserved covariates that help explain the variation in smoking rates across schools. The model assumes that the effect of school environment is common to all observations collected in a given school, not just those gathered at one specific time point. Furthermore, the correlation between two observations on the same individual at times t and t' must be interpreted as conditional on, or adjusted for, the random effect(s) of the school(s) attended by that individual at times t and t'. It is therefore reasonable to model this conditional correlation as a function of time only. Observations from different individuals in the same school are taken to be independent, conditional on the random effect for that school. There is no restriction on which school an individual attends at any given time.
Note that if the intraindividual correlations
tt' were all zero, the model given by expression 1 would be equivalent to the standard logistic-normal random effects model, which assumes conditional independence between two observations in the same school, given the random effect for that school. Conversely if
2 were zero, the model would not include random effects and therefore reduce to a logistic model for longitudinal data, wherein repeated observations on the same subject would be correlated, with different subjects responding independently. In the composite formulation above, neither the observations from a given student nor those from a given school are independent, even conditional on the appropriate random effect. Suppose for instance that T = 2 and that school k gives rise to six observations, corresponding to students 1, 2, and 3 each observed at times 1 and 2; for simplicity we suppose here that no students leave school k over the course of the study, and no new students enter. This implies the following correlation structures:
 |
and
 |
The nonzero entries in the latter matrix reflect the fact that the observation vector for school k does not consist of independent responses, even conditional on bk, since the same students will normally contribute multiple observations in the same school. These correlation matrices need to be incorporated into the estimating equations one will solve to obtain estimates of the fixed effects
and the random effects
. We propose a generalized form of the empirical Bayes' estimating equations for the logistic-normal model to estimate the parameters in the model given in expression 1. This estimation procedure is described in the Appendix and was implemented in an Splus (Becker et al. (23
)) program, used for model fitting.
 |
EXAMPLE
|
---|
The data set
As indicated in the introduction, the WSPP3 was aimed to evaluate a school-based social influences smoking prevention program. This study consisted of an elementary and a high school component, enrolling a total of approximately 6,000 students. Initially 100 southern Ontario elementary schools from seven school boards were stratified by school risk (high, medium, or low, based on the proportion of senior (grade 8) students who were smoking) and then randomized within strata and board to one of five study conditions. Four of these were treatment conditions, corresponding to the four combinations of the type of provider who administered the intervention curriculum (nurse or teacher) and the type of training the provider received (workshop participation or self-preparation through printed material). The fifth was a "usual care" control condition. Starting in grade 6, students in the four treatment conditions received the "Keep It Clean" smoking prevention curriculum, consisting of six 40-minute lessons in grade 6, three lessons in grade 7, and six lessons in grade 8. A baseline measure of smoking status was taken prior to any intervention at the beginning of grade 6, and subsequently smoking status was measured on the same students at the end of grades 7 and 8, after which they moved on into secondary schools. Smoking status was determined based on self-report. However, to promote truthful response (see Patrick et al. (24
)), prior to data collection students were advised that a breath sample would be taken to measure carbon monoxide, a marker for recent smoking. Samples were then taken from each student, and carbon monoxide levels were recorded. For the purpose of these analyses, a smoker was defined as anyone who had smoked more than once and reported not having quit smoking. Thus, smokers included both experimental and regular smokers.
As part of the high school component of this study, the students of the WSPP3 elementary cohort were followed to the end of grade 12, and their smoking status was measured on an annual basis in grades 9 through 12. In addition, 30 schools, each of which enrolled 30 or more students from the original cohort, were matched in pairs according to location (urban vs. rural), size, and the proportion of cohort students from elementary school intervention conditions. The schools in each pair were then randomized to either an intervention or a control condition. The high school intervention program covered the period to the end of grade 10 for the cohort and consisted of a school mobilization effort to involve students in activities supportive of nonsmoking. Systematic attempts were initiated by a selected staff member in each school to maximize such student participation in promoting the smoke-free cause (25
).
Analysis
We begin by considering data from the first 3 years of the study, corresponding to the time the cohort of students spent in elementary schools. The subset of observations we selected includes the responses (self-reported smoking status) in grades 7 and 8 (t = 1 and 2, respectively) of those students who were nonsmokers at baseline, which was preintervention in grade 6. To illustrate the methodology, we considered a complete-case analysis, corresponding to two observations on each of 3,380 students who attended a total of 99 schools. This facilitated computational aspects of fitting the model. However, students with missing data because of temporary absence or drop-out may be different from those with complete data. Frangakis and Rubin (26
) describe the possible biases that can result from a complete-case analysis. Inferences are limited to this particular subpopulation. Methods for incorporating incomplete data are not explored here, but several model refinements and alternatives are addressed in the Discussion.
We examined a logistic model formulation expressing the probability of smoking at time t as a function of the following covariates, which were found to be of most relevance:
- Cond: study condition (Cond = 1 for schools in one of the four treatment conditions described earlier, and 0 otherwise);
- Irisk: a time-dependent individual-level smoking risk score defined in terms of the smoking habits of a student's parents, siblings, and friends (27
) (Irisk = 1 for students classified on the basis of social models' factors to be at low risk for smoking, Irisk = 2 for students at medium risk, and 3 for students at high risk);
- Srisk: a school-level risk score, coded as a continuous covariate ranging between 0 and 100, with larger values indicative of higher-risk schools; this was derived from an examination of the proportion of smokers among the senior students in each school and was also used in the stratification of schools as discussed above;
- Gr8: a grade effect (Gr8 = 1 for a grade 8 observation, 0 otherwise).
In addition, the interaction between Cond and Srisk (C x Srisk) was taken into consideration. Since students' individual-level risk could change over time, the value reported at time t - 1 was used to predict the observation at time t. In this analysis we focused on marginal smoking rates at each time point, relegating a student to the smoking state (Yit(k) = 1) if (s)he reported to be either an experimental or a regular smoker and to the nonsmoking state (Yit(k) = 0) otherwise. Letting xit(k) refer to the realization of the covariate vector (1, Cond, Irisk, Srisk, Gr8, C x Srisk) for student i at time t attending school k, the results of fitting the composite model
are given in table 1, along with the estimates from the three models that are special cases of the more general formulation: the ordinary logistic fit, assuming independence among all observations; the GEE fit, ignoring the random school effects; and the standard empirical Bayes' logistic-normal model, assuming repeated observations on the same individual to be independent.
View this table:
[in this window]
[in a new window]
|
TABLE 1. Various model fits to elementary school data, Waterloo Smoking Prevention Project, study 3, Ontario, Canada, 19891996
|
|
The properties of parameter estimation based on the composite model were examined in detail via simulation by Sashegyi (22
). The fixed effects estimates under the proposed approach appear to have little bias, and the standard errors obtained from the model accurately reflect the standard deviation of the estimates from the simulation results. Furthermore, the estimates appear to be approximately normally distributed, and hence Wald-type test statistics (estimate divided by standard error of estimate) can be referred to the standard normal distribution. Departures of the results for the logistic, GEE, and empirical Bayes' models from those of the composite model are indicative of limitations in the ability of these three simpler models to capture the essential features of the study design.
Note that the standard errors for the two school-level covariates Cond and Srisk are similar for the empirical Bayes' and composite model fits and are smaller for the other models. This is not surprising, since the effect of these covariates is measured in terms of a comparison among schools, and the precision with which it can be estimated is therefore reduced in the face of extraneous school-to-school variability. The logistic and GEE models ignore this overdispersion and hence understate the variability of parameter estimates associated with school-level covariates.
All models suggest that a student's individual risk score and that of the school(s) he/she attends are highly predictive of smoking status, and that the risk of smoking is greater in grade 8 than in grade 7. While the estimated effects at the 0-level of school risk suggest that the intervention conditions actually increase the risk of smoking, these estimates must be interpreted taking into account the C x Srisk interaction. Thus, for the typical risk range of schools in this study, lower smoking rates were observed in the intervention schools. A graphical representation of the results of fitting the composite model, illustrating this phenomenon, is shown in figure 1. While further comparing parameter estimates with their standard errors, we see that both the empirical Bayes' and the composite model fits indicate that the Cond and Srisk interaction is marginally significant, suggesting that the intervention program may be effective in lowering smoking rates in high-risk schools. Both the logistic and GEE model fits lead to a similar conclusion but estimate this interaction to be more significant than it is in actual fact. Interestingly, there was no discernible difference among the four different treatment conditions, in terms of the effect on students' smoking behavior.

View larger version (15K):
[in this window]
[in a new window]
|
FIGURE 1. Estimated probability of smoking in grade 8, for high-risk individuals (risk = 3) (dotted line) and low-risk individuals (risk = 1) (solid line), Waterloo Smoking Prevention Project, study 3, Ontario, Canada, 19891996.
|
|
Consider now data from the secondary phase of WSPP3. In order to assess the postintervention impact of the elementary school smoking prevention program and any additional effect due to the high school intervention, we examined the high school smoking behavior of those students who were in one of the five original study conditions in grade 6, reported to be nonsmokers in grade 8, attended one of the 30 study high schools in grade 9, and provided complete data until grade 12. This resulted in four observations (grades 9 through 12) on each of 1,381 students, attending at any given time either one of the 30 study schools or a nonstudy high school (a student could for instance transfer to a nonstudy school after grade 9). This completers analysis does not take into account high school drop-outs. As before, we examined logistic model formulations for the probability of student i smoking at time t. The covariates we examined in this case were the following:
- Gr10, Gr11, Gr12: indicator variables taking value 1 for observations in grades 10, 11, and 12, respectively, and 0 otherwise;
- HScond: high school study condition (HScond = 1 for intervention schools and 0 otherwise);
- EScond: elementary school study condition (defined previously as Cond);
- Sex(F): taking value 1 for female students, 0 for males;
- Irisk: individual-level smoking risk score (as defined previously).
In addition, the interaction between Sex(F) and HScond, denoted as S x HScond, was also considered. Here we carried out an intent-to-treat analysis, treating individuals as though they remained in the same study condition throughout their high school careers. That is, the value of HScond assigned to schools in grade 9, and hence to all students therein, was taken to be fixed for each student, even if the individual moved to a school of the opposite study condition or to a nonstudy school at a later time. We considered a comparison of the same four models as are listed in table 1. In this case we are modeling
, and hence estimate, for the GEE and composite model fits, the six parameters in the intraindividual correlation matrix
 |
As indicated above, the empirical Bayes' and composite model formulations model Yit(k) conditional on bk, where bk
N(0,
2), k = 1, ..., 30. These random effects correspond to the 30 study schools. An additional category (k = 31) was included and the effect b31 specified for those observations taken in any other nonstudy school (since all students in this data set attended one of the 30 study schools in grade 9, such observations were necessarily responses in grade 10 or later). Model summaries are provided in table 2.
View this table:
[in this window]
[in a new window]
|
TABLE 2. Various model fits to secondary school data, Waterloo Smoking Prevention Project, study 3, Ontario, Canada, 19891996
|
|
Examining the standard errors of the regression coefficients, we note the similar values for individual-level covariates under the GEE and composite model formulations, which are understated by the other two models (see in particular EScond, Sex(F), and the interaction between Sex(F) and HScond; EScond can be considered an individual as opposed to school-level covariate in this analysis, insofar as students from both elementary school study conditions are represented in a given high school). In contrast to the comments made above regarding school-level covariates, the effect of an individual-level covariate is measured in terms of a comparison between individuals but largely within schools. The variability of the parameter estimate associated with such an effect is therefore primarily a function of the within-individual correlation structure and not the variance of the random school effects. Hence, the logistic and empirical Bayes' models that neglect to adjust for intraindividual correlation tend to understate the variability of parameter estimates associated with individual-level covariates. HScond as defined is also an individual-level covariate, though insofar as most students do remain in the same study condition over time, it should also behave like a school-level covariate. This is indeed the case. Referring to table 2, the standard error of the coefficient for HScond is inflated under the GEE and empirical Bayes' fit as compared with the logistic fit, and the composite model provides the largest standard error estimate of all four models.
There is no discernible difference in the high school smoking rates between nonsmoking grade 8 students who had received the WSPP3 elementary intervention and those who had not. The secondary intervention also shows little impact. A previous analysis of the data suggested that males who were nonsmokers in grade 8 and subsequently entered a secondary intervention school showed significantly lower smoking rates than did females at the end of grade 10, and that this difference was maintained to the end of grade 12 by those males from high-risk elementary schools (25
). Nevertheless considering all the data over the entire span of the high school observation period, the effects of intervention, gender, and their interaction are slight. It would be worthwhile to consider separate analyses of specific portions of the data to avoid unduly large models containing complicated higher-order interactions. For example, one might examine males and females separately and within each gender look at groups of students with similar risk profiles.
From the various model fits in table 2 we note that responses from the same individual over time tend to be more strongly correlated in later years (compare also to the estimates of the correlation between grade 7 and 8 observations from table 1). In addition, note that the odds of a student's smoking in grade 10 as compared with grade 9 are about e0.79 = 2.2 times larger, but that the analogous increase in comparing grades 11 and 10 is only a factor of e1.05-0.79 = 1.3, and almost negligible in comparing grades 12 and 11 (taken from the composite model fit in table 2). This suggests that smoking behavior in adolescents becomes more firmly set throughout the high school years, that is, less easily influenced by intervention programming. Launching such programs in much earlier grades seems to provide some measure of success, though it is not clear how to maintain these positive results as students move to secondary schools, apart from providing continued intensive intervention throughout the high school years.
 |
DISCUSSION
|
---|
The composite model described in this paper combines empirical Bayes' methods and GEE in a useful and relatively straightforward manner. It allows the modeling of more complicated correlation structures than either technique can reasonably support on its own and can thus provide an analysis that better accommodates the complex study designs of cluster-randomized longitudinal intervention trials. The model makes efficient and appropriate use of the available data in that it neither sacrifices power by collapsing observations over clusters, nor overstates the amount of information contained in the data by ignoring dependencies among the observations. It yields inferences for both individual- and cluster-level covariates that are adjusted for intracluster as well as intraindividual correlation, in a manner that is consistent with the way the study was designed. Note for instance that an alternative to the GEE approach for modeling the longitudinal correlation structure would be to impose a second level of random effects. The GEE approach, however, seems a more natural approach to this end, since it allows more flexibility in modeling the intraindividual correlations. It is reasonable, for instance, to expect observations adjacent in time to be more strongly correlated than those that are farther apart. In this paper we advocated the use of an unstructured intraindividual correlation matrix R, which imposes no restrictions on the pairwise correlations. This is recommended when the number of repeated observations per individual is relatively small. In other cases, certain autoregressive or possibly an exchangeable correlation structure may also be suitable; see the report by Liang and Zeger (11
).
The example of the WSPP3 study illustrates that standard regression models applied to correlated data can underestimate considerably the standard errors of parameter estimates. This in turn can affect the conclusions drawn from a study, and thus it is of central importance to address the sources of nonindependence. A number of extensions in the application of the composite model proposed are possible, which are not examined in this paper. We considered complete-case analyses to illustrate the proposed methodology. Missing data on students could be accommodated by deriving the estimate of
tt' only from students who contribute data at times t and t' and implementing an approach that allows variable numbers of observations on individuals. Various imputation methods for missing data are available as well, but they would add considerable complexity to the model. They are beyond the scope of this paper. Implementation of the composite model is computationally intensive and not yet possible with currently marketed software. However, the SAS macro GLIMMIX, described in the report by Littell et al. (28
) and provided by the SAS Institute, Inc., offers a flexible tool for fitting generalized linear mixed models, including the composite model. The approach to estimation differs slightly from that described here, though the results are very similar. Although this macro is not as yet a validated product and an understanding of its appropriate use is necessary before one can comfortably apply it in practice, it holds much promise for users as a convenient tool for fitting models with complicated correlation structures.
Other analyses of the type of data provided by WSPP3 are also of interest from a public health perspective. One could model the odds of smoking onset among initial nonsmokers and, conversely, the odds of quitting among initial smokers. Time-to-event analyses are appropriate in these cases, and a discrete proportional hazards model, using a complementary log-log link function, could be applied to model such data. These models accommodate the longitudinal nature of the data by definition. However, intraschool correlation would still need to be addressed. Transition models that examine the odds of smoking conditional on smoking status at the previous time point are another alternative; interest in these models is focused on state transitions over time. Details relating to these approaches are given by Sashegyi (22
).
 |
APPENDIX
|
---|
Here we propose an estimation approach for the model described in expression 1. If repeated observations on the same student were independent, empirical Bayes' estimation of ß, b, and
2 would be straightforward. Let S be an N x T matrix whose (i, t) element S(i, t) is k, reflecting the school attended by individuial i at time t. Following MacGibbon and Tomberlin (21
), for a given initial value of
2 (to be updated empirically), the estimating equations for ß and b could be expressed as follows:
 | (A1) |
 | (A2) |
Here Xi is the portion of the full covariate or design matrix corresponding to individual i, Yi = (Yi1(·), ..., YiT(·))', and pi = (pi1(·), ..., piT(·))'. The notation leaves the school assignment unspecified, allowing students to change schools over time. For example, considering only two time periods, if the ith student attends school 4 at time 1 and school 7 at time 2, then Yi = (Yi1(4), Yi2(7))'. The vector Yk = {Yit(k)}S(i,t) =k and pk = {pit(k)}S(i,t) =k. Further, 1 refers to a unit vector of appropriate length, to facilitate the expression of equation A2 in matrix notation.
When the assumption of independence among repeated observations on the same student is not warranted, the intraindividual correlations
tt' also need to be incorporated into the estimation procedure. Following the approach of Liang and Zeger (11
), we propose generalized empirical Bayes' estimating equations to extend estimation for clustered data to clustered longitudinal data. These are given by
 | (A3) |
| (A4) |
where Ai = diag{pi1(·)(1 - pi1(·)), ..., piT(·)(1 - piT(·))} and Vi is the working covariance matrix
where
Ri = R
is the working correlation matrix with
(t, t') entry
tt'.
Furthermore,
=
and
is the working covariance matrix
defined for school k. The entries of the corresponding correlation matrix
k are defined in terms of the conditional correlations between repeated observations on the same individual in the same school, given the random effect for that school:
For a given value of
2, equations A3 and A4 can be set to zero and solved iteratively using a Newton-Raphson algorithm. Letting
be the estimate of
= (ß', b')' after the
th iteration, an updated estimate is obtained as
 | (A5) |
where U(
) = (U(ß)',U(b1), ..., U(bk))' and I(
) is a generalization of the matrix of negative second derivatives from the logistic-normal log-posterior, defined as
 | (A6) |
where
Moment estimates for the intraindividual correlation par-ameters are computed after each iteration toward a solution to the estimating equations. The correlation matrix R has diagonal elements equal to 1, and off-diagonal elements
tt' estimated by the off-diagonal elements of the matrix
For a given value of
2, equation A5 is solved repeatedly until convergence is achieved. After the nth estimation cycle conditional on
2(n)
, the prior variance is updated empirically using the formula
| (A7) |
where
is the (p + k, p + k) element of I(
)-1, evaluated at the nth cycle. The final parameter estimates are the values obtained upon convergence in both
and
serves as an approximate covariance matrix for
.
 |
ACKNOWLEDGMENTS
|
---|
This work was supported through funds from the Natural Sciences and Engineering Research Council of Canada, the National Health Research and Development Program (Canada), and the National Heart, Lung, and Blood Institute (United States).
 |
NOTES
|
---|
Reprint requests to Dr. Patrick J. Farrell, School of Mathematics and Statistics, Carleton University, 1125 Colonel By Drive, Ottawa, Ontario, Canada K1S 5B6 (e-mail: pfarrell{at}math.carleton.ca).
 |
REFERENCES
|
---|
-
Donner A, Brown KS, Brasher P. A methodological review of non-therapeutic intervention trials employing cluster randomization, 19791989. Int J Epidemiol 1990;19:795800.[Abstract]
-
Murray D, Rooney B, Hannan P, et al. Intraclass correlation among common measures of adolescent smoking: estimates, correlates, and applications in smoking prevention studies. Am J Epidemiol 1994;140:103850.[Abstract]
-
McKinlay SM, Stone EJ, Zucker DM. Research design and analysis issues. Health Educ Q 1989;16:30713.[ISI][Medline]
-
Murray DM, Hannan PJ, Zucker DM. Analysis issues in school-based health promotion studies. Health Educ Q 1989;16:31520.[ISI][Medline]
-
Cnaan A, Laird NM, Slasor P. Using the general linear mixed model to analyse unbalanced repeated measures and longitudinal data. Stat Med 1997;16:234980.[ISI][Medline]
-
Flay B, Brannon B, Anderson Johnson C, et al. The television school and family smoking prevention and cessation project. Prev Med 1988;17:585607.[ISI][Medline]
-
Best JA, Brown KS, Cameron R, et al. Gender and predisposing attributes as predictors of smoking onset: implications for theory and practice. J Health Educ 1995;26:S5260.
-
Cameron R, Brown KS, Best JA, et al. Effectiveness of a social influences smoking prevention program as a function of provider-type, training method, and school risk. Am J Public Health 1999;89:182731.[Abstract]
-
Green SB, COMMIT Research Group. Community Intervention Trial for Smoking Cessation (COMMIT). 1. Cohort results from a four-year community intervention. Am J Public Health 1995;85:18391.[Abstract]
-
Norton EC, Bieler GS, Ennett ST, et al. Analysis of prevention program effectiveness with clustered data using generalized estimating equations. J Consult Clin Psychol 1996;64:91926.[ISI][Medline]
-
Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986;73:1322.[ISI]
-
McCullagh P, Nelder IA. Generalized linear models. London, United Kingdom: Chapman and Hall, 1989.
-
Williams DA. Extra-binomial variation in linear logistic models. Appl Stat 1982;31:1448.[ISI]
-
Laird NM, Ware J. Random effects models for longitudinal data. Biometrics 1982;38:96374.[ISI][Medline]
-
Stiratelli R, Laird N, Ware J. Random-effects models for serial observations with binary response. Biometrics 1984;40:96171.[ISI][Medline]
-
Hedeker D, Gibbons RD, Flay BR. Random-effects regression models for clustered data with an example from smoking prevention research. J Consult Clin Psychol 1994;62:75765.[ISI][Medline]
-
Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. J Am Stat Assoc 1993;88:925.[ISI]
-
McCulloch CE. Maximum likelihood algorithms for generalized linear mixed models. J Am Stat Assoc 1997;92:16270.[ISI]
-
Hu FB, Goldberg J, Hedeker D, et al. Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes. Am J Epidemiol 1998;147:694703.[Abstract]
-
Pentz MA, Dwyer JH, MacKinnon DP, et al. A multicommunity trial for primary prevention of adolescent drug use. JAMA 1989;261:325966.[Abstract]
-
MacGibbon B, Tomberlin TJ. Small area estimates of proportions via empirical Bayes techniques. Surv Methodol (Stat Can) 1989;15:23752.
-
Sashegyi AI. Models for correlated binary responses: applications for the Waterloo Smoking Prevention Projects data. (PhD dissertation). Ottawa, Canada: Department of Statistics and Actuarial Science, University of Waterloo, 1998.
-
Becker RA, Chambers JM, Wilks AR. The new S language: a programming environment for data analysis and graphics. Pacific Grove, CA: Wadsworth and Brooks/Cole, 1988.
-
Patrick DL, Cheadle A, Thompson DC, et al. The validity of self-reported smoking: a review and meta-analysis. Am J Public Health 1994;84:108693.[Abstract]
-
Brown KS, Cameron R. Long-term evaluation of an elementary and secondary school smoking intervention. (Final report). Ottawa, Canada: National Health Research and Development Program, Health Canada, 1997.
-
Frangakis CE, Rubin DR. Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika 1999;86:36579.[Abstract/Free Full Text]
-
Santi S, Best JA, Brown KS, et al. Social environment and smoking initiation. Int J Addict 1990;25:881903.[ISI][Medline]
-
Littell RC, Milliken GA, Stroup WW, et al. SAS system for mixed models. Cary, NC: SAS Institute, Inc, 1996.
Received for publication June 15, 1999.
Accepted for publication February 14, 2000.