1 Biological Psychiatry Laboratory, McLean Hospital, Belmont, MA.
2 Department of Psychiatry, Harvard Medical School, Boston, MA.
3 Department of Biostatistics, Harvard School of Public Health, Boston, MA.
4 Department of Epidemiology, Harvard School of Public Health, Boston, MA.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
family; family characteristics; generalized estimating equation; logistic models
Abbreviations: GEE, generalized estimating equations.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
PREVIOUS ANALYTICAL APPROACHES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The first approach, following the usual case-control study analysis, models disorder status of the proband as a function of level of family history of disease. Unfortunately, there is no standard choice for predictor. Presence of disorder in at least one relative is often used, but this choice reduces information and is biased, overestimating measures of relative risk when there is aggregation (5).
The second approach adopts the traditional analysis of the cohort study and models disorder in relatives as a function of disorder in the associated proband (proband predictive model, also sometimes called a marginal model). It permits a straightforward model for the log odds of disorder in a relative as a function of the disorder status of the associated proband and covariates. Consider a family with n members, including the proband. Let Yj, j = 1, ... n, denote the disorder outcome status for each family member; let Y1 denote the outcome for the proband, with Y1 = 1 if case and Y1 = 0 if control. Let Zj denote the vector of covariates for the family member j; these covariates could also include variables measured for the proband. The proband predictive model for an individual family, for j = 2, ..., n, is
![]() | (1) |
Unlike the typical cohort study analysis, we cannot assume that all observations are independent, because observations are clustered within families. To account for this correlation, we can make inferences on ß1 by using the generalized estimating equations (GEE) method (6). Alternatively, we can use ordinary logistic regression to estimate parameters and empirical variances (7
) to provide asymptotically unbiased estimates of the covariances; this method is equivalent to GEE with an independent working covariance structure. Liang and Pulver (8
) have discussed use of GEE in familial aggregation studies. Zhao et al. (9
) have provided an estimating equations approach to fit similar models.
The third approach is based on an underlying multivariate probability model for the responses of all family members (family predictive model, also sometimes called a conditional model). The response of each family member is modeled, conditional on the disorder status of other family members.
In this paper, we consider a model derived from the quadratic exponential model (10) for the joint probabilities of disorder in all family members; this model was developed previously (4
, 11
). Others have proposed similar models (8
, 12
, 13
). The model implies the following logistic regression model for the response of each family member, conditional on all other family members' responses:
![]() | (2) |
The measure of aggregation, 1, is the increase in log odds of disorder in a family member associated with each additional relative with disorder, adjusted for covariates. The model assumes that a family member's log odds of disorder increase linearly with the number of relatives with disorder, regardless of family size. The model is not reproducible, in that parameters have different interpretations for different family sizes (11
, 14
).
Because probands are sampled on the basis of disorder status, only the disorder statuses of relatives (j = 2, ..., n) are included as outcomes. We can also include the proband's disorder status as an outcome, provided that the intercept for the proband is different (4, 14
). Because the disorder status of the proband is included in the predictor S-j, the analysis is always conditional on proband status.
Model 2 assumes that 1) the same disorder is being assessed in probands and relatives; and 2) the parameter of aggregation, 1, is the same for all pairs of family members, including both relative-relative pairs and relative-proband pairs. These assumptions are plausible when we can consider that cases are randomly selected from those with disorder in a family and that controls are randomly selected from those without disorder in a family; this feature is referred to as interchangeability of proband and relatives. This condition is met in many family studies. If it is not, parameters can be added for proband-relative associations, for example,
![]() |
The major advantage of model 2 compared with model 1 is that it uses information from all pairs of family members, not simply proband-relative pairs. The major disadvantage is that parameter interpretations change with family size; thus, it may not be suitable for analyzing families of widely varying size.
Familial aggregation of two disorders
Two types of associations are of primary interest in analyzing aggregation of two disorders. The first is association within families regarding the occurrence of each single disorder. That is, to what extent does disorder A aggregate within families, and to what extent does disorder B aggregate within families? The second is association within families between the presence of one disorder and the presence of the other. That is, to what extent do A and B aggregate together (or "coaggregate") in families?
Previous studies of two disorders have almost always used a series of univariate models or an equivalent multinomial model, with logistic regression similar to model 1 or other proband predictive measures of association, such as relative morbidity risk or relative hazard (1517
). One limitation of these studies is that, with few exceptions (17
), they do not account for the correlation of observations within families. However, as with univariate models, use of GEE can correct this problem. Another shortcoming is that they force possibly unnecessary terms into the models by not considering the extent to which chance co-occurrence of the two disorders, A and B, accounts for the co-occurrence of A plus B in persons (discussed later in this paper).
The most serious limitation is that these analyses cannot combine information across responses to estimate coaggregation. For example, they estimate separately the associations between 1) disorder A in a relative and disorder B in a proband, and 2) disorder B in a relative and disorder A in a proband.
Multivariate methods addressing these deficiencies have only recently been proposed (11, 18
). Betensky and Whittemore (11
) developed a multivariate likelihood model for correlated bivariate responses (discussed below) and applied it to the familial aggregation of breast and ovarian cancer.
![]() |
ANALYTICAL MODELS AND METHODS FOR TWO DISORDERS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Proband predictive model for two disorders
In this model, the unit of analysis is the bivariate binary response (disorder status for each of the two disorders) of each relative. The responses of each relative are modeled as a function of the disorder status of his or her proband and the relative's covariates (table 1).
|
![]() | (3) |
![]() |
![]() |
![]() | (4) |
The 's correspond to baseline intercepts for the two disorders;
's correspond to intercepts for covariates. There are four main aggregation parameters (refer to table 1 for interpretations).
Two parameters, and
A*B, appear in both equations.
is always identical in both equations because model 4 is derived from model 3. However,
A*B is identical in general only when we assume that this association is independent of whether the proband has A disorder response and a relative has B response or a relative has A response and the proband has B response. This assumption is plausible when we can view probands with a given combination of disorders as being randomly selected from those with the same combination in a familythat is, when proband and relatives are interchangeable. However, in many cases, interchangeability is not plausible, particularly if the proband is selected in such a manner that his or her disease characteristics and the presence of co-occurring disorders differ from those of the proband's relatives. If we do not accept interchangeability, we can add a parameter by changing
A*BYB1 to
A*B1YB1 in the first equation and
A*BYA1 to
A*B2YA1 in the second equation of model 4.
Family predictive model for two disorders
In this model, the unit of analysis is the bivariate binary response (disorder status for each of the two disorders) of each family member. Each subject's responses are modeled as a function of the disorder status of other family members and the subject's covariates (table 1). We use logistic regression models implied by the multivariate quadratic exponential model for familial aggregation of two disorders described by Betensky and Whittemore (11). The logistic regression analysis is less efficient than the full likelihood analysis based on the quadratic exponential model. However, the loss of efficiency may not be large; for example, Connolly and Liang (13
) showed that model 2 for aggregation of a single disorder is highly efficient compared with maximum likelihood estimation.
We use notation described for models 3 and 4, except that we allow j to change from 1 to n. In addition, we let and SA,-j = YAk and SB,-j =
YBk, for k
j; that is, SA,-j is the number of relatives of person j with disorder A, and SB,-j is the number with disorder B. We let Y-j denote the bivariate vector of responses of person j's relatives. Based on the quadratic exponential model, in which three-way and higher-order associations are set to zero, the model for the responses of family member j (YAj YBj)T, given the responses of j's relatives, (YA-j, YB-j)T, for j = 1, ..., n is
![]() | (5) |
![]() |
![]() |
![]() | (6) |
In this model, we assume that 1) the intercepts for the two disorders ('s) are the same for relative responses and proband responses not fixed by design; 2) the within-person association of two disorders (
) is the same for proband and relatives; 3) pairwise measures of association (
's) are the same for relative-relative pairs and proband-relative pairs; and 4) the coaggregation of disorder A with disorder B in different family members (
A*B) is independent of whether a given family member has A response and another family member has B response (including proband-relative pairs) or a family member has B response and another family member has A response. These assumptions are plausible when all family members are interchangeable; the model can be modified to include appropriate extra parameters if we do not accept these assumptions.
As with model 4, there are four main aggregation parameters. Their interpretation changes because of the conditioning on relatives' responses (table 1).
We include as outcomes any part of the proband disorder status not fixed by design. For example, if probands are sampled on the basis of A disorder status but the B outcome is allowed to be random, then only B disorder statuses of probands are used as outcomes. Alternatively, we can include proband disorder status fixed by design if it has separate intercept(s) (11). Because disorder status of probands is included in SA,-j and SB,-j, the analysis is always conditional on proband status.
Interactions
We have written the proband and family predictive models without the three interactions relating to how the presence of both disorders in a person modifies the parameters. The proband and family predictive models with these terms are, respectively,
![]() |
![]() |
![]() |
![]() | (7) |
![]() |
![]() |
![]() |
![]() | (8) |
Model testing and fitting
We suggest starting with the basic model 4 or model 6; interaction terms can be added, but there is often little power to test their significance. The interpretations of the main effects given in table 1 hold only when the three interaction terms are zero. In the Appendix of the companion paper (2), we show how to implement the models by using Stata (Stata Corporation, College Station, Texas) and SAS (SAS Institute, Inc., Cary, North Carolina) software.
Use of mutually exclusive categories of disorder as responses
Alternatively, responses can be defined as mutually exclusive categories: A alone, B alone, A plus B, and no disorder. The models could be written as three logistic equations for the first three combinations and then combined. These models are reparameterizations of models 7 and 8 (derivation available from the authors upon request). Therefore, we can make the same interpretations about associations of interest for the models with interactions by using either definition.
Despite the equivalence of the underlying full models, we prefer the non-mutually-exclusive parameterization because it models more flexibly the co-occurring form of A and B, allowing for tests of interactions and potential model reduction. Whereas the mutually exclusive definition forces a full model, the non-mutually-exclusive parameterization decomposes the associations related to the co-occurrence of A plus B in persons into components due to 1) the chance co-occurrence of A alone plus B alone and 2) extra components not attributable to chance co-occurrence (modeled as interactions).
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Multivariate methods have not been widely used for two main reasons. First, the correlation structure is difficult to model. However, with the development of GEE and similar methods, it is not necessary to explicitly model the correlation structure to make asymptotically valid inferences about the parameters. Second, until recently, there has been little development of models, such as the quadratic exponential model, for multivariate correlated binary data (11, 18
). Although we know of no commercial software to fit them, many of these modelsas was shown for the quadratic exponential modelimply logistic regression equations that can be fit with GEE by using commercial software.
Comparison of proband predictive and family predictive models
The major advantage of the family predictive versus the proband predictive model is that it uses more information in the data about coaggregation. The major disadvantage of the family predictive model is that interpretation of parameters changes with family size. It assumes that the increase in risk depends on the number of ill relatives, regardless of the total number of relatives. The degree to which variability in family size in a data set may influence interpretation of the findings is unknown. For example, in a study of the familial aggregation of a single disorder, lung cancer, Laird et al. (4) found that family size influenced the results of a family predictive model. By contrast, as reported in our companion paper (2
), family size did not have much influence on the aggregation of eating and mood disorders. Betensky and Whittemore (11
) also found little influence in a study of ovarian and breast cancer and showed that when both disorders are rare, variation in family size is unlikely to affect the findings substantially.
While each model has merits, we recommend the proband predictive model as a first choice. Another approach would be to apply both models. Convergence of findings would increase confidence in the conclusions, and discrepancies might reveal problems that should be evaluated further.
Design considerations
We propose several guidelines when designing studies in which these models are used. To begin with, to maximize the efficiency of coaggregation estimates, probands and relatives should meet the same definition of disorder. Otherwise, interchangeability of probands and relatives is not plausible and extra parameters must be added to the model, effectively reducing the analysis to a series of univariate models. An example of a nonuniform definition of disorder is a study of melanoma in which probands have melanoma and relatives have either melanoma or premelanotic lesions.
Because the two disorders whose aggregation we want to assess are likely to be uncommon, some form of selecting probands with disorder will usually be necessary. Furthermore, because many relatives will have neither disorder, it is not necessary to recruit a group of probands with neither disorder, provided that there are adequate numbers of probands with A alone and with B alone. Finally, because proband disorder statuses not fixed by design can be used as outcomes in this analysis, it is desirable to fix as little of the proband disorder status as possible. For example, in a study of colon and skin cancer, we might select probands with 1) colon cancer, regardless of whether they had skin cancer, and 2) skin cancer, regardless of whether they had colon cancer.
Proband selection should minimize the possibility of ascertainment bias. This bias can occur when family members are not interchangeable; thus, the probability of selection is different for pairs of family members with different aggregation parameters, after adjustment for terms already in the model. This bias can often be removed by analysis, including intercepts for additional covariates (including proband status) and interactions between covariates and predictors. However, because we may not have measured the relevant covariates and because addition of parameters reduces efficiency, it is best to avoid sampling that may still result in significant bias after simple adjustments.
For example, a common potential cause of ascertainment bias is the effect of treatment seeking, given that many family studies use patients for probands. For example, for a family study of depression and panic disorder, suppose we select probands seeking treatment for depression. Those probands may be more likely than depressed persons not seeking treatment to have panic disorder. Because panic disorder in turn aggregates within families, we might get a spuriously high estimate for familial coaggregation of depression and panic disorder. We can adjust for the effect of an increase in the prevalence of the co-occurring disorder in probands compared with relatives by treating both the depression and panic disorder outcome of the proband as fixed. This adjustment will likely reduce bias due to treatment seeking but may not remove it entirelyas suggested by one study (19) of the familial aggregation of depression.
Conclusions
This paper has presented multivariate proband predictive and family predictive binary response models for assessing aggregation of two disorders within families. These models are more realistic, flexible, and powerful than univariate models. Furthermore, they can be easily fit by using commercial software (refer to the Appendix in our companion paper (2)), and the interpretation of model parameters is straightforward. Future research directions include development of family predictive models that are less sensitive to family size, extensions to analyze aggregation of more than two disorders, and determination of optimal sampling.
![]() |
ACKNOWLEDGMENTS |
---|
The authors thank Drs. Garrett Fitzmaurice and Bernard Rosner for their comments on the manuscript.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|