Multivariate Logistic Regression for Familial Aggregation of Two Disorders. I. Development of Models and Methods

James I. Hudson1,2,3,4, Nan M. Laird3 and Rebecca A. Betensky3

1 Biological Psychiatry Laboratory, McLean Hospital, Belmont, MA.
2 Department of Psychiatry, Harvard Medical School, Boston, MA.
3 Department of Biostatistics, Harvard School of Public Health, Boston, MA.
4 Department of Epidemiology, Harvard School of Public Health, Boston, MA.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS ANALYTICAL APPROACHES
 ANALYTICAL MODELS AND METHODS...
 DISCUSSION
 REFERENCES
 
The question of whether two disorders cluster together, or coaggregate, within families often arises. This paper considers how to analyze familial aggregation of two disorders and presents two multivariate logistic regression methods that model both disorder outcomes simultaneously. The first, a proband predictive model, predicts a relative's outcomes (the presence or absence of each of the two disorders) by using the proband's disorder status. The second, a family predictive model derived from the quadratic exponential model, predicts a family member's outcomes by using all of the remaining family members' disorder statuses. The models are more realistic, flexible, and powerful than univariate models. Methods for estimation and testing account for the correlation of outcomes among family members and can be implemented by using commercial software.

family; family characteristics; generalized estimating equation; logistic models

Abbreviations: GEE, generalized estimating equations.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS ANALYTICAL APPROACHES
 ANALYTICAL MODELS AND METHODS...
 DISCUSSION
 REFERENCES
 
Demonstrating that two disorders cluster together, or coaggregate, within families is often the first step in the "chain of genetic epidemiologic research" (1Go, p. 82) leading to exploration of potentially common genetic factors. In this paper, we propose two multivariate models that can be used to analyze aggregation of two disorders in family studies. In a companion paper (2Go), we apply these models to studies of eating and mood disorders.


    PREVIOUS ANALYTICAL APPROACHES
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS ANALYTICAL APPROACHES
 ANALYTICAL MODELS AND METHODS...
 DISCUSSION
 REFERENCES
 
Familial aggregation of a single disorder
The most common design used to assess aggregation of a single disorder samples probands with and without the disorder (case-control sampling) and then evaluates their relatives for the presence of the disorder. Compared with the typical case-control study, two features of this design allow a wider range of analytical techniques to be applied. First, we can arbitrarily consider either variable of interest–the disorder in relatives or the disorder in the proband–as exposure and the other as disease. Second, if the disorder is the same for relatives and probands, the exposure-disease distinction is unnecessary, and we can examine aggregation of the disorder among all family members, regardless of their status as proband or relative. The three main analytical approaches in the literature reflect these options (3Go, 4Go).

The first approach, following the usual case-control study analysis, models disorder status of the proband as a function of level of family history of disease. Unfortunately, there is no standard choice for predictor. Presence of disorder in at least one relative is often used, but this choice reduces information and is biased, overestimating measures of relative risk when there is aggregation (5Go).

The second approach adopts the traditional analysis of the cohort study and models disorder in relatives as a function of disorder in the associated proband (proband predictive model, also sometimes called a marginal model). It permits a straightforward model for the log odds of disorder in a relative as a function of the disorder status of the associated proband and covariates. Consider a family with n members, including the proband. Let Yj, j = 1, ... n, denote the disorder outcome status for each family member; let Y1 denote the outcome for the proband, with Y1 = 1 if case and Y1 = 0 if control. Let Zj denote the vector of covariates for the family member j; these covariates could also include variables measured for the proband. The proband predictive model for an individual family, for j = 2, ..., n, is

(1)
In this model, ß1 is the log odds ratio measuring the increase in the log odds of disorder in a relative of an ill proband compared with a relative of a healthy proband. The covariates of the two relatives and their probands are held constant, but the covariates or responses of other family members are not controlled.

Unlike the typical cohort study analysis, we cannot assume that all observations are independent, because observations are clustered within families. To account for this correlation, we can make inferences on ß1 by using the generalized estimating equations (GEE) method (6Go). Alternatively, we can use ordinary logistic regression to estimate parameters and empirical variances (7Go) to provide asymptotically unbiased estimates of the covariances; this method is equivalent to GEE with an independent working covariance structure. Liang and Pulver (8Go) have discussed use of GEE in familial aggregation studies. Zhao et al. (9Go) have provided an estimating equations approach to fit similar models.

The third approach is based on an underlying multivariate probability model for the responses of all family members (family predictive model, also sometimes called a conditional model). The response of each family member is modeled, conditional on the disorder status of other family members.

In this paper, we consider a model derived from the quadratic exponential model (10Go) for the joint probabilities of disorder in all family members; this model was developed previously (4Go, 11Go). Others have proposed similar models (8Go, 12Go, 13Go). The model implies the following logistic regression model for the response of each family member, conditional on all other family members' responses:

(2)
where Yj is as defined previously, j = 1, ..., n. Here, Y-j denotes the (n - 1) vector (Y1, ..., Yj-1, Yj+1, ..., Yn), which excludes Yj, S-j = {Sigma}Yk for k != j, Zj is a vector of covariates, and {delta}0, {delta}1, and {delta}2 are parameters. As with the proband predictive model, we can make inferences about the parameters by using GEE.

The measure of aggregation, {delta}1, is the increase in log odds of disorder in a family member associated with each additional relative with disorder, adjusted for covariates. The model assumes that a family member's log odds of disorder increase linearly with the number of relatives with disorder, regardless of family size. The model is not reproducible, in that parameters have different interpretations for different family sizes (11Go, 14Go).

Because probands are sampled on the basis of disorder status, only the disorder statuses of relatives (j = 2, ..., n) are included as outcomes. We can also include the proband's disorder status as an outcome, provided that the intercept for the proband is different (4Go, 14Go). Because the disorder status of the proband is included in the predictor S-j, the analysis is always conditional on proband status.

Model 2 assumes that 1) the same disorder is being assessed in probands and relatives; and 2) the parameter of aggregation, {delta}1, is the same for all pairs of family members, including both relative-relative pairs and relative-proband pairs. These assumptions are plausible when we can consider that cases are randomly selected from those with disorder in a family and that controls are randomly selected from those without disorder in a family; this feature is referred to as interchangeability of proband and relatives. This condition is met in many family studies. If it is not, parameters can be added for proband-relative associations, for example,

where S-1j = {Sigma}Yk for k != 1 and k != j. {delta}1 measures relative-relative associations, whereas measures relative-proband associations.

The major advantage of model 2 compared with model 1 is that it uses information from all pairs of family members, not simply proband-relative pairs. The major disadvantage is that parameter interpretations change with family size; thus, it may not be suitable for analyzing families of widely varying size.

Familial aggregation of two disorders
Two types of associations are of primary interest in analyzing aggregation of two disorders. The first is association within families regarding the occurrence of each single disorder. That is, to what extent does disorder A aggregate within families, and to what extent does disorder B aggregate within families? The second is association within families between the presence of one disorder and the presence of the other. That is, to what extent do A and B aggregate together (or "coaggregate") in families?

Previous studies of two disorders have almost always used a series of univariate models or an equivalent multinomial model, with logistic regression similar to model 1 or other proband predictive measures of association, such as relative morbidity risk or relative hazard (15GoGo–17Go). One limitation of these studies is that, with few exceptions (17Go), they do not account for the correlation of observations within families. However, as with univariate models, use of GEE can correct this problem. Another shortcoming is that they force possibly unnecessary terms into the models by not considering the extent to which chance co-occurrence of the two disorders, A and B, accounts for the co-occurrence of A plus B in persons (discussed later in this paper).

The most serious limitation is that these analyses cannot combine information across responses to estimate coaggregation. For example, they estimate separately the associations between 1) disorder A in a relative and disorder B in a proband, and 2) disorder B in a relative and disorder A in a proband.

Multivariate methods addressing these deficiencies have only recently been proposed (11Go, 18Go). Betensky and Whittemore (11Go) developed a multivariate likelihood model for correlated bivariate responses (discussed below) and applied it to the familial aggregation of breast and ovarian cancer.


    ANALYTICAL MODELS AND METHODS FOR TWO DISORDERS
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS ANALYTICAL APPROACHES
 ANALYTICAL MODELS AND METHODS...
 DISCUSSION
 REFERENCES
 
This section describes multivariate proband and family predictive models. These models are extensions of univariate models.

Proband predictive model for two disorders
In this model, the unit of analysis is the bivariate binary response (disorder status for each of the two disorders) of each relative. The responses of each relative are modeled as a function of the disorder status of his or her proband and the relative's covariates (table 1).


View this table:
[in this window]
[in a new window]
 
TABLE 1. Interpretation of the main parameters of aggregation in the proband predictive and family predictive models

 
To write the model, we consider within each family the bivariate vector of responses (YAj, YBj)T. The first subscript denotes which disorder outcome is being considered (A for disorder A, B for disorder B), and subscript j indexes family members, with j = 1 for the proband and j > 1 for relatives. The responses are not mutually exclusive; that is, YAj = 1 means that the jth family member has disorder A (regardless of the presence of B), and YBj = 1 means that the jth family member has disorder B (regardless of the presence of A). The presence of A plus B is represented by the product of A status and B status, and the presence of neither disorder is represented by the absence of A and B. A log-linear model for the responses of a relative (YAj, YBj)T for j = 2, ..., n, given the responses of the corresponding proband (YA1 YB1)T, is

(3)

This model can be expressed by two ordinary logistic regression equations with shared coefficients:


(4)
The model includes outcomes for relatives only. However, we could also include proband disorder status not fixed by design by using modifications similar to those discussed above for these responses in the univariate family predictive model.

The {alpha}'s correspond to baseline intercepts for the two disorders; {theta}'s correspond to intercepts for covariates. There are four main aggregation parameters (refer to table 1 for interpretations).

Two parameters, {delta} and {gamma}A*B, appear in both equations. {delta} is always identical in both equations because model 4 is derived from model 3. However, {gamma}A*B is identical in general only when we assume that this association is independent of whether the proband has A disorder response and a relative has B response or a relative has A response and the proband has B response. This assumption is plausible when we can view probands with a given combination of disorders as being randomly selected from those with the same combination in a family–that is, when proband and relatives are interchangeable. However, in many cases, interchangeability is not plausible, particularly if the proband is selected in such a manner that his or her disease characteristics and the presence of co-occurring disorders differ from those of the proband's relatives. If we do not accept interchangeability, we can add a parameter by changing {gamma}A*BYB1 to {gamma}A*B1YB1 in the first equation and {gamma}A*BYA1 to {gamma}A*B2YA1 in the second equation of model 4.

Family predictive model for two disorders
In this model, the unit of analysis is the bivariate binary response (disorder status for each of the two disorders) of each family member. Each subject's responses are modeled as a function of the disorder status of other family members and the subject's covariates (table 1). We use logistic regression models implied by the multivariate quadratic exponential model for familial aggregation of two disorders described by Betensky and Whittemore (11Go). The logistic regression analysis is less efficient than the full likelihood analysis based on the quadratic exponential model. However, the loss of efficiency may not be large; for example, Connolly and Liang (13Go) showed that model 2 for aggregation of a single disorder is highly efficient compared with maximum likelihood estimation.

We use notation described for models 3 and 4, except that we allow j to change from 1 to n. In addition, we let and SA,-j = {Sigma}YAk and SB,-j = {Sigma}YBk, for k != j; that is, SA,-j is the number of relatives of person j with disorder A, and SB,-j is the number with disorder B. We let Y-j denote the bivariate vector of responses of person j's relatives. Based on the quadratic exponential model, in which three-way and higher-order associations are set to zero, the model for the responses of family member j (YAj YBj)T, given the responses of j's relatives, (YA-j, YB-j)T, for j = 1, ..., n is

(5)

Model 5 can be expressed by two logistic regression equations with shared coefficients:


(6)

In this model, we assume that 1) the intercepts for the two disorders ({alpha}'s) are the same for relative responses and proband responses not fixed by design; 2) the within-person association of two disorders ({delta}) is the same for proband and relatives; 3) pairwise measures of association ({gamma}'s) are the same for relative-relative pairs and proband-relative pairs; and 4) the coaggregation of disorder A with disorder B in different family members ({gamma}A*B) is independent of whether a given family member has A response and another family member has B response (including proband-relative pairs) or a family member has B response and another family member has A response. These assumptions are plausible when all family members are interchangeable; the model can be modified to include appropriate extra parameters if we do not accept these assumptions.

As with model 4, there are four main aggregation parameters. Their interpretation changes because of the conditioning on relatives' responses (table 1).

We include as outcomes any part of the proband disorder status not fixed by design. For example, if probands are sampled on the basis of A disorder status but the B outcome is allowed to be random, then only B disorder statuses of probands are used as outcomes. Alternatively, we can include proband disorder status fixed by design if it has separate intercept(s) (11Go). Because disorder status of probands is included in SA,-j and SB,-j, the analysis is always conditional on proband status.

Interactions
We have written the proband and family predictive models without the three interactions relating to how the presence of both disorders in a person modifies the {gamma} parameters. The proband and family predictive models with these terms are, respectively,




(7)
and




(8)
where for SAB,-j = {Sigma}YAkYBk for k != j. We can also expand the models to include interactions between covariates and other terms.

Model testing and fitting
We suggest starting with the basic model 4 or model 6; interaction terms can be added, but there is often little power to test their significance. The interpretations of the main effects given in table 1 hold only when the three interaction terms are zero. In the Appendix of the companion paper (2Go), we show how to implement the models by using Stata (Stata Corporation, College Station, Texas) and SAS (SAS Institute, Inc., Cary, North Carolina) software.

Use of mutually exclusive categories of disorder as responses
Alternatively, responses can be defined as mutually exclusive categories: A alone, B alone, A plus B, and no disorder. The models could be written as three logistic equations for the first three combinations and then combined. These models are reparameterizations of models 7 and 8 (derivation available from the authors upon request). Therefore, we can make the same interpretations about associations of interest for the models with interactions by using either definition.

Despite the equivalence of the underlying full models, we prefer the non-mutually-exclusive parameterization because it models more flexibly the co-occurring form of A and B, allowing for tests of interactions and potential model reduction. Whereas the mutually exclusive definition forces a full model, the non-mutually-exclusive parameterization decomposes the associations related to the co-occurrence of A plus B in persons into components due to 1) the chance co-occurrence of A alone plus B alone and 2) extra components not attributable to chance co-occurrence (modeled as interactions).


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS ANALYTICAL APPROACHES
 ANALYTICAL MODELS AND METHODS...
 DISCUSSION
 REFERENCES
 
Advantages of multivariate models
Multivariate approaches provide a more realistic and flexible model for coaggregation–an inherently multivariate concept. By contrast, univariate methods reduce a multivariate problem to a series of univariate analyses and provide more-restricted interpretations of the parameters. Multivariate techniques also have greater power to detect coaggregation because they use more information. For example, a univariate proband predictive analysis would fit separately the first and second equations of model 4, yielding two estimates for {gamma}A*B, which measures coaggregation. However, the multivariate analysis yields a single estimate, using information from both equations.

Multivariate methods have not been widely used for two main reasons. First, the correlation structure is difficult to model. However, with the development of GEE and similar methods, it is not necessary to explicitly model the correlation structure to make asymptotically valid inferences about the parameters. Second, until recently, there has been little development of models, such as the quadratic exponential model, for multivariate correlated binary data (11Go, 18Go). Although we know of no commercial software to fit them, many of these models–as was shown for the quadratic exponential model–imply logistic regression equations that can be fit with GEE by using commercial software.

Comparison of proband predictive and family predictive models
The major advantage of the family predictive versus the proband predictive model is that it uses more information in the data about coaggregation. The major disadvantage of the family predictive model is that interpretation of parameters changes with family size. It assumes that the increase in risk depends on the number of ill relatives, regardless of the total number of relatives. The degree to which variability in family size in a data set may influence interpretation of the findings is unknown. For example, in a study of the familial aggregation of a single disorder, lung cancer, Laird et al. (4Go) found that family size influenced the results of a family predictive model. By contrast, as reported in our companion paper (2Go), family size did not have much influence on the aggregation of eating and mood disorders. Betensky and Whittemore (11Go) also found little influence in a study of ovarian and breast cancer and showed that when both disorders are rare, variation in family size is unlikely to affect the findings substantially.

While each model has merits, we recommend the proband predictive model as a first choice. Another approach would be to apply both models. Convergence of findings would increase confidence in the conclusions, and discrepancies might reveal problems that should be evaluated further.

Design considerations
We propose several guidelines when designing studies in which these models are used. To begin with, to maximize the efficiency of coaggregation estimates, probands and relatives should meet the same definition of disorder. Otherwise, interchangeability of probands and relatives is not plausible and extra parameters must be added to the model, effectively reducing the analysis to a series of univariate models. An example of a nonuniform definition of disorder is a study of melanoma in which probands have melanoma and relatives have either melanoma or premelanotic lesions.

Because the two disorders whose aggregation we want to assess are likely to be uncommon, some form of selecting probands with disorder will usually be necessary. Furthermore, because many relatives will have neither disorder, it is not necessary to recruit a group of probands with neither disorder, provided that there are adequate numbers of probands with A alone and with B alone. Finally, because proband disorder statuses not fixed by design can be used as outcomes in this analysis, it is desirable to fix as little of the proband disorder status as possible. For example, in a study of colon and skin cancer, we might select probands with 1) colon cancer, regardless of whether they had skin cancer, and 2) skin cancer, regardless of whether they had colon cancer.

Proband selection should minimize the possibility of ascertainment bias. This bias can occur when family members are not interchangeable; thus, the probability of selection is different for pairs of family members with different aggregation parameters, after adjustment for terms already in the model. This bias can often be removed by analysis, including intercepts for additional covariates (including proband status) and interactions between covariates and predictors. However, because we may not have measured the relevant covariates and because addition of parameters reduces efficiency, it is best to avoid sampling that may still result in significant bias after simple adjustments.

For example, a common potential cause of ascertainment bias is the effect of treatment seeking, given that many family studies use patients for probands. For example, for a family study of depression and panic disorder, suppose we select probands seeking treatment for depression. Those probands may be more likely than depressed persons not seeking treatment to have panic disorder. Because panic disorder in turn aggregates within families, we might get a spuriously high estimate for familial coaggregation of depression and panic disorder. We can adjust for the effect of an increase in the prevalence of the co-occurring disorder in probands compared with relatives by treating both the depression and panic disorder outcome of the proband as fixed. This adjustment will likely reduce bias due to treatment seeking but may not remove it entirely–as suggested by one study (19Go) of the familial aggregation of depression.

Conclusions
This paper has presented multivariate proband predictive and family predictive binary response models for assessing aggregation of two disorders within families. These models are more realistic, flexible, and powerful than univariate models. Furthermore, they can be easily fit by using commercial software (refer to the Appendix in our companion paper (2Go)), and the interpretation of model parameters is straightforward. Future research directions include development of family predictive models that are less sensitive to family size, extensions to analyze aggregation of more than two disorders, and determination of optimal sampling.


    ACKNOWLEDGMENTS
 
Supported in part by National Institute of Mental Health grant T32 MH-017119.

The authors thank Drs. Garrett Fitzmaurice and Bernard Rosner for their comments on the manuscript.


    NOTES
 
Reprint requests to Dr. James I. Hudson, Biological Psychiatry Laboratory, McLean Hospital, 115 Mill Street, Belmont, MA 02478 (e-mail: jhudson{at}hsph.harvard.edu).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS ANALYTICAL APPROACHES
 ANALYTICAL MODELS AND METHODS...
 DISCUSSION
 REFERENCES
 

  1. Faraone SV, Tsuang MT. Methods in psychiatric genetics. In: Tsuang MT, Tohen M, Zahner EP, eds. Textbook in psychiatric epidemiology. New York, NY: Wiley-Liss, 1995:81–134.
  2. Hudson JI, Laird NM, Betensky RA, et al. Multivariate regression for familial aggregation of two disorders: II. Analysis of studies of eating and mood disorders. Am J Epidemiol 2001;153:506–14.[Abstract/Free Full Text]
  3. Khoury MJ, Beaty TH, Cohen BH. Fundamentals of genetic epidemiology. New York, NY: Oxford University Press, 1993:164–99.
  4. Laird NM, Fitzmaurice GM, Schwartz AG. The analysis of case-control data: epidemiologic studies of familial aggregation. In: Sen PK, Rao CR, eds. Handbook of statistics. Vol 18. New York, NY: Elsevier Science, 2000:465–81.
  5. Khoury MJ, Flanders DW. Bias in using family history as a risk factor in case-control studies of disease. Epidemiology 1995;6:511–19.[ISI][Medline]
  6. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986;73:13–22.[ISI]
  7. Diggle PJ, Liang KY, Zeger SL. Analysis of longitudinal data. Oxford, England: Oxford University Press, 1994.
  8. Liang KY, Pulver AE. Analysis of case-control/family sampling design. Genet Epidemiol 1996;13:253–70.[ISI][Medline]
  9. Zhao LP, Hsu L, Holte S, et al. Combined association and aggregation analysis of data from case-control family studies. Biometrika 1998;85:299–315.[Abstract]
  10. Zhao LP, Prentice RL. Correlated binary regression using a quadratic exponential model. Biometrika 1990;77:642–8.[ISI]
  11. Betensky RA, Whittemore AS. An analysis of correlated multivariate binary data: application to familial cancers of the ovary and breast. Appl Stat 1996;45:411–29.[ISI]
  12. Hopper JL, Hannah MC, Mathews JD. Genetic analysis workshop II: pedigree analysis of a binary trait without assuming an underlying liability. Genet Epidemiol 1984;1:183–8.
  13. Connolly MA, Liang KY. Conditional logistic regression models for correlated binary data. Biometrika 1988;75:501–6.[ISI]
  14. Whittemore AS. Logistic regression of family data from case-control studies. Biometrika 1995;82:57–67.[ISI]
  15. Hudson JI, Pope HG Jr, Jonas JM, et al. A controlled family history study of bulimia. Psychol Med 1987;17:883–90.[ISI][Medline]
  16. Weissman MM, Wickramaratne P, Adams PB, et al. The relationship between panic disorder and major depression: a new family study. Arch Gen Psychiatry 1993;50:767–80.[Abstract]
  17. Merikangas KR, Stolar M, Stevens DE, et al. Familial transmission of substance use disorders. Arch Gen Psychiatry 1998;55:973–9.[Abstract/Free Full Text]
  18. Molenberghs G, Ryan LM. An exponential family model for clustered multivariate binary data. Environmetrics 1999;10:279–300.[ISI]
  19. Kendler K. Is seeking treatment for depression predicted by a history of depression in relatives? Implications for family studies of affective disorder. Psychol Med 1995;25:807–14.[ISI][Medline]
Received for publication December 10, 1999. Accepted for publication June 5, 2000.