From the Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT.
Received for publication August 2, 2002; accepted for publication May 8, 2003.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
genetic predisposition to disease; genetics; interaction; research design; sample size
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Gene-gene interaction can be studied in both linkage studies and association studies. Linkage studies include model-based methods in which a detailed model for the disease mode of inheritance is specified and model-free methods in which no details such as allele frequencies and modes of inheritance are specified. For example, Cordell et al. (6, 7) investigated multilocus linkage tests of joint genetic effects using affected relative pairs and statistical modeling of interlocus interactions. Mitchell et al. (8) studied epistasis using variance component linkage analysis. Because association studies are likely to be more powerful than linkage studies for identifying genes with small-to-moderate effects in humans, in this article we focus our discussion on association designs.
Association designs can be broadly categorized into family-based designs and population-based designs. The family-based study design (9, 10) has received great attention in the last decade, both because of its robustness to population stratification and because of its power to identify genes with small-to-moderate effects (11). This design compares alleles transmitted to the affected children with those not transmitted. The population-based design has been criticized for possibly inducing spurious association due to population stratification, but it may be easier and less expensive to collect DNA samples from unrelated persons in the general population for certain diseases, and previous studies have shown that population-based studies such as traditional case-control studies can be more powerful than family-based studies in identifying disease genes, both for qualitative traits (12, 13) and for quantitative traits (14). Moreover, genomic markers can be used to control for population stratification in population-based association studies (1518).
In a recent paper, Gauderman (19) discussed sample size requirements for detecting gene-gene interaction using four different study designs: the matched-case-control design, the case-sibling design, the case-parent design, and the case-only design. He used different statistical models for different designs, and he assumed a very low disease prevalence rate in order to make the parameter estimates have comparable meanings. In this article, using the same logistic regression model for disease risks across different study designs, we calculate the sample sizes needed to detect gene-gene interaction with the case-parent design, the matched case-control design, and the unmatched case-control design. We make comparisons for different levels of gene-gene interaction under three genetic models: the additive model, the dominant model, and the recessive model. We find that the unmatched case-control design is more powerful than both the matched case-control design and the case-parent design, whereas the matched case-control design is more powerful than the case-parent design when the disease prevalence is moderate (10 percent) and less powerful when the disease prevalence is low (1 percent).
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
log[p(D|Go)/(1 p(D|Go))] = ß0 + ß1X1 + ß2X 2 + ß3X 1X 2,
where X1 and X 2 are the codings for the genotypes at two candidate genes and the codings depend on the specific genetic model being studied. We assume that we will study each candidate gene at a polymorphic site with two allelic variants, a high-risk allele (denoted by capital letters, A and B) and a low-risk allele (denoted by small letters, a and b). Therefore, for each individual, there are nine possible genotype combinations at these two marker loci. We consider three genetic modelsthe additive model, the dominant model, and the recessive modelby coding genotypes differently as described in table 1. For example, under the additive model, X1 and X2 take the value 2 for the genotypes AA and BB, 1 for the genotypes Aa and Bb, and 0 for the genotypes aa and bb. We use the same coding system for both family-based designs and population-based designs. Note that Schaid (20) has proposed the logistic regression model for assessment of gene-environment interaction, whereas Gauderman (19) investigated gene-gene interaction using the conditional logistic regression model. In our model, the parameters to be estimated are ß0, ß1, ß2, and ß3, where ß0 corresponds to the intercept, ß1 and ß2 correspond to the main effects at two candidate genes (denoted by X1 and X2), and ß3 corresponds to the interaction effect (denoted by X1X2, which is the product of X1 and X2). When there is gene-gene interaction, that is, when ß3 is not equal to 0, the effect of one gene varies over the levels of the other gene.
|
|
with the log-likelihood being
where is the number of families whose diseased child has genotype Go and whose parents have genotypes Gp. The
can be calculated through the Bayes rule as
where Go, Gp, and D are defined as above. The conditional analysis is robust to population stratification in the testing of linkage or association between a candidate gene and disease using family trios. However, although conditional analysis is robust to population stratification for detecting main genetic effects, it is no longer robust for the detection of gene-gene interactions (more details are provided in the Discussion section). Therefore, we also consider unconditional analysis below as an alternative approach to studying gene-gene interactions.
For the unconditional formulation, let be the probability that an affected offspring has genotype Go and his/her parents have mating type Gp, conditional on the childs being affected. The unconditional likelihood for a set of independent family trios is
with the log-likelihood being
where is the number of families in which the diseased child has genotype Go and the parents have genotypes Gp. The
can be evaluated as
To determine the sample sizes required to detect gene-gene interactions, we use the noncentral chi-squared distribution to approximate the distribution of the likelihood ratio statistics. To derive the noncentrality parameter, we need to maximize the expected log-likelihood for a set of independent families with the expected number of each type of family
and
where is the expected number of families whose diseased child has genotype Go and whose parents have genotypes Gp. It is easy to see that
where N is the number of trios. The total number of persons in the sample is 3N. For sample size calculation, we consider various genetic models with different interaction effects, and the null hypothesis assumes a genetic model with no gene-gene interactions. The likelihood ratio statistic has an approximate noncentral chi-squared distribution with 1 degree of freedom and noncentrality parameter , which is the expected log-likelihood ratio test statistic,
is the expected log-likelihood allowing the presence of interaction, and
is that without interaction. We maximize the likelihood by means of the simplex method (21). For a prespecified powerfor example, b = 80 percentand a prespecified significance levelfor example, a = 5 percentthe sample size N can be calculated as (za/2 + z1-b)/
2, where za is the (1 a)th percentile of the standard normal distribution.
In the following discussion, we call the conditional analysis using the case-parent design the "conditional case-parent design" and the unconditional analysis the "unconditional case-parent design."
Case-control design
In the case-control study design, we consider both matched and unmatched case-control designs. For the unmatched case-control design, we assume that we sample ND cases and controls, with
, where R can be any prespecified positive number. The total sample size is
Let
be the probability that the ith diseased individual has genotype Gi and
be the probability that the jth normal individual has genotype Gj. The likelihood for the case-control data is
where
and
where p(Gi) is the genotype frequency summarized in table 2.
To determine the sample size, we need to maximize the expected log-likelihood for a sample with the expected numbers of cases and controls:
where is the expected number of cases with genotype Gi and
is the expected number of controls with genotype Gj. It is easy to see that
The likelihood ratio statistic has an approximate noncentral chi-squared distribution with 1 degree of freedom and noncentrality parameter = ND
2 = 2(ln LE1 ln LE0), where ln LE1 is the expected log-likelihood under a model that allows interactions and ln LE0 is that without interactions. The number of required samples is (1 + R)ND. In this article, we assume an equal number of cases and controls, that is, R = 1.
For the matched case-control design, we consider the 1:1 matching situation. Let denote the probability that the diseased individual has genotype Gi and the normal individual has genotype Gj in a matched case-control pair. The conditional likelihood for N sets of independent matched case-control pairs is
where
and is the number of pairs in which the case has genotype Gi and the control has genotype Gj.
To determine the sample size, we need to maximize the expected log-likelihood for a sample with the expected number of pairs with a specific case genotype and control genotype combination:
where is the expected number of matched pairs with case genotype Gi and control genotype Gj. It is easy to see that
where p(Gi) is the genotype frequency summarized in table 2. The likelihood ratio statistic has an approximate noncentral chi-squared distribution with 1 degree of freedom and noncentral parameter = N
2 = 2(ln LE1 ln LE0), where ln LE1 is the maximum expected log-likelihood under a model that allows interactions and ln LE0 is that without interactions.
The total numbers of subjects required under different gene-gene interaction alternatives are compared by calculating the relative efficiencies, defined as the required sample size under one scenario divided by the smallest required sample size among all possible scenarios (three genetic models: the additive model, the dominant model, and the recessive model; four study designs: the conditional case-parent design, the unconditional case-parent design, the matched case-control design, and the unmatched case-control design; and the number of alternative hypotheses of gene-gene interaction). The relative efficiency of 1 indicates the most efficient design and model setup, and relative efficiencies greater than 1 mean that more samples are needed to achieve the same statistical power as the most efficient method. Similar asymptotic relative efficiency was used both by Schaid (20) in gene-environment interaction studies and by Gauderman (19) in gene-gene interaction studies.
We fix the population prevalence in the parameter estimation. Our experience is that unless population prevalence is fixed throughout the estimation, the estimate of ß0 is very unstable; this may lead to an unreasonably high or low population disease prevalence and cause biologic meaning to be lost.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We consider both pure interaction models without main effects, that is, ß1 = 0 and ß2 = 0, and interaction models with main effects, in which we fix the main effects of the candidate genes at R1 = 3 and R2 = 3. The magnitude of the interaction is varied at Rinter = 2, Rinter = 4, and Rinter = 6, with ß3 equal to log(2), log(4), and log(6). The population prevalence of the disease is varied at 10 percent and 1 percent, which correspond to a common disease and a disease with relatively low prevalence. The sample sizes for different models and designs are summarized in tables 3, 4, 5, and 6. Each table lists the total number of subjects required. For the family-based design, this number is 3Nc-p, where Nc-p is the number of case-parent trios; for the population-based design, it is 2Nc-c, where Nc-c is the number of case-control pairs.
|
|
|
|
With regard to comparison between the matched case-control design and the conditional case-parent design, the sample size requirement depends on both the population prevalence and the interaction models. For pure interaction models without main effects, the conditional case-parent design is more efficient than the matched case-control design when the population prevalence is 1 percent. This result is consistent with Gaudermans finding for a rare disease, even though he compared sample units rather than total sample sizes. On the other hand, the matched case-control design is more efficient than the conditional case-parent design when the disease prevalence is 10 percent. For interaction models with main effects, when the disease has a population prevalence of 10 percent, the matched case-control design is more efficient than the conditional case-parent design under the additive and dominant models, but it is slightly less efficient under the recessive models. When the disease prevalence is 1 percent, the conditional case-parent design is more efficient than the matched case-control design.
For each susceptible proportion combination between the two genes, we observe a decreasing trend in the sample size requirement for the three genetic models as the magnitude of the gene-gene interaction increases from Rinter = 2 to Rinter = 6.
Regarding the sample sizes needed to detect gene-gene interactions under different genetic models, there is no clear pattern for family-based designs, but there is a clear and consistent pattern for population-based designs. For population-based designs, we have higher power under the additive model and the same power for the dominant and recessive models across both population prevalence cases and different interaction models. For the family-based design, we first examine the situation where the population prevalence is 10 percent. In this case, we have the highest power to detect gene-gene interactions under a recessive model across different interaction models and susceptible proportions at two loci for the conditional case-parent design. For the unconditional case-parent design, we have the highest power to detect gene-gene interaction under a recessive model when the proportion of susceptible persons at one gene is 0.2 or less. When the susceptible proportion is 0.25 at both genes, we have higher power to detect interactions under the additive model than under the recessive model. When the population prevalence is 1 percent, the power is higher under the additive model for most susceptible proportion combinations. These comparisons can be better visualized in figures 1, 2, 3, 4, 5, 6, 7, and 8, where the total sample sizes (3Nc-p for the case-parent designs and 2Nc-c for the case-control designs) required under three gene-gene interaction alternatives are compared according to the relative efficiencies. We note that the relative efficiencies among different models are similar between pure interaction models and interaction models with main effects, and more samples are needed under interaction models with main effects.
|
|
|
|
|
|
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For simplicity, we have considered candidate genes with biallelic markers. Although it is possible to extend our models to incorporate multiple alleles at a marker, biallelic markers are much more common than other types of markers, and it is possible to group multiple alleles into two groups in modeling interactions. We have also focused only on gene-gene interactions between two markers. A complete analysis of interaction among all important markers is often not feasible either, because of the rapid increase in the number of analyses required and its consequential cost in type I error. We have used 0.05 as the type I error rate. This is valid if the two testing marker loci are known to be associated with the disease of interest from the previous studies. However, for a more general study, when it is not known whether the marker loci are associated with the disease, a more stringent criterion is needed.
In his paper, Gauderman (19) focused on a disease with a population prevalence of 0.01 percent. For a population-based association study, he investigated the matched case-control design instead of the case-control design to avoid population stratification plus some other confounding biases. Under these scenarios, it was found that the conditional case-parent design is more efficient than the matched case-control design. We were able to reproduce Gaudermans results when fixing the population prevalence at this low level and a more moderate level of 1 percent. However, when we considered a more common disease with a population prevalence of 10 percent, we found that the matched case-control design is more powerful than the conditional case-parent design. Our results also showed greater efficiency for the matched case-control design than for the unconditional case-parent design when the susceptible proportion of one locus is 10 percent. Compared with the results for the matched case-control design, the sample sizes required by the unmatched case-control design are systematically lower, which suggests that more efficiency can be obtained through the use of the unmatched case-control design. Consistent with another finding made by Gauderman when the population prevalence is very low, when the population prevalence is 10 percent, our results suggest that there is higher power to detect interactions under recessive models than under dominant models for the family-based designs and the same power under the recessive and dominant models for the population-based designs. For family-based designs, the relative sample sizes needed under additive models as compared with the other two genetic models change with different main effects, different population prevalences, and different susceptible proportions at two candidate loci.
We note that when disease prevalence is moderate (10 percent) and when both the susceptible proportions of the two loci and the interaction effect are moderate (
0.2 and Rinter
4), we need at most several hundred subjects to detect an effect of gene-gene interaction for the family-based designs across the three genetic models. For the population-based designs, we need fewer than 250 people. When both the susceptible proportions of the two loci and the interaction effect are very small (
0.1 and Rinter
2), the sample size requirements for the three genetic models for the family-based design and the matched case-control design are unrealistically large, requiring several thousand subjects or more to achieve reasonable power. For the unmatched case-control design, the required sample sizes are possible to achieve but still relatively large for genetic studies. When the disease prevalence is low (1 percent), the sample size requirements for the unmatched case-control design are reasonable for all of the parameter combinations considered, but the required sample sizes are too large for the matched case-control design and family-based designs when the gene-gene interaction effect is small.
Note that in the matched case-control design, we assume the same genotype probabilities for all matched case-control pairs. This assumption may not be realistic for the matched case-control design. For example, if we consider age as the matching variable, in an age-matched study of candidate disease loci and cardiovascular disease, the independence assumption would imply that cardiovascular disease incidence among persons with genotype aa does not vary with age. In this case, the sample size calculated above may seriously underestimate the actual sample size needed for the required power. Many research groups have addressed this issue in the literature. The method proposed by Dupon (22) can be used to obtain an accurate power calculation by modeling the dependence between genotypes of cases and genotypes of controls within matched case-control pairs using the correlation coefficient for genotypes. The procedure of Fleiss and Levin (23) can be performed in order to employ Schlesselmans calculation (24) first and then introduce odds ratios for genotypes for matched cases and controls. Although the implications of the departure from our simple assumption for the matched case-control design need further study, our general conclusion is nevertheless not affected, because our results indicate that the unmatched case-control design is more powerful than the matched case-control design even when the sample size is probably underestimated for the matched design under our assumption.
We note that even though the biologic principle behind epistasis is intuitive, phenotypes represent unpredictable results from disease determinants. Therefore, the detection of a statistical interaction does not necessarily imply interaction on the biologic level, and it may be problematic to interpret the interaction biologically. Moreover, there are many ways in which genes can interact. The scale of the measurement is another problem we may face when modeling gene-gene interaction. It is well known that certain measurement scales of phenotype can give the impression that the interaction exists when it is actually an artifact of the scale used. Since the presence of a statistical interaction depends on the measurement scale, that is, on whether we are modeling the penetrance or the log odds, it is even harder to interpret its biologic meaning. Clearly, issues of scale and interpretation of underlying biologic meaning need additional investigation.
We have assumed a homogeneous population in this article. However, there may be potential population heterogeneity. The unmatched case-control design and the unconditional case-parent design are valid only when the study population is homogeneous, and these designs may be biased in the presence of population stratification. The major advantage of the case-parent design in comparison with the case-control design is its robustness against population stratification in the detection of genes underlying traits of interest. However, this robustness for gene identification no longer holds for the detection of gene-gene interaction. To demonstrate this, we have conducted simulations by considering two populations with different disease prevalences and different allele frequencies at each of the two candidate genes. We have found that even when these two candidate genes have no interaction effects on the disease risk under the logistic model in either population, an interaction effect may be identified when a single logistic model is fitted to the case-parent samples from the combined population. Such interaction effects can be substantial, especially when the main effects of the same allele are opposite in the two populationsfor example, when allele A increases disease risk in the first population but reduces disease risk in the second population.
Therefore, all of the designs considered in this article are potentially subject to bias due to population stratification. Conditional case-parent analysis is no longer superior to unconditional case-parent analysis in terms of population stratification. Approaches other than the case-parent design are needed to ensure that the identification of gene-gene interaction is not subject to bias caused by population stratification. It is likely that genomic control methods can be used as a possible correction (14). Such methods use genomic markers believed to be independent of the disease and the candidate genes to estimate background association due to population stratification. If a positive association of disease with the genomic marker is detected, that indicates the existence of population structure. The candidate gene can then be adjusted for population stratification. However, the magnitude of gene-gene interaction between two adjusted candidate markers requires further investigation.
![]() |
ACKNOWLEDGMENTS |
---|
The authors are grateful to Dr. Shuanglin Zhang for helpful discussions.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|