From the Department of Preventive Medicine, School of Medicine, University of Southern California, 1540 Alcazar Street, Suite 220, Los Angeles, CA 90089 (e-mail: jimg{at}usc.edu).
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
association; case-control studies; genetics; interaction; research design; sample size
Abbreviations: G x G, gene-gene[interaction]; G x E, gene-environment[interaction]; GST, glutathione S-transferase
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Case-control studies are widely used in epidemiology for studying associations between disease and potential risk factors. It is critical to the success of such studies that an adequate-sized sample be recruited. For case-control studies of gene-environment (G x E) interaction, several authors have developed methods for estimating required sample sizes in both unmatched 14
and matched 5
, 6
designs. In this paper, I describe a general method of computing required sample size for tests of G x G interaction in the context of four designs: the matched case-control design 7
, the case-sibling design 8
10
, the case-parent design 11
, 12
, and the case-only design 13
15
. For a range of genetic models, I provide estimates of sample size needed for detecting G x G interaction, with the primary goal of comparing requirements across designs to infer their efficiencies relative to one another.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sampling designs and likelihood formation
Below I describe the design and analytical approach for each of the four case-based designs considered in this paper. In each design, cases are subjects affected by the disease of interest, perhaps with restrictions on age of onset, disease subtype, etc.
Matched case-control design.
In the matched case-control design, controls are subjects who are unaffected by the disease of interest and are assumed to be genetically unrelated to cases. They should be selected from the same source population as the cases. Since genotype frequencies may vary across ethnic groups, controls will typically be matched to cases according to ethnicity in order to avoid confounding, also known as population stratification bias. Age is also likely to be a matching factor for complex diseases, since disease risk often varies substantially by age.
In the simpler situation of testing for an association between a disease and a single gene, McNemar's 2 test and the associated matched odds ratio could be used for analysis. The natural extension of this method to allow modeling of two genes and their interaction is conditional logistic regression 7
. The corresponding conditional likelihood in a sample of N matched sets has the form
![]() | (1) |
Case-sibling design.
In the case-sibling design, controls are selected from unaffected siblings of the case. For a disease with variation in age of onset, an eligible sibling should have attained the age of the case free of disease, which often will restrict the sample to older siblings. While this restriction can be problematic in the analysis of environmental factors if there are secular changes in exposure levels 10, it should not bias a study of two genes and their interaction. The conditional likelihood in equation 1 can be used to estimate odds ratios Rg and Rh and the odds-ratio ratio Rgh 9
, 10
.
Case-parent design.
In the case-parent design, genotypes are measured in the parents of the case, but parental disease status is neither required nor used in the analysis. The most commonly used approach to the analysis of a single gene is the transmission disequilibrium test 11, which is equivalent to McNemar's
2 test comparing the distributions of alleles transmitted and nontransmitted from parents to the case. As in the case-control settings, this approach can be generalized to the analysis of two or more genes and their interaction(s) using conditional logistic regression 5
, 9
, 10
, 12
, 20
. The likelihood is the same as that shown in equation 1, where the denominator now includes a contribution from the case and from 15 "pseudosiblings" of the case, the latter formed as the 15 possible joint genotypes that the case could have inherited from the parents but did not. For example, if the father's genotype was Aa/Bb (at loci G/H, respectively), the mother's was aa/bb, and the case's was Aa/Bb, the 15 pseudosibling genotypes would include three copies of Aa/Bb, four of Aa/bb, four of aa/Bb, and four of aa/bb. The exp(ß) quantities based on equation 1 represent genetic relative risks (Rg and Rh) and the relative-risk ratio (Rgh), rather than odds-ratio parameters as in the case-control design 21
. Of course, these will be equivalent to the odds-ratio parameters provided that the disease is rare in all genetic subcategories. Some researchers have described the application of Poisson regression to the case-parent design 22
, 23
, an approach that can be extended to allow for maternally mediated effects 24
and imprinting 25
and can be used when there are missing parental data 26
.
Case-only design.
In the case-only design, no controls are selected. Such a sample cannot be used to estimate genetic main effects, but it can be used to test and estimate G x E 14, 15
or G x G interaction effects 13
. The analysis is a standard
2 test of association between the genes, or, equivalently, it can be based on unconditional logistic regression with the likelihood form
![]() | (2) |
Calculation of sample size
For each of the four designs, I will provide examples of the minimum number (N) of sampling units that will provide a given power for detecting a gene-gene interaction. Depending on the design, a sampling unit is defined as a case-control pair (design 1), a case-sibling pair (design 2), a case-parent trio (design 3), or simply a case (design 4). The null hypothesis (H0) is ßgh = 0, i.e., that there is no G x H interaction on a multiplicative scale. In all models, I assume that the disease is rare enough that the test of the odds-ratio parameter in the case-control and case-sibling designs is equivalent to the test of the relative-risk parameter in the case-parent and case-only designs. I adopt an approach to sample-size determination that has been previously described 6, 27
; it is summarized in the Appendix. In all calculations, I assume a significance level of 5% and a power of 80%, and I allow for a two-sided alternative hypothesis. For comparative purposes, I compute the ratio of N for the case-control design to N for each of the other three designs, which provides a measure of asymptotic relative efficiency per sampling unit of the latter to the former.
Computer software
A colleague (John Morrison) and I have developed a user-friendly Windows-based software program called QUANTO for computing either sample size or power in studies of G x G or G x E interaction 28. Inputs to the program include the design (case-control, case-sibling, case-parent, case-only), true model parameters, and the significance level. Required sample size will be computed for a given power or vice versa. The program is available at no charge and may be downloaded from a University of Southern California website (http://hydra.usc.edu/gxe).
![]() |
EXAMPLES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Suppose one wants to conduct a study that has as one of its aims a test of whether there is an interaction between the genes GSTM1 and GSTT1. For some specific values of model parameters, I will demonstrate sample sizes that would be required by each of the four designs to address this hypothesis. For simplicity, I assume that subjects will be selected from a single population. For both loci, it is the "null/null" genotype that is suspected of increasing asthma risk, indicating use of the recessive model at both loci. I assume that the prevalence of the null/null genotype is 40 percent for GSTM1 and 25 percent for GSTT1 30. Letting g = GSTM1 and h = GSTT1, the frequencies of the corresponding null alleles in the population are qA = 0.63 (i.e.,
) and qB = 0.5, respectively. I also assume a pure interaction model, in which neither GSTM1 nor GSTT1 increases risk by itself (i.e., Rg = 1.0 and Rh = 1.0) but risk is increased in subjects with the null/null genotype at both loci (Rgh > 1.0).
For the four designs and a range of values for Rgh, table 1 shows the number of required sampling units. The required sample size in a case-control study exceeds 2,000 pairs when Rgh 1.5 but declines sharply with increasing magnitude of the interaction strength. The case-sibling design requires a larger sample size than the case-control design, by a factor of approximately 20 percent. The case-parent and case-only designs, however, require substantially fewer sampling units than the other two designs. For instance, when Rgh = 3.0, the case-control and case-sibling designs would require 270 and 319 matched pairs, respectively, while the case-parent design would require 146 trios and the case-only design 116 cases.
|
Table 2 shows the sample sizes required to detect an interaction effect of magnitude Rgh = 3.0 in each design, assuming a pure-interaction model and various combinations of susceptible-genotype prevalence and dominance model. When genetic susceptibility is rare (0.01) at both loci, the required number of sampling units is impractically large (more than 30,000) regardless of the design. Sample size requirements are substantially less (in the range of 130400) if susceptibility is common (0.25) at both loci. For the case-control and case-only designs, sample size requirements do not depend on the dominance model, since there is no familial relationship among subjects. The case-sibling design requires more pairs than the case-control design in all models, with asymptotic relative efficiencies ranging from about 0.75 to 0.90. The case-parent and case-only designs are more efficient than the case-control design, with asymptotic relative efficiencies ranging from 1.7 to 2.2 in the former and from 2.5 to 2.7 in the latter. Sample size requirements in the case-sibling and case-parent designs are lower if at least one locus is recessive than if they are both dominant.
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For characterizing genetic and interactive effects, designs alternative to those considered in this paper have been suggested. These include hybrid designs such as those combining the case-parent and case-sibling designs 26, 34
, the case-parent and case-control designs 35
, and the case-sibling and case-control designs 36
. Other investigators have developed methods for the analysis of whole pedigrees, allowing for collection of genetic data on only a subset of family members 37
, 38
. In the context of the case-control design, Andrieu et al. 39
have proposed counter-matching of cases to controls as a technique for using available data at the time of sampling (e.g., family history of disease) to enrich the sample for informative matched sets. Additional work is required to compare the sample size requirements of these alternative designs with the designs considered in this paper.
The investigator planning a new study is likely to have parameter choices that differ from the specific values used in this paper. For this reason, my colleague and I distribute software that investigators may use to compute power or required sample size for their particular design parameters. For unmatched case-control studies of G x E interaction, Garcia-Closas and Lubin 1 also distribute a software program for computing sample size or power. Their program could be used to obtain sample size for an unmatched case-control study of G x G interaction, with their "E" being replaced by the second gene and exposure prevalence being replaced by the corresponding prevalence of the susceptibility genotype. However, their program is not directly applicable to a matched case-control study, and it will not provide calculations for the case-sibling, case-parent, or case-only design.
In the sample size comparisons, I assumed that the loci G and H were independently transmitted from parents to offspring (unlinked) and that they were independently distributed in the population (no disequilibrium). The linkage assumption will be violated if the two genes are in close physical proximity to one another. The disequilibrium assumption can be violated for this same reason, or because of other mechanisms that cause correlation among alleles in the population (e.g., admixture or selective forces that favor or discourage specific alleles at both loci). For each design, I describe below how the validity of the test of G x G interaction is affected by deviations from these assumptions.
Case-control and case-sibling designs.
The case-control and case-sibling designs are valid in the presence of linkage and/or disequilibrium.
Case-parent design.
The case-parent design is valid when there is disequilibrium between G and H but invalid if there is linkage. The problem if there is linkage is that the 16 possible pseudosibling genotypes are not equally likely under the null hypothesis of no genetic effects; rather, the distribution of genotypes depends on the recombination fraction () between G and H. If
were known, which may be possible given that G and H have known chromosomal locations, a valid test could be recovered by including it as an offset to the pseudosibling genotype distribution. Determining the sensitivity of a G x G interaction test to misspecification of
requires further investigation.
Case-only design.
The case-only design is valid if there is linkage between G and H but invalid if there is disequilibrium. The problem with the latter is that a population-level association between G and H will also be reflected in a case series, in the absence of any interaction.
In practice, the linkage assumption is easily assessed, since the investigator will know (approximately) the chromosomal locations of G and H. If G and H are on different chromosomes, for example, the two loci are unlinked with certainty. However, it will be difficult to evaluate the disequilibrium assumption unless one has genotypic data for G and H on a random sample of persons drawn from the same population as the people under study. Finally, it is possible that one will not observe G and H directly but rather markers M1 and M2 that are in linkage disequilibrium with G and H, respectively. The same conditions for test validity in each design apply to analysis of M1 x M2 interaction, although one will suffer a loss in power relative to a study of G and H directly. For a test of association between disease and G using M1 (or H using M2), it is known that one must modify the analytical approach in the following two situations: 1) when there is m:n matching (with m 1 and/or n
1) in the case-sibling design 40
and 2) when there are two or more affected offspring in the case-parent design 41
. In these two situations, similar corrections will be required for valid analysis of M1 x M2 interaction.
In the example calculations, I assumed that all subjects were obtained from a single population. However, there may be variations across subgroups of the population (e.g., ethnic groups) in overall disease prevalence and in susceptible-allele prevalences (qA, qB). In fact, this may be the reason one chooses to use a matched design. The sample size and power calculation approach described above may be modified to account for population stratification, by including in equation A1 (see Appendix) the stratum-specific parameters and the frequency of each stratum in the population 6. Although the absolute sample sizes depend on these additional parameters, the relative efficiencies among designs are similar (calculations not shown). However, one should be prepared to assume that the relative risks Rg, Rh, and Rgh are the same in all population subgroups. If this cannot be assumed, one should conduct separate sampling and estimation within each stratum, since there is no single set of parameters to estimate.
Previous papers have focused on design comparisons for case-control studies of genetic main effects 9, 10
and G x E interactions 5
, 6
. For testing of genetic main effects, the case-parent design is typically more efficient than the matched case-control design, and both are more efficient than the case-sibling design. For testing of G x E interaction, the case-parent design is also more efficient than the case-control design, but the case-sibling design can be the most efficient provided that there is not a high degree of sharing of the environmental exposure between siblings 6
. These findings, in addition to those presented in this paper, indicate that the case-parent design is a good choice for studies of genetic main and interaction effects. The case-only design might best be viewed as a screening tool with which to identify promising interactions, with follow-up by one of the other three designs to rule out the possibility of population association between genes.
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() | (A1) |
For each design, I maximize the expected log-likelihood in equation A1 twice, once letting ßgh be a free parameter and once fixing ßgh = 0 (i.e., under H0). I let 1 and
0 denote the maximum values of the corresponding log-likelihoods. In both maximizations, ßg and ßh are free parameters. The quantity
= 2(
1 -
0) is the expected likelihood ratio test statistic for a single sampling unit, and N
is the noncentrality parameter of the
2 distribution under the alternative hypothesis 27
, 42
. Since I assume that each gene can be coded by a single covariate, sample size may be computed as
![]() | (A2) |
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|