From the Genetic Epidemiology Division, Cancer Research UK Clinical Centre at Leeds, Leeds, United Kingdom.
Received for publication January 14, 2003; accepted for publication May 23, 2003.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
case-control studies; epidemiologic methods; interaction; research design; statistics
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The power of a case-control study to detect interactions is low compared with the power to detect main effects (1). This has resulted in many different designs being proposed as strategies for improving power. In addition to matching strategies, because genetic risk factors are often under study, family designs have also been considered for studies of gene-gene or gene-environment interactions (68). Unlike family studies of risk factor main effects (9), these designs have been found to have the potential to improve power to detect interactions in some situations. A design that samples only cases has also been proposed (10). The improvement in power for this design is large. However, if the risk factors under study are not independent in the population from which the cases are sampled, the false-positive rate with this design can become greatly inflated (11). Therefore, matching strategies are one of several approaches to improving the power to detect gene-environment interactions.
Sturmer and Brenner (3) recently proposed the use of flexible matching to address this problem. By increasing the prevalence of the environmental exposure in controls above the prevalence in cases, the authors showed that this method could offer a substantial improvement in statistical power. There are many scenarios in which environmental exposure, or at least a proxy thereof, may be measured in a relatively large set of potential controls. For example, in a case-control study of the interaction between genetic risk factors and smoking in relation to bladder cancer, potential controls could be screened by means of a simple question asking whether or not they had ever been a regular smoker. When interest is in interaction rather than the main effect of smoking, controls could then be selected for genotyping and detailed exposure evaluation by sampling according to their response to this question. Similarly, in a case-control study of sun exposure and the genes involved in melanoma risk, potential controls who had lived for some time in a hot country might be oversampled to improve power to test for gene-environment interactions. Using frequency matching strategies, the researchers would sample controls to have the same exposure frequency as the cases, whereas with a flexible matching strategy they would seek to sample exposure at the frequency among controls that maximized the power to test for interactions.
Sturmer and Brenners simulations showed that the optimal degree of matching for exposure could be found in different situations. However, they concluded, "Given the strong dependence of the power and efficiency gains by matching on the multiple parameters, general recommendations as to the best degree of matching in all settings are difficult, if not impossible" (3, p. 599).
In this paper, we use a large-sample approximation of the variance of the interaction odds ratio to show that the exposure frequency among flexibly matched controls that minimizes the variance of the interaction odds ratio, and thus maximizes the power for this design, can be estimated.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The variance of the log of the interaction odds ratio for departure from multiplicative joint effects can be estimated as follows for a population-based case-control study (12):
By a similar argument, the variance of the interaction effect from a study using flexible matching can be estimated by
Here, the contribution to the variance of the log of the interaction odds ratio due to the population-based controls in equation 1,
is replaced by the contribution from the flexibly matched controls,
in equation 2. Therefore, the degree of matching that optimizes the efficiency of the flexible matching design will be the degree that minimizes this variance. Because the flexible matching technique samples population-based cases, the variance in the interaction term that is due to the cases is unaffected by the matching strategy. Thus, the optimum strategy can be determined by finding the frequency of the environmental factor among controls that minimizes
or, equivalently, that minimizes 1/p00m + 1/p01m + 1/p10m + 1/p11m.
Let ME be the frequency of the matching factor (exposure) among flexibly matched controls, and let PG be the frequency of the genotype in the source population. When the two risk factors are independent in the source population, this term can be written as
1/[(1 ME)(1 PG)] + 1/[(1 ME)PG] + 1/[ME(1 PG)] + 1/[MEPG],
which simplifies to [PG(1 PG)ME(1 ME)]1. Finding the value for ME that minimizes this variance is equivalent to finding a maximum for PG(1 PG)ME(1 ME), which can be solved by differentiating with respect to ME and finding the solution at 0. Unsurprisingly, the variance is minimized when ME = 0.5.
When the two risk factors are not independent, the most efficient frequency for the exposure sampling depends on both the odds ratio for the association between the genotype and the exposure in the source population (see the Appendix in Sturmer and Brenner (3)) and the frequencies of the two risk factors. The optimum frequency at which to sample exposure among controls (ME) can be estimated using the following equation, where PE is the population exposure frequency and p00, p01, p10, and p11 are, as before, the proportions of the population/unmatched controls with the different exposure/genotype combinations. Further details are given in the Appendix.
The sample size required to detect interactions is calculated using the method of Self et al. (1315). Briefly, the likelihood ratio test statistic for the interaction asymptotically follows a noncentral chi-squared distribution under the alternative hypothesis. A large exemplary data set with the risk factor frequencies among cases and controls expected under the alternative hypothesis is analyzed using standard statistical software. The likelihood ratio test statistic is the noncentrality parameter for this distribution. The required sample size is simply inversely proportional to this noncentrality parameter, which allows the application of this method to a wide range of designs.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Strategies similar to flexible matching for interactions have been discussed previously. Cain and Breslow (16) discussed a strategy similar to the one detailed above for improving power to detect interactions and main effects. They considered a situation where exposure information on cases and controls was available before sampling of the particular controls for which more detailed information would be collected (in this case, genotyping). They advocated a strategy in which controls are sampled with balanced numbers from each exposure stratum. Cain and Breslow found that the balanced design is always much more powerful than the unstratified design for detecting interactions. Indeed, the only time they found the strategy less efficient was when there is a strong negative correlation between the variables that are measured in the first and second stages; this is also reflected here in the case where the optimum sampling frequency for the exposure is potentially greater or less than 50 percent when the two risk factors are strongly associated.
Breslow and Cain (17) similarly recognized for the two-stage design that unbiased estimates of the interaction parameter can be obtained from an unmatched analysis even though the exposure is used as a matching factor, in the same way as for the flexible matching design. However, estimates of the population exposure frequency can also be used to additionally allow estimation of the exposure main effects. This is an aspect that could also be applied to the flexible matching design if, at the control sampling stage of the study, an estimate of the population exposure frequency could be made, or if the controls were being sampled from a preexisting cohort for which exposure information was available. At the analysis stage, the log of the exposure group frequency (i.e., exposed or unexposed) is used as an offset in the logistic regression model, to retrieve unbiased estimates of exposure main effects. One advantage of this result is that the offset has no effect on the power of the design to detect interactions (17). Thus, if this information is not available, this does not detract from the strength of the design for detecting interactions.
Understanding how the power of the flexible matching design can be optimized is helpful in understanding comparisons between different designs that have been proposed as strategies for detecting interactions. Table 2 reflects well that although the exposure frequency among controls is chosen to minimize the variance, the decrease in the required sample size is still small in comparison with the case-only design, where there is no component of variance in the interaction estimate due to the controls. The inappropriateness of the case-only design in the presence of risk factor association and concerns about the false-positive rate when this assumption is violated (11, 18) mean that alternative strategies are still attractive and should be explored.
By considering the large-sample approximation to the variance of the interaction parameter for the flexible matching design, one can see why using family controls has the potential to improve the power to detect interactions (7, 8). When risk factors are rare (and this is the situation in which most improvement in power from family designs has been observed), the exposure frequencies among controls are raised above the population levels towards the most optimal frequencies of 50 percent due to within-family correlation of genetic, and to a lesser extent environmental, risk factors. Similar arguments can be considered for other designs, such as the design that compares case subjects who have two primary cancers with cases who have only one primary cancer (19). This sampling strategy will increase the prevalence of rare risk factors among all study participants, again decreasing variation in the interaction parameter and increasing power.
Matching strategies such as flexible matching are often the most rational approach to choosing an efficient design for detecting interactions, if the assumption of independence of genotype and exposure that is required for the case-only design proves untenable (11). The strategies described here can be used to find the most informative risk factor frequencies. If the population exposure frequency is known, the theory from two-stage designs can be incorporated at the analysis stage to estimate the main effects of the matching variables. This further increases the attractiveness of these designs.
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The pijs are calculated following the method of Sturmer and Brenner (3). They depend on the genotype and exposure frequencies and the magnitude of the association between the two factors. Alternatively, if an unmatched control group were available, then the values of the proportions pij could be observed directly. Therefore, the proportions of persons with each genotype/exposure combination when controls are selected under a flexible matching scheme, pijm, are calculated as follows, such that the frequency of exposure among controls is ME.
p00m = p00(1 ME)/(1 PE).
p01m = p01(1 ME)/(1 PE).
p10m = p10ME/PE.
p11m = p11ME/PE.
Therefore, the variance of the log of the interaction odds ratio due to the flexibly matched controls can be estimated by
(1 PE)/[p00(1 ME)] + (1 PE)/[p01(1 ME)] + PE/[p10ME] + PE/[p11ME].
By differentiating this function with respect to ME and finding the value of ME when this is zero, one can find the value of ME that minimizes this variance. After some simple algebra, the derivative can be expressed as
p01u p00u(1 ME)2/(1 PE)2 p10u p11u ME 2/PE 2.
Setting this to zero, the equation can be solved for ME by factorization, since the derivative is the difference of two squares, providing the solution in equation 3.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|