Flexible Matching in Case-Control Studies of Gene-Environment Interactions

Catherine L. Saunders and Jennifer H. Barrett 

From the Genetic Epidemiology Division, Cancer Research UK Clinical Centre at Leeds, Leeds, United Kingdom.

Received for publication January 14, 2003; accepted for publication May 23, 2003.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 REFERENCES
 
Because of the lack of power of case-control study designs to detect gene-environment interactions, flexible matching has recently been proposed as a method of improving efficiency. In this paper, the authors consider a large-sample approximation method that allows estimation of the most efficient matching strategy when genotype and exposure are either independent or associated. The authors provide tables of the sample sizes required to detect gene-environment interactions if this flexible matching strategy is followed, and they make brief comparisons with other study designs.

case-control studies; epidemiologic methods; interaction; research design; statistics


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 REFERENCES
 
The potential of matching strategies to improve statistical power for detection of gene-environment interactions has been debated (14). Detecting the departure from multiplicative joint effects of two risk factors for disease (as in gene-environment interaction) is important in understanding how risk factors act together in complex diseases (5) and in identifying high-risk groups.

The power of a case-control study to detect interactions is low compared with the power to detect main effects (1). This has resulted in many different designs being proposed as strategies for improving power. In addition to matching strategies, because genetic risk factors are often under study, family designs have also been considered for studies of gene-gene or gene-environment interactions (68). Unlike family studies of risk factor main effects (9), these designs have been found to have the potential to improve power to detect interactions in some situations. A design that samples only cases has also been proposed (10). The improvement in power for this design is large. However, if the risk factors under study are not independent in the population from which the cases are sampled, the false-positive rate with this design can become greatly inflated (11). Therefore, matching strategies are one of several approaches to improving the power to detect gene-environment interactions.

Sturmer and Brenner (3) recently proposed the use of flexible matching to address this problem. By increasing the prevalence of the environmental exposure in controls above the prevalence in cases, the authors showed that this method could offer a substantial improvement in statistical power. There are many scenarios in which environmental exposure, or at least a proxy thereof, may be measured in a relatively large set of potential controls. For example, in a case-control study of the interaction between genetic risk factors and smoking in relation to bladder cancer, potential controls could be screened by means of a simple question asking whether or not they had ever been a regular smoker. When interest is in interaction rather than the main effect of smoking, controls could then be selected for genotyping and detailed exposure evaluation by sampling according to their response to this question. Similarly, in a case-control study of sun exposure and the genes involved in melanoma risk, potential controls who had lived for some time in a hot country might be oversampled to improve power to test for gene-environment interactions. Using frequency matching strategies, the researchers would sample controls to have the same exposure frequency as the cases, whereas with a flexible matching strategy they would seek to sample exposure at the frequency among controls that maximized the power to test for interactions.

Sturmer and Brenner’s simulations showed that the optimal degree of matching for exposure could be found in different situations. However, they concluded, "Given the strong dependence of the power and efficiency gains by matching on the multiple parameters, general recommendations as to the best degree of matching in all settings are difficult, if not impossible" (3, p. 599).

In this paper, we use a large-sample approximation of the variance of the interaction odds ratio to show that the exposure frequency among flexibly matched controls that minimizes the variance of the interaction odds ratio, and thus maximizes the power for this design, can be estimated.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 REFERENCES
 
Using the notation of Sturmer and Brenner (3), let pij, pijc, and pijm be the proportions of persons with level of the environmental exposure (the matching factor) i (i = 1 (0) if the environmental exposure is present (absent)) and genetic susceptibility j ( j = 1 (0) if the genetic susceptibility is present (absent)) in the population, in cases, and in matched controls, respectively. In the same way, let nij, nijc, and nijm be the numbers of persons in each group in a study.

The variance of the log of the interaction odds ratio for departure from multiplicative joint effects can be estimated as follows for a population-based case-control study (12):

By a similar argument, the variance of the interaction effect from a study using flexible matching can be estimated by

Here, the contribution to the variance of the log of the interaction odds ratio due to the population-based controls in equation 1,

is replaced by the contribution from the flexibly matched controls,

in equation 2. Therefore, the degree of matching that optimizes the efficiency of the flexible matching design will be the degree that minimizes this variance. Because the flexible matching technique samples population-based cases, the variance in the interaction term that is due to the cases is unaffected by the matching strategy. Thus, the optimum strategy can be determined by finding the frequency of the environmental factor among controls that minimizes

or, equivalently, that minimizes 1/p00m + 1/p01m + 1/p10m + 1/p11m.

Let ME be the frequency of the matching factor (exposure) among flexibly matched controls, and let PG be the frequency of the genotype in the source population. When the two risk factors are independent in the source population, this term can be written as

1/[(1 – ME)(1 – PG)] + 1/[(1 – ME)PG] + 1/[ME(1 PG)] + 1/[MEPG],

which simplifies to [PG(1 – PG)ME(1 – ME)]–1. Finding the value for ME that minimizes this variance is equivalent to finding a maximum for PG(1 – PG)ME(1 – ME), which can be solved by differentiating with respect to ME and finding the solution at 0. Unsurprisingly, the variance is minimized when ME = 0.5.

When the two risk factors are not independent, the most efficient frequency for the exposure sampling depends on both the odds ratio for the association between the genotype and the exposure in the source population (see the Appendix in Sturmer and Brenner (3)) and the frequencies of the two risk factors. The optimum frequency at which to sample exposure among controls (ME) can be estimated using the following equation, where PE is the population exposure frequency and p00, p01, p10, and p11 are, as before, the proportions of the population/unmatched controls with the different exposure/genotype combinations. Further details are given in the Appendix.

The sample size required to detect interactions is calculated using the method of Self et al. (1315). Briefly, the likelihood ratio test statistic for the interaction asymptotically follows a noncentral chi-squared distribution under the alternative hypothesis. A large exemplary data set with the risk factor frequencies among cases and controls expected under the alternative hypothesis is analyzed using standard statistical software. The likelihood ratio test statistic is the noncentrality parameter for this distribution. The required sample size is simply inversely proportional to this noncentrality parameter, which allows the application of this method to a wide range of designs.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 REFERENCES
 
Table 1 shows the exposure frequency that maximizes the efficiency of a study to detect interactions over a range of control group genotype frequencies and magnitudes of risk factor associations (odds ratio for the association between genotype and exposure (ORGE)). We give the optimum exposure frequencies at particular matched control genotype frequencies rather than for specific population exposure and genotype frequencies. In practice, when exposure is sampled at a specific frequency among controls, this will also affect the frequency of the genotype, unless risk factors are independent; thus, the genotype frequency among flexible matching controls will not always be the same as that in the source population. (This is reflected in table 3, where the optimum matching frequencies for exposure are expressed with respect to the population genotype frequency and are slightly different from those in table 1.) When exposure/genotype combination frequencies are known among unmatched controls, applying equation 3 is the simplest way to calculate the optimum exposure frequency if risk factors are not independent. When the association between risk factors is small or the genotype frequency among controls is close to 0.5, a frequency for the matching factor of 0.5 remains the most efficient. In addition, the values shown in table 1 confirm the finding in Sturmer and Brenner (3) that when there is a strong positive association between risk factors and genotype frequency is low, the optimum degree of matching is smaller than when there is less association or no association.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Optimum flexible matching exposure frequencies when risk factors are not independent
 
To consider the practical use of the flexible matching design, we calculated required sample sizes under this optimal matching strategy for a range of magnitudes of risk factor effects and frequencies. These complement the relative efficiencies presented by Sturmer and Brenner (3). Sample sizes needed (number of cases required, assuming equal numbers of cases and controls) for a statistical power of 80 percent and a two-sided significance level of 0.05 are presented in table 2. Situations in which exposure and genotype are independent are considered first; therefore, in the flexible matching design, exposure frequency among controls is simply sampled at 50 percent, and required sample sizes are provided for comparison with an unmatched population-based study and a case-only design. We consider a situation with a rare disease (population frequency = 0.1 percent) and genotype main effect (relative risk of disease among unexposed people with the susceptibility genotype compared with people exposed to neither risk factor) equaling 2. Required sample sizes are provided for a range of genotype and exposure frequencies and magnitudes of interaction and main effects. It can be seen from table 2 that the sample size requirements for the flexible matching design are always lower, and can be substantially lower, than those for the population-based case-control design, especially when the exposure is relatively rare (frequency 0.1). Although the sample size requirements for the case-only design are lower still, the flexible matching design does not require the assumption of independence of risk factors that makes the case-only design untenable in many situations.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Numbers of cases required to detect interactions for the flexible matching, case-control, and case-only study designs*
 
Situations where exposure and genotype are not independent are also shown. Sample size requirements are not presented for the case-only design, because it would not be an appropriate strategy in these situations. The flexible matching strategy, however, still shows a significant reduction in the required sample size in comparison with the population-based controls in all situations. Table 3 shows the optimal frequencies at which exposure is sampled for table 2. When genotype and exposure are independent, this frequency is 50 percent. Because changing the frequency at which exposure is sampled will also alter the genotype frequency among flexibly matched controls when genotype and exposure are not independent, both genotype and exposure population frequencies, as well as the magnitude of their association, affect the optimal matching frequency for exposure.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Flexible matching exposure frequencies for table 2
 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 REFERENCES
 
The relative efficiency under the four scenarios given in Sturmer and Brenner’s (3) table 2 can also be estimated by the ratio of the variances calculated using this large-sample approximation method. Although, for each scenario, both methods (simulation in the paper by Sturmer and Brenner (3) and approximation here) gave the same degree of matching as the most efficient, the magnitudes of the relative efficiencies were slightly different. Discrepancies can be attributed to equation 2, which, though widely used, only calculates an asymptotic approximation to the variance of the interaction odds ratio.

Strategies similar to flexible matching for interactions have been discussed previously. Cain and Breslow (16) discussed a strategy similar to the one detailed above for improving power to detect interactions and main effects. They considered a situation where exposure information on cases and controls was available before sampling of the particular controls for which more detailed information would be collected (in this case, genotyping). They advocated a strategy in which controls are sampled with balanced numbers from each exposure stratum. Cain and Breslow found that the balanced design is always much more powerful than the unstratified design for detecting interactions. Indeed, the only time they found the strategy less efficient was when there is a strong negative correlation between the variables that are measured in the first and second stages; this is also reflected here in the case where the optimum sampling frequency for the exposure is potentially greater or less than 50 percent when the two risk factors are strongly associated.

Breslow and Cain (17) similarly recognized for the two-stage design that unbiased estimates of the interaction parameter can be obtained from an unmatched analysis even though the exposure is used as a matching factor, in the same way as for the flexible matching design. However, estimates of the population exposure frequency can also be used to additionally allow estimation of the exposure main effects. This is an aspect that could also be applied to the flexible matching design if, at the control sampling stage of the study, an estimate of the population exposure frequency could be made, or if the controls were being sampled from a preexisting cohort for which exposure information was available. At the analysis stage, the log of the exposure group frequency (i.e., exposed or unexposed) is used as an offset in the logistic regression model, to retrieve unbiased estimates of exposure main effects. One advantage of this result is that the offset has no effect on the power of the design to detect interactions (17). Thus, if this information is not available, this does not detract from the strength of the design for detecting interactions.

Understanding how the power of the flexible matching design can be optimized is helpful in understanding comparisons between different designs that have been proposed as strategies for detecting interactions. Table 2 reflects well that although the exposure frequency among controls is chosen to minimize the variance, the decrease in the required sample size is still small in comparison with the case-only design, where there is no component of variance in the interaction estimate due to the controls. The inappropriateness of the case-only design in the presence of risk factor association and concerns about the false-positive rate when this assumption is violated (11, 18) mean that alternative strategies are still attractive and should be explored.

By considering the large-sample approximation to the variance of the interaction parameter for the flexible matching design, one can see why using family controls has the potential to improve the power to detect interactions (7, 8). When risk factors are rare (and this is the situation in which most improvement in power from family designs has been observed), the exposure frequencies among controls are raised above the population levels towards the most optimal frequencies of 50 percent due to within-family correlation of genetic, and to a lesser extent environmental, risk factors. Similar arguments can be considered for other designs, such as the design that compares case subjects who have two primary cancers with cases who have only one primary cancer (19). This sampling strategy will increase the prevalence of rare risk factors among all study participants, again decreasing variation in the interaction parameter and increasing power.

Matching strategies such as flexible matching are often the most rational approach to choosing an efficient design for detecting interactions, if the assumption of independence of genotype and exposure that is required for the case-only design proves untenable (11). The strategies described here can be used to find the most informative risk factor frequencies. If the population exposure frequency is known, the theory from two-stage designs can be incorporated at the analysis stage to estimate the main effects of the matching variables. This further increases the attractiveness of these designs.


    APPENDIX
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 REFERENCES
 
As before, let pij and pijm be the proportions of persons with level of the environmental exposure (the matching factor) i (i = 1 (0) if the environmental exposure is present (absent)) and genetic susceptibility j ( j = 1 (0) if the genetic susceptibility is present (absent)) in the population and in matched controls, respectively. Let PE be the exposure frequency in the source population, and let ME be the exposure frequency among flexibly matched controls.

The pij’s are calculated following the method of Sturmer and Brenner (3). They depend on the genotype and exposure frequencies and the magnitude of the association between the two factors. Alternatively, if an unmatched control group were available, then the values of the proportions pij could be observed directly. Therefore, the proportions of persons with each genotype/exposure combination when controls are selected under a flexible matching scheme, pijm, are calculated as follows, such that the frequency of exposure among controls is ME.

p00m = p00(1 – ME)/(1 – PE).

p01m = p01(1 – ME)/(1 – PE).

p10m = p10ME/PE.

p11m = p11ME/PE.

Therefore, the variance of the log of the interaction odds ratio due to the flexibly matched controls can be estimated by

(1 – PE)/[p00(1 – ME)] + (1 – PE)/[p01(1 ME)] + PE/[p10ME] + PE/[p11ME].

By differentiating this function with respect to ME and finding the value of ME when this is zero, one can find the value of ME that minimizes this variance. After some simple algebra, the derivative can be expressed as

p01u p00u(1 – ME)2/(1 – PE)2p10u p11u ME 2/PE 2.

Setting this to zero, the equation can be solved for ME by factorization, since the derivative is the difference of two squares, providing the solution in equation 3.


    NOTES
 
Correspondence to Dr. Jennifer Barrett, Genetic Epidemiology Division, Cancer Research UK Clinical Centre at Leeds, Leeds LS9 7TF, United Kingdom (e-mail: jenny.barrett{at}cancer.org.uk). Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 APPENDIX
 REFERENCES
 

  1. Smith PG, Day NE. The design of case-control studies: the influence of confounding and interaction effects. Int J Epidemiol 1984;13:356–65.[Abstract]
  2. Thomas DC, Greenland S. The efficiency of matching in case-control studies of risk-factor interactions. J Chronic Dis 1985;38:569–74.[ISI][Medline]
  3. Sturmer T, Brenner H. Flexible matching strategies to increase power and efficiency to detect and estimate gene-environment interactions in case-control studies. Am J Epidemiol 2002;155:593–602.[Abstract/Free Full Text]
  4. Sturmer T, Brenner H. Potential gain in efficiency and power to detect gene-environment interactions by matching in case- control studies. Genet Epidemiol 2000;18:63–80.[CrossRef][ISI][Medline]
  5. Brennan P. Gene-environment interaction and aetiology of cancer: what does it mean and how can we measure it? Carcinogenesis 2002;23:381–7.[Abstract/Free Full Text]
  6. Gauderman WJ. Sample size requirements for association studies of gene-gene interaction. Am J Epidemiol 2002;155:478–84.[Abstract/Free Full Text]
  7. Gauderman WJ. Sample size requirements for matched case-control studies of gene-environment interaction. Stat Med 2002;21:35–50.[CrossRef][ISI][Medline]
  8. Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol 1999;149:693–705.[Abstract]
  9. Schaid DJ, Rowland C. Use of parents, sibs, and unrelated controls for detection of associations between genetic markers and disease. Am J Hum Genet 1998;63:1492–506.[CrossRef][ISI][Medline]
  10. Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat Med 1994;13:153–62.[ISI][Medline]
  11. Albert PS, Ratnasinghe D, Tangrea J, et al. Limitations of the case-only design for identifying gene-environment interactions. Am J Epidemiol 2001;154:687–93.[Abstract/Free Full Text]
  12. Cuzick J. Interaction, subgroup analysis and sample size. In: Boffetta P, Caporaso N, Cuzick J, et al, eds. Metabolic polymorphisms and susceptibility to cancer. Lyon, France: International Agency for Research on Cancer, 1999:109–21. (IARC Scientific Publication no. 148).
  13. Self SG, Mauritsen RH, Ohara J. Power calculations for likelihood ratio tests in generalized linear models. Biometrics 1992;48:31–9.[ISI]
  14. Brown BW, Lovato J, Russell K. Asymptotic power calculations: description, examples, computer code. Stat Med 1999;18:3137–51.[CrossRef][ISI][Medline]
  15. Longmate JA. Complexity and power in case-control association studies. Am J Hum Genet 2001;68:1229–37.[CrossRef][ISI][Medline]
  16. Cain KC, Breslow NE. Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol 1988;128:1198–206.[ISI][Medline]
  17. Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika 1988;75:11–20.[ISI]
  18. Saunders CL, Gooptu C, Bishop DT, et al. The use of case only studies for the detection of interactions, and the non-independence of genetic and environmental risk factors for disease. (Abstract). Genet Epidemiol 2001;21:174.
  19. Begg CB, Berwick M. A note on the estimation of relative risks of rare genetic susceptibility markers. Cancer Epidemiol Biomarkers Prev 1997;6:99–103.[Abstract]