1 Unité de Recherche en Epidémiologie des Cancers, Institut de la Santé et de la Recherche Médicale (INSERM) U521, Institut Gustave-Roussy, 94805 Villejuif, France.
2 Genetic Epidemiology Branch, National Cancer Institute, Bethesda, MD.
3 Department of Preventive Medicine, School of Medicine, University of Southern California, Los Angeles, CA.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
case-control studies; cohort studies; epidemiologic methods; interaction; matched-pair analysis; research design; statistics
Abbreviations: ARE, asymptotic relative efficiency; RR, rate ratio.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To increase a study's power to detect a gene-environment interaction when one of the factors under study is rare, one possible alternative is the counter-matching design. Counter-matching was introduced by Langholz and Clayton (5) as a method of sampling controls from a cohort, or more generally from an at-risk population, for nested case-control studies. One purpose of the design is to increase the numbers of cases and controls with the rare factor of interest without prohibitively increasing the number of measurements that must be performed. The goal of counter-matching is to maximize the number of discordant case-control pairs, from which information comes in a matched case-control study. The efficiency of this method in assessing main effects of uncommon factors has already been evaluated. Counter-matching has been shown to increase the efficiency of main effect estimation by approximately 25 percent in comparison with classical random sampling (6
). In recent work, Cologne and Langholz found that counter-matching was advantageous in assessing interaction between two factors for which data on one were available for the entire cohort and data on the other were to be obtained from the sample (J. B. Cologne, Radiation Effects Research Foundation (Hiroshima, Japan) and B. Langholz, University of Southern California (Los Angeles, California), personal communication, 1999). The design they explored is quite different from that considered here, in that information on one of the exposures (used for the counter-matching) is known for the entire cohort and information on the other is collected in the sample. Furthermore, Cologne and Langholz were interested only in the interaction effect of the exposures, not in the main effects. Assessment of effects of gene-environment interaction without knowledge about both the genetic and the environmental main effects might be of little use for public health or individual risk assessment. In this paper, we propose counter-matching designs that allow for estimation of the gene-environment interactive effect as well as both genetic and environmental main effects. The efficiency and feasibility of this counter-matching design are evaluated and discussed for different interaction scenarios.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The classical definition of interaction was used for this analysis and is as follows. Gene-environment interaction exists if the joint effect of the genetic factor and the environmental exposure differs from the product of the risks for the individual factors on a multiplicative scale (RRint = RREG/(RRERRG)). An interaction effect of more than 1 indicates a greater than multiplicative effect between E and G, while an interaction effect less than 1 indicates a less than multiplicative effect. An additive effect may also be considered when the joint effect of the genetic factor and the environmental exposure differs from the sum of the background disease rate and the excess rates for the environmental exposure and the genetic factor (RRint = RREG/(RRE + RRG - 1)). However, this exercise focuses on the multiplicative model, the model most commonly used in chronic disease epidemiology.
Counter-matching for gene-environment interaction
The design and analysis of counter-matched studies are presented in detail elsewhere (68
) and thus are only briefly described below. In counter-matching, controls are selected to increase the variation in factors of interest in a case-control set relative to random sampling. The goal is thus the opposite of that of matching, where one selects factors for controls that are similar to the cases' factors. A partial likelihood method has been developed for estimating different exposure effects in counter-matching (9
) using weighting that takes into account the probabilities that subjects were selected from specific strata.
In general, the number of case-control subjects from each factor-of-interest status group is fixed by the design. Here, three different variants of counter-matching for assessment of gene-environment interaction are proposed. In order to make assessment of main effects possible, we suppose that surrogates for G and E are available for the entire cohort/population at risk in which the case-control study is nested. Thus, counter-matching is performed either on the genetic factor or on the environmental factor, or on both the genetic and the environmental factors. Each case's risk set would be stratified by either a surrogate of G or a surrogate of E, or by surrogates of both G and E, and controls for that risk set would be selected from the strata other than the case's stratum.
We compared the following population-based designs to evaluate the efficiency of counter-matching in a study examining gene-environment interaction: 1) a full cohort study with no matching and with infinite numbers of controls for each case; 2) a standard nested case-control study with three controls per case; 3) a 2-2 case-control design with counter-matching on a surrogate of E; 4) a 2-2 case-control design with counter-matching on a surrogate of G; and 5) a 1-1-1-1 case-control design with counter-matching on surrogates of both E and G.
The third and fourth designs have two individuals exposed and two unexposed for either the G surrogate (Gsur) or the E surrogate (Esur), respectively. For example, if a case is exposed for a given surrogate, then one exposed control and two unexposed controls for the given surrogate are drawn. If a case is unexposed for a given surrogate, then one unexposed control and two exposed controls for the given surrogate are drawn. Design 5 includes one individual who is unexposed for both surrogates, one who is exposed for both surrogates, one who is exposed for Esur and unexposed for Gsur, and one who is exposed for Gsur and unexposed for Esur. Thus, for designs 25, each sampled risk set includes one case and three controls, with controls sampled according to Esur or/and Gsur status, depending on the design.
Calculating asymptotic relative efficiency
To evaluate efficiency, we calculate the asymptotic relative efficiency (ARE). For assessment of gene-environment interaction, ARE is defined as the ratio of the gene-environment interaction variance for each counter-matched case-control design to the variance for either the classical 1:3 nested case-control study or the full cohort. The ratio indicates proportionally how many more (or fewer) observations (in large samples) are needed by the counter-matched design to achieve the same precision as the reference design (10). Asymptotic variances are calculated as described by Langholz and Borgan (9
), using a FORTRAN program developed by Langholz (11
).
Calculations of sample size
To calculate sample sizes for a gene-environment interaction study using counter-matching, we use the following classical formulation as described by Breslow and Day (12):
![]() |
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Since counter-matching on both G and E appears to be the most efficient case-control design in the detection of gene-environment interaction, we use this design to examine the efficiency when parameters such as frequencies of G and E, main effects of G and E, or G x E interaction effects are varied.
Efficiency of counter-matching according to the sensitivity and specificity of surrogates. Figure 1 shows the effects of the sensitivity (proportion of truly exposed (to either E or G) subjects who are so identified by the surrogate (Esur or Gsur, respectively)) and specificity (proportion of truly nonexposed subjects who are so identified by the nonexposed surrogate) of Gsur on ARE for different frequencies of G (P(G) = 0.01, 0.1, 0.2). The AREs are calculated comparing the design that counter-matches on both G and E with the standard 1:3 case-control study. All other parameters are fixed with RRE = RRG = RRint = 2 and P(E) = 0.1, and the sensitivity and specificity of E are fixed at 0.8. AREs increase as the sensitivity of Gsur increases (specificity of Gsur fixed at 0.8 (figure 1, lines with circles)) or as the specificity of Gsur increases (sensitivity of Gsur fixed at 0.8 (figure 1, lines with asterisks)) for different frequencies of G. In this scenario, the counter-matched design is more efficient than or at least as efficient as the 1:3 case-control study regardless of the specificity, and it becomes more efficient than the 1:3 case-control study when the sensitivity of G is greater than 0.1. Actually, the threshold of specificity and sensitivity of G for obtaining a gain in efficiency depends mainly on the fixed values of the sensitivity and specificity of E and vice versa (data not shown). For example, when the sensitivity and specificity of E are equal to 0.5, the threshold (ARE = 1) for the sensitivity of G is equal to 0.3 and the specificity of G is equal to 0.5. In other words, if one of the two surrogates has low sensitivity and specificity, the other factor must be highly sensitive and specific to produce a gain in efficiency for a gene-environment interaction study. This gain increases as the specificity or sensitivity increases. Thus, high sensitivity and specificity for the surrogates of G and E are important for making counter-matching on G and E an efficient study design.
|
The results of these calculations are summarized in table 2. Over a wide range of genetic parameters, the specificities P(FH = 0|G1 = aa) were consistently high, ranging from approximately 68 percent to 84 percent. Under a recessive model, the specificities for heterozygous (Aa) cases/controls P(FH = 0|G1 = Aa) were also fairly high, generally about 6575 percent. Sensitivities were more variable from one set of model parameters to another, but they generally ranged from 40 percent to 90 percent for homozygous (AA) cases/controls. For heterozygous cases/controls, the sensitivities ranged from about 45 percent to 75 percent under a dominant model.
|
Thus, we fixed sensitivity and specificity at 80 percent to examine the efficiency when frequencies of G and E, main effects of G and E, or G x E interaction effects are varied.
Efficiency of counter-matching according to the main effects of G and E. In parts a and b of figure 2, we examine the AREs for different RRE and RRG values with sensitivity and specificity set at 0.8 for both G and E, P(E) equal to 0.2, and P(G) equal to 0.01. AREs are calculated for two different values of G x E interaction (RRint = 3, 10). The results show an increase in ARE as RRG increases. For example, when RRint = 3 and RRE = 3 (figure 2, part a, dashed line with asterisks), the AREs increase from 1.35 when RRG = 1 to 2.30 when RRG = 10. The slopes of the AREs barely change across the range of RRint. For example, when RRE = 3, comparison of AREs for RRG = 10 versus RRG = 1 shows increases of 0.93 when RRint = 3 (part a, dashed line), 0.95 when RRint = 5 (data not shown), and 0.96 when RRint = 10 (part b, dashed line).
|
These results show that as the main effect of the rarer factor increases, AREs for detecting gene-environment interaction increase. In contrast, as the main effect of the more common factor increases, AREs for detecting gene-environment interaction may increase slightly but usually decrease. When the frequencies of G and E are equal, AREs increase as the main effect of either E or G increases (data not shown).
Efficiency of counter-matching according to the frequencies of G and E. Figure 3 presents AREs from counter-matching on both G and E for different frequencies of G and E, with the sensitivity and specificity of both G and E fixed at 0.8, RRE equal to 2, and RRG equal to 3. AREs are calculated for three different values of G x E interaction (RRint = 3, 5, 10). The results show a decrease in the ARE as the frequency of G increases. For example, when RRint = 5 and P(E) = 0.1 (figure 3, part b, line with diamonds), AREs decrease from 2.40 when P(G) = 0.001 to 1.17 when P(G) = 0.5. At a given frequency of E, the slope of ARE increases as RRint increases. For example, when P(E) = 0.1, comparison of AREs for P(G) = 0.001 versus P(G) = 0.5 shows a decrease in ARE of 1.04 when RRint = 3 (part a, line with diamonds), 1.23 when RRint = 5 (part b, line with diamonds), and 1.44 when RRint = 10 (part c, line with diamonds).
|
Efficiency of counter-matching according to the G x E interaction value. The ARE of counter-matching on both G and E is calculated according to different values of G x E interaction for various frequencies of G (figure 4). As before, the sensitivity and specificity of both G and E are fixed at 0.8; RRE = 2, RRG = 3, and P(E) = 0.2. Figure 4 shows a slight increase in the ARE as RRint increases. The increase becomes greater as the frequency of G becomes smaller. For example, when P(G) = 0.01 (figure 4, line with circles), AREs increase from 1.50 when RRint = 1 to 1.98 when RRint = 20. The slopes of the AREs increase as the frequency of G decreases. For example, comparing RRint = 1 with RRint = 20 produces an increase in ARE of 0.09 when P(G) = 0.3, an increase of 0.26 when P(G) = 0.1, and an increase of 0.52 when P(G) = 0.001. These results show increases in AREs for detecting gene-environment interaction as RRint increases. The greater the interaction effect, the greater the efficiency of counter-matching on both G and E, particularly when the frequency of G is small (P(G) < 0.1). However, the AREs are much more strongly influenced by P(G) than by RRint.
|
|
|
When G and E are rare (e.g., 0.01) and gene-environment interaction is moderate (e.g., RRint
5), the required sample size is very large (>8,000 sets), often reaching unrealistic numbers of sets. When the frequency of G is very low (P(G) = 0.001), as might be observed for major genes such as BRCA1/BRCA2 or CDKN2A, the needed sample size is only realistic when E is common (P(E)
0.2) and there is a strong gene-environment interaction effect: RRint > 5 when RRG = 2 and RRint
5 when RRG = 10.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Since the sensitivity and specificity of the surrogates are very important for the gain in efficiency of counter-matching, the choice of highly specific and sensitive surrogates in the first stage of this method is critical. However, the requirement for highly specific and sensitive surrogates must be balanced against the need to use surrogates on which data are available or are easily measured in the cohort/population at risk. For example, at present, it would probably be too costly to genotype an entire cohort for a specific gene. As such, one might consider using a family history of the disease under study as a surrogate for the genetic factor. Family history is relatively easily measured, and the information is not too expensive to obtain. However, before designing the study, one must assess how predictive family history of disease is for the particular gene of interest. Indeed, we have shown that both the sensitivity and the specificity of surrogate family history tend to be higher for rare major susceptibility genes than for common low penetrance genes. Thus, in many complex and chronic diseases, family history may not be highly sensitive or specific for G if the disease under study is genetically hetero-geneous (i.e., if more than one gene is involved, leading to low sensitivity), if most gene carriers are not affected (i.e., there is low penetrance), or if family sizes are not sufficiently large (the latter two conditions' leading to low specificities). In such scenarios, family history would be expected to be only a weak surrogate of G and thus produce only a modest or minimal gain in efficiency for the counter-matching design. When genes of interest have low penetrance, physiologic G surrogates (such as inexpensive phenotypic assays of urine, saliva, hair, etc.) may be considered and may be expected to be more sensitive and specific than family history.
The results of this analysis show that as the main effect of the rarer factor increases, the relative efficiency of counter-matching on both G and E for detecting gene-environment interaction increases. Conversely, as the main effect of the more common factor increases, the relative efficiency usually decreases. Moreover, it has been shown that the larger the gene-environment interaction and the rarer the risk factors G and E, the greater the efficiency of counter-matching. However, the gain in efficiency must be balanced by the feasibility of the study, as measured by the needed sample size. Indeed, when the two factors are rare (i.e., frequency < 0.1) and the interaction value is moderate (i.e., 5), the relative efficiency of the counter-matching design is very high but the corresponding required sample size is very large (i.e., >8,000 sets, >8,000 cases, and >24,000 controls). Even if sample sizes for alternative designs would be even larger than those needed for counter-matching, these sample sizes are generally not realistic, and studies of this size tend to be prohibitively expensive.
When the frequency of G is very small (e.g., 0.001), as might be observed for major genes in cancer or other chronic diseases, the needed sample size might remain realistic only when factor E is common (i.e., >0.2) and when the interaction effect is high (i.e., >5).
For more common factors, the gain in efficiency derived from use of the counter-matching design, which may be minimal in some situations, must be balanced against the complexity of the design, particularly the difficulty involved in obtaining two specific and sensitive surrogates for the risk factors of interest. At present, identification of good surrogates for the factor(s) of interest and the costs associated with measuring these surrogates in large numbers of subjects (i.e., the entire cohort) may be the major determinants in deciding whether or not to conduct a counter-matched study.
This evaluation of the counter-matching design for assessment of gene-environment interaction used unrelated individuals drawn from a cohort or population at risk. The potential problem of population stratification or genetic admixture might affect the efficiency of this design in a gene-environment interaction study, although the potential loss of efficiency would be expected to be small (14). The use of related individuals as controls from a family population-based cohort may be an alternative. This design has recently been proposed for counter-matching in assessment of the effect of genetic factors and their interaction with environmental exposures (15
).
Another multistage design has been proposed for the study of rare factors: the so-called "balanced design" (16). In this study design, rather than choosing a subset at random, one selects cases and controls in order to oversample for the rare factor of interest. The oversampling is taken into account in the analysis to obtain unbiased estimates of the effects of the individual factors and their interaction. One important difference between this design and counter-matching is that the balanced design is for grouped data (i.e., the case-control set must have multiple cases in each set), while the counter-matching design we have used is individually matched. In some situations, grouping may offer some logistical advantages, while in others the ability to match finely may be desirable. When the individual factor effects and their interaction are to be estimated, the balanced design appears to be as complex to implement as the counter-matching design. Cain and Breslow (17
) investigated the efficiency of a balanced design versus a random sampling case-control design in estimating exposure-covariate interaction. Similar to the counter-matching design, the balanced design was always more efficient than a random sampling design for estimating interaction. Limited direct comparisons between the "balanced" and "counter-matching" designs showed similar efficiencies in interaction estimation (9
, 18
). Additional research on the efficiency of the balanced design using different gene-environment interaction schemes, such as those presented in this paper, would facilitate better comparison of these two complex study designs.
The development of designs for specific gene-environment interaction studies should consider factors specific to each particular study. For instance, the frequencies of the genetic and environmental variables, their main effects, and the a priori supposed interaction effects would determine the efficiency and feasibility of various study designs. In addition, the costs of contacting and enrolling cases and controls, measuring surrogate variables, gene typing, measuring environmental variables, and establishing a reliable administrative structure with which to accurately implement a complex study design (e.g., multistage selection and data collection) should be assessed. In addition, if the specific genes of interest are highly prevalent metabolic genes for which relatively inexpensive phenotypic assays are available, one could consider a design that would screen large numbers of potential study subjects with a phenotypic assay and then genotype only those subjects selected for counter-matching. Alternatively, the rarity of the disease may suggest sampling strategies that differ from designs which would be used if the disease were common. If the disease is common, such that many cases are available, a one-control-per-case counter-matching design could be envisioned. Again, the specifics of a particular study will determine what type(s) of designs to consider. Whatever the design, the principle of using surrogate measures to inform the sampling may be considered in order to increase power for a gene-environment interaction study.
The increased interest in evaluating gene-environment interaction for many chronic diseases and the requirement of larger sample sizes for such studies have led to the evaluation of epidemiologic study designs that differ from traditional case-control or cohort designs. Counter-matching is one such alternative design. Although the efficiency of counter-matching relative to a 1:3 case-control study is greatest when the risk factors of interest are very rare, the study of such very rare factors is not realistic unless one is interested only in detecting very strong interaction effects (e.g., RRint > 10). Nevertheless, a 1-1-1-1 counter-matching design appears to be more appropriate than most traditional epidemiologic methods for the study of gene-environment interaction involving rare genes or uncommon environmental exposures. However, both efficiency and feasibility must be evaluated before one considers using a counter-matching design, since it is more complex than that of a standard nested case-control study.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|