From the Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
bias; causality; epidemiologic research design; genetics; linkage disequilibrium; logistic models; polymorphism (genetics); sampling
Abbreviations: E, exposure; G, genetic factor.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Geneticists have tended to mistrust case-control "associational" studies because of their potential vulnerability to effects of genetic "admixture." Admixture, or "stratification," arises when genetically diverse subpopulations are incompletely mixed. In an admixed population, some subgroups may carry both a higher prevalence of an allele of interest (and/or an exposure under study) and a higher risk of disease. In such a population, an association between a variant allele and risk could be noncausal, reflecting only confounding with subpopulation. Although some have argued that admixture causes little bias in a diverse population such as the United States (1), family-based approaches offer robustness against this problem. For this and other (perhaps better) reasons, for example, convenience of implementation, recruitability of suitable controls, or statistical efficiency, family-based designs are worth considering.
The purpose of this commentary is to lay out some of the issues that should be considered in choosing a retrospective design for studying joint effects of a candidate gene (or a marker gene) and an exposure. The issues related to using a prospective cohort study are important but lie beyond our present scope. To the extent that nested retrospective designs might also be incorporated in that context, however, some of our comments will be relevant there as well. We begin by describing a range of retrospective designs that could be used, discussing available methods of analysis for these designs, and then contrast the assumptions needed by the various strategies.
![]() |
RETROSPECTIVE DESIGNS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The most economical alternative is the case-only design, first set forth as a method for studying gene-by-environment interaction for susceptibility genes in 1994 (3). If an inherited dichotomous genetic factor, G, and a dichotomous exposure, E, can be assumed to occur independently in the population, then one can show algebraically that the odds ratio for G in relation to E among cases can be treated as an estimate of the multiplicative interaction, the ratio of relative risks (4
), between G and E in causing the disease. Consequently, one can assess such interactions by carrying out logistic regression analysis using only cases, treating G as the outcome. In fact, under the required independence assumption, the precision of the estimate based only on cases is superior to that based on a classical logistic regression analysis of data that has included any number of controls for the same cases. A recent application of the case-only approach demonstrated an interaction between inheritance of the null-null genotype for glutathione S-transferase M1 and exposure to environmental tobacco smoke in causing lung cancer in women who had never smoked (5
). This study made good use of existing archived tissue specimens for cases, but, as was pointed out in an accompanying editorial (6
), the investigators did not fully use their existing data: Even though genotype data for controls were lacking, what data were available could have been used to estimate exposure effects within each specific genotype category (7
). Such a set of estimates is epidemiologically more informative than is a single interaction parameter.
Independence of G and E is often a natural assumption, but it can certainly fail to hold. The prevalences of E and G can both vary across subpopulations, inducing a correlation. Some genetic factors, for example, those governing metabolic processes, can actively influence certain exposures. In addition, genetic factors strongly related to risk, for example, BRCA1 or BRCA2 with breast cancer, could become increasingly related to other risk factors across age-time, through selective attrition. While the required assumption of the independence of genotype and exposure cannot be verified without including a sample of controls, the case-only design nevertheless can be useful for inexpensively exploring possible gene-by-exposure interactions by using archived tissue from cases. But once an E-by-G association has been noted in a case-only study, methods that are epidemiologically more rigorous, such as a population-based case-control study, should be used for follow-up confirmation.
Both approaches, however, are theoretically subject to bias due to genetic admixture, as noted above. With the case-only approach, bias could arise because of incomplete mixing of subpopulations that differ by exposure prevalence and genotype prevalence, even if their baseline risks of disease do not differ. Hence, family-based approaches might offer some advantages over both the case-only and the case-control designs.
In contrast with the population-based case-control design, family-based designs enroll related persons to serve as controls, a strategy that confers robustness against the potentially biasing effects of genetic admixture. Such controls could include the sibling(s) of affected persons or perhaps, as suggested recently by Witte et al. (8), cousins of affected persons. Of course, use of siblings (or cousins) raises the troublesome issue that some cases may have to be excluded only because they do not have a suitable relative willing to participate. Another approach, particularly well adapted to studying a disease with early onset, such as a birth defect, insulin-dependent diabetes, or schizophrenia, is instead to genotype the parents of affected persons and study the apparent transmission of particular alleles from such parents to their affected offspring.
Studies in which siblings or cousins are used as controls are useful for diseases with early or later onset and can be analyzed by using conditional logistic regression (8), with fine stratification on family. Under the usual assumptions required for logistic modeling, this analysis permits estimation of the within-family odds ratios and the within-family interaction parameter. However, the investigator should recognize that, even without bias from genetic admixture, the within-family odds ratio generally will not be the same as the corresponding odds ratio based on a population-based case-control study. This difference arises because the within-family parameter is subject specific, the "subject" being the family, while the population-based odds ratio is averaged across the population. (The within-family odds ratio has implicitly adjusted out the unknowable baseline risks that vary across families.) Neuhaus et al. have shown that the population-based odds ratio will generally be closer than the former to 1 (9
). For similar reasons, a study that uses siblings will be estimating a parameter that is further from the null value than the same parameter estimated by a study in which cousins are used as controls. Meta-analysts should take notice.
Designs in which the parents provide the genetic control offer some practical and analytical advantages, particularly for studying diseases with early onset, such as birth defects. In this design, a sample of unrelated cases and their parents are enrolled, and each threesome of persons is genotyped. The resulting triads (of genotypes) can be assumed to be mutually statistically independent.
Several methods of analysis have also been proposed for case-parent data. Consider a diallelic gene producing three genetic categories of risk corresponding to whether a person carries zero, one, or two copies of the variant allele. The data can be analyzed by considering the "pseudosiblings" of the affected proband, that is, the three other equally likely potential offspring (with respect to the gene under study) that those two parents could have produced. In the pseudosib method of analysis, first proposed by Self et al. (10), one can use conditional logistic regression; each case is considered matched to the pseudosibs in a 1:3 match. Interaction between genotype and an exposure is assessed multiplicatively by comparing the estimated genetic relative risks among exposed probands with the corresponding genetic relative risks among unexposed probands. Because of the associated computational trick, the case-parent design is sometimes referred to as the pseudosib design (8
), although "pseudosib" really refers to a method of analysis and not the design (11
). One feature of this approach that has caused some confusion (8
, 11
) is that what can be estimated is the within-family relative risk, not the within-family odds ratio, despite the fact that the computational gimmick applies conditional logistic software to the pseudosib data.
An alternative approach to the same data explicitly treats the family, that is, the triple of genotypes, as the unit of analysis. One first counts the triads, categorized on the basis of the three genotypes, and then applies Poisson regression. The log-linear model must stratify on the parental genotypes (12, 13
), with further possible stratification on the exposure status of the proband (14
). The use of parents in this way confers robustness against confounding due to genetic admixture. For effects due to the inherited genes and their multiplicative interactions with exposures, maximum likelihood analysis based on the log-linear Poisson model gives results identical to those based on conditional logistic regression (10
) of cases and their pseudosibs. The estimated genetic relative risks and their standard errors produced by using the two methods will be the same (regardless of whether the disease is rare). Moreover, both are also equivalent to the "conditional on parental genotype" likelihood approach advanced by Schaid and Sommer (15
, 16
). These are not three different models but, in this limited context, should be thought of as three different ways to do the same thing.
The log-linear approach has some important advantages, however, over the explicitly conditional alternatives. Standard Poisson regression software can be used. The log-linear model can be extended easily to allow for effects of the maternal genotype (13), which can be important, for example, in the study of birth defects; in addition, it can be modified to assess imprinting (17
). It can also efficiently use incomplete triads by allowing for missing parents (18
). (Although their simulations suggest that their approach is less efficient than others, Sun et al. (19
, 20
) provided closed-form estimators for the genetic relative risks that remain valid when some parents are missing.)
Table 1 summarizes the primary analytical options available for data from a case-parents design. R1 and R2 denote the relative risks of disease for heterozygous and homozygous carriers of the variant allele, respectively, relative to those with no copies. While seven methods are listed, the first two are not useful for estimation, as has been discussed (14), and the third and fourth would be expected to be less efficient than full maximum likelihood because they discard information from families in which all three persons are heterozygous. The remaining three methods are all equivalent in a setting in which interest centers on the effects of the inherited gene and its multiplicative interactions with exposure. All seven methods shown in table 1 share the same set of assumptions required for valid analysis. It is important to understand what this design can and cannot do in relation to other designs.
|
![]() |
ASSUMPTIONS COMPARED |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The two key additional limitations imposed by the case-parents design are its inability to assess nonmultiplicative models and its inability to estimate exposure effects. Another practical limitation is that the case-parents design is most useful when many cases' parents will still be living, such as for studies of diseases such as juvenile-onset diabetes and birth defects.
The inability of the case-parents design to estimate exposure effects is perhaps its most important limitation. With such a study, one can only estimate effects of genotype in the absence and presence of an exposure (or at multiple levels of exposure), but such estimates can be difficult to interpret without an estimate of the main effect of the exposure. Thus, for example, one would not be able to distinguish a scenario in which an exposure confers excess risk but only in the presence of a mutant allele from a scenario in which the exposure confers protection but only in the absence of the allele. Such distinctions can be key to understanding the biology of the joint effects of the allele and the exposure.
The case-sibling design has a different set of limitations, many of which are listed in table 2. One assumption not listed in table 2 is required when case-sibling data are analyzed with conditional logistic regression (and some families contribute more than two persons). In that setting, the model requires that the outcomes be conditionally independent within families. If the locus under study is in linkage with a locus that influences risk, and if some probands have been matched to more than one related control, then the usual conditional logistic model will not fully capture the dependency structure in sibship- or cousin-based data; hence, conditional logistic regression is not strictly statistically valid. Other methods have been proposed that avoid the dependency issue, for example, the sign-test-inspired testing method of Horvath and Laird (22), which does not fully use the data nor permit estimation of relative risk parameters. Another method for avoiding the problem of within-family dependency repeatedly samples case-control pairs from each informative sibship then carries out a matched-pairs logistic analysis, which relies only on the statistical independence (across clusters) of the paired differences. The estimates resulting from a large number of such resamplings are then averaged, and a corresponding adjusted variance implicitly handles the dependency among the resampled data sets (23
). The dependency-structure problem is not listed in table 2 because it is not a feature of the design but rather arises when conditional logistic regression is the selected analytical method.
Some important, more practical concerns also arise when siblings are used as controls, some of which were mentioned by Witte et al. (8). One problem with use of unaffected siblings is that some may really be "affected" but just not have lived long enough to express the disease phenotype. Witte et al. handle this issue, which they describe in terms of risk-set sampling, by requiring the unaffected sibling to be older than the proband. (This approach presumes that incident cases are studied immediately. A slightly less-restrictive design might require the unaffected sibling to be at least as old as the case was at the time of diagnosis.) The requirement that a sibling of a suitable age exists and can be studied can substantially reduce the number of eligible cases, as occurred in the California breast cancer study cited (8
), in which only a quarter of cases had an eligible sibling control. Witte et al. propose to fix this problem by advising the investigator to enroll a cousin when a suitable sibling is not available. As they acknowledge, this strategy may improve sample size but weakens the design's protection against genetic admixture. Moreover, use of either siblings or siblings supplemented by cousins could result in sampling in which the cases and controls are not comparable with respect to factors such as region of residence.
Correlation in exposure status can cause overmatching and consequent loss of efficiency when siblings are used (8), but bias also may be produced by asymmetric structure in the exposures of older versus younger offspring. If birth-order effects exist, then any asymmetry in eligibility criteria for cases versus controls could produce bias. For example, birth order is evidently related to the likelihood that a person becomes a smoker, with the younger siblings being at increased risk (24
). Such asymmetry could result in an apparent effect of smoking on risk, whether one exists or not.
Other temporal trends can also introduce bias. For example, suppose a birth defect is under study and the exposure of interest is maternal smoking. If two babies born to the same woman are discrepant with respect to maternal smoking, it may be more likely that the more recent gestation was the smoke-free one. The point is that women may be more likely to quit than to take up smoking between pregnancies. Such a temporal pattern can produce a spurious protective effect of maternal smoking if the case is matched more often to an older sibling. Of course, with birth defects (being prevalent events at birth), there is no good reason to require the control sibling to be older. (On the other hand, if incident birth defects are studied, there will be no younger siblings to sample!) Even for diseases with later onset, such as leukemia, the same kind of bias can arise if one wishes to consider early life exposures, such as parental smoking. In general, young couples may experience systematic trends over time that differentially affect their children, such as increasing affluence and increasing exposure to infectious agents with increasing numbers of children. All such temporal trends can produce bias if one requires the sibling control to have been born before the proband.
Clearly, despite having certain practical and theoretical advantages, both the case-parents design and the sibling-control design suffer from important limitations. On the other hand, family members are often well motivated, and the sampling assumptions required for a family-based design are weaker than those required for the corresponding population-based case-control study.
In practice, the optimal family-based design may be a hybrid that genotypes all available first-degree relatives; then the analysis takes all available information into account. Procedures have been proposed for testing for genetic effects based on such hybrid data (18, 20
, 25
), as have methods for estimating genetic relative risks (18
). However, to our knowledge, a comprehensive analytical approach that would allow full exploitation of data from family members in such a setting has not been set forth.
![]() |
CONCLUSIONS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our assessment of problems associated with family-based designs should be tempered by the recognition that the population-based case-control design is also not problem free. There is, again, the issue of possible confounding due to incomplete genetic mixing of sectors of the population under study, which can produce bias in assessing both genetic effects and gene-by-environment interactions. Family-based studies avoid this potential source of bias by making within-family inferences.
Another issue not always appreciated is that the maternal genotype can serve as a confounder. Genetic effects can take two forms: the inherited genotype can affect the offspring's risk (the usual mechanism assumed), or the mother's genotype can influence the offspring's prenatal environment (13). If the genetic effect is fully or partly mediated by the mother's genotype, a case-control study could to some extent misattribute such effects to the inherited genotype and also estimate with bias the interaction between genotype and exposure. Hence, when using a case-control design, one needs to assume that no maternally mediated genetic effects exist (table 2) or recognize that the two cannot be fully separated.
Another problem that can potentially impact a population-based case-control study is that random sampling of suitable controls from a defined population may become increasingly difficult for both practical and cultural reasons (26). Although data are hard to find on trends over time in recruitment rates for controls, a trend toward decreasing participation among healthy potential controls could be anticipated. Many people use telephone answering machines to avoid unwanted solicitations. Although leaving a message on the answering machine has been shown to improve response rates (27
), such machines do present a barrier that could become increasingly difficult to surmount, particularly if telemarketing continues to expand. Moreover, with recent press attention focused on privacy and the abuse of genetic and other types of information, aversion to genetic testing may grow. In the face of these obstacles, the key assumption that recruitability in relation to genotype and exposure is not differential by case status may become less tenable. If healthy people increasingly resist recruitment as controls into epidemiologic studies, family-based designs may become increasingly important tools in the 21st century for the epidemiologist who wishes to study joint effects of genes and exposures.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|