Ascertainment Bias in Family-based Case-Control Studies
Kimberly D. Siegmund and
Bryan Langholz
From the Department of Preventive Medicine, University of Southern California, 1540 Alcazar Street, Suite 220, Los Angeles, CA 90089-9011 (e-mail: kims{at}usc.edu)
 |
ABSTRACT
|
---|
In a family-matched case-control study, a population-based sample of cases is selected from a well-defined geographic region over a fixed period of time. For diseases of adult onset, the control is generally a sibling or cousin who is matched on sex and age without regard to location of residence. Such a design can lead to biased estimates of environmental relative risk if the prevalence of an environmental risk factor varies by the geographic region from which the cases and controls are drawn. However, assuming the independence of genotype and environmental exposure, the estimators for the gene and gene-environment interaction effects are consistent. This suggests that we must use caution in interpreting parameters that estimate environmental main effects from a family-based case-control study if controls are selected from outside the case-ascertainment region.
bias; case-control studies; family relations
Abbreviations:
USC, University of Southern California
 |
INTRODUCTION
|
---|
In genetic epidemiology, case-family member control designs have been proposed both for the search for new genes that predispose to disease and for gene characterization, including the study of the joint effects of genes and environment on disease risk. The use of family controls in studies involving genes has been motivated by the desire to control for confounding by mixed ethnicity, the problem of "population stratification" 1


5
. In population-based case-family-control studies, the case is often identified through a disease registry or center (in the "case-ascertainment region"), and the family control is identified from the case. This family control may reside inside the region in which the cases are being ascertained, but this is not required. If the control does not reside within the case-ascertainment region, the "study base" principle, which essentially requires that, conditional on the case-control set, the probability of ascertaining the set does not depend on which subject is the case, is violated. We take as an example the University of Southern California (USC) Consortium Colorectal Cancer Family Registry, one of six international centers participating in the National Cancer Institute Collaborative Family Registry for Colorectal Cancer Studies. The USC Consortium is a multicenter study developing a population-based registry of colorectal cancer patients and their families. Their goal is to characterize cloned genes for colorectal cancer (e.g., MLH1, MSH2) and to detect new susceptibility genes 6
. In a simplified description of the study design, a population-based sample of cases was ascertained from 1997 to 1999 in six regions: Minnesota; Colorado; Arizona; New Hampshire; Los Angeles County in southern California; and 33 counties in North Carolina. When available, one or more unaffected siblings and two unaffected cousins are selected as controls for each enrolled case. Sibling and cousins are identified as part of the interview with the case and are selected as controls on the basis of their age and gender. They may or may not reside in the regions covered by the registries. For study participants, blood samples and risk factor and food frequency questionnaires are collected to study joint effects of genes and environment on the risk of colorectal cancer. Data collected at the end of year 1 showed that at the time of recruitment, approximately 2 years after diagnosis for the case, the proportion of cases living in the ascertainment regions ranged from 92 to 100 percent, the proportion of matched, unaffected siblings ranged from 19 to 58 percent, and the proportion of matched, unaffected cousins from 3 to 50 percent. In general, Arizona showed the smallest proportion of unaffected relatives living in the ascertainment region (3 percent of cousins and 19 percent of siblings), presumably because of people moving to Arizona for retirement or spending their winters there. Los Angeles County showed the next lowest proportions of unaffected relatives living in the sampling region (10 percent of cousins and 33 percent of siblings). We attribute this to the large immigrant population in Los Angeles. North Carolina, Colorado, and Minnesota all showed similar numbers of unaffected siblings residing in the ascertainment regions (approximately 50 percent). At the time of inquiry, there were no data available from New Hampshire. These year 1 data show that a large proportion of the controls reside outside the case-ascertainment region and, strictly speaking, violate the study base principle (i.e., the case-control set would not have been ascertained by the study if the control had contracted colon cancer and the ascertained case had not).
In a study design in which cases must reside in the ascertainment region to be ascertained into the study, but there is no such requirement for the sibling controls, we artificially create an ascertainment region as a very strong risk factor for disease status. The consequence is that this artificial risk factor can now act as a "confounder" for the true exposure of interest. Hence, we can think of the ascertainment bias in terms of classical confounding. Relative risks for exposures that are positively (or negatively) correlated with ascertainment region will be confounded by ascertainment region, and the marginal relative risk will be inflated (or deflated) toward infinity (or negative infinity). However, methods for control of confounding are well known, so what about simply controlling for region of ascertainment in the analysis? If we try either of the ways of doing this, either stratifying by including only ascertainment region status concordant pairs or controlling for ascertainment region as an analysis variable, we find that this leads to using only pairs in which the control sibling lives in the ascertainment region, a large restriction in the number of matched case-control sets.
What does this suggest about ascertainment bias in the USC Consortium study of colorectal cancer? Within a case-family member set, genotype is unlikely to be related to geographic region; thus, we would not expect ascertainment bias for estimation of the main effect of genes. However, to the extent that lifestyle risk factors for colorectal cancer, such as smoking, fruit and vegetable consumption, and exercise vary by region, there is a potential for ascertainment bias in our estimates of environmental effects. Our example does not immediately suggest whether gene x environment interactions will be biased. We develop methods to study this question analytically and to calculate the sensitivity of our parameter estimates to heterogeneity in environmental exposure by geographic region.
 |
MATERIALS AND METHODS
|
---|
Let D denote disease status (1 = case, 0 = control), E denote the environmental exposure (1 = exposed, 0 = unexposed), and G the trait susceptibility genotype. For subjects who carry two bad copies of the gene, G is 1, and for those who carry no bad copies, G is 0. For carriers of one bad copy, G is 0,
, or 1 depending on whether the trait is recessive, multiplicative (log-additive), or dominant. We use a logistic model for the probability of disease, logit Pr(D = 1|G, E) =
+ ßGG + ßEE + ßGEGE. The coefficients
= (ßG, ßE, ßGE) measure the effects of gene, environment, and their interaction.
Let rjk = exp(ßGGk + ßEEk + ßGEGkEk) denote the odds ratio of disease for the kth subject in the jth matched set, where k is one for the case and two for the control. Assuming conditional independence of the disease outcomes in matched case-control pairs given the genotypes and environmental exposures, the conditional likelihood for N pairs is
We study the asymptotic bias of the regression coefficients, the difference between the maximum likelihood estimate and the true value, when the probability of exposure varies by geographic region. We estimate the regression coefficients using the score vector U(
) (U(
i) =
/
i ln L(
) i = 1, ..., 3) and information matrix I (
) (with ijth entry
2/
i
j ln L(
)). For instance, the score contribution of the first coefficient from a single case-control set is
We apply the Newton-Raphson algorithm to solve the expected score and information matrix over the distribution of genotypes and exposures, conditional on the ascertainment regions, familial relationship (R), and disease status of the family-matched case-control relative pair. Let e1, e2 denote the environmental exposure for the case and control, respectively, g1 and g2 denote their respective genotypes (coded as described above), and d1 and d2 their disease statuses (d1 = 1, d2 = 0). Furthermore, let A1 and A2 denote the case and control ascertainment region, respectively (1 = case ascertainment region, 0 = otherwise). Then the expected score for a family-matched case-control set in which the control is selected from outside the ascertainment region is
The probability of the paired genotype depends on the genetic relation between the case and the family-matched control and is assumed unrelated to exposure and ascertainment region, Pr(g1, g2|R). The probability of exposure is allowed to depend on the ascertainment region and is assumed independent of genotype and degree of relation. In computing genotype probabilities, the usual assumptions of Hardy-Weinberg equilibrium, random mating, and Mendelian inheritance are made. Let q denote the susceptibility gene frequency. Then the probability that a parent carries two copies of a susceptibility gene is q2, the probability of one copy is 2q(1 - q), and that of zero copies is (1 - q)2. We write Pr(g1, g2|R) =
gmgf Pr(g1|gm, gf) Pr(g2|gm, gf) Prq(gm) Prq(gf), where gm and gf denote the genotype of the mother and father, respectively, and Pr(gi|gm, gf) is the transmission probability that takes on values 0,
, or 1. For a case-cousin pair, the summation would be extended over the possible genotypes for the shared grandparents and the parents of the unaffected cousin.
We begin by describing the distribution of three colorectal cancer risk factorsexercise, smoking, and fruit and vegetable consumptionand the distribution of prevalence by state, as reported by the Behavior Risk Factor Surveillance System at the Centers for Disease Control and Prevention (http://apps.nccd.cdc.gov/brfss/index.asp). The variable exercise is defined as participating in regular activity at least five times each week for a period of at least 30 minutes (yes/no), smoking status is defined as current smoker (yes/no), and fruit and vegetable consumption as eating a minimum of a total of five fruits and vegetables each day. Regular exercise and fruit and vegetable consumption are known to be protective against colorectal cancer, and smoking increases risk. Next, we study the asymptotic bias in our parameter estimates when the prevalence of exposure in the case ascertainment region is k times larger than in the control ascertainment region (k =
-3). The bias is defined as the asymptotic value of the maximum likelihood estimate of the regression coefficient minus its true value.
 |
RESULTS
|
---|
Table 1 shows the average prevalence of exercise, smoking, and fruit and vegetable consumption in the United States in 1996 and the distribution of prevalence by state, as reported to the Behavior Risk Factor Surveillance System at the Centers for Disease Control and Prevention. The study average is the average of prevalences across states, weighting the state prevalence by the anticipated number of colorectal cancer cases collected by the USC Consortium in that state. We note that there is some variability in exposure prevalences across the states, with the greatest variability occurring for the consumption of fruits and vegetables. The greatest consumption of fruits and vegetables is in California, with a prevalence nearly twice that of North Carolina and 35 percent above the national average. Because of the larger number of cases being sampled from states with higher prevalence of fruit and vegetable consumption, we anticipate the prevalence in our study to be nearly 20 percent higher than the national average. At the state level, North Carolina shows a high-risk profile across all three categoriesthe lowest prevalence of exercise and fruit and vegetable consumption and the highest prevalence of current smokers. The healthiest lifestyles were in Colorado (for frequent exercise) and California (for the fewest smokers and the most fruit and vegetable eaters).
View this table:
[in this window]
[in a new window]
|
TABLE 1. Prevalence of exposures (%) by geographic region in 1996 provided by the Behavioral Risk Factor Surveillance System at the Centers for Disease Control and Prevention
|
|
We find that when the exposure prevalence differs across geographic regions (case-ascertainment region vs. otherwise), the maximum likelihood estimate of the log-odds ratio for exposure (ßE) is asymptotically biased, i.e., it will estimate the wrong quantity (figure 1). This bias depends on the prevalence of exposure in the two geographic regions. If the prevalence of exposure is greater in the case-ascertainment region, the bias will be positive. The bias is negative if the exposure prevalence is lower in the case-ascertainment region. When the prevalence in the two regions is the same, the estimate of ßE is unbiased. For instance, suppose that the prevalence of exposure in the case-ascertainment region is 30 percent and the prevalence outside the case-ascertainment region is only 20 percent. The prevalence of exposure is 1.5 times higher in the case-ascertainment region than outside that region (k = 1.5). For this scenario, the bias in ßE is 0.539, and the estimated odds ratio for an exposure that is unrelated to disease is 1.7 (= exp(0.539)).

View larger version (16K):
[in this window]
[in a new window]
|
FIGURE 1. Asymptotic bias for ße when ßg = ße = ßge = 0. The x-axis is the relative prevalence of exposure in the case-ascertainment region relative to outside the case-ascertainment region.
|
|
The bias in ßE is constant, regardless of the baseline rate of disease or the strength of the association between the exposure and disease (ßE). This corresponds to a constant multiplicative effect on the true odds ratio. Suppose the exposure is protective and the true odds ratio is 0.5 (ßE = -0.693). The estimate of the odds ratio under the example described above would be 0.5 x 1.7 = 0.86 (= exp(-0.693 + 0.539)). If the true odds ratio is 2.0 (ßE = 0.693), the estimated odds ratio would be 2.0 x 1.7 = 3.4 (= exp(0.693 + 0.539)). Interestingly, even though estimates of ßE can be asymptotically biased, under a rare disease model, the estimates of ßG and ßGE are unbiased so long as genotype is independent of exposure.
 |
DISCUSSION
|
---|
Family-based case-control studies have been proposed for characterizing the joint effects of genes and environment. It is important to recognize that selecting family-based controls who live in a different geographic region from the case may result in biased estimates of disease-exposure association if the prevalence of the exposure also varies by ascertainment region of cases and controls. As a result, the genotype-specific estimates of the environmental effect can be incorrect. However, in a multiplicative odds model, if genotype is independently distributed from environmental exposure and the rare disease assumption holds, we find that the ratio of the odds ratio for exposure in gene carriers to that in noncarriers will estimate the correct quantity. More simply stated, the study of gene-environment interaction remains unbiased.
We describe the asymptotic bias that can be introduced from the very simple situation of cases and controls sampled from two geographic regions with different exposure prevalence or baseline rates of disease. All cases come from one region (case-ascertainment region), and all controls come from the second region (outside the case-ascertainment region). The extent to which bias is introduced in our sample depends on the number of pairs that are not matched on region and the variability in exposure prevalence by geographic region. The possible differential ascertainment of cases and controls based on region of residence has been noted previously by Weinberg and Umbach 5
. For convenience, we assumed that there is no familial aggregation of exposure among family-matched sets.
The concern for ascertainment bias is greater for diseases with adult onset, in which cases and controls are likely to live in different regions. It will also depend on the actual ascertainment region. For example, the USC Consortium study found that the proportions of case-sibling pairs from Arizona and Los Angeles County are much lower than the proportion of pairs from Minnesota or North Carolina. Therefore, fewer pairs ascertained in Los Angeles County and Arizona will be matched on region. However, by designing a multicenter study in which some cases are from regions with higher exposure prevalences than the average and some are from regions with lower exposure prevalences than the average, we may expect the biases across centers to cancel each other.
To minimize ascertainment bias, one might additionally match on geographic region, but this could lead to the exclusion of a large number of cases. Alternatively, one could opt for a population-based case-control design and attempt to correct for unobserved population structure in the analysis. Recent methods have been proposed for such an analysis 7
.
The most important result to the study investigators of the USC Consortium was that we found no bias in studies of gene-environment interaction as long as genotype is independent of exposure. This is true if the gene we are studying is unrelated to exposure. Therefore, we ask ourselves whether, in case-control studies using unrelated population controls, estimates of gene-environment interaction effects are unbiased in the presence of population stratification (confounding due to ethnicity). Here, it is not ascertainment bias that can distort our estimates of gene main effects, but actual confounding due to different allele frequencies between two subpopulations from which our cases and controls are drawn. For example, suppose there exist two subpopulations in our ascertainment region, each having a different susceptibility allele frequency at a given genetic locus. Furthermore, there exists a second unobserved risk factor for disease in which the exposure frequency also varies by subpopulation. Then, subpopulation acts as a surrogate for the unobserved risk factor, and we have the situation of classical confounding in which the exposure and disease prevalence differ by subpopulation. As a result, in a study of the effects of genotype on disease, if we do not control for subpopulation we will obtain biased estimates of our main effect of genotype. However, as long as there is no interaction between the unobserved risk factor and exposure and exposure is independent of genotype and subpopulation, the true gene-environment interaction effect remains unbiased (Appendix). This is similar to the result we found in studying the ascertainment bias problem. Although the assumptions may not always hold, it is comforting to know that biased estimates of genetic or environmental effects alone do not necessarily distort interaction effects.
This paper has focused on the analysis of 1:1 matched sets for characterizing a causal gene. Suppose instead that we have a sample of sibships with variable numbers of cases and/or controls and we have measured a marker genotype that is not functional but is in linkage disequilibrium with a functional variant. Then the conditional independence assumption of disease status given genotype required by conditional logistic regression is violated, and the variance of the coefficients will be underestimated. For this situation, Siegmund et al. 8
proposed a jackknife estimate of variance that will provide robust standard errors that are valid for sibships of arbitrary size. Kraft 9
proposed the sandwich estimator of variance from the score equation. A simulation study showed that the sandwich estimator of variance provided smaller variance estimates than the jack-knife for the analysis of data from multiple-case sibships when the odds ratios are large (
4) 10
. In addition, we have simplified our discussion of the USC Consortium sampling design to a random sample of cases in the population. In fact, cases were stratified based on family history of disease, and cases with a positive family history of disease were oversampled. As a result, the proper analysis for the study is conditional logistic regression with the sampling fractions used as offsets 11
.
Finally, the choice of controls in a case-control study will always depend on the study hypotheses. Our primary aim is simply to caution the genetic epidemiologist against interpreting main effects of environmental exposures from family-based studies without first checking in their analysis for possible ascertainment bias.
 |
APPENDIX
|
---|
For the analysis of studies using unrelated population controls, we describe the effect of confounding on our estimate of gene-environment interaction using standard unconditional logistic regression. Let C be a factor that is associated with both gene G and disease but, like G, is independent of the exposure E. Further, assume a logistic model for the effects of C, G, and E of the form
i.e., there can be interactive effects between C and G but not between C and E. Now, when C is not known, the probability of disease conditional only on G and E is obtained by taking the expectation over C,
When disease is rare, the odds are very close to the probability, so taking the expectation leads to the "induced" logistic model
Now, E[exp(ßCC + ßGCGC)|G] takes on two values depending on whether G is 0 or 1, say µ0 and µ1, so that the above can be written
Standard logistic regression methods will thus be estimating the parameters in this induced odd model; the intercept and G main effect will be "biased," but the E and G x E effects will be closely estimated (up to the rare disease assumption).
 |
ACKNOWLEDGMENTS
|
---|
Supported by National Cancer Institute grants CA-52862, CA-78296, GM-58897, and ES-10421.
The authors thank the investigators of the USC Consortium for the use of their data and for their help in obtaining the year 1 estimates.
 |
NOTES
|
---|
(Reprint requests to Dr. Kimberly D. Siegmund at this address.)
 |
REFERENCES
|
---|
-
Self SG, Longton G, Kopecky KJ, et al. On estimating HLA/disease association with application to a study of aplastic anemia. Biometrics 1991;47:5361.[ISI][Medline]
-
Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 1993;52:50616.[ISI][Medline]
-
Spielman RS, Ewens WJ. A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am J Hum Genet 1998;62:4508.[ISI][Medline]
-
Siegmund KD, Gauderman WJ, Thomas DC. Association tests using unaffected-sibling versus pseudo-sibling controls. Genet Epidemiol 1999;17 (suppl 1):S7316.[ISI][Medline]
-
Weinberg CR, Umbach DM. Choosing a retrospective design to assess joint genetic and environmental contributions to risk. Am J Epidemiol 2000;152:197203.[Abstract/Free Full Text]
-
Haile RW, Siegmund KD, Gauderman WJ, et al. Study-design issues in the development of the University of Southern California Consortium's Colorectal Cancer Family Registry. J Natl Cancer Inst Monogr 1999;26:8993.[Medline]
-
Pritchard JK, Stephens M, Rosenberg NA, et al. Association mapping in structured populations. Am J Hum Genet 2000;67:17081.[ISI][Medline]
-
Siegmund KD, Langholz B, Kraft P, et al. Testing linkage disequilibrium in sibships. Am J Hum Genet 2000;67:2448.[Medline]
-
Kraft P. A robust score test for linkage disequilibrium in general pedigrees. Genet Epidemiol 2001;21:S44752.[ISI][Medline]
-
Kraft P, Siegmund KD, Langholz B. Testing linkage disequilibrium in sibships using conditional logistic regression with robust variance estimators. (Abstract). Genet Epidemiol 2000;19:257.
-
Siegmund KD, Langholz B. Stratified case sampling and the use of family controls. Genet Epidemiol 2001;20:31627.[ISI][Medline]
Received for publication June 4, 2001.
Accepted for publication November 27, 2001.