From the Graduate Institute of Epidemiology, College of Public Health, National Taiwan University, Taipei, Taiwan, Republic of China.
Received for publication August 7, 2002; accepted for publication June 10, 2003.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
epidemiologic methods; genetics; polymorphism, single nucleotide
Abbreviations: Abbreviations: df, degree of freedom; MDT, mating disequilibrium test; PDT, pedigree disequilibrium test; TDT, transmission/disequilibrium test.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It has been argued that the genetic analysis of complex human diseases will rely more and more on the epidemiologic association paradigm (1, 2). In particular, the application of the transmission/disequilibrium test (TDT) in a case-parents study has received much attention (3, 4). For a marker with two alleles, the TDT compares the number of heterozygous parents who transmit one allele with the number of heterozygous parents who transmit the other allele to the affected offspring. Because the comparison is within family, it is not affected by population stratification, which can produce an excess of false positive results in a conventional case-control study (4).
Although it successfully removes the population stratification bias, using parents as the control group can create a problem of its own. Parents may have died already, making it impossible to genotype them. This is particularly true when the disease under study has an age at onset in adulthood or older, such as non-insulin-dependent diabetes, cardiovascular diseases, Alzheimers disease, many forms of cancers, and so on. Without parental genotypes, one cannot trace the transmission of the alleles from parents down to their offspring. Assuming noninformative missingness, several authors (57) have proposed methods to tackle this missing-parent problem. However, the assumption can be violated in several ways (58).
A case-sibling study using siblings as the control group is an alternative design option (911). It is true that siblings were still alive more often than parents were when cases were recruited. However, siblings of an adult case normally do not live together with the case, making it difficult to recruit them for study. Furthermore, it is possible that some of the cases in the study do not have siblings at all.
In this paper, I will propose two alternatives, the case-spouse and the case-offspring designs, that recruit the spouses and the offspring, respectively, as the control group. These designs are particularly useful for genetic study of adult-onset diseases, because of the ease in recruiting the control subjects; an adult normally will get married and live together with his/her mate and, if any, with their child(ren). A new test will be proposed, the mating disequilibrium test (MDT), to analyze the case-spouse and case-offspring data. The conditions for a MDT to be a valid test for genetic association with a disease-susceptibility gene will be discussed and be examined through computer simulation. (In this paper, a disease-susceptibility gene refers to a gene that will by itself influence the risk of disease, or that will predispose a subject to risk factor(s) of the disease and thereby indirectly influence the disease risk.) Finally, a power formula for a genome-wide scan using case-spouse and case-offspring designs will be given, and the number of families required will be compared with that required using the case-parents design.
![]() |
THE CASE-SPOUSE AND THE CASE-OFFSPRING DESIGNS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Suppose that a sample of n (i = 1, ..., n) cases has been recruited. Genotyping has been done at a particular marker locus with two alleles, M and m. (Note that this paper does not consider markers in the sex chromosome.) For the ith case, the M-allele count is denoted as Ci. The spouse of the ith case, if available for study, is also genotyped with an M-allele count of Si. If the spouse of the ith case is missing but his/her offspring is/are available for genotyping, one calculates the Oi, the average M-allele count of the offspring of the ith case, and then imputes the M-allele count of the missing spouse by . (The imputation is based on the obvious fact that
Note that defined in this way can become negative sometimes and should not be reset to zero should that happen). The difference of the M-allele count between the ith case and his/her spouse (or the imputed spouse) is denoted as
. The Dis are the basic data to be analyzed.
The following two assumptions are invoked in the case-spouse and the case-offspring studies. 1) The marker genotype frequencies at conception should be the same for both sexes. This assumption is likely to be met in practice because of random segregation of sex chromosomes and autosomes. 2) There is no selective attrition of marker allele(s) through gestation and over time. In other words, the marker studied should not be in linkage disequilibrium with a gene, or be a gene itself, that affects survival through gestation or over time. This is the same assumption invoked in the case-parents and the case-sibling studies (12).
If the validity of the assumptions is a concern, one should check the genotypes of the unaffected individuals recruited in a study (the sibling in the case-sibling study, the spouse in the case-spouse study, and the offspring in the case-offspring study) to see if the frequencies vary over sex or age.
Under the null hypothesis that the marker is not genetically associated with the disease in question (by genetic association, we mean that the marker is in linkage disequilibrium with a disease-susceptibility gene or that the marker is a disease-susceptibility gene itself), the expected value of Di will be zero if both assumptions are met and if the study population is a homogeneous population or a stratified population but mating is restricted to subjects in the same stratum.
Note that the above assumptions suffice to ensure that the spouse and the offspring of a case are his/her legitimate controls. The expected value of Di will be zero, even if only one sex can get the disease, even if cases/spouses/offspring all have different risk factor profiles, and even under assortative mating.
![]() |
THE MATING DISEQUILIBRIUM TEST |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
With the assumptions stated above, the following X2 statistic is a valid test for genetic association with a disease-susceptibility gene:
Under the null hypothesis, X2 is asymptotically a chi-square distribution with 1 degree of freedom (df). The MDT is based on this statistic.
The data of a case-spouse study have the same structure as the data of a pair-matched case-control study, and they can alternatively be analyzed using a logistic regression based on Dis (13). It can be shown that the above X2 is the efficient score statistic of such a logistic model. If the data consist exclusively of pairs of cases and their single offspring, it can also be shown that the above X2 is algebraically the same as the 1-TDT statistic proposed by Sun et al. (5), except that the 1-offspring now plays the role of the 1-parent in the 1-TDT.
Monte-Carlo simulation was performed to study the empirical type I error rates of the MDT. The study population was assumed to be composed of two strata (the first stratum constitutes 40 percent, and the second, 60 percent). The two strata do not intermix. Disease prevalences were assumed to be different between the sexes. In the first stratum, the prevalences were 3 x 105 for males and 2 x 106 for females. In the second stratum, the prevalences were 3 x 105 · r for males and 2 x 106 · r for females, where r is the disease prevalence ratio between the second and the first strata. The effect of varying the extent of population stratification was examined for r = 1, ..., 10. In each round of simulation, the M-allele frequencies for the first and the second strata were generated by taking two numbers at random from the interval (0.05, 0.95). (This represents an average of 0.3, in the absolute allele-frequency differences between these two strata.) Both random mating and assortative mating within the stratum were considered. For the assortative mating, 90 percent of the subjects in a given stratum performed random mating, and the remaining 10 percent performed mating strictly within the same genotype. (To make sense of this contrived scenario, one can think of a genetic marker that is associated with, e.g., body height. In the population, 90 percent of the subjects are not choosy about the physique of their potential mates. Yet, the remaining 10 percent wont mate unless they find someone with similar height.) A total of 200 cases were recruited from the population at large. Because the disease prevalence is very low, it was assumed that these cases were from different families. Control subjects were recruited according to the conventional case-control design (200 control subjects), the case-spouse design, and the case-offspring design (three offspring per family), respectively. The Armitage trend test (14) was applied for the case-control design. (Sasieni (15) has pointed out that the usual Pearson chi-square statistic comparing allele frequencies between cases and controls is inappropriate when Hardy-Weinberg equilibrium does not hold.) The MDT was applied for the case-spouse and the case-offspring designs. The nominal levels were set in turns at 0.05 and 0.01. Ten thousand simulations were performed for each scenario created.
Figure 1 presents the results of random mating within the stratum. It can be seen that the case-control study produces grossly inflated type I error rates for the disease prevalence ratio of >1 (figure 1, part A), whereas the case-spouse (figure 1, part B) and case-offspring (figure 1, part C) designs maintain the nominal levels in the range of disease prevalence ratios that were studied. The same conclusion can be drawn when there is assortative mating within the stratum (results not shown).
|
![]() |
CORRECTION FOR POPULATION ADMIXTURE |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
I suggest using a principle of multiplicative scaling of chi-square distribution proposed by Reich and Goldstein (16) for a correction of the MDT statistic. (Their method was proposed originally to correct the allelic chi-square statistic of a case-control design, under the model of "population stratification.") To be precise, a number of markers unlinked to (or in linkage equilibrium with) the foregoing candidate marker were also genotyped in the same set of cases and spouses (offspring). (These "null markers" are to be chosen at random throughout the whole genome, so that it is unlikely that any one is tightly linked to a disease-susceptibility gene.) The average of the MDT statistics across the null markers provides a measure of the amount of admixture. By dividing the candidate MDT by this average (the principle of multiplicative scaling of chi-square distribution (16)), one can obtain a p value that corrects for admixture.
Monte-Carlo simulation was performed to study the effectiveness of such an admixture correction. The two-strata population in the previous section was considered again (with a disease prevalence ratio of 5). This time, however, varying degrees of admixing between strata were allowed (admixture proportions of 0.0, 0.1, ..., 1.0 were studied). An admixture proportion of 0.0 implies random mating within the stratum but no intermarriage between strata. At the other extreme of an admixture proportion of 1.0, there is random mating in the population at large. In the middle, for example, an admixture proportion of 0.3, 30 percent of the population mate randomly without regard to population stratification, and the remaining 70 percent mate only within the stratum. In addition to the candidate marker, a total of 50 null markers were typed. The allele frequencies for the candidate marker as well as for the null markers in the first and the second strata were generated by taking random numbers from (0.05, 0.95) in each round of simulation (one pair of numbers for the candidate marker plus 50 pairs of numbers for the 50 null markers within each simulation). For the case-control study, the Armitage trend statistic of the candidate marker was divided by the average of the same statistics of the 50 null markers. For the case-spouse and the case-offspring studies, the candidate-marker MDT was divided by the average MDT of the 50 null markers. All the other simulation settings are the same as in the previous section.
Figure 2 presents the results without admixture correction (candidate-marker Armitage trend test and MDT, without being divided by the corresponding statistics of the null markers). It can be seen that the case-control study (figure 2, part A) produces grossly inflated type I error rates, irrespectively of the admixture proportion in the population. Without admixture correction, the case-spouse (figure 2, part B) and the case-offspring (figure 2, part C) studies also produce inflated type I error rates. The inflation becomes more intolerable as the admixture proportion becomes larger.
|
|
![]() |
POWER FORMULA AND NUMBER OF FAMILIES REQUIRED |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Let us consider
The distribution of X (under either H0 or H1) can be approximated by a normal distribution. Using the multivariate delta method (19), one can show that, for large n, such a distribution has a mean of
and a variance of
Let zw denote the w quantile of a standard normal distribution. Then,
To compare the powers of the MDT in the case-spouse (case-offspring) design and the TDT in the case-parents design, one assumes that the marker is a disease-susceptibility gene per se (two alleles, D and d) and that the same modes of inheritance as used by Knapp (17) were considered (1: genotype relative risk of Dd over dd;
2: DD over dd): 1) the multiplicative model (
1 =
and
2 =
2), 2) the additive model (
1 =
and
2 = 2
according to Camps definition (20) of additive mode of inheritance for the sake of comparability), 3) the recessive model (
1 = 1 and
2 =
), and 4) the dominant model (
1 =
2 =
). The test was two sided with the
level set at 107. This corresponds to
= 5 x 108 for the genomewide one-sided TDTs used by Risch and Merikangas (1). (If allele D is positively associated with the disease and
is small, the power of one-sided TDT for allele D with a type I error rate of
is very near the power of two-sided TDT with a type I error rate of 2
.) For each combination of mode of inheritance, risk parameter
(
= 1.5, 2, 4), and allele frequency P of D in the source population (P = 0.01, 0.1, 0.5, 0.8), the numbers of families required to achieve 80 percent power for the MDT in a case-spouse design and a case-offspring (number of offspring: 1, ..., 5) design were calculated by solving the above power formula using a bisection method (a root-finding method (21)). To check the precision of power approximation, 100,000 simulated data sets at the above-calculated sample sizes were generated. For each round of simulation, the MDT was calculated, and the true power was estimated as the proportion of simulations rejecting the null hypothesis at
= 107. (The sample size for the case-spouse design can be calculated alternatively using the method of Julious and Campbell (22) for matched ordinal data. However, simulation shows that the method will lead to a gross underestimation of sample size sometimes (results not shown)).
Tables 1, 2, 3, and 4 present the number of families required to achieve 80 percent power by the MDT under various conditions. The empirical powers based on simulations (in parenthesis) match very well with the expected value of 0.80, indicating that the power formula presented in this paper is quite accurate. For the purpose of comparison, these tables also present the numbers of families required by a genomewide TDT scan, which numbers were taken from table 3 of Knapp (17). It can be seen that the differences in numbers of families required between the case-spouse and the case-parents designs are inconsequential (slightly higher in the multiplicative, additive, and recessive modes of inheritance and slightly lower in the dominant mode of inheritance for the case-spouse design compared with the case-parents design), whereas the number of families required is higher for the case-offspring design compared with the case-spouse design. As the number of offspring increases, the number of families required for a case-offspring design decreases. These findings are as expected, because imputed data were used for calculating the MDT in a case-offspring design, and the more offspring a case has, the more precise the imputation of his/her missing spouse can be. Tables 1, 2, 3, and 4 suggest that the number of families should be doubled with single-offspring families. If five offspring in a family are available for study, a case-offspring study is comparable with a case-spouse study in terms of the number of families required.
|
|
|
|
![]() |
TWO-DEGREE-OF-FREEDOM LIKELIHOOD RATIO TESTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
where D1i = C1i S1i and D2i = C2i S2i are the differences in genotype counts for case versus spouse in the ith pair (13). A standard logistic regression program can be used to obtain the maximum likelihood estimates, and
. The deviance statistic for this model is
and the deviance statistic for the null model is G0 = 2n x log 2. Consequently, the 2-df likelihood ratio test for testing H0 (ß1 = ß2 = 0) is based on G0 G1.
Using the method of Longmate (25), I calculate the number of families required by the 2-df likelihood ratio test to achieve 80 percent power at = 107 for the case-parents and the case-spouse studies under various modes of inheritance (table 5). It can be seen that the number of families required by the case-spouse study is larger than that needed by the case-parents study, but the difference is inconspicuous. Compared with the 1-df tests (tables 1, 2, 3, and 4), we see as expected that the 2-df likelihood ratio tests can indeed reduce the number of families required to achieve the same power in both the case-parents and the case-spouse studies, under the recessive mode of inheritance and several occasions of dominant mode of inheritance.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In practice, we usually have various configurations of families in a study, with the following hierarchy: 1) genotype available for both parents; 2) genotype available for unaffected sibling(s) but not for both parents; 3) genotype available for the spouse but not for the unaffected sibling and not for both parents; and 4) genotype available for offspring but not for spouse, not for unaffected sibling, and not for both parents. The method described in this paper can easily be extended to deal with all these families, by redefining Di, respectively, as: 1) the number of transmitted M alleles minus the number of nontransmitted M alleles; 2) the M-allele count of the case minus the (average) M-allele count of his/her unaffected sibling(s); 3) the M-allele count of the case minus the M-allele count of his/her spouse; and 4) the M-allele count of the case minus the M-allele count of his/her imputed spouse. Each Di defined in this way has the expectation of zero under the null hypothesis, irrespective of family configurations. One can then proceed to use the same X2 statistic to combine Di values across all these families.
In this paper, the correction for population admixture by typing null markers is based on the principle of multiplicative scaling of a chi-square distribution (16). The approach is simple and convenient compared with the computer-intensive "latent class method," where the number of strata in a population and each subjects probability of membership in each of these strata have to be estimated before a formal genetic association test can be done (2830). Furthermore, it is noted that the breakdown of the multiplicative principle for extremes of stratification in a conventional case-control study, as noted by Reich and Goldstein (16), is not necessarily a serious concern here. Provided that interstrata marriages are not too common, the expected value of the Di under the null of the case-spouse (case-offspring) design should not deviate too far from zero even if the population at large is an extremely stratified one. This explains why the type I error rates can be more effectively controlled in a case-spouse (case-offspring) study than in the conventional case-control study (figure 3). If, however, a case-spouse (case-offspring) study is to be conducted in a population with frequent interstrata marriages, one can reduce the amount of admixture in the sample by excluding those mating couples with clearly different ethnic backgrounds (couples with non-zero expected Di under the null) before applying the correction method.
The present paper assumed the markers to be biallelic. This is not a major restriction, because a dense map of biallelic single nucleotide polymorphisms will be ready for use in the very near future (31, 32). With the cost of genotyping single nucleotide polymorphisms dropping and the cost of recruiting subjects rising, genotyping additional null markers for an admixture correction in a candidate-gene study will not constitute too much of a burden. Furthermore, one could be interested in performing a genomewide scan at the outset. In that case, a multitude of markers across the genome is to be typed anyway.
Epidemiologists, practicing and theoretical alike, have long been troubled by the issue of control selection in a case-control study (3335). In the recent decade, a better understanding of counterfactual definitions of causation has led to the inventions of a series of new designs (3640). The present paper expands the list of legitimate counterfactual controls to include such members as the spouse and the offspring of a case. With the ease of recruiting subjects, effective control of the type I error rate, and satisfactory powers, the case-spouse and the case-offspring designs represent viable alternatives for genetic association studies of adult-onset diseases. It will be of interest to test whether the MDT lives up to expectations, when applied to real data.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Related articles in Am. J. Epidemiol.: