From the National Institute of Environmental Health Sciences, Research Triangle Park, NC.
Received for publication March 21, 2003; accepted for publication April 4, 2003.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() |
---|
To identify such genes, one must be able to correlate the genetic variation in a population with variation in risk of disease. Meanwhile, the extent of variation in the human genome seems daunting. With the number now standing at almost 2 million, the number of known single-nucleotide polymorphisms (SNPs) continues to rise (2) (see http://snp.cshl.org/), with regular updates through the SNP Consortium. At the same time, high-throughput genotyping is becoming increasingly feasible, and genome-wide scans are worth contemplating. Such scans will be affordable within the near future.
In the current issue, Lee (3), following earlier work by Feder et al. (4) and Nielsen et al. (5), develops a sample size formula for a strategy by which one can search for disease susceptibility genes by carrying out a genome-wide SNP scan using only affected individuals. Lee provides a power formula and verifies its excellent performance through simulations. This rather ingenious strategy exploits the distortion from Hardy-Weinberg equilibrium that may be induced among cases for genes with alleles related to risk. The method relies on an assumption that, in the population at large, all of the SNP markers under study are in Hardy-Weinberg equilibrium. Lee provides the sample sizes that would be required to achieve a statistical power of 0.80, based on an alpha level of 0.0000001. The alpha level is small to account for multiple testing involving some 500,000 selected SNPs, to control the rate of false detection for a genome scan. The proposed approach is shown to be more powerful (for a given number of individuals to be genotyped), under either a recessive or a dominant mode of inheritance, than genotyping cases and their parents and applying either the 1-df transmission disequilibrium test or the 2-df likelihood ratio test.
As the author points out, testing a null hypothesis of Hardy-Weinberg equilibrium among cases is not useful under any scenario where 2 =
12, where
2 denotes the relative risk (relative penetrance) for those with two copies of the variant SNP compared with those with none, and
1 denotes the relative risk for those with one copy. This is because the Hardy-Weinberg equilibrium present in the larger population is preserved among cases under such a scenario. In effect, testing for departures from Hardy-Weinberg equilibrium amounts to testing a null hypothesis that
2 =
12, rather than that
2 =
1 = 1. Consequently, the proposed method should have little power for scenarios where
1 lies between 1 and
2 (as in the leftmost column of the table of Lee (3)). Such scenarios may be the rule rather than the exception for any SNP that is not itself the important variant within a risk-related gene but is associated with the disease only because of linkage disequilibrium. Such a marker will display a gene-dose relation to risk, even if the linked risk-related gene for which it serves as a surrogate works according to a recessive or a dominant model. Hence, the poor statistical power for the proposed method that is evident in the leftmost column of the table by Lee gives cause for concern.
It has recently been recognized that linkage disequilibrium is more extensive than had been thought (6). Consequently, the number of existing haplotypes is far smaller than the number theoretically possible on the basis of the number of SNPs. The search for common haplotypes in the human genome, the international "HapMap" project, is ongoing and may permit us to pare down the number of informative SNPs to a smaller number of carefully selected sentinel ("tag") SNPs that will almost fully capture the genetic variation in a population. This prospect is appealing: It means that once a smaller set of sentinel SNPs has been identified, the required sample sizes will be smaller than those shown in the table of Lee, because the alpha level will not need to be nearly so low. One way to think about this is that the Bonferroni correction being imposed inherently (and incorrectly) assumes that the 500,000 tests being carried out are independent; the high degree of dependency across the genome should be exploited in the future to create much more powerful testing procedures.
Other methods for reducing the number of SNPs that must be tested have been proposed, such as restricting attention to SNPs in coding regions (1), on the premise that coding regions are more likely to be close to and in linkage disequilibrium with a risk-related gene. In light of the relatively poor power associated with rare SNPs, as is evident in the table of Lee, another potentially useful restriction would be to limit the analysis to SNPs having some minimal prevalence, for example, 0.10. These more common SNPs may be more likely to be old and well mixed in the population and not subject to founder effects (1).
How safe is it to assume that a population has all of its SNPs in Hardy-Weinberg equilibrium? As discussed by Lee, even in a genetically homogeneous population this assumption is not necessarily safe. Regardless of the overall health of the population, genetic susceptibility factors will affect the survival of individuals, and hence certain SNPs will have suffered preferential attrition in the older age groups. This age dependence may be particularly marked for genes that govern fundamental processes, such as fetal development, immune function, apoptosis, and DNA repair, and also the "environmental genes" (see www.niehs.nih.gov/envgenom/snpsdb.htm) involved in detoxifying harmful exposures, such as tobacco smoke or components of air pollution.
Suppose that one is studying a disease with median age at onset above 50 years and that the analyses proposed by Lee reveal a statistically significant departure from Hardy-Weinberg equilibrium among cases. One might then partition cases into those above the median age versus those below the median age. If the evident violation of Hardy-Weinberg equilibrium is confined to the older group, this discrepancy should raise the suspicion that the finding is due to selective attrition rather than to a causal relation between an SNP under study (or some nearby risk-related gene) and the disease.
Even among the younger segment of the population, as Lee recognizes, many SNPs may violate Hardy-Weinberg equilibrium for noncausal reasons, especially if the population is ethnically diverse. Indeed, SNP markers have been proposed as a means of stratifying the analysis of case-control data on the basis of unmeasured ethnicity (7); for example, one allele strongly related to hereditary hemachromatosis may be restricted largely to those with northern European ancestry (4). Such an ethnicity-specific marker would depart from Hardy-Weinberg equilibrium in an analysis of an ethnically mixed group of cases, if one failed to stratify based on the genetically different subpopulations.
Apparent violations of Hardy-Weinberg equilibrium might also arise because of errors in genotyping, as suggested by Xu et al. (8). In a literature survey, they looked at the incidence of departure from Hardy-Weinberg equilibrium in the control groups (using a significance level of 0.05) in 75 articles that included case-control data, for a total of 133 SNPs. Some 12 percent of the SNPs departed from Hardy-Weinberg equilibrium, with p values ranging from 1030 to 0.049. The reasons for such departures presumably include genotyping misclassification.
In our view, Lees focus on hypothesis testing may be misplaced. Most diseases presumably do have a genetic component, so a global null hypothesis that no allele has anything to do with risk is of modest interest, unless one believes that the disease under study could plausibly be purely environmental or purely random. If one accepts that most diseases have a genetic component, then the localization of the susceptibility genes becomes more important than the rejection of a null hypothesis that no risk-related genes exist. Under a scenario where there is a susceptibility variant at a particular gene, if the variant is of relatively recent origin in the population, then diseased individuals will have a higher degree of relatedness than will nondiseased individuals, and this relatedness (9) can induce departures from Hardy-Weinberg equilibrium across the entire genome, not just in the region linked to the disease locus. This kind of generalized Hardy-Weinberg disequilibrium among cases will frustrate attempts to localize risk-related loci based on assessments involving only affected individuals.
Finally, suppose that one is studying a disease with fairly young onset, for which many parents would be available, and that the number of cases available is the limiting factor, rather than the costs of genotyping individuals. In such a setting, one should divide the sample sizes shown in the table (3) for the transmission disequilibrium test and for the likelihood ratio test by three, to derive the number of cases required. Moreover, the sources of bias described above would not apply to a transmission-based method. On a per-case-studied basis, case-parents methods offer much better power (and robustness) than the case-only approach.
In summary, a number of caveats must be considered before carrying out and interpreting the analysis described by Lee. A finding of departure from Hardy-Weinberg equilibrium must be subjected to further scrutiny, using designs that incorporate appropriate controls. A negative finding is similarly inconclusive and does not imply that the disease under study has no genetic component to its etiology. Much depends on luck in selecting at least one SNP that is in strong linkage disequilibrium with a risk-related allele that has a reasonably high relative risk. Nevertheless, in a setting where one has ready access to a sample of tissues from affected individuals and where the population under study is relatively homogeneous, the proposed method may offer an economical approach to generating hypotheses aimed at localizing susceptibility genes. With luck, some of that high fruit may come down, even with methods that amount to vigorously shaking the tree.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() |
---|
Related articles in Am. J. Epidemiol.: