Institute of Human Genetics and Department of Cardiology, University of Newcastle-upon-Tyne, UK.
Dr Bernard Keavney, Institute of Human Genetics, Central Parkway, Newcastle-upon-Tyne NE1 3BZ, UK. E-mail: b.d.keavney{at}ncl.ac.uk
Keywords Coronary heart disease, genetics
Accepted 26 March 2002
![]() |
Introduction |
---|
There are a number of reasons why genetic studies of complex diseases such as CHD have moved to a central position in the thinking of both geneticists and epidemiologists. Firstly, genetic studies offer the opportunity to identify determinants of disease that are very likely to be causative. This is because genotypes are unchanged throughout life, and therefore not susceptible to confounding via mechanisms related either to the presence of disease itself or the bodys response to disease (in the way that measurement of hypothesized risk factors in plasma, for example, could be). Secondly, the current pharmacological armamentarium for complex diseases consists of drugs which act on a very small number (several hundred) of the 30 000 or so genes in the human genome. Identification of novel genes that contribute to CHD risk could therefore lead to new therapeutic targets with strong molecular underpinning. Thirdly, identification of risk genotypes might enable more accurate determination of an individuals risk of CHD than is currently possible from measurement of known risk factors. This is partly because genotypes could measure the activity of novel or otherwise unmeasurable biological pathways, and partly because genotypes would not be susceptible to short-term fluctuations and measurement error, in contrast to measurements of plasma and other quantitative phenotypic risk factors.
![]() |
How large is the genetic contribution to CHD risk? |
---|
![]() |
How many genes? |
---|
It could thus be argued that CHD is a singularly unpromising phenotype for genetic investigation, since susceptibility to disease is very significantly influenced by a single environmental exposure (cigarette smoking), and since the other recognized major risk factors are themselves probably under the control of multiple genes and the environment. This leads to the expectation that genes affecting CHD risk will be many, and of small individual effect. Whilst this expectation is the single most important issue in the discussion of CHD genetics, there are a number of reasons why studies of CHD endpoints rather than only of risk factors remain important. Firstly, it is as yet unclear whether the observed effect of genetic polymorphisms on risk will be commensurate with that expected from the composite associations between polymorphisms and risk factors, and between risk factors and disease. This is because particular polymorphisms may act on a number of pathways relevant to disease (such genetic effects are said to be pleiotropic), and if so their effect on risk could be substantially different from what would be predicted based on their effect on a single risk factor. The reliable assessment of the contribution of such genes to CHD risk would require that adequate numbers of CHD events be studied, rather than extrapolation made from the effects of polymorphisms on risk factors. Hitherto it has not been possible to assess this issue because of the absence in the literature of suitably sized studies with both genotypes at candidate genes and plasma risk factor measurements. The second reason to study endpoints rather than only risk factors is that genetic determinants of risk may affect pathways whose activity is difficult to measure in samples that could be collected in blood-based epidemiological studies (for example pathways involved in apoptosis or neoangiogenesis). Thirdly, genetic studies focussed on clinical CHD endpoints may have a unique part to play in testing hypotheses about causal pathways potentially amenable to intervention, as will be discussed.
![]() |
How many alleles? |
---|
Available data shows that either model may apply in different diseases. A good example of a common allele of large effect, which consequently contributes significantly to the population risk of disease, is the Factor V Leiden mutation which causes activated protein C resistance (APCR). APCR is present in 2050% of patients with venous thromboembolism, and is in most cases due to a single mutation (a G to A transversion at nucleotide 1691) in the gene for coagulation factor V. This mutation is common (with an allele frequency of about 35%, yielding a carrier frequency of 510% in most Caucasian populations), and confers a relative risk of venous thrombosis to carriers of about fivefold.6 Conversely, in the case of Crohns disease, recent data have demonstrated that a number of rare variants clustered within the NOD2 gene underlie susceptibility in a particular subset of patients.7,8 So far, no allele of similar population significance to Factor V Leiden has been found in the study of CHD.
Thus, considering the available evidence, it is likely that the effect of genotype at any individual polymorphism on the risk of CHD will be smallthis has been borne out in perhaps the only robust genetic association with CHD thus far described, that of the apolipoprotein E 4 allele, which confers a relative risk of myocardial infarction of 1.21.3 to carriers that has been confirmed in several thousand cases of disease.9 This contrasts with a relative risk to smokers of about 5.0 compared to non-smokers in similar age groups.10
![]() |
Case-control genetic association studies of CHD |
---|
Overwhelmingly the most frequently used association study design is that involving unrelated cases of disease and unrelated controls, principally because such studies are far easier to collect than are family-based studies. Allele frequencies at the candidate polymorphisms of interest are compared between cases and controls. There are well-recognized caveats regarding the conduct and interpretation of such studies in classical epidemiology: studies involving large numbers of cases produce more precise estimates of any association of a particular factor with risk, and in the case of small effects, very large studies may be needed; there is potential for confounding if cases and controls are not well matched; false positive results may occur if excessive subgroup analysis is carried out, particularly if the number of events in the study is small; and replication in an independent cohort provides strong evidence in favour of the correctness of the conclusions.
A caveat specific to genetic association studies of this type relates to the potential for confounding due to subtle, undetected ethnic stratification between cases and controls. If there is such an ethnic difference between cases and controls then that will be reflected by differences in allele frequency at a large number of genetic markers, few if any of which will be indicating true association with disease arising from chromosomal proximity with a disease-causing locus. Although this issue is frequently cited as a potential cause of false positive results, it has only rarely been convincingly implicated.12,13 Even in these studies it could be convincingly argued that recent ethnic admixture of differing degrees in cases and controls was quite clear from available demographic data, and should have been detected by vigilant investigators. Recent evidence would suggest that in outbred populations selected on the basis of self-reported ethnicity, and in whom reasonable safeguards are applied in the ascertainment process to avoid clearly admixed populations, the risk of unsuspected ethnic stratification sufficient to cause false positive results is rather low.14 Additionally, recently developed mathematical techniques involving comparison of allele frequencies at a large number of randomly selected polymorphisms have the potential to discover whether significant stratification is indeed present between cases and controls in such a study.15,16
![]() |
Linkage disequilibrium: islands in a stream of variation? |
---|
Recent molecular studies, however, show that linkage disequilibrium (LD) throughout the human genome is structured in blocks, within which there is very substantial disequilibrium between polymorphisms, separated by recombination hotspots within which there is little disequilibrium.5,18,19 The length of these blocks of disequilibrium appears to be variable, but in some cases they may extend over tens of kilobases within which just one or two single nucleotide polymorphisms (SNPpolymorphisms which result from a simple transversion of one nucleotide to another, and are thus biallelic) would capture the majority of the variation in certain populations. This is extremely good news for genetic epidemiologists contemplating association studies, since the amount of genotyping required to achieve coverage suddenly looks much less than before. However, much remains to be done in defining the limits of the blocks of disequilibrium, and views differ regarding whether effort should be put into a genome-wide characterization of LD blocks or whether regions in which specific candidate genes for particular conditions are located should be prioritized. Another important caveat is that patterns of LD are dependent not only on the genetic distance between polymorphisms and the presence of hotspots, but the population history. Most populations not of African origin seem to have fairly extensive blocks of LD; however, a number of studies have shown that there is far less LD in African populations.19,20 Thus, long-distance mapping using LD will be far more difficult in African origin populations, with potentially a far larger number of loci needing to be typed in order to achieve coverage of a specific region or gene. It is also clear that, within a block of LD, it will be difficult to identify the causative polymorphism(s) since all the polymorphisms within a block, causative and non-causative, will be in LD with each other and therefore hard to distinguish. These differences between populations may be turned to geneticists advantage, however: one possible approach, successfully used by some groups, is trans-ethnic fine mapping, wherein the original coarse localization is made in a population with extensive LD, and finer localization made in an African origin population.21 This approach does, however, assume that the variants causing disease will be the same, or at least in the same part of a gene, in both populations, which may not be the case in some diseases. In general, since association studies depend not only on the relationship between specific genes and disease but also on the history of the population studied, results may not be replicable between populations unless causative polymorphisms can be tested. Whether the majority of the effort is first focussed on candidate genes or genome-wide, the recent findings regarding LD structure in a few regions have illustrated an urgent need for an internationally co-ordinated effort to describe this structure in multiple populations.
![]() |
Candidate genes or large hypothesis approach? |
---|
With respect to the number of polymorphisms needing to be typed in any candidate gene to confirm or exclude that gene from involvement in disease aetiology, it is increasingly apparent from the data discussed above regarding LD that the degree to which any particular polymorphism or set of a few polymorphisms describes all the variation in a candidate gene is variable between genes. For example, at the lipoprotein lipase (LPL) gene, a candidate for CHD susceptibility, there is a recombination hotspot over a 1.9-kb segment in the middle of the gene, and within this hotspot a number of polymorphisms would have to be typed in order to obtain a full picture of the variability even in this short region. However, outside the hotspot, the remaining variation in the gene can be described by typing of a relatively few polymorphisms.22,23 In the angiotensin-I converting enzyme (ACE) gene, an extensively investigated candidate gene for CHD risk, it has been shown that there are only three common haplotype groups (a haplotype describes the phased array of genotypes along a chromosome) in Caucasian populations, and that one of the haplotype groups arises from an ancestral recombination event between the two commonest haplotypes around exon 7 of the gene. Thus, the majority of the significant population variability at this locus (which extends over 25 kb) in Caucasians could be described by the genotyping of just two of the 78 polymorphisms described within the gene, one on either side of the recombination event in the middle of the gene.24,25 The principal change in candidate gene association studies within the next few years will be the explicit consideration of the haplotype structure of the genes under investigation and selective genotyping of haplotype-tagging SNP selected based on knowledge of the variation present in the population. This may not be so easydespite the very large number of SNP (in excess of 2 million) that have been deposited in the SNP Database, a recent study showed that, for several candidate genes, the available SNP did not adequately describe the variation present.5 Thus, a large amount of preliminary work on SNP definition, haplotype identification and SNP selection in candidate genes is to be expected in the next few years, although several groups have already produced large datasets describing the frequent SNP in exonic and 5' sequences for a number of candidate genes for CHD.26
The increasing facility with which SNP can be typed, the availability of a large number of frequent SNP in databases, and the recent information regarding the structure of LD in a variety of genomic regions, raise the possibility that a large hypothesis experiment consisting of a genome-wide SNP survey in CHD may be possible within a few years. An advantage of such an approach would be its unbiased nature and its potential to discover genes participating in novel causative pathways, which might not have been studied in a candidate gene approach. Further advances in genotyping technology, in particular with regard to cost and throughput, would be necessary before such an approach could be successful; also, the interpretation of the vast amounts of data such an experiment would generate would present unique challenges.
![]() |
Why have results thus far been so unreliable? |
---|
Perhaps the principal conceptual difficulty in the conduct of these studies is that a far more Bayesian viewpoint than is customary in many biological experiments is necessary. There are estimated to be approximately 5 million single nucleotide polymorphisms with a minor allele frequency of 10%, and 11 million SNP with a minor allele frequency of
1%. The prior probability that any of these is causally associated with CHD is vanishingly small. Even when polymorphisms are randomly selected in candidate genes known or suspected to lie within a causal pathway for disease, the prior probability of association, while not numerically calculable, must remain low. In this situation, differences in allele frequency between cases and controls significant at the conventionally accepted P < 0.05 level are more likely to represent a false than a true positive, because the prior probability of such a difference being causal is so low. This problem tends to be exacerbated by the ease with which genetic polymorphisms can be typed and multiple hypotheses tested with post hoc justification. Although the appropriate threshold for significance in such studies is not, so far, agreed, some authorities have suggested it should be very much more extreme than hitherto customary (for example Risch has suggested P = 5 x 10-8 for randomly selected SNP29). In order to provide adequate power to detect small effects even with substantially less stringent significance criteria (for example, P < 0.001, which if adopted would still result in the reclassification of most positive CHD case-control gene-association studies as negative), far larger studies than has hitherto been usual will be required.
It may be possible to revise the estimate of an appropriate level of significance based on factors which could, in the case of particular polymorphisms, significantly affect what the prior probability of a causal association is likely to be. In the case of the the ApoE 2/
3/
4 polymorphism, genotype significantly affects plasma levels of cholesterol and its subfractions, certain genetic defects in ApoE are associated with severe lipid abnormalities and premature CHD, and the different ApoE isoforms have been shown to have biologically significantly different properties in a variety of actions potentially relevant to the development of atherosclerosis. All these factors act to increase the prior probability that an association of ApoE genotype with CHD discovered at a given level of statistical significance is causal. The long history of robust genetic associations at the HLA locus with a variety of autoimmune diseases (most of which were initially discovered and replicated by association studies in unrelated cases and controls, and subsequently confirmed by family studies) provides another example which strongly confirms the view that case-control studies can reliably detect genetic effects if the prior probability of a causal association is sufficiently high.
It is highly likely that individuals with certain genotypes are differentially susceptible to the effects of environmental exposures, with particularly adverse consequences of an environmental exposure in those of particular genotype (gene-environment interaction). A number of studies have focussed on attempts to identify such interactions by examining the effects of environmental exposures in subgroups of genotyped cases and controls.30 In almost every case, such analysis has included very small numbers of cases and controls in the subgroups claimed to show differences, and the caveats of classical epidemiology regarding undue emphasis on the results of such analysis have gone unheeded. Since the likely effect of any particular allele overall is likely to be small, attempts to detect heterogeneity between subgroups of individuals carrying such an allele are likely to be very unreliable unless they are carried out in far larger numbers (many thousands) of cases of disease than has been hitherto usual. In practical terms, Clayton and McKeigue have recently pointed out that the public health benefits of targeting interventions towards those individuals of particular genotype that are unusually susceptible to a specific adverse environmental factor are likely to be limited, and that greater benefits are likely to result from interventions directed at the whole population.31 This calls into question the emphasis on gene-environment interaction currently influencing the design of several large cohort studies into the genetic epidemiology of complex disease, including a study of 500 000 individuals over 10 years in the UK (for details see http://www.wellcome.ac.uk/en/1/biovenpop.html).
Three issues, thereforethe small anticipated size of any genetic effect, the need for high levels of statistical significance to counteract the very low prior probability of association, and the reasonable aim to examine the effect of particular polymorphisms in a limited number of carefully chosen subgroups all lead to the conclusion that far larger studies than have been hitherto usual (involving many thousands of cases of disease and controls) are necessary in this area for reliable results to be obtained. However, the number of such very large studies will be fewwhere does this leave those investigators with access to small or medium-sized cohorts who wish to contribute to the field, and how will replication of the results of a very large study take place? In my view, collaborative efforts involving deposition of raw genotype and phenotype data, suitably anonymized, in a central database that can be accessed and analysed by all contributors are needed. Such databases could be set up by particular groups with agreed co-ordinating responsibilities, by national funding agencies, or by journals with a particular interest in resolving whether specific hypothesized associations are indeed real. Science pursued according to the traditional competitive model has not produced impressive results in this uniquely difficult area and large sums of money have probably been wasted in genotyping studies that were individually far too small to yield accurate results.
![]() |
Genetic associations as a test for causality |
---|
![]() |
Conclusions |
---|
KEY MESSAGES
|
![]() |
References |
---|
2 Williams RR, Hunt SC, Hopkins PN et al. Genetic basis of familial dyslipidemia and hypertension: 15-year results from Utah. Am J Hypertens 1993;6(11 Pt 2):319S27S.[Medline]
3 Livshits G, Gerber LM. Familial factors of blood pressure and adiposity covariation. Hypertension 2001;37:92835.
4 Kruglyak L, Nickerson DA. Variation is the spice of life. Nat Genet 2001;27:23436.[CrossRef][ISI][Medline]
5 Johnson GC, Esposito L, Barratt BJ et al. Haplotype tagging for the identification of common disease genes. Nat Genet 2001;29:23337.[CrossRef][ISI][Medline]
6 Simioni P, Prandoni P, Lensing AW et al. The risk of recurrent venous thromboembolism in patients with an Arg506- ->Gln mutation in the gene for factor V (factor V Leiden). N Engl J Med 1997;336:399403.
7 Hugot JP, Chamaillard M, Zouali H et al. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohns disease. Nature 2001;411:599603.[CrossRef][ISI][Medline]
8 Ogura Y, Bonen DK, Inohara N et al. A frameshift mutation in NOD2 associated with susceptibility to Crohns disease. Nature 2001;411:60306.[CrossRef][ISI][Medline]
9 Wilson PW, Schaefer EJ, Larson MG, Ordovas JM. Apolipoprotein E alleles and risk of coronary disease. A meta-analysis. Arterioscler Thromb Vasc Biol 1996;16:125055.
10 Parish S, Collins R, Peto R et al. Cigarette smoking, tar yields, and non-fatal myocardial infarction: 14 000 cases and 32 000 controls in the United Kingdom. The International Studies of Infarct Survival (ISIS) Collaborators. BMJ 1995;311:47177.
11 Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996;273:151617.[ISI][Medline]
12 Gardner LI, Jr, Stern MP, Haffner SM et al. Prevalence of diabetes in Mexican Americans. Relationship to percent of gene pool derived from native American sources. Diabetes 1984;33:8692.[Abstract]
13 Knowler WC, Williams RC, Pettitt DJ, Steinberg AG. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet 1988;43:52026.[ISI][Medline]
14 Collins A, Lonjou C, Morton NE. Genetic epidemiology of single-nucleotide polymorphisms. Proc Natl Acad Sci USA 1999;96:1517377.
15 Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics 2000;155:94559.
16 Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet 2000;67:17081.[CrossRef][ISI][Medline]
17 Kruglyak L. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet 1999;22:13944.[CrossRef][ISI][Medline]
18 Jeffreys AJ, Kauppi L, Neumann R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet 2001;29:21722.[CrossRef][ISI][Medline]
19 Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES. High-resolution haplotype structure in the human genome. Nat Genet 2001; 29:22932.[CrossRef][ISI][Medline]
20 McKenzie CA, Julier C, Forrester T et al. Segregation and linkage analysis of serum angiotensin I-converting enzyme levels: evidence for two quantitative-trait loci. Am J Hum Genet 1995;57:142635.[ISI][Medline]
21 McKenzie CA, Abecasis GR, Keavney B et al. Trans-ethnic fine mapping of a quantitative trait locus for circulating angiotensin I-converting enzyme (ACE). Hum Mol Genet 2001;10:107784.
22 Templeton AR, Weiss KM, Nickerson DA, Boerwinkle E, Sing CF. Cladistic structure within the human Lipoprotein lipase gene and its implications for phenotypic association studies. Genetics 2000;156:125975.
23 Templeton AR, Clark AG, Weiss KM, Nickerson DA, Boerwinkle E, Sing CF. Recombinational and mutational hotspots within the human lipoprotein lipase gene. Am J Hum Genet 2000;66:6983.[CrossRef][ISI][Medline]
24 Keavney B, McKenzie CA, Connell JM et al. Measured haplotype analysis of the angiotensin-I converting enzyme gene. Hum Mol Genet 1998;7:174551.
25 Farrall M, Keavney B, McKenzie C, Delepine M, Matsuda F, Lathrop GM. Fine-mapping of an ancestral recombination breakpoint in DCP1. Nat Genet 1999;23:27071.[CrossRef][ISI][Medline]
26 Halushka MK, Fan JB, Bentley K et al. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat Genet 1999;22:23947.[CrossRef][ISI][Medline]
27 MacMahon S, Peto R, Cutler J et al. Blood pressure, stroke, and coronary heart disease. Part 1, Prolonged differences in blood pressure: prospective observational studies corrected for the regression dilution bias. Lancet 1990;335:76574.[ISI][Medline]
28 Keavney B, McKenzie CA, Parish S et al. Large-scale test of hypothesised associations between the angiotensin-converting enzyme insertion/deletion polymorphism and myocardial infarction in about 5000 cases and 6000 controls. Lancet 2000;355:43442.[ISI][Medline]
29 Risch NJ. Searching for genetic determinants in the new millennium. Nature 2000;405:84756.[CrossRef][ISI][Medline]
30 Humphries SE, Talmud PJ, Hawe E, Bolla M, Day IN, Miller GJ. Apolipoprotein E4 and coronary heart disease in middle-aged men who smoke: a prospective study. Lancet 2001;358:11519.[CrossRef][ISI][Medline]
31 Clayton D, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet 2001; 358:135660.[CrossRef][ISI][Medline]
32 Youngman L, Keavney B, Palmer A et al. Plasma fibrinogen and fibrinogen genotypes in 4685 cases of myocardial infarction and in 6002 controls: test of causality by Mendelian randomisation. Circulation 2000;102(Suppl.II):3132.
33 Rosenberg N, Murata M, Ikeda Y et al. The frequent 5,10-methylenetetrahydrofolate reductase c677t polymorphism is associated with a common haplotype in Whites, Japanese, and Africans. Am J Hum Genet 2002;70:75862.[CrossRef][ISI][Medline]
34 Vickers M, Green FR, Terry CA et al. Genotype at a promoter polymorphism of the interleukin-6 gene is associated with usual plasma levels of C-reactive protein. Cardiovasc Res In press.