Skilled use of DNA polymorphisms as a tool for polygenic cancers

Kari Hemminki,2 and Peter G. Shields1

Department of Biosciences, Karolinska Institute, Novum, S-141 57 Huddinge, Sweden and
1 Cancer Genetics and Epidemiology Program, Lombardi Cancer Center, Georgetown University Medical Center, Washington DC, USA


    Abstract
 Top
 Abstract
 Introduction
 References
 
Association studies are assumed to be an efficient method of deciding whether a gene or its variant is important for cancer. Sequencing data on 30 000 human genes suggest that an average gene contains one to two single nucleotide polymorphisms (SNP), and high through-put technologies have become available for fast genotyping. Because no functional data are available for most SNPs, the result of the large-scale genotyping effort will be a huge amount of data of unknown biological significance. We discuss here the approaches in study design and reporting that will reduce the spread of false positive data and optimize scientific progress in the genotyping field.

Abbreviations: SNP, single nucleotide polymorphisms.


    Introduction
 Top
 Abstract
 Introduction
 References
 
Most types of cancer are complex diseases caused by the interaction of many genetic and environmental factors (1,2). Each gene in a polygenic pathway is likely to have a small effect on the risk of cancer but because the mutations or polymorphisms in these genes affect a large proportion of the population, the resulting attributable risk may be high (35). There is an explosion of data on polymorphisms in the human genome, some of which will provide new insights in carcinogenic mechanisms and some that are spurious, leading to side roads. It will be important that those who carry out the research recognize these issues and plan their work optimally to avoid the negative effects. We want to point out some basic problems in the large-scale analysis of polymorphisms and propose principles for design and reporting of these studies.

The International SNP Map Working Group has located 1.42 million SNPs and Celera 2.1 million (6,7). The estimates on the number of SNPs in exonic DNA range from 20 000 to 60 000. Assuming that there are 30 000 genes in the human genome, an average gene could contain one to two SNPs (6,7). About one-half of the SNPs change the amino acid and in 45% of the cases the change is non-conservative (7). SNPs are not equally distributed among genes coding for different classes of proteins, but if they were, a large proportion of proteins would harbor an amino acid change. However, highly conserved proteins may not contain any polymorphisms (8,9). On the other hand, even some intronic polymorphisms, including those at promotor and other regulatory sequences, may be of importance.

Traditionally, the approach to the identification of SNPs as cancer risk factors was to identify candidate genes and polymorphisms within them, tested in an epidemiological study. Great weight would be given to those SNPs that result in functional phenotypic effect (i.e. altered protein levels or function), known a priori and tested in the context of biologically based hypothesis. However, it is now recognized that this approach, while most rational, is yielding to the power of new technologies that have identified a large number of SNPs and where phenotypic effects have not yet been identified. This has resulted in a very different approach to the identification of genetic risk factors for cancer.

Large laboratories are planning or carrying out extensive SNP analysis in disease candidate genes and there will be an explosion of genotype data in the near future. There will be a large number of positive associations, and the recent history of conflicting polymorphism studies in the few metabolic genes is frightening. Lander and Kruglyak have warned about genetic analysis of complex traits: `Scientific disciplines erode their credibility when a substantial proportion of claims cannot be replicated—even more so when the claims reach not only the professional journals but also the evening news' (10). For illustration about what is to come, let us assume that there are 100 distinct types of cancer (1). Almost all of them are multifactorial, caused by gene–environment interactions (11,12). SNP analysis of candidate genes in cases and controls is a preferred approach for such diseases (3). For any cancer, candidate genes may number in the hundreds, including those involved in cell-cycle control, signal transduction, DNA repair, cell-to-cell communication and metabolism. Assuming that 100 critical genes will be selected for SNP analysis in any cancer, these may harbor an equal number of relevant SNPs. Thus, among all 100 types of cancers 10 000 polymorphisms will be tested. Usually the study population will be stratified by age, sex, ethnicity, life-style and exposures; gene–gene and gene–environment interactions may also be tested. The number of comparisons may become huge in a single study: two categories on three variables will result in eight strata, and on five variables in 32 strata. Thus, even in one cancer, assayed for 100 genes and analyzed only for five variables, 3200 statistical tests will be carried out, giving 160 significant results at a 5% level. This is a typical multiple testing problem.

There are various ways of addressing the multiple testing problem imminent in SNP analyses. In mutation analysis and whole genome linkage analysis, the significance levels have been adjusted based on the number of tests carried out, e.g. by the Bonferoni correction (10,13). The statistical corrections may be useful, and the sample size allowing, it may be advisable to carry out the analysis in two halves, one for generating and the other for testing the hypothesis. However, in SNP analysis there are four biologically motivated considerations for the interpretation of the results: (i) selection of the genes and pathways have to be based on mechanistic knowledge; (ii) a known functional effect induced by the polymorphism makes positive results persuasive, as do SNPs that induce non-conservative amino acid changes in important functional domains of the protein; (iii) as a first approximation, an additive model of gene dosage effect can be assumed, i.e. the effect in heterozygotes should be between the two homozygotes, a test that can always be done. A sample size should be large enough to enable a proper testing of the three genotypes. Gene dosage effects may be non-linear, but heterozygotes should lie at or between homozygotes; (iv) finally, demonstration of loss of heterozygosity (LOH) of the wild-type allele in tumor would be strong evidence of an effect (14). However, this will require a phenotypic effect that will offset the normal function, such as loss-of-function, and availability of tumor samples.

As new SNPs are screened and associations identified, they then become candidate gene polymorphisms. The results are subject to interpretation and caveats, as are the findings of any new epidemiological findings. Specifically, no single study is definitive, each has its own limitations, and all results await replication in different populations. Moreover, following the finding of a candidate gene polymorphism, complementary studies should begin that would identify a phenotype associated with the nucleotide base change. Only the combined consideration of studies in different populations replicating similar results, and supportive laboratory evidence will result in the belief that a candidate gene polymorphism is indeed a cancer risk factor. Therefore, pending additional studies, we as scientists need to inform the media and the public that new candidate gene polymorphism reports are preliminary. Also, following the publication of positive results, journals should be amenable to publishing null results. Otherwise, publication bias can result (15).

We believe that the large-scale genotyping of samples from cancer patients will lead to important breakthroughs in understanding gene–environment and gene–gene interactions as mechanistic basis for the common polygenic cancers. Several authors have considered design issues on molecular epidemiology studies and we concur that these are fundamental to the success of genotyping studies (16). However, we are concerned about the overwhelming volume of data that are about to be generated. To guarantee scientific progress the results need to be skillfully and responsibly processed by the individual research teams before publication.


    Notes
 
2 To whom correspondence should be addressed Email: kari.hemminki{at}cnt.ki.se Back


    References
 Top
 Abstract
 Introduction
 References
 

  1. Hanahan,D. and Weinberg,R. (2000) The hallmarks of cancer. Cell, 100, 57–70.[ISI][Medline]
  2. Hemminki,K. and Mutanen,P. (2001) Genetic epidemiology of multistage carcinogenesis. Mutat. Res., 473, 11–21.[ISI][Medline]
  3. Risch,N. and Merikangas,K. (1996) The future of genetic studies of complex diseases. Science, 273, 1516–1517.[ISI][Medline]
  4. Easton,D. (1999) How many more breast cancer predisposition genes are there? Breast Cancer Res., 1, 14–17.[Medline]
  5. Shields,P. and Harris,C. (2000) Cancer risk and low-penetrance susceptibility genes in gene–environment interactions. J. Clin. Oncol., 18, 2309–2315.[Abstract/Free Full Text]
  6. The International SNP Map Working Group. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928–933.[ISI][Medline]
  7. Venter,J., Adams,M., Myers,E. et al. (2001) The sequence of the human genome. Science, 291, 1304–1351.[Abstract/Free Full Text]
  8. Cargill,M., Altshuler,D., Ireland,J. et al. (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet., 22, 231–238.[ISI][Medline]
  9. Ma,X., Jin,Q., Försti,A., Hemminki,K. and Kumar,R. (2000) Single nucleotide polymorphism analyses of the human proliferating cell nuclear antigen (PCNA) and flap endonuclease 1 (FEN1) genes. Int. J. Cancer, 88, 938–942.[ISI][Medline]
  10. Lander,E. and Kruglyak,L. (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nature Genet., 11, 241–247.[ISI][Medline]
  11. Fearon,E.R. (1997) Human cancer syndromes: clues to the origin and nature of cancer. Science, 278, 1043–1050.[Abstract/Free Full Text]
  12. Lichtenstein,P., Holm,N., Verkasalo,P. et al. (2000) Environmental and heritable factors in the causation of cancer. N. Engl. J. Med., 343, 78–85.[Abstract/Free Full Text]
  13. Rothman,K. and Greenland,S. (1998) Modern Epidemiology, 2nd Edn. Lippincott-Raven, Philadelphia.
  14. Ma,X., Yang,K., Lindblad,P., Egevad,L. and Hemminki,K. (2001) VHL gene alterations in renal cell carcinoma patients: novel hotspot or founder mutations and linkage disequilibrium. Oncogene, 20, 5393–5400.[ISI][Medline]
  15. Shields,P. (2000) Publication bias is a scientific problem with adverse ethical outcomes: the case for a selection for null studies. Cancer Epidemiol. Biomarkers Prev., 9, 771–772.[Free Full Text]
  16. Vineis,P., Malats,N., Lang,M., d'Errico,A., Caporaso,N., Cuzick,J. et al. (eds) (1999) Metabolic Polymorphisms and Susceptibility to Cancer. IARC, Lyon.
Received October 25, 2001; revised October 29, 2001; accepted November 6, 2001.