NEWS

Narrowing the Field: Identifying Ways to Improve SNP Association Studies

Rabiya S. Tuma

There are millions of single nucleotide polymorphisms (SNPs) within the human population, and some of them are undoubtedly associated with an increased risk of cancer and other diseases. Thus far, however, genome-wide association studies and even intragenic association studies have led to inconsistent results and do not always seem biologically plausible, even when the association appears statistically significant. To improve the reproducibility and plausibility of such studies, some researchers are mining the genome and literature to help them prioritize and focus their SNP studies.

Speaking in October at the American Association for Cancer Research’s second annual International Conference on Frontiers in Cancer Prevention Research, Timothy Rebbeck, Ph.D., from the Center for Clinical Epidemiology and Biostatistics at the University of Pennsylvania School of Medicine, presented a "laundry list" of approaches researchers can use to maximize the efficacy of their association studies ahead of time—without having to jump into actual laboratory experiments (see box, p. 95). "We don’t want all of our results to be published in the Journal of Irreproducible Results," Rebbeck said.

The first step in Rebbeck’s strategy, which he described in the context of his own work on the melanocortinin-1 receptor gene (MC1R), which has been associated with melanoma risk, is to examine the background experimental data and biology about a gene to see if there are any hints about which SNPs will be the most interesting. For example, MC1R is known to be required for the synthesis of black and brown pigments, and individuals who have lower levels of MC1R activity have more red or yellow pigments and have red or blonde hair. This sort of general knowledge, plus information about the protein structure, is helpful when thinking about the biological plausibility of association study results, Rebbeck said.

The MC1R gene has about 30 missense SNPs reported, which Rebbeck said may be a bit of an extreme example in terms of number but can illustrate his main point: "The first question we face is, if you have 30, 40, or 50 variants in a gene, how are you going to pick out which ones you are going to study and which ones you are going to ignore because they aren’t likely to be important for function?"

Both experimental and epidemiologic evidence can be helpful. Unfortunately for many genes, especially newly identified ones, there is often little of either type of information available. Also, not all animal data are going to translate to humans, so one should not put undue weight on such data when they exist, Rebbeck said.

In the case of epidemiologic evidence, P values (a measure of statistical significance) from past association studies may provide information about where to focus the new work. For example, in MC1R three variants, R151C, R160W, and D294H have been found to be associated with red hair and those three variants plus D84E appear to be associated with an increased risk of melanoma. That seems interesting, but these are also the most common variants, Rebbeck pointed out. Thus, other SNPs may be more important in disease but have not yet been detected because the case–control studies lacked the power to show statistical significance.

To get at rare alleles, Rebbeck said, it is important to look at the nucleotide sequence of the gene in question and determine what type of changes the SNPs induce. Missense or non-synonymous SNPs (nsSNPs) are more likely to be functionally relevant than ones that do not change the protein sequence. If they truncate the protein, that too is an indication that the SNP results in a functional alteration in the protein.

John Potter, Ph.D., head of the Cancer Prevention Research Program at the Fred Hutchinson Cancer Research Center, Seattle, emphasized that in evaluating amino acid changes induced by nsSNPs, one should take into account whether the change is a non-conservative amino acid substitution or a conservative one, where the native and introduced residue are functionally similar in terms of charge or size.

But even more than just looking at the individual gene sequence, Potter and Rebbeck have turned to comparisons between evolutionarily related proteins to identify SNPs that are likely to alter protein function and thus play a role in disease.

With MC1R, Rebbeck’s team has used a program called SIFT, which stands for Sorting Intolerant From Tolerant. The program, written by Steven Henikoff, Ph.D., and colleagues from the Fred Hutchinson Cancer Research Center, compares the sequence of interest to publicly available protein databases to find similar sequences, both within the same species and in other species, and then uses that information to predict which SNPs are most likely to be functionally important. Because the program ranks the SNPs, researchers can easily use it to prioritize their efforts.

In Rebbeck’s current study, the team compared the MC1R gene to 37 human related sequences and 29 from other species. Based on these, the program identified regions in the protein where changes were likely to be more tolerated than others and it highlighted specific SNPs that are likely to have more or less functional impact. Starting with all of the known MC1R variants, SIFT predicted that 13 would have an impact. All three of the red hair alleles were identified by SIFT as well as 10 others that had not been widely studied. Therefore, Rebbeck said, electronic data from free Web sites and publicly available programs can be a source of information that can improve association studies.

In addition to SIFT, Rebbeck and others are comparing structural alignments between related proteins to get hints at which SNPs are likely to be associated with specific disease. Rebbeck’s team collaborated with Roland J. Dunbrack, Ph.D., of the University of Pennsylvania School of Medicine, and used the crystal structure of bovine rhodopsin to guide them in an in silico model of MC1R. Surprisingly, they found that, although the amino acids encoded by the D84E and D294H SNPs are far apart in the primary amino acid sequence, they come together in the folded protein such that they are neighbors in a protein pocket, hinting that this region of the protein has an important functional role.

Henikoff pointed out that, although such structural alignments are helpful, they require that a related sequence or the protein itself has been modeled—and that is still a minority of proteins. By contrast, the number of primary protein sequences added to the databases is growing at a rapid clip, simultaneously increasing the efficiency of programs like SIFT.

Potter’s group has used both SIFT and a Bayesian phylogenetic approach, which compares orthologous proteins aligned along an evolutionary tree to identify important regions in the BRCA1 protein that are functionally important. "We obtained essentially the same results with the two methods," said Potter. The phylogenetic comparison identified 38 nsSNPs as being of interest, while SIFT identified 36 of those and another 34 as potentially important. In their paper, the research team reasoned that SIFT includes the additional SNPs because the program compares nucleic acid sequences without regard to evolutionary relationships and therefore gives equal weight to changes from all of the different organisms studied. By comparison, the phylogenetic approach gives more weight to changes that arise between closely related species.

One catch with SIFT and other such comparative approaches is that regions of the protein that may confer specificity may appear functionally unimportant, said Henikoff. For example, if a researcher is studying a DNA binding protein, the DNA binding domain will have conserved residues that likely resemble other DNA binding proteins but will also contain some amino acids that specify the binding site. But SIFT may not identify these amino acids that provide specificity as being important because the other comparator proteins carry substantial variation in the same locus, so it will just seem like a residue that tolerates many changes. Thus, said Henikoff, users need to think about what their results mean and not just trust the program entirely.

"There are a lot of ways of thinking of function that don’t require going into the lab and making a [gene] knock-out model animal that may help us understand functional SNPs," said Rebbeck. In the end, Rebbeck takes all of the information gathered and builds a grid with the SNPs listed down one side and the relevant data sets above. It is not likely that there will be information in every data set about all of the SNPs in question, but when he sees a SNP showing up over and over again in the different assays, then he begins to think it is one of the important ones—even if it is rare.



Related Resource

             
Copyright © 2004 Oxford University Press (unless otherwise stated)
Oxford University Press Privacy Policy and Legal Statement