* National Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences (CAMS) and Peking Union Medical College (PUMC), Beijing, People's Republic of China; Chinese National Human Genome Center, Beijing, People's Republic of China;
MOE Key Laboratory of Bioinformatics, Department of Automation, Tsinghua University, Beijing, People's Republic of China; and
SGDP, Institute of Psychiatry, Kings College, London, United Kingdom
Correspondence: E-mail: sheny{at}ms.imicams.ac.cn.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: recombination linkage disequilibrium haplotype block haplotype-tagging SNPs polymorphisms
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Thus far, a consensus definition for haplotype blocks based on the LD structure has not been established. However, a range of operational definitions has been used to identify haplotype-block structures (Patil et al. 2001; Gabriel et al. 2002; Wang et al. 2002; Zhang et al. 2002), which can be roughly classed into three groups. First, there are methods based on diversity in the sequence, such as that of Patil et al. (2001) and Zhang, et al. (2002), which define blocks so as to enforce low sequence diversity by some diversity measure within each block. The second group consists of LD methods, such as that of Gabriel et al. (2002), which define blocks so as to enforce generally high pairwise LD within blocks and generally low pairwise LD between blocks. Finally, there are methods that look for direct evidence of recombination, such as that of Wang et al. (2002), using the four-gamete test developed by Hudson and Kaplan (1985) and defining blocks as apparently recombination-free regions. Recently, Schwartz et al. (2003) examined the validity of the haplotype-block concept by comparing block decompositions derived from two public empirical data sets by several leading methods of block detection. They concluded that the different block-finding algorithms identify similar structure to an extent that cannot be explained by chance, and the absolute correspondence between block assignments can differ markedly in response to changes in both block definition and optimization criterion. However, there is still a lack of studies that systematically compare identification of haplotype blocks and selection of htSNPs under these various definitions. Whether these different haplotype-block definitions show similar behaviors on the haplotype-block partition and htSNPs selection is still unclear.
Using simulation studies, we explored three popular methods for defining haplotype blocks and their behaviors under different population-genetics scenarios and two distinct recombination models. Furthermore, we compared average haplotype-block size, average htSNPs number, and average information loss identified by each method, because these variables are critical in association mapping. We also compared the proportion of the genome covered by blocks. Our article aims to address three issues: (1) How are the properties of different haplotype-block definitions affected by population-genetics parameters? (2) Are different haplotype-block definitions affected by different recombination models? (3) What is the impact of haplotype-block definitions on htSNPs selection?
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For a diversity-based test, Patil et al. (2001) defined a haplotype block as a region in which a fraction of percent or more of all the observed haplotypes are represented at least n times or at a given threshold in the sample. The particular case implemented in our study required that in haplotype blocks, at least 80% of the observed haplotypes should be observed in at least 5% of the sample. To implement this method, we applied the optimization criteria outlined by Zhang et al. (2002). Their paper describes a general algorithm that defines block boundaries in a way that minimizes the number of SNPs that are required to identify all the haplotype in a region. We solved it optimally using the dynamic programming algorithm of Zhang et al. (2002). Secondly, for a LD-based test, we used the method of Gabriel et al. (2002) that defined blocks to be a region in which a small proportion of marker pairs show evidence for historical recombination. We modified the criteria suggested by Wall and Pritchard (2003) for handling haplotype data instead of unphased genotype data. Blocks are partitioned according to whether the upper and lower confidence limits on estimates of pairwise D' measure fall within certain threshold values. Finally, we use the four-gamete test of Hudson and Kaplan (1985) as the example of a recombination-based test (Wang et al. 2002), which defined blocks as apparently recombination-free regions under the infinite-sites assumption.
Statistic for Comparing Block Partitions
To compare the similarity of the different methods, we used the number of shared block boundaries as a statistic for the similarity of two block partitions (Schwartz et al. 2003; Bafna et al. 2003). If the partitions are independent of one another, the probability that they share exactly m boundaries can be calculated as follows:
![]() | (1) |
![]() | (2) |
htSNPs Selection Algorithm
There is currently no consensus on the best criterion to use to select a set of htSNPs that will capture most information in the haplotype block (Goldstein et al. 2003). We used the criteria that at least 80% of haplotypes that occur in at least 5% of the sample could be explained by the htSNPs (Zhang et al. 2002; Patil et al. 2001). To decrease the running time and guarantee the optimal selection of htSNPs in simulated data set, which has been proved to be an NP-complete problem (Garey and Johnson 1979), we used the branch-and-bound algorithm (De Bontridder et al. 2002) to select htSNPs in each haplotype block.
Evaluation of htSNPs Selection in the Simulated Haplotype
We used the information-theoretic quantity known as Shannon Entropy (Shannon 1948) as one way of measuring information within the whole haplotype (Judson et al. 2002; Avi-Itzhak, Su and De La Vega 2003).
![]() | (3) |
![]() | (4) |
Simulation Data Sets
The coalescent process is a powerful tool for population genetics, which is used to model a wide variety of biological phenomena (Hudson 1983; Fu and Li 1999). A helpful simulation tool (e.g., mksamples [Hudson 2002]) was provided to generate realistic data under various population scenarios about underlying biology and demography. We used this tool to simulate the genetic data under the uniform recombination rate. To simulate the genetic data under the simple single-hotspot model for recombination variation, we used the algorithm in Li and Stephens (2003) to postprocess the output from mksamples.
Coalescent Model with Uniform-Recombination Rate
We followed the methods suggested in Wang et al. (2002) to simulate the data set under the uniform recombination rate. The simulations had a sample size of n = 50 (n is the number of chromosomes) under the variable population-mutation rate ( = 4Neµ, where Ne is the effective population size and µ is the mutation rate per locus per generation) and population-recombination rate (
= 4Ne r, where r is the recombination rate per locus per generation). To examine the effects of population parameters on haplotype-block pattern and htSNPs selection under the different haplotype-block definitions, we simulated a data set using three values of
(5, 10, and 25) and varying
from 0.1
to 5.0
. The three values of
corresponds to Ne of 2,000, 4,000, 10,000, respectively, µ, the mutation rate is fixed at 109 per site per year.
Furthermore, we examined the contribution of to haplotype-block characteristics by allowing
to vary for fixed values of
(
= 0.4, 2, 6, and 10, respectively). We set
= 1, 2, 3, 4, 5, 6, 7, 8, 10, 15, 20, and 25.
Coalescent Model with Recombination Hotspot
For simulating the genetic data under recombination hotspot, we assumed that haplotype blocks with a low recombination rates are separated by short recombination hotspot. We used the algorithm in Li and Stephens (2003) to postprocess the output from mksamples (Hudson 2002) to simulate data under the single-hotspot model for recombination variation. We illustrated this algorithm as follows:
Suppose a sample was simulated with approximately S segregating sites. The background recombination rate is . A hotspot of width w = (b a) lies between positions a and b, with recombination rate
where
> 1, which quantifies the magnitude of the recombination hotspot. We followed these steps:
Each data set was simulated to have about 120 segregating sites and n = 50, with variable (from 1 to 25), and variable
(20, 50, and 100). The recombination hotspot was located in the center of the region, where a = 0.48 and b = 0.52 (i.e., the recombination hotspot has a width of 1 kb).
We ran 1,000 replicates for all of the simulations. Programs for the analyses were written in C++ or Perl and are available from the authors on request.
Empirical Data Sets
Polymorphism data (both SNPs and indels) were downloaded from the SeattleSNPs on August 23, 2003. A total of 130 loci were available on that date. The data were obtained by DNA resequencing of 24 unrelated African Americans and 23 unrelated European Americans from the Coriell Cell Repository. We reconstructed the haplotype from these unphased genotypic data by using PHASE (Stephens, Smith, and Donnelly 2001).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
To further characterize the haplotype-block pattern in the simulated data set, that is, to illustrate whether the haplotype-block model can explain the whole LD structure, we prepared plots that showed the parts of each region that were contained in haplotype blocks of various sizes (fig. 2). These plots indicated that the population-genetics parameters, as well as the different haplotype-block definitions affected the distribution of haplotype-block sizes. As might be expected, there was an inverse correlation between the proportion of sequence that was contained in haplotype blocks and the recombination rate. Relatively smaller blocks were identified with the higher population effective size. In this figure, the different haplotype-block definitions also demonstrated the block size distribution. For the diversity-based method (fig. 2ac), approximately 20% of the chromosome regions could not be covered by haplotype block under the different ratio of /
for a fixed
; even with the effective population size (Ne) increasing from 2,000 or 4,000 to 10,000, approximately 20% of the chromosome regions could not be covered and the proportions of relatively small blocks (05 kb) increases from approxiamtely 20% to approximately 60%. It should be noticed that there is more variance in the proportions that cannot be covered by haplotype block for LD-based methods (fig. 2df), increasing from approximately 20% to approximately 80% when
/
increased from 0.1 to 5 under different population effective size (Ne). It also showed that the haplotype block with larger size (>10 kb) could be observed when and only when
/
was small. For the recombination-based method (fig. 2gi), approximately 10% to approximately 30% of the chromosome regions could not be covered with haplotype blocks. When
/
2, there was much variance in the haplotype-block distribution; whereas when
/
> 2, the distribution was not affected by recombination rate
.
|
|
|
|
|
The Block Comparison
The measure derived by Bafna et al. (2003) and Schwartz et al. (2003) was used to determine whether the block boundaries derived from the different methods were comparable. For each data sample in the simulated data set, we calculated the P value for the intersection of the two partitions being random. If the P value is less than a threshold (0.05), the null hypothesis that two partitions are independent could be rejected. We calculated the proportion of samples that had a significant P value from the 1,000 simulated data sets. Table 1 and table 2 show the pairwise comparison of the different methods applied to the simulated data. Under the uniform recombination model, less than half of the 1,000 simulated sample showed that two partitions were related, ranging from 0.005 to 0.498. In comparison, more samples demonstrated dependence under the recombination-hotspot model. The results from both models show some degree of similarity between the LD-based method and the diversity-based method. However, there appears to be no relationship between the diversity-based and the recombination-based methods. This is consistent with the result obtained by analyzing the chromosome 21 haplotype data set and human lipoprotein lipase (LPL) data set (Schwartz et al. 2003). The observation on the average block size in figures 1 and 5 also provided the evidence that recombination-based method and LD-based method showed more similar behavior on the haplotype-block partition than any other pairwise comparison.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We have observed the varying haplotype-block patterns under different population-genetics scenarios and two coalescent models; that is, with a uniform recombination rate versus recombination hotspots. For the uniform recombination model, the diversity-based method is less sensitive to the population-recombination rate (fig. 1ac) and population history (figs. 3ad and 4ac) than both the LD-based and the recombination-based method. In the recombination-hotspot model, the diversity-based method is unable to recognize the haplotype-block structure in the simulated data set (fig. 5ac).
The block partition and htSNPs selection results should provide information about the genotyping effort needed to cover a region or the whole genome sufficiently. Results relating to the proportion of the genome covered by blocks varies considerably with population-genetics parameters, when either the LD-based method (figs. 2df and 6ac) or the recombination-based method (figs. 2gi and 6df) are used. The proportion covered by the diversity-based method (figs. 2ac and 6gi) is even less variable.
Based on the descriptive statistics and the P value for testing whether the two haplotype-block methods are related, we conclude that the recombination-based method appears to be much closer to the LD-based method than either of those is to the diversity-based method under these two recombination models. This conclusion is consistent with the empirical data analysis on the Patil et al. (2001) 21-chromosome haplotype data and the Nickerson et al. (2000) LPL data by Schwartz el al. (2003). These results should be tested on more empirical data because only less than half of the P value showed the two methods are related in our simulations and empirical data analysis.
How can these differences across the different haplotype-block definitions be explained? In our simulation study, under both the recombination-uniform model and the recombination-hotspot model, our results indicate that population-recombination rate and population history have critical effects on the haplotype-block partitioning under different haplotype-block definitions. In the recombination-hotspot model, the haplotype-block structure cannot be recognized when the background recombination rate is high. The imperfect nature of the haplotype-block concept has been considered as the cause of the differences across the different definitions (Schwartz et al. 2003). Several studies have suggested that haplotype blocks can arise not only by recombination (Daly et al. 2001; Gabriel et al. 2002; Goldstein et al. 2001) but also by factors such as natural selection, population bottlenecks, population admixture, choices of marker spacing, and allele frequencies (Phillips et al. 2003; Stumpf and Goldstein 2003).
There is currently no consensus on the criterion that best measures the performance of a set of htSNPs in capturing information on haplotype structure within a genome region of interest. The criteria used by Johnson et al. (2001) and Weale et al. (2003) can be split into two classes: those based on capturing as much as possible of the original haplotype diversity present in the set of known SNPs when they are reduced to the smaller set of htSNPs (Patil et al. 2001; Zhang et al. 2002; Johnson et al. 2001; Clayton 2002) and those based on establishing as high an association as possible between the reduced htSNPs set and the larger set (Weale et al. 2003). The different criteria for htSNPs selection in each haplotype block would lead to different performance of the htSNPs or tagging-SNPs selection (Weale et al. 2003). We only used one of the diversity-based criteria (Patil et al. 2001; Zhang et al. 2002) to perform an initial study of the comparison of the htSNPs selection under different haplotype-block definitions. The different haplotype-block definitions lead to different numbers of SNPs, which results in different haplotype information loss in the measure of entropy. The alternative selections of htSNPs under different haplotype- block definitions will probably make the haplotype-blockbased association mapping for complex disease quite variable. On the haplotype-blockbased study, whether the haplotype-block definitions have a different effect on the statistical power either on the candidate-gene association study or on the whole-genome association study is still an important issue to be solved in the future. However, both diversity-based and association-based criteria for selecting htSNPs could, in fact, be applied without regard to the underlying block structure, as was, in fact, advocated by several previous papers (Weale et al. 2003; Goldstein et al. 2003).
In conclusion, we performed an initial systematic simulation study to compare haplotype-block definitions both under various population-genetics parameters and under two different recombination models. Based on our study, we conclude the following: (1) The behaviors of haplotype block under different haplotype definitions are affected by population-genetics parameters, especially by population-mutation rate and population-recombination rate for the coalescent with uniform recombination framework. (2) The recombination intensity has no effect on the haplotype-block partition and htSNPs selection for the coalescent with recombination-hotspot framework. (3) Under both recombination models, the LD-based definitions is more similar or related to the recombination-based definitions. (4) Under both recombination models, there is more variance in illustrating the LD structure under the LD-based and recombination-based definitions because they appear to be affected by the population-recombination rate, whereas the diversity-based definition is not as sensitive to population-recombination rate.(5) Different haplotype-block definitions lead to the different selection of average htSNPs number; under the diversity-based definition, the reduced number of htSNPs selected leads to an increase in the haplotype information loss. To perform haplotype-blockbased association mapping, consideration is needed when choosing the haplotype-block definitions and htSNPs.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Avi-Itzhak, H. I., X. Su, and F. M. De La Vega. 2003. Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity. Pacific Symp. Biocomput. 8:466477.
Bafna, V., B. V. Halldórsson, R. Schwartz, A. G. Clark, and S. Istrail. 2003. Haplotype and informative SNP selection algorithms: don't block out information. RECOMB'03:1927.
Clayton, D. 2002. Choosing a set of haplotype tagging SNPs from a larger set of diallelic loci. http://www-gene.cimr.cam.ac.uk/clayton/software/stata/htSNP/htsnp.pdf.
Daly, M. J., J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander. 2001. High-resolution haplotype structure in the human genome. Nat. Genet. 29:229232.[CrossRef][ISI][Medline]
De Bontridder, K. M. J., B. J. Lageweg, J. K. Lenstra et al. 2002. Branch-and-bound algorithms for the test cover problem. Pp. 223233 in AlgorithmsESA 2002 LNCS. Springer, Berlin.
Fu, Y. X., and W. H. Li. 1999. Coalescing into the 21st century: an overview and prospects of coalescent theory. Theor. Popul. Biol. 56:110.[CrossRef][ISI][Medline]
Gabriel, S. B., S. F. Schaffner, H. Nguyen et al. (18 co-authors). 2002. The structure of haplotype blocks in the human genome. Science 296:22252229.
Garey, M. R., and D. S. Johnson. 1979. Computers and intractability: a guide to the theory of Np-completeness. WH Freeman, New York.
Goldstein, D. B. 2001. Islands of linkage disequilibrium. Nat. Genet. 29:109111.[CrossRef][ISI][Medline]
Goldstein, D. B., K. R. Ahmadi, M. E. Weale, and N. W. Wood. 2003. Genome scans and candidate gene: approaches in the study of common diseases and variable drug responses. Trends Genet. 19:615622.[CrossRef][ISI][Medline]
Hoehe, M. R. 2003. Haplotypes and the systematic analysis of genetic variation in genes and genomes. Pharmacogenomics 4:547570.[CrossRef][ISI][Medline]
Hudson, R. R. 1983. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23:183201.[ISI][Medline]
. 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18:337338.
Hudson, R. R., and N. Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of sequences. Genetics 111:147164.
Jeffreys, A. J., L. Kauppi, and R. Neumann. 2001. Intensely punctuate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet. 29:217222.[CrossRef][ISI][Medline]
Johnson, G. C., L. Esposito, B. J. Barratt et al. (21 co-authors). 2001. Haplotype tagging for the identification of common diseases genes. Nat. Genet. 29:233237.[CrossRef][ISI][Medline]
Judson, R., B. Salisbury, J. Schneider, A. Windemuth, and J. C. Stephens. 2002. Pharmacogenomics 3:379391.[ISI][Medline]
Kruglyak, L. 1999. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22:139144.[CrossRef][ISI][Medline]
Li, N., and M. Stephens. 2003. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165:22132233.
Nickerson, D. A., S. L. Taylor, S. M. Fullerton, K. M. Weiss, A. G. Clark, J. H. Stengaard, V. Salomaa, E. Boerwinkle, and C. F. Sing. 2000. Sequence diversity and large-scale typing of SNPs in the human apolipoprotein E gene. Genome Res. 10:15321545.
Patil, N., A. J. Berno, D. A. Hinds et al. (22 co-authors). 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294:17191723.
Phillips, M. S., R. Lawrence, R. Sachidanandam et al. (35 co-authors). 2003. Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet. 33:382387.[CrossRef][ISI][Medline]
Satta, Y., C. Ohuigin, N. Takahata, and J. Klein. 1993. The synonymous substitution rate of the major histocompatibility complex loci in primates. Proc. Natl. Acad. Sci. USA 90:74807484.
Schwartz, R., B. V. Halldósson, V. Bafna, A. G. Clark, and S. Istrail. 2003. Robustness of inference of haplotype block structure. J. Comp. Biol. 10:1319.[CrossRef][ISI]
SeattleSNPs.NHLBI program for genomic applications. UW-FHCRC, Seattle, Wash. (http://pga.gs.Washington.edu) August, 2003 accessed.
Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27:379423.[ISI]
Stephens, M., N. J. Smith, and P. Donnelly. 2001. A new statistical method for haplotype reconstruction from population data. Am. J. Human. Genet. 68:978989.[CrossRef][ISI][Medline]
Stumpf, M.P.H. 2002. Haplotype diversity and the block structure of linkage disequilibrium. Trends Genet. 18:226228.[CrossRef][ISI][Medline]
Stumpf, M. P. H., and D. B. Goldstein. 2003. Demography, recombination hotspot intensity, and the block structure of linkage disequilibrium. Curr. Biol. 17:502510.[CrossRef]
Wall, J. D., and J. K. Pritchard. 2003. Assessing the performance of the haplotype block model of linkage disequilibrium. Am. J. Hum. Genet. 73:502515.[CrossRef][Medline]
Wang, N., J. M. Akey, K. Zhang, R. Chakraborty, and J. Li. 2002. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am. J. Hum. Genet. 71:12271234.[CrossRef][ISI][Medline]
Weale, M. E., D. Chantal, S. J. Macdonald, A. Smith, P. S. Lai, S. D. Shorvon, N. W. Wood, and D. B. Goldstein. 2003. Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium gene mapping. Am. J. Hum. Genet. 73:551565.[CrossRef][Medline]
Wiuf, C., and D. Posada. 2003. A coalescent model of recombination hotspots. Genetics 164:407417.
Zhang, K., M. Deng, T. Chen, M. S. Waterman, and F. Z. Sun. 2002. A dynamic programming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci. USA 99:73357339.