Department of Radiation Oncology, Ottawa Regional Cancer Centre, Ottawa, Ontario, Canada
![]() |
Abstract |
---|
Key Words: exponential dispersion model coalescent model population genetics scale invariance
![]() |
Introduction |
---|
|
The International SNP Map Working Group (ISMWG) has mapped the density of SNPs along all 24 human chromosomes, down to a resolution of 200 kb per sample bin (International SNP Map Working Group [ISMWG] 2001). At the scales of both the sample bins and the reads used to assess each bin, their maps agreed with the conventional coalescent model. Evidence will be presented here that the distribution of SNPs from the ISMWG map, when examined over a range of measurement scales, exhibits a power function relationship between the estimated variance and the mean. This power function will be shown to be quantitatively consistent with the coalescent model in the presence of recombination.
This variance to mean power function can also be considered in the context of exponential dispersion models (Jørgensen 1997), a comprehensive family of statistical models. Exponential dispersion models serve as error distributions for generalized linear models, and they provide a formal basis to study a wide variety of non-normal data. The reader is referred to the monograph by Jørgensen (1997) for general information regarding these models; details specific to the distribution of SNPs are provided here in Appendix A.
If the distribution of SNPs were to be described by an exponential dispersion model (a reasonable assumption given the generality of these models), the variance to mean power function would implicate one particular model represented by the scale invariant sum of a Poisson number of independently distributed gamma random variables. Evidence will be presented here that this Poisson gamma (PG) model accurately depicts the distribution of SNPs in the presence of recombination. Because coalescent models with significant recombination generally require Monte Carlo simulation (Hudson 1991) for their depiction, an algebraic approach as proposed here could be of interest.
The intent of the present study is to examine how well this scale invariant PG model describes the distribution of human SNPs. The variance to mean power function, a characteristic feature of scale invariant exponential dispersion models, provides an initial test. A more specific evaluation follows, based on the predicted PG cumulative distribution function (CDF) and its compliance with the empirical CDFs derived from the ISMWG map. It will be argued that this PG distribution of SNPs implies a genomic substructure comprised of multiple sequential and nonoverlapping segments.
To better relate the PG model to the current paradigm for genomic structure, a hypothesis will be proposed here that the segments inferred by the PG model represent haplotype blocks, and that the number of segments within a sample bin is indirectly related to the number of recombinations that have accumulated within that bin. One may estimate population genetic parameters based on the fits of the PG distribution to data from each chromosome. The heterozygosities so estimated will be compared to more direct estimates taken from the ISMWG map, and the number of putative segments within a region will be used to estimate the number of haplotype blocks. At this juncture it will be helpful to first review pertinent features of the coalescent theory and the ISMWG map.
![]() |
Predictions from the Conventional Coalescent Model |
---|
|
The estimate was defined at the scale of cistrons so as to exclude the possibility of recombination. In this equation = 4Neu was a population genetic parameter, with Ne being the population's effective size, and u the mutation rate per sequence per generation. The mean and variance for K were, respectively, E(K)
and var(K)
+
2; thus the predicted variance to mean relationship for this case with no recombination was:
|
Watterson assumed a constant rate of neutral mutations and an infinite site model in his derivation. With the additional assumption of random mating within the population, he also showed that for the sequence length L and heterozygosity , E(
) =
/L.
With the development of the coalescent approach came the means to describe K in the presence of recombination. The variance of K became var(K)
+
2 var(T), where T represented the weighted average length of the genealogies (Hudson 1991). With high rates of recombination the variance of T would approach zero, and K would then be Poisson distributed. The variance to mean relationship equation (2) was thus modified (Kaplan and Hudson 1985):
|
|
In the expression for var(T) the recombination parameter R was related to both the recombination rate (events/generation/bp) and L, by either R = 4Ne
L for autosomes or R = 2Ne
L for the X chromosome. For the Y chromosome, R = 0, because recombination does not occur on the Y chromosome, except within two small pseudoautosomal regions (Tilford et al. 2001). Given a sufficiently large and dense SNP map, it would presumably be possible to fit equation (3) to a set of estimated variances and means calculated over a range of scales L in order to estimate the magnitude of R. Such a map became available through the efforts of the ISMWG (2001).
![]() |
The Human SNP Map |
---|
Discordant pairs of chromosomes, which differed at specific sites, were therefore identified. The heterozygosity analysis considered only pair-wise comparisons of chromosomes, and so the sample size for this analysis was exactly 2 (D. Altshuler, personal communication). When the ISMWG examined more than two chromosomes, each pair was considered separately in the analysis. Human population heterozygosity (SNPs/bp) was thus estimated by dividing the number of discordant sites within each bin by the number of comparisons that did and did not show discordance at a specific locus. This gave a probability that, at a randomly chosen site within the bin, two randomly chosen chromosomes would be discordant. From this probability one could then estimate the number of SNPs per bin.
As indicated, repetitive DNA was excluded from study. Alignments of sample sequences to the genome were excluded if less than 99% identical. Any individual sample sequence was excluded if it aligned to more than 1 bin in the genome assembly. Bins were excluded if more than 10% of the sample sequences that mapped to them also mapped to another chromosome.
With the sampling methods and exclusions employed here, there was a possibility that some unrecognized nonrandom bias might be introduced that could affect the proposed analysis. For this reason a number of additional controls were included. These will be presented in due course, but first we will examine the relationship between the variance and the mean number of SNPs per bin.
![]() |
The Variance to Mean Relationship over a Range of Scales |
---|
These variance estimates were plotted versus the corresponding mean estimates of SNPs per bin for each chromosome on log-log plots over the range of bin sizes (fig. 2). The logarithmic transform of the theoretical variance to mean relationship (eq. 3), was fitted to the similarly transformed estimates for the variance and mean number of SNPs. As can be seen from figure 2, the ISMWG data (circles) and the predictions from the coalescent model (crosses) were often superimposed.
|
|
To further test whether spurious influences might have affected the estimation methods, simulations based on physical distance were performed according to the method of Hudson (1991) using a fixed recombination rate and a range of bin sizes to estimate the variance and the mean. The simulations employed a constant effective population size of 1 million and recombination and mutation rates that were constant and homogeneous for all loci. There was no admixture, inbreeding, or any bottleneck assumed in the simulated population structure.
Equation 3 was fitted to the simulated data and, when the same value for as used in the simulated data was substituted into equation (3), the theoretical estimates (crosses) became superimposed on the simulated data (circles; fig. 2, lower right-hand corner). The method to estimate recombination rates from the ISMWG data thus did not seem to have been influenced by any spurious effects from estimation process.
The ISMWG data plotted in figure 2 revealed apparently linear relationships between the logarithms of the variance and the mean. Indeed, regression analyses of data from each chromosome revealed strong correlations with the linearized relationship (table 1),
|
|
Goodness of fit between the logarithmically transformed ISMWG data and equation (4) was assessed through the standard deviations of the residuals SD as well as with the squared correlation coefficients r2 (table 1). The standard deviations provided a measure of the statistical scatter about the regression line; r2 gave an estimate for the proportion of variance accounted for by the model. As such these two measures provided independent assessments of the regression. The residuals obtained tended to be small and normally distributed about zero (fig. 3a) and the values for r2 tended to be high, indicating good fits.
|
Based on an assessment of the values of SD, the fits to equation (4) were compared to those with equation (3). The transformed power function provided better fits (paired t-test: t = 2.2, df = 23, P = 0.04). However, the power function employed two adjustable parameters (a and p) whereas equation (3) employed only one (R). Nevertheless, in view of the complexity of equation (3), the empirical variance to mean relationship was much more simply represented by the power function.
The best fits to equation (4) for each chromosome were plotted in figure 2 as solid lines, and for the most part they were indistinguishable from both the coalescent predictions (eq. 3) and the ISMWG data. From these fits, the power function's exponent p from all 24 chromosomes ranged from 1.07 to 1.86 (mean 1.33, 95% confidence interval [CI] 1.26 to 1.40), a result that will be shown to have greater significance below.
The transformed variance to mean power function also correlated strongly with the simulated data, calculated on the basis of physical distance (r2 = 0.999, P = 2 x 10-16; fig 2, lower right-hand corner). In the simulations the population structure was assumed to be free of inbreeding, recent admixture bottlenecks, and heterogeneity. The residuals from the fit of equation (4) were relatively small and normally distributed about zero (fig. 3a). As well, the value for the power function exponent p = 1.33, obtained from the simulated data, agreed with the ISMWG estimates. Because the variance to mean power function was consistent with these simulated data, the possibility that the power function relationship could be attributed to either artifacts of data collection or estimation seemed unlikely.
To summarize the findings to this point: the fits of the coalescent model (eq. 3) and the transformed variance to mean power function (eq. 4) to both the ISMWG and simulated data were compared. Both relationships fitted these data well. The more complicated coalescent model, however, had the advantage that it was based on a population genetic mechanism, whereas at this juncture the variance to mean power function might be considered simply a functional coincidence. To dispel this concern, a model that explains the origin of the variance to mean power function and its possible origin from population genetic mechanisms will be proposed. But before any possible mechanisms can be considered, some background on scale invariant exponential dispersion models will be helpful.
![]() |
Scale Invariant Exponential Dispersion Models |
---|
The family of scale invariant exponential dispersion models can be further subcategorized according to the values taken by the exponent p. The scale invariant PG exponential dispersion model is uniquely specified by 1 < p < 2; for p = 2 the unique family member is the gamma distribution. Thus if the distribution of SNPs were to be described by an exponential dispersion model, the PG distribution would be the candidate distribution for cases with recombination; the gamma distribution, for cases without.
Additive scale invariant exponential dispersion models are described by the cumulant generating function (CGF):
|
Here, Z is a continuous random variable, and
are the canonical and index parameters, respectively,
(
) is the cumulant function, the constant
=
, and s is the index variable for the generating function. (In the mathematical literature the canonical parameter
is conventionally denoted by
, but this was changed to avoid confusion with the population genetic parameter
.) This CGF directly implies the variance to mean power function (eq. 5) and, when this power function's exponent is restricted such that 1 < p < 2, it yields the PG distribution. In this case the distribution represents the sum of N independent and identically distributed random variables with gamma distributions, where N obeys a Poisson distribution.
Jørgensen (1997, p 141) has derived the probability density function for the additive form of the PG distribution:
|
|
We will next consider how this probabilistic model can be adapted to describe the distribution of SNPs and haplotype blocks within the genome.
![]() |
A Poisson Gamma Distribution of SNPs |
---|
We would like to construct a probabilistic model for the distribution of haplotype blocks and SNPs within the genome. To begin, we divide the genome into a sequence of non-overlapping and equal-sized bins of size large enough that, on average, each bin would be expected to contain multiple haplotype blocks. Let the continuous random variable Z denote the estimated number of SNPs per bin, and let the discrete random variable N describe the number of haplotype blocks per bin. Given current knowledge of the distribution of haplotype blocks within the genome, it would seem reasonable to assume that N is a random (Poisson distributed) number.
Now, if we also assume that the mean number of SNPs per haplotype block is described by a gamma distribution, the total number of SNPs per bin would be represented by the sum of N number of identically distributed gamma distributions, and it would thus obey a PG distribution. If one were to further require that Z conform to an exponential dispersion model, the scale invariant PG distribution (eq. 7) would result. A gamma distribution for the mean number of SNPs per block can be viewed as a continuous valued generalization of Watterson's equation (eq. 1), and it is consistent with observed patterns of rate variation in nuclear and mitochondrial DNA (Wakeley 1994).
In this context, the mean number of segments per bin would be E(N) = ·
(
); the mean number of SNPs per bin would give the population genetic parameter
= 4Neu =
·
·
(
)/
; the mean number of SNPs per segment would be
/
; and the variance to mean power function would have the form var(Z) =
1/(
-1)E(Z)p. (A reproductive form for the scale invariant PG distribution exists (Appendix A), which could be applied to describe the statistical behavior of the density of SNPsi.e., the heterozygosity (SNPs/bp).) Presumably the number of segments would mainly depend upon the amount of recombination, but other processes such as chromosomal events could be involved too.
![]() |
Tests of the Scale Invariant Model |
---|
|
|
One note of caution is warranted with respect to the fit of the PG model to the Y chromosome data: 95% of the Y chromosome does not recombine, and much of it consists of repetitive DNA (Tilford et al. 2001). The PG model was thus probably not appropriate for the Y chromosome, and the data itself were more difficult to analyze because of the proportion of repetitive content. Nevertheless, an analysis of the Y chromosomal data was included for completeness.
As an additional control, a Monte Carlo simulation was carried out according to the algorithm of Hudson (1991). The CDF obtained from the simulated data was fitted to the PG CDF (fig. 4, bottom right-hand corner). This plot exhibited a close agreement between simulation and theory (sum of the squared residuals, 0.044, Dmax = 0.035, P = 0.96), and the majority of the residuals from the nonlinear regression were symmetrically distributed about zero (fig. 3b). The close agreement of the PG model with both the ISMWG data and the simulated data served to confirm the model.
![]() |
Estimates from the Poisson Gamma Model |
---|
|
The estimated mean numbers of blocks per bin (table 2) for each chromosome were compared to the respective estimates for the number of recombinations, R (table 1). These values were plotted versus each other (fig. 5b) to yield an apparent linear relationship with a high degree of correlation. With the exception of chromosome Y, the number of blocks per bin was less than R.
This observation that the number of blocks generally was less than the estimates for the recombination rate was not surprising. Recombination rates determined from changes in the location of SNPs might underestimate the total amount of recombination, as some recombinations might not cause any perceptible alteration in the pattern of SNPs (Nordborg 2000). Hudson and Kaplan (1985) have studied this issue, and they demonstrated that the number of recombinations that can be parsimoniously inferred from a sample of sequences, RM, is indeed less than the total number of recombinations from the history of a sample of sequences, R.
A set of Monte Carlo simulations based on Hudson's (1991) algorithm revealed a similar relationship between the number of blocks and R (fig. 5b, insert), providing some confirmation of this observation. A subject for further research would be to investigate the relationship between the mean number of blocks per section E(N) postulated in the PG model and the number of parsimonious recombinations RM (Hudson and Kaplan 1985), as determined by simulation.
In addition to this point, there is good evidence that the regions between haplotype blocks contain recombination "hot spots" (Jeffreys, Kauppi, and Neumann 2001). These hot spots are characterized by clusters of multiple recombination sites. Thus the estimated mean number of blocks per bin should more directly relate to the number of clusters per bin rather than the number of recombinations per se.
![]() |
Discussion |
---|
The second assumption related to the correspondence between the PG model and the segmental substructure of haplotype blocks within the genome. The number of haplotype blocks within a larger genomic region was assumed to obey a Poisson distribution; the numbers of SNPs contained within those blocks was attributed to a gamma distribution. The justification for this PG model was based on a number of empirical observations: the power function relationship between the variance and mean number of SNPs per bin, the range restrictions on the power function exponent, and the close fit of the PG distribution function to both the ISMWG map and the simulated data.
Another feature of exponential dispersion models that made them applicable to the description of SNPs relates to the ease with which these models can describe certain scale invariant processes. Scale invariance has become increasingly recognized in biological systems, but its origins and significance are unclear (Gisiger 2001). In physical systems scale invariance may relate to critical phenomena (Goldenfeld 1992, chapter 1). In exponential dispersion models, the variance to mean power function (one manifestation of scale invariance) results from the asymptotic properties of distributions (Jørgensen, Martinez, and Tsao 1994). The summation of multiple independent distributions can yield distributions with scale invariant properties (Hill 1996), and given that the estimation process for heterozygosity could potentially have involved sampling from different distributions, and that more than one stochastic process likely influences the distribution of SNPs within the genome, one might speculate that the scale invariance seen within the ISMWG map reflects an asymptotic property of multiple summed processes.
Caution would seem appropriate with the interpretation of the PG model in general, because an algebraically complicated model such as thisone that required complicated numerical methods for its assessmentcould potentially have been affected by artifact. For this reason a number of controls were included in the analysis. The first control involved the fit of the variance to mean relationship from the coalescent model (eq. 3) to real data. Over the range of values observed, this coalescent relationship seemed to approximate a power function. Other controls involved simulations based on the coalescent model: Data generated for a range of bin sizes fitted well with both the transformed variance to mean power function and the coalescent model; data generated under the assumption of zero recombination fitted well to a gamma distribution; data generated with non-zero recombination fitted well to a PG distribution; and data generated over a range of recombination rates yielded an approximately linear relationship between the putative number of identical by descent segments and the estimated recombinational parameter. These results served to reduce concerns that the fits of the variance to mean power function and the PG CDF to real data reflected numerical artifact rather than biologically interpretable results.
Inherent in the method of fitting complicated equations like the PG CDF and Kaplan and Hudson's equation (3) to data from whole chromosomes was the assumption that recombination occurs with a constant probability over each chromosome. However, there is significant evidence that recombination is inhomogeneously distributed throughout the genome (Jeffreys, Kauppi, and Neumann 2001). Nonetheless, estimates for recombination rates are conventionally determined by averages over long distances, and thus it is not unreasonable to estimate recombination as a mean over an entire chromosome. One consequence of the excellent fits of the PG CDF and equation (3) to the ISMWG data is the implication that any localized inhomogeneity in recombination appears to be minimized by averaging over large scales.
As noted in the Introduction, there has been considerable interest in the structure of haplotype blocks. Fits of the PG model to the ISMWG data gave a range for the block sizes from 17 to 53 kb. In a population of European descent Reich et al. (2001) found that LD extended typically 60 kb; for Yorubans it extended to less than 5 kb. Patil et al. (2001) observed an average block length of 7.8 kb in human chromosome 21; Daly et al. (2001) estimated a mean length of 34.6 kb within 5q31. Gabriel et al. (2002) estimated that half of the human genome exists in blocks of 22 kb or larger in African samples, and 44 kb or larger in European or Asian samples. In another analysis of the ISMWG data, which examined the autocorrelation in polymorphism levels, Reich et al. (2002) found blocks of sequence similarity of the same order of size as obtained from the fits of the PG distribution. Although Reich's latter study used the same ISMWG data as the present study, the independence of the two methods, and similar findings, lends credibility to the PG model.
Obviously there is variability in the lengths of these blocks; the estimates of block size appear method dependent and population dependent. At present there are insufficient data to directly determine the form of the size distribution for these blocks. However, Clark (1999) has argued that this distribution should obey an exponential form. In the model presented here it was assumed that the number of blocks per bin should be distributed according to a Poisson distribution. This would imply that the lengths of the blocks within each chromosome should be distributed according to the interarrival lengths of a Poisson distribution, which would be an exponential distribution.
A Poisson distribution for the number of blocks per bin would also imply that the junctions between blocks occur independently of each other with a constant probability of occurrence along the chromosome. Although this might seem inconsistent with the observation of the clustering of recombination crossovers into hot spots (Jeffreys, Kauppi, and Neumann 2001), as pointed out above, these junctions should more directly relate to the presence of hot spots than to individual recombinations.
Earlier it was demonstrated that the mean number of blocks per bin for the chromosomes correlated well with the respective recombination rates R. One can also compare these estimates to data provided by Kong et al. (2002), who constructed a high-density map of recombination rates for the human genome. When the chromosomal averages for the mean number of blocks per bin were compared to the sex-averaged recombination rates for each chromosome provided by Kong et al. (2002), no correlation was observed (r2 = 0.008, P = 0.7), and neither did the respective variances show any correlation. Similarly, the chromosomal values for R obtained herein did not correlate with Kong's averaged rates (r2 = 0.006, P = 0.7).
This lack of correlation may have a simple explanation: In the present study, the mean number of blocks per bin and the mean recombination rate R for each chromosome were obtained, respectively, by the nonlinear regression of the PG CDF and equation (3) with the ISMWG data and, as such, represent population-based estimates. The recombination rates for each chromosome provided by Kong et al. (2002) represent averages obtained from their genetic map of 146 Icelandic families. One would have expected better correlations between the different estimates, provided that the neutral model could reasonably model the demographic processes within the population, and that recombination rates at hot spots did not vary significantly between populations. However, Gabriel et al. (2002) have demonstrated significant differences in the haplotype structure between Yoruban and Eurasian populations, and so it might be possible that the discrepancies seen are attributable to differences between the Icelandic and ISMWG populations.
As mentioned above, Kaplan and Hudson's equation (eq. 3) and the exponential dispersion model are based on random processes inherent in a neutral model. Indeed, the excellent fits of equation (3) and the PG model to the ISMWG data would seem to indicate that neutral processes, such as demography, recombination, and mutation, dominate in shaping genomic polymorphism patterns. There was otherwise only weak evidence that natural selection might be operative at the HLA loci, and this effect (if true) appeared localized.
One final concern relates to how the findings here might relate to the feasibility of association studies. The estimates for the average block lengths obtained here ranged from 17 to 53 kb. If the human genome were spanned by blocks with high degrees of linkage disequilibrium, ranging in size from 5 to100 kb and punctuated by boundary regions of intense recombinational activity that extended for 1 to 5 kb, then the possibility that relatively few haplotype-tagging SNPs would be required to identify disease variants appears reasonable (Reich et al. 2002; Stumpf 2002). Indeed, Gabriel et al. (2002) have estimated that for an average block size of 11 to 22 kb, and 3 to 5 haplotypes per block, association studies could be done with 300,000 to 1,000,000 well-chosen haplotype-tagging SNPs. The average block lengths inferred by the present study are consistent with these estimates and thus association studies appear feasible.
![]() |
Conclusion |
---|
![]() |
Appendix A |
---|
|
|
A second and related family of distributions with random variable Y = Z/
ED(µ,
2), where µ and
2 = 1/
are respectively the mean value and dispersion parameters and are termed reproductive exponential dispersion models (Jørgensen 1997). For n independent reproductive random variables Yi
ED(µ,
2/wi), with weighting factors wi such that w =
wi, gives with weighted averaging of the variables
|
Thus for reproductive models the weighted average of independent random variables with fixed µ and 2 and various wi, will belong to the same family of distributions. The additive and reproductive models may be transformed into each other by Y
Z = Y/
2.
The cumulant function, (
) may be used to construct the cumulant generating function K*(s;
,
) for ED*(
,
),
|
|
Certain exponential dispersion models are also scale invariant. These have come to be called Tweedie exponential dispersion models in honor of M. C. K. Tweedie who first described them (1984). For the reproductive exponential dispersion model ED(µ,2) to be scale invariant, we require that for any positive constant c,
|
|
For scale invariant exponential dispersion models, and when p 1 or 2, the cumulant function takes the form
|
|
|
For the case where p = 2 the cumulant function is (
) = -ln(-
), which corresponds to the gamma distribution. The probability density function for the gamma distribution may be expressed in the additive form,
|
![]() |
Appendix B |
---|
The theoretical CDF was obtained by numerical integration of the Poisson-gamma probability density function (eq. 7). For each chromosome, the theoretical CDF was fitted to the empirical CDF numerically by minimizing the sum of the squared deviations (i.e., the summed least squares). Because there were three adjustable parameters estimated from the data, the Kolmogorov Smirnov test was applied to the composite hypothesis, and the critical values were estimated by Monte Carlo simulation.
![]() |
Acknowledgements |
---|
![]() |
Footnotes |
---|
Wolfgang Stephen, Associate Editor
![]() |
Literature Cited |
---|
Altshuler, D., V. J. Pollara, C. R. Cowles, W. J. Van Etten, J. Baldwin, L. Linton, and E. S. Lander. 2000. A SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407:513-516.[CrossRef][ISI][Medline]
Clark, A. G. 1999. The size distribution of homozygous segments in the human genome. Am. J. Hum. Genet. 65:1489-1492.[CrossRef][ISI][Medline]
Daly, M. J., J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander. 2001. High-resolution haplotype structure in the human genome. Nat. Genet. 29:229-232.[CrossRef][ISI][Medline]
Gabriel, S. B., S. F. Schaffner, and H. Nguyen, et al. (18 co-authors). 2002. The structure of haplotype blocks in the human genome. Science 296:2225-2229.
Gisiger, T. 2001. Scale invariance in biology: coincidence or footprint of a universal mechanism? Biol. Rev. 76:161-209.[CrossRef][Medline]
Goldenfeld, N. 1992. Lectures on phase transitions and the renormalization group. Perseus Books, Reading, Mass.
Goldstein, D. B. 2001. Islands of linkage disequilibrium. Nat. Genet. 29:109-111.[CrossRef][ISI][Medline]
Hill, T. 1996. A statistical derivation of the significant-digit law. Stat. Sci. 10:354-363.[ISI]
Horton, R., D. Niblett, S. Milne, S. Palmer, B. Tubby, J. Trowsdale, and S. Beck. 1998. Large-scale sequence comparisons reveal unusually high levels of variation in the HLA-DQB1 locus in the class II region of the human MHC. J. Mol. Biol. 282:71-97.[CrossRef][ISI][Medline]
Hudson, R. R. 1982. Estimating genetic variability with restriction endonucleases. Genetics 100:711-719.
Hudson, R. R. 1991. Gene genealogies and the coalescent process. Pp. 144 in D. Futuyma and J. Antonovics, eds. Oxford Surveys in Evolutionary Biology. Oxford University Press, Oxford.
Hudson, R. R., and N. L. Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111:147-164.
International SNP Map Working Group (ISMWG). 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409:928-933.[CrossRef][ISI][Medline]
Jeffreys, A. J., L. Kauppi, and R. Neumann. 2001. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet. 29:217-222.[CrossRef][ISI][Medline]
Jørgensen, B. 1997. The theory of dispersion models. Chapman & Hall, London.
Jørgensen, B., J. R. Martinez, and M. Tsao. 1994. Asymptomatic behaviour of the variance function. Scand. J. Stat. 21:223-243.[ISI]
Kaplan, N. L., and R. R. Hudson. 1985. The use of sample genealogies for studying a selectively neutral M-loci model with recombination. Theoret. Popul. Biol. 28:382-396.[ISI]
Kingman, J. F. C. 1982. The coalescent. Stochast. Proc. Appl. 13:235-248.[CrossRef]
Kong, A., D. F. Gudbjartsson, and J. Sainz, et al. (16 co-authors). 2002. A high-resolution recombination map of the human genome. Nat. Genet. 31:241-247.[CrossRef][ISI][Medline]
Miller, R. D., P. Taillon-Miller, and P.-Y. Kwok. 2001. Regions of low single nucleotide polymorphism incidence in human and orangutan Xq: deserts and recent coalescences. Genomics 71:78-88.[CrossRef][ISI][Medline]
Nordborg, M. 2000. Coalescent theory. Chapter 7 in D. Balding, M. Bishop, and C. Cannings, eds. Handbook of statistical genetics. Wiley, Chichester, UK.
Patil, N., A. J. Berno, and D. A. Hinds, et al. (21 co-authors). 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294:1719-1723.
Reich, D. E., M. Cargill, and S. Bolk, et al. (11 co-authors). 2001. Linkage disequilibrium in the human genome. Nature 411:199-204.[CrossRef][ISI][Medline]
Reich, D. E., S. F. Schaffner, M. J. Daly, G. McVean, J. C. Mullikin, J. M Higgins, D. J. Richter, E. S. Lander, and D. S. Altshuler. 2002. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat. Genet. 32:135-142.[CrossRef][ISI][Medline]
Stumpf, M. P. 2002. Haplotype diversity and the block structure of linkage disequilibrium. Trends Genet. 18:226-228.[CrossRef][ISI][Medline]
Tilford, C. A., T. Kuroda-Kawaguchi, and H. Skaletsky, et al. (12 co-authors). 2001. A physical map of the human Y chromosome. Nature 409:943-945.[CrossRef][ISI][Medline]
Tweedie, M. C. K. 1984. An index which distinguishes between some important exponential families. Pp. 579604 in J. K. Ghosh and J. Roy, eds. Statistics: applications and new directions. Proceedings of the Indian Statistical Institute Golden Jubilee International Conference. Indian Statistical Institute, Calcutta, India.
Wakeley, J. 1994. Substitution-rate variation among sites and the estimation of transition bias. Mol. Biol. Evol. 11:436-442.[Abstract]
Watterson, G. A. 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256-276.[ISI][Medline]