An Exponential Dispersion Model for the Distribution of Human Single Nucleotide Polymorphisms

Wayne S. Kendal

Department of Radiation Oncology, Ottawa Regional Cancer Centre, Ottawa, Ontario, Canada


    Abstract
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
An analysis of 1.42 million human single nucleotide polymorphisms (SNPs), mapped by the International SNP Map Working Group, revealed an apparent power function relationship between the estimated variance and mean number of SNPs per sample bin. This relationship could be explained by the assumption that a scale invariant Poisson gamma (PG) exponential dispersion model could describe the distribution of SNPs within the bins. In this model the sample bins would contain random (Poisson distributed) numbers of identical by descent genomic segments, each with independently distributed and gamma distributed numbers of SNPs. This model was both qualitatively and quantitatively consistent with the conventional coalescent model. It agreed with the empirical cumulative distribution functions derived from the SNP maps as well as with simulated data. The model was used to estimate the heterozygosity {pi}, and the mean number and size of haplotype blocks for each chromosome. These estimates were consistent with measurements from conventional studies. This PG model thus provides an alternative to Monte Carlo simulation for description of the distribution of SNPs.

Key Words: exponential dispersion model • coalescent model • population genetics • scale invariance


    Introduction
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Single nucleotide polymorphisms (SNPs) appear to be nonuniformly dispersed throughout the human genome (fig. 1), as predicted by conventional coalescent theory (Hudson 1982; Kingman 1982). This nonuniformity can be attributed to a segmented genealogical substructure within the genome, where older segments may have accumulated greater numbers of SNPs (Miller, Taillon-Miller, and Kwok 2001). The ancestral haplotypes, from which each chromosome had its origins, presumably have been affected by population dynamics, mutation, recombination, and selection to yield these segments, which are conventionally delineated by shared genealogy and linkage disequilibrium (LD) (Reich et al. 2001).



View larger version (40K):
[in this window]
[in a new window]
 
FIG. 1. The distribution of SNPs within chromosome 1. From the data provided by the ISMWG for chromosome 1 (ISMWG 2001), the numbers of SNPs within sequential 200,000 bp bins along the chromosome were estimated. The distribution of SNPs along the length of the chromosome was highly variable, as expected from conventional coalescent theory. This variable pattern was similarly evident within the remaining human chromosomes

 
Recent studies, which have examined allelic associations by means of high-density markers, have revealed regions of linkage disequilibrium of variable length, termed haplotype blocks, which are bounded by concentrated regions of meiotic recombination (Daly et al. 2001; Jeffreys, Kauppi, and Neumann 2001; Patil et al. 2001; Gabriel et al. 2002). At present, efforts are being directed to examine high-density SNP maps to better determine the number, size distribution, and diversity of these blocks within various human populations (Gabriel et al. 2002).

The International SNP Map Working Group (ISMWG) has mapped the density of SNPs along all 24 human chromosomes, down to a resolution of 200 kb per sample bin (International SNP Map Working Group [ISMWG] 2001). At the scales of both the sample bins and the reads used to assess each bin, their maps agreed with the conventional coalescent model. Evidence will be presented here that the distribution of SNPs from the ISMWG map, when examined over a range of measurement scales, exhibits a power function relationship between the estimated variance and the mean. This power function will be shown to be quantitatively consistent with the coalescent model in the presence of recombination.

This variance to mean power function can also be considered in the context of exponential dispersion models (Jørgensen 1997), a comprehensive family of statistical models. Exponential dispersion models serve as error distributions for generalized linear models, and they provide a formal basis to study a wide variety of non-normal data. The reader is referred to the monograph by Jørgensen (1997) for general information regarding these models; details specific to the distribution of SNPs are provided here in Appendix A.

If the distribution of SNPs were to be described by an exponential dispersion model (a reasonable assumption given the generality of these models), the variance to mean power function would implicate one particular model represented by the scale invariant sum of a Poisson number of independently distributed gamma random variables. Evidence will be presented here that this Poisson gamma (PG) model accurately depicts the distribution of SNPs in the presence of recombination. Because coalescent models with significant recombination generally require Monte Carlo simulation (Hudson 1991) for their depiction, an algebraic approach as proposed here could be of interest.

The intent of the present study is to examine how well this scale invariant PG model describes the distribution of human SNPs. The variance to mean power function, a characteristic feature of scale invariant exponential dispersion models, provides an initial test. A more specific evaluation follows, based on the predicted PG cumulative distribution function (CDF) and its compliance with the empirical CDFs derived from the ISMWG map. It will be argued that this PG distribution of SNPs implies a genomic substructure comprised of multiple sequential and nonoverlapping segments.

To better relate the PG model to the current paradigm for genomic structure, a hypothesis will be proposed here that the segments inferred by the PG model represent haplotype blocks, and that the number of segments within a sample bin is indirectly related to the number of recombinations that have accumulated within that bin. One may estimate population genetic parameters based on the fits of the PG distribution to data from each chromosome. The heterozygosities so estimated will be compared to more direct estimates taken from the ISMWG map, and the number of putative segments within a region will be used to estimate the number of haplotype blocks. At this juncture it will be helpful to first review pertinent features of the coalescent theory and the ISMWG map.


    Predictions from the Conventional Coalescent Model
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
The coalescent model (Hudson 1982; Kingman 1982) describes the random history of gene genealogies within a sample of homologous sequences. It provides a means to analyze DNA sequence polymorphism in the presence of recombination. Watterson (1975) laid some groundwork for this model when he showed that the number of segregating sites K, within corresponding cistrons from a population, should have an approximately geometric distribution:


The estimate was defined at the scale of cistrons so as to exclude the possibility of recombination. In this equation {theta} = 4Neu was a population genetic parameter, with Ne being the population's effective size, and u the mutation rate per sequence per generation. The mean and variance for K were, respectively, E(K) {approx} {theta} and var(K) {approx} {theta} + {theta}2; thus the predicted variance to mean relationship for this case with no recombination was:


Watterson assumed a constant rate of neutral mutations and an infinite site model in his derivation. With the additional assumption of random mating within the population, he also showed that for the sequence length L and heterozygosity {pi}, E({pi}) = {theta}/L.

With the development of the coalescent approach came the means to describe K in the presence of recombination. The variance of K became var(K) {approx} {theta} + {theta}2 var(T), where T represented the weighted average length of the genealogies (Hudson 1991). With high rates of recombination the variance of T would approach zero, and K would then be Poisson distributed. The variance to mean relationship equation (2) was thus modified (Kaplan and Hudson 1985):


where


In the expression for var(T) the recombination parameter R was related to both the recombination rate (events/generation/bp) and L, by either R = 4NeL for autosomes or R = 2NeL for the X chromosome. For the Y chromosome, R = 0, because recombination does not occur on the Y chromosome, except within two small pseudoautosomal regions (Tilford et al. 2001). Given a sufficiently large and dense SNP map, it would presumably be possible to fit equation (3) to a set of estimated variances and means calculated over a range of scales L in order to estimate the magnitude of R. Such a map became available through the efforts of the ISMWG (2001).


    The Human SNP Map
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
The ISMWG provided their genomic map as supplementary information to their publication (http://www.nature.com). To identify SNPs, the ISMWG divided each chromosome into non-overlapping sequential 200-kb bins, and for each bin a number of randomly chosen reads were obtained from whole genomic libraries. These reads were aligned with each other, and to the available GenBank sequences, and then candidate SNPs were identified from sequence differences in overlapping regions within paired sequences. Each candidate SNP was screened to eliminate false positives and spurious matches that might arise from repetitive sequences. The molecular and bioinformatic methods used by the ISMWG to identify SNPs have been rigorously validated and reported elsewhere (Altshuler et al. 2000).

Discordant pairs of chromosomes, which differed at specific sites, were therefore identified. The heterozygosity analysis considered only pair-wise comparisons of chromosomes, and so the sample size for this analysis was exactly 2 (D. Altshuler, personal communication). When the ISMWG examined more than two chromosomes, each pair was considered separately in the analysis. Human population heterozygosity (SNPs/bp) was thus estimated by dividing the number of discordant sites within each bin by the number of comparisons that did and did not show discordance at a specific locus. This gave a probability that, at a randomly chosen site within the bin, two randomly chosen chromosomes would be discordant. From this probability one could then estimate the number of SNPs per bin.

As indicated, repetitive DNA was excluded from study. Alignments of sample sequences to the genome were excluded if less than 99% identical. Any individual sample sequence was excluded if it aligned to more than 1 bin in the genome assembly. Bins were excluded if more than 10% of the sample sequences that mapped to them also mapped to another chromosome.

With the sampling methods and exclusions employed here, there was a possibility that some unrecognized nonrandom bias might be introduced that could affect the proposed analysis. For this reason a number of additional controls were included. These will be presented in due course, but first we will examine the relationship between the variance and the mean number of SNPs per bin.


    The Variance to Mean Relationship over a Range of Scales
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
From the SNP maps of each chromosome, adjacent bins were combined successively to construct new sequences of non-overlapping bins (of 400 kb, 600 kb, and so on upwards to 3,200 kb); the variance var(Z) and mean number of SNPs per bin E(Z) were estimated for each set of bin sizes. About 8% of bins were found to have significant repetitive DNA content, and thus they were excluded from these analyses. When adjacent bins were combined, if any of the component bins had been excluded because of significant repetitive content, those bins too were excluded from analysis. Consequent to the exclusion of repetitive DNA, successively more information was censored from analysis, as adjacent bins were recombined into larger bins. The possibility existed that the progressive censoring of data might introduce heteroscedasticity into the variance to mean relationship; this was considered in the subsequent analysis.

These variance estimates were plotted versus the corresponding mean estimates of SNPs per bin for each chromosome on log-log plots over the range of bin sizes (fig. 2). The logarithmic transform of the theoretical variance to mean relationship (eq. 3), was fitted to the similarly transformed estimates for the variance and mean number of SNPs. As can be seen from figure 2, the ISMWG data (circles) and the predictions from the coalescent model (crosses) were often superimposed.



View larger version (42K):
[in this window]
[in a new window]
 
FIG. 2. Plots of the variance versus mean number of SNPs. The variance and mean number of SNPs per bin were calculated for a range of bin sizes for each human chromosome (numbered 1 through 22, X, and Y), and the variances were plotted against their respective means on log-log plots. The crosses represent the best fit of the conventional coalescent model equation (3) to these data; the solid lines represent the variance to mean power function. Both the coalescent model and the power function fitted the data well; for all autosomes and the X chromosome the two models were essentially superimposed. For the Y chromosome, the conventional model without recombination (eq. 2) did not fit as well as did the variance to mean power function. The plot in the lower right-hand corner represents data from multiple simulations based on physical distance (L = 100 to 2,000 bp, Ne = 106, sample size 2, {theta} = 0.05, = 1.25 x 10-8); both the variance to mean power function and equation (3) fitted well to these data

 
By a series of regression analyses the recombination parameter R and the squared correlation coefficients r2 were estimated for each chromosome (table 1). For all chromosomes, except the Y chromosome, r2 was greater than 0.8, indicating a high degree of correlation between the coalescent model and the ISMWG data. The estimates of R from the autosomes ranged widely, from 5.7 to 36 for chromosomes 8 and 13, respectively (table 1).


View this table:
[in this window]
[in a new window]
 
Table 1 Variance and Mean Measurements for SNPs/Bin Compared to Kaplan and Hudson's (1985) Coalescent Model with Recombination, and to the Variance to Mean Power Function.

 
The greatest discordance between the data and coalescent theory was seen with the Y chromosome, where recombination was assumed to be nil, and therefore equation (2) had been employed. But even here a chi-squared goodness-of-fit test revealed that the discrepancy was not significant ({chi}2 = 0.41, df = 7, P = 1). The variance to mean relationships predicted from conventional theory were consistent with the ISMWG data. This agreement between the ISMWG data and the coalescent model (eq. 3) helped to quell concerns that spurious effects might have been introduced by the estimation methods.

To further test whether spurious influences might have affected the estimation methods, simulations based on physical distance were performed according to the method of Hudson (1991) using a fixed recombination rate and a range of bin sizes to estimate the variance and the mean. The simulations employed a constant effective population size of 1 million and recombination and mutation rates that were constant and homogeneous for all loci. There was no admixture, inbreeding, or any bottleneck assumed in the simulated population structure.

Equation 3 was fitted to the simulated data and, when the same value for as used in the simulated data was substituted into equation (3), the theoretical estimates (crosses) became superimposed on the simulated data (circles; fig. 2, lower right-hand corner). The method to estimate recombination rates from the ISMWG data thus did not seem to have been influenced by any spurious effects from estimation process.

The ISMWG data plotted in figure 2 revealed apparently linear relationships between the logarithms of the variance and the mean. Indeed, regression analyses of data from each chromosome revealed strong correlations with the linearized relationship (table 1),


derived from the variance to mean power function,


Goodness of fit between the logarithmically transformed ISMWG data and equation (4) was assessed through the standard deviations of the residuals SD as well as with the squared correlation coefficients r2 (table 1). The standard deviations provided a measure of the statistical scatter about the regression line; r2 gave an estimate for the proportion of variance accounted for by the model. As such these two measures provided independent assessments of the regression. The residuals obtained tended to be small and normally distributed about zero (fig. 3a) and the values for r2 tended to be high, indicating good fits.



View larger version (46K):
[in this window]
[in a new window]
 
FIG. 3. Box and whisker plots of residuals. (a) Residuals from the fits of the transformed variance to mean power function (eq. 4) were plotted for each chromosome and for the simulation done in figure 2. The error bars delineate the ranges of the residuals, the boxes define the 3rd and 4th quartiles, and the dots delineate the median values for the residuals. The labels along the horizontal axis denote the respective chromosomes and the simulated data. The residuals tended to be normally distributed about zero, indicating acceptable fits. (b) Residuals from the fits of the cumulative distribution functions. The labels along the horizontal axis again identify data from individual chromosomes, obtained through the nonlinear regression of the PG distribution. Also, Sim R = 0 denotes data simulated with no recombination (L = 103, Ne = 106, sample size 2, {theta} = 0.05) and fitted to a gamma distribution. Sim R = 50, denotes the simulation reported in figures 2, 3a, and 4 with nonzero recombination. Most fits revealed residuals of relatively small size with values symmetrically distributed about zero, indicating acceptable fits

 
Comparison of the ISMWG data to the simulated data within the plots of figure 2 revealed increasing deviations of the data points from the power function and coalescent models as the mean values of SNPs/bin increased. This was evident with the ISMWG data but not with the simulated data (which contained no repetitive content). Neither was any such heteroscedasticity found when the nonrepetitive regions from the ISMWG data were merged together in sequence. This heteroscedasticity was thus attributable to the method of progressive censorship of repetitive content from the ISMWG data used here. The degree of heteroscedasticity was small enough that it did not appear to have a significant influence on the results, and thus more complicated methods of analysis to correct for it were not deemed necessary.

Based on an assessment of the values of SD, the fits to equation (4) were compared to those with equation (3). The transformed power function provided better fits (paired t-test: t = 2.2, df = 23, P = 0.04). However, the power function employed two adjustable parameters (a and p) whereas equation (3) employed only one (R). Nevertheless, in view of the complexity of equation (3), the empirical variance to mean relationship was much more simply represented by the power function.

The best fits to equation (4) for each chromosome were plotted in figure 2 as solid lines, and for the most part they were indistinguishable from both the coalescent predictions (eq. 3) and the ISMWG data. From these fits, the power function's exponent p from all 24 chromosomes ranged from 1.07 to 1.86 (mean 1.33, 95% confidence interval [CI] 1.26 to 1.40), a result that will be shown to have greater significance below.

The transformed variance to mean power function also correlated strongly with the simulated data, calculated on the basis of physical distance (r2 = 0.999, P = 2 x 10-16; fig 2, lower right-hand corner). In the simulations the population structure was assumed to be free of inbreeding, recent admixture bottlenecks, and heterogeneity. The residuals from the fit of equation (4) were relatively small and normally distributed about zero (fig. 3a). As well, the value for the power function exponent p = 1.33, obtained from the simulated data, agreed with the ISMWG estimates. Because the variance to mean power function was consistent with these simulated data, the possibility that the power function relationship could be attributed to either artifacts of data collection or estimation seemed unlikely.

To summarize the findings to this point: the fits of the coalescent model (eq. 3) and the transformed variance to mean power function (eq. 4) to both the ISMWG and simulated data were compared. Both relationships fitted these data well. The more complicated coalescent model, however, had the advantage that it was based on a population genetic mechanism, whereas at this juncture the variance to mean power function might be considered simply a functional coincidence. To dispel this concern, a model that explains the origin of the variance to mean power function and its possible origin from population genetic mechanisms will be proposed. But before any possible mechanisms can be considered, some background on scale invariant exponential dispersion models will be helpful.


    Scale Invariant Exponential Dispersion Models
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Quite remarkably there is a subgroup of exponential dispersion models characterized by a power function relationship between the variance and the mean (Appendix A). This power function implies that these particular models are scale invariant. Scale invariance means that a physical object, or mathematical equation, has the same form regardless of the measurement scale employed. The variance to mean power function (eq. 5) is itself scale invariant. If the measurement scale is increased so that the mean is increased by a factor c then, a · [c · E(Z)]p = (acp) · [E(Z)]p, and the variance to mean relationship retains the form of a power function with exponent p. Indeed, this variance to mean power function is the only possible scale invariant relationship between variance and the mean (Jørgensen 1997, p. 128). Scale invariance also appears as an abstract property of exponential dispersion models (Appendix A), but even in this form the scale invariance can be related to the variance to mean power function.

The family of scale invariant exponential dispersion models can be further subcategorized according to the values taken by the exponent p. The scale invariant PG exponential dispersion model is uniquely specified by 1 < p < 2; for p = 2 the unique family member is the gamma distribution. Thus if the distribution of SNPs were to be described by an exponential dispersion model, the PG distribution would be the candidate distribution for cases with recombination; the gamma distribution, for cases without.

Additive scale invariant exponential dispersion models are described by the cumulant generating function (CGF):


Here, Z is a continuous random variable, N and {lambda} are the canonical and index parameters, respectively, {kappa}(N) is the cumulant function, the constant {alpha} = , and s is the index variable for the generating function. (In the mathematical literature the canonical parameter N is conventionally denoted by {theta}, but this was changed to avoid confusion with the population genetic parameter {theta}.) This CGF directly implies the variance to mean power function (eq. 5) and, when this power function's exponent is restricted such that 1 < p < 2, it yields the PG distribution. In this case the distribution represents the sum of N independent and identically distributed random variables with gamma distributions, where N obeys a Poisson distribution.

Jørgensen (1997, p 141) has derived the probability density function for the additive form of the PG distribution:


with


We will next consider how this probabilistic model can be adapted to describe the distribution of SNPs and haplotype blocks within the genome.


    A Poisson Gamma Distribution of SNPs
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
As mentioned above, the human genome appears to be structured, with blocks of high LD being punctuated by regions of concentrated recombinational activity (Daly et al. 2001; Goldstein 2001; Reich et al. 2001). These blocks of high LD are thought to represent the remnants of ancestral haplotypes, which over evolution have been disrupted by mutation and recombination (Gabriel et al. 2002). Genomic structure thus could be represented by a sequence of non-overlapping haplotype blocks, of random lengths joined together by "hotspots" of recombination (Jeffreys, Kauppi, and Neumann 2001).

We would like to construct a probabilistic model for the distribution of haplotype blocks and SNPs within the genome. To begin, we divide the genome into a sequence of non-overlapping and equal-sized bins of size large enough that, on average, each bin would be expected to contain multiple haplotype blocks. Let the continuous random variable Z denote the estimated number of SNPs per bin, and let the discrete random variable N describe the number of haplotype blocks per bin. Given current knowledge of the distribution of haplotype blocks within the genome, it would seem reasonable to assume that N is a random (Poisson distributed) number.

Now, if we also assume that the mean number of SNPs per haplotype block is described by a gamma distribution, the total number of SNPs per bin would be represented by the sum of N number of identically distributed gamma distributions, and it would thus obey a PG distribution. If one were to further require that Z conform to an exponential dispersion model, the scale invariant PG distribution (eq. 7) would result. A gamma distribution for the mean number of SNPs per block can be viewed as a continuous valued generalization of Watterson's equation (eq. 1), and it is consistent with observed patterns of rate variation in nuclear and mitochondrial DNA (Wakeley 1994).

In this context, the mean number of segments per bin would be E(N) = {lambda} · {kappa}(N); the mean number of SNPs per bin would give the population genetic parameter {theta} = 4Neu = {lambda} · {alpha} · {kappa}(N)/N; the mean number of SNPs per segment would be {alpha}/N; and the variance to mean power function would have the form var(Z) = {lambda}1/({alpha}-1)E(Z)p. (A reproductive form for the scale invariant PG distribution exists (Appendix A), which could be applied to describe the statistical behavior of the density of SNPs—i.e., the heterozygosity (SNPs/bp).) Presumably the number of segments would mainly depend upon the amount of recombination, but other processes such as chromosomal events could be involved too.


    Tests of the Scale Invariant Model
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
The scale invariant PG model was, by virtue of its definition, consistent with the variance to mean power functions seen with the ISMWG map. The cumulative distribution function (CDF) of the PG distribution provided a more specific criterion with which to compare to the ISMWG map. The PG CDF was therefore fitted to the empirical CDF (Appendix B), derived from the SNP estimates from each chromosome (table 2). The distribution parameters N, {lambda}, and p were obtained by minimizing the sum of the least squares differences between the theoretical and empirical CDFs at each data point. For all chromosomes, except chromosome 6, the Kolmogorov Smirnov test indicated acceptable fits of the PG CDF to the data. For chromosome 6 the deviations from the theoretical model, although significant, were not more than 4.5%.


View this table:
[in this window]
[in a new window]
 
Table 2 Fit of the Poisson-Gamma CDF to the Empirical CDF from the SNP Maps.

 
The theoretical CDFs (solid lines) for all 24 chromosomes were plotted along with the empirical CDFs (dots) in figure 4. Visual inspection of these plots confirmed the acceptable agreement between the theoretical model and observation. As well, an analysis of the residuals revealed, by and large, symmetrical distributions about zero (fig. 3b), further supporting the validity of the nonlinear regressions.



View larger version (44K):
[in this window]
[in a new window]
 
FIG. 4. Cumulative distribution functions for the number of SNPs per bin from 24 human chromosomes and simulated data. The number of SNPs/bin was estimated for sequences of non-overlapping and sequential 200,000 bp bins spanning each human chromosome. The empirical CDFs (dots) were calculated and compared to the theoretical CDFs, derived from the scale invariant Poisson-gamma model (solid lines). The empirical and theoretical CDFs agreed well. In the bottom right-hand corner, a coalescent simulation is presented for a sample size of 2, L = 1,000, Ne = 106, R = 50, and {theta} = 0.05. The PG model agreed well with this simulation

 
Chromosome 6 carries the HLA loci, which are associated with relatively high heterozygosities (Horton et al. 1998). When the HLA region (bins 161 to 207) was excluded from analysis, the discrepancy between the theoretical and empirical CDFs was resolved (table 2). One thus might speculate that this discrepancy could be attributable to additional processes not accounted for by the PG model, such as the balancing selection thought to occur at the HLA loci.

One note of caution is warranted with respect to the fit of the PG model to the Y chromosome data: 95% of the Y chromosome does not recombine, and much of it consists of repetitive DNA (Tilford et al. 2001). The PG model was thus probably not appropriate for the Y chromosome, and the data itself were more difficult to analyze because of the proportion of repetitive content. Nevertheless, an analysis of the Y chromosomal data was included for completeness.

As an additional control, a Monte Carlo simulation was carried out according to the algorithm of Hudson (1991). The CDF obtained from the simulated data was fitted to the PG CDF (fig. 4, bottom right-hand corner). This plot exhibited a close agreement between simulation and theory (sum of the squared residuals, 0.044, Dmax = 0.035, P = 0.96), and the majority of the residuals from the nonlinear regression were symmetrically distributed about zero (fig. 3b). The close agreement of the PG model with both the ISMWG data and the simulated data served to confirm the model.


    Estimates from the Poisson Gamma Model
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Each fit of the PG distribution to the chromosomal data yielded individual values for the dispersion parameters p, {lambda}, and N (table 2); these values were, in turn, used to estimate population genetic parameters. The estimates for {pi}, so obtained from each chromosome, were compared to more direct measurements published by the ISMWG (2001) (fig. 5a), and they correlated well with one another (r2 = 0.96, P = 1 x 10-16). This close agreement served to validate the nonlinear regression methods used here.



View larger version (31K):
[in this window]
[in a new window]
 
FIG. 5. Estimates of population genetic parameters from the Poisson gamma fits. (a) Heterozygosity estimates. On the basis of the fits of the PG CDF to the ISMWG data the heterozygosity {pi} was estimated for each chromosome. These estimates were plotted on the vertical axis and compared on the horizontal axis to estimates for {pi} taken directly from the ISMWG publication (2001). The two sets of estimates correlated strongly. The dotted lines represent the 95% confidence intervals for the regression. (b) Number of genomic segments per bin. The mean number of segments per bin was estimated from the fits of the PG CDF to the SNP data from each chromosome. These values were compared to estimates of recombination parameter R obtained for each chromosome, using equation (3) from coalescent theory. The mean number of segments correlated well with the recombination parameter and the relationship appeared to be approximately linear. With the exception of the Y chromosome, the mean number of segments was less than R. A series of Monte Carlo simulations were performed for a range of values of R(L = 103, Ne = 106, sample size 2, {theta} = 0.05). The PG distribution was fitted to data derived from each simulation, and the mean number of segments was also estimated. The relationship between the mean number of segments per bin and the recombination parameter appeared similarly linear with the mean being less than R (insert)

 
The second set of estimates obtained from the PG fits dealt with the mean number of genomic segments per sample bin. The hypothesis was proposed earlier (see Introduction) that many of these smaller segments should correspond to haplotype blocks and the mean number of blocks should be indirectly related to the number of recombinations. For chromosome 1 there was an average of E(N) {approx} 9.5 blocks per 200 kb bin. The average size of the blocks on chromosome 1 was therefore 21 kb. Similar estimates were made for all the recombining chromosomes, and they ranged from 17 to 53 kb (mean 23 kb, variance 185 kb2; table 2).

The estimated mean numbers of blocks per bin (table 2) for each chromosome were compared to the respective estimates for the number of recombinations, R (table 1). These values were plotted versus each other (fig. 5b) to yield an apparent linear relationship with a high degree of correlation. With the exception of chromosome Y, the number of blocks per bin was less than R.

This observation that the number of blocks generally was less than the estimates for the recombination rate was not surprising. Recombination rates determined from changes in the location of SNPs might underestimate the total amount of recombination, as some recombinations might not cause any perceptible alteration in the pattern of SNPs (Nordborg 2000). Hudson and Kaplan (1985) have studied this issue, and they demonstrated that the number of recombinations that can be parsimoniously inferred from a sample of sequences, RM, is indeed less than the total number of recombinations from the history of a sample of sequences, R.

A set of Monte Carlo simulations based on Hudson's (1991) algorithm revealed a similar relationship between the number of blocks and R (fig. 5b, insert), providing some confirmation of this observation. A subject for further research would be to investigate the relationship between the mean number of blocks per section E(N) postulated in the PG model and the number of parsimonious recombinations RM (Hudson and Kaplan 1985), as determined by simulation.

In addition to this point, there is good evidence that the regions between haplotype blocks contain recombination "hot spots" (Jeffreys, Kauppi, and Neumann 2001). These hot spots are characterized by clusters of multiple recombination sites. Thus the estimated mean number of blocks per bin should more directly relate to the number of clusters per bin rather than the number of recombinations per se.


    Discussion
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Two assumptions were mentioned in the development of this model. First, the numbers of SNPs within sequential equal-sized bins were assumed to be described by an exponential dispersion model. This seemed plausible given the general nature of exponential dispersion models, which include the normal, Poisson, gamma, binomial, negative binomial, positive stable, and extreme stable distributions. However, without prior knowledge of the variance to mean relationship, other unrelated models might have seemed equally plausible. This assumption was used as a means to introduce exponential dispersion models; it was not critical to the application of the PG model to the distribution of SNPs.

The second assumption related to the correspondence between the PG model and the segmental substructure of haplotype blocks within the genome. The number of haplotype blocks within a larger genomic region was assumed to obey a Poisson distribution; the numbers of SNPs contained within those blocks was attributed to a gamma distribution. The justification for this PG model was based on a number of empirical observations: the power function relationship between the variance and mean number of SNPs per bin, the range restrictions on the power function exponent, and the close fit of the PG distribution function to both the ISMWG map and the simulated data.

Another feature of exponential dispersion models that made them applicable to the description of SNPs relates to the ease with which these models can describe certain scale invariant processes. Scale invariance has become increasingly recognized in biological systems, but its origins and significance are unclear (Gisiger 2001). In physical systems scale invariance may relate to critical phenomena (Goldenfeld 1992, chapter 1). In exponential dispersion models, the variance to mean power function (one manifestation of scale invariance) results from the asymptotic properties of distributions (Jørgensen, Martinez, and Tsao 1994). The summation of multiple independent distributions can yield distributions with scale invariant properties (Hill 1996), and given that the estimation process for heterozygosity could potentially have involved sampling from different distributions, and that more than one stochastic process likely influences the distribution of SNPs within the genome, one might speculate that the scale invariance seen within the ISMWG map reflects an asymptotic property of multiple summed processes.

Caution would seem appropriate with the interpretation of the PG model in general, because an algebraically complicated model such as this—one that required complicated numerical methods for its assessment—could potentially have been affected by artifact. For this reason a number of controls were included in the analysis. The first control involved the fit of the variance to mean relationship from the coalescent model (eq. 3) to real data. Over the range of values observed, this coalescent relationship seemed to approximate a power function. Other controls involved simulations based on the coalescent model: Data generated for a range of bin sizes fitted well with both the transformed variance to mean power function and the coalescent model; data generated under the assumption of zero recombination fitted well to a gamma distribution; data generated with non-zero recombination fitted well to a PG distribution; and data generated over a range of recombination rates yielded an approximately linear relationship between the putative number of identical by descent segments and the estimated recombinational parameter. These results served to reduce concerns that the fits of the variance to mean power function and the PG CDF to real data reflected numerical artifact rather than biologically interpretable results.

Inherent in the method of fitting complicated equations like the PG CDF and Kaplan and Hudson's equation (3) to data from whole chromosomes was the assumption that recombination occurs with a constant probability over each chromosome. However, there is significant evidence that recombination is inhomogeneously distributed throughout the genome (Jeffreys, Kauppi, and Neumann 2001). Nonetheless, estimates for recombination rates are conventionally determined by averages over long distances, and thus it is not unreasonable to estimate recombination as a mean over an entire chromosome. One consequence of the excellent fits of the PG CDF and equation (3) to the ISMWG data is the implication that any localized inhomogeneity in recombination appears to be minimized by averaging over large scales.

As noted in the Introduction, there has been considerable interest in the structure of haplotype blocks. Fits of the PG model to the ISMWG data gave a range for the block sizes from 17 to 53 kb. In a population of European descent Reich et al. (2001) found that LD extended typically 60 kb; for Yorubans it extended to less than 5 kb. Patil et al. (2001) observed an average block length of 7.8 kb in human chromosome 21; Daly et al. (2001) estimated a mean length of 34.6 kb within 5q31. Gabriel et al. (2002) estimated that half of the human genome exists in blocks of 22 kb or larger in African samples, and 44 kb or larger in European or Asian samples. In another analysis of the ISMWG data, which examined the autocorrelation in polymorphism levels, Reich et al. (2002) found blocks of sequence similarity of the same order of size as obtained from the fits of the PG distribution. Although Reich's latter study used the same ISMWG data as the present study, the independence of the two methods, and similar findings, lends credibility to the PG model.

Obviously there is variability in the lengths of these blocks; the estimates of block size appear method dependent and population dependent. At present there are insufficient data to directly determine the form of the size distribution for these blocks. However, Clark (1999) has argued that this distribution should obey an exponential form. In the model presented here it was assumed that the number of blocks per bin should be distributed according to a Poisson distribution. This would imply that the lengths of the blocks within each chromosome should be distributed according to the interarrival lengths of a Poisson distribution, which would be an exponential distribution.

A Poisson distribution for the number of blocks per bin would also imply that the junctions between blocks occur independently of each other with a constant probability of occurrence along the chromosome. Although this might seem inconsistent with the observation of the clustering of recombination crossovers into hot spots (Jeffreys, Kauppi, and Neumann 2001), as pointed out above, these junctions should more directly relate to the presence of hot spots than to individual recombinations.

Earlier it was demonstrated that the mean number of blocks per bin for the chromosomes correlated well with the respective recombination rates R. One can also compare these estimates to data provided by Kong et al. (2002), who constructed a high-density map of recombination rates for the human genome. When the chromosomal averages for the mean number of blocks per bin were compared to the sex-averaged recombination rates for each chromosome provided by Kong et al. (2002), no correlation was observed (r2 = 0.008, P = 0.7), and neither did the respective variances show any correlation. Similarly, the chromosomal values for R obtained herein did not correlate with Kong's averaged rates (r2 = 0.006, P = 0.7).

This lack of correlation may have a simple explanation: In the present study, the mean number of blocks per bin and the mean recombination rate R for each chromosome were obtained, respectively, by the nonlinear regression of the PG CDF and equation (3) with the ISMWG data and, as such, represent population-based estimates. The recombination rates for each chromosome provided by Kong et al. (2002) represent averages obtained from their genetic map of 146 Icelandic families. One would have expected better correlations between the different estimates, provided that the neutral model could reasonably model the demographic processes within the population, and that recombination rates at hot spots did not vary significantly between populations. However, Gabriel et al. (2002) have demonstrated significant differences in the haplotype structure between Yoruban and Eurasian populations, and so it might be possible that the discrepancies seen are attributable to differences between the Icelandic and ISMWG populations.

As mentioned above, Kaplan and Hudson's equation (eq. 3) and the exponential dispersion model are based on random processes inherent in a neutral model. Indeed, the excellent fits of equation (3) and the PG model to the ISMWG data would seem to indicate that neutral processes, such as demography, recombination, and mutation, dominate in shaping genomic polymorphism patterns. There was otherwise only weak evidence that natural selection might be operative at the HLA loci, and this effect (if true) appeared localized.

One final concern relates to how the findings here might relate to the feasibility of association studies. The estimates for the average block lengths obtained here ranged from 17 to 53 kb. If the human genome were spanned by blocks with high degrees of linkage disequilibrium, ranging in size from 5 to100 kb and punctuated by boundary regions of intense recombinational activity that extended for 1 to 5 kb, then the possibility that relatively few haplotype-tagging SNPs would be required to identify disease variants appears reasonable (Reich et al. 2002; Stumpf 2002). Indeed, Gabriel et al. (2002) have estimated that for an average block size of 11 to 22 kb, and 3 to 5 haplotypes per block, association studies could be done with 300,000 to 1,000,000 well-chosen haplotype-tagging SNPs. The average block lengths inferred by the present study are consistent with these estimates and thus association studies appear feasible.


    Conclusion
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
A scale invariant model for the distribution of SNPs within the genome was proposed whereby sample bins were assumed to contain Poisson distributed numbers of smaller DNA blocks, each with its own genealogical history and each with a gamma distributed number of SNPs. The junctional regions between these blocks were assumed to contain concentrations of recombination, and the block lengths were exponentially distributed. The PG model, by virtue of its probabilistic structure, was considered analogous to a contemporary model where blocks of limited haplotype diversity span the genome interspersed by recombination hot spots. Multiple tests of this PG model, based on real and simulated data, served to support its application to the distribution of SNPs.


    Appendix A
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
Exponential dispersion models may be expressed in canonical form using the interrelated measures {nu}{lambda} such that


where


is the cumulant function, N and {lambda} are the canonical and index parameters, and z is the canonical statistic. The distribution function P{lambda},N represents a family of distributions for a range of values of N, ED*(N,{lambda}), determined by N, {lambda}, and the cumulant function. This family has the additive property such that the sum of independent random variables Z+ = Z1 + ··· + Zn, each member of the family Zi ~ ED*(N, {lambda}i) with fixed N and various {lambda}, itself belongs to the same family Z+ ~ ED*(N,{lambda}1 + ··· + {lambda}n).

A second and related family of distributions with random variable Y = Z/{lambda} ~ ED(µ,{sigma}2), where µ and {sigma}2 = 1/{lambda} are respectively the mean value and dispersion parameters and are termed reproductive exponential dispersion models (Jørgensen 1997). For n independent reproductive random variables Yi ~ ED(µ,{sigma}2/wi), with weighting factors wi such that w = wi, gives with weighted averaging of the variables


Thus for reproductive models the weighted average of independent random variables with fixed µ and {sigma}2 and various wi, will belong to the same family of distributions. The additive and reproductive models may be transformed into each other by Y -> Z = Y/{sigma}2.

The cumulant function, {kappa}(N) may be used to construct the cumulant generating function K*(s; N,{lambda}) for ED*(N,{lambda}),


where s is an index variable. This cumulant generating function completely determines the distribution function P{lambda},N, and its first two derivatives with s = 0 yield the corresponding mean and variance. Another useful function derived from the cumulant function is the mean value mapping {tau}(N) = {kappa}'(N), which describes the relationship between the canonical parameter and the mean µ = {kappa}'(N). From this mapping one may define the variance function


where {tau}-1(µ) denotes the inverse function. The mean and variance of the additive random variable Z are, respectively, E(Z) = {lambda}µ and var(Z) = {lambda}V(µ).

Certain exponential dispersion models are also scale invariant. These have come to be called Tweedie exponential dispersion models in honor of M. C. K. Tweedie who first described them (1984). For the reproductive exponential dispersion model ED(µ,{sigma}2) to be scale invariant, we require that for any positive constant c,


where p is a positive real-valued constant. Under this scale transformation we obtain a new random variable, = cY, which belongs to the same family of distributions with fixed µ and {sigma}2, but different values of c, and for this reason the model can be considered scale invariant. The variance function here obeys the relation V(cµ) = g(c) · V(µ), for appropriate values of the function g(c). Scale invariance then implies that g(c) = V(c) and V(µ) = µp. It follows for the reproductive model that var(Y) = {sigma}2E(Y)p, and for the corresponding additive model


where {alpha} = .

For scale invariant exponential dispersion models, and when p != 1 or 2, the cumulant function takes the form


In this case the cumulant generating function for the additive model is


The corresponding reproductive model has the cumulant generating function,


In both models here the domain for s is the positive real numbers, and the domains for N and {alpha} are the negative real numbers.

For the case where p = 2 the cumulant function is {kappa}(N) = -ln(-N), which corresponds to the gamma distribution. The probability density function for the gamma distribution may be expressed in the additive form,



    Appendix B
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
To derive the empirical CDF, the SNP estimates obtained from each chromosome were ranked from smallest to largest and for each measurement the ratio i/n was calculated, where i was the numeric rank of the measurement and n the total number of measurements. The plot of i/n versus the respective number of SNPs provided the empirical CDF.

The theoretical CDF was obtained by numerical integration of the Poisson-gamma probability density function (eq. 7). For each chromosome, the theoretical CDF was fitted to the empirical CDF numerically by minimizing the sum of the squared deviations (i.e., the summed least squares). Because there were three adjustable parameters estimated from the data, the Kolmogorov Smirnov test was applied to the composite hypothesis, and the critical values were estimated by Monte Carlo simulation.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 
This work was supported by funds provided by the Beattie Library of the Ottawa Regional Cancer Centre. The author thanks D. Altshuler and C. Ng for their helpful advice in the preparation of this manuscript, as well as S. Moussett for providing simulation software.


    Footnotes
 
E-mail: wayne.kendal{at}orcc.on.ca. Back

Wolfgang Stephen, Associate Editor Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Predictions from the...
 The Human SNP Map
 The Variance to Mean...
 Scale Invariant Exponential...
 A Poisson Gamma Distribution...
 Tests of the Scale...
 Estimates from the Poisson...
 Discussion
 Conclusion
 Appendix A
 Appendix B
 Acknowledgements
 Literature Cited
 

    Altshuler, D., V. J. Pollara, C. R. Cowles, W. J. Van Etten, J. Baldwin, L. Linton, and E. S. Lander. 2000. A SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407:513-516.[CrossRef][ISI][Medline]

    Clark, A. G. 1999. The size distribution of homozygous segments in the human genome. Am. J. Hum. Genet. 65:1489-1492.[CrossRef][ISI][Medline]

    Daly, M. J., J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander. 2001. High-resolution haplotype structure in the human genome. Nat. Genet. 29:229-232.[CrossRef][ISI][Medline]

    Gabriel, S. B., S. F. Schaffner, and H. Nguyen, et al. (18 co-authors). 2002. The structure of haplotype blocks in the human genome. Science 296:2225-2229.[Abstract/Free Full Text]

    Gisiger, T. 2001. Scale invariance in biology: coincidence or footprint of a universal mechanism? Biol. Rev. 76:161-209.[CrossRef][Medline]

    Goldenfeld, N. 1992. Lectures on phase transitions and the renormalization group. Perseus Books, Reading, Mass.

    Goldstein, D. B. 2001. Islands of linkage disequilibrium. Nat. Genet. 29:109-111.[CrossRef][ISI][Medline]

    Hill, T. 1996. A statistical derivation of the significant-digit law. Stat. Sci. 10:354-363.[ISI]

    Horton, R., D. Niblett, S. Milne, S. Palmer, B. Tubby, J. Trowsdale, and S. Beck. 1998. Large-scale sequence comparisons reveal unusually high levels of variation in the HLA-DQB1 locus in the class II region of the human MHC. J. Mol. Biol. 282:71-97.[CrossRef][ISI][Medline]

    Hudson, R. R. 1982. Estimating genetic variability with restriction endonucleases. Genetics 100:711-719.[Abstract/Free Full Text]

    Hudson, R. R. 1991. Gene genealogies and the coalescent process. Pp. 1–44 in D. Futuyma and J. Antonovics, eds. Oxford Surveys in Evolutionary Biology. Oxford University Press, Oxford.

    Hudson, R. R., and N. L. Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111:147-164.[Abstract/Free Full Text]

    International SNP Map Working Group (ISMWG). 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409:928-933.[CrossRef][ISI][Medline]

    Jeffreys, A. J., L. Kauppi, and R. Neumann. 2001. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet. 29:217-222.[CrossRef][ISI][Medline]

    Jørgensen, B. 1997. The theory of dispersion models. Chapman & Hall, London.

    Jørgensen, B., J. R. Martinez, and M. Tsao. 1994. Asymptomatic behaviour of the variance function. Scand. J. Stat. 21:223-243.[ISI]

    Kaplan, N. L., and R. R. Hudson. 1985. The use of sample genealogies for studying a selectively neutral M-loci model with recombination. Theoret. Popul. Biol. 28:382-396.[ISI]

    Kingman, J. F. C. 1982. The coalescent. Stochast. Proc. Appl. 13:235-248.[CrossRef]

    Kong, A., D. F. Gudbjartsson, and J. Sainz, et al. (16 co-authors). 2002. A high-resolution recombination map of the human genome. Nat. Genet. 31:241-247.[CrossRef][ISI][Medline]

    Miller, R. D., P. Taillon-Miller, and P.-Y. Kwok. 2001. Regions of low single nucleotide polymorphism incidence in human and orangutan Xq: deserts and recent coalescences. Genomics 71:78-88.[CrossRef][ISI][Medline]

    Nordborg, M. 2000. Coalescent theory. Chapter 7 in D. Balding, M. Bishop, and C. Cannings, eds. Handbook of statistical genetics. Wiley, Chichester, UK.

    Patil, N., A. J. Berno, and D. A. Hinds, et al. (21 co-authors). 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294:1719-1723.[Abstract/Free Full Text]

    Reich, D. E., M. Cargill, and S. Bolk, et al. (11 co-authors). 2001. Linkage disequilibrium in the human genome. Nature 411:199-204.[CrossRef][ISI][Medline]

    Reich, D. E., S. F. Schaffner, M. J. Daly, G. McVean, J. C. Mullikin, J. M Higgins, D. J. Richter, E. S. Lander, and D. S. Altshuler. 2002. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat. Genet. 32:135-142.[CrossRef][ISI][Medline]

    Stumpf, M. P. 2002. Haplotype diversity and the block structure of linkage disequilibrium. Trends Genet. 18:226-228.[CrossRef][ISI][Medline]

    Tilford, C. A., T. Kuroda-Kawaguchi, and H. Skaletsky, et al. (12 co-authors). 2001. A physical map of the human Y chromosome. Nature 409:943-945.[CrossRef][ISI][Medline]

    Tweedie, M. C. K. 1984. An index which distinguishes between some important exponential families. Pp. 579–604 in J. K. Ghosh and J. Roy, eds. Statistics: applications and new directions. Proceedings of the Indian Statistical Institute Golden Jubilee International Conference. Indian Statistical Institute, Calcutta, India.

    Wakeley, J. 1994. Substitution-rate variation among sites and the estimation of transition bias. Mol. Biol. Evol. 11:436-442.[Abstract]

    Watterson, G. A. 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256-276.[ISI][Medline]

Accepted for publication November 26, 2002.