Department of Ecology and Evolution, University of Chicago
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Two main approaches have been employed for estimating C. One method observes the frequency of sequence exchange between distant markers (e.g., Ashburner 1989
; True, Mercer, and Laurie 1996
; Bouffard et al. 1997
; Nagaraja et al. 1997
), while the other method estimates C from the patterns of sequence variation expected in random population samples (e.g., Hudson and Kaplan 1985
; Hudson 1987
; Griffiths and Marjoram 1996
; Hey and Wakeley 1997
; Wakeley 1997
; Kuhner, Yamato, and Felsenstein 1999
). If there is local variation in recombination rates (e.g., Fullerton et al. 1994
), then the former method might be quite inaccurate for a particular region of interest. In this paper, I focus on ways of estimating C from sequence variation in random samples.
Researchers have used different aspects of the expected patterns of sequence variation to estimate C. For example, Hudson (1987)
used the observed variance of the number of pairwise differences to construct an estimator of C (here called Chud). This variance is a measure of the amount of linkage disequilibrium in a sample; as the recombination rate increases, the expected amount of linkage disequilibrium between segregating sites decreases, as does the expected variance of the number of pairwise differences. Wakeley (1997)
made two small technical improvements in Chud to obtain a new estimator, which I call Cwak. Although these two estimators are easy to calculate, they ignore most of the available information by essentially describing the data with a single summary statistic. Another concern is that their method of moments approach is probably not optimal; maximum-likelihood methods might be more appropriate. Perhaps the most elegant approach is to condition on the complete data set to obtain a maximum-likelihood estimate of C (Griffiths and Marjoram 1996
; Kuhner, Yamato, and Felsenstein 1999
; Nielsen, personal communication). The full likelihood approach has the advantage of using all of the information that is available, but it is extremely computationally intensive. For example, it is not yet computationally feasible to estimate C for many human polymorphism data sets (e.g., Harding et al. 1997
; Zietkiewicz et al. 1997
; Nickerson et al. 1998
) using maximum-likelihood estimation programs provided by R. C. Griffiths and M. Kuhner. One possible compromise is the approach of Hey and Wakeley (1997
). Their estimator
is close to the average of maximum-likelihood estimates of many small subsets of the data. Another possible compromise (and the one adopted here) is to describe sequence data using summary statistics, but then to use maximum-likelihood methods on these summaries of the data to estimate C (see below).
Recently, some researchers have proposed statistical tests based on the observed number of distinct haplotypes in a sample (Strobeck 1987
; Fu 1996, 1997
; Depaulis and Veuille 1998
). The motivation is that some forces tend to decrease the expected number of distinct haplotypes (e.g., population subdivision, balancing selection, or decreasing population size) (Fu 1996
; Wall 1999
), while other forces act in the opposite direction (e.g., linkage to a selective sweep or increasing population size) (Fu 1997
). However, the latter forces are confounded with the effect of intragenic recombination: the expected number of distinct haplotypes increases rapidly as the recombination rate increases. As a result, the above statistical tests are inappropriate when the recombination rate is unknown.
Instead, I use the number of distinct haplotypes (denoted hereinafter as H) to estimate C. The general approach is to find what value of C maximizes the probability of obtaining the observed value of H. This method of using maximum likelihood on summary statistics has been used before (e.g., Fu and Li 1997
; Takahata and Satta 1997
; Weiss and von Haeseler 1998
), and it is expected to be useful if data sets have large sample sizes and relatively few segregating sites. These criteria are satisfied by most recent studies of human genetic variation (e.g., Harding et al. 1997
; Zietkiewicz et al. 1997
; Nickerson et al. 1998
), but not by most studies of Drosophila sequence variation. Although exact results are available for the joint distribution of H and S (the observed number of segregating sites) when there is no recombination (Griffiths 1982
), it is difficult, if not impossible, to obtain analytic results in the presence of intragenic recombination. Therefore, simulations are used to estimate likelihoods (see below).
Other summary statistics can also be used for maximum-likelihood estimation. Hudson and Kaplan (1985)
proposed using the minimum number of recombination events (RM) to estimate C. Recombination was inferred to have occurred between a pair of segregating sites when all four gametic types were present. This implicitly assumes an infinite-sites model of mutations (i.e., that each new mutation occurs at a previously unmutated site). Similar to the above, I construct an estimator that maximizes the joint probability of obtaining the observed values of both RM and H.
The simulations that are used to estimate likelihoods can be run in two different ways. They can be run with as an unknown parameter or they can be run conditional on S (Hudson 1990, 1993
). Hudson (1993)
proposed using the latter method, since S is known, whereas
must be estimated. The drawback is that conditioning on S results in a null model that is slightly different from the standard coalescent one. Instead of a constant mutation rate per generation, there is a constant number of mutations regardless of the length of the tree. I run both types of simulations and thus directly test the effect of conditioning on S. The new estimators of C are compared with many previously proposed ones (Hudson 1987
; Griffiths and Marjoram 1996
; Hey and Wakeley 1997
; Wakeley 1997
; Kuhner, Yamato, and Felsenstein 1999; Nielsen, personal communication).
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
One or more summary statistics are used to characterize the data. The ones considered are S, RM, and H. Most simulations are run conditional on S (cf. Hudson 1993
). The estimators CH and CHRM are defined as the values of C that maximize the likelihoods L(C|H) and L(C|H, RM), respectively. For each possible value of S, a range of C values from 0.0 to 300.0 is considered, and 105 trials are run for each value of C. Pr(H|C0) and Pr(H, RM|C0) are estimated as the proportion of trials with C = C0 that show a particular configuration of {H} or {H, RM}, respectively.
Standard coalescent simulations (cf. Hudson 1990
) are also run. The estimator CSH is defined as the value of C that maximizes the joint likelihood L(
, C|S, H). A range of parameter values is considered, with
varying from 0.5 to 20.0 and C varying from 0.0 to 260.0. For each combination of
and C, 105 trials are run, and Pr(S, H|
0, C0) is estimated as the proportion of trials with
=
0 and C = C0 showing a particular configuration of S and H. CSH is very similar in motivation to CH. However, it is much quicker to calculate CH, since simulations are needed for a single value of S instead of for many possible values of
.
Coalescent trials were run with fixed values of and C, and the distributions of estimator values for these trials are compared with each other. The other estimators of C considered were Chud (Hudson 1987
), Cwak (Wakeley 1997
), CGM (Griffiths and Marjoram 1996
; the program recom58 uses Monte Carlo recursion methods to estimate C and was kindly provided by R. C. Griffiths), CKYF (Kuhner, Yamato, and Felsenstein 1999
; the program RECOMBINE, version 1.0, incorporates a Metropolis-Hastings [Metropolis et al. 1953
; Hastings 1970
] Markov chain Monte Carlo [MCMC] algorithm with a finite-sites model and estimates the likelihood surface using importance sampling), CN1 (R. Nielsen, personal communication; a preliminary version of the program Baysim was kindly provided by R. Nielsen. Baysim also uses Metropolis-Hastings MCMC, but with an infinite- sites model and a Bayesian approach to parameter estimation assuming a uniform prior. CN1 is the maximum-likelihood estimate.), CN2 (R. Nielsen, personal communication; CN2 is the posterior mean for C), and
(Hey and Wakeley 1997
; the program SITES is available at http://heylab.rutgers.edu). CH, CSH, Chud, Cwak, and
were compared using a sample size of n = 50 and C =
= 1.0, 2.0, ..., 12.0, C = 2
= 2.0, 4.0, ..., 24.0, C = 4
= 4.0, 8.0, ..., 48.0, and
= 5 with C = 0.0, 5.0, 10.0, ..., 45.0. Ten thousand trials were run for each estimator and each pair of parameter values. Due to computational constraints, the other estimators (CHRM, CGM, CKYF, CN1, and CN2) were compared only with n = 50 and C =
= 3.0. CHRM was calculated for 1,000 different trials, while 100 trials were run for the other four estimators. For each trial, CGM was estimated from 500,000 replicates with a single set of generating parameters (C =
= 3.0). Trials for CKYF had 10 short chains (40,000 iterations each), one long chain (600,000 iterations), and starting parameters C =
=
W (Wattersons [1975
] estimate of
), while those for CN1 and CN2 had a burn-in time of 106 iterations and a single chain of 5 x 106 iterations. Although more replicates would have been desirable, the amount of time required to run these simulations was already substantial (i.e., 56 h per trial of CGM or CKYF on a 400-MHz Pentium II processor). Note that the simulations conducted for this study were meant to be exploratory, not exhaustive.
The mean and the mean squared error (MSE) are calculated for the estimators distributions. These two summaries are inadequate, both because the distributions are not approximately normal (see Results) and because estimators do not always return defined and finite values. Chud, Cwak, and are undefined when there is not enough information (i.e., not enough segregating sites) or when the patterns of variation are consistent with free recombination. Similarly, CH, CSH, and CHRM are undefined whenever S < 2 or H
2S. In addition, sometimes CH, CSH, and CHRM could not be calculated (and were considered undefined) because simulations with high enough values of C were not run. This latter case was always rare. It did not occur in any of the trials shown in table 1
and does not affect the summary statistic described below. I present histograms of the distributions when C =
= 3.0, and for the other simulations I summarize the results using
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
g was calculated for CH, CSH, Chud, Cwak, and under a wider range of parameter values. figure 2A
shows simulations with C =
, figure 2B
assumes C = 2
, and figure 2C
assumes C = 4
. As expected, all estimators improve as
increases. As above, CH and CSH are roughly of the same quality, and both usually have higher g values than the other three estimators.
is poor for small data sets but improves more rapidly with increasing
than do the others in figure 2A and B
. However,
is relatively less accurate in figure 2C
for higher recombination rates. When the actual recombination rate is so high,
s downward bias has a large influence in lowering the value of g. This trend is highlighted in figure 3
, which plots g as a function of C when
= 5. While other estimators get better as C increases (presumably because less linkage disequilibrium provides more useful information),
s g value starts decreasing when C > 20. In both figures 2 and 3
, Chud and Cwak perform almost identically, with the latter having marginally higher g values. In summary, all of the test statistics are not very good unless both C and
are reasonably high. For example, in figure 2A,
when C =
= 5.0 (and the expected number of segregating sites is roughly 22), all estimators are within a factor of 2 of the true value less than half the time. In contrast, a simple estimator of
(based on the observed value of S; cf. Watterson 1975
) for these simulations is within a factor of 2 of the true value of
almost all of the time.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Previous researchers have noted that Chud and Cwak are effective only for very large data sets (e.g., C =
100) (Hudson 1987
; Wakeley 1997
). To my knowledge, there are no sequence polymorphism data sets with sample sizes greater than three and
> 50. Thus, there is a premium on estimators that are still effective even when
is relatively small. The results obtained here suggest that while some estimators are reasonable for intermediate-sized data sets, all of them (with the possible exceptions of CN2 and CKYF) behave poorly for small data sets (e.g.,
< 5).
figure 2
implies that for most of the parameter values considered, all ad hoc estimators use no more information than a simple count of the number of distinct haplotypes that are present (i.e., other estimators are generally not as good as CH). Although more clever ways of summarizing the data will lead to better (yet still easy to calculate) estimators of the population recombination rate, it will always be easier to estimate than to estimate C. This is not only because it is conceptually more difficult to understand the effects of recombination on the patterns of observed sequence variation, but also because given data sets can be consistent with a wide range of possible recombination rates. For example, nearly half of the trials shown in figure 1 and table 1
are consistent with no recombination at all.
One open question is whether estimators that summarize the data (e.g., CH, CSH, CHRM, Chud, and Cwak) or estimators that only consider subsets of the data (e.g., ) are throwing away valuable information. As computational power increases, methods that condition on all of the data will become feasible for a wider range of data sets. In addition, technical advancements will increase the efficiency of existing likelihood algorithms. A program written by P. Fearnhead which incorporates an importance sampling scheme appears to be a substantial improvement over the Griffiths and Marjoram (1996
) method (results not shown; P. Fearnhead, personal communication). Although in certain circumstances full maximum-likelihood estimators are expected to be asymptotically optimal, there is no guarantee that they are the best estimators for actual data sets (for which there is dependency and the small sizes are far from the asymptotic limit). It is possible that the MLE is highly biased (see, e.g., CGM and CN1 in tables 1 and 2 ), so other likelihood-based methods like the Bayesian approach of CN2 might be best. In any case, table 2
suggests that there are probably still some methodological problems with more than one of the programs used.
Ad hoc estimators will still be necessary for analyzing larger data sets. Of these, compromise approaches that use likelihood methods on summaries of the data (such as CH, CSH, CHRM, and ) look promising. An estimator of C based on the likelihood of observed pairwise sample configurations appears to be superior to CH (Hudson 1993
; R. Hudson, personal communication). Of the estimators considered here, CH or CHRM is preferred as long as the sample size is not too small; otherwise,
is probably the best, although one should keep in mind that
s downward bias is large when the recombination rate is high.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: recombination,
coalescent simulations,
evolutionary inference,
statistical tests of neutrality.
2 Address for correspondence and reprints: Jeffrey D. Wall, Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, Illinois 60637. E-mail: jdwall{at}midway.uchicago.edu
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Ashburner, M. 1989. Drosophila: a laboratory handbook. Cold Spring Harbor Laboratory Press, New York.
Bouffard, G. G., J. R. Idol, V. V. Braden et al. (14 co-authors). 1997. A physical map of human chromosome 7: an integrated YAC contig map with average STS spacing of 79 kb. Genome Res. 7:673692.
Depaulis, F., and M. Veuille. 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15:17881790.
Fu, Y.-X. 1996. New statistical tests of neutrality for DNA samples from a population. Genetics 143:557570.
. 1997. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147:915925.
Fu, Y.-X., and W.-H. Li. 1997. Estimating the age of the common ancestor of a sample of DNA sequences. Mol. Biol. Evol. 14:195199.[Abstract]
Fullerton, S. M., R. M. Harding, A. J. Boyce, and J. B. Clegg. 1994. Molecular and population analysis of allelic sequence diversity at the human betaglobin locus. Proc. Natl. Acad. Sci. USA 91:18051809.
Griffiths, R. C. 1981. Neutral two-locus multiple allele models with recombination. Theor. Popul. Biol. 19:169186.[ISI]
. 1982. The number of alleles and segregating sites in a sample from the infinite-alleles model. Adv. Appl. Probab. 14:225239.[ISI]
Griffiths, R. C., and P. Marjoram. 1996. Ancestral inference from samples of DNA sequences with recombination. J. Comp. Biol. 3:479502.[ISI]
Harding, R. M., S. M. Fullerton, R. C. Griffiths, J. Bond, M. J. Cox, J. A. Schneider, D. S. Moulin, and J. B. Clegg. 1997. Archaic African and Asian lineages in the genetic ancestry of modern humans. Am. J. Hum. Genet. 60:772789.[ISI][Medline]
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97109.
Hey, J., and J. Wakeley. 1997. A coalescent estimator of the population recombination rate. Genetics 145:833846.
Hudson, R. R. 1983. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23:183201.[ISI][Medline]
. 1987. Estimating the recombination parameter of a finite population model without selection. Genet. Res. 50:245250.[ISI][Medline]
. 1990. Gene genealogies and the coalescent process. Pp. 144 in D. Futuyma and J. Antonovics, eds. Oxford surveys in evolutionary biology. Vol. 7. Oxford University Press, New York.
. 1993. The how and why of generating gene genealogies. Pp. 2336 in N. Takahata and A. G. Clark, eds. Mechanisms of molecular evolution. Sinauer, Sunderland, Mass.
Hudson, R. R., and N. L. Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111:147164.
Kaplan, N. L., and R. R. Hudson. 1985. The use of sample genealogies for studying a selectively neutral m-loci model with recombination. Theor. Popul. Biol. 28:382396.[ISI][Medline]
Kingman, J. F. C. 1982a. On the genealogy of large populations. J. Appl. Probab. 19A:2743.
. 1982b. The coalescent. Stochastic Proc. Appl. 13:235248.
Kuhner, M. K., J. Yamato, and J. Felsenstein. 1999. RECOMBINE. Version 1.0. Available at http://evolution.genetics.washington.edu/lamarc.html.
Labate, J. A., C. H. Biermann, and W. F. Eanes. 1999. Nucleotide variation at the runt locus in Drosophila melanogaster and Drosophila simulans. Mol. Biol. Evol. 16:724731.[Abstract]
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21:10871091.[ISI]
Nagaraja, R., S. Macmillan, J. Kere et al. (25 co-authors). 1997. X chromosome map at 75-kb STS resolution, revealing extremes of recombination and GC content. Genome Res. 7:210222.[Abstract]
Nickerson, D. A., S. L. Taylor, K. M. Weiss, A. G. Clark, R. G. Hutchinson, J. Stengard, V. Salomaa, E. Vartiainen, E. Boerwinkle, and C. F. Sing. 1998. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nat. Genet. 19:233240.[ISI][Medline]
Pluzhnikov, A., and P. Donnelly. 1996. Optimal sequencing strategies for surveying molecular genetic diversity. Genetics 144:12471262.
Strobeck, C. 1987. Average number of nucleotide differences in a sample from a single subpopulation: a test for population subdivision. Genetics 117:149153.
Tajima, F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585595.
Takahata, N., and Y. Satta. 1997. Evolution of the primate lineage leading to modern humans: phylogenetic and demographic inferences from DNA sequences. Proc. Natl. Acad. Sci. USA 94:48114815.
True, J. R., J. M. Mercer, and C. C. Laurie. 1996. Differences in crossover frequency and distribution among three sibling species of Drosophila. Genetics 142:507523.
Wakeley, J. 1997. Using the variance of pairwise differences to estimate the recombination rate. Genet. Res. 69:4548.[ISI][Medline]
Wall, J. D. 1999. Recombination and the power of statistical tests of neutrality. Genet. Res. 74:6579.[ISI]
Watterson, G. A. 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256276.[ISI][Medline]
Weiss, G., and A. von Haeseler. 1998. Inference of population history using a likelihood approach. Genetics 149:15391546.
Zietkiewicz, E., V. Yotova, M. Jarnik et al. (11 co-authors). 1997. Nuclear DNA diversity in worldwide distributed human populations. Gene 205:161171.