* Section of Evolutionary Biology, Department of Biology II, University of Munich, Munich, Germany
Laboratoire Populations Génétique et Evolution, CNRS, Gif-sur-Yvette, France
Laboratoire d'Ecologie, Ecole Pratique des Hautes Etudes, Université Pierre et Marie Curie, Paris, France
Correspondence: E-mail: mousset{at}zi.biologie.uni-muenchen.de.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: neutrality test DNA polymorphism mismatch distribution selective sweep demography
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
On the one hand, the effects of population expansion on this distribution have been studied in the cases when recombination is weak or absent (Rogers and Harpending 1992). In such cases, population expansion typically leaves a wave in the mismatch distribution (Slatkin and Hudson 1991; Rogers and Harpending 1992). These results have been applied to infer the demographic history of populations using nonrecombining DNA sequences such as mitochondrial DNA (Comas et al. 1996; Excoffier and Schneider 1999) or the nonrecombining part of the Y chromosome (Pereira et al. 2001), mostly from human data (reviewed in Harpending and Rogers 2000). These inferences are based on a population expansion model, using a maximum likelihood approach to determine parameters that best fit observations.
On the other hand, Hudson (1987) studied the effects of recombination on the mismatch distribution and proposed an estimator of the recombination parameter R based on the variance of this distribution R = 4 Ner, where r is the intragenic recombination rate. Assuming no recombination and a constant population size, the mismatch distribution F converges toward an equilibrium (Watterson 1975), and the probability that two randomly chosen sequences have exactly i differences is
|
|
|
When population expansion from a very small initial population size or a complete selective sweep occurs, the genealogy tends to become star-like (figure 3 in Harpending et al., 1998), and the number of differences between pairs of sequences tends to conform to a Poisson distribution with parameter 2. In this extreme case the mean and variance of this distribution are
|
In this article, we characterize the effects of recombination and selection on the mismatch distribution. Based on this distribution, we propose a test detecting any deviation from the expectations of a standard neutral equilibrium model. Unlike maximum likelihood approaches, this test assumes no model of selection or of population expansion.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Power Analysis of the CV-Test
We tested our ability to detect selective sweeps using the CV statistics and compared the power of this new test to five other tests: Tajima's D (Tajima 1989), Fay and Wu's H statistics (Fay and Wu 2000, hereafter FW-test), the haplotype partition test based on the frequency of the major haplotype HP (Hudson et al. 1994, hereafter HP-test), and the K and H haplotype tests (Depaulis and Veuille 1998; hereafter K-test and H-test, respectively).
In a first stage, we computed the thresholds of these tests using neutral coalescent simulations with constant population size, conditional on the sample size n = 20 and various numbers of segregating sites S (Depaulis, Mousset, and Veuille 2001, 2003), with the conservative assumption of no recombination. The tests were one-tailed (using the lower bounds of Tajima's D, Fay and Wu's H, Depaulis and Veuille's K and H, and the upper bound of HP and CV), and the nominal rejection rate was fixed at 0.05.
There are three parameters in our model of positive selection: the selection parameter , the time until the fixation of the mutant allele
(in units of Ne generations), and the distance C between the studied locus and the target of selection. In this article, we focus on the effect of the last two parameters,
and C. In a second stage, we simulated coalescent trees conditional on the sample size n = 20, the mutational parameter
= 10 and the recombination parameter R = 10 in a sample of sequences where selection occurs either at a fixed genetic distance C = 90 from the observed locus, and increasing times
in the past, or at fixed time
= 0.1 and increasing distances C from the observed locus. For each simulated set of sequences, the observed statistics were compared to the thresholds obtained at the first stage (conditional on the observed number of segregating sites S); we then recorded the rejection rate for each test. This procedure was proposed to mimic the conditions faced in empirical studies when the mutation parameter
is usually unknown (Fay and Wu 2000; Wall and Hudson 2001; Przeworski 2002; Depaulis et al. 2004). Detailed power analyses for neutrality tests and population expansion tests have been published by Depaulis et al. (2004) and Ramos-Onsins and Rozas (2002), respectively.
Pairwise Comparisons Between Neutrality Tests
To determine whether tests are redundant or complementary with each other, we performed pairwise power comparisons based on the previous simulations. For each pair of tests, we recorded the rate of rejection, either by the two tests simultaneously or by only one of them. A higher common rejection rate indicates a higher redundancy.
Experimental Data
We also applied the CV-test to previously published data. We used a sample of the Sod region of Drosophila melanogaster, where positive selection was already documented (Hudson et al. 1994), and a human sample of mitochondrial DNA in which population expansion was already characterized (Comas et al. 1996). The CV values were obtained from the distribution of the number of pairwise differences in the Sod sample (Hudson et al. 1994, Table 2), or using DnaSP 3.53 (Rozas and Rozas 1999). The CV-tests were performed using a lower recombination rate (5 x 10-9/bp/generation) than the observed rate in the Sod chromosomal region (3.85 x 10-8/bp/generation; Comeron, Kreitman, and Aguadé 1999), and no recombination for the mitochondrial data. The effective population size Ne of D. melanogaster was assumed to be 106 (Li, Satta, and Takahata 1999).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Power of the CVand Other Tests
Our analysis confirms previous findings about neutrality tests (Depaulis, Mousset, and Veuille 2004). Most tests (including the new CV-test but not Tajima's D) effectively detected recent selective sweep events (fig. 3a), provided intergenic recombination occurred during the selective phase between the selected locus and the observed locus (fig. 3b). Simulations showed that the CV-test performs on average better than the others, except Tajima's D in the case of ancient sweeps (fig. 3a) or when the target of selection is close to the studied locus (fig. 3b). It is worth noting that our probability of detecting deviation from neutrality when combining the results of all six tests falls below the 5% threshold when selection is ancient (dashed line, fig. 3a) or occurs too far from the studied locus (dashed line, fig. 3b).
|
|
|
At a fixed time = 0.1 after a selective event occurred, the CV-test seems reasonably complementary to most other tests, especially the K- and FW-tests (fig. 5a). The CV-test seems to be more efficient than other tests, except Tajima's D, at detecting nearby selective events (fig. 5).
The HP- and H-tests are highly redundant over a wide range of C/ values, and together with the FW-test, they highly improve the power to detect selection at distant loci when used in combination with other tests (fig. 5b, 5d, and 5e). Once again, the K-test is highly redundant with most of the other tests (fig. 5c). Finally, Tajima's D is more accurate at detecting selection occurring at short distances (fig. 5f).
Experimental Data
Hudson et al. (1994) found an excess of a haplotype family consisting of sequences differing by a single replacement at the Superoxide dismutase (Sod) locus in a sample of Drosophila melanogaster from different locations. We applied the CV-test to the three Random Constructed Samples (RCS) used by these authors, assuming a conservatively low recombination rate (see Materials and Methods). All three tests showed a significantly higher than expected CV (total RCS, n = 25, S = 63, CV = 0.773, P = 0.002; Barcelona RCS, n = 10, S = 55, CV = 0.763, P = 0.007; Culver City RCS, n = 11, S = 30, CV = 0.713, P = 0.018). These results are consistent with positive selection acting in the neighborhood of this gene, as previously described by Hudson et al. (1994).
Whereas partial selective sweeps increase the CV of the mismatch distribution, population expansion tends to reduce it (Rogers and Harpending 1992, especially their figure 6). We applied the CV-test to a human mtDNA population sample (n = 45, S = 56, Comas et al. 1996; see also Ramos-Onsins and Rozas [2002], table 1, for other tests on this data set). We were able to detect a significantly lower than expected CV, consistent with population expansion documented in this sample (CV = 0.4226, P = 0.004). This trend remains significant when assuming a recombination rate of R = 2 for this region (mitochondrial control region, not recombining, 360 nucleotides, P = 0.02). This assumption could be used to design a more conservative test if the absence of recombination is uncertain.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Here we propose to use the mismatch distribution as a test that would detect any deviation from the standard model, including selection, and we performed power analyses to compare its power to detect selective sweeps to that of five other tests. Most neutrality tests compare different estimates of the mutational parameter obtained from empirical data: Tajima's D compares Watterson's estimate based on the total number of segregating sites in the sample (Watterson 1975) to Tajima's estimate based on the mean number of differences between pairs of sequences (Tajima 1989); Fay and Wu's FW-test compares Tajima's estimate to another estimate weighted by the homozygosity of the derived variants (Fay and Wu 2000); the H-test (Depaulis and Veuille 1998) uses haplotype heterozygosity, H =
/(1 +
) under an infinite-allele model (Malécot 1951; Kimura and Crow 1964). Finally, the K-test uses the estimate of
as derived from the number of haplotypes according to Ewens's (1972) sampling theory under an infinite-allele model. This test follows the same rationale as a neutrality test put forward by Strobeck (1987) and as Fu's Fs test (1997). The three tests are actually very similar.
The proposed CV-test relies only on the mismatch distribution of a sample of DNA sequences. Its value may be strongly reduced by recombination (equation 4) and complete selective sweep events (equation 5), or it may be strongly increased by partial selective sweep events (fig. 1). Finally, the CV-test will be sensitive to any underlying cause for discrepancies between the mean and the variance of the distribution of pairwise differences.
Pairwise Comparisons Between Neutrality Tests
Figure 3 can be used to compare the overall power of neutrality tests. Such comparisons should be considered with caution, however, because the power of neutrality tests is sensitive to the overall level of variation, and thus it also depends on the relative values of the different parameters in the model (, R, C,
, and
).
Pairwise comparisons between tests showed different levels of redundancy and of complementarity between them (figs. 4 and 5). On the whole, Tajima's D appears to be very different from all other tests, because its applicability covers regions that are closer to the target of selection and extends over a longer time after the selective sweep. It clearly outcompetes the other tests in such contexts. This is because Tajima's D is based on the restoration of variability by new mutations after a complete or extensive sweeping of variation. In contrast, the other tests are mostly based on the frequency distribution of mutations that have escaped the sweeping process. Broadly speaking, the FW- and K-tests are sensitive to a change in mutation frequency or in haplotype frequency, respectively, whereas the H- and HP-tests are sensitive to the structuring of variation into haplotypes, that is to linkage disequilibrium. The K- and FW-tests were not very effective at detecting selection in our simulations. The H- and HP-tests were more effective and yielded almost identical results. The curves corresponding to these two tests are virtually superimposed in figures 4 and 5. Note that in the context of selection, a given haplotype tends to become the major haplotype. Therefore, the H-test tends to become an HP-test (based on the frequency of a single, major haplotype), and hence conveys roughly the same information in this special case. In the case of population bottlenecks, the H- and K-tests are more efficient than the HP-test, because haplotype variation is usually reduced to more than a single surviving haplotype in such a context (Depaulis et al., 2003). The CV-test did as well as the H- and HP-tests in our simulations. Like these tests, it is based on linkage disequilibrium between polymorphic sites. Clearly, any of these three tests could be used for detecting recent selective sweeps at some distance from the selected locus. The utility of the CV-test rests on a somewhat easier interpretation in a series of contexts involving selection, demography, and recombination. Further studies using actual data should tell which of these tests is the best for a general approach.
Population Expansion
The CV-test belongs to the Ramos-Onsins and Rozas (2002) class III tests that use information from the mismatch distribution. Its power to detect population expansion may thus be close to that of other tests based on this distribution (see Ramos-Onsins and Rozas 2002, for more details). Population expansion, like intragenic recombination, tends to reduce the CV value. It is therefore important never to use an underestimate of R when running one-tailed CV-tests to detect population expansion from data obtained in recombining regions of the genome. This means that the upper bound of the confidence interval of experimentally derived recombination rate estimates should be used. When this upper bound is unknown, a high recombination rate should systematically be used to make sure that the test is conservative. In this respect, this test may be difficult to apply to highly recombining regions, because available coalescent simulation programs may not be able to handle such high rates.
Positive Selection at Linked Site
Unlike recombination or population expansion, partial selective sweeps tend to increase the value of the CV statistics. Assuming a lower than expected recombination rate is thus conservative to perform a one-tailed CV-test to detect partial sweeps. Using some (conservatively low) recombination rate may however increase the power of this test (see fig. 2). Power analyses (figs. 3a and 3b) show the power of this test to be close to that of the other tests used for detecting partial selective sweeps. Pairwise comparisons between tests (figs. 4 and 5) show that the CV-test is partly redundant with other tests and partly complementary, as a number of deviations detected by this test are not detected by other tests, in particular the K-test and Fay and Wu's H-test.
Finally, our power analysis shows that the power to detect deviation from neutrality is improved using several different tests. Our simulations showed that the parameters we used to design our test and perform power analysis did not lead to a higher than expected type I error (or "achieved levels of significance"; Fu 1996). Indeed, a special correction should be used when multiple tests are performed, but authors of experimental studies often use several neutrality tests without using such a correction. Our study raises the question of the tendency to multiple testing in molecular population genetics. Clearly, a number of tests have different properties and can detect different kinds of deviations from the standard neutral equilibrium model. It is therefore tempting to run several tests simultaneously on the same data set. However, all tests also have overlapping ranges of application. One solution, suggested by our results, is to run preliminary tests using realistic values for key parameters like the recombination rate R and the mutational parameter . Then, it is possible to determine the extent of redundancy from pairwise comparisons. An assessment of this property should also be considered when carrying out multiple tests. In our simulations the achieved level of significance always remained below the nominal value, even though the results of the tests were cumulative (fig. 3). To perform conservative neutrality tests, the intragenic recombination rate R was intentionally set to a lower value in the testing procedure (R = 0) than in the data-generating procedure (R = 10). Because the actual type I error of individual tests was thus lower than the nominal 5% rate, and because the tests are partly redundant, the achieved level of significance obtained when all tests were combined did not rise over the nominal threshold.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Braverman, J. M., R. R. Hudson, N. L. Kaplan, C. H. Langley, and W. Stephan. 1995. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140:783-796.
Comas, D., F. Calafell, E. Mateu, A. Perez-Lezaun, and J. Bertranpetit. 1996. Geographic variation in human mitochondrial DNA control region sequence: the population history of Turkey and its relationship to the European populations. Mol. Biol. Evol. 13:1067-1077.[Abstract]
Comeron, J. M., M. Kreitman, and M. Aguadé. 1999. Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila. Genetics 151:239-249.
Depaulis, F., S. Mousset, and M. Veuille. 2001. Haplotype tests using coalescent simulations conditional on the number of segregating sites. Mol. Biol. Evol. 18:1136-1138.
Depaulis, F., S. Mousset, and M. Veuille. 2003. Power of neutrality tests to detect bottlenecks and hitchhiking. J. Mol. Evol. 57:s190-s200.[CrossRef][ISI][Medline]
Depaulis, F., S. Mousset, and M. Veuille. 2004. Detecting selective sweeps with haplotype tests. in D. Nurminsky, ed., Selective Sweep. Landes Biosciences, Georgetown, Tex. in press.
Depaulis, F., and M. Veuille. 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15:1788-1790.
Ewens, W. J. 1972. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3:87-112.[ISI][Medline]
Excoffier, L., and S. Schneider. 1999. Why hunter-gatherer populations do not show signs of Pleistocene demographic expansions. Proc. Natl. Acad. Sci. USA 96:10597-10602.
Fay, J. C., and C.-I. Wu. 2000. Hitchhiking under positive Darwinian selection. Genetics 155:1405-1413.
Fu, Y.-X. 1996. New statistical tests of neutrality for DNA samples from a population. Genetics 143:557-570.
Fu, Y.-X. 1997. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147:915-925.
Glinka, S., L. Ometto, S. Mousset, W. Stephan, and D. De Lorenzo. 2003. Demography and natural selection have shaped genetic variation in Drosophila melanogaster: a multi-locus approach. Genetics 165:1269-1278.
Harpending, H., and A. Rogers. 2000. Genetic perspectives on human origins and differentiation. Annu. Rev. Genomics Hum. Genet. 1:361-385.[CrossRef][ISI][Medline]
Harpending, H. C., M. A. Batzer, M. Gurven, L. B. Jorde, A. R. Rogers, and S. T. Sherry. 1998. Genetic traces of ancient demography. Proc. Natl. Acad. Sci. USA 95:1961-1967.
Hudson, R. R. 1987. Estimating the recombination parameter of a finite population model without selection. Genet. Res. 50:245-250.[ISI][Medline]
Hudson, R. R. 1990. Gene genealogies and the coalescent process. Pp. 145 in D. Futuyama and J. Antonovics, eds., Oxford surveys in evolutionnary biology, vol. 7, Oxford University Press, Oxford, U.K.
Hudson, R. R. 1993. The how and why of generating gene genealogies. Pp. 2336 in N. Takahata and A. Clark, eds., Mechanisms of molecular evolution, Sinauer Associates, Sunderland, Mass.
Hudson, R. R., K. Bailey, D. Skarecky, J. Kwiatowski, and F. J. Ayala. 1994. Evidence for positive selection in the superoxide dismutase (sod) region of Drosophila melanogaster. Genetics 136:1329-1340.
Kim, Y., and W. Stephan. 2002. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160:765-777.
Kimura, M., and J. Crow. 1964. The number of alleles that can be maintained in a finite population. Genetics 49:725-738.
Li, Y. J., Y. Satta, and N. Takahata. 1999. Paleo-demography of the Drosophila melanogaster subgroup: application of the maximum likelihood method. Genes Genet. Syst. 74:117-127.[CrossRef][ISI][Medline]
Malécot, G. 1951. Un traitement stochastique des problèmes linéaires (mutation, linkage, migration) en génétique de population. Ann. Univ. Lyon Sci. Sec. A 14:79-117.
Maynard Smith, J., and J. Haigh. 1974. The hitch-hiking effect of a favourable gene. Genet. Res. 23:23-35.[ISI][Medline]
Mousset, S., L. Brazier, M. L. Cariou, F. Chartois, F. Depaulis, and M. Veuille. 2003. Evidence of a high rate of selective sweeps in African Drosophila melanogaster. Genetics 163:599-609.
Pereira, L., I. Dupanloup, Z. H. Rosser, M. A. Jobling, and G. Barbujani. 2001. Y-chromosome mismatch distributions in Europe. Mol. Biol. Evol. 18:1259-1271.
Przeworski, M. 2002. The signature of positive selection at randomly chosen loci. Genetics 160:1179-1189.
Ramos-Onsins, S. E., and J. Rozas. 2002. Statistical properties of new neutrality tests against population growth. Mol. Biol. Evol. 19:2092-2100.
Rogers, A. R., and H. Harpending. 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9:552-569.[Abstract]
Rozas, J., and R. Rozas. 1999. DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics 15:174-175.
Slatkin, M., and R. R. Hudson. 1991. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129:555-562.
Strobeck, C. 1987. Average number of nucleotide differences in a sample from a single subpopulation: a test for population subdivision. Genetics 117:149-153.
Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.
Tajima, F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.
Wall, J., and R. Hudson. 2001. Coalescent simulations and statistical tests of neutrality. Mol. Biol. Evol. 18:1134-1135.
Watterson, G. 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256-276.[ISI][Medline]