Departament de Genètica, Universitat de Barcelona, Barcelona
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Tajima (1989b)
, Slatkin and Hudson (1991)
, and Rogers and Harpending (1992)
pioneered the study of the effect of some demographic events on DNA sequence data. They have shown that a relatively recent demographic event, such as a population growth, causes most of the coalescent events to occur before the expansion and, consequently, samples of these populations have gene genealogies stretched near the external nodes and compressed near the root (i.e., star genealogies). Thus, population size changes can leave a particular footprint that may eventually be detected in DNA sequence data. This theoretical framework prompted the development of statistical tests for detecting population expansion.
The analysis of the distribution of pairwise differences, or mismatch distribution (Slatkin and Hudson 1991
; Rogers and Harpending 1992
), provides a method for inferring such demographic events. These authors have shown that, for nonrecombining DNA regions, constant size populations presented mismatch distributions with shapes with very little resemblance to that expected in growing populations. This prompted the development of some statistical tests for detecting expansion processes (Harpending et al. 1993
; Harpending 1994
; Eller and Harpending 1996
; Rogers et al. 1996
). One of the most frequently used tests is the raggedness statistic rg (Harpending et al. 1993
). Although the distribution of the rg statistic is unknown, its confidence intervals could be obtained by computer simulations based on the coalescent algorithm. But because methods based on the mismatch distribution use little information accumulated in the data (Felsenstein 1992
), tests based on the mismatch distribution should be very conservative.
In recent years, a number of authors have developed several methods of statistical inference and statistical tests using different approaches (e.g., Griffiths and Tavaré 1994
; Bertorelle and Slatkin 1995
; Rogers 1995
; Aris-Brosou and Excoffier 1996
; Fu 1996
, 1997
; Kuhner, Yamato, and Felsenstein 1998
; Weiss and Von Haeseler 1998
; Galtier, Depaulis, and Barton 2000
; Furlong and Brookfield 2001
). More recently, specific methods for detecting population expansions have also been developed for the analysis of microsatellite data (e.g., Kimmel et al. 1998
; Reich and Goldstein 1998
; Beaumont 1999
; Reich, Feldman, and Goldstein 1999
; King, Kimmel, and Chakraborty 2000
).
Here we report the development of new statistical tests for detecting past population growth. We performed an extensive analysis of their statistical power against different alternative hypotheses, and we compared their relative performance with respect to others published in the literature. Although some authors (Braverman et al. 1995
; Simonsen, Churchill, and Aquadro 1995
; Fu 1996
, 1997
) have also investigated the power of some statistical tests against population growth and genetic hitchhiking (which leave similar footprints in DNA sequences), at present there is no exhaustive comparative analysis. The major population growth model investigated was the sudden (instantaneous) growth, although we also studied the power under the logistic model of population growth. The power of these tests was evaluated using random data sets generated by computer simulations based on the coalescent (Hudson 1990
).
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Class I Statistics
Class I statistics use information of the mutation (segregating site) frequency. These statistics could be appropriate to distinguish population growth from constant size populations because the former generates an excess of mutations in external branches of the genealogy (i.e., recent mutations) and therefore an excess of singletons (substitutions present in only one sampled sequence) (Tajima 1989a
, 1989b
; Slatkin and Hudson 1991
).
We studied the following test statistics: Tajima's D, and Fu and Li's D*, F*, D (named DF) and F statistics (Tajima 1989a
; Fu and Li 1993
; see also Simonsen, Churchill, and Aquadro 1995
). These tests are based on the difference between two alternative estimates of the mutational parameter
= 2Nu, where N is the effective number of gene copies in the population (the number of females in the population for mtDNA regions or double the population size for an autosomal region) and u is the mutation rate. Tajima's D and Fu and Li's D* and F* statistics use information from only intraspecific data, whereas Fu and Li's DF and F statistics use information from the number of recent mutations; the latter, therefore, requires the presence of an outgroup to be computed.
We developed a number of tests based on the difference between the number of singleton mutations and the average number of nucleotide differences. The R2 statistic is defined as
|
We also built two R2 related tests namely, R3 and R4. These statistics differ from the R2 test in the power exponent values; in R3 and R4, the exponent values of 2 and 1/2 (eq. 1) are replaced by 3 and 1/3, and by 4 and 1/4, respectively.
We have constructed three additional test statistics (R2E, R3E, and R4E) that use information on the number of mutations in external branches; thus, an outgroup will be required for their estimation. The R2E test is defined as
|
We have also developed two other tests (Ch and Che) based on the difference between the number of singleton (and also for recent) mutations and their expected value:
|
|
Class II Statistics
In class II, we include statistical tests that use information from the haplotype distribution. We have only studied Fu's FS test statistic (Fu 1997
) within this class. This statistic, which is based on the Ewens' sampling distribution (Ewens 1972
), has low values with the excess of singleton mutations caused by the expansion.
Class III Statistics
Class III statistical tests use information from the distribution of the pairwise sequence differences (or mismatch distribution). It has been shown that population expansions leave a particular signature in the distribution of the pairwise sequence differences (Slatkin and Hudson 1991
; Rogers and Harpending 1992
); therefore, statistics based on the mismatch distribution can be used to test for demographic events. We evaluated the following statistics. (1) The raggedness rg statistic (Harpending et al. 1993
; Harpending 1994
). The raggedness statistic, which measures the smoothness of the mismatch distribution, differs among constant size and growing populations: lower rg values are expected under the population growth model. (2) The mean absolute error (MAE) between the observed and the theoretical mismatch distribution (Rogers et al. 1996
). (3) We also developed a new statistical test, the ku test, based on the fourth central moment (i.e., on the kurtosis) of the mismatch distribution. Given that population expansion generates more smoothly peaked distributions, this statistic can distinguish between constant size and growing populations. Let d, nc, and Wi be the maximum number of differences in the mismatch distribution, the number of pairwise comparisons (= n (n - 1)/2), and the frequency of pairs of DNA sequences that differ by i mutations, respectively. We define:
|
|
Empirical Distributions
We obtained the empirical distribution of each statistical test by Monte Carlo simulations based on the coalescent process for a neutral infinite-sites model, assuming a large population size (Kingman 1982a
, 1982b
; Hudson 1990
). We also assumed that there is neither intragenic recombination nor migration and that the mutation rate is homogeneous across the DNA region. We performed the simulations conditional on the number of segregating sites (S); that is, placing randomly S mutations along the tree (the so-called fixed S method). Given that the actual value of
is usually unknown, this method seems to be appropriate for testing purposes (Hudson 1993
). The routine ran1 (Press et al. 1992
) was used as a random number generator. We conducted coalescent simulations for constant population size (null hypothesis) and for population growth (alternative hypothesis); the empirical distribution was estimated from 100,000 computer replicates for both the null and the alternative hypotheses.
For the constant size model (null hypothesis), the samples were generated using conventional procedures (Hudson 1990
); in this model only two parameters are required: the sample size and the number of segregating sites. The sudden population growth model (Rogers and Harpending 1992
) considers a population that was formerly at equilibrium, but te generations before the present one the population grew suddenly to the current size. Coalescent simulations under the sudden expansion model require four parameters: n, S, te, and De, where De, the degree of the expansion event, is:
![]() |
|
We also conducted some coalescent simulations assuming that the population follows the logistic model of growth. In this model
|
Coalescent simulations under the logistic model of growth were generated changing the times of the nodes according to the population size. These times are given by
|
Critical Values and the Power of the Tests
We determined the critical values of each statistical test from its empirical distribution. The power of each test, or the probability of rejecting the null hypothesis (constant size population) when the alternative hypothesis (population growth) is true, was estimated as the proportion of computer replicates generated under the alternative hypothesis for which the null hypothesis was rejected. For the analysis, we fixed a significance level of = 0.05. Because the critical region for all alternative hypotheses would consist of only one side of the distribution, we conducted one-tailed tests. Specifically, all analyzed statistics, except Ch, Che, and ku had lower values under the population growth model.
Given that under the null hypothesis the empirical distribution of some statistics presented a reduced number of points (e.g., the distribution of D* statistic; see Results), the actual probability of rejecting the null hypothesis when it is true (i.e., the size of the test) could be lower than the nominal significance level of 0.05.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The power analysis of the tests R3, R4, R3E, and R4E show a similar power than the R2 and will not be presented here. Nevertheless, for some specific set of parameters the R4 and R4E tests presented a slightly higher power than R2. Generally, results of the statistical power of all statistical tests that use interspecific data presented a similar power than its equivalent statistic using intraspecific information (figures not shown).
Figure 1
shows the effect of Tethe time elapsed since the expansion eventon the statistical power of different statistical tests. It can be observed that R2 and Fu's FS are the most powerful tests: the R2 test is the most powerful for small sample sizes, whereas the behavior of Fu's FS is better for large samples. The power of Tajima's D and Fu and Li's F* is lower than R2 and FS. The results also indicate that some commonly used tests based on the mismatch distribution, rg and MAE, are among the least powerful. All statistical tests show a peak in the statistical power at intermediate values of Te (Te 0.1); thus, it is unlikely to detect a population expansion when Te is too small or too large. This result agrees with that obtained by Simonsen, Churchill, and Aquadro (1995)
and Fu (1997)
.
|
|
|
|
Logistic Population Growth Model
We also conducted the analysis of power under a more realistic population growth scenario, the logistic population growth model. Using this model, we performed an explorative analysis of the most relevant cases to validate the conclusions of our work. We found that the assumption of the logistic population growth model does not change the major conclusions of the work. Even so, in comparison with the sudden growth model the maximum power of the tests is reached at higher values of the elapsed time; for instance, for the parameter sets used in Fu (1997)
(r = 10, c = 1) the maximum power is at Ts
1.2. In general, as expected (1) all statistical tests have less power under the logistic than under the sudden growth models; nevertheless the decrease in the power is relatively uniform for all statistical tests and (2) the larger the value of r, the more power the tests have.
Application to DNA Sequence Data
The present results have been applied to two published DNA data sets: the mtDNA variation analysis of a Turkish human population (Comas et al. 1996
), and the survey of a human noncoding autosomal region (Alonso and Armour 2001
). Comas et al. (1996)
sequenced 360 base pairs of the region I of the mtDNA D-loop in 45 individuals. From the mismatch distribution analysis the authors suggested that the Turkish population had expanded recently. We determined the power of the different tests to identify which is most powerful against population growth. For the total data (n = 45; S = 56) and considering that De = 100 and Te = 0.4 (scaled in terms of N generations) most tests were powerful enough, and several of them could reject the null hypothesis of constant size. We also determined whether the tests could also reject the null hypothesis for small sample sizes. For that, we reanalyzed a subset of 10 randomly chosen sequences from the data of Comas et al. (1996)
. Table 1
shows the estimates of the power and of P values of some statistical tests. The results clearly illustrate that the constant size hypothesis can be rejected by the most powerful tests (Fu's FS and R2).
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We have shown that tests based on the mismatch distribution have little power against population growth. The MAE test is the less powerful one; although rg is more powerful than MAE, it works less well than nearly all class I and class II tests examined. ku, the newly developed test of class III, although better than MAE and rg, is clearly inferior to other class I and class II tests.
On the other hand, several class I and class II tests can detect population expansion even for small De values. We have shown that two of the surveyed tests (R2 and FS) are the most powerful for a variety of different conditions. These tests should therefore be chosen to test constant population size versus population growth. In particular, we suggest using the R2 statistical test for small sample sizes and FS for large ones. Nevertheless, because R2 and FS statistics use different kinds of information, discrepancies between these tests could provide information about the action of other evolutionary processes, for example on the intragenic recombination (see below).
Fu (1997)
studied the power of some statistics under the logistic model of population growth. He conducted coalescent simulations fixing theta (
= 5,
= 10) instead of fixing S. To check the behavior of R2, and other mismatch-based statistics, under these conditions we performed some additional simulations conditional on
. We found that the R2 and FS are again the most powerful statistics (see an example in fig. 5
). Interestingly, rg and MAE have better results fixing
than fixed S.
|
Coalescent Simulations Conditional on the Number of Segregating Sites
The present power analyses have been performed conducting coalescent simulations conditional on the number of segregating sites. Given that the actual value of is usually unknown, and that estimates of
are usually obtained from DNA polymorphism data information, the method seems to be appropriate (Hudson 1993
; Depaulis, Mousset, and Veuille 2001
; Wall and Hudson 2001
). But Markovtsova, Marjoram, and Tavaré (2001)
pointed out correctly that the power of coalescent-based tests are not independent of
and, therefore, the statistical power might vary as a function of
for a given n and S. To check that effect on the R2 we performed a prospective analysis generating samples conditional on
and S using the rejection algorithm of Tavaré et al. (1997)
. The results yield the same conclusions as that of Depaulis, Mousset, and Veuille (2001)
and Wall and Hudson (2001)
, i.e., the fixed S method seems to be appropriate unless the actual value of
is far from Watterson's (1975)
estimate of
.
Competitive Alternative Hypotheses
It should be stressed that a significant result (a significant departure from the null hypothesis) should be interpreted cautiously: there are several putative alternative hypotheses to single null hypotheses. Indeed, processes other than population expansion, such as genetic hitchhiking (Maynard Smith and Haigh 1974
), could also produce similar genealogies (i.e., departures of the statistical tests in the same direction). Therefore, additional analyses could be necessary to discriminate between some competitive alternative hypotheses. For instance, because genetic hitchhiking in regions undergoing recombination will affect a relatively small fraction of the genome (close to the advantageous mutation), surveys at different gene regions across the genome could provide the opportunity to discriminate between population expansion and genetic hitchhiking (see Galtier, Depaulis, and Barton 2000
).
To summarize, FS and R2 are the best statistical tests for detecting population growth. The behavior of R2 is better for small sample sizes, whereas FS is better for bigger sample sizes. Additionally, preliminary results also indicate that the behavior of R2 should be superior when the intragenic recombination is considered. On the other hand, some popular statistics based on the mismatch distribution, rg and MAE, are very conservative.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: population growth
population expansion
coalescent simulations
neutrality tests
Address for correspondence and reprints: Julio Rozas, Departament de Genètica, Facultat de Biologia, Universitat de Barcelona, Diagonal 645, E-08071 Barcelona, Spain. E-mail: julio{at}bio.ub.es
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Alonso S., J. A. L. Armour, 2001 A highly variable segment of human subterminal 16p reveals a history of population growth for modern humans outside Africa Proc. Natl. Acad. Sci. USA 98:864-869
Aris-Brosou S., L. Excoffier, 1996 The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism Mol. Biol. Evol 13:494-504[Abstract]
Beaumont M. A., 1999 Detecting population expansion and decline using microsatellites Genetics 153:2013-2029
Bertorelle G., M. Slatkin, 1995 The number of segregating sites in expandind human populations, with implications for estimates of demographic parameters Mol. Biol. Evol 12:887-892[Abstract]
Braverman J. M., R. R. Hudson, N. L. Kaplan, C. H. Langley, W. Stephan, 1995 The hitchhiking effect on the site frequency spectrum of DNA polymorphism Genetics 140:783-796
Cann R. L., 2001 Genetic clues to dispersal in human populations: retracing the past from the present Science 291:1742-1748
Comas D., F. Calafell, E. Mateu, A. Pérez-Lezaun, J. Bertranpetit, 1996 Geographic variation in human mitochondrial DNA control region sequence: the population history of Turkey and its relationship to the European populations Mol. Biol. Evol 13:1067-1077[Abstract]
Depaulis F., S. Mousset, M. Veuille, 2001 Haplotype tests using coalescent simulations conditional on the number of segregating sites Mol. Biol. Evol 18:1136-1138
Donnelly P., S. Tavaré, 1995 Coalescents and genealogical structure under neutrality Annu. Rev. Genet 29:401-421[ISI][Medline]
Eller E., H. C. Harpending, 1996 Simulations show that neither population expansion nor population stationarity in a West African population can be rejected Mol. Biol. Evol 13:1155-1157
Ewens W. J., 1972 The sampling theory of selectively neutral alleles Theor. Pop. Biol 3:87-112[ISI][Medline]
Felsenstein J., 1992 Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates Genet. Res 59:139-147[ISI][Medline]
Fu Y.-X., 1996 New statistical tests of neutrality for DNA samples from a population Genetics 143:557-570
. 1997 Statistical tests of neutrality against population growth, hitchhiking and background selection Genetics 147:915-925
Fu Y.-X., W.-H. Li, 1993 Statistical tests of neutrality of mutations Genetics 133:693-709
. 1999 Coalescing into the 21st century: De, an overview and prospects of coalescent theory Theor. Pop. Biol 56:1-10[ISI][Medline]
Furlong R. F., J. F. Y. Brookfield, 2001 Inference of past population expansion from the timing of coalescence events in a gene genealogy J. Theor. Biol 209:75-86[ISI][Medline]
Galtier N., F. Depaulis, N. H. Barton, 2000 Detecting bottlenecks and selective sweeps from DNA sequence polymorphism Genetics 155:981-987
Griffiths R. C., S. Tavaré, 1994 Sampling theory for neutral alleles in a varying environment Philos. Trans. R. Soc. Lond. B 344:403-410[ISI][Medline]
Harpending H. C., 1994 Signature of ancient population growth in a low-resolution mitochondrial DNA mismatch distribution Hum. Biol 66:591-600[ISI][Medline]
Harpending H. C., M. A. Batzer, M. Gurven, L. B. Jorde, A. R. Rogers, S. T. Sherry, 1998 Genetic traces of ancient demography Proc. Natl. Acad. Sci. USA 95:1961-1967
Harpending H. C., S. T. Sherry, A. R. Rogers, M. Stoneking, 1993 Genetic structure of ancient human populations Curr. Anthr 34:483-496[ISI]
Hudson R. R., 1983 Properties of a neutral allele model with intragenic recombination Theor. Pop. Biol 23:183-201[ISI][Medline]
. 1990 Gene genealogies and the coalescent process Oxf. Surv. Evol. Biol 9:1-44
. 1993 The how and why of generating gene genealogies Pp. 2336 in N. Takahata and A. G. Clark, eds. Mechanisms of molecular evolution. Sinauer Associates, Inc., Sunderland, Mass
Jorde L. B., M. Bamshad, A. R. Rogers, 1998 Using mitochondrial and nuclear DNA markers to reconstruct human evolution Bioessays 20:126-136[ISI][Medline]
Kimmel M., R. Chakraborty, J. P. King, M. Bamshad, W. S. Watkins, L. B. Jorde, 1998 Signatures of population expansion in microsatellite repeat data Genetics 148:1921-1930
King J. P., M. Kimmel, R. Chakraborty, 2000 A power analysis of microsatellite-based statistics for inferring past population growth Mol. Biol. Evol 17:1859-1868
Kingman J. F. C., 1982a. The coalescent Stoch. Proc. Applns 13:235-248
. 1982b. On the genealogy of large populations J. Appl. Prob 19A:27-43
Kuhner M. K., J. Yamato, J. Felsenstein, 1998 Maximum likelihood estimation of population growth rates based on the coalescent Genetics 149:429-434
Markovtsova L., P. Marjoram, S. Tavaré, 2001 On a test of Depaulis and Veuille Mol. Biol. Evol 18:1132-1133
Maynard Smith J., J. Haigh, 1974 The hitch-hiking effect of a favourable gene Genet. Res 23:23-35[ISI][Medline]
Nordborg M., 2001 Coalescent theory Pp. 179212 in D. J. Balding, M. Bishop, and C. Cannings, eds. Handbook of statistical genetics. John Wiley and Sons, West Sussex, U.K
Press W. H., S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, 1992 Numerical recipes in C The art of scientific computing. Cambridge University Press, New York
Reich D. E., M. W. Feldman, D. B. Goldstein, 1999 Statistical properties of two tests that use multilocus data sets to detect population expansions Mol. Biol. Evol 16:453-466
Reich D. E., D. B. Goldstein, 1998 Genetic evidence for a paleolithic human population expansion in Africa Proc. Natl. Acad. Sci. USA 95:8119-8123
Rogers A. R., 1995 Genetic evidence for a pleistocene population explosion Evolution 49:608-615[ISI]
. 1997 Population structure and modern human origins Pp. 5579 in P. Donnelly and S. Tavaré, eds. Progress in population genetics and human evolution. Springer-Verlag, New York
Rogers A. R., A. E. Fraley, M. J. Bamshad, W. S. Watkins, L. B. Jorde, 1996 Mitochondrial mismatch analysis is insensitive to the mutational process Mol. Biol. Evol 13:895-902
Rogers A. R., H. C. Harpending, 1992 Population growth makes waves in the distribution of pairwise genetic differences Mol. Biol. Evol 9:552-569[Abstract]
Rozas J., R. Rozas, 1999 DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis Bioinformatics 15:174-175
Rozas J., C. Segarra, G. Ribó, M. Aguadé, 1999 Molecular population genetics of the rp49 gene region in different chromosomal inversions of Drosophila subobscura Genetics 151:189-202
Simonsen K. L., G. A. Churchill, C. F. Aquadro, 1995 Properties of statistical tests of neutrality for DNA polymorphism data Genetics 141:413-429
Slatkin M., R. R. Hudson, 1991 Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations Genetics 129:555-562
Stephens M., 2001 Inference under the coalescent theory Pp. 213238 in D. J. Balding, M. Bishop, and C. Cannings, eds. Handbook of statistical genetics. John Wiley and Sons, West Sussex, U.K
Tajima F., 1989a. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism Genetics 123:585-595
. 1989b. The effect of change in population size on DNA polymorphism Genetics 123:597-601
Takahata N., 1996 Neutral theory of molecular evolution Curr. Opin. Genet. Dev 6:767-772[ISI][Medline]
Tavaré S., D. Balding, R. C. Griffiths, P. Donnelly, 1997 Inferring coalescence times for molecular sequence data Genetics 145:505-518
Wall J. D., 1999 Recombination and the power of statistical tests of neutrality Genet. Res 74:65-79[ISI]
Wall J. D., R. R. Hudson, 2001 Coalescent simulations and statistical tests of neutrality Mol. Biol. Evol 18:1134-1135
Watterson G. A., 1975 On the number of segregating sites in genetical models without recombination Theor. Popul. Biol 7:256-276[ISI][Medline]
Weiss G., A. Von Haeseler, 1998 Inference of population history using a likelihood approach Genetics 149:1539-1546