Department of Mathematics
Biostatistics Division, Department of Preventive Medicine
Program in Molecular Biology, Department of Biological Sciences, University of Southern California
In a recent letter to this journal, Depaulis and Veuille (1998)
discussed two possible tests of neutrality, the "haplotype number test" and the "haplotype diversity test." They present in their tables 1 and 2
means and percentage points of the distribution of the number Kn of haplotypes and the sample heterozygosity Hn = 1 - , where are the relative frequencies of those haplotypes in a sample of size n for different values of the number of segregating sites s observed in the data. They assume a neutral infinitely-many-sites model of mutation with no recombination. These percentage points were found by repeatedly simulating a random coalescent tree with n tips, randomly distributing s mutations on the tree, and calculating the observed values of Kn and Hn. See Hudson (1990)
for a description of how such simulations can be performed.
|
|
Depaulis and Veuille (1998)
suggested that the percentage points of their statistics could be used as a test of neutrality: for a given sample size n and observed value of s, one compares the observed values of Kn and Hn with the given 95% credible intervals. Values falling outside those intervals lead to rejection of neutrality. Given that the true conditional distributions of Kn and Hn in fact depend on the unknown mutation rate
, it is likely that the power of this test varies dramatically as a function of
for given n and s. To assess this hypothesis, we simulated observations from the true joint conditional distribution using equation (1) and then estimated the probability that either statistic would fall outside the limits given by Depaulis and Veuille. This gave an empirical estimate of the probability that their test would reject neutrality for different values of
.
We used a Markov chain Monte Carlo (MCMC) approach to simulate observations from the conditional distribution of (, T) given Sn = s. MCMC methods produce correlated samples, but these samples may be made approximately independent by sampling the output at widely spaced intervals. The results below were generated using the approach in Markovtsova, Marjoram, and Tavaré (2000). An alternative approach is the rejection algorithm of Tavaré et al. (1997)
. Table 1
shows the fraction of 10,000 observations that fell inside the DV nominal 95% intervals for three different scenarios. As noted by Fu and Li (1993)
, Sn is not a sufficient statistic for
, so the fraction of observations that fall within the DV nominal 95% intervals varies greatly. It appears that if the true
value is well supported by the data, the test of neutrality based on the DV intervals will work well. However, if the true
value is not well supported by the data, the test will be inaccurate, leading to incorrect rejections of neutrality. For further discussion, see Wall and Hudson (2001).
As we have seen, the power of the DV test depends on the unknown parameter . Even if the mutation rate is known, the compound parameter
still depends on the underlying effective population size at the time of sampling, and this is not known in practice. Depaulis and Veuille (1998)
discuss a data set from the Su(H) locus in Drosophila melanogaster for which n = 20, s = 44. The observed values of K20 and H20 were 7 and 0.76, respectively. The nominal P values were estimated to be 0.011 and 0.017, respectively. We used the simulation approach outlined above to find empirical estimates of these P values for different values of
, using 10,000 simulated values once more. We used values of
= 1, 5, 12.4, 50.0, and 100 for illustration; the value 12.4 corresponds to Watterson's (1975)
segregating-sites estimator. Results are shown in table 2
.
|
Computer programs that implement both the MCMC approach and the rejection method to generate observations from the joint conditional distribution of, for example, (Kn, Hn) given the value of Sn for a given distribution for under a variety of demographic scenarios can be obtained from the authors.
Acknowledgements
We thank Y.-X. Fu and an anonymous reviewer for helpful comments. We were supported in part by NSF grant DBI95-04393 and NIH grant GM 58897.
Footnotes
1 Keywords: Markov chain Monte Carlo
coalescent
2 Address for correspondence and reprints: Simon Tavaré, Program in Molecular Biology, Department of Biological Sciences, SHS 172, University of Southern California, Los Angeles, California 90089-1340. stavare{at}gnome.usc.edu
literature cited
Depaulis, F., and M. Veuille. 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15:17881790
Fu, Y. X., and W. H. Li. 1993. Maximum likelihood estimation of population parameters. Genetics 134:12611270
Griffiths, R. C. 1982. The number of alleles and segregating sites in a sample from the infinite-alleles model. Adv. Appl. Prob. 14:225239[ISI]
Griffiths, R. C., and S. Tavaré. 1996. Monte Carlo inference methods in population genetics. Math. Comput. Modelling 23:141158
Hudson, R. R. 1990. Gene genealogies and the coalescent process. Pp. 144 in D. Futuyma and J. Antonovics, eds. Oxford surveys in evolutionary biology. Vol. 7. Oxford University Press, Oxford, England
Markovtsova, L., P. Marjoram, and S. Tavaré. 2000. The effects of rate variation on ancestral inference in the coalescent. Genetics 156:14271436
Tavaré, S., D. Balding, R. C. Griffiths, and P. Donnelly. 1997. Inferring coalescence times for molecular sequence data. Genetics 145:505518
Wall, J. D., and R. R. Hudson. 2001. Coalescent simulations and statistical tests of neutrality. Mol. Biol. Evol. 18:11341135
Watterson, G. A. 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256276[ISI][Medline]