On a Test of Depaulis and Veuille

Lada Markovtsova, Paul Marjoram and Simon Tavare

Department of Mathematics
Biostatistics Division, Department of Preventive Medicine
Program in Molecular Biology, Department of Biological Sciences, University of Southern California

In a recent letter to this journal, Depaulis and Veuille (1998)Citation discussed two possible tests of neutrality, the "haplotype number test" and the "haplotype diversity test." They present in their tables 1 and 2 means and percentage points of the distribution of the number Kn of haplotypes and the sample heterozygosity Hn = 1 - , where are the relative frequencies of those haplotypes in a sample of size n for different values of the number of segregating sites s observed in the data. They assume a neutral infinitely-many-sites model of mutation with no recombination. These percentage points were found by repeatedly simulating a random coalescent tree with n tips, randomly distributing s mutations on the tree, and calculating the observed values of Kn and Hn. See Hudson (1990)Citation for a description of how such simulations can be performed.


View this table:
[in this window]
[in a new window]
 
Table 1 Fraction (%) of 10,000 Observations Falling Within the DV Interval

 
Depaulis and Veuille's (1998)Citation procedure produces observations whose distribution is independent of the underlying neutral mutation rate {theta}. The authors present the method as though the resulting simulated values had the conditional distribution of Kn and Hn given Sn = s, respectively. However, this is not true, as the following argument shows. Denote the coalescent tree by {Lambda} and the coalescence times by T = (Tn, ... , T2), so that Tj is the time for which there are j distinct ancestors in the tree {Lambda}. We see that for mutation rate {theta}, the conditional distribution of (Kn, Hn) given Sn = s can be represented as


where f{theta}({lambda}, t | s) is the joint conditional density of ({Lambda}, T) given Sn = s. Because of the Poisson nature of the mutation process, the first term under the integral signs in equation (1) does not depend on {theta}. It follows that in order to simulate observations from the joint conditional distribution of (Kn, Hn) given Sn = s, one first simulates from the conditional distribution of ({Lambda}, T) given Sn = s and then randomly distributes the s mutations over the resulting tree and calculates the values of Kn and Hn. Notice that Depaulis and Veuille's (1998)Citation procedure simulates from the unconditional distribution of ({Lambda}, T) instead of the conditional distribution of ({Lambda}, T) given Sn = s. The joint distribution of (Kn, Sn) in the case of constant population size is discussed by Griffths (1982)Citation , and that in the variable population size case is discussed by Griffths and Tavaré (1996)Citation .

Depaulis and Veuille (1998)Citation suggested that the percentage points of their statistics could be used as a test of neutrality: for a given sample size n and observed value of s, one compares the observed values of Kn and Hn with the given 95% credible intervals. Values falling outside those intervals lead to rejection of neutrality. Given that the true conditional distributions of Kn and Hn in fact depend on the unknown mutation rate {theta}, it is likely that the power of this test varies dramatically as a function of {theta} for given n and s. To assess this hypothesis, we simulated observations from the true joint conditional distribution using equation (1) and then estimated the probability that either statistic would fall outside the limits given by Depaulis and Veuille. This gave an empirical estimate of the probability that their test would reject neutrality for different values of {theta}.

We used a Markov chain Monte Carlo (MCMC) approach to simulate observations from the conditional distribution of ({Lambda}, T) given Sn = s. MCMC methods produce correlated samples, but these samples may be made approximately independent by sampling the output at widely spaced intervals. The results below were generated using the approach in Markovtsova, Marjoram, and Tavaré (2000). An alternative approach is the rejection algorithm of Tavaré et al. (1997)Citation . Table 1 shows the fraction of 10,000 observations that fell inside the DV nominal 95% intervals for three different scenarios. As noted by Fu and Li (1993)Citation , Sn is not a sufficient statistic for {theta}, so the fraction of observations that fall within the DV nominal 95% intervals varies greatly. It appears that if the true {theta} value is well supported by the data, the test of neutrality based on the DV intervals will work well. However, if the true {theta} value is not well supported by the data, the test will be inaccurate, leading to incorrect rejections of neutrality. For further discussion, see Wall and Hudson (2001).

As we have seen, the power of the DV test depends on the unknown parameter {theta}. Even if the mutation rate is known, the compound parameter {theta} still depends on the underlying effective population size at the time of sampling, and this is not known in practice. Depaulis and Veuille (1998)Citation discuss a data set from the Su(H) locus in Drosophila melanogaster for which n = 20, s = 44. The observed values of K20 and H20 were 7 and 0.76, respectively. The nominal P values were estimated to be 0.011 and 0.017, respectively. We used the simulation approach outlined above to find empirical estimates of these P values for different values of {theta}, using 10,000 simulated values once more. We used values of {theta} = 1, 5, 12.4, 50.0, and 100 for illustration; the value 12.4 corresponds to Watterson's (1975)Citation segregating-sites estimator. Results are shown in table 2 .


View this table:
[in this window]
[in a new window]
 
Table 2 P Values for Drosophila melanogaster Data

 
From table 2 , we see that for {theta} in the range 12.4 or larger, the data are highly unlikely under a neutral scenario. However, for a range of smaller {theta} values, including, for example, {theta} = 5, the data become much more likely. Thus, the ability to reject the supposed neutral scenario depends on the true value of {theta}. It should be noted that one cannot reject neutrality on the basis of this test; rather, one can reject the model upon which the test is based. As Depaulis and Veuille (1998)Citation note, this model does not include recombination (although it is easy to alter it to do so); neither does it include any population demographics such as stratification. If the model is rejected, any of these missing factors could be the cause, not necessarily the assumption of neutrality.

Computer programs that implement both the MCMC approach and the rejection method to generate observations from the joint conditional distribution of, for example, (Kn, Hn) given the value of Sn for a given distribution for {theta} under a variety of demographic scenarios can be obtained from the authors.

Acknowledgements

We thank Y.-X. Fu and an anonymous reviewer for helpful comments. We were supported in part by NSF grant DBI95-04393 and NIH grant GM 58897.

Footnotes

Yun-Xin Fu, Reviewing Editor

1 Keywords: Markov chain Monte Carlo coalescent Back

2 Address for correspondence and reprints: Simon Tavaré, Program in Molecular Biology, Department of Biological Sciences, SHS 172, University of Southern California, Los Angeles, California 90089-1340. stavare{at}gnome.usc.edu Back

literature cited

    Depaulis, F., and M. Veuille. 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15:1788–1790[Free Full Text]

    Fu, Y. X., and W. H. Li. 1993. Maximum likelihood estimation of population parameters. Genetics 134:1261–1270

    Griffiths, R. C. 1982. The number of alleles and segregating sites in a sample from the infinite-alleles model. Adv. Appl. Prob. 14:225–239[ISI]

    Griffiths, R. C., and S. Tavaré. 1996. Monte Carlo inference methods in population genetics. Math. Comput. Modelling 23:141–158

    Hudson, R. R. 1990. Gene genealogies and the coalescent process. Pp. 1–44 in D. Futuyma and J. Antonovics, eds. Oxford surveys in evolutionary biology. Vol. 7. Oxford University Press, Oxford, England

    Markovtsova, L., P. Marjoram, and S. Tavaré. 2000. The effects of rate variation on ancestral inference in the coalescent. Genetics 156:1427–1436

    Tavaré, S., D. Balding, R. C. Griffiths, and P. Donnelly. 1997. Inferring coalescence times for molecular sequence data. Genetics 145:505–518

    Wall, J. D., and R. R. Hudson. 2001. Coalescent simulations and statistical tests of neutrality. Mol. Biol. Evol. 18:1134–1135[Free Full Text]

    Watterson, G. A. 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256–276[ISI][Medline]

Accepted for publication January 29, 2001.