*Department of Statistics, Rice University;
and
Human Genetics Center, University of Texas Health Science Center, Houston, Texas
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
More recently, several statistics were developed for detection of past population growth based on short tandem repeat DNA sequences, also referred to as microsatellite loci (Di Rienzo et al. 1998
; Kimmel et al. 1998
; Reich and Goldstein 1998
; Reich, Feldman, and Goldstein 1999
). Some of their properties were investigated in the original communications. However, a systematic comparison has yet to be carried out. Microsatellites are an important source of information about the phylogenetic relationships among human populations (Bowcock et al. 1994
; Deka et al. 1995
; Mountain and Cavalli-Sforza 1997
). Also, they were found to be of crucial importance in gene mapping (Dib et al. 1996
) and estimation of individuals' relatedness and identity (Chakraborty and Jin 1993
). The estimates vary, but there exist at least several tens of thousands of these loci throughout the human genome. Most of them are considered neutral, but some are associated with coding sequences, and some are known to cause disease in individuals carrying highly expanded alleles (reviewed by Rubinsztein 1999
).
The purpose of this paper is to present results concerning the power to detect past population growth using three microsatellite-based statistics available in the current literature: (1) the interlocus g statistic (Reich and Goldstein 1998
), (2) the within-locus k statistic (Reich and Goldstein 1998
), and (3) the imbalance index ß (Kimmel et al. 1998
). The analysis was based on the single-step stepwise mutation model (SSMM). The power of the statistics was evaluated under both constant and variable mutation rates. The latter case is important since it is a standard procedure for pooling data collected at a number of microsatellite loci, and mutation rates at these loci are known to be variable (Chakraborty et al. 1997
; American Association of Blood Banks 1998
; Brinkmann et al. 1998
; Deka et al. 1999a
). These power studies are preceded by a brief summary of relevant theoretical concepts and definitions of the indices. The effects of variations in the mutation process, particularly the presence of multistep mutations, on the power of the statistics are also presented.
![]() |
Microsatellite Evolution Under Demographic Change |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The evolution of a microsatellite locus is shaped by demography, as well as by stepwise mutation. This is because genetic drift (the loss of alleles through random sampling of genotypes of new individuals from the gamete pool) acts with strength inversely proportional to the effective population size. This effect determines the distribution of the branch lengths in the genealogy of the sampled locus. For a sample of n individuals, the genealogy may be partitioned according to the times Tk, k = 2, 3, ... , n, where Tk denotes the time for which the sample represents k distinct lineages (see fig. 1 ). In the case of constant population size, it is known that the coalescence times Tk are independent, exponentially distributed random variables, each with parameter (k2)/(2N), where N is the number of diploid individuals in the population.
|
|
|
| (1) |
As a result, the distributions of the coalescence times Tk in a population of variable size will be distorted relative to their counterparts in a constant population.
The growth pattern of interest here is rapid expansion. Consider, for example, a population originally of size N0 which undergoes a stepwise expansion te generations in the past to its current size N, where N >> N0. Looking backward in time, this demographic expansion corresponds to a sudden increase in the coalescence intensity from 1/(2N) to 1/(2N0). The effect of this change on the genealogy of a sample of n chromosomes depends on the time since the expansion. If the expansion event is very ancient, even the coalescence times closest to the root will reflect the current population size 2N. However, if the growth is sufficiently recent, the lineages in the genealogy at the time of expansion will be subject to the pre-expansion coalescence intensity 1/(2N0), and expected coalescence times for these lineages will be much shorter. With high probability, then, the most recent common ancestor of the sample is found close to time te.
Figure 2b depicts a genealogy sampled from a stepwise expansion scenario. The mutation rate and final population size are the same as those used to simulate the tree in figure 2a, so that coalescence times and the number of mutations in the two scenarios are identically distributed over the time period t = 0 to t = te. In figure 2b, the lineages which remain at time te coalesce quickly, and mutation has little time to act on the deepest branches of the genealogy. Relative to the constant-population scenario, observed allelic variation depends more on mutation events which occur near the tips of the tree, and observed allele sizes correspond more closely to independent, identically distributed random variables. These features are apparent in figure 3b, which shows three simulated allele size distributions based on samples of size n = 40 from the expansion scenario just discussed. In comparison with figure 3a, these distributions are less variable and more clearly unimodal.
The three tests for demographic expansion discussed in this paper represent three attempts to detect these qualitative and quantitative differences between samples of microsatellite alleles from constant and recently expanded populations.
![]() |
Indices of Demographic Expansion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
| (2) |
Under a generalized stepwise mutation model (i.e., a model in which U, the allele size change in a single mutation event, is an arbitrary integer-valued random variable), mutation-drift equilibrium is mathematically predicted (Kimmel and Chakraborty 1996
). Under the SSMM (U restricted to ±1), both V(t) and P0(t) converge to limits which depend only on the composite parameter
:
| (4) |
The imbalance index (Kimmel et al. 1998
) is based on comparison of two estimators of
derived from these limiting values:
| (5) |
called the (allele size) variance estimator of and the homozygosity estimator of
, respectively. At mutation-drift equilibrium,
| (1) |
| (6) |
Deviation of ß(t) from 1 at a microsatellite locus is a signature of disequilibrium or departure from the SSMM. In particular, populations which have recently expanded exhibit values of ß(t) less than 1. The stepwise expansion scenario discussed above illustrates this effect. After expansion, both V(t) and the heterozygosity 1 - P0(t) increase as the population approaches a new mutation-drift equilibrium. If the expansion is sufficiently recent, typical sample genealogies resemble the one in figure 2b, with mutation events largely restricted to the postexpansion epoch and distributed among relatively many lineages. Such mutations are effectively placed to generate heterozygosity, but not allele size variance, since the latter quantity is more sensitive to mutations which distinguish the oldest lineages. As a result, heterozygosity approaches its new limit value faster than genetic variance, causing the index ß(t) to deviate downward from 1.
As is the case with any ratio-type index, ß(t) can be estimated in several ways. Here, we consider two estimators of ln ß(t) which yield normal-like distributions. Let L be the number of microsatellite loci in the sample. The first estimator is the log ratio of means, used by Kimmel et al. (1998)
and given by
| (7) |
| (8) |
Between-Locus Variability and the g Statistic
Introduced by Reich and Goldstein (1998)
(also see Di Rienzo et al. 1998
), the g statistic is defined by the following ratio
| (9) |
The predicted variance of V =
is based on the mutation-drift equilibrium expression derived by Roe (1992)
, Zhivotovsky and Feldman (1995)
, and, in a more general setting, by Kimmel and Chakraborty (1996)
:
| (10) |
For a sample of L loci, this quantity can be estimated by , where is the average of the L single-locus estimates
V =
. The observed variance of
V is simply the sample variance of these same single-locus estimates. The statistical properties of g were studied extensively by Reich, Feldman, and Goldstein (1999)
.
The behavior of the g statistic can be illustrated by considering the genealogical effect of expansion. Mutations in the deepest branches of a genealogy strongly influence the allele size variance estimated by V. In a population of constant size, the high interlocus variability of these branch lengths inflates the variance of
V across loci. In expanded populations, however, mutation has little opportunity to act on the oldest lineages. As a result, Var(
V) is decreased (compare fig. 3a and b
), and we expect g < 1.
Shape of the Allele Size Distribution and the Within-Locus k Test
Developed by Reich and Goldstein (1998)
, the within-locus k test is based on the observation that allele size distributions in expanding populations tend to be unimodal, more peaked than the ragged distributions associated with constant populations (compare fig. 3a and b
). The test statistic is
![]() | (11) |
![]() |
Simulation Studies |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the simulation studies which follow, a single replicate consists of a sample of n = 40 individuals, each typed at L = 30 unlinked microsatellite loci. A replicate corresponds to a data set of modest size, and all four tests would perform at least as well with more data. Reich, Feldman, and Goldstein (1999)
have studied the effect of increasing both n and L on the g and k statistics. For each replicate, we used the coalescent to generate 30 independent genealogies and applied single-step mutations according to a Poisson process. (The rate of the mutation process was either fixed at
= 5 x 10-4 or sampled from a lognormal distribution with mean 5 x 10-4; details are given below.) From the resulting 30 allele size distributions we computed the test statistics, ln
1, ln
2, g, and k.
We simulated 1,000 replicates for each null hypothesis. Critical values for the tests based on ß and g were determined by the 0.05 quantile of the resulting empirical distributions of each test statistic. The cutoff for the kurtosis test was based on the binomial distribution with n = 30 and p = 0.515. We chose to reject the null hypothesis if 11 or fewer loci yielded positive k values, resulting in a significance level of approximately 0.074, slightly higher than the type 1 error rate of the other three tests.
For each alternative hypothesis, we simulated 1,000 replicates for 1,000 realizations of each test statistic. The power of each test was estimated by the proportion of these 1,000 realizations that were less than or equal to the critical value.
Power to Detect Stepwise Expansion
Let N(t) denote the size of the population t generations in the past (so that N(0) gives the population size at the time of sampling), and consider the null hypothesis H0: N(t) = N for all t 0. Using the procedure just described, we estimated the power of the four statistical tests to detect alternatives of the form
| (2) |
|
| (3) |
|
| (4) |
| (5) |
Power estimates for these scenarios are shown in figure 6
. In panel a, is sampled from a lognormal distribution with mean µ = 5 x 10-4 and standard deviation
= 1 x 10-4. The interquartile range for this distribution is (4.3 x 10-4, 5.6 x 10-4). Panel b corresponds to µ = 5 x 10-4,
= 3 x 10-4, and interquartile range (2.9 x 10-4, 6.2 x 10-4), while panel c represents µ = 5 x 10-4,
= 6 x 10-4, and interquartile range (1.7 x 10-4, 6.1 x 10-4).
|
|
Figure 7
plots the estimated quantiles of ln 1, ln
2, and ln g as a function of the probability p of a two-step mutation. As p ranges from 0 to 0.2, the quantiles of ln g are roughly constant, while the quantiles of the statistics based on ß increase with p. This effect is more pronounced for larger mutation step sizes, as shown for 10-step mutations in figure 8
. In this case, even the g statistic is affected, and again the effect is conservative for all three statistics. Table 2
presents estimated 0.05 quantiles for ln
1, ln
2, and ln g as a function of
and the probability p of a 10-step mutation. Reich, Feldman, and Goldstein (1999)
have shown that the k test is nonconservative in the presence of multistep mutations.
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
When the mutation rate is fixed, statistics based on the imbalance index ß are more sensitive to population increase than the g and k statistics, and they maintain their sensitivity over a longer time interval. The power estimates for ln
1 and ln
2 nearly coincide. For 100-fold stepwise growth (fig. 4c
), all three tests achieve high power over a range of times which are relevant to the evolution of early human populations. In all three stepwise scenarios, the ß statistics are more responsive than g and k to expansions of recent origin. As the expansion time becomes more remote, tests based on ß maintain power comparable or superior to that of the g and k tests.
The exponential expansion scenario (fig. 5
) yields results similar to those of the stepwise case, with the following exception. The expansion signal appears later and persists longer in all four tests. Both facts are related to the gradual nature of exponential versus stepwise growth. In the stepwise scenario, coalescence times begin to reflect the final population size N immediately after the expansion event. If the growth is exponential, however, the coalescence process is governed by values of N(t) << N for some time after the expansion, and the signal arises more slowly. This fact also accounts for the persistence of the signal under exponential expansion. Stepwise growth goes undetected if nk = 2 Tk < te. Under the analogous exponential expansion (with the same values for N0, N, and te), the most remote coalescence times will reflect values of N(t) only slightly greater than N0, corresponding to the early stages of growth. Sampled genealogies will be qualitatively similar to the one in figure 2b,
and the expansion will be detectable with positive probability.
It should be stressed that in figure 5
, increasing values of te correspond to smaller exponential growth rates. This fact explains the eventual loss of power in all tests. If we fix the growth rate (so that the final population size increases with te), we would expect the expansion signal to persist for all time. It can be shown, for example, that under exponential growth, the imbalance index ß(t)
1/
as t
(unpublished data).
When the mutation rate varies across loci, the power to detect expansion also depends on how each statistic incorporates information from multiple loci. If the variation is relatively small, the results are similar to those for the case of constant mutation rate (compare figs. 6a and 4b
). As Var(
) increases, tests based on ln
1 and g lose sensitivity, while the power of tests based on ln
2 and k is maintained. In the terminology of Reich and Goldstein (1998)
, the latter two quantities are "within-locus" statistics: when L loci are sampled, they combine L single-locus estimates, each of which is informative provided
is not too small. The tests which suffer, however, depend on the "interlocus" statistics ln
1 and g, both of which combine multilocus information in the numerator and separately in the denominator.
It should be noted that the power analyses performed in the present work were under the assumptions of the SSMM. Kimmel and Chakraborty (1996)
showed that when individual mutations may result in multistep repeat size changes, at mutation-drift equilibrium (see eq. 4
), the within-population allele size variance has the form V(
) = 4N
''(1), where
''(1)> is the second moment of the distribution of repeat size changes produced by individual mutations. An analogous closed-form expression for the homozygosity, P0(
), does not exist under a generalized stepwise mutation model (Kimmel and Chakraborty 1996
). Therefore, although any contraction-expansion bias of mutations does not affect V(
) or P0(
), the sampling properties of all of the statistics considered here are only approximate if mutation events do not obey the SSMM. Empirical data on mutations, although sparse at present, show a reasonable degree of agreement with the SSMM assumption. For example, nearly 84% of the mutation events encountered in parental testing laboratories (which use tri- and tetranucleotide repeat loci) involve only single-repeat size changes (American Association of Blood Banks 1998
). Likewise, Brinkmann et al. (1998)
reported that 22 of 23 mutation events observed during parentage testing experiences in Germany are single-step changes, with the remaining event being a double-step change. Even for loci with considerably higher mutation rates, the SSMM approximation may be reasonable. For example, Deka et al. (1999b)
showed that for the gene-associated CAG repeat locus ERDA1, where the mutation rate is over 6%, only 11 of 46 observed mutation events were multistep. Our analysis of the impact of multistep allele size changes indicates that a substantial fraction of such mutations is required to significantly reduce the power of the ß and g tests.
In summary, the ln 2 test is the most sensitive to population expansion over all the scenarios considered here. This sensitivity may not always be a virtue. When only ancient, massive expansions are of interest, the g or k statistic may be preferred.
![]() |
Footnotes |
---|
1 Keywords: stepwise mutation model
coalescence
population expansion
microsatellites
repeat DNA
indices of imbalance
2 Address for correspondence and reprints: Ranajit Chakraborty, Human Genetics Center, University of Texas Health Science Center, P.O. Box 20334, Houston, Texas 77225. E-mail: rc{at}hgc9.sph.uth.tmc.edu
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
American Association of Blood Banks. 1998. Annual report of the American Association of Blood Banks. Arlington, Va
Bowcock, A. M., R.-A. Linares, J. Tomfohrde, E. Minch, J. R. Kidd, and L. L. Cavalli-Sforza. 1994. High resolution of human evolutionary trees with polymorphic microsatellites. Nature 368:455457
Brinkmann, B., M. Klintschar, F. Neuhuber, J. Huhne, and B. Rolf. 1998. Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. Am. J. Hum. Genet. 62:14081415[ISI][Medline]
Chakraborty, R., and L. Jin. 1993. A unified approach to study hypervariable polymorphisms: statistical considerations of determining relatedness and population distances. Pp. 153175 in S. D. J. Pena, R. Chakraborty, J. T. Epplen, and A. J. Jeffreys, eds. DNA fingerprinting: state of the science. Birkhäuser, Basel, Switzerland
Chakraborty, R., M. Kimmel, D. N. Stivers, L. J. Davison, and R. Deka. 1997. Relative mutation rates at di-, tri-, and tetranucleotide microsatellite loci. Proc. Natl. Acad. Sci. USA 94:10411046
Deka, R., M. D. Shriver, L. M. Yu, R. E. Ferrell, and R. Chakraborty. 1995. Intra- and interpopulation diversity at short tandem repeat loci in diverse populations of the world. Electrophoresis 16:16591664
Deka, R., S. Guangyun, D. Smelser, Y. Zhong, M. Kimmel, and R. Chakraborty. 1999a. Rate and directionality of mutations and effects of allele size constraints at anonymous, gene-associated, and disease-causing trinucleotide loci. Mol. Biol. Evol. 16:11661177
Deka, R., S. Guangyun, J. Wiest, D. Smelser, S. Chunhua, Y. Zhong, and R. Chakraborty. 1999b. Patterns of instability of expanded CAG repeats at the ERDA1 locus in general populations. Am. J. Hum. Genet. 65:192199
Dib, C., S. Fauré, C. Fizames et al. (14 co-authors). 1996. A comprehensive map of the human genome based on 5624 microsatellites. Nature 380:152154
DiRienzo, A., P. Donnelly, C. Toomajian, B. Sisk, A. Hill, M. L. Petzl-Erler, G. K. Haines, and D. H. Barch. 1998. Heterogeneity of microsatellite mutations within and between loci, and implications for human demographic histories. Genetics 148:12691284
Ewens, W. J. 1979. Mathematical population genetics. Springer, New York
Griffiths, R. C., and S. Tavaré. 1994. Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B Biol. Sci. 344:403410[ISI][Medline]
Kimmel, M., and R. Chakraborty. 1996. Measures of variation at DNA repeat loci under a general stepwise mutation model. Theor. Popul. Biol. 50:345367[ISI][Medline]
Kimmel, M., R. Chakraborty, J. P. King, M. Bamshad, W. S. Watkins, and L. B. Jorde. 1998. Signatures of population expansion in microsatellite repeat data. Genetics 148:19211930
Mountain, J. L., and L. L. Cavalli-Sforza. 1997. Multilocus genotypes, a tree of individuals, and human evolutionary history. Am. J. Hum. Genet. 61:705718[ISI][Medline]
Polanski, A., M. Kimmel, and R. Chakraborty. 1998. Application of a time-dependent coalescence process for inferring the history of population size changes from DNA sequence data. Proc. Natl. Acad. Sci. USA 95:54565461
Reich, D. E., M. W. Feldman, and D. B. Goldstein. 1999. Statistical properties of two tests that use multilocus data sets to detect population expansions. Mol. Biol. Evol. 16:453466
Reich, D. E., and D. B. Goldstein. 1998. Genetic evidence for a Paleolithic human population expansion in Africa. Proc. Natl. Acad. Sci. USA 95:81198123
Relethford, J. 1998. Mitochondrial DNA and ancient population growth. Am. J. Phys. Anthropol. 105:17[ISI][Medline]
Roe, A. 1992. Correlations and interactions in random walks and genetics. Ph.D. dissertation, University of London, London
Rogers, A. R., and H. C. Harpending. 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9:552569[Abstract]
Rubinsztein, D. C. 1999. Trinucleotide expansion mutations cause diseases which do not conform to classical Mendelian expectations. Pp. 8096 in D. B. Goldstein and C. Schlötterer, eds. Microsatellites: evolution and applications. Oxford University Press, Oxford, England
Weber, J. L., and C. Wong. 1993. Mutation of human short tandem repeats. Hum. Mol. Genet. 2:11231128[Abstract]
Zhivotovsky, L. A., and M. W. Feldman. 1995. Microsatellite variability and genetic distances. Proc. Natl. Acad. Sci. USA 92:1154911552