* Laboratoire Modélisation et Biologie Evolutive, CBGP-INRA, Montferrier sur Lez, France
Laboratoire Génétique et Environnement, CNRS-UMR 5554, Montpellier, France
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: coalescence dispersal isolation by distance microsatellite DNA nonparametric ABC bootstrap
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In numerous species, individual dispersal is restricted in space. This means that there is a higher probability that individuals mate with individuals born in close proximity to themselves than to individuals born far away. Several studies on animals or plants have shown such restricted dispersal (e.g., for plant data, see Crawford 1984; and for animal data, Rousset 1997, 2000; Spong and Creel 2001; Sumner et al. 2001). Isolation by distance models taking into account this biological feature were introduced by Wright (1943 and 1946). Under these models the genetic differentiation at neutral loci is expected to increase with geographical distance (e.g., Malécot 1950, 1967; Sawyer 1977). Empirical data indicate that such a relationship holds for many species (Endler 1977; Slatkin 1993). Recently, a method of analysis was developed based on the increase, at a local scale, of genetic differentiation between individuals with geographical distance in a "continuous" population evolving under isolation by distance (Rousset 2000). The method makes use of the regression of estimators of a parameter analogous to the parameter FST/(1 - FST), calculated between individuals, and the logarithm of the geographical distance, to estimate the product D2, where D is the density of adults and
2 the average squared axial parent-offspring distance. It is expected to perform better than previous methods for several reasons. First, the demographic model on which the method is based makes weak assumptions about the shape of the distribution of dispersal distances. In particular, the method is valid for leptokurtic distributions of dispersal distance (Rousset 2000), a feature commonly observed in natural populations (for review and data, see Endler 1977; Portnoy and Willson 1993; Clark et al. 1999). Second, analysis of genetic differentiation is made at a small (local) geographical scale so that heterogeneity of demographic parameters such as dispersal or density is reduced and hence its influence on genetic differentiation is also reduced (Slatkin 1993; Rousset 2001b). In a similar way, influence of non-neutrality of the genetic markers may be less problematic for studies at local scale because selection parameters may be less heterogeneous at a small geographical scale. On the other hand, the theory on which the method is based shows that only estimations from analysis over short distances will be accurate (Rousset 1997). These expectations have been confirmed by several comparisons of direct and indirect estimates of D
2 (Rousset 1997, 2000; Sumner et al. 2001). Although the geographical scale at which the sampling has been done is expected to influence the quality of the estimation of D
2, very few analytical or simulation studies have formally addressed this question.
Since their discovery in the 1980s, microsatellite loci have been increasingly used as genetic markers. Rapid progress in molecular biology technologies, especially the development of the polymerase chain reaction, and attractive evolutionary features (e.g., high level of polymorphism), explain why this category of markers are progressively replacing, or at least complementing, classical markers such as allozymes for numerous applications in molecular systematics, population genetics, and ecology (reviewed in Estoup and Angers 1998; Estoup, Jarne, and Cornuet. 2002). However, the mutation processes (i.e., the nature of mutations) at microsatellite loci are complex and not yet well understood (e.g., Estoup and Cornuet 1999). The effect of the mutation processes on evolutionary inferences depends in large part on the method, the statistics, and the evolutionary time scale considered (e.g., Estoup, Jarne, and Cornuet 2002). Some authors have discussed the effect of the nature of the mutation on FST values (Slatkin 1995; Rousset 1996). Because a stepwise mutation process occurs at microsatellite loci, several statistics taking into account the allele size have been proposed (Goldstein et al. 1995; Slatkin 1995; Michalakis and Excoffier 1996). Their utility, however, has often been criticized (e.g., Takezaki and Nei 1996; Gaggiotti et al. 1999). Overall, the potential interest of the different statistics has never been addressed in the context of the estimation of demographic parameters under isolation by distance.
In this study, we developed an original simulation algorithm based on the coalescent theory in order to study the sensitivity of the estimation of D2 to different factors: (1) the sampling scale of individuals, (2) the mutation model of markers and (3) their mutation rate, with particular reference to microsatellite markers for the two latest points. This algorithm was also used to test a nonparametric ABC bootstrap procedure allowing the construction of confidence intervals on the D
2 estimation. Finally, we draw guidelines that could be useful for empirical investigators using the individual-based method of Rousset (2000).
![]() |
Models and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The life cycle is divided into four steps: (1) at each reproductive event, each individual gives birth to a great number of gametes, and then dies; (2) gametes undergo the effect of mutations; (3) gametes disperse; (4) diploid individuals are formed, and (5) competition brings back the number of adults in each deme to one.
Coalescent Algorithm
The genealogical tree of a sample of n genes taken from a panmictic population of constant size N can be modeled using a stochastic process known as the n-coalescent. This process was introduced by Kingman (1982a, 1982b) as an approximation of a gene genealogy under the "Wright-Fisher" neutral model (see also Hudson 1990, Tajima 1983). More sophisticated models have since been developed for analysis of more complex evolutionary scenarios with recombination, selfing, and variable population size (reviewed in Nordborg 2001).
The n-coalescent approximation can be used in the same context as diffusion equations (Nordborg 2001). It is thus valid for a restricted numbers of models of population structure, e.g., panmictic populations or the infinite island model. In the present work, we focused on isolation by distance. For this category of models, no analytical treatment of coalescence time or coalescence probabilities has been done for more than two genes. Algorithms such as those developed for likelihood estimation by Griffiths and collaborators (see Nath and Griffiths 1996; Bahlo and Griffiths 2000) could in principle deal with continuous models; however, they are not ready for demographic inferences (De Iorio and Griffiths, personal communication). The coalescent algorithm we developed is not based on the n-coalescent theory; rather it is an algorithm for which coalescence and migration events are considered "generation by generation" until the common ancestor of the sample has been found. The idea of tracing lineages back in time generation by generation is fundamental in the coalescence theory, and is well described in Nordborg (2001). At least one study already used this simple concept for simulations (i.e., Pope, Estoup, and Morris 2000). Although such a generation-by-generation algorithm leads to less efficient simulations in terms of computation time than those based on the n-coalescent theory, it is much more flexible when complex demographic and dispersal features are considered. The algorithm described below and the program used in this study were checked at every step during elaboration by comparison with exact analytical results for probabilities of identity in models of isolation by distance on finite lattice (e.g., Malécot 1975 for the lattice model, adapted to different mutation models following Rousset 1996). These comparisons show that estimates of identity probabilities from our program and analytical expectations differ by less than one per thousand for sufficiently long runs.
Let us consider, at a given time and on a two-dimensional lattice, a sample of n(0) genes numbered 1 to n(0). The position of each gene on this lattice is given by a pair of coordinates (x,y). The set of coordinates of sampled genes is given by the two vectors X(0) =[x1(0), ..., xn(0)(0)], Y(0) = [y1(0), ..., yn(0)(0)], where xi(0) and yi(0) are the coordinates of the gene i at G = 0, with G corresponding to the number of generations since sampling.
This algorithm goes backward in time, generation by generation (considering discrete generations). At G = 1, parents of our n(0) sampled genes have coordinates xi(1) = xi(0) + dx, yi(1) = yi(0) + dy, where dx and dy are random variables representing dispersal distance in one dimension, expressed in number of steps on the lattice. Under a two-dimensional model, the density function of the random variable (dx,dy) is given by bdx,dy, the "backward" dispersal function. The term backward is used because the position of the parental gene is determined knowing the position of its descendant gene. This function is calculated using fdx,dy, the forward dispersal density function describing where descendants go. The dispersal functions are detailed in the next section. We assume that dispersal is independent in each direction, so that fdx,dy = fdx x fdy. Considering that density is homogenous in space, backward dispersal functions are equal to forward dispersal functions, so that bdx,dy = fdx,dy = fdx x fdy.
Once the position of the parents on the lattice is known, the coalescence events occurring at G = 1 are assessed. In other words, we determine whether some genes share a common parent at G = 1. This step corresponds to the idea of "individuals picking their parents at random from the previous generation" (Nordborg 2001). A coalescence event occurs if genes are both on the same lattice node and if they originate from the same parental gene. Multiple coalescences are allowed. The probability for a coalescence of k genes in a given parental gene is 1/2k-1 under the model with one individual per lattice node. In this case, the remaining j genes from the same lattice node coalesce in the other parental gene. For convenience, we keep the numbering (i [1, ..., n(0)]) of descendant genes for their parents when these genes do not coalesce and attribute new numbers (i
[n(0) + 1, ..., n(1)]) for the parents of the coalesced genes. A gene i at G = 0 and its parent at G = 1 have the same number if there was no coalescence event between the gene i and another gene at G = 0. Thus our numbering refers more to the branches of the coalescent tree than to the genes themselves. This particular numbering of branches, nodes, and genes is illustrated in figure 1. At G = 1, we have X(1) = (x1(1),., xn(1)(1)), Y(1) = (y1(1),., yn(1)(1)), the n(1) geographic coordinates at G = 1 for each branch corresponding to a lineage of our sample. We keep in memory the ages of the tree "nodes" (corresponding to coalescence events) and the labels of the branches descending from this "node." The entire process is repeated over generations until the most recent common ancestor of our entire gene sample has been found.
|
By suitable choice of the two parameter values, large kurtosis can be obtained with high migration rates (Rousset 2000). For all of our simulations, we used a dispersal distribution with a moderate 2 value (
2 = 4), corresponding to a dispersal distribution with parameters:
|
Mutation Processes
One interesting feature of the coalescent-based approach is that, for neutral loci, genealogical and mutation processes are totally independent, so that the effects of mutation are simply superimposed on the genealogical tree obtained for the gene sample.
Two theoretical mutation models, the infinite allele model (IAM: Kimura and Crow 1964) and the K-allele model (KAM: Crow and Kimura 1970), have sometimes been used for microsatellite loci. However, the most widely adopted model for microsatellite mutation is the stepwise mutation model (SMM: Ohta and Kimura 1973) in which the mutant allele differs from its parent by one repeat. Direct and indirect studies have shown that mutations of several repeats also occurred, indicating that a strict one-step model is inappropriate (Estoup and Angers 1998; Gonser et al. 2000; Ellegren 2000). In practice, modeling assumptions are commonly limited to the SMM (e.g., Reich and Goldstein 1998; Wilson and Balding 1998), and sensitivity of the final inferences to this assumption may be substantial, although this is rarely investigated. In several studies (e.g., Pritchard et al. 1999), a generalization of the SMM was adopted in which the change in the number of repeat units forms a geometric random variable. This generalization was named the GSM (generalized stepwise mutation) model. The geometric distribution in our GSM model refers to a change expressed in an (absolute) number of repeat units subsequently added or withdrawn to the mutating allele with equal probability. Under this model, the large data set of microsatellite mutations of Dib et al. (1996) in humans suggests an estimate of the variance of the geometric distribution near 0.36 (Estoup et al. 2001). The GSM does not capture all the complexity of the mutation process at microsatellite loci. In particular, constraints on allele size occur at some microsatellite loci (reviewed in Amos 1999; Estoup and Cornuet 1999; Ellegren 2000) and potentially affect various statistics in population genetics (Estoup et al. 2002). This evolutionary feature, particular to microsatellite loci, was thus tested on our method. Allele size constraints were included in our simulations by imposing reflecting boundaries to the allele size range (e.g., Feldman et al. 1997; Estoup et al. 1999). Another outstanding feature of the microsatellite mutation process is that within-loci mutation rate increases with allele length (Ellegren 2000; Huang et al. 2002). Whether this increase is linear with the number of repeats remains subject to further investigation (Schlötterer 2000; Stumpf and Goldstein 2001; Brohede et al. 2002). In our simulations, we considered a linear model in which (1) the mutation rate was fixed to 5 x 10-4 for the allelic state of the root of the tree (fixed at 100 repeats units and considered the "middle size allele"); (2) a decrease in mutation rate with allele size of 0.1% or 1% per repeat unit for a weak or a strong variation, respectively is simulated for alleles shorter than 100 repeat units; (3) a similar increase is simulated for alleles longer than 100 repeat. In other words, this leads to the linear form: µ(L) = µ0 + s*L, where µ(L) is the mutation rate for an allele of size L, µ0 the mutation rate for the smallest allele, and s the increase per repeats unit. We set s = 0.1% or 1% for a weak or a strong variation, respectively, to be close to the value given in Brohede et al. (2002).
Interlocus variability in the mutation rate potentially decreases the precision of parameter estimation in population genetics (Takezaki and Nei 1996; Gonser et al. 2000). The effect of variable mutation rate was thus tested as well. Little information is available on the interlocus variance of the mutation rate at microsatellite loci. Several pedigree studies show that the mutation rates can differ across loci in important respects (reviewed in Schlötterer 2000). Without more information, we modeled variable mutation rates at microsatellite loci by drawing single locus mutation rate values in a gamma distribution with parameters (shape, scale) being (2, 2.5 10-4). This distribution has a mean equal to 5 x 10-4, a value considered as the average mutation rate in many species (reviewed in Estoup and Angers 1998), and 2.5% and 97.5% quantiles equal to 6 x 10-5 and 1.4 x 10-3, respectively. These values are similar to the mean and 95% confidence interval values typically considered for autosomal microsatellites in humans (Weber and Wong 1993).
The following step-by-step procedure was used to add mutations to the genealogical tree. Take at random two genes i, j and their most recent common ancestor, the gene l, and let statei, statej, statel be their respective allelic states. The number of mutations that occurred in lineage i is proportional to the length Li (expressed in number of generations) of branch i (from l to i) and is given by a binomial distribution with parameters (µ, Li), which can be approximated by a Poisson process with parameter µLi. Let mi be the number of mutations that occurred on branch i. One can easily deduce statei from statel through mi successive steps, each step corresponding to a mutation event under the chosen mutation model. The allelic states of the various genes of the sample were obtained starting from a given state for the common ancestor of the sample (root of the genealogical tree) and going forward in time on each branch.
Method of Analysis
Each simulation iteration gave the genotypes at l polymorphic loci for (n x n) individuals denoted by their coordinates on the lattice. l independent coalescent trees were used to simulate multi-locus genotypes. This process was repeated 1,000 times giving 1,000 multilocus samples sharing the same demographic conditions. We computed estimates of the parameter
|
|
|
|
To test the effect of using a statistic that takes into account the allele length differences (and hence the stepwise mutational process occurring at microsatellite loci), we defined another parameter br, equivalent to ar, except that it is defined in terms of squared differences in microsatellite allele lengths (SD) instead of probabilities of non- identity in state (1 - Q). Thus, we have
|
|
|
|
For each of the 1,000 repetitions, the value of the slope of the regression line between â (or ) and the logarithm of geographical distance was computed. In the limit of low mutation rates, the inverse of the slope is an estimate of the product 4
D
2, where D is the density of adults and
2 the average squared axial parent-offspring distance (Rousset 1997). It is worth noting that high mutation rates should not result in an asymptotic bias as long as the focus is on local processes involving distances between sampled individuals
|
An accurate estimate of the uncertainty associated with parameter estimates is important to avoid misleading inferences. The nonparametric ABC bootstrap procedure described in DiCiccio and Efron (1996) was adapted to compute 95% confidence intervals around the regression slope. ABC bootstrap is a procedure that generates approximated bootstrap confidence intervals without real resampling. It is useful for estimation methods with high computation time needs. In this procedure, we considered genotypic data at each locus as independent replicates of the genealogical process. Tests of this procedure were performed using the same simulation program described above by calculating probability coverage of the confidence intervals for 1,000 simulated data sets. We choose arbitrarily a dispersal distribution with 2 = 4 [parameters given in equation (1)]. For each repetition, 100 individuals were sampled every two lattice nodes within an area of (10
x 10
) on a (100 x 100) lattice. Estimates of ar and 95% confidence intervals were calculated for 7, 13, or 25 loci evolving under a SMM with a mutation rate equal to 5 x 10-4.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Influence of the Sampling Scale
Previous simulations with two-allele loci suggested that the regression method would be efficient if one can sample all individuals within an area of about 10 x 10
, giving a sample size of 100D
2 individuals (Rousset 2000). It is worth noting that if D
2 is greater than say 5, it becomes difficult in practice to sample and genotype all individuals (>500 individuals). Hence, since the number of individuals to sample is necessarily limited, the method should be less efficient when D
2 increases. In practice, biologists collect samples of a reasonably large number of individuals (say 100) within an area larger or smaller than the recommended (10
x 10
) area when D
2 is small or large respectively. In order to assess the effect of such practical "non-scaled sampling," we simulated a distribution of dispersal with
2 = 4 [parameters given in expression (1)] and four different sampling schemes. One hundred individuals were taken: (1) every lattice node within an area of (5
x 5
), for the first sampling scheme; (2) every two lattice nodes within an area of (10
x 10
), for the second one; (3) every five lattice nodes within an area of (25
x 25
) for the third one; and (4) every ten lattice nodes within an area of (50
x 50
) for the last one. For each repetition the parameter estimated is ar for 13 loci evolving under a SMM with a mutation rate equal to 5 x 10-4. We considered that a set of 13 loci represents a reasonable number of loci in empirical studies using microsatellites. A two dimensional lattice of (200 x 200) individuals was considered for the first three sampling schemes and of (500 x 500) individuals for the last one, to avoid edge effects on the estimations when considering samples larger than half the length of the lattice. Figure 2 shows that lattice size has no major effect on the estimation, except if it is less than ten times the mean dispersal distance (simulation parameters are those used in this paragraph). Unless the lattice size is very small (50*50), the bias and the MSE do not differ notably from those for a very large lattice size (1000*1000).
|
|
Simulations were run considering a sample of 100 individuals for 13 loci evolving in a two-dimensional lattice of (100 x 100) individuals. For each repetition of the simulation process the parameter estimated is ar. As it is often not easy in practice to sample most individuals from a small area, we considered a sample of (10 x 10) individuals taken every two nodes from an area of (20 x 20) nodes in the lattice. By doing so, we approximated the sampling scheme typically used in empirical studies. We also chose a dispersal distribution with a relatively large 2 value [i.e.,
2 = 4, parameters given in equation (1)]. The logic underlying this choice is that the method may be inaccurate in this case and that it is more relevant to distinguish differences in efficiency when the method does not perform extremely well, than when it performs well, whatever the mutation model.
The mutation rate was first fixed at 5 x 10-4 for all loci for each mutation model. Our results show that the nature of the mutation model has little influence on the estimation of the product D2 (table 3). Whatever mutation model is considered, the bias is positive and around 10%. Although the precision of the method is maximum under the IAM (MSE of 6%) and minimum under the GSM with strong constraints (K = 10, MSE = 0.11), these differences are small. For all mutation models more than 97% of the estimations are within a factor 2 from the expected D
2 value.
|
Influence of the Mutation Rate
The influence of the mutation rate (or the genetic diversity) has been studied for the GSM, a mutation model considered as more realistic for microsatellite loci than the SMM, the KAM, or the IAM (e.g., Estoup and Cornuet 1999). All other simulation parameters are those used for evaluating the influence of the mutation model. Our simulations showed that the mutation rate has a substantial effect on the bias and the MSE (fig. 3 and table 4). The MSE is more strongly influenced by the mutation rate than the bias. For "low" genetic diversities (i.e., H = 0.5), the observed bias is positive and never greater than 12%. In contrast, for genetic diversity lower than 0.6, the MSE is greater than 20% and increases relatively rapidly when the genetic diversity decreases. However, even for a genetic diversity lower than the mean genetic diversity observed in most microsatellite studies (e.g., about 0.5), 85% of the estimations are within a factor of two from D2, but 15 negative slopes were found (table 4).
|
|
|
|
It is sometimes considered that the large variation between loci of the mutation rate decreases the precision of parameter estimation in population genetics (e.g., Takezaki and Nei 1996; Gonser et al. 2000). To address this question, we considered 13 loci evolving under the GSM with mutation rates drawn for each locus in a gamma distribution of mean 5 x 10-4 (see earlier under Models and Methods: Mutation Model), all other simulation parameter values being the same as those used in the previous section. Our simulation results show that variable mutation rates for microsatellite loci have little effect on the estimation of D2 (table 4). The bias and the MSE values are 11% and 11%, respectively, which does not differ much from the values of 10% and 9% obtained with a fixed mutation rate of 5 x 10-4. More than 98% of the estimations are within a factor of 2 from D
2 and no negative estimates were found. Finally, our simulation results show that a linear increase in mutation rates with allele length has little effect on the estimation of D
2 (table 4). Strong or weak variations give similar results. The bias and the MSE values are about 10%11% and 8%, respectively, which again does not differ much from the values of 10% and 9% obtained with a fixed mutation rate of 5 x 10-4. No negative estimates were found, and more than 99% of the estimations are within a factor of 2 from D
2.
Test for a Statistic Taking into Account Allele Size Differences
The behavior of the statistic br, an equivalent of ar based on allele sizes, has been studied under both the SMM (i.e., the mutation model under which this statistic is expected to perform optimally) and the GSM with a mutation rate fixed at 5 x 10-4. All other simulation parameters values are those used in the two previous sections. Table 5 shows that the method of estimation of D2 performs poorly when br is used. Under both the SMM and GSM, the increase in MSE as well as the number of negative slopes is spectacular. For instance the MSE goes from about 10% when using the classical measure ar to values greater than 100% when using br. In contrast, the bias is only slightly increased compared to estimations using ar. Although slight, the bias increase appears higher under the GSM than the SMM (+ 9% versus + 4%).
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A second major conclusion of this study is that the mutation rate, or the genetic diversity (the latest being largely dependent on the mutation rate), has a strong influence on the estimation of D2. This is in agreement with previous studies demonstrating that mutation rate is a more important feature than mutation processes for the estimation of demographic parameters through F-statistics (reviewed in Rousset 2001a; Estoup, Jarne, and Cornuet 2002). Interestingly, the heterozygosities at microsatellite loci are typically between 0.5 and 0.8 (reviewed in Estoup and Angers 1998), a range of values corresponding to the level of genetic diversity that was found to maximize the efficiency of the estimation of D
2. Moreover, the potential effect on the estimation of interlocus and intralocus variability in the mutation rate seems to be weak. Therefore microsatellites are more appropriate to estimate the product D
2 than less polymorphic markers such as allozymes. The importance of the level of variability of the loci used to estimate population parameters has been illustrated by several theoretical and empirical studies. For example, Robertson and Hill (1984) showed that precision in estimates of heterozygote deficiency (Fis) increases with the level of variability of the markers. Goudet et al. (1996) also showed that the power of statistical tests of differentiation increases with the number of alleles. In practice, although precise information on mutation rate is difficult to obtain, it is straightforward to calculate a genetic diversity index for a set of markers from which a level of efficiency can be inferred for the estimation of D
2. Our simulations also indicate that future studies should avoid loci with a very high level of genetic diversity (higher than, say, 0.85), because those loci were found to strongly bias negatively the estimations of D
2.
Many studies emphasize that traditional FST does not make use of the additional information provided by the difference in the number of repeat units at microsatellite loci. However, statistics developed for this purpose often have higher variance than statistics based on allele frequencies (e.g., Gaggiotti et al. 1999). In agreement with this finding, estimates computed using a statistic taking into account allele size differences increases by at least a factor of 10 the MSE compared to a statistic based on identity in state. This result parallels those of Gaggiotti et al. (1999), which showed that in many cases, especially when sample size and number of loci are "small" (i.e., under the conditions of most empirical studies), population structure measures based on allele frequencies alone are more reliable than measures specifically designed for microsatellite loci. Takezaki and Nei (1996) also showed that even for loci evolving under a strict SMM, genetic distances taking into account allele size differences are less efficient for phylogenetic inference than those based on identity in state, especially for short to moderate divergence times. The poor efficiency of this category of statistics appears to be a general feature of studies of evolutionary events, especially those referring to fine geographical and temporal scales.
The effects of the mutation processes and high mutation rates on the estimation of D2 are expected to be more important at large geographical scales (Rousset 1997). In agreement with this expectation, our results showed that sampling at large distance leads to an underestimation of the regression slope and thus to an overestimation of D
2. Therefore sampling at large distance makes it less likely to detect a pattern of isolation by distance. In contrast, sampling from too small an area leads to an overestimation of the regression slope and thus to an underestimation of the product D
2. A possible explanation for this overestimation is that the linear relationship between estimates of ar and the logarithm of the geographical distance is expected to hold less well over very short distances (Rousset 1997). However, using a sample not exactly appropriate to the biological case studied [i.e., a few times larger or smaller than the recommended area of (10
x 10
)] still gives reasonably robust estimations because, in most cases, the estimated D
2 fell within a factor of 2 from the expected D
2 value.
Given our result on bootstrap confidence intervals, we alert biologists using this method on a standard-sized data set (10 loci and 150 individuals, e.g., Sumner et al. 2001) that ABC confidence intervals overestimate the lower bound for the regression slope and thus underestimate the upper bound for D2. Construction of reliable confidence intervals based on the bootstrap is an ongoing problem for which a satisfactory solution has not yet been found, especially when the number of replications is limited computationally (DiCiccio and Efron 1996). Nevertheless, the ABC bootstrap procedure evaluated here should give an idea of the uncertainty of the D
2 estimate, namely a correct lower bound for D
2 and a minimal value for the upper bound. This procedure will be implemented in the next version of the population genetics package Genepop (Raymond and Rousset 1995).
Conclusion
Three conclusions inferred from our simulation study have important consequences for empirical investigations. First, we recommended using loci with high levels of polymorphism (genetic diversity around 0.7), although loci with too high genetic diversity, e.g., more than 0.85, should be avoided. Because the mutational processes, specifically size homoplasy and allele size constraints, have little influence on D2 estimations, microsatellite markers seem to be the best choice at the present time. Second, using statistics based on allele size differences at microsatellite loci gives unreliable estimations of D
2 because of the very high variance of those estimations. Third, it is important to restrict the sampling design to a relatively small geographical area in order to work at a local geographical scale; however, it is necessary to sample on a relatively large scale when
is high. Optimizing the method studied here requires a previous knowledge of
, and we therefore recommended using a preliminary estimate of
to allow subsequent design of an appropriate sampling scheme. In the absence of a preliminary estimate of
, a rough estimate of this parameter deduced from consideration of known dispersal mechanisms should be useful to define the minimal scale of the study (e.g., Leblois et al. 2000). If these aspects are approximately satisfied, the method should give estimates of the product D
2 with low bias and low mean square error. Finally, the ABC bootstrap procedure, as implemented in the package Genepop (Raymond and Rousset 1995), should be useful to estimate a 95% confidence interval on D
2, although the upper bound of this interval is likely to be underestimated.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Amos, W. 1999. A comparative approach to the study of microsatellite evolution. Pp. 6679 in D. B. Goldstein and C. Schlötterer, eds. Microsatellites: evolution and applications. Oxford University Press, Oxford.
Bahlo, M., and R. C. Griffiths. 2000. Inference from GeneTree in a subdivided population. Theor. Pop. Biol. 57:79-95.[CrossRef][ISI][Medline]
Barton, N. H., F. Depaulis, and A. M. Etheridge. 2002. Neutral evolution in spatially continuous populations. Theor. Popul. Biol. 61:31-48.[CrossRef][ISI][Medline]
Brohede, J., C. Primmer, A. Møller, and H. Ellegren. 2002. Heterogeneity in the rate and pattern of germline mutation at individual microsatellite loci. Nucleic Acids Res. 30:1997-2003.
Clark, J. S., M. Silman, R. Kern, E. Macklin, and J. HilleRisLambers. 1999. Seed dispersal near and far: patterns across temperate and tropical forests. Ecology 80:1475-1494.[ISI]
Crawford, T. J. 1984. The estimation of neighborhood parameters for plant populations. Heredity 52:273-283.[ISI]
Crow, J. F., and M. Kimura. 1970. An introduction to population genetics theory. Harper & Row, New York.
Dib, C., S. Faure, and C. Fizames, et al. 1996. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature 380:152-154.[CrossRef][ISI][Medline]
DiCiccio, T. J., and B. Efron. 1996. Bootstrap confidence intervals (with discussion). Stat. Sci. 11:189-228.[CrossRef][ISI]
Ellegren, H. 2000. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 24:400-402.[CrossRef][ISI][Medline]
Endler, J. A. 1977. Geographical variation, speciation, and clines. Princeton University Press, Princeton, N.J.
Estoup, A., and B. Angers. 1998. Microsatellites and minisatellites for molecular ecology: theoretical and empirical considerations. Pp. 5586 in G. Carvalho, ed. Advances in molecular ecology. NATO ASI series. IOS Press, Amsterdam.
Estoup, A., and J.-M. Cornuet. 1999. Microsatellite evolution: inferences from population data. Pp. 4965 in D. B. Goldstein and C. Schlötterer, eds. Microsatellites: evolution and applications. Oxford University Press, Oxford.
Estoup, A., P. Jarne, and J.-M. Cornuet. 2002. Homoplasy at microsatellite loci and its consequences for population genetics analysis. Mol. Ecol. 11:1591-1604.[CrossRef][ISI][Medline]
Estoup, A., I. J. Wilson, C. Sullivan, J.-M. Cornuet, and C. Moritz. 2001. Inferring population history from microsatellite and enzyme data in serially introduced cane toads, Bufo marinus. Genetics 159:1671-1687.
Felsenstein, J. 1975. A pain in the torus: some difficulties with models of isolation by distance. Am. Nat. 109:359-368.[CrossRef][ISI]
Gaggiotti, O. E., O. Lange, K. Rassmann, and C. Gliddon. 1999. A comparison of two methods for estimating average levels of gene flow using microsatellites data. Mol. Ecol. 8:1513-1520.[CrossRef][ISI][Medline]
Goldstein, D. B., A. R. Linares, L. L. Cavalli-Sforza, and M. W. Feldman. 1995. Genetic absolute dating based on microsatellites and the origin of modern humans. Proc. Natl. Acad. Sci. USA 92:6723-6727.[Abstract]
Gonser, R., P. Donnelly, G. Nicholson, and A. Di Rienzo. 2000. Microsatellite mutations and inferences about human demography. Genetics 154:1793-1807.
Goudet, J., M. Raymond, T. de Meeüs, and F. Rousset. 1996. Testing differentiation in diploid populations. Genetics 144:1931-1938.
Hastings, A., and S. Harrison. 1994. Metapopulation dynamics and genetics. Annu. Rev. Ecol. Syst. 25:167-188.[CrossRef][ISI]
Huang, Q.-Y., F.-H. Xu, H. Shen, H.-Y. Deng, Y.-J. Liu, Y.-Z. Liu, J.-L. Li, R. R. Becker, and H.-W. Deng. 2002. Mutation patterns at dinucleotide microsatellite loci in humans. Am. J. Hum. Genet. 70:625-634.[CrossRef][ISI][Medline]
Hudson, R. R. 1990. Gene genealogies and the coalescent process. Pp. 144 in D. Futuyama and J. Antonovics, eds. Oxford surveys in evolutionary biology. Oxford University Press, Oxford.
Kimura, M., and J. F. Crow. 1964. The number of alleles that can be maintained in a finite population. Genetics 49:725-738.
Kingman, J. F. C. 1982a. The coalescent. Stochast. Proc. Appl. 13:235-248.[CrossRef]
Kingman, J. F. C. 1982b. On the genealogy of large populations. J. Appl. Prob. 19A:27-43.
Koenig, W. D., D. Van Vuren, and P. N. Hooge. 1996. Detectability, philopatry, and the distribution of dispersal distances in vertebrates. Trends Ecol. Evol. 11:514-517.[CrossRef][ISI]
Kot, M., M. A. Lewis, and P. van den Driessche. 1996. Dispersal data and the spread of invading organisms. Ecology 77:2027-2042.
Leblois, R., F. Rousset, D. Tikel, C. Moritz, and A. Estoup. 2000. Absence of evidence for isolation by distance in expanding cane toad (Bufo marinus) population: an individual-based analysis of microsatellite genotypes. Mol. Ecol. 9:1905-1909.[CrossRef][ISI][Medline]
Malécot, G. 1950. Quelques schémas probabilistes sur la variabilité des populations naturelles. Ann. Univ. Lyon A 13:37-60.
Malécot, G. 1967. Identical loci and relationship. Pp. 317332 in L. M. Lecam and J. Neyman, eds. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 4. California University Press, Berkeley.
Malécot, G. 1975. Heterozygoty and relationship in regularly subdivided populations. Theor. Popul. Biol. 8:212-241.[ISI][Medline]
Maruyama, T. 1972. Rate of decrease of genetic variability in a two-dimensional continuous population of finite size. Genetics 70:639-651.
Michalakis, Y., and L. Excoffier. 1996. A generic estimation of population subdivision using distances between alleles with special interest to microsatellite loci. Genetics 142:1061-1064.
Nath, H. B., and R. C. Griffiths. 1996. Estimation in an island model using simulation. Theor. Pop. Biol. 50:227-253.[CrossRef][ISI][Medline]
Nauta, M. J., and F. J. Weissing. 1996. Constraints on allele size at microsatellite loci: implications for genetic differentiation. Genetics 143:1021-1032.
Nordborg, M. 2001. Coalescent theory. Pp. 179208 in D.A. Balding, M. Bishop and C. Cannings, eds. Handbook of statistical genetics. John Wiley & Sons, Chichester, U.K.
Ohta, T., and M. Kimura. 1973. A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res. 22:201-204.[ISI][Medline]
Pope, L. C., A. Estoup, and C. Moritz. 2000. Phylogeography and population structure of an ecotonal marsupial, Bettongia tropica, determined using mtDNA and microsatellites. Mol. Ecol. 9:2041-2053.[CrossRef][ISI][Medline]
Portnoy, S., and M. F. Willson. 1993. Seed dispersal curves: behavior of the tail of the distribution. Evol. Ecol. 7:25-44.[ISI]
Pritchard, J. K., M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman. 1999. Population growth of human Y chromosome microsatellites. Mol. Biol. Evol. 16:1791-1798.
Raymond, M., and F. Rousset. 1995. GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. J. Hered. 86:248-249.[ISI]
Reich, D. E., and D. B. Goldstein. 1998. Genetic evidence for a paleolithic human population expansion in Africa. Proc. Natl. Acad. Sci. USA 95:8119-8123.
Robertson, A., and W. G. Hill. 1984. Deviations from Hardy-Weinberg proportions: sampling variances and use in estimation of inbreeding coefficients. Genetics 107:703-718.
Rousset, F. 1996. Equilibrium values of measures of population subdivision for stepwise mutation processes. Genetics 142:1357-1362.
Rousset, F. 1997. Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. Genetics 145:1219-1228.
Rousset, F. 2000. Genetic differentiation between individuals. J. Evol. Biol. 13:58-62.[CrossRef][ISI]
Rousset, F. 2001a. Genetic approaches to the estimation of dispersal rates. Pp. 1828 in J. Clobert, E. Danchin, A. A. Dhondt, and J. D. Nichols, eds. Dispersal: individual, population and community. Oxford University Press, Oxford.
Rousset, F. 2001b. Inferences from spatial population genetics. Pp. 239265 in D. A. Balding, M. Bishop, and C. Cannings, eds. Handbook of statistical genetics. John Wiley & Sons, Chichester, U.K.
Sawyer, S. 1977. Asymptotic properties of the equilibrium probability of identity in a geographically structured population. Adv. Appl. Prob. 9:268-282.[ISI]
Schlötterer, C. 2000. Evolutionary dynamics of microsatellite DNA. Chromosoma 109:365-371.[ISI][Medline]
Slatkin, M. 1993. Isolation by distance in equilibrium and non-equilibrium populations. Evolution 47:264-279.[ISI]
Slatkin, M. 1994. Gene flow and population structure. Pp. 317 in L. A. Real, ed. Ecological genetics. Princeton University Press, Princeton, N.J.
Slatkin, M. 1995. A measure of population subdivision based on microsatellite allele frequencies. Genetics 139:457-462.
Spong, G., and S. Creel. 2001. Deriving dispersal distances from genetic data. Proc. R. Soc. Lond. Ser. B 268:2571-2574.[CrossRef][ISI][Medline]
Stumpf, M. P. H., and D. B. Goldstein. 2001. Genealogical and evolutionary inference with the human Y chromosome. Science 291:1738-1742.
Sumner, J., F. Rousset, A. Estoup, and C. Moritz. 2001. "Neighborhood" size, dispersal and density estimates in the prickly forest skink (Gnypetoscincus queenslandiae) using individual genetic and demographic methods. Mol. Ecol. 10:1917-1927.[CrossRef][ISI][Medline]
Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.
Takezaki, N., and M. Nei. 1996. Genetic distances and reconstruction of phylogenetic trees from microsatellites DNA. Genetics 144:389-399.
Weber, J. L., and C. Wong. 1993. Mutation of human short tandem repeats. Hum. Mol. Genet. 2:1123-1128.[Abstract]
Wilson, I. J., and D. J. Balding. 1998. Genealogical inference from microsatellite data. Genetics 150:499-510.
Wright, S. 1943. Isolation by distance. Genetics 28:114-138.
Wright, S. 1946. Isolation by distance under diverse systems of mating. Genetics 31:39-59.