A Maximum-Likelihood Approach to Fitting Equilibrium Models of Microsatellite Evolution

Richard M. Sibly, John C. Whittaker and Melanie Talbot

*School of Animal and Microbial Sciences
{dagger}Department of Applied Statistics, University of Reading, Reading, England


    Abstract
 TOP
 Abstract
 Introduction
 Models
 Results
 Discussion
 Acknowledgements
 literature cited
 
Here, we develop a new approach to Markov chain modeling of microsatellite evolution through polymerase slippage and introduce new models: a "constant-slippage-rate" model, in which there is no dependence of slippage rate on microsatellite length, as envisaged by Moran; and a "linear-with-constant" model, in which slippage rate increases linearly with microsatellite length, but the line of best fit is not constrained to go through the origin. We show how these and a linear no-constant model can be fitted to data hierarchically using maximum likelihood. This has advantages over previous methods in allowing statistical comparisons between models. When applied to a previously analyzed data set, the method allowed us to statistically establish that slippage rate increases with microsatellite length for dinucleotide microsatellites in humans, mice, and fruit flies, and suggested that no slippage occurs in very short microsatellites of one to four repeats. The suggestion that slippage rates are zero or close to zero for very short microsatellites of one to four repeats has important implications for understanding the mechanism of polymerase slippage.


    Introduction
 TOP
 Abstract
 Introduction
 Models
 Results
 Discussion
 Acknowledgements
 literature cited
 
Microsatellite loci, alleles of which are distinguished by different numbers of repeats of short sequences of nucleotides, are widely dispersed throughout the genomes of eukaryotes. They may have relatively high mutation rates (10-3 per generation) and large numbers of alleles, but the largest alleles generally have fewer than 100 repeats (Tautz 1993Citation ). Recently, there has been much interest in obtaining an improved understanding of the evolutionary processes that produce microsatellites (Garza, Slatkin, and Freimer 1995Citation ; Bell and Jurka 1997Citation ; Wierdl, Dominska, and Petes 1997Citation ; Kruglyak et al. 1998, 2000Citation ; Falush and Iwasa 1999Citation ). Models of the evolutionary processes that produced allelic frequency distributions have mostly been based on or developed from Moran's (1975)Citation analysis of random-walk models and are known collectively as stepwise-mutation models because alleles are assumed to mutate upward or downward in "steps" consisting (usually) of one repeat. Recent examples that do not assume that the mutation step is one repeat include those of Fu and Chakraborty (1998)Citation and Morris et al. (1995)Citation . In the early versions of these models, there were no evolutionary processes restricting the excursions of the random walks. Negative integers were sometimes achieved, together with arbitrarily high integers. Neither of these outcomes is realistic for microsatellites, which are necessarily positive and generally less than 100 (Tautz 1993Citation ).

The nonequilibrium nature of the original random-walk models poses the problem that some populations are expected to reach allele sizes larger than those observed in nature. This raises the question of what prevents populations reaching large allele sizes. A plausible explanation of the absence of long alleles has recently been suggested whereby point mutations occurring within microsatellites break them into smaller units (Bell and Jurka 1997Citation ; Kruglyak et al. 1998Citation ). The proposed evolutionary process can be modeled as a Markov chain model in which point mutations and slippage act simultaneously (Kruglyak et al. 1998, 2000Citation ; Durrett and Kruglyak 1999Citation ). Kruglyak et al. (1998)Citation investigated a model in which slippage mutation rate increases linearly with microsatellite length, measured as the number of repeats. This model was fitted to DNA data for di-, tri-, and tetranucleotide repeats obtained by scanning approximately one million base pairs of nonoverlapping DNA each from humans, mice, and fruit flies. The data were compiled as frequency distributions, with each microsatellite locus contributing one datum. We consider this model to be a major advance, but we have some reservations. In the first place, the definition of a microsatellite is unconventional, referring to a stretch of repeats bounded at one end, not both ends, by nonrepeating DNA. With microsatellites so defined it is natural to fit the calculated evolutionary equilibrium distributions to cumulated frequency data, but this has the consequence that some observations contribute many times rather than just once to the observed cumulative frequencies, such that observations are not independent. Second, the fitting procedure minimized the sum of absolute differences between observed and expected cumulative frequencies. The validity of this procedure depends on the assumption, which cannot be correct, that the variance of cumulative frequency is independent of microsatellite length. Finally, there is no way other than bootstrapping to estimate the standard errors of the parameter estimates.

Here, we show how these problems can be avoided using a modeling approach that defines microsatellites in the conventional way. We show how maximum-likelihood estimates of model parameters can be calculated using as an example a data set previously analyzed by Kruglyak et al. (1998)Citation , and we introduce two new, nested, models of slippage. The new approach allows statistical demonstration that the slippage mutation rate increases with microsatellite length and that the line of best fit intercepts the x-axis to the right of length 1. This has the important biological implication that very short microsatellites are not at risk of slippage mutation.


    Models
 TOP
 Abstract
 Introduction
 Models
 Results
 Discussion
 Acknowledgements
 literature cited
 
The models described here are continuous time Markov chain models. The states of the chain are the positive integers, 1, 2, 3, ... , each of which corresponds to the number of repeats at a microsatellite locus. Only perfect microsatellites are considered, where "perfect" means that (1) the microsatellite does not contain mutations or interruptions, and (2) the left-hand flanking region of the microsatellite does not contain further repeats. Thus, in the case of an interrupted microsatellite containing one mutation, only the region to the left of the mutation would qualify as a perfect microsatellite. A perfect allele with i repeats is said to be in state i. The general transition matrix for the Markov chain models is given in table 1 and justified below.


View this table:
[in this window]
[in a new window]
 
Table 1 Transition Matrix for the Markov Chain Models, Corresponding to Equations (1)–(3)

 
For each generation, three types of transition may occur.

  1. The number of repeats may change by ±1 unit because of polymerase slippage. For i >= 2, let si be the slippage mutation rate from state i, defined here as the per-generation probability that a microsatellite of length i >= 2 mutates by slippage to length i + 1; si is also the probability of a slippage mutation to length i - 1. Our three main models differ in the dependence of this slippage mutation rate on length. There may be no length dependency—here termed the "constant" model. This was the original model of Moran (1975)Citation . In this case, si = s, where s is the length-independent slippage rate. In the second model, the slippage mutation rate increases linearly with microsatellite length at rate si = (i - 1)·b, where b is the per-repeat slippage rate. This is like the model introduced by Kruglyak et al. (1998)Citation ; we shall refer to it as the "linear no-constant" model. In the third model, termed here the "linear-with-constant" model, slippage occurs at the rate si = b1 + (i - 1)·b2.
  2. Point mutation may occur at any point within a microsatellite, breaking it into two smaller microsatellites. Only the one to the left of the point mutation is considered perfect within the terms of the definition above, and the other disappears from the model. A microsatellite of i repeat units will move to any of the states 1, 2, ... , i - 1 at point mutation rate a.
  3. A transition from i = 1 to i = 2 due to specific base substitutions occurs at rate c. This assumption is necessary to prevent i = 1 from being an absorbing state.

Let pi be the probability that a randomly selected perfect microsatellite is of length i, i >= 1, with


(1)
We now derive the equilibrium distributions of these slippage models (stationarity is guaranteed by the results of Durrett and Kruglyak 1999Citation ). We treat the cases i >= 2 and i = 1 separately, starting with the former. It is convenient to distinguish states i and smaller, which we refer to as set A, and those states larger than i, which we term set B. Transitions from set A to set B occur only by slippage from state i to state i + 1, with frequency pisi. Transitions from B to A occur by both slippage and point mutation. Transitions occur by slippage from state i + 1 to state i, with frequency pi+1si+1. Transitions occur by point mutation from state j within B to any given state in A with frequency pja, and since there are i states in A, the total frequency of transitions by point mutation from state j to A is pjai. The total frequency of transitions by point mutation from set B to A is therefore ai {Sigma}{infty}j>i pj. Since in equilibrium the frequency of transitions from A to B equals the frequency of transitions the other way, we have for i >= 2:


(2)
The treatment for i = 1 is the same, except that transitions from A (now comprising only state 1) to B do not occur by slippage but by specific base substitutions, with frequency p1c (see assumption 3, above). This gives


(3)
The three slippage models described here differ only in the form of si. In the constant model si = s, in the linear no-constant model si = b (i - 1), and in the linear-with-constant model si = b1 + (i - 1)·b2. The slippage mechanism has one parameter (s or b) in the constant and linear no-constant models, and two (b1 and b2) in the linear-with-constant model. In practice, the parameter a is redundant, as can be seen by dividing equations (2) and (3) by a, and one works with si/a and c/a. In the application presented below, the fitting procedure was conditioned on the absence of microsatellites smaller than five repeats in the data set, and for this reason c/a was not estimated. Given values of the slippage mechanism parameters, equations (1)–(3) can now be used to obtain the equilibrium distribution numerically.

The observed frequencies of microsatellite lengths in a sample of data have a multinomial distribution with parameters given by the above equilibrium distribution and the sample size, n (see, e.g., Ross 1997Citation ). The likelihood of the data can therefore be written in terms of n and the pi of the equilibrium distribution; since the pi are functions of the model parameters (e.g., b1/a and b2/a for the linear-with-constant model), we have the likelihood as a function of the model parameters and can estimate these parameters by maximum likelihood in the usual way (see, e.g., Silvey 1970Citation ). Standard errors of parameter estimates can be obtained via Fisher's information and nested models compared via likelihood ratio statistics in the usual manner. Maximum-likelihood estimates were found using the Quasi-Newton optimization routine in SAS, estimating derivatives and second derivatives numerically where necessary, and Fisher's information was estimated using the finite difference approximation routine (programs available on request).

In addition to the three slippage models described above, parameter estimates were calculated for two further models: the full model, in which a parameter is fitted for each observed microsatellite length—we used lengths between 5 and 34 repeats—and a "geometric model," according to which pi = pi-1(1 - p), where p = 1/16 for dinucleotide microsatellites. The geometric model represents the dinucleotide microsatellite frequency distribution which would result from the chance assemblage of bases if each base occurred with equal chance 1/4.


    Results
 TOP
 Abstract
 Introduction
 Models
 Results
 Discussion
 Acknowledgements
 literature cited
 
The observed and expected frequency distributions for each of the three slippage models are shown in figure 1a . The best fitting slippage models are illustrated in figure 1b . Note that the increase of slippage rate with length is greater in the linear-with-constant than in the linear no-constant model. Parameter values with SEs are given in table 2 , along with the log likelihoods of each of the three slippage models and of two further models: the full model, in which one parameter is estimated for each of the 30 observed microsatellite lengths, and the geometric model, in which the frequency distribution is calculated a priori so that no parameters are estimated from the data. Nested models can be compared within species using deviances calculated as twice the difference in log likelihood. The data are statistically independent because microsatellite loci evolve independently (Bell and Jurka 1997Citation ), and only one allele is measured per locus, so deviance may be distributed approximately as chi-square, with degrees of freedom calculated as the difference between the number of estimated parameters in the two models being compared. However, since some of the estimated frequencies were low, the chi-square assumption could be unreliable, and we therefore carried out parametric bootstrapping as a check. The agreement was generally good. We report the P values obtained by parametric bootstrapping.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1.—a, Frequency distribution (log10 scale) of perfect dinucleotide microsatellites in humans, mice, and fruit flies (data of Kruglyak et al. 1998Citation ), together with fitted distributions of the three slippage models. Where no microsatellite was observed, the frequency has been plotted as -1. b, Best-fitting slippage rates si in relation to microsatellite length for each of the three slippage models, with the parameter a assigned the value 2 x 10-8 per dinucleotide repeat per generation, as in Kruglyak et al. (1998)Citation . Dotted lines, constant model; dashed lines, linear no-constant model; solid lines, linear-with-constant model

 

View this table:
[in this window]
[in a new window]
 
Table 2 Log Likelihoods and Estimated Parameter Values of the Models Described in the Text Applied to Data of Kruglyak et al. (1998) from Humans, Mice, and Fruit Flies

 
The slippage models fit much better than the geometric model (table 2 ), but this is not tested formally because the geometric model is not nested within the others. For each species, the linear-with-constant model is better than the constant model (P < 0.001). This suggests that slippage rate increases with microsatellite length in each species. For each species, the linear-with-constant model is better than the linear no-constant model (P < 0.01), dramatically so for humans. Inspection of parameter values (table 2 ) shows that although the slippage mutation rate increases with microsatellite length, the line of best fit intercepts the x-axis to the right of the microsatellite of length 1 (fig. 1b ). Zero slippage is predicted to occur at length 4 in humans and fruit flies, and at length 3 in mice. Slippage is predicted to be close to 0 but negative for length 2, which is biologically impossible. To see if allowing negative slippage rates affected the form of the equilibrium distribution, we modified the best-fitting linear-with-constant models, replacing negative with small positive slippage rates. This had little effect on the form of the equilibrium frequency distribution, so we conclude that, at least in the present analyses, it did not matter that negative slippage rates were allowed. The full model is not significantly better than the linear-with-constant model for the data from fruit flies, although it is for mice (P < 0.001) and humans (P < 0.05). This suggests a more complicated relationship between microsatellite length and slippage in mice and humans.


    Discussion
 TOP
 Abstract
 Introduction
 Models
 Results
 Discussion
 Acknowledgements
 literature cited
 
Using a new approach to Markov chain modeling that defines microsatellites in the conventional way, we have introduced and fitted two new models, a "constant" model, in which there is no dependence on microsatellite length, and a "linear-with-constant" model. Fitting these and the linear no-constant model hierarchically to published data allowed us to establish that slippage rates increase with microsatellite length for dinucleotide microsatellites in humans, mice, and fruit flies and that slippage rates are around 0 for very short microsatellites of one to four repeats. The new modeling approach, together with the fitting procedure based on it, has a number of advantages. We discuss these points in turn, starting with the last.

The fitting procedure based on our new modeling approach avoids using cumulative frequencies. This avoids multiple representation of some observations and allows calculation of likelihoods. Maximum-likelihood methods can then be deployed to calculate parameter estimates and their standard errors. The likelihood approach also allows hierarchical comparison of models, and we used it to show that slippage rate increases with microsatellite length. An increase in slippage rate with length is plausible because longer microsatellites give more opportunity for slippage. The result is difficult to demonstrate in studies of individual microsatellite loci, although a number of studies have suggested it (Weber 1990Citation ; Primmer et al. 1996Citation ; Wierdl, Dominska, and Petes 1997Citation ). Following Kruglyak et al. (1998)Citation , models have here been fitted to data pooled over different types of repeat motif. The legitimacy of this procedure, however, depends on slippage rates being unaffected by motif type, and we suspect this assumption is not wholly correct. Further work is currently in progress to characterize the effects of motif type.

The suggestion that slippage rates are close to 0 for very short microsatellites of one to four repeats has important implications for the likely mechanism of polymerase slippage. During DNA replication, the elongating DNA strand and the template strand temporarily dissociate; the maximum length of the elongating strand at this stage is not known. During reassociation, the end of the elongated strand may attach to the wrong repeat, producing the phenomenon referred to as polymerase slippage. The suggestion that this does not occur for dinucleotide microsatellites shorter than five repeats suggests that the reattaching end is about 8 bp long. Interestingly, Rose and Falush (1998)Citation also reached the conclusion that little slippage mutation occurs below 8 bp on the basis of analysis of the mono-, di-, and tetranucleotide microsatellites of Saccharomyces cerevisiae.

A number of further developments of these models are possible. In particular, mutation bias, in which for each microsatellite length upward mutations are more frequent than downward mutations, could be readily incorporated into the models and would be interesting in view of the demonstrated importance of this asymmetry at some loci in barn swallows (Primmer et al. 1998Citation ) and humans (Weber and Wong 1993Citation ).

It would be attractive to develop an intuitive understanding of the relationship between the slippage models in figure 1b and the frequency distributions they produce (fig. 1a ). The constant models (dotted lines) produced log frequency distributions that are convex viewed from above, whereas both linear models are concave. The increase in slippage rate with microsatellite length seems to be necessary to account for the observed number of larger microsatellites. The better fit of the linear-with-constant than of the no-constant model may result from its use of an extra parameter to increase the curvature of the fitted frequency distribution. The linear no-constant models did not fit well the frequencies of the smallest reported microsatellites (fig. 1a ) because their slippage rates were constrained to go near the origin. It can be seen from the graphs that the first observations are highly influential for the linear no-constant model and much less so for the linear-with-constant model. The linear no-constant models are consequently considerably more sensitive than linear-with-constant models to the cutoff point used to define minimum microsatellite length (five repeats in the data of Kruglyak et al. "1998"Citation ).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Models
 Results
 Discussion
 Acknowledgements
 literature cited
 
We thank M. Wilkinson and M. Pagel for stimulating discussion, and M. Pagel, D. J. Balding, R. N. Curnow, R. Durrett, and S. Kruglyak for comments on an earlier version of the manuscript. M.T. was funded by EPSRC.


    Footnotes
 
Yun-Xin Fu, Reviewing Editor

1 Keywords: microsatellite evolution polymerase slippage Markov chain equilibrium models Back

2 Address for correspondence and reprints: Richard M. Sibly, School of Animal and Microbial Sciences, P.O. Box 228, University of Reading, Reading RG6 6AH, United Kingdom. E-mail: r.m.sibly{at}reading.ac.uk Back


    literature cited
 TOP
 Abstract
 Introduction
 Models
 Results
 Discussion
 Acknowledgements
 literature cited
 

    Bell, G. I., and J. Jurka. 1997. The length distribution of perfect dimer repetitive DNA is consistent with its evolution by an unbiased single-step mutation process. J. Mol. Evol. 44:414–421.[ISI][Medline]

    Durrett, R., and S. Kruglyak. 1999. A new stochastic model of microsatellite evolution. J. Appl. Prob. 36:621–631.[ISI]

    Falush, D., and Y. Iwasa. 1999. Size-dependent mutability and microsatellite constraints. Mol. Biol. Evol. 16:960–966.[Free Full Text]

    Fu, Y.-X., and R. Chakraborty. 1998. Simultaneous estimation of all parameters of a stepwise mutation model. Genetics 150:487–497.

    Garza, J. C., M. Slatkin, and N. B. Freimer. 1995. Microsatellite allele frequencies in humans and chimpanzees, with implications for constraints on allele size. Mol. Biol. Evol. 12:594–603.[Abstract]

    Kruglyak, S., R. T. Durrett, M. D. Schug, and C. F. Aquadro. 1998. Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc. Natl. Acad. Sci. USA 95:10774–10778.

    ———. 2000. Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations. Mol. Biol. Evol. 17:1210–1219.[Abstract/Free Full Text]

    Moran, P. A. P. 1975. Wandering distributions and the electrophoretic profile. Theor. Popul. Biol. 8:318–330.[ISI][Medline]

    Morris, A., N. E. Morton, A. Collins, J. Macpherson, D. Nelson, and S. Sherman. 1995. An n-allele model for progressive amplification in the FMR1 locus. Proc. Natl. Acad. Sci. USA 92:4833–4837.

    Primmer, C. R., N. Saino, H. Ellegren, and A. P. Moller. 1998. Unravelling the processes of microsatellite evolution through analysis of germ line mutations in barn swallows Hirundo rustica. Mol. Biol. Evol. 15:1047–1054.[Free Full Text]

    Rose, O., and D. Falush. 1998. A threshold size for microsatellite expansion. Mol. Biol. Evol. 15:613–615.[Free Full Text]

    Ross, S. 1997. A first course in probability. Prentice Hall, Upper Saddle River, N.J.

    Silvey, S. D. 1970. Statistical inference. Chapman and Hall, London.

    Tautz, D. 1993. Notes on the definition and nomenclature of tandemly repetitive DNA sequences. Pp. 21–28 in S. D. J. Pena, R. Chakraborty, J. T. Epplen, and A. J. Jeffreys, eds. DNA fingerprinting: state of the science. Birkhauser, Basel, Switzerland.

    Weber, J. L. 1990. Informativeness of human (dC-dA)n(dC-dT)n polymorphisms. Genomics 7:517–524.

    Weber, J. L., and C. Wong. 1993. Mutation of human short tandem repeats. Hum. Mol. Genet. 2:1123–1128.[Abstract]

    Wierdl, M., M. Dominska, and T. D. Petes. 1997. Microsatellite instability in yeast: dependence on the length of the microsatellite. Genetics 146:769–779.

Accepted for publication November 20, 2000.