* Department of Mathematics
Program in Molecular and Computational Biology, Department of Biological Sciences, University of Southern California
Correspondence: E-mail: fsun{at}hto.usc.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: microsatellites Markov processes branching processes
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Based on phylogenetic analysis, Messier, Li, and Stewart (1996) suggested a minimum number of repeat units for slippage mutations. Using a simple mathematical model, Rose and Falush (1998) demonstrated the existence of a minimum threshold size for slippage mutations by studying the ratio between the observed frequency and the expected frequency of microsatellites. The estimated threshold size was about eight nucleotides long irrespective of different motifs for mononucleotides, dinucleotides, and tetranucleotides. The study suggested more complicated mechanisms for microsatellite slippage mutations (Rose and Falush 1998). However, Pupko and Graur (1999) debated the existence of threshold sizes for slippage mutations.
In experimental studies for human microsatellite mutations in vivo, high mutation rates from about 104 to 102 per locus per generation were observed. Besides single step mutational events, some multiple steps mutational events were also observed. Zhang et al. (1994) observed that longer trinucleotide repeats had much higher mutation rates than short ones and that contractions occurred more frequently than expansions. Xu et al. (2000) observed more mutations and contractions for longer tetranucleotide repeats. Bacon, Dunlop, and Farrington (2001) observed high mutation rates for mononucleotides. Huang et al. (2002) observed that the mutation rate increased and the probability of expansion given mutation occurrence decreased as the number of repeat units increased for dinucleotides. Length-dependent mutation patterns of microsatellites were also observed from different organisms, such as flies (Harr and Schlötterer 2000) and yeast (Wierdl, Dominska, and Petes 1997). In all those experiments, the numbers of observed mutations were not large enough to give clear patterns for the relationship between microsatellite slippage mutation rate and the number of repeat units.
With the whole genome sequence available, it is possible to collect a large volume of data for microsatellite distributions. The equilibrium assumption assumes that the observed distributions of this generation are the same as those of the next generation. Together with the equilibrium assumption, it is possible to estimate microsatellite mutation rates. Bell and Jurka (1997) first proposed such an approach and applied it to some genome sequences. Kruglyak et al. (1998, 2000) extended such an idea and proposed a novel estimation method. Sibly, Whittaker, and Talbort (2001) further generalized it with a maximum likelihood estimation method. Those studies were based on the symmetric single stepwise model that assumes the expansion rate to be the same as the contraction rate. A recent study by Sibly et al. (2003) found that the symmetric single stepwise model for microsatellite slippage mutations cannot explain the observed human sequence data. In a recent study by Calabrese and Durrett (2003), they found that it was difficult to model microsatellite slippage mutations using simple functions. They observed a bias toward contraction for long microsatellites by assuming a quadratic model or piecewise linear model for slippage mutation rates. Most of the previous approaches were based on the single stepwise mutation model. This simplified model can reflect microsatellite mutation mechanisms because single-step mutational events were the major mutational events observed in experiments. In previous studies (Bell and Jurka 1997; Kruglyak et al. 1998, 2000; Sibly, Whittaker, and Talbort 2001; Calabrese and Durrett 2003; Sibly et al. 2003), a constant, linear, or quadratic relationship between microsatellite slippage mutation rate and the number of repeat units was assumed. Such assumptions are not strongly supported by the experimental results (Zhang et al. 1994; Xu et al. 2000; Bacon, Dunlop, and Farrington 2001; Huang et al. 2002).
In this study, we propose a novel method using two sets of equations based on two stochastic processes to estimate microsatellite slippage mutation rates. This study differs from previous studies by introducing a new multi-type branching process in addition to the stationary Markov process proposed before (Bell and Jurka 1997; Kruglyak et al. 1998, 2000; Sibly, Whittaker, and Talbort 2001; Calabrese and Durrett 2003; Sibly et al. 2003). The distributions from the two processes make it possible to estimate microsatellite slippage mutation rates without assuming any relationship between microsatellite slippage mutation rate and the number of repeat units. We apply our method to the sequence data from the human genome. We also develop a novel method for estimating the threshold size for slippage mutations. In the following paragraphs, we first explain our method for data collection and the mathematical model; we then present estimation results.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Data Collection
We downloaded the human genome sequence from the National Center for Biotechnology Information database ftp://ftp.ncbi.nih.gov/genbank/genomes/H_sapiens/OLD/(updated on September 06, 2001). We collected mono-, di-, tri-, tetra-, penta-, and hexa- nucleotides in two different schemes. The first scheme is simply to collect all repeats that are microsatellites without interruptions among the repeats. The second scheme is to collect perfect repeats (Sibly, Whittaker, and Talbort 2001), such that there are no interruptions among the repeats and the left flanking region (up to 2l nucleotides) does not contain the same motifs when microsatellites (of motif with l nucleotide bases) are collected. Mononucleotides were excluded when di-, tri-, tetra-, penta-, and hexa- nucleotides were collected; dinucleotides were excluded when tetra- and hexa- nucleotides were collected; trinucleotides were excluded when hexanucleotides were collected. For a fixed motif of l nucleotide bases, microsatellites with the number of repeat units greater than 1 were collected in the above manner. The number of microsatellites with one repeat unit was roughly calculated by [(total number of counted nucleotides) i>1 l x i x (number of microsatellites with i repeat units)]/l. All the human chromosomes were processed in such a manner. Table 1 gives an example of the two schemes.
|
Modeling all repeats
For all repeats, we consider the following stochastic process:
Let Zn = (Zn1, Zn2, ... , ZnN), n = 0, 1, 2, ... be the corresponding stochastic process, where Znk is the number of microsatellites with k repeat units after n generations. {Zn} forms a multi-type branching process. The first moments matrix (Harris 1963) is given by
|
From the Perron-Frobenius Theorem (Harris 1963), the Perron-Frobenius eigenvalue of M is greater than 1, and we denote it as 1 + . We denote the left Perron-Frobenius eigenvector of M as p = (p1, p2, ... , pN). From the theory of multi-type branching processes (Harris 1963; Athreya and Ney 1972; Athreya and Vidyashankar 1995), we have limn
Zn/|Zn| = p. Here |·| means to sum over all the entries of a vector. Therefore, the distribution of all repeats will converge to p. From equation pM = (1 +
)p, we have
|
Modeling perfect repeats
For perfect repeats, we consider the following Markov process proposed in previous studies (Bell and Jurka 1997; Kruglyak et al. 1998, 2000; Sibly, Whittaker, and Talbort 2001; Calabrese and Durrett 2003; Sibly et al. 2003).
Let Xn, n = 0, 1, 2, ... be the corresponding stochastic process. {Xn} forms a Markov process. The transition matrix is given by
|
From the theory of Markov process, there is a stationary distribution q = (q1, q2, ... , qN) with qP = q, which is equivalent to
|
Two Sets of Equations
Note that a is a nuisance parameter in both models. We can only estimate the relative expansion slippage rates and contraction slippage rates compared to the point mutation rate. We divide both sides of equations (1) and (2) by a and denote , ek, and ck for the previous
/a, ek/a and ck/a, respectively. We have the following two fundamental equations.
|
Compared to microsatellites slippage mutation rates, point mutation rates are relatively small. The difference between the matrices M and P is of the level of point mutation rate a, which is very small. Therefore, we expect only slight differences between the two distributions p and q when they are normalized.
For convenience, the above point mutation rate a is the point mutation rate for the whole motif. We will apply our estimation method to sequence data of mono-, di-, tri-, tetra-, penta-, and hexa- nucleotides. Therefore, a is different for microsatellites with motifs of different numbers of nucleotide bases. The estimation results are the relative ratios between the slippage mutation rate and point mutation rate. To keep the estimation results comparable, we will multiply the estimated slippage mutation rates by the motif length l.
Threshold Size
We define microsatellite slippage threshold size T as the number of repeat units such that ck = 0, 2 k
T and ck > 0, for k > T. Under this threshold size T, there are almost no slippage mutations; Above T, microsatellites slippage mutations will dominate point mutations.
For the observed distributions {pk} for all repeats and {qk} for perfect repeats, we consider their sequential ratios {pk+1/pk} and {qk+1/qk}. A null hypothesis is that there is no microsatellite slippage mutation and that microsatellites are generated by random arrangement by different nucleotides (Pupko and Graur 1999; Rose and Falush 1998). Under this hypothesis, {pk} and {qk} should follow a geometric distribution. Therefore, we expect that the sequential ratios are all of relatively low and constant level.
If the sequential ratios can keep a relatively low and constant level up to L, then the observed fractions of states up to L + 1 can be explained by the above null hypothesis. This implies that there is almost no slippage mutation from L + 2 to L + 1. Therefore, we can estimate the threshold size T by L + 2.
Estimating Slippage Mutation Rates
When the number of repeat units is below T, microsatellite slippage mutation rates are small and can be regarded as 0. In the following paragraphs, we will examine only slippage mutation rates of microsatellites with a number of repeat units greater than T. Statistically, the estimated results will be reliable only when we have a large number of observations. Therefore, we estimate slippage mutation rates of microsatellites with a number of repeat units ranging from T + 1 to H 1, where H is the minimum number of repeat units for which either the observed number of all repeats or perfect repeats with H repeat units is less than 100.
The estimated threshold size for microsatellites slippage mutation is useful for computing the Perron-Frobenius eigenvalue 1 + . On the threshold size T, we set the contraction slippage mutation rate cT = 0. Then
and eT1 can be obtained by directly solving equations (3) for k = T 1. With
available, we can estimate ek and ck+1 using equations (3) for k
T. Owing to random variation of the observations, some of solved values for ek and ck+1 are negative. It was observed from experiments (Zhang et al. 1994; Xu et al. 2000; Huang et al. 2002) that the contraction slippage mutation rate increased with the number of repeat units. We thus use the following strategy to guarantee non-negative solutions: If the direct solutions ek and ck+1 from equations (3) are all non-negative, we will accept them. Otherwise, we set ck+1 = ck and compute ek using the least squares method for equations (3). The confidence intervals for our estimated slippage mutation rates can be obtained using the bootstrap method (Effron 1979).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Figure 1 shows the observed frequency in logarithm scale for all repeats {pk} and perfect repeats {qk} (see Materials and Methods for details). We observe that mononucleotides are the most abundant microsatellites in the human genome, followed by dinucleotides, trinucleotides, etc. Microsatellites can contain a large number of repeat units, with the observation of more than 65 for mononucleotides and more than 49 for dinucleotides. Overall, microsatellite frequencies decrease exponentially as the number of repeat units increases. But the shape of the frequency distribution is not regular, with different slopes in different intervals of the number of repeat units. Around repeat 36 for mononucleotides, repeat 10 for tetranucleotides, we observe "humps." The complicated shape of microsatellite frequency distributions indicates that the microsatellite mutation mechanism is complicated.
|
|
|
|
|
|
We obtained 95% confidence intervals using the bootstrap method. Those confidence intervals for our estimations are shown in figure 3 and figure 4. Because the number of microsatellites decreases rapidly as the number of repeat units increases, the interval becomes wider as the number of repeat units increases.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Using two sets of equations based on a multi-type branching process and a Markov process, we estimated mutation rates of microsatellites in the human genome without assuming any relationship between microsatellite slippage mutation rate and the number of repeat units. The novelity of this study is the introduction of a multi-type branching process. In previous studies involving only the Markov process, some relationship between the microsatellite slippage mutation rate and the number of repeat units has to be assumed. Our method can also be applied to estimate microsatellite mutation mechanisms for other organisms when large amounts of genome sequence data are available. It is possible to compare microsatellite mutation mechanisms among different organisms.
We observed an exponentially increasing trend for the estimated slippage mutation rates and a decreasing trend for the estimated slippage expansion ratios. The total slippage mutation rate may differ up to 103 104-fold for different numbers of repeat units. Our estimation results are consistent with experimental studies (Zhang et al. 1994; Xu et al. 2000; Bacon, Dunlop, and Farrington 2001; Huang et al. 2002) and computational studies (Calabrese and Durrett 2003). Long microsatellites are highly unstable and likely to mutate. When slippage mutations happen, expansions occur more frequently if the number of repeat units is small, and contractions occur more frequently if the number of repeat units is large. When mutations happen, long microsatellites are likely to mutate to shorter ones; short microsatellites are likely to mutate to longer ones. The scarcity of large numbers of repeat units in a microsatellite locus can be explained by the high mutation rate and downward mutation bias when the number of repeat units is large.
As Calabrese and Durrett (2003) have pointed out, it is difficult to describe microsatellite slippage mutation rates using simple functions. We observe complicated patterns in our estimated results, which suggests that the microsatellite slippage mutation mechanism is complicated.
It is possible that genetic characteristics of local sequences influence the microsatellites mutation mechanism. Calabrese and Durrett (2003) applied comparative studies to show that local dinucleotide distributions were not significantly different for the regions with different local recombination rates, proximity to genes, local GC contents, location on the chromosome, and proximity to Alu repeats. Such results support the approach to estimating microsatellite slippage mutation rates using whole genome sequence data.
There are several limitations to our approach. One is that we grouped all the motifs with the same length together in this study. Different motifs may have different mutation mechanisms, and their mutation mechanisms need to be studied separately when enough data become available. In the present study, we assumed that the distribution of the number of perfect repeats and all repeats had achieved equilibrium, a common assumption in almost all the studies of similar type. An important question is how to test if the distributions have achieved equilibrium. These questions need to be considered in future studies.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Adam Eyre-Walker, Associate Editor
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Ashley, C. T., and S. T. Warren. 1995. Trinucleotide repeat expansion and human disease. Annu. Rev. Genet. 29:703-728.[CrossRef][ISI][Medline]
Athreya, K. B., and P. E. Ney. 1972. Branching processes. Springer-Verlag, Berlin.
Athreya, K. B., and A. N. Vidyashankar. 1995. Large deviation rates for branching processes. II. The multitype case. Ann. Appl. Probab. 5:566-576.
Bacon, A., M. G. Dunlop, and S. M. Farrington. 2001. Hypermutability at a poly(A/T) tract in the human germline. Nucleic Acids Res. 29:4405-4413.
Bell, G. I., and J. Jurka. 1997. The length distribution of perfect dimer repetitive DNA is consistent with its evolution by an unbiased single-step mutation process. J. Mol. Evol. 44:414-421.[ISI][Medline]
Calabrese, P., and R. Durrett. 2003. Dinucleotides repeats in the Drosophila and Human genome have complex, length-dependent mutation processes. Mol. Biol. Evol. 20:715-725.
Effron, B. 1979. Bootstrap method: another look at the Jackknife. Ann. Stat. 7:1-26.[ISI]
Harr, B., and C. Schlötterer. 2000. Long microsatellite alleles in Drosophila melanogaster have a downward mutation bias and short persistence times, which cause their genome-wide underrepresentation. Genetics 155:1213-1220.
Harris, T. E. 1963. The theory of branching processes. Springer-Verlag, Berlin.
Huang, Q. Y., F. H. Xu, H. Shen, H. Y. Deng, Y. J. Liu, Y. Z. Liu, J. L. Li, R. R. Recker, and H. W. Deng. 2002. Mutation patterns at dinuleotide microsatellite loci in humans. Am. J. Hum. Genet. 70:625-634.[CrossRef][ISI][Medline]
Kong, A., D. F. Gudbjartsson, and J. Sainz, et al. (16 co-authors). 2002. A high-resolution recombination map of the human genome. Nat. Genet. 31:241-247.[CrossRef][ISI][Medline]
Kruglyak, S., R. T. Durrett, M. D. Schug, and C. F. Aquadro. 1998. Equilibrium distribution of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc. Natl. Acad. Sci. USA 95:10774-10778.
Kruglyak, S., R. T. Durrett, M. D. Schug, and C. F. Aquadro. 2000. Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations. Mol. Biol. Evol. 17:1210-1219.
Lai, Y., D. Shinde, N. Arnheim, and F. Z. Sun. 2003. The mutation process of microsatellites during the polymerase chain reaction. J. Comp. Biol. 10:143-155.[CrossRef][ISI]
Lai, Y., and F. Z. Sun. 2003. Microsatellite mutations during the polymerase chain reaction: mean field approximations and their applications. J. Theor. Biol. 224:127-137.[CrossRef][ISI][Medline]
Leeflang, E. P., S. Tavaré, P. Marjoram, C. O. S. Neal, J. Srinidhi, M. E. MacDonald, M. Young, N. S. Wexler, J. F. Gusella, and N. Arnheim. 1999. Analysis of germline mutation spectra at the Huntington's disease locus supports a mitotic mutation mechanism. Hum. Mol. Genet. 8:173-183.
Li, W. H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Messier, W., S. H. Li, and C. B. Stewart. 1996. The birth of microsatellites. Nature 381:483.[ISI][Medline]
Ott, J. 1999. Analysis of human genetic linkage. The Johns Hopkins University Press, Baltimore and London.
Pupko, T., and D. Graur. 1999. Evolution of microsatellites in the yeast Saccharomyces cerevisiae: role of length and number of repeated units. J. Mol. Evol. 48:313-316.[ISI][Medline]
Rose, O., and D. Falush. 1998. A threshold size for microsatellite expansion. Mol. Biol. Evol. 15:613-615.
Rosenberg, N. A., J. K. Pritchard, J. L. Weber, H. M. Cann, K. K. Kidd, L. A. Zhivotovsky, and M. W. Feldman. 2002. Genetic structure of human populations. Science 298:2381-2385.
Shinde, D., Y. Lai, F. Z. Sun, and N. Arnheim. 2003. Taq DNA polymerase slippage mutation rates measured by PCR and quasi-likelihood analysis: (CA/GT)n and (A/T)n microsatellites. Nucleic Acids Res. 31:974-980.
Schlötterer, C., and D. Tautz. 1992. Slippage synthesis of simple sequence DNA. Nucleic Acids Res. 20:211-215.[Abstract]
Sibly, R. M., J. C. Whittaker, and M. Talbort. 2001. A maximum-likelihood approach to fitting equilibrium models of microsatellite evolution. Mol. Biol. Evol. 18:413-417.
Sibly, R. M., A. Meade, N. Boxall, M. J. Wilkinson, D. W. Corne, and J. C. Whittaker. 2003. The structure of interrupted human AC microsatellites. Mol. Biol. Evol. 20:453-459.
Sturzeneker, R., R. A. U. Bevilacqua, L. A. Haddad, A. J. G. Simpson, and S. D. J. Pena. 2000. Microsatellite instability in tumors as a model to study the process of microsatellite mutations. Hum. Mol. Genet. 9:347-352.
Viguera, E., D. Canceill, and S. D. Ehrlich. 2001. Replication slippage involves DNA polymerase pausing and dissociation. EMBO J. 20:2587-2595.
Weber, J., and P. May. 1989. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet. 44:388-396.[ISI][Medline]
Weber, J. L., and C. Wong. 1993. Mutation of human short tandem repeats. Hum. Mol. Genet. 2:1123-1128.[Abstract]
Wierdl, M., M. Dominska, and T. D. Petes. 1997. Microsatellite instability in yeast: dependence on the length of the microsatellite. Genetics 146:769-779.
Xu X., M. Peng, Z. Fang, and X. Xu. 2000. The direction of microsatellite mutations is dependent upon allele length. Nat. Genet. 24:396-399.[CrossRef][ISI][Medline]
Zhang, L., E. P. Leeflang, J. Yu, and N. Arnheim. 1994. Studying human mutations by sperm typing: instability of CAG trinucleotide repeats in the human androgen receptor gene. Nat. Genet. 7:531-535.[ISI][Medline]