A Relationship Between Lengths of Microsatellites and Nearby Substitution Rates in Mammalian Genomes

Mauro F. Santibáñez-Koref, Rathithevy Gangeswaran and John M. Hancock

Comparative Sequence Analysis Group, MRC Clinical Sciences Centre, Imperial College School of Medicine, Hammersmith Hospital, London, England

A number of studies have indicated an influence of sequences flanking tandem repeats on repeat stability (Monckton et al. 1994Citation ; Shimizu et al. 1996Citation ; Bowater et al. 1997Citation ; Jeffreys, Murray, and Neumann 1998Citation ; Kruglyak et al. 1998Citation ; Brock, Anderson, and Monckton 1999Citation ). In particular, Kruglyak et al. (1998)Citation have postulated a model to explain differences in average microsatellite length between species. According to this model, the average microsatellite length in a genome depends on two parameters: the tendency of microsatellites to undergo slippage-like mutation, and the rate of base substitution. We hypothesized that such a model might also apply within genomes, such that, for example, local variations in point substitution rate (Wolfe, Sharp, and Li 1989Citation ) could give rise to differences in average microsatellite length in different locations. To test this hypothesis, we investigated the relationship between microsatellite length and the substitution rates in flanking sequences.

We first examined the relationship between flanking-sequence divergence and array length for CA microsatellites, which are common in a variety of eukaryotes (Hamada, Petrino, and Kakunaga 1982Citation ). We compared homologous loci in the rat and the mouse, two species for which a substantial amount of sequence information is available and which are related closely enough to frequently share microsatellites at orthologous positions. Sequences were retrieved from the GenBank (release 99.0) or EMBL database (release 49.0) using the GCG package. The databases were screened for rat or mouse sequences containing at least (CA)10 or (TG)10. Fifty bases 5' of the start of the CA or TG block in the orientation represented in the database were used to search for homologs in the other species. We excluded entries containing composite repeats and other microsatellites within 150 bp, since these would complicate the determination of the flanking-sequence boundaries. The sequences were aligned using the program BESTFIT (GCG 1994Citation ) using default parameter values.

As shown in figure 1A, the boundaries between a microsatellite and its flanking sequences are not always easily defined. Surrounding the (CA)n tract is often a region that contains reiterations of the repeat motif interrupted by other sequences. Consequently, an unambiguous alignment in this region is often difficult to define, making estimates of sequence divergence unreliable. Previous analyses have suggested that substitution rates in regions immediately adjacent to microsatellite arrays are elevated (Djian, Hancock, and Chana 1996Citation ; Brohede and Ellegren 1999Citation ; Hancock, Worthey, and Santibáñez-Koref 2001Citation ). Sequence changes in regions immediately adjacent to tandem repeats may reflect mutational events involving the repeats themselves rather than processes in nearby regions of the genome. Because of this, we avoided the region immediately adjacent to the microsatellite when estimating the sequence divergence of flanking sequences. The region that we excluded from the analysis was designated the transitional zone (see fig. 1 ). We defined the limits of this region to be the first two identical residues 5' or 3' of the microsatellite core that were not involved in a reiteration of the microsatellite motif in either of the two sequences of the alignment.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1.—Relationship between flanking-sequence divergence and repeat length of CA microsatellites. A, Alignment between two homologous sequences containing a microsatellite. B, Diagrammatic representation of the different regions: (I) core—the longest uninterrupted array, represented as shaded text in the alignment; (II) transitional zone—the region surrounding the longest uninterrupted array (see text), represented by the text that is boxed but not shaded; and (III) flanking region—sequence more distant from the repetitive core, where an unambiguous alignment is often possible. C, Correlation between repeat length and divergence of the flanking sequences for (CA)n microsatellites in the rat and the mouse. Vertical axis: divergence between pairs of homologous flanking sequences, expressed as the proportion of sites differing between species; horizontal axis: mean of the (CA)n array lengths seen in the two species (in bp). Accession numbers of the sequences used are as follows: J02582, D00466; M18668, M26669, M74149; L02427, X60367; K02246, M10021; J04963, M61907, M73534; M65150, AA499623; M31670, M14872; M32754, M95526; L36125, M29660; J05206, M33961; M64092, L02241; K02248, X51468; M95735, D45207; M64488, D37793; M31076, U65016; D00475, M20155; Z22607, X56848; Y00396, L00038; J00373, J00374; X82152, X94998; X17215, X57155; U72995, W71716

 
The numbers of substitutions between the 150-bp flanking sequence adjacent to the rat array and the corresponding mouse sequence were determined, excluding the two paired bases used to define the boundary of the flanking sequence. Twenty-four alignments were included in the calculations. The average lengths of repeats were calculated from the lengths of the longest uninterrupted CA arrays as determined from the database entries for the mouse and the rat. Lengths of repeats were measured in base pairs and did not require repetition of a full unit. They could therefore adopt any integral value. The relationship between array size and flanking sequence divergence for these pairs of microsatellites is illustrated in figure 1C. Regression analysis showed that the amount of sequence divergence between the homologous flanking sequences diminished with increasing size of the array (R = -0.64, P = 0.0008). Our use of 50 bases 5' of the start of the CA or TG block to search for the homologs in the other species might have influenced the observed correlation. We therefore repeated the calculations for the flanks not used to search for the homolog. The correlation with array length was again significant (R = -0.58, P = 0.006, 21 alignments). It was also significant if we included deletions or insertions in the analysis (R = -0.64, P = 0.0007). No significant correlation was found between array length and flanking sequence base composition, as measured by the frequency of each of the four bases, indicating that a shift in base composition is not an explanation for the observed correlation. No correlation was found with the number of interruptions of the CA array.

To test the generality of this observation for other classes of repeats, we investigated whether a similar correlation could be observed for CAG arrays. These arrays are often found in coding regions (Stallings 1994Citation ) and have recently attracted considerable attention because of their involvement in a number of inherited human diseases (Rubinsztein 1999Citation ). We examined a set of homologous mouse and rat coding sequences containing CAG arrays. We selected sequences with a CAG or CTG tract with more than five repeat units in either the rat or the mouse homolog. To restrict the influence of possible constraints at the amino acid level, only sequences coding for polyglutamine tracts were included. Total (K), synonymous (Ks), and nonsynonymous (Ka) divergences were then calculated with the method of Li, Wu, and Luo (1985)Citation using the program LI93 (K. H. Wolfe, unpublished). The criteria defined above for CA repeats were used to delimit the transitional zone. The calculations were again based on 150-bp flanking sequence on either side of the array as far as this was possible, but introns, 5' and 3' untranslated regions, and regions in which the alignments became ambiguous were excluded. Twelve sequence pairs were included in the analysis. Note that since many of the sequences included in this analysis were only available as cDNAs, the results (see table 1 ) are probably confounded by not taking into account distances separating array and flanking sequence at the genomic level and by any effects of intronic sequences on flanking-sequence evolution.


View this table:
[in this window]
[in a new window]
 
Table 1 Correlations Between the Length of CAG Arrays and Flanking-Sequence Divergence

 
We did not observe significant correlations between the divergence in the flanking sequences and the average length of the pure CAG array. A significant correlation did, however, become apparent when the lengths of CAN tracts (where N designates any base) were considered instead. CAN repeats in this context are defined by extending the CAG repeat in both the 5' and the 3' directions so that they include all other codons starting with CA. In some cases, this had the effect of merging two CAG repeats interrupted by a single synonymous (CAA) or nonsynonymous (CAC, CAT) codon. A correlation between sequence divergence and CAN repeat length is consistent with an underlying correlation which is a function not only of the present state of the array, but also of its length during its evolutionary history, as interrupted arrays are likely to have evolved from pure ancestors. Interestingly, the data in table 1 suggest that the correlation gains strength when the length of the smaller CAN array of both homologs is considered instead of the average. This is consistent with a preferential increase in array length during evolution (Rubinsztein et al. 1995Citation ; Amos et al. 1996Citation ; Primmer et al. 1996Citation ). Under such a scenario, the size of the shorter array would be closer to the size of the ancestral array, and therefore would better represent the lengths of both arrays during the period of divergence, than would the average array length. The data in table 1 also indicate that the decrease in the rate of substitution with increasing array size affects both synonymous and nonsynonymous changes. It should be noted that the Ks value in our sample is significantly lower than the values determined by Wolfe and Sharp (1993)Citation for a collection of 363 mouse and rat sequences (P = 0.03; Mann-Whitney U-test).

The observed correlation between flanking-sequence divergence and repeat length raised the possibility that changes in the local rate of sequence change could give rise to changes in repeat length. Such changes should be observable in phylogenetic analysis of a region containing a repeat known to change in length during evolution. Examples of such repeats are CAG repeats found in genes involved in human neurological diseases (reviewed by Rubinsztein 1999Citation ). To investigate whether this may be the case, we carried out a phylogenetic analysis of changes within primate orthologs of the human dentatorubral-pallidoluysian atrophy (DRPLA) locus (Nagafuchi et al. 1994Citation ). We PCR-amplified 435 bp of genomic sequence 3' of the DRPLA CAG repeat. This region corresponds to positions 1735–2169 of the published human sequence (accession number D31840; Nagafuchi et al. 1994Citation ) and lies within exon 5 of the gene. The region was amplified from six nonhuman primate species—the bonobo (Pan paniscus, accession number AJ133270), the gorilla (Gorilla gorilla, AJ133271), the orangutan (Pongo pygmaeus, AJ133272), the gibbon (Hylobates lar, AJ133273), the cynomolgous monkey (Macaca fascicularis, AJ133274), and the tufted capuchin (Cebus apella, AJ133275)—and compared with the published human sequence. We compared substitution rates rather than divergences to take into account divergence times between species. Synonymous and nonsynonymous changes were identified by assigning the reading frame from the human sequence, and rates were calculated using divergence times published by Kumar and Hedges (1998)Citation . The values in table 1 are the correlation coefficients derived from 21 pairwise comparisons. To estimate the significance of these coefficients given the phylogenetic relationships and array sizes, we simulated a set of sequences related by the corresponding phylogenetic tree (Kumar and Hedges 1998Citation ) 10,000 times, assuming a constant substitution rate of 1.3 x 10-3 per residue per million years (the observed rate average for these species at that locus). We then calculated simulated correlation coefficients between estimated substitution rates and array length. The analyses again showed a significant inverse correlation between substitution rate and array length.

Our findings suggest that a relationship between flanking-sequence divergence and repeat length applies to both noncoding CA microsatellites and coding CAG repeats. The observation of this relationship at synonymous sites and in noncoding sequences suggests that the effect is not the result of selection at the peptide level and may instead reflect genuine differences in mutation rates. While some investigators have not detected any differences between the substitution rates of sequences adjacent to microsatellites (Brohede and Ellegren 1999Citation ), others have reported surprisingly low divergence, or conservation, of microsatellite loci across different species (Schlötterer, Amos, and Tautz 1991Citation ; FitzSimmons, Moritz, and Moore 1995Citation ; Rico, Rico, and Hewitt 1996Citation ; Ezenwa et al. 1998Citation ; Zhu, Queller, and Strassmann 2000Citation ). Observations of this kind would be consistent with our results, as higher conservation of flanking sequences of long microsatellites would improve the ability of primers to amplify across species provided these flanking sequences are located far enough from the array, i.e., outside the transitional zone.

Although the wider generality of these phenomena remains to be tested, the observations reported here are consistent with the hypothesis that local variations in point substitution rate around a genome could influence the lengths of microsatellites. This represents an extension of the model proposed by Kruglyak et al. (1998)Citation , who suggested that such a relationship gave rise to the genomewide length distribution of microsatellites. On this basis, we suggest that long microsatellites are likely to be more stable (and/or form preferentially) in regions of mammalian genomes with low point mutation rates. The results also raise the possibility that evolutionary processes that affect substitution rates at a genomic locality may also affect the lengths of microsatellites in that region.

Our results could also be explained by an inverse effect of array expansion on the rate of sequence change in flanking regions. The presence of a microsatellite has been associated with effects on adjacent sequences such as alteration of chromatin structure (Wang et al. 1994Citation ; Otten and Tapscott 1995Citation ), transcription (Bidichandani, Ashizawa, and Patel 1998Citation ), and stimulation of gene conversion (Wahls, Wallace, and Moore 1990Citation ). These or similar processes could affect the mutation rate by, for example, modulating accessibility to DNA-damaging agents or components of the repair machinery. However, any influence of these processes on the substitution rate remains to be established.

In summary, our findings provide evidence for an association between the size of CA and CAG microsatellites and the rate of evolutionary change of the adjacent DNA. They suggest that the presence of this widespread class of sequence elements may signal the presence of genome regions with relatively low point mutation rates.

Acknowledgements

We thank the U.K. Medical Research Council for financial support, Andy Porter for supplying orangutan DNA, Hector Seuanez for C. apella DNA, Philippe Djian for DNA from the other primates, and Ken Wolfe for LI93.

Footnotes

Jeffrey Long, Reviewing Editor

1 Present address: Department of Bioinformatics, Max Delbruck Centrum, Berlin, Germany. Back

2 Present address: Department of Computer Science, Royal Holloway University of London, Egham, Surrey, England. Back

3 Keywords: microsatellites genome evolution flanking sequences transitional zone DRPLA mutation rate Back

4 Address for correspondence and reprints: John M. Hancock, Department of Computer Science, Royal Holloway University of London, Egham, Surrey TW20 0EX, United Kingdom. j.hancock{at}cs.rhul.ac.uk . Back

References

    Amos W., S. J. Sawcer, R. W. Feakes, D. C. Rubinsztein, 1996 Microsatellites show mutational bias and heterozygote instability Nat. Genet 13:390-391[ISI][Medline]

    Bidichandani S. I., T. Ashizawa, P. I. Patel, 1998 The GAA triplet-repeat expansion in Friedreich ataxia interferes with transcription and may be associated with an unusual DNA structure Am. J. Hum. Genet 62:111-121[ISI][Medline]

    Bowater R. P., A. Jaworski, J. E. Larson, P. Parniewski, R. D. Wells, 1997 Transcription increases the deletion frequency of long CTG. CAG triplet repeats from plasmids in Escherichia coli Nucleic Acids Res 25:2861-2868[Abstract/Free Full Text]

    Brock G. J., N. H. Anderson, D. G. Monckton, 1999 Cis-acting modifiers of expanded CAG/CTG triplet repeat expandability: associations with flanking GC content and proximity to CpG islands Hum. Mol. Genet 8:1061-1067[Abstract/Free Full Text]

    Brohede J., H. Ellegren, 1999 Microsatellite evolution: polarity of substitutions within repeats and neutrality of flanking sequences Proc. R. Soc. Lond. B Biol. Sci 266:825-833[ISI][Medline]

    Djian P., J. M. Hancock, H. S. Chana, 1996 Codon repeats in genes associated with human diseases: fewer repeats in the genes of nonhuman primates and nucleotide substitutions concentrated at the sites of reiteration Proc. Natl. Acad. Sci. USA 93:417-421[Abstract/Free Full Text]

    Ezenwa V. O., J. M. Peters, Y. Zhu, E. Arevalo, M. D. Hastings, P. Seppa, J. S. Pedersen, F. Zacchi, D. C. Queller, J. E. Strassmann, 1998 Ancient conservation of trinucleotide microsatellite loci in polistine wasps Mol. Phylogenet. Evol 10:168-177[ISI][Medline]

    FitzSimmons N. N., C. Moritz, S. S. Moore, 1995 Conservation and dynamics of microsatellite loci over 300 million years of marine turtle evolution Mol. Biol. Evol 12:432-440[Abstract]

    GCG. 1994 Program manual for the Wisconsin package. Version 8 Genetics Computer Group, Madison, Wis

    Hamada H., M. G. Petrino, T. Kakunaga, 1982 A novel repeated element with Z-DNA-forming potential is widely found in evolutionarily diverse eukaryotic genomes Proc. Natl. Acad. Sci. USA 79:6465-6469[Abstract]

    Hancock J. M., E. A. Worthey, M. F. Santibáñez-Koref, 2001 A role for selection in regulating the evolutionary emergence of disease-causing and other coding CAG repeats in humans and mice Mol. Biol. Evol 18:1014-1023[Abstract/Free Full Text]

    Jeffreys A. J., J. Murray, R. Neumann, 1998 High-resolution mapping of crossovers in human sperm defines a minisatellite-associated recombination hotspot Mol. Cell 2:267-273[ISI][Medline]

    Kruglyak S., R. T. Durrett, M. D. Schug, C. F. Aquadro, 1998 Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations Proc. Natl. Acad. Sci. USA 95:10774-10778[Abstract/Free Full Text]

    Kumar S., S. B. Hedges, 1998 A molecular timescale for vertebrate evolution Nature 392:917-920[ISI][Medline]

    Li W. H., C. I. Wu, C. C. Luo, 1985 A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes Mol. Biol. Evol 2:150-174[Abstract]

    Monckton D. G., R. Neumann, T. Guram, N. Fretwell, K. Tamaki, A. MacLeod, A. J. Jeffreys, 1994 Minisatellite mutation rate variation associated with a flanking DNA sequence polymorphism Nat. Genet 8:162-170[ISI][Medline]

    Nagafuchi S., H. Yanagisawa, E. Ohsaki, T. Shirayama, K. Tadokoro, T. Inoue, M. Yamada, 1994 Structure and expression of the gene responsible for the triplet repeat disorder, dentatorubral and pallidoluysian atrophy (DRPLA) Nat. Genet 8:177-182[ISI][Medline]

    Otten A. D., S. J. Tapscott, 1995 Triplet repeat expansion in myotonic dystrophy alters the adjacent chromatin structure Proc. Natl. Acad. Sci. USA 92:5465-5469[Abstract]

    Primmer C. R., N. Saino, A. P. Moller, H. Ellegren, 1996 Directional evolution in germline microsatellite mutations Nat. Genet 13:391-393[ISI][Medline]

    Rico C., I. Rico, G. Hewitt, 1996 470 million years of conservation of microsatellite loci among fish species Proc. R. Soc. Lond. B Biol. Sci 263:549-557[ISI][Medline]

    Rubinsztein D. C., 1999 Trinucleotide expansion mutations cause diseases which do not conform to classical Mendelian expectations Pp. 80–97 in D. B. Goldstein and C. Schlötterer, eds. Microsatellites: evolution and applications. Oxford University Press, Oxford, England

    Rubinsztein D. C., W. Amos, J. Leggo, S. Goodburn, S. Jain, S. H. Li, R. L. Margolis, C. A. Ross, M. A. Ferguson-Smith, 1995 Microsatellite evolution–evidence for directionality and variation in rate between species Nat. Genet 10:337-343[ISI][Medline]

    Schlötterer C., B. Amos, D. Tautz, 1991 Conservation of polymorphic simple sequence loci in cetacean species Nature 354:63-65[ISI][Medline]

    Shimizu M., R. Gellibolian, B. A. Oostra, R. D. Wells, 1996 Cloning, characterization and properties of plasmids containing CGG triplet repeats from the FMR-1 gene J. Mol. Biol 258:614-626[ISI][Medline]

    Stallings R. L., 1994 Distribution of trinucleotide microsatellites in different categories of mammalian genomic sequence: implications for human genetic diseases Genomics 21:116-121[ISI][Medline]

    Wahls W. P., L. J. Wallace, P. D. Moore, 1990 The Z-DNA motif d(TG)30 promotes reception of information during gene conversion events while stimulating homologous recombination in human cells in culture Mol. Cell. Biol 10:785-793[ISI][Medline]

    Wang Y. H., S. Amirhaeri, S. Kang, R. D. Wells, J. D. Griffith, 1994 Preferential nucleosome assembly at DNA triplet repeats from the myotonic dystrophy gene Science 265:669-671[ISI][Medline]

    Wolfe K. H., P. M. Sharp, 1993 Mammalian gene evolution: nucleotide sequence divergence between mouse and rat J. Mol. Evol 37:441-456[ISI][Medline]

    Wolfe K. H., P. M. Sharp, W. H. Li, 1989 Mutation rates differ among regions of the mammalian genome Nature 337:283-285[ISI][Medline]

    Zhu Y., D. C. Queller, J. E. Strassmann, 2000 A phylogenetic perspective on sequence evolution in microsatellite loci J. Mol. Evol 50:324-338[ISI][Medline]

Accepted for publication May 11, 2001.