Plant Molecular Biology Unit, Division of Biochemical Sciences, National Chemical Laboratory, Pune, India
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Like any other regions of DNA, SSRs can also originate in coding regions, leading to the appearance of repetitive patterns in protein sequences. In protein sequence database studies, we have observed that tandem repeats are common in many proteins (Katti et al. 2000)
, and mechanisms involved in their genesis may contribute to the rapid evolution of proteins (Green and Wang 1994
; Huntley and Golding 2000
). During the past decade, several human neurodegenerative diseases have been found to be associated with dynamic mutations occurring at microsatellite loci within or near specific genes (Ashley and Warren 1995
), leading to an increased interest in understanding the molecular mechanisms involved in the origin, evolution, and expansion/deletion of microsatellites.
Frequencies of various microsatellite sequences in different genomes have been estimated experimentally by hybridization techniques (e.g., Tautz and Renz 1984
; Panaud, Chen, and McCouch 1995
). However, this could not be done accurately using oligonucleotides like (AT)n and (GC)n that can self-complement. With the growth of sequence databases, several authors have reported an abundance of simple sequence repeats in different genomes (e.g., Hancock 1995
; Jurka and Pethiyagoda 1995
; Richard and Dujon 1996
; Bachtrog et al. 1999
; Kruglyak et al. 2000
). In a recent survey, Toth, Gaspari, and Jurka (2000)
examined the distribution of microsatellites in exonic, intronic, and intergenic regions of several eukaryotic taxa. Differential abundance of repeats in different genomes led them to suggest that strand-slippage theories alone are insufficient to explain characteristic microsatellite distributions.
Most of the previous studies on microsatellite distribution were based on DNA sequence databases in which coding or gene-rich regions were overrepresented. On the other hand, the availability of complete genome sequences now permits the determination of frequencies of SSRs at the whole-genome level. Such estimates should reflect the basal level of SSR dynamics within a species. The present paper details occurrences of SSRs in eukaryotic genomes that have been completely sequenced or for which complete chromosome sequences are available. Moreover, nonredundant complete genome-coding DNA sequences of Drosophila, Caenorhabditis elegans, and yeast have been analyzed to assess the extent of codon reiterations in protein-coding regions.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
A polyA repeat is same as a polyT repeat on a complementary strand. Similarly, (AC)n is equivalent to (CA)n, (TG)n, and (GT)n, while (AGC)n is equivalent to (GCA)n, (CAG)n, (CTG)n, (TGC)n, and (GCT)n in different reading frames or on a complementary strand. Thus, 2 unique classes are possible for mononucleotide repeats, whereas 4 classes are possible for dinucleotide, 10 for trinucleotide, and 33 for tetranucleotide repeats (Jurka and Pethiyagoda 1995
). We determined individual repeat frequencies for all of these classes.
Complete genome coding DNA sequences of all predicted peptides of Drosophila, C. elegans, and yeast were obtained from the Berkeley Drosophila Genome Project (http://www.fruitfly.org), the Sanger Centre's Wormpep Database (http://www.sanger.ac.uk/Projects/C_elegans/wormpep), and the Saccharomyces Genome Database (http://genome-www.stanford.edu/Saccharomyces), respectively. A codon repeat was considered only when it was tandemly repeated for a minimum of seven times allowing one mismatch for every 10 nt.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Analysis of complete chromosome/genome sequences of humans, Drosophila, Arabidopsis, C. elegans, and yeast for occurrences of various microsatellites (table 1 and fig. 1 ) revealed that compared with other genomes, human chromosomes 21 and 22 are rich in mono- and tetranucleotide repeats. On the other hand, the Drosophila chromosomes have higher frequencies of di- and trinucleotide repeats. Surprisingly, the C. elegans genome contains less SSRs per million base pairs of sequence compared to that in the yeast genome. Moreover, the frequency of trinucleotide repeats in yeast is higher than that observed in human chromosomes 21 and 22.
|
The length distributions of all SSRs indicated that the frequency of repeats decreases exponentially with repeat length (data not shown). This may be because longer repeats have higher mutation rates and hence are more unstable (Wierdl, Dominska, and Petes 1997
; Kruglyak et al. 1998
). The paucity of longer microsatellites could also be due to their downward mutation bias and short persistence time (Harr and Schlotterer 2000)
. Recent studies have shown that compared with expansion mutation events, contraction mutations occur more frequently with increases in allele size (Xu et al. 2000)
, and long alleles tend to mutate to shorter lengths, thus preventing their infinite growth (Ellegren 2000)
.
Among the repeats longer than 40 nt, the dinucleotide repeats were more frequent, whereas mononucleotide repeats seemed to be less common (table 1
). Large numbers of tetranucleotide repeats in human chromosomes and trinucleotide repeats in Drosophila were also longer than
40 nt. Slippage rates have been estimated to be highest in dinucleotide repeats, followed by tri- and tetranucleotide repeats (Chakraborty et al. 1997
; Kruglyak et al. 1998
; Schug et al. 1998
). Probably, shorter repeating units allow more possible slippage events per unit length of DNA and hence are likely to be more unstable. However, shorter lengths of mononucleotide repeats in all genome sequences and an abundance of tetranucleotide repeats in human sequences suggest the involvement of additional mechanisms.
Our study shows that compared with human chromosome 21, chromosome 22 has significantly higher frequencies of mono-, tri-, and tetranucleotide repeats but lower frequencies of dinucleotide repeats (t-test: t = 5.60 for mononucleotide repeats, t = 3.42 for dinucleotide repeats, t = 4.59 for trinucleotide repeats, and t = 3.94 for tetranucleotide repeats; P < 0.01 in all the cases). In C. elegans, among a total of 60 chromosome pair/repeat type combinations, 15 combinations showed significant differences in density of repeats (at P < 0.05). On the other hand, the densities of repeats in Arabidopsis chromosomes 2 and 4 were similar. For Drosophila, the sex chromosome (X) contained 1.53 times as many repeats per million base pairs of sequence as autosomes (chromosomes 2 and 3) (significance not calculated). Such differences for dinucleotide repeats in the Drosophila sex chromosome and autosomes have been reported (Pardue et al. 1987
; Bachtrog et al. 1999
). Thus, although the trends for different repeat classes are similar between chromosomes within a genome, the density of repeats may vary between different chromosomes of the same species. This can be expected, since different chromosomes in a genome can have different organizations of genes, euchromatin, and heterochromatin.
Relative Frequencies of Various Di- and Trinucleotide Repeats
All dinucleotide repeat combinations excluding homomeric dinucleotides can be grouped into four unique classes, namely, (AT)n, (AG)n, (AC)n, and (GC)n. It is evident that in human and Drosophila chromosomes, AC dinucleotide repeats are more frequent, followed by AT and AG repeats (fig. 2
). In contrast, Arabidopsis chromosomes contain more AT repeats, followed by AG repeats. However, in the yeast genome, AT repeats seem to be predominant compared with other dinucleotide repeats. Interestingly, GC dinucleotide repeats are extremely rare in all of the genomes studied. Lower frequencies of CpG dinucleotides in vertebrate genomes has been attributed to methylation of cytosine, which, in turn, increases its chances of mutation to thymine by deamination (Schorderet and Gartler 1992
). However, CpG suppression by this mechanism cannot explain the rarity of (CG)n dinucleotide repeats in yeast, C. elegans, and Drosophila, since they do not show cytosine methylation.
|
|
DNA strand slippage can occur during transient dissociation and reannealing in the repeat region, and this could be a deceptive event for DNA processing machinery leading to expansions or deletions in the repeat tracks. It has been suggested that if the nucleotides on the single strand are self-complementary, they can base-pair to form loops or hairpins and stabilize strand slippage (Gacy et al. 1995
; Moore et al. 1999
). If these mechanisms favor repeat expansions/deletions, repeats with higher hairpin propensities like (CTG)n and (CCG)n (Gacy et al. 1995
; Mitas et al. 1995
) or self-complementary repeats like (AT)n and (GC)n are likely to be more abundant. However, relative frequencies of various di- and trinucleotide repeat classes within and between different genomes do not seem to support such an association. For example, trinucleotide repeats of the AGC class (representing CAG/CTG repeats) are predominant in Drosophila, whereas in humans, Arabidopsis, and C. elegans genome sequences, they are less frequent. In contrast, human chromosomes 21 and 22 contain more of AAT and AAC trinucleotide repeats, although their relative hairpin propensity is low (Gacy et al. 1995
; Mitas et al. 1995
). Similarly, trinucleotide repeats of the AAG class that can adopt triple-helical structures (Pearson and Sinden 1998
) are comparatively more numerous in Arabidopsis, C. elegans, and yeast and less numerous in human and Drosophila sequences. This suggests that in addition to alternative DNA structures formed by repeat motifs, species-specific cellular factors interacting with them are likely to play an important role in the genesis of repeats (Toth, Gaspari, and Jurka 2000)
.
Codon Repetitions in Complete Genome Coding DNA Sequences
Among all SSRs, slippage-mediated expansions/deletions of only trinucleotide repeats or multiples thereof can be tolerated in coding regions, since they do not disturb the reading frame. Coding DNA sequences of all the predicted peptides of Drosophila, C. elegans, and yeast genomes were analyzed for the occurrence of the same codon (trinucleotide) consecutively repeated seven or more times (table 2
). It is evident that codon repetitions are far more frequent in Drosophila than in C. elegans, which in fact has more predicted proteins than Drosophila. This is to be expected, since the frequency of microsatellites is very low in C. elegans (fig. 1
). In Drosophila coding sequences, CAG codon (encoding glutamine) repetitions are predominant, followed by AGC (serine), GAG (glutamic acid), GCA (alanine), and AAC (asparagine) repeats. On the other hand, in C. elegans coding sequences, GAT (aspartic acid), CCA (proline), CAA (glutamine), GAA (glutamic acid), and AAG (lysine) codon repeats are comparatively more frequent, although very few of them are repeated 14 or more times. In yeast open reading frames (ORFs), GAA (glutamic acid), CAA (glutamine), GAT (aspartic acid), AAT (asparagine), and CAG (glutamine) codon repeats are more numerous. Such trends for triplet repeats in yeast ORFs have also been reported previously and are thought to reflect functional selection acting on amino acid reiterations in the encoded proteins (Alba, Santibanez-Koref, and Hancock 1999
).
|
The trends observed for codon repeats in complete genome coding DNA sequences are consistent with our previous study of a protein sequence database, where we observed that single amino acid repeat stretches of small/hydrophilic amino acids were more frequent in proteins (Katti et al. 2000)
. This might perhaps explain why the majority of the repeat-associated diseases are due to expansions of CAG repeats in specific genes. Since glutamine repeats are tolerated more in proteins, the initial small (CAG)n expansions in coding regions are likely to have enough survival value to remain in a population. However, as their instability increases with increasing length, their effect on protein structure and function could be deleterious beyond a certain limit, leading to the protein malfunctioning (Perutz 1999
). On the other hand, initial small expansions of hydrophobic and basic amino acid residues could be lethal and hence would be eliminated from the population as soon as they appeared. The availability of a complete coding DNA sequence set of the human genome will enable us to test this hypothesis.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The locations and sequences of all of the microsatellite loci reported in this study are available at http://www.ncl-india.org/ssr. This information could be useful for the selection of a wide range of microsatellite loci for studying their location and sequence-dependent evolution. They can also be used as markers for the fine analysis of recombination events along individual chromosomes. Availability of data on microsatellite content of complete chromosome sequences should also facilitate comprehensive studies on the direct role of microsatellites in genome organization, recombination, gene regulation, quantitative genetic variation, and evolution of genes.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: microsatellites
DNA strand slippage
codon repeats
genome sequences
database
2 Address for correspondence and reprints: Vidya S. Gupta, Plant Molecular Biology Unit, Division of Biochemical Sciences, National Chemical Laboratory, Pune 411 008, India. E-mail: vidya{at}ems.ncl.res.in
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adams M. D., S. E. Celniker, R. A. Holt, et al. (195 co-authors) 2000 The genome sequence of Drosophila melanogaster. Science 287:2185-2195
Alba M. M., M. F. Santibanez-Koref, J. M. Hancock, 1999 Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process J. Mol. Evol 49:789-797[ISI][Medline]
Ashley C. T., S. T. Warren, 1995 Trinucleotide repeat expansion and human disease Annu. Rev. Genet 29:703-728[ISI][Medline]
Bachtrog D., S. Weiss, B. Zangerl, G. Brem, C. Schlotterer, 1999 Distribution of dinucleotide microsatellites in the Drosophila melanogaster genome Mol. Biol. Evol 16:602-610[Abstract]
Beckmann J. S., M. Soller, 1990 Toward a unified approach to genetic mapping of eukaryotes based on sequence tagged microsatellite sites Biotechnology 8:930-932[ISI][Medline]
C. elegans Sequencing Consortium. 1998 Genome sequence of the nematode C. elegans: a platform for investigating biology Science 282:2012-2018
Chakraborty R., M. Kimmel, D. N. Stivers, L. J. Davison, R. Deka, 1997 Relative mutation rates at di-, tri-, and tetranucleotide microsatellite loci Proc. Natl. Acad. Sci. USA 94:1041-1046
Dunham I., N. Shimizu, B. A. Roe, et al. (217 co-authors) 1999 The DNA sequence of human chromosome 22 Nature 402:489-495[ISI][Medline]
Ellegren H., 2000 Heterogeneous mutation processes in human microsatellite DNA sequences Nat. Genet 24:400-402[ISI][Medline]
Gacy A. M., G. Goellner, N. Juranic, S. Macura, C. T. McMurray, 1995 Trinucleotide repeats that expand in human disease form hairpin structures in vitro. Cell 81:533-540[ISI][Medline]
Goffeau A., B. G. Barrell, H. Bussey, et al. (16 co-authors) 1996 Life with 6000 genes Science 274:546-567
Green H., N. Wang, 1994 Codon reiterations and the evolution of proteins Proc. Natl. Acad. Sci. USA 91:4298-4302[Abstract]
Hancock J. M., 1995 The contribution of slippage-like processes to genome evolution J. Mol. Evol 41:1038-1047[ISI][Medline]
Harr B., C. Schlotterer, 2000 Long microsatellite alleles in Drosophila melanogaster have a downward mutation bias and short persistence times, which cause their genome-wide underrepresentation Genetics 155:1213-1220
Harr B., B. Zangerl, C. Schlotterer, 2000 Removal of microsatellite interruptions by DNA replication slippage: phylogenetic evidence from Drosophila Mol. Biol. Evol 17:1001-1009
Hattori M., A. Fujiyama, T. D. Taylor, et al. (63 co-authors) 2000 The DNA sequence of human chromosome 21 Nature 405:311-319[ISI][Medline]
Huntley M., G. B. Golding, 2000 Evolution of simple sequence in proteins J. Mol. Evol 51:131-140[ISI][Medline]
Jurka J., C. Pethiyagoda, 1995 Simple repetitive DNA sequences from primates: compilation and analysis J. Mol. Evol 40:120-126[ISI][Medline]
Katti M. V., R. Sami-Subbu, P. K. Ranjekar, V. S. Gupta, 2000 Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications Protein Sci 9:1203-1209[Abstract]
Kruglyak S., R. T. Durrett, M. D. Schug, C. F. Aquadro, 1998 Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations Proc. Natl. Acad. Sci. USA 95:10774-10778
. 2000 Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations Mol. Biol. Evol 17:1210-1219
Levinson G., G. A. Gutman, 1987 Slipped-strand mispairing: a major mechanism for DNA sequence evolution Mol. Biol. Evol 4:203-221[Abstract]
Lin X., S. Kaul, S. Rounsley, et al. (37 co-authors) 1999 Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402:761-768[ISI][Medline]
Mayer K., C. Schuller, R. Wambutt, et al. (230 co-authors) 1999 Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402:769-777[ISI][Medline]
Mitas M., A. Yu, J. Dill, T. J. Kamp, E. J. Chambers, I. S. Haworth, 1995 Hairpin properties of single-stranded DNA containing a GC-rich triplet repeat: (CTG)15 Nucleic Acids Res 23:1050-1059[Abstract]
Moore H., P. W. Greenwell, C. P. Liu, N. Arnheim, T. D. Petes, 1999 Triplet repeats form secondary structures that escape DNA repair in yeast Proc. Natl. Acad. Sci. USA 96:1504-1509
Morgante M., A. M. Olivieri, 1993 PCR-amplified microsatellites as markers in plant genetics Plant J 3:175-182[ISI][Medline]
Panaud O., X. Chen, S. R. McCouch, 1995 Frequency of microsatellite sequences in rice (Oryza sativa L.) Genome 38:1170-1176[ISI][Medline]
Pardue M. L., K. Lowenhaupt, A. Rich, A. Nordheim, 1987 (dC-dA)n. (dG-dT)n sequences have evolutionarily conserved chromosomal locations in Drosophila with implications for roles in chromosome structure and function EMBO J 6:1781-1789[Abstract]
Pearson C. E., R. R. Sinden, 1998 Trinucleotide repeat DNA structures: dynamic mutations from dynamic DNA Curr. Opin. Struct. Biol 8:321-330[ISI][Medline]
Perutz M. F., 1999 Glutamine repeats and neurodegenerative diseases: molecular aspects Trends Biochem. Sci 24:58-63[ISI][Medline]
Petes T. D., P. W. Greenwell, M. Dominska, 1997 Stabilization of microsatellite sequences by variant repeats in the yeast Saccharomyces cerevisiae. Genetics 146:491-498
Richard G. F., B. Dujon, 1996 Distribution and variability of trinucleotide repeats in the genome of the yeast Saccharomyces cerevisiae. Gene 174:165-174[ISI][Medline]
Schorderet D. F., S. M. Gartler, 1992 Analysis of CpG suppression in methylated and nonmethylated species Proc. Natl. Acad. Sci. USA 89:957-961[Abstract]
Schug M. D., C. M. Hutter, K. A. Wetterstrand, M. S. Gaudette, T. F. Mackay, C. F. Aquadro, 1998 The mutation rates at di-, tri- and tetranucleotide repeats in Drosophila melanogaster. Mol. Biol. Evol 15:1751-1760
Tautz D., M. Renz, 1984 Simple sequences are ubiquitous repetitive components of eukaryotic genomes Nucleic Acids Res 12:4127-4138[Abstract]
Tautz D., M. Trick, G. A. Dover, 1986 Cryptic simplicity in DNA is a major source of genetic variation Nature 322:652-656[ISI][Medline]
Toth G., Z. Gaspari, J. Jurka, 2000 Microsatellites in different eukaryotic genomes: survey and analysis Genome Res 10:967-981
Wierdl M., M. Dominska, T. D. Petes, 1997 Microsatellite instability in yeast: dependence on the length of the microsatellite Genetics 146:769-779
Xu X., M. Peng Z. Fang, X. Xu, 2000 The direction of microsatellite mutations is dependent upon allele length Nat. Genet 24:396-399[ISI][Medline]