*Structure et dynamique des génomes, Institut Jacques Monod, Paris, France;
and
Atelier de Bioinformatique, Université Paris VI, Paris, France
Abstract
The complete genome of the yeast Saccharomyces cerevisiae was investigated for intrachromosomal duplications at the level of nucleotide sequences. The analysis was performed by looking for long approximate repeats (from 30 to 3,885 bp) present on each of the chromosomes. We show that direct and inverted repeats exhibit very different characteristics: the two copies of direct repeats are more similar and longer than those of inverted repeats. Furthermore, contrary to the inverted repeats, a large majority of direct repeats appear to be closely spaced. The distance (delta) between the two copies is generally smaller than 1 kb. Further analysis of these "close direct repeats" shows a negative correlation between delta and the percentage of identity between the two copies, and a positive correlation between delta and repeat length. Moreover, contrary to the other categories of repeats, close direct repeats are mostly located within coding sequences (CDSs). We propose two hypotheses in order to interpret these observations: first, the deletion/conversion rate is negatively correlated with delta; second, there exists an active duplication mechanism which continuously creates close direct repeats, the other intrachromosomal repeats being the result, by chromosomal rearrangements of these "primary repeats."
Introduction
Since the first complete bacterial genome (Fleischmann et al. 1995
), 22 new eubacteria, 6 archebacteria, and 3 eukaryote sequences have been completed, and several new genomics fields, such as "functional genomics" (deciphering the function of genes) and "comparative genomics" (comparison of entire genomes) (Chervitz et al. 1998
), have emerged. Here, we focus on "dynamical genomics," which can be seen as the study of chromosome history and dynamics through the analysis of the structure of current genomes. One way of studying these phenomena is through the analysis of chromosomal rearrangement remnants, such as duplications. Among eukaryotes, budding yeast Saccharomyces cerevisiae, which has been completely sequenced (Goffeau et al. 1996
), is a good model because of its small size (12.1 Mb) and its comprehensive annotation.
The first evidence of sequence duplication in S. cerevisiae came from Lalo et al. (1993)
, who found a large duplication event between chromosomes II and XIV. More exhaustive studies, based on translated coding sequence (CDS) alignments, have brought prominence into large interchromosomal duplications (Coissac, Maillier, and Netter 1997
; Wolfe and Shields 1997
). Further analysis revealed that, for the two copies of a duplicated CDS, the distances to the closest telomere are similar (Coissac, Maillier, and Netter 1997
). The importance of telomeres underlines the relation between nuclear organization and genome dynamics. Other studies were undertaken at the DNA level leading to the development of a "duplication databank" (Mewes et al. 1997
) and to the definition of the X2 element in the subtelomeric region (Britten 1998
).
However, to our knowledge, apart from the description of gene tandem duplication (CUP1, PMR2, rDNA, ASP3), no systematic study has yet been undertaken on intrachromosomal duplications. In the present work, we searched for intrachromosomal repeats at the level of nucleotide sequences. Through this analysis, we show that direct and inverted repeats exhibit very different characteristics. Moreover, we identify a special class of direct repeats (named close direct repeats) exhibiting several particular features. Finally, we propose a model based on the active flow of creation of these close direct repeats and their dispersion by chromosomal rearrangements.
Materials and Methods
Data
The S. cerevisiae complete sequences and annotations were extracted from the Saccharomyces Genome Database (SGD; http://genome-www.stanford.edu/Saccharomyces/). The total size of the 16 chromosomes is 12.1 Mb. We used the entire nuclear sequences as given in the database, including the three tandem clusters (CUP1, rDNA, and PMR2), which were reduced to a single repeat. We additionally built 10 "random genomes" by shuffling each chromosome independently with respect to its dinucleotide composition.
Construction of the Repeat Database
Our primary goal was to look for approximate repeats, i.e., repeats whose copies may not be strictly identical but may contain errors (mismatches and indels). The usual procedure for this purpose derives from dynamic programming (Smith and Waterman 1981
) but is unfortunately not amenable to the study of very long sequences because of its quadratic time complexity. Although several heuristics have already been proposed to work around this problem (Leung et al. 1991
; Vincens et al. 1998
), we chose here to develop our own procedure in order to fit the biological problem more closely. Like most of the already-proposed heuristics, this procedure first looks for "seeds" of exact repeats and then extends the seeds by using dynamic programming techniques. This is done for each chromosome independently in four consecutive steps which are described as follows.
First Step: Searching for Seeds
Exact repeats were detected by using the Karp-Miller-Rosenberg (KMR) algorithm (Karp, Miller, and Rosenberg 1972
), which finds the largest subword present at least rmin times (here rmin = 2) in a text (here, each chromosome). Since we were interested in "unusually" large repeats (i.e., repeats which did not appear by chance), we set a threshold (Lmin) on the minimal length of repeats of interest. Lmin was calculated using the statistics developed by Karlin and Ost (1985)
. For each chromosome, we chose Lmin such that the probability of finding a three-copy repeat in a random sequence with the same length and base composition on the chromosome was less than 0.001. Lmin typically ranges from 15 to 17 bp depending on the chromosome length.
In order to avoid the problem of any two subwords of a repeated word being themselves repeated, we devised the following heuristics (Rocha, Danchin, and Viari 1999
): first, the longest repeat on the chromosome is sought, and its length is compared with the minimum preset value Lmin. When a test is successful, both copies of the repeat are masked and excluded from further analysis. The process is iterated up to the point where the length of the largest repeats becomes smaller than Lmin. It should be pointed out that this process is a heuristic. In particular, if there is a three-copy repeat where the third copy appears a little bit shorter, then this copy will be missed by the method. This explains the rationale behind the procedure to set up the threshold Lmin (vide supra). We devised two versions of the program: one to detect direct repeats and the other to detect inverted repeats (repeats for which the second copy has the reverse orientation). The two orientation classes (direct and reverse) were further handled separately.
Second Step: Removing Low-Complexity, Overlapping, and Telomeric Seeds
In order to remove low-complexity repeats (like microsatellites), we used an entropy filter. The entropy is taken here in the sense of Shannon (Schneider et al. 1986
) for dinucleotide distribution:
where pi is the frequency of the ith dinucleotide. The entropy (H) is computed on the sequence of a repeat (Hrepeat) and on the whole chromosome (Hchromosome). The values are then compared by computing the ratio Hrepeat/Hchromosome. This ratio was calibrated by using artificial stretches of mono-, di-, tri-, and tetranucleotides to define a threshold: only repeats whose ratio was greater than 0.6 were kept.
Next, we discarded all repeats for which the two copies overlap. At this stage, these repeats generally correspond to multicopies of small words.
Finally, we removed all subtelomeric duplications. Several well-known elements are located in the subtelomeric regions (Y' sequence [Louis and Haber 1990
], X2 [Britten 1998
], seripauperine [Viswanathan et al. 1994
]). These elements have already been widely studied and are known to exhibit a highly special plasticity (for review, see Pryde, Gorham, and Louis 1997
). We arbitrarily set a subtelomeric barrier at 30 kb and removed all repeats with at least one copy in a subtelomeric region.
Third Step: Extending the Seeds
Exact repeats (seeds) were extended into larger nonstrict repeats by using a local alignment program (Smith and Waterman 1981
) developed by P. Hardy and M. Waterman (http://www.hto.usc.edu/software/seqaln/). The sequence of a seed was substituted with X's, and 100 bp were picked on both sides. For example, a seed of 30 bp will become (A/C/G/T)100-(X)30-(A/C/G/T)100. The scoring matrix retained for the alignment was as follows: match(A/T/C/G) = +4; match(X) = +99; mismatch(A/T/G/C) = -4; mismatch(X) = -99; Gapopen = -16; Gapextension = -4. The value +99 will force the program to always align the two copies of the seed. When the best local alignment found by the program ended less than 10 bp from one of the sequences termini, the sequences were further extended 200 bp and a new run was performed. This operation was iterated until the alignment eventually ended more than 10 bp from both sides. It should be pointed out that after this step, several different initial seeds may give rise to the same (or a similar) extended repeat. Therefore, when two or more extended repeats occurred at the same location (with a tolerance of 20% of their length), we just kept the longest one.
Fourth Step: Removing Short or Biological Trivial Repeats
In order to remove repeats that were too short or too different, we decided to keep repeats with (1) a minimum percentage of identity and (2) a minimum number of matches between their two copies. These minima were arbitrarily set at 50% identity and 30 matches. Finally, we applied a last filter in order to remove all "biologically trivial" duplications, which have their own dynamics. Actually, many of the repeats were due to the 275 tRNAs, 2 rRNAs, 50 Ty's, or 385 solos widespread in the yeast genome and were therefore removed. The positions of these known repeated elements were extracted from the SGD annotations (http://genome-www.stanford.edu/Saccharomyces).
Results
The application of the previously described method yields a total of 275 direct repeats and 340 inverted repeats on the yeast genome. In comparison, the random genomes (see Materials and Methods) produce an average of 25 direct repeats and 24 inverted repeats. The number and distribution of repeats differ from one chromosome to the other (data not shown). However, in the rest of this analysis, we pooled together all of the repeats in order to get sufficient statistics to study their global properties. In order to examine more closely the characteristics of the repeats, we focused on three parameters: "length" simply denotes the mean length of the two copies; "identity" is defined as the ratio of the number of matches between the two copies over the length of the largest copy; and "delta," also called "spacer" in the literature (Klein 1995
), is defined as the distance between the two copies. For both orientations, delta begins after the 3' end of the first copy. It stops at the 5' end of the second copy for direct repeats and at its 3' end for inverted ones.
Differences Between Direct and Inverted Repeats
Figure 1
shows the distributions of the three parameters described above for the two orientation classes and for real and random genomes. The comparison of real direct repeats with random ones in figure 1a
reveals important differences: random repeats are all shorter than 100 bp, whereas a significant number (146/275) of real ones are much longer than 100 bp (up to 3,885 bp coming from the ENA family on chromosome IV). On the contrary, real inverted repeats behave much like random ones: only a few (71/340) real inverted repeats are significantly longer than 100 bp. Thus, on the sole basis of their length, it seems clear that real direct repeats are different from real inverted ones.
|
Finally, the histograms of delta (fig. 1c ) highlight another important structural difference between real direct and inverted repeats. Most (139/275) real direct repeats have deltas shorter than 1 kb, while random repeats exhibit almost exclusively deltas longer than 1 kb. In contrast, real inverted repeats display about the same distribution as random inverted ones.
In summary, these results show that both orientation classes are different from random distribution and that real direct and inverted repeats constitute two different populations with distinct properties. The main difference concerns the delta parameter, with the majority of direct repeats being closely spaced (delta smaller than 1 kb). Hereafter, we refer to them as "close" (as opposed to "distant") direct repeats.
Identity Is Negatively Correlated with Delta for Close Direct Repeats
In order to reveal possible correlations between the parameters, we plotted, for both orientation classes and for real and random genomes, the identity as a function of delta. Figure 2a
suggests that close direct repeats are negatively correlated to delta. This visual observation is further confirmed by Kendall tau correlation measurement ( = -0.36; P
10-10). In contrast, no such correlation is found for larger deltas nor for random repeats.
|
|
In contrast, a much lower percentage of distant repeats (58%; 79/136) and inverted repeats (40.6%; 138/340) are completely included within CDSs. Moreover, only 50.6% (40/79) of distant direct DNA repeats and 34.1% (47/138) of inverted DNA repeats correspond to repeats at the protein level.
Discussion
This investigation on intrachromosomal duplications allows us to bring out several biological results and hypotheses about the dynamics of repeats. The first set of arguments comes from the analysis of the data presented in figures 2 and 3 : for direct repeats, the main differences being observed between close repeats (delta < 1 kb) and distant repeats (delta > 1 kb) were as follows:
Some of the interpretations invoked above could be considered contradictory. In particular, the high percentage of identity of the close direct repeats is considered the mark of a recent duplication event (point 1, i), whereas their short length could be a consequence of the long time elapsed from the initial duplication event (point 2, ii). Therefore, as explained below, we shall consider the second explanation less probable.
Conversion Versus Deletion: A Plausible Explanation
For close direct repeats, it seems reasonable to think that the extent of the exchange process is negatively correlated with delta. The exchange process can be either deletion (loss of one copy by reciprocal recombination or replication slippage) or conversion (homogenization of the two copies by nonreciprocal recombination). In fact, experimental studies undertaken on Bacillus subtilis (Chedin et al. 1994
) and Escherichia coli (Lovett et al. 1994
) have highlighted similar results. Thus, the decrease in identity as a function of delta (fig. 2
) could be explained by a decrease in the conversion rate.
To understand the correlation between length and delta (fig. 3 ), we must put the genetic exchange back in its dynamic context: actually, each close repeat can be submitted to a deletion or to a conversion event. If a deletion occurs, there is no way back. On the contrary, if a conversion occurs, the two copies are still present and a new round of exchange (i.e., conversion or deletion) is possible. So, during a long period, a bias in favor of deletion of one copy should be observed. Furthermore, several experiments have demonstrated a positive correlation between recombination rate and repeat length in yeast (Jinks, Michelitch, and Ramcharan 1993
), bacteria (Peeters et al. 1988
), and phages (Pierce, Kong, and Masker 1991
). Briefly, for a short delta, a long repeat should be too unstable to persist, but by increasing delta, longer repeats could be maintained. This "length tolerance" effect could explain the observed positive correlation between length and delta.
Functional Pressures: A Protection from Deletion
Another important difference between close and distant repeats is related to their presence within CDS: close direct repeats are located mainly within CDSs and in the same frame, therefore giving rise to repeats at the protein level as well. On the contrary, distant direct repeats give rise to fewer protein repeats. These observations lead to two nonexclusive hypotheses:
It should be pointed out that Marcotte et al. (1998)
recently reported that there is a high frequency of internal repeats within proteins sequences of eukaryotes (as compared with prokaryotes). This observation can be in line with the result presented in this study. To summarize, we probably observe a combination of DNA mechanisms which tend to (1) delete the close repeats and (2) keep them identical and which are constrained by functional pressures at the protein level.
Direct Versus Inverted Repeats
The last group of results we take into account concerns the important differences observed between direct and inverted repeats. These differences can be summarized as follows:
A Dynamic Model for Intrachromosomal Repeats
We propose a simple model, illustrated in figure 4
, to explain all these observations and solve the apparent contradictions. In this model, the initial event is the continuous production of close direct repeats. Whatever the mechanism giving rise to it (unequal crossing over or replication slippage), when a close direct repeat is created, it can be either modified by mutation (although the high conversion rate will tend to maintain the two copies identical) or deleted (since the deletion rate is high). As long as the repeat remains, a new exchange (conversion/deletion) is possible. Therefore, the fate of a close direct repeat is to disappear sooner or later (depending on the conversion rate vs. the deletion rate). As a consequence, only two kinds of direct repeats can be conserved on a large time scale:
|
This model implies that most intrachromosomal repeats originate from close direct duplications but does not preclude any mechanism. Furthermore, it gives rise to several predictions that can be experimentally tested, like a negative correlation of the deletion/conversion rate with delta.
Acknowledgements
We thank E. Rocha, J. Pothier, E. Maillier, and D. Higuet for their scientific help and their friendly support. This work was supported by grants from Association pour la Recherche sur le Cancer. E.C. and P.N. are members of the Université Pierre et Marie Curie, Paris.
Footnotes
1 Abbreviation: CDS, coding sequence.
2 Keywords: genome dynamics
evolution
duplication
direct repeats
Saccharomyces cerevisiae.
3 Address for correspondence and reprints: Guillaume Achaz, Structure et dynamique des génomes, IJM, Tour 4344, 1° étage, 4, place Jussieu, 75251 Paris CEDEX 05, France. E-mail: achaz{at}ijm.jussieu.fr
literature cited
Britten, R. J. 1998. Precise sequence complementarity between yeast chromosome ends and two classes of just-subtelomeric sequences. Proc. Natl. Acad. Sci. USA 95:59065912.
Chedin, F., E. Dervyn, R. Dervyn, S. D. Ehrlich, and P. Noirot. 1994. Frequency of deletion formation decreases exponentially with distance between short direct repeats. Mol. Microbiol. 12:561569.[ISI][Medline]
Chervitz, S. A., L. Aravind, G. Sherlock et al. (13 co-authors). 1998. Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science 282:20222028.
Coissac, E., E. Maillier, and P. Netter. 1997. A comparative study of duplications in bacteria and eukaryotes: the importance of telomeres. Mol. Biol. Evol. 14:10621074.[Abstract]
Fleischmann, R. D., M. D. Adams, O. White et al. (11 co-authors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496512.
Goffeau, A., B. G. Barell, H. Bussey et al. (16 co-authors). 1996. Life with 6000 genes. Science 274:546567.
Jinks, R. S., M. Michelitch, and S. Ramcharan. 1993. Substrate length requirements for efficient mitotic recombination in Saccharomyces cerevisiae. Mol. Cell. Biol. 13:39373950.
Karlin, S., and F. Ost. 1985. Maximal segmental match length among random sequences from a finite alphabet. Pp. 225243 in L. M. L. Cam and R. A. Olshen, eds. Proceedings of the Berkeley Conference in honour of Jerzy Neyman and Jack Kiefer. Vol. 1. Association for Computing Machinery, New York.
Karp, R. M., R. E. Miller, and A. L. Rosenberg. 1972. Rapid identification of repeated patterns in strings, trees and arrays. Pp. 125126 in Proceedings 4th Annual ACM Symposium Theory of Computing, New York.
Klein, H. L. 1995. Genetic control of intrachromosomal recombination. Bioessays 17:147159.
Lalo, D., S. Stettler, S. Mariotte, P. P. Slonimski, and P. Thuriaux. 1993. Two yeast chromosomes are related by a fossil duplication of their centromeric regions. C. R. Acad. Sci. 316:367373.[ISI]
Leung, M. Y., B. E. Blaisdell, C. Burge, and S. Karlin. 1991. An efficient algorithm for identifying matches with errors in multiple long molecular sequences. J. Mol. Biol. 221:13671378.[ISI][Medline]
Louis, E. J., and J. E. Haber. 1990. The subtelomeric Y' repeat family in Saccharomyces cerevisiae: an experimental system for repeated sequence evolution. Genetics 124:533545.
Lovett, S. T., T. J. Gluckman, P. J. Simon, V. J. Sutera, and P. T. Drapkin. 1994. Recombination between repeats in Escherichia coli by a recA-independent, proximity-sensitive mechanism. Mol. Gen. Genet. 245:294300.[ISI][Medline]
Marcotte, E. M., M. Pellegrini, T. O. Yeates, and D. Eisenberg. 1998. Census of protein repeats. J. Mol. Biol. 293:151160.[ISI]
Mewes, H. W., K. Albermann, M. Bahr et al. (12 co-authors). 1997. Overview of the yeast genome. Nature 387:765.
Peeters, B. P., B. J. De, S. Bron, and G. Venema. 1988. Structural plasmid instability in Bacillus subtilis: effect of direct and inverted repeats. Mol. Gen. Genet. 212:450458.[ISI][Medline]
Pierce, J. C., D. Kong, and W. Masker. 1991. The effect of the length of direct repeats and the presence of palindromes on deletion between directly repeated DNA sequences in bacteriophage T7. Nucleic Acids Res. 19:39013905.[Abstract]
Pryde, F. E., H. C. Gorham, and E. J. Louis. 1997. Chromosome ends: all the same under their caps. Curr. Opin. Genet. Dev. 7:822828.[ISI][Medline]
Rocha, E. P. C., A. Danchin, and A. Viari. 1999. Analysis of long repeats in bacterial genomes reveals alternative evolutionary mechanisms in Bacillus subtilis and other competent prokaryotes. Mol. Biol. Evol. 16:12191230.[Abstract]
Schneider, T. D., G. D. Stormo, L. Gold, and A. Ehrenfeucht. 1986. Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188:415431.[ISI][Medline]
Semple, C., and K. H. Wolfe. 1999. Gene duplication and gene conversion in the Caenorhabditis elegans genome. J. Mol. Evol. 48:555564.[ISI][Medline]
Smith, T. F., and M. S. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195197.[ISI][Medline]
Vincens, P., L. Buffat, C. Andre, J. P. Chevrolat, J. F. Boisvieux, and S. Hazout. 1998. A strategy for finding regions of similarity in complete genome sequences. Bioinformatics 14:715725.
Viswanathan, M., G. Muthukumar, Y. S. Cong, and J. Lenard. 1994. Seripauperins of Saccharomyces cerevisiae: a new multigene family encoding serine-poor relatives of serine-rich proteins. Gene 148:149153.
Wolfe, K. H., and D. C. Shields. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708713.