Analysis of Intrachromosomal Duplications in Yeast Saccharomyces cerevisiae: A Possible Model for Their Origin

Guillaume Achaz3,*, Eric Coissac*, Alain Viari{dagger} and Pierre Netter*

*Structure et dynamique des génomes, Institut Jacques Monod, Paris, France; and
{dagger}Atelier de Bioinformatique, Université Paris VI, Paris, France

Abstract

The complete genome of the yeast Saccharomyces cerevisiae was investigated for intrachromosomal duplications at the level of nucleotide sequences. The analysis was performed by looking for long approximate repeats (from 30 to 3,885 bp) present on each of the chromosomes. We show that direct and inverted repeats exhibit very different characteristics: the two copies of direct repeats are more similar and longer than those of inverted repeats. Furthermore, contrary to the inverted repeats, a large majority of direct repeats appear to be closely spaced. The distance (delta) between the two copies is generally smaller than 1 kb. Further analysis of these "close direct repeats" shows a negative correlation between delta and the percentage of identity between the two copies, and a positive correlation between delta and repeat length. Moreover, contrary to the other categories of repeats, close direct repeats are mostly located within coding sequences (CDSs). We propose two hypotheses in order to interpret these observations: first, the deletion/conversion rate is negatively correlated with delta; second, there exists an active duplication mechanism which continuously creates close direct repeats, the other intrachromosomal repeats being the result, by chromosomal rearrangements of these "primary repeats."

Introduction

Since the first complete bacterial genome (Fleischmann et al. 1995Citation ), 22 new eubacteria, 6 archebacteria, and 3 eukaryote sequences have been completed, and several new genomics fields, such as "functional genomics" (deciphering the function of genes) and "comparative genomics" (comparison of entire genomes) (Chervitz et al. 1998Citation ), have emerged. Here, we focus on "dynamical genomics," which can be seen as the study of chromosome history and dynamics through the analysis of the structure of current genomes. One way of studying these phenomena is through the analysis of chromosomal rearrangement remnants, such as duplications. Among eukaryotes, budding yeast Saccharomyces cerevisiae, which has been completely sequenced (Goffeau et al. 1996Citation ), is a good model because of its small size (12.1 Mb) and its comprehensive annotation.

The first evidence of sequence duplication in S. cerevisiae came from Lalo et al. (1993)Citation , who found a large duplication event between chromosomes II and XIV. More exhaustive studies, based on translated coding sequence (CDS) alignments, have brought prominence into large interchromosomal duplications (Coissac, Maillier, and Netter 1997Citation ; Wolfe and Shields 1997Citation ). Further analysis revealed that, for the two copies of a duplicated CDS, the distances to the closest telomere are similar (Coissac, Maillier, and Netter 1997Citation ). The importance of telomeres underlines the relation between nuclear organization and genome dynamics. Other studies were undertaken at the DNA level leading to the development of a "duplication databank" (Mewes et al. 1997Citation ) and to the definition of the X2 element in the subtelomeric region (Britten 1998Citation ).

However, to our knowledge, apart from the description of gene tandem duplication (CUP1, PMR2, rDNA, ASP3), no systematic study has yet been undertaken on intrachromosomal duplications. In the present work, we searched for intrachromosomal repeats at the level of nucleotide sequences. Through this analysis, we show that direct and inverted repeats exhibit very different characteristics. Moreover, we identify a special class of direct repeats (named close direct repeats) exhibiting several particular features. Finally, we propose a model based on the active flow of creation of these close direct repeats and their dispersion by chromosomal rearrangements.

Materials and Methods

Data
The S. cerevisiae complete sequences and annotations were extracted from the Saccharomyces Genome Database (SGD; http://genome-www.stanford.edu/Saccharomyces/). The total size of the 16 chromosomes is 12.1 Mb. We used the entire nuclear sequences as given in the database, including the three tandem clusters (CUP1, rDNA, and PMR2), which were reduced to a single repeat. We additionally built 10 "random genomes" by shuffling each chromosome independently with respect to its dinucleotide composition.

Construction of the Repeat Database
Our primary goal was to look for approximate repeats, i.e., repeats whose copies may not be strictly identical but may contain errors (mismatches and indels). The usual procedure for this purpose derives from dynamic programming (Smith and Waterman 1981Citation ) but is unfortunately not amenable to the study of very long sequences because of its quadratic time complexity. Although several heuristics have already been proposed to work around this problem (Leung et al. 1991Citation ; Vincens et al. 1998Citation ), we chose here to develop our own procedure in order to fit the biological problem more closely. Like most of the already-proposed heuristics, this procedure first looks for "seeds" of exact repeats and then extends the seeds by using dynamic programming techniques. This is done for each chromosome independently in four consecutive steps which are described as follows.

First Step: Searching for Seeds
Exact repeats were detected by using the Karp-Miller-Rosenberg (KMR) algorithm (Karp, Miller, and Rosenberg 1972Citation ), which finds the largest subword present at least rmin times (here rmin = 2) in a text (here, each chromosome). Since we were interested in "unusually" large repeats (i.e., repeats which did not appear by chance), we set a threshold (Lmin) on the minimal length of repeats of interest. Lmin was calculated using the statistics developed by Karlin and Ost (1985)Citation . For each chromosome, we chose Lmin such that the probability of finding a three-copy repeat in a random sequence with the same length and base composition on the chromosome was less than 0.001. Lmin typically ranges from 15 to 17 bp depending on the chromosome length.

In order to avoid the problem of any two subwords of a repeated word being themselves repeated, we devised the following heuristics (Rocha, Danchin, and Viari 1999Citation ): first, the longest repeat on the chromosome is sought, and its length is compared with the minimum preset value Lmin. When a test is successful, both copies of the repeat are masked and excluded from further analysis. The process is iterated up to the point where the length of the largest repeats becomes smaller than Lmin. It should be pointed out that this process is a heuristic. In particular, if there is a three-copy repeat where the third copy appears a little bit shorter, then this copy will be missed by the method. This explains the rationale behind the procedure to set up the threshold Lmin (vide supra). We devised two versions of the program: one to detect direct repeats and the other to detect inverted repeats (repeats for which the second copy has the reverse orientation). The two orientation classes (direct and reverse) were further handled separately.

Second Step: Removing Low-Complexity, Overlapping, and Telomeric Seeds
In order to remove low-complexity repeats (like microsatellites), we used an entropy filter. The entropy is taken here in the sense of Shannon (Schneider et al. 1986Citation ) for dinucleotide distribution:


where pi is the frequency of the ith dinucleotide. The entropy (H) is computed on the sequence of a repeat (Hrepeat) and on the whole chromosome (Hchromosome). The values are then compared by computing the ratio Hrepeat/Hchromosome. This ratio was calibrated by using artificial stretches of mono-, di-, tri-, and tetranucleotides to define a threshold: only repeats whose ratio was greater than 0.6 were kept.

Next, we discarded all repeats for which the two copies overlap. At this stage, these repeats generally correspond to multicopies of small words.

Finally, we removed all subtelomeric duplications. Several well-known elements are located in the subtelomeric regions (Y' sequence [Louis and Haber 1990Citation ], X2 [Britten 1998Citation ], seripauperine [Viswanathan et al. 1994Citation ]). These elements have already been widely studied and are known to exhibit a highly special plasticity (for review, see Pryde, Gorham, and Louis 1997Citation ). We arbitrarily set a subtelomeric barrier at 30 kb and removed all repeats with at least one copy in a subtelomeric region.

Third Step: Extending the Seeds
Exact repeats (seeds) were extended into larger nonstrict repeats by using a local alignment program (Smith and Waterman 1981Citation ) developed by P. Hardy and M. Waterman (http://www.hto.usc.edu/software/seqaln/). The sequence of a seed was substituted with X's, and 100 bp were picked on both sides. For example, a seed of 30 bp will become (A/C/G/T)100-(X)30-(A/C/G/T)100. The scoring matrix retained for the alignment was as follows: match(A/T/C/G) = +4; match(X) = +99; mismatch(A/T/G/C) = -4; mismatch(X) = -99; Gapopen = -16; Gapextension = -4. The value +99 will force the program to always align the two copies of the seed. When the best local alignment found by the program ended less than 10 bp from one of the sequences termini, the sequences were further extended 200 bp and a new run was performed. This operation was iterated until the alignment eventually ended more than 10 bp from both sides. It should be pointed out that after this step, several different initial seeds may give rise to the same (or a similar) extended repeat. Therefore, when two or more extended repeats occurred at the same location (with a tolerance of 20% of their length), we just kept the longest one.

Fourth Step: Removing Short or Biological Trivial Repeats
In order to remove repeats that were too short or too different, we decided to keep repeats with (1) a minimum percentage of identity and (2) a minimum number of matches between their two copies. These minima were arbitrarily set at 50% identity and 30 matches. Finally, we applied a last filter in order to remove all "biologically trivial" duplications, which have their own dynamics. Actually, many of the repeats were due to the 275 tRNAs, 2 rRNAs, 50 Ty's, or 385 solos widespread in the yeast genome and were therefore removed. The positions of these known repeated elements were extracted from the SGD annotations (http://genome-www.stanford.edu/Saccharomyces).

Results

The application of the previously described method yields a total of 275 direct repeats and 340 inverted repeats on the yeast genome. In comparison, the random genomes (see Materials and Methods) produce an average of 25 direct repeats and 24 inverted repeats. The number and distribution of repeats differ from one chromosome to the other (data not shown). However, in the rest of this analysis, we pooled together all of the repeats in order to get sufficient statistics to study their global properties. In order to examine more closely the characteristics of the repeats, we focused on three parameters: "length" simply denotes the mean length of the two copies; "identity" is defined as the ratio of the number of matches between the two copies over the length of the largest copy; and "delta," also called "spacer" in the literature (Klein 1995Citation ), is defined as the distance between the two copies. For both orientations, delta begins after the 3' end of the first copy. It stops at the 5' end of the second copy for direct repeats and at its 3' end for inverted ones.

Differences Between Direct and Inverted Repeats
Figure 1 shows the distributions of the three parameters described above for the two orientation classes and for real and random genomes. The comparison of real direct repeats with random ones in figure 1a reveals important differences: random repeats are all shorter than 100 bp, whereas a significant number (146/275) of real ones are much longer than 100 bp (up to 3,885 bp coming from the ENA family on chromosome IV). On the contrary, real inverted repeats behave much like random ones: only a few (71/340) real inverted repeats are significantly longer than 100 bp. Thus, on the sole basis of their length, it seems clear that real direct repeats are different from real inverted ones.



View larger version (35K):
[in this window]
[in a new window]
 
Fig. 1.—Distribution of the three parameters (length, identity, and delta) used in this study for each orientation (direct or inverted) of the repeats. Black boxes represent data observed for the real yeast genome, and gray boxes correspond to shuffled data. Since much fewer repeats are observed on random data (see text), repeats from 10 random genomes (each chromosome is shuffled with respect to the dinucleotide composition) have been pooled. a, Histogram of the length of the repeats. b, Histogram of the percentage of identity between the two copies of a repeat. c, Histogram of the distance (delta) between the two copies of a repeat

 
As shown in figure 1b, both real direct and inverted repeats show a higher percentage of identity than random ones. Moreover, by comparing the two orientation classes for real data, a major difference appears: direct repeats exhibit a higher degree of similarity than inverted ones (for instance, 103 direct repeats, against only 29 inverted repeats, are found above 90% identity).

Finally, the histograms of delta (fig. 1c ) highlight another important structural difference between real direct and inverted repeats. Most (139/275) real direct repeats have deltas shorter than 1 kb, while random repeats exhibit almost exclusively deltas longer than 1 kb. In contrast, real inverted repeats display about the same distribution as random inverted ones.

In summary, these results show that both orientation classes are different from random distribution and that real direct and inverted repeats constitute two different populations with distinct properties. The main difference concerns the delta parameter, with the majority of direct repeats being closely spaced (delta smaller than 1 kb). Hereafter, we refer to them as "close" (as opposed to "distant") direct repeats.

Identity Is Negatively Correlated with Delta for Close Direct Repeats
In order to reveal possible correlations between the parameters, we plotted, for both orientation classes and for real and random genomes, the identity as a function of delta. Figure 2a suggests that close direct repeats are negatively correlated to delta. This visual observation is further confirmed by Kendall tau correlation measurement ({tau} = -0.36; P {approx} 10-10). In contrast, no such correlation is found for larger deltas nor for random repeats.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 2.—Negative correlation between the percentage of identity and the spacing (delta) between the two copies of a repeat. The percentage of identity (y-axis) is plotted as a function of delta (x-axis on a logarithmic scale) for both real (left side) and shuffled (right side) yeast genomes. Direct repeats are given in a, and inverted repeats are given in b. Since much fewer repeats are observed on random data (see text), and in order to give rise to a comparable total number of points, the plots on the right actually correspond to the sum of 10 random genomes. The black curve (for real data) represents the mean of the y values (identity) computed on a sliding window spanning 20 data points. This visual negative correlation is further confirmed by Kendall tau rank tests (see text)

 
Length Is Positively Correlated with Delta for Close Direct Repeats
Similarly, we searched for a correlation between length and delta for both orientation classes and for real and random genomes. Figure 3a reveals a peculiar variation of the length as a function of delta. More precisely, close direct repeats exhibit a positive correlation ({tau} = +0.26; P {approx} 3 x 10-6) between length and delta. It should be noted that no significant rank correlation between length and identity was observed for close direct repeats.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 3.—Positive correlation between the length and the spacing (delta) between the two copies of a repeat. The mean length of the two copies of a repeat (y-axis) is plotted as a function of delta (x-axis on a logarithmic scale) for both real (left side) and shuffled (right side) yeast genomes. Direct repeats are given in a, and inverted repeats are given in b. Since much fewer repeats are observed on random data (see text), and in order to give rise to a comparable total number of points, the plots on the right actually correspond to the sum of 10 random genomes. The black curve (for real data) represents the mean of the y values (identity) computed on a sliding window spanning 20 data points. This visual positive correlation is further confirmed by Kendall tau rank tests (see text)

 
Close Direct Repeats Are Mostly "Coding" Sequences
In order to find out whether repeats are located inside CDSs, we examined positions of repeats in relation to the CDSs. This brought out two main results. Close direct repeats are mainly located within coding sequences: 85.6% (119/139) of them have their two copies completely included within CDSs and, with two exceptions, these repeats are always located within the same CDS. Moreover, it turns out that for 115 of these 117 repeats, the two copies are in the same coding frame, therefore giving rise to repeats at the protein level too.

In contrast, a much lower percentage of distant repeats (58%; 79/136) and inverted repeats (40.6%; 138/340) are completely included within CDSs. Moreover, only 50.6% (40/79) of distant direct DNA repeats and 34.1% (47/138) of inverted DNA repeats correspond to repeats at the protein level.

Discussion

This investigation on intrachromosomal duplications allows us to bring out several biological results and hypotheses about the dynamics of repeats. The first set of arguments comes from the analysis of the data presented in figures 2 and 3 : for direct repeats, the main differences being observed between close repeats (delta < 1 kb) and distant repeats (delta > 1 kb) were as follows:

  1. In figure 2 , for close direct repeats, one can observe a negative correlation between the percentage of identity and delta: the shorter the delta, the higher the identity. Similar results have already been suggested for Caenorhabditis elegans (Semple and Wolfe 1999Citation ). This result could be understood if a high percentage of identity is (i) the mark of a recent adjacent duplication event and/or (ii) the result of an active conversion process (homogenization of the two copies). This latter process may depend upon the relative distance of the two copies (vide infra).
  2. Figure 3 shows that, for close direct repeats, there is a positive correlation between length and delta: the shorter the delta, the shorter the length of the repeat. This correlation could be interpreted as the result of (i) a specific mechanism preferentially deleting large repeats (the loss of one copy leading to a single sequence) and/or (ii) genetic erosion due to the mutational events accumulated from the initial duplication, therefore leading to a lower identity percent.

Some of the interpretations invoked above could be considered contradictory. In particular, the high percentage of identity of the close direct repeats is considered the mark of a recent duplication event (point 1, i), whereas their short length could be a consequence of the long time elapsed from the initial duplication event (point 2, ii). Therefore, as explained below, we shall consider the second explanation less probable.

Conversion Versus Deletion: A Plausible Explanation
For close direct repeats, it seems reasonable to think that the extent of the exchange process is negatively correlated with delta. The exchange process can be either deletion (loss of one copy by reciprocal recombination or replication slippage) or conversion (homogenization of the two copies by nonreciprocal recombination). In fact, experimental studies undertaken on Bacillus subtilis (Chedin et al. 1994Citation ) and Escherichia coli (Lovett et al. 1994Citation ) have highlighted similar results. Thus, the decrease in identity as a function of delta (fig. 2 ) could be explained by a decrease in the conversion rate.

To understand the correlation between length and delta (fig. 3 ), we must put the genetic exchange back in its dynamic context: actually, each close repeat can be submitted to a deletion or to a conversion event. If a deletion occurs, there is no way back. On the contrary, if a conversion occurs, the two copies are still present and a new round of exchange (i.e., conversion or deletion) is possible. So, during a long period, a bias in favor of deletion of one copy should be observed. Furthermore, several experiments have demonstrated a positive correlation between recombination rate and repeat length in yeast (Jinks, Michelitch, and Ramcharan 1993Citation ), bacteria (Peeters et al. 1988Citation ), and phages (Pierce, Kong, and Masker 1991Citation ). Briefly, for a short delta, a long repeat should be too unstable to persist, but by increasing delta, longer repeats could be maintained. This "length tolerance" effect could explain the observed positive correlation between length and delta.

Functional Pressures: A Protection from Deletion
Another important difference between close and distant repeats is related to their presence within CDS: close direct repeats are located mainly within CDSs and in the same frame, therefore giving rise to repeats at the protein level as well. On the contrary, distant direct repeats give rise to fewer protein repeats. These observations lead to two nonexclusive hypotheses:

  1. Close direct DNA repeats are most probably submitted to an active recombination pressure leading to the deletion of one of the copies. However, the repeat can be fixed if it is submitted to functional pressures at the protein level. The consequence is that one very rarely observes close direct repeats which have not been protected from deletion by this selective advantage (i.e., located outside CDSs), because they have been massively lost by recombination. Finally, close direct repeats which have been fixed by functional pressures at the protein level are still submitted to an active conversion process (vide infra), preventing any further evolution.
  2. On the other hand, distant repeats are submitted to less active recombination and conversion pressures. This allows the creation of different proteins from the same repeated DNA sequence located in different CDSs, and even sometimes translated in different reading frames.

It should be pointed out that Marcotte et al. (1998)Citation recently reported that there is a high frequency of internal repeats within proteins sequences of eukaryotes (as compared with prokaryotes). This observation can be in line with the result presented in this study. To summarize, we probably observe a combination of DNA mechanisms which tend to (1) delete the close repeats and (2) keep them identical and which are constrained by functional pressures at the protein level.

Direct Versus Inverted Repeats
The last group of results we take into account concerns the important differences observed between direct and inverted repeats. These differences can be summarized as follows:

  1. The direct repeats exhibit a higher similarity than the inverted ones (fig. 1b ). If we assume that there is a higher conversion rate for the numerous close direct repeats (vide supra), this observation is not surprising.
  2. The direct repeats are clearly longer that the inverted ones (fig. 1a ). This result is surprising, since, as already mentioned, long direct repeats are experimentally known to be more easily deleted (Jinks, Michelitch, and Ramcharan 1993Citation ). Therefore, our interpretation is that the observed long direct repeats were produced more recently than inverted ones and have not yet been eliminated.
  3. Finally, the repartition of deltas is the main difference between the two groups of repeats (fig. 1c ). Inverted repeats do not seem to be constrained by delta (with the corresponding distributions being almost identical for real and random genomes). On the contrary, as shown above, close direct repeats are overrepresented and distant direct repeats are underrepresented as compared with inverted ones. We now discuss a possible model to account for these differences.

A Dynamic Model for Intrachromosomal Repeats
We propose a simple model, illustrated in figure 4 , to explain all these observations and solve the apparent contradictions. In this model, the initial event is the continuous production of close direct repeats. Whatever the mechanism giving rise to it (unequal crossing over or replication slippage), when a close direct repeat is created, it can be either modified by mutation (although the high conversion rate will tend to maintain the two copies identical) or deleted (since the deletion rate is high). As long as the repeat remains, a new exchange (conversion/deletion) is possible. Therefore, the fate of a close direct repeat is to disappear sooner or later (depending on the conversion rate vs. the deletion rate). As a consequence, only two kinds of direct repeats can be conserved on a large time scale:



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 4.—A model for the origin and dynamics of intrachromosomal repeats. The initial event is the close duplication of a sequence (oriented black boxes). The two copies can then diverge (oriented gray boxes) or be maintained identical through a conversion process. Alternatively, the repeat can also be deleted, leading back to a single copy. On a long timescale, this second situation prevails. Therefore the only two ways to maintain both copies are (case 1) to move them away through chromosomal rearrangements, since the relative conversion rate then decreases (thin arrows), and (case 2) to protect them from deletion by functional pressures; the two copies are located within CDSs

 
  1. Coding repeats, which can be conserved by functional pressures. In this case, they must be short due to the length tolerance effect. One should note, however, that strong functional pressures and/or multiple-copy repeats can lead to maintenance of large tandem clusters (rDNAs, ENA family, ASP3 cluster, CUP1 cluster, etc).
  2. Repeats in which one of the two copies is moved away by an interchromosomal (not represented in fig. 4 ) or intrachromosomal rearrangement: an inversion will lead to inverted repeats, and an insertion will lead to distant direct repeats. Under this model, we could explain the underrepresentation of distant direct repeats by a lower level of insertion as compared to the inversion one.

This model implies that most intrachromosomal repeats originate from close direct duplications but does not preclude any mechanism. Furthermore, it gives rise to several predictions that can be experimentally tested, like a negative correlation of the deletion/conversion rate with delta.

Acknowledgements

We thank E. Rocha, J. Pothier, E. Maillier, and D. Higuet for their scientific help and their friendly support. This work was supported by grants from Association pour la Recherche sur le Cancer. E.C. and P.N. are members of the Université Pierre et Marie Curie, Paris.

Footnotes

Manolo Gouy, Reviewing Editor

1 Abbreviation: CDS, coding sequence. Back

2 Keywords: genome dynamics evolution duplication direct repeats Saccharomyces cerevisiae. Back

3 Address for correspondence and reprints: Guillaume Achaz, Structure et dynamique des génomes, IJM, Tour 43–44, 1° étage, 4, place Jussieu, 75251 Paris CEDEX 05, France. E-mail: achaz{at}ijm.jussieu.fr Back

literature cited

    Britten, R. J. 1998. Precise sequence complementarity between yeast chromosome ends and two classes of just-subtelomeric sequences. Proc. Natl. Acad. Sci. USA 95:5906–5912.

    Chedin, F., E. Dervyn, R. Dervyn, S. D. Ehrlich, and P. Noirot. 1994. Frequency of deletion formation decreases exponentially with distance between short direct repeats. Mol. Microbiol. 12:561–569.[ISI][Medline]

    Chervitz, S. A., L. Aravind, G. Sherlock et al. (13 co-authors). 1998. Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science 282:2022–2028.

    Coissac, E., E. Maillier, and P. Netter. 1997. A comparative study of duplications in bacteria and eukaryotes: the importance of telomeres. Mol. Biol. Evol. 14:1062–1074.[Abstract]

    Fleischmann, R. D., M. D. Adams, O. White et al. (11 co-authors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512.

    Goffeau, A., B. G. Barell, H. Bussey et al. (16 co-authors). 1996. Life with 6000 genes. Science 274:546–567.

    Jinks, R. S., M. Michelitch, and S. Ramcharan. 1993. Substrate length requirements for efficient mitotic recombination in Saccharomyces cerevisiae. Mol. Cell. Biol. 13:3937–3950.

    Karlin, S., and F. Ost. 1985. Maximal segmental match length among random sequences from a finite alphabet. Pp. 225–243 in L. M. L. Cam and R. A. Olshen, eds. Proceedings of the Berkeley Conference in honour of Jerzy Neyman and Jack Kiefer. Vol. 1. Association for Computing Machinery, New York.

    Karp, R. M., R. E. Miller, and A. L. Rosenberg. 1972. Rapid identification of repeated patterns in strings, trees and arrays. Pp. 125–126 in Proceedings 4th Annual ACM Symposium Theory of Computing, New York.

    Klein, H. L. 1995. Genetic control of intrachromosomal recombination. Bioessays 17:147–159.

    Lalo, D., S. Stettler, S. Mariotte, P. P. Slonimski, and P. Thuriaux. 1993. Two yeast chromosomes are related by a fossil duplication of their centromeric regions. C. R. Acad. Sci. 316:367–373.[ISI]

    Leung, M. Y., B. E. Blaisdell, C. Burge, and S. Karlin. 1991. An efficient algorithm for identifying matches with errors in multiple long molecular sequences. J. Mol. Biol. 221:1367–1378.[ISI][Medline]

    Louis, E. J., and J. E. Haber. 1990. The subtelomeric Y' repeat family in Saccharomyces cerevisiae: an experimental system for repeated sequence evolution. Genetics 124:533–545.

    Lovett, S. T., T. J. Gluckman, P. J. Simon, V. J. Sutera, and P. T. Drapkin. 1994. Recombination between repeats in Escherichia coli by a recA-independent, proximity-sensitive mechanism. Mol. Gen. Genet. 245:294–300.[ISI][Medline]

    Marcotte, E. M., M. Pellegrini, T. O. Yeates, and D. Eisenberg. 1998. Census of protein repeats. J. Mol. Biol. 293:151–160.[ISI]

    Mewes, H. W., K. Albermann, M. Bahr et al. (12 co-authors). 1997. Overview of the yeast genome. Nature 387:7–65.

    Peeters, B. P., B. J. De, S. Bron, and G. Venema. 1988. Structural plasmid instability in Bacillus subtilis: effect of direct and inverted repeats. Mol. Gen. Genet. 212:450–458.[ISI][Medline]

    Pierce, J. C., D. Kong, and W. Masker. 1991. The effect of the length of direct repeats and the presence of palindromes on deletion between directly repeated DNA sequences in bacteriophage T7. Nucleic Acids Res. 19:3901–3905.[Abstract]

    Pryde, F. E., H. C. Gorham, and E. J. Louis. 1997. Chromosome ends: all the same under their caps. Curr. Opin. Genet. Dev. 7:822–828.[ISI][Medline]

    Rocha, E. P. C., A. Danchin, and A. Viari. 1999. Analysis of long repeats in bacterial genomes reveals alternative evolutionary mechanisms in Bacillus subtilis and other competent prokaryotes. Mol. Biol. Evol. 16:1219–1230.[Abstract]

    Schneider, T. D., G. D. Stormo, L. Gold, and A. Ehrenfeucht. 1986. Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188:415–431.[ISI][Medline]

    Semple, C., and K. H. Wolfe. 1999. Gene duplication and gene conversion in the Caenorhabditis elegans genome. J. Mol. Evol. 48:555–564.[ISI][Medline]

    Smith, T. F., and M. S. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195–197.[ISI][Medline]

    Vincens, P., L. Buffat, C. Andre, J. P. Chevrolat, J. F. Boisvieux, and S. Hazout. 1998. A strategy for finding regions of similarity in complete genome sequences. Bioinformatics 14:715–725.

    Viswanathan, M., G. Muthukumar, Y. S. Cong, and J. Lenard. 1994. Seripauperins of Saccharomyces cerevisiae: a new multigene family encoding serine-poor relatives of serine-rich proteins. Gene 148:149–153.

    Wolfe, K. H., and D. C. Shields. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708–713.

Accepted for publication May 2, 2000.