Genome-wide Analysis of the Emigrant Family of MITEs of Arabidopsis thaliana

Néstor Santiago*, Cristina Herráiz{dagger}, J. Ramón Goñi{dagger}, Xavier Messeguer{dagger} and Josep M. Casacuberta*

*Department of Genètica Molecular, IBMB-CSIC, Barcelona;
{dagger}Department of Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Miniature inverted-repeat transposable elements (MITEs) are structurally similar to defective class II elements, but their high copy number and the size and sequence conservation of most MITE families suggest that they can be amplified by a replicative mechanism. Here we present a genome-wide analysis of the Emigrant family of MITEs from Arabidopsis thaliana. In order to be able to detect divergent ancient copies, and low copy number subfamilies with a different internal sequence we have developed a computer program to look for Emigrant elements based solely on the terminal inverted-repeat sequence. We have detected 151 Emigrant elements of different subfamilies. Our results show that different bursts of amplification, probably of few active, or master, elements, have occurred at different times during Arabidopsis evolution. The analysis of the insertion sites of the Emigrant elements shows that recently inserted Emigrant elements tend to be located far from open reading frames, whereas more ancient Emigrant subfamilies are preferentially found associated to genes.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Transposable elements (TEs) are important components of eukaryote genomes, accounting for a high fraction of them. Most TEs can be grouped in two major classes according to their mode of transposition: class I elements transpose by a replicative mechanism invoking an RNA intermediate, whereas class II elements usually transpose by a nonreplicative "cut-and-paste" mechanism. Whereas most genomes contain elements of both classes, class I elements are in general much more abundant than class II elements, probably as a consequence of their replicative mode of transposition. This is the case for the human genome, where the most abundant TEs are the L1 long interspersed nuclear element (LINE) and the Alu family of short interspersed nuclear elements (SINEs) (International Human Genome Sequencing Consortium 2001Citation ), both belonging to the class I group of TEs. Plant genomes also contain retrotransposons in a high copy number, although in this case retroviral-like long terminal repeat–retrotransposons are the most abundant (Kumar and Bennetzen 1999Citation ). In addition, plant genomes contain miniature inverted-repeat transposable elements (MITEs) in a very high copy number (Wessler, Bureau, and White 1995Citation ; Le, Wright, and Bureau 2000Citation ; Turcotte, Srinivasan, and Bureau 2001Citation ).

MITEs are a particular class of TE first described in plants but later found to be present in other eukaryote genomes (Bureau and Wessler 1992Citation ; Oosumi, Garlick, and Belknap 1996Citation ; Tu 1997Citation ). They are structurally similar to defective class II elements, but their high copy number and the size and sequence conservation of most MITE families suggest that they can be amplified by a replicative mechanism. It has been recently proposed that MITEs could be a particular type of defective class II element (Feschotte, Jiang, and Wessler 2002Citation ), some of them related to the pogo subclass of Mariner transposons or to bacterial insertion sequence elements (Feschotte and Mouchès 2000aCitation ; Le, Wright, and Bureau 2000Citation ; Zhang et al. 2001Citation ). Nevertheless, whereas it has been proposed that some MITE families could still be active in plants (Casacuberta et al. 1998Citation ; Zhang, Arbuckle, and Wessler 2000Citation ; Zhang et al. 2001Citation ), the characterization of a mobile MITE copy allowing the analysis of its transposition mechanisms is still lacking. In this context, the analysis of the evolution of MITE families of elements within their host genomes is probably the best approach to analyze the lifestyle of these elements and the impact of their mobility on host genomes.

Here we present a genome-wide analysis of the Emigrant family of MITEs in Arabidopsis thaliana. In order to be able to detect elements with a divergent internal sequence, representing either ancient Emigrant elements, or previously undescribed low copy number Emigrant subfamilies, we have developed a computer program to detect putative MITEs in a genomic sequence based solely on their terminal inverted-repeat (TIR) sequences. This approach has allowed us to perform, for the first time, an evolutionary analysis of a family of MITEs within a particular genome. Our results show that different Emigrant subfamilies of elements have probably been generated by the amplification of a small number of founder elements. Our results also show that, although Emigrant elements target very rich AT regions for insertion, elements closely linked to genes are more frequently maintained during evolution.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
TRANSPO and SPAT Programs
TRANSPO implements the fast bit-vector algorithm (Myers 1998Citation ) that finds all locations at which a query (the Emigrant TIR sequence in this case) approximately matches a sequence (the sequence of five Arabidopsis chromosomes). The expected time is linear on the length of the sequence. Although this algorithm is based on dynamic programming with quadratic cost on the lengths of the sequence and the query, the linearity can be accomplished if the length of the query is shorter than the length of the computer word.

The clustering program SPAT groups the sequences into a hierarchical classification, i.e., a nested sequence of partition (Gordon 1999Citation ). Given a similarity measure between each pair of sequences, the complete weighted graph is designed where the nodes are the sequences and the weight of each edge is the similarity measure between the pair of sequences connected by this edge. Then the maximum spanning binary tree (MST) is found. This tree "spans" the graph connecting all the nodes in such away that the sum of the weights of the edges is maximized. The algorithm proceeds by removing the edge with minimum weight and dividing the tree into two disjoint subtrees (Zahn 1971Citation ; Delattre and Hansen 1980Citation ). A cluster is determined by the value of the removed edge that creates it and the value of the edge that divides it. A cluster of sequences is cohesive when a consecutive extraction of a significant number of edges does not change the composition of the cluster.

TRANSPO and SPAT are available at www.lsi.upc.es/~alggen.

Emigrant Element Mining
The TRANSPO program was used to look at the entire available Arabidopsis genome sequence (www.arabidopsis.org) for inverted repeated sequences 75% identical to the first 20 nt of the previously defined Emigrant TIR (CAGTAAAACCTCTATAAATT) located within a range of 200–700 nt. Overlapping elements generated from subterminal inverted-repeat sequences were eliminated. A pairwise similarity matrix was calculated and sequences were grouped using the SPAT program, and a graphical distribution of the different elements in Arabidopsis chromosomes was obtained using the program CLUPH.

To obtain information about the open reading frame (ORF) located close to the Emigrant elements, the 30 nt flanking each Emigrant element upstream and downstream were used as probes in sequence similarity searches (BLAST 2.0; Altschul et al. 1990Citation ; http://www.ncbi.nlm.nih.gov/blast/). A table containing the BAC accession number and the nucleotide position of each Emigrant element, as well as the name and the distance of the elements to the closest upstream and downstream ORF can be obtained as additional information. Sequence similarity searches (BLAST) with the sequences flanking Emigrant elements were also used to look for related empty sites (RESites).

Phylogenetic Analysis
Sequences belonging to a particular group obtained with SPAT were aligned using the multiple-alignment program CLUSTAL W using the default parameters (version 1.5; Thompson, Higgins, and Gibson 1994Citation ), with some minor refinements. DNADIST in Felsentein's package PHYLIP (Felsenstein 1989Citation ) was used to generate a distance matrix based on the Jukes-Cantor algorithm (Jukes and Cantor 1969Citation ). This was used to generate neighbor-joining trees (Saitou and Nei 1987Citation ). Bootstrap analyses were performed using the programs Seqboot and Consense from PHYLIP (Felsenstein 1989Citation ). Sequence variability, as measured by Nei's measure of nucleotide diversity, {pi} (Nei 1987Citation ), and its standard deviation were calculated using the program DnaSP (Rozas and Rozas 1999Citation ).


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Different Groups of the Emigrant Element can be found in the Genome of A. thaliana
We have designed a new computer program named TRANSPO that looks at a given sequence for the presence of a particular inverted-repeat motif within a given range of distances. The independence of the search from a conserved internal sequence allows the localization of previously undetected low copy number subfamilies of a particular MITE that share the TIR sequences but differ in their internal sequence, as well as ancient MITE copies that have lost most of the sequence homology by insertion, deletions, or point mutations, within the sequence between the TIRs.

We have searched the complete available Arabidopsis sequence, which covers 115.4 Mbp of the 125-Mbp genome and does not include telomeres, centromeres or the rDNA repeated regions, for Emigrant elements by looking for Emigrant TIRs (tolerating up to 25% divergence) separated by more than 200 nt and less than 700 nt. We have localized 151 sequences that could represent Emigrant elements. All these sequences present Emigrant-like TIRs, are very AT-rich, and do not have coding capacity, and most of them are flanked by the dinucleotide TA. Although all these characteristics are reminiscent of MITE-related sequences, most of these sequences are not annotated as Emigrant elements, MITEs, or possible TEs in the databases.

The high variability of the internal sequence does not allow the correct alignment of the 151 sequences and their analysis by phylogenetic methods. We have developed the program SPAT that proceeds by elimination of the most divergent sequence of a given group in order to tentatively group the sequences and be able to apply conventional phylogenetic methods. SPAT gave a tentative classification of most of the 151 sequences into three main groups, that we have named EmiA (41 sequences), EmiB (26 sequences), and EmiC (37 sequences), based on pairwise identity comparisons. Forty-seven sequences were too divergent to be included in any of the defined groups and have been named as Emi0 elements.

All the previously described Emigrant elements (Casacuberta et al. 1998Citation ) belong to the EmiA group. We have previously demonstrated that EmiA elements were mobile in the recent past because some of them were found to be polymorphic among Arabidopsis ecotypes (Casacuberta et al. 1998Citation ). In order to obtain data on the possible mobility of the other groups of elements, we searched the Arabidopsis genome for RESites, representing genome duplication events occurring prior to the transposition of these newly described elements. The presence of RESites within a genome has been successfully used as an indication of mobility when analyzing possible TEs within a single genome (Le, Wright, and Bureau 2000Citation ; Tu 2001Citation ). We found more than 20 well-conserved RESites corresponding to the different groups of Emigrant elements, although we found more RESites corresponding to the EmiA, EmiB, and EmiC classes than to the Emi0 class (not shown). These data, and the presence in each case of a TA duplication accompanying the insertion of the element, strongly suggest that the different elements described here are indeed mobile elements related to the Emigrant family of MITEs.

Analysis of the Sequence and Size Variability of the Different Emigrant Subfamilies
The relatively high sequence identity within each of the EmiA, EmiB, and EmiC groups of elements has allowed us to perform conventional phylogenetic analysis. Sequences belonging to each group were aligned using CLUSTAL W, and the alignments were used to obtain neighbor-joining trees. Figure 1 presents the trees obtained. Different monophyletic groups supported by high bootstrap values can be defined within each tree. Within each Emigrant group most of the sequences can be subdivided into three different subfamilies (A1, A2, and A3; B1, B2, and B3; C1, C2, and C3). By performing new alignments with the sequences belonging to each subfamily, we have deduced a consensus sequence for each of them and compared these consensus sequences in order to obtain information about the phylogenetic relationships among the different Emigrant subfamilies. A neighbor-joining tree, obtained comparing the consensus sequences of each subfamily, is also shown in figure 1 . The three EmiA subfamilies, the three EmiB subfamilies, and the three EmiC subfamilies seem phylogenetically related because the three different groups cluster together with high bootstrap values.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 1.—Phylogenetic analysis of the Emigrant elements. Neighbor-joining trees obtained comparing the EmiA (A), EmiB (B), and the EmiC (C) elements. The subfamilies defined within each Emigrant group are shown as A#, B#, or C#. (D) Neighbor-joining tree obtained comparing the consensus sequences of the different Emigrant subfamilies. Bootstrap values above 60% supporting major clusters are shown. Distances are proportional to evolutionary divergence expressed in substitutions per hundred sites

 
The alignments of the sequences belonging to each subfamily were also used to calculate the nucleotide diversity, {pi} (Nei 1987Citation ), and the size variability for each Emigrant subfamily. These results show that each Emigrant subfamily displays a different degree of variability (table 1 ). Whereas some subfamilies such as EmiA2 are highly homogeneous both in size and sequence, other subfamilies like the EmiB3 subfamily are highly variable. Each Emigrant group contains subfamilies of different variability. Within the EmiA group the A2 subfamily is more homogeneous than the A1 subfamily; within the EmiB group the most homogeneous is the B2 subfamily, and the most variable is B3, the B1 subfamily displaying an intermediate degree of variability. Within the EmiC group the C1 subfamily is the most homogeneous in sequence although relatively variable in size, and C3 seems to be the most variable group.


View this table:
[in this window]
[in a new window]
 
Table 1 Sequence, Size variability and Position with Respect to the Closest Predicted Genes of the Different Emigrant Subfamilies

 
Analysis of Emigrant Insertion Sites
Figure 2 shows the distribution of Emigrant elements in Arabidopsis chromosomes. The five chromosomes of Arabidopsis contain Emigrant elements although the density of insertions varies slightly among them. Chromosome 4 has the highest concentration of Emigrant elements (1.8 Emigrant (Emi)/Mbp), whereas chromosome 5 the lowest (1 Emi/Mbp), and chromosomes 1, 2, and 3 intermediate Emigrant concentrations. Although the concentration of the different Emigrant groups varies slightly between the five chromosomes, they all contain representatives of each Emigrant group. The concentration of EmiA elements in chromosome 4 is six times higher than in chromosome 5, and twice the concentration of EmiC elements is present in chromosomes 2 and 4 compared with chromosome 3, whereas there are more Emi0 elements (3x) in chromosome 4 than in chromosome 1.



View larger version (35K):
[in this window]
[in a new window]
 
Fig. 2.—Distribution of Emigrant elements of the different groups in the five chromosomes of Arabidopsis. The number of Emigrant elements per million base pairs is indicated for each subfamily

 
The analysis of 60 nt around the insertion site of the 151 Emigrant insertions shows that, in addition to insertion within TA sequences, Emigrant elements, like other MITEs and mariner-like elements (MLEs) (Le, Wright, and Bureau 2000Citation ), target regions of very high AT content. The average AT content of the Arabidopsis chromosomes ranges from 64.5% to 66.6%, rising to 67.6% in noncoding regions (The Arabidopsis Genome Initiative 2000Citation ), whereas sequences flanking Emigrant elements have 74.3% AT. Although the most frequent microsatellite in Arabidopsis consists of TA repetitions (Casacuberta, Puigdomènech, and Monfort 2000Citation ), Emigrant does not seem to target microsatellites for integration because whereas 25% of the Emigrant elements analyzed here lie in a TATA sequence, only 3% are found in a sequence containing a repetition of more than 4 TAs.

Position of Emigrant Elements Relative to ORFs
Although MITEs seem to target AT-rich regions, they have often been found close to transcribed sequences (Wessler, Bureau, and White 1995Citation ; Yang et al. 2001Citation ). We analyzed the regions flanking the 151 Emigrant insertions and calculated the distance from the ATG or STOP codon of the closest predicted gene. Ten percent of the elements lie within a predicted gene (7% in introns and 3% in exons), 24% lie at less than 500 nt from an ORF, and 23% lie at more than 500 nt and less than 1,000 nt from an ORF. Twenty-nine percent are located at more than 1,000 nt from any ORF, and 13% are inserted within a repetitive region. Nevertheless, the position of the Emigrant elements with respect to the predicted genes greatly varies among the different subfamilies analyzed (see table 1). While 55.5% of the EmiB3 elements are found at less than 500 nt from an ORF, and 42% of Emi0 are located within or close to a predicted gene, the vast majority of the EmiA2 elements (85%) are located at more than 500 nt from any ORF.

Among the 53 Emigrant elements located at less than 500 nt from the closest ORF, 46.5% are located downstream, 27.5% are located upstream, and 26% are located within a predicted gene. These elements can affect promoter activity, splicing, transcriptional termination, or RNA stability, as well as the coding capacity of the ORF. We have thus analyzed these insertions in some detail, and figure 3 shows examples of such close-gene insertions. Figure 3A shows an Emi0 element, found within the transcribed downstream region of the Det1 gene, as an example of an element lying downstream of an ORF. The availability of the genomic and the cDNA sequence for the Det1 gene has allowed us to determine that the transcription of the Det1 gene stops within the Emigrant element, probably using polyadenylation sequences provided by Emi116. Figure 3B shows an example of an Emi0 element located within a predicted gene coding for a GATA-like transcription factor. The insertion of the element has provided a new putative ATG and 48 new amino acids within the C-terminal region of the protein. We also found five Emigrant elements lying at less than 500 nt from two different ORFs. The insertion of those elements could potentially affect the expression of both the upstream and the downstream genes. Alternatively, the insertion of an Emigrant element in these extremely short intergenic regions could help to avoid transcriptional interference between both genes. Related to this, it is interesting to note that it has been proposed that some MITEs could act as matrix attachment regions isolating their neighboring genes (Tikhonov, Bennetzen, and Avramova 2000Citation ). This possible effect of MITE insertion could be particularly useful in Arabidopsis, which has a very compact genome, and genes are sometimes found extremely close to one another.



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 3.—Emigrant elements inserted within ORFs. Schematic representation of the insertions of Emi116 (A) and Emi130 (B) within ORFs. Open boxes represent coding sequences, and filled boxes represent Emigrant elements. The name of the gene or the accession number of the ORF in which the Emigrant elements are inserted is indicated. The nucleotide and deduced amino acid sequences are shown under the scheme. Sequences corresponding to the Emigrant element are in bold, and the TIR sequence is boxed. The poly-A site is also indicated, and the polyadenylation signal is underlined

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
A New Approach to Study MITE Evolution
Most MITE families described to date are characterized by a high degree of sequence and size conservation (Bureau and Wessler 1992Citation ; Casacuberta et al. 1998Citation ; Feschotte and Mouchès 2000bCitation ; Oosumi, Garlick, and Belknap 1996Citation ; Tu 1997Citation ; Yang et al. 2001Citation ). However, after insertion MITEs are subjected to random mutation, and the sequence and size homogeneity of a particular MITE family will decrease with time. Ancient divergent copies, and low copy number families of MITEs, are difficult to detect by sequence similarity–based search methods and have probably been missed in searches performed to date. In order to get access to Emigrant divergent elements and analyze the evolution of this particular family of MITEs we developed a computer program, TRANSPO, based solely on the presence of relatively conserved TIR sequences. A similar approach has recently been used to describe new MITE families in Anopheles gambiae. Tu (2001)Citation developed a computer program, named FINDMITE, that looks in a given sequence for unknown TIRs flanked by short direct repeats. The sequences of both the TIR and the direct repeat are not fixed, which makes FINDMITE suitable for looking for new MITEs but not for finding old elements that have very imperfect TIRs or have lost the short direct duplications. TRANSPO can be used for looking for the presence of a particular TIR sequence with a low stringency and can detect elements that have lost the short direct duplication flanking them, which makes it more suitable for looking for ancient elements with a known TIR. The program TRANSPO has allowed us to detect all the sequences present within the genome of Arabidopsis that contain TIRs 75% identical to the previously defined Emigrant TIR (Casacuberta et al. 1998Citation ), separated by more than 200 nt and less than 700 nt. The presence of target site duplications and RESites for the different Emigrant groups described here has allowed us to confirm the mobile nature of these sequences.

Different Amplification Bursts of Emigrant Elements have Occurred During Arabidopsis Evolution
The 47 Emi0 sequences are too divergent to be included in any of the nine Emigrant subfamilies defined here. The high divergence of these elements suggests that they represent old Emigrant insertions that have accumulated a high number of mutations. The phylogenetic analysis of the other Emigrant elements shows that they belong to different subfamilies with different degrees of variability. Whereas the 20 EmiA2 elements are highly homogeneous, the EmiB3 subfamily is highly variable. This suggests that different amplification bursts have occurred at different times during the evolution of Arabidopsis, giving rise to these different subfamilies—the more variable a subfamily is, the more ancient the amplification burst that has generated it should be. The start-type topology of the Emigrant subfamilies in the different trees suggests that each subfamily has been generated from the amplification of a single Emigrant element. This could be explained by the presence of only one active or master element capable of amplification at a particular moment, as predicted by the master gene model developed for SINEs (Deininger and Batzer 1995Citation ), or simply by assuming that the amplification of MITEs is an extremely rare event occurring stochastically on any Emigrant element that would thus act as a founder element for a new subfamily. In any of both scenarios the result will be that only very few Emigrant elements had been amplified during the evolution of Arabidopsis and that the insertion dynamics of Emigrant elements has been very similar to that of most SINEs, in spite of the important differences in the transposition mechanisms.

This evolutionary dynamics shown here for the Emigrant element is probably shared by other elements, as the presence of highly conserved subfamilies within a single host genome has been described for other MITEs (Tu 2001Citation ; Yang et al. 2001Citation ).

Elements Close to Genes have been Preferentially Maintained During Arabidopsis Evolution
Although MITEs seem to target very highly AT-rich regions, they have often been found close to transcribed sequences (Wessler, Bureau, and White 1995Citation ; Yang et al. 2001Citation ; Feschotte, Jiang, and Wessler 2002Citation ). Nevertheless, it is not known if this preferential location is the result of their insertion specificity. On the other hand, a recent survey failed to detect transposon insertions in A. thaliana coding regions, suggesting a purifying selection against deleterious mutation (Le, Wright, and Bureau 2000Citation ). The presence of mobile elements at particular locations within a genome is the result of their transpositional activity and of the selection of the best fit genomes. Thus, elements transposing randomly within a genome can be found at particular locations as the result of a positive selection of their insertion within those sites or the negative selection of insertion in other locations. The effect of target site specificity should be more easily detected for recently inserted elements, whereas the effect of selection will be more apparent for ancient insertions. The comparison of the distribution of ancient versus recent elements should reveal the effect of selection and thus the impact of transposon insertions. So, we compared the relative distribution of the different subfamilies of Emigrant elements, which represent amplification bursts occurring at different times of the evolution of Arabidopsis, with respect to predicted genes in order to determine their insertion specificity as well as the effect of selection and the impact of Emigrant insertions.

EmiA2 is the most homogeneous subfamily described here, both in sequence and size, and probably represents the most recent burst of amplification of Emigrant elements. Eighty-five percent of the 20 EmiA2 elements lie at more than 500 nt from the closest ORF (see table 1). The genome of Arabidopsis is extremely compact, and the intergenic regions are very short. The mean size of Arabidopsis genes is 2 kbp and there is one gene every 5 kbp, which implies that the mean distance between two genes is only 3 kbp (The Arabidopsis Genome Initiative 2000Citation ). Thus genic regions occupy 40% of the genome space, and the regions closely linked to the genes, that most probably contain gene regulatory regions (arbitrarily taken here as 500 nt) occupy 20% of the genome space, which means that 60% of the genome is occupied by genes and their potentially regulatory regions. The regions not linked to genes occupy only 40% of the genome space (20% the region arbitrarily defined here as between 500 and 1,000 nt, and 20% the region arbitrarily defined here as >1,000 nt). The distribution of EmiA2 elements is thus far from random, with Emigrant elements inserting preferentially far from ORFs. This difference is statistically highly significant, with a chi-square value of 19.76, whereas the chi-square value with three degrees of freedom and 99% probability is 11.34. The strict specificity of Emigrant and other MITEs for the TA dinucleotide as insertion site, as well as the preference for very highly AT-rich regions (74.3% AT in the case of Emigrant) probably helps these elements to avoid genes even in extremely compact genomes such as that of Arabidopsis.

This preference for regions far from genes is less pronounced for other Emigrant subfamilies. More than 50% of the EmiB3 elements, and 43% of the Emi0 group of elements, are less than 500 nt from the closest ORF. Interestingly, the Emi0 group contains the most divergent Emigrant elements, and the EmiB3 subfamily is one of the most variable subfamilies, suggesting that both the Emi0 group and the EmiB3 subfamily represent the most ancient insertions and have been subjected to selection for a relatively long period of time. In particular, the differences of distribution of Emi0 and EmiA2 elements are statistically significant, with a chi-square value of 19.76 (three degrees of freedom; 99% probability {chi}2 = 11.34). The other Emigrant subfamilies show different distribution patterns with respect to ORFs, but although the low number of elements makes it difficult to draw conclusions in some cases, the more variable a subfamily, the more closely it is associated to genes.

A possible explanation for these results would be a domestication of the Emigrant transposase that would have learned to avoid genes during evolution, inserting Emigrant elements farther and farther from genes. Alternatively, these results suggest that whereas Emigrant elements preferentially insert far from ORFs, the elements closely linked to genes are more frequently maintained during evolution. This is reminiscent of what has been shown for the Alu family of SINEs in the human genome. Alu's tend to insert in AT-rich regions, and recently transposed Alu subfamilies are found in gene-poor regions, whereas ancient Alu subfamilies are found preferentially in GC-rich regions closely associated to genes (International Human Genome Sequencing Consortium 2001Citation ). Although other possible explanations have been pointed out (Brookfield 2001Citation ; Batzer and Deininger 2002Citation ), it has been proposed that a positive selection in favor of the minority of Alu's in GC-rich DNA could explain the difference in distribution between old and new Alu subfamilies (International Human Genome Sequencing Consortium 2001Citation ). Emigrant and other MITEs resemble SINEs in their short size and their high copy number, and here we have shown that their amplification dynamics is also very similar. Thus, although the redistribution of Emigrant elements could also be the result of a preferential loss of those elements located far from genes, it is tempting to hypothesize that, as has been proposed for Alu elements within the human genome, there has been a positive selection for Emigrant elements lying within or close to genes during Arabidopsis evolution.

A Role for Emigrant Elements in the Evolution of Arabidopsis Genes
Over the last 10 years a growing body of evidence has pointed toward a modular nature for the regulation of gene expression. Promoters, and probably terminators, are constituted by a complex array of regulatory elements. Most of these elements are found in many different promoters or terminators, although each promoter-terminator contains a particular combination of them. With the completion of genome sequencing projects it has become more and more clear that coding regions of eukaryote genes are also often composed of domains or modules that have been reshuffled during evolution. There are many different mechanisms that can account for the amplification and distribution of particularly successful coding or regulatory modules, but short replicative elements such as SINEs and MITEs would be particularly suitable candidates for such a function. SINEs are frequently found within or close to genes in Arabidopsis (Lenoir et al. 2001Citation ) and other organisms (Makalowski 1995Citation ), and it has been recently found that some of them can play an important biological role as coding or transcriptional regulatory regions (Shimamura et al. 1998Citation ; Ferrigno et al. 2001Citation ; Goodyer, Zheng, and Hendy 2001Citation ; Landry, Medstrand, and Mager 2001Citation ; Ackerman et al. 2002Citation ). Moreover, it has been proposed that B2 SINEs may have the potential to distribute a functional pol II promoter throughout the genome (Ferrigno et al. 2001Citation ). Here we show that a high number of Emigrant elements within potential promoters, terminators, introns, and coding sequences which may affect gene coding capacity or regulation have been conserved during evolution. Although molecular experiments to determine unambiguously the impact of these insertions have yet to be performed, our results suggest that the insertion of Emigrant elements has played an important role in the evolution of Arabidopsis genes. MITEs, as has been proposed for SINEs, could have been recruited by genomes in an evolutionary mechanism to generate novel coding or regulatory sequences. The fact that MITEs can probably be excised (Petersen and Seberg 2000Citation ; Yang et al. 2001Citation ) makes them even more suitable for such a function.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We would like to thank E. Casacuberta, M. L. Espinás, J. Martinez-García, and P. Puigdomènech for critical reading of the manuscript. This work was supported by a grant from the Ministerio de Ciencia y Tecnología to J.M.C. (grant BIO2000-0953).


    Footnotes
 
Pierre Capy, Reviewing Editor

Keywords: Arabidopsis evolution MITE Emigrant master element Back

Address for correspondence and reprints: Josep M. Casacuberta, Department of Genètica Molecular, IBMB-CSIC. Jordi Girona 18, 08034 Barcelona, Spain. E-mail: jcsgmp{at}cid.csic.es . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Ackerman H., I. Udalova, J. Hull, D. Kwiatkowski, 2002 Evolution of a polymorphic regulatory element in interferon-gamma through transposition and mutation Mol. Biol. Evol 19:884-890[Abstract/Free Full Text]

    Altschul S. F., W. Gish, W. Miller, E. W. Myers, D. J. Lipman, 1990 Basic local alignment search tool J. Mol. Biol 215:403-410[ISI][Medline]

    Batzer M. A., P. L. Deininger, 2002 Alu repeats and human genomic diversity Nat. Rev. Genet 3:370-379[ISI][Medline]

    Brookfield J. F., 2001 Selection on Alu sequences? Curr. Biol 11:R900-R901[Medline]

    Bureau T. E., S. R. Wessler, 1992 Tourist: a large family of small inverted repeat elements frequently associated with maize genes Plant Cell 4:1283-1294[Abstract/Free Full Text]

    Casacuberta E., J. M. Casacuberta, P. Puigdomènech, A. Monfort, 1998 Presence of miniature inverted-repeat transposable elements (MITEs) in the genome of Arabidopsis thaliana: characterisation of the Emigrant family of elements Plant J 16:79-85[ISI][Medline]

    Casacuberta E., P. Puigdomènech, A. Monfort, 2000 Distribution of microsatellites in relation to coding sequences within the Arabidopsis thaliana genome Plant Sci 157:97-104[ISI][Medline]

    Deininger P. L., M. A. Batzer, 1995 SINE master genes and population biology Pp. 43–60 in R. J. Maraia, ed. The impact of short interspersed elements (SINEs) on the host genome. RG Landes Company, Austin, Tex.

    Delattre M., P. Hansen, 1980 Bicriterion cluster analysis IEEE Trans. Pattern Anal. Mach. Intelligence 4:277-291.

    Felsenstein J., 1989 PHYLIP—phylogeny inference package (version 3.56) Cladistics 5:164-166

    Ferrigno O., T. Virolle, Z. Djabari, J. P. Ortonne, R. J. White, D. Aberdam, 2001 Transposable B2 SINE elements can provide mobile RNA polymerase II promoters Nat. Genet 28:77-81[ISI][Medline]

    Feschotte C., N. Jiang, S. R. Wessler, 2002 Plant transposable elements: where genetics meets genomics Nat. Rev. Genet 3:329-341[ISI][Medline]

    Feschotte C., C. Mouches, 2000a. Evidence that a family of miniature inverted-repeat transposable elements (MITEs) from the Arabidopsis thaliana genome has arisen from a pogo-like DNA transposon Mol. Biol. Evol 17:730-737[Abstract/Free Full Text]

    ———. 2000b. Recent amplification of miniature inverted-repeat transposable elements in the vector mosquito Culex pipiens: characterization of the Mimo family Gene 250:109-116[ISI][Medline]

    Goodyer C. G., H. Zheng, G. N. Hendy, 2001 Alu elements in human growth hormone receptor gene 5' untranslated region exons J. Mol. Endocrinol 27:357-366[Abstract/Free Full Text]

    Gordon A. D., 1999 Classification Chapman & Hall/CRC, New York.

    International Human Genome Sequencing Consortium. 2001 Initial sequencing and analysis of the human genome Nature 409:860-922[ISI][Medline]

    Jukes T. H., C. R. Cantor, 1969 Evolution of protein molecules Pp. 21–132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.

    Kumar A., J. Bennetzen, 1999 Plant retrotransposons Annu. Rev. Genet 33:479-532[ISI][Medline]

    Landry J. R., P. Medstrand, D. L. Mager, 2001 Repetitive elements in the 5' untranslated region of a human zinc-finger gene modulate transcription and translation efficiency Genomics 76:110-116[ISI][Medline]

    Le Q. H., S. Wright, T. Bureau, 2000 Transposon diversity in Arabidopsis thaliana Proc. Natl. Acad. Sci. USA 97:7376-7381[Abstract/Free Full Text]

    Lenoir A., L. Lavie, J. L. Prieto, C. Goubely, J. C. Cote, T. Pelissier, J. M. Deragon, 2001 The evolutionary origin and genomic organization of SINEs in Arabidopsis thaliana Mol. Biol. Evol 18:2315-2322[Abstract/Free Full Text]

    Makalowski W., 1995 SINEs as a genomic scrap yard: an essay on genomic evolution Pp. 81–104 in R. J. Maraia, ed. The impact of short interspersed elements (SINEs) on the host genome. RG Landes Company, Austin, Tex

    Myers G., 1998 A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Proc. Ninth Combinatorial Pattern Matching Conference Springer-Verlag LNCS Series 1448:1-13

    Nei M., 1987 Molecular evolutionary genetics Columbia University Press, New York

    Oosumi T., B. Garlick, W. R. Belknap, 1996 Identification of putative nonautonomous transposable elements associated with several transposon families in Caenorhabditis elegans J. Mol. Evol 43:11-18.[ISI][Medline]

    Petersen G., O. Seberg, 2000 Phylogenetic evidence for excision of Stowaway Miniature Inverted-Repeat Transposable Elements in Triticeae (Poaceae) Mol. Biol. Evol 17:1589-1596[Abstract/Free Full Text]

    Rozas J., R. Rozas, 1999 DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis Bioinformatics 15:174-175[Abstract/Free Full Text]

    Saitou N., N. Nei, 1987 The neighbour-joining method: a new method for reconstructing phylogenetic trees Mol. Biol. Evol 4:406-425[Abstract]

    Shimamura M., M. Nikaido, K. Ohshima, N. Okada, 1998 A SINE that acquired a role in signal transduction during evolution Mol. Biol. Evol 15:923-925[Free Full Text]

    Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, population-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]

    The Arabidopsis Genome Initiative. 2000 Analysis of the genome sequence of the flowering plant Arabidopsis thaliana Nature 408:796-815[ISI][Medline]

    Tikhonov A. P., J. L. Bennetzen, Z. V. Avramova, 2000 Structural domains and matrix attachment regions along colinear chromosomal segments of maize and sorghum Plant Cell 12:249-264[Abstract/Free Full Text]

    Tu Z., 1997 Three novel families of miniature inverted-repeat transposable elements are associated with genes of the yellow fever mosquito, Aedes aegypti Proc. Natl. Acad. Sci. USA 94:7475-7480[Abstract/Free Full Text]

    ———. 2001 Eight novel families of miniature inverted repeat transposable elements in the African malaria mosquito, Anopheles gambiae Proc. Natl. Acad. Sci. USA 98:1699-1704.[Abstract/Free Full Text]

    Turcotte K., S. Srinivasan, T. Bureau, 2001 Survey of transposable elements from rice genomic sequences Plant J 25:169-179[ISI][Medline]

    Wessler S., T. Bureau, S. E. White, 1995 LTR-retrotransposons and MITEs: important players in the evolution of plant genomes Curr. Opin. Genet. Dev 5:814-821[ISI][Medline]

    Yang G., J. Dong, M. B. Chandrasekharan, T. C. Hall, 2001 Kiddo, a new transposable element family closely associated with rice genes Mol. Genet. Genomics 266:417-424[ISI][Medline]

    Zahn C. T., 1971 Graph-theoretical methods for detecting and describing gestalt clusters IEEE Trans. Comput C-20:68-86.

    Zhang Q., J. Arbuckle, S. R. Wessler, 2000 Recent, extensive, and preferential insertion of members of the miniature inverted-repeat transposable element family Heartbreaker into genic regions in maize Proc. Natl. Acad. Sci. USA 97:1160-1165[Abstract/Free Full Text]

    Zhang X., C. Feschotte, Q. Zhang, N. Jiang, W. R. Eggleston, S. R. Wessler, 2001 P instability factor: an active maize transposon system associated with the amplification of Tourist-like MITEs and a new superfamily of transposases Proc. Natl. Acad. Sci. USA 98:12572-12577[Abstract/Free Full Text]

Accepted for publication August 22, 2002.