Department of Biochemistry, Virginia Polytechnic Institute and State University
Correspondence: E-mail: jaketu{at}vt.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: genome molecular evolution non-LTR retrotransposon polyadenylation retrotransposition reverse transcriptase
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Non-LTRs are generally 38 kb long and have been found in virtually all eukaryotes studied. Most non-LTRs have a reverse transcriptase (RT) domain that is essential for their retrotransposition. The RT domain has been used for phylogenetic classification of non-LTR retrotransposons into 11 clades, all of which date back to the Precambrian era, approximately 600 MYA (Malik, Burke, and Eickbush 1999). The total number of clades has been recently increased to 15, with the addition of NeSL-1 (Malik and Eickbush 2000), Ingi, Rex1 (Eickbush and Malik 2002), and L2 clades (Lovsin, Gubensek, and Kordi 2001). In addition to the RT domain, many non-LTRs encode an additional protein related to the retroviral Gag genes that is a nucleic acid bindingprotein or nucleocapsid. Studies of this Gag-like protein from L1 in mice show that it acts as a nucleic acid chaperone (Martin and Bushman 2001). Some elements also have a ribonuclease H (RNase H) and/or AP endonuclease (APE) domain encoded. Other typical structural characteristics found in various non-LTR families are internal pol II promoters and 3' ends containing canonical AATAAA polyadenylation signals, poly (dA) tails, or adenosine-rich tandem repeats. Target Primed Reverse Transcription has been proposed as the mechanism of retrotransposition for R2 of Bombyx mori and may be generally true for all non-LTR elements (Luan et al. 1993). Target site duplications (TSDs) are generated by most non-LTRs as a result of insertion of a new copy.
Non-LTRs may have had significant influences on eukaryotic genome evolution (Brosius 1999; Makalowski 2000; Kidwell and Lisch 2001). They have been shown to occupy a large portion of some genomes, contain sequences that affect the expression of nearby genes, serve as homologous sites for recombination, and contribute to novel exons. For example, L1 from humans and a SINE named Alu, which is thought to be retrotransposed by L1, make up more than 27% of the human genome (Lander et al. 2001). In addition, L1 has been shown to be capable of 3' transduction (Moran, DeBerardinis, and Kazazian 1999), and is believed to be responsible for the duplication of up to 1% of the human genome (Goodier, Ostertag, and Kazazian 2000; Pickeral et al. 2000). L1 is also believed to be responsible for the production of processed pseudogenes because they have the characteristics of retrotransposed sequences. Although non-LTRs occupy a significant bulk of the human genome, they represent only two of the 15 described clades (Malik, Burke, and Eickbush 1999; Lander et al. 2001). Large-scale surveys of Arabidopsis thaliana, Caenorhabditis elegans, and a few other smaller eukaryotic genomes showed similar lack of diversity. However, in Drosophila melanogaster (Berezikov, Bucheton, and Busseau 2000), more than 10 families of non-LTRs were discovered which represent six of the 15 clades. The recently reported Fugu rubripes genome (Aparicio et al. 2002) also showed significant diversity, containing non-LTRs from five clades. Here we report the discovery and characterization of a large number of non-LTRs in the newly released A. gambiae genome assembly.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Identification of Full-Length Elements
Alignment of the nucleotide sequences plus flanking genomic sequence of each family was performed with ClustalW to determine transposon boundaries and full-length elements (fig. 1B). The alignment was also used to identify TSDs. The alignment was done using ClustalW version 1.81 for the Linux operating system (Thompson, Higgins, and Gibson 1994) and viewed with ClustalX versison 1.81 for Windows (Thompson et al. 1997). The following parameters were used: pairwise gap penalty (open = 10, extension = 0.1), multiple gap penalty (open = 10, extension = 0.2). The alignment process was facilitated by the use of FromTEpost to retrieve all qualified hits into a fasta file. In practice, the search for non-LTR families and the identification of full-length elements were performed concurrently. Identified full-length elements were used in TEmask whenever available.
Software Description
Software modules designed for this study include four C programs TEpost, TEcombine, FromTEpost, and TEmask, all of which are available for download on our Web site (www.biochem.vt.edu/aedes or http://128.173.80.165). Details of the programs can be found in the corresponding readme files on our Web site. TEpost uses a Blast output file (produced by single or multiple queries) as an input file and produces an output file listing each Blast hit in a row along with several characteristics associated with that hit. The parameters for TEpost are these: maximum transposon length, maximum transposon gap, and minimal overlap. Maximum transposon length restricts the hits reported to those with a length less than the specified value. Because of the nature of Blast and the presence of indels or other chromosomal rearrangements, Blast hits corresponding to one transposon copy can be reported as multiple hits and can result in an overestimation of copy number. The gap length parameter was added to reduce this occurrence by grouping fragmented hits associated with one transposon copy as a single match. The formula for gap length is |(Q2Q1)(S2S1)| where Q1 and Q2 are the start and end of the query aligning with the Blast hits, and S1 and S2 are the start and end of the subject in the Blast hits. Please note that the start and end positions are from neighboring hits being considered as potentially one "continuous" match. Situations where query and subjects are on opposite strands are treated separately. A maximum cut-off value is used so that hits having gaps exceeding this value are recorded as separate and are assumed to originate from different transposon copies. Additional optional parameters that filter the TEpost output are E-value, hit length, and percent identity that are set as cut-off limits. When multiple queries from the same region of a transposon are used for Blast searches, it is expected that the same hit will be reported for more than one query. This creates a problem when TEpost files generated from Blast searches using different queries are combined. TEcombine was designed to eliminate these redundant hits by retaining only the hit listing with the most significant E-value. Minimal overlap is a parameter used here in the same way as that described for TEpost. Both TEpost and TEcombine output files can be copied/pasted into Microsoft Excel for sorting and viewing. FromTEpost uses both TEpost and TEcombine files as input to produce fasta sequence files of the recorded hits. Flanking sequences can be included by choice of the user. Sequences on plus and minus strands are reported separately. TEmask uses information from either a TEpost or TEcombine file to mask a database for all of the recorded hits. As with the previously described program RepeatMasker (Smit and Green, http://ftp.genome.washington.edu/RM/RepeatMasker.html), TEmask ensures that discovered families are not hit again in subsequent Blast searches.
Copy Number Estimation
Most non-LTR retrotranspsoson copies are truncated at the 5' end as a result of incomplete reverse transcription during retrotransposition. Copy numbers are therefore expected to be different for estimations based on the 5' and 3' ends. Full-length elements were identified by multiple sequence alignment for use as queries in copy number determination. When full-length copies were not identified, we used the longest obtainable sequence in a family (including the 3' terminus when possible). A multiple query fasta file (one sequence per family) was used in a BlastN search with an E-value cut-off of 1e-10. The Blast output was then processed using TEpost and TEcombine, during which only sequences that showed more than 80% nucleotide identity were included for copy number estimation. Parameters used were these: minimum hit length, 50; maximum transposon length, 10,000; maximum transposon gap, 3,000; minimal overlap, 50. To determine the minimum percentage of the genome occupied by non-LTRs, we summed the total number of bases resulting from Blast hits. Manual inspection of the output files used for copy number determination showed that a few copies were reported twice because of hits resulting from different parts of the query that exceeded the allowed gap. These occurrences were insignificant, however, and the numbers were adjusted when any were identified. Some queries contained other repetitive elements that resulted in increased copy numbers, but these hits were excluded from the reported copy numbers. It should be noted that we have used a working definition of a non-LTR family as a group of sequences having 80% amino acid identity in the conserved RT regions. This definition is more inclusive than the criterion used for copy number estimation, which is 80% nucleotide identity in the conserved RT region. Therefore, our current copy number estimations may be slight underestimates in some cases.
Phylogenetic Analysis
Phylogenetic analyses were performed using multiple sequence alignments of approximately 260 amino acid characters in the conserved regions 05 of the RT domains. These alignments were obtained using ClustalW version 1.81 for the Linux operating system (Thompson, Higgins, and Gibson 1994) and viewed with ClustalX version 1.81 for Windows (Thompson et al. 1997). Parameters used for these alignments were these: pairwise gap penalty (open = 30, extension = 0.8), multiple gap penalty (open = 10, extension = 0.25). Only one representative was used from each family found in the A. gambiae genome. In the case of the alignment used in the overall phylogenetic analysis shown in figure 2, minor adjustments were made in the alignment at conserved region 5 (Malik, Burke, and Eickbush 1999) for two familes of the R4 clade, Dong of Bombyx mori and Ag-R4-1. All phylogenetic analyses were performed with PAUP* version 4.0b10 (Swofford 2002). Both Neighbor-Joining and maximum parsimony trees were constructed with bootstrap support of 500 replicates. In cases where it was not possible to obtain amino acid sequence including all regions 05, the longest obtainable amino acid sequence was used. Outgroup selections and specific parameters for the phylogenetic analyses are described in the legends of figures 2 and 3, respectively.
|
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Loner and Outcast are New Clades, Each Comprised of Divergent Families
Phylogenetic analysis (fig. 2) resulted in two novel deep branching groups, new clades named Loner and Outcast. Both are well supported by bootstrap analyses (100%) using Neighbor-Joining and maximum parsimony methods. Domains found present in ORF1 and ORF2 of these new clades are compared in figure 4. ORF1 from elements of both clades contains three repeated cystine-histidine (Cys/His) motifs (CCHC) characteristic of nucleic acidbinding domains. A single Cys/His motif was also found in the 3' end of ORF2 in families of these clades, similar to what has been described for elements in other clades (Malik, Burke, and Eickbush 1999). ORF2 from elements of these clades contains domains in the order of APE, RT, and RNase H. Whereas ORF2 of Ag-Loner-1 appears to require a frameshift for translation, the ORF2 of Ag-Outcast-6 appears to start from an ATG 95 nucleotides downstream of the ORF1 stop. Three criteria have been proposed to define a clade, including shared structural features, ample phylogenetic support for the group, and an origin dating back to the Precambrian era (Malik, Burke, and Eickbush 1999). The first two of the three proposed criteria are met for both the Loner and Outcast clades. The last criterion can be assumed based on the comparison of the branch origins of Loner and Outcast to those of other established clades. Multiple divergent families were also found within each clade. Copy numbers are relatively low for most of the families in these clades, ranging from 4 to 86 (table 1 and appendix A). No representatives were found in other organisms on the basis of tBlastN searches against the NCBI nonredundant nucleotide database using amino acid query sequences including the RT domains of the Loner and Outcast families. Loner and Outcast clades both contain families that appear to have been recently active (table 1). The Ag-Loner-1 family has two copies with 99.3% nucleotide identity and a significant hit to an EST. Four copies of Ag-Outcast-6 are virtually identical, having nucleotide identities of 99.9%100% when compared to the consensus sequence of three full-length copies. Two other families Ag-Outcast-2 and Ag-Outcast-5 also have copies with a high degree of nucleotide identity. Both clades have families with TSDs (table 1).
|
A. gambiae Has Representatives of the Recently Described L2 Clade
Recently L2 was proposed as a new clade (Lovsin, Gubensek, and Kordi 2001), having diverse representatives from insect, sea urchin, snake, and fish. This clade was once considered part of the CR1 clade, but later was established as distinct from CR1. In addition to three families in A. gambiae (Ag-L2-1, 2, and 3) and the Takifugu rubripes representative (Lovsin, Gubensek, and Kordi 2001), two diverse representatives were found from the Zebrafish in support of this clade. ORF2 of Ag-L2 families contain an APE upstream of the RT, like families in the CR1 clade. Representatives from diverse organisms, phylogenetic support, and the deep branching of this group support this as a distinct clade from CR1. The Ag-L2-1 family has two 5.2 kb copies with nucleotide identities of 99.9% but their ORF2 proteins contain frameshifts. Ag-L2-2 and Ag-L2-3 families have multiple copies with identities of 96% or less spanning roughly 300500 nucleotides.
Sponge, a Unique Gag-Only Non-LTR Retrotransposon
While analyzing sequences of the Ag-CR1-3 family, a group of sequences of identical length were found that were much shorter than Ag-CR1-3. Further analysis revealed these sequences to be a distinct family that contained a 1.9-kb deletion in ORF2 that eliminates the RT domain (fig. 4). This family was named Ag-Sponge for its dependency on another element for retrotransposition. It has at least 13 full-length copies that are approximately 2.4 kb long (not shown). These full-length copies have the same 5' and 3' termini, unique flanking genomic sequences, and a high degree of nucleotide identity, suggesting that they were once an intact unit of retrotransposition. Ag-Sponge and Ag-CR1-3 have 98% nucleotide identity covering 160 bases in their 3' ends, including their tandem repeats. The remainder of Ag-Sponge has approximately 95% identity to Ag-CR1-3 and contains another 400 bp deletion in the APE region of ORF2. Sponge probably relied on the reverse transcriptase of an autonomous non-LTR, likely that of Ag-CR1-3. It should be noted that the discovery of a family of internally deleted non-autonomous non-LTRs is interesting because most nonautonomous non-LTRs are 5' truncations that may no longer be able to form a unit for retrotransposition.
3' Sequence Characteristics
The 3' ends of non-LTRs contain important sequence characteristics such as polyadenylation signals, poly (dA) tails or unique tandem repeats, and in some cases conserved secondary structure (Finnegan 1997; Mathews et al. 1997). The first three of these characteristics are listed in table 1 and Appendix A for A. gambiae non-LTRs. The presence of these characteristics is variable for any given family. Considering all clades, the majority of families have polyadenylation signals but do not have poly (dA) tails in the 3' end. For most elements with poly (dA) tails in the genomic sequence, the canonical AATAAA polyadenylation signal can be found upstream. Interestingly, there are some families in the R1 clade that have poly (dA) tails, but no canonical AATAAA polyadenylation signal. Other families have neither poly (dA) signals nor poly (dA) tails. Most A. gambiae elements have a 3' tandem repeat, with exceptions in the L1 and R1 clades. The repeat unit varies in length for different families, from the two base pair repeat AT for Ag-Jock-13 of the Jockey clade to the variety of 8 bp repeats for families of the CR1 clade. In some cases, the sequence of the tandem repeat units reported in the order 1, 2, 3 may actually be in the order 2, 3, 1 because it was difficult to determine the start and finish of the repeat unit (e.g., Ag-Outcast-2 and Ag-Outcast-5 have GAA and AAG repeats). A notable feature is that the presence of a tandem repeat and a polyA tail are apparently mutually exclusive. For most families, the total length of the 3' tandem repeat is restricted to approximately 30 bp or less and starts close to or immediately downstream of the AATAAA signal. Another notable feature is that there appears to be a general correlation between the 3' repeat sequence and RT phylogeny with a few exceptions.
Non-LTR Families Vary Widely in Copy Number
To determine copy number, the longest obtainable sequence of a family was used as queries for a Blast search. These queries included the 3' terminus when possible. The majority of hits were not full-length, either because of incomplete reverse transcription or because they did not fall within the specified parameters (see Materials and Methods for parameters used). Copy numbers are highly variable and range from just a few to about 2,000 for A. gambiae non-LTRs (table 1 and Appendix A). Most families are of low copy number with exceptions in the RTE and CR1 clades. Ag-JAMMIN-1 and Ag-JAMMIN-2 families of the RTE clade both have copy numbers dramatically higher than any other A. gambiae non-LTR family. Together they contribute to about 25% of the non-LTR portion of the genome. The T1, Ag-CR1-13, and Ag-CR1-3 families of the CR1 clade also have attained significantly higher copy numbers than most families. We have determined that non-LTRs contribute to at least 3% of the A. gambiae genome. Reported TE copy numbers will often be an underestimate because older elements will be mutated beyond recognition. This will also result in a conservative estimate when determining the percentage of a genome comprised by non-LTRs.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Unprecedented Diversity, Recent Activity in Multiple Lineages, and Evolutionary Implications
Over a hundred families of non-LTRs were found in the A. gambiae genome, and they occupy at least 3% of the genome. These families, defined by at least 20% amino acid sequence divergence in their RT domains, represent 10 different clades including eight of the 15 previously described clades plus two novel clades, Loner and Outcast. To our knowledge, such a level of inter- and intracladal diversity of non-LTRs has not been reported in any genome. Considering that there are interscaffold and intrascaffold gaps (Holt et al. 2002), there still may be unidentified families in the genome. In this regard, A. gambiae is in great contrast to the human genome, where only three families in the L1 and L2 clades were found to occupy over 20% of the genome (Lander et al. 2001). Large-scale surveys of a few smaller eukaryotic genomes showed a similar lack of diversity (table 2). However, the recently reported Fugu rubripes genome (Aparicio et al. 2002) showed a significant level of diversity, containing non-LTRs from five clades. In D. melanogaster, more than 10 families of non-LTRs were reported (Berezikov, Bucheton, and Busseau 2000), which represent six of the 15 clades, four of which are found in A. gambiae (I, Jockey, CR1, and R1; table 2). Because their definition of a family was perhaps more inclusive than the one used in this study and because significant portions of the heterochromatic regions were not analyzed (Berezikov, Bucheton, and Busseau 2000), there may be even greater diversity waiting to be discovered in D. melanogaster. The number of described non-LTR clades in all eukaryotes has increased dramatically from 11 (Malik, Burke, and Eickbush 1999) to 17 over the past few years. The availability of increasing numbers of genomic sequences and the development of computational approaches for database searches will further facilitate our understanding of this divergent group of long-time residents in a broad range of genomes. As genomic data accumulate, it will be interesting to determine whether non-LTRs in the Loner and Outcast clades will be found in other species, and whether non-LTRs in the L1 clade will be found in other invertebrates.
|
At least 21 of the highly diverse non-LTR families show evidence of recent activity. All clades except R4 have one or more families with some or all of the sequence characteristics, suggesting recent transposition activity such as multiple full-length copies with over 99% nucleotide identity, target site duplications (TSDs), intact ORFs, and corresponding ESTs. Therefore it is clear that these divergent clades for the most part are not ancient "fossils." Rather, many families in these clades are or have recently been active components of the A. gambiae genome. It is remarkable that a large number of diverse non-LTR lineages have been maintained in the genome, and many show evidence of recent activity (table 1 and Appendix A). In A. gambiae, both intra- and inter-cladal diversity are biased toward expansion in lineages of the non-sitespecific type, with the exception of some non-LTRs in the R1 clade that showed specific target sites. Elements in CRE, NeSL-1, R2, and R4 clades encode restriction enzymelike endonucleases that confer site-specificity and represent the primordial non-LTRs. Of these clades, only degenererate copies of R4-like elements were found in the A. gambiae genome. In contrast, the CR1 and Jockey clades have flourished, having more families than any other clade (fig. 1). At least six families in the Jockey clade and five families in the CR1 clade appear to have been recently active based on sequence analysis. Some of these families have significant hits to A. gambiae ESTs. These estimates are conservative because in the CR1 clade, frequent 5' truncations and the lack of TSDs made it difficult to identify full-length elements for many families.
High levels of diversity and the maintenance of multiple recently active lineages within different clades in the A. gambiae genome indicate a complex evolutionary scenario. Unlike DNA-mediated TEs, for which horizontal transfer is well documented, analysis by Eickbush and Malik (2002) suggests that there is no reason to believe that non-LTRs have been involved in horizontal transfer. They showed that non-LTRs previously implicated in horizontal transfers on the basis of phylogenetic incongruence actually obey expectations for vertical transmission. The high degree of intracladal diversity described in our current study suggests that great caution should be applied when inferring horizontal transfer on the basis of phylogenetic analysis, because the presence of a large number of paralogous families can easily confound the analysis. If we accept the assumption of vertical transmission, A. gambiae non-LTRs of the CR1 and Jockey clades must have gone through a tremendous diversification (fig. 3). Without extensive CR1 and Jockey sequences from other mosquitoes, it is difficult to determine whether the diversification has occurred prior to the common ancestor of mosquitoes or within the lineage leading to A. gambiae during mosquito evolution. However, most of the diversification events may have occurred after the divergence that led to D. melanogaster and A. gambiae lineages. Given the presence of multiple recently active lineages within the CR1 and Jockey clades, it is tempting to speculate that the observed diversity is driven by positive selection generated by competition among different non-LTR families or by attempts to escape suppressive mechanisms by the host. Nucleotide sequences are too divergent between these families for dS/dN analysis to assess the selection pressure, especially in ORF1 which is under a lower degree of selection pressure than the RT of ORF2. However, future comparative analysis of non-LTRs in closely related mosquitoes may provide answers to this question.
3' Sequence Characteristics and Their Relation to Retrotransposition
One of the earliest characteristics defining non-LTRs was the presence of the canonical AATAAA polyadenylation signal and either poly(dA) or A-rich tandem repeats. Except for families in the CR1, Outcast, and RTE clades, tandem repeats are not A-rich. Most of the families found in A. gambiae have the polyadenylation signal, but only a small number of them have the poly (dA) tail associated with it in the genomic sequence. This is curious because it would be expected that RT would copy the poly (A) sequence from the RNA during retrotransposition resulting in poly (dA) tails in the genomic sequence. Such a disconnect between the polyadenylation signal and the presence of poly (dA) has been previously described in CM-gag, a non-LTR retrotransposon from the Culex pipiens mosquito (Bensaadi-Merchermek et al. 1997). Analysis of the transcript of CM-gag showed that it was polyadenylated immediately downstream of its TTGAA tandem repeat. It appears that in this case, reverse transcription is not initiated until the TTGAA repeat during retrotransposition. A recent study of the D. melanogaster I factor provided a similar conclusion (Chambeyron, Bucheton, and Busseau 2002). In contrast, some A. gambiae non-LTRs in the R1 clade do not have a canonical polyadenylation signal but have a poly (dA) tail in the genomic sequence. Several noncanonical poly (A) signal sequences have been documented (Graber et al. 1999; MacDonald and Redondo 2002), which could explain what is observed in these R1 families, although no particular candidate sequence could be found. In addition, the D. melanogaster I factor may have been using its TAA tandem repeat as a poly (A) signal (Chambeyron, Bucheton, and Busseau 2002). Finally, there are other non-LTR families in A. gambiae from the Jockey, L2, Outcast, and RTE clades that have neither poly(A) signals nor poly(dA) tails. If these families use cryptic polyadenylation signals, their poly(A) tails must have not been part of the reverse transcription reaction, or at least were not used as templates. Analysis of the 3' regions of the transcripts from A. gambiae non-LTRs may help clarify RNA processing for these different groups, provided that transcripts of functional copies can be obtained. The human L1s are so far the only non-LTR families that are shown to be responsible for 3' transduction and pseudogene formation because they recognize the poly(A) tail for reverse transcription (Esnault, Maestre, and Heidmann 2000; Wei et al. 2001). It will be interesting to determine whether A. gambiae L1s and other poly (dA) non-LTRs are capable of 3' transduction or creating processed pseudogenes, which may determine whether poly (dA) non-LTRs had a broader genomic impact in species other than mammals.
The 3' tandem repeat may originate from a telomerase-like activity of the retrotransposition machinery (Chaboissier, Finnegan, and Bucheton 2000). Such a mechanism could explain the conservation of the 3' repeat sequence between some closely related non-LTR families (table 1 and Appendix A). However, there is also a high degree of variability of repeat sequences among some closely related families, as well as variability in the number of tandem repeats in different copies of a given family. The apparent exceptions to the conservation suggest that the specific tandem repeat sequences may not be required for retrotransposition, which is consistent with previous hypothesis by Chaboissier, Finnegan, and Bucheton (2000). However, the 3' tandem repeats may influence retrotransposition in other ways. For example, Chambeyron, Bucheton, and Busseau (2002) showed that the TAA repeats of D. melanogaster I factor directed the precise initiation of reverse transcription. It was also suggested that the tandem repeats could play a role in target site specificity (Chaboissier, Finnegan, and Bucheton 2000).
Non-LTRs and Nonautonomous Retroelements
Many non-LTR families contain 5' truncated copies, although these 5' truncations most likely will not become a distinct unit of retrotransposition because of the lack of promoter. In this study we have discovered a non-LTR family Ag-Sponge that has a large internal deletion and that presumably is derived from Ag-CR1-3. We have shown that Ag-Sponge has been an intact unit of successful retrotransposition, probably by "borrowing" the protein machinery from Ag-CR1-3. This discovery underscores the potential for retrotransposition machinery to act in trans (Jensen et al. 1994; Wei et al. 2001). Two other nonautonomous non-LTRs, Het-A of D. melanogaster and CM-gag of Culex pipiens, have been reported which encode only the Gag-like protein (Biessmann et al. 1992; Bensaadi-Merchermek et al. 1997).
Short interspersed nuclear elements (SINEs) are another type of nonautonomous retrotransposons, which presumably also "borrow" the retrotransposition machinery from "partner" non-LTRs. However, SINEs are different from Ag-Sponge because they have a composite structure and use Pol III promoters. A highly repetitive SINE family named Ag-SINE200 that has been identified in A. gambiae has an apparent AAG tandem repeat at the 3' end (Holt et al. 2002). Although several non-LTRs sharing the same 3' tandem repeat have been identified (table 1 and appendix A), no significant similarity was found between Ag-SINE200 and the non-LTRs beyond the tandem repeats.
Potential Applications of A. gambiae Non-LTRs
A. gambiae is the most important vector of malaria, a disease that is responsible for more than a million deaths every year. Vector control, a major component of malaria control strategies, is hampered by increasing insecticide resistance and the genetic heterogeneity of the vector complex. One novel approach is actively being pursued which aims to replace vector mosquitoes in wild populations with genetically modified mosquitoes that are incompetent disease vectors. Our analysis of A. gambiae non-LTRs may contribute to the genetic strategy to control mosquito-borne diseases by providing transformation tools, gene-driving mechanisms, and genetic markers for population studies. One major challenge to establishing sophisticated vector control programs and meaningful epidemiological studies has been the genetic complexities in A. gambiae populations (Powell et al. 1999). Several recent studies using a number of genetic markers have made significant progress toward illustrating this complexity while pointing to the need for more extensive research (e.g., Black and Lanzaro 2001; della Torre et al. 2001). The development of new population genomic approaches is needed because conflict exists between results obtained using different types of genetic markers and markers at different genomic locations (Besansky, 1999). Because many non-LTR families are interspersed throughout the genome, and because relatively high levels of insertion polymorphisms are expected on the basis of evidence for recent retrotransposition, non-LTRs are great sources of polymorphic markers across different regions of the A. gambiae genome. Like the human Alu elements, these polymorphic insertion sites provide co-dominant markers when sequences flanking a TE at the specific locus are used as primers for PCR amplification of genomic DNA isolated from an individual sample (e.g., Batzer et al. 1994; Stoneking et al. 1997; Roy-Engel et al. 2001). Like Alu, inserted non-LTR copies are not subject to excision by transposition, which makes them ideal for use as markers in population genetics because the ancestral state is known and is non-insertion. Furthermore, non-LTRs can be exploited as gene vectors. For example, the human L1 has been used to make an retrotransposon-adenovirus hybrid vector capable of efficient delivery of stably intergrated transgenes (Soifer et al. 2001). Because non-LTRs transpose through a mechanism completely different from DNA-mediated TEs that are currently pursued as major vectors for the genetic manipulation of mosquitoes, non-LTRs can possibly be developed as alternative vectors with different features. For example, retrotransposed copies are not subject to further excision, which could make them more stable than DNA-mediated TE vectors. As mosquito biology is entering the post-genomic era, these potential new tools of population analysis and genetic manipulation will undoubtedly contribute to a better understanding of these important disease vectors and help toward an informed and sustainable control strategy.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.
Aparicio, S., J. Chapman, E. Stupka, N. Putnam, J. M. Chia, P. Dehal, A. Christoffels, S. Rash, S. Hoon, and A. Smit, et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297:1301-1310.
Bao, Z., and S. R. Eddy. 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12:1269-1276.
Batzer, M. A., M. Stoneking, M. Alegria-Hartman, H. Bazan, D. H. Kass, T. H. Shaikh, G. E. Novick, P. A. Ioannou, W. D. Scheer, and R. J. Herrera, et al. 1994. African origin of human-specific polymorphic Alu insertions. Proc. Natl. Acad. Sci. USA 91:12288-12292.
Bensaadi-Merchermek, N., C. Cagnon, I. Desmons, J. C. Salvado, S. Karama, F. D'Amico, and C. Mouches. 1997. CM-gag, a transposable-like element reiterated in the genome of Culex pipiens mosquitoes, contains only a gag gene. Genetica 100:141-148.[CrossRef][ISI][Medline]
Berezikov, E., A. Bucheton, and I. Busseau. 2000. A search for reverse transcriptase-coding sequences reveals new non-LTR retrotransposons in the genome of Drosophila melanogaster. Genome Biol. 1: RESEARCH0012.
Besansky, N. J. 1990. Evolution of the T1 retroposon family in the Anopheles gambiae complex. Mol. Biol. Evol. 7:229-246.[Abstract]
Besansky, N. J. 1999. Complexities in the analysis of cryptic taxa within the genus Anopheles. Parasitologia 41:97-100.[Medline]
Besansky, N. J., J. A. Bedell, and O. Mukabayire. 1994. Q: a new retrotransposon from the mosquito Anopheles gambiae. Insect Mol. Biol. 3:49-56.[Medline]
Besansky, N. J., S. M. Paskewitz, D. M. Hamm, and F. H. Collins. 1992. Distinct families of site-specific retrotransposons occupy identical positions in the rRNA genes of Anopheles gambiae. Mol. Cell. Biol. 12:5102-5110.[Abstract]
Biessmann, H., K. Valgeirsdottir, A. Lofsky, C. Chin, B. Ginther, R. W. Levis, and M. L. Pardue. 1992. HeT-A, a transposable element specifically involved in "healing" broken chromosome ends in Drosophila melanogaster. Mol. Cell. Biol. 12:3910-3918.[Abstract]
Black, W. C. T., and G. C. Lanzaro. 2001. Distribution of genetic variation among chromosomal forms of Anopheles gambiae s.s: introgressive hybridization, adaptive inversions, or recent reproductive isolation? Insect Mol. Biol. 10:3-7.[CrossRef][ISI][Medline]
Brosius, J. 1999. Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica 107:209-238.[CrossRef][ISI][Medline]
Chaboissier, M. C., D. Finnegan, and A. Bucheton. 2000. Retrotransposition of the I factor, a non-long terminal repeat retrotransposon of Drosophila, generates tandem repeats at the 3' end. Nucleic Acids Res. 28:2467-2472.
Chambeyron, S., A. Bucheton, and I. Busseau. 2002. Tandem UAA repeats at the 3'-end of the transcript are essential for the precise initiation of reverse transcription of the I factor in Drosophila melanogaster. J. Biol. Chem. 277:17877-17882.
Della Torre, A., C. Fanello, M. Akogbeto, J. Dossou-yovo, G. Favia, V. Petrarca, and M. Coluzzi. 2001. Molecular evidence of incipient speciation within Anopheles gambiae s.s. in West Africa. Insect Mol. Biol. 10:9-18.[CrossRef][ISI][Medline]
Doolittle, W. F., and C. Sapienza. 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature 284:601-603.[ISI][Medline]
Eickbush, T. H., and H. S. Malik. 2002. Origins and evolution of retrotransposons. Pp. 11111144 in N. L. Craig, R. Craigie, M. Gellert, and A. M. Lambowitz, eds. Mobile DNA II. ASM Press, Washington, D. C.
Esnault, C., J. Maestre, and T. Heidmann. 2000. Human LINE retrotransposons generate processed pseudogenes. Nat. Genet. 24:363-367.[CrossRef][ISI][Medline]
Finnegan, D. J. 1992. Transposable elements. Curr. Opin. Genet. Dev. 2:861-867.[Medline]
Finnegan, D. J. 1997. Transposable elements: how non-LTR retrotransposons do it. Curr. Biol. 7:R245-R248.[ISI][Medline]
Goodier, J. L., E. M. Ostertag, and H. H. Kazazian, Jr. 2000. Transduction of 3'-flanking sequences is common in L1 retrotransposition. Hum. Mol. Genet. 9:653-657.
Graber, J. H., C. R. Cantor, S. C. Mohr, and T. F. Smith. 1999. In silico detection of control signals: mRNA 3'-end-processing sequences in diverse species. Proc. Natl. Acad. Sci. USA 96:14055-14060.
Hill, S. R., S. S. Leung, N. L. Quercia, D. Vasiliauskas, J. Yu, I. Pasic, D. Leung, A. Tran, and P. Romans. 2001. Ikirara insertions reveal five new Anopheles gambiae transposable elements in islands of repetitious sequence. J. Mol. Evol. 52:215-231.[ISI][Medline]
Holmes, S. E., M. F. Singer, and G. D. Swergold. 1992. Studies on p40, the leucine zipper motif-containing protein encoded by the first open reading frame of an active human LINE-1 transposable element. J. Biol. Chem. 267:19765-19768.
Holt, R. A., G. M. Subramanian, and A. Halpern, et al. (123 co-authors). 2002. The genome sequence of the malaria mosquito Anopheles gambiae. Science 298:129-149.
Jensen, S., L. Cavarec, O. Dhellin, and T. Heidmann. 1994. Retrotransposition of a marked Drosophila line-like I element in cells in culture. Nucleic Acids Res. 22:1484-1488.[Abstract]
Kidwell, M. G., and D. R. Lisch. 2001. Perspective: transposable elements, parasitic DNA, and genome evolution. Int. J. Org. Evol. 55:1-24.
Lander, E. S., L. M. Linton, and B. Birren, et al. (252 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.[CrossRef][ISI][Medline]
Lovsin, N., F. Gubensek, and D. Kordi. 2001. Evolutionary dynamics in a novel L2 clade of non-LTR retrotransposons in Deuterostomia. Mol. Biol. Evol. 18:2213-2224.
Luan, D. D., M. H. Korman, J. L. Jakubczak, and T. H. Eickbush. 1993. Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mechanism for non-LTR retrotransposition. Cell 72:595-605.[ISI][Medline]
MacDonald, C. C., and J. L. Redondo. 2002. Reexamining the polyadenylation signal: were we wrong about AAUAAA? Mol. Cell. Endocrinol. 190:1-8.[CrossRef][ISI][Medline]
Makalowski, W. 2000. Genomic scrap yard: how genomes utilize all that junk. Gene 259:61-67.[CrossRef][ISI][Medline]
Malik, H. S., W. D. Burke, and T. H. Eickbush. 1999. The age and evolution of non-LTR retrotransposable elements. Mol. Biol. Evol. 16:793-805.[Abstract]
Malik, H. S., and T. H. Eickbush. 2000. NeSL-1, an ancient lineage of site-specific non-LTR retrotransposons from Caenorhabditis elegans. Genetics 154:193-203.
Martin, S. L., and F. D. Bushman. 2001. Nucleic acid chaperone activity of the ORF1 protein from the mouse LINE-1 retrotransposon. Mol. Cell. Biol. 21:467-475.
Mathews, D. H., A. R. Banerjee, D. D. Luan, T. H. Eickbush, and D. H. Turner. 1997. Secondary structure model of the RNA recognized by the reverse transcriptase from the R2 retrotransposable element. RNA 3:1-16.
Moran, J. V., R. J. DeBerardinis, and H. H. Kazazian, Jr. 1999. Exon shuffling by L1 retrotransposition. Science 283:1530-1534.
Mouches, C., N. Bensaadi, and J. C. Salvado. 1992. Characterization of a LINE retroposon dispersed in the genome of three non-sibling Aedes mosquito species. Gene 120:183-190.[CrossRef][ISI][Medline]
Pickeral, O. K., W. Makalowski, M. S. Boguski, and J. D. Boeke. 2000. Frequent human genomic DNA transduction driven by LINE-1 retrotransposition. Genome Res. 10:411-415.
Powell, J. R., V. Petrarca, A. della Torre, A. Caccone, and M. Coluzzi. 1999. Population structure, speciation, and introgression in the Anopheles gambiae complex. Parassitologia 41:101-113.[Medline]
Roy-Engel, A. M., M. L. Carroll, E. Vogel, R. K. Garber, S. V. Nguyen, A. H. Salem, M. A. Batzer, and P. L. Deininger. 2001. Alu insertion polymorphisms for the study of human genomic diversity. Genetics 159:279-290.
Soifer, H., C. Higo, H. H. Kazazian, Jr., J. V. Moran, K. Mitani, and N. Kasahara. 2001. Stable integration of transgenes delivered by a retrotransposon-adenovirus hybrid vector. Human Gene Ther. 12:1417-1428.[CrossRef][ISI][Medline]
Stoneking, M., J. J. Fontius, S. L. Clifford, H. Soodyall, S. S. Arcot, N. Saha, T. Jenkins, M. A. Tahir, P. L. Deininger, and M. A. Batzer. 1997. Alu insertion polymorphisms and human evolution: evidence for a larger population size in Africa. Genome Res. 7:1061-1071.
Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Sinauer Associates, Sunderland, Mass.
Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins. 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25:4876-4882.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.[Abstract]
Tu, Z., and J. J. Hill. 1999. MosquI, a novel family of mosquito retrotransposons distantly related to the Drosophila I factors, may consist of elements of more than one origin. Mol. Biol. Evol. 16:1675-1686.
Tu, Z., J. Isoe, and J. A. Guzova. 1998. Structural, genomic, and phylogenetic analysis of Lian, a novel family of non-LTR retrotransposons in the Yellow Fever mosquito, Aedes aegypti. Mol. Biol. Evol. 15:837-853.[Abstract]
Warren, A. M., M. A. Hughes, and J. M. Crampton. 1997. Zebedee: a novel copia-Ty1 family of transposable elements in the genome of the medically important mosquito Aedes aegypti. Mol. Gen. Genet. 254:505-513.[CrossRef][ISI][Medline]
Wei, W., N. Gilbert, S. L. Ooi, J. F. Lawler, E. M. Ostertag, H. H. Kazazian, J. D. Boeke, and J. V. Moran. 2001. Human L1 retrotransposition: cis preference versus trans complementation. Mol. Cell. Biol. 21:1429-1439.