Departments of * Genetics
Statistics, University of Georgia
Correspondence: E-mail: edog22{at}uga.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: C. elegans LTR retrotransposon gene evolution genome evolution
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Genomic sequence analysis has proved to be a useful tool in efforts to understand the possible adaptive significance of TEs in gene and genome evolution. One group of TEs, the retrotransposons, has been studied in this regard. Retrotransposons are the most abundant group of TEs in the human genome and have a lifecycle analogous to that of infectious retroviruses (Boeke et al. 1985). Retrotransposon sequences are transcribed by host transcription complexes, and these transcripts are reverse transcribed by element-encoded reverse transcriptase (RT). As a consequence, retrotransposons contain many cis-regulatory components typical of eukaryotic genes, including promoter and enhancer sequences as well as termination and polyadenylation signals (fig. 1). The effect of these regulatory sequences are not always limited to the retroelements in which they are contained but may also influence the expression of adjacent genes (e.g., Kapitonov and Jurka 1999; Mager et al. 1999; Baust et al. 2000; Llorens and Marin 2001; Medstrand, Landry, and Mager 2001; Stokstad 2001; Jordan et al. 2003). In addition to regulatory effects, retrotransposons may also contribute to the coding regions of genes. For example, in a preliminary study of the human genome, Nekrutenko and Li (2001) discovered that about 4% of human genes have a retrotransposon component in the coding region. Thus, retrotransposons are a significant source of regulatory and coding region variation and a potentially important factor in gene evolution.
|
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Information regarding the function, expression, and homologs of each gene was collected from various sources. For most genes, information on function and size was available from NCBI and Wormbase gene reports (Spring 2002 data releases). EST data were obtained through Blasts of the NCBI "est" database. Exon boundaries were based on reports in Wormbase and NCBI. Conserved domains were predicted with the NCBI CDD-Conserved Domain Database. C. briggsae homology data was obtained from Wormbase (WABA predictions, Kent and Zahler 2000) and directly from the Washington University C. briggsae blast server (http://genome.wustl.edu/projects/cbriggsae, Spring 2002 data). MacVector version 7.0 was used to annotate and collate gene information from all sources as well as provide graphical representations of gene/retrotransposon association regions.
Statistical Analysis
The goal of the statistical analysis was to determine whether the distribution of TEs in the genome deviates from the random expectation and, in particular, whether TEs tend to lie near genes. We test two null hypotheses: (1) the location of TEs follows a uniform distribution in the nongenic genome (the term "nongenic genome" refers to the nontranscriptional regions upstream and downstream of genes), and (2) the location of TEs follows a uniform distribution throughout the entire genome.
To test the first hypothesis, we define windows of length 1,000 bp upstream and downstream of each gene. This window is defined to contain only nongenic genome. The window is shortened if the distance to the next gene is less than 1,000 bp. A TE is located in the window if its nearest end to the gene is located within the window. The following discussion will be in terms of a window on the 5' end of the gene, but identical arguments apply for the 3' window.
Under the null hypothesis, the probability p that a particular TE is located in an upstream window is simply the length of nongenic genome within 1,000 bp of a 5' end divided by the total length of nongenic genome. Then, the probability that x out of N total TEs is located in upstream windows is given by the binomial distribution.
|
The quantity p was calculated as
|
|
We also consider a window consisting of nongenic genome that is between 500 and 1,000 bp from the nearest 5' end. The new quantity p for this case is found by repeating the calculation in equation (2) with 500 bp instead of 1,000 bp and subtracting from the original p. For our calculations to remain conservative, a TE is not counted as being in the window if it is upstream of two genes. The calculations for the second null hypothesis (2) defined above are carried out in a similar manner, with the obvious modifications.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The retrotransposon data set was used to create an annotation file readable by the Wormbase genome browser (Stein et al. 2002). This file was used to visualize the location of retrotransposons, genes, and other genomic features within a given chromosomal region. Analysis of genomic sequence from a 5-kb window on either side of each retrotransposon resulted in the identification of 190 gene/retrotransposon associations (tables 1 and 2). Forty (40) retrotransposon sequences were found to be associated with a single gene, and 75 were associated with genes both upstream and downstream of the TE. Only nine retrotransposon sequences were not located within 5 kb of any gene.
|
|
|
|
|
Associated Gene Function and Homology
Functional information for each gene associated with a retrotransposon was analyzed to confirm the validity of the genes. Several studies have addressed the quality of gene identification and prediction in C. elegans (e.g., Harrison, Echols, and Gerstein 2001; Reboul et al. 2001; Mounsey, Bauer, and Hope 2002). The consensus conclusion of these studies is that 80% to 90% of C. elegans's predicted genes are "real" or functional, whereas the remainder are likely pseudogenes or false predictions. We find that 125 of the 190 genes associated with retrotransposon sequences have one or more identifiable functional domains or are members of established homolog families. In addition, about half (49%) of all retrotransposon sequences are associated with genes having medium to high identity with C. briggsae homologs as defined by Wormbase (93 C. briggsae homologs/190 total associations). Pooling these findings, we conclude that at least 172 of the 190 genes (90.5%) found to be associated with retrotransposon sequences in our study have functional or phylogenetic support.
Some Cer Elements Are Within Genes
We discovered 40 genes containing a Cer retrotransposon component, meaning a retrotransposon was identified within predicted gene boundaries (hereafter an "internal association"). In some cases, a retrotransposon sequence lies within two genes, so 35 (of 124) retrotransposons are responsible for the 40 internal associations. Since genic regions represent approximately 52% of the C. elegans genome, this result is significantly lower than expected (chi-squared test, expect 64.5 Cer TEs, P < 4.288) if insertion sites are assumed to be random.
The frequency of solo LTRs (18), fragments (nine), and full-length elements (eight) in genes is consistent with the frequency observed for all associations. As with all gene/element associations in C. elegans, sense (21) and antisense (19) associations are equally abundant. There are more than three times more Bel-like (27) than gypsy-like (eight) element sequences located within the boundaries of genes. This result contrasts with the approximately two times greater number of Bel-like element sequences present in the entire C. elegans genome. Cer9 is one Bel-like element that accounts for nearly a quarter of all internal associations (table 4).
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A total of 124 Cer retrotransposon sequences (full-length elements, fragmented elements, and solo LTRs) account for 0.4% of the C. elegans genome. Searching a 5-kb window both upstream and downstream of each Cer element sequence resulted in the identification of 190 gene-retrotransposon associations. Interestingly, 79 (63%) LTR retrotransposons map within 1 kb of a gene. Within this group, we discovered that retrotransposons are overrepresented upstream of genes, specifically in an intergenic region 1000 bp to 500 bp from genes. This is significant because most cis-regulatory sequences are believed to lie within 1 kb of the transcriptional start site of C. elegans genes (McGhee and Krause 1997). An additional 21.1% of all associations involved retrotransposon sequences located within introns, exons, or both.
Reports of TE content in humans indicate that more than 40% of the genome is composed of retroelement sequences (Li et al. 2001), and an estimated 4% of human protein-coding genes have been found to contain retrotransposon sequences (Nekrutenko and Li 2001). Additional studies suggest that the role of retrotransposon sequences on the regulation of human gene expression may also be significant. For example, it was recently estimated that approximately 24% of identified human promoter regions contain retrotransposon sequences (Jordan et al. 2003). Our results indicate that 190 of the 19,000 genes (1.0%) identified in the C. elegans genome (C. elegans Sequencing Consortium 1998; Reboul et al. 2001) are associated with retrotransposon sequences and that 28% (35/124) of all Cer element sequences are located within genes.
In a recent study of the distribution of retrotransposon sequences within the human genome, Medstrand et al. (2002) noted a significant decrease in the density of LTR retrotransposon sequences within 5 kb of genes. Moreover, those retrotransposon sequences located near human genes are relatively recent insertions and most often in an antisense configuration with respect to the adjacent gene. The authors interpret these results to suggest that most retrotransposon insertions proximal to human genes, and especially those in a sense configuration, are nonadaptive and selected against. In contrast to the pattern observed in humans, our results demonstrate that well over half of all retrotransposon sequences in the C. elegans genome (57.9%) are located in or within 1 kb of genes, with no bias against sense associations observed. At least two hypotheses may help account for these differences.
Protection from deletion or recombination may explain why TEs are close to genes in C. elegans. The relatively small size of the C. elegans genome has been attributable, in part, to a significantly higher rate of deletion than humans and other animals (Kent and Zahler 2000; Robertson 2000). In addition, C. elegans is estimated to have up to a 1440-fold higher rate of genome rearrangement than humans and other mammals (Coghlan and Wolfe 2002). Recombination breakpoints in C. elegans are typically associated with repetitive sequences, including retrotransposon sequences (Coghlan and Wolfe 2002). Deletion or recombination events involving retrotransposon sequences in or near genes may have an adverse effect and thus be selected against. Such a scenario might help explain the clustering of retrotransposon sequences that are not otherwise deleterious in or around genes.
Another possible explanation of the abundance of retrotransposon sequences in or near C. elegans genes is that they are of adaptive benefit. Indeed, there is a growing body of evidence from a number of systems (Makalowski 2000; Medstrand, Landry, and Mager 2001; Nigumann et al. 2002) that retrotransposon sequences have contributed to adaptive changes in gene structure and regulation.
The central regions of C. elegans chromosomes are the general location of "housekeeping" genes and other essential genes displaying homology to genes even in distantly related species (C. elegans Sequencing Consortium 1998). In contrast, many nematode-specific genes are located along the chromosomal arms. Interestingly, C. elegans transposons and other repeats also tend to cluster on the chromosomal arms (Surzycki and Belknap 2000; Ganko, Fielman, and McDonald 2001). The chromosomal arms of C. elegans are regions of high insertional polymorphism, duplications, and intrachromosomal rearrangements (C. elegans Sequencing Consortium 1998). Insertions, duplications, chromosome rearrangements, and TEs may all have a role in the evolution of novel genes (Long 2001; Betrán and Long 2002). For these reasons, regions of the chromosomal arms of C. elegans might be viewed as an "evolutionary laboratory" where new genes are created and tested by natural selection. Low mobility species such as C. elegans may require a diverse group of specialized genes to successfully exploit their environment (Hodgkin 2001), and an ability to rapidly evolve new genes or new regulatory structures may be particularly important to these organisms. The fact that nearly all of the C. elegans genes that we have found to be in close association with retrotransposon sequences are located in the chromosome arms suggests that retrotransposon sequences may play a role in the evolution of new nematode genes. It will be interesting to determine if newly evolved genes in other species, including humans, show a preference for close association with retrotransposon sequences.
![]() |
Supplementary Material |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adams, M. D., S. E. Celniker, and R. A. Holt, et al. (195 co-authors). 2000. The genome sequence of Drosophila melanogaster. Science 287:2185-2195.
The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796-815.[CrossRef][ISI][Medline]
Baust, C., W. Seifarth, H. Germaier, R. Hehlmann, and C. Leib-Mosch. 2000. HERV-K-T47D-Related long terminal repeats mediate polyadenylation of cellular transcripts. Genomics 66:98-103.[CrossRef][ISI][Medline]
Betrán, E., and M. Long. 2002. Expansion of genome coding regions by acquisition of new genes. Genetica 115:65-80.[CrossRef][ISI][Medline]
Boeke, J. D., D. J. Garfinkel, C. A. Styles, and G. R. Fink. 1985. Ty elements transpose through an RNA intermediate. Cell 40:491-500.[ISI][Medline]
Brosius, J. 1999. Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica 107:209-238.[CrossRef][ISI][Medline]
C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282:2012-2018.
Charlesworth, B., P. Sniegowski, and W. Stephan. 1994. The evolutionary dynamics of repetitive DNA in eukaryotes. Nature 371:215-220.[CrossRef][ISI][Medline]
Coghlan, A., and K. H. Wolfe. 2002. Fourfold faster rate of genome rearrangement in nematodes than in Drosophila. Genome Res. 12:857-867.
Flavell, R. B. 1986. Repetitive DNA and chromosome evolution in plants. Philos. Trans. R. Soc. Lond. B Biol. Sci. 312:227-242.[ISI][Medline]
Ganko, E. W., K. T. Fielman, and J. F. McDonald. 2001. Evolutionary history of Cer elements and their impact on the C. elegans genome. Genome Res. 11:2066-2074.
Harrison, P. M., N. Echols, and M. B. Gerstein. 2001. Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res. 29:818-830.
Hodgkin, J. 2001. What does a worm want with 20,000 genes? Genome Biol. 2:comment2008.[Medline]
Hoskins, R. A., C. D. Smith, and J. W. Carlson, et al. (16 co-authors). 2002. Heterochromatic sequences in a Drosophila whole-genome shotgun assembly. Genome Biol. 3:RESEARCH00850085.
Jordan, I. K., I. B. Rogozin, G. V. Glazko, and E. V. Koonin. 2003. Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends Genet. 19:68-72.[CrossRef][ISI][Medline]
Kaminker, J. S., C. M. Bergman, and B. Kronmiller, et al. (12 co-authors). 2002. The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 3:RESEARCH00840084.
Kapitonov, V. V., and J. Jurka. 1999. The long terminal repeat of an endogenous retrovirus induces alternative splicing and encodes an additional carboxy-terminal sequence in the human leptin receptor. J. Mol. Evol. 48:248-251.[ISI][Medline]
Kent, W. J., and A. M. Zahler. 2000. Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Res. 10:1115-1125.
Kidwell, M. G. 2002. Transposable elements and the evolution of genome size in eukaryotes. Genetica 115:49-63.[CrossRef][ISI][Medline]
Kidwell, M. G., and D. R. Lisch. 2001. Perspective: transposable elements, parasitic DNA, and genome evolution. Int. J. Org. Evol. 55:1-24.
Kim, J. M., S. Vanguri, J. D. Boeke, A. Gabriel, and D. F. Voytas. 1998. Transposable elements and genome organization: a comprehensive survey of retrotransposons revealed by the complete Saccharomyces cerevisiae genome sequence. Genome Res. 8:464-478.
Li, W. H., Z. Gu, H. Wang, and A. Nekrutenko. 2001. Evolutionary analyses of the human genome. Nature 409:847-849.[CrossRef][ISI][Medline]
Llorens, C., and I. Marin. 2001. A mammalian gene evolved from the integrase domain of an LTR retrotransposon. Mol. Biol. Evol. 18:1597-1600.
Long, M. 2001. Evolution of novel genes. Curr. Opin. Genet. Dev. 11:673-680.[CrossRef][ISI][Medline]
Mager, D. L., D. G. Hunter, M. Schertzer, and J. D. Freeman. 1999. Endogenous retroviruses provide the primary polyadenylation signal for two new human genes (HHLA2 and HHLA3). Genomics 59:255-263.[CrossRef][ISI][Medline]
Makalowski, W. 2000. Genomic scrap yard: how genomes utilize all that junk. Gene 259:61-67.[CrossRef][ISI][Medline]
McDonald, J. F. 1993. Evolution and consequences of transposable elements. Curr. Opin. Genet. Dev. 3:855-864.[Medline]
McDonald, J. F. 1995. Transposable elements: possible catalysts of organismic evolution. Trends Ecol. Evol. 10:123-126.[CrossRef][ISI]
McGhee, J. D., and M. W. Krause. 1997. Transcription factors and transcriptional regulation. Pp. 147184 in D. L. Riddle, T. Blumenthal, B. J. Meyer, and J. R. Priess, eds. C. elegans II. Cold Spring Harbor Laboratory Press, Plainview, N.Y.
Medstrand, P., J. R. Landry, and D. L. Mager. 2001. Long terminal repeats are used as alternative promoters for the endothelin B receptor and apolipoprotein C-I genes in humans. J. Biol. Chem. 276:1896-1903.
Medstrand, P., L. N. van de Lagemaat, and D. L. Mager. 2002. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res. 12:1483-1495.
Mounsey, A., P. Bauer, and I. A. Hope. 2002. Evidence suggesting that a fifth of annotated Caenorhabditis elegans genes may be pseudogenes. Genome Res. 12:770-775.
Nekrutenko, A., and W. H. Li. 2001. Transposable elements are found in a large number of human protein-coding genes. Trends Genet. 17:619-621.[CrossRef][ISI][Medline]
Nigumann, P., K. Redik, K. Matlik, and M. Speek. 2002. Many human genes are transcribed from the antisense promoter of l1 retrotransposon. Genomics 79:628-634.[CrossRef][ISI][Medline]
Orgel, L. E., and F. H. C. Crick. 1980. Selfish DNA: the ultimate parasite. Nature 284:604-607.[ISI][Medline]
Pearce, S. R., G. Harrison, D. Li, J. Heslop-Harrison, A. Kumar, and A. J. Flavell. 1996. The Ty1-copia group retrotransposons in Vicia species: copy number, sequence heterogeneity and chromosomal localisation. Mol. Gen. Genet. 250:305-315.[CrossRef][ISI][Medline]
Reboul, J., P. Vaglio, and N. Tzellas, et al. (20 co-authors). 2001. Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nat. Genet. 27:332-336.[CrossRef][ISI][Medline]
Robertson, H. M. 2000. The large srh family of chemoreceptor genes in Caenorhabditis nematodes reveals processes of genome evolution involving large duplications and deletions and intron gains and losses. Genome Res. 10:192-203.
Smit, A. F. 1999. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9:657-663.[CrossRef][ISI][Medline]
Stein, L. D., C. Mungall, and S. Shu, et al. (11 co-authors). 2002. The Generic genome browser: a building block for a model organism system database. Genome Res. 12:1599-1610.
Stein, L., P. Sternberg, R. Durbin, J. Thierry-Mieg, and J. Spieth. 2001. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 29:82-86.
Stokstad, E. 2001. Entomology: first light on genetic roots of Bt resistance. Science 293:778.
Surzycki, S. A., and W. R. Belknap. 2000. Repetitive-DNA elements are similarly distributed on Caenorhabditis elegans autosomes. Proc. Natl. Acad. Sci. USA 97:245-249.
Waterston, R. H., K. Lindblad-Toh, and E. Birney, et al. (222 co-authors). 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520-562.[CrossRef][ISI][Medline]