The Relative Abundance of Dinucleotides in Transposable Elements in Five Species

Emmanuelle Lerat, Pierre Capy and Christian Biémont2

*Laboratoire Biométrie et Biologie Évolutive, UMR CNRS 5558, Université Lyon 1, 69622 Villeurbanne Cedex, France;
{dagger}Laboratoire Populations, Génétique et Évolution, UPR CNRS 9034, 91198 Gif/Yvette Cedex, France

Burge, Campbell, and Karlin (1992)Citation observed that the relative frequencies of di- and trinucleotides characterize a genome, independent of its base composition and the coding and noncoding capacity of the regions analyzed. Species thus differ with regard to this genomic signature, which is constant in a given genome and shows similarities between related species (Gentles and Karlin 2001Citation ). The variation in the relative abundance of dinucleotides is interpreted as reflecting differences between species in the cellular machinery for replication and repair, which may select specific dinucleotides in the sequence (Campbell, Mrázek, and Karlin 1999Citation ). A tendency toward the suppression of CG is often observed and is interpreted as resulting from the action of methylation activities (Bird 1986Citation ). The dinucleotides pattern of the mitochondrial genome has also been shown to differ from that of the nuclear genome, and the explanation suggests that nuclear and mitochondrial genomes use independent DNA polymerase machinery and different methods of replication (Campbell, Mrázek, and Karlin 1999Citation ). We therefore wanted to find out whether transposable elements (TEs), which have been shown to have a greater AT content than their host genes in various species (Shields and Sharp 1989Citation ; Lerat, Capy, and Biémont 2002Citation ), have the same dinucleotides pattern as their host.

TEs are repeated sequences that are able to move from one position to another along chromosomes. They were first discovered in maize by Barbara McClintock (1984)Citation in the 1950s and seem to exist in all living organisms. They are divided into two main classes, according to the transposition intermediate they use (Capy et al. 1997, pp. 1–197Citation ). Class I consists of retrotransposons that use an RNA intermediate and are subdivided into two subclasses according to whether they do or do not have long terminal repeats (LTRs) at their extremities, LTR retrotransposons and non-LTR retrotransposons, respectively. Class II consists of transposons that use a DNA intermediate for transposition and code for a transposase. There is a third class that consists of foldback elements and MITEs, the transposition mechanism of which has not yet been elucidated.

The complete genomes of Saccharomyces cerevisiae, Caenorhabditis elegans, and Drosophila melanogaster, chromosomes 2 and 4 of Arabidopsis thaliana, and chromosomes 21 and 22 of Homo sapiens were downloaded from the Genome On Line Database site (wit.integratedgenomics.com/GOLD/) (Kyrpides 1999Citation ). Entire sequences of transposons, LTR retrotransposons and non-LTR retrotransposons, and of class-III elements from C. elegans, D. melanogaster, H. sapiens, and A. thaliana were downloaded from GenBank. Other Arabidopsis TEs were obtained from the Arabidopsis transposable element database (soave.biol.mcgill.ca/clonebase/main.html). The positions of TEs in the sequenced genome of Saccharomyces were obtained from the site transposable element resources (www.public.iastate.edu/~voytas/resources/resources.html). The TE data set, thus available, consisted of 40 sequences from D. melanogaster, 50 from S. cerevisiae, 19 from C. elegans, 25 from H. sapiens, and 31 from A. thaliana. The TE sequences for each species were concatenated. Of the 25 TE sequences from H. sapiens, 10 were retroviruses (HERV-K, HERV-K-T47D, HERV-K101, HERV-KC4, HIV1, HIV2, HTLV1, HTLV2, HSRV, and v-oncogene), which are class-I elements and can be considered to belong to the LTR retrotransposon family.

We used the indices defined by Burge, Campbell, and Karlin (1992)Citation . For a dinucleotide XY, the indices {rho}XY = fXY/fXfY were computed for each sequence, where fX and fY are the frequencies of bases X and Y, respectively, and fXY the frequency of the dinucleotide XY. When the coding sequences of TEs and genes were used, the indices were only calculated from single-stranded DNA. For complete sequences, we took into account the antiparallel and complementary structure of double-stranded DNA (Burge, Campbell, and Karlin 1992Citation ). We thus computed f*A = f*T = 1/2(fA + fT) for base A and its associated T nucleotide in the double-stranded sequence and f*G = f*C = 1/2(fG + fC) for base G and its associated C nucleotide. The frequency of the GT dinucleotide was computed as f*GT = (1/2fGT + 1/2fAC), and the indices {rho}*XY = f*XY/f*Xf*Y were estimated. According to Karlin and Burge (1995)Citation , the XY dinucleotide was considered to be underrepresented if {rho}*XY <= 0.78 and overrepresented if {rho}*XY >= 1.23.

The relative distance between two sequences, f and g, was calculated as the sum of the differences between the {rho}*ij indices for each ij dinucleotide between the two sequences: {delta}*(f,g) = (1/16){Sigma}ij |{rho}*ij(f) - {rho}*ij(g)| (Karlin and Ladunga 1994Citation ; Karlin and Mrázek 1997Citation ). Relative distances were computed for the genomic sequences and the concatenated TEs for all species, the fragments of genomic sequences and complete TEs for all species, and the host genes and coding parts of TEs for each species separately. The distance matrix obtained was analyzed using a principal coordinates analysis, a specific multivariate analysis which transforms distance matrices into euclidean matrices before extracting the principal components (Gower 1966Citation ). This analysis makes it possible to visualize neighboring sequences in terms of their relative abundance of dinucleotides. These analyses were done using the ADE-4 package (Thioulouse et al. 1997Citation ).

The relative abundances of dinucleotides in TE and genomic sequences were calculated for the five species listed previously (detailed data available upon request). Whatever the species, the dinucleotide TA appeared to be underrepresented in both genomes and TEs, except in the yeast retrotransposons. The dinucleotide CG was underrepresented in both genomes and TEs in A. thaliana and H. sapiens and in the LTR retrotransposons Ty1, Ty4, and Ty5 in Saccharomyces. In the Caenorhabditis and Drosophila genomes, AA/TT was overrepresented. For a given species, the TE and genomic sequences displayed the same global pattern of relative dinucleotides abundance, as revealed by the positive correlation coefficients for the relative abundance of dinucleotides between TEs and host genomes (r = 0.98, P < 0.05 for Arabidopsis; r = 0.93, P < 0.05 for Caenorhabditis; r = 0.94, P < 0.05 for Drosophila; r = 0.87, P < 0.05 for H. sapiens). For Saccharomyces, the coefficient of correlation between the genome and TEs was not different from zero (r = 0.54, P = 0.40).

To check for a codon signature in coding regions, we calculated the relative abundance of dinucleotides according to their position in codons along the single-stranded DNA (data available upon request). The strong positive correlation detected at position 1–2 of codons between genes and TEs for each species (r = 0.93, P < 0.05 for Arabidopsis; r = 0.90, P < 0.05 for Caenorhabditis; r = 0.70, P < 0.05 for Drosophila; r = 0.77, P < 0.05 for human; r = 0.91, P < 0.05 for Saccharomyces) suggests that there were only a few differences between TE and gene sequences in the relative abundances patterns of dinucleotides. The correlation was also positive at position 2–3 for Arabidopsis (r = 0.88, P < 0.05), for Caenorhabditis (r = 0.64, P < 0.05), for human (r = 0.80, P < 0.05), and for Saccharomyces (r = 0.64, P < 0.05) but was not statistically different from zero in D. melanogaster (r = 0.17, P = 0.40). In D. melanogaster and S. cerevisiae, the relative abundance of dinucleotides at position 3–1 (r = 0.40, P = 0.40; r = 0.50, P = 0.40 for Drosophila and Saccharomyces, respectively) showed no correlation to that found in other species (r = 0.87, P < 0.05 for Arabidopsis; r = 0.77, P < 0.05 for Caenorhabditis; r = 0.90, P < 0.05 for human). The dinucleotide TA was strongly underrepresented at all positions in both genes and TEs in all the species, except Saccharomyces, where TA was underrepresented only at position 1–2 of the codons. TT and TC were strongly overrepresented, and CG and GT were underrepresented at position 1–2 in all the data sets. The TG and CA dinucleotides were well represented at position 2–3 and 3–1: {rho}TG and {rho}CA were often greater than 1 and sometimes reached values indicative of overrepresentation ({rho} > 1.23).

Figure 1 shows the projection of TEs and genomes onto the plane defined by the two first axes of a principal coordinates analysis of the distance matrix between the dinucleotide relative abundance indices of genomic and TE sequences. TE and genomic sequences from one species were close, except for Saccharomyces, which presented no correlation between TE and genomic sequences for dinucleotide relative abundance. In this analysis, we compared TE sequences from genomic sequences likely to include TEs, and we therefore carried out a more detailed principal coordinates analysis on complete TE sequences and on TE-free genomic fragments. To do this, genomic sequences were broken down into genomic fragments of 9,000 bp size, which was roughly equivalent to the mean length of the complete TEs. For each species, 100 fragments were randomly selected and a BLASTN analysis (Altschul et al. 1Citation 997) was done to compare the genomic fragments and TE sequences and allow us to eliminate the genomic fragments including TEs. In this way, we obtained a total of 459 TE-free genomic fragments and 165 complete TE sequences for the five species. The distances between the indices of relative dinucleotides abundance were then computed. The relative abundances of dinucleotides in the genomic fragments were nearly the same as the values obtained for the overall genomic sequences. With the exception of Saccharomyces, TE sequences and genomic fragments from a given species were found to be clustered (figure available upon request).



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 1.—Plot of the two first axes of the principal coordinates analysis of dinucleotide distances of genomes and TEs. TE-At = TEs of A. thaliana, TE-Ce = TEs of C. elegans, TE-Sc = TEs of S. cerevisiae, TE-Dm = TEs of D. melanogaster, TE-Hs = TEs of H. sapiens, At = genome of A. thaliana, Ce = genome of C. elegans, Sc = genome of S. cerevisiae, Dm = genome of D. melanogaster, Hs = genome of H. sapiens

 
Figure 2 shows the plot of the dinucleotide relative abundance distances between genes and coding parts of TEs for each species separately. Coding regions of the TEs and host genes appeared to be located together in Caenorhabditis and Arabidopsis. In H. sapiens, some of the TEs were located with the host genes, whereas the rest, corresponding to retrovirus sequences, formed a distinct group. In Drosophila and Saccharomyces, the TEs were not located with host genes. In Drosophila, the TEs furthest from the host genes corresponded to LTR retrotransposons with an env gene, e.g., retrovirus-like elements (Tirant, 297, ZAM and in a lowest way 17.6, gypsy, idefix, and nomad).



View larger version (43K):
[in this window]
[in a new window]
 
Fig. 2.—Plot of the two first axes of five different principal coordinates analysis of dinucleotide distances of host genes and coding parts of TEs, for each species analyzed individually. Black triangles correspond to coding parts of TEs and white squares to host genes. Circled black triangles correspond to the coding parts of LTR retrotransposons with an env gene in D. melanogaster and to the coding parts of human retroviruses

 
In the five species analyzed, A. thaliana, C. elegans, S. cerevisiae, D. melanogaster, and H. sapiens, TEs appear to display a similar pattern of the relative abundances of dinucleotides as their host genome. In all our analyses, we found that the TA dinucleotide was underrepresented in both genomes and TEs. Such underrepresentation of TA, which seems to be a general feature, is attributed to (1) the avoidance of the inappropriate terminate codons TAA or TAG in coding sequences, (2) the selection of mRNA stability by avoiding UpA, which is susceptible to RNAse activity (Beutler et al. 1989Citation ), or (3) the avoidance of having too many transcription signals (Burge, Campbell, and Karlin 1992Citation ). We also observed CG suppression in both genomes and TEs in Arabidopsis and human. Such global CG suppression is believed to reduce the stacking energies of DNA, thus facilitating replication and transcription (Karlin and Burge 1995Citation ). The fact that no CG suppression was observed in C. elegans, S. cerevisiae, and D. melanogaster suggests, however, that this explanation is far from universally applicable. We show here that CG suppression, which has been already reported in small eukaryotic viruses (Karlin, Doerfler, and Cardon 1994Citation ), also exists in the elements Ty1, Ty4, and Ty5 of Saccharomyces, in many LTR retrotransposons of Arabidopsis, and in all the LTR retrotransposons of H. sapiens. In Drosophila, however, LTR retrotransposons with an env gene do not exhibit this underrepresentation of CG. The combination of these findings suggests that CG suppression does not affect all kinds of transposable elements and is not related to the size of the TE sequence.

Multivariate analysis showed that the retroviruses of H. sapiens and the LTR retrotransposons with env genes of Drosophila were very distant from their host genes. This specific grouping of the coding parts of retrovirus-like elements and of retroviruses relative to the host genes was not found when entire sequences were used, suggesting that there are differences in the transcription mechanisms for the coding parts of these elements. The coding parts of HERV (human endogenous retrovirus) were also located with the other retroviruses, although such endogenous retroviruses are not infectious because of deletions or the presence of stop codons in their coding parts (Bock and Stoye 2000; Tristen 2000Citation ). It has been shown, however, that the HERV-K element can theoretically be trans-complemented and then becomes infectious (Bock and Stoye 2000Citation ). If the large dinucleotide relative abundance distances observed between host genes and retroviruses and some LTR retrotransposon genes is an indication of their infectivity, then we can expect the Drosophila elements, 297, Tirant, 17.6, and idefix to be infectious or to have been infectious in the recent past. Infectious capacity has been clearly demonstrated for gypsy (Kim et al. 1994Citation ), but the other five elements are only suspected of being retroviruses (Dessat et al. 1999Citation ; Canizares et al. 2000Citation ). Experimental evidences are therefore required to test the theoretical expectation of the present analysis.

Footnotes

Wolfgang Stephan, Reviewing Editor

Keywords: transposable elements retrovirus dinucleotide abundance Back

Address for correspondence and reprints: Christian Biémont, Laboratoire de Biométrie et Biologie Évolutive, UMR CNRS 5558, Université Lyon 1, 69622 Villeurbanne Cedex, France. biemont{at}biomserv.univ-lyon1.fr . Back

References

    Altschul S. F., T. L. Madden, A. A. Schaffer, J. H. Zhang, Z. Zhang, W. Miller, D. J. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402[Abstract/Free Full Text]

    Beutler E., T. Gelbart, J. H. Han, J. A. Koziol, B. Beulter, 1989 Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage Proc. Natl. Acad. Sci. USA 86:192-196[Abstract]

    Bird A. P., 1986 CpG-rich islands and the function of DNA methylation Nature 321:209-213[ISI][Medline]

    Bock M., J. P. Stoye, 2000 Endogenous retroviruses and the human germline Curr. Opin. Genet. Dev 10:651-655[ISI][Medline]

    Burge C., A. M. Campbell, S. Karlin, 1992 Over- and under-representation of short oligonucleotides in DNA sequences Proc. Natl. Acad. Sci. USA 89:1358-1362[Abstract]

    Campbell A., J. Mrázek, S. Karlin, 1999 Genome signature comparisons among prokaryote, plasmid, and mictochondrial DNA Proc. Natl. Acad. Sci. USA 96:9184-9189[Abstract/Free Full Text]

    Canizares J., M. Grau, N. Paricio, M. D. Molto, 2000 Tirant is a new member of the gypsy family of retrotransposons in Drosophila melanogaster Genome 43:9-14[ISI][Medline]

    Capy P., C. Bazin, D. Higuet, T. Langin, 1997 Dynamics and evolution of transposable elements R. G. Landes Company, Austin, Tex

    Dessat S., C. Conte, P. Dimitri, V. Calco, B. Dastugue, C. Vaury, 1999 Mobilization of two retroelements ZAM and Idefix, in a novel instable line of Drosophila melanogaster Mol. Biol. Evol 16:54-66[Abstract]

    Gentles A. J., S. Karlin, 2001 Genome-scale compositional comparisons in eukaryotes Genome Res 11:540-546[Abstract/Free Full Text]

    Gower J. C., 1966 Some distance properties of latent root and vector methods used in multivariate analysis Biometrika 53:325-338[ISI]

    Karlin S., C. Burge, 1995 Dinucleotide relative abundance extremes: a genomic signature Trends Genet 11:283-290[ISI][Medline]

    Karlin S., W. Doerfler, L. R. Cardon, 1994 Why is CpG suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses? J. Virol 68:2889-2897[Abstract]

    Karlin S., I. Ladunga, 1994 Comparisons of eukaryotic genomic sequences Proc. Natl. Acad. Sci. USA 91:12832-12386[Abstract/Free Full Text]

    Karlin S., J. Mrázek, 1997 Compositional differences within and between eukaryotic genomes Proc. Natl. Acad. Sci. USA 94:10227-10232[Abstract/Free Full Text]

    Kim A., C. Terzian, P. Santamaria, A. Pélisson, N. Prud'homme, A. Bucheton, 1994 Retroviruses in invertebrates: the gypsy retrotransposon is apparently an infectious retrovirus of Drosophila melanogaster Proc. Natl. Acad. Sci. USA 91:1285-1289[Abstract]

    Kyrpides N., 1999 Genomes OnLine Database (GOLD): a monitor of complete and ongoing genome projects world wide Bioinformatics 15:773-774[Abstract/Free Full Text]

    Lerat E., P. Capy, C. Biémont, 2002 Codon usage by transposable elements and host genes in five species J. Mol. Evol. (in press)

    McClintock B., 1984 The significance of responses of the genome to challenge Science 226:792-801[ISI]

    Shields D. C., P. M. Sharp, 1989 Evidence that mutation patterns vary among Drosophila transposable elements J. Mol. Biol 207:843-846[ISI][Medline]

    Thioulouse J., D. Chessel, S. Dolédec, J. M. Olivier, 1997 ADE-4: a multivariate analysis and graphical display software Stat. Comput 7:75-83[ISI]

    Tristen M., 2000 Identification and characterization of novel human endogenous retrovirus families by phylogenetic screening of the human genome mapping project database J. Virol 74:3715-3730[Abstract/Free Full Text]

Accepted for publication November 21, 2000.