*Laboratoire Biométrie et Biologie Évolutive, UMR CNRS 5558, Université Lyon 1, 69622 Villeurbanne Cedex, France;
Laboratoire Populations, Génétique et Évolution, UPR CNRS 9034, 91198 Gif/Yvette Cedex, France
Burge, Campbell, and Karlin (1992)
observed that the relative frequencies of di- and trinucleotides characterize a genome, independent of its base composition and the coding and noncoding capacity of the regions analyzed. Species thus differ with regard to this genomic signature, which is constant in a given genome and shows similarities between related species (Gentles and Karlin 2001
). The variation in the relative abundance of dinucleotides is interpreted as reflecting differences between species in the cellular machinery for replication and repair, which may select specific dinucleotides in the sequence (Campbell, Mrázek, and Karlin 1999
). A tendency toward the suppression of CG is often observed and is interpreted as resulting from the action of methylation activities (Bird 1986
). The dinucleotides pattern of the mitochondrial genome has also been shown to differ from that of the nuclear genome, and the explanation suggests that nuclear and mitochondrial genomes use independent DNA polymerase machinery and different methods of replication (Campbell, Mrázek, and Karlin 1999
). We therefore wanted to find out whether transposable elements (TEs), which have been shown to have a greater AT content than their host genes in various species (Shields and Sharp 1989
; Lerat, Capy, and Biémont 2002
), have the same dinucleotides pattern as their host.
TEs are repeated sequences that are able to move from one position to another along chromosomes. They were first discovered in maize by Barbara McClintock (1984)
in the 1950s and seem to exist in all living organisms. They are divided into two main classes, according to the transposition intermediate they use (Capy et al. 1997, pp. 1197
). Class I consists of retrotransposons that use an RNA intermediate and are subdivided into two subclasses according to whether they do or do not have long terminal repeats (LTRs) at their extremities, LTR retrotransposons and non-LTR retrotransposons, respectively. Class II consists of transposons that use a DNA intermediate for transposition and code for a transposase. There is a third class that consists of foldback elements and MITEs, the transposition mechanism of which has not yet been elucidated.
The complete genomes of Saccharomyces cerevisiae, Caenorhabditis elegans, and Drosophila melanogaster, chromosomes 2 and 4 of Arabidopsis thaliana, and chromosomes 21 and 22 of Homo sapiens were downloaded from the Genome On Line Database site (wit.integratedgenomics.com/GOLD/) (Kyrpides 1999
). Entire sequences of transposons, LTR retrotransposons and non-LTR retrotransposons, and of class-III elements from C. elegans, D. melanogaster, H. sapiens, and A. thaliana were downloaded from GenBank. Other Arabidopsis TEs were obtained from the Arabidopsis transposable element database (soave.biol.mcgill.ca/clonebase/main.html). The positions of TEs in the sequenced genome of Saccharomyces were obtained from the site transposable element resources (www.public.iastate.edu/
voytas/resources/resources.html). The TE data set, thus available, consisted of 40 sequences from D. melanogaster, 50 from S. cerevisiae, 19 from C. elegans, 25 from H. sapiens, and 31 from A. thaliana. The TE sequences for each species were concatenated. Of the 25 TE sequences from H. sapiens, 10 were retroviruses (HERV-K, HERV-K-T47D, HERV-K101, HERV-KC4, HIV1, HIV2, HTLV1, HTLV2, HSRV, and v-oncogene), which are class-I elements and can be considered to belong to the LTR retrotransposon family.
We used the indices defined by Burge, Campbell, and Karlin (1992)
. For a dinucleotide XY, the indices
XY = fXY/fXfY were computed for each sequence, where fX and fY are the frequencies of bases X and Y, respectively, and fXY the frequency of the dinucleotide XY. When the coding sequences of TEs and genes were used, the indices were only calculated from single-stranded DNA. For complete sequences, we took into account the antiparallel and complementary structure of double-stranded DNA (Burge, Campbell, and Karlin 1992
). We thus computed f*A = f*T = 1/2(fA + fT) for base A and its associated T nucleotide in the double-stranded sequence and f*G = f*C = 1/2(fG + fC) for base G and its associated C nucleotide. The frequency of the GT dinucleotide was computed as f*GT = (1/2fGT + 1/2fAC), and the indices
*XY = f*XY/f*Xf*Y were estimated. According to Karlin and Burge (1995)
, the XY dinucleotide was considered to be underrepresented if
*XY
0.78 and overrepresented if
*XY
1.23.
The relative distance between two sequences, f and g, was calculated as the sum of the differences between the *ij indices for each ij dinucleotide between the two sequences:
*(f,g) = (1/16)
ij |
*ij(f) -
*ij(g)| (Karlin and Ladunga 1994
; Karlin and Mrázek 1997
). Relative distances were computed for the genomic sequences and the concatenated TEs for all species, the fragments of genomic sequences and complete TEs for all species, and the host genes and coding parts of TEs for each species separately. The distance matrix obtained was analyzed using a principal coordinates analysis, a specific multivariate analysis which transforms distance matrices into euclidean matrices before extracting the principal components (Gower 1966
). This analysis makes it possible to visualize neighboring sequences in terms of their relative abundance of dinucleotides. These analyses were done using the ADE-4 package (Thioulouse et al. 1997
).
The relative abundances of dinucleotides in TE and genomic sequences were calculated for the five species listed previously (detailed data available upon request). Whatever the species, the dinucleotide TA appeared to be underrepresented in both genomes and TEs, except in the yeast retrotransposons. The dinucleotide CG was underrepresented in both genomes and TEs in A. thaliana and H. sapiens and in the LTR retrotransposons Ty1, Ty4, and Ty5 in Saccharomyces. In the Caenorhabditis and Drosophila genomes, AA/TT was overrepresented. For a given species, the TE and genomic sequences displayed the same global pattern of relative dinucleotides abundance, as revealed by the positive correlation coefficients for the relative abundance of dinucleotides between TEs and host genomes (r = 0.98, P < 0.05 for Arabidopsis; r = 0.93, P < 0.05 for Caenorhabditis; r = 0.94, P < 0.05 for Drosophila; r = 0.87, P < 0.05 for H. sapiens). For Saccharomyces, the coefficient of correlation between the genome and TEs was not different from zero (r = 0.54, P = 0.40).
To check for a codon signature in coding regions, we calculated the relative abundance of dinucleotides according to their position in codons along the single-stranded DNA (data available upon request). The strong positive correlation detected at position 12 of codons between genes and TEs for each species (r = 0.93, P < 0.05 for Arabidopsis; r = 0.90, P < 0.05 for Caenorhabditis; r = 0.70, P < 0.05 for Drosophila; r = 0.77, P < 0.05 for human; r = 0.91, P < 0.05 for Saccharomyces) suggests that there were only a few differences between TE and gene sequences in the relative abundances patterns of dinucleotides. The correlation was also positive at position 23 for Arabidopsis (r = 0.88, P < 0.05), for Caenorhabditis (r = 0.64, P < 0.05), for human (r = 0.80, P < 0.05), and for Saccharomyces (r = 0.64, P < 0.05) but was not statistically different from zero in D. melanogaster (r = 0.17, P = 0.40). In D. melanogaster and S. cerevisiae, the relative abundance of dinucleotides at position 31 (r = 0.40, P = 0.40; r = 0.50, P = 0.40 for Drosophila and Saccharomyces, respectively) showed no correlation to that found in other species (r = 0.87, P < 0.05 for Arabidopsis; r = 0.77, P < 0.05 for Caenorhabditis; r = 0.90, P < 0.05 for human). The dinucleotide TA was strongly underrepresented at all positions in both genes and TEs in all the species, except Saccharomyces, where TA was underrepresented only at position 12 of the codons. TT and TC were strongly overrepresented, and CG and GT were underrepresented at position 12 in all the data sets. The TG and CA dinucleotides were well represented at position 23 and 31: TG and
CA were often greater than 1 and sometimes reached values indicative of overrepresentation (
> 1.23).
Figure 1
shows the projection of TEs and genomes onto the plane defined by the two first axes of a principal coordinates analysis of the distance matrix between the dinucleotide relative abundance indices of genomic and TE sequences. TE and genomic sequences from one species were close, except for Saccharomyces, which presented no correlation between TE and genomic sequences for dinucleotide relative abundance. In this analysis, we compared TE sequences from genomic sequences likely to include TEs, and we therefore carried out a more detailed principal coordinates analysis on complete TE sequences and on TE-free genomic fragments. To do this, genomic sequences were broken down into genomic fragments of 9,000 bp size, which was roughly equivalent to the mean length of the complete TEs. For each species, 100 fragments were randomly selected and a BLASTN analysis (Altschul et al. 1
997) was done to compare the genomic fragments and TE sequences and allow us to eliminate the genomic fragments including TEs. In this way, we obtained a total of 459 TE-free genomic fragments and 165 complete TE sequences for the five species. The distances between the indices of relative dinucleotides abundance were then computed. The relative abundances of dinucleotides in the genomic fragments were nearly the same as the values obtained for the overall genomic sequences. With the exception of Saccharomyces, TE sequences and genomic fragments from a given species were found to be clustered (figure available upon request).
|
|
Multivariate analysis showed that the retroviruses of H. sapiens and the LTR retrotransposons with env genes of Drosophila were very distant from their host genes. This specific grouping of the coding parts of retrovirus-like elements and of retroviruses relative to the host genes was not found when entire sequences were used, suggesting that there are differences in the transcription mechanisms for the coding parts of these elements. The coding parts of HERV (human endogenous retrovirus) were also located with the other retroviruses, although such endogenous retroviruses are not infectious because of deletions or the presence of stop codons in their coding parts (Bock and Stoye 2000; Tristen 2000
). It has been shown, however, that the HERV-K element can theoretically be trans-complemented and then becomes infectious (Bock and Stoye 2000
). If the large dinucleotide relative abundance distances observed between host genes and retroviruses and some LTR retrotransposon genes is an indication of their infectivity, then we can expect the Drosophila elements, 297, Tirant, 17.6, and idefix to be infectious or to have been infectious in the recent past. Infectious capacity has been clearly demonstrated for gypsy (Kim et al. 1994
), but the other five elements are only suspected of being retroviruses (Dessat et al. 1999
; Canizares et al. 2000
). Experimental evidences are therefore required to test the theoretical expectation of the present analysis.
Footnotes
Wolfgang Stephan, Reviewing Editor
Keywords: transposable elements
retrovirus
dinucleotide abundance
Address for correspondence and reprints: Christian Biémont, Laboratoire de Biométrie et Biologie Évolutive, UMR CNRS 5558, Université Lyon 1, 69622 Villeurbanne Cedex, France. biemont{at}biomserv.univ-lyon1.fr
.
References
Altschul S. F., T. L. Madden, A. A. Schaffer, J. H. Zhang, Z. Zhang, W. Miller, D. J. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402
Beutler E., T. Gelbart, J. H. Han, J. A. Koziol, B. Beulter, 1989 Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage Proc. Natl. Acad. Sci. USA 86:192-196[Abstract]
Bird A. P., 1986 CpG-rich islands and the function of DNA methylation Nature 321:209-213[ISI][Medline]
Bock M., J. P. Stoye, 2000 Endogenous retroviruses and the human germline Curr. Opin. Genet. Dev 10:651-655[ISI][Medline]
Burge C., A. M. Campbell, S. Karlin, 1992 Over- and under-representation of short oligonucleotides in DNA sequences Proc. Natl. Acad. Sci. USA 89:1358-1362[Abstract]
Campbell A., J. Mrázek, S. Karlin, 1999 Genome signature comparisons among prokaryote, plasmid, and mictochondrial DNA Proc. Natl. Acad. Sci. USA 96:9184-9189
Canizares J., M. Grau, N. Paricio, M. D. Molto, 2000 Tirant is a new member of the gypsy family of retrotransposons in Drosophila melanogaster Genome 43:9-14[ISI][Medline]
Capy P., C. Bazin, D. Higuet, T. Langin, 1997 Dynamics and evolution of transposable elements R. G. Landes Company, Austin, Tex
Dessat S., C. Conte, P. Dimitri, V. Calco, B. Dastugue, C. Vaury, 1999 Mobilization of two retroelements ZAM and Idefix, in a novel instable line of Drosophila melanogaster Mol. Biol. Evol 16:54-66[Abstract]
Gentles A. J., S. Karlin, 2001 Genome-scale compositional comparisons in eukaryotes Genome Res 11:540-546
Gower J. C., 1966 Some distance properties of latent root and vector methods used in multivariate analysis Biometrika 53:325-338[ISI]
Karlin S., C. Burge, 1995 Dinucleotide relative abundance extremes: a genomic signature Trends Genet 11:283-290[ISI][Medline]
Karlin S., W. Doerfler, L. R. Cardon, 1994 Why is CpG suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses? J. Virol 68:2889-2897[Abstract]
Karlin S., I. Ladunga, 1994 Comparisons of eukaryotic genomic sequences Proc. Natl. Acad. Sci. USA 91:12832-12386
Karlin S., J. Mrázek, 1997 Compositional differences within and between eukaryotic genomes Proc. Natl. Acad. Sci. USA 94:10227-10232
Kim A., C. Terzian, P. Santamaria, A. Pélisson, N. Prud'homme, A. Bucheton, 1994 Retroviruses in invertebrates: the gypsy retrotransposon is apparently an infectious retrovirus of Drosophila melanogaster Proc. Natl. Acad. Sci. USA 91:1285-1289[Abstract]
Kyrpides N., 1999 Genomes OnLine Database (GOLD): a monitor of complete and ongoing genome projects world wide Bioinformatics 15:773-774
Lerat E., P. Capy, C. Biémont, 2002 Codon usage by transposable elements and host genes in five species J. Mol. Evol. (in press)
McClintock B., 1984 The significance of responses of the genome to challenge Science 226:792-801[ISI]
Shields D. C., P. M. Sharp, 1989 Evidence that mutation patterns vary among Drosophila transposable elements J. Mol. Biol 207:843-846[ISI][Medline]
Thioulouse J., D. Chessel, S. Dolédec, J. M. Olivier, 1997 ADE-4: a multivariate analysis and graphical display software Stat. Comput 7:75-83[ISI]
Tristen M., 2000 Identification and characterization of novel human endogenous retrovirus families by phylogenetic screening of the human genome mapping project database J. Virol 74:3715-3730