From the
The L1 element (LINE-1, long interspersed repeated DNA) is the
mammalian version of the non-long terminal repeat class of transposable
elements that replicate via an RNA intermediate
(retrotransposons)(1) . Every modern mammalian species studied
to date contains a distinctive L1 family consisting of tens of
thousands of members, which are interspersed throughout the genome.
Despite their distinctiveness, all full-length mammalian L1 elements
share the same organization: a 5`-UTR, ()which includes a
regulatory sequence; ORF I, which encodes a protein of unknown
function; ORF II, which encodes an RT(2) ; and a 3`-UTR that
contains a G-rich polypurine:polypyrimidine tract and terminates in an
A-rich sequence (Fig. 1).
Figure 1: The L1 retrotransposable element. Generic mammalian L1 element. reg and G-rich Pu:Py sequence denote regulatory sequence and a guanine-rich polypurine:polypyrimidine sequence, respectively. See text for more details.
Each of the modern L1 families
evolved independently in the various mammalian lineages from a common
ancestral L1 element that dates back to sometime before the mammalian
radiation 100 million years ago (3, 4, 5) . Being capable of prodigious
amplification, the modern L1 elements and their evolutionary
antecedents (see below) now account for at least 30% of the mass of
mammalian DNA. In addition, L1 elements are active in present day
species and are a frequent cause of genetic polymorphisms including a
number of non-inherited genetic defects in
humans(6, 7, 8) . It is also possible that
the L1 RT catalyzed the retrotransposition of elements that do not
encode their own RT such as the mammalian SINE families (e.g. Alu in primates, B1, B2, ID, etc., in
rodents)(5, 9, 10, 11) . Since these
families can reach copy numbers as high as 1
10
and
alone contribute up to 5% of mammalian DNA (e.g. Alu(9) ), L1 elements quite likely have had, and continue
to have, a profound effect on the structure, function, and evolution of
mammalian genomes.
In spite of their prominence, most of the biochemical and molecular details of L1 regulation, replication, and transposition remain unknown. To a large extent, what is known has been derived from evolutionary studies, and these have yielded two kinds of information. The first is derived from comparisons between different mammalian L1 families or between L1 elements and their counterparts in other organisms. This comparative biochemical approach identified and assigned possible functional significance to different features of non-long terminal repeat retrotransposons.
The second type of information, generated by the analytical techniques of evolutionary biology, revealed the evolutionary dynamics of L1 families. These studies suggest that L1 evolution is a paradigm for a novel, but as yet incompletely understood, evolutionary process that is taking place within the ``ecosystem'' of the mammalian genome and that L1 evolution is quite dynamic, with novel L1 variants continually emerging over relatively short periods of time. As a consequence, L1 evolution has generated a rather complex family structure, and it has become apparent that this feature of L1 evolution can be exploited to examine the evolutionary (phylogenetic) history of the mammalian hosts that harbor these elements(12, 13, 14, 15, 16) . It is this last aspect of L1 biology that will be the focus of this review. By way of introduction, we will briefly summarize some results derived from the comparative biochemical analysis and the evolutionary studies of L1 families.
Comparative Biochemistry of L1 Elements
Evolutionary comparisons have shown that the L1 RT is
seemingly of very ancient lineage since transposable elements encoding
an homologous protein have been found in bacteria, Group II introns,
plants, fungi, and invertebrates(1) . Elegant biochemical
studies on the L1-like RTs from invertebrates including insects, fungi,
some Group II introns, and bacteria revealed several intriguing
mechanistic properties of this class of RT, which may bear directly on
the biochemical properties of the L1 RT. Although this is the subject
of a recent review(17) , two properties of the RT are worth
mentioning here. First, efficient cDNA synthesis by the RT depends on
recognition of a structural feature near the 3`-end of the transposon
transcript(10, 18, 19, 20, 21) .
Second, the RT of the L1-like R2Bm element of Bombyx mori tends to incorporate non-templated bases (mainly, but not only,
As) at the 3`-end of the transposed cDNA(21) . These properties
could explain two evolutionarily conserved features of the mammalian L1
3`-UTR. The first is a G-rich polypurine stretch, which can form
various unusual folded structures whether present as DNA (22, 23, 24) or as RNA. ()In the
latter case such structures could possibly act as a recognition site
for the L1 RT. The second is the A-rich terminus of L1 elements. While
originally thought to have originated as the poly(A) tail of the
retrotranscribed L1 transcript(25, 26) , the A-rich
terminus could have been generated during the retrotransposition
process, as has been found for the R2Bm element(21) . Such a
mechanism could account for the fact that even recently transposed L1
elements do not always terminate in a pure poly(A) sequence (e.g. see (27) and (28) ).
One of the more striking findings revealed by the comparisons of different mammalian L1 families is that, in contrast to the rest of the element, the 5`-UTRs of even very closely related L1 families are not homologous(29, 30, 31, 32, 33) . This indicates that the evolutionary origin of the 5`-UTR region is independent of the rest of the L1 element and that novel 5`-UTRs have been repeatedly acquired by the various mammalian L1 families. Since the 5`-UTR includes a region that has regulatory properties(34, 35, 36, 37, 38) , the repeated acquisition of a novel regulatory sequence could be a means whereby the element bypasses either inactivating mutations in the L1 element (38) or a host-encoded repressive mechanism. Either explanation is consistent with the fact that sense strand-specific L1 transcripts are produced mainly from the most recently evolved L1 elements(39, 40) . Although the evolutionary source for the novel L1 regulatory sequences is not known, they share certain sequence features with viral and housekeeping promoters in that they are CpG islands (41, 42) and lack many of the traditional transcription factor binding motifs found in RNA polymerase II promoters (e.g. TATA and CAAAT boxes).
The Evolutionary Dynamics of L1 Families
L1 replication generates two types of progeny: replication-competent copies and, in far greater numbers, defective copies, e.g. 5`-truncated, rearranged, etc.(25, 26, 29) . For the most part, these defective copies were neither excised (4, 5, 11, 12) nor homogenized by postreplicative events such as gene conversion(11, 12, 43, 44, 45, 46) but have diverged from each other due to the accumulation of random mutations over time. Therefore, the extent of divergence between members of any particular family serves as a built-in ``carbon'' dating mechanism whereby the time of amplification can be estimated, i.e. the more divergent the family, the older it is.
Among the replication-competent copies, novel variants
were also produced, and these in turn generated both defective and yet
newer versions of non-defective
elements(11, 47, 48, 49) . Variant
elements can rapidly succeed each other (31, 32, 50) and also
co-exist(6, 11, 15, 49, 51) ,
perhaps competing with each other (46) . ()Therefore, a given L1 ``family'' consists of
several closely related L1 subfamilies. Since L1 elements are
transmitted only by inheritance (i.e. vertically)(3, 13, 29, 30, 31, 46) ,
the L1 DNA composition of each species is unique. Thus, taken in
toto, the L1 content of present day mammalian species is very
complex encompassing as it does the entire evolutionary history of the
modern L1 elements since their descent from the common mammalian
ancestral L1 element(4, 5, 12) . (
)
Using L1 DNA as a Phylogenetic Character
Establishing a correct phylogeny, i.e. the unique tree that describes the genealogy of the taxa in question, is essential if either studies on evolutionary processes or comparative biochemical studies are to be meaningful. However, determining the correct phylogenetic tree can be extremely difficult (e.g. see (52, 53, 54, 55, 56) ). Taxa are grouped on the basis of shared characters, and sometimes it is impossible to determine whether a shared character has been inherited from a common ancestor or whether it arose independently due to convergence, parallelisms, or reversion to an ancestral state. Non-inherited shared characters are called homoplasies, and they can lead to multiple, equally likely phylogenetic trees or, in extreme cases, a single incorrect tree. A lucid elaboration of the difficulties caused by homoplasy can be found in (55) .
An additional problem encountered in phylogenetic analysis is determining whether a shared character has been recently acquired (derived) or is an ancestral (primitive) one that was retained by the modern taxa. This becomes a problem if different taxa have undergone different rates of evolution. For example, when species that share a common ancestor evolve at different rates, then the slower evolving ones will retain more of the ancestral characters than the faster evolving ones, and the slower and faster evolving species could be grouped separately even though they share a common ancestor.
If we consider the presence or
absence of an amplified L1 clade (i.e. family or
subfamily) as a phylogenetic character, the multicopy state
of the ``L1 character'' renders the issue of homoplasy moot.
Since the relics of a given L1 amplification event share multiple
diagnostic nucleotides, the presence of the same L1 clade in different
taxa could not have occurred by convergent evolution but must be a
shared derived character (referred to as a synapomorphy). Since L1
relics are retained in the genome in high copy number, reversion to the
ancestral state, i.e. the absence of a particular L1 family in
a particular taxon, cannot occur. In addition, the relative
``ages'' (extent of sequence divergence) of L1 clades are
easily determined. Therefore, the problems of both homoplasy and of
whether a character is a retained primitive or a newly acquired one are
circumvented when L1 DNA is used as a phylogenetic character.
Examples of Using L1 as a Phylogenetic Character
The use of L1 DNA as a phylogenetic character is relatively simple in both principle and practice and depends on obtaining enough DNA sequence information to prepare clade-specific hybridization probes. Although probes cognate to any region of L1 DNA can be used (e.g. see below and the legend to Fig. 2), those specific to the 3`-UTR are most generally useful, especially for recently evolved clades. This is because the 3`-UTR evolves for the most part more rapidly than most of ORF I and all of ORF II (e.g. Refs. 5, 12, 13) and is not replaced wholesale during evolution as can be the case for the 5`-UTR (see ``Comparative Biochemistry of L1 Elements''). In spite of the relatively rapid evolutionary change in the 3`-UTR, clades that are as old as 12-15 million years can be readily distinguished (see below).
Figure 2:
Distribution of L1 clades in various
rodents. A, diagrammatic representation of the presence or
absence of an ancient murine L1 clade, Lx, and several modern rat
clades, L1, L1
, and L1
. The
original data were presented in Refs. 12, 15, and 16. B,
distribution of two newly evolved clades of L1
:
L1
and L1
. These families were
distinguished on the basis of differences between a hypervariable
region that we recently discovered in ORF I. The darkness of
the grayfilledcircles is related to the
amount of the indicated subfamilies in R. norvegicus and R. rattus moluccarius, where the sum of the rn and mol clades
in R. norvegicus is about the same as the mol clade in R.
rattus moluccarius (E. Cabot, B. Angeletti, B. Hayward, K. Usdin,
and A. V. Furano, manuscript in
preparation).
For older L1 clades,
we have found probes of 200 base pairs to be both specific yet
long enough to hybridize efficiently to the divergent members of a
given clade. For the younger families oligonucleotide probes are
essential. Oligonucleotide probes of
20 bases cognate to regions
of clades that differ by 2 or more diagnostic nucleotides are ideal. In
cases where the multiple diagnostic base differences between clades are
further apart than can be accommodated on a single oligonucleotide more
than one oligonucleotide should be used to eliminate the possibility
that the shared hybridization signal is due to chance mutation in
precisely the same position in two otherwise different clades (but see
below). We have obtained excellent discrimination using
oligonucleotides to probe for a single base difference as long as the
difference resides in the middle of the oligonucleotide and the
hybridization is carried out in the presence of a large excess of the
appropriate competitor oligonucleotide, i.e. one that has the
same sequence as the probe except for the distinguishing base change.
Hybridizations are most conveniently carried out using dot blots of genomic DNA. However, hybridization to blots of electrophoretically separated fragments of genomic DNA that had been digested with restriction endonucleases, which recognize conserved sites within the 3`-UTR, greatly increases both the specificity and sensitivity of the method. The appearance of novel restriction fragments is indicative of subdivisions within a given clade due to the loss or gain of a particular restriction enzyme site. Therefore, a shared novel restriction fragment detected even by a probe specific for just a single base difference would be highly specific for a given clade. This is because the presence of the novel restriction fragment would have required at least two base changes: the one detected by the oligonucleotide and the one that created or destroyed a given restriction enzyme site. The sensitivity of the method is increased because the presence of subdivisions within a given clade could be evidence of recently evolved (or evolving) L1 clades. In the two sections below we demonstrate the use of L1 as a phylogenetic character to examine an evolutionary event that occurred about 12 Ma and one that began 1-3 Ma.
Phylogenetic Analysis Using an Ancient Murine L1 Clade
Murinae, a rodent subfamily, which includes Old World rats (Rattus) and mice (Mus) and many other genera, first appeared 12-15 Ma. The classification of Murinae is traditionally based on several cranial and dental characters (57) and in a number of cases has been problematic(58) . A few years ago we discovered the relics of an ancient L1 clade (referred to as Lx) in the genomes of mice and rats(11, 12, 15) . Based on the extent of nucleotide divergence between Lx members and the murine neutral nucleotide substitution rate, we estimated that the Lx amplification coincided with the murine radiation(15) . Therefore, we expected that the relic copies of Lx would be present in all modern day murines but absent from non-murine taxa.
We found Lx to be present in 24 unambiguously classified murine species and absent from 13 unambiguously classified non-murine species(11, 15) . Of particular interest was our finding that the Lx amplification was absent from three taxa, Lophuromys, Uranomys, and Acomys, that were traditionally classified as murines (58) . Our data suggested that the classification of these species was incorrect, and indeed their inclusion in Murinae has at times been challenged (e.g. see (59) and references therein). Subsequent re-examination of the morphological data and both single copy DNA hybridization data (59) and 12 S mitochondrial rRNA sequence analysis (60) have now further supported the exclusion of these taxa from Murinae. Therefore, the murine-like dental pattern of the (Lophuromys, Uranomys, Acomys) clade, which in part formed the basis of their classification as murines, is quite likely a homoplasy due to convergence.
The above results indicated that the Lx amplification is an acquired taxon-defining character, or synapomorphy, for the subfamily Murinae. We further tested this supposition by re-examining the classification of Otomys. The animals in this genus, commonly called African vlei rats, were traditionally classified in their own subfamily, Otomyinae, of equal rank to Murinae(58) . However, this classification did not accommodate the presence of a transitional fossil form between an ancestral murine species and present day Otomys. This fossil of the now extinct Euryotomys was dated from 6.0 to 4.5 Ma(61) , well after the murine radiation and its existence suggested that the Otomyinae were murines. If true, then the Otomyinae species should contain Lx DNA, and this turned out to be the case(16) . Recent single copy DNA hybridization data (62) also support the reclassification of these animals as murines. Therefore, using the absence or presence of Lx DNA as a phylogenetic character helped resolve two problems in rodent phylogeny. The distribution of Lx in murine and non-murine species is summarized in Fig. 2.
Phylogenetic Analysis with Modern L1 Clades
The distribution of recently amplified L1 clades can be used
to resolve the taxonomy of more recently diverged animals. The genus Rattus contains about 50 species considered to be Rattus sensu strictu. Single copy DNA hybridization is unable to
establish a branching pattern for many of these species, and the
systematics of this group remains largely
unresolved(16, 58) . We can distinguish at least five
relatively modern L1 clades in Rattus norvegicus.()One of the older ones, L1
, amplified
about 3.5 million years ago when the species comprising Rattus sensu strictu began emerging. As Fig. 2illustrates, the
L1
clade is present only in animals classified as Rattus sensu strictu(16) . Therefore, the L1
clade probably arose in the common ancestor of Rattus sensu strictu some time after the divergence of these animals from
the ancestor they shared with Rattus sensu lato.
By
contrast, two younger rat L1 clades, L1 and
L1
, are present only in R. norvegicus and in
animals identified as Rattus rattus moluccarius, a presumed
subspecies of Rattus rattus(16) . Although R.
rattus moluccarius specimens contained both the L1
and
L1
clades, these L1 clades were absent from a number of
other R. rattus specimens (Fig. 2). This result was
quite surprising and suggested that the R. rattus moluccarius specimens were misclassified and represent a sister taxon of R. norvegicus rather than a subspecies of R.
rattus(16) . Further analysis using mitochondrial DNA
sequences and our finding that R. norvegicus and R. rattus
moluccarius share a satellite DNA sequence supported this
conclusion(15) . Therefore, the L1
and L1
clades are markers for a new taxon within Rattus sensu
strictu; this taxon contains R. norvegicus and R. rattus
moluccarius.
The L1 clade has evolved rapidly,
and two descendant clades of L1
can be distinguished:
L1
and L1
. While R. rattus moluccarius contains only the L1
clade, R. norvegicus contains some members of this
clade but far greater numbers of the L1
clade (Fig. 2B). This indicates that the L1
clade either arose in or began amplifying in R.
norvegicus soon after it and R. rattus moluccarius diverged from their common ancestor. Furthermore, it is possible
that the L1
clade may have expanded at the
expense of the L1
clade in the R.
norvegicus genome since this clade has not amplified to the same
extent as the L1
clade in R. norvegicus or as the L1
clade in R. rattus
moluccarius. (
)These results suggest that very closely
related L1 clades can exclude each other perhaps by competing for
limiting host factors.
Studies on L1 DNA of Mus have revealed a similar picture of L1 evolution and have demonstrated the usefulness of L1 DNA as a phylogenetic character in this taxon. Species-specific L1 clades distinguish Mus domestics and Mus spretus(13) and have been used to detect M. spretus genomic sequences present in an inbred strain of Mus musculus(63) . Additionally, recent work on modern M. spretus L1 DNA has revealed emerging and apparently competing L1 clades that may be useful in defining subpopulations of this species as well(46, 49) . Humans also contain a very complex L1 DNA composition (5) including a number of distinct replication-competent L1 clades(6, 51) .
As a consequence of their long replicative history in
mammalian genomes, L1 elements have generated a rich collection of DNA
``fossils'' that can be used to determine the phylogenetic
history of mammals. Here we have shown how the presence (or absence) of
an amplified L1 clade can be used as a novel and robust phylogenetic
character. We should also mention that individual transposition events
can be used for phylogenetic analysis. Batzer et al.(64) showed that the frequency of a SINE insertion at four
different loci in the human genome distinguished human population
groups and used their results to further support the African origin of
modern humans. Comparisons between mammalian -globin loci have
shown that different species can be distinguished by the pattern of L1
insertions at this site(4, 65, 66) . For
example, an ancient L1 insertion between the
and
genes
distinguishes eutherians (mammals) from metatherians
(marsupials)(4, 66) , and two independent L1
insertions flank the
-globin gene in simians but not in
prosimians(66) . However, independent insertional events could
be problematic for phylogenetic analysis. First, they are much harder
to identify or characterize initially (though once detected, relatively
easy to screen for) than the presence or absence of an amplified L1
clade. Second, any individual insertion or site that is being scored
for the presence of the insertion could be subject to re-arrangement, e.g. deletion of the inserted element. Therefore, both the
problems of homoplasy and of determining whether the character is an
ancestral or acquired one could theoretically afflict the use of
individual insertion events.
Finally, we would like to close with a comment about the possible effect of L1 transposition on mammalian evolution. Because L1 insertions are random and potentially either beneficial or deleterious, it is easy to visualize how an L1 amplification event introduces genetic diversity into an extant animal population. Depending on a number of extrinsic (e.g. geographical isolation, population size) and intrinsic (e.g. changes in fitness caused by an L1-induced genetic effect) factors, a given animal population could become differentiated into subpopulations as a consequence of the difference between their pattern of L1 insertions. Moreover, depending on the rate at which novel L1 clades emerge and amplify, it would be quite possible that subpopulations could also differ by their content of distinct L1 clades, which, depending on the relative transposition rate of the distinct L1 clades, further enhance the generation of genetic diversity within the taxon. To the extent that genetic diversity predisposes a given taxon to speciation, one might entertain the notion that L1 amplification events may have a role in mammalian speciation. In this regard, we note the apparent correlation, at least during rodent evolution, between the generation and expansion of novel L1 clades and a number of speciation/extinction events (see (15) and references therein.