Genes of the Pseudoviridae (Ty1/copia Retrotransposons)

Brooke D. Peterson-Burch and Daniel F. Voytas

Department of Zoology & Genetics, Iowa State University


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
A comprehensive survey of the Pseudoviridae (Ty1/copia) retroelement family was conducted using the GenBank sequence database and completed genome sequences of several model organisms. Plant genomes were the most abundant sources of Pseudoviridae, with the Arabidopsis thaliana genome having 276 distinct elements. A reverse transcriptase amino acid sequence phylogeny indicated that the Pseudoviridae comprises highly divergent members. Coding sequences for a representative subset of elements were analyzed to identify conserved domains and differences that may underlie functional divergence. With the exception of some fungal elements (e.g., Ty1), most Pseudoviridae encode Gag and Pol on a single open reading frame. In addition to the nearly ubiquitous RNA-binding motif of nucleocapsid, three new conserved domains were identified in Gag. pol-encoded aspartic protease was similar to the retroviral enzyme and could be mapped onto the HIV-1 structure. Pol was highly conserved throughout the family. The greatest divergence among Pol sequences was seen in the C-terminus of integrase (IN). We defined a large motif (GKGY) after the IN catalytic domain that is unique to the Pseudoviridae. Additionally, the extreme C-terminus of IN is rich in simple sequence motifs. A distinct lineage of Pseudoviridae in plants have envlike genes. This lineage has undergone a large expansion of Gag characterized by an {alpha}-helix–rich domain containing coiled-coil motifs. In several elements, this domain is flanked on both sides by RNA-binding domains. We propose that this monophyletic lineage defines a new Pseudoviridae genus, herein referred to as the Agrovirus.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Genome sequencing efforts indicate that a large portion of most eukaryotic genomes consists of DNA transposons and retroelements (retroviruses, retroposons, and retrotransposons). Retroelements often make up the majority of interspersed repetitive DNA. For example, more than half of the maize genome is of retroelement origin, and in salamanders, a single retroelement has amplified to such an extent that it contributes more DNA to its host than exists in the entire human genome (Marracci et al. 1996Citation ; SanMiguel et al. 1996Citation ). Retroelement abundance derives from element accumulation through reverse transcription and integration. Immobile parental elements seed progeny to new sites in the host genome. In contrast, DNA transposons employ a conservative replication strategy, limiting their contribution to host genome expansion. Their life cycle involves a "cut and paste" mechanism in which the parental element excises from its original location before integrating elsewhere.

Retroviruses and long terminal direct repeat (LTR) retrotransposons share numerous similarities in genetic organization and mechanism of replication. Both are flanked by LTRs and encode Gag and Pol polyproteins that are processed by a pol-encoded aspartic protease (PR). Gag is a structural protein that forms virus or viruslike particles. pol also encodes the enzymes reverse transcriptase (RT) and integrase (IN), which synthesize retroelement cDNA and integrate it into the host genome. Retroviruses have a third open reading frame, env, enabling extracellular transmission.

Sequence heterogeneity is high among retroelements even though they encode proteins of same or similar function. One explanation lies in the comparatively low fidelity of reverse transcription that results in accelerated retroelement evolution (Gabriel and Mules 1999Citation ). Despite sequence heterogeneity, phylogenetic analyses based on RT amino acid sequences divide LTR retroelements into five distinct lineages found in diverse eukaryotes (Xiong and Eickbush 1990Citation ; Boeke et al. 2000a,Citation 2000b;Citation Malik, Henikoff, and Eickbush 2000Citation ). These include the vertebrate retroviruses (Retroviridae), two predominantly retrotransposon lineages (the Pseudoviridae and the Metaviridae), and the BEL and DIRS clades (Malik, Henikoff, and Eickbush 2000Citation ). The Pseudoviridae (also known as the Ty1/copia elements) are characterized by a distinctive pol enzymatic domain order, wherein coding sequences for IN precede those for RT. This domain order is reversed in the other LTR retroelement lineages. The Metaviridae and Pseudoviridae are each divided into two genera (Boeke et al. 2000a,Citation 2000bCitation ). The Metaviridae consist of the Metavirus and Errantivirus, the latter has envlike genes similar to the Retroviridae. The Hemivirus and Pseudovirus constitute the two genera of Pseudoviridae, which are distinguished by the primer used to initiate reverse transcription. Pseudoviruses prime DNA synthesis with the terminal 3' residue of initiator methionine tRNA, whereas Hemiviruses utilize a half-tRNA generated by cleavage within the anticodon stem-loop.

Genome sequencing projects have enhanced our understanding of diversity and evolutionary trends among retroelements. For example, it has become apparent that infectious retroviruses evolved independently multiple times through the acquisition of env genes by retrotransposons (Malik, Henikoff, and Eickbush 2000Citation ). This has been well documented in the Metaviridae, and more recently, examples of elements with envlike genes have been reported in the Pseudoviridae (Laten, Majumdar, and Gaucher 1998Citation ; Kapitonov and Jurka 1999Citation ; Peterson-Burch et al. 2000Citation ). Additional important evolutionary trends will likely be revealed as genome sequences continue to unfold. As a framework for understanding such trends, we characterized the coding regions of Pseudoviridae found within GenBank as well as the complete genome sequences of five eukaryotes representing plants (Arabidopsis thaliana), humans, animals (Caenorhabditis elegans, Drosophila melanogaster), and fungi (Saccharomyces cerevisiae). We sought both conserved and divergent features that may offer insight into the mechanisms of replication and how these elements have adapted to their host cell environments. Because the Pseudoviridae are particularly abundant and diverse in higher plants, analyses focused on the Pseudoviridae of A. thaliana


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The Data Set
Entrez searches were used to identify previously described Pseudoviridae for the core data set. DNA sequences were retrieved from GenBank and were required to range between 2 and 20 kb in length. The lower size limit excluded sequences consisting only of element subdomains, whereas the upper limit excluded the majority of putative elements in large clones generated by genome sequencing projects. The length boundaries were used in combination with the keywords Ty1, copia, retrotransposon, and retroelement. Core data set candidates identified by this screen were assessed for structural integrity and completeness. Full-length elements were defined as those for which PR, IN, and RT domains could be identified. Degenerate elements, which could not be manually reconstructed, were eliminated from the consideration. A single element was used to represent retrotransposons sharing greater than 75% amino acid similarity with Pol. These criteria resulted in a core data set of 32 retrotransposons (table 1 ).


View this table:
[in this window]
[in a new window]
 
Table 1 Pseudoviridae Core Data Set Retroelements

 
Conserved RT sequences from each member of the core data set (as defined by Xiong and Eickbush 1990Citation ) were used to query model organism genomes with the tblastn program of BLAST 2.0 (Altschul et al. 1Citation 997). tblastn was set to perform an unfiltered gapped search using the BLOSUM62 scoring matrix (Henikoff S and Henikoff JG 1992Citation ) and an expect cutoff of 10 x 10-5. All other options were left at default values. Search results were combined, and unique genome sequence hits were parsed and placed into a FASTA format file. Hit lengths varying by more than 20% of the query RT sequence length were excluded from further analyses. It should be noted that in most cases, the same putative elements were identified, regardless of the query sequence used. BLAST 2 sequences were used to identify LTRs and to delimit coding regions used for multiple alignments (Tatusova and Madden 1999Citation ).

Sequence Analysis
ClustalX version 1.81 was used to generate most multiple sequence alignments (Thompson et al. 1997Citation ). Default settings were used for all parameters other than substitution matrices, where the BLOSUM series was used for both pairwise and multiple sequence alignments. The delay divergent sequences option was set to 45 for the RT alignment. Alignments of IN, PR, and RT are deposited in the EMBL database and are posted on our website (http://www.public.iastate.edu/~voytas/publications/mbe02.html). Additional methods of multiple sequence alignment were employed to identify conserved amino acid sequences in Gag and Pol. These included ungapped blastx alignments and the methods incorporated by the Blocks software (Henikoff et al. 1995Citation ). Constrained regions and residues in Gag were revealed when the methods were not forced to align the sequences over their entirety. SeqLogos were generated from ungapped blastx multiple alignments of Gag amino acid sequences (Schneider and Stephens 1990Citation ) using the WebLogo server (http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi).

The RT neighbor-joining tree (Poisson correction) was generated using MEGA 2.1 (Kumar et al. 2001Citation ) from ClustalX alignments. Other trees were created with PAUP 4.028 (Swofford 1999Citation ). The IN tree was generated from aligned sequences spanning from four residues upstream of the first histidine of the N-terminal zinc-binding motif (HHCC) to the end of the conserved region, which ends approximately 120 residues downstream of the glutamate at the catalytic site. The PR tree was generated from the alignment shown in figure 3A. Bootstrap replicates used full character replacement. Default settings were used for all other parameters. Specifics regarding the rooting of trees are given in the figure legends. Trees were visualized with MEGA or TreeExplorer (http://evolgen.biol.metro-u.ac.jp/TE/TE_man.html).



View larger version (85K):
[in this window]
[in a new window]
 
Fig. 3.—Pseudoviridae protease. (A) Alignment of PR amino acid sequences. The bottom seven sequences are representative members of Retroviridae (Hunter et al. 2000Citation ); all others are Pseudoviridae. Colored columns in the alignment indicate positions with conserved physiochemical properties. The red, orange, and blue boxes mark conserved domains. (B) The Pseudoviridae PR sequences were mapped onto the structure of the HIV-1 PR dimer (PDB ID: 1HXW). The red portion depicts the enzyme's active site. The blue domain is part of the enzyme's structural backbone. The orange domain extends into the flexible substrate-binding loop. (C) Neighbor-joining tree (bootstrapped 1,000 replicates) derived from the alignment in A. The Agrovirus and Retroviridae clades are highlighted in green and red, respectively. The root branch point is indicated by a triangle

 
Coiled-coil predictions for Gag were generated with Multicoil (Wolf, Kim, and Berger 1997Citation ). The JPred2 server provided secondary structure predictions for Gag, using the ClustalX alignments (Cuff and Barton 2000Citation ). Transmembrane domains in envlike sequences were predicted using HMMTop, which is purported to be 90% accurate (Tusnady and Simon 1998Citation ). Overall secondary structure predictions for the envlike ORFs were generated by PSIpred v2.0 (Jones 1999Citation ). Note that for most comparisons, we used a consensus of the ToRTL1 envlike ORF on the basis of three closely related ToRTL1 elements (U68072, AF220603, AF220602). Locations of PR amino acid sequences aligning with HIV-1 were mapped onto the HIV-1 PR crystal structure (PDB 1HXW) using the Cn3D visualization software of NCBI (Hogue 1997Citation ). Amino acid sequences for Gag, IN, and the Envlike proteins were submitted to the MEME 3.0 web server to identify motifs (Bailey and Elkan 1994Citation ). Results were also assessed for evidence of rearrangements in motif order within the proteins (McClure, Vasi, and Fitch 1994Citation ). Splice site predictions were made with the GeneSeqer program (http://bioinformatics.iastate.edu/cgi-bin/gs.cgi).


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Thirty-two members of the Pseudoviridae were used as the core data set for familywide comparisons (see table 1 ). These elements were chosen because most have been previously described in publications and many are being actively characterized in research laboratories. To determine whether the core data set encompasses the diversity of Pseudoviridae, searches of the complete human, C. elegans, D. melanogaster, and A. thaliana genome sequences were conducted. Each genome was searched by tblastn, using RT sequences from each element in the core data set. No Pseudoviridae were found in humans or C. elegans. The Pseudoviridae identified in D. melanogaster were 1731 and copia (members of the core data set). We previously characterized retrotransposons in the complete S. cerevisiae genome sequence, wherein the Pseudoviridae are represented by Ty1, Ty2, Ty4, and Ty5 (Kim et al. 1998Citation ). Consistent with the emerging picture of transposable element abundance in plants, the A. thaliana genome harbored the most Pseudoviridae. On the five A. thaliana chromosomes, 276 distinct Pseudoviridae RTs were identified.

Seven conserved domains of RT (Xiong and Eickbush 1990Citation ) were aligned for all elements and used to construct a consensus neighbor-joining tree (fig. 1 ). The tree is characterized by long branch lengths, indicating considerable diversity among the Pseudoviridae. Most of this diversity is embodied by the A. thaliana elements; many of the other plant elements in the core data set have A. thaliana homologues. Nonplant elements emanate from the base of the tree. A previous study based on the partial A. thaliana genome sequence identified several of the major clades described here (Terol et al. 2001Citation ) (numbered clades in fig. 1 ). Representatives from these and additional lineages (numbered elements in fig. 1 ) were subjected to the analyses of coding sequences described in subsequent sections (data not shown). No unique coding sequence features were observed. We concluded that the core data set is representative of the Pseudoviridae.



View larger version (36K):
[in this window]
[in a new window]
 
Fig. 1.—Diversity of Pseudoviridae RT. (A) Neighbor-joining tree of RT sequences placing the Pseudoviridae family in context with the Metaviridae and Retroviridae. The tree is rooted with the A. thaliana retroposon Ta11 (branchpoint marked by triangle). The clade of the proposed Agrovirus genus is labeled. Adapted from Peterson-Burch et al. 2000. (B) Neighbor-joining tree of all Pseudoviridae RTs from the core data set and model organism genomes. This bootstrapped consensus tree is rooted to Ta11. Branches representing elements from the core data set are labeled with the element name. Letters in parenthesis after element names indicate those from nonplant hosts: G, green algae; F, fungi; I, insects. Numbered taxa identify A. thaliana elements from divergent clades not represented by the core data set; the first number identifies the chromosome and the second the chromosomal location. Numbered clades represent Pseudoviridae groups previously described (Terol et al. 2001Citation ). The numbers on the branches represent bootstrap support for 1,000 replicates

 
The core data set consists of eight Hemiviruses and 24 Pseudoviruses. With the exception of some fungal elements, all appear to use a full or half initiator methionine tRNA to prime reverse transcription (table 1 ). No Hemiviruses were found in plants. Some members of a distinct lineage of plant elements have envlike ORFs (see labeled clade in fig. 1 ), and in the discussion section we propose assigning these elements to a new genus within the Pseudoviridae called the Agroviruses (see Discussion). For simplicity, they are hereafter referred to as Agroviruses.

gag
gag encodes proteins that form virus or viruslike particles and that package retroelement mRNAs (Vogt 1997Citation ). For the purposes of this study, Gag was defined as the span of amino acids extending from the first methionine to the three catalytic residues of PR (see below for description of PR). Gag proteins of retroviruses and Ty1 are cleaved 20–40 residues before the PR active site, so this definition likely overestimates this protein's length. Accordingly, observations pertaining to Gag features do not consider the C-termini.

In some retroelements, Gag and Pol are encoded by a single open reading frame, whereas in others they are separated by a shift in a reading frame or a stop codon. For these latter elements, Pol is usually expressed as a consequence of translational recoding (e.g., frameshifting or stop codon suppression) (Gesteland and Atkins 1996Citation ). Most Pseudoviridae appear to encode Gag and Pol on a single ORF; 27 of the 32 elements in the core data set (table 1 ) do not have a break in reading frame between gag and pol. Ty1 is known to use ribosomal frameshifting for pol expression, and the sequences that mediate Ty1 frameshifting are conserved in Ty4 (Voytas and Boeke 2002Citation ). Two of the maize elements (Opie-2 and PREM-2) were too degenerate to assess their organization, and gag was entirely missing in an element from pea. The reference 1731 element has a frameshift located between gag and pol, suggesting that it undergoes frameshifting; however, a consensus element derived from 1731 insertions in the D. melanogaster genome encodes a single ORF (data not shown).

Gag is typically processed into multiple proteins, including a short nucleocapsid protein that coats the retroelement mRNA within the particle. The RNA-binding motif (RB) (Cx2Cx4Hx4C) is a characteristic feature of the nucleocapsid and is widespread among retroelements (Vogt 1997Citation ). This motif is also a common feature of the Pseudoviridae with the exception of Ty1, Ty4, and Tca2. The transpositionally active Ty5 element has a shortened motif variant (Cx2Cx3Hx4C). In most Pseudoviridae, as in other retroelement families, the region flanking the motif exhibits a basic nature and contains low complexity amino acid sequence repeats. Most of these repeats are composed of N(n), K(n), G(n), and variations on RG(n) (data not shown).

Striking differences in the organization and number of RBs were observed in Agroviruses (fig. 2 ). Among Agroviruses with identifiable envlike ORFs, SIRE-1 and ToRTL1 have two RBs, whereas Endovir1-1 has three. In Endovir1-1, two motifs are arranged in tandem near the center of Gag, and one is located at the Gag C-terminus. The only other member of the Pseudoviridae with two tandem RBs is Tpv2-6. In this case, the RBs are located at the Gag C-terminus. It should be noted that in the RT phylogeny, Tpv2 is among the elements most closely related to the Agroviruses (fig. 1 ). Agroviruses for which envlike ORFs have not been identified include Opie-2 and PREM-2, and these elements have a single, centrally located RB. In general, the central location of this motif is a unique feature of the Agroviruses. Agroviruses also stand out from the rest of the Pseudoviridae on the basis of Gag length. Gag averages 656 amino acids in length for the five Agroviruses (ranging from 541 to 808), whereas the remaining family members average 329 amino acids. An internal domain of Gag accounts for most of the size difference. This helix forming region is located downstream of the central RB and lacks sequence similarity with other family members.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 2.—Structural organization of Gag. Comparison of Gag between the Agroviruses and other Pseudoviridae. Coding regions for Gag (bars) are drawn to scale. Beta-sheets (arrows) and alpha helices (cylinders) were predicted by JPred2 (Cuff and Barton 2000Citation ). Shading between bars denotes regions of extended amino acid sequence homology. Boxes A, B, and C are depicted as sequence logos to show residues conserved in the majority of family members. Unlabeled boxes represent approximate locations of the RB Cx2Cx3GHx4C. Note that the number and location of these motifs in the Agroviruses are variable, ranging from one to three copies

 
Despite the differences observed for Agrovirus Gag, it can be clearly seen that there are constraints on this protein when considering this family as a whole. The N-terminal half of Gag contains three conserved amino acid sequence domains found in all Pseudoviridae. The first domain is centered around a nearly invariant tryptophan residue (fig. 2, domain A ). Two larger conserved regions follow this domain (fig. 2, domains B and C ) and are characterized by regularly spaced hydrophobic residues. This suggests that these regions form surfaces for protein-protein interactions, such as leucine zippers; however, predictions with the program MultiCoil failed to indicate that they form the coiled-coils seen in leucine zippers and certain other interaction domains. The only predicted coiled-coil domains are located in the {alpha}-helix–rich region that is unique to Agrovirus Gag.

pol
pol encodes the enzymatic functions required for replication. Conserved amino acid sequence motifs for PR, IN, and RT were used as the starting point to delimit boundaries of these enzymes. Ty1 PR cleavage sites, which are known, were also used to set approximate boundaries for each enzyme (Garfinkel et al. 1991Citation ; Moore and Garfinkel 1994Citation ; Merkulov et al. 1996Citation ). A caveat that should be mentioned is that Ty1 is among the most divergent family members. This is evident in phylogenetic analyses that include RTs from outside Pseudoviridae (Eickbush 1994Citation ).

Protease
PR is the first enzyme encoded by pol, located at the N-terminus. It is required to release the other enzymes from the Pol precursor and is involved in processing of Gag (Gulnik, Erickson, and Xie 2000Citation ). Retroelement aspartic proteases are characterized by a D(S/T)G motif at the catalytic site. We defined the PR N-terminus as the nearly invariant tryptophan, three residues upstream of the catalytic aspartate (fig. 2A ) although, as mentioned previously, this likely underestimates the length of the N-terminus by approximately 30 residues. The C-terminus of PR was defined as the last conserved region upstream of the Ty1 PR-IN cleavage site. Family members share no obvious conserved features or physiochemical properties at the location of the Ty1 PR cleavage sites. These criteria result in a PR that has 109–120 amino acid residues. PR sequences do not strongly differentiate relationships among Pseudoviridae (fig. 3B ) but do group Agroviruses together.

Because Pseudoviridae PR sequences were divergent, we took advantage of the 3D structural information available for some species of Retroviridae (Gulnik, Erickson, and Xie 2000Citation ). Sequences were sufficiently similar to enable conserved motifs to be mapped onto the dimeric form of the HIV-1 enzyme (fig. 3B ). The conserved glycine-rich region (shaded orange) is part of a flexible loop that works cooperatively to coordinate the substrate in the catalytic site (shaded red). When substrate is not present, PR undergoes a major conformational shift wherein the locking loops are separated from each other, allowing substrates access to the catalytic core. The remaining homology region (shaded blue) is largely hydrophobic and contributes to the protein core. A central, highly conserved proline residue is located at the point where this segment folds back sharply upon itself.

Integrase
IN binds and inserts the retroelement cDNA into host chromosomal DNA. IN is characterized by three domains—HHCC, a catalytic core (the DD35E motif), and a poorly conserved C-terminal domain (Haren, Ton-Hoang, and Chandler 1999Citation ). The N-terminus of Pseudoviridae IN was defined as beginning four amino acid residues upstream of the first histidine in the HHCC motif. This position is several amino acids downstream from the known Ty1 N-terminal cleavage site. The highly conserved zinc-binding motif and central catalytic core regions differ very little between Pseudoviridae and other LTR retroelement INs (Haren, Ton-Hoang, and Chandler 1999Citation ). Conservation in the catalytic domain terminates approximately 120 amino acids downstream of the glutamate in the DD35E motif (fig. 4A ). The C-terminus is defined as beginning at this point and extending to RT. This likely overestimates the length of this region because the actual boundary between IN and RT is unclear. Phylogenetic analyses were performed using the catalytic core domain (fig. 4B ), and element clusters largely correspond to those observed for RT (fig. 1 ).



View larger version (69K):
[in this window]
[in a new window]
 
Fig. 4.—Pseudoviridae integrases. (A) Structural organization of IN. The bars depict INs of the Pseudoviridae and (Meta/Retro)viridae families. The N-terminal HHCC domain and the catalytic DD35E domain are labeled. The C-terminus region is marked by slanted lines. Specific features in the C-terminus are as follows: GKGY motif; proline-rich region (shaded box) that surrounds the Agrovirus-specific ILGD motif, the (Meta/Retro)viridae GPF/Y motif; the chromodomain (when present). The aligned sequences begin with the conserved glutamate residue in the catalytic domain and include downstream regions shared by all INs. The top, middle, and bottom sets of sequences are representatives from the Pseudoviridae, (Meta/Retro)viridae with a GPF/Y motif, and (Meta/Retro)viridae lacking a GPF/Y motif, respectively. Numbers indicate gaps introduced into the alignment. Asterisked columns indicated residues with conserved positive charge. An ellipsis followed by a number indicates amino acid residues remaining in IN. Plus symbols at the end of the alignments denote presence of the ILGD motif (Pseudoviridae) or chromodomains (Meta/Retro)viridae. (B) Selected motifs in Pseudoviridae C-termini. Boxes indicate the location of the Agrovirus ILGD motif and the three tandem copies of a short motif specific to copia. (C) Alignment of residues comprising the ILGD motif. (D) Neighbor-joining tree (unrooted, bootstrapped 10,000 replicates) based on an IN alignment extending from the HHCC N-terminal domain through the end of the common region depicted in A. The Agrovirus clade is shaded. Letters in parenthesis after element names indicate those from nonplant hosts: G, green algae; F, fungi; I, insects

 
Recent studies of Metaviridae and Retroviridae identified a conserved motif of unknown function in the IN C-terminus designated the GPF/Y motif (Malik and Eickbush 1999Citation ). Some Metaviridae INs also terminate in a chromodomain. Neither the GPF/Y motif nor a chromodomain was detected in any member of the Pseudoviridae. In contrast to INs from other families, sequence homology among the Pseudoviridae extends approximately 60 residues downstream from the catalytic domain (fig. 4A ). We designated this region as the GKGY motif with reference to four highly conserved residues. The GKGY motif ends at a location corresponding to the beginning of the GPF/Y motif in retroelements from other families. The GKGY motif is apparently unique to the Pseudoviridae, as similarity searches failed to identify related sequences in the databases.

Although the remainder of the IN C-terminus lacks high sequence conservation, there are some shared features. The region is enriched in proline (8.5%), asparagine (6.8%), serine-threonine (12%/7.4%), and aspartate-glutamate (7%/9.6%). Numerous short runs of charged residues, such as lysines and glutamates, are also evident. A tandem duplication is located in the C-terminus of copia, further suggesting that this region can tolerate considerable variability (fig. 4B ). The C-termini of most Agrovirus share a short, 14-residue motif with four invariant residues, ILGD (fig. 4B ). The functional significance, if any, of this motif is unknown. Several fungal elements, (Ty1, Ty4, Ty5 of S. cerevisiae and Tca2 of Candida albicans) have comparatively large and extended C-termini. Ty1 and Ty5 are known to target their integration to specific locations (Voytas and Boeke 2002Citation ). The targeting function has been mapped to the C-terminus of Ty5 IN (Xie et al. 2001Citation ), and Ty1 has a nuclear localization signal in this region (Kenna et al. 1998Citation ; Moore, Rinckel, and Garfinkel 1998Citation ).

Reverse Transcriptase
The carboxy-terminal domain of pol encodes RT and its associated RNase H subdomain. These enzyme activities generate a cDNA copy of the retroelement from genomic mRNA (Telesnitsky and Goff 1997Citation ). Previous studies have shown that RT is the most conserved retroelement-coding region (Xiong and Eickbush 1988Citation ). RT immediately follows the IN C-terminus and continues through RNase H to the end of pol. Analysis of core data set RTs did not reveal or offer any new observations regarding function or evolution beyond those described in previous sections (fig. 1 ) and other RT studies (Malik and Eickbush 2001Citation ).

envlike ORFs
Three Agroviruses have envlike ORFs downstream of pol (Endovir1-1, SIRE-1, ToRTL1) (Laten, Majumdar, and Gaucher 1998Citation ; Kapitonov and Jurka 1999Citation ; Peterson-Burch et al. 2000Citation ). The Endovir1-1 ORF shares significant amino acid similarity with the corresponding ORF of SIRE-1 (fig. 5 ). Significant similarity is also observed for TorTL1 but over a much shorter span. Like retroviral env genes, the SIRE-1 and Endovir1-1 ORFs have central transmembrane domains. ToRTL1 and SIRE-1 have a transmembrane domain located at or near the N-terminus. The C-terminal halves of these proteins are generally rich in secondary structural features, particularly alpha helices. BLAST searches of the ORFs did not identify any related proteins in the GenBank databases. It should be noted that envlike ORFs could not be discerned in some members of the Agrovirus (e.g., PREM-2 and Opie-2). This may be attributed to mutations or deletions that obscure this coding region in these elements.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 5.—Agrovirus envlike-coding regions. The envlike ORF is represented by white bars and is drawn to scale. Black lines depict noncoding sequences between pol and the start of the envlike ORF. Regions of amino acid similarity between elements are connected by shading. Percentages represent the total amino acid similarity over the shaded regions. The numbers of amino acids in the envlike ORFs are given for each element. Predicted features are denoted as follows: alpha helices, dark gray boxes; beta sheets, arrows; transmembrane domains, slanted line boxes; splice acceptor site (P > 0.9), triangle

 
The location of the envlike ORFs after gag-pol raises questions as to how it might be expressed. The ORFs are separated by 27 (SIRE-1) to 1,031 (Endovir1-1) nucleotides from gag-pol, and these spacer regions are rife with stop codons and frameshifts. Most retroviral env genes are expressed from a spliced, subgenomic transcript. Aside from ToRTL1 (fig. 5 ), there are no strong splice acceptors (P > 0.9) within the region spanning the last 100 nucleotides of RH to the end of the envlike ORF. Much weaker splice acceptors, however, do exist in this region, which may result in less efficient splicing.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
A significant fraction of eukaryotic genomes is composed of sequences generated through reverse transcription. We utilized whole genome sequences and GenBank to undertake a comprehensive analysis of the Pseudoviridae, a widespread family of LTR retroelements. Our core data set consisted of 32 Pseudoviridae from 18 host species, including plants, animals, and fungi. Through a comparative analysis of Pseudoviridae-coding sequences, we hoped to gain an insight into the mechanisms by which they replicate and evolutionary events that have enabled them to successfully colonize eukaryotic genomes.

Familywide Features of Coding Sequences
In the Pseudoviridae, Gag and Pol are typically encoded by a single ORF. Only the S. cerevisiae Ty1 and Ty4 elements clearly have gag and pol separated by a frameshift (+1). Opie-2 and PREM-2 of maize are the only other candidates for separate gag and pol genes. But sequence degeneracy makes it difficult to assign reading frames for these elements. The prevalence of a single gag-pol–coding region differs from most (Meta/Retro)viridae, where these genes are usually separated by a stop codon or a -1 frameshift (Vogt 1997Citation ). Translational recoding regulates Gag and Pol stoichiometry for most (Meta/Retro)viridae guaranteeing the high levels of Gag required for proper virion assembly (Gesteland and Atkins 1996Citation ). Of the single-ORF Pseudoviridae for which the stoichiometry of Gag and Pol has been examined, Gag preponderance is ensured by differential splicing (copia) and perhaps by increased Pol turnover (Ty5) (Brierley and Flavell 1990Citation ; Yoshioka et al. 1990Citation ; Irwin and Voytas 2001Citation ). Differential protein stability has been well documented as the mechanism regulating Gag and Pol levels for Tf1, a single-ORF Metavirus (Atwood, Lin, and Levin 1996Citation ). It is likely that the Pseudoviridae generally use mechanisms other than translational recoding to regulate gag-pol expression.

Gag
Gag proteins are typically highly divergent, yet Pseudoviridae Gag shows discernible conservation between family members at both the primary and secondary structural levels. We identified three sequence domains featuring several interspersed, highly conserved amino acid residues. In Ty1, mutations in domain C cause transposition phenotypes and disrupt VLP formation and morphology (Braiterman et al. 1994Citation ; Monokian, Braiterman, and Boeke 1994Citation ; Martin-Rendon et al. 1996Citation ). Like the (Meta/Retro)viridae, Pseudoviridae Gag proteins have an RB characteristic of nucleocapsid. The signature of this motif differs from the (Meta/Retro)viridae by an additional conserved glycine residue (Cx2Cx3GHx4C). Exceptions include Ty5, which is missing the glycine, and Tca2 and Ty1, which lack a recognizable RB. In the case of Ty1, however, the corresponding region of Gag has other properties of nucleocapsid proteins (Cristofari, Ficheux, and Darlix 2000Citation ). The major homology domain found in many (Meta/Retro)viridae Gag proteins was not observed in the Pseudoviridae (Orlinsky et al. 1996Citation ). But the major homology region approximately corresponds to domains B and C. As discussed below, there has also been an expansion of Gag in the Agroviruses.

Pol
Pol showed a high degree of sequence conservation over most of its length. RT and its associated RH have been extensively studied and reviewed elsewhere (Telesnitsky and Goff 1997Citation ). We only add that sequence conservation in the Pseudoviridae extends approximately 30 residues upstream of the RT domain 1 (as defined by Xiong and Eickbush 1990Citation ). Pseudoviridae PR is the most divergent of the Pol enzymes. Only two amino acids are invariant in the data set, yet many chemically similar residues display regular spacing and are observed as vertical bands of color in the alignment (fig. 3A ). Mapping conserved regions onto the structure of the HIV-1 PR identifies a hydrophobic core region that helps maintain and stabilize the enzyme, assuring proper positioning and coordination of interacting regions of the dimer (Gulnik, Erickson, and Xie 2000Citation ). The glycine-rich domain of PR forms much of the conformationally mobile, substrate clamping loop. The precisely spaced hydrophobic residues in the glycine-rich region maintain contacts with the hydrophobic backbone, likely placing some limitations on the range of movement of the clamping loops. PR size is very uniform within the family, and the precise spacing of conserved residues suggests that functional constraints limit sequence divergence.

Pseudoviridae IN has an N-terminal metal-binding domain and a central catalytic domain similar to those of the (Meta/Retro)viridae (Haren, Ton-Hoang, and Chandler 1999Citation ). These domains are the second most highly conserved regions of pol after RT. The only major structural variation seen in pol occurs in the carboxy-terminus of IN. Using the conserved glutamate of the DD35E motif as a reference point, conservation extends farther downstream in the Pseudoviridae than in other retroelement families. We refer to this conserved C-terminal domain as the GKGY motif after the four most highly conserved residues. In HIV-1 IN, the C-terminal domain is involved in nonspecific DNA binding and also plays a role in oligomerization coordinated by the zinc-binding motif (Brown 1997Citation ). Critical C-terminal residues in HIV-1 IN provide a basic charged platform involved in the nonspecific DNA interaction (Eijkelenboom et al. 1999Citation ; Chen et al. 2000Citation ). It is interesting that the GKGY motif shows six positions where a positive residue is preserved (asterisked columns in fig. 4A ), suggesting that they may play a similar role in DNA binding. The location of the HIV-1 basic residues corresponds approximately to the GKGY motif.

Some (Meta/Retro)viridae have a conserved C-terminal motif (GPF/Y) not found in the Pseudoviridae (Malik and Eickbush 1999Citation ). This motif is located just after the GKGY domain and has also been speculated to play a role in DNA binding. The region C-terminal to the GKGY motif varies considerably in size among the Pseudoviridae. This is particularly evident among the fungal elements. Pseudoviridae IN C-termini are often proline rich, characteristic of some protein-protein interaction domains (Kay, Williamson, and Sudol 2000Citation ). Some (Meta/Retro)viridae C-termini have chromodomains, and recent work suggests that chromodomains interact with histones (Nielsen et al. 2001Citation ). Although we did not observe any chromodomains in the Pseudoviridae, the C-terminus of Ty5 IN interacts with chromatin, and this interaction is responsible for Ty5's target specificity (Xie et al. 2001Citation ). Perhaps the IN C-terminus generally interacts with chromatin to help direct integration.

The Agroviruses
All members of the Retroviridae encode Env, and more recently LTR retroelements with envlike genes have been described in the Metaviridae (Wright and Voytas 1998Citation ; Lerat and Capy 1999Citation ), the BEL group elements (Bowen and McDonald 1999Citation ; Frame, Cutfield, and Poulter 2001Citation ), and the Pseudoviridae (Laten, Majumdar, and Gaucher 1998Citation ; Kapitonov and Jurka 1999Citation ; Peterson-Burch et al. 2000Citation ). The D. melanogaster gypsy element envlike gene has been shown to mediate infection, indicating that it encodes a true Env protein (Kim et al. 1994Citation ; Song et al. 1994Citation ). The envlike genes of some insect Metaviridae and BEL group elements are evolutionarily related to viral env genes (Malik, Henikoff, and Eickbush 2000Citation ). This suggests that retroviruses have evolved from retrotransposons through transduction of viral env genes. Of course, this process can occur in reverse, and some endogenous retroelements (e.g., IAP and VL30 elements) have likely originated from viruses that lost env (Boeke and Stoye 1997Citation ).

Pseudoviridae with envlike genes include SIRE-1 of soybean, Endovir of A. thaliana, and ToRTL1 of tomato. The envlike ORFs of these elements vary in length from 476 to 668 amino acids and are separated from pol by distances ranging from 27 to over 1,000 nucleotides. Retroviridae env is expressed from a spliced mRNA, and this also appears to be the case for Metaviridae envlike genes (Avedisov and Ilyin 1994Citation ; Marsano et al. 2000Citation ; Wright and Voytas 2002Citation ). Alternative splicing does not likely play a role in expression of the Pseudoviridae envlike genes because no conserved splice acceptor sites are evident in the vicinity of this ORF. It is possible that the envlike ORF is expressed from an internal promoter; however, searches failed to identify likely promoter elements or transcription factor–binding sites (data not shown). Alternative strategies for expression include internal ribosome entry and translational bypassing (Gesteland and Atkins 1996Citation ; Jackson 2000Citation ). The mechanism of expression will ultimately need to be determined experimentally.

Like env of the retroviruses, the envlike genes of the Pseudoviridae have predicted transmembrane domains. The organization of these domains is not conserved among the three elements: ToRTL1 and SIRE-1 have N-terminal transmembrane domains, and SIRE-1 and Endovir have one or two transmembrane domains in the C-terminal halves of their proteins. Comparisons among the three elements showed that the N-terminal halves share the greatest sequence identity, whereas the C-terminal halves are enriched in secondary structural features. We recognize that the Agrovirus data set is limited, and conserved features will be revealed by the characterization of additional envlike ORF-containing elements. Although we imply its involvement in extracellular transmission, an alternative possibility is that the Envlike protein facilitates movement between plant cells. Many plant viruses encode movement proteins that carry out this role, including some with transmembrane domains (Brill et al. 2000Citation ; Melcher 2000Citation ).

A notable Agrovirus feature is that gag has undergone considerable expansion. Like other Pseudoviridae, Agrovirus Gag has the conserved A, B, C core domains followed by an RB (fig. 2 ). Agroviruses with envlike ORFs have a second RB. Sequences adjacent to these RBs are variable, preventing determination of which RB is orthologous to the single RB found in other Pseudoviridae. Multiple RBs are also observed in retroviruses, where tandem pairs help package the viral genome within the virion. Two members of the Psueudoviridae contain tandem RBs as seen in the Retroviridae. Endovir1-1 has a tandem pair located in the center of Gag as well as a third RB at the C-terminus. Interestingly, Tpv2-6, which also has a tandem pair of RBs, is a member of a Pseudoviridae clade that is closely related to the Agroviruses. This suggests that duplication of the RB may have occurred before the acquisition of the envlike gene. It should be noted that Agroviruses from monocots without envlike ORFs have only a single, centrally located RB.

The expansion of gag occurred in two regions: between domains A and B and after domain C. In the Agroviruses with envlike ORFs, the larger expansion is located between the two RBs in the C-terminus. This expansion is not an obvious tandem duplication of the first half of gag because we could not detect homology or secondary structural similarity with upstream regions. The role of the expansion region is not known, but its association with the lineage of elements with envlike genes suggests a functional relationship between Gag and the Envlike protein. The C-terminal expansion is highly alpha helical in character, and the coiled-coils found in this region are a common feature of proteins involved in oligomerization. If the envlike ORF mediates infectivity, Gag may facilitate docking and budding of the virion, as is observed in retroviruses (Swanstrom and Wills 1997Citation ). Another alternative is that this region acts as a movement protein enabling transport of virions between plant cells.

Evolutionary Trends
Although Pseudoviridae are found in diverse eukaryotes, clearly they have been very successful in colonizing plant genomes. In our survey, we identified 276 distinct A. thaliana Pseudoviridae RTs. Because all other plant elements cluster with A. thaliana clades, the diversity of A. thaliana elements reflects the diversity in other plant species. Some clades of A. thaliana elements do not have other plant homologues in this data set. As additional plant genomes are sequenced, it will be of interest to determine the full complement of plant Pseudoviridae lineages. Outside of plants, relatively few (e.g., S. cerevisiae, D. melanogaster) or no Pseudoviridae were identified (e.g., nematodes and humans). It is difficult to provide a parsimonious explanation for the punctate distribution of these retroelements in eukaryotes. It may be that they originated in plants, where they are ubiquitous, and then moved into other organisms by way of horizontal transfer. This latter hypothesis is supported by the apparent entry of copia into D. melanogaster by horizontal transfer and the relatively youthful appearance of D. melanogaster retrotransposons (Jordan, Matyunina, and McDonald 1999Citation ; Bowen and McDonald 2001Citation ). An alternative hypothesis is that Pseudoviridae were lost in some eukaryotic lineages. Clearly, species lacking Pseudoviridae have other successful retroelements (e.g., the Cer elements in nematodes and the LINE-1 elements in humans).

Most elements in the Pseudoviridae RT tree have very long branch lengths and are poorly separated near the tree base. Long branch lengths may be the result of accelerated sequence evolution brought about in part by the error-prone nature of RT (Gabriel and Mules 1999Citation ). Widely used amino acid substitution models do not consider this and other factors that may influence repetitive sequence evolution. For example, methylation of repetitive DNAs accelerates mutation rates (Colot and Rossignol 1999Citation ). Repetitive sequences in plants are highly methylated, perhaps increasing the observed diversity of the plant elements.

Although we invoke horizontal transfer as a possible explanation for the punctate distribution of elements in the eukaryotes, our analysis did not provide any strong evidence of horizontal transfer or domain swapping. We recognize that we excluded elements sharing greater than 75% amino acid identity from the data set; however, this process only eliminated elements from the same host organism. Some elements did not cluster with others from their host (e.g., Tca2 from C. albicans and Ty1 from S. cerevisiae). In all such cases, separation of the elements was poorly supported by bootstrap analyses, making it difficult to invoke horizontal transfer to explain their relationships. Agroviruses, which might have a means for host cell escape by infection, also cluster together and show no apparent signs of horizontal transfer. Trees generated from sequence domains other than RT generally had the same topology, providing little evidence for swapping of coding regions. Fine-scale organizational analysis of specific coding sequences such as IN also did not indicate swapping or small-scale rearrangments (data not shown). Evidence for horizontal transfer and domain swapping may surface with the identification of additional elements.

Proposal for a New Genus
The current taxonomic structure of the Pseudoviridae is not suggested by RT-coding sequence phylogenies. The Hemivirus and Pseudovirus are not monophyletic in either unrooted or rooted RT trees (fig. 1 , and data not shown). These genera were originally classified on the basis of the primer used for reverse transcription: a cleaved half tRNA for the Hemiviruses or the 3' end of a full tRNA for the Pseudoviruses (Boeke et al. 2000a,Citation 2000bCitation ). In the Metaviridae, genera distinctions are based on the presence (Errantivirus) or absence (Metavirus) of an env gene. The discovery of Pseudoviridae with envlike genes suggests a new classification scheme for this family. Throughout this article, we have referred to this lineage as the Agroviruses, and we propose that they represent a third genera of the Pseudoviridae. In addition to the envlike gene, it is clear that the Agroviruses are evolving independently from other members of the family: Agroviruses from both monocots and dicots are monophyletic, and all share an expanded gag gene. Ultimately, study of a functional Agrovirus will be required to understand the biological role for the dramatic coding sequence changes that characterize this distinct lineage of the Pseudoviridae.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
This work was supported in part by a graduate fellowship to B.P.-B. from the NSF-IGERT program at Iowa State. This is Journal Paper No. J-19591 of the Iowa Agriculture and Home Economics Experiment Station, Ames, Iowa, Project No. 3383 and was supported by Hatch Act and State of Iowa funds.


    Footnotes
 
Thomas Eickbush, Reviewing Editor

Keywords: retrotransposon Pseudoviridae gag pol env Back

Address for correspondence and reprints: Daniel F. Voytas, Department of Zoology & Genetics, 2208 Molecular Biology Building, Iowa State University, Ames, Iowa 50011. voytas{at}iastate.edu Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Altschul S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402[Abstract/Free Full Text]

    Atwood A., J. H. Lin, H. L. Levin, 1996 The retrotransposon Tf1 assembles virus-like particles that contain excess Gag relative to integrase because of a regulated degradation process Mol. Cell. Biol 16:338-346[Abstract]

    Avedisov S. N., Y. V. Ilyin, 1994 Identification of spliced RNA species of Drosophila melanogastergypsy retrotransposon. New evidence for retroviral nature of the gypsy element FEBS Lett 350:147-150[ISI][Medline]

    Bailey T. L., C. Elkan, 1994 Fitting a mixture model by expectation maximization to discover motifs in biopolymers Proc. Int. Conf. Intell. Syst. Mol. Biol 2:28-36[Medline]

    Boeke J. D., T. Eickbush, S. B. Sandmeyer, D. F. Voytas, 2000a. Metaviridae Pp. 359–367 in M. H. V. van Regenmortel, C. M. Fauquet, D. H. L. Bishop, E. B. Carsten, M. K. Estes, S. M. Lemon, J. Maniloff, M. A. Mayo, D. J. McGeoch, C. R. Pringle, and R. B. Wickner, eds. Virus taxonomy: seventh report of the international committee on taxonomy of viruses. Academic Press, New York

    ———. 2000b. Pseudoviridae Pp. 349–357 in M. H. V. van Regenmortel, C. M. Fauquet, D. H. L. Bishop, E. B. Carsten, M. K. Estes, S. M. Lemon, J. Maniloff, M. A. Mayo, D. J. McGeoch, C. R. Pringle, and R. B. Wickner, eds. Virus taxonomy: seventh report of the international committee on taxonomy of viruses. Academic Press, New York

    Boeke J. D., J. P. Stoye, 1997 Retrotransposons, endogenous retroviruses, and the evolution of retroelements Pp. 343–436 in J. M. Coffin, S. H. Hughes, and H. E. Varmus, eds. Retroviruses. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York

    Bowen N. J., J. F. McDonald, 1999 Genomic analysis of Caenorhabditis elegans reveals ancient families of retroviral-like elements Genome Res 9:924-935[Abstract/Free Full Text]

    ———. 2001 Drosophila euchromatic LTR retrotransposons are much younger than the host species in which they reside Genome Res 11:1527-1540[Abstract/Free Full Text]

    Braiterman L. T., G. M. Monokian, D. J. Eichinger, S. L. Merbs, A. Gabriel, J. D. Boeke, 1994 In-frame linker insertion mutagenesis of yeast transposon Ty1: phenotypic analysis Gene 139:19-26[ISI][Medline]

    Brierley C., A. J. Flavell, 1990 The retrotransposon copia controls the relative levels of its gene products post-transcriptionally by differential expression from its two major mRNAs Nucleic Acids Res 18:2947-2951[Abstract]

    Brill L. M., R. S. Nunn, T. W. Kahn, M. Yeager, R. N. Beachy, 2000 Recombinant tobacco mosaic virus movement protein is an RNA-binding, alpha-helical membrane protein Proc. Natl. Acad. Sci. USA 97:7112-7117[Abstract/Free Full Text]

    Brown P. O., 1997 Integration Pp. 161–204 in J. M. Coffin, S. H. Hughes, and H. E. Varmus, eds. Retroviruses. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York

    Chen J. C., J. Krucinski, L. J. Miercke, J. S. Finer-Moore, A. H. Tang, A. D. Leavitt, R. M. Stroud, 2000 Crystal structure of the HIV-1 integrase catalytic core and C-terminal domains: a model for viral DNA binding Proc. Natl. Acad. Sci. USA 97:8233-8238[Abstract/Free Full Text]

    Colot V., J. L. Rossignol, 1999 Eukaryotic DNA methylation as an evolutionary device Bioessays 21:402-411[ISI][Medline]

    Cristofari G., D. Ficheux, J. L. Darlix, 2000 The GAG-like protein of the yeast Ty1 retrotransposon contains a nucleic acid chaperone domain analogous to retroviral nucleocapsid proteins J. Biol. Chem 275:19210-19217[Abstract/Free Full Text]

    Cuff J. A., G. J. Barton, 2000 Application of multiple sequence alignment profiles to improve protein secondary structure prediction Proteins 40:502-511[ISI][Medline]

    Eickbush T. H., 1994 Origin and evolutionary relationships of retroelements Pp. 121–157 in S. S. Morse, ed. The evolutionary biology of viruses. Raven Press, New York

    Eijkelenboom A. P., R. Sprangers, K. Hard, R. A. Puras Lutzke, R. H. Plasterk, R. Boelens, R. Kaptein, 1999 Refined solution structure of the C-terminal DNA-binding domain of human immunovirus-1 integrase Proteins 36:556-564[ISI][Medline]

    Frame I. G., J. F. Cutfield, R. T. Poulter, 2001 New BEL-like LTR-retrotransposons in Fugu rubripes, Caenorhabditis elegans, and Drosophila melanogaster Gene 263:219-230[ISI][Medline]

    Gabriel A., E. H. Mules, 1999 Fidelity of retrotransposon replication Ann. NY Acad. Sci 870:108-118[Abstract/Free Full Text]

    Garfinkel D. J., A. M. Hedge, S. D. Youngren, T. D. Copeland, 1991 Proteolytic processing of pol-TYB proteins from the yeast retrotransposon Ty1 J. Virol 65:4573-4581[ISI][Medline]

    Gesteland R. F., J. F. Atkins, 1996 Recoding: dynamic reprogramming of translation Annu. Rev. Biochem 65:741-768[ISI][Medline]

    Gulnik S., J. W. Erickson, D. Xie, 2000 HIV protease: enzyme function and drug resistance Vitam. Horm 58:213-256[ISI][Medline]

    Haren L., B. Ton-Hoang, M. Chandler, 1999 Integrating DNA: transposases and retroviral integrases Annu. Rev. Microbiol 53:245-281[ISI][Medline]

    Henikoff S., J. G. Henikoff, 1992 Amino acid substitution matrices from protein blocks Proc. Natl. Acad. Sci. USA 89:10915-10919[Abstract]

    Henikoff S., J. G. Henikoff, W. J. Alford, S. Pietrokovski, 1995 Automated construction and graphical presentation of protein blocks from unaligned sequences Gene 163:GC17-GC26[ISI][Medline]

    Hogue C. W., 1997 Cn3D: a new generation of three-dimensional molecular structure viewer Trends Biochem. Sci 22:314-316[ISI][Medline]

    Hunter E., J. Casey, B. Hahn, M. Hayami, B. Korber, R. Kurth, J. Neil, A. Rethwilm, P. Sonigo, J. Stoye, 2000 Retroviridae Pp. 369–387 in M. H. V. van Regenmortel, C. M. Fauquet, D. H. L. Bishop, E. B. Carsten, M. K. Estes, S. M. Lemon, J. Maniloff, M. A. Mayo, D. J. McGeoch, C. R. Pringle, and R. B. Wickner, eds. Virus taxonomy: seventh report of the international committee on taxonomy of viruses. Academic Press, New York

    Irwin P. A., D. F. Voytas, 2001 Expression and processing of proteins encoded by the Saccharomyces retrotransposon Ty5 J. Virol 75:1790-1797[Abstract/Free Full Text]

    Jackson R. J., 2000 A comparative view of initiation site selection mechanisms. Pp. 127–183 in N. Sonenberg, J. W. B. Hershey, and M. B. Mathews, eds. Translational control of gene expression. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York

    Jones D. T., 1999 Protein secondary structure prediction based on position-specific scoring matrices J. Mol. Biol 292:195-202[ISI][Medline]

    Jordan I. K., L. V. Matyunina, J. F. McDonald, 1999 Evidence for the recent horizontal transfer of long terminal repeat retrotransposon Proc. Natl. Acad. Sci. USA 96:12621-12625[Abstract/Free Full Text]

    Kapitonov V. V., J. Jurka, 1999 Molecular paleontology of transposable elements from Arabidopsis thaliana Genetica 107:27-37[ISI][Medline]

    Kay B. K., M. P. Williamson, M. Sudol, 2000 The importance of being proline: the interaction of proline-rich motifs in signaling proteins with their cognate domains FASEB J 14:231-241[Abstract/Free Full Text]

    Kenna M. A., C. B. Brachmann, S. E. Devine, J. D. Boeke, 1998 Invading the yeast nucleus: a nuclear localization signal at the C terminus of Ty1 integrase is required for transposition in vivo Mol. Cell. Biol 18:1115-1124[Abstract/Free Full Text]

    Kim A., C. Terzian, P. Santamaria, A. Pelisson, N. Purd'homme, A. Bucheton, 1994 Retroviruses in invertebrates: the gypsy retrotransposon is apparently an infectious retrovirus of Drosophila melanogaster Proc. Natl. Acad. Sci. USA 91:1285-1289[Abstract]

    Kim J. M., S. Vanguri, J. D. Boeke, A. Gabriel, D. F. Voytas, 1998 Transposable elements and genome organization: a comprehensive survey of retrotransposons revealed by the complete Saccharomyces cerevisiae genome sequence Genome Res 8:464-478[Abstract/Free Full Text]

    Kumar S., K. Tamura, I. B. Jakobsen, M. Nei, 2001 MEGA2: molecular evolutionary genetics analysis software, Arizona State University, Tempe, Ariz

    Laten H. M., A. Majumdar, E. A. Gaucher, 1998 SIRE-1, a copia/Ty1-like retroelement from soybean, encodes a retroviral envelope-like protein Proc. Natl. Acad. Sci. USA 95:6897-6902[Abstract/Free Full Text]

    Lerat E., P. Capy, 1999 Retrotransposons and retroviruses: analysis of the envelope gene Mol. Biol. Evol 16:1198-1207[Abstract]

    Malik H. S., T. H. Eickbush, 1999 Modular evolution of the integrase domain in the Ty3/Gypsy class of LTR retrotransposons J. Virol 73:5186-5190[Abstract/Free Full Text]

    ———. 2001 Phylogenetic analysis of ribonuclease H domains suggests a late, chimeric origin of LTR retrotransposable elements and retroviruses Genome Res 11:1187-1197[Abstract/Free Full Text]

    Malik H., S. Henikoff, T. Eickbush, 2000 Poised for contagion: evolutionary origins of the infectious abilities of invertebrate retroviruses Genome Res 10:1307-1318[Abstract/Free Full Text]

    Manninen I., A. H. Schulman, 1993 BARE-1, a copia-like retroelement in barley (Hordeum vulgare L) Plant Mol. Biol 22:829-846[ISI][Medline]

    Marracci S., R. Batistoni, G. Pesole, L. Citti, I. Nardi, 1996 Gypsy/Ty3-like elements in the genome of the terrestrial Salamander hydromantes (Amphibia, Urodela) J. Mol. Evol 43:584-593[ISI][Medline]

    Marsano R. M., R. Moschetti, C. Caggese, C. Lanave, P. Barsanti, R. Caizzi, 2000 The complete Tirant transposable element in Drosophila melanogaster shows a structural relationship with retrovirus-like retrotransposons Gene 247:87-95[ISI][Medline]

    Martin-Rendon E., G. Marfany, S. Wilson, D. J. Ferguson, S. M. Kingsman, A. J. Kingsman, 1996 Structural determinants within the subunit protein of Ty1 virus-like particles Mol. Microbiol 22:667-679[ISI][Medline]

    McClure M. A., T. K. Vasi, W. M. Fitch, 1994 Comparative analysis of multiple protein-sequence alignment methods Mol. Biol. Evol 11:571-592[Abstract]

    Melcher U., 2000 The ‘30K’ superfamily of viral movement proteins J. Gen. Virol. 81 Pt 1:257-266

    Merkulov G. V., K. M. Swiderek, C. B. Brachmann, J. D. Boeke, 1996 A critical proteolytic cleavage site near the C-terminus of the yeast retrotransposon Ty1 Gag protein J. Virol 70:5548-5556[Abstract]

    Monokian G. M., L. T. Braiterman, J. D. Boeke, 1994 In-frame linker insertion mutagenesis of yeast transposon Ty1: mutations, transposition and dominance Gene 139:9-18[ISI][Medline]

    Moore S. P., D. J. Garfinkel, 1994 Expression and partial purification of enzymatically active recombinant Ty1 integrase in Saccharomyces cerevisiae Proc. Natl. Acad. Sci. USA 91:1843-1847[Abstract]

    Moore S. P., L. A. Rinckel, D. J. Garfinkel, 1998 A Ty1 integrase nuclear localization signal required for retrotransposition Mol. Cell. Biol 18:1105-1114[Abstract/Free Full Text]

    Nielsen A. L., M. Oulad-Abdelghani, J. A. Ortiz, E. Remboutsika, P. Chambon, R. Losson, 2001 Heterochromatin formation in mammalian cells: interaction between histones and HP1 proteins Mol. Cell 7:729-739[ISI][Medline]

    Orlinsky K. J., J. Gu, M. Hoyt, S. Sandmeyer, T. M. Menees, 1996 Mutations in the Ty3 major homology region affect multiple steps in Ty3 retrotransposition J. Virol 70:3440-3448[Abstract]

    Peterson-Burch B. D., D. A. Wright, H. M. Laten, D. F. Voytas, 2000 Retroviruses in plants? Trends Genet 16:151-152[ISI][Medline]

    SanMiguel P., A. Tikhonov, Y. K. Jin, et al. (11 co-authors) 1996 Nested retrotransposons in the intergenic regions of the maize genome Science 274:765-768.[Abstract/Free Full Text]

    Schneider T. D., R. M. Stephens, 1990 Sequence logos: a new way to display consensus sequences Nucleic Acids Res 18:6097-6100[Abstract]

    Song S. U., T. Gerasimova, M. Kurkulos, J. D. Boeke, V. G. Corces, 1994 An env-like protein encoded by a Drosophila retroelement: evidence that gypsy is an infectious retrovirus Genes Dev 8:2046-2057[Abstract]

    Swanstrom R., J. W. Wills, 1997 Synthesis, assembly and processing of viral proteins Pp. 263–334 in J. Coffin, S. H. Hughes, and H. E. Varmus, eds. Retroviruses. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY

    Swofford D., 1999 PAUP 4.0. 4.0 edition Smithsonian Institution, Washington D.C

    Tatusova T. A., T. L. Madden, 1999 BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences FEMS Microbiol. Lett 174:247-250[ISI][Medline]

    Telesnitsky A., S. P. Goff, 1997 Reverse transcriptase and the generation of retroviral DNA Pp. 121–160 in J. Coffin, S. H. Hughes, and H. E. Varmus, eds. Retroviruses. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y

    Terol J., M. C. Castillo, M. Bargues, M. Perez-Alonso, R. de Frutos, 2001 Structural and evolutionary analysis of the copia-like elements in the Arabidopsis thaliana genome Mol. Biol. Evol 18:882-892[Abstract/Free Full Text]

    Thompson J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, D. G. Higgins, 1997 The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools Nucleic Acids Res 25:4876-4882[Abstract/Free Full Text]

    Tusnady G. E., I. Simon, 1998 Principles governing amino acid composition of integral membrane proteins: application to topology prediction J. Mol. Biol 283:489-506[ISI][Medline]

    Vogt V. M., 1997 Retroviral virions and genomes Pp. 27–70 in J. Coffin, S. H. Hughes, and H. E. Varmus, eds. Retroviruses. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y

    Voytas D. F., J. D. Boeke, 2002 Ty1 and Ty5 of Saccharomyces cerevisiae. Pp. 631–662 in N. L. Craig, R. Craigie, M. Gellert, and A. M. Lambowitz, eds. Mobile DNA II American Society for Microbiology, Washington, D.C.

    Wolf E., P. S. Kim, B. Berger, 1997 MultiCoil: a program for predicting two- and three-stranded coiled coils Protein Sci 6:1179-1189[Abstract/Free Full Text]

    Wright D. A., D. F. Voytas, 1998 Potential retroviruses in plants: Tat1 is related to a group of Arabidopsis thaliana Ty3/gypsy retrotransposons that encode envelope-like proteins Genetics 149:703-715[Abstract/Free Full Text]

    ———. 2002 Athila4 of Arabidopsis and Calypso of soybean define a lineage of endogenous plant retroviruses Genome Res. 12:122–131

    Xie W., X. Gai, Y. Zhu, D. C. Zappulla, R. Sternglanz, D. F. Voytas, 2001 Targeting of the yeast Ty5 retrotransposon to silent chromatin is mediated by interactions between integrase and Sir4p Mol. Cell. Biol 21:6606-6614[Abstract/Free Full Text]

    Xiong Y., T. H. Eickbush, 1988 Similarity of reverse transcriptase-like sequences of viruses, transposable elements, and mitochondrial introns Mol. Biol. Evol 5:675-690[Abstract]

    ———. 1990 Origin and evolution of retroelements based upon their reverse transcriptase sequences EMBO J 9:3353-3362[Abstract]

    Yoshioka K., H. Honma, M. Zushi, S. Kondo, S. Togashi, T. Miyake, T. Shiba, 1990 Virus-like particle formation of Drosophilacopia through autocatalytic processing EMBO J 9:535-541[Abstract]

Accepted for publication June 4, 2002.