Birth and Death of Orphan Genes in Rickettsia

Haleh Amiri, Wagied Davids and Siv G. E. Andersson

Department of Molecular Evolution, University of Uppsala, Uppsala, Sweden

Correspondence: E-mail: siv.andersson{at}ebc.uu.se.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The origin and evolution of the thousands of species-specific genes with unknown functions, the so-called orphan genes, has been a mystery. Here, we have studied the rates and patterns of orphan sequence evolution, using the Rickettsia as our reference system. Of the Rickettsia conorii orphans examined in this study, 80% were found to be short gene fragments or fusions of short segments from neighboring genes. We reconstructed the putative sequences of the full-length genes from which the short orphan fragments are thought to have originated. One of the genes thus reconstructed displays weak similarity to the ankyrin-repeat protein family, an identification that is strongly supported by comparative molecular modeling. Studies of the patterns of gene fragmentation underscore the importance of short repeated sequences as targets for recombination events that result in sequence loss and the formation of short, transient open reading frames. Our analysis demonstrates that gene sequences present in the common ancestor can be inferred even in cases when no full-length open reading frame is present in any of the contemporary species. Such reconstructions support the identification of lost protein functions and hint at important lifestyle changes.

Key Words: intergenic DNA • molecular evolution • phylogeny • Rickettsia


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Bacterial and archaeal genomes normally contain 6% to 14% noncoding DNA (Rogozin et al. 2002). This fraction of DNA consists of (1) pseudogenes with strong similarity to full-length genes, (2) extensively degraded gene remnants with weak similarity to full-length genes, and (3) spacer sequences for which no homolog can be recognized. The coding fraction consists of (1) genes with assigned functions, and (2) hypothetical genes that are conserved among species ("conserved hypotheticals") but have no identified function. In addition, most genomes contain genus-specific and/or species-specific open reading frames (ORFs) with no sequence similarity to previously identified genes, the so-called orphans (sometimes written as ORFans [Siew and Fischer 2003]). We use the word "orphans" to refer to ORFs without homologs and the term "orphan genes" for the subset of orphans that are functional genes. We also use the term "genus-specific orphans" for cases where a sequence has homologs within a genus but not in other taxa. Since orphans may encode novel metabolic functions that distinguish one species from another, they are prime candidates for further experimental studies. For this reason, it is important prior to the start of experimental work to distinguish as accurately as possible between orphan genes and spurious ORFs.

The source DNA from which the orphans have originated has been a mystery ever since the first genome sequences were made public. As more and more genome sequence data is collected, the fraction of orphans is gradually decreasing. However, despite the completion of nearly 100 genomes, the percentage of annotated orphans is still high, around 15% in most genomes (Siew and Fischer 2003). A recent analysis of 60 microbial genomes identified a total of more than 20,000 orphans (Siew and Fischer 2003), implying that if all correspond to functional genes, a substantial number of new protein functions remain to be discovered. Do these sequences really code for functional proteins, or do they correspond to very rapidly evolving protein families, and if so, why? Since homologs and intermediate sequences are normally not available for comparative analysis, the widespread occurrence of orphans has been an enigma (Siew and Fischer 2003).

Trivial explanations for the large numbers of orphans are that they correspond to incorrectly annotated sequences or that they belong to families of highly divergent paralogs of unknown functions (Skovgaard et al. 2001; Siew and Fischer 2003). Another explanation is that orphans represent short, inactivated gene remnants (Fischer and Eisenberg 1999). Because it is difficult to distinguish full-length orphan genes with truly unique functions from orphan fragments with no gene functions, the percentage of genes with novel, species-specific functions is at present unknown.

Experimental studies of mRNA expression patterns may provide hints about the functional status of unique, small ORFs, also called smORFs (Basrai, Hieter, and Boeke 1997). A complication with this approach is that some small RNAs (sRNAs) may be nontranslatable (Chen et al. 2002). Furthermore, damaged gene sequences may express nonfunctional mRNA. For example, a comprehensive survey of pseudogene-like sequences in the genome of Saccharomyces cerevisiae identified more than 200 gene sequences with frameshift mutations and/or premature termination codons, which corresponds to about 3% of the yeast proteome (Harrison et al. 2002). Although most of these mutated ORFs are substantially damaged, with 90% having two or more disruptive mutations, many are transcribed into RNA (Harrison et al. 2002).

Likewise, it has been shown that some of the so-called split genes in Rickettsia conorii produce mRNA despite being divided into multiple internal ORFs (Ogata et al. 2001). In this case, substitution frequencies are enhanced for both expressed and nonexpressed fragments of the split genes, suggesting that the observed expression is most likely a transient stage in the overall gene deterioration process (Davids, Amiri, and Andersson 2002). Apparently, gene expression per se does not signify the functional status of the orphans.

In total, 552 ORFs were identified in the R. conorii genome that do not correspond to ORFs in the R. prowazekii genome, although highly degraded sequence homologs were identified in 229 cases (Ogata et al. 2001). Approximately 400 of the 552 ORFs are not present in any species outside the genus Rickettsia (Ogata et al. 2001). If all of these are correctly annotated, the fraction of genus-specific orphans in R. conorii approaches 30% of the total gene set (Ogata et al. 2001). A few cases of deteriorating orphan genes have been identified previously in Rickettsia species from a set of short ORFs in one species that partially overlap a different set of short ORFs in another species, none of which have an identified function (Andersson and Andersson 1999a, 1999b, 2001). Thus, it is conceivable that a subset of the annotated orphans in R. conorii (and other species) represent misannotations or pseudogenes, as also suggested by the observation that orphans are on the average shorter in size than genes with orthologs in other species (Mira, Klasson, and Andersson 2002).

To understand the origin and evolution of the orphan sequences, we have examined the evolutionary fate of a selected set of genes annotated as orphans in R. conorii (Ogata et al. 2001). The results show that a majority of the orphans are short remnants of longer genes that were present in an ancestor of the modern Rickettsia species. The conclusion is that the fraction of functional orphans is much lower than previously thought. Instead, there is a larger number of damaged genes that are in the process of being purged from these genomes.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Isolation, Amplification, and Sequencing of Genomic DNA in Rickettsia
Genomic DNAs from R. typhi strain Wilmington, R. rickettsii strain 84-21C, and R. montana were isolated and purified as described previously (Pretzman et al. 1987). The eight intergenic regions were amplified by two subsequent PCR reactions. First, genes flanking the 3' and 5' ends of each region were partially amplified using degenerate primers. Species-specific primers were then designed based on the resulting sequences. These were used in a large set of long-range PCR reactions (AquTaq) to amplify the segments flanked by the primers.

PCR reactions were carried out under standard conditions. In a 50 µl PCR reaction, 10 to 20 ng DNA, 1 U Taq polymerase (Sigma) and 50 to100 pmol of each degenerate primers were used. The PCR reactions were performed on a DNA thermal cycler. The templates for the PCR reactions were genomic DNA from R. typhi, R. montana, and R. rickettsii. The sizes of the PCR products ranged from 500 bp to 1 kb. The PCR program consists of a hot start (95°C, 3 min; 80°C, 5 min) followed by 30 cycles at 94°C for 1 min, (50°C to 55°C) for 1 min and 72°C for 2 min and a final extension step of 10 min at 72°C. The amplified PCR products were purified by using the QIA-Quick PCR Purification Kit (Qiagen) and sequenced on both strands using the ABI PRISM BigDye Terminator Cycle Sequencing Kit (PerkinElmer). The DNA sequences were analyzed by an ABI PRISM 377 DNA Sequencer (PerkinElmer).

Sequence Analyses
The generated complete sequences from the eight intergenic regions were assembled using the Phred&Phrap assembly program. Sequence similarity searches within the data set as well as against sequences in the public databases were carried out with the Blast program (Altschul et al. 1997). The identified genes, pseudogenes, and fossil-ORFs were aligned using ClustalW (Thompson, Higgins, and Gibson 1994), and editing of the alignments was performed with the aid of Seaview (Galtier, Gouy, and Gautier 1996). Base composition features were analyzed using CODONW (Lloyd and Sharp 1992). Protein structures were predicted by comparative modeling with the aid of Geno3D (Combet et al. 2002).

Definition of Genes and Pseudogenes
In this study, ORFs were defined as genes if they were conserved in size among closely related species and possessed the characteristic G+C content values for Rickettsia genes (Andersson and Sharp 1996). Putative genes in this study are ampG4, gapD, Rco295, and Rco529 in R. conorii, R. rickettsii, and R. montana. Pseudogenes are identified as sequences that showed similarity to a sequence classified as a gene in another species (E < e – 20) but in which frameshift and substitution mutations to stop codons have started to accumulate. Putative pseudogenes according to this definition are ampG4, Rco295, and Rco529 in R. prowazekii and R. typhi. Two short ORFs, Rco530 and Rco272, are conserved in size among R. conorii, R. rickettsii, and R. montana but have atypically high G+C content values. We have here tentatively assigned Rco530 and Rco272 as pseudogenes, but the final annotation will have to await further experimental studies. Rco530 and Rco272 in R. prowazekii and R. typhi are shorter, degraded versions of these ORFs.

Reconstruction of Gene Sequences of the Common Ancestor
For pseudogenes showing sequence similarity to a full-length gene in another species, we reconstructed the putative sequences present in the common ancestor by aligning the pseudogene sequences translated in three reading frames with the protein sequences of the orthologous, full-length genes. Deletions and insertions were introduced into the pseudogene sequence in such a way that the aligned protein sequences matched as closely as possible.

For segments containing short overlapping but not identical ORFs in closely related species, we also attempted to reconstruct the putative gene sequences of their common ancestor. However, in these cases, no full-length gene was available that could be used as a template for the reconstructions. Instead, we compared the protein sequences of all three possible reading frames and reconstructed the ancestral sequence by manually searching for the longest ORF, given the smallest number of mutations in each of the aligned sequences. The genes reconstructed from short ORFs with no sequence similarity to genes in other species were defined as fossil-ORFs (forf). In total, we have reconstructed genes from overlapping but not identical ORFs in three of the intergenic regions examined here. These have been designated forfEFG.

Deletions and insertions were weighted equally, and their frequencies were estimated by the number of changes required to convert the pseudogene sequences of each species into the full-length gene sequence of another species or into the reconstructed sequences in cases where no full-length gene sequence was available in any of the species. These estimates should be taken as only rough approximations of the true frequencies of insertion and deletion mutations.

Protein Structure Predictions
Protein structures were predicted by comparative modeling with the aid of the Structure Prediction Meta Server (Bujnicki et al. 2001). Alignments were obtained from FFAS03 (Rychlewski et al. 2000). ORFeus (K. Ginalski, J. Pas, L. S. Wyrwicz, M. von Grotthuss, J. M. Bujnicki and L. Rychlewski, personal communication) was used for construction of protein homology models. Rasmol (Sayle and Milner-White 1995) was used for visualization of three-dimensional models.

Nucleotide Sequence Accession Numbers
The nucleotide sequences obtained in this study have been given the following GenBank accession numbers: the sequence flanked by Rp388 and dacF in R. typhi, AJ555874; R. rickettsii, AJ555875; R. montana, AJ55873; the sequence flanked by purC and thrS in R. typhi, AJ556137; R. rickettsii, AJ556135; R. montana, AJ556136; the sequence flanked by proS and ruvA in R. typhi, AJ556152; R. rickettsii, AJ556150; R. montana, AJ556151; the sequence flanked by atm and gyrA in R. typhi, AJ556140; R. rickettsii, AJ556139; R. montana, AJ556138; the sequence flanked by rpsD and capM in R. typhi, AJ556144; R. rickettsii, AJ556146; R. montana, AJ556145; the sequence flanked by queA and abcT3 in R. typhi, AJ556149; R. rickettsii, AJ556147; R. montana, AJ556148; the sequence flanked by trxB1 and lgtD in R. typhi, AJ556155; R. rickettsii, AJ556154; R. montana, AJ556153; the sequence flanked by cspA and ksgA in R. typhi, AJ556141; R. rickettsii, AJ556142; R. montana, AJ556143.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Coding Potential of Intergenic Regions in Rickettsia Species
A length distribution plot of 413 orphans in R. conorii shows that these species-specific ORFs are much shorter than the 785 orthologous genes present in both R. conorii and R. prowazekii (fig. 1). To study the origin and evolution of the orphans, we selected for analysis eight genomic segments covering 11 orphans in R. conorii (Ogata et al. 2001) that are located at positions corresponding to intergenic regions longer than 1 kb in the R. prowazekii genome (Andersson et al. 1998). These segments were amplified by PCR in R. typhi, like R. prowazekii a member of the typhus groups (TG), as well as in R. montana and R. rickettsii, both of which are members of the spotted fever group Rickettsia (SFG) (Roux and Raoult 1995; Roux et al. 1997).



View larger version (15K):
[in this window]
[in a new window]
 
FIG. 1. The gene size distribution profile of 413 orphans with unassigned functions that are uniquely present in R. conorii compared with the gene size distribution profile of 785 orthologous genes present in both R. conorii and R. prowazekii. Lists of orthologous genes have been taken from Davids, Amiri, and Andersson (2002)

 
A comparison of the aligned sequences of each of the eight segments revealed a considerable heterogeneity in size and coding content for these Rickettsia species (table 1). The total sizes of the segments vary from 11,524 bp in R. typhi to 14,470 bp in R. conorii. No complete genes were identified in either R. prowazekii or R. typhi. The corresponding regions in the SFG species R. conorii, R. rickettsii, and R. montana have a coding fraction of 45% to 48%, which is considerably lower than the estimated coding fraction of 82% for the R. conorii genome (Ogata et al. 2001). The overall G+C content values of these segments were 23.7% in the TG and 30.7% in the SFG (table 1). The variation in G+C content within as well as among species is related to the different coding potentials of these segments. It is also in accordance with previously observed differences in G+C content for genes and intergenic DNA in Rickettsia species (Andersson et al. 1998; Ogata et al. 2001).


View this table:
[in this window]
[in a new window]
 
Table 1 Sizes, Nucleotide Frequencies, and Coding Fractions of the Amplified Intergenic Regions in Five Rickettsia Species.

 
Orphan Genes in the SFG Correspond to Pseudogenes in the TG
Three segments consist of known genes and orphan genes in R. conorii that are conserved in size among members of the SFG (ampG, Rco295, and Rco529) but are present as pseudogenes with multiple termination codons and frameshift mutations in the TG (fig. 2A–C). We observed strong sequence similarities between these genes in the SFG and the corresponding sequences in the TG Rickettsia (E < e – 20) but no full-length ORFs could be identified in either R. prowazekii or R. typhi (table 2), suggesting that they evolve as pseudogenes in these species. We reconstructed the putative gene sequences present in the common ancestor of the TG as the longest ORFs, given the smallest number of mutations. The G+C content values at the first codon positions are consistently higher than at third codon positions for Rco295 and Rco529 (table 3), as expected for orphans that represent or descend from functional genes. Taken together, this suggests that ampG4, Rco295, and Rco529 were present in the common ancestor of the Rickettsia but that they have started to decay in R. prowazekii and R. typhi.



View larger version (15K):
[in this window]
[in a new window]
 
FIG. 2. Schematic representation of a selected set of segments containing orphans in Rickettsia species. (A–C) The fragments represent regions containing genes in the SFG that are similar to pseudogenes in the TG. (D–E) Pseudogenes in the SFG with significant sequence similarity to extensively degraded gene remnants in the TG. (F–H) Pseudogenes, gene remnants, and noncoding DNA in the SFG showing weak similarity to the corresponding noncoding sequences in the TG. The organisms are, from top to bottom, R. prowazekii (Rp), R. typhi (Rt), R. conorii (Rc), R. rickettsii (Rr), and R. montana (Rm)

 

View this table:
[in this window]
[in a new window]
 
Table 2 Gene Status and Database Matches for Genes, Pseudogenes, and Gene Remnants in Five Rickettsia Species.

 

View this table:
[in this window]
[in a new window]
 
Table 3 Nucleotide Frequency Statistics of Genes and Reconstructed Pseudogenes in Rickettsia Species.

 
Orphans As Short, Internal Fragments of Deteriorating Genes
Another set of orphans in R. conorii displays little or no sequence similarity to the corresponding fragments in the TG Rickettsia (fig. 2D–G and table 2). Most of these sequences correspond to short fragments of full-length genes, with different parts of the ancestral gene being retained in the different species. For example, the annotated genes Rco466 and Rco467 in R. conorii represent two fragments of a single orphan gene (here called forfE) that may have been up to 2 kb in size in the common ancestor of the SFG (fig. 3A). R. montana lacks a segment of 250 bp (R1) at the 5'-terminal end of the reconstructed gene. The identified deletion in R. montana is flanked by a short, nearly perfect repetitive element, TAACC (G/C)AAGGGT, in R. conorii (fig. 4A). This suggests that the loss of this segment was mediated by a recombination event at the repeated sites.



View larger version (22K):
[in this window]
[in a new window]
 
FIG. 3. Gene fragmentation is promoted by recombination at repeated sequences. The putative, ancestral genes (A) forfE, (B) forfF, and (C) forfG were reconstructed from alignments of the homologous sequences in R. conorii, R. rickettsii, and R. montana. Fragmentation of these ancestral genes has resulted in different sets of open reading frames in each species, as indicated by boxes. Stars represent potential translation initiation sites with the initiating amino acid indicated by letters M (methionine), L (leucine), and V (valine). Hexamers indicate the position of in-frame termination codons. Arrows show the position of repeated sequences with bows highlighting fragments deleted in one or more species

 


View larger version (62K):
[in this window]
[in a new window]
 
FIG. 4. Alignments of the 5'-terminal fragments of the reconstructed gene sequences (A) forfE and (B) forfF. Boxes show the position of repeated or nearly repeated sites in one species that are flanking deletions in one or both of the other species. The organisms are, from top to bottom, R. conorii (Rc), R. rickettsii (Rr), and R. montana (Rm). The consensus sequences represent the putative ancestral sequences

 
Likewise, Rco284 and Rco285 in R. conorii correspond to two segments located at the 5'-terminal and the 3'-terminal ends of a reconstructed ORF (here called forfF), estimated to have been about 1 kb in size in the common ancestor of the SFG (fig. 3B). R. conorii and R. rickettsii lack a segment of 147 bp (R1) that is flanked by a short repetitive sequence in R. montana, GATCCAAATACA (fig. 4B). Another segment of 253 bp (R2), which is missing from R. montana, is also flanked by a short repeat in R. conorii and R. rickettsii (fig. 3B). Finally, a short fragment at the 3'-terminal end (R3), which is absent from R. conorii, is flanked by repeats in R. rickettsii and R. montana (fig. 3B). Taken together, this suggests that lineage-specific recombination events at repeated sequences inside the ancestral gene have resulted in different short sets of species-specific orphans.

Orphans As Short, Fused Fragments of Deteriorating Genes
Finally, we have identified an orphan in R. montana that consists of the 5'-terminal and 3'-terminal segments of two genes present in the genome of its ancestor (figs. 2G and 3C). In R. conorii, three orphans (Rco619, Rco620, and Rco622) have been annotated in the region flanked by trxb1 and lgtD (fig. 2G). The size of this region is about 700 bp shorter in R. montana than in the other species of the SFG (fig. 2G). A closer inspection of the aligned sequences shows that this is due to the loss of the 3'-terminal part of Rco619 and the 5'-terminal part of Rco620, as well as of the intervening sequence (fig. 3C). Also in this case, we found that the lost fragment in R. montana is flanked by short repeated sequences in R. conorii and R. rickettsii. As a result, a new orphan has been created in R. montana that contains a short upstream sequence that is located in-frame with the 5'-terminal end of Rco619 and the 3'-terminal end of Rco620 (fig. 3C). This illustrates how a new, species-specific orphan can be created by a single deletion event that results in the fusion of the 5' and 3' flanking segments, which in this case consisted of two neighboring genes.



View larger version (18K):
[in this window]
[in a new window]
 
FIG. 2 (Continued)

 
Putative Functions of the Reconstructed Genes
The genes reconstructed from the different pieces of short orphans in the SFG display the characteristic G+C content statistics (table 3), suggesting that they were functional in the common ancestor of the SFG species. If so, what were the functions of these ancestral genes and why are they being purged from the modern Rickettsia genomes? To resolve these questions, we searched the reconstructed protein sequences against GenBank and the Conserved Domain Database (CDD). The search against GenBank with the gene product of forfF revealed a weak sequence similarity (E < 1e – 08) to a protein from D. melanogaster that contains multiple ankyrin repeats (Dubreuil and Yu 1994). Likewise, the search against CDD identified the presence of six ankyrin repeats in the reconstructed protein sequence (E < 7e – 11). A closer inspection showed that the similarity was derived from a 200–amino acid fragment inside the reconstructed protein sequence, including segment R2 that has been deleted from R. montana and a downstream region of 50 amino acids (fig. 3B).

Comparative molecular modeling confirmed these results by revealing similarity to the crystal structure of the consensus ankyrin repeat domain that consists of a ß-turn followed by two antiparallel {alpha}-helices and a loop reaching the turn of the next repeat (fig. 5) (Kohl et al. 2003). The modeling approach suggests that fossil-ORF F is a dimer of six ankyrin repeats with the overall protein topology being remarkably maintained (fig. 5), despite the low sequence identity (28%). The reconstructed ankyrin-repeat protein was predicted to contain 23 {alpha}-helices (fig. 5).



View larger version (30K):
[in this window]
[in a new window]
 
FIG. 5. Representations of the three-dimensional protein structure of the consensus ankyrin repeat domain (A) and a comparative homology protein model of fossil-ORF F (B). The structure of the consensus ankyrin repeat was taken from previous studies (Kohl et al. 2003), whereas the predicted structure of fossil-ORF F was obtained by comparative modeling (Bujnicki et al. 2001)

 
When the reconstructed protein sequence of forfE was used as a query in database searches, we observed a weak similarity to an ABC-transporter protein of H. influenzae (E = 7e – 06). In this case, searches against the conserved domain database showed that the similarity was to the ATP-binding cassette of ABC-transporters (E < e – 07). This suggests that forfE represents a previous ATP-binding protein that was possibly a member of the ABC-transporter protein family in Rickettsia. Taken together, our results suggest that the short orphans Rco284, Eco285, Rco467, and Rco468 identified in R. conorii and their variants in the other Rickettsia species are fragments of two highly divergent members of multigene families.

Rco529 is currently annotated as a hypothetical cell filamentation gene (fic), but we were unable to confirm this similarity in our database searches (E < 0.01). We were also unable to obtain any hints about the putative functions of forfG1, G2 or G3, Rco272, and Rco295 using this approach. It remains possible that forfG1 and forfG2 belong to the same ancestral gene.

Frequencies of Deletions and Insertions in the TG and the SFG
Three genes are of similar sizes and show no signs of deterioration in the SFG (ampG, Rco295, and Rco529), whereas they are present as pseudogenes in the TG. The extent of deterioration in the latter group is high, which complicates attempts to correctly identify the number and positions of the accumulated mutations. Nevertheless, by aligning the ampG gene from members of the SFG with the pseudogene-like sequences from members of the TG, we estimate that more than 300 nucleotides have been eliminated in approximately 30 deletion events from each of R. prowazekii and R. typhi. Also for Rco295, we estimate that more than 20 deletion events in each of the two species have eliminated approximately 25% of the ancestral gene sequences. Similar results were obtained from Rco529 and Rco530; here more than 300 nucleotides have been deleted from the spacer regions, which currently are only about 1300 nucleotides in size in R. prowazekii and R. typhi. Taken together, we estimate that of more than 4,000 nucleotide sites present in these regions in their common ancestor, approximately 950 and 1,400 nucleotides have been eliminated in R. prowazekii and R. typhi, respectively.

Estimating the frequencies of deletions in sequences for which no full-length copy is present in any of the modern species is even more difficult. In this case, the putative ancestral gene sequences were first reconstructed as the longest ORF given the smallest number of mutations in each species of the SFG Rickettsia. These inferred mutations were then counted as deletions or insertions in each individual species as compared with the reconstructed reference sequence. In total, we estimate that less than 100 nt per kb (6% to 7%) have been eliminated from members of the SFG, with the exception of R. montana in which the deletion frequency may have been as high as 34%. This can be compared with more than 300 nt deleted per kb in members of the TG, which corresponds to a deletion frequency of 27% in R. prowazekii and 44% in R. typhi (table 4). A closer inspection of the spectrum of changes shows that R. montana is exceptional in that as many as an average of 100 nt were eliminated per event (table 4), as compared with 8 to 16 nt in the other species. The difference is due to a few very large deletions in R. montana that were mediated by short, repetitive sequences.


View this table:
[in this window]
[in a new window]
 
Table 4 Frequencies of Insertions and Deletions in the TG and the SFG.

 
Also, for insertion mutations, the number of events has been approximately 10-fold higher in the TG, although 2 to 5 nt on average are inserted per mutational event in both groups. In this case, the exception is R. montana, in which no insertion mutations were identified in any of the pseudogenes here examined. However, the results are not directly comparable across the TG and SFG since the numbers have been inferred from different sets of pseudogenes in the two groups. Nevertheless, the general trend is that there is a bias towards deletion mutations in all the Rickettsia species examined here. Furthermore, the data indicates that recombination at short repeated sites has occurred at high frequencies, especially in R. montana. This will explain the shorter sizes of the gene fragments in this species compared with the other members of the SFG.

Long Noncoding Sequences in the TG
One segment consists entirely of noncoding sequences in which no genes, pseudogenes, or gene fragments could be identified in any of the Rickettsia species (fig. 2H). The noncoding segment flanked by cspA and ksgA is the most atypical of all fragments examined here. It is more than 3,000 bp long in the TG but only approximately 1,400 bp long in the SFG. Very short stretches of sequences with similarity between the SFG and the TG were identified at the 5' as well as at the 3' end of this region. However, unlike all other intergenic segments, there is a sequence of more than 2,000 bp in the TG with no similarity to the SFG nor to any other sequences in the R. prowazekii and R. conorii genomes. An alignment of the intergenic regions in R. prowazekii and R. typhi show considerable sequence similarity but also numerous sequence gaps, especially in R. prowazekii (data not shown). The lack of sequences and ORFs that are conserved among the two species and the absence of ORFs above 250 bp suggests that this region may represent species-specific remnants of a larger segment that was present in the common ancestor of the Rickettsia species.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The identification of thousands of species-specific orphans without any observable similarity to genes in other genomes (Siew and Fisher 2003) requires an explanation. Here, we have taken a comparative sequence approach to examine the extent of within-genus sequence conservation for orphans, using Rickettsia as our reference system. The study has shown that approximately 80% of the orphans examined here represent short fragments of deteriorating genes. In total, more than 400 orphans have been annotated in the R. conorii genome (Ogata et al. 2001). If our results were representative of the genome as a whole, the coding content of R. conorii genome would be reduced from the currently predicted 81% (Ogata et al. 2001) to less than 76%, which is in the range of the estimated coding content of the R. prowazekii genome (Andersson et al. 1998).

This provides an explanation for the low average length, 775 bp, of genes in the R. conorii genome (Ogata et al. 2001), as compared with an average gene length of 947 bp in bacterial genomes (Mira, Ochman, and Moran 2001). Indeed, sequences annotated as orphans are atypically short (313 bp on average), as compared with full-length orthologs present in both R. prowazekii and R. conorii (1,030 bp on average) (fig. 1). This observation per se suggests that many of the annotated orphans are fragments of deteriorating genes. Short repeated sequences located in close proximity to each other are known to play an important role in mediating recombination events in Rickettsia and other species (Syvänen et al. 1996; Andersson et al. 1999; Ogata et al. 2000, 2001; Amiri, Alsmark, and Andersson 2002; Frank, Amiri, and Andersson 2002). As we have shown in this study, the outcome of these recombination events may be fusions and partial losses of genic sequences (figs. 3 and 4).

In effect, fragments of genes will temporarily be retained in the form of multiple short ORFs. The nucleotide composition patterns of these will initially be similar to those of the full-length, ancestral genes from which they were derived, although they no longer code for functional proteins. Studies of the rate of sequence evolution at amino acid replacement sites among closely related species will facilitate functional predictions (Davids, Amiri, and Andersson 2002; Ochman 2002), although some structural and regulatory peptides that are not under selection for amino acid conservation may be missed using this approach (Lawrence 2003). The presence of stop codons and frameshift mutations in sequences that correspond to full-length genes or overlapping but not identical ORFs in closely related species are signs of nonfunctionality and degradation. However, it is not excluded that a few gene duplication, deletion, and fusion events may help to create new protein functions, as illustrated for proteobacterial genomes (Le Bouder-Langevin et al. 2002).

Comparative sequence analysis of close relatives and reconstructions of gene sequences present in their common ancestor may help to identify inactivated or lost protein functions, as we have shown in this study. A particularly interesting result from these exercises is that one of the reconstructed proteins in Rickettsia displays similarity to ankyrin repeat–containing proteins that are normally found in eukaryotic cells, where they provide a linkage between the cytoskeleton and the cell membranes, including the membranes of intracellular organelles (Rubtsov and Lopina 2000). The function of the ankyrin domain is to mediate protein-protein interactions, and its success and wide distribution may be attributed to the observation that each repeat can bind to virtually any target protein (Rubtsov and Lopina 2000). Ankyrin repeat–containing proteins of unknown functions have also been identified in the obligate intracellular Wolbachia pipientis (www.tigr.org) and Ehrlichia phagocytophila (Caturegli et al. 2000) that are close relatives to Rickettsia. A suggestion is that these proteins interact with the chromatin of their host cells, thereby influencing the expression of cell cycle regulators (Caturegli et al. 2000).

Surprisingly, only a single gene per genome was previously identified in R. prowazekii and R. conorii to code for proteins with ankyrin repeats (Andersson et al. 1998; Ogata et al. 2001), whereas E. phagocytophila (Caturegli et al. 2000) and W. pipientis (www.tigr.org) contain multiple such genes. The identification of a second, albeit decaying, ankyrin repeat–containing gene in the SFG Rickettsia is intriguing. It suggests that many more such genes were probably present in the common ancestor of the Rickettsia and that this ancestor may have shared important lifestyle features with modern Ehrlichia and/or Wolbachia.

Another interesting observation is that the SFG and the TG seem to be converging towards a similar minimal gene set, albeit at different rates. Thus, orphans in the SFG correspond to pseudogenes in the TG, and pseudogenes in the SFG correspond to extensively degraded gene remnants in the TG. Within the TG, the rate of degradation has been highest in the lineage leading to R. typhi; three of the analyzed fragments have considerably shorter intergenic regions in this species than in R. prowazekii (fig. 2 and table 1), and there is only one example of the converse. Within the SFG, we have found that R. montana has two fragments that are shorter in size than the corresponding fragments in R. rickettsii and R. conorii (fig. 1E and G). Thus, the rate of gene degradation seems to be lowest in R. conorii and R. rickettsii, higher in R. montana, and highest in R. prowazekii and R. typhi.

Gene sequences coding for nonessential protein functions will only have a transient presence in the genome (Andersson and Kurland 1998; Andersson and Andersson 1999a, 1999b, 2001; Berg and Kurland 2002). Thus, the observed convergence in the identity of genes being eliminated is explained if this particular set of genes were inactivated, or rendered nonessential, before the divergence of the TG and the SFG. If so, however, why has the process of sequence elimination occurred more rapidly in the TG than in the SFG Rickettsia? Part of the answer may be that the generation times of the two bacterial groups differ as a result of the different vectors used for transmission among hosts. Indeed, the SFG Rickettsia are normally transmitted by ticks that may have life spans of several years, whereas the TG are transmitted among hosts by lice and rat fleas with generation times of only a few months. Since gene loss is an irreversible process when it is not balanced by a corresponding inflow of genes, genome degradation will proceed fastest in the species with the shortest generation times. Additionally, the rate of genome degradation may be influenced by the population sizes of the vectors and hosts as well of the number of bacterial cells transferred per transmission event, factors that we for the moment do not know much about in these particular species.

The observation that many genus-specific orphans with no similarity to genes in other genera may be degraded remnants of ancestral genes has several important implications. Depending on the order of the deletion events, the decaying DNA sequences may differ even among closely related strains and isolates, explaining the many strain-specific and species-specific orphans. On the assumption that phenotypic differences among species as well as novel metabolic features reside in this category of genes, orphans are often included in high-throughput functional and structural genome analyses. Needless to say, such analyses become meaningless if the ORFs under investigation do not encode functional proteins.

Therefore, we recommend that the rates and patterns of sequence evolution be examined before the implementation of experimental strategies for elucidating the functions of short orphans. Comparative sequence data will make it possible to follow the deterioration of coding into noncoding sequences, and vice versa. This should resolve questions concerning the coding status of any particular sequence at any given time-point. Until such information is available, it may be necessary to apply a more strict length criterion for the assignment of short, species-specific orphan genes than for orthologs present in two or more species.

As we have shown here, ancestral gene sequences may be reconstructed in silico by comparative sequence analysis of closely related strains and species, even if the gene per se is absent from all modern species. Thus, we already have in principle the information required for designing and cloning genes from ancestral genome sequences. In the future, this should allow us to examine the functions of genes no longer present in contemporary bacteria and thereby obtain information about microbial metabolism at an earlier stage of evolution.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The authors would like to acknowledge the contribution of L. Rychlewski for help in the comparative protein modeling experiments. We also thank Carolin Frank, Ramy Arnout, Charles Kurland, three anonymous referees, and the editor for helpful comments on the manuscript. This work was financed by the Foundation for Strategic Research (SSF), the Knut and Alice Wallenberg Foundation (KAW) and the National Research Foundations in Sweden and South Africa.


    Footnotes
 
Kenneth Wolfe, Associate Editor Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.[Abstract/Free Full Text]

    Amiri, H., C. Alsmark, and S. G. E. Andersson. 2002. Proliferation and deterioration of Rickettsia palindromic elements. Mol. Biol. Evol. 19:1234-1243.[Abstract/Free Full Text]

    Andersson, J. O., and S. G. E. Andersson. 1999a. Genome degradation is an ongoing process in Rickettsia. Mol. Biol. Evol. 16:1178-1191.[Abstract]

    Andersson, J. O., and S. G. E. Andersson. 1999b. Insight into the evolutionary process of genome degradation. Curr. Op. Gen. Dev. 9:664-671.[CrossRef][ISI][Medline]

    Andersson, J. O., and S. G. E. Andersson. 2001. Pseudogenes, junk DNA and the dynamics of Rickettsia genomes. Mol. Biol. Evol. 18:829-839.[Abstract/Free Full Text]

    Andersson, S. G. E., and C. G. Kurland. 1998. Reductive evolution of resident genomes. Trends Microbiol. 6:525-536.

    Andersson, S. G. E., and P. M. Sharp. 1996. Codon usage and base composition in Rickettsia prowazekii. J. Mol. Evol. 42:525-536.[ISI][Medline]

    Andersson, S. G. E., D. R. Stothard, P. Fuerst, and C. G. Kurland. 1999. Molecular phylogeny and rearrangement of rRNA genes in Rickettsia species. Mol. Biol. Evol. 16:987-995.[Abstract]

    Andersson, S. G. E., A. Zomorodipour, J. O. Andersson, T. Sicheritz-Pontén, U. C. M. Alsmark, R. F. Podowski, A. K. Näslund, A.-S. Eriksson, H. H. Winkler, and C. G. Kurland. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396:133-140.[CrossRef][ISI][Medline]

    Basrai, M. A., P. Hieter, and J. D. Boeke. 1997. Small open reading frames: beautiful needles in the haystack. Genome Res. 7:768-771.[Free Full Text]

    Berg, O. G., and C. G. Kurland. 2002. Evolution of microbial genomes: sequence acquisition and loss. Mol. Biol. Evol. 19:2265-2276.[Abstract/Free Full Text]

    Bujnicki, J. M., Elofsson, A., Fischer, D., and L. Rychlewski. 2001. Structure prediction meta server. Bioinformatics 17:750-751.[Abstract/Free Full Text]

    Caturegli, P., K. M. Asanovich, J. J. Walls, J. S. Bakken, J. E. Madigan, V. L. Popov, and J. S. Dumler. 2000. ankA: an Ehrlichia phagocytophila group gene encoding cytoplasmic protein antigen with ankyrin repeats. Inf. Immun. 68:5277-5283.[Abstract/Free Full Text]

    Chen, S., E. A. Lesnik, T. A. Hall, R. Sampath, R. H. Griffey, D. J. Ecker, and L. B. Blyn. 2002. A bioinformatics based approach to discover small RNA genes in the Escherichia coli genome. Biosystems 65:157-177.[CrossRef][ISI][Medline]

    Combet, C., M. Jambon, G. Deleage, and C. Geoujon. 2002. Geno3D: automatic comparative molecular modeling of protein. Bioinformatics 18:213-214.[Abstract/Free Full Text]

    Davids, W., H. Amiri, and S. G. E. Andersson. 2002. Small RNAs in Rickettsia: are they functional? Trends Genetics 18:331-334.[CrossRef][ISI][Medline]

    Dubreuil, R., and J. Yu. 1994. Ankyrins and ß-spectrin accumulate independently of {alpha}-spectrin in Drosophila. Proc. Natl. Acad. Sci. USA 91:10285-10289.[Abstract/Free Full Text]

    Fischer, D., and D. Eisenberg. 1999. Finding families for genomic ORFans. Bioinformatics 15:759-762.[Free Full Text]

    Frank, C., H. Amiri, and S. G. E. Andersson. 2002. Genome deterioration: loss of repeated sequences and accumulation of junk DNA. Genetica 115:1-12.[CrossRef][ISI][Medline]

    Galtier, N., M. Gouy, and C. Gautier. 1996. SeaView and Phylo_win: two graphic tools for sequence alignment and molecular phylogeny. Comput. Appl. Biosci. 12:543-548.[Abstract]

    Harrison, P., A. Kumar, N. Lan, N. Echols, M. Snyder, and M. Gerstein. 2001. A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution. J. Mol. Biol. 316:409-419.[CrossRef][ISI]

    Kohl, A., H. K. Binz, P. Forrer, M. T. Stumpp, A. Pluckthun, and M. G. Grutter. 2003. Designed to be stable: crystal structure of a consensus ankyrin repeat protein. Proc. Natl. Acad. Sci. USA 100:1700-1705.[Abstract/Free Full Text]

    Lawrence, J. 2003. When ELFs are ORFs, but don't act like them. Trends Genet. 19:131-132.[CrossRef][ISI][Medline]

    Le Bouder-Langevin, S., I. Capron-Montaland, R. De Rosa, and B. Labedan. 2002. A strategy to retrieve the whole set of protein modules in microbial proteomes. Genome Res. 12:1961-1973.[Abstract/Free Full Text]

    Lloyd, A. T., and P. M. Sharp. 1992. CODONS: a microcomputer program for codon usage analysis. J. Hered. 83:239-240.[ISI][Medline]

    Mira, A., L. Klasson, and S. G. E. Andersson. 2002. Microbial genome evolution: sources of variability. Curr. Opin. Microbiol. 5:506-512.[CrossRef][ISI][Medline]

    Mira, A., H. Ochman, and N. Moran. 2001. Deletional bias and the evolution of bacterial genomes. Trends Genet. 17:589-596.[CrossRef][ISI][Medline]

    Ochman, H. 2002. Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends Genet. 18:335-337.[CrossRef][ISI][Medline]

    Ogata, H., S. Audic, V. Barbe, F. Artugueave, P. E. Fournier, D. Raoult, and C. M. Claverie. 2000. Selfish DNA in protein coding genes. Science 290:347-350.[Abstract/Free Full Text]

    Ogata, H., S. Audic, and P. Renesto-Audiffren, et al. (11 co-authors). 2001. Mechanisms of evolution in Rickettsia conorii and R. prowazekii. Science 293:2093-2098.[Abstract/Free Full Text]

    Pretzman, C. I., Y. Rikihisa, D. Ralph, J. C. Gordon, and S. Bech Nielsen. 1987. Enzyme-linked immunosorbent assay for Potomac horse fever disease. J. Clin. Microbiol. 25:31-36.[Medline]

    Rogozin, I. B., K. S. Makatova, D. A. Natale, A. N. Spiridonov, R. L. Tatusov, Y. I. Wolf, J. Yin, and E. V. Koonin. 2002. Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. Nucleic Acids Res. 30:4264-4271.[Abstract/Free Full Text]

    Roux, V., and D. Raoult. 1995. Phylogenetic analysis of the genus Rickettsia by 16S rDNA sequencing. Res. Microbiol. 146:385-96.[CrossRef][ISI][Medline]

    Roux, V., E. Rydkina, M. Eremeeva, and D. Raoult. 1997. Citrate synthase gene comparison, a new tool for phylogenetic analysis, and its application for the rickettsiae. Int. J. Syst. Bacteriol. 47:252-261.[Abstract/Free Full Text]

    Rubotsov, A. M., and O. D. Lopina. 2000. Ankyrins. FEBS Lett. 482:1-5.[CrossRef][ISI][Medline]

    Rychlewski, L., L. Jaroszewski, W. Li, and A. Godzik. 2000. Comparison of sequence profiles: Strategies for structural predictions using sequence information. Protein Sci. 9:232-241.[Abstract]

    Sayle, R., and E. J. Milner-White. 1995. Rasmol: biomolecular graphics for all. Trends Biochem. Sci. 20:374.[CrossRef][ISI][Medline]

    Siew, N., and D. Fischer. 2003. Twenty thousand ORFan microbial protein families for the biologist? Structure 11:7-9.[CrossRef][ISI][Medline]

    Skovgaard, M., J. L. Jensen, S. Brunak, D. Ussery, and A. Krogh. 2001. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 17:425-428.[CrossRef][ISI][Medline]

    Syvänen, A.-C., H. Amiri, A. Jamal, S. G. E. Andersson, and C. G. Kurland. 1996. A chimeric disposition of the elongation factor genes in Rickettsia prowazekii. J. Bacteriol. 178:6192-6199.[Abstract]

    Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4683-4680.

Accepted for publication May 9, 2003.