Department of Biology, McMaster University, Hamilton, Ontario, Canada
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
DNA sequence analysis has also provided valuable clues about horizontal transfer events. Genome (A+T) content, dinucleotide frequencies, and synonymous codon usage (Lawrence and Ochman 1997
) vary among organisms and are generally characteristic of evolutionary lineages. Several methods have been suggested to use these data to identify horizontally transferred genes. (Ochman and Lawrence 1996
; Lawrence and Ochman 1998
) used anomalous GC content at first and third codon positions together with synonymous codon usage, positional homology, and BLAST hits to analyze genes of the Escherichia coli strain MG1655 genome, suggesting that a minimum 17.6% of identified open reading frames (ORFs) had arisen via horizontal transfer since separation of the Escherichia and Salmonella lineages about 100 MYA. ORFs were initially identified as atypical if their GC contents at first and third codon positions were two or more standard errors (SE) higher or lower than the respective means for all genes in the genome. Chi-square (
2) of codon usage and the codon adaptation index (CAI) were also calculated for each gene. CAI is a measure of similarity of a gene's synonymous codon usage to that of a standard set of highly expressed genes for that organism (Sharp and Li 1987
). Those genes with a high
2 and a low CAI were classified as atypical. From this list of atypical genes, known native genes that exhibit atypical base compositions were eliminated for other reasons, such as the amino acid content of the encoded protein. Lawrence and Ochman 1997
also estimated the time of introgression for each of the horizontally transferred genes found in E. coli. Transferred genes are subject to those mutational processes affecting the recipient genome. Amelioration is the process by which the acquired gene incurs substitutions and evolves to reflect the DNA composition of the new genome (Lawrence and Ochman 1998
). Lawrence and Ochman estimated the time of introgression by examining the rate and extent of amelioration of the introgressed genes and determining how long each sequence had been subjected to the directional mutational pressures of the recipient genome. They estimated that many of the genes were introduced within the last 10 Myr, with the average time of introduction being 25.3 MYA. Médigue et al. 1991
used a chi-square distance measure to divide E. coli genes into three classes: genes of high expression, genes of low expression, and a third class containing horizontally transferred genes. Mrázek and Karlin 1999
used a measure assessing the bias of one group of genes against a second group in order to identify alien genes in the genomes of several bacteria.
It is important to assess the validity of nucleotide composition and synonymous codon usage as a measure to detect horizontally transferred genes. Introgressions may be undetected by these methods, since genes from closely related organisms may not have unusual nucleotide composition or codon bias. Also, genes that have been in the genome for a long period will have undergone amelioration. Therefore, the earliest genes introduced into the genome have probably fully ameliorated and will go undetected when using base composition and codon bias as a means of identification. Perhaps more importantly, the forces that shape normal compositional variation among genes within a genome are not well understood. Escherichia coli genes with a high CAI have a strong correlation to highly expressed genes (Sharp and Li 1987
) and use a set of preferred codons that correspond to the tRNA molecules found in rapidly growing cells (Ikemura 1981
). There may be other mutational or selectional forces that cause deviations of nucleotide composition from the genome average that might lead to a misclassification of genes as horizontally transferred.
A comparison of genes between closely related organisms can be used to evaluate methods of gene classification, as well as to identify genes that have atypical modes of evolution. The genomes of E. coli MG1655 (Blattner et al. 1997
) and S. typhi (Salmonella enterica serovar Typhi) have been completely sequenced. These species diverged about 100 MYA (Doolittle et al. 1996
) and have essentially colinear genetic maps (Krawiec and Riley 1990
). We used protein similarity and conservation of local gene order to identify a group of 2728 E. coli genes that have positional orthologs in S. typhi. Phylogenetic analysis of selected examples of 1,144 "novel" E. coli genes suggested that on the order of 10%15% of the E. coli genome may have been horizontally introduced. We also found that atypical nucleotide composition alone was not a reliable indicator of horizontal transmission.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The fraction of (A+T) nucleotides in the first and third codon positions was calculated from the E. coli ORF's coding sequence. Bivariate frequency distributions for all E. coli ORFs and those positionally conserved were compared using two-way classification and the odds ratio (fraction of positionally conserved ORFs in a cell divided by the fraction of all ORFs in a cell). The significance of this ratio was assessed with a G-test corrected for continuity and a Bonferroni correction for multiple tests.
CAI was calculated for each E. coli ORF according to the method of Sharp and Li (1987)
using as a reference set of highly expressed E. coli genes the 27 genes used by Sharp and Li (1986)
. Since CAI depends on amino acid composition, CAI was also corrected by subtracting the value that would be obtained for a protein of the same composition which used synonymous codons according to the genome cumulative average. These deviations were divided by the genome standard deviation to give a ZCAI score. Since these scores were distributed asymmetrically, extreme scores were assessed empirically.
An NCBI BLAST search was conducted on the amino acid sequence of each E. coli ORF. Sequences that hit to eight or more different species with blast expected values of less than 10-15 were used for phylogenetic analysis. Phylogenies were generated where possible for a number of novel E. coli ORFs, as well as for those genes previously classified as horizontally transferred by Lawrence and Ochman 1998
based on base composition and codon bias. Sequences were aligned using the CLUSTAL W algorithm (Thompson, Higgins, and Gibson 1994
). For each gene, 100 bootstrap samples were constructed using SEQBOOT (Felsenstein 1994
) and 100 phylogenetic trees were generated using the neighbor-joining algorithm (Saitou and Nei 1987
). Using the original sequence alignment, these 100 trees were then evaluated by the maximum-likelihood inference program PROTML from the MOLPHY package, version 2.2 (Adachi and Hasegawa 1992
). If the maximum-likelihood tree showed evidence of horizontal transfer, a new tree was constructed. In this tree, E. coli was forced to be located next to Salmonella (or another closely related Gram-negative species). This rearranged tree represented a null hypothesis without horizontal transfer. If the likelihood of the rearranged tree was significantly below the likelihood of the original tree (Kishino and Hasegawa 1989
), this hypothesis was rejected, and the gene was classified as "horizontally transferred." If the best likelihood tree grouped E. coli with other closely related Gram-negative bacteria and the gene was a positional ortholog, we classified the gene as "native."
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Positionally conserved E. coli ORFs are generally less diverged than orphans. Over 90% have less than 0.3 expected amino acid substitutions per site (fig. 2
) (mean = 0.142, SD = 0.166).These protein distances are comparable to the average of 0.04 nonsynonymous nucleotide substitutions per nonsynonymous site found by Sharp (1991)
for a set of 67 orthologous E. coli and S. typhi genes. There are a few positionally conserved E. coli genes that are unusually distant (more than one amino acid substitution per site) from their S. typhi homologs. These may represent homologous replacements or a subset of genes that have diverged unusually rapidly. To distinguish between these possibilities, each gene must be examined individually by comparison with orthologs in closely related species. Positionally conserved genes that aligned over a length of more than 90% of their sequence and had a protein distance of less than 0.5 substitutions per site from their S. typhi hit were termed "positional orthologs" (fig. 2
and table 1
). D < 0.5 was arbitrarily chosen as the cutoff because there were few positionally conserved genes with D > 0.5, and for the orphan genes, 0.5 fell between two modes of the protein distance distribution (fig. 2
). This subset of 2,728 E. coli genes is a conservative estimate of the number of orthologs that have retained chromosomal order since E. coli and S. typhi separated. Of the E. coli genes previously classified as horizontally transferred (Lawrence and Ochman 1998
), 18% (135/747) have a positional ortholog in S. typhi, an unexpected number if the average time of introgression is 25.3 MYA (Lawrence and Ochman 1997
).
|
|
Another group of E. coli orphans were similar to their S. typhi neighbors. Their origin is probably complex. A small number have distances typical of positionally conserved genes and are likely the result of transposition events involving one or only a few genes. A second group had intermediate distances (0.20.5) larger than the average of positionally conserved genes. Some of these orphans could result from duplication events that connect two E. coli genes to the same S. typhi ORF. An example may be the two E. coli formate dehydrogenase operons (fdo and fdn), both of which hit the same group of S. typhi sequences. Only one is positionally conserved; the other is an orphan with a distance of 0.260.85, depending on the gene. Another cause of orphans at intermediate protein distance is deletion of a positional ortholog from the S. typhi chromosome, causing the E. coli ORF to hit a second S. typhi gene that is very similar to the one deleted. Only a thorough examination of individual genes can distinguish among these various possibilities.
In order to identify genes unique to the E. coli genome and not found in S. typhi, we removed 194 ORFs from the orphan group that were relatively close in protein distance. Since very few of the genes classified as positional orthologs had D > 0.5, we used this as a cutoff to divide orphans into a group of "novel" ORFs versus "unclassified" ORFs (fig. 2
). A total of 1,144 E. coli ORFs were placed in the novel class, representing a subset of those genes which had no similar S. typhi neighbor at the expected chromosomal position. Novel orphans are caused by deletion from the S. typhi genome or by introgression into the E. coli genome. An example of the latter is the E. coli LacA protein, 88% of which aligns with an S. typhi ORF at a protein distance of 1.0. The E. coli lactose operon is thought to have been horizontally introduced into the E. coli genome (Buvinger et al. 1984
). The same S. typhi ORF is also hit by E. coli YlaD, its positional ortholog, at a distance of 0.23, as well as by a second novel E. coli ORF, WbbJ, at a distance of 0.82. These E. coli genes are part of a group of acetyltransferases with sufficient structural similarity to account for their common S. typhi TBLASTN hit. Sixty-four percent (479/747) of the E. coli genes previously identified as horizontally transferred (Lawrence and Ochman 1998
) belong to the group of novel genes.
Codon Bias of E. coli ORFs
Escherichia coli genes identified as positionally conserved can be used to estimate the nucleotide composition of E. coli/S. typhi orthologs. This group of genes has undergone a conservative mode of evolution compared with orphan genes, as shown by their distinct protein distances (fig. 2
). They are expected to have nucleotide compositions characteristic of genes resident in the E. coli genome since E. coli and S. typhi diverged. Novel genes that have been introduced into the E. coli genome should have compositions reflecting their origin. Positionally conserved genes can be used to test the hypothesis that extreme nucleotide composition is a reliable measure of horizontal transfer. If this is true, any group tentatively identified as foreign on the basis of extreme composition, especially at the third codon position, should contain few, if any, positionally conserved ORFs. Positionally conserved genes are less variable in their use of synonymous codons and tend to have fewer genes with high (A+T)3 (fig. 3
). Only 14% of ORFs with (A+T)3 > 0.65 are positionally conserved, but this decreases to 9% for (A+T)3 > 0.7 (table 1
). The other extreme of (A+T)3, corresponding to high (G+T) content, is less effective as a predictor of positional conservation, as 37% of ORFs with (A+T) < 0.3 are positionally conserved. Although extremes of (A+T)3 are deficient in positionally conserved genes, the typical or modal range is not solely composed of this group. For example, out of 2,967 ORFs with (A+T) between 0.35 and 0.50, fully 667 (22%) are not positionally conserved. Thus, an extreme base composition bias enhances detection but produces many false positives, while genes of intermediate base composition are mostly positional orthologs but may also include many genes that may have been horizontally transferred into E. coli.
|
|
|
|
Phylogenies Can Identify Horizontally Transferred Genes
Horizontal transfer produces chromosomes containing genes with different ancestries and durations in the genome (Lawrence and Ochman 1998
) Introgression can alter the topologies of gene trees; therefore, a phylogenetic approach can be employed to detect horizontally transferred genes. If a gene is confined to one taxon or species, it is more likely to have been acquired through horizontal transfer than to have been lost independently from multiple lineages (Smith, Feng, and Doolittle 1992
). If it can be shown that many different genes from the same organism have statistically different phylogenies, this would indicate that horizontal gene transfer is playing a major role in the evolution of the species.
A total of 80/102 statistically significant protein trees were generated (see http://life.biology.mcmaster.ca/liisa/appendix.html) and used to explore how accurately our classification can identify potential horizontally transferred genes (table 2
). A large number of genes cannot be used to generate phylogenies, as homologs have not yet been identified in a sufficient number of species. Out of 24 protein trees tested for the group of positional orthologs, all generated phylogenies consistent with vertical evolution. We expect all of the genes in this category to be native to E. coli since its divergence from Salmonella. An example of an E. coli gene of unusual nucleotide composition which has evolved normally is gloB, coding for a probable hydroxyacylglutathione hydrolase. It is one of the four positional orthologs among the 100 E. coli genes with the lowest ZCAI. However, its (A+T)1 (0.51) and (A+T)3 (0.35) do not place it among the ORFs that are extreme in this measure. GloB was categorized by Lawrence and Ochman 1998
as horizontally transferred. However, its tree and homologous chromosomal position is consistent with a vertically evolving gene within E. coli and Salmonella (fig. 6
). Of the 24 positional orthologs whose trees do not show evidence of horizontal transfer, Lawrence and Ochman 1998
classified 15 as horizontally transferred based on codon bias and base composition. This does not represent an estimate of the frequency of misclassification, however, as we specifically analyzed a subsample of positional orthologs which had been previously classified as horizontally transferred. A lack of phylogenetic evidence for horizontal transfer is also not proof that it did not occur, since introgressions between closely related organisms would go unnoticed by this technique. Our phylogenetic analysis simply fails to find evidence that horizontal transfer has taken place between distantly related species.
|
|
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We confirmed that a majority of the E. coli ORFs with extreme nucleotide compositions do not have positional orthologs in the S. typhi genome.(A+T)1 and (A+T)3 were found to be a better indicator of nonorthologous genes than (A+T)3 alone or CAI, while a combination of CAI and (A+T) at first and third positions was found to be an even better indicator. An unusually large CAI, on the other hand, is a good indicator of orthologous transmission. Over 90% of those E. coli ORFs with CAI > 0.45 were found to have an S. typhi ortholog, indicating that high-expression genes have been retained in both genomes since their divergence.
While extreme nucleotide composition is a good predictor of nonorthologous transmission, it identifies only a small fraction of such genes. Approximately 20% of E. coli ORFs with typical nucleotide composition and codon usage, representing over 600 genes, have no positional ortholog in S. typhi. Most of these are genes that have been either deleted from the S. typhi genome or horizontally introduced into the E. coli genome. Gene trees for those "novel" ORFs for which sufficient data were available (table 2 ) suggest that approximately 52% of these are horizontal introgressions. Therefore, using nucleotide composition alone to identify horizontal introgressions would miss approximately 300 genes. It is perhaps not surprising that many horizontal introgressions would have nucleotide compositions typical of the E. coli genome, since gene transfer is expected to be more likely between closely related bacteria. The process of amelioration would also account for the typical composition of some horizontally transferred genes.
Even though there have been numerous additions and deletions to the E. coli and S. typhi genomes, our analysis indicates that these two species have retained a common core of approximately 2,700 genes. Many of these are highly expressed E. coli genes, as identified by CAI. These conserved genes account for the colinearity of the Escherichia and Salmonella genetic maps (Riley and Krawiec 1987
; Krawiec and Riley 1990
). This conserved core must provide for the common properties of Escherichia and Salmonella, while those added by introgression determine distinctions among subgroups and adaptations to novel environments (Ochman, Lawrence, and Groisman 2000
).
Phylogenetic analysis has confirmed that a substantial fraction of novel E. coli ORFs originated from horizontal transfer since this species diverged from Salmonella. Even the Escherichia and Salmonella genera themselves may have split because of introgressions that provided distinctive functions (Groisman, Saier, and Ochman 1992
; Groisman et al. 1993
). It is clear from our analysis of gene trees and chromosomal position that base composition and codon usage patterns should not be used without additional support to identify horizontally transferred genes. Combining measures based on base composition with a phylogenetic approach is required to eliminate vertically evolved genes with atypical composition. Although the analysis of gene trees is time-consuming and requires a number of well-characterized species, it is an important tool in the analysis of horizontal gene transfer and should be employed whenever possible.
![]() |
Footnotes |
---|
1 Keywords: horizontal gene transfer
Escherichia coli,
codon bias
2 Address for correspondence and reprints: G. Brian Golding, Department of Biology, McMaster University, 1280 Main Street West, Hamilton, Ontario, Canada L8S 4K1. E-mail: golding{at}mcmaster.ca
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi, J., and M. Hasegawa 1992. Molphy: programs for molecular phylogenetics, I. Protml: maximum likelihood inference of protein phylogeny. Computer Science Monographs.
Altschul, S. F., W. Gish, W. Miller, E. W. Meyers, and D. J. Lipman. 1990. Basic alignment search tool. J. Mol. Biol. 215:403410.[ISI][Medline]
Blattner, F. R., G. Plunkett III, C. A. Bloch et al. (17 co-authors). 1997. The complete genome sequence of Escherichia coli K-12. Science 277:14531474.
Buvinger, W. E., K. A. Lampel, R. J. Bojanowski, and M. Riley. 1984. Location and analysis of nucleotide sequences at one end of a putative lac transposon in the Escherichia coli chromosome. J. Bacteriol. 159:618622.[ISI][Medline]
Doolittle, R. F., D. F. Feng, S. Tsang, G. Cho, and E. Little. 1996. Determining divergence times of the major kingdoms of living organisms with a protein clock. Science 271:470477.
Felsenstein, J. 1994. PHYLIP (phylogeny inference package). Version 3.5. Distributed by the author, University of Washington, Seattle.
Groisman, E. A.., M. H. Saier Jr., and H. Ochman. 1992. Horizontal transfer of a phosphatase gene as evidence for mosaic structure of the Salmonella genome. EMBO J. 11:13091316.[Abstract]
Groisman, E. A., M. A. Sturmoski, F. R. Solomon, R. Lin, and H. Ochman. 1993. Molecular, functional, and evolutionary analysis of sequences specific to Salmonella. Proc. Natl. Acad. Sci. USA 90:10331037.
Huynen, M. A., and P. Bork. 1998. Measuring genome evolution. Proc. Natl. Acad. Sci. USA 95:58495856.
Ikemura, T. 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system.J. Mol. Biol. 151:389409.[ISI][Medline]
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order of the Hominoidea. J. Mol. Evol. 29:170179.[ISI][Medline]
Krawiec, S., and M. Riley. 1990. Organization of the bacterial chromosome. Microbiol. Rev. 54:502533.[ISI]
Lawrence, J. G., and H. Ochman. 1997. Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol. 44:383397.[ISI][Medline]
. 1998. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95:94139417.
Médigue, C., T. Rouxel, P. Vigier, A. Henaut, and A. Danchin. 1991. Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222:851856.[ISI][Medline]
Mrázek, J., and S. Karlin. 1999. Detecting alien genes in bacterial genomes. Ann. N.Y. Acad. Sci. 870:314329.
Ochman, H., and J. G. Lawrence. 1996. Escherichia coli and Salmonella typhimurium: cellular and molecular biology. Pp. 26272637 in American Society for Microbiology, Washington, D.C.
Ochman, H., J. G. Lawrence, and E. A. Groisman. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405:299304.
Riley, M., and S. Krawiec. 1987. Escherichia coli and Salmonella typhimurium: cellular and molecular biology. Pp. 967981 in American Society for Microbiology, Washington, D.C.
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Sharp, P. M. 1991. Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: codon usage, map position, and concerted evolution. J. Mol. Evol. 33:2333.[ISI][Medline]
Sharp, P. M., and W. H. Li. 1986. Codon usage in regulatory genes in Escherichia coli does not reflect selection for rare codons. Nucleic Acids Res. 14:77377749.[Abstract]
. 1987. The Codon Adaptation Indexa measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15:12811295.[Abstract]
Smith, D. K., T. Kassam, B. Singh, and J. F. Elliott. 1992. Escherichia coli has two homologous glutamate decarboxylase genes that map to distinct loci. J. Bacteriol. 174:58205826.[Abstract]
Smith, M. W., D. F. Feng, and R. F. Doolittle. 1992. Evolution by acquisition: the case for horizontal gene transfers. Trends Biochem. Sci. 17:489493.[ISI][Medline]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.[Abstract]