On the Nature of Gene Innovation: Duplication Patterns in Microbial Genomes

Sean D. Hooper and Otto G. Berg

Department of Molecular Evolution, Uppsala University, Uppsala, Sweden


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
Gene duplication is considered a major force in gene family expansion and gene innovation. As gene copies assume novel functions, they must avoid periods of neutrality or be deleted from the genome. Current opinions state that copies avoid neutrality through gene dosage effects. These copies are therefore selected from an early stage. This study concentrates on the flow of copies from recent duplication to gene innovation. We have studied 21 microbial genomes using amino acid divergence to describe paralog evolution in the long-term perspective. Five of these were studied in closer detail using nucleotide divergence for a shorter perspective. It was found that rates of duplication and deletion are high, with only a small fraction of duplications retained and apparently selected. This leads to a steady accumulation of paralogs, which seems to be of a similar magnitude in most of the genomes. Furthermore, it is found that genes of high expression level, as measured by their codon bias, are strongly underrepresented among the most recent duplications. Based on these and other observations, it is suggested that gene innovation is driven by amplification of weak, ancillary functions rather than strong, established functions.

Key Words: Paralog evolution • gene duplication • gene family expansion • S. cerevisiaeE. coli


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
The success of an organism depends to a large extent on its ability to adapt to changing environments and to exploit new niches. This requires new gene functions that can cope with the environmental changes, for instance by metabolizing new substrates or deactivating toxins. The classical model of the creation of these novel genes is by duplication followed by divergence (Ohno 1970). A redundant copy can then freely be modified and may in time assume a new role. By duplication and divergence of genes, successful subdomains can be reused for new or related purposes. As a result, the two genes can be said to belong to a family, being related by sequence similarity, if not by function. Naturally, this relationship will decrease with time until no discernable similarity can be observed in regions of low conservation.

After duplication, the resulting copy can meet one of three fates: selection, silencing, or deletion (Ohno 1970; Li 1999). The duplicate may therefore avoid redundancy by assuming a novel selected function or by splitting ancestral functions. Copies may also be degraded to pseudogenes by mutational inactivation. Finally, copies can be removed from the genome by deletion. Of course, mutational time is a deciding factor, since copies need sufficient modifications to be able to assume roles different from their parents, assuming that they are initially neutral. Thus, the deletion rate is of great importance to gene innovation by giving copies time to diverge. Recently, the classical view of duplication, that one copy is neutral and free to evolve while the other remains selected, has been challenged in work by Kondrashov et al. (2002) and Lynch and Conery (2000), who show that paralogs do not seem to have experienced any extensive period of neutral evolution. Kondrashov et al. (2002) proposed that paralogs avoid neutrality through gene amplification, followed by a period of either relaxed or positive selection. Kondrashov et al. (2002) also observed that paralogs evolve faster than their corresponding orthologs. Again, this could be due to relaxed or positive selection. Another theory of gene innovation by duplication is the duplication-degeneration-complementation (DDC) model (Force et al. 1999). In this model, paralogs become selected and retained by losing separate subfunctions from a multifunctional ancestor gene. Redundant material is discarded through degradation. A large number of observations (Force et al. 1999) support the DDC model, although mostly in diploid or polyploid eukaryotes.

Lynch and Conery (2000) have studied the distributions of paralogs in six eukaryotes as a function of the numbers of substitutions at silent sites and replacement sites, estimating rates of origin of new duplications. In this study, however, we focus primarily on microbial genomes and the numbers of paralogs in different categories of amino acid similarity in order to reach an understanding of how effective duplication is as a mechanism of gene innovation. In some of the cases, we also look at relations with the numbers of changes per synonymous site (Ks) to study the time scales and estimate rates. Furthermore, in genomes with strong codon bias (e.g., E. coli and S. cerevisiae) the relationship between the expression level of a gene and its propensity for duplication can be studied.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
The full nucleotide sequences of the microbes in table 1 were downloaded from GenBank (http://www.ncbi.nlm.nih.gov/Genbank/).


View this table:
[in this window]
[in a new window]
 
Table 1 Organisms Used.

 
Within each organism, Blast (Altschul et al. 1990) scores were calculated by searching a six-frame protein sequence translation of nucleotide sequences against protein sequences in the genome. Sequences scoring hits against other sequences of E-value 10-20 or lower were considered to be paralogs and were therefore linked into a cluster. If further paralogs were found, they were also linked into the cluster. Thus, it is possible for two genes to belong to the same cluster even though they have no direct similarities to each other—they may match separate regions of a third gene. However, these clusters are used only to identify and count duplication events; no claim of homology between all pairs of genes in the cluster is implied.

The degree of amino acid sequence similarity (identity) between two paralogs is a measure of how well they align. In this study, identity levels are used as cutoffs for cluster building. If two genes have an identity level equal to or higher than the cutoff, they are clustered. The average size of clusters will increase as identity cutoffs decrease. The set of clusters where the identity cutoff is 95% is named p95—the name being an abbreviated form of "paralogs at 95% or higher identity." The succeeding set of clusters is built at p90, that is, at a cutoff of 90% identity and so on down to p60. Thus, paralogs in p95 have experienced a lower number of nonsynonymous changes than p90. As mutational changes accumulate, paralogs become less similar, thereby moving down along the identity groups from p95 to p60.

Identity was based on the amino acid level, and not on nucleotide differences, since paralogs are assumed to attain new functions when and if the resulting protein is sufficiently distinguishable from the original. Nucleotide differences are not as informative in this respect.

Difference sets are often used, that is, the difference set of p60 minus p70 are those genes scoring more than or equal to 60% but less than 70% identity. This is to isolate paralogs that have experienced a certain number of changes, thereby being at different stages in a paralog life span (see Introduction).

Transposon-Elements, Phage-Elements, and IS-Elements
Not all paralogs in a genome can be regarded as genes with protein products that contribute to an expansion of gene families. There are open reading frames (ORFs), which have a higher propagation rate than others purely by virtue of viral activity or repetitive DNA structures, being more promiscuous than ordinary genes. Such ORFs are often referred to as transposons or insertion sequences (IS), which carry no information other than the structures necessary for transposition.

Rather than relying on annotation, which may be cumbersome for a large number of organisms, a simplistic way of assessing promiscuous sequences could be the following: If duplication sites are random and equally likely, cluster sizes in p95 can be approximated by a Poisson distribution, with the expected value (number of genes in p95 divided by the total number of genes in the organism). Thus, the probability of randomly duplicating a particular gene twice within the same identity category is less than 0.01 () for S. solfataricus and S. cerevisiae and even lower for the remaining organisms in this study. Thus, clusters of size two and three are included, whereas clusters of four or more genes in p95 are considered to be effects of nonrandom duplication (). The terms random and nonrandom are used only in this mathematical sense. We refer to the removed sequences as nonrandom duplications (NRDs).

For a few prominent organisms, we evaluated how well this simple filtering worked. For E. coli, the NRD filter removed 43 of 54 annotated transposons in p95. In S. cerevisiae, 82 of 84 transposon-related sequences were removed. In S. typhimurium however, only five paralogs were removed and they were annotated as ABC transporters, while 24 prophage sequences remained. The ABC transporter superfamily shares a conserved ATP-binding cassette but can be relatively diverse in what they transport. In V. cholera, 24 sequences were removed, none of which annotated as transposon. Their annotated functions suggest that they are not members of the integron island (Heidelberg et al. 2000). Of the 63 remaining sequences, two were annotated as transposons. The effect of this method on L. lactis was rather ambiguous; whereas 38 annotated transposons were removed, the remaining 18 sequences were annotated as prophage DNA. Clearly, the method will miss promiscuous sequences that happen to be present in only a few copies. Conversely, some partially diverging sequences with different functions could be erroneously removed by this method (e.g., the ABC transporters).

Hence, it is clear that not all such sequences can be removed from all genomes. In fact, the p95 group should always be considered as spurious. In the remaining groups p90 to p60, very few transposons are found. In E. coli, only two new transposon groups are found below p95. It appears as if these promiscuous sequences have a low survival rate in most genomes.

A Mutation and Deletion–Based Model of Paralog Distribution
We consider two classes of duplications, neutral duplications that are subject to deletion (with rate kdel per gene per generation) and selected duplications that remain. After time T, the duplicate and the original will have a synonymous divergence (average number of synonymous changes per synonymous site) of , where U is the synonymous substitution rate (roughly twice the mutation rate per base pair, since the change can occur on either copy). The influx of new duplications per unit time is assumed to be {alpha}0U and {alpha}sU for neutral and selected duplications, respectively. Thus, {alpha}0 and {alpha}s are the duplication rates normalized to the mutation rate. Assuming further that duplicated genes are on average L codons in length, the number, , of neutral duplicated genes with i synonymous changes will vary in time according to


At the stationary state, the time derivatives are zero and


Note that i and Ks count the number of synonymous changes per site that have occurred and therefore include multiple hits. Similarly, the number of selected duplications with i synonymous changes is


The total number of duplicated genes with divergence less than or equal to Ks is the sum of equations 3 and 4 from to . Thus, . The approximation holds if KsL is sufficiently large, that is, so large that neutral duplicates do not survive beyond divergence Ks. Thus, the total number of neutral duplications is


which corresponds to the excess number of duplications that appear with small Ks values (including ). In all genomes we looked at, from equation 4. Then the number of unmutated duplications is and from equations 3 and 5 the deletion rate can be calculated as


and the neutral duplication rate is


In E. coli, from a comparison of the survival of lateral transfer genes in the strains K12 and O157 (Hooper and Berg 2002), we have estimated the ratio for neutral deletions. The same number has been estimated in S. cerevisiae (Lynch and Conery 2000) using a different method. Thus, only a very slight fraction of neutral duplications can remain beyond Ks larger than ca. 0.5. Only the selected duplications remain for larger values of Ks. Using the laboratory mutation rate (Drake 1991) for base pair substitution (uBP) as the rate of synonymous change in E. coli, per base pair per generation, or per Myr if there are ca. 200 generations per year. The mutation rate (per base pair per generation) in S. cerevisiae is roughly half of this (Drake 1991).

For the rate of appearance of new gene duplications in E. coli we estimated (Hooper and Berg 2002) per genome. As we will see below, the present analysis, accounting specifically for the class , gives significantly larger estimates for duplication and deletion rates than those quoted above.

Substitution Rates
For some organisms, the numbers of synonymous changes at synonymous sites (Ks) and nonsynonymous changes at nonsynonymous sites (Ka) were calculated using the PAML package (Yang 1997). These calculations were based on events rather than clusters (see above). The number of duplication events is consequently (genes in cluster) -1. Values of Ka are very strongly negatively correlated with gene identity, as expected.

Another noteworthy difference in this method compared with the amino acid method is the acceptance of only one frame—both paralogs must keep the same coding frame. Only ORFs or genes that are in the same frame can be used for the calculation of Ka and Ks. Thus, we can expect some differences when comparing an event model with the cluster model.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
Number of Paralogs
After the initial clustering of genes by the different E-value thresholds, the total numbers of paralogs (np) correlate well to the genome size (nG) of the organism (fig. 1). The correlation coefficient is 0.96. Exceptions to this may be the obligate intracellular parasites C. trachomatis and R. prowazekii. The linear regression between the number of clustered paralogs (np) and the total number of genes (nG) is



View larger version (11K):
[in this window]
[in a new window]
 
FIG. 1. Correlation of total number of paralogs with genome size. Nonrandom duplications are excluded from both variables. A linear regression is shown

 
The frequencies of nonrandom duplication vary greatly among the microbes in this study (table 2) and do not seem to be correlated with lifestyle, except that all the obligate intracellular parasites lack these sequences. The crenarchaeota S. solfataricus has a high proportion of genes in large clusters in p95, suggesting that a number of genes have been duplicated more than twice within a timeframe of less than or equal to one change per 20 amino acids. It has been estimated that nearly 10% of the S. solfataricus genome are transposons (She et al. 2001), which is consistent with this study. However, the other crenarchaeota P. aerophilum was found to have few sequences resembling transposons or IS elements at high nucleotide identity levels (Fitz-Gibbon et al. 2002). Bacteria that are otherwise quite similar, such as E. coli and S. typhimurium, can differ greatly in numbers of nonrandom duplications (table 2). This would reflect the fast rate of transposon propagation in a genome. Calculating the nucleotide identity score for the E. coli p95 set of genes, nonrandom duplications score on average very close to 100%. Therefore, there is a sharp decline in the numbers of potential transposons as changes accumulate. The low numbers of transposons/IS elements in p90 and below imply a rapid deletion of such sequences. In the studied eubacteria, there is a general lack of "older" members of transposon groups, which could imply that transposons invade genomes, propagate, lose the ability to propagate after mutational degradation, and are deleted. Alternatively, the mobile nature of these sequences may imply that they are also deleted more often. Deletion rates could be sufficiently high to limit transposons in some eubacteria and archaea. In S. solfataricus and P. aerophilum, deletion rates may not be large enough, since these structures seem to have more impact in data sets of decreasing identity (see below).


View this table:
[in this window]
[in a new window]
 
Table 2 Difference Sets.

 
Amino Acid Divergence
The distribution of paralogs by identity scores is shown as table 2. This distribution can be interpreted either as a static one, where paralogs have diverged just far enough as required from functional requirements and then conserved at that level. More likely, it represents a snapshot of a flux, where paralogs slowly continue to diverge. Different paralogs would flow at different rates, depending on selection and functional constraints, but the group with the highest amino acid identity would correspond to the most recent duplications. This latter picture is supported also by the analysis of the nucleotide divergence below.

In general terms, the archaea and free-living eubacteria have roughly the same patterns of paralogs in groups p90 to p60, with differences in p95 being attributable to possible remaining IS sequences and other spurious ORFs (see above), to large groups of transporters (e.g., S. typhimurium), or to duplication of large segments of DNA (e.g., N. meningitidis [see below]). B. subtilis is known to have large families, notably a group of 77 ATP-binding transport proteins (Kunst et al. 1997), but these are not very pervasive through groups p95 to p60. The majority of these genes are found in the Rest set.

A large number of the excess genes in p95 could also be very recent neutral duplications that have not yet been deleted. This will be considered in more detail with the nucleotide divergence below. The smaller free-living eubacterial genomes considered have up to approximately 2,300 genes, ranging from T. maritima at 1,842 genes to L. lactis at 2,266 genes. The proportion of p95 paralogs in the genome is generally low, with the notable exception of N. meningitidis, which has a disproportionate amount of recent duplications. In this genome, 5.5% of the genes are paralogs of 90% or higher identity. Upon closer inspection, many of these paralogs have been created in few but large multiple sequence duplications. In fact, the difference set of (p90–p95) is more similar in size to other eubacteria, indicating that the large number of recent paralogs may be the result of a sudden "burst" of duplication and perhaps not representative of an ongoing process. The archaea also cover a wide range of duplication patterns, from M. jannaschi with very few recent copies to S. solfataricus and A. fulgidus, which are comparable to the free-living eubacteria. Gene duplication has been proposed to be significant in Archaeoglobales diversity (Klenk et al. 1997). All four archaea are extreme thermophiles, so no correlation with lifestyle is observed. S. solfataricus is heavily affected by the removal of nonrandom duplications and would otherwise have very high proportions of paralogs in p95. S. solfataricus has the largest number of paralogs in sets p90 and below of all organisms used in this study. These could be remnants of old transposons that are gradually degrading and deleted, which is consistent with the steady decline of paralogs, or they could be copies of functional genes that have been duplicated in conjunction with transposons.

R. prowazekii and C. trachomatis are both considered obligate intracellular parasites (Zomorodipour and Andersson 1999) undergoing a reductive evolution (Andersson and Kurland 1998), and a similar mechanism of gene decay can be observed in M. leprae (Cole et al. 2001). They appear very different from the majority of eubacteria mentioned above, since duplications seem to be very rare events in R. prowazekii and C. trachomatis—both have only one duplication event in p85. R. conorii is similar to R. prowazekii and has a similar lifestyle. M. leprae has eight recent paralogs, suggesting that the reductive evolutionary forces are not yet as severe as in the two other obligate parasites. M. tuberculosis is not an obligate intracellular parasite, which may be reflected in the higher numbers of paralogs.

The sole eukaryote, S. cerevisiae, also clearly distinguishes itself in the data set of table 2, having a large number of paralogs throughout p95 to p60. The large number of clusters in p95 could indicate a duplication rate that is much higher than for any other organism used in this study. Even though a large amount appears to be deleted at p90, there is still a considerable amount of paralogs in sets p90 and below, suggesting a larger flow of diverging duplications. This picture is somewhat modified when we look at the distribution of duplications versus synonymous nucleotide divergence (Ks) below. It is worth noting that of the 818 paralogs in p95, more than half are paralogous to sequences on the noncoding strand of their hit. These duplications are not ORFs in themselves, but have been incorporated into other ORFs as fragments. Although they appear to be bona fide duplications and may alter existing genes, they do not contribute new genes. Furthermore, they all have 100% amino acid identity, albeit not in the expected translation frame, and almost all are of 100% nucleotide identity as well. They therefore seem to be of recent origin and very short lived.

Nucleotide Divergence
While the amino acid identity score is an important measure of gene-product divergence and therefore of the contribution to gene innovation, it does not answer several vital questions, such as whether paralogs can be considered neutral, selected but diverging, or selected and conserved. Paralogs that are selected and conserved would be saturated in Ks but still have high identity (low Ka) scores.

The distributions of paralogs in the identity groups give only a rough idea of the age distribution, as amino acid replacements have varied rates. Age is better represented by the number of synonymous changes (Ks). In table 3, we have listed the number of duplications as distributed over divergence classes (Ks). In this list, ORFs annotated as transposons or phage related have been removed. As shown in table 3, the distribution of duplications over Ks groups is very similar to that over the identity groups. There is an excess density at , and an approximately constant density up to . Ks values above 2 to 3 are influenced by saturation and are not useful as a time measure. The approximately constant density for is interpreted as the contribution from selected duplications (eq. 4). The excess in the groups at is interpreted to be from neutral duplications waiting to be deleted (eq. 5).


View this table:
[in this window]
[in a new window]
 
Table 3 Distribution of Events by Ks.

 
Ks can be calculated for out-of-frame duplications only when the nucleotide identity is 100%, in which case . In table 3, there is a large fraction (e.g., 98% in S. cerevisiae and 100% in E. coli) of new duplications () in a different frame. These seem to be very short lived, as there are very few out-of-frame duplications that have less than 100% nucleotide identity; that is, most of them disappear before any mutation has occurred, implying . These 100% identical out-of-frame duplications have been included in the calculations below and contribute to the very high estimated turnover rates.

A Ka/Ks ratio that exceeds 0.5 in the range can be considered approximately neutral. In the eubacteria, there are few paralogs that satisfy this criterion. The proportion of events that exceed 0.5 was one of 134 in E. coli, two of 85 in B. subtilis and two of 175 in P. aeruginosa. S. cerevisiae had a higher number, 21 of 246 events, and the archaea P. aerophilum had three of 64 events.

Gene Expression
The codon adaptation index (CAI; Sharp and Li 1987a) is used as a measure of expression level, since CAI compares the codon usage in a gene with that of a reference set of highly expressed and selected genes. The distribution of CAI scores among the paralogs was studied for E. coli and S. cerevisiae (fig. 2). Genes scoring higher than 0.4 were considered to be of high expression. In E. coli, we observed that recent () duplications are skewed towards low CAI; in only two of 32 events do both paralogs score higher than , compared with 18% in the whole genome. Thus, genes of low expression, as measured by CAI, are overrepresented among recent duplications (). This could imply that a duplication event of any of the highly expressed genes is counter selected—perhaps through an all too severe gene dosage effect. Alternatively, duplication events could be more common in regions of low CAI genes. The CAI distribution of events with appears skewed to high CAI values (fig. 2a); this is discussed below.



View larger version (13K):
[in this window]
[in a new window]
 
FIG. 2. (a) E. coli CAI (average of both paralogs) versus Ks. (b) S. cerevisiae CAI (average of both paralogs) versus Ks

 
For S. cerevisiae, the picture is radically different. There is an excess of highly expressed genes in recent duplications with (). Sixty of 120 (50%) events have , compared to 5.1% in the whole genome. This is in agreement with findings by Seoighe and Wolfe (1999), who studied mRNA expression levels. However, the numbers of duplications among highly expressed genes (as measured by CAI) decrease above (fig. 2b). Furthermore, among the 63 most recent duplications (with ) there is only one with . Thus, the ongoing processes seem to avoid duplicating high expression level genes in S. cerevisiae as well as in E. coli, in contrast to recent suggestions (Kondrashov et al. 2002). An explanation for this could be that some genes are selected for high expression through gene dosage. Thereby both paralogs would remain selected and conserved with the same function. Furthermore, genes with high CAI values will have artificially low Ks values, due to selection on synonymous changes (Sharp and Li 1987b). In E. coli, this reduction in Ks at high CAI is strengthened by a decrease in the mutation rate for genes at high expression (Berg and Martelius 1995), probably due to transcription-repair coupling. Thus, the duplications of highly expressed genes in S. cerevisiae may be older than they appear (based on Ks), and most of them could derive from the proposed genome duplication ca. 108 years ago (Wolfe and Shields 1997). Finally, high-CAI genes in E. coli are seemingly overrepresented in duplications with in figure 2a. However, as discussed above, due to the codon bias, these duplications could be significantly older than the genes with lower CAI values in the same Ks range. It is also interesting to note that most of the high-CAI paralogs fully overlap their siblings, as opposed to low-CAI paralogs, which are often fragmented. In S. cerevisiae, fragmentation in recent () duplications is biased in favor of low CAI genes. Of the genes with , 50 of 173 are fragmented at less than 90% length. Only one of 61 genes with is similarly fragmented. This supports the suggestion above that the high-CAI duplications are amplified for strong functions and are thus selected and conserved rather than providing new functions.

Duplication events were compared in E. coli and S. typhimurium. The median Ks value between orthologs in these two bacteria is 1.3. Genes with high CAI () have a lower median Ks-value of 0.8, compared with low CAI genes (<0.4) with a Ks-value of 1.5. It was observed that recent duplications () in either organism rarely have orthologs in the other. Only 10 of 34 duplication events in E. coli have orthologs in S. typhimurium, compared with 79% of all genes in E. coli. Thus, the expected number would be 27 if duplications were random over the whole genome (). A similar significant () underrepresentation was seen in S. typhimurium (13 of 33; expected number is 26). This observation would suggest that (1) gene amplification may be deleterious for a core set of essential genes, whether of high expression or not, or that (2) some genes may be subject to higher rates of turnover due to their chromosomal position or due to repetitive DNA. It is therefore possible that lateral-transfer genes tend to be duplicated more often, as suggested previously (Hooper and Berg 2002).

Rates of Duplication and Deletion
From the distribution over Ks-values (table 3) we can estimate the influxes of selected and neutral duplications using equations 5, 6, and 7. The data in table 3, using the region to estimate the density (or flow) of selected duplications, gives, for S. cerevisiae, , , and . These calculations were made after removing genes of high CAI, which are expected to have anomalously low Ks-values, as discussed above. The fraction of the total duplications that are selected and kept in the genome is {alpha}s/{alpha}0; assuming an average target size for synonymous substitution of , this corresponds to a probability that a duplication is selected of ca. 10-4. From the data for E. coli in table 3, one gets in the same way, , , and . Assuming gives , , and . The rate of fixation ({alpha}SU) of new duplicated genes in E. coli is ca. two genes per Myr, based on a mutation rate of per Myr.

Thus, it seems that rates of selected duplication ({alpha}s) and deletion may be of similar order of magnitude in S. cerevisiae and E. coli, wheras there is a much larger influx of neutral duplications ({alpha}0) in S. cerevisiae. The other organisms in table 3 are difficult to evaluate with the model, primarily because of small numbers. B. subtilis has very few recent duplications (only one with ) as well as only a small excess of neutral genes in the class . This suggests that the deletion rate is high, and . In P. aeruginosa, there are six genes with but little or no excess in the class . This distribution is similar to that of E. coli. In P. aerophilum, there are no duplicates with but a relatively large excess in both classes with . This suggests that the random duplication rate is low, but there has been a burst of duplications (possibly selected) that now have reached a divergence of ca. 0.4. If these are mostly neutral duplications, their survival to significant divergence suggests that the deletion rate is also low. The rates of fixation ({alpha}s), in units of the mutation rate, of new selected genes are of similar order of magnitude in all five genomes of table 3, except that it is lower in P. aerophilum. On the assumption that most duplications are random and therefore neutral, this suggests that duplication rates are high in all the genomes studied, except possibly C. trachomatis and the two species of Rickettsia. The excess of neutral genes (nneutral [eq. 5]) would then be small in genomes where the deletion rates were even higher.

These interpretations are based on an assumption of constant rates. In some cases, this assumption seems untenable and the results fit better for instance with a burst of duplications. But even in the cases (e.g., for S. cerevisiae) where data in table 3 can be interpreted with constant rates, it is of course possible that the excess numbers of duplicates in the classes and are due to recent bursts of duplication. The divergence can also proceed with unequal rates (different values of U) for different duplications. However, this would not much influence the shape of the distribution over the different Ks classes.

The estimated rate differences can explain, at least in part, the large differences between E. coli and S. cerevisiae for the distribution along the amino acid identity groups displayed in table 2. Also, the average Ka/Ks is significantly smaller in S. cerevisiae than in E. coli (table 4). Thereby the flow would be retarded, and the density of duplications would be correspondingly larger in the amino acid identity groups for S. cerevisiae.


View this table:
[in this window]
[in a new window]
 
Table 4 Ka/Ks for E. coli and S. cerevisiae.

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
General Patterns of Paralog Evolution
The patterns of duplication and deletion vary greatly among different organisms. Correlation with lifestyle seems limited, but some patterns can be discerned: free-living bacteria have high rates of duplication, but paralogs must be initially selected to stand a chance of surviving. As noted previously (Lynch and Conery 2000; Kondrashov et al. 2001), based on the Ka/Ks ratio, there appear to be very few neutral paralogs. In this study, we find a similar situation for E. coli, P. aeruginosa, and B. subtilis. Persistence of neutral paralogs could be low also in other eubacteria, in part undermining Ohno's classical model of paralog evolution (1970).

The low number of paralogs in R. conorii, R. prowazekii, and C. trachomatis could be due to the stable and providing environment of their eukaryotic hosts. Deletion and/or duplication rates may have changed in these organisms since adopting an obligate intracellular lifestyle, although it is not known which of these rates that has changed. However, R. prowazekii still retains many psuedogenes (Andersson and Andersson 1999), indicating that deletion rates are not extreme. This suggests that some event has changed the duplication rate in R. conorii, R. prowazekii, and possibly C. trachomatis, since the latter shares many properties with R. prowazekii. One distinguishing characteristic of these organisms is their isolation. If duplication is facilitated in some form by foreign IS, phage, or transposon sequences, then this mechanism would be absent in the above organisms. Alternatively, these reductive genomes may have lost essential components of recombination pathways.

Rates of Duplication
The two genomes (E. coli and S. cerevisiae) for which paralog numbers are sufficiently large to allow rate estimates display a very intense turnover of duplications. Most new paralogs are neutral and disappear very quickly (Lynch and Conery 2000). In fact, these would not be expected to be fixed in the population, instead contributing to the genome diversity of the organism (Berg and Kurland 2002). The deletion rate for neutral duplications in E. coli is much larger than our estimate for neutral lateral-transfer genes (Hooper and Berg 2002). This is in part due to the fact that the potential imports considered only complete ORFs of at least 400 nt in length, but duplications were also counted when they appeared as parts of other ORFs. The rate difference could also be due to a difference in molecular mechanism. Values of duplication and deletion derived from the model above indicate high rates, which should be seen as minimum values. The model builds on the assumption that a small number of neutral paralogs survive past the first synonymous substitution—which is hard to prove. If no neutral paralogs survive that far, rates of duplication and deletion are impossible to calculate. It is conceivable that these prodigious rates are facilitated by paralogs flanked by repetitive element–like or IS element–like DNA.

Gene Dosage Is Vital to Paralog Evolution
The distribution of CAI values among the most recent duplications in E. coli and S. cerevisiae shows that duplication of high expression level genes is avoided or counterselected. This could be a consequence either of protein burden or of a disruptive effect on a cooperative network of interactions in which the gene product participates. The number of paralogs with high CAI (high expression level) and relatively low Ks in S. cerevisiae is somewhat spurious, since their age cannot easily be determined. A possible origin is through genome duplication, such as the tetraploidy at approximately 100 MYA described by Wolfe and Shields (1997). A whole-genome duplication would conserve stoichiometry also between high expression level genes and could therefore be less disruptive than the random duplication of one or a few such genes. The observations show that amplification of strong—or highly expressed—functions often lead to highly conserved paralogs. Thus, such gene amplifications, even if maintained for long periods of time, do not result in functions discernibly different from the ancestor gene. The observation further suggests that amplification, for example as a consequence of a whole-genome or chromosome duplication, of a system of high-expression genes can be stable. Once one such duplication has occurred, it could again be disruptive to remove single gene copies. Thus, a whole-genome or chromosome duplication would result in different categories of genes. First, we expect a conserved set of paralogs, amplified for a strong function. The high-CAI/low-Ks genes in S. cerevisiae would be within this category. Second, we expect random deletions of neutral or counterselected amplifications. Third, selected amplification for weak or secondary functions would result in either single genes with strong function or distant paralogs with related functions.

The pattern of CAI values of recent duplications in both E. coli and S. cerevisiae suggests that gene dosage is advantageous primarily for weakly expressed genes. This notion of amplification of primarily weak function and positive selection is supported by the comparison between E. coli and S. typhimurium, where it was found that most recently duplicated genes do not have orthologs in the other bacterium. These particular genes must be either lateral-transfer genes in one organism or genes that have been lost in the other. In either case, they do not carry important core functions in the genome. Thus, we suggest that new functions do not arise as a result of duplication per se, but that the function is already present, albeit at a weak level, in the ancestor. Genes could have weak or secondary functions, or lateral transfer could bring a foreign function into the genome. A weak paralog is easier to improve by substitution than is a highly optimized gene. The presence of several copies increases the probability for a selected mutation in a paralog, resulting in an accelerated adaptive evolution. Thereafter, copy numbers could again shrink as the individual efficiency of paralogs increase.

The picture that emerges is one of rapid duplication and loss. Only a very small fraction of duplicated genes are retained and allowed to be modified (Lynch and Conery 2000). This small fraction could represent duplications where new functions have evolved. The main problem is the high deletion rates of neutral genes. This leaves little room for free mutation of a redundant copy. To be retained, a duplicated gene must either be immediately selected or very quickly pick up an adaptive mutation that makes it selected for some new function. One way to achieve immediate selection would be through gene dosage (Kondrashov et al. 2002) in response to an increased demand after a change in external conditions. This demand could either be for the primary function or some ancillary function, which would allow a duplicated gene to gain a foothold in the population. If the demand requires an ancillary function to be amplified, the primary function would become overexpressed. This would leave one copy free to pick up adaptive mutations that strengthen the ancillary function and reduce the primary one. In all scenarios, new functions develop from existing ones. There is no room for long-term "tinkering" of a sequence through mutational accumulation that leaves the gene without function. If ancillary functions are amplified, the observed difference in evolutionary rates between paralogs and orthologs (Kondrashov et al. 2002) can be attributed to positive selection rather than relaxed selection.

One of the main differences between this model of weak amplification and the subfunctionalization theory put forth by Force et al. (1999) is that genes are not treated as a compartmentalized sequence of domains that can be split, forming novel genes, but rather that a new function can be discovered within extant ones. Therefore, genes do not need multiple domains in order to contribute to gene innovation. Moreover, in the subfunctionalization model, the paralogs are initially neutral if the end result is to split the original gene into separate functions. The majority of these paralogs would not survive the high deletion rates of many microbes long enough to subfunctionalize. Through amplification on the other hand, paralogs may be retained, and the ancillary function may be optimized. Finally, a majority of the observations of subfunctionalization made by Force et al. (1999) are for eukaryotic systems, where polyploidy and cellular differentiation play major roles. The compact genomes of most prokaryotes may require different considerations.

In conclusion, we propose that the mechanism of paralog evolution does not "invent" novel gene functions through the divergence of neutral genes. Rather, novel functions are "discovered" within extant gene functions, retained by amplification, and developed through paralog evolution.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
This work was supported by The Swedish Foundation for International Cooperation in Research and Higher Education (STINT) and by the Swedish Research Council. S.D.H. also wishes to thank Andrew T. Lloyd and Ken H. Wolfe for stimulating discussions and criticisms. This work was carried out in part during the stay of S.D.H. at the Department of Genetics, Smurfit Institute, Trinity College, University of Dublin, Republic of Ireland.


    Footnotes
 
E-mail: otto.berg{at}ebc.uu.se. Back

Pekka Pamilo, Associate Editor Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 

    Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.[CrossRef][ISI][Medline]

    Andersson, J. O., and S. G. E. Andersson. 1999. Genome degradation is an ongoing process in Rickettsia. Mol. Biol. Evol. 16:1178-1191.[Abstract]

    Andersson, S. G. E., and C. G. Kurland. 1998. Reductive evolution of resident genomes. Trends Microbiol. 6:263-268.[CrossRef][ISI][Medline]

    Andersson, S. G. E., A. Zomorodipour, J. O. Andersson, T. Sicheritz-Ponten, U. C. Alsmark, R. M. Podowski, A. K. Naslund, A. S. Eriksson, H. H. Winkler, and C. G. Kurland. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396:133-140.[CrossRef][ISI][Medline]

    Berg, O. G., and C. G. Kurland. 2002. Evolution of microbial genomes: sequence acquisition and loss. Mol. Biol. Evol. 19:2265-2276.[Abstract/Free Full Text]

    Berg, O. G., and M. Martelius. 1995. Synonymous substitution rate constants in Escherichia coli and Salmonella typhimurium and their relationship to gene expression and selection pressure. J. Mol. Evol. 41:449-456.[ISI][Medline]

    Blattner, F. R., G. Plunkett, III, and C. A. Bloch, et al. (17 co-authors). 1997. The complete genome sequence of Escherichia coli K-12. Science 277:1453-1474.[Abstract/Free Full Text]

    Bolotin, A., P. Wincker, S. Mauger, O. Jaillon, K. Malarme, J. Weissenbach, S. D. Ehrlich, and A. Sorokin. 2001. The complete genome sequence of the lactic acid bacterium Lactococcus lactis ssp. lactis IL1403. Genome Res. 11:731-753.[Abstract/Free Full Text]

    Bult, C. J., O. White, and G. J. Olsen, et al. (40 co-authors). 1996. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273:1058-1073.[Abstract]

    Cole, S. T., R. Brosch, and J. Parkhill, et al. (42 co-authors). 1998. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393:537-544.[CrossRef][ISI][Medline]

    Cole, S. T., K. Eiglmeier, and J. Parkhill, et al. (44 co-authors). 2001. Massive gene decay in the leprosy bacillus. Nature 409:1007-1011.[CrossRef][ISI][Medline]

    Deckert, G., P. V. Warren, and T. Gaasterland, et al. (15 co-authors). 1998. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392:353-358.[CrossRef][ISI][Medline]

    Drake, J. W. 1991. A constant rate of spontaneous mutation in DNA-based microbes. Proc. Natl. Acad. Sci. USA 88:7160-7164.[Abstract]

    Fitz-Gibbon, S. T., H. Ladner, U. J. Kim, K. O. Stetter, M. I. Simon, and J. H. Miller. 2002. Genome sequence of the hyperthermophilic crenarchaeon Pyrobaculum aerophilum. Proc. Natl. Acad. Sci. USA 99:984-989.[Abstract/Free Full Text]

    Fleischmann, R. D., M. D. Adams, and O. White, et al. (40 co-authors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496-512.[ISI][Medline]

    Force, A., M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, and J. Postlethwait. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 15:1531-1545.

    Goffeau A., B. G. Barrell, and H. Bussey, et al. (16 co-authors). 1996. Life with 6000 genes. Science 274:546-567.[Abstract/Free Full Text]

    Heidelberg, J. F., J. A. Eisen, and W. C. Nelson, et al. (32 co-authors). 2000. DNA Sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 406:477-483.[CrossRef][ISI][Medline]

    Hooper, S. D., and O. G. Berg. 2002. Gene import or deletion—a study of the difference genes in Escherichia coli strains K12 and O157:H7. J. Mol. Evol. 55:734-744.[CrossRef][ISI][Medline]

    Kaneko, T., S. Sato, and H. Kotani, et al. (24 co-authors). 1996. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 3:109-136.[Medline]

    Klenk, H. P., R. A. Clayton, and J-F. Tomb, et al. (51 co-authors). 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390:364-370.[CrossRef][ISI][Medline]

    Kondrashov, F. A., I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. 2002. Selection in the evolution of gene duplications. Genome Biol. 3: research 0008.1-0008.9.

    Kunst, F., N. Ogasawara, and I. Moszer, et al. (150 co-authors). 1997. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 390:249-256.[CrossRef][ISI][Medline]

    Li, W-H. 1999. Molecular evolution. Sinauer Associates, Sunderland, Mass.

    Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151-1155.[Abstract/Free Full Text]

    McClelland, M., K. E. Sanderson, and J. Spieth, et al. (26 co-authors). 2001. The complete genome sequence of Salmonella enterica serovar Typhimurium LT2: features revealed by comparison to related genomes. Nature 413:852-856.[CrossRef][ISI][Medline]

    Nelson, K. E., R. A. Clayton, and S. R. Gill, et al. (28 co-authors). 1999. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399:323-329.[CrossRef][ISI][Medline]

    Nierman, W. C., T. V. Feldblyum, and I. T. Paulsen, et al. (37 co-authors). 2001. Complete genome sequence of Caulobacter crescentus. Proc. Natl. Acad. Sci. USA 98:4136-4141.[Abstract/Free Full Text]

    Ogata, H., S. Audic, and P. Renesto-Audiffren, et al. (11 co-authors). 2001. Mechanisms of evolution in Rickettsia conorii and R. prowazekii. Science 293:2093-2098.[Abstract/Free Full Text]

    Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, Heidelberg, Germany.

    Seoighe, C., and K. H. Wolfe. 1999. Yeast genome evolution in the post-genome era. Curr. Opin. Microbiol. 2:548-554.[CrossRef][ISI][Medline]

    Sharp, P. M., and W-H. Li. 1987a. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15:1281-1295.[Abstract]

    Sharp, P. M., and W-H. Li. 1987b. The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias. Mol. Biol. Evol. 4:222-230.[Abstract]

    She, Q., R. K. Singh, and F. Confalonieri, et al. (31 co-authors). 2001. The complete genome of the crenarchaeon Sulfolobus solfataricus P2. Proc. Natl. Acad. Sci. USA 98:7835-7840.[Abstract/Free Full Text]

    Stephens, R. S., S. Kalman, and C. J. Lammel, et al. (12 co-authors). 1998. Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis. Science 282:754-759.[Abstract/Free Full Text]

    Stover, C. K., X-Q. T. Pham, and A. L. Erwin, et al. (31 co-authors). 2000. Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen. Nature 406:959-964.[CrossRef][ISI][Medline]

    Tettelin, H., K. E. Nelson, and I. T. Paulsen, et al. (39 co-authors). 2001. Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 293:498-506.[Abstract/Free Full Text]

    Tettelin, H., N. J. Saunders, and J. Heidelberg, et al. (42 co-authors). 2000. Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science 287:1809-1815.[Abstract/Free Full Text]

    Wolfe, K. H., and D. C. Shields. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708-713.[CrossRef][ISI][Medline]

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555-556.[Medline]

    Zomorodipour A, and S. G. Andersson. 1999. Obligate intracellular parasites: Rickettsia prowazekii and Chlamydia trachomatis. FEBS Lett. 452:11-15.[CrossRef][ISI][Medline]

Accepted for publication January 30, 2003.