* Department of Ecology and Evolution and the Committee on Genetiecs, University of Chicago
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: Drosophila gene duplicates positive selection pseudogenes polymorphism concerted evolution
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Several recent studies have highlighted the importance of genomic location in the molecular evolution of duplicate loci. In Drosophila melanogaster, X-linked duplicates have roughly twice the Ka/Ks between copies as autosomal duplicates (Thornton and Long 2002), and there is an excess of autosomal duplicate loci derived by retrotransposition of cDNA from X-linked loci (Betrán et al. 2002), a pattern that is also observed in mammals (Emerson et al. 2004). These observations of X/autosome differences have been interpreted in light of the theoretical result that the dominance of adaptive mutations influences the relative rates of evolution of X-linked and autosomal loci (Charlesworth et al. 1987). Genomic location has also been implicated in affecting rates of synonymous divergence between amylase gene copies in Drosophila (Zhang and Kishino 2004a). Finally, there is a tendency for duplicates in regions of low recombination in the Saccharomyces cerevisiae genome to exhibit higher rates of amino acid evolution than their paralogs in regions of high recombination (Zhang and Kishino, 2004b). These last two studies implicate an effect of weak selection on the divergence between paralogs.
In this study, we examine nucleotide variation in an African sample of D. melanogaster in several families of X-linked duplicates that we have previously identified as having rapidly diverged (Thornton and Long 2002). We are interested in addressing four features of polymorphism and divergence in this class of genes. First, we examine evidence for selective constraint to ask whether patterns of polymorphism are consistent with the duplicates being pseudogenes. Second, we assess the impact of gene conversion as a homogenizing force among these loci. Third, we test the neutral mutation hypothesis by comparing polymorphism and divergence between copies at replacement and synonymous sites (McDonald and Kreitman 1991), in an effort to test the theory that positive selection and gene conversion can have opposite effects on the divergence of paralogs, and that selection may need to be quite strong to overcome the homogenizing force of gene conversion (Innan 2003b). Finally, as the synonymous divergence between the loci studied is low, we use the annotation of the Drosophila pseudoobscura genome, the most closely related genome to D. melanogaster currently available, to ask if there is evidence that some of the duplication events post-dated the divergence of the two species. We find strong evidence of selective constraint in the X-linked gene families, and only weak evidence for conversion between copies. Additionally, there is a significant of excess amino acid fixations between copies relative to polymorphism among the X-linked duplicates in the sample, suggesting that positive natural selection has acted on these loci after duplication, and that selection has been a stronger long-term force than ectopic conversion. The comparison with D. pseudoobscura suggests that many of these loci are of relatively recent origin.
We have also compiled all polymorphism data from duplicate loci in D. melanogaster that we are aware of, and provided two new data sets from autosomal duplicates, and tested the neutral mutation hypothesis. In contrast to the X-linked loci, there is a deficit of fixations among these autosomal loci, consistent with the expectation of concerted evolution and the theory that strong selection is needed to escape ectopic gene conversion (Innan 2003b).
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
We have also have mined all available data on gene duplications that we are aware of. These loci include the members of the hsp family (Bettencourt and Feder 2002) and the attacin family of immunity peptides (Lazzaro and Clark 2001). We excluded data from both the esterase family in D. melanogaster (Balakirev Balakirev, and Ayala 2002; Balakirev et al. 2003), because silent sites are saturated between copies, and from the amylase duplication (Araki, Inomata and Yamazaki 2001), because we sequenced amylase in this study. Data from the alcohol dehydrogenase (Adh) region of a Zimbabwe, Africa, sample of D. melanogaster encompassing the regions sequenced by Kreitman and Hudson (Kreitman and Hudson, 1991) were kindly provided by P. Andolfatto. As silent sites are saturated between Adh and Adh-r, we used a sequence from D. teissieri (GenBank: X54118) to restrict the analysis to ingroup-specific substitutions. The primary interest in including these data will be the application of McDonald and Kreitman (McDonald and Kreitman, 1991) tests (see below). An analysis of gene conversion among the attacin and hsp loci is found in Innan (2003a).
Drosophila Lines
Isofemale lines were sequenced from 10 lines from Zimbabwe, Africa (5 from Harare, 5 from Sengwa) (Begun and Aquadro 1993), obtained from M.-L. Wu. These lines have been maintained in the lab for over 10 years (greater than 200 generations), and we observed no heterozygous base calls at autosomal loci, allowing direct inference of haplotypes.
PCR, Sequencing, and Assembly
All polymerase chain reaction sequencing was done from genomic DNA extracted from single male flies. Gene-specific primers were designed by aligning the genomic regions and coding sequences (CDS) of gene families. Primer positions were then chosen to cover sites that differed among the different genes in the alignment, and the 3' base of at least one primer was placed over a base that differentiated the genes in the Drosophila genome sequence. A list of all PCR and sequencing primers used is available from the authors.
Polymerase chain reaction conditions were optimized on a gradient thermocycler for each gene pair, and then each line was amplified at the optimum temperature in a 50 µl reaction. The PCR products were cleaned with Qiagen columns (Qiagen, Inc.) and sequenced in both directions using Big Dye 3.0 sequencing kits (PerkinElmer). Sequences were obtained with an ABI 3700 machine. Base calls from the sequencing reactions were done, and contigs were assembled, with the phred/phrap software package (Ewing et al. 1998; Ewing and Green 1998). Contigs were assembled individually for each fly line, and a multiple alignment was assembled using the mace software (W. Gilliland and C. Langley, manuscript in preparation; program available from http://ludwig.ucdavis.edu/mace/), and visualized with consed (Gordon, Abajian, and Green 1998). For assembly using mace, bases with quality scores less than 30 were initially called as missing data, and were only included back into the sequence if the base calls were easily made, with no ambiguity, by eye from chromatographs. After assembly, all polymorphisms were confirmed by visual inspection of chromatographs in consed. Sequences have been deposited in GenBank (accession numbers AY752495AY752639 and AY754313AY754331), and alignments are available from K.T. upon request.
Analysis
Levels of Variation
Levels of nucleotide polymorphism were summarized using two statisticsWatterson's (1975) W, which depends on the number of mutations in the data, and
(Tajima 1983), which depends on the mean number of pairwise differences in the sample. Under a neutral model, both values are unbiased estimators of
= 4Neµ, the population mutation rate. For these calculations, sites in the alignments with gaps were excluded. Sites with missing data were included by adjusting the sample size at each site and summing the value of the statistic across sites. For
w, sites with more than two alleles segregating were counted as k 1 segregating sites for a site with k alleles. In other words, we used the inferred number of mutations at that site. There were only two sites in the data with more than two alleles (in genes CG13732 and CG15644 from the Zimbabwe sample). Under the assumption that males and females have the same effective population size (Ne), the Ne of an X-linked locus is
that of an autosomal locus. Thus, to make parameter estimates comparable between chromosomes, X-linked values have been multiplied by
. We note that the observation of variation in male reproductive success in natural population (Charlesworth 2001) implies that
may not be the appropriate correction. However, as we do not directly compare X and autosomal levels of variation in this study, our conclusions remain unaffected.
Silent and Replacement Polymorphism
The mean number of silent and replacement sites in the alignment were estimated using the method of Comeron (Comeron, 1995). Sites with more than two states segregating were excluded, which led to the exclusion of one single nucleotide polymorphism (SNP) in Zimbabwe, out of 259 total polymorphic sites. Partial codons at the end of the alignments were excluded because one does not know if the missing sites are variable or not. This led to the exclusion of one polymorphism at CG2885 in Zimbabwe. w and
were calculated for replacement sites, synonymous sites, noncoding sites, and noncoding plus synonymous sites. Noncoding sites do not distinguish 5', 3', or intron sequence because we do not always have data from each site class for each gene. All diversity estimates for X-linked loci are multipied by
for comparison with the autosomes.
Inference of Ectopic Gene Conversion
When polymorphism data are obtained from duplicate loci, gene conversion between copies (ectopic gene conversion) is visually detectable if polymorphisms are shared between the loci. However, the absence of shared polymorphisms does not rule out an important role for gene conversion, because not all gene conversion events lead to shared polymorphisms (on singleton lineages, for example). We are therefore interested both in identifying conversion events and in estimating the rate of conversion between paralogs.
To detect ectopic gene conversion, we tallied the number of shared polymorphisms between pairs of duplicates. Alignments for these analyses were done using CLUSTALW (Thompson, Higgins, and Gibson 1994) and edited by eye to remove highly divergent regions with questionable alignments when necessary (all such regions were at the 5' or 3' ends of alignments).
In addition to counting shared polymorphisms, we also used the GENECONV program (Sawyer 1999), which implements extensions of Sawyer's (1989) method for detecting gene conversion. In our analysis, we ignore results on within-locus fragments, as GENECONV has been shown by simulation to be reasonably powerful in detecting reciprocal exchange events (i.e., classical crossing over with no gene conversion) (Posada and Crandall 2001), meaning it is likely that the analysis would confound within-locus conversion events with crossing over. For all analyses with GENECONV, we used the default settings, and applied the program to nucleotide alignments. GENECONV reports a P value associated with every conversion event detected. For simplicity, we report P values only for the most likely (least significant) fragment. The P values reported are from a permutation procedure which implicitly corrects for multiple comparisons (Sawyer 1989, 1999).
To estimate the population rate of ectopic gene conversion, we used the moment estimators of Innan (2003a), based on a simplified coalescent process for gene families of size 2. Recombination occurs between loci, not within them; ectopic conversion is assumed to be intra-chromosomal; and there is no gene conversion between alleles within loci. A simplifying assumption is made that the tract length of conversion events is small enough that only one site is converted per event. The parameters estimated are (= 4Neµ), the population mutation rate,
(= 4Ner), the population recombination rate between the loci, and
(= 4Nec), the population rate of ectopic conversion. The expressions for the estimators are:
![]() | (1) |
![]() | (2) |
![]() | (3) |
It can be seen from equation (2) that can be greater than zero even when Dsum = 0 (which occurs when there are no shared polymorphisms). These estimators only return sensible values under certain conditions. Equation (2) will return negative values if the mean within-locus diversity is greater than the between locus diversity (
w >
b) and Dsum = 0. Also, the recombination rate between the loci is undefined when Dsum = 0 (eq. 3). These problems pose some practical issues both for data analysis and for investigation of the properties of the estimator by simulation.
A program was written to calculate the estimators in equations (1) through (3) from an alignment of polymorphism data from two duplicates. Gapped positions in alignments were not analyzed, and sites with more than two states in the alignment were excluded. In our program, when the sample sizes differ between the two loci, the actual number of comparisons between loci is kept track of when calculating b and is used instead of n(n 1). Estimates for X-linked loci are scaled by
to allow comparison with autosomes.
Divergence Between Duplicates
A variant of McDonald-Kreitman-style contingency tables (McDonald and Kreitman 1991) (the "MK" test) was used to test the null hypothesis that divergence between duplicate loci is a neutral process. In our application, the MK test was performed between alignments of polymorphism data from pairs of duplicated loci. For this test, we only considered changes at replacement and synonymous sites (i.e., intron and untranslated region (UTR) sites were not considered).
To perform the MK test, coding regions were extracted from the alignment, and coding regions of duplicate pairs were aligned using CLUSTALW (Thompson, Higgins, and Gibson 1994). To ensure accurate alignment, we made use of alignments of the coding sequences (CDS) from the genome we had previously generated, using peptide alignments as guides (Thornton and Long 2002). The alignments required minimal manual editing, with the exception of the removal of highly divergent regions from the 5' end of some alignments (particularly for the CG15644/CG15645/CG13732/CG18620 family). This removal is conservative with respect to testing for an excess of amino acid replacements in these data, as it reduces the number of amino acid fixations in the alignment with little effect on the number of polymorphisms.
In our application of the MK test, we analyze either the most closely related pair of genes, or the pairs for which we know the ancestral/derived relationship of the duplicates (CG15645/CG13732, Betrán, Thornton, and Long. [2002]). The exception to these conditions is the inclusion of CG11941/CG11942, to ensure that every gene sequenced appears at least once in the analysis. There are two reasons for restricting the number of comparisons in this fashion. First, for families of size greater than two, we do not know the phylogeny of the gene family, which would result in fixed differences being counted multiple times when all possible comparisons are made. Second, when analyzing all but the most closely related pairs, the number of fixed amino acid differences increases dramatically, meaning that removing such pairs is conservative for our purpose.
Gene Discovery in Drosophila pseudoobscura
The pairwise Ks between all X-linked duplicates in this study falls in the range of 0.010.25 (table 1), the upper value of which is slightly larger than the mean value of 0.18 between D. melanogaster and D. yakuba (Takano 1998). We are therefore interested in whether there is evidence that the duplication events leading to these loci are relatively recent. Whole-genome sequence data are now avaliable for D. pseudoobscura. We made use of the D. melanogaster/D. pseudoobscura orthology assignments generated by FlyBase (http://www.flybase.org) (Brian Bettencourt, personal communication) based on the freeze1 assembly available at Baylor College of Medicine (http://www.hgsc.bcm.tmc.edu/projects/drosophila/).
Software Availability
All analyses of polymorphism data were performed using custom software. All SNP analysis programs were implemented in C++, based on a common library (Thornton 2003). The program to extract CDS from GenBank files was written in perl using routines from the bioperl libraries (http://www.bioperl.org). Source code for all programs is available from K.T.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The null hypothesis that these genes are pseudogenes leads to testable predictions. First, insertion/deletion (indels) polymorphism should be very informative, and indels should have lengths that are multiples of 3 and in-frame within the coding sequence of functional genes, whereas pseudogenes (or null-alleles) are more likely to bear indels that result in frame-shift mutations. Second, if these genes are pseudogenes, then mutations at silent and replacement sites should occur with equal frequency per site, leading to similar levels of variation observed in the two site classes, i.e., ,N
,S, where the subscripts N and S refer to replacement and synonymous sites, respectively. We focus the analyis on levels of diversity (
), rather than on Watterson's
W, because the former is affected by the frequency of segregating variation, which is more informative for our purposes than simply counting the total number of variants. A summary of the total polymorphism data from each locus is presented in table 3.
|
|
Levels of variation at replacement and silent sites are presented in table 5. We are particularly interested in levels of silent versus replacement variation in the set of duplicates with high Ka/Ks. Considering only loci from that set for which polymorphisms are observed in the exons, ,N is less than
,S 11/13 loci (P = 0.02, sign test). This result indicates that there is purifying selection against aminoacid replacement mutations, further suggesting functional constraint on these genes.
|
Detection and Rates of Ectopic Gene Conversion
A common method employed to detect conversion between paralogs is to identify shared polymorphisms between copies. However, an absence of shared polymorphisms does not imply that the rate of ectopic conversion is zero. For example, a mutation on a branch leading to a fixed difference between copies can be converted to the ancestral state on a singleton lineage. Thus, there is a discrepancy between the values of the summary statistic and the value of the parameter of interest. A similar issue arises in inferring recombination events at a single locusthe number of pairs of sites showing all four gametes (Hudson and Kaplan 1985) may be zero, but one can still estimate non-zero values of the population recombination rate by using more of the information present in the data, such as the amount of pairwise linkage disequilibrium in the sample (i.e., Hudson [2001]). We will therefore be interested in the summary of the data, the number of shared polymorphisms, and whether or not estimates of the rate of ectopic conversion are compatible with 0.
When only SNPs are considered, there are very few shared polymorphisms in the entire data (table 6). The exceptions to this pattern are the autosomal duplicates (CG11466/CG17875 and Amy(p)/Amy(d)) and two X-linked pairs (CG18620/CG15644 and CG2532 (3' exon)/CG2885). The general observation for X/X duplications in these data is that the number of fixed differences between loci is much larger than the number of shared and private polymorphisms. The numbers of shared polymorphisms per site were 0.001 and 0.009, for X-linked and autosomal gene families, respectively. Note that there are shared polymorphisms between CG17875, a putative pseudogene (Tijet, Helvig, and Feyereisen 2001), and CG11466 (table 6), which is believed to be functional.
|
We have also estimated the parameters of Innan's model (Innan 2003a) (table 7). The numbers of sites compared for this analysis are given in table 6. The parameters estimated are = 4Nµ, the population mutation rate;
= 4Nc, the population rate of ectopic conversion; and
= 4Nr, the population rate of recombination between the two loci (the model assumes no intra-locus recombination of any sort). The per-locus estimates of
are generally an order of magnitude lower than
(table 7) in our sample. Estimating
usually returned negative values, because Dsum was often zero or negative (table 7, Eq. 3).
|
Divergence Between Duplications
To study the evolutionary forces underlying the degree of divergence between these duplicate loci, we conducted McDonald-Kreitman tests (McDonald and Kreitman 1991) on alignments of polymorphism data from the coding regions of the genes we sequenced. First, however, we explored the effect that gene conversion between paralogs has on rejecting the neutral model using the MK test.
We simulated contingency tables for the MK test under the standard neutral model using a modification of Innan's (2003a) program. The purpose of this model was to examine the distribution of P values for the test under the null model when applied to data sampled from duplicate loci. Genealogies were simulated according to Innan (2003a). Replacement mutations were placed on the tree with rate R, conditional on the length of the tree, and synonymous changes occur at rate
S. The simulation output was parsed into the cell entries for the MK test, and P values were calculated as two-tailed Fisher's exact tests in R (R Development Core Team 2004). The distribution of P values for four different rates of gene conversion (C = 4Nec = 0.01, 0.1, 0.5, and 1) were obtained in this fashion from 104 simulated histories. Figure 1 shows results for the case where n = 10 alleles sampled from each locus,
= 4Neu = 5 at synonymous sites,
= 2.5 at replacement sites, and there is no recombination between loci. Figure 1 reports two summaries for each value of 4Nec. First is the fraction of times the MK test was applicable (i.e., at least one fixation was present in the simulated sample). We report this because, in practice, an MK test would not be applied if no fixations were observed. Second, we report the rejection rate of the test, conditional on the test being applicable. In other words, the first quantity is the probability of observing any fixations at all, given the parameters, and the second is the probability of rejecting the neutral mutation hypothesis at
= 0.05, given the observation of at least one fixation.
|
Our sample of X-linked gene duplicates shows an excess of amino acid fixations in 6 out of 7 comparisons (table 8). The number of fixations in table 8 differs from that in table 6 because the former only considers coding regions while the latter considers the entire alignment of the duplicated region and does not distinguish amino acid from synonymous/noncoding fixations. One comparison (CG9123/CG12608) is significant at the = 5% level, but not at
= 0.0033, which represents an experiment-wide Bonferonni correction for applying 15 tests (table 8). When all X-linked comparisons are pooled, there is a significant excess of amino acid fixations after correcting for multiple tests. The pooled analysis remains significant, even after removal of the most significant comparison (CG9123/CG12608), suggesting that the observation is a general feature of these loci rather than a property of one extreme duplicate pair.
|
Data Mining in D. pseudoobscura
The duplicates that we sequenced fall in a Ks range of 0.01 to 0.25 (table 1), which is much less than the median Ks between D. melanogaster and D. pseudoobscura (1.79, S. Richards, K. Thornton, A. Clark, R. Nielsen, unpublished data), suggesting that several of these duplicates may have arisen along the D. melanogaster lineage. The six loci (out of 17 total) that are found in the assembly of the D. pseudoobscura genome are labeled in table 2. In comparison with our database of gene duplicates in D. melanogaster, the duplicate pairs sequenced in this study appear to be young tips of gene familes conserved between the two species.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
There are, however, signs that some duplicate genes are less constrained than others. Length variation that disrupts open reading frames is observed in CG6999 and CG17875, and a polymorphic stop codon was found in CG17875. Of these three loci, only CG6999 is a member of the set of highly diverged X/X duplications, and CG17875 is believed to be a pseudogene (Tijet, Helvig, and Feyereisen 2001). If we assume that the pattern of polymorphism at CG17875 is typical of a pseudogene, then the null allele in CG6999 in Zimbabwe may lead us to classify CG6999 as a potential pseudogene. However, many of the alleles of CG17875 are intact, reinforcing the point that one should be cautious in declaring Drosophila genes as pseudogenes, as loci originally believed to be pseudogenes are sometimes later shown to have patterns of molecular evolution inconsistent with the expectation for a pseudogene (Long and Langley 1993; Begun 1997). In addition, Amy(d) is a duplicate gene known to be present in many Drosophila species (Aquadro et al. 1991; Inomata and Yamazaki 2002), and it is less constrained than Amy(p) (table 5). Araki, Inomata, and Yamazaki (2001) observed several deletions leading to null alleles in a Kenyan sample (but not in Japan). Segregating null alleles have also been observed in the D. melanogaster esterase gene family (Balakirev Balakirev, and Ayala 2002; Balakirev et al. 2003) and at Attacin A (Lazzaro and Clark 2001) The observation that most of the lesions and stop codons observed were found in autosmal duplicates (table 4; Araki Inomata, and Yamazaki [2001], Balakirev et al. [2002, 2003]) may lead us to conclude that the X/X duplicates surveyed in this study may be under more selective constraint than the autosomal duplicates for which data are available, possibly because deleterious alleles will be eliminated more quickly from the X chromosone because of hemizygosity in males.
The Role of Ectopic Gene Conversion
Gene conversion has long been recognized as a potentially powerful force of "concerted evolution," a process with the long-term effect of retarding sequence divergence (Ohta 1981; Nagylaki and Barton 1986; Ohta 1987a; 1987b; Nagylaki 1988). In Drosophila species, the amylase duplication is a well-studied example of such concerted evolution. The coding sequences of Amy(p) and Amy(d) are more similar within than between species, while the 5' regulatory sequences have diverged (Shibata and Yamazaki 1995). In addition, sequence polymorphism studies reveal shared polymorphisms between the coding sequences of the two loci (Araki, Inomata, and Yamazaki 2001). Shared polymorphisms have also been found between Attacin A and B genes (Lazzaro and Clark 2001), which are involved in immune responses, as well as within and between the hsp70 gene clusters (Bettencourt and Feder 2002). As discussed above, samples from duplicate loci that show shared polymorphisms require a gene genealogy involving ectopic gene conversion (assuming that every mutation falls at a previously unmutated site).
In our sample of duplicates with high Ka/Ks between copies, nonparametric methods to detect gene conversion provide little evidence for ectopic conversion between copies. First, very few shared polymorphisms are observed (table 6). Second, few tracts of similar sequence were found using GENECONV. However, not all gene conversion will be detected as shared polymorphisms in a sample, leading us to estimate the ectopic exchange rate using available methods (Innan 2003a). The estimates of C (= 4Nec) are generally very low for this set of duplicates. Table 7, and the results of our simulation suggest that we can rule out a model where C = 0 for 4 of 11 X-linked duplicate pairs at the 2.5% level, before correcting for multiple tests.
The methods employed to make inference on gene conversion make very different assumptions about the gene conversion process. Sawyer's (1989, 1999) method assumes that tracts of conversion are large enough to result in runs of similarity between copies, whereas Innan's (2003a) method assumes that conversion affects only one mutation per event. In D. melanogaster, if average diversity is 0.01 per site, this would correspond to an average tract length of 100 base pairs. It is unclear in D. melanogaster what the tract length is between non-allelic loci, nor is it clear how sensitive Innan's model is to violations of this assumption. Despite these differences between approaches, results from both approaches agree in that they do not suggest high rates of gene conversion between copies in this data set. For the X-linked duplicates that appear to be undergoing ectopic exchange (labeled in table 7), it is unlikely that violations of the infinite sites are sufficient to explain the observation. First, the Ks between copies are all low (table 1), and only one of the loci has a single site with more than two states segregating (CG15644, the inferred number of mutations is greater than the number of segregating sites, table 3).
Divergence Between Duplicates
We have applied the McDonald-Kreitman (1991) test to two different samples of duplicate loci. The first consists of (mostly) X-linked duplicates with Ka/Ks > 1 between copies (table 1). The second set consists of tandem autosomal duplicates taken from our own data and from the literature. The patterns of polymorphism relative to divergence are very different in these two sets of loci, with a significant excess of amino acid replacements in the former, and a deficit of fixations in the latter (table 8).
In the case of the X-linked duplicates, our data suggest that they are currently evolving under purifying selection (table 5), with relatively little statistical support for gene conversion between copies, and a historical excess of amino acid replacements between copies. Two alternative models are compatible with this excess of amino acid fixations. First, a new function could have evolved relatively quickly after duplication, and the substitutions accumulated under positive selections. Second, if the duplicate genes were redundant in function, substitutions could accumulate as a consequence of relaxation of purifying selection. In that case, the duplicates may have been preserved if, for example, an environmental change occurred resulting in a change in selective pressure (i.e., Kimura [1983, pp. 104113]) leading to current selective constraint. However, fixations between copies are neutral under the latter model, and it seems unlikely that the high Ka/Ks between copies (table 1) would have fortuitously accumulated on X-linked genes in the face of the gene conversion that is expected to occur between young, highly similar, tandem duplications. Rather, we argue that selection was strong enough, relative to gene conversion, to drive the divergence of these loci at the amino acid level.
The autosomal duplicates that have been studied to date show a rather different pattern. The deficit of divergence among these loci (table 8) is consistent with the expectation for duplicate pairs evolving under rather strong gene conversion (i.e., fig. 1). In fact, estimates of the ectopic conversion rate of these autosomal loci are higher than for the genes in our X-linked families (table 7). Additionally, patterns of molecular evolution at amylase in Drosophila suggest strong, long-term, concerted evolution (Shibata and Yamazaki 1995; Inomata, Tachida, and Yamazaki 1997), and tracts of gene conversion are readily visible within and between hsp70 gene clusters (Bettencourt and Feder 2002). In fact, all previous studies of nucleotide variation at duplicate loci in D. melanogaster (Araki, Inomata, and Yamazaki 2001; Lazzaro and Clark 2001; Bettencourt and Feder 2002; Balakirev, Balakirev, and Ayala 2002; Balakirev et al. 2003) and D. pseudoobscura (King 1998) that we are aware of have documented evidence for gene conversion between paralogs. Given that many of the duplicate pairs studied by previous authors are ancient (King 1998; Bettencourt and Feder 2002; Balakirev, Balakirev, and Ayala 2002; Balakirev et al. 2003), the extent to which gene conversion affects the distribution of Ks between duplicates within genomes remains an important question. A recent analysis of duplicates from several yeast genomes has found widespread evidence for gene conversion, suggesting that the distribution of Ks between duplicates does not conform in general to the assumptions of the molecular clock (L. Gao and H. Innan, personal communication).
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Michael Nachman, Associate Editor
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adams, M. D., S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Amanatides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle (194 co-authors). 2000. The genome sequence of Drosophila melanogaster. Science 287:21852195.
Andolfatto, P. 2001. Contrasting patterns of X-linked and autosomal nucleotide variation in Drosophila melanogaster and Drosophila simulans. Mol. Biol. Evol. 18:279290.
Aquadro, C. F., A. L. Weaver, S. W. Schaeffer, and W. W. Anderson. 1991. Molecular evolution of inversions in Drosophila pseudoobscurathe Amylase gene region. Proc. Natl. Acad. Sci, U.S.A. 88:305309.[Abstract]
Araki, H., N. Inomata, and T. Yamazaki. 2001. Molecular evolution of duplicated amylase gene regions in Drosophila melanogaster: evidence of positive selection in the coding regions and selective constraints in the cis-regulatory regions. Genetics 157:667677.
Balakirev, E. S., E. I. Balakirev, and F. J. Ayala. 2002. Molecular evolution of the Est-6 gene in Drosophila melanogaster: contrasting patterns of DNA variability in adjacent functional regions. Gene 288:167177.[CrossRef][ISI][Medline]
Balakirev, E. S., V. R. Chechetkin, V. V. Lobzin, and F. J. Ayala. 2003. DNA polymorphism in the ß-esterase gene cluster of Drosophila melanogaster. Genetics 164:533544.
Begun, D. J., and C. F. Aquadro. 1993. African and North-American populations of Drosophila melanogaster are very different at the DNA level. Nature 365:548550.[CrossRef][ISI][Medline]
Begun, D. J. 1997. Origin and evolution of a new gene descended from alcohol dehydrogenase in Drosophila. Genetics 145:375382.
Betrán, E., K. Thornton, and M. Long. 2002. Retroposed new genes out of the X in Drosophila. Genome Res. 12:18541859.
Bettencourt, B. R., and M. E. Feder. 2002. Rapid concerted evolution via gene conversion at the Drosophila hsp70 genes. J. Mol. Evol. 54:569586.[CrossRef][ISI][Medline]
Celniker, S. E., D. A. Wheeler, B. Kronmiller, J. W. Carlson, A. Halpern, S. Patel, M. Adams, M. Champe, S. P. Dugan, E. Frise (32 co-authors). 2003. Finishing a whole genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biol. 3:0079.10079.14.
Charlesworth, B. 2001. The effect of life-history and mode of inheritance on neutral genetic variability. Genet. Res. 77:153166.[CrossRef][ISI][Medline]
Charlesworth, B., J. A. Coyne, and N. H. Barton. 1987. The relative rates of evolution of sex-chromosomes and autosomes. Am. Nat. 130:113146.[CrossRef][ISI]
Clark, A. G. 1994. Invasion and maintenance of a gene duplication. Proc. Natl. Acad. Sci. U. S. A. 91:29502954.[Abstract]
Comeron, J. M. 1995. A method for estimating the numbers of synonymous and nonsynonymous substitutions per site. J. Mol. Evol. 41:11521159.[ISI][Medline]
Emerson, J. J., H. Kaessmann, E. Betran, and M. Long. 2004. Extensive gene traffic on the mammalian X chromosome. Science 303:537540.
Ewing, B., and P. Green. 1998. Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186194.
Ewing, B., L. Hillier, M. Wendl, and P. Green. 1998. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175185.
Fisher, R. A. 1935. The sheltering of lethals. Am. Nat. 69:446455.[CrossRef]
Force, A., M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, and J. Postlethwait. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:15311545.
Gordon, D. C., C. Abajian, and P. Green. 1998. Consed: a graphical tool for sequence finishing. Genome Res. 8:195202.
Haldane, J. B. S. 1933. The part played by recurrent mutation in evolution. Am. Nat. 67:519.[CrossRef]
Hudson, R. R. 2001. Two-locus sampling distributions and their application. Genetics 159:18051817.
Hudson, R. R., and N. L. Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA-sequences. Genetics 111:147164.
Innan, H. 2003a. The coalescent and infinite-site model of a small multigene family. Genetics 163:803810.
. 2003b. A two-locus gene conversion model with selection and its application to the human RHCE and RHD genes. Proc. Natl. Acad. Sci. U.S.A 100:87938798.
Inomata, N., and T. Yamazaki. 2002. Nucleotide variation of the duplicated amylase genes in Drosophila kikkawai. Mol. Biol. Evol. 19:678688.
Inomata, N., H. Tachida, and T. Yamazaki. 1997. Molecular evolution of the Amy multigenes in the subgenus Sophophora of Drosophila. Mol. Biol. Evol. 14:942950.[Abstract]
Kimura, M. 1983. The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, U.K.
King, L. M. 1998. The role of gene conversion in determining sequence variation and divergence in the Est-5 gene family in Drosophila pseudoobscura. Genetics 148:305315.
Kondrashov, F. A., I. B. Rogozon, Y. I. Wolf, and E. V. Koonin. 2002. Selection in the evolution of gene duplications. Genome Biol. 3:0008.10008.9.
Kreitman, M., and R. R. Hudson. 1991. Inferring the evolutionary histories of the Adh and Adh-Dup loci in Drosophila melanogaster from patterns of polymorphism and divergence. Genetics 127:565582.
Lasko, P. 2000. The Drosophila melanogaster genome: translation factors and RNA binding proteins. J. Cell Biol. 150:F51F56.
Lazzaro, B. P., and A. G. Clark. 2001. Evidence for recurrent paralogous gene conversion and exceptional allelic divergence in the Attacin genes of Drosophila melanogaster. Genetics 159:659671.
Long, M. Y., and C. H. Langley. 1993. Natural-selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science 260:9195.[ISI][Medline]
Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:11511155.
Lynch, M., and A. Force. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459473.
Lynch, M., M. O'Hely, B. Walsh, and A. Force. 2001. The probability of preservation of a newly arisen gene duplicate. Genetics 159:17891804.
McDonald, J. H., and M. Kreitman. 1991. Adpative protein evolution at the Adh locus in Drosophila. Nature 351:652654.[CrossRef][ISI][Medline]
Misra, S., M. A. Crosby, C. J. Mungall, B. B Matthews, K. S. Campbell, P. Hradecky, Y. Huang, J. S. Kaminker, G. H. Millburn, S. E. Prochnik (30 co-authors). 2003. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 3:0083.10083.22.
Nagylaki, T. 1988. Gene conversion, linkage, and the evolution of multigene families. Genetics 120:291301.
Nagylaki, T., and N. Barton. 1986. Intrachromosomal gene conversion, linkage, and the evolution of multigene families. Theor. Popul. Biol. 29:407437.[ISI][Medline]
Nei, M., and A. Roychoudhury. 1968. Probability of fixation of nonfunctional genes at duplicate Loci. Am. Nat. 107:362372.[CrossRef]
Ohta, T. 1981. Genetic-variation in small multigene families. Genet. Res. 37:133149.[ISI][Medline]
. 1987a. A model of evolution for accumulating genetic information. J. Theor. Biol. 124:199211.[ISI][Medline]
. 1987b. Simulating evolution by gene duplication. Genetics 115:207213.
Okuyama, E., H. Shibata, H. Tachida, and T. Yamazaki. 1996. Molecular evolution of the 5'-flanking regions of the duplicated Amy genes in Drosophila melanogaster species subgroup. Mol. Biol. Evol. 13:574583.[Abstract]
Petrov, D. A., E. R. Lozovskaya, and D. L. Hartl. 1996. High intrinsic rate of DNA loss in Drosophila. Nature 384:346349.[CrossRef][ISI][Medline]
Petrov, D. A., Y. C. Chao, E. C. Stephenson, and D. L. Hartl. 1998. Pseudogene evolution in Drosophila suggests a high rate of DNA loss. Mol. Biol. Evol. 15:15621567.
Posada, D., and K. A. Crandall. 2001. Evaluation of methods for detecting recombination from DNA sequences: Computer simulations. Proc. Natl. Acad. Sci. U. S. A. 98:1375713762.
R Development Core Team. 2004. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-00-3.
Sawyer, S. 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6:526538.[Abstract]
. 1999. GENECONV: a computer package for the statistical detection of gene conversion. Department of Mathematics, Washington University in St. Louis, http://www.math.wustl.edu/ sawyer.
Shibata, H., and T. Yamazaki. 1995. Molecular evolution of the duplicated amy locus in the Drosophila-melanogaster species subgroupconcerted evolution only in the coding region and an excess of nonsynonymous substitutions in speciation. Genetics 141:223236.
Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437460.
Takano, T. S. 1998. Rate variation of DNA sequence evolution in the Drosophila lineages. Genetics 149:959970.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.[Abstract]
Thornton, K. 2003. libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 19:23252327.
Thornton, K., and M. Long. 2002. Rapid divergence of gene duplicates on the Drosophila melanogaster X chromosome. Mol. Biol. Evol. 19:918925.
Tijet, N., C. Helvig, and R. Feyereisen. 2001. The cytochrome P450 gene superfamily in Drosophila melanogaster: annotation, intron-exon organization and phylogeny. Gene 261:189198.[CrossRef][ISI]
Walsh, J. B. 1995. How often do duplicated genes evolve new functions?. Genetics 139:421428.
Watterson, G. A. 1975. Number of segregating sites in genetic models without recombination. Theor. Popul. Biol. 7:256276.[ISI][Medline]
Zhang, Z., and H. Kishino. 2004a. Genomic background drives the divergence of duplicated amylase genes at synonymous sites in Drosophila. Mol. Biol. Evol. 21:222227.
. 2004b. Genomic background predicts the fate of duplicated genes: evidence from the yeast genome. Genetics 166:19951999.