Differential Selection After Duplication in Mammalian Developmental Genes

Emmanouil T. Dermitzakis and Andrew G. Clark

Institute of Molecular Evolutionary Genetics, Department of Biology, Pennsylvania State University


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 Literature Cited
 
Gene duplication provides the opportunity for subsequent refinement of distinct functions of the duplicated copies. Either through changes in coding sequence or changes in regulatory regions, duplicate copies appear to obtain new or tissue-specific functions. If this divergence were driven by natural selection, we would expect duplicated copies to have differentiated patterns of substitutions. We tested this hypothesis using genes that duplicated before the human/mouse split and whose orthologous relations were clear. The null hypothesis is that the number of amino acid changes between humans and mice was distributed similarly across different paralogs. We used a method modified from Tang and Lewontin to detect heterogeneity in the amino acid substitution pattern between those different paralogs. Our results show that many of the paralogous gene pairs appear to be under differential selection in the human/mouse comparison. The properties that led to diversification appear to have arisen before the split of the human and mouse lineages. Further study of the diverged genes revealed insights regarding the patterns of amino acid substitution that resulted in differences in function and/or expression of these genes. This approach has utility in the study of newly identified members of gene families in genomewide data mining and for contrasting the merits of alternative hypotheses for the evolutionary divergence of function of duplicated genes.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 Literature Cited
 
Duplication is a nearly ubiquitous property of genes in higher eukaryotes. It is believed that gene duplication provides the material for the generation of new genes and, subsequently, new or more specialized functions. The details of how members of duplicated gene pairs acquire new function are largely unknown. The classical view is that selection is relaxed, because if one gene gains a mutation that alters function, the remaining copy serves as a backup to retain the original function (Ohno 1970Citation ; Ohta 1987Citation ). The divergent copy is then free to acquire random substitutions, and by chance, these substitutions may result in new tissue specificity or other function.

Recently several alternative models have been proposed that consider the dynamics of such processes and determine the relative likelihoods of gain of new function versus loss of function under the pressure of random mutations. Ota and Nei (1994)Citation and Nei, Gu, and Sitnikova (1997)Citation have proposed a birth-and-death process for the evolution of immunity genes, suggesting that gain and loss of genes plays a dominant role in the evolution of multigene families. Clark (1994)Citation showed that the classical model cannot account for the levels of duplicated gene preservation observed and that populations in mutation-selection balance are likely to experience weak positive selection for the initial duplication. Hughes (1994)Citation and Walsh (1995)Citation suggested that the evolution of new functions must be rather frequent. Zhang, Rosenberg, and Nei (1998)Citation demonstrated that positive selection is probably acting on ribonuclease genes, leading to new adaptive functions. Recently, Force et al. (1999)Citation formalized and modeled the idea of subfunctionalization of regulatory regions, whereby the original genes had multiple expression domains, and were then free to lose those subfunctions in a complementary gene-specific manner. Subfunctionalization appears to be consistent with the nature of differences in regulatory regions, and this model can explain the relatively high preservation of duplicate copies compared with the neutral expectation.

In another study, Lynch and Force (2000)Citation modeled the evolution and preservation of duplicated genes under their DDC model (duplication-degeneration-complementation) and concluded that the data provided good support for the idea of subfunctionalization. Furthermore, many proteins have subunits with distinct functions, and the subfunctionalization model can be applied to these subfunctions within the coding region (Henikoff et al. 1997Citation ). If a gene has two domains, each interacting within a specific tissue, subsequent duplication could eventually lead to subfunctionalization of the protein sequence such that each copy could interact with components of one tissue in a complementary manner. This divergence could occur independent of the subfunctionalization of the regulatory sequences, but some interactions are possible. Interest in detection of regions that contribute to functional divergence is increasing, but very few studies have proposed rigorous methods for detecting them (e.g., Gu 1999Citation ). Here, we propose a novel method for identifying heterogeneity in patterns of substitutions along duplicated genes, and we present results from the analysis of 12 gene families present in the human and mouse genomes.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 Literature Cited
 
Sequences Used
The gene families that we used were taken from the literature and consisted exclusively of genes for which phylogenetic information was available to guarantee orthologous relationships. Orthology was independently investigated by searching OMIM (Online Mendelian Inheritance of Man, www.ncbi.nlm.nih.gov/ omim) and MGI (Mouse Genome Informatics, www.informatics.jax.org/mgihome) for the presence of those genes in syntenic positions (e.g., Nadeau 1989Citation ) and for similar or identical tissue-specific expression patterns. Also, most of the information about the expression patterns of those genes, presented below, was obtained mainly from these two databases and references mentioned therein. In all cases, full-length amino acid sequence was used in the analysis. The gene families and respective gene members that were used include the following (GenBank accession numbers in parentheses for the human and the mouse, respectively): Mef2— Mef2A (Q02078, Q60929), Mef2B (Q02080, JC4221), Mef2C (Q06413, A47561); MyoD—MyoD1 (P15172, P10085), Myf5 (P13349, P24699), Myf6 (P23409, P15375), MyoG (P15173, P12979); Bmp—Bmp2 (P12643, P21274), Bmp6 (P22004, P20722), Bmp7 (P18075, P23359); Notch—Notch1 (P46531, Q01705), Notch4 (AAC32288, P31695); Elav—HuA (AAB41913, AAB17967), HuB (AAA69698, AAC52644), HuC (AAA58677, AAC52999); Hox1—Hoxa1 (U10421, NM_010449), Hoxb1 (NM_002144, NM_008266); Hox3— Hoxa3 (NM_043365, P02831), Hoxb3 (NM_002146, P09026), Hoxd3 (P31249, P09027); Hox4—Hoxa4 (AAB97947, P06798), Hoxb4 (P17483, P10284), Hoxc4 (Q08624, P09017), Hoxd4 (P09016, P10628); Hox5— Hoxa5 (P20719, P09021), Hoxb5 (P09079, P09067); Hox7—Hoxa7 (NP_008827, P02830), Hoxb7 (P09629, P09024); Hox9—Hoxa9 (P31269, P09631), Hoxd9 (P28356, P28357); Hox13—Hoxa13 (P31271, Q62424), Hoxd13 (P35453, P70217).

Description of the Paralog Heterogeneity Test
The paralog heterogeneity test is designed to identify differences between pairs of duplicated genes in the pattern of interspecific sequence divergence. Human and mouse sequences were used because of the rapidly expanding sequence data available for both species. We extracted GenBank amino acid sequence entries that corresponded to gene families from the two species. We aligned all members of each family for both species using CLUSTAL W and deleted all of the gapped sites (complete deletion). Deleting the gapped sites may eliminate some of the regions that functionally differentiate the proteins of the gene family. However, we wanted to have complete correspondence of homologous sites for comparison between different paralogs. Phylogenetic analysis has resolved the relationships of most of the genes used in this study (Zhang and Nei 1996Citation ; Hughes 1999Citation ). However, for each family, we constructed a phylogenetic tree to confirm that our alignments produced the correct orthologous relationships. We then calculated a Qi statistic (from Tang and Lewontin 1999Citation ) for each variable site in each orthologous comparison as


(1)
where j = 1, 2, 3, ... , v is the index of the current, or jth, variable site, v is the total number of variable sites, i is the nucleotide or amino acid position of jth variable site in the sequence, and l is the total length of the sequence in amino acids or nucleotides (fig. 1 ). Qi is positive for runs of sites that are closer together than a uniform distribution, and Qi is negative in regions of sparse variation. We subsequently calculated Qi for nonvariable sites by linear interpolation between the two flanking variable sites. Thus, for a nonvariable site at position i between variable sites at positions x and k on the sequence, the value of Qi is


(2)
For sites before the first variable site and after the last variable site, we consider Qi to be 0 at the two ends of the sequence.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 1.—A graphical description of the test in a case of nucleotide sequence. Equations below the sequences demonstrate the method for the calculation of Qi for each position i in the sequence. j is the index of a variable site, v is the number of mismatches between the mouse and the human sequence, and l is the total length of the sequence

 
The main idea behind the test is that even when the distribution of substitutions (variable sites) is uneven, the pattern of this heterogeneity will be similar between different paralogs if the functional domains are the same. To test the homogeneity of Qi we calculate the difference between Qi's at the same position for two different human-mouse comparisons (two different paralogs):

(3)
where Q1,i and Q2,i are the Qi values for the ith positions in the human-mouse comparison in paralogous gene copy 1 and paralogous gene copy 2, respectively. The expectation is that if the substitution pattern is similar in the two copies, the D values will be low given the number of variable sites. Finally, we calculate the longest stretch of monotonic increase or decrease of the D values for these two copies and we report the absolute differential R of D values in this run. Significance levels were determined by randomly assigning the same number of observed substitutions in each of the sequences, determining the largest monotonic run of D, and calculating the R value for this run. By doing this 1,000 times, we obtained a null distribution of the R values. Analysis was also performed with the alternative hypothesis that there were two or three regions of differential substitution pattern. In this case, we calculated the sum of differentials (R values) for the top two or three regions. Significance levels again were determined by random permutations identical to the ones for one region, but this time the sum of the largest two or three R values was used to generate the null distribution. The need for this additional analysis is illustrated with simulations in Tang and Lewontin (1999). The rationale is that there may be two or more regions that each exhibit a rather small heterogeneity effect, in which case the effect is distributed across several different regions. The fact that we used random permutation to determine significance levels eliminated any bias in the analysis, because we calculated the same statistic for the permuted sets. The test is directional in the sense that a decline of D means lower variation within the region examined in the first paralog than in the second paralog, while a increase of D means higher variation in the region examined in the first paralog than in the second paralog.

We assume under the null hypothesis that if the different orthologous groups have the same relative substitution properties across the sequence, there should be no heterogeneity in the distribution of substitutions across the sequence, and the test should fail to reject the null hypothesis. On the other hand, if different regions were evolving differently among paralogs, there would be detectable heterogeneity. This method cannot detect substitution rate acceleration if the effect is proportional across the gene for each copy, but it will detect accelerated rates of substitutions if they occur in different regions across different paralogs (fig. 2 ).



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 2.—Hypothetical tree structure for each of two domains when there is (a) an accelerated substitution rate in one orthologous pair across the whole sequence in both species and (b) an accelerated substitution rate differing across domains in the two different orthologous groups. The paralog heterogeneity test will be able to detect heterogeneity in b but not in a.

 
Power Calculations
In order to calculate the power of the test, we simulated the accumulation of substitutions in a protein of 500 residues consisting of two equal-sized domains of 250 residues. Numbers of substitutions in the two domains for the two orthologous pairs were sampled from a Poisson distribution such that the average overall rate of their sequence divergence was preserved. The two domains in each orthologous pair were assumed to diverge at different rates, and we report the ratio of those rates (rate in domain 1/rate in domain 2). In all simulations, the domains accumulated substitutions differently across orthologous pairs such that the domain that had the high rate in one orthologous pair had the low rate in the other pair, and vice versa for the other domain. The ratio of the divergence rates of the two domains of one orthologous pair was precisely the reciprocal of the ratio of rates in the other pair. We simulated 1,000 sets of two pairs of genes (two orthologous groups in two species) with both orthologous groups evolving with three different scenarios:



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 3.—The plots indicate the proportions of tests rejected for the various cases described in Materials and Methods; high indicates the case in which both divergence rates between orthologous genes were 0.45; low indicates the case in which both divergence rates were 0.13; and dif indicates the case in which one orthologous group had a rate of 0.13 and the other had a rate of 0.45. The rate ratio indicated in this figure is the ratio of low to high rate in all cases

 
We did not simulate the divergence of paralogs before speciation because the divergent sites will be shared within the orthologs, so they do not contribute to the pattern. The rates of divergence that were simulated are within the range observed in the analyzed data of the 12 families studied.

Subsequently, we calculated R for each of the 1,000 sets of two orthologous pairs. The significance was determined by comparing the values of R with a null distribution of R values derived from simulations assuming equal rates for the two domains and identical to the simulated sets above for the whole amino acid sequence. The test in this case became more conservative because we did not generate a null distribution by permutation for the actual number of substitutions for each simulated set, so the variance of the null distribution was higher, given that different counts of substitutions were included in this distribution.

Software and Alignments
Computer programs were written in Perl and are available on request. All of the alignments used in this analysis are also available on request.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 Literature Cited
 
We examined 12 gene families that had at least two orthologous pairs in the mouse (Mus musculus) and the human. The pairwise paralog heterogeneity tests revealed that 7 out of 12 families had significant heterogeneity (table 1 ). This heterogeneity appears to have occurred before the human-mouse split, because when we tested for heterogeneity in the amino acid substitutions between paralogs in the two species, this time obtaining the pattern of amino acid substitutions between paralogs within the same species and comparing it with the pattern of amino acid substitution in the same paralogous comparison in the other species, the null hypothesis was never rejected. In other words, the heterogeneity appears as differences among paralogous gene copies rather than as a species-specific effect in the amino acid substitution pattern.


View this table:
[in this window]
[in a new window]
 
Table 1 Data for All the Pairwise Comparisons of the Gene Family Members

 
Of the seven Hox gene families tested, only three showed significant heterogeneity, while the proportion of non-Hox genes that showed significant heterogeneity was much higher. These results are consistent with the widely accepted hypothesis that Hox genes evolve mainly through their regulatory regions rather than their coding region (Holland et al. 1994Citation ). This illustrates once more that gene regulation is an essential component of gene function and evolutionary potential.

Power of the Test
The results of the calculations for the power of the test are summarized in figure 3 . We simulated a rather simple model that just assumed two differentially evolving regions. In reality, the patterns of substitutions can be much more complicated. However, we illustrate that for this case the power of the test is very good if the ratio of rates between the two regions is >0.5, which is realistic if the diversification occurred long enough before the present. This ratio of rates is not uncommon, since we are considering the scenario in which one of the two domains has lost function in one paralog and the other domain has lost function in the other paralog.

Function of Genes that Have Diverged
One of the key predictions of this model is that the regions that will appear to be divergent in substitution pattern are the ones that confer subfunctions. For example, in the case of a transcription factor, the domain that would probably be conserved is the DNA-binding domain, while tissue-specific domains (including transactivation domains) will diverge because of subfunctionalization. In order to find specific examples of such patterns, we examined more closely two of the gene families, MyoD and Mef2, consisting of transcription factors involved in myogenic fate determination in mammals.

The central region of the MyoD genes encodes a helix-loop-helix domain responsible for DNA binding. From the plots of D values across the different pairwise comparisons, it is clear that there is little difference in the amino acid substitution pattern in this central region. On the other hand, the transactivation domain, which is at the C-terminus of the proteins, appears quite diverged (fig. 4 a). In particular, MyoG has accumulated a large number of amino acid differences in both the mouse and the human. This is consistent with the idea that MyoG is a member of the MyoD family with very distinct functions and expression patterns. However, it appears that MyoD and Myf5 are also different in the pattern of amino acid substitutions, as indicated in table 1, with the greatest heterogeneity residing in the transactivation domain (C-terminus). The general pattern of divergence in the transactivation domain is consistent with evolution driven by the interactions of the molecule with the cellular environment.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 4.—Plots of the D values for comparisons in three gene families. An increase or decline indicates the differential rate of substitution, while plateaus indicate very similar substitution rates across paralogs. Note that the amino acid position refers to the alignment after complete deletion (ungapped). a, Plot of D values for the pairwise comparisons for the MyoD family. Note the divergent domains at the C-termini of the proteins. The gray-filled bar indicates one of the two candidate regions of functional divergence between MyoG and Myf5; the black-filled bar indicates the candidate region for functional divergence in all four significant comparisons. b, Plot of D values for the pairwise comparisons for the Mef2 family. Note the divergent domains at the C-termini of the proteins. The black-filled bars indicate the candidate regions for functional divergence between Mef2 and Mef2c. c, Plot of D values for the pairwise comparisons for the Hox3 gene family. No comparison is significantly heterogenous. Comparisons with statistically significant heterogeneity are indicated with asterisks

 
A similar pattern was observed for the Mef2 (MADS box) gene family. Differences were pronounced in the C-termini (the transactivation domain), while the DNA-binding domain (the MADS and Mef2 domains) was generally conserved (see fig. 4 b). The three members appeared to have different roles in muscle differentiation, and all three had different patterns of substitution across the amino acid sequence. Mef2A is expressed in skeletal and cardiac muscle, while Mef2C is expressed in skeletal muscle and the brain. Mef2A is thought to be involved in the induction of the differentiation process in skeletal muscle, while Mef2C maintains this differentiated state. Mef2B, on the other hand, appears to be more redundant in terms of function and expression, being expressed in skeletal and cardiac muscle and the brain. Consistent with the prediction, Mef2A and Mef2C appeared to have significantly different patterns of amino acid substitution, while Mef2B, which overlaps in function and expression with both, was not significantly different from either of the two others. In addition, we were able to separate candidate regions for the functional divergence in the Mef2A-Mef2C comparison. The maximum differential (R) was observed between sites 86 and 125 (in the ungapped alignment—R = 0.609; increase of D) followed by site 126–183 (R = 0.3769; decline of D) and 255–301 (R = 0.206; decline of D). These become likely candidates for regions that are responsible for interactions with the cardiac tissue (Mef2A-specific) or the brain (Mef2C-specific). In particular, the region comprising sites 86–125 is highly variable in Mef2A but conserved in Mef2C. Therefore, it is a candidate region for interactions with components of the brain. On the other hand, the regions comprising sites 126–183 and 255–301 are conserved in Mef2A and are highly variable in Mef2C, which makes them good candidates for interactions with components of the cardiac muscle tissue.

One possibility is that the functional divergence observed in these two gene families may also be due to the fact that Mef2 genes interact genetically and biochemically with the genes of the MyoD family. Intriguingly, the regions that appear to be different between different homologous groups are the C-termini, which are the transactivation domains for both families. It has been demonstrated that there is direct biochemical interaction between the Mef2 proteins and the MyoD proteins (Molkentin et al. 1995). This interaction in some cases makes the C-terminus (or parts of it) of either the Mef2 or the MyoD gene dispensable. Assuming that the pairwise interactions are specific in terms of which Mef2 protein interacts with which MyoD protein, either due to coexpression in the same tissue or because of biochemical affinity, there would be coevolution of the protein pairs that exhibit the interaction. In addition, the transactivation domains, which interact in the nucleus after DNA binding, may have evolved according to the specific components to which they bind. This would expose the transactivation domains to different selective pressures.

On the other hand, the expectation is that duplicated genes evolving through their regulatory regions will not show heterogeneity in the rate of amino acid substitutions. Interestingly, a recent study (Greer et al. 2000) has shown that it is not the amino acid sequence responsible for the functional differences between Hoxa3 and Hoxd3, but rather quantitative modulation in gene expression. The Hox3 family did not show heterogeneity in our analysis (P > 0.2; fig 4 c), consistent with no subfunctionalization in the coding sequence, as predicted by the experimental data.


    Conclusions
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 Literature Cited
 
Our results demonstrate that coding subfunctionalization is an important component in the evolution of duplicated genes. They also support previous theoretical work (Clark 1994Citation ; Lynch and Force 2000) that claims that the classical model of evolution of duplicated genes does not adequately explain the level of duplicate gene preservation. Duplicated genes undergo many steps of functional divergence, some of which leave their footprints on the amino acid sequences. Finally, we propose a simple and flexible method for the analysis of homologous genes that have been a product of gene duplication. This method not only can detect heterogeneity in different duplicate copies, but also can indicate candidate regions for functional divergence. This is of particular interest in mammalian genomics because of the ubiquity of gene duplication. As we gain access to complete genome sequences of humans and mice, such approaches could be of great value in generating decisive experimentally testable hypotheses.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 Literature Cited
 
We thank Alberto Civetta, Malia Fullerton, Brian Lazzaro, and Kristi Montooth for critically reading earlier versions of the manuscript. We also thank Jeff Townsend for an insightful discussion and two anonymous reviewers for constructive comments.


    Footnotes
 
Yun-Xin Fu, Reviewing Editor

1 Keywords: gene duplication differential selection specialization Back

2 Address for correspondence and reprints: Emmanouil T. Dermitzakis, 208 Mueller Laboratory, Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802. exd158{at}psu.edu Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 Literature Cited
 

    Clark, A. G. 1994. Invasion and maintenance of a gene duplication. Proc. Natl. Acad. Sci. USA 91:2950–2954

    Force, A., M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, and J. Postlethwait. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 1531–1545

    Greer, J. M., J. Puetz, K. R. Thomas, and M. R. Capecchi. 2000. Maintenance of functional equivalence during paralogous Hox gene evolution. Nature 403:661–665

    Gu, X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16:1664– 1674[Abstract/Free Full Text]

    Henikoff, S., E. A. Greene, S. Pietrokovski, P. Bork, T. K. Attwood, and L. Hood. 1997. Gene families: the taxonomy of protein paralogs and chimeras. Science 278:609– 614

    Holland, P. W., J. Garcia-Fernandez, N. A. Williams, and A. Sidow. 1994. Gene duplications and the origins of vertebrate development. Dev. Suppl. pp. 125–133

    Hughes, A. L. 1994. The evolution of functionally novel proteins after gene duplication. Proc. R. Soc. Lond. B Biol. Sci. 256:119–124[ISI][Medline]

    ———. 1999. Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J. Mol. Evol. 48:565–576[ISI][Medline]

    Lynch, M., and A. Force. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154: 459–473

    Molkentin, J. D., B. L. Black, J. F. Martin, and E. N. Olson. 1995. Cooperative activation of muscle gene expression by Mef2 and myogenic bHLH proteins. Cell 83:1125– 1136

    Nadeau, J. H. 1989. Maps of linkage and synteny homologies between mouse and man. Trends Genet. 5:82–86[ISI][Medline]

    Nei, M., X. Gu, and T. Sitnikova. 1997. Evolution by the birth-and-death process in multigene families of the vertebrate immune system. Proc. Natl. Acad. Sci. USA 94:7799– 7806

    Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, Berlin

    Ohta, T. 1987. Simulating evolution by gene duplication. Genetics 115:207–213

    Ota, T., and M. Nei. 1994. Divergent evolution and evolution by the birth-and-death process in the immunoglobulin VH gene family. Mol. Biol. Evol. 11:469–482[Abstract]

    Tang, H., and R. C. Lewontin. 1999. Locating regions of differential variability in DNA and protein sequences. Genetics 153:485–495

    Walsh, J. B. 1995. How often do duplicated genes evolve new functions? Genetics 139:421–428

    Zhang, J., and M. Nei. 1996. Evolution of Antennapedia-class homeobox genes. Genetics 142:295–303

    Zhang, J., H. F. Rosenberg, and M. Nei. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708–3713

Accepted for publication December 1, 2000.