Department of Biological Sciences, University of South Carolina, Columbia, South Carolina
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: block duplication genome evolution polyploidization tandem duplication vertebrate evolution
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Hughes (1999) found that these conditions were rarely met in a set of gene families encoding a set of developmentally important proteins. The International Human Genome Sequencing Consortium (2001) applied the same test to 57 families and reported results contrary to the 2R hypothesis in 76% of cases. In the most extensive application to date of this approach, Friedman and Hughes (2001a) examined all four-member gene families in the available portion of the human genome. In 134 families with resolved phylogenies, 71% showed results inconsistent with the 2R hypothesis. In addition, Friedman and Hughes (2001a) compared all homologous gene families in human and Drosophila and found that less than 5% of such families show a 4:1 ratio of the number of family members in human to the number of family members in Drosophila.
Advocates of the 2R hypothesis have frequently cited as evidence in favor of this hypothesis the existence of sets of paralogous genes found on two or more different chromosomes in the genomes of human or other vertebrates (Lundin 1993; Kasahara et al. 1997; Abi-Rached et al. 2002). The implication is that these genes were duplicated simultaneously during polyploidization. However, the hypothesis that a set of linked paralogues were duplicated simultaneously can only be accepted if the phylogenies of the gene families are consistent with their duplication during the same time period. When phylogenetic analyses have been applied to sets of linked paralogues allegedly duplicated simultaneously by polyploidization, the phylogenies have revealed that these genes were in fact duplicated at widely different times over the history of life (Hughes 1998; Hughes, da Silva, and Friedman 2001). Furthermore, such potentially duplicated blocks are usually identified in a subjective manner. In a genome with numerous gene families, however, members of two or more of these families may be found in close linkage merely by chance, without having been duplicated simultaneously. For this reason, it is desirable to employ a statistical test of the hypothesis that genes have been duplicated simultaneously (Friedman and Hughes 2001b).
The major alternative to polyploidization hypotheses for explaining the occurrence of paralogous genes on different chromosomes is a hypothesis of tandem duplication followed by translocation of one or both duplicates to other chromosomes. Note that such a tandem duplication may involve a single genetic locus or a chromosomal block including several loci. Completely sequenced genomes of eukaryotes include examples of recently duplicated intrachromosomal blocks including many loci (Friedman and Hughes 2001b). A variety of mechanisms exist by which linked duplicates can be separated over evolutionary time. These include chromosomal breakage and rearrangement. Comparisons of the genomic maps of various mammals provide evidence that such mechanisms have operated repeatedly over the course of mammalian evolution (O'Brien et al. 1999).
Two recent papers have provided evidence of a peak of gene duplications early in vertebrate history, which the authors claim to be evidence that one or more genome duplications occurred at that time (Gu, Wang, and Gu 2002; McLysaght, Hokamp, and Wolfe 2002). There are numerous problems with these authors' analyses, however, and their conclusions are not well supported. First, in both studies the evidence for a peak of gene duplications relies on divergence time estimates made under the assumption of a molecular clock. It is well known that this assumption is often not met by molecular data (Li 1997, pp. 215235; Nei and Kumar 2000, pp. 187206). Furthermore, in both of these studies the evidence of a peak in gene duplications was entirely subjective. No statistical method was applied in either study to detect whether the observed peak might be attributed to random fluctuations. Finally, although McLysaght, Hokamp, and Wolfe (2002) used a statistical approach to identify potentially duplicated blocks in the human genome, they did not compare the duplication times of genes located within these blocks with those of genes outside the blocks.
In the present paper, we assign to families a set of highly conserved human protein-coding genes for which map positions are available. Using these conserved gene families, we apply the method of Friedman and Hughes (2001b) to identify potentially duplicated blocks in the human genome. We then identify homologous genes from other representative vertebrates, from the completely sequenced genomes of the invertebrates Drosophila melanogaster and Caenorhabditis elegans, and from the genomes of yeast and Arabidopsis thaliana. By constructing phylogenies of these gene families, we time gene duplication events relative to major cladogenetic events in vertebrate evolutionary history without relying on the assumption of a molecular clock. Comparing the duplication times of gene pairs within potentially duplicated blocks enables us to test the hypothesis that these blocks arose by polyploidization early in vertebrate history. In addition, comparison of the duplication times of gene pairs located on the same and different chromosomes enables us to estimate the pattern and rate of separation of duplicated gene pairs over the evolutionary history of the vertebrates.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The location of each protein in human was parsed from the features table. The location was then related to the protein sequence using the locus name as the unique identifier. When two predicted genes overlapped in location, one gene was chosen at random in order to eliminate redundancy (such as alternately spliced products of the same gene). Such overlap presumably occurs because of alternatively spliced transcripts from the same gene. After redundancy was removed, 13,802 proteins remained in the human data set. This number represents 47.0% of the total number of known and predicted proteins (both mapped and unmapped), including splice variants.
The text file of the nonredundant proteins was formatted as a database file using the blast tools obtained from the National Center for Biotechnology Information (NCBI) ftp site (Altschul et al. 1997). After the nonredundant proteome was determined for each genome, each protein was used to search for homology among the rest of the proteome. This "all against all" blast method was performed using the blastall executable which is packaged with the blast tools. Similarly, each protein in the nonredundant human proteome was searched against all proteins from other genomes. In searching human proteins against the remainder of the human proteome and in searching human proteins against nonhuman proteins, we used an expect value of E = 10-50. The use of a strict search criterion has the advantage that it identifies as homolgous only proteins showing evidence of homology throughout the length of the protein rather than in only one domain or a few domains (Friedman and Hughes 2001a, 2001b). In all cases, we used the defaults of a BLOSUM62 substitution matrix and the SEG filter (Wootton and Federhen 1993). The resultant records were filtered using MSPcrunch, a program to filter and convert the blast output to a tabular format (Sonnhammer and Durbin 1994).
Given all pairs of homologous proteins, a "single link" method was used to find the protein families. This step groups genes that share homology. For example, if gene A and B are in a family, and B and C are in another family, then A, B, and C are in a family. Further, in this example, if A and D also share homology, then A, B, C, and D are in a family. Using the strict homology search criterion E = 10-50 yielded a set of 1,520 highly conserved gene families with two or more members in human, containing a total of 5,475 genes (table 1). These were used in the identification of potentially duplicated genomic blocks and phylogenetic analyses (see below). Within families, amino acid sequences were aligned using ClustalW 1.81 (Thompson, Higgins, and Gibson 1994).
|
Note that McLysaght, Hokamp, and Wolfe (2002) used a method similar to that of Friedman and Hughes (2001b) to identify potentially duplicated blocks in the human genome. However, the method of these authors differed from that of Friedman and Hughes (2001b) in important ways. First, these authors did not use a strict homology search criterion; thus, they may have identified as homologues genes showing strong evidence of homology only in a portion of the sequence. Second, McLysaght, Hokamp, and Wolfe (2002) "collapsed" all tandem arrays of duplicated genes, counting each such array as only a single gene. Thus, their randomization test of the hypothesis that the same families occur in syntenic groups in separate genomic locations to a greater extent than expected by chance is biased and is more likely to reject the null hypothesis than is our test.
Phylogenetic Analyses
Phylogenetic trees were constructed by the quartet maximum-likelihood (ML) method (Strimmer and von Haeseler 1996) as implemented in TREEPUZZLE 5.0, using the JTT (Jones, Taylor, and Thornton 1992) model of amino acid evolution and assuming that rate variation among sites followed a gamma distribution. All trees were treated as unrooted, and no attempt was made to assign an outgroup to root any tree. On the basis of tree topology, we determined the time of each human gene duplication event relative to the following cladogenetic events: the deuterostome-protostome divergence; the amniote-amphibian divergence; and the primate-rodent divergence. This method of timing duplication events does not assume a constant rate of molecular evolution ("molecular clock") and is independent of the rooting of the tree. We concluded that a gene duplicated prior to a cladogenetic event if the internal branch supporting that duplication was significantly supported. We considered a branch to be significantly supported if it was supported in 95% or more of 10,000 puzzling steps; this represents a highly conservative test for significance of an internal branch (Strimmer and von Haeseler 1996). We concluded that a gene duplicated prior to a cladogenetic event if the internal branch supporting that duplication was significantly supported.
In counting the proportions of duplications in a set of trees that could be dated prior to a given cladogenetic event, we compared only those families for which a sufficient number of sequences were available for the hypothesis of duplication before that event to be tested. For example, in a gene family including two human sequences, a mouse sequence, and a Drosophila sequence, the only hypothesis that can be tested regarding the duplication of the human genes is that they duplicated prior to the primate-rodent divergence. This hypothesis would be supported if one of the human genes clustered with the mouse gene and one with the Drosophila gene, and if the internal branch separating these two clusters was significantly supported. By contrast, if the two human genes clustered together, the phylogeny would be taken as consistent with the hypothesis that the human genes duplicated after the primate-rodent divergence. In comparing proportions of gene duplications occurring before and after a given cladogenetic event, we compared only (1) those families in which there was significant support for duplication before that cladogenetic event and (2) those families in which it was possible to test the hypothesis of gene duplication prior to that cladogenetic event yet the topology of the tree was consistent with duplication after that event.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Given that the human genome includes numerous closely linked duplicated genes, it might be argued that sharing of two or more families between windows will be observed relatively infrequently simply because a large proportion of windows will include members of a single family. To test for this possibility, we examined the distribution of the number of members in the largest family observed in each window (fig. 1). With a window size of 10, only 7 of 548 windows (1.3%) included 10 genes all of the same family; and in only 28 of 548 windows (5.1%) was the largest family represented by 6 or more genes. The median value for the size of the largest family per window was 2.0, whereas the mean was 2.34 with a standard error of 0.07. Thus, with a window size of 10, there were some cases in which a single family accounted for over half the genes in a window, but these were very rare.
|
To examine the maximal number of possibly duplicated regions, we combined the data on windows of size 10 sharing three or more families and windows of size 30 sharing six or more families to identify potentially duplicated genomic regions. The combined data included 422 duplicated gene pairs in 92 putatively duplicated genomic blocks. (A listing of the genes involved is available from the authors upon request.) Figure 2 shows a graph illustrating sharing of one or more duplicated blocks between chromosomes. All chromosomes were represented in the graph except chromosome 21, which shared no duplicated blocks with other chromosomes; and the graph was completely connected (fig. 2). In addition to between-chromosome duplicated blocks, within-chromosome duplicated blocks were observed on chromosomes 1, 3, 6, 8, 11, 15, 16, 17, 19, 22, and X.
|
|
|
|
|
We used the proportion of duplicate pairs estimated to have duplicated before a given cladogenetic event to estimate Pt. We then used linear regression of the natural logarithms of these estimated Pt values against estimates of the times of the cladogenetic events in order to estimate , the slope of the best fit line through the origin and through these three time points. We used the estimates of 110 MYA for the primate-rodent divergence; 360 MYA for the amniote-amphibian divergence; and 830 MYA for the deuterostome-protostome divergence (Kumar and Hedges 1998; Nei, Xu, and Glazko 2001).
Although only three data points were available, the fit of the regression (R2 = 96.1%; P = 0.02) was quite good (fig. 4). The resulting estimate of was 1.7 x 10-9 per gene-pair per year. Because of the simplifying assumptions made, this represents a minimum estimate of the probability of separation of tandemly duplicated gene pairs.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Gu, Wang, and Gu (2002) and McLysaght, Hokamp, and Wolfe (2002) reported peaks of gene duplication early in vertebrate history on the basis of molecular clock analyses. However, many of these gene duplications may have been incorrectly timed because of defects inherent in molecular clock analyses. It is well known that gene duplication is often followed by a period of accelerated evolution at the amino acid level as daughter genes adapt to distinct functions (Hughes 1994). Such accelerated evolution will disrupt the molecular clock and cause the duplication to be dated earlier than it actually occurred. The existence of numerous such cases might create an artifactual "peak" of gene duplications at earlier dates.
Gu, Wang, and Gu (2002) did not attempt to weed out protein families not evolving in a clock-like manner; thus, their analyses probably included a high proportion of families in which the assumption of a molecular clock is not valid. McLysaght, Hokamp, and Wolfe (2002) did attempt to weed out families not evolving in a clock-like manner. They used the two-cluster test of Takezaki, Rhetsky, and Nei (1995), which tests for nonuniformity of rate between two groups of sequences in comparison to an outgroup. This method, however, will not be able to detect bursts of rapid evolution after gene duplication if they occur in both duplicates. In addition, the families assumed by McLysaght, Hokamp, and Wolfe (2002) to be evolving in a clock-like manner were in fact merely those families for which they lacked statistical power to detect deviations from a molecular clock. Thus, the data set on which they based their estimates probably included many families of short sequences and sequences with high rates of replacement per site, because in these cases the test will lack statistical power to detect deviations from the molecular clock even when such deviations are present.
Furthermore, it is worth noting that an apparent peak of gene duplications, even if it is not a statistical artifact, is not in itself evidence of polyploidization. Gene duplication occurs continually over the course of evolution, but most duplicate genes are quickly lost (Lynch and Conery 2000). Specialization of a duplicate gene for a new function substantially enhances its probability of being retained (Hughes 1994; Lynch et al. 2001). Thus, if we observe an apparent peak of gene duplication in the past history of a species, what we are really observing is not a peak of gene duplication per se, but rather a peak of retention of duplicate gene copies. And because retention of duplicate genes is likely to be associated with the evolution of new functions (Hughes 1994; Lynch et al. 2001), an apparent peak of gene duplication is likely to be the signature not of polyploidization but of adaptive radiation. Thus, even if the peaks of retention of duplicate gene copies early in vertebrate history reported by Gu, Wang, and Gu (2002) and McLysaght, Hokamp, and Wolfe (2002) are not artifacts, they provide no information one way or another regarding the hypothesis of polyploidization. Rather, they merely reflect the occurrence of adaptive radiation early in vertebrate history, which is unsurprising on the basis of our knowledge of vertebrate paleontology (Carroll 1988).
On the polyploidization hypothesis, duplicated genomic regions are the residue of ancient polyploidization events (Lundin 1993). Here we used a simple method to identify such duplicated regions (Friedman and Hughes 2001b). When this method was previously applied to the genome of yeast, it provided a strong signal of genome duplication, as had previously been proposed for this species by Wolfe and Shields (1997), an event estimated to have occurred 200300 MYA (Friedman and Hughes 2001b). This method identified a number of potentially duplicated blocks in the human genome. However, when phylogenetic analysis was used to time gene duplications between these blocks, the results provided no strong signal of ancient polyploidization. On the polyploidization hypothesis, we would expect gene pairs in duplicated blocks to show a disproportionate number duplicated after the deuterostome-protostome divergence but before the amniote-amphibian divergence. In fact, the pattern of gene duplication times in duplicated blocks was not different from that outside the blocks (table 3). Likewise, when chromosome pairs previously alleged to show the effects of ancient polyploidization were analyzed, the pattern of duplication times was very similar to that for genes on other chromosome pairs (table 4).
Duplication of certain of the gene pairs in duplicated blocks could be dated with strong statistical support prior to the deuterostome-protostome divergence, whereas others duplicated after the aminote-amphibian divergence or after the primate-rodent divergence (table 3). A similar result was seen in the case of genes on allegedly duplicated chromosome pairs (table 4). Duplication times after the amniote-amphibian divergence or prior to the deuterostome-protostome divergence are not explainable by polyplodization early in vertebrate history. Thus, our results suggest that, if genome duplication did occur early in vertebrate history, it was not responsible for a large fraction of the duplicated genes or for a large fraction of the duplicated genomic blocks found in the genomes of current-day vertebrates.
The simplest alternative model to that of polyploidization to explain the increase in gene number in vertebrates is a model invoking repeated independent events of tandem gene duplication (Hughes, da Silva, and Friedman 2001). These tandem duplications might involve individual genes or they might involve chromosomal blocks such as we detected on chromosomes 1, 3, 6, 8, 11, 15, 16, 17, 19, 22, and X. Indeed, recent evidence from the human genome suggests that duplication of genomic blocks is a recurring feature of vertebrate genome evolution (Bailey et al. 2002; Samonte and Eichler 2002). Once a gene or genomic segment has been duplicated, subsequent events of chromosome breakage and translocation of chromosomal segments can serve to break up tandemly duplicated gene pairs.
Several aspects of our data support this model. First, we found a consistent tendency for duplicated gene pairs mapping to the same chromosome to have duplicated more recently than those mapping to separate chromosomes (fig. 3). This was true of genes in duplicated blocks as well as of other genes (table 3). Furthermore, our results on the timing of gene duplications provided an excellent fit to a simple model assuming only tandem duplication and a constant probability of separation onto different chromosomes (fig. 4).
We estimated the rate of separation of tandemly duplicated gene pairs onto different chromosomes in the human lineage at 1.7 x 10-9 per gene-pair per year. This estimate represents a long-term average for the human lineage, and it cannot be expected to apply to vertebrates with numbers of chromosomes either much larger or much smaller than those of humans. Given this rate, it is expected that after 100 million years, about one in six duplicated gene pairs will have separated onto different chromosomes. After 450 million years, the estimated time since the last common ancestor of bony fishes and tetrapods (Kumar and Hedges 1998), about three-quarters of duplicated gene pairs are expected to be separated onto different chromosomes. The anticipated availability of complete genomic sequences from human, mouse, pufferfish, and zebrafish will make it possible to test these predictions. In addition, application of similar methods to a number of complete genomes will make it possible for us to develop more precise quantitative models of vertebrate chromosomal evolution.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
E-mail: austin{at}biol.sc.edu.
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Abi-Rached, L., A. Gilles, T. Shiina, P. Pontarotti, and H. Inoko. 2002. Evidence of en bloc duplication in vertebrate genomes. Nat. Genet 31:100-105.[CrossRef][ISI][Medline]
Altschul, S. F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402.
Bailey, J. A., G. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz, M. D. Adams, E. W. Myers, P. W. Li, and E. E. Eichler. 2002. Recent segmental duplications in the human genome. Science 297:1003-1007.
Carroll, R. L. 1988. Vertebrate paleontology and evolution. W. H. Freeman, New York.
Friedman, R., and A. L. Hughes. 2001a. Pattern and timing of gene duplication in animal genomes. Genome Res 11:1842-1847.
Friedman, R., and 2001b. Gene duplication and the structure of eukaryotic genomes. Genome Res 11:373-381.
Gu, X., Y. Wang, and J. Gu. 2002. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate genomes. Nat. Genet 31:205-209.[CrossRef][ISI][Medline]
Hughes, A. L. 1994. The evolution of functionally novel proteins after gene duplication. Proc. R. Soc. Lond. Ser. B 256:119-125.[ISI][Medline]
Hughes, A. L. 1998. Phylogenetic tests of the hypothesis of block duplication of homologous genes on human chromosomes 6, 9, and 1. Mol. Biol. Evol 15:854-870.[Abstract]
Hughes, A. L. 1999. Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J. Mol. Evol 48:565-576.[ISI][Medline]
Hughes, A. L., J. da Silva, and R. Friedman. 2001. Ancient genome duplications did not structure the human Hox-bearing chromosomes. Genome Res 11:771-780.
International, Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.[CrossRef][ISI][Medline]
Jékely, G., and P. Friedrich. 1999. The evolution of the calpain family as reflected in paralogous chromosome regions. J. Mol. Evol 49:272-281.[ISI][Medline]
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci 8:275-282.[Abstract]
Kasahara, M., Y. Nayaka, Y. Satta, and N. Takahata. 1997. Chromosomal duplication and the emergence of the adaptive immune system. Trends Genet 13:90-92.[CrossRef][ISI][Medline]
Kent, W. J., and D. Haussler. 2001. Assembly of the working draft of the human genome with GigAssembler. Genome Res 11:1461-1462.
Kumar, S., and S. B. Hedges. 1998. A molecular timescale for vertebrate evolution. Nature 392:917-919.[CrossRef][ISI][Medline]
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Lundin, L. G. 1993. Evolution of the vertebrate genome as reflected in paralogous chromosome regions in man and the house mouse. Genomics 16:1-19.[CrossRef][ISI][Medline]
Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151-1155.
Lynch, M., M. O'Hely, B. Walsh, and A. Force. 2001. The probability of preservation of a newly arisen gene duplicate. Genetics 159:1789-184.
McLysaght, A., K. Hokamp, and K. H. Wolfe. 2002. Extensive genomic duplication during early chordate evolution. Nat. Genet 31:200-204.[CrossRef][ISI][Medline]
Meyer, A., and M. Schartl. 1999. Gene and genome duplication in vertebrates: the one-to-four (-to-eight in fish) rule and the evolution of novel gene functions. Curr. Opin. Cell Biol 11:699-704.[CrossRef][ISI][Medline]
Nei, M., and S. Kumar. 2000. Molecular evolution and phylogenetics. Oxford University Press, New York.
Nei, M., P. Xu, and G. Glazko. 2001. Estimation of divergence times from multiprotein sequences for a few mammalian species and several distantly related organisms. Proc. Natl. Acad. Sci. USA 98:2497-2502.
O'Brien, S. J., M. Menotti-Raymond, W. J. Murphy, W. G. Nash, J. Wiensburg, R. Stanyon, N. G. Copeland, N. A. Jenkins, J. Womack, and J. A. M. Graves. 1999. The promise of comparative genomics in mammals. Science 286:458-481.
Samonte, R. V., and Eichler E. E. 2002. Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet 3:65-72.[CrossRef][ISI][Medline]
Sidow, A. 1996. Gen(om)e duplications in the evolution of early vertebrates. Curr. Opin. Genet. Dev 6:715-722.[CrossRef][ISI][Medline]
Sonnhammer, E. L. L., and R. Durbin. 1994. A workbench for large scale sequence homology analysis. Comput. App. Biol. Sci 10:301-307.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol 13:964-969.
Takezaki, N., A. Rzhetsky, and M. Nei. 1995. Phylogenetic test of the molecular clock and linearized trees. Mol. Biol. Evol 12:823-833.[Abstract]
Thompson, J. D., D. G. Higgins, and T. Gibson. 1994. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.[Abstract]
Wolfe, K. H. 2001. Yesterday's polyploids and the mystery of diploidization. Nat. Rev. Genet 2:333-341.[CrossRef][ISI][Medline]
Wolfe, K. H., and D. C. Shields. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708-713.[CrossRef][ISI][Medline]
Wootton, J. C., and S. Federhen. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comp. Chem 17:149-163.[CrossRef][ISI]