* Department of Ecology and Evolution, University of Chicago; Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg;
Institute of Statistics, National Chiao-Tung University, Hsinchu, Taiwan; and
Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park
Correspondence: E-mail: whli{at}uchicago.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: gene density recombination rate repetitive elements positive selection
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To date, two groups have done genome-wide analyses of segmental duplications (Bailey et al. 2002; Cheung et al. 2003). Bailey et al. (2002) estimated the proportion of duplicated segments ( 1 kb and
90% sequence similarity) in the entire genome to be 5.2%, and Cheung et al. (2003) estimated the proportion of duplicated segments (
5 kb and
90% sequence similarity) to be 3.5%. Apart from different criteria of duplication size, the discrepancy between the two estimates could also be caused by different methods used to identify duplicated regions and different genome assembly versions (Cheung et al. 2003).
A more complete assembly version of the human genome became available in April 2003 but has not yet been analyzed. In this study, we identified the segmental duplications that are 1 kb in length and
90% in sequence similarity in the hg15 version and found great variation in the extent of segmental duplication within and among chromosomes. To understand the causes of the observed variation, we examined a number of factors, including regional gene density, repeat sequence density, recombination rate, and GC content. Why are there so many segmental duplications in the human genome? To address this question, we contrasted duplications containing genes with duplications containing no genes in terms of duplication frequency, size, and sequence similarity.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
After this step, we took out the sequences of each block pair plus the 10-kb sequences from each side of the block. We then used the GS-aligner program (Shih and Li 2003) to align the two sequences of each block pair. The GS-aligner produces HSP (high-scoring segment pairs) and non-HSP regions. HSP regions are highly similar regions without gaps, whereas non-HSP regions have a lower similarity and may contain gaps. Two HSP regions with 90% sequence similarity are combined if the non-HSP region between them also has a sequence similarity
90%. However, non-HSP regions may have a similarity lower than 90% because of random fluctuations. To be more vigorous, we applied a binomial test to each non-HSP region, and if the sequence similarity was not significantly lower (P < 0.05) than 90%, the two flanking HSP regions and the non-HSP region were combined into one segment. The alignment end was extended by a dynamic programming for up to 5 kb outward from both ends of the alignment, while maintaining the requirement
90% sequence similarity during the extension process.
Noting that some regions showed a higher density of duplication than others, we examined these regions for possible causes of duplication. Specifically, we divided each chromosome into nonoverlapping regions of 500 kb, except for pericentromeric and subtelomeric regions (see the definition used in Bailey et al. [2001]). For each region, we calculated the duplication-enrichment index, which is defined as the ratio of the observed percentage of duplications in the region to the percentage of duplications in the entire genome in terms of sequence length. We also considered the duplication-enrichment index in terms of the number of duplications in a region for all our analyses.
We examined several factors that might affect the frequency of duplication. First, we examined the relationship between the gene density and the duplication-enrichment index of a region. We used two gene databases for this analysis: known genes and Ensembl genes. Second, we examined the relationship between density of repetitive elements and extent of duplication in the region. Third, because repetitive elements (such as microsatellites and transposable elements) tend to accumulate in low-recombination-rate regions (Bartolome, Maside, and Charlesworth 2002), and because recent segmental duplications are considered as low-copy repeats (Stankiewicz and Lupski 2002), it is interesting to compare how duplicated regions distribute with respect to local recombination rates. We used the deCODE recombination rates available at http://genome.cse.ucsc.edu/index.html. Fourth, we calculated the GC content of each chromosomal region and examined the correlation between duplication-enrichment index and GC content.
Finally, we divided all duplications into duplications containing complete genes and duplications containing no complete genes and compared their frequency, size, and sequence similarity. We simulated 10,000 samples under a neutral-duplication model to examine whether our observed frequencies and size distributions are expected under the neutral model. The neutral model assumes that duplications can occur anywhere on the chromosome and that the frequency and size distribution for each type of duplication is simply the result of the random distribution of genes on the chromosome. For example, the neutral expectations of the frequencies of the two types of duplications were obtained as follows.
Each simulated duplication is randomly sampled without replacement from the observed duplications. The size of the simulated duplication is the same as the sampled duplication. If the sampled duplication is intrachromosomal, we pick a site from a uniform distribution of all the sites on the chromosome. If the sampled duplication is interchromosomal, we pick one chromosome from the two chromosomes involved with equal probability and randomly pick a site on the chosen chromosome. We then determine the type of the duplication based on known genes and Ensembl genes. This procedure is continued until each observed duplication is simulated, so that a simulated sample is completed. The procedure was used to obtain 10,000 simulated samples. The frequencies of the two types of duplications were then calculated for each sample and compared with the observed frequencies.
In a similar manner, the neutral size distributions for the two types of duplications were obtained, except that duplications were sampled with replacement from the observed duplications and every simulated sample contains the same number of each type of duplications as that in our observed data. Next, the two types of simulated duplications (i.e., duplications containing genes or no genes) were compared to see whether the difference in the observed duplications is expected under the neutral-duplication model.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Duplication-Enrichment Indexes
The extent of duplication shows great variation among chromosomal regions. For example, for chromosome 7, there is an approximately 23-fold enrichment at one subtelomeric region, a depletion of duplication at the other subtelomeric region, and an approximately sevenfold enrichment in the pericentromeric region (fig. 2). On average, pericentromeric and subtelomeric regions have a 2.9-fold and a 4.1-fold duplication enrichment, respectively (table 2), as compared with the 3.7-fold and 1.7-fold estimated by Bailey et al. (2001), who used an earlier assembly version (January 2001) with 21 chromosomes only.
|
|
|
The duplication-enrichment index shows a positive correlation with GC content for chromosomes 7 and Y and a negative correlation for chromosome 10 (table 3 [P < 0.00033]).Several points are worth mentioning. First, using the number of duplications instead of the percent coverage in the region for duplication-enrichment index does not change the results qualitatively. Second, the above significant correlations are not caused by chromosome Y, because excluding chromosome Y from the entire genome data does not affect any of the correlation analyses qualitatively (table 3). Third, excluding pericentromeric and subtelomeric regions also does not affect the results qualitatively.
Segmental Duplications with and Without Genes
Duplications containing genes and duplications containing no genes were compared for their frequency, size, and sequence similarity. First, the proportion of duplications containing complete genes is 3.4% for known genes and 10.7% for Ensembl genes. Both values are significantly higher than expected under the neutral-duplication model (for known genes: P value = 0.02; for Ensembl genes: P value 2.2e16. [see, e.g., figure 3]).
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
When one chromosome is examined at a time, often some factors are more important than others for some but not all chromosomes. Combining all chromosomes, we did find that regional duplication frequency is positively correlated with regional gene density, repeat density, recombination rates, and GC content. Nevertheless, the overall pattern emerging from our genome-wide analysis is that none of the above properties has a strong effect on the extent of segmental duplication, because our multiple-regression analysis shows that these factors account for only approximately 4% of the total variation in duplication frequency. Among these factors, gene density seems to be the most important in influencing duplication frequency because it alone accounts for 3.4% of the variation in duplication frequency.
Some repetitive elements, such as small ribonucleoprotein RNAs (srpRNAs), satellite DNAs, long-terminal repeats (LTRs), and, especially, Alu repeats have been found to be enriched in duplication borders (Bailey et al. 2001; Bailey, Liu, and Eichler. 2003; Cheung et al. 2003). However, repeat density does not seem to have a strong influence on the duplication frequency of a region (table 3), suggesting that although segmental duplication may be facilitated by repetitive elements, how often a region is involved in duplication does not significantly depend on the density of repetitive elements in the region.
Repetitive DNAs, such as microsatellites and transposable elements, tend to accumulate in low-recombination-rate regions (Bartolome, Maside, and Charlesworth 2002). This has been thought to be caused by the possibility that insertion and expansion of these repetitive elements are slightly deleterious, and selection is not efficient in removing them in low-recombination-rate regions. However, although recent segmental duplications are one type of repetitive DNAs, there is only a weak negative correlation between duplication frequency and local recombination rate (table 3). In an earlier study, Zhang and Gaut (2003) found that the frequency of tandemly arrayed genes is positively, rather than negatively, correlated with local recombination rate for three of the five chromosomes and has no significant correlation for the other two chromosomes in the Arabidopsis thaliana genome. Taken together, it suggests that low-copy repeats may have different dynamics and distributions from high-copy repeats such as microsatellites and transposable elements.
Possible Adaptive Significance of Recent Segmental Duplications
What are the forces that maintain the recent segmental duplications in the human genome? Evidence to date suggests that some segmental duplications are maintained by selection. PMCHL1 and PMCHL2, which arose from a recent segmental duplication on chromosome 5, show different expression patterns (Courseaux and Nahon 2001). The duplicated DGCR6 genes, which arose from a segmental duplication in the past 35 Myr, have been selectively maintained in the genome (Edelmann et al. 2001). The morpheus gene family, produced by recent segmental duplications on chromosome 16, shows molecular signatures of positive selection (Johnson et al. 2001).
In this study, we constructed a neutral-duplication model to examine whether the relative frequencies of duplications containing genes and duplications containing no genes are simply a result of regional variation in gene lengths and gene densities. Based on the model, the fixation of any duplication in the population does not depend on where it occurs on the chromosome; that is, whether the duplication includes genes or not has no fitness effect on the organism. Therefore, if a duplication containing genes and a duplication containing no genes have the same probability of fixation, the simulated duplications should have similar relative frequencies for the two types of duplication as the observed relative frequencies. However, we found that the observed frequency of duplications containing genes is much higher than the simulated values, suggesting that many duplications containing genes were selectively advantageous and, thus, have been maintained by selection after duplication (fig. 3).
It is puzzling that the proportion of duplications containing genes increases, whereas that containing no genes decreases, as sequence similarity increases (fig. 5b). Here, we present several possible explanations: First, the observations suggest that the rate of segmental duplication has not been constant over time. It is possible that duplications containing no genes had occurred more frequently in the past than in recent times, whereas the opposite trend is true for duplications containing genes. Second, duplications containing genes have, on average, been subject to stronger purifying selection than duplications containing no genes, so that their sequence similarity has been better maintained. Third, gene conversion might have contributed to some extent to the differences between the two distributions: if the rate of gene conversion increases with sequence similarity, duplications containing genes would have better chances of being homogenized than duplications containing no genes because sequence similarity in coding regions would tend to be better conserved than noncoding regions. Whether any of these speculations are plausible remain to be examined in the future.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bailey, J. A., Z. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz, M. D. Adams, E. W. Myers, P. W. Li, and E. E. Eichler. 2002. Recent segmental duplications in the human genome. Science 297:10031007.
Bailey, J. A., G. Liu, and E. E. Eichler. 2003. An Alu transposition model for the origin and expansion of human segmental duplications. Am. J. Hum. Genet. 73:823834.[CrossRef][ISI][Medline]
Bailey, J. A., A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11:10051017.
Bartolome, C., X. Maside, and B. Charlesworth. 2002. On the abundance and distribution of transposable elements in the genome of Drosophila melanogaster. Mol. Biol. Evol. 19:926937.
Cheung, J., X. Estivill, R. Khaja, J. R. MacDonald, K. Lau, L. C. Tsui, and S. W. Scherer. 2003. Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 4:R25.[CrossRef][Medline]
Courseaux, A., and J.-L. Nahon. 2001. Birth of two chimeric genes in the hominidae lineage. Science 291:12931297.
Edelmann, L., P. Stankiewicz, E. Spiteri, R. K. Pandita, L. Shaffer, J. R. Lupski, and B. E. Morrow. 2001. Two functional copies of the DGCR6 gene are present on human chromosome 22q11 due to a duplication of an ancestral locus. Genome Res. 11:208217.
Hattori, M., A. Fujiyama, T. D. Taylor et al. (34 co-authors). 2000. The DNA sequence of human chromosome 21. Nature 405:311319.[CrossRef][ISI][Medline]
Hillier, L. W., R. S. Fulton, L. A. Fulton et al. (107 co-authors). 2003. The DNA sequence of human chromosome 7. Nature 424:157164.[CrossRef][ISI][Medline]
Johnson, M. E., L. Viggiano, J. A. Bailey, M. Abdul-Rauf, G. Goodwin, M. Rocchi, and E. E. Eichler. 2001. Positive selection of a gene family during the emergence of humans and African apes. Nature 413:514519.[CrossRef][ISI][Medline]
Lander, E. S., L. M. Linton, B. Birren et al. (255 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860921.[CrossRef][ISI][Medline]
Lupski, J. R. 1998. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet. 14:417422.[CrossRef][ISI][Medline]
Samonte, R. V., and E. E. Eichler. 2002. Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet. 3:6572.[CrossRef][ISI][Medline]
Shih, A. C., and W. H. Li. 2003. GS-Aligner: a novel tool for aligning genomic sequences using bit-level operations. Mol. Biol. Evol. 20:12991309.
Stankiewicz, P., and J. R. Lupski. 2002. Genome architecture, rearrangements and genomic disorders. Trends Genet. 18:7482.[CrossRef][ISI][Medline]
Stankiewicz, P., S. S. Park, K. Inoue, and J. R. Lupski. 2001. The evolutionary chromosome translocation 4;19 in Gorilla gorilla is associated with microduplication of the chromosome fragment syntenic to sequences surrounding the human proximal CMT1A-REP. Genome Res. 11:12051210.
Zhang, L., and B. S. Gaut. 2003. Does recombination shape the distribution and evolution of tandemly arrayed genes (TAGs) in the Arabidopsis thaliana genome?. Genome Res. 13:25332540.