* Department of Genetics and Center for Human Genetics, Case Western Reserve University School of Medicine and University Hospitals of Cleveland
Washington University School of Medicine Genome Sequencing Center, St. Louis
Department of Microbiology and Immunology, University of Oklahoma Health Sciences Center
Children's Hospital Oakland Research Institute, Oakland, California
|| Sezione di Genetica, DAPEG, University of Bari, Bari, Italy
¶ The Institute for Genomic Research, Rockville, Maryland
Correspondence: E-mail: eee{at}po.cwru.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: pericentromeric DNA segmental duplications genome architecture nonhuman primates genome evolution
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Pericentromeric Organization
In general, resolution of the organization and evolution of these regions has been hampered by unusual constellations of repetitive sequences when compared with other regions of the genome. Sequence analysis of Drosophila melanogaster pericentromeric regions indicated that they are mainly composed of simple satellite sequences, transposons, retroposons, and rRNA genes (Sun, Wahlstrom, and Karpen 1997; Adams et al. 2000). Similarly, Arabidopsis thaliana pericentromeric regions are largely composed of retroelements, transposons, microsatellites, and various classes of middle repetitive DNA (Copenhaver et al. 1999; The Arabidopsis Genome Initiative 2000). In addition to simple satellites and retroposons, directed analyses of human pericentromeric regions on chromosomes 2, 10, and 16 reveal a preponderance of partial gene duplications (Jackson et al. 1999; Loftus et al. 1999; Guy et al. 2000; Horvath et al. 2000; Horvath, Schwartz, and Eichler 2000). Although the occurrence of mobile genetic elements and duplicated sequence within pericentromeric regions is a common property shared by these distant species, the structure of the human genome appears to be unique in the proportion and extent of these blocks of duplications, which may be as large as a few Mb in size (Dunham et al. 1999; Jackson et al. 1999; Hattori et al. 2000; Bailey et al. 2001; IHGSC 2001; Bailey et al. 2002a, 2002b; Crosier et al. 2002).
Human Pericentromeric Duplications
The structure of these large mosaic blocks of duplication is complex. For nearly half of human chromosomes, an estimated zone of duplication extends from the satellite-repeat sequence to the unique euchromatic region (Bailey et al. 2001; IHGSC 2001). These regions are composed of a mosaic of duplicated genomic segments that originate from diverse areas of the genome. A large number of partial-gene and whole-gene duplications have been recently characterized in detail (Jackson et al. 1999; Horvath et al. 2001). These segmental duplications share conserved exon-intron structure and have been termed duplicons (Eichler 2001b). In most cases, the duplicons originate from an ancestral expressed locus, range in copy number from 2 to 15, and show an interchromosomal distribution restricted largely to pericentromeric regions. Comparative analyses of a few regions indicate that these transposed duplicated segments are found only in humans and closely related nonhuman primates (Arnold et al. 1995; Eichler et al. 1996, 1997; Regnier et al. 1997; Zimonjic et al. 1997; Orti et al. 1998; Horvath et al. 2000). With the exception of these few anecdotal studies focused on individual duplicated segments, a global synopsis of this property of genome evolution and chromosome structure has been lacking. The molecular basis for the duplicative transposition bias toward pericentromeric regions is unknown.
Pericentromeric-Specific Repeat Sequences
In addition to duplicated gene segments, a variety of primate-specific degenerate repeat sequences have been identified between the duplicons (Eichler et al. 1997; Eichler, Archidiacono, and Rocchi 1999; Guy et al. 2000; Horvath, Schwartz, and Eichler 2000). The fact that they demarcate the transition between unrelated pericentromeric genic duplication events and that, in at least one case, they existed before the evolutionary transfer of the duplicated segments has been taken as circumstantial evidence that these repeats may play a role in the duplication process (Eichler, Archidiacono, and Rocchi 1999). Unlike the genic duplications described above, these pericentromeric interspersed repeat sequences (PIRs) do not exhibit obvious exon-intron structure. They, therefore, do not appear to be derived from ancestral gene sequences that have been transposed from nonpericentromeric regions of the genome. Several types of pericentromeric repeat sequences have been described, including CAGGG, GGGCAAAAGCCG, and chAB4 repeats (Assum et al. 1991; Eichler et al. 1996; Wohr, Fink, and Assum 1996; Eichler et al. 1997; Eichler, Archidiacono, and Rocchi 1999; Horvath, Schwartz, and Eichler 2000). Unlike satellite sequences, these sequences are not composed of repetitive tandem arrays. In some cases, the underlying sequence structure of the interspersed repeats is reminiscent of degenerate subtelomeric repeat tracts (Flint et al. 1997; Riethman et al. 2001). Indeed, telomeric associatedrepeats have occasionally been reported in close proximity to these sequence elements (Eichler, Archidiacono, and Rocchi 1999). In addition, the pericentromeric interspersed repeats often exist at multiple locations within the same chromosome, separated by tens to hundreds of kb of intervening duplicated sequence.
Here, we characterize a novel pericentromeric interspersed repeat, termed PIR 4, that is specific to the genomes of humans and apes. This element represents one of the most abundant recent segmental duplications within the human genome. Among humans, this repeat occurs on more than half of all chromosomes, it is found in association with other segmental duplications, and it is restricted almost exclusively to pericentromeric regions. The purpose of this study was to take advantage of the multichromosomal and pericentromeric distribution of this interspersed repeat, using it as a marker to (1) recover additional sequence from these intractable regions of the genome, (2) map existing sequences generated as part of the Human Genome Project that were ambiguously placed, and (3) reconstruct the series of evolutionary events that occurred in the distribution of this repeat among primate chromosomes. Our analysis provides a global snapshot of the dynamic evolutionary history of these regions and the series of nonhomologous sequence exchanges that created the architecture of contemporary human chromosomes.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Of the 170 accessions containing PIR4, there were only 37 that were distinct finished copies and could be used for further analyses. These 37 GenBank accessions were analyzed using our previously described algorithm (Bailey et al. 2001), which is designed to capture large genomic alignments despite the presence of retroposon-induced large insertions and deletions. Here, the PIR4 reference (AC073318) was compared with each of the 37 finished clones after the high copy repeats identified by RepeatMasker (version 07/13/2002) were spliced out and then a pairwise comparison using gap Blast was generated (Altschul et al. 1990). For each alignment, repeats were subsequently reinserted, the end-points were heuristically trimmed, and optimal global alignments were generated using the program ALIGN (Myers and Miller 1988). Based on these alignments, we extracted a PIR4 sequence for each sequence accession based on the orientation and extent of the putative ancestral locus, AC073318, where overlapping sequences compared with the reference sequence were removed in favor of the higher scoring alignment within the clone's sequence. These extracted segments served as the basis for constructing an optimal global alignment for all PIR4 pairs of sequence. We limited our analysis to alignments that were at least 10 kb (a total of 25 GenBank accessions). We estimated the number of substitutions/site/year (substitution rate) by correcting the divergence for multiple substitutions using Kimura's two-parameter model (Kimura 1980).
To study the characteristics of other duplicons flanking the PIR4 sequences, we performed a second all-by-all BlastN comparison of the 25 accessions that included the entire GenBank accession. We defined flanking alignments as alignments within 7 kb (the full length of L1 element insertion) of PIR4 sequence. Alignment statistics were only calculated for the non-PIR4 alignment portions. For each clone comparison, we selected the largest global alignment (minus PIR4) that was at least 10 kbp. To compare the divergence of PIR4 with the largest flanking alignment, we calculated the difference (KPIR4 - Kflanking). In this case, a positive (KPIR4 - Kflanking) value indicates that PIR4 is more divergent, whereas a negative (KPIR4 - Kflanking) value reflects a more divergent flanking sequence. A value at or near zero indicates that both PIR4 and flanking duplicons were equally divergent and likely duplicated at or near the same timepoint in evolution.
Identification of duplicons within accessions (fig. 1a, b, and c) was performed using a repeat masked accession against the EST division of GenBank. All ESTs exhibiting exon-intron structure to the given accession were searched against the Unigene database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene). A representative EST (for each cluster with more than one EST hit to the given accession) and all ESTs without Unigene hits were subsequently queried against the nr and htgs divisions of GenBank. The accession with a 100% match to the query EST was considered the ancestral locus of the duplication, which then was used in comparisons with the original PIR4 accession to determine the extent of overlap and percent identity of this segment as shown in figure 1.
|
|
|
|
|
|
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
PIR4 Copy Number Estimates
A variety of methods (Southern analysis, FISH, library depth-of-coverage) were used to estimate copy number in the genome. Initially, a "unique" 300-bp PCR (32N49) amplicon was designed specific to the repeat and screened against a 25.5-fold redundant human BAC library (RPCI-11, segments 1, 2, 4, and 5). We obtained 768 strongly hybridizing positives, suggesting there were at least 30 copies of this element in the human genome (table 1). An independent analysis of depth of coverage using whole-genome shotgun sequence data (Bailey et al. 2002a) showed a 40-fold (1972/47.2) excess of sequence read depth when compared with unique regions of the genome (http://humanparalogy.cwru.edu/). Of all segmental duplications characterized within the human genome, only DNA sequence corresponding to rDNA duplications from acrocentric chromosomes surpassed PIR4 in both depth of coverage and degree of sequence identity. PCR analysis of a monochromosomal hybrid DNA panel confirmed copies of PIR4 on chromosomes 1, 2, 7, 9, 10, 13, 14, 15, 16, 17, 21, 22, and Y (fig. 5a). Subsequent sequence analysis of the amplicons, in many cases, revealed the presence of "heterozygous" sequence signatures (fig. 5b). Since each DNA sample was derived from a monochromosomal somatic cell hybrid source, it is likely that multiple copies of the element are present on many chromosomes. To improve the copy number estimate and to recover genomic clones specific for each chromosome, cosmid libraries from flow-sorted human chromosomes 1, 2, 7, 9, 13, 14, 15, 16, 18, 21, and 22 were hybridized with amplicon 32N49 (table 1). Based on the depth of coverage for each library, the results suggested that most human chromosomes contained multiple copies of the PIR4 repeat sequence (mean = four copies per chromosome for chromosomes with PIR4) with chromosomes 1 and 9 being particularly enriched (estimated seven to nine copies).
|
Although FISH data confirmed a pericentromeric map location for the vast majority of PIR4-containing BAC clones, it was impossible using this approach to unambiguously assign a specific BAC clone to its chromosome of origin. The variable copy number of the repeat within specific chromosomes, furthermore, made assignment based on signal intensity unreliable (table 3 and fig. 3). As a secondary means of resolving the chromosomal location of PIR4-containing clones, we implemented a sequence-based strategy (termed paralogous sequence tagging) (Horvath, Schwartz, and Eichler 2000) that depends on the identification and characterization of paralogous sequence variants (PSVs) specific to chromosomes with PIR4. Since the BAC libraries were constructed from two chromosomal haplotypes (maternal and paternal), sequence variants between two BACs may be due to either allelism (one maternal variant and one paternal variant at the same locus) or paralogy (two variants at different loci as the result of a duplication event). In contrast to BAC libraries, cosmid libraries constructed early in the Human Genome Sequencing Project at Lawrence Livermore and Los Alamos National Laboratories were constructed from a single flow-sorted chromosome (from a somatic cell line containing a single human chromosome) and represent in theory a single haplotype, thereby excluding allelism as a possible source for the variation (Trask et al. 1991). Thus, sequence variants between two cosmid clones from the same library identify paralogous, not allelic, copies. Sequence identity matches between BAC and cosmid sequence signatures further allow the assignment of large-insert BAC clones to specific pericentromeric regions. We reasoned that clones identified from each of the cosmid libraries could, therefore, be informative as a mapping resource for these intractable, highly duplicated areas of the genome.
To implement this approach, we selected 205 BACs (RPCI-11) and 176 cosmids for sequence analyses. Each clone was PCR amplified using oligonucleotides specific to the PIR4 repeat and all PCR products were directly sequenced to obtain a catalog of sequence signatures that distinguish various contigs of clones. A total of 67 distinct cosmid and BAC-derived sequence signatures were identified. A BAC sequence signature was considered distinct if the number of sequence differences was greater (two differences/252 bp) than that expected for allelic variation. Only high-quality sequence differences present on both forward and reverse sequencing of the PCR amplicon were considered in this analysis. Twenty-one of these cosmid signatures matched a BAC signature, allowing for chromosomal assignment of the BAC and providing an anchor for future sequence assembly. However, 20 BAC signatures were left unassigned to any chromosome, and 27 cosmid signatures had no evidence of BAC sequence support, suggesting extensive levels of allelic variation for these pericentromeric loci. Using the collection of experimentally derived sequence signatures, sequence similarity searches were performed against both the nonredundant (nr) and high throughput genomic sequences (htgs) divisions of GenBank. Forty-nine of the 67 variants matched an accession in the database (zero or one variant in 252 bp), 19 of which could be unambiguously assigned to a chromosome. In contrast, 20 BAC/cosmid signatures were not represented within GenBank, suggesting considerable underrepresentation of this segment within the current genome assembly, and additional sequence tag information (total = 945 bp) was obtained for future sequence comparisons. Further, analysis of the most recent assembly of the human genome (NCBI, build 31, November 2002) using these PIR4 sequences revealed that only 20 of the estimated 49 database copies of PIR4 were currently represented within this assembly, confirming considerable underrepresentation of these sequences within human genome assemblies. In total, our analysis allowed the unambiguous chromosomal assignment of 19 distinct PIR4 loci. Sixteen RPCI-11 BAC clones (AC127362, AC127380, AC127381, AC127384, AC127387, AC127389, AC127391, AC127701, AC128674, AC128676, AC128677, AC129338, AC129778, AC129779, AC129782, and AC092854) and three chromosome 22 cosmids (AC093314, AC103582, and AC093091) were placed in the sequence queue. Notwithstanding, half of the clones characterized in this study could not be assigned to a specific chromosome. This may be due to extreme levels of allelic variation, structural heteromorphism, or clone gaps within existing libraries.
Analysis of PIR4 Flanking Sequences
Since FISH analyses indicated that PIR4 occurred exclusively in pericentromeric regions, we tested more directly its association with satellite DNA (classical centromeric DNA markers). A subset (306) of PIR4-containing RPCI-11 BACs was selected for end-sequence analysis. These sequences were then searched against GenBank, revealing that at least one end sequence placed within centromeric satellite DNA for 83 of these BACs (27.1%), a significantly higher proportion than expected based on random sampling of human BAC end sequences (<1% satellite repeats). This association with satellite DNA was further supported by analysis of existing Human Genome Project data. Twenty of the 37 (54%) distinct BAC clones, for which finished sequence was available, contained at least 1 kb (and most often more than 10 kb) of centromerically associated satellite sequences, including HSATII, CER, ALR and GAATG/CATTC (RepeatMasker designations of centromeric DNA). These data are consistent with PIR4 sequences lying within the euchromatin/heterochromatin transition zone in close proximity to human centromeres.
Similarly, the segmental duplication content within the vicinity of PIR4 loci was assessed by comparing the sequences of the 37 large-insert PIR4 BAC clones that had been completely sequenced and a comparison of the flanking genomic sequences to the segmental duplication database of the human genome (Bailey et al. 2002a). With the exception of alpha satellite containing clones, only AC073318 contained PIR4 as its sole duplication element (fig. 1a). The organization of most clones showed complex patterns of segmental duplications (both interchromosomally and intrachromosomally) with the PIR4 sequence most often associated with a larger block of duplicated sequence (fig. 1b and c). This organization of duplications embedded within duplications is consistent with the previously proposed two-step model for the origin of pericentromeric duplications (Eichler et al. 1997) (Horvath, Schwartz, and Eichler 2000). Based on this analysis, it therefore is not surprising that these clones had ambiguous chromosomal assignments in the public build31. Further, the lack of unique sequence in the vicinity of PIR4 and the high degree of sequence identity among the duplicates indicates that most of the available PIR4-containing sequences within GenBank could not be mapped using traditional methods.
PIR4 as a Marker of Pericentromeric Duplications
Based on the multichromosomal distribution and the pericentromeric specificity of PIR4, we reasoned that this interspersed repeat might serve as an informative phylogenetic marker to reconstruct the series of evolutionary events that have restructured these regions of the human genome. Moreover, since most of the PIR4 elements were associated with larger blocks of segmental duplication, the PIR4 elements might also provide insight into these larger secondary duplication events. This assumes that PIR4 sequences have not been preferential targets of gene conversion and therefore represent "neutral" markers of pericentromeric evolution. To test this assumption, the pairwise genetic distance between each finished copy of PIR4 within GenBank was calculated (fig. 6). Here, 10 kb or more of aligned sequence was compared with the 25 copies of PIR4 for a total of 234 comparisons. Next, we examined the largest flanking sequence excluding PIR4 and calculated the genetic distance between these duplicated flanks. We then compared the genetic distance of the PIR4 element to the genetic distance of the flanking duplicated material as the difference of these two estimates (see fig. 2 in Supplementary Material online). A difference of zero (identity) between K values would suggest that both PIR4 and flanking sequences had diverged equally and arose at approximately the same time in evolution. A negative K value would suggest that the PIR4 copies were more similar than the flanking DNA and had therefore undergone conversion events. Because we assessed only flanking (within 7 kb of PIR4) and not nearby duplications (>7 kb away), our sample size was small (18) and we likely excluded some duplications that could have been separated from PIR4 due to secondary rearrangement events. However, since nearly half (7/18) of the PIR4 elements showed genetic distances consistent with those of the flanking duplications (a difference of 0.005 changes/bp, or less than 1% difference), many of the PIR4 elements act as a marker of pericentromeric DNA.
|
Since this molecular evidence points to multiple copies of PIR4 in chimpanzee and orangutan, two sets of comparative FISH experiments were undertaken to determine the copy number and distribution of PIR4 sequences on these primate chromosomes. In the first study, a human chromosome 22 cosmid probe, N20B5, which contained a single copy of the PIR4 sequence was probed against chromosome spreads of chimpanzee and orangutan metaphases (fig. 4a). Multiple pericentromeric signals were observed on chimpanzee chromosomes (I, IIp, VII, X, and XVI with respect to the human phylogenetic group designations). In contrast to human and chimpanzee metaphases, a single robust signal was observed in orangutan metaphases, corresponding to phylogenetic group VII. Since the absence of signal on orangutans might presumably be due to sequence divergence, a reciprocal set of experiments was conducted using orangutan BACs as probes on both human and orangutan chromosomes. A representative orangutan BAC from each of the four sequence classes was assessed. Two of the orangutan BACs (CHORI-253 220o24 and CHORI-253 1j18) yielded identical results, hybridizing to a single locus on chromosome VII in both human and orangutan (BAC 1J18 [fig. 4b]). In contrast, orangutan BAC 346B14 hybridized to a single locus in orangutan (chromosome VII [fig. 4c]) but multiple chromosomes in human (1, 2, 7, 14, 16, 17, 21, and 22), whereas 321D4 hybridized to two discrete but nearby loci on chromosome VII in orangutan and multiple loci in humans (2, 7, 14, 16, and data not shown). BAC-end sequencing and subsequent similarity searches of the orangutan PIR4-containing BAC clones revealed that they mapped to two different positions within the human chromosome 7 reference sequence.
Phylogenetic Analyses of PIR4 Sequences
As the final step in our analysis, a phylogenetic tree was generated using MEGA2 to compare 67 human, two chimpanzee, four orangutan, and four gibbon loci (fig. 2) (Kumar et al. 2001). At least two major clades could be distinguished. One clade (termed A) consists almost entirely of human sequences from many different chromosomes. This clade is further stratified into relatively chromosome-specific subgroups of PIR4 (see chromosomes 1, 2, 7, and 9) as well as an acrocentric chromosome subclade (13, 14, and 21). In contrast, clade B consists of human, chimpanzee, orangutan, and gibbon sequences as well as the putative ancestral human sequence on chromosome 7 (AC073318). With the exception of chromosome 7, very little evidence of chromosome-specific amplification is observed within this clade. It should be noted that chromosomes 2, 7, 13, 16, and 22 have representative sequences in both clade A and clade B. To increase confidence of the two separate clades on the 1-kb tree, we generated a 3-kb tree from a subset of the accessions (fig. 1 in Supplementary Material online). This increased bootstrap support from 80% to 100% for the existence of two clades. In total, these data suggest a rapid dispersal of PIR4 sequences over a narrow window of primate evolution followed by more recent chromosome-specific duplication events. To examine this in more detail, another phylogenetic tree was constructed from a shorter multiple sequence alignment (252 bp), incorporating an additional 18 distinct chimpanzee BAC sequences. These chimpanzee sequences distributed throughout clade A and clade B, showing, in general, closer phylogenetic relationship to other human loci rather than other chimpanzee sequences (data not shown). Thus, it is likely that PIR4 sequences populated the hominoid genome before the divergence of the two lineages.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
PIR4 Ancestral Sequence
Several lines of evidence point to chromosome 7 as the ancestral origin of PIR4. First, it is the only chromosome commonly hybridizing to human, chimpanzee, and orangutan metaphase spreads (fig. 4ac). Second, it is the only locus for which a clear ortholog can be identified within each great ape species examined (fig. 2). This is supported both by phylogeny as well as BAC end-sequence analysis. Third, it is one of the only PIR4-containing loci (GenBank AC073318) devoid of other segmental duplications. Characterization of numerous other segmental duplications (Eichler et al. 1996, 1997; Regnier et al. 1997; Zimonjic et al. 1997; Horvath et al. 2001; Crosier et al. 2002) suggest that the progenitor loci most often occur outside of the pericentromeric duplication zone surrounded by unique sequence. Subsequent duplicative transposition events become associated with other pericentromeric duplications. Finally, size estimates of PIR4 from AC073318 indicate that it represents the largest and most complete copy (49 kb). Other copies of PIR4 have become truncated with respect to this locus, perhaps as a result of deletion of secondary progenitors before subsequent rounds of duplication (fig. 7). For example, AC093787 from chromosome 2 contains 44.2 kb of PIR4, AC025223 from chromosome 2 has 17.3 kb, and AC073210 and AC104057 from chromosome 7 contain 16.2 kb and 26.3 kb, respectively. While the extent of PIR4 rearrangement with respect to the putative ancestral locus (AC073318) is not always a good indicator of degree of nucleotide sequence identity, it is noteworthy that similar deletion patterns share monophyletic origins consistent with the placement within the phylogenetic tree (see AC128674, AC127384, AC0024500, and AC006359 for an example [fig. 7]). These data not only validate the phylogeny but also provide insight into different trajectories of evolutionary duplication, where irreversible deletion/rearrangement events tagged a progenitor copy and its descendants. Finally, the phylogenetic data are consistent with a strict division of PIR4 sequences into two clades, an ancestral (clade B), which contains the putative human donor locus as well as representatives from each great ape species, and derivative clade (clade A), which contains only chimpanzee and human copies of this element.
|
Using the average rate (between 1.5 x 10-9, 1.75 x 10-9, and 1.89 x 10-9) of 1.71 x 10-9 mutations/site/year, the most divergent human sequences in figure 2 (AC127701B and AC073318) have a K value of 0.087, suggesting approximately 25 Myr of change between them. Interestingly, the phylogenetic tree in figure 2 has some sequences clustered together (e.g., 7cos43b6 and 7cos33f1), further suggesting recent intrachromosomal duplication or conversion events. This suggests that whereas some PIR4 copies have existed for over 20 Myr of evolution, others have arisen recently and the process of PIR4 duplication may be ongoing. However, the mean genetic distance between all human sequences is 0.047 (Kimura's estimate), suggesting that many duplications occurred 14 MYA (just before the divergence of humans from our great ape ancestors) (fig. 6). Surprisingly, sequences from chromosomes 2, 7, 13, 16, and 22 are found in both clades of the tree, suggesting very different evolutionary histories exist on the same chromosome.
Consequences of PIR4 Duplications
Of the pericentromeric regions identified in this study, most harbor multiple copies of PIR4 (table 1). Based on available data within GenBank, it appears that intrachromosomal copies of PIR4 are separated by at least 100 to 150 kb, as BAC clones rarely contain two distinct elements. Based on their high identity and close proximity within pericentromeric regions, PIR4 elements have the potential to undergo gene conversion. This is supported in part by our analysis of cosmid and BAC PSV signatures that we were sometimes unable to match to one another, suggesting that PIR4 elements have rapidly diverged between individuals or have been effectively deleted within the population (see table 3 in Supplementary Material online). In some cases, such as chromosomes 9p/9q12 and 2p/2q21, individual copies of PIR4 may be separated by multiple Mb as evidenced by distinct metaphase FISH signals. This organization is presumably due to recent evolutionary centromeric rearrangements that have occurred within these two specific chromosomes (Baldini et al. 1993). The organization of these intrachromosomal copies of PIR4 is reminiscent of low-copy repeat (LCRs) sequences that have been implicated in chromosomal instability associated with more than two dozen genomic disorders. It is possible that intrachromosomal PIR4 sequences separated by hundreds of kb could similarly facilitate nonhomologous recombination events, leading to secondary deletions, duplications, and inversions (Bailey et al. 2002a; Stankiewicz and Lupski 2002). Such dynamic mutational events, if they exist, might account for the considerable heteromorphism observed for these regions of the genome (Buiting et al. 1992; Barber et al. 1998, 1999). Although the clinical and evolutionary significance of such germline/somatic instability is unknown, it is noteworthy that many of the same pericentromeric regions containing PIR4 duplications (1, 2, 8, 9, 14, 15, 16, 17, 18, 21, and 22) are regions associated with common breakpoints in solid tumor cell lines, suggesting that the presence of PIR4 may be associated with somatic instability (Padilla-Nash et al. 2001). Finally, the unusual architecture of PIR4 repeats on chromosomes 2 and 9 could help explain the high frequency of large-scale inversions. Chromosome 9 inversion events are the most common karyotype variation seen in humans, and chromosome 2 inversion events are the second most commonly diagnosed events (Kaiser 1984). Although PIR4 has not yet been directly implicated in these common rearrangements, its existence in many regions of instability necessitates a more thorough investigation of the genomic architecture.
Based on BAC end-sequencing data as well as large-scale sequencing of PIR4-containing clones, we estimate that approximately 25% of PIR4 copies abut large tracts (>10 kb) of satellite repeat sequences (alpha, HSATII, etc). Such repetitive sequences have been postulated to play a pivotal role in the recent nonhomologous exchanges that have dynamically shaped human pericentromeric regions (Mashkova et al. 1998; Guy et al. 2000; Horvath, Schwartz, and Eichler 2000), as they often demarcate the boundaries of large-scale interchromosomal duplications. The proximity of PIR4 sequences to blocks of satellite may have contributed to their proliferation within the human genome. Of the chromosomes known to contain pericentromeric duplications (Bailey et al. 2001; Cheung et al. 2001), detailed pericentromeric analyses have only been conducted for chromosomes 2, 10, and 16 and the completely sequenced chromosomes 14, 20, 21, and 22 (Dunham et al. 1999; Jackson et al. 1999; Hattori et al. 2000; Horvath et al. 2000; Horvath, Schwartz, and Eichler 2000; Deloukas et al. 2001; Heilig et al. 2003). Our analysis predicts that many duplicon-rich pericentromeric regions, such as chromosomes 1, 5, 7, 9, 11, 12, 13, 15, 17, 18, and Y, still remain uncharacterized with respect to the full extent of their duplicated architecture. Interestingly, even among chromosomes that have been deemed completed (21, 22, and 14), our analysis has identified additional clones that have not yet been sequenced. Presumably, these clones map centromerically to the most proximal sequence within the sequence assembly.
Using PIR4 to Fill Genome Gaps
Utilizing PIR4 as a marker of pericentromeric DNA, we have used paralogous-sequence tagging to begin to successfully map approximately 40% of these relatively intractable regions of the genome. In addition, our analysis recovered additional candidate clones for targeted sequencing. As part of a collaboration with the Washington School of Medicine Genome Sequencing Center, we have submitted an additional 15 RPCI-11 BAC clones whose sequence signature did not match an accession within the NCBI database (at least three variants over 950 bp of sequence analyzed) (AC127362, AC127380, AC127381, AC127384, AC127387, AC127389, AC127391, AC127701, AC128674, AC128676, AC128677, AC129338, AC129778, AC129779, and AC129782). In collaboration with Oklahoma's Advanced Center for Genome Technology, we have sequenced one RPCI-11 BAC clone (AC092854) as well cosmid clones from chromosome 22 (AC093314, AC103582, and AC093091). These clones have effectively added over 2 Mb of human pericentromeric sequence to GenBank, although their integration into the final human genome assembly is still ongoing. In cases where chromosome-assigned pericentromeric clones have been dropped during the assembly process, we are working with the sequence community to ensure that such clones are reincorporated into the minimal tiling path of the final human genome sequence. Although it is unlikely that complete closure of these regions will be achieved by the finish target date (2003), these sequences should provide valuable anchor points from which to seed future mapping, sequencing and assembly.
Is the additional effort within these regions warranted? Although biological and evolutionary arguments may be easily mustered, the primary motivation of the Human Genome Project has been to identify all genes within the context of its genomic sequence (Collins et al. 1998). Many pericentromeric regions have been recalcitrant to closure due to their unusual duplication architecture. Pericentromeric genes embedded within these highly duplicated regions have been difficult to identify because of a lack of available sequence, difficulties in assembly of underlying genomic DNA, and/or ambiguities of paralogous gene annotation. Furthermore, pericentromeric regions have been operationally classified as heterochromatic DNA, since they are located in the vicinity of centromeres. As such, they are considered gene-poor genomic environments. Although heterochromatin is typically devoid of transcription presumably due to its compact nature (Dillon and Festenstein 2002; Donze and Kamakaka 2002), several recent studies have challenged the notion that DNA sequence in the vicinity of heterochromatic DNA is transcriptionally silent. For example, a mammalian artificial chromosome study indicated that a gene placed in close proximity to and between centromeric and telomeric satellites can still be readily expressed (Bayne et al. 1994). Within Drosophila, essential genes such as the MAP-kinase were recovered embedded within satellite sequences (Adams et al. 2000). Similarly, recent articles by Crosier et al. (2002) and Bailey et al. (2002b) provide strong evidence of human transcripts from pericentromeric regions on chromosomes 2 and 22. Our own analysis of 89 GenBank accessions containing (>5 kb) PIR4 reveals that 31 of these genomic sequences contain at least one transcript (exon-intron structure over at least two exons >99% identity to an EST). Seven out of these 31 accessions also contain tracts of satellite sequence (>3 kb). These transcripts map to 19 different Unigene clusters that have been assigned to chromosomes 2, 7, 9, 10, 16, and 22. Although these data do not prove the existence of pericentromerically located genes associated with PIR4 in humans, they do suggest transcriptional potency of these genomic regions. This underscores the importance of complete human genome sequence and assembly up to the higher order alpha satellite arrays to provide a comprehensive transcription and, ultimately, a gene map of the human genome.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adams, M. D., S. E. Celniker, and R. A. Holt, et al. (194 co-authors). 2000. The genome sequence of Drosophila melanogaster. Science 287:2185-2195.
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.[CrossRef][ISI][Medline]
Arnold, N., J. Wienberg, K. Emert, and H. Zachau. 1995. Comparative mapping of DNA probes derived from the Vk immunoglobulin gene regions on human and great ape chromosomes by fluorescence in situ hybridization. Genomics 26:147-156.[CrossRef][ISI][Medline]
Assum, G., T. Fink, C. Klett, B. Lengl, M. Schanbacher, S. Uhl, and G. Wohr. 1991. A new multisequence family in human. Genomics 11:34-41.
Bailey, J. A., Z. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz, M. D. Adams, E. W. Myers, P. W. Li, and E. E. Eichler. 2002a. Recent segmental duplications in the human genome. Science 297:1003-1007.
Bailey, J. A., A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11:1005-1017.
Bailey, J. A., A. M. Yavor, L. Viggiano, D. Misceo, J. E. Horvath, N. Archidiacono, S. Schwartz, M. Rocchi, and E. E. Eichler. 2002b. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70:83-100.[CrossRef][ISI][Medline]
Baldini, A., T. Ried, V. Shridhar, K. Ogura, L. D'Aiuto, M. Rocchi, and D. C. Ward. 1993. An alphoid DNA sequence conserved in all human and great ape chromosomes: evidence for ancient centromeric sequences at human chromosomal regions 2q21 and 9q13. Hum. Genet. 90:577-583.[ISI][Medline]
Barber, J. C., I. E. Cross, F. Douglas, J. C. Nicholson, K. J. Moore, and C. E. Browne. 1998. Neurofibromatosis pseudogene amplification underlies euchromatic cytogenetic duplications and triplications of proximal 15q. Hum. Genet. 103:600-607.[CrossRef][ISI][Medline]
Barber, J. C., C. J. Reed, S. P. Dahoun, and C. A. Joyce. 1999. Amplification of a pseudogene cassette underlies euchromatic variation of 16p at the cytogenetic level. Hum. Genet. 104:211-218.[CrossRef][ISI][Medline]
Bayne, R. A., D. Broccoli, M. H. Taggart, E. J. Thomson, C. J. Farr, and H. J. Cooke. 1994. Sandwiching of a gene within 12 kb of a functional telomere and alpha satellite does not result in silencing. Hum. Mol. Genet. 3:539-546.[Abstract]
Buiting, K., V. Greger, B. H. Brownstein, R. M. Mohr, I. Voiculescu, A. Winterpacht, B. Zabel, and B. Horsthemke. 1992. A putative gene family in 15q11-13 and 16p11.2: possible implications for Prader-Willi and Angelman syndromes. Proc. Natl. Acad. Sci. USA 89:5457-5461.[Abstract]
Cheung, V. G., N. Nowak, and W. Jang, et al. (65 co-authors). 2001. Integration of cytogenetic landmarks into the draft sequence of the human genome. The BAC Resource Consortium. Nature 409:953-958.[CrossRef][ISI][Medline]
Collins, F. S., A. Patrinos, E. Jordan, A. Chakravarti, R. Gesteland, and L. Walters. 1998. New goals for the U.S. Human genome project: 19982003. Science 282:682-689.
Copenhaver, G. P., K. Nickel, and T. Kuromori, et al. (14 co-authors). 1999. Genetic definition and sequence analysis of Arabidopsis centromeres. Science 286:2468-2474.
Crosier, M., L. Viggiano, and J. Guy, et al. (11 co-authors). 2002. Human paralogs of KIAA0187 were created through independent pericentromeric-directed and chromosome-specific duplication mechanisms. Genome Res. 12:67-80.
Deloukas, P., L. H. Matthew, and J. Ashurst, et al. (127 co-authors). 2001. The DNA sequence and comparative analysis of human chromosome 20. Nature 414:865-871.[CrossRef][ISI][Medline]
Dillon, N., and R. Festenstein. 2002. Unravelling heterochromatin: competition between positive and negative factors regulates accessibility. Trends Genet. 18:252-258.[CrossRef][ISI][Medline]
Donze, D., and R.T. Kamakaka. 2002. Braking the silence: how heterochromatic gene repression is stopped in its tracks. Bioessays 24:344-349.[CrossRef][ISI][Medline]
Dunham, I., N. Shimizu, and B. A. Roe, et al. (25 co-authors). 1999. The DNA sequence of human chromosome 22. Nature 402:489-495.[CrossRef][ISI][Medline]
Eichler, E., N. Archidiacono, and M. Rocchi. 1999. CAGGG repeats and the pericentromeric duplication of the hominoid genome. Genome Res. 9:1048-1058.
Eichler, E. E. 2001a. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17:661-669.[CrossRef][ISI][Medline]
Eichler, E. E., . 2001b. Segmental duplications: What's missing, misassigned, and misassembledand should we care? Genome Res. 11:653-656.
Eichler, E. E., M. L. Budarf, M. Rocchi, L. L. Deaven, N. A. Doggett, A. Baldini, D. L. Nelson, and H. W. Mohrenweiser. 1997. Interchromosomal duplications of the adrenoleukodystrophy locus: a phenomenon of pericentromeric plasticity. Hum. Mol. Genet. 6:991-1002.
Eichler, E. E., F. Lu, Y. Shen, R. Antonacci, V. Jurecic, N. A. Doggett, R. K. Moyzis, A. Baldini, R. A. Gibbs, and D. L. Nelson. 1996. Duplication of a gene-rich cluster between 16p11.1 and Xq28: a novel pericentromeric-directed mechanism for paralogous genome evolution. Hum. Mol. Genet. 5:899-912.
Flint, J., K. Thomas, G. Micklem, H. Raynham, K. Clark, N. Doggett, A. King, and D. Higgs. 1997. The relationship between chromosome structure and function at a human telomeric region. Nature Genet. 15:252-257.[ISI][Medline]
Goodman, M. 1999. The genomic record of humankind's evolutionary roots. Am. J. Hum. Genet. 64:1-39.[CrossRef][ISI][Medline]
Guy, J., C. Spalluto, and A. McMurray, et al. (15 co-authors). 2000. Genomic sequence and transcriptional profile of the boundary between pericentromeric satellites and genes on human chromosome arm 10q. Hum. Mol. Genet. 9:2029-2042.
Hattori, M., A. Fujiyama, and T. D. Taylor, et al. (26 co-authors). 2000. The DNA sequence of human chromosome 21. The chromosome 21 mapping and sequencing consortium. Nature 405:311-319.[CrossRef][ISI][Medline]
Heilig, R., R. Eckenberg, and J. L. Petit, et al. (98 co-authors). 2003. The DNA sequence and analysis of human chromosome 14. Nature 421:601-607.[CrossRef][ISI][Medline]
Higgins, D. G., J. D. Thompson, and T. J. Gibson. 1996. Using Clustal for multiple sequence alignments. Methods Enzymol. 266:383-402.[ISI][Medline]
Horvath, J., S. Schwartz, and E. Eichler. 2000. The mosaic structure of a 2p11 pericentromeric segment: a strategy for characterizing complex regions of the human genome. Genome Res. 10:839-852.
Horvath, J., L. Viggiano, B. Loftus, M. Adams, M. Rocchi, and E. Eichler. 2000. Molecular structure and evolution of an alpha/non-alpha satellite junction at 16p11. Hum. Mol. Genet. 9:113-123.
Horvath, J. E., J. A. Bailey, D. P. Locke, and E. E. Eichler. 2001. Lessons from the human genome: transitions between euchromatin and heterochromatin. Hum. Mol. Genet. 10:2215-2223.
IHGSC (International Human Genome Sequencing Constortium). 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.[CrossRef][ISI][Medline]
ISCN (International System of Chromosome Nomenclature). 1985. Report of the standing committee on human cytogenetic nomenclature. Birth Defects 21:1-117.
Jackson, M. S., M. Rocchi, and G. Thompson, et al. (13 co-authors). 1999. Sequences flanking the centromere of human chromosome 10 are a complex patchwork of arm-specific sequences, stable duplications and unstable sequences with homologies to telomeric and other centromeric locations. Hum. Mol. Genet. 8:205-215.
Kaiser, P. 1984. Pericentric inversions. Problems and significance for clinical genetics. Hum. Genet. 68:1-47.[CrossRef][ISI][Medline]
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[ISI][Medline]
Kumar, S., K. Tamura, I. B. Jakobsen, and M. Nei. 2001. MEGA2: molecular evolutionary genetics analysis software. Bioinformatics 17:1244-1245.
Li, W. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Li, W. H., and M. Tanimura. 1987. The molecular clock runs more slowly in man than in apes and monkeys. Nature 326:93-96.[CrossRef][ISI][Medline]
Liu, G., S. Zhao, J. A. Bailey, S. C. Sahinalp, C. Alkan, E. Tuzun, E. D. Green, and E. E. Eichler. 2003. Analysis of primate genomic variation reveals a repeat-driven expansion of the human genome. Genome Res. 13:358-368.
Loftus, B., U. Kim, and V. Sneddon, et al. (19 co-authors). 1999. Genome duplications and other features in 12 Mbp of DNA sequence from human chromosome 16p and 16q. Genomics 60:295-308.[CrossRef][ISI][Medline]
Mahtani, M. M., and H. F. Willard. 1998. Physical and genetic mapping of the human X chromosome centromere: repression of recombination. Genome Res. 8:100-110.
Mashkova, T., N. Oparina, I. Alexandrov, O. Zinovieva, A. Marusina, Y. Yurov, M. H. Lacroix, and L. Kisselev. 1998. Unequal cross-over is involved in human alpha satellite DNA rearrangements on a border of the satellite domain. FEBS Lett. 441:451-457.[CrossRef][ISI][Medline]
Myers, E. W., and W. Miller. 1988. Optimal alignments in linear space. Comput. Appl. Biosci. 4:11-17.[Abstract]
Orti, R., M. C. Potier, C. Maunoury, M. Prieur, N. Creau, and J. M. Delabar. 1998. Conservation of pericentromeric duplications of a 200-kb part of the human 21q22.1 region in primates. Cytogenet. Cell Genet. 83:262-265.[CrossRef][ISI][Medline]
Padilla-Nash, H. M., K. Heselmeyer-Haddad, D. Wangsa, H. Zhang, B. M. Ghadimi, M. Macville, M. Augustus, E. Schrock, E. Hilgenfeld, and T. Ried. 2001. Jumping translocations are common in solid tumor cell lines and result in recurrent fusions of whole chromosome arms. Genes Chromosomes Cancer 30:349-363.[CrossRef][ISI][Medline]
Regnier, V., M. Meddeb, G. Lecointre, F. Richard, A. Duverger, V. C. Nguyen, B. Dutrillaux, A. Bernheim, and G. Danglot. 1997. Emergence and scattering of multiple neurofibromatosis (NF1)-related sequences during hominoid evolution suggest a process of pericentromeric interchromosomal transposition. Hum. Mol. Genet. 6:9-16.
Riethman, H. C., Z. Xiang, S. Paul, E. Morse, X. L. Hu, J. Flint, H. C. Chi, D. L. Grady, and R. K. Moyzis. 2001. Integration of telomere sequences with the draft human genome sequence. Nature 409:948-951.[CrossRef][ISI][Medline]
Stankiewicz, P., and J. R. Lupski. 2002. Genome architecture, rearrangements and genomic disorders. Trends Genet. 18:74-82.[CrossRef][ISI][Medline]
Sun, X., J. Wahlstrom, and G. Karpen. 1997. Molecular structure of a functional Drosophila centromere. Cell 97:1007-1019.
The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796-815.[CrossRef][ISI][Medline]
Trask, B. J., G. van den Engh, M. Christensen, H. F. Massa, J. W. Gray, and M. Van Dilla. 1991. Characterization of somatic cell hybrids by bivariate flow karyotyping and fluorescence in situ hybridization. Somat. Cell Mol. Genet. 17:117-136.[ISI][Medline]
Wohr, G., T. Fink, and G. Assum. 1996. A palindromic structure in the pericentromeric region of various human chromosomes. Genome Res. 6:267-279.[Abstract]
Yan, C. M., K. W. Dobie, H. D. Le, A. Y. Konev, and G. H. Karpen. 2002. Efficient recovery of centric heterochromatin P-element insertions in Drosophila melanogaster. Genetics 161:217-229.
Zimonjic, D., M. Kelley, J. Rubin, S. Aaronson, and N. Popescu. 1997. Fluorescence in situ hybridization analysis of keratinocyte growth factor gene amplification and dispersion in evolution of great apes and humans. Proc. Natl. Acad. Sci. USA 94:11461-11465.