Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: isochore base composition human chimpanzee selection SNP
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A number of different approaches have been used to distinguish between biases in the processes of mutation and fixation as explanations for the existence of isochores. The preferential fixation of certain mutations in regions of similar GC content would cause patterns of polymorphism and substitution between closely related species to be discordant. Comparison of patterns of polymorphism and substitution is the basis of the McDonald-Kreitman test of neutrality (McDonald and Kreitman 1991), which utilizes silent and amino acid replacement changes, and the same principle has also been applied to ATGC and GC
AT mutations (Akashi 1995; Eyre-Walker 1997, 1999). Thus, if substitution patterns are due solely to variation in the mutation process, patterns of substitution and polymorphism caused by AT
GC and GC
AT mutations should be similar, assuming there have been no changes in mutation bias since the time of divergence of the two species (4.66.2 Myr for human-chimpanzee; Chen and Li 2001). However, previous studies using this prediction have relied on the genome composition being at equilibrium (Eyre-Walker 1999; Smith and Eyre-Walker 2001), which is not supported by recent analyses (Duret et al. 2002; Smith, Webster, and Ellegren 2002).
Under a model of fixation bias, differences are predicted between the frequency distributions of noncoding polymorphisms caused by GCAT and AT
GC mutations (Akashi and Schaeffer 1997; Eyre-Walker 1999). If there is a bias toward fixation of GC nucleotides in GC-rich regions, then AT
GC mutations will segregate at higher average frequencies than GC
AT changes. The polymorphism frequency test is potentially more powerful than a test based on numbers of polymorphisms, as it is only sensitive to changes in mutation bias since the time of origin of extant neutral polymorphism (<1.5 Myr in human noncoding DNA e.g. see Zhao et al. 2000; Yu et al. 2001).
Sequences that have been inserted into new positions in the genome, such as processed pseudogenes and interspersed repeats, have been employed to study the process of compositional evolution, and it has been previously demonstrated that such sequences evolve so that their GC content approaches that of surrounding DNA (Filipski, Salinas, and Rodier 1989; Casane et al. 1997; Lander et al. 2001). Although these observations are consistent with both the fixation and mutation bias hypotheses (Eyre-Walker and Hurst 2001), they suggest the presence of "regional effects" on the nucleotide substitution process, which could possibly extend across whole isochores. Repetitive elements are believed to comprise >50% of the human genome (Lander et al. 2001) and are thus of great utility for understanding the determinants of molecular evolution in noncoding DNA.
We have performed a large-scale comparison of inter- and intraspecific mutational changes observed in noncoding regions of the human genome by inferring lineage-specific substitutions in 1.8 Mb of human-chimpanzee-baboon DNA alignments and determining the roots of 6542 single-nucleotide polymorphisms (SNPs) within noncoding regions using >25 Mb of human-chimpanzee DNA alignments. The frequency distributions of the SNPs for which allele frequencies were available were also analyzed. The contribution of regional effects to the process of compositional evolution was studied by analysis of patterns of substitution in Alu sequences within the human-chimpanzee-baboon alignments. The results shed light on the forces that govern evolution of "junk" DNA in the human and chimpanzee genomes.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The alignments were divided into 2 kb windows based only on human and chimpanzee length (i.e., positions containing a gap in both the human and the chimpanzee sequence were ignored). Only windows with at least 100 bp of noncoding and nonrepetitive DNA were considered. The average GC content of each class of regions defined by GC content was weighted by number of substitutions to enable comparison with polymorphism data where the GC content around each observed SNP contributed to the average value.
Alignments were analyzed for nucleotide substitutions occurring along the lineages leading to human and chimpanzee, with baboon as the outgroup. Only changes at sites where the human and chimpanzee differed and the baboon was equivalent to either human or chimpanzee were considered, allowing inference of the direction of the change by parsimony. Substitutions putatively due to CpG hypermutability (CpG to TpG or CpA) were identified. We performed a binomial test to determine whether the patterns of substitution on the human and chimpanzee lineages were significantly different. The probability of finding a bias in the numbers of ATGC and GC
AT changes on the two lineages equal to or greater than the observed values, under the null assumption that each type of change occurs at the same rate on both lineages, was calculated using the binomial probability formula. A binomial test was also used to test the departure of AT
GC and GC
AT changes from a 1:1 ratio in regions of similar GC content in both substitution and polymorphism data.
The location of Alu repeats in the alignments was determined using RepeatMasker and the substitution patterns along the human and chimpanzee lineages within these regions were analyzed, inferring ancestral states by parsimony. The GC content of 1 kb of noncoding nonrepetitive DNA surrounding each Alu repeat was calculated.
Maximum Likelihood Inference of Substitution Patterns
Inferring ancestral states using parsimony can be unreliable, particularly when base composition is biased (Eyre-Walker 1998) and when base composition is not at equilibrium. However, parsimony is usually reasonably accurate when sequence divergence is low, as is the case in our primate comparisons. We checked the accuracy of parsimony in determining the numbers of ATGC and GC
AT changes along the human and chimpanzee lineages by using a maximum likelihood implementation of a model which allows GC content to evolve nonhomogeneously (Galtier and Gouy 1998). The 2-kb windows were pooled according to GC content, and the NHML program of Galtier and Gouy (1998) was applied to each GC category separately.
The NHML program allows the maximum likelihood testing of heterogeneous base composition evolution. More specifically, the maximum likelihood for the same human-chimp-baboon data set is found under two models: (1) with the same equilibrium base composition ( in Galtier and Gouy's terminology) applying to all four branches in the rooted human-chimp-baboon tree, and (2) with four branch specific
's. Twice the increase in log likelihood from model one to model two (2
logL) is expected to be distributed as a chi-squared distribution with three degrees of freedom (since there are three additional parameters in model two relative to model one). The other fitted parameters were the same for both models: ancestral base composition, root location, branch lengths, transition/transversion ratio, and base composition at nodes (for details, see Galtier and Gouy 1998). We did not model rate variation among sites; in other words we assume that all sites evolve at the same rate.
Under the Galtier and Gouy model it is straightforward to employ the maximum likelihood parameter estimates from model two to assess the accuracy of parsimony inference with the empirical Bayesian approach. For example, consider when the site pattern in human-chimpanzee-baboon is AGG (HCB = AGG). Parsimony yields an unambiguous inference of a G to A substitution in the human lineage, equivalent to the state of the human-chimpanzee ancestor being G. But we know that evolution does not always proceed according to parsimony, and it is this uncertainty which we seek to quantify. The substitution probabilities of the Galtier and Gouy model, given in the form pIJ which is the probability of state J at the bottom of the branch given state I at the top of the branch, can be used to determine the probability that the human-chimpanzee ancestor is a G (HCA = G) given that HCB = AGG, Pr(HCA = G | HCB = AGG), using Bayes theorem, given here for the general case of HCA = W and HCB = XYZ.
|
|
|
Determination of Human SNP Roots Using Human-Chimpanzee Alignments
A total of 37,580 human-chimpanzee alignments (20.1 Mb) were obtained using the draft quality chimpanzee BAC end sequences published as part of the RIKEN chimpanzee genome project (Fujiyama et al. 2002). The chimpanzee sequences were extracted from NCBI Entrez (http://www.ncbi.nlm.nih.gov/Entrez/) along with their corresponding orthologous human sequences, as given at http://hgp.gsc.riken.go.jp/pub/chimp/clone.html. Alignments were generated using the default settings of ClustalW, and those containing coding sequence were identified by BLAST searches against the RefSeq database (Pruitt and Maglott 2001) and omitted from the analysis. A further set of 80 human-chimpanzee genomic alignments, masked for genes and with a total length of 5.1 Mb, were also included in the analysis. These were constructed from full or partial length chimpanzee BAC clones and their orthologous sequence in human contigs as described in Webster, Smith, and Ellegren (2002). The human-chimpanzee-baboon alignments described earlier comprise a subset of these 80 human-chimpanzee alignments for which baboon sequence was available (further details of all aligned sequences are available on request).
The tenth release of The SNP Consortium (TSC) database, consisting of 1,250,611 mapped SNPs and their 3' and 5' flanking sequences (average of 690 bp flanking sequence per SNP) was obtained from http://snp.cshl.org/index.html. To identify SNPs in regions where the chimpanzee root is available, BLAST searches were performed between all of the human-chimpanzee alignments on each human chromosome and databases consisting of the flanking sequences of all the available SNPs on that chromosome. SNPs whose flanking sequences perfectly matched the human sequence in an alignment were then compared to the aligned chimpanzee sequence in order to determine the root of the mutation. Only biallelic SNPs where the chimpanzee root was the same as one of the two human alleles were included in the analysis. SNPs within repetitive DNA were removed from the analysis as they could potentially show mutation patterns unrepresentative of the remainder of noncoding DNA. This was done by masking repeats in the sequence surrounding SNPs with RepeatMasker. Nonrepetitive DNA within the masked sequence flanking SNPs was used to calculate local GC content.
Files containing SNP allele frequencies were retrieved from TSC (ftp://snp.cshl.org/pub/SNP/frequency/). To determine the frequency of the ancestral and variant alleles of biallelic SNPs, these were searched for SNPs where the chimpanzee root had been previously determined. Genotyping results from African American, Asian, and Caucasian samples were pooled, although there were no qualitative differences in the results when individual populations were considered (data not shown).
Bootstrap Analysis of Equilibrium GC Content
The predicted equilibrium GC content, f*, of a genomic region can be calculated by considering the present GC content, f, and the per base pair rate of GCAT (u) and AT
GC (v) mutations (Sueoka 1993; Eyre-Walker 1997) using the following formula:
|
Values of f * were estimated for all GC categories using the parsimony-inferred patterns of substitution and polymorphism. Confidence intervals for f * were calculated by re-sampling all of the inferred mutational changes with replacement using 10,000 independent replicates, assuming that all mutations were independent events. We also tested the significance of differences between the predicted values of f * derived from polymorphism and divergence by comparison of each replicate from the different data sets. This procedure was also used to calculate the significance of differences in values of f * when sequences are divided into regions of high (0.4) and low (<0.4) GC content.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
There is a significant excess of GCAT substitutions in regions of high GC content and an excess of AT
GC substitutions in lower GC regions compared with the predicted ratio of 1:1 when the GC content is at equilibrium (table 1, binomial test). Furthermore, the predicted equilibrium GC contents ( f *) estimated from regions of different GC content within the human-chimpanzee-baboon alignments are similar, and suggest a homogenization of GC content toward the average GC content of all regions studied here (comparable to the genomic average of
41%; Lander et al. 2001). However, when the observed changes are divided in regions of high (
0.4) and low (<0.4) GC content, f * is significantly higher in regions of higher GC content (0.43 in high GC, 0.40 in low GC; bootstrap P = 0.003). This suggests that there is a slight bias toward AT
GC substitutions in regions of higher GC content.
|
|
|
We also compared the values of f* predicted by polymorphism and substitution data by comparing independent bootstrap replicates. When regions of all GC contents are analyzed together, f* predicted from polymorphisms is significantly lower than from substitutions (P < 0.001; table 3), indicating a overall bias toward fixation of ATGC mutations. It is unclear from these data, however, whether this fixation bias is more prevalent in regions of high or low GC content, as significant differences between patterns of divergence and polymorphism are only observed in regions of intermediate GC content, where the amount of data is larger.
Analysis of the frequency distributions of GCAT and AT
GC SNPs in regions of high (
0.5), medium (4050), and low (<0.4) GC content separately (table 4) reveals no significant differences using Mann-Whitney U tests. The mean frequency of the two types of SNP is similar in all regions. However, at higher (
0.5) GC contents the average frequency of AT
GC mutations is slightly higher than GC
AT, reflected in the frequency distributions (fig. 1). This discrepancy is consistent with the presence of a weak fixation bias favoring mutations that increase GC content in regions where GC content is already high.
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is unclear from the comparisons of polymorphism and divergence whether the bias toward fixation of GC nucleotides is more powerful in regions of any particular GC content (table 3). However, the equilibrium GC content predicted by substitutions in regions of higher (0.4) GC contents is significantly higher than in regions of lower GC (<0.4) contents, whereas no such pattern is exhibited by polymorphisms. In addition, there is a trend toward AT
GC mutations segregating at relatively higher mean frequencies in regions of high GC content (table 4; fig. 1). It is therefore possible that the bias toward fixation of AT
GC mutations operates most strongly in regions of high GC content. Nevertheless, as the polymorphism and divergence data sets both reveal that the GC content of all regions of the genome is heading toward an equilibrium of close to 40%, the observed patterns of substitution can be considered to be mainly caused by the pattern of mutation, with only a weak effect of fixation bias.
It is possible that a genome-wide change in mutation bias occurring after the divergence of humans and chimpanzees could result in the observed discrepancies between patterns of intra- and inter-specific mutational changes. However, there are no significant differences between the patterns of inferred changes on the human and chimpanzee lineages, and it is unlikely that similar changes in mutation bias have occurred in the two lineages.
Another potential complication is that more rapidly evolving sequences are under-represented in the human-chimpanzee-baboon alignments (used to infer substitutions) compared with human chimpanzee alignments (used to infer SNP roots) because of difficulties in assigning orthology with baboon sequences due to greater divergence. If rapidly evolving sequences have different substitution patterns then this could cause a bias in the results. However, average divergence in noncoding, nonrepetitive regions of the human-chimpanzee alignments (excluding those of draft quality) is 0.0104, whereas in the human-chimpanzee-baboon alignments it is 0.0105, which suggests that the human-chimpanzee-baboon alignments do not represent particularly slowly evolving sequences.
It is important to exclude the possibility that errors in the data set could result in any of the observed results. In the case of the patterns of substitution observed in human-chimpanzee-baboon alignments this would happen only if sequencing were highly biased toward errors that change GC content of the sequence. This is unlikely, and furthermore would be expected to lead to differences in the patterns of substitutions inferred on human and chimpanzee lineages, which are not observed. A recent study found no significant differences between the substitutions observed by comparing human and chimpanzee sequences and all human SNPs (Ebersberger et al. 2002), which means that differences reported here between substitution and polymorphism are only apparent when the roots of the polymorphisms are considered. Even though the chimpanzee genome project sequences used here are of draft quality, there would have to be a very large bias toward errors that change GC to result in incorrect rooting of a fraction of the human SNPs, which is unlikely because the level of divergence between chimpanzee draft sequences and human sequences is low (0.026 per nucleotide).
Determinants of Mutation Rate Variation
Under a model where the equilibrium GC frequency predicted by the combined effects of mutation and fixation processes is lower than 0.5, a positive correlation between divergence and GC content is expected (Piganeau et al. 2002), as exhibited by the data presented here. However, both the GCAT and AT
GC substitution rates per nucleotide also increase with GC content, which suggests that, although compositional non-equilibrium is likely to be a major determinant of higher substitution rate in regions of high GC, other factors also act to increase the substitution rate in these regions. One intriguing possibility is the potentially mutagenic effect of recombination (Lercher and Hurst 2002), which correlates with GC content (Fullerton, Bernardo Carvalho, and Clark 2001).
Further insight into the mechanisms determining substitution rate variation can be gained by examining the patterns of substitution in DNA elements inserted into backgrounds with different GC contents. Casane et al. (1997) examined substitutions in three processed pseudogenes of high GC content inserted into regions of intermediate GC contents and showed that the pseudogenes were approaching the GC contents of the surrounding DNA, consistent with a regional effect on the nucleotide- substitution process leading to an attenuation of GC content of inserted sequences. However, these findings are also consistent with a hypothesis of a constant mutation bias across the entire genome where u > v, which leads to an excess of GCAT substitutions in sequences of high GC content, regardless of their genomic location.
An analysis of substitution patterns observed in a genome-wide sample of five different DNA transposons compared with their consensus sequences revealed that elements inserted into high GC regions accumulated a greater relative proportion of ATGC substitutions than repeats in low GC regions (Lander et al. 2001). Filipski, Salinas, and Rodier (1989) reported an effect of surrounding GC content on the pattern of substitution in Alu repeats in the human
- (high GC content) and ß-globin (intermediate GC content) gene clusters since the common ancestor of human and chimpanzee. However, a similar effect is not evident in our larger sample of Alu repeats, in which substitution patterns since the human-chimpanzee common ancestor are not significantly affected by surrounding GC content. It is possible, however, that such an effect would become apparent over a longer timescale. In addition, it is plausible that there were forces acting to maintain GC content earlier in vertebrate evolution, which left traces in the divergence of anciently inserted sequences, but that these factors have not been as effective since the human-chimpanzee common ancestor.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Fumio Tajima, Associate Editor
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Akashi, H. 1995. Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics 139:1067-1076.
Akashi, H., and S. W. Schaeffer. 1997. Natural selection and the frequency distributions of "silent" DNA polymorphism in Drosophila. Genetics 146:295-307.
Altschul, S. F., T. L. Madden, A. A. Schaffer, J. H. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.
Bernardi, G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, and F. Rodier. 1985. The mosaic genome of warm-blooded vertebrates. Science 228:953-958.[ISI][Medline]
Casane, D., S. Boissinot, B. H. Chang, L. C. Shimmin, and W.-H. Li. 1997. Mutation pattern variation among regions of the primate genome. J. Mol. Evol. 45:216-226.[ISI][Medline]
Chen, F. C., and W.-H. Li. 2001. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. 68:444-456.[CrossRef][ISI][Medline]
Duret, L., M. Semon, G. Piganeau, D. Mouchiroud, and N. Galtier. 2002. Vanishing GC-rich isochores in mammalian genomes. Genetics (in press).
Ebersberger, I., D. Metzler, C. Schwarz, and S. Paabo. 2002. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70:1490-1497.[CrossRef][ISI][Medline]
Eyre-Walker, A. 1993. Recombination and mammalian genome evolution. Proc. R. Soc. Lond. Ser. B Biol. Sci. 252:237-243.[ISI][Medline]
Eyre-Walker, A. 1997. Differentiating between selection and mutation bias. Genetics 147:1983-1987.
Eyre-Walker, A. 1998. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. 47:686-690.[ISI][Medline]
Eyre-Walker, A. 1999. Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152:675-683.
Eyre-Walker, A., and L. D. Hurst. 2001. The evolution of isochores. Nat. Rev. Genet. 2:549-555.[CrossRef][ISI][Medline]
Filipski, J., J. Salinas, and F. Rodier. 1989. Chromosome localization-dependent compositional bias of point mutations in Alu repetitive sequences. J. Mol. Biol. 206:563-566.[CrossRef][ISI][Medline]
Fujiyama, A., H. Watanabe, A. Toyoda, et al. (17 co-authors). 2002. Construction and analysis of a human-chimpanzee comparative clone map. Science 295:131-134.
Fullerton, S. M., A. Bernardo Carvalho, and A. G. Clark. 2001. Local rates of recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol. 18:1139-1142.
Galtier, N., and M. Gouy. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871-879.[Abstract]
Lander, E. S., L. M. Linton, B. Birren, et al. (248 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.[CrossRef][ISI][Medline]
Lercher, M. J., and L. D. Hurst. 2002. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. 18:337-340.[CrossRef][ISI][Medline]
McDonald, J. H., and M. Kreitman. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652-654.[CrossRef][ISI][Medline]
Nekrutenko, A., and W.-H. Li. 2000. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 10:1986-1995.
Piganeau, G., D. Mouchiroud, L. Duret, and C. Gautier. 2002. Expected relationship between the silent substitution rate and the GC content: implications for the evolution of isochores. J. Mol. Evol. 54:129-133.[ISI][Medline]
Pruitt, K. D., and D. R. Maglott. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29:137-140.
Smith, N. G. C., and A. Eyre-Walker. 2001. Synonymous codon bias is not caused by mutation bias in G+C-rich genes in humans. Mol. Biol. Evol. 18:982-986.
Smith, N. G. C., M. T. Webster, and H. Ellegren. 2002. Deterministic mutation rate variation in the human genome. Genome Res. 12:1350-1356.
Sueoka N. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85:2653-2657.[Abstract]
Sueoka N. 1993. Directional mutation pressure, mutator mutations, and dynamics of molecular evolution. J. Mol. Evol. 37:137-153.[ISI][Medline]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. ClustalWimproving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.[Abstract]
Webster, M. T., N. G. C. Smith, and H. Ellegren. 2002. Microsatellite evolution inferred from human-chimpanzee genomic sequence alignments. Proc. Natl. Acad. Sci. USA 99:8748-8753.
Yu, N., Y. X. Fu, N. Sambuughin, M. Ramsay, T. Jenkins, E. Leskinen, L. Patthy, L. B. Jorde, T. Kuromori, and W.-H. Li. 2001. Global patterns of human DNA sequence variation in a 10-kb region on chromosome 1. Mol. Biol. Evol. 18:214-222.
Zhao, Z., L. Jin, Y. X. Fu, et al. (13 co-authors). 2000. Worldwide DNA sequence variation in a 10-kilobase non-coding region on human chromosome 22. Proc. Natl. Acad. Sci. USA 97:11354-11358.