*Department of Ecology and Evolution, University of Chicago;
Human Genetics Center, University of Texas at Houston;
Neurology Research, Phoenix, Arizona;
§Department of Human Genetics, South African Institute for Medical Research, Johannesburg, South Africa;
||Department of Biology, University of Oulu, Finland;
¶Institute of Enzymology, Hungarian Academy of Sciences, Budapest, Hungary; and
**Department of Human Genetics, University of Utah
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We have been pursuing human DNA variation studies in noncoding regions for two purposes. First, we wish to establish a genomewide and worldwide neutrality standard of nucleotide diversity. By "neutrality standard," we mean the level of nucleotide diversity expected in a region in which all mutations are neutral and not directly subject to natural selection. This standard will be a very useful reference, especially for comparison with the levels of nucleotide diversity in coding regions. Obviously, such a standard requires data from many genomic regions, because the level of nucleotide diversity in a region is subject to strong stochastic effects. Second, we wish to study the origin and evolution of modern humans. DNA sequence data from noncoding regions may more accurately reflect human history than data from coding regions, because noncoding regions are not directly subject to natural selection. The majority of past studies on human DNA variation, which are mainly from mitochondrial (mt) DNA, microsatellite DNA, and the Y chromosome, have largely given the impression of a relatively shallow genetic history of humans. These observations have been taken as evidence for the Out of Africa model for the origin of modern humans, which postulates that a founder group of modern humans emigrated from Africa about 100,000 years ago to Europe and Asia and completely replaced all the indigenous populations outside of Africa (Cann, Stoneking, and Wilson 1987
; Stringer and Andrew 1988
). However, recent studies of the ß-globin and the PDHA1 gene regions (Harding et al. 1997
; Harris and Hey 1999
) and a 10-kb noncoding region on chromosome 22 (Zhao et al. 2000
) have revealed an ancient genetic history of humans and suggested that human evolution has been more complex than depicted by the simple Out of Africa model. To attain a better understanding of this issue, it is necessary to obtain sequence data from other noncoding regions.
For the above purposes, we selected a 10-kb region on chromosome 1 and obtained sequence data from worldwide populations. This region contains mostly introns, although it also includes four short exons. The new data were compared with the data from Xq13.3 and 22q11.2 and other regions to study the features of sequence variation within and between populations and were used to infer the genetic history of human evolution.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sixty-one individuals were collected worldwide from 14 human populations in three major geographic areas: 20 Africans (5 South African Bantu speakers, 1 !Kung, 2 Mbuti Pygmies, 2 Biaka Pygmies, 5 Nigerians, 5 Kenyans), 20 Asians (8 Chinese, 3 Japanese, 6 Indians, 3 Yakuts), and 21 Europeans (6 Swedes, 2 Finns, 5 French, 5 Hungarians, 3 Italians). One chimpanzee, one gorilla, and one orangutan were used as outgroups.
PCR Amplification and DNA Sequencing
Five primer pairs were designed to amplify three overlapping fragments covering positions 96141 and two overlapping fragments covering positions 688011022 in the 12-kb region. Touchdown PCR (Don et al. 1991
) was used, and the reactions were carried out under the conditions described in Zhao et al. (2000)
. The PCR products were purified with the Wizard PCR Preps DNA Purification Resin Kit (Promega). Sequencing reaction was performed according to the protocol of ABI Prism BigDye Terminator Sequencing Kits (Perkin Elmer) modified by quarter reaction. The extension products were purified by Sephadex G-50 (DNA grade, Pharmacia) and run on an ABI 377XL DNA sequencer using 4.25% gels (Sooner Scientific).
ABI DNA Sequence Analysis 3.0 was used for lane tracking and base calling. The data were then proofread; the fluorescence traces were reread manually and heterozygous sites were detected as double peaks. The segment sequences were assembled automatically using SeqMan in DNASTAR. The assembled files were carefully checked manually using the same program, and variant sites were identified in the aligned sequences in MegAlign in DNASTAR. All of the nucleotides in each segment were sequenced at least once in both directions. Furthermore, all singletons and doubletons, which are defined as variants that appear, respectively, only once and twice in the total sample, were verified by reamplifying the region containing the variant site and resequencing the region in both directions using new internal primers that were close to the site.
Data Analysis
The sequences were aligned by MegAlign in the DNASTAR software package. The human consensus sequence was obtained from the alignment using DNASTAR. The human ancestral sequence was inferred by comparing the human sequences with the outgroup sequences using the maximum-parsimony principle.
For a DNA sequence subject to no natural selection, the mutation rate per sequence per generation (µ) is estimated by
![]() | (1) |
Tajima's (1989)
test and Fu and Li's (1993)
tests were used to test the selective neutrality of the region studied; a program is available at http://hgc.sph.uth.tmc.edu/fu. The critical points (values) for the neutrality tests were obtained from 5,000 simulated samples. Fu's (1996)
and Fu and Li's (1997)
methods were used to estimate the age of the most recent common ancestor (MRCA) of the DNA sequences in a sample. We computed both the mode and mean of the age (T) of the MRCA in years.
Note that all of the above computations require only segregating site data but do not require haplotype data.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Three strong potential coding regions (exons) in the selected 12-kb segment were predicted by GenScan (each with a probability >95.5%) and GRAIL-EXP; these potential exons were at sites 2352223632, 2613126285, and 2754227679 in locus HS125H23 (Z94054), respectively. In fact, a BLAST search showed that the amino acid sequence translated from these three exons was 86% similar to a segment of a human membrane protein CH1 (GenBank accession number AF097535). Later on a BLAST search of the GenBank with the nucleotide sequence of the 10 kb region indicated that four exons are similar to the human membrane protein CH1 (similarity > 99%) and the amino acid sequence translated from these four exons is identical to that translated from this gene. These exons are from site 21,773 to 21,888, site 26,128 to 26,285, site 27,542 to 27,679, and site 27,902 to 28,056 in locus HS125H23 (Z94054), respectively.
Pattern of Sequence Variation
A total of 48 variant sites were found in the alignment of human sequences; 19 of them were observed only once (i.e., singletons), 7 were observed twice (i.e., doubletons), and 22 were observed more than twice (i.e., others) (table 1
). Two variant sites (one synonymous and one nonsynonymous singleton) were observed in the second exon, while all of the remaining 46 variant sites were found in introns. All singletons and doubletons were verified as explained in Materials and Methods, and no error was found. In addition to the 48 single-nucleotide variants, we found 4 insertions/deletions (indels) among the 122 human sequences. On average, 5 variant sites per 1,000 bp were found in the region studied.
|
Table 1
also includes the patterns of sequence variation in the 10-kb noncoding regions on chromosomes X and 22 (Kaessmann et al. 1999
; Zhao et al. 2000
). Note that among the African sequences, the number of low-frequency variants (i.e., singletons and doubletons) was smaller than that of high-frequency variants (i.e., others) in the present region, whereas the opposite was true for the two other regions. On the other hand, among non-African sequences, the number of low-frequency variants was larger than that of the high-frequency variants in the present region, whereas the opposite was true for the other two regions. Thus, although a stronger excess of low-frequency variants in Africans than in non-Africans was observed in the previous two regions, the opposite was found in the present region. This difference in pattern notwithstanding, two features common to the three regions were noted. First, there were more variants in the African sample than in the non-African sample, despite the number of Africans studied being less than half that of non-Africans studied. Thus, Africans were considerably more polymorphic than non-Africans, in agreement with previous observations (Cann, Stoneking, and Wilson 1987
; Kaessmann et al. 1999
; Zhao et al. 2000
). Second, the number of low-frequency variants (singletons and doubletons) in the total sample was larger than the number of high-frequency variants, e.g., 26 versus 22 in the present region. This excess of low-frequency variants in all three regions is in sharp contrast to the situations for the dystrophin and PDHA1 genes (Zietkiewicz et al. 1998
; Harris and Hey 1999
) and suggests a relatively recent population expansion, because such an excess is not expected from an equilibrium Wright-Fisher population.
The present region was considerably less polymorphic than the region on chromosome 22it had fewer high-frequency variants, especially among non-Africans, and a much smaller number of doubletons. The relatively low polymorphism may be due to a lower mutation rate (see below) and selection in the gene containing these introns or in genes linked to this region. The X-linked region was even less variable (table 1
). This may be because an X-linked region has a smaller effective population size than an autosomal region (3Ne/4 vs. Ne) and because the X region has a low recombination rate (Kaessmann et al. 1999
), so that compared to the other two regions, it is subject to stronger background selection (i.e., effects of deleterious mutations in genes linked to the region) or selective sweep (effects of positive selection in genes linked to the region) (Begun and Aquadro 1992
; Charlesworth 1994
).
A comparison of all sequences, including the chimpanzee, gorilla, and orangutan sequences, revealed 382 variant sites, 44 of which were indels. The 382 variant sites were evenly distributed in this region (2 = 14.4, df = 9, P = 0.11) (table 2
). The number of variant sites in human populations was also evenly distributed (
2 = 9.9, df = 9, P = 0.36).
|
Mutation Pattern
Comparing the human, chimpanzee, gorilla, and orangutan sequences, we were able to infer the direction of 172 mutations (table 3 ); the proportion of transitional changes was 66%. For the 169 mutations for which the direction could not be inferred, the proportion of transitions was 65% (table 3
). This proportion was between the values (59% and 70%) observed in pseudogenes (Li, Wu, and Luo 1984
) and for the 10-kb region in 22q11.2 (Zhao et al. 2000
). For those mutations whose direction could be inferred, the number of G/C-to-A/T mutations was 57, while that of A/T-to-G/C mutations was 90. According to the GC and AT contents of 68.5% and 31.5%, the expected numbers of G/C-to-A/T mutations and A/T-to-G/C mutations are 100.7 and 46.3, respectively, and a comparison with the observed numbers gives
2 = 3.61 and P = 0.057, which is close to significant. This result suggests that G/C-to-A/T mutations might occur more frequently than A/T-to-G/C mutations, similar to the situation for mammalian pseudogenes (G/C-to-A/T, 64.5%; A/T-to-G/C, 35.5%) (Li 1997
).
|
|
|
|
A large variation in is seen among regions (table 6
). In particular, the 5' and 3' flanking regions of the ß-globin gene and the ß-globin replication origin initiation region (IR) have high
values (Harding et al. 1977; Fullerton et al. 2000
). The high
value in the IR has been speculated to be due to a high mutation rate because of the peculiar feature of the DNA unwinding element in the IR (Fullerton et al. 2000
). On average, the autosomal regions have the highest
value (0.091%), the X-linked regions have a somewhat lower
value (0.079%), and the Y-linked regions have a very low
value (0.008%). These differences may be partly due to the fact that the relative effective population sizes are Ne, 3Ne/4, and Ne/4 for an autosomal, an X-linked, and a Y-linked sequence, respectively. However, the extremely low
value for Y-linked sequences may be mainly due to background selection and selective sweep, because there is no recombination in the Y chromosome except for the pseudoautosomal region. Background selection and selective sweep should have, on average, stronger effects on an X-linked region than on an autosomal region because of a lower average recombination rate in the X chromosome than in an autosome, partly accounting for the lower
value for X-linked regions. The number of Alu sequences studied is small, but the data suggest a higher average
value for Alus than for other regions. This is not surprising, because Alus have higher mutation rates due to the presence of a high frequency of CpG dinucleotides.
Mutation Rate, , Ne
The average numbers of nucleotide substitutions per site were 0.62% between human and chimpanzee sequences, 1.07% between human and gorilla sequences, and 2.44% between human and orangutan sequences; here, we exclude the four exons. The mutation rates (v) were estimated to be 0.52 x 10-9, 0.67 x 10-9, and 1.02 x 10-9 per nucleotide site per year based on divergence times of 6 Myr between humans and chimpanzees, 8 Myr between humans and gorillas, and 12 Myr between humans and orangutans, respectively; other divergence dates are also considered in table 7
. The first two values are considerably smaller than the third, but the differences may be largely due to stochastic fluctuations. The average for the three estimates is 0.74 x 10-9. This value is considerably lower than the estimate (1.15 x 10-9) obtained from the 10-kb region on chromosome 22, but it is consistent with the lower nucleotide diversity in this region than in the 10-kb region on chromosome 22. It is possible that this region has a lower (neutral) mutation rate.
|
If we know the mutation rate, we can estimate the effective population size Ne from . As the estimate of the mutation rate varies with the assumption of the divergence dates, the estimate of Ne also varies (table 7
). Moreover, it also depends on the estimation methods used (table 7
). If we use the average mutation rate obtained above and the average (6.70) of the
values estimated by Watterson's, Tajima's, and the BLUE methods, we obtain Ne = 12,600, which is not far from the commonly used value of 10,000 (Takahata 1993
).
Age of the MRCA
To estimate the age (T) of the MRCA of the sequences in a sample, the values of both Ne and mutation rate per sequence per generation (u) are required. As mentioned above, the estimate of mutation rate depends on the species pair and the divergence dates used. For simplicity, we shall use the average mutation rate obtained above, i.e., v = 0.74 x 10-9 changes per site per year and u = 1.33 x 10-4 changes per sequence per generation in humans; the estimate of T increases with decreasing u. Table 8
presents the estimates of T for several values of effective population sizes for the entire sample, the subsample of sequences from Africa only, and the subsample of non-African sequences only, respectively. If the commonly used Ne = 10,000 was assumed, the mode estimate (Tmode) and mean estimate (Tmean) were, respectively, 1,376,000 and 1,559,000 years for the entire sample. These estimates were comparable to our previous estimates based on the polymorphism data from a 10-kb region on chromosome 22 (Zhao et al. 2000
). The estimates based on the African sample were only somewhat smaller than those based on the entire sample, while those based on the non-African sample were the smallest. This pattern was also consistent with our previous study (Zhao et al. 2000
).
|
In summary, like the data from the PDHA1 locus (Harris and Hey 1999
) and the 10-kb region on chromosome 22 (Zhao et al. 2000
), the present data also provide evidence for a genetic history of humans that is much more ancient than the emergence of modern humans. The observation that both the region on chromosome 22 and the present region show an ancient genetic history outside of Africa argues against a complete replacement of all indigenous populations in Europe and Asia by an African stock. Moreover, the ancient genetic history of humans indicates no severe bottleneck during the evolution of humans in the last half million years, because much of the ancient genetic history would have been lost during a severe bottleneck. On the other hand, the fact that most available nuclear DNA variation data, as well as mitochondrial DNA data, show a considerably shallower genetic history in Asia and Europe than in Africa suggests that human evolution has not occurred in parallel in different parts of the Old World, as depicted by the multiregional model. Thus, both the Out of Africa and the multiregional models appear to be too simple to explain the evolution of modern humans.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: nucleotide diversity
DNA variation
human evolution
unique variants
2 Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, Illinois 60637. E-mail: whli{at}uchicago.edu
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Begun, D. J., and C. F. Aquadro. 1992. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melangaster. Nature 356:519520.
Cann, R. L., M. Stoneking, and A. C. Wilson. 1987. Mitochondrial DNA and human evolution. Nature 325:3136.
Charlesworth, B. 1994. The effect of background selection against deleterious alleles on weakly selected, linked variants. Genet. Res. 63:213228.[ISI][Medline]
Don, R. H., P. T. Cox, B. J. Wainwright, K. Baker, and J. S. Mattick. 1991. Touchdown PCR to circumvent spurious priming during gene amplification. Nucleic Acids Res. 19:4008.
Fu, Y. X. 1994. Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics 138:13751386.
. 1996. Estimating the age of the common ancestor of a DNA sample using the number of segregating sites. Genetics 144:829838.
Fu, Y. X., and W. H. Li. 1997. Estimating the age of the common ancestor of a sample of DNA sequences. Mol. Biol. Evol. 14:195199.[Abstract]
. 1993. Statistical tests of neutrality of mutations. Genetics 133:693709.
Fullerton, S. M., J. Bond, J. A. Schneider, B. Hamilton, R. M. Harding, A. J. Boyce, and J. B. Clegg. 2000. Polymorphism and divergence in the ß-globin replication origin initiation region. Mol. Biol. Evol. 17:179188.
Griffiths, R. C., and S. Tavaré. 1994. Ancestral inference in population genetics. Stat. Sci. 9:307319.[ISI]
Harding, R. M., S. M. Fullerton, R. C. Griffiths, J. Bond, M. J. Cox, J. A. Schneider, D. S. Moulin, and J. B. Clegg. 1997. Archaic African and Asian lineages in the genetic ancestry of modern humans. Am. J. Hum. Genet. 60:772789.[ISI][Medline]
Harris, E. E., and J. Hey. 1999. X chromosome evidence for ancient human histories. Proc. Natl. Acad. Sci. USA 96:33203324.
Jaruzelska, J., E. Zietkiewicz, M. Batzer, D. E. C. Cole, J.-P. Moisan, R. Scozzari, S. Tavare, and D. Labuda. 1999. Spatial and temporal distribution of the neutral polymorphisms in the last ZFX intron: analysis of the haplotype structure and genealogy. Genetics 152:10911101.
Jaruzelska, J., E. Zietkiewicz, and D. Labuda. 1999. Is selection responsible for the low level of variation in the last intron of the ZFY locus? Mol. Biol. Evol. 16(11):16331640. J. Mol. Evol. 21:5871.
Kaessmann, H., F. Heißig, A. von Haeseler, and S. Pääbo. 1999. DNA sequence variation in a non-coding region of low recombination on the human X chromosome. Nat. Genet. 22:7881.[ISI][Medline]
Li, W. H. 1997. Molecular evolution. Sinauer, Sunderland, Mass.
Li, W. H., and L. Sadler. 1991. Low nucleotide diversity in man. Genetics 129:513523.
Li, W. H., C.-I. Wu, and C.-C. Luo. 1984. Nonrandomness of point mutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implication. J. Mol. Evol. 21:5871.[ISI][Medline]
Nickerson, D. A., S. L. Taylor, K. M. Weiss, A. G. Clark, R. G. Hutchinson, J. Stengard, V. Salomaa, E. Vartiainen, E. Boerwinkle, and C. F. Sing. 1998. DNA sequence diversity in a 9.7-kb region of the human lipoprotein gene. Nat. Genet. 19:233240.[ISI][Medline]
Rieder, M. J., S. L. Taylor, A. G. Clark, and D. A. Nickerson. 1999. Sequence variation in the human angiotensin converting enzyme. Nat. Genet. 22:5962.[ISI][Medline]
Rozas, J., and R. Rozas. 1999. DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics 15:174175.
Stringer, C. B., and P. Andrews. 1988. Genetic and fossil evidence for the origin of modern humans. Science 139:12631268.
Tajima, F. 1983. Evolution relationship of DNA sequences in finite populations. Genetics 105:437460.
. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585595.
Takahata, N. 1993. Allelic genealogy and human evolution. Mol. Biol. Evol. 10:222.[Abstract]
Watterson, G. A. 1975. On the number of segregation sites. Theor. Popul. Biol. 7:256276.[ISI][Medline]
Whitfield, L. S., J. E. Sulston, and P. N. Goodfellow. 1995. Sequence variation of the human Y chromosome. Nature 378:379380.
Zhao, Z., J. Li, Y.-X. Fu et al. (13 co-authors). 2000. Worldwide DNA sequence variation in a 10 kb noncoding region on human chromosome 22. Proc. Natl. Acad. Sci. USA 97:1135411358.
Zietkiewicz, E., V. Yotova, M. Jarnik, et al. (11 co-authors). 1998. Genetic structure of the ancestral population of modern humans. J. Mol. Evol. 47:146155.[ISI][Medline]