Genetic variation in mRNA coding sequences of highly conserved genes
ANNELOOR L. M. A. TEN ASBROEK1,
JEFFREY OLSEN2,
DAVID HOUSMAN3,
FRANK BAAS1 and
VINCE STANTON, JR.2
1 Neurozintuigen Laboratory, Academic Medical Center, 1105 AZ, Amsterdam, The Netherlands
2 Variagenics, Cambridge 02139-1562
3 Center for Cancer Research, Massachusetts Institute of Technology, Cambridge, Massachusetts 02140
 |
ABSTRACT
|
---|
The frequency and distribution of genetic polymorphism in the human genome is a question of major importance. We have studied this in highly conserved genes, which encode crucial functions such as DNA replication, mRNA transcription, and translation. Evolutionary comparisons suggest that these genes are under particularly strong selective pressure, and their frequency of nucleotide sequence polymorphism would be expected to represent a minimum estimate for sequence variation throughout the genome. We have analyzed the complete coding sequence and the 3'-untranslated region (3'-UTR) of 22 human genes, most of which have homologs in all cellular organisms and all of which are at least 25% amino acid identical to homologs in yeast. Comparisons with similar studies of less conserved human disease genes indicate that 1) evolutionarily conserved genes are, on average, less polymorphic than disease related genes; 2) the difference in polymorphism levels is attributable almost entirely to reduced levels of variation in protein coding sequences, whereas noncoding sequences have similar levels of polymorphism; and 3) the character of polymorphism, in terms of the spectrum and frequency of mutational changes, is similar.
polymorphism; diversity; single nucleotide polymorphisms; human
 |
INTRODUCTION
|
---|
GENETIC DIVERSITY IN A SPECIES depends on the mutation rate, the size and demographic history of the species, the length of time over which diversity occurs, and biological factors such as selection. The human genome comprises a diverse collection of genes, with some of recent origin and others, like the tRNA synthetases, likely dating back to the origins of cellular life. Recently much emphasis is put on the analysis of all sequence variants in the human genome, especially single nucleotide polymorphisms (SNPs). SNPs are often biallelic and consequently less informative than microsatellite markers; however, they occur more frequently and are more mutationally stable. Therefore, they are suggested to be excellent markers for genetic linkage analysis or genetic association studies to be used in the search of variants that contribute to risk factors for genetic disease. It has been hypothesized that a great proportion of the risk factors or susceptibility alleles for common diseases are in fact alleles that occur at a considerable frequency in the human population (3, 12, 18). This hypothesis of a common disease-common variant (CD-CV) assumes that there is considerable variation in DNA sequences in the human genome. The use of automated sequencers, DNA probe arrays, and semi-automated denaturing high-pressure liquid chromatography (DHPLC) allowed systematic analysis of large stretches of DNA in many different individuals (1, 2, 8, 9, 12, 15, 16). These studies, even though different methodology for mutation detection and various sample sizes were used, yielded estimates of nucleotide diversity in coding and noncoding regions of the human genome. However, in most studies genes were analyzed which contained regions that are not expected to be under stringent selective pressure, e.g., genes with redundant functions (1, 2, 8, 16, 20) or segments with noncoding genomic DNA (9, 12, 15). Other studies focused on single ethnic groups, which might lead to an underestimation of the sequence variation in the genome (1, 116, 20).
The present study represents an effort to obtain a more rounded picture of human genetic diversity by examining variation in the transcribed sequences of highly conserved genes in a panel of 36 individuals of various ethnic origins. We report the characterization of SNPs in genes involved in essential pathways, for which homozygous null-alleles are expected to be lethal to the cell. Of these genes, 19 of 22 are in the apparent minimal gene set required for cellular life (7, 14). Since the genes analyzed are supposed to be under highly selective pressure, we expect that this set of genes will give us a lower bound on nucleotide variation. This should allow us to define the lowest level of sequence variation in the human genome.
 |
METHODS
|
---|
Cell lines and RNA used for polymorphism analysis.
Cell lines were obtained from Coriell Cell Repository or were gifts from colleagues. RNA was isolated using Trizol reagent (GIBCO-BRL).
Identification and confirmation of SNPs.
cDNA synthesis and single-strand conformation (SSC) analysis was performed according to standard procedures (11, 17). Two approaches were used. For CTPS, POLR2A, POLR2C, and RPS6, internally labeled PCR fragments of about 220 nt, overlapping at least 50 nt, were analyzed under at least two conditions (SSC gels with and without 10% glycerol). For all the other genes analyzed, overlapping 0.8- to 1.5-kb segments of the genes were PCR amplified from cDNA in the presence of [
-32P]dCTP to obtain internally labeled DNA fragments. These fragments were then digested with endonucleases to obtain fragments ranging in size from 90 to 400 nt, suitable for single-strand conformation polymorphism (SSCP) analysis. Each 0.8- to 1.5-kb fragment was restricted in three separate reactions using different restriction endonucleases to obtain threefold redundant coverage of each gene. All restriction digests were then run on two SSCP gels that differed in gel composition (5.5% acrylamide/1x TBE buffer vs. 8% acrylamide/10% glycerol/1x TBE buffer, where 1x TBE is 89 mM Tris base, 89 mM boric acid, and 2 mM EDTA) and running conditions (room temperature vs. 4°C). Thus each nucleotide pair was examined on six SSCP gels (3 digests x 2 running conditions), for a total of 12 analyses (both strands visualized on all gels). Samples with SSC bands with altered mobility were sequenced using Big-Dye terminator chemistry (Perkin-Elmer, Foster City, CA) and analyzed on an ABI-377 sequencer. To determine the sensitivity of the SSC analysis, we have sequenced the complete cDNA for EPRS and TYMS in the 36 samples we used in the study. cDNA was PCR amplified and subjected to dye terminator chemistry sequence analysis. PolyPhred analysis of sequences generated on an ABI-3700 sequencer did not yield polymorphisms that were not identified by SSC analysis (5, 6). In addition, we have compared the SSC analysis with mutation detection by DHPLC. In that comparison, none of the 50 samples that were previously screened for mutations by SSC yielded a novel SNP. This analysis was done on a separate set of samples and for other genes (the myelin genes PMP22 and MPZ). Because this analysis is not extensive, we propose a sensitivity of mutation detection of >95% by SSC. For heterozygosity analysis and SNP detection, samples with the minor allele as well as samples with the major allele on SSC were sequenced.
Nucleotide diversity estimation.
Nucleotide diversity and its standard deviation were calculated under the assumption of an infinite site neutral allele model as follows. The normalized number of variant sites
= K/aL, with standard deviation
where K is the number of SNPs, L is the genome length analyzed (in bp), and n is the number of alleles analyzed (n = 72). Statistical significance was determined by the paired Students t-test.
 |
RESULTS
|
---|
We selected 22 highly conserved genes for polymorphism screening, all of which have homologs in Saccharomyces cerevisiae (Table 1). Of these genes, 19 of 22 are in the apparent minimal gene set required for cellular life (7, 14). This is a collection of 250300 genes derived from a comparison of the genomes of the distantly related prokaryotes Mycoplasma genitalium (which has 468 predicted genes, the smallest known gene complement in a cellular life form) and Hemophilus influenza (which has 1,703 predicted genes). The three other genes selected are eukaryotic initiation factor 5A, CTP synthetase, and the 30-kDa TATA-associated factor (TAFII30). The former two have clear functional counterparts in prokaryotic genomes. The extent of amino acid identity between human and yeast homologs is at least 27% and ranges up to 82%, whereas the degree of cross-species amino acid similarity ranges from 44% up to 88% (Table 1). In view of the interspecies conservation of these genes, it is likely that they have been under continuous selective pressure for millions of years. Therefore it might be expected that they exhibit lower divergence, or a different distribution of polymorphic changes, than other, less conserved, genes.
View this table:
[in this window]
[in a new window]
|
Table 1. Conserved human genes analyzed for nucleotide polymorphisms, and protein similarity and identity to yeast orthologs
|
|
Variation in the transcribed regions of these genes was determined by systematically screening cDNA from a panel of 36 human cell lines established from individuals of diverse ethnic, racial, and geographic origin. The panel includes 4 Chinese, 4 Japanese, 2 other Asians, 1 cell line from India, 1 cell line from Saudi Arabia, 4 African Americans, 4 Hispanic Americans, 5 whites of Southern European origin, and 11 whites of Central and Northern European origin. Polymorphism screening was by the SSCP method. To maximize the likelihood of detecting DNA sequence variants, SSC analysis was performed under at least two conditions with fragments overlapping at least 50 bp. Samples exhibiting altered mobility on an SSCP gel were then sequenced to determine the underlying nucleotide sequence change.
In total, the cDNA sequence screened amounts to 51,154 bp, covering 42,270 bp of coding sequence and 8,884 bp of 3'-untranslated region (3'-UTR). Sixty-five genetic variants were detected, of which 61 are SNPs and 4 are insertions (3 single-nucleotide insertions, 1 complex repeat alteration) (Table 2). The base insertions (RPA1 2296, RRM1 2724, and POLR2C 3'-UTR) we detected all occur in a repetitive sequence environment and are compatible with slippage during replication.
Of the 61 SNPs, 51 (84%) were transitions, and 27 of the SNPs are in CpG dinucleotides, 25 of which are consistent with the proposed CpG methylation-mediated deamination mechanism. These numbers are in the same range as found by other studies reporting 68% of SNPs to be transitions (4) and 3135% of SNPs in CpG (4, 8). The SNPs do not show a preference of mutation to the flanking nucleotide (data not shown). Eleven amino acid substitutions were detected in 14,090 codons analyzed, resulting in an amino acid substitution rate of 1:1,281. These variants were seen with allele frequencies in the range of 113%. Some variations are nonconservative, e.g., a proline for histidine substitution in EPRS and an arginine for cysteine substitution in POLR2A (Table 2).
 |
DISCUSSION
|
---|
To get a more rounded picture of sequence diversity in the human genome, we analyzed the variation of highly conserved genes in transcribed sequences.
Our analysis results in an overall polymorphism frequency of 65 variants in 51,154 bp. The SNP frequency we identified in coding regions was considerably less than recently reported in other large studies (2, 8), whereas the frequency for noncoding regions was similar. Since the number of polymorphisms (SNPs) will depend on the number of individuals and length of DNA sequences assayed, a normalized term for nucleotide variation (
) should be used. The nucleotide diversity is shown in Table 3. SNPs were found at a similar overall frequency in the noncoding regions analyzed (5.7 x 10-4 and 5.3 x 10-4). This underscores the fact that our screening method (SSC analysis) has a sensitivity similar to that of DNA chips and DHPLC used in other studies. In addition, sequence analysis of the entire EPRS and TYMS cDNA in all the 36 samples we analyzed did not yield polymorphisms that had not been identified by SSC analysis. SNPs in coding regions, however, differ significantly from the 3'-UTR SNPs. Both the synonymous and nonsynonymous SNPs were less frequent in our study, when compared with the data from Cargill et al. (2), who are even more close to our data than those of Halushka et al. (8). Particularly, the frequency of nonsynonymous SNPs, which affect the amino acid sequence of a gene product, were significantly reduced in our set of genes (Table 3, P < 0.001).
Several explanations could explain our observed different nucleotide substitution rates. A difference in detection of nucleotide substitutions seems highly unlikely, since we find an almost identical frequency of variances in the noncoding regions when we compare our study with these two large studies (see Table 3). Therefore, we must conclude that the difference observed in coding regions is due to the nonrandom set of genes analyzed. The genes in our analysis seem to be under a highly selective pressure, and it is likely that this selection is due to the nature of the genes analyzed. Most of them are within the minimal set required for cellular life (7, 14), and the remaining three are involved in essential processes like nucleotide metabolism and transcription. Another recent study (19) was aimed at identification of the lower limit estimate for nonsynonymous SNPs that might have phenotypic effects. This was done by comparative analysis of database sets of disease-causing mutations, interspecies conservation of substitutions, and nonsynonymous SNPs from public database (thought to be neutral) with respect to protein function and structure. Those data demonstrate that variants in structurally important sites are not selectively neutral. This would support evolutionary constraint as an explanation for our lower frequency of SNPs detected. However, in the large data set of Cargill et al. (2) the nonsynonymous variants are detected at 38% of the rate of synonymous changes. We observe the same proportionality (37%). Therefore, the lower diversity that we have found for both synonymous and nonsynonymous changes might be explained by a lower intrinsic mutation rate of the genes affected, rather than a greater selection pressure on these genes.
In conclusion, there is a great variation in nucleotide polymorphism substitution rates within the human genome. We report the lowest variation determined thus far. However, the low nucleotide substitution rates still result in a significant variation in coding sequences. Of the polymorphisms detected, 49% (32 of 65) have a heterozygote frequency of >20% and thus contribute significantly to the variants present in an unselected population. Since our study was aimed at identifying the polymorphism frequency in presumed essential genes, our data might reflect the lowest level of DNA sequence variation in the human population.
 |
ACKNOWLEDGMENTS
|
---|
We thank D. Joyce for expert technical assistance, Dr. H. Tabak for comments on the manuscript, and Dr. J. Ruijter for help with the statistical analysis.
This work was supported by a grant from Variagenics to F. Baas.
 |
FOOTNOTES
|
---|
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: F. Baas, Neurozintuigen Laboratory, AMC, Meibergdreef 9, 1105 AZ, Amsterdam, The Netherlands (E-mail: f.baas{at}amc.uva.nl).
 |
REFERENCES
|
---|
-
Cambien F, Poirier O, Nicaud V, Herrmann SM, Mallet C, Ricard S, Behague I, Hallet V, Blanc H, Loukaci V, Thillet J, Evans A, Ruidavets JB, Arveiler D, Luc G, and Tiret L. Sequence diversity in 36 candidate genes for cardiovascular disorders. Am J Hum Genet 65: 183191, 1999.[ISI][Medline]
-
Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Lane CR, Lim EP, Kalayanaraman N, Nemesh J, Ziaugra L, Friedland L, Rolfe A, Warrington J, Lipshutz R, Daley GQ, and Lander ES. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22: 231238, 1999.[ISI][Medline]
-
Collins FS, Guyer MS, and Chakravarti A. Variations on a theme: cataloging human DNA sequence variation. Science 278: 15801581, 1997.[Free Full Text]
-
Cooper D, Krawczak M, and Antonarakis SE. The nature and mechanism of human gene mutation. In: The Metabolic Basis of Inherited Disease (7th ed.), edited by Scriver CR, Beaudet AL, Sly WS, and Valle D. 1995, p. 259292.
-
Ewing B, Hillier L, Wendl MC, and Green P. Base-calling of automated sequencers traces using phred. I. Accuracy assessment. Genome Res 8: 175185, 1998.[Abstract/Free Full Text]
-
Ewing B and Green P. Base-calling of automated sequencers traces using phred. II. Error probabilities. Genome Res 8: 186194, 1998.[Abstract/Free Full Text]
-
Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, Fritchman JL, Weidman JF, Small KV, Sandusky M, Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC, and Lucier TS. The minimal gene complement of Mycoplasma genitalium. Science 270: 397403, 1995.[Abstract]
-
Halushka MK, Fan JB, Bentley K, Hsie L, Shen N, Weder A, Cooper R, Lipshutz R, and Chakravarti A. Patterns of single-nucleotide polymorphisms in candidate genes regulating blood-pressure homeostasis. Nat Genet 22: 239247, 1999.[ISI][Medline]
-
Harding RM, Fullerton SM, Griffiths RC, Bond J, Cox MJ, Schneider JA, Moulin DS, and Clegg JB. Archaic African and Asian lineages in the genetic ancestry of modern humans. Am J Hum Genet 60: 772789, 1997.[ISI][Medline]
-
Jeffreys AJ. DNA sequence variants in the G gamma-, A gamma-, delta- and beta-globin genes of man. Cell 18: 110, 1979.[ISI][Medline]
-
Kulkens T, Bolhuis PA, Wolterman RA, Kemp S, te Nijenhuis S, Valentijn LJ, Hensels GW, Jennekens FG, de Visser M, Hoogendijk JE, and Baas F. Deletion of the serine 34 codon from the major peripheral myelin protein P0 gene in Charcot-Marie-Tooth disease type 1B. Nat Genet 5: 3539, 1993.[ISI][Medline]
-
Lai E, Riley J, Purvis I, and Roses A. A 4-Mb high-density single nucleotide polymorphism-based map around human APOE. Genomics 54: 3138, 1998.[ISI][Medline]
-
Lander ES. The new genomics: global views of biology. Science 274: 536539, 1996.[Free Full Text]
-
Mushegian AR and Koonin EV. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci USA 93: 1026810273, 1996.[Abstract/Free Full Text]
-
Nickerson DA, Taylor SL, Weiss KM, Clark AG, Hutchinson RG, Stengard J, Salomaa V, Vartiainen E, Boerwinkle E, and Sing CF. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nat Genet 19: 233240, 1998.[ISI][Medline]
-
Ohnishi Y, Tanaka T, Yamada R, Suematsu K, Minami M, Fuji K, Hoki N, Kodama K, Nagata S, Hayashi T, Kinoshita N, Sato H, Sato H, Kuzuya T, Takeda H, Hori M, and Nakamura Y. Identification of 187 single nucleotide polymorphisms (SNPs) among 41 candidate genes for ischemic heart disease in the Japanese population. Hum Genet 106: 288292, 2000.[ISI][Medline]
-
Orita M, Sekiya T, and Hayashi K. DNA sequence polymorphisms in Alu repeats. Genomics 8: 271278, 1990.[ISI][Medline]
-
Risch N and Merikangas K. The future of genetic studies of complex human diseases. Science 273: 15161517, 1996.[ISI][Medline]
-
Sunyaev S, Ramensky V, and Bork P. Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet 16: 198200, 2000.[ISI][Medline]
-
Unoki M, Furuta S, Onouchi Y, Watanabe O, Doi S, Fujiwara H, Miyatake A, Fujita K, Tamari M, and Nakamura Y. Association studies of 33 single nucleotide polymorphisms (SNPs) in 29 candidate genes for bronchial asthma: positive association a T924C polymorphism in the thromboxane A2 receptor gene. Hum Genet 106: 440446, 2000.[ISI][Medline]