Evolutionary Dynamics of Oncogenes and Tumor Suppressor Genes: Higher Intensities of Purifying Selection than Other Genes

Michael A. Thomas*,1,, Benjamin Weston*, Moltu Joseph*, Wenhua Wu*, Anton Nekrutenko{dagger},2 and Peter J. Tonellato*

* Bioinformatics Research Center, Medical College of Wisconsin, Milwaukee
{dagger} Department of Ecology and Evolution, University of Chicago


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
Oncogenes and tumor suppressor genes (hereafter referred to as "cancer genes") result in cancer when they experience substitutions that prevent or distort their normal function. We examined evolutionary pressures acting on cancer genes and other classes of disease-related genes and compared our results to analyses of genes without known association to disease. We compared synonymous and nonsynonymous substitution rates in 3,035 human genes—approximately 10% of the genome—measuring the intensity of purifying selection on 311 human disease genes, including 122 cancer-related genes. Although the genes examined are similar to nondisease genes in product, expression, function, and pathway affiliation, we found intriguing differences in the selective pressures experienced by cancer genes relative to other (noncancer) disease-related and non–disease-related genes. We found a statistically significant increase in the intensity of purifying selection exerted on cancer genes (the average ratio of nonsynonymous to synonymous substitutions, {omega}, was 0.079) relative to all other disease-related genes groups () and non–disease-related genes (). This difference indicates a striking increase in selection against nonsynonymous substitutions in oncogenes and tumor suppressor genes. This finding provides insight into the etiology of cancer and the differences between genes involved in cancer and those implicated in other human diseases. Specifically, we found a significant overlap between human oncogenes and tumor suppressor genes and "essential genes," human homologs of mouse lethal genes identified by knockout experiments. This insight may improve our ability to identify cancer-related genes and enhances our understanding of the nature of these genes.

Key Words: Darwinian selection • purifying selection • disease gene • oncogene • synonymous and nonsynonymous substitutions


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
"Disease genes" have a verified, heritable increased propensity for particular human disorders (Fearon 1997; Mushegian et al. 1997; Fortini et al. 2000; Rubin et al. 2000; Reiter et al. 2001), whereas "nondisease genes" have no known association with any disorder. Apart from association with disease, disease-related genes play typical roles in the cell, acting as transmembrane receptors, cytoplasmic regulatory proteins, transcription factors, cell cycle factors, or other factors (Fearon 1997) and become disease genes when they experience specific nonsynonymous substitutions (Nesse 2001; Stearns and Ebert 2001). Why these genes cause disease when other genes that experience similar rates of nonsynonymous substitutions do not is unknown.

Nucleotide substitutions in protein-coding regions occur in two types: synonymous and nonsynonymous. Synonymous (or silent) nucleotide substitutions do not change the amino acid encoded by the mutated codon; nonsynonymous substitutions result in a translation change, often with profound implications for the resulting protein (Nei and Kumar 2000). Typically, nucleotide substitutions in the third codon position are silent, whereas substitutions in the first and second codon positions result in an amino acid change; highly parameterized maximum-likelihood and other methods take this and related factors (such as transition/transversion bias) into account when calculating substitution rates (Nei and Kumar 2000; Yang and Nielsen 2000). The rate of nonsynonymous substitutions per nonsynonymous substitution site (KA) varies greatly from gene to gene due to varying intensities of purifying selection (and other factors). The rate of synonymous substitutions per synonymous substitution site (KS) is related to µ, the average mutation rate for the genome as a whole and should be similar among genes (Kumar and Subramanian 2002). The ratio of rates of nonsynonymous to synonymous substitutions (KA/KS, or {omega}) is a measure of accepted substitutions normalized for opportunity (Liberles 2001) and can indicate whether or not selection is occurring and the degree and type of selection (Nei and Kumar 2000; Yang and Nielsen 2000). Under neutral evolution, deviation of KA from KS may be due to positive Darwinian selection when or purifying (stabilizing) selection when . For some genes experiencing positive Darwinian selection, the mean {omega} for the entire gene may be <1.0 while portions of the gene (areas not under protein-coding selective constraints, for example) may have values of (Liberles 2001).

Previous researchers used analyses of relative substitution rates to measure the strength of purifying selection (selection against amino acid changes) on individual disease genes (Hurst and Pal 2001), but no study has demonstrated that this is a general characteristic of disease genes or any subsets of disease genes. This information would augment other research on the functional classification of disease genes (Jimenez-Sanchez, Childs, and Valle 2001) and contribute to our understanding of disease.

We identified 331 human genes implicated in disease and divided these into six functional classes: oncogenes (including tumor suppressor genes), immune system genes, metabolic genes, muscle and bone genes, nervous system genes, and transport genes. We then measured the relative substitution rates experienced by these genes (relative to rodent homologs) and compared these rates to an appropriate set of non–disease-related genes. We hypothesized that these classes of disease-related genes will have experienced different selective pressures than genes that are not typically involved in disease due to the different implications for fitness.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
A set of 3,035 genes was obtained using methods similar to previous research (Nekrutenko, Makova, and Li 2002) These genes are a subset of human genes with rodent homologs obtained from HomoloGene, from which uncharacterized and incomplete genes were removed. We used mouse orthologs (2,879 genes) for most of our calculations, but used rat orthologs (158 genes) when mouse sequence was unavailable. (Makalowski and Boguski [1998] found a strong correlation between substitution rates of mouse and rat.) For the 3,035 genes, we calculated the KS and KA substitution rates and the value of {omega} using Yang's maximum-likelihood method (Yang and Nielsen 2000) as implemented by Nekrutenko, Makova, and Li (2002).

Sequence comparisons between rodents and humans make the assumption that these species have experienced similar selection and fitness effects, despite having diverged 90 to109 MYA (Kumar and Hedges 1998; Nei, Xu, and Glazko 2001; Kumar and Subramanian 2002). Of course, these effects are difficult to measure. A third mammalian group, such as an ungulate, will greatly increase the robustness of this analysis when data become available for such a three-way comparison.

For comparison purposes, we used Nekrutenko's (Nekrutenko, Makova, and Li 2002) data set (153 human genes with known exonic structure and mouse homologs [table 1]), which had the advantage of being fully curated and well-understood genes, making them an appropriate test to ensure our 3,035 genes were not unusual. Another data set (Makalowski and Boguski 1998), while containing many more homologous gene pairs than the Nekrutenko data set (1,880 versus 153), was estimated using a method (Ina 1995) that is not as accurate, in certain circumstances, as the maximum-likelihood method (Bielawski, Dunn, and Yang 2000; Yang and Bielawski 2000; Yang and Nielsen 2000) used in this study.


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of Results with Previous Studies.

 
From our 3,035-gene set, we extracted 311 genes implicated in disease (including 121 cancer genes). This was supported by generally accepted evidence from three databases: NCBI's LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink/), the Gene Ontology Consortium (GO) (http://geneontology.org/), and the Cancer Genome Anatomy Project (CGAP) (http://cgap.nci.nih.gov/Genes/CuratedGeneLists/). Genes were categorized into six classes (see table 2): cancer, immune system, metabolism, muscle and bone, nervous system, and transporters. (The remaining genes were categorized as "unclassified" because they did not belong in the six classes and were used as a comparison set.) Disease-related genes from LocusLink and GO have curated disease associations and identification numbers from the Online Mendelian Inheritance in Man (OMIM) database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). For each class, we divided the list of genes into disease-related and nondisease genes (the numbers of genes in each class are reported in table 2).


View this table:
[in this window]
[in a new window]
 
Table 2 Results from Our Analysis of 3,035 Genes.

 
To test for an overall effect, we conducted ANOVA tests on normalized (log-transformed) {omega} values. Genes with {omega} of 0 were removed from the ANOVA data because a log could not be calculated. The hypothesis of equal rates was rejected when the data were organized into six disease classes + nondisease genes () or when they were organized into six functional groups () and the 3,035 genes (). We then explored the differences indicated by the ANOVA test by comparing the KA, KS, and {omega} values (1) between disease genes (as a group) and nondisease genes (minus all disease genes, for a total of 2,724 genes); (2) between each disease gene class and the nondisease gene set (minus all disease); and (3) between each disease gene group and the nondisease gene group corresponding to that class (e.g., metabolic disease genes versus metabolic nondisease genes). We used the Mann-Whitney test (Mann and Whitney 1947; Zar 1999), as implemented by the R statistical package (Gentleman and Ihaka 1996), for pairwise comparison of the substitution rates to determine whether rates of evolution acting on the gene groups differ between humans and rodents (Zar 1999). We also used a t-test to analyze the normalized (log-transformed) {omega} values. The Mann-Whitney test is superior to the parametric t-test for these comparisons because the former makes no assumptions of normality of distribution. We corrected for multiple comparisons with Bonferroni's adjustment, setting for most pairwise comparisons.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
The 3,035 genes were distributed across the genome in a pattern similar to other, larger data sets (Thomas, unpublished data), indicating that these genes are representative of the human genome in this respect. The mean value of KS for all of the genes examined was 0.66, and no group of genes was significantly different from this value, as one would expect given previous research (Kumar and Subramanian 2002). The only exception to this was the immune disease–related genes (), but the sample size for this group was too small to be conclusive.

The average values within these sets were consistent with the previously supported hypothesis that mammalian genes are under strong purifying selection ( [see table 1]) (Makalowski and Boguski 1998; Nekrutenko, Makova, and Li 2002). The value of {omega} for disease genes as a group (311 genes) was not significantly different from the set of 2,724 nondisease genes (table 2), according to a Mann-Whitney test ( [see below]). This remained the case when the class of fast-evolving immune system genes was removed from the analysis. Immune genes typically experience less intense purifying selection and may bias the results (Hurst and Smith 1999).

When cancer-related disease genes were compared with genes outside of that class (3,035 genes minus all cancer-related genes), they had significantly lower {omega} (), whereas other disease-related genes had marginally higher {omega} in three cases (immune, metabolic, and transport disease genes; , 0.050, and 0.056, respectively) and no difference in two cases (muscle/bone and nervous system disease genes; and 0.988, respectively). When each disease gene class was compared with a more general set of genes (all nondisease genes), we found the same pattern.

Comparisons between disease and nondisease genes within the five (noncancer) functional classes (immune, metabolism, muscle/bone, nervous, and transport) revealed significant differences only for transport genes (). For that class, the value of {omega} for transport disease genes was marginally, but nonsignificantly, higher than nondisease genes (of all classes [see above]), whereas nondisease transport genes had marginally lower {omega} () than nondisease genes of all classes. Due to these results, with respect to the intensity of purifying selection, it is apparent that (with the exception of transport genes) noncancer disease genes differ little from their nondisease counterparts and human genes in general.

Cancer genes, however, seem to be different from other disease-related genes and human genes in general. Only 39 of 121 cancer-related genes had {omega} values higher than the average value (0.1) for all other genes (compared with 1,188 of the 3,035 genes). The mean {omega} is significantly lower for cancer-related genes () relative to 2,914 noncancer genes, and the mean KA is significantly lower ( for cancer genes and 0.063 for noncancer genes, [see table 2]), whereas the mean KS is not statistically different. The decrease in KA for cancer genes, greater than changes (in KA) in other classes of genes we examined, is the source of the decrease in the value of {omega} overall and represents a significant, marked decrease in the rate of nonsynonymous substitutions experienced by these genes as a whole. Previous efforts (Gojobori and Yokoyama 1987) demonstrated that the substitution rates of cancer genes differentiate them from other genes, but it was difficult to conclude from that study that more intense purifying selection might be a general characteristic of cancer genes, due to a small sample size (six cellular oncogenes) and less precise method of estimating substitution rates (Nei and Gojobori 1986).


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
The 3,035-gene set is well representative of the genome, as it consists of approximately 10% of all human genes evenly distributed among chromosomes (with the exception of chromosome Y [unpublished data]). It is conceivable that they contain some (undiscovered) disease genes classified by this study as nondisease genes. Theoretically, all genes have the potential to contribute to a disorder, should it experience the appropriate combination of substitutions. However, we were interested in looking for generalities between genes involved with relatively common, well-described disorders. Genes without such an involvement (be they nondisease genes or simply unrecognized disease genes) constitute a reasonable reference set. Due to the potential presence of as yet uncharacterized disease genes, however, our comparisons are probably conservative. As our noncancer functional classes are broad, it is possible that a narrowed disease definition would uncover other interesting patterns. However, since this study was limited to 10% of human genes, a narrower focus was not compatible with retaining sample sizes sufficient for robust statistical analysis.

The results of this study suggest that cancer-related genes experience significantly stronger selective pressures than other disease genes and nondisease genes. This difference may be important for understanding the etiology of cancer-related genes. Why these genes are experiencing stronger purifying selection is unknown. One can imagine a scenario in which increased purifying selection may prevent the multigenic interactions implicated in certain cancers. However, this scenario does not address why these genes are under different selective pressures than other disease-related genes—one might logically predict, for example, that other genetic (noncancer) diseases would also be under much more intense purifying selection for the same reason.

An intriguing possibility is hinted at by the nature of cancer genes: we found these genes were overrepresented in a collection of "essential" genes, classified as lethal by mouse knockout experiments. In our 3,035 gene set, there were 104 genes homologous to mouse essential genes; of these, 11 were cancer genes, significantly greater ( in a {chi}2 test) than would be expected by chance alone. The significant overlap between cancer genes and homologs of mouse essential genes provides a potentially important clue about the kind of genes that lend themselves to cancer-causing substitutions. It is well established that many of these oncogenes and tumor suppressor genes act as transcription factors genes and have other conserved biological functions that also characterize essential genes (Hirsh and Fraser 2001), so this overlap is not necessarily surprising. For cancer-related essential genes, non–cancer-related essential genes, and all essential genes, the mean {omega} is 0.055, 0.075, and 0.066, respectively.

Genes with a smaller overall affect on fitness of an organism will not experience as severe an intensity of purifying selection as genes with greater overall affect on fitness. In light of our finding, this would lead one to believe that substitutions in cancer genes are more detrimental to fitness than other disease or nondisease genes. The implications of this finding will not be known and appreciated until the results are confirmed through the analysis of a much larger disease gene sets (experiments that become possible as the genome projects mature) and experiments are conducted to explore causal mechanisms. The immediate utility of this finding may be to predict the identity of other cancer genes when a list of candidate genes (in a quantitative trait locus, for example) is investigated.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 
Howard J. Jacob, Alissa K. Salmore, Michael I. Jensen-Seaman, Milton Datta, Wojciech Makalowski, and two anonymous reviewers offered helpful suggestions and comments during the development of this article. Michael I. Jensen-Seaman and Chin-Fu Chen assisted in developing the ideas on which this research was based. Wojciech Makalowski and Wen-Hsiung Li kindly provided data for the comparison data sets. Roumyana Kirova assisted our analysis with critical advice on statistical issues.


    Footnotes
 
Brian Golding, Associate Editor Back

1 Present address: Department of Biological Sciences, Idaho State University. Back

2 Present address: Department of Biochemistry and Molecular Biology, The Pennsylvania State University. Back

E-mail: mthomas{at}mcw.edu. Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 Literature Cited
 

    Bielawski, J. P., K. A. Dunn, and Z. Yang. 2000. Rates of nucleotide substitution and mammalian nuclear gene evolution: approximate and maximum-likelihood methods lead to different conclusions. Genetics 156:1299-1308.[Abstract/Free Full Text]

    Fearon, E. R. 1997. Human cancer syndromes: clues to the origin and nature of cancer. Science 278:1043-1050.[Abstract/Free Full Text]

    Fortini, M. E., M. P. Skupski, M. S. Boguski, and I. K. Hariharan. 2000. A survey of human disease gene counterparts in the Drosophila genome. J. Cell. Biol. 150:F23-30.[CrossRef][ISI][Medline]

    Gentleman, R., and R. Ihaka. 1996. R: a language for data analysis and graphics. J. Comp. Graph. Stat. 5:299-314.

    Gojobori, T., and S. Yokoyama. 1987. Molecular evolutionary rates of oncogenes. J. Mol. Evol. 26:148-156.[ISI][Medline]

    Hirsh, A. E., and H. B. Fraser. 2001. Protein dispensability and rate of evolution. Nature 411:1046-1049.[CrossRef][ISI][Medline]

    Hurst, L. D., and C. Pal. 2001. Evidence for purifying selection acting on silent sites in BRCA1. Trends Genet. 17:62-65.[CrossRef][ISI][Medline]

    Hurst, L. D., and N. G. Smith. 1999. Do essential genes evolve slowly? Curr. Biol. 9:747-750.[CrossRef][ISI][Medline]

    Ina, Y. 1995. New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J. Mol. Evol. 40:190-226.[ISI][Medline]

    Jimenez-Sanchez, G., B. Childs, and D. Valle. 2001. Human disease genes. Nature 409:853-855.[CrossRef][ISI][Medline]

    Kumar, S., and S. B. Hedges. 1998. A molecular timescale for vertebrate evolution. Nature 392:917-920.[CrossRef][ISI][Medline]

    Kumar, S., and S. Subramanian. 2002. Mutation rates in mammalian genomes. Proc. Natl. Acad. Sci. USA 99:803-808.[Abstract/Free Full Text]

    Liberles, D. A. 2001. Evaluation of methods for determination of a reconstructed history of gene sequence evolution. Mol. Biol. Evol. 18:2040-2047.[Abstract/Free Full Text]

    Makalowski, W., and M. S. Boguski. 1998. Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc. Natl. Acad. Sci. USA 95:9407-9412.[Abstract/Free Full Text]

    Mann, H., and D. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Statist. 18:50-60.[ISI]

    Mushegian, A. R., D. E. Bassett, Jr., M. S. Boguski, P. Bork, and E. V. Koonin. 1997. Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs. Proc. Natl. Acad. Sci. USA 94:5831-5836.[Abstract/Free Full Text]

    Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418-426.[Abstract]

    Nei, M., and S. Kumar. 2000. Molecular evolution and phylogenetics. Oxford University Press, New York.

    Nei, M., P. Xu, and G. Glazko. 2001. Estimation of divergence times from multiprotein sequences for a few mammalian species and several distantly related organisms. Proc. Natl. Acad. Sci. USA 98:2497-2502.[Abstract/Free Full Text]

    Nekrutenko, A., K. D. Makova, and W. H. Li. 2002. The KA/KS ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res. 12:198-202.[Abstract/Free Full Text]

    Nesse, R. M. 2001. How is Darwinian medicine useful? West. J. Med. 174:358-360.[CrossRef][ISI][Medline]

    Reiter, L. T., L. Potocki, S. Chien, M. Gribskov, and E. Bier. 2001. A systematic analysis of human disease-associated gene sequences in Drosophila melanogaster. Genome Res. 11:1114-1125.[Abstract/Free Full Text]

    Rubin, G. M., M. D. Yandell, and J. R. Wortman, et al. (50 co-authors). 2000. Comparative genomics of the eukaryotes. Science 287:2204-2215.[Abstract/Free Full Text]

    Stearns, S. C., and D. Ebert. 2001. Evolution in health and disease: work in progress. Q. Rev. Biol. 76:417-432.[CrossRef][ISI][Medline]

    Yang, Z., and J. P. Bielawski. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15:496-503.[CrossRef][ISI][Medline]

    Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32-43.[Abstract/Free Full Text]

    Zar, J. H. 1999. Biostatistical analysis. Prentice Hall, Upper Saddle River, N.J.

Accepted for publication February 19, 2003.