Department of Biology, Institute of Molecular Evolutionary Genetics, Pennsylvania State University
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the present study, we analyzed the evolution of human functional binding site sequence in 51 regulatory regions by contrasting the sequences with those of non-human primates and rodents. The sequence analysis is rooted by the direct experimental confirmation that the sites under study are functional sequences in the human promoters. For a subset of 20 of the regulatory regions, we obtained comparative functional data from the primary literature for both human and rodents. By comparing regulatory regions from a series of species across a range of divergence times from humans, we capture binding sites at varying degrees of sequence divergence. On the basis of the functional information, this analysis suggests attributes of the manner in which regulatory regions undergo evolutionary turnover.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Human Functional Transcription Factor Binding Sites
The transcription factor binding sites, used in the analysis, were selected on the basis of direct experimental confirmation of binding ability (footprinting, gel shift assays) and function (promoter deletion experiments, directed mutagenesis, expression of reporter genes) in previous studies. We identified the location of these binding sites in the human sequence by searching the primary literature and the TRANSFAC database (Wingender et al. 2000
) (see Supplementary Data for references used for the identification of the binding sites). Divergence of binding site sequences for all the human-rodent analysis was done including alignment gaps because we are interested in how different the sequences are in the species compared and not how the substitutions occurred.
Comparative Functional Analysis for Human and Rodents
Data were collected from the primary literature. We restricted the analysis to studies that tested the function and binding ability of binding sites with the same criteria and methods. The criteria for the validity of the function of transcription factor binding sites were as strict as that for the human collection of binding sites. From 20 genes we collected data on 64 binding sites that align between human and rodent, 33 of which share function between human and rodents, 14 that are functional in humans only (human specific), and 17 that are rodent specific (see Supplementary Data for references and GenBank accession numbers of the regulatory region sequences).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
Average divergence of sequence in the human-rodent comparison within binding sites (p-distance: d = 0.229, standard deviation = 0.177; Kimura 2-parameter: d = 0.273, SD = 0.182) is lower than that of the average synonymous human-mouse divergence (Kimura 2-parameter: d = 0.468, SD = 0.169; Makalowski and Boguski 1998
) but much higher than that of the nonsynonymous human-mouse divergence (Kimura 2-parameter: d = 0.090, SD = 0.102; Makalowski and Boguski 1998
), and the divergence of the background sequence (p-distance: d = 0.310, SD = 0.175; Kimura 2-parameter: d = 0.399, SD = 0.178) is very similar to the synonymous divergence. It is possible that other binding sites reside in the aligned regions and are not yet identified as functional. However, the fact that the Kimura 2-parameter estimate of divergence is not very different from the synonymous rate of substitution implies that the density of such potentially unidentified binding sites is low. Additionally, there is no correlation between amino acid sequence divergence of the genes and binding site sequence divergence (P = 0.680), and the amino acid divergence in the genes compared is generally low, averaging d = 0.269 (SD = 0.139). Therefore, the relatively high binding site divergence we observe cannot be explained by rapid overall gene divergence. In addition, there is no correlation between divergence in individual binding sites in human-rodent and human-macaque comparisons (r = 0.001, P = 0.909), suggesting that constraints for each site are generally independent in the two different lineages and not a property of the importance of the site for the expression of the gene. Manual inspection of expression profiles from public databases (Unigene, LocusLink, MGI, NCBI) does not suggest any major differences in expression pattern of the genes between human and rodents, but we cannot exclude the possibility that such changes have occurred. Unfortunately, data on tissue- and temporal-specific expression patterns are not unified sufficiently to allow a formal comparison of human versus rodent expression patterns.
Proportion of Species-Specific Transcription Factor Binding Sites
In order to estimate how many binding sites exhibit species-specificity in function we need experimental data for both species. Such data were available for 20 of the 43 alignable regulatory regions compared between human and rodents. A total of 64 alignable binding sites have been identified in these 20 regions, out of which 33 have shared function between human and rodents (mouse or rat), 14 are human specific and 17 are rodent specific. First we tested whether the subset of the data for which there is functional information for both species is representative of the original sample of 43 genes (fig. 3
). The nonparametric Mann-Whitney U-test (Sokal and Rohlf 1997, pp. 440447
) shows that there is no significant difference between the divergence values obtained from the sample of 20 genes and the divergence values from the remainder of the data (W = 7,746, P = 0.1948). In addition, there is no difference between the divergence values of the human-specific versus rodent-specific binding sites (Mann-Whitney: W = 151, P = 0.9173), so they can be pooled in one class of species-specific binding sites. There was a highly significant difference, as expected, in the divergence values in binding sites with shared function versus the species-specific binding sites (Mann-Whitney: W = 628, P = 0.000). Finally, there was no difference between the divergence values in binding sites compared between human-mouse versus the values in binding sites compared between human-rat (Mann-Whitney: W = 468, P = 0.930).
|
In order to bypass this bias, we used another method to estimate the proportion of species-specific binding sites, this time taking into account the distribution of divergence of each of the two functional classes of the 64 binding sites (shared function vs. species-specific function). We used these distributions to define the probability of shared function of a binding site between species, given a value of divergence of the functional sequence from the other species sequence. For each functional class we counted the number of occurrences for each interval of divergence equal to 0.1 (e.g., 0.000.10, 0.110.2, 0.210.3 etc) and calculated the proportion of values that fall within this interval for each class. We then estimated the probability that a site does not share function in the two species compared, by dividing, for each interval, the proportion of the species-specific values in this interval with the sum of proportions of species-specific and shared values for the same interval. We then used the data from the other subset of the data for which there was functional information only for the human binding sites and computed the predicted number of sites with species-specific function by multiplying the probability defined above with the number of binding sites observed within the same interval of divergence. A total of 38 out of 96 binding sites were estimated to be human specific (40%), similar to the experimental estimate.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This pattern of evolution has important implications for the use of phylogenetic methods to identify functional regulatory elements for basic and medical research. Distant interspecific comparisons will reveal mainly highly conserved binding sites, and focusing only on those imposes an unfortunate bias in our understanding of regulatory variation. The highly conserved binding sites are those likely to have a radical effect on the expression of the gene, and nucleotide variation in these sites is likely to be associated with rare monogenic disorders. Complex disorders are likely to be mediated by common variants in less constrained binding sites (Risch and Merikangas 1996
), precisely those sites that are missed in distant comparisons. On the other hand, comparisons of more closely related species are confounded by the low divergence even in nonfunctional sequences, which will produce many false positives. The positive aspect of our results is that 60%68% of the transcription factor binding sites are functionally conserved between human and rodents. Therefore, their nucleotide sequence is functionally constrained, and by using the appropriate parameters for calibration, which our data and analysis provides, several methods will be able to identify them within human-rodent alignments of regulatory regions.
The small size of transcription factor binding sites and the degeneracy of binding requirements allows not only for the accumulation of conservative substitutions within binding sites but also for the independent emergence of new binding sites because many different nucleotide combinations will satisfy the binding requirements of a DNA-binding protein (Berg and von Hippel 1987
). These new sites may relax the evolutionary constraint in previously essential sites and lead to loss of some of them without serious phenotypic consequences (Ludwig et al. 2000
). This pattern of evolution will make it difficult to identify regulatory elements that have undergone turnover. Thus, a tight combination of probabilistic methods for binding site prediction, such as Hidden Markov Models (Durbin et al. 1998, pp. 46132
; Eddy 1998
), study of polymorphism in promoter sequences, and extensive functional (Ren et al. 2000
) and computational studies (Bussemaker, Li, and Siggia 2001
) will be able to detect nonconserved binding sites. Detailed studies of regulatory sequence function combined with more sophisticated comparative genomics (Dubchak et al. 2000
; Sumiyama, Kim, and Ruddle 2001
), including comparison across multiple species of varying degrees of divergence (such as dog and rabbit) and polymorphism analysis will be informative in capturing the fluid regulatory landscape of mammalian genomes. Finally, these results may lay the foundation for studying how species are different from each other, enabling the identification of genomic segments that are responsible for these differences.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Address for correspondence and reprints: Emmanouil T. Dermitzakis, 1 Rue Michel-Servet, Division of Medical Genetics, Medical School, University of Geneva, 1211 Switzerland. Emmanouil.Dermitzakis{at}medecine.unige.ch
Keywords: regulatory evolution
binding site turnover
mammals
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Berg O. G., P. H. von Hippel, 1987 Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters J. Mol. Biol 193:723-750[ISI][Medline]
Bussemaker H. J., H. Li, E. D. Siggia, 2001 Regulatory element detection using correlation with expression Nature Genet 27:167-174[ISI][Medline]
Collins F. S., M. S. Guyer, A. Chakravarti, 1997. Variations on a theme: cataloging human DNA sequence variation Science 278:1580-1581.
Dubchak I., M. Brudno, G. G. Loots, L. Pachter, C. Mayor, E. M. Rubin, K. A. Frazer, 2000 Active conservation of noncoding sequences revealed by three-way species comparisons Genome Res 10:1304-1306
Durbin R., S. Eddy, A. Krogh, G. Mitchison, 1998 Biological sequence analysis Cambridge University Press, Cambridge
Eddy S., 1998 Profile hidden markov models Bioinformatics 14:755-763[Abstract]
Florea L., M. Li, C. Riemer, B. Giardine, W. Miller, et al 2000 Validating computer programs for functional genomics in gene regulatory regions Curr. Genomics 1:11-27
Hardison R. C., J. Oeltjen, W. Miller, 1997 Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome Genome Res 7:959-966
Jareborg N., E. Birney, R. Durbin, 1999 Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs Genome Res 9:815-824
Leung J. Y., F. E. McKenzie, A. M. Uglialoro, P. O. Flores-Villanueva, B. C. Sorkin, 2000 Identification of phylogenetic footprints in primate tumor necrosis factor-alpha promoters Proc. Natl. Acad. Sci. USA 97:6614-6618
Ludwig M., C. Bergman, N. H. Patel, M. Kreitman, 2000 Evidence for stabilizing selection in a eukaryotic enhancer element Nature 403:564-567[ISI][Medline]
Makalowski W., M. Boguski, 1998 Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences Proc. Natl. Acad. Sci. USA 95:9407-9412
Mateu M. G., A. R. Fersht, 1999 Mutually compensatory mutations during evolution of the tetramerization domain of tumor suppressor p53 lead to impaired hetero-oligomerization Proc. Natl. Acad. Sci. USA 96:3595-3599
McDermott D. H., P. A. Zimmerman, F. Guignard, C. A. Kleeberger, S. F. Leitman, P. M. Murphy, 1998 CCR5 promoter polymorphism and HIV-1 disease progression Multicenter AIDS Cohort Study (MACS). Lancet 352:866-870
Picketts D. J., C. R. Mueller, D. Lillicrap, 1994 Transcriptional control of the factor IX gene: analysis of five cis-acting elements and the deleterious effects of naturally occurring hemophilia B Leyden mutations Blood 84:2992-3000
Ren B., F. Robert, J. J. Wyrick, et al. (11 co-authors). 2000 Genome-wide location and function of DNA binding proteins Science 290:2306-2309
Risch N., K. Merikangas, 1996 The future of genetic studies of complex human diseases Science 273:1516-1517[ISI][Medline]
Schwartz S., Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, W. Miller, 2000 PipMakera web server for aligning two genomic DNA sequences Genome Res 10:577-586
Shasikan C. S., C. B. Kim, M. A. Borbely, W. C. H. Wang, F. H. Ruddle, 1998 Comparative studies on mammalian Hoxc8 early enhancer sequence reveal a baleen whale-specific deletion of a cis-acting element Proc. Natl. Acad. Sci. USA 95:15446-15451
Sokal R. R., F. J. Rohlf, 1997 Biometry 3rd edition, W. H. Freeman and Co
Sumiyama K., C. B. Kim, F. H. Ruddle, 2001 An efficient cis-element discovery method using multiple sequence comparisons based on evolutionary relationships Genomics 71:260-266[ISI][Medline]
Wasserman W., M. Palumbo, W. Thompson, J. W. Fickett, C. E. Lawrence, 2000 Human-mouse genome comparisons to locate regulatory sites Nat. Genet 26:225-228[ISI][Medline]
Wei J., G. P. Hemmings, 2000 The NOTCH4 locus is associated with susceptibility to schizophrenia Nat. Genet 25:376-377[ISI][Medline]
Werth V. P., W. Zhang, K. Dortzbach, K. Sullivan, 2000 Association of a promoter polymorphism of tumor necrosis factor-alpha with subacute cutaneous lupus erythematosus and distinct photoregulation of transcription J. Investig. Dermatol 115:726-730
Wingender E., X. Chen, R. Hehl, H. Karas, I. Liebich, V. Matys, T. Meinhardt, M. Pruss, I. Reuter, and F. Schacherer, 2000 TRANSFAC: an integrated system for gene expression regulation Nucleic Acids Res 28:316-319
Zhu J., J. S. Liu, C. E. Lawrence, 1998 Bayesian adaptive alignment and inference Bioinformatics 14:25-39[Abstract]