Center for Applied Mathematics and Department of Mathematics, Cornell University, Ithaca, New York
Correspondence: E-mail: rtd1{at}cornell.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: tandem gene duplication adaptive evolution zinc-finger genes
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In addition to a general increase in the number of zinc-finger genes, some regions of the human genome contain many such genes with no homologs in rodents. Bellefroid et al. (1995) studied the ZNF91 gene family on human chromosome 19p12-p13.1. They found ZNF91 family members in a number of primate species but could find no murine gene with sequence similarity to ZNF91. They concluded that this cluster resulted from duplication events some 55 MYA.
The structure and binding properties of zinc-finger genes have been extensively studied (see Wolfe, Nekludova, and Pabo [1999] for a review). A C2H2 zinc finger consists of an -helix that begins between the first two asterisks in figure 1 and continues to the first histidine. The remainder of the finger consists of two antiparallel ß sheets. The amino acids at positions 1, 3, and 6 with respect to the
-helix make contacts to bases 3, 2 ,and 1 in the primary DNA strand, whereas the amino acid at
-helix position 2 makes contact to the complement of base 4. The recognition code for zinc-finger binding has been widely studied (Choo and Klug 1997). However recent research (Benos, Lapedes, and Stromo 2002) suggests that no simple 1 to 1 relationship exists but that different amino acid sequences bind to target nucleotide sequences with different efficiencies.
The H/C link TGEKPY/F separating adjacent fingers (dark gray in figure 1), the two C and two H positions bound to the zinc atom to make the finger, as well as the hydrophobic phenyalanine (F) and leucine (L), are highly conserved. However, the four sites involved in binding the protein to DNA indicated by asterisks in figure 1 are highly variable.
These observations and the fact that even closely related genes display distinct patterns of tissue-specific expression (Shannon et al. 2003) suggest that gene duplication has aided in the diversification of zinc-finger binding motifs. Shannon et al. (2003) used pairwise dN/dS comparisons to examine selective pressures in what we will call clusters I and II below. The goal of this paper is to use the methods of Yang et al. (2000), Yang and Swanson (2002), and Suzuki and Gojobori (1999) to look for signs of positive selection in these clusters and others on human chromosome 19.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To examine the relationship between zinc finger genes, we aligned the KRAB domains and spacer sequences of our genes using ClustalW. We did not use the zinc fingers in the alignment because the number varied considerably between genes, and the repetitive zinc finger structure resulted in the alignment of fingers with much dissimilarity. Alignments were done using the European Bioinformatics Institute's server (http://www.ebi.ac.uk/clustalw/) with default parameters. As described in Thompson, Higgins, and Gibson (1994), ClustalW (1) performs a pairwise alignment of all sequences, (2) computes a distance matrix based on the percentage of identities between the two aligned sequences, (3) produces a tree by the neighbor joining algorithm, and then (4) uses the tree to guide the multiple alignment.
Using the clustering of genes on the tree and a comparison of their -helix sequences (see Results for more details), we identified four sets of genes for study. For each gene cluster, we obtained the mRNA sequences from the NCBI Web site and located the fingers that were common to all of the genes to make our comparison data set. In each case, alignment of the selected fingers using ClustalW resulted in an alignment with no gaps in any sequence and trees that agreed with those that had been constructed from the alignment of KRAB domains and spacer sequences. To further confirm the phylogenies, we built trees using parsimony and neighbor-joining methods implemented in PHYLIP using the Web server at http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html. In clusters II to IV, the trees from all methods were identical. In cluster I, we found two tree topologies that differed in the positions of ZNF 224 and 225, which are almost equidistant from the pair ZNF 155 and 221, so we analyzed this cluster under both trees. Results of subsequent tests were very similar for the two trees. To look for signs of positive selection in our four clusters, we used the following three approaches.
Site-Specific Models
Nielsen and Yang (1998) and Yang et al. (2000) introduced various models to study how the distribution of = dN/dS varies along sequences. Model M7 has an
for each site drawn from a beta distribution with parameters p and q. Model M8 uses the M7 recipe for a fraction p0 of the sites and assigns another
to the remaining fraction. M7 and M8 are nested models, so they can be compared using a likelihood ratio test (LRT). Twice the difference in log-likelihood between models is compared with the value obtained under a
2 distribution with degrees of freedom equal to the difference in number of parameters between models (in this case 2). When M8 fits the data significantly better than M7 and the
ratio estimated under model M8 is greater than 1, we need to ask whether it is significantly greater than 1. To do this, we recalculate the log-likelihood value in M8 while fixing
to be 1 (model M8A from Swanson, Nielsen, and Yang [2003]) and compare the change in likelihood with a
2 distribution with 1 degree of freedom.
Fixed-Sites Models
The approach in the last paragraph does not take into account the fact that zinc fingers are periodic, so we will also use a method developed by Yang and Swanson (2002) that allows us to take advantage of a priori knowledge. We divide the sites into three classes: constrained sites (finger positions 1, 2, 4, 7, 11, 17, 20, 24, 25, 26, 27, and 28), the binding sites (13, 15, 16, and 19), and the remaining "unconstrained" sites. We have used quotation marks because it will turn out that these sites have values significantly smaller than 1.
Let be the transition/transversion ratio,
i the frequency of amino acid i, and let rj denote the ratio of substitution rates for the jth site class to that of the first, with r1 = 1. Yang and Swanson (2002) introduced the following models. In model A, there is only one rate class, and all sites use the same
,
, and
values. In model B, the r values are different, but all sites use the same
,
, and
values. In model C, the r and
values are different, but all sites use the same
and
values. In model D, the r,
, and
values are different, but all sites use the same
values. In model E, each class has a different set of parameters. In model F, the sites are divided into three groups and analyzed separately. Tests were carried out using version 3.14 of PAML software introduced by Yang (1997).
Parsimony Analysis
At the request of two referees, we used Suzuki and Gojobori's (1999) method as implemented in ADAPTSITE.p version 1.3 (http://mep.bio.psu.edu/adaptivevol.html) to look for positive selection in our four clusters. The test is based on comparing the observed total number of synonymous (sc) and nonsynonymous (nc) substitutions for a codon, to the binomial with tc trials and success probability p, where tc is the total number of changes and p is the fraction of synonymous changes expected in the tree. There are several reasons not to use this test. The first reason is that the distribution of sc conditioned on the observed values of tc and p is not binomial (R. Durrett, unpublished data). The second reason is that the test has very low power unless the number of sequences compared is large (see Wong et al. [2004]). Suzuki and Gojobori (1999) say that a tree length of at least 2.5 nucleotide changes per codon site is needed to detect positive selection. Adding the branch lengths of the maximum-parsimony trees shows that our clusters range from 0.45 to 0.6 changes per site. However, we can remedy this problem by taking advantage of the periodic structure of zinc-finger genes and grouping codons together by position in the 9 to 10 fingers being compared. This is similar to our second PAML analysis, but now our groups are the 28 finger positions rather than the three classes of sites. Because of our a priori beliefs, we performed one-tailed tests of positive selection at the four binding sites and of negative selection at the other sites.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
![]() |
|
|
|
|
|
We examined -helix sequences for all of our zinc-finger genes to identify other groups. Here and in what follows, the numbers in parentheses indicate the start of the gene in megabases. As shown in table 4, the
-helix sequences of ZNF440(11.78) and ZNF439(11.83) show strong signs of tandem duplication, as do ZNF44(12.22), LOC147837(12.28), ZNF442(12.36), LOC90576(12.36), and ZNF443(12.40). From the five intervening genes we choose ZNF20(12.10) to complete cluster III. Notice that these genes appear together in the tree in figure 3. Our final cluster, cluster IV, consists of ZNF90(20.07), LOC163233(20.51), ZNF85(20.89), ZNF430(20.99), LOC148206(21.04), ZNF431(21.11), and LOC163227(21.69). These genes appear in two groups in the tree (fig. 3) but their
-helix sequences given in table 5 are very similar to the others in the group.
|
|
|
|
|
Model F is a separate analysis of the three partitions (i.e., it runs model A for each partition separately). As expected, the estimated ratios at the constrained sites are small in all four clusters 0.20, 0.02, 0.16, and 0.48, and the unconstrained sites are larger 0.66, 0.23, 0.55, and 0.48. For the class of binding sites, we get
values larger than 1 in clusters I, III, and IV: 1.14, 2.22, and 2.10. However, in cluster II, our
estimate is 0.34. To test whether the values observed at the binding sites are significantly different from 1, we recalculate the log-likelihood values in model F by fixing
1 to be 1 and perform the LRT as described above. Cluster I is not significant, but clusters III, IV, and II are significant with P values 0.05, 0.01, and 0.001, respectively.
In the last analysis of our four clusters, we applied the parsimony-based program ADAPTSITE.p (Suzuki and Gojobori 1999) to look for selection at individual codon sites. No positively selected sites are identified in any cluster, but several nonbinding sites turn out to be under negative selection at the 5% significance level in clusters I (15 sites), II (24 sites), III (17), and IV (five sites).
Results of our analysis using ADAPTSITE.p with data pooled by finger position are given in table 9. There are three binding sites with significant positive selection, finger position 13 in cluster III (P < 0.0004) and positions 16 and 19 in cluster IV (P < 0.0046 and P < 0.0458, respectively), but only the first two are smaller than the threshold of 0.0178 demanded by the Bonferroni correction for our 28 tests. Again, there are a large number of nonbinding sites that show negative selection at this level. In cluster II, this occurs for 21 of the 24 nonbinding sites, with four of the P values smaller than 106. Indeed, two of the binding sites, positions 15 and 16, show negative selection with P values less than 0.0013 and less than 0.00001, respectively, which is consistent with previous PAML analysis.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The results for cluster II are consistent with those of Shannon et al. (2003), who examined dN/dS ratios at three of the binding sites (our finger positions 13, 16, and 19) and found no evidence of positive selection in cluster II genes but significant evidence of purifying selection in pairwise comparisons of ZNF235 with Zfp235, Zfp93, and Zfp109 (see their table 2). In ZNF genes near our cluster I, they find significant evidence of positive selection in comparisons of 226 with 230, 223, 284, and 222; 234 with 221; and 284 with 230. In no case are both of their compared genes within our cluster I (which consists of 155, 221 to 225, and 230). Some of the comparisons that Shannon et al. (2003) find significant are quite curious in view of the data presented in table 2. ZNF223 has nine zinc fingers versus 17 in ZNF226, and the overlapping fingers do not align well. ZNF284 and ZNF230 are more similar in length (11 versus nine fingers) but comparison of the -helix sequences reveals very little overall similarity.
Tandemly duplicated genes are subject to gene conversion events. Given the ability of gene conversion to homogenize gene families (see e.g., Chapter 11 of Li [1997]), it is natural to ask whether concerted evolution can introduce correlated changes in different lineages and hence invalidate the use of Yang's and Suzuki and Gojobori's methods, which assume independent substitutions. We cannot rule out the possibility that gene conversion acted soon after duplication to protect the duplicated copies from becoming pseudogenes (see Walsh [1987]), an effect that can cause the underestimation of divergence times (see Teshima and Innan [2004]). However, there are two reasons to doubt that this force has acted in the recent past.
First we observe that gene conversion acts to homogenize genes that perform the same function. Yet, Shannon et al.'s (2003) study of cluster I show that these genes have different tissue-specific expression patterns. The second obvious point is that if gene conversion is still acting, it is not doing a very good job. At a gross level, the numbers of zinc fingers of the genes in cluster I are 9, 9, 9, 15, 11, 19, and 17, respectively (the first three appear to be recent duplicates). Within clusters, there is considerable divergence between sequences. For example, in cluster IV, 23 synonymous and 36 nonsynonymous differences separate the 840 nucleotides in the most closely related pair (ZNF431 and LOC148206), and there are more than 100 differences between a typical pair of genes.
Several studies have presented evidence of gene conversion by examining patterns in the differences between genes and pointing out regions of unusually high similarity (see figure 5 in Sharon et al. [1999], figure 6 in Lazzaro and Clark [2001]), and figure 6 in Bettencourt and Feder [2001]). To look for similar signals in our data, we conducted an analysis (fig. 6) in which we calculated the number of nucleotide differences in a 168-nucleotide window (the length of two fingers) between adjacent genes in each cluster, advancing the window by 7 nucleotides until the end of the sequence is reached. Successive differences in each cluster are indicated by hollow squares, diamonds, and triangles, followed by filled versions of the symbols and an X for the seventh comparison. We find a lot of variability in divergence, but with the exception of one gene pair at the end of cluster III, no other regions dip below 5 nucleotide differences and most are above 10, which represents 6% divergence in the window. Assuming a mutation rate of 2 x 108 per nucleotide per generation, this suggests that gene conversion has not acted on these clusters in the past 3 million generations.
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bellefroid, E. J., D. A. Poncelet, P. J. Lecocq, O. Relevant, and J. M. Martial. 1991. The evolutionarily conserved Krüppel-associated box domain defines a subfamily of eukaryotic multifingered proteins. Proc. Natl. Acad. Sci. 88:36083612.[Abstract]
Bellefroid, E. J., J. C. Marine, A. G. Matera, C. Bourginion, T. Desai, K. C. Healy, P. Bray-Ward, J. A. Martial, J. N. Ihle, and D. C. Ward. 1995. Emergence of the ZNF91 Krüppel-associated box-containing zinc finger gene family in the last common ancestor of the Anthropedia. Proc. Natl. Acad. Sci. USA 92:1075710761.[Abstract]
Benos, P. V., A. S. Lapedes, and G. D. Stormo. 2002. Probabilistic code for DNA recognition by proteins of the EGR family. J. Mol. Biol. 323:701727.[CrossRef][ISI][Medline]
Bettencourt, B. R., and M. E. Feder. 2001. Hsp70 duplication in the Drosophila melanogaster species group: How and when did two become five?. Mol. Biol. Evol. 18:12721282.
Choo, Y., and A. Klug. 1997. Physical basis of a protein-DNA recognition code. Curr. Opinion. Struct. Biol. 7:117125.[CrossRef][ISI][Medline]
Dehal, P., P. Predki, A. S. Olsen et al. (21 co-authors). 2001. Human chromosome 19 and related regions in mouse: conservative and lineage specific evolution. Science 293:104111.
Gell, D., M. Crossley, and J. Mackay. 2003. Zinc-finger genes. Pp. 823828 in D. N. Cooper, ed. The nature encyclopedia of the human genome, Vol. 5. MacMillan Publishers, London.
Lander, E. S., L. M. Linton, B. Birren et al. (256 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860921.[CrossRef][ISI][Medline]
Lazzaro, B. P., and A. G. Clark. 2001. Evidence for recurrent paralogous gene conversion and exceptional allelic divergence in the Attacin genes of Drosophila melanogaster. Genetics 159:659671.
Li, W. H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Looman, C. 2003. The ABC of KRAB zinc finger proteins. Comprehensive summaries of Uppsala dissertations, Acta Universitatis Upsalenis.
Margolin, J. F., J. R. Friedman, W. K. Meyer, H. Vissing, H. J. Thiesen, and F. J. Rauscher III. 1994. Krüppel-associated boxes are potent transcriptional repressor domains. Proc. Nat. Acad. Sci. USA 91:45094513.[Abstract]
Miller J., A. McLachan, and A. Klug. 1985. Repetitive zinc-binding domains in the protein transcription factor IIA from Xenopus oocytes. EMBO J. 4:16091614.[Abstract]
Nielsen R., and Z. Yang. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929936.
Schuh, R., W. Aichler, U. Gaul et al. (11 co-authors). 1986. A conserved family of nuclear proteins containing structural elements of Krüppel, a Drosophila segmentation gene. Cell 47:10251032.[ISI][Medline]
Shannon, M., A. T. Hamilton, L. Gordon, E. Branscomb, and L. Stubbs. 2003. Differential expansion of zinc-finger transcription factor loci in homologous human and mouse gene clusters. Genome Res. 13:10971110.
Sharon, D., G. Glusman, Y. Pilpel, M. Khen, F. Gruetzner, T. Haaf, and D. Lancet. 1999. Primate evolution of an olfactory receptor cluster: diversification by gene conversion and recent emergence of pseudogenes. Genomics 61:2436.[CrossRef][ISI][Medline]
Suzuki, Y., and T. Gojobori. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16:13151328.[Abstract]
Swanson, W. J., R. Nielsen, and Q. Yang, 2003. Pervasive adaptive evolution in mammalian fertilization proteins. Mol. Biol. Evol. 20:1820.
Tang, M., M. Waterman, and S. Yooseph. 2002. Zinc finger clusters and tandem gene duplication. J. Comp. Biol. 9:429446.[CrossRef][ISI]
Teshima, K. M., and H. Innan. 2004. The effect of gene conversion on the divergence between duplicated genes. Genetics 166:15531560.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.[Abstract]
Venter, J. C., M. D. Adams, E. W. Myers et al. (274 co-authors). 2001. The sequence of the human genome. Science 291:13041351.
Walsh, J. B. 1987. Sequence-dependent gene conversion: Can duplicated genes diverge fast enough to escape conversion?. Genetics 117:543557.
Wolfe, S. A., L. Nekludova, and C. O. Pabo. 1999. DNA recognition by Cys2His2 zinc finger proteins. Annu. Rev. Biophys. Biomol. Struct. 3:183212.
Wong, W. S. W., Z. Yang, N. Goldman, and R. Nielsen. 2004. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. (in press).
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555556.[Medline]
Yang, Z., R. Nielsen, N. Goldman, and A. M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449.
Yang, Z., and W. J. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:4957.
|