*Department of Ecology and Evolution
Department of Statistics, University of Chicago
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The rate of gene duplication in a genome is also of great interest. This type of study is possible only when the whole genome data is available. Lynch and Conery (2000)
estimated the gene duplication rates in the yeast, Drosophila, and C. elegans genomes using the synonymous site changes (KS) as the time scale. However, it is well known that codon usage is highly biased in some genes in these organisms (Ikemura 1982
; Akashi, Kliman, and Eyre-Walker 1998
). A negative correlation between synonymous rate (KS) and strength of codon usage bias in Drosophila suggests that in some genes synonymous changes are not neutral (Sharp and Li 1989
; Moriyama and Hartl 1993
), though Dunn, Bielawski, and Yang (2001)
argued against the existence of this correlation. Therefore, the KS value might not reflect the real age of a gene duplication. A combination of KS and the genetic distances in intron and flanking regions might be more informative.
The relatively good quality of genomic sequences and concomitant annotation for yeast, Drosophila, and C. elegans make it possible for us to investigate the above questions in these genomes. However, the presence of same genes with different names and the existence of alternative splicing forms in the database make it difficult to study the extent of gene duplication in a genome. Moreover, retrotranscriptase (RT) and protein parts derived from repetitive elements (REs) might mislead the identification of homologous proteins. For these reasons, it is important to clean the database. In this paper, after carrying out a detailed cleaning procedure for the protein databases of these three genomes, we asked two questions: How many gene families are there in each genome? How often has gene duplication occurred in the recent past in each organism? We defined two simple homology criteria by improving the criterion adopted by Rost (1999)
. Using the new criteria for identifying homologous genes and the single-linkage algorithm for clustering, we estimated the number of gene families in each of the three genomes. The frequency of recent gene duplication was investigated by using gene pairs with a small KS. We excluded the gene pairs with possible gene conversion and codon usage bias by comparing KS with the genetic distance in intron and flanking regions. The effect of codon usage on KS in yeast was further studied.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Caenorhabditis elegans: http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ Wormpep release 40 was used. There were 19,730 protein sequences in the database, of which 48 did not have genomic position information and 22 did not have corresponding coding sequences (cds). We used the rest of the 19,660 protein sequences in our analysis.
Yeast: ftp://ncbi.nlm.nih.gov/genbank/genomes/S_cerevisiae/ We used the NCBI October 2000 version, which was part of the Reference Sequence (RefSeq) project. The annotation for this version was based on the Saccharomyces Genome Database in the Stanford genomic resources (SGD, http://genome-www.stanford.edu/Saccharomyces/). A total of 6,297 protein sequences were in the database and used in our analysis. Information for block duplications and gene pairs within the blocks was downloaded from the website http://www.gen.tcd.ie/khwolfe/ (Wolfe and Shields 1997
).
Drosophila: ftp://ncbi.nlm.nih.gov/genbank/genomes/D_melanogaster/ Release 2, October 2000 from NCBI was used. A total of 14,335 protein sequences were in the database.
The corresponding cds and genomic sequences for all three genomes were also downloaded from the above websites.
The data for each genome was processed as described below.
First Round Grouping
In each of the three genomes studied, every protein was used as the query to search against all other proteins in the database using FASTA (E = 10). Our criteria for two proteins to form a link (i.e., to be in the same family) are (1) the FASTA-alignable region between the two proteins should be longer than 80% of the longer protein, and (2) the identity between the two proteins (I) should be I 30% if the alignable region is longer than 150 a.a. and I
0.01n + 4.8L-0.32(1 + exp(-L/1000)) (Rost 1999
) if otherwise, where L is the alignable length between the two proteins. Rost's formula was derived from an empirical study, which suggests that a higher I value was needed for shorter proteins. We use n = 6, which makes the formula continuous at L = 150. We call it a hit if two proteins form a link. The single-linkage algorithm was used to group proteins into clusters, i.e., if protein A hits protein B and protein B hits protein C, we group proteins A, B, and C together, regardless of whether protein A hits protein C or not.
Cleaning of Same Genes with Different Names
Occasionally, more than one name is assigned to the same gene and these names are presented as different genes in the database. Such a situation can be detected by comparing their sequence coordinates. In each of such cases, only one copy was kept in the analysis.
Isoform Cleaning
Based on gene and exon annotation, we define that two genes are isoforms if their shared coding region is longer than 20% of the entire coding region of the shorter gene. We delete one of the two isoforms from the database as follows: Delete the shorter one if both are singletons; delete the one that is a singleton if the other belongs to a multigene family; delete the shorter one if the two isoforms form a two-member cluster; delete the shorter one if both are from the same gene family with more than two members and they have the same hits; and delete the one with fewer hits if they belong to the same cluster but their hits are not all the same. We keep both proteins if they belong to different multigene families.
RE Cleaning
Each protein was used as the query to search against the RE database for the same organism using FASTAX (E = 10-5). We delete the whole protein sequence from the database if the part of the protein hit by an RE is longer than 80% of the protein itself. We delete only the part that was hit by an RE if it is shorter than 80% of the protein.
Second Round Grouping
We repeat the steps in the first round grouping with the cleaned database. The new clusters are regarded as gene families.
Synonymous-Nonsynonymous Substitution Calculation
If two proteins are linked to each other, the FASTA-alignable regions of the two proteins are realigned using clustalW, and the corresponding coding regions of the genes are aligned based on the protein alignment. The number of substitutions per synonymous site (KS) and nonsynonymous site (KA) are calculated using PAML (Yang and Nielsen 2000
; the default parameters are used in the calculation).
Genetic Distance in Intron and Flanking Regions
Sequences of entire introns and flanking 150 bp of cds (both upstream and downstream) are extracted using gene annotation data. ClustalW is used to do the alignment, and genetic distances are calculated using Kimura's (1980)
two-parameter method when the divergence is less than 5%, but Tamura and Nei's (1993)
six-parameter method is used for more divergent sequences.
Codon Usage Bias Calculation
The effective number of codons (ENC) is calculated for each gene using the CodonW package (ftp://molbiol.ox.ac.uk/cu/codonW.tar.Z) and is used as a criterion for the strength of codon usage bias: the smaller the ENC, the stronger the codon usage bias.
The results of the above analysis were stored in a MySQL database.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Importance of Database Cleaning
We conducted an extensive database cleaning before we did protein family grouping. The presence of same genes with different names and the existence of isoforms (mainly from alternative splicing) in the database inflate the counts of protein families, whereas protein parts derived from REs may cause nonhomologous genes to be clustered into the same family. The database cleaning was found to be very necessary. For example, the number of protein families in the original Drosophila database was 1,094, but it was reduced to 674 after database cleaning. The selection of the database for study is also of great importance. The protein and related cds database in NCBI for C. elegans was used at the beginning of our study. However, it was found that the database was full of redundant protein sequences, many gene annotations were not consistent with cds sequences, and the translation of many cds were not the same with the corresponding protein sequences. We switched to Wormpep for its better quality.
Paucity of Young Duplicate Genes in Drosophila
It is interesting that the gene duplication rate is very low in Drosophila. This conclusion comes from two lines of evidence: (1) the number of gene pairs with a small KS (KS < 0.01) in Drosophila is much smaller than those in yeast and C. elegans (table 5
), and (2) the number of gene families in Drosophila is only somewhat larger than that in yeast but much smaller than that in C. elegans (table 3
), despite the fact that Drosophila has a genome size and protein number much larger than those in yeast and similar to those in C. elegans. However, this conclusion should be taken with caution because the sequencing strategy for Drosophila was different from the other two organisms. It is possible that the shotgun strategy used to sequence the Drosophila genome could not distinguish between very recently duplicated genes, which have KS and KA equal or close to 0. The fact that we still see some gene pairs with KS = KA = 0 in Drosophila genome and that the number of gene pairs with an intermediate KS is still much smaller in Drosophila than in the other two organisms (table 2
) suggests that the sequencing strategy might not be a major factor. In other words, Drosophila might indeed have had a much lower duplication rate than yeast and C. elegans, at least in the recent past.
It has been shown that block duplication occurs more frequently in yeast and C. elegans than in Drosophila (Friedman and Hughes 2001
). This might largely account for the higher duplication rates in yeast and C. elegans. For example, we investigated the most recent gene duplications in yeast and found that among gene pairs with KS < 0.01, there are at least eight duplication events involving more than one gene pair with KS < 0.01. The largest block has seven embedded gene pairs with KS < 0.01. There are other possible explanations. For example, a higher evolutionary rate for synonymous site, a higher deletion rate for duplicate genes, or a lower gene conversion rate in Drosophila than the other two organisms might lead to the observed results too. Further investigation is necessary for distinguishing among these possibilities.
Our estimates of gene pairs with KS < 0.01 in the three genomes are different from those of Lynch and Conery (2000)
(table 5 ). Lynch and Conery (2000)
found 10 pairs with KS < 0.01 in Drosophila but only six pairs are indicated in their current website. Two of these six pairs are the same as ours. In each of the remaining four pairs in our result, either one or both genes do not exist in Release 1 of the fly database from NCBI, which was used by Lynch and Conery (2000)
. Among the remaining four pairs in their result, there are two pairs for each of which either one or both genes do not exist in Release 2 of the fly database from NCBI, and for each of the two other pairs, the two proteins have very different lengths: 69 a.a. and 220 a.a. for one pair and 210 a.a. and 580 a.a. for the other. For yeast, if we count only the gene pairs in small gene families (size <6), as was done in Lynch and Conery (2000)
, there are 29 pairs with KS < 0.01 in our result (32 pairs before excluding gene pairs with both genetic distances in intron and flanking regions >0.02). Lynch and Conery found 32 pairs with KS < 0.01. For C. elegans we used WormPep40 from the Sanger Centre instead of the database from NCBI. There are 76 and 77 pairs with KS < 0.01 in small (size <6) and large (size >5) gene families, respectively. After excluding gene pairs with both genetic distances in intron and flanking regions >0.02, these numbers became 73 and 74, respectively. In comparison, there were 164 pairs with KS < 0.01 in small gene families (size <6) in Lynch and Conery (2000)
. Although we used gene pairs with KS < 0.01 in all gene families to estimate the gene duplication rates, our estimates are lower than their results for Drosophila and C. elegans and similar for yeast if we exclude the putative helicase protein family.
Common Protein Families in the Three Organisms
It has been estimated that about 550 functional chemoreceptors (olfactory receptors) are scattered in the C. elegans genome (Robertson 1998
). These proteins help nematodes detect different kinds of chemicals. Chemoreceptors can be divided into different protein families, among which srh, str, stl, and srd are large ones. The srh gene family was estimated to have 214 members (Robertson 2000
), most of which fall into our second largest family in C. elegans, which has 181 members. Two closely related protein families, the str and stl (str-like protein) protein families, constitute our largest gene family which has 242 members (the sum of proteins in these two families was estimated to be 240, Robertson 1998
). The three other most common gene families in nematode (table 4
) are all hypothetical protein families. Their function and phylogenetic relationship need to be investigated in the future. It is interesting to note that among the five largest gene families in yeast there are two (seripauperins and putative helicases) located at the end of the chromosomes. Seripauperin genes encode serine-poor relatives of serine-rich proteins. Members in the putative helicase gene family are very similar to each other. Transcriptions of these genes are not detected under normal culture conditions (Viswanathan et al. 1994
; Yamada et al. 1998
). However, expression of putative helicases can be detected under some stress conditions (Yamada et al. 1998
). Another common gene family in yeast is also a stress response gene family: heat shock proteins. Most of the proteins within this family are cell stress chaperones. The remaining two of the five largest protein families are both membrane proteins. Although yeast has a gene duplication rate as high as that in C. elegans and much higher than that in Drosophila, there are no protein families in yeast that are as large as those in the other two organisms. Like olfactory receptors in C. elegans, cuticle proteins represent two out of the five largest protein families in Drosophila. The P450 proteins have been divided into two protein families (Tijet, Helvig, and Feyereisen 2001
), one of which is among the five largest protein families in Drosophila (table 4
), and the other is among the 10 most common protein families in flies (data not shown). The other two largest protein families in Drosophila are trypsin and GTP-binding proteins, which are involved in metabolic and regulatory pathways.
Nonneutral Evolution at Synonymous Sites in Yeast
As codon usage is strongly biased in many yeast genes (Ikemura 1982
), we now consider whether codon usage bias can affect the estimation of KS in yeast duplicate genes. We used gene pairs within the same duplicate block to investigate the relationship between KS and codon usage bias, which is negatively correlated with the ENC in a gene. In principle, gene pairs within a block should have the same age, but we noted that those pairs with a small ENC value (implying strong codon usage bias) tended to have a small KS value (data not shown). A similar pattern was found by Friedman and Hughes (2001)
. These results suggest that synonymous sites in genes with strong codon usage bias are under selective constraints. We now compare the KS values with the numbers of substitutions per site (K) in both the upstream- and downstream-flanking 150-bp regions. We consider only gene pairs with KS < 0.25 in small gene families (size <6) because it is easier to obtain reliable KS and K when they are not large and because if the family size is larger than six, there may be too many nonindependent pairs. The result is shown in figure 1
. For all gene pairs with average ENC
32, the KS values are much smaller than the average K values in the flanking regions, whereas for the majority of gene pairs with average ENC
32, the KS values are similar to the average K values in the flanking regions. This suggests that for a duplicate gene pair with strong codon usage bias, the KS value does not increase linearly with the age of the gene duplication. In the above study of recent gene duplication rates, we considered only gene pairs with KS < 0.01 and excluded those gene pairs whose genetic distances in intron and flanking regions were >0.02. This procedure reduced the effect of codon usage bias on KS (table 5
) and should give a more accurate estimate of the rate of gene duplication.
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: gene families
gene duplication rate
database cleaning
codon usage bias
Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology and Evolution, University of Chicago, 1101 East 57th street, Chicago, Illinois 60637. whli{at}uchicago.edu
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Akashi H., R. M. Kliman, A. Eyre-Walker, 1998 Mutation pressure, natural selection, and the evolution of base composition in Drosophila Genetica 102/103:49-60
The Arabidopsis Genome Initiative, 2000 Analysis of the genome sequence of the flowering plant Arabidopsis thaliana Nature 408:796-815.[ISI][Medline]
Brosius J., 1999 Genome were forged by massive bombardments with retroelements and retrosequences Genetica 107:209-238[ISI][Medline]
Doolittle R. F., 1986 Of URFs and ORFs: a primer on how to analyze derived amino acid sequences University Science Book, Mill Valley, Calif.
. 1995 The multiplicity of domains in protein Annu. Rev. Biochem 64:287-314[ISI][Medline]
Dunn K. A., J. P. Bielawski, Z. Yang, 2001 Substitution rates in Drosophila nuclear genes: implications for translational selection Genetics 157:295-305
Friedman R., A. L. Hughes, 2001 Gene duplication and the structure of eukaryotic genome Genome Res 11:373-381
Ikemura T., 1982 Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs J. Mol. Biol 158:573-597[ISI][Medline]
Kimura M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]
Krogh A., M. Brown, I. S. Mianm, K. Sjolander, D. Haussler, 1994 Hidden markov models in computational biology: applications to protein modeling J. Mol. Biol 235:1501-1531[ISI][Medline]
Li W.-H., 1997 Molecular evolution Sinauer, Sunderland, Mass
Lynch M., J. S. Conery, 2000 The evolutionary fate and consequences of duplicate genes Science 290:1151-1155
Makalowski W., 2000 Genome scrap yard: how genomes utilize all that junk Gene 259:61-67[ISI][Medline]
Moriyama E. N., D. L. Hartl, 1993 Codon usage bias and base composition of nuclear genes in Drosophila Genetics 134:847-858
Nekrutenko A., W.-H. Li, 2001 Transposable elements are found in a large number of human protein coding regions Trends Genet 17:619-621[ISI][Medline]
Ohno S., 1970 Evolution by gene duplication Springer-Verlag, Berlin
Robertson H. M., 1998 Two large families of chemoreceptor genes in the nematodes Caenorhabditis elegans and Caenorhabditis briggsae reveal extensive gene duplication, diversification, movement, and intron loss Genome Res 8:449-463
. 2000 The large srh family of chemoreceptor genes in Caenorhabditis nematodes reveals processes of genome evolution involving large duplications and deletions and intron gains and losses Genome Res 10:192-203
Rost B., 1999 Twilight zone for protein sequences alignments Protein Eng 12:85-94
Rubin G. M., M. D. Yandell, J. R. Wortman, (54 co-authors) 2000 Comparative genomics of the eukaryotes Science 287:2204-2215
Seoighe C., K. H. Wolfe, 1999 Updated map of duplicated regions in the yeast genome Gene 238:253-261[ISI][Medline]
Sharp P. M., W.-H. Li, 1989 On the rate of DNA sequence evolution in Drosophila J. Mol. Evol 28:398-402[ISI][Medline]
Sonnhammer E. L. L., S. R. Eddy, R. Durbin, 1997 Pfam: a comprehensive database of protein domain families based on seed alignments Proteins 28:405-420[ISI][Medline]
Tamura K., M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Mol. Biol. Evol 10:512-526[Abstract]
Tijet N., C. Helvig, R. Feyereisen, 2001 The cytochrome P450 gene superfamily in Drosophila melanogaster: annotation, intron-exon organization and phylogeny Gene 262:189-198[ISI][Medline]
Viswanathan M., G. Muthukumar, Y. S. Cong, J. Lenard, 1994 Seripauperins of Saccharomyces cerevisiae: a new multigene family encoding serine-poor relatives of serine-rich proteins Gene 148:149-153[ISI][Medline]
Wolfe K. H., D. C. Shields, 1997 Molecular evidence for an ancient duplication of the entire yeast genome Nature 387:708-713[ISI][Medline]
Yamada M., N. Hayatsu, A. Matsuura, F. Ishikawa, 1998 Y'-Help1, a DNA helicase encoded by the yeast subtelomeric Y' element, is induced in survivors defective for telomerase J. Biol. Chem 255:335-345
Yang Z., R. Nielsen, 2000 Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models Mol. Biol. Evol 17:32-43