* Computational and Evolutionary Genomics, Center for Genomics Research, Academia Sinica, Taipei, Taiwan; and Department of Ecology and Evolution, University of Chicago
Correspondence: E-mail: whli{at}uchicago.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: protein evolution protein length orthologous proteins paralogous proteins eukaryotes
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
On the other hand, an increase in peptide length may increase the energy cost of biosynthesis. A recent study of yeast genes suggested that natural selection favors shorter protein length for efficient synthesis (Akashi 2003). Studies using gene expression data showed strong correlation between codon usage bias and transcript abundance, suggesting that natural selection tends to increase the speed of protein synthesis (Moriyama and Powell 1998; Pal, Papp, and Hurst 2001; Akashi 2003). Also, it has been reported that proteins in the parasitic eukaryote Encephalitozoon cuniculi tend to be shorter than their eukaryotic orthologs (Katinka et al. 2001). In addition, spontaneous deletion tends to occur more often than spontaneous insertion in DNA sequences (de Jong and Ryden 1981). Thus, it is unclear whether in general proteins tend to increase in length.
Because there seems to be no study of protein sequence-length evolution in eukaryotes, we conducted a genome-wide comparison of protein lengths across eukaryotic kingdoms, taking advantage of the increasing abundance of genomic data in eukaryotes.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The commonly preserved (shared) proteins were defined by using yeast protein sequences as templates to BlastP against the protein sequences from the protein databases of nematode, Drosophila, human, and Arabidopsis, separately. The orthologous proteins among the different species were defined by two criteria: the BlastP search with the E value e10 and the alignable region between the two proteins greater than 50% of the longer protein. If more than one protein met the criteria, the protein with the highest hit score was chosen as the potential ortholog. Next, the method of bidirectional reciprocal best hit was applied to each of the other four species and the proteins that failed to satisfy this criterion were removed from the data set of commonly preserved proteins.
The ancestral states of the proteins were reconstructed by using the proteins commonly shared by the yeast, nematode, Drosophila, human, and Arabidopsis. For each set of orthologous protein sequences, a multiple sequence alignment was obtained using ClustalW with default parameters (Thompson, Higgins, and Gibson 1994). The length of the ancestral sequence was obtained from the multiple sequence alignment by the parsimony principle under the commonly accepted phylogeny of the five species in which the plant lineage branched off before the divergence between the yeast and animal lineages, and the insects (arthropods) are closer to humans than to nematode worms (the coelomata hypothesis) (see Blair et al. [2002], Hedges [2002] and Hughes and Friedman [2004]). An amino acid position was assumed to be in the ancestral sequence if it was present in at least three species or if the position was present in Arabidopsis and in at least one of the four other species. In this procedure, the Arabidopsis lineage has been given more weight because in the phylogeny used, it is the most divergent among the five eukaryotes studied. If this assumption is wrong and the yeast lineage is actually closer to the plant lineage than the animal lineage, it should have only a small effect on our inference because the three lineages are likely to be close to a trichotomy. Note that our inference makes no distinction between the coelomata hypothesis and the ecdysozoa hypothesis, which assumes that nematodes and insects form a clade and are equally distantly to humans. Therefore, our inference holds under either hypothesis.
The KOGs database includes orthologous and paralogous genes of eukaryotic species. Each group is associated with a conserved and specific function (Tatusov et al. 2003). To compare the sequence lengths of orthologous and paralogous proteins, we used the 1,252 proteins commonly preserved among yeast, nematode, Drosophila, human, and Arabidopsis as queries to search the corresponding KOG databases (a total of 4,852 KOGs) and identified the paralogs of each set of orthologous proteins. Take the yeast as an example. We found 1,173 yeast proteins shared by the 1,252 proteins that have been commonly preserved among the five genomes and the KOGs database by identifying their accession numbers. These 1,173 proteins formed the set of the "orthologous" proteins that could be found in the KOGs database for the five genomes. Then, for each of the 1,173 proteins, we searched the KOGs database to find all other homologous yeast proteins and put them in the set of "paralogous proteins." If there were n yeast paralogous proteins in the set of 1,173 proteins and there were other yeast paralogous proteins in the KOGs database, then all theses proteins were clustered into n groups according to the rule of best hits in the BlastP search. If a group had more than one protein, then their average length was used in later analyses.
We also compared the lengths of the proteins commonly shared among yeast, Drosophila, and human with the proteins that were shared only by human and Drosophila but not yeast, which for simplicity are called derived proteins. The criteria for the BlastP search for commonly shared proteins were the same as described above.
Protein-length comparison was based on the lengths of polypeptides. Normality of the protein-length distribution was examined by the Kolmogorov-Smirnov (K-S) test; normality was rejected when the P value was smaller than 0.05. In addition, in view of the large variance of protein length, log10 transformation was applied to compress the range and to stabilize the variance, and then the K-S test was applied to test the normality. If the normality of either the original data or the log10-transformed data was accepted, the pairwise t-test was applied; otherwise, the pairwise Wilcoxon rank sum test was used.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Previous studies have suggested that present-day proteins have gone through several stages of evolution and that an evolving polypeptide chain may grow in length by insertion of residues into the chain or to its tail (Lupas, Ponting, and Russell 2001; Aravind et al. 2002; Trifonov and Berezovsky 2003). Empirical studies also suggested that insertion of a peptide segment can occasionally increase protein stability or even improve function (Matsuura et al. 1999; Chow et al. 2003; Claverie and Ogata 2003). Consistently, our results show that the lengths of proteins in eukaryotes are, on average, longer than those of E. coli proteins, extending the result by Zhang (2000). Surprisingly, we found conservation of protein sequence length since the common ancestor of yeast, nematode, Drosophila, human, and Arabidopsis. A simple explanation for the above observations is that although in general protein length had increased during the evolution from prokaryotes to eukaryotes, the length seems to have been largely optimized in the common ancestor of fungi, animals, and plants.
It has been proposed that first protein domains evolved by recombination from a limited number of polypeptides (Soding and Lupas 2003), followed by genetic events such as point mutations, insertions, and deletions. Insertion (including internal duplication) expands the protein sequence, providing opportunities for additional function or functional improvement (Matsuura et al. 1999; Trifonov and Berezovsky 2003). A survey of genes in eukaryotes showed that internal duplications have occurred frequently in evolution, sometimes increasing the number of active sites and, thus, enhancing the protein function (see Li [1997]). However, our study showed no general tendency for eukaryotic proteins to increase length; indeed, the chance for a decrease in length is as high as that for an increase in length. Because for the 1,252 commonly preserved proteins, only 402 are essential (deletion lethal) genes in yeast, functional constraint may not be the primary factor for the conservation of sequence length. Rather, the conservation in sequence length is probably mainly caused by structural constraint.
Some studies have indeed pointed to the importance of structural constraint in protein evolution. For example, thermodynamic stability and folding kinetics were shown to exert pressure on the course of protein evolution (Dokholyan and Shakhnovich 2001). The limited diversity of protein domain was also suggested to be caused by structural constraint (Jones et al. 1998; Hou et al. 2003). Furthermore, the study of Yang, Gu, and Li (2003) on the relationship between protein dispensability and the rate of evolution suggested that structural constraints are more important in determining the rate of amino acid substitution in proteins than functional requirement. Finally, as is well known, protein three-dimensional structure studies have shown that protein structure is much better conserved than sequence (Ponting and Russell 2002; Soding and Lupas 2003). Our observation of sequence-length conservation during the evolution of eukaryotes may be explained by protein-structure conservation because changes in sequence length may often cause changes in structure.
We also note that although paralogous proteins often undergo deletions and insertions probably because of relaxation in functional constraint, they show no general tendency to increase sequence length. Also, although there seems to be a tendency for paralogous proteins to decrease in sequence length, the tendency is very weak because it is significant only in one (human) of the five species studied (fig. 4). This is somewhat surprising in view of the facts that spontaneous deletion occurs more often than spontaneous insertion and that a shorter protein requires a lower cost of biosynthesis.
Interestingly, "newly evolved" or "derived" proteins are, on average, substantially longer than "old" proteins (fig. 5). It is not clear how this has happened, but we speculate two possibilities. First, some of these proteins might have already been present in the common ancestor of yeast, Drosophila, and human but had gained insertions and underwent a period of rapid amino acid change. Second, some of these proteins could have been derived from duplicate genes that had undergone gene elongation and rapid sequence changes. In both cases, the proteins have undergone so much sequence change that they can no longer be detected by the BlastP search using yeast proteins as queries. In any case, many of the new proteins might have gained new function, partly through sequence elongation. Whether these speculations have any merit requires further investigation.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Akashi, H. 2003. Translational selection and yeast proteome evolution. Genetics 164:12911303.
Aravind, L., R. Mazumder, S. Vasudevan, and E. V. Koonin. 2002. Trends in protein evolution inferred from sequence and structure analysis. Curr. Opin. Struc. Biol. 12:392399.[CrossRef][ISI][Medline]
Blair, J. E., K. Ikeo, T. Gojobori, and S. B. Hedges. 2002. The evolutionary position of nematodes. BMC Evol. Biol. 2:7.[CrossRef][Medline]
Chow, C. C., C. Chow, V. Raghunathan, T. J. Huppert, E. B. Kimball, and S. Cavagnero. 2003. Chain length dependence of apomyoglobin folding: structure evolution from misfolded sheets to native helices. Biochemistry 42:70907099.[CrossRef][ISI][Medline]
Claverie, J., and H. Ogata. 2003. The insertion of palindromic repeats in the evolution of proteins. Trends Biochem. Sci. 28:7580.[CrossRef][ISI][Medline]
de Jong, W. W., and L. Ryden. 1981. Causes of more frequent deletions than insertions in mutations and protein evolution. Nature 290:157159.[CrossRef][ISI][Medline]
Dokholyan, N. V., and E. I. Shakhnovich. 2001. Understanding hierarchical protein evolution from first principles. J. Mol. Biol. 312:289307.[CrossRef][ISI][Medline]
Hedges, S. B. 2002. The origin and evolution of model organisms. Nat. Rev. Genet. 3:838849.[CrossRef][ISI][Medline]
Hou, J., G. E. Sims, C. Zhang, and S. H. Kim. 2003. A global representation of the protein fold space. Proc. Natl. Acad. Sci. USA 100:23862390.
Hughes, A. L., and R. Friedman. 2004. Differential loss of ancestral gene families as a source of genomic divergence in animals. Proc. R. Soc. Lond. B Biol. Sci. 271(suppl. 3):S107109.[CrossRef][ISI][Medline]
Hughes, J. F., and J. M. Coffin. 2002. A novel endogenous retrovirus-related element in the human genome resembles a DNA transposon: evidence for an evolutionary link?. Genomics 80:453455.[CrossRef][ISI][Medline]
Jones, S., M. Stewart, A. Michie, M. B. Swindells, C. Orengo, and J. M. Thornton. 1998. Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Sci. 7:233242.
Kaessmann, H., S. Zollner, A. Nekrutenko, and W. H. Li. 2002. Signatures of domain shuffling in the human genome. Genome Res. 12:16421650.
Katinka, M. D., S. Duprat, E. Cornillot et al. (17 co-authors). 2001. Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature 414:450453.[CrossRef][ISI][Medline]
Kinch, L. N., and N. V. Grishin. 2002. Evolution of protein structures and functions. Curr. Opin. Struct. Biol. 12:400408.[CrossRef][ISI][Medline]
Krylov, D. M., Y. I. Wolf, I. B. Rogozin, and E. V. Koonin. 2003. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res. 13:22292235.
Lander, E. S., L. M. Linton, B. Birren et al. (255 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860921.[CrossRef][ISI][Medline]
Li, W. H. 1997. Molecular evolution. Sinauer Associated, Sunderland, Mass.
Li, W. H., Z. Gu, H. Wang, and A. Nekrutenko. 2001. Evolutionary analyses of the human genome. Nature 409:847849.[CrossRef][ISI][Medline]
Lupas, A. N., C. P. Ponting, and R. R. Russell. 2001. On the evolution of protein folds: Are similar motifs in different protein folds the result of convergence, insertion, or relices of an ancient peptide world?. J. Struct. Biol. 134:191203.[CrossRef][ISI][Medline]
Makalowski, W. 2000. Genomic scrap yard: how genomes utilize all that junk. Gene 259:6167.[CrossRef][ISI][Medline]
Matsuura, T., K. Miyai, S. Trakulnaleamsai, T. Yomo, Y. Shima, S. Miki, K. Yamamoto, and I. Urabe. 1999. Evolutionary molecular engineering by random elongation mutagenesis. Nat. Biotechnol. 17:5861.[CrossRef][ISI][Medline]
Moriyama, E. N., and J. R. Powell. 1998. Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli. Nucleic Acids Res. 26:31883193.
Pal, C., B. Papp, and L. D. Hurst. 2001. Does the recombination rate affect the efficiency of purifying selection? The yeast genome provides a partial answer. Mol. Biol. Evol. 18:23232326.
Ponting, C. P., and R. R. Russell. 2002. The natural history of protein domains. Annu. Rev. Biophys. Struct. 31:4571.[CrossRef][ISI][Medline]
Soding, J., and A. N. Lupas. 2003. More than the sum of their parts: on the evolution of proteins from peptides. BioEssays 25:837846.[CrossRef][ISI][Medline]
Tatusov, R. L., N. D. Fedorova, J. D. Jackson et al. (17 co-authors). 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:4154.[CrossRef][Medline]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.[Abstract]
Trifonov, E. N., and L. N. Berezovsky. 2003. Evolutionary aspects of protein structure and folding. Curr. Opin. Struct. Biol. 13:110114.[CrossRef][ISI][Medline]
Venter, J. C., M. D. Adams, E. W. Myers et al. (274 co-authors). 2001. The sequence of the human genome. Science 291:13041351.
Yang, J., Z. Gu, and W. H. Li. 2003. Rate of protein evolution versus fitness effect of gene deletion. Mol. Biol. Evol. 20:772774.
Zhang, J. 2000. Protein-length distributions for the three domains of life. Trends Genet. 16:107109.[CrossRef][ISI][Medline]
|