Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
Correspondence: E-mail: dhickey{at}uottawa.ca.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: comparative genomics angiosperm nucleotide amino acid
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this study, we compared homologous gene pairs from two species of flowering plants, Oryza sativa (rice) and Arabidopsis thaliana. Because these two species diverged less than 200 MYA, many homologous sequences from the two genomes are unambiguously alignable. Moreover, the level of amino acid sequence divergence between homologous proteins is relatively low, allowing us to gauge the patterns of amino acid substitution. Finally, there is a wide variation in the nucleotide contents of the rice genes: some closely resemble their Arabidopsis homologs in G+C content, whereas others have significantly elevated levels of G+C relative to their homologs (Carels and Bernardi 2000). Because all of the genes diverged from their common ancestral sequences at the same point in evolutionary time, this provides us with a "controlled" evolutionary experiment, enabling us to do a comparative study of two sets of rice genes that are evolving under contrasting evolutionary constraints.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A total of 26,178 protein-coding sequences from A. thaliana (from five chromosomes) were downloaded from National Center for Biotechnology Information (NCBI) FTP server (ftp://ftp.ncbi.nih.gov/genbank/genomes/A_thaliana/). After passing the sequences to CodonW for codon integrity check and removing genes shorter than 75 codons, a total of 25,625 Arabidopsis coding sequences remained for analysis. Protein sequences of Arabidopsis were also obtained by translating the coding sequences using EMBOSS/transeq program.
Identification and Comparison of Homologous Sequences
Homologous protein pairs between O. sativa and A. thaliana were identified by performing BlastP searches (Altschul et al. 1990) of the rice protein sequences against Arabidopsis sequences with a cutoff expect score of 1e-20. When a rice protein had more than one Arabidopsis protein hit, the pair having the most significant expect score was retained. In all, 4,447 homologous pairs were identified.
After the homologous protein sequences had been identified, the corresponding nucleotide sequences were scored for nucleotide content. In this study, we ranked the rice homologs by their G+C content. We then compared the group of 1,000 rice genes with the highest G+C content (the "high G+C" class) to their homologs in the Arabidopsis genome. We also performed a parallel comparison between the group of 1,000 rice genes having the lowest G+C content (the "low G+C" class) and their homologs.
Identifying Amino Acids for GC-rich and AT-rich Codons
In the manner introduced by Foster, Jermiin, and Hickey (1997), we partitioned the codon table into three groups: codons that were GC-rich at the first two codon positions, codons that were AT-rich at the first two codon positions; and unbiased codons. The GC-rich codons encode glycine, alanine, arginine, and proline (GARP). The AT-rich codons encode phenylalanine, tyrosine, methionine, isoleucine, asparagines, and lysine (FYMINK). The unbiased codons fill two quadrants of the rearranged codon table, and they encoded serine (S), threonine (T), cysteine (C), tryptophan (W) and valine (V), leucine (L), glutamic acid (E), aspartic acid (D), histidine (H), and glutamine (Q).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
In addition to showing simple differences in amino acid compositions between the homologous protein sequences, we wanted to investigate the patterns of amino acid substitution during the course of their evolutionary divergence. To do this, we aligned the homologous sequence pairs, and we then concatenated these alignments. The aligned sites can be classified as invariant (where the same amino acid appears in the rice and Arabidopsis sequences) or variant (where there is a difference between the two sequences). Because it is only these latter sites that contain information about sequence divergence, we recalculated the amino acid frequencies for these sites only. The results for the high G+C rice genes and their Arabidopsis homologs are shown in figure 2A. In this case, there is a twofold increase in the proportion of GARP amino acids in the rice sequences and an even greater proportional decrease in FYMINK amino acids. Not does a large average difference exist between the two concatenated sequences but also a consistent trend is seen among individual homologous gene pairs. For instance, 971 out of the 1,000 high G+C rice genes have higher GARP levels than their Arabidopsis homologs, and this trend is highly significant (P < 0.00001 in a one-tailed, paired-sample t-test). There are also consistent differences for all of the individual amino acids, within both the GARP and the FYMINK groups of amino acids (fig. 2B). Some of these frequency changes for individual amino acids are quite dramatic. For instance, the rice genes have a threefold increase in the proportion of alanine (A) at the variant sites and a twofold increase in arginine (R). They show a correspondingly large (more than twofold) decreases in isoleucine (I), asparagine (N), and lysine (K). The differences in amino acid composition are highly significant (P < 0.001) for all but three of the 20 pairwise comparisons. The exceptions are cysteine (C), tryptophan (W), and leucine (L).
|
|
It should be noted that it takes two nucleotide substitutions (at both the first and second codon positions) to achieve an amino acid exchange between the GARP and FYMINK amino acids. This means that, during a period of increasing G+C content in the rice sequences, many of the substitutions involve "intermediate" amino acids; that is, amino acids that are encoded by codons with intermediate nucleotide content. For instance, as the rice genes become more G+C rich, they are expected to gain GARP amino acids by single nucleotide substitutions from the pool of codons with intermediate nucleotide content. At the same time, they will lose FYMINK amino acids because of mutations that change the latter codons into codons of intermediate nucleotide content. Thus, these intermediate codons act as the "flow-through" from one extreme to the other. This effect is also illustrated in figure 3. For instance, we see that the source of the huge increase in alanine (A) in the rice sequences is not primarily the result of direct substitution from the FYMINK group, but substitution from other amino acids such as serine (S) (shown in blue). Likewise, the greatest single loss of isoleucine (I) in the rice sequences is to the G+C-intermediate valine (V, shown in red). More generally, we can see that exchanges between alanine in the rice sequences and the intermediate group of V, L, E, D, H, and Q (shown in orange in fig. 3) results in a net increase of 163 alanines (239 76 = 163). We used Fisher's exact test to calculate the significance of these asymmetries. All of the differences mentioned above are highly statistically significant (P < 0.001) and, in fact, the gain of alanine from all other amino acids is significant (P < 0.05) except for tyrosine (Y), cysteine (C), and tryptophan (W). In other words, alanine becomes a "sink" in the high G+C rice genes.
In contrast to these findings, when we constructed a parallel exchange matrix (not shown) for the low G+C rice genes and their Arabidopsis homologs, we found no evidence of asymmetry in the patterns of amino acid substitution. Thus the asymmetric pattern of protein evolution is correlated with the changes in nucleotide content among the rice genes.
Possible Sources of Compositional Bias in Rice Genes and Their Encoded Proteins
Although our primary purpose was to explore the effects of mutational bias on the patterns of protein evolution, we also wished to infer the causes of this variation in nucleotide content between rice genes. First, we wished to reconcile the reports of Carels and Bernardi (2000), who state that there are two classes of genes in plants (one class being G+C-rich), and of Wong et al. (2002), who find there is a gradient of G+C content along individual rice genes. For instance, it might be possible that the high G+C genes had increased levels of these nucleotides at their 5' ends only, resulting in an amino acid bias that is concentrated at the amino-terminal of the encoded proteins. We show the results for the patterns of amino acid composition in the high G+C and low G+C rice genes (fig. 4A). These results at the protein level reflect the underlying patterns of nucleotide composition. They illustrate that all rice genes tend to have especially elevated levels of G+C-rich codons (encoding the G, A, R, and P amino acids) at their 5' ends but that the high G+C class is characterized by a tendency to have this elevated level extend over the entire coding sequence. In summary, we found that the differences in nucleotide composition between rice genes are caused by a combination of a gradient along the gene length (as noted by Wong et al. [2002]) and an overall average difference between the genes (as noted by Carels and Bernardi [2000]). Neither the compositional gradient along the coding sequence length, nor the bimodal distribution of nucleotide composition among genes is seen in the Arabidopsis genome.
|
Discussion
Our results show a clear correlation between the variations in nucleotide composition of different rice genes and the evolutionary changes in the amino acid composition of their encoded proteins. Such a correlation could reflect either a primary effect at the level of nucleotide bias that produces a secondary effect at the protein level or, alternatively, selection for amino acid content at the protein level. The first indication that mutational bias at the nucleotide level is, indeed, the primary cause comes from the observation that the differences in G+C content are greatest at the third codon position (table 1). If the changes in average nucleotide content were a primary effect at the amino acid level, we would expect that the greatest change would be at the first and second codon positions. A related method for distinguishing between nucleotide-level and protein-level effects is to compare the calculated rates of synonymous and nonsynonymous nucleotide substitutions. We used the method of Yang and Nielsen (2000) to calculate these rates for the two groups of rice genes (high G+C and low G+C) compared with their Arabidopsis homologs. If the nucleotide composition of the high G+C rice genes is affected primarily by selection at the protein level, we should see elevated rates of nonsynonymous changes. If, on the other hand, the primary effect is at the nucleotide level, we should see an elevation in the synonymous substitution rate. The results show very clearly that the increase in substitution rate happens at the synonymous sites, where there is a twofold increase relative to the rate for the low G+C genes (the average values of dS are 7.7 ± 0.6 and 3.6 ± 0.1 for the high G+C and low G+C genes, respectively). The nonsynonymous substitution rate remains relatively constant between the two sets of genes (average value of dN = 0.5 for both groups). This points to mutational bias at the nucleotide level, rather than functional selection at the protein level.
Our finding that shorter coding sequences have a greater tendency to increase in G+C content confirms the findings of Carels and Bernardi (2000), and it is reminiscent of the finding of Duret et al. (1995) who showed that the G+C content of many vertebrate genes was negatively correlated with coding sequence length. More recently, a similar trend has been noted in a survey of single-exon coding sequences (Xia et al. 2003). It is intriguing to observe the same length correlations in both vertebrates and plants. One possible explanation for this trend is that the increased G+C content is linked to the absence of introns, if one assumes that longer genes are more likely to have multiple exons. However, we tested this hypothesis by confining our analysis to single-exon genes only, and we found that the presence of introns was not the primary determining factor of nucleotide content. For instance, with this more restricted data set (single exon genes only), the difference in gene length was just as great as that shown in figure 4B for all genes. Thus, the length difference is maintained even in the absence of introns. Interestingly, however, we found that the high G+C rice genes included relatively few multiple-exon genes, especially genes with three or more exons (table 2). This suggests that the presence of multiple introns may prevent even short genes from becoming G+C rich. This is supported by the observation that among the low G+C rice genes, the average length of three-exon and four-exon genes is only 60% of the length of one-exon and two-exon genes. In other words, even though the coding sequences of these genes are relatively short, the presence of multiple introns may prevent them from becoming G+C rich. This implies that there are selective constraints related to RNA splicing that counter the effects of mutational bias in these genes. Such a constraint would not, however, explain the fact that long, single-exon coding sequences also remain relatively immune to mutational bias. The answer may lie in the fact that RNA splicing is only one form of RNA processing. In general, it may be that longer genes encode more complex transcripts and proteins that have a greater chance of being functionally disrupted by biased mutational changes. Shorter genes are also at risk, but they provide a smaller target for these mutations and, consequently, they are subject to lesser selective constraint.
|
In summary, we have shown that mutational bias can have profound effects on the patterns of evolutionary divergence between homologous plant protein sequences. This indicates that mutational bias can be a major determinant of the patterns of protein evolution in eukaryotes. The rice genome does not, however, have a uniformly elevated G+C content among its coding sequences. The result of this heterogeneity in the nucleotide content among the coding sequences is reflected in the very different amino acid compositions among the encoded proteins.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.[CrossRef][ISI][Medline]
Bairoch, A., and R. Apweiler. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28:45-48.
Carels, N., and G. Bernardi. 2000. Two classes of genes in plants. Genetics 154:1819-1825.
Carels, N., P. Hatey, P. Jabbari, and G. Bernardi. 1998. Compositional properties of homologous coding sequences from plants. J. Mol. Evol. 46:45-53.[ISI][Medline]
Collins, D. W., and T. H. Jukes. 1993. Relationship between G + C in silent sites of codons and amino acid composition of human proteins. J. Mol. Evol. 36:201-213.[ISI][Medline]
Duret, L., D. Mouchiroud, and C. Gautier. 1995. Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores. J. Mol. Evol. 40:308-317.[ISI][Medline]
Foster, P. G., L. S. Jermiin, and D. A. Hickey. 1997. Nucleotide composition bias affects amino acid content in proteins coded by animal mitochondria. J. Mol. Evol. 44:282-288.[ISI][Medline]
Galtier, N. 2003. Gene conversion drives GC content evolution in mammalian histones. Trends Genet. 19:65-68.[CrossRef][ISI][Medline]
Gautier, C. 2000. Compositional bias in DNA. Curr. Opin. Genet. Dev. 10:656-661.[CrossRef][ISI][Medline]
Hickey, D. A., L. Bally-Cuif, S. Abukashawa, V. Payant, and B. F. Benkel. 1991. Concerted evolution of duplicated protein-coding genes in Drosophila. Proc. Natl. Acad. Sci. USA 88:1611-1615.[Abstract]
Karlin, S., A. M. Campbell, and J. Mrazek. 1998. Comparative DNA analysis across diverse genomes. Annu. Rev. Genet. 32:185-225.[CrossRef][ISI][Medline]
Kreil, D. P., and C. A. Ouzounis. 2001. Identification of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res. 29:1608-1615.
Li, W. H. 1997. Molecular Evolution. Sinauer, Sunderland, Mass.
Lobry, J. R. 1996. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13:660-665.[Abstract]
1997. Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. Gene 205:309-316.[CrossRef][ISI][Medline]
Morton, B. R. 1999. Strand asymmetry and codon usage bias in the chloroplast genome of Euglena gracilis. Proc. Natl. Acad. Sci. USA 96:5123-5128.
Rice, P., I. Longden, and A. Bleasby. 2000. EMBOSS: The European molecular biology open software suite. Trends Genet. 16:276-277.[CrossRef][ISI][Medline]
Sasaki, T., T. Matsumoto, and K. Yamamoto, et al. (80 co-authors). 2002. The genome sequence and structure of rice chromosome 1. Nature 420:312-316.[CrossRef][ISI][Medline]
Singer, G. A. C., and D. A. Hickey. 2000. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol. Biol. Evol. 17:1581-1588.
Stoesser, G., W. Baker, and A. van den Broek, et al. (16 co-authors). 2002. The EMBL nucleotide sequence database. Nucleic Acids Res. 30:21-26.
Tillier, E. R., and R. A. Collins. 2000. The contributions of replication orientation, gene direction, and signal sequences to base-composition asymmetries in bacterial genomes. J. Mol. Evol. 50:249-257.[ISI][Medline]
Ware, D., P. Jaiswal, and J. Ni, et al. (12 co-authors). 2002. Gramene: a resource for comparative grass genomics. Nucleic Acids Res. 30:103-105.
Wilquet, V., and M. Van de Casteele. 1999. The role of the codon first letter in the relationship between genomic GC content and protein amino acid composition. Res. Microbiol. 150:21-32.[CrossRef][ISI][Medline]
Wong, G. K., J. Wang, L. Tao, J. Tan, J. Zhang, D. A. Passey, and J. Yu. 2002. Compositional gradients in Gramineae genes. Genome Res. 12:851-856.
Xia, X., Z. Xie, and W. H. Li. 2003. Effects of GC content and mutational pressure on the lengths of exons and coding sequences. J. Mol. Evol. 56:362-370.[CrossRef][ISI][Medline]
Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32-43.
Yu, J., S. Hu, and J. Wang, et al. (100 co-authors). 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296:79-92.
|