* Department of Ecology and Evolution, University of Chicago
Division of Molecular Biology and Biochemistry, University of Missouri-Kansas City
Correspondence: E-mail: ciwu{at}uchicago.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: amino acid substitution genetic code protein evolution amino acid change
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Of the 190 possible interchanges among the 20 amino acids, only 75 can be achieved by single-base substitutions, which are referred to as elementary amino acid changes. The remaining 115 can occur by two-base or three-base substitutions. They are composites of two or three elementary changes and should be treated separately. By considering only the elementary changes, we can formulate the neutral evolutionary expectation. The evolutionary index (EI) for each elementary amino acid change is the likelihood of its fixation relative to that of the synonymous changes. EI is, therefore, the equivalent of the Ka/Ks ratio for each elementary change or the equivalent to the ij of Yang, Nielsen, and Hasegawa (1998). We should note that, when the amino acid changes are classified into so many bins, the analysis necessarily has to be based on a large collection of genes to have the resolution.
We measure EIs among four sets of genes from four pairs of species (human versus the macaque monkey, mouse versus rat, S. cerevisiae versus S. paradoxus, and D. melanogaster versus D. simulans).
Previous empirical approaches have also sought to measure amino acid exchangeability during evolution for each type of amino acid change (e.g., PAM, BLOSUM) (Dayhoff, Schwartz, and Orcutt 1978; Henikoff and Henikoff 1993). Such matrices were used to identify distant homologs (Altschul and Lipman 1990) or to detect selection on amino acid substitutions (Li, Wu, and Luo 1985; Yang, Nielsen and Hasegawa 1998; Wyckoff, Wang, and Wu 2000). These approaches are based on amino acids, not codons. In PAM matrix, each entry is related to the observed exchanges between amino acid i and amino acid j divided by the expected exchanges, which is the product of their respective frequencies in the data set. The EI proposed here formulates the expectation according to the genetic code (see Materials and Methods) and is hence a very different measure from PAM (see also Results).
The analysis shows that the EI values differ by at least 10-fold between conservative and radical changes but their relative ranking appears stable across gene sets among eukaryotic organisms. As a result, it is possible to predict amino acid exchangeabilities during evolution accurately. EIij is also an important parameter in most codon substitution models (Goldman and Yang 1994; Muse and Gaut 1994). These models assume either uniform EIij values or a fixed scale among them (such as the one defined by the Grantham's [1974] matrix). Our results will have some relevance to the construction of such models, but their applications to modeling are beyond the scope of this study.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Expected Amino Acid Changes
All possible one-step changes for all 61 codons in the mammalian nuclear genetic code, except the three stop codons, were determined. In one step, a codon can change in nine different ways and generally can have from zero to three synonymous changes and from six to nine nonsynonymous changes, except in cases of "sixfold" degeneracy such as leucine or arginine. The set of all possible one-step changes from a given codon is the basis of the expectation calculation. We shall use TTT (F) as an example where (F) is the one letter amino acid code.
The codon change pattern is as folllows:
We then sum over the codon changes and convert them into the amino acid change patterns below:
The description above is the expected synonymous and amino acid changes based on a single codon, TTT. A similar description can be made for all 61 nontermination codons. For the set of genes, the calculation of the expectation is then weighted by the number of codons that appear in the sequences of both species. These weighted changes are put into a 64 x 64 codon change matrix (effectively 61 x 61, because stop codons are excluded). The codon change pattern for TTT has been given above. Our approach, therefore, takes into account the amino acid composition and synonymous codon usage. The total number of synonymous changes across all codons in the expectation matrix is set equal to the number of observed synonymous changes, and the number of each type of nonsynonymous change is scaled accordingly. The 64x64 codon matrix was then converted into a 20x20 amino acid matrix by summing over all codons that code for the same amino acid. Again, this conversion has been shown above for TTT.
The expectation amino acid matrices are "folded" so that they are symmetric because, without ancestry data, there is no directionality among the amino acid changes. The average of the expected number for each elementary change is 166.6, 2342.6, 8114.9, and 184.1 respectively, for primates, rodents, yeast, and Drosophila. Thus, even with 75 classes, there is sufficient resolution for each class.
Observed Amino Acid Changes
The numbers of substitutions, synonymous or nonsynonymous, are put into a 64x64 observed matrix. Two-step changes are counted as two one-step changes. Because there are two such pathways (e.g., TTTTCTTCC versus TTTTTCTCC), they are weighted according to the observed patterns in the 1-bp changes as done before (Li 1993). Three-step changes are ignored, but there are few of them in the data set. The 64x64 codon matrix is again converted into a 20x20 amino acid observation matrix.
Calculation of EIs
If we ignore multiple substitutions at the same nucleotide sites that are less a problem for closely related species, the uncorrected EI (designated with an asterisk [*]) is
|
To make the correction for multiple substitutions, we shall follow the method of Li (1993). Let A' be the actual observed number of transitions per base pair, and let B' be the actual observed number of transversions per base pair. EI* corresponds to the Ka(uncorrected)/Ks(uncorrected) where
|
|
|
Calculation of the Universal Index, U
Given the high correlations of EIs from different sets of genes of different taxa, we now propose a universal index U. For any data set (or any gene), the predicted EI will be U*R, where R is the weighted average Ka/Ks for that data set. U is scaled such that its weighted average is 1. We used rodent and yeast EIs to obtain U, as they are based on the largest numbers of nonsynonymous substitutions.
In the plot of the observed EIs for yeast versus rodent, the fitted values (ÊIi(r), ÊIi(y)) for a particular kind (i) of elementary amino acid change should be (Ui*R(r),Ui*R(y)) where R(r) and R(y) are weighted average Ka/Ks for rodent and yeast, respectively. All the 75 fitted points should lie on the fitted line that is through the origin with the slope equal to R(y)/R(r). Obviously, the point position for the observed value is (EIi(r), EIi(y)). For each particular kind of elementary amino acid change, we make the line connecting the observed and fitted values perpendicular to the fitted line. By doing that, we minimize the total residual sum of squares for yeast and rodent EIs (i.e., minimize ((
+
) (here êi(r) = EIi(r) ÊIi(r) and êi(y) = EIi(y) ÊIi(y)). Now,
Calculation of PAM-4
The identity percentage for the primate sequences used in this study is 95.8% at the amino acid level. We, therefore, used the PAM-4 substitution matrix with each entry Sij. PAM-4 is derived from PAM-1 by assuming the Markovian transition model (Dayhoff, Schwartz, and Orcutt 1978). Each Sij in PAM-4 is 10 times the log odds ratio of two probabilities: Pr(observed i j mutation rate) and Pr(mutation rate expected from amino acid frequencies). We denote the ratio of the two probabilities as P(ij), which can be an empirical measure of the relative exchangeability for that type of amino acid change.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For each data set, our estimated EIs differ by at least 10-fold between the most radical and conservative changes, as shown in table 2. The separate estimation of nonsynonymous and synonymous substitutions (Li, Wu, and Luo 1985; Nei and Gojobori 1986) has been a most widely used practice in molecular evolutionary studies. It has led to conclusions of positive selection in many recent studies (Wyckoff, Wang, and Wu 2000; Yang, Nielsen, and Hasegawa 1998; Zhang 2000). Table 2 suggests that amino acid changes are highly disparate in their evolutionary dynamics. Therefore, when all nonsynonymous changes are lumped into one class, we might in fact lose much evolutionary information.
The Correlations Among EIs from Different Taxa
Attempts to classify amino acid changes according to their evolutionary exchangeability have been briefly noted (Li, Wu, and Luo 1985; Nei and Gojobori 1986; Henikoff and Henikoff 1993). All these classifications implicitly assume that there exist universal rules governing amino acid exchangeability. If amino acid 1 and amino acid 2 are more exchangeable than amino acid 3 and amino acid 4 in some genes, is the former pair also more exchangeable than the latter in most other genes? Unless such consistency can be demonstrated across genes and across taxa, it would be futile to attempt to formulate such rules. By the pairwise comparisons of EI values (table 3), we found that the correlations are generally high, around 0.8. The degree of correlation depends strongly on the size of the data set. The highest correlation (r = 0.91) exists between the rodent and yeast data sets, which have the largest numbers of changes (table 1). The lowest (r = 0.788) lies between the primate and Drosophila data sets with relatively small numbers.
|
The Universal EI (U)
The high correlations suggest that the relative ranking of EIs is consistent across genes and taxa, as long as the total number of nonsynonymous substitutions in the data set is large (> 4,000). Therefore, a universal ranking, Ui, may exist, which is described by a set of 75 measures linearly correlated with the EIs of table 2. In Materials and Methods, we estimate Ui values as follows:
|
In figure 2, we compare the observed EIs with the predicted EIs (= U*R) for each of the data sets from rodents, yeast, primate, and Drosophila, respectively. The correlations are greater than 0.84 and, when the regression lines are fitted through the origin, the slopes are approximately 1.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
An application of the EI results may be in the detection of selection. There are several approaches to detect the influence of positive selection (McDonald and Kreitman 1991; Wyckoff, Wang, and Wu 2000; Fay, Wyckoff, and Wu 2001, 2002; Yang and Swanson 2002; Bamshad and Wooding 2003). A most straightforward one is to identify genes with Ka/Ks >1 (e.g., Hughes and Nei 1988), but the method is overly restrictive, given the fact that most of amino acid sites are functionally conserved, and only some are critical for molecular adaptation (Golding and Dean 1998). A powerful means is to examine multiple orthologous sequences and estimate the effect of positive selection by identifying those sites with Ka/Ks > 1 (Fitch et al. 1997; Nielsen and Yang 1998; Suzuki and Gojobori 1999; Yang et al. 2000).
On the other hand, there will be an influx of sequences from two closely related species. In that case, it would be most effective to partition the changes by their amino acid properties. It is sometimes, but not always, possible to detect selection in a subset of classes, especially the most conservative class (Wyckoff, Wang, and Wu 2000). Between the amino acids with high EIs, the Ka/Ks values may exceed those in the bottom of table 2 by more than 10-fold. Previous attempts at grouping amino acids into classes often failed to resolve the Ka/Ks values into several distinct estimates. This failure may be the result of the poor correlation between the amino acid classification and their evolutionary exchangeability. In this respect, the U index should provide a better evolutionary basis for classifying amino acids and may lead to a better way of estimating the levels of coding-region divergence and, hence, the influence of positive selection. EI can also be developed further along with other measures currently used to assess the functional effect of amino acid changes in proteins (Yang and Nielsen 1998; Bustamante, Townsend, and Hartl 2000; Wyckoff, Wang, and Wu 2000; Zhang 2000; Chasman and Adams 2001). Notably, our analysis does not take into account the contextual information such as the protein structure, the expression level, and so on. Contextual information from other studies may eventually be used in conjunction with noncontextual information, such as EI, to gain useful insight into changes in protein function during evolution.
Finally, the approach of focusing on the elementary amino acid changes can be applied to the polymorphism and disease patterns. By contrasting the EIs of divergence and of polymorphism (Yang and Nielsen 1998; Bustamante, Townsend, and Hartl 2000; Wyckoff, Wang, and Wu 2000; Zhang 2000; Chasman and Adams 2001), we may have a glimpse of the operation of natural selection at the amino acid level. Similarly, contrasting EI with disease causation by amino acid changes, we may understand the effect of disease on Darwinian fitness better.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul, S. F., and D. J. Lipman. 1990. Protein database searches for multiple alignments. Proc. Natl. Acad. Sci. USA 87:5509-5513.[Abstract]
Bamshad, M., and S. P. Wooding. 2003. Signatures of natural selection in the human genome. Nat. Rev. Genet. 4:99-111.[CrossRef][ISI][Medline]
Bustamante, C. D., J. P. Townsend, and D. L. Hartl. 2000. Solvent accessibility and purifying selection within proteins of Escherichia coli and Salmonella enterica. Mol. Biol. Evol. 17:301-308.
Chasman, D., and R. M. Adams. 2001. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J. Mol. Biol. 307:683-706.[CrossRef][ISI][Medline]
Dayhoff M.O., R. M. Schwartz, and B. C. Orcutt. 1978. Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, D.C.
Doolittle, R. F. 1979. Protein evolution. Pp.1118 in H. Neurath and R. L Hill, eds. The proteins. Academic Press, New York.
Fay, J. C., G. J. Wyckoff, and C. I. Wu. 2001. Positive and negative selection on the human genome. Genetics 158:1227-1234.
Fay, J. C., G. J. Wyckoff, and C. I. Wu. 2002. Testing the neutral theory of molecular evolution with genomic data from Drosophila. Nature 415:1024-1026.[CrossRef][ISI][Medline]
Fitch, W. M., R. M. Bush, C. A. Bender, and N. G. Cox. 1997. Long term trends in the evolution of H(3) HA1 human influenza type A. Proc. Natl. Acad. Sci. USA 94:7712-7718.
Genetics Computer Group. 1999. The Wisconsin Package. Version 10.0. Madison, Wis.
Golding, G. B., and A. M. Dean. 1998. The structural basis of molecular adaptation. Mol. Biol. Evol. 15:355-369.[Abstract]
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.
Grantham, R. 1974. Amino acid difference formula to help explain protein evolution. Science 185:862-864.[ISI][Medline]
Graur, D. 1985. Amino acid composition and the evolutionary rates of protein-coding genes. J. Mol. Evol. 22:53-62.[ISI][Medline]
Hellmann, I., S. Zollner, W. Enard, I. Ebersberg, B. Nickel, and S. Paabo. 2003. Selection on human genes as revealed by comparisons to chimpanzee cDNA. Genome Res. 13:831-837.
Henikoff, S., and J. G. Henikoff. 1993. Performance evaluation of amino acid substitution matrices. Proteins 17:49-61.[ISI][Medline]
Hughes, A. L., and M. Nei. 1988. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335:167-170.[CrossRef][ISI][Medline]
Kellis, M., N. Patterson, M. Endrizzi, B. Birren, and E. S. Lande. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241-254.[CrossRef][ISI][Medline]
Li, W. H. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol 36:96-99.[ISI][Medline]
Li, W. H., C. I. Wu, and C. C. Luo. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2:150-174.[Abstract]
Makalowski, W., and M. S. Boguski. 1998. Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc. Natl. Acad. Sci. USA 95:9407-9412.
McDonald, J. H., and M. Kreitman. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652-654.[CrossRef][ISI][Medline]
Miyata, T., S. Miyazawa, and T. Yasunaga. 1979. Two types of amino acid substitutions in protein evolution. J. Mol. Evol. 12:219-236.[ISI][Medline]
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715-724.
Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418-426.[Abstract]
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929-936.
Suzuki, Y., and T. Gojobori. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16:1315-1328.[Abstract]
Wyckoff, G. J., W. Wang, and C. I. Wu. 2000. Rapid evolution of male reproductive genes in the descent of man. Nature 403:304-309.[CrossRef][ISI][Medline]
Yang, Z., and R. Nielsen. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. 46:409-418.[ISI][Medline]
Yang, Z., R. Nielsen, N. Goldman, and A. M. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449.
Yang, Z., R. Nielsen, and M. Hasegawa. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15:1600-1611.
Yang, Z., and W. J. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:49-57.
Zhang, J. 2000. Rates of conservative and radical nonsynonymous nucleotide substitutions in mammalian nuclear genes. J. Mol. Evol. 50:56-68.[ISI][Medline]