Department of Ecology and Evolution, University of Chicago
Abstract
What are the major forces governing protein evolution? A common view is that proteins with strong structural and functional requirements evolve more slowly than proteins with weak constraints, because a stringent negative selection pressure limits the number of substitutions. In contrast, Graur claimed that the substitution rate of a protein is mainly determined by its amino acid composition and the changeabilities of amino acids. In this paper, however, we found that the relative changeabilities of amino acids in mammalian proteins are different for transmembranal and nontransmembranal segments, which have very distinct structural requirements. This indicates that the changeability of a given residue is influenced by the structural and functional context. We also reexamined the relationship between substitution rate and amino acid composition. Indeed, the two kinds of segments exhibit contrasting amino acid compositions: transmembranal regions are made up mainly of hydrophobic residues (a total frequency of ~60%) and are very poor in polar amino acids (<5%), whereas nontransmembranal segments have frequencies of 30% and 22%, respectively. Interestingly, we found that within a given integral membrane protein, nontransmembranal segments accumulate, on average, twice as many substitutions as transmembranal regions. However, regression analyses showed that the variability in amino acid frequencies among proteins cannot explain more than 30% of the variability in substitution rate for the transmembranal and nontransmembranal data sets. Furthermore, transmembranal and nontransmembranal segments evolving at the same rate in different proteins have different compositions, and the compositions of slowly evolving and rapidly evolving segments of the same type are similar. From these observations, we conclude that the rate of protein evolution is only weakly affected by amino acid composition but is mostly determined by the strength of functional requirements or selective constraints.
Introduction
Since the introduction of gene and protein sequences in evolutionary studies (e.g., Zuckerkandl and Pauling 1965
), molecular evolutionists have tried to identify the major forces that govern molecular evolution. A widely accepted principle is that the rate of amino acid substitution is determined by the stringency of structural and functional constraints (e.g., Kimura 1983;
Nei 1987
; Li 1997
). That is, if a protein (or part of a protein) has very stringent structural and/or functional requirements, then the encoding gene is subject to a strong negative selection pressure limiting the number of changes in the gene product. Consequently, such a protein evolves more slowly than a protein with weaker constraints. This argument also explains why, within a given protein, regions or individual sites critical for the function, such as catalytic sites or binding domains, are generally better conserved than the rest of the molecule. Many examples of this constraint-rate relationship have been described (see, e.g., Li 1997
).
In contrast, Graur (1985)
, through an analysis of 60 mammalian protein-coding genes, found a highly significant correlation between the substitution rate and the amino acid frequencies in the protein. He thus proposed that composition is the main factor determining the rate of evolution of proteins, whereas functional constraints have only a minor effect. He showed that, in fact, the substitution rate of a protein can be predicted directly from its composition and the relative changeabilities of the amino acids. That is, slowly evolving proteins contain higher frequencies of stable residues and/or lower frequencies of highly changeable residues than do fast evolving peptides. However, this relation may hold only if the changeability of a particular amino acid is similar among proteins. If it is not, then the frequency of a given amino acid tells us little about the substitution rate. The demonstration by Jones, Taylor, and Thornton (1994)
that the changeabilities of amino acids in transmembrane segments of membrane-associated proteins are different from those in globular proteins casts some doubt about the use of composition to infer evolutionary rates.
Using a large amount of sequence data, we reanalyzed the relationships among substitution rate, amino acid composition, amino acid changeability, and functional constraints in mammalian proteins. In particular, we compared the evolutionary patterns of transmembranal and nontransmembranal segments, for which the structural and functional requirements are very different. The conclusion is that the rate of protein evolution is only weakly affected by amino acid composition but is mostly determined by the strength of functional requirements or selective constraints.
Materials and Methods
Computation of Relative Changeabilities of Amino Acids
To compute amino acid changeabilities, a data set consisting of proteins of several mammalian groups was constructed. Nuclear protein-coding genes for which there were orthologous sequences available for at least six species belonging to at least three of the following seven mammalian orders were retrieved from HOVERGEN data bank release 34 (February 1999) (see Duret, Mouchiroud, and Gouy [1994
] for a description of the database): primates, rodents, lagomorphs, artiodactyls, carnivores, perissodactyls, and cetaceans. Orthology of sequences was assessed by examination of the phylogenetic tree of each protein provided in HOVERGEN. When multiple sequences were present for a given species, the sequence located on the shorter branch was retained. To minimize the possibility of multiple substitutions at the same site, which may bias the calculation of amino acid changeabilities (see below), only proteins for which the observed amino acid divergence between any pair of sequences was lower than 20% were selected. In addition, only proteins longer than 80 residues were kept in order to limit the effect of sampling errors. The final data set consisted of 101 proteins (757 sequences). For each protein, the selected sequences were aligned by CLUSTAL W (Thompson, Higgins, and Gibson 1994
); the alignment was then adjusted manually, and ambiguous regions and gaps were removed from analysis.
The species sampled and the number of species varied among genes (although human, mouse, and rat sequences were available for most genes). To reduce the possible bias in the computation of substitutions due to incorrect phylogenetic inference, we used branching orders that were known or that had received strong support from different analyses conducted by various authors. Figure 1
gives the overall rooted organismal phylogeny assumed, with the topology for a particular gene constituting only a subtree of this global phylogeny. The relationships between the mammalian orders considered here (or some of them) were obtained using mainly mitochondrial, but also nuclear, proteins (Li et al. 1990
; Cao, Adachi, and Hasegawa 1994
; Kuma and Miyata 1994
; Graur, Duret, and Gouy 1996
; Janke, Xu, and Arnason 1997
and references therein). The relative position of Cetacea within Cetartiodactyla (Cetacea + Artiodactyla) has been confirmed by recent analyses of mitochondrial sequence data (Gatesy 1997
; Ursing and Arnason 1998
) and transposable elements (Shimamura et al. 1997
). The primate subtree in figure 1
represents the classification of primates suggested by fairly congruent fossil and molecular evidence (reviewed by Shoshani et al. [1996
] and Goodman et al. [1998
]). The branching order for the rodent groups in our data set was deduced from phylogenetic analyses of mitochondrial and nuclear genes (see Robinson et al. 1997
and references therein).
|
In addition, an analysis was done using only the transmembranal segments of the integral membrane proteins present in the data set. Information about the transmembranal nature of the proteins and the limits of the different domains was retrieved from the SwissProt database, release 37 (Bairoch and Apweiler 1999
). Since transmembranal regions were very short (2030 amino acids in all proteins examined), all transmembranal segments of a given protein were combined together into a single sequence. Only the 33 proteins (out of 37 available) for which the length of this sequence exceeded 80 amino acids were kept for analysis (225 sequences in total). For this subset, ancestral sequences were inferred by maximum-likelihood using the substitution matrix derived by Jones, Taylor, and Thornton (1994)
from a large collection of transmembranal segments combined with the average amino acid frequencies of the present-day sequences and a gamma distribution for rates among sites. For comparison, changeabilities were also estimated from the nontransmembranal regions (totaling more than 80 residues) of those proteins. Actually, only 29 out of the 33 genes were used (192 sequences). The four remaining genes were excluded because the divergence between the nontransmembranal sequences of several species was well above 20%. Here, ancestral states at interior nodes were reconstructed by maximum likelihood using the general model of Jones, Taylor, and Thornton (1992)
, combined with observed amino acid frequencies and the gamma distribution.
Correlation Between Substitution Rate and Amino Acid Composition
To study the correlation between substitution rate and amino acid composition, an independent data set was created. It consisted of 230 additional nuclear protein-coding genes that were orthologous between humans and mice. All 99 complete integral membrane proteins for which information about the transmembranal segments was available were retrieved. A random sample of 131 globular proteins was added to construct a relatively balanced data set containing similar numbers of globular and transmembranal proteins. All sequences, which were more than 80 codons long, were extracted from HOVERGEN. For the integral membrane proteins, two subsets were defined. They contained, respectively, the transmembranal and the nontransmembranal regions of 93 proteins for which the total length of transmembranal segments and the total length of nontransmembranal segments were both larger than 80 codons. As above, orthology was checked using the phylogenetic tree of each gene, and the amino acid sequences were aligned by CLUSTAL W. Protein alignments were then manually adjusted and served as templates to construct the corresponding nucleotide alignments.
For each of the three human-mouse data sets (general, transmembranal, and nontransmembranal), 20 simple linear regression equations between the nonsynonymous distance and the mean frequency of each amino acid in a given gene were computed, as in Graur (1985)
. The nonsynonymous distance, defined here as the number of nonsynonymous (i.e., amino acidchanging) substitutions per nonsynonymous site in nucleotide sequences, was computed using Li's (1993)
method.
Following Graur (1985)
, we also fitted multiple linear regression equations between the distance and the frequencies of m amino acids (1
m
19). Starting with the residue for which the correlation between its frequency and the nonsynonymous distance was the highest, amino acids were successively introduced in the multiple-regression function (stepwise addition). At each step, the amino acid chosen was the one that gave the largest increase in the proportion of variance (in distance) explained (see Nie et al. 1975
). Graur (1995) called these equations "empirical indices of mutability" (denoted Im) and suggested that they could be used to predict the substitution rates of proteins.
Results
Relative Changeabilities and Occurrences of Amino Acids
Table 1
gives the relative changeabilities and frequencies of occurrence of the 20 amino acids computed from the general set of 101 mammalian proteins. The results are compared with those obtained by Jones, Taylor, and Thornton (1992)
, who used sequences from a larger diversity of eukaryotic and prokaryotic organisms. Overall, the amino acid frequencies and the rankings of changeabilities are similar among the two studies. Isoleucine and serine are slightly less changeable in our data set than in Jones, Taylor, and Thornton's (1992)
, while methionine, histidine, and alanine are more changeable. Changeabilities and frequencies calculated for the transmembranal (33 proteins) and nontransmembranal (29 proteins) subsets are shown in table 2 . Even though our data set of transmembranal segments is much more limited than that of Jones, Taylor, and Thornton (1994)
(740 and 4,845 accepted point mutations, respectively), we can see that the estimated amino acid frequencies are similar and that the changeability rankings are correlated. The notable differences concern glutamine, cysteine, and leucine. The changeability of glutamine and cysteine deduced from mammalian transmembranal segments is very low (rank 3 and 4, respectively) as opposed to that estimated using a more general set of species (rank 12 and 13) in Jones, Taylor, and Thornton (1994)
. On the other hand, leucine switches from higly changeable (rank 15) to intermediate (rank 8). Finally, we notice that the amino acid frequencies and the pattern of changeability in nontransmembranal regions (table 2
) are similar to those seen in the general data set, which is biased toward nontransmembranal globular proteins (table 1 ).
|
|
Correlation Between Substitution Rate and Amino Acid Composition
Using an independent data set of orthologous protein-coding human and mouse genes, we first studied the correlation between the substitution rate of a gene and the frequency of each amino acid. A protein containing high frequencies of stable residues is supposed to evolve slowly, whereas a protein rich in highly changeable residues is supposed to evolve more rapidly. Therefore, as shown by Graur (1985)
, we expect a high negative correlation coefficient for the most stable amino acids, a high positive coefficient for the most changeable ones, and nearly no correlation for residues with intermediate changeabilities. The first five amino acids for which the correlation between the frequency and the nonsynonymous distance between humans and mice is the strongest in the general data set (230 genes) and the transmembranal and nontransmembranal subsets (93 genes) are given in table 3 . Although significant correlations were found for many amino acids, the absolute values of the correlation coefficients (r) obtained were smaller than 0.3 in all but two cases. This means that the fraction (r2) of variance in distance explained by the variability in amino acid frequency rarely exceeded 10% in all three data sets. Furthermore, the sign of the relationship was sometimes the opposite of that expected. For example, a positive correlation was found for cysteine, a very stable residue in the general and nontransmembranal sets (r = +0.281 and +0.336, respectively), and a negative correlation was obtained for valine, a highly changeable amino acid in transmembrane domains (r = -0.222).
|
|
|
|
|
In this paper, we analyzed the pattern of amino acid substitution in mammalian proteins and examined the relationship between evolutionary rate and amino acid composition. First, this study, like that of Jones, Taylor, and Thornton (1994)
, which was not limited to mammals, revealed that the relative amino acid changeabilities and the amino acid composition in transmembranal regions of integral membrane proteins are distinct from those in nontransmembranal parts of these proteins (table 2
). In the latter regions, the pattern is similar to that observed for globular proteins. As explained by Jones, Taylor, and Thornton (1994)
, most of these observations can be interpreted in terms of the specific structural requirements imposed on membrane-crossing domains. The effect of constraints on amino acid composition is obvious: being inserted into a lipid environment, these segments contain very high frequencies of hydrophobic residues and very low frequencies of polar (hydrophilic) ones. As far as changeability is concerned, the low changeability of proline in transmembrane domains, compared with other regions, may be explained by the fact that this amino acid plays a major role in kinking transmembranal helices. Similarly, asparagine is more conserved in these segments than in nontransmembranal regions because it is able to create hydrogen bonds that help stabilize the helices (see Jones, Taylor, and Thornton 1994
). The effects of functional constraints on composition and changeability can also be viewed in nontransmembranal parts. For example, Graur (1985)
compared the frequencies of the 20 amino acids in different regions of proteins using a general set close to that of Dayhoff, Schwartz, and Orcutt (1978)
. He found that seven amino acids (serine, lysine, glutamic acid, aspartic acid, arginine, cysteine, and histidine) constituted 81% of active sites. In contrast, these amino acids accounted for only 35% in other regions (see table 4
in Graur 1985
). These residues, except cysteine, have intermediate to high changeabilities in the general data set (rank >7; see tables 1 and 4
). However, the fact that active sites evolve much more slowly than the rest of proteins implies that the relative changeabilities of these amino acids are lower in active sites than in outer regions. In conclusion, the fact that amino acid changeabilities are influenced by structural and functional context suggests that the overall changeability (i.e., the substitution rate) of a protein cannot be deduced from the amino acid content. We cannot conclude, for example, that a proline-rich protein will evolve more slowly than a proline-poor one if the changeability of proline is different in the two molecules.
A weak relationship between rate of protein evolution and amino acid composition is indeed the second point emphasized in our study. Among human and mouse integral membrane proteins, segments (either transmembranal or not) evolving at different rates maintain similar compositions. On the other hand, transmembranal and nontransmembranal regions of different proteins can manifest similar rates of evolution even if their amino acid contents are distinct (table 5
). According to the predictions based on amino acid changeability, fast-evolving proteins are supposed to contain large frequencies of highly changeable amino acids and low frequencies of highly stable residues (Graur 1985
). Some of the results of regression analyses based on orthologous pairs of human and mouse genes presented in tables 3 and 4
contradict those predictions. Low but significant positive correlations between substitution rate and frequencies of stable residues were found, and negative correlations between the rate and frequencies of highly changeable amino acids were also obtained. Significant correlations for residues with intermediate changeabilities were found, although no relationship was expected. Some of these discrepancies may be reconciled with the predictions based on the amino acid changeabilities computed using the index employed by Graur (1985)
. For example, in Graur (1985)
, arginine and glutamine were classified as highly stable (rank <6) and highly changeable (rank >16) amino acids, respectively (see table 2
in Graur 1985
). Hence, the negative coefficient r = -0.231 obtained for the former residue and the positive coefficient r = 0.250 obtained for the latter in the general data set (table 3 ) would be in agreement with the expectations. However, the stability index used by Graur was based on the physicochemical distance between two amino acids that are separated by a single nucleotide substitution between codons. This is a theoretical measure that does not depend on sequence data. In particular, a single index would be inappropriate for describing the changeability pattern in both transmembranal and nontransmembranal regions for which the substitution processes clearly differ because of functional constraints. Furthermore, Dayhoff, Schwartz, and Orcutt (1978)
found that about 20% of the changes observed in their data set involved amino acids whose codons differed by more than one nucleotide. In contrast, the changeabilities we used were derived from amino acid substitutions observed in real data and therefore give a better representation of the actual evolutionary pattern in protein sequences. Interestingly, most of the correlations between amino acid composition and substitution rate were low. The variability in amino acid frequencies among proteins could not explain more than 30% of the variability in nonsynonymous distance even when several amino acids were considered (fig. 2
and table 4
). The fraction of variance explained by any individual residue was never higher than 14% and was generally less than 10% (table 3
). These findings, obtained for a general data set of 230 protein-coding genes from humans and mice, as well as for the transmembranal and nontransmembranal subsets of 93 genes, are in sharp contrast to the results of Graur (1985)
. Using a set of 60 mammalian proteins, he found that the frequency of glycine alone could account for 38% (r = -0.619) of the variability in nonsynonymous substitution rate and that almost all variation (97%) was explained with all 20 amino acids. Evidently, these correlations do not hold when additional proteins are included in the analysis. We think that the multiple-regression equations Im cannot be reliably used as mutability or compositional indices to predict the substitution rate of a protein or a particular region within a protein (fig. 2
).
For many years, it has been observed that functionally important regions within a protein, like catalytic sites or binding domains, accumulate few changes because of intense selective pressure, whereas less important parts evolve faster because of more relaxed constraints. In a particular membrane-associated protein, the rate of substitution in transmembrane domains is usually slower than that in other segments (fig. 3
). The above results indicate clearly that this difference in rate is not a consequence of the difference in amino acid content. Rather, it is because transmembrane segments are subject to more stringent structural constraints. Jones, Taylor, and Thornton (1994)
also noted that multiple-spanning transmembrane segments, which require additional structural features allowing them to cross the membrane several times, evolve more slowly than single-spanning domains. We conclude that the selection associated with structural and functional requirements is really the main factor that determines the rate of protein evolution.
Acknowledgements
This study was supported by NIH grants. We thank two anonymous reviewers for their very helpful comments on an earlier version of this manuscript.
Footnotes
Wolfgang Stephan, Reviewing Editor
1 Keywords: amino acid composition
substitution rate
functional and structural requirements
selective constraints
protein evolution
2 Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, Illinois 60637. E-mail: whli{at}uchicago.edu
literature cited
Bairoch, A., and R. Apweiler. 1999. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27:4954.
Cao, Y., J. Adachi, and M. Hasegawa. 1994. Eutherian phylogeny as inferred from mitochondrial DNA sequence data. Jpn. J. Genet. 69:455472.[Medline]
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. Pp. 345352 in Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C.
Duret, L., D. Mouchiroud, and M. Gouy. 1994. HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 22:23602365.[Abstract]
Gatesy, J. 1997. More DNA support for a Cetacea/Hippopotamidae clade: the blood-clotting protein gene gamma-fibrinogen. Mol. Biol. Evol. 14:537543.[Abstract]
Goodman, M., C. A. Porter, J. Czelusniak, S. L. Page, H. Schneider, J. Shoshani, G. F. Gunnell, and C. P. Groves. 1998. Toward a phylogenetic classification of Primates based on DNA evidence complemented by fossil evidence. Mol. Phylogenet. Evol. 9:585598.[ISI][Medline]
Graur, D. 1985. Amino acid composition and the evolutionary rates of protein-coding genes. J. Mol. Evol. 22:5362.[ISI][Medline]
Graur, D., L. Duret, and M. Gouy. 1996. Phylogenetic position of the order Lagomorpha. Nature 379:333335.
Janke, A., X. Xu, and U. Arnason. 1997. The complete mitochondrial genome of the wallaroo (Macropus robustus) and the phylogenetic relationship among Monotremata, Marsupialia, and Eutheria. Proc. Natl. Acad. Sci. USA 94:12761281.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275282.[Abstract]
. 1994. A mutation data matrix for transmembrane proteins. FEBS Lett. 339:269275.[ISI][Medline]
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, England.
Kuma, K., and T. Miyata. 1994. Mammalian phylogeny inferred from multiple protein data. Jpn. J. Genet. 69:555566.[Medline]
Li, W.-H. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evol. 36:9699.[ISI][Medline]
. 1997. Molecular evolution. Sinauer, Sunderland, Mass.
Li, W.-H., M. Gouy, P. M. Sharp, C. O'hUigin, and Y.-W. Yang. 1990. Molecular phylogeny of Rodentia, Lagomorpha, Primates, Artiodactyla, and Carnivora and molecular clocks. Proc. Natl. Acad. Sci. USA 87:67036707.
Nei, M. 1987. Molecular evolutionary genetics. Columbia University Press, New York.
Nie, N. H., C. H. Hull, J. G. Jenkins, K. Steinbrenner, and D. H. Bent. 1975. SPSS: statistical package for the social sciences. MacGraw-Hill, New York.
Robinson, M., F. Catzeflis, J. Briolay, and D. Mouchiroud. 1997. Molecular phylogeny of rodents, with special emphasis on murids: evidence from nuclear gene LCAT. Mol. Phylogenet. Evol. 8:423434.[ISI][Medline]
Shimamura, M., H. Yasue, K. Ohshima, H. Abe, H. Kato, T. Kishiro, M. Goto, I. Munechika, and N. Okada. 1997. Molecular evidence from retroposons that whales form a clade within even-toed ungulates. Nature 388:666670.
Shoshani, J., C. P. Groves, E. L. Simons, and G. F. Gunnell. 1996. Primate phylogeny: morphological vs. molecular results. Mol. Phylogenet. Evol. 5:102154.[ISI][Medline]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.[Abstract]
Ursing, B. M., and U. Arnason. 1998. Analyses of mitochondrial genomes strongly support a hippopotamus-whale clade. Proc. R. Soc. Lond. B Biol. Sci. 265:22512255.[ISI][Medline]
Yang, Z. 1999. Phylogenetic analysis by maximum likelihood (PAML). Version 2.0. University College London, London.
Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:16411650.
Zhang, J., and M. Nei. 1997. Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods. J. Mol. Evol. 44(Suppl. 1):S139S146.
Zuckerkandl, E., and L. Pauling. 1965. Molecules as documents of evolutionary history. J. Theor. Biol. 8:357366.[ISI][Medline]