From the Department of Molecular Biology, Princeton University, Princeton, New Jersey 08544
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The mutability of each amino acid has been determined empirically through pair-wise comparison of aligned homologous protein sequences; mutability is defined as the number of times an amino acid differs at analogous sites of two aligned sequences divided by the total occurrence of that amino acid within the pair of sequences (1). Thus, an amino acid that has mutated relatively frequently over the course of evolution is assigned a high mutability, whereas an amino acid that has mutated relatively infrequently is assigned a low mutability. Amino acids differ in mutability according to the ease with which each particular amino acid may be structurally or functionally replaced by any other within proteins. This depends on the size, shape, hydrophobicity, and charge of each amino acid side chain and its ability to form various types of weak bonds, as well as the structure of the genetic code.
Our approach is based upon the following premise. An amino acid with relatively low mutability is by definition less likely to change over the course of sequence evolution than other amino acids. Therefore, as an original set of ancestral sequences gives rise to successive generations of descendants, the frequency of such an amino acid within conserved positions of those descendants (i.e. residues that are unchanged between ancestral and descendant sequences) will increase relative to its frequency within the entire ancestral sequence set. Consequently, the frequency of an amino acid with low mutability within conserved sequence positions of descendant sequences provides an upper limit on its frequency within the ancestral sequences, i.e. it must have occurred with a lower frequency within the ancestral sequences as a whole than within the conserved positions of descendant sequences. On the other hand, the frequency of an amino acid with relatively high mutability will decrease over evolution within conserved positions of descendant sequences relative to the entire ancestral sequence set; thus, its frequency within conserved positions provides a lower limit on its frequency within the ancestral sequences. It is important to recognize that these inferences regarding the upper and lower limits of amino acid frequencies within ancestral sequences are completely independent of substitution events occurring within non-conserved sequence positions.
As a consequence of the limits specified above, two general types of observations (Table I) would suggest that a change in frequency of an amino acid over evolution within a set of proteins had occurred; if an amino acid with low mutability occurs less frequently within conserved than within non-conserved residues of the extant protein set, its frequency must have increased over evolution, because its frequency within ancestral sequences can be inferred to have been lower than that within conserved residues. Conversely, if an amino acid with high mutability occurs with greater frequency within conserved than non-conserved residues, its frequency can be inferred to have decreased over evolution, because its frequency within ancestral sequences can be inferred to have been higher than that within conserved residues. It is worth remarking that, based on this approach, no inferences regarding changing amino acid frequencies may be made in cases in which an amino acid with low mutability occurs more frequently, or an amino acid with high mutability occurs less frequently, within conserved than non-conserved residues. Nonetheless, this approach may identify some amino acids that have changed in frequency over deep evolutionary time and thereby provide novel insights regarding early proteins. Guided by this rationale, we determined the frequency of each amino acid in conserved and non-conserved sequence elements of a set of extant proteins dating to the LUA in 26 species spanning the three primary lineages.
|
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our first requirement was that a member of a protein family be present in at least one species of each of the three primary lineages, because this criterion is used to infer that an ancestor of that family was present in the LUA (4). In fact, we required that for any protein family to be included in the study, at least one member had to be present in all 26 species selected from the COG database (for the list of species, see the legend to Table V. This made it possible to assemble a set containing members from the same protein families for each of these species. Although only one eukaryote, Saccharomyces cerevisiae, was included in the analysis, this did not in any way limit the ability to identify conserved sequence positions within the protein set or to draw conclusions based on the data obtained. In fact, the very wide phylogenetic representation of both eubacteria and archaea was more than sufficient to identify conserved residues, allowing inferences to be drawn regarding the frequency of certain amino acids within ancestral sequences in the LUA.
|
After these requirements were fulfilled, our protein set consisted of 59 COG families (Table II). Forty-five of these proteins play some role in translation (many are ribosomal proteins), and another seven play a role in transcription, replication, or DNA repair. These all are classified as informational proteins (7), because they function in replication, transcription, or translation. The remaining seven proteins are classified as operational proteins (7), which perform metabolic and other housekeeping roles within the cell. Informational proteins have been found to be less likely to be laterally transferred than operational proteins (7), and because one of the goals in choosing the set was to avoid laterally transferred proteins, the high proportion of informational proteins in the set was both expected and reassuring.
|
To identify conserved residues more accurately, maximum parsimony (9) was used to partially reconstruct the ancestral protein sequences in the LUA that gave rise to each family of aligned descendants. The protein parsimony software "protpars" included in the PHYLIP phylogenetic package (10) was used to partially reconstruct ancestral sequences, assuming the phylogenetic tree indicated by small subunit rRNA data (5). Using the inferred ancestral sequence, conserved and non-conserved sites within the descendant sequence of each species were identified. Because these ancient sequences have diverged to a great extent, only slightly more than a third (37%) of the sites within the ancestral sequence could be reconstructed. At sequence positions for which no ancestral residue could be assigned, it was assumed that residues within none of the descendant sequences were conserved. The frequency of each amino acid within conserved and non-conserved residues of the sequence set in each species could then be determined.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
The frequencies of cysteine, tyrosine, and phenylalanine within conserved residues are 0.0039, 0.0231, and 0.0331, respectively (Table III). Because of their low mutability, the frequencies of these amino acids within conserved residues provide an upper limit on their frequencies within this protein set in the LUA. By comparison, the frequencies of cysteine, tyrosine, and phenylalanine within the protein set as a whole are 0.0074, 0.0297, and 0.0374, respectively. It can therefore be inferred that the frequency of cysteine has doubled within this protein set between the LUA and today, whereas that of tyrosine has increased at least 29% and phenylalanine at least 13%.
Given these findings, we sought to determine whether the frequency of these three amino acids increased to an even greater extent within the modern whole-genome protein sets (i.e. proteomes) than within the ancient protein set. To this end, the mean frequency of each amino acid within the ancient protein set and within the proteomes was compared. Data on the proteomic frequency of these amino acids were taken from the Proteome Analysis Database (13). The mean frequency of cysteine within the ancient protein set is 0.0074 compared with 0.0099 in the proteomes, the frequency of tyrosine is 0.0297 versus 0.0335, and the frequency of phenylalanine is 0.0375 versus 0.0437. It is apparent, therefore, that the frequency of these three amino acids within modern proteomes has increased even more than within the set of ancient proteins itself.
To gain insight on whether cysteine, tyrosine, and phenylalanine might still be increasing in frequency today, we determined whether they are present in modern proteomes at frequencies predicted by neutral evolution. The neutral theory of molecular evolution predicts that an amino acid within a proteome should eventually reach an equilibrium frequency determined primarily by the number of codons assigned to that amino acid, adjusted for the nucleotide composition of its codons and the nucleotide composition of the genomic coding sequences (14). The probability of observing amino acid j in a specific genome is given by pj = (
i xiyizi), where i represents each codon assigned to amino acid j; xi, yi, and zi represent the frequency of occurrence of the first, second, and third nucleotides, respectively, of codon i within coding sequences of that genome; and
is a constant such that the sum over all amino acids is equal to one. The normalization constant
compensates for probabilities assigned to stop codons.
Using genomic coding sequence nucleotide frequency data derived from the Codon Usage Database (15), the frequencies of cysteine, tyrosine, and phenylalanine in the proteome of each species predicted by neutral evolution were determined Table V). The observed frequency of cysteine is significantly less than that predicted in all 26 species (p 0.01), the mean over all species being one-third of that predicted. In contrast, the observed frequencies of tyrosine is less than predicted in only 15 of the species (p = 0.28, which is not statistically significant), and the mean observed frequency of tyrosine, 0.0335, is close to that predicted, 0.0358. For phenylalanine, the observed frequency is higher than predicted in 25 species (p 0.01), the mean observed frequency, 0.0437, being 40% higher than predicted. Therefore, the observed frequency of cysteine is less than, and of phenylalanine is greater than, that predicted by neutral evolution, whereas that of tyrosine agrees with the prediction of neutral evolution.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Consequently, we propose that upon their introduction into the code, these three amino acids would have gone from being non-existent to being rare within early coded proteins. Furthermore, because of the distinct physicochemical properties of these amino acids, the majority of subsequent coding sequence mutations introducing them into proteins presumably would have been deleterious, causing their increase in frequency to be gradual (that of cysteine especially so). Because our data indicate that these three amino acids increased in frequency between the LUA and today, they must not have reached their equilibrium frequencies by the time of the LUA. According to this scenario, the under-representation of these amino acids in the LUA relative to today is consistent with their late addition to the genetic code.
It has conventionally been assumed that the time between the origin of proteins and today has been sufficient for all amino acids to reach their equilibrium frequencies and therefore, that an observed frequency of an amino acid distinct from that predicted by neutral evolution is evidence of some strict requirement of protein structure or function that places unusual selection on that amino acid (14). However, because our findings suggest that at the time of the LUA, cysteine, tyrosine, and phenylalanine had yet to reach equilibrium frequencies, change of amino acid composition toward that predicted by neutral evolution may be a process requiring very long time periods. Indeed, the observation that the frequency of cysteine is so much lower than that predicted by neutral evolution in modern proteomes may be evidence that the increase in usage of this particular amino acid has been especially gradual over evolution. Consequently, the possibility that even today cysteine continues to move toward its equilibrium frequency through neutral evolution, as the vast range of all possible sequence space is gradually searched, cannot be ruled out. On the other hand, over time phenylalanine has become more frequent in proteins than predicted by neutral evolution. In fact, it is possible that the frequency of phenylalanine, too, will increase further with evolution. In any case, positive selection for phenylalanine has caused any initial rarity of this amino acid in the earliest proteins to be overcome. The same may be argued for tyrosine, the observed frequency of which does not differ significantly from that predicted by neutral evolution.
Although our approach did not produce evidence for a change in frequency of any of the other 17 amino acids over the course of evolution, this does not imply that no other amino acids have changed in frequency. Using our rationale, it is not possible to reach a definite conclusion regarding the change in frequency (or lack thereof) of those amino acids of high mutability that are less frequent in conserved than non-conserved positions and those of low mutability that are more frequent in conserved than non-conserved positions. Moreover, our ability to make inferences was limited by the lack of consensus on the relative mutability of six amino acids (see Table III and Table IV). It is therefore possible that amino acids other than cysteine, tyrosine, and phenylalanine have increased in frequency since the LUA. With the increase in frequency of these three (and perhaps other) amino acids, there must have been a concomitant decrease in frequency of at least one other amino acid. Because valine is of low mutability and is present at greater frequency in conserved than non-conserved sequence elements (although not to a statistically significant extent), it may indeed have decreased in frequency over time. An alternative approach will be required to determine with certainty which amino acids other than cysteine, tyrosine, and phenylalanine have in fact changed in frequency over evolution.
It is not immediately evident how amino acid composition and structure have co-evolved in the ancient protein set investigated. Studies of protein evolution suggest that structure and function can be well conserved even as protein sequence diverges extensively (see Ref. 19, but see Ref. 20 for a contrary view). However, evolution of amino acid composition may have impacted structure in newly arising proteins of the proteome. Each amino acid has a specific predisposition to occur in different secondary structures, i.e. in -helices, ß-sheets, or random coils (21, 22), and negative selection preserving structure would have been relatively relaxed in this later protein set. Further investigation will be required to elucidate structural consequences of changes in proteomic amino acid composition.
![]() |
FOOTNOTES |
---|
1 The abbreviations used are: LUA, last universal ancestor; COG, clusters of orthologous groups.
* The computational facility utilized for this work was obtained with funds provided by the Department of Defense through MEDCOM at Fort Detrick, MD (to J. R. F.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Supported in part by predoctoral traineeships from National Institutes of Health Grant 2T32GM07388-22 and from National Science Foundation Grant DGE 9972930.
Published, MCP Papers in Press, November 13, 2001, DOI
To whom correspondence should be addressed. Tel.: 609-258-3927; Fax: 609-258-2759; E-mail: jrfresco{at}princeton.edu.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|