*Department of Molecular Biology, Princeton University
Department of Haematology, Cambridge Institute for Medical Research, United Kingdom;
Department of Computer Science and the Lewis-Sigler Institute for Integrative Genomics, Princeton University
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The recent increase in the number of whole genome sequences has made the analysis of the corresponding inferred proteomes possible. These analyses have included a statistical description of amino acid composition and sequence length (Gerstein 1998a
), and surveys of both structurally determined (Gerstein 1998b;
Wolf, Grishin, and Koonin 2000
) and predicted folds (Gerstein 1997
). Interspecies proteomic comparisons have revealed the phylogenetic distribution of protein families (Tatusov, Koonin, and Lipman 1997
); in turn, this knowledge has been used to infer the protein complement of the Last Universal Ancestor (LUA) (Kyrpides, Overbeek, and Ouzounis 1999
), the single-celled ancestor that gave rise to the three primary lineages, the eubacteria, archaea, and eukaryotes.
Here, we extend these earlier proteomic characterizations by estimating the amino acid composition of proteins in an ancestral genome. The existing methods used for the reconstruction of ancestral protein sequences, maximum parsimony (MP) and maximum likelihood (ML), are by themselves insufficient to provide accurate estimates of ancestral sequence composition. Where sequences have diverged significantly, MP can only provide partial sequence reconstructions; the identity of residues at many sites in the ancestral sequence will remain ambiguous. Moreover, those sites with assigned residues will be biased toward amino acids that tend to be conserved. On the other hand, ML requires an assumption of the amino acid composition of the sequences being reconstructed, and it is usually assumed that the composition of the ancestors is the same as that of the extant descendants (Yang, Kumar, and Nei 1995
). Although Galtier, Tourasse, and Gouy (1999)
estimated the GC content of ancestral genomic DNA sequences using an ML approach (not on the basis of sequence reconstruction), a similar approach may not be tractable for protein sequences, given the significantly larger number of parameters to be estimated.
The method we introduce for inferring ancestral amino acid composition is based on the insight that the amino acid composition of conserved residues in present-day proteins, i.e., those residues which are unchanged between an ancestral sequence and any given descendant sequence, is determined by two factors: (1) the amino acid composition within ancestral sequences, and (2) the relative probability of conservation of each amino acid between an ancestral and an extant descendant sequence. Reversing this logic, given the amino acid composition of conserved residues and the relative probability of conservation of each amino acid, the amino acid composition within ancestral sequences can be inferred. Our approaches for identifying and estimating the composition of conserved residues, and estimating the relative probability of conservation of different amino acids are discussed subsequently (see Methods).
Simulations of sequence evolution have been employed previously to evaluate alternative methods for reconstructing events in the evolutionary past. For example, they have been used to assess the accuracy of methods for building phylogenetic trees (Saitou and Nei 1987) and reconstructing ancestral protein sequences (Zhang and Nei 1997
). We introduce simulations here to model errors in the estimate of amino acid composition of ancestral sequences. Using these models, the bias in the estimated ancestral amino acid frequencies might be compensated for, allowing the estimates to be placed within a confidence interval.
Using our approach, we have estimated the amino acid composition of a set of 65 proteins within the LUA. We infer that within this set of proteins many of the amino acids believed to have been most abundant in the prebiotic environment (Miller 1953
, 1987
; Kvenvolden et al. 1970
), including glycine, alanine, aspartic acid, and valine, were used more frequently within proteins of the LUA than within those of modern species. On the other hand, amino acids believed to have been rare or unavailable prebiotically, including cysteine, tryptophan, tyrosine, and phenylalanine, were generally used much less frequently within proteins of the LUA. This may reflect the order of addition of these two groups of amino acids to the genetic code.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
Estimating Probabilites
Identifying Conserved Residues Using MP Sequence Reconstruction
Let P'(i), P'(conserved), P'(conserved|i), and P'(i|conserved) denote the estimated values of P(i), P(conserved), P(conserved|i), and P(i|conserved), respectively. To estimate P'(conserved) and P'(i|conserved), conserved residues must first be identified. Although conserved residues are frequently defined as residues which are identical in all descendant sequences, this definition is too restrictive for our purposes. Instead we identify, as well as possible, all cases in which a residue remains unchanged between the ancestor and any of its descendant sequences; this is consistent with the definition of P(conserved) in equations (1a) and (1b)
. To facilitate this process, ancestral sequences are partially reconstructed through MP (Eck and Dayhoff 1966
), and these are compared with modern sequences to identify residues conserved between the ancestor and descendant (fig. 1B
). At each site s along the sequence, the number of identities between the inferred ancestor and each of the descendant sequences is tabulated to determine P'(conserved) at s. For example, if three of the four descendant sequences retain the original residue at site s, P'(conserved) at s is 0.75. P'(conserved) is then averaged over all sites. P'(i|conserved) simply represents the composition of sites identified as conserved between the ancestor and any of the descendants (fig. 1B
).
As illustrated in figure 1 , this method does not identify all instances in which a residue has been conserved between the ancestor and descendant, resulting in errors in P'(conserved) and P'(i|conserved). However, this method is preferable to counting as conserved only those sites identical in all descendants because on the basis of that criterion the actual conserved residues would be much more undercounted and thus P'(conserved) and P'(i|conserved) would be even more biased. Importantly, error in P'(i) arising through error in P'(conserved) and P'(i|conserved) may be modeled through simulation and taken into account in the derivation of the final estimates of P(i) (see Modeling Error in P'(i) Through Simulation below).
Estimating P(conserved|i) Using Subsitutition Probability Matrices
In theory, in addition to P'(conserved) and P'(i|conserved), P'(conserved|i) could also be obtained through a comparison of descendant sequences with partially reconstructed ancestral sequences. However, obtaining estimates of all three probabilities in such a manner would lead to values of P'(i) equivalent to the composition of the partially reconstructed sequences (once the values are normalized to sum to one); these P'(i) values are expected to be biased toward those amino acids which have a high relative probability of conservation (this expectation was confirmed through simulations; data not shown). To overcome this difficulty, substitution probability matrices based on empirical protein data are introduced as an independent means for obtaining P'(conserved|i).
A PAM 1 matrix represents the global average substitution probabilities that will result in approximately one substitution per 100 residues when applied to a sequence with the same average composition as the sequence set used to derive the matrices. Substitution probabilities for greater evolutionary distances can be computed as powers of the PAM 1 matrix (Dayhoff, Schwartz, and Orcutt 1978
, pp. 345352); this requires the assumption that substitution probabilities are independent of time and position within the sequence. For modeling the composite evolution of a large sequence set in which only average substitution probabilities are required, this assumption is a reasonable first approximation, one which is frequently used in modeling protein evolution (Zhang and Nei 1997
). Values for P(conserved|i) are then represented by the diagonal elements of the PAM probability matrix corresponding to a given evolutionary distance (Dayhoff, Schwartz, and Orcutt 1978
). That is, P(conserved|i) is the probability that amino acid i is replaced by itself over the specified evolutionary distance.
Using this approach to estimate P(conserved|i) thus requires that a model of evolution based on a set of empirical matrices be assumed, which is common for the evolutionary analyses of proteins. Because a molecular clock is assumed, regardless of the evolutionary relationships between the descendant sequences, for any amino acid in the ancestor the expected fraction of descendant sites conserved is equal to the probability of conservation for that amino acid at a given evolutionary distance. As noted previously, error in the estimates of P(i) arising through deviation from the assumed molecular clock may be modeled and compensated for through simulation.
The question remains as to how to choose the correct PAM distance for estimating P(conserved|i). Given the diagonal elements P(conserved|i)k of a PAM matrix of distance k and P(i) of the ancestral sequences, P(conserved)k, the probability that any residue is conserved at evolutionary distance k, is:
|
Conversely, given P(i) and P(conserved) of a set of sequences, the PAM distance k separating these sequences from their ancestor may be determined by finding the distance k for which:
|
|
In our approach, although P(conserved) and P(i|conserved) may be estimated as described previously, P(i) and the distance k from which P(conserved|i) may be estimated are both unknown. We therefore developed the following method to identify the matrix distance separating ancestral and offspring sequences, and thus P(conserved|i) and P(i), given P(conserved) and P(i|conserved). PAM matrices of successively increasing distance k are used to supply P(conserved|i)k. The method empirically determines the distance k which best fits our estimates of P(conserved) and P(i|conserved) as follows:
(1) For each k (a) estimate P(i) by P(i)k = P(i|conserved) x P(conserved)/P(conserved|i)k and (b) because these probabilities should sum to 1.0, calculate the normalization factor k such that
k
P(i)k = 1.0.
(2) Choose the matrix for which |log k| is minimized, i.e., the matrix k for which the probability estimates of P(i) require the least amount of normalization.
The corresponding normalized values of P(i)k indicate P(i). It may be shown that our method finds the integer k whose estimates of P(conserved|i) and P(i) minimize the term given in equation (2c) (see Appendix). The same method may be used to derive P'(conserved|i) and P'(i), given P'(conserved) and P'(i|conserved).
Assuming that a representative subset of proteins within a genome is being analyzed, substitution matrices based on good global alignments of generic protein families are most likely to accurately reflect probabilities relating to the evolution of the set. Such matrices include the original PAM matrix of Dayhoff (Dayhoff, Schwartz, and Orcutt 1978
) and its updated version by Jones, Taylor, and Thornton (1992)
. Because the matrices of Gonnet, Cohen, and Benner (1992)
are based upon alignments of highly divergent homologs, which are inherently difficult to align, for our purposes the matrices of Jones, Taylor, and Thornton (1992)
are preferable. (These matrices are based on alignments of relatively closely related proteins [>85% identity].) The matrices of Jones, Taylor, and Thornton (1992)
, based on release 22 of the SWISS-PROT databank, the current version of which is described in Bairoch and Apweiler (2000)
, were therefore used for all the analyses described here. This series is frequently used to represent substitution probabilities for evolutionary modeling (see, for example, Yang, Kumar, and Nei 1995
; Zhang and Nei 1997
). It should be noted that neither matrices based on local alignments, such as BLOSUM, which are biased toward protein interiors (Henikoff and Henikoff 1992
), nor matrices derived from specific protein subclasses, such as transmembrane proteins (e.g., Jones, Taylor, and Thornton 1994
), are suitable to provide estimates of P(conserved|i).
Lineage-specific versus pooled estimates
Rather than pooling data from multiple species to arrive at P'(conserved), P'(i|conserved), P'(conserved|i), and ultimately P'(i), independent estimates of P(i) for each lineage can be made and then averaged. This allows the elimination of the assumption of a molecular clock. However, the differences in P'(i) arrived at using either pooled or lineage-specific data were insignificant (data not shown). Because pooling data from all lineages facilitates modeling and correction for error in the estimates, estimates based on pooled data were used.
Estimating Amino Acid Composition in the LUA
Choice of Protein Set
Although there has been much recent discussion of the nature of the LUA (see for example Woese 1998
), it is sufficient for our purposes to think of the LUA as either a heterogeneous or a homogeneous population of organisms which eventually diverged into the three primary lineages. On the basis of their presence in eubacteria, archaea, and eukaryotes, approximately 325 proteins are proposed to have been present in the LUA (Kyrpides, Overbeek, and Ouzounis 1999
).
Orthologous proteins (homologs which arose through speciation) that are present in a group of modern day species may be inferred to have been present in their last common ancestor. To assemble a set of proteins that were likely present in the LUA, we used the Clusters of Orthologous Groups (COG) database (Tatusov, Koonin, and Lipman 1997
; Tatusov et al. 2001
) to select all orthologous proteins common to a group of eight species which included members from each of the primary lineages: two archaea, Archaeoglobus fulgidus and Methanobacterium thermoautotrophicum; one eukaryote, Saccharomyces cerevisiae; and five eubacteria, Aquifex aeolicus, Thermotoga maritima, Synechocystis PCC6803, Escherichia coli, and Bacillus subtilis. These species were chosen because their phylogenetic diversity was expected to maximize the extent and accuracy of ancestral sequence reconstruction. Two hundred fifteen protein families in the COG database are present in all eight species.
Inclusion in the reconstruction of paralogs (homologs arising through gene duplication) which arose before the speciation event separating the species would invalidate the assumed species phylogenetic tree. In the COG database, both orthologs and closely related paralogs are grouped together into protein families. To minimize the possibility of including incorrect paralogs in sequence reconstructions, COG protein families were included either if they had only one protein member per species for all species other than yeast or if the existing paralogs could clearly be determined to have arisen after species divergence. (The tree provided with each COG protein family was used to resolve all issues regarding the protein phylogenetic tree.) For many protein families, yeast has a paralog of mitochondrial origin in addition to the true ortholog. The yeast homolog which clusters with the archaeal species was assumed to represent the true ortholog and was chosen for the analysis. This restriction reduced the protein set to 105 families.
Because lateral transfer also invalidates the assumed species tree, it is important to minimize the number of laterally transferred proteins included in the set. Whole genome analyses have suggested that species phylogenetic trees are broadly consistent with the small subunit rRNA (SSU rRNA) tree (and the trees of the majority of informational proteins, i.e., proteins playing a role in replication, transcription, or translation [Fitz-Gibbon and House 1999
; Teichmann and Mitchison 1999
]). Therefore, we eliminated from our set all proteins whose trees were not consistent with the SSU rRNA tree (Olsen, Woese, and Overbeek 1994
). For this purpose, we applied the criterion that the yeast and archaeal proteins should cluster on the protein tree. After this restriction was applied, 65 proteins remained in the set. These proteins are primarily classified as informational, most playing a role in translation. The COG protein families included in the analysis are listed in table 1
.
|
|
The simulations were performed as follows: a set of n ancestral sequences of determined length was randomly generated according to a given P(i). The resulting ancestral amino acid composition is referred to as P(i)S. The ancestral sequences were evolved to produce descendant sequences based on a phylogenetic tree with unequal branch lengths reflecting average evolutionary distances within the protein set and a substitution probability matrix representing PAM 1 (Jones, Taylor, and Thornton 1992
). P'(i)S of the ancestral sequence set was derived using our method, and P(i)S - P'(i)S was determined. This difference is the error in P'(i)S.
The modeled sources of error were systematic and predictable. First, P'(conserved) is an underestimate of P(conserved) (values of 0.3425 and 0.4584, respectively). This underestimate can be explained by the fact that any sequence position in which the residue in at least one of the descendants has been conserved, but for which an ancestral residue was not assigned through MP, will lead to an undercount of P(conserved) (see fig. 1B ). The underestimate of P(conserved) causes P'(conserved|i)k to be chosen from a greater PAM distance k than would the true value of P(conserved) (the lower the conservation between sequences, the greater the evolutionary distance between them; see eq. (2a) ).
To understand how bias in P'(conserved) and P'(conserved|i) interact to result in bias in P'(i), it is useful to consider the ratio P'(conserved):P'(conserved|i) (see eq. (1b) ). With increasing PAM distance, the ratio P(conserved):P(conserved|i) for any amino acid i diverges from unity (fig. 2 ). This trend holds for PAM = 1 to 150, the range of distances relevant to our data set. For those amino acids for which P(conserved):P(conserved|i) < 1.0, such as tryptophan, the ratio decreases with increasing PAM, whereas for those amino acids for which P(conserved):P(conserved|i) > 1.0, such as alanine, the ratio increases (fig. 2 ). Therefore, as a consequence of the underestimate of P(conserved), it is observed that for those amino acids for which P(conserved):P(conserved|i) < 1.0, P'(conserved):P'(conserved|i) is an underestimate (P'(conserved):P'(conserved|i)/P(conserved):P(conserved|i) < 1.0; table 3 , columns 2 and 3; fig. 2 ). In contrast, for those amino acids for which P(conserved):P(conserved|i) > 1.0, this ratio is an overestimate (table 3 , columns 2 and 3; fig. 2 ).
|
|
Qualitatively, bias in P'(conserved):P'(conserved|i) and P'(i|conserved) counteract in equation (1b) . For those amino acids with a relatively low P(conserved|i), P'(i|conserved) is an underestimate, whereas P'(conserved):P'(conserved|i) is an overestimate (table 3 , columns 24). The reverse is true for those amino acids with a relatively high P(conserved|i). Quantitatively, on the other hand, bias in P'(i|conserved) and P'(conserved):P'(conserved|i) do not precisely offset, so there is net error in P'(i).
The predictable nature of the bias revealed in this way provided confidence that simulations could be used to model and correct for error in estimates based on real sequence data.
Placing a Confidence Interval on P(i) in the LUA
Simulations were performed with parameters chosen to model the LUA data set. To represent the ancestral sequence set, 65 proteins of 320 residues each (the average length of the analyzed sequences) were randomly generated using P'(i)LUA. To generate a phylogenetic tree representing the sequence set, a distance matrix containing average pairwise distances between the species for the entire LUA protein set was first determined using the program protdist of the PHYLIP package (Felsenstein 1993
). The program Fitch of the PHYLIP package was used to convert these distances into an unrooted phylogenetic tree with branch lengths. This tree was converted into a tree with its root between the eubacterial lineage and the lineage giving rise to the archaea and eukaryotes. After simulated evolution to create eight descendant sequences, 166,400 residues (a number similar to the 164,730 residues of the real data set) were available to derive P'(i|conserved)S, P'(conserved)S, P'(conserved|i)S, and thereby P'(i)S. P(i)S - P'(i)S, or the error in P'(i)S, was then determined for each simulation. One hundred simulations were performed to derive statistics (mean and standard deviation) of the error in P'(i)S. This number of simulations was sufficient for the values of these statistics to stabilize.
The mean and standard deviation of error in P'(i)S over 100 simulations (table 4 , columns 4 and 5) were used to place P(i)LUA within 95% confidence intervals. First, biases in P'(i)LUA were corrected by adding to them the mean simulated error (table 4 , column 6). Lower and upper bounds of 95% confidence intervals for P(i)LUA were determined by subtracting or adding, respectively, twice the standard deviation of error in P'(i)S to these corrected values (table 4 , columns 7 and 8). These confidence intervals are based on the assumption that the simulations accurately model error in P'(i)LUA.
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The nine amino acids which have decreased in frequency between the LUA and today (fig. 3A
) may thus be inferred to have been introduced into the code early. Most of these amino acids are among those presumed to have been most abundant in the prebiotic environment, as inferred through spark tube simulations (Miller 1953
, 1987
) and analysis of the Murchison meteorite (Kvenvolden et al. 1970
). In contrast, the eight amino acids which have increased in frequency between the LUA and today (fig. 3B
), and which are thus inferred to have been late additions to the code, include several of the most biosynthetically complex amino acids (for example, all three aromatic amino acids, which share a common, complex metabolic intermediate, are inferred to have been late additions); most of these are presumed either to have been nonexistent or of very low abundance in the prebiotic environment. Two of these, cysteine and tryptophan, are both conservatively estimated to have been less than half as frequent within this protein set in the LUA than today.
We emphasize that the validity of the inferences drawn in this study depend upon the reliability of the Jones, Taylor, and Thornton (1992)
substitution probabilities for modeling evolution over very long time periods and along all lineages. With the development of lineage-specific models of evolution, estimates of ancestral amino acid composition can be expected to improve. In the meantime, although there are undoubtedly limitations to using the matrices of Jones, Taylor, and Thornton (1992)
to model evolution since the LUA, we feel they provide the best available estimates of these substitution probabilities. It is noteworthy that previously, on the basis of an independent approach, cysteine, tyrosine, and phenylalanine were inferred to have been used less frequently in proteins of the LUA than today (Brooks and Fresco 2002
).
The inferences drawn here regarding the relatively early or late introduction of amino acids into the genetic code are generally consistent with earlier proposals which were based on the presumed presence or absence of various amino acids in the primordial environment (see for example Wong 1975
). However, for a few amino acids our assignment as early or late is not in keeping with earlier ideas. For example, histidine and asparagine, believed to have been absent in the prebiotic environment, are both inferred through the present work to have been added to the code relatively early, whereas glutamate, believed to have been present in the prebiotic environment, is inferred to have been added late (figs. 3 and 4
). Interestingly, each of these three amino acids share a block of four codons with a second amino acid: histidine with glutamine, asparagine with lysine, and glutamate with aspartate (fig. 4
). Codon capture, in which one amino acid loses some of its codons to another, is commonly proposed as a mechanism for introducing amino acids, especially later arriving ones, into the code (Crick 1968
; Wong 1975
). Consistent with codon capture, it is plausible that one amino acid was added to the four-codon block first and that this amino acid later gave up two of its codons to the second amino acid which now shares the block.
|
Existing ideas regarding the origin and evolution of the genetic code have been based largely on theoretical investigations and experiments involving oligonucleotide aptamer binding of amino acids (reviewed in Knight, Freeland, and Landweber 1999
). The present findings suggest that notwithstanding the impact of mutation over the long course of molecular evolution, with the aid of the appropriate analytical tools and insights, the sequences of contemporary proteins also provide an important avenue for exploring these early evolutionary events.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
However, if P(i) is unknown, but P(conserved) and P(i|conserved) are known, one can substitute for P(i) as follows:
|
|
|
|
Thus, by choosing the value k which minimizes |log k|, we are finding the PAM distance k whose corresponding estimates of P(i) and P(conserved|i) minimize (2c).
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Abbreviations: MP, maximum parsimony; ML, maximum likelihood; LUA, Last Universal Ancestor; COG, Clusters of Orthologous Groups.
Keywords: amino acid composition
protein evolution
Last Universal Ancestor
substitution probability matrices
genetic code
Address for correspondence and reprints: Mona Singh, Department of Computer Science, Princeton University, Princeton, New Jersey 08544. mona{at}cs.princeton.edu
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bairoch A., R. Apweiler, 2000 The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 28:45-48
Brooks D. J., J. R. Fresco, 2002 Increased frequency of cysteine, tyrosine and phenylalanine residues since the Last Universal Ancestor Mol. Cell. Proteomics 1:125-131.
Crick F. H. C., 1968 The origin of the genetic code J. Mol. Biol 38:367-379[ISI][Medline]
Dayhoff M. O., R. M. Schwartz, B. C. Orcutt, 1978 Atlas of protein sequence and structure. Vol. 5 [Suppl. 3] National Biomedical Research Foundation, Washington, D.C
Eck R. V., M. O. Dayhoff, 1966 Atlas of protein sequence and structure National Biomedical Research Foundation, Silver Spring, Md
Felsenstein J., 1993 PHYLIP(Phylogeny Inference Package). Version 35c Distributed by the author, Department of Genetics, University of Washington, Seattle
Fitz-Gibbon S. T., C. H. House, 1999 Whole genome-based phylogenetic analysis of free-living microorganisms Nucleic Acids Res 27:4218-4222
Galtier N., N. Tourasse, M. Gouy, 1999 A nonhyperthermophilic common ancestor to extant life forms Science 283:220-221
Gerstein M., 1997 A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure J. Mol. Biol 274:562-576[ISI][Medline]
. 1998a. How representative are the known structures of the proteins in a complete genome? A comprehensive structural census Fold. Des 3:497-512[ISI][Medline]
. 1998b. Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census Proteins 33:518-534[ISI][Medline]
Gonnet G. H., M. A. Cohen, S. A. Benner, 1992 Exhaustive matching of the entire protein sequence database Science 256:1443-1445[ISI][Medline]
Gonnet G. H., M. Hallett, 2000 The Darwin manual ETH Zurich
Henikoff S., J. Henikoff, 1992 Amino acid substitution matrices from protein blocks Proc. Natl. Acad. Sci. USA 89:10915-10919[Abstract]
Jones D. T., W. M. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences Comput. Appl. Biosci 8:275-282[Abstract]
. 1994 A mutation data matrix for transmembrane proteins FEBS Lett 339:269-275[ISI][Medline]
Knight R. D., S. J. Freeland, L. F. Landweber, 1999 Selection, history and chemistry: the three faces of the genetic code Trends Biochem. Sci 24:421-427
Kvenvolden K., J. Lawless, E. Pering, E. Peterson, J. Flores, C. Ponnamperuma, I. R. Kaplan, C. Moore, 1970 Evidence for extraterrestrial amino-acids and hydrocarbons in the Murchison meteorite Nature 228:923.[ISI][Medline]
Kyrpides N., R. Overbeek, C. Ouzounis, 1999 Universal protein families and the functional content of the last universal common ancestor J. Mol. Evol 49:413-423[ISI][Medline]
Miller S. L., 1953 Production of amino acids under possible primitive earth conditions Science 117:528-529[ISI][Medline]
. 1987 Which organic compounds could have occurred on the prebiotic earth? Cold Spring Harbor Symp. Quant. Biol 52:17-27[ISI][Medline]
Olsen G. J., C. R. Woese, R. Overbeek, 1994 The winds of (evolutionary) change: breathing new life into microbiology J. Bacteriol 176:1-6[ISI][Medline]
Osawa S., T. H. Jukes, K. Watanabe, A. Muto, 1992 Recent evidence for evolution of the genetic code Microbiol. Rev 56:229-264[Abstract]
Saitou N., M. Nei, 1987 The neighbor-joining method: a new method for reconstructing phylogenetic trees Mol. Biol. Evol 4:406-425[Abstract]
Tatusov R. L., E. V. Koonin, D. J. Lipman, 1997 A genomic perspective on protein families Science 278:631-637
Tatusov R. L., D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, E. V. Koonin, 2001 The COG database: new developments in phylogenetic classification of proteins from complete genomes Nucleic Acids Res 29:22-28
Teichmann S. A., G. Mitchison, 1999 Is there a phylogenetic signal in prokaryote proteins? J. Mol. Evol 49:98-107[ISI][Medline]
Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]
Woese C. R., 1998 The universal ancestor Proc. Natl. Acad. Sci. USA 95:6854-6859
. 2000 Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process Microbiol. Mol. Biol. Rev 64:202-236
Wolf Y. I., N. V. Grishin, E. V. Koonin, 2000 Estimating the number of protein folds and families from complete genome data J. Mol. Biol 16:897-905
Wong J. T., 1975 A co-evolution theory of the genetic code Proc. Natl. Acad. Sci. USA 72:1909-1912[Abstract]
Yang Z., S. Kumar, M. Nei, 1995 A new method of inference of ancestral nucleotide and amino acid sequences Genetics 141:1641-1650.
Zhang Y., M. Nei, 1997 Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods J. Mol. Evol 44:S139-S146[ISI][Medline]