Department of Zoology, University of Cambridge, Cambridge, England
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: amino acid replacement general reversible model maximum likelihood protein evolution
Key Words: amino acid replacement general reversible model maximum likelihood protein evolution
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Dayhoff and colleagues (Dayhoff, Eck, and Park 1972
; Dayhoff, Schwartz, and Orcutt 1978
) used a parsimony-based counting method to generate accepted point mutation (PAM) matrices from the limited amount of protein sequence data available at the time. To achieve this, phylogenetic trees were estimated for multiple protein families, along with the ancestral sequences within those trees, using maximum parsimony (MP). This information was then used to estimate the relative rates of all amino acid replacements by simply counting both the inferred numbers of different amino acid replacements that occurred on all of the lineages of the trees and the numbers of occasions on which no change in amino acids was inferred.
Jones, Taylor, and Thornton (1992)
applied a faster, automated procedure based on Dayhoff and colleagues' (Dayhoff, Eck, and Park 1972
; Dayhoff, Schwartz, and Orcutt 1978
) approach and used it to produce a replacement matrix from a much larger database. After estimating the phylogenetic tree for each protein family in the database, their method selected a pair of sequences from a phylogeny that were nearest-neighbors and were >85% identical and counted the differences between them. The pair of sequences was then discarded to avoid recounting changes on any given branch of a phylogeny. This process was repeated for all such pairs of sequences in all protein families from their database. The 85% identity rule was used to reduce the number of multiple changes recorded as single replacements. Both of these approaches, which we refer to as counting methods, produce matrices of counts that may be used to estimate Markov process models of amino acid replacement (Swofford et al. 1996
; Liò and Goldman 1998
). The two models described above, the most widely used for the phylogenetic analysis of amino acid sequences of globular proteins, are both estimated using these counting methods and are known as the Dayhoff model (Dayhoff, Schwartz, and Orcutt 1978
) and the JTT model (Jones, Taylor, and Thornton 1992
).
The counting methods effectively employ MP to estimate amino acid replacement matrices and are therefore susceptible to its inherent problems. In particular, MP intrinsically assumes that for any given site in an alignment, only one change takes place along any single branch in a tree. This can lead to a serious underestimation of the true number of replacements that have occurred in branches where multiple changes have occurred and may consequently lead to systematic error in any model estimated using counts of replacements. In addition, MP inferences of ancestral sequences may introduce still further inaccuracies (Yang, Kumar, and Nei 1995
). The Dayhoff model may be affected by both of these problems. The 85% rule of the JTT method attempts to reduce the impact of these problems by reducing the expected number of multiple hits that are neglected. Without completely solving the problem, this also renders the method very wasteful because it discards all of the information available in the sequences that are <85% identical. The JTT method avoids making inferences of ancestral sequences, but at the further cost of using an inefficient method for avoiding the repeated counting on branches of phylogenetic trees, discarding many sequences which have >85% identity only with previously used sequences.
Adachi and Hasegawa (1996)
, Yang, Nielsen, and Hasegawa (1998)
, and Adachi et al. (2000)
used maximum-likelihood (ML) methods to estimate models of amino acid replacement for vertebrate mitochondrial, mammalian mitochondrial, and chloroplast sequences, respectively. For an alignment of sequences related by a single phylogenetic tree, the amino acid replacement matrix that gave the highest likelihood was found simultaneously with the optimal phylogeny and branch lengths. This ML approach avoids the problems associated with the counting methods by using all of the information available in the sequences across all levels of homology and by having a model that explicitly allows multiple changes to occur on a single branch at any site in an alignment. Unfortunately ML, while providing a more reliable estimate of a model of replacement than the counting methods, has a much greater computational burden associated with it. The time each individual likelihood calculation takes and the number of calculations required to numerically maximize the likelihood increase significantly with each sequence added to an analysis. This has meant that relatively few sequences, each consisting of a number of concatenated genes available for all of the organisms studied, have been included in previous analyses: Adachi and Hasegawa (1996)
analyzed 20 sequences, each of 3,357 residues; Yang, Nielsen, and Hasegawa (1998)
used 23 sequences of similar lengths; and Adachi et al. (2000)
used just 10 sequences, each of 9,957 residues. This may restrict the accuracy of the resulting models or the variety of proteins for which the models are subsequently found to be useful (see also P. Liò and N. Goldman, unpublished data
).
Here, we combine the best attributes of the likelihood and counting methods to estimate a model of amino acid replacement from a large database of different globular protein families using an approximation to ML. This model should provide a better estimate of the evolutionary process than existing models estimated using counting methods and be applicable to phylogenetic studies of a much broader range of protein sequences than existing models estimated using the likelihood approach.
![]() |
Models of Amino Acid Replacement |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
| (1) |
The i values represent the equilibrium or stationary frequencies of the 20 amino acids. These frequencies may all be set to 1/20 or may be set to the values estimated from the original data used to estimate the sij values. These applications are now relatively rare in phylogenetics (and are not used in this paper), and the
i are more typically estimated as being equal to the proportions of the amino acids as observed in a data set under phylogenetic analysis (Cao et al. 1994
). When the frequencies are estimated from the data in this way, model names are generally given the suffix "+F"; e.g., JTT+F would use sij as estimated by Jones, Taylor, and Thornton (1992)
and the
i observed in the data set under analysis. The 20 amino acid frequencies can be described by 19 free parameters because of the constraint
i
i = 1 and, in effect, weight sij according to sequences' amino acid compositions.
All of the standard models of evolution used in this paper have previously been well documented (e.g., Swofford et al. 1996
; Liò and Goldman 1998
). The simplest model of amino acid evolution is the equiprobable (EQU) model, which assumes that all the exchangeability parameters sij are equal and sets all of the stationary frequencies to 1/20. The EQU+F form of this model allows the stationary frequencies to equal the proportions of the different amino acids observed in the data. The Dayhoff (Dayhoff, Schwartz, and Orcutt 1978
) and JTT (Jones, Taylor, and Thornton 1992
) models, which in their Dayhoff+F and JTT+F forms will be compared with our new models, have values of sij which have been estimated from large databases using counting methods. The mitochondrial- and chloroplast-specific models of Adachi and Hasegawa (1996)
, Yang, Nielsen, and Hasegawa (1998)
, and Adachi et al. (2000)
, which are not appropriate for direct comparison with our new model because of their sequence-specificity, have each had their 189 free sij parameters estimated by direct likelihood maximization for relatively few protein sequences and using a single evolutionary tree.
Assumptions Needed to Estimate a Model Using Multiple Phylogenies
The simultaneous use of many different protein families implies that an estimated model may be applicable to a wide range of proteins (as are the Dayhoff and JTT models), and the use of the likelihood approach to perform this estimation suggests that it may more accurately reflect the evolutionary process by avoiding systematic error and utilizing more of the available data. In order to estimate an empirical model of amino acid replacement simultaneously from many families of sequences, we have developed a new approximation to the likelihood approach, exploiting two observations about the ML estimation of parameters on phylogenetic trees. First, it has been shown that parameters describing the evolutionary process remain relatively constant across near-optimal tree topologies (e.g., Yang, Goldman, and Friday 1994, 1995
; Sullivan, Holsinger, and Simon 1996
; Yang, Nielsen, and Hasegawa 1998
; Adachi et al. 2000
). We exploit this by assuming it to be the case for the parameters used to describe amino acid replacement, in particular, assuming that the relative values of the amino acid exchangeability parameters sij stay approximately constant over near-optimal branch lengths and tree topologies. The implication of this assumption is that so long as branch lengths are close enough to optimal when estimating the new model, any changes in the branch lengths observed when they are reestimated under the new model would not influence the model estimated to any great extent.
The second observation relates to changes in individual branch lengths that occur when performing ML estimation under two different models of evolution for a single-tree topology. When the two models are quite different, the ML branch lengths they give can be quite different (demonstrated by comparing the statistics shown in table 3
for either the EQU or the EQU+F model of evolution with those for either the Dayhoff+F or the JTT+F model). When two models are alike in their abilities to describe the evolutionary process, however, there is often much less difference in the branch lengths (e.g., Yang, Nielsen, and Hasegawa 1998
; also illustrated by comparing the statistics shown in table 3
for the EQU and EQU+F models or those for the Dayhoff+F and JTT+F models). We exploit the relatively small changes in branch lengths under "similarly good" models of evolution by assuming that the JTT+F model is capable of providing near-optimal branch lengths for the best general model of evolution (yet to be estimated).
|
Rather than completely fixing these estimates of the branch lengths during the estimation of the amino acid replacement model, only the ratios of branch lengths were fixed, and a scaling factor was introduced which allowed all branch lengths to increase or decrease linearly. This parameter makes some allowance for any unforeseen changes in branch lengths between the JTT+F model and the new model being estimated, which could occur if the assumptions discussed above regarding branch lengths were invalid. As we assume that the families' topologies and relative branch lengths are now fixed at near-optimal values, the overall log-likelihood for the database can be calculated as
| (2) |
Application to Real Data
This method was used to estimate a general model of amino acid replacement from the BRKALN database of aligned protein sequence families (D. Jones, unpublished data). This database has previously been used to estimate amino acid replacement models specific to different protein secondary structures (e.g., Goldman, Thorne, and Jones 1996
), and we used 3,905 sequences split into 182 protein families, each containing no more than 100 sequences. The amino acid frequencies for the entire database are shown in table 1
.
|
Both the WAG and the WAG* models required only 190 parameters to be numerically optimized, compared with >7,500 (before even considering optimization over topologies) under the traditional likelihood approach. This optimization was still computationally slow: estimation of the WAG model took approximately 18 h on a Digital 600au Personal Workstation. To avoid local maxima, estimations of the WAG and WAG* models were each performed using two sets of starting values, those of the EQU and the JTT models; the same estimates were recovered in each case. We found it computationally impractical to estimate the stationary frequency parameters i by ML (Yang and Roberts 1995
) simultaneously with the estimation of the sij. Given the large size of the BRKALN database, we expect that any differences in the estimated
i would be small and that any consequent differences in the estimated sij would be insignificant.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
We also note that, even after making allowance for the estimation of 190 additional parameters, the improvement in likelihood of the WAG* or WAG model over the JTT model is greater than the improvement of the JTT model over the Dayhoff model. This suggests that the improvement achieved by estimating a model of evolution using our new method may be at least as great as the improvement obtained by using a larger database from which to estimate a model of evolution, which is the main detail in which the Dayhoff and JTT models differ.
When the WAG* and the WAG models are compared, neither appears clearly better than the other in examining the whole database of families. As expected, each performs best for the analysis conditions it was optimized for, with WAG giving a better likelihood when using a single set of stationary frequencies (+F option) and WAG* performing better for multiple sets of stationary frequencies (+mF). When the two models are compared using equivalent methods (both +F or both +mF) for calculating the stationary frequencies, their log-likelihood values are very similar (changing by approximately 0.01%), suggesting that there is little difference between the two models.
The branch length scaling factor was estimated as 1.027 during the generation of the WAG model. This suggests that the likelihood maximization procedure was not trying to change the branch lengths dramatically and that the approximations used were valid. While we note that this scaling factor may not detect nonlinear changes in branch lengths (i.e., changes not proportional to the original lengths), results obtained by the reestimation of the branch lengths and a subsequently reestimated model give no indication of this occurring (see below).
Performance of New Models on Specific Phylogenies
The new models' performance when estimating individual phylogenies is of more practical relevance than their performance when estimating a likelihood for an entire database. To give an example of the improvement in fit to the data that might be achieved with our new models, an alignment of 18 Lepidopteran sequences of the phosphoenolpyruvate carboxykinase protein (Friedlander et al. 1996
; Goldman, Thorne, and Jones 1998
) was chosen as a typical example of data used to perform a phylogenetic analysis. Some statistics of the ML trees under different replacement models are shown in table 3
. From these statistics, it is clear that the use of the WAG+F or WAG*+F model results in a considerably higher likelihood value than any of the other models. The difference between the two new models is very small. There is some change in the branch length statistics between JTT and the new models, although it does not appear large enough to invalidate the assumptions used to estimate the WAG and WAG* models. Note that the phosphoenolpyruvate carboxykinase family is not represented in the BRKALN database, and so there is no possibility of the WAG and WAG* models having any unfair advantage. In this example, the WAG and WAG* models of sequence evolution are superior, and, in general, we expect the use of the models giving the best fit to the observed data to lead to more accurate phylogenetic estimation. Even small changes in branch lengths may lead to changes in optimal ML tree topology, resulting in an estimated tree being closer to the true evolutionary tree.
In order to demonstrate this improvement in performance for the whole BRKALN database, ML values were calculated for each protein family under the JTT+F, WAG+F, and WAG*+F models, this time fixing the evolutionary models and tree topologies but reestimating the branch lengths for each family's phylogeny. Figure 1A and B
shows that the majority of protein families (146 out of 182, or 80%) had higher likelihoods when analyzed under either the WAG+F or the WAG*+F model than under the JTT+F model. The protein families whose likelihoods were higher under the JTT+F model were examined in more detail and compared with the families whose likelihoods were higher under the WAG+F and WAG*+F models to see if there was any common feature defining which model was preferred. It was found that families for which the JTT+F model gave a higher likelihood had relatively shorter average branch lengths than those families for which the new models gave higher likelihoods (data not shown). This may be the result of Jones, Taylor, and Thornton's (1992)
counting method of replacement matrix estimation using only sequences that are >85% identical to estimate the JTT model and perhaps consequently overfitting the model to relatively short evolutionary distances.
|
Comparison of the Structure of the New Models with Those of Other Commonly Used Models
A comparison of the differences in the patterns of amino acid replacement between the empirically derived Dayhoff, JTT, WAG, and WAG* models of evolution is shown in figure 2
. From these graphs, it is clear that the overall structures of the four models are similar, which suggests that they are all modeling the same process. Closer examination shows no discernible pattern to the differences in the values of the parameters sij of the amino acid replacement matrices of the JTT and WAG models; this is illustrated in figure 3
. There is almost no difference between the values in the replacement matrices of the WAG and WAG* models. The exchangeability parameters sij defining the WAG and WAG* models are available via http://www.zoo.cam.ac.uk/zoostaff/goldman/WAG.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The newly estimated WAG and WAG* models both gave significantly higher likelihoods than any other commonly used models when used to assess phylogenies for all 182 protein families of the BRKALN database simultaneously and for a large majority of the individual phylogenies examined. It was unclear which of the two new models performed better, but we would tentatively suggest that in this case the methodology used to estimate the WAG model was preferable because it involved fewer parameters being estimated from the amino acid sequence database. We note some change in the estimated branch lengths under the new models of evolution. The better statistical fit of our new models to the data suggests that in many cases they may provide more accurate estimates of phylogenetic trees than existing models, although differences in branch length estimates do not appear so great as to invalidate the assumptions used in our method for estimating models.
We hope that the WAG and WAG* models of amino acid replacement will be of value in phylogenetic analyses of amino acid sequences as potentially superior alternatives to the Dayhoff and JTT models. Both our new methodology and models produced using it may have further applications outside of phylogenetics, in fields that rely on accurate descriptions of amino acid replacement, such as protein structure prediction, the detection of sequence homology (including database searching), and sequence alignment.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
2 Address for correspondence and reprints: Simon Whelan, Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, United Kingdom. s.whelan{at}zoo.cam.ac.uk
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi, J., and M. Hasegawa. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42:459468.[ISI][Medline]
Adachi, J., P. J. Waddell, W. Martin, and M. Hasegawa. 2000. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J. Mol. Evol. 50:348358.[ISI][Medline]
Cao, Y., J. Adachi, A. Janke, S. Pääbo, and M. Hasegawa. 1994. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J. Mol. Evol. 39:519527.[ISI][Medline]
Dayhoff, M. O., R. V. Eck, and C. M. Park. 1972. A model of evolutionary change in proteins. Pp. 8999 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. Vol. 5. National Biomedical Research Foundation, Washington, D.C.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. Pp. 345352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. Vol. 5, suppl. 3. National Biomedical Research Foundation, Washington, D.C.
Felsenstein, J. 1995. PHYLIP (phylogenetic inference package). Version 3.57. Distributed by the author, Department of Genetics, University of Washington, Seattle.
Friedlander, T. P., J. C. Regier, C. Mitter, and D. L. Wagner. 1996. A nuclear gene for higher-level phylogeneticsphosphoenolpyruvate carboxykinase tracks Mesozoic-age divergences within Lepidoptera (Insecta). Mol. Biol. Evol. 13:594604.[Abstract]
Goldman, N., J. L. Thorne, and D. T. Jones. 1996. Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J. Mol. Biol. 263:196208.[ISI][Medline]
. 1998. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149:445458.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8:275282.
Lindgren, B. W. 1976. Statistical theory. 3rd edition. Macmillan, New York.
Liò, P., and N. Goldman. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:12331244.
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964969.
Sullivan, J., K. E. Holsinger, and C. Simon. 1996. The effect of topology on estimation of among site rate variation. J. Mol. Evol. 42:308312.[ISI][Medline]
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass.
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555556.
Yang, Z., N. Goldman, and A. Friday. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316324.[Abstract]
. 1995. Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Syst. Biol. 44:384399.[ISI]
Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:16411650.
Yang, Z., R. Nielsen, and M. Hasegawa. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15:16001611.
Yang, Z., and D. Roberts. 1995. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. 12:451458.[Abstract]