Department of Zoology, University of Cambridge, U.K
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We may also expect more realistic models to give more reliable inference of biological factors they share with simpler models, for example phylogenetic topologies and branch lengths. This improvement could arise from their improved ability to account for factors that simpler models neglect and whose influence on observed data might otherwise be misinterpreted. This expectation has generally been confirmed by both theoretical results from simulation studies (e.g., Kuhner and Felsenstein 1994
; Zhang 1999
) and empirical results (e.g., Cunningham, Zhu, and Hillis 1998
; Takezaki and Gojobori 1999
; Philippe and Germot 2000
; Yoder and Yang 2000
), although Posada and Crandall (2001)
describe an interesting exception.
Much progress has recently been made with improving models of sequence evolution (reviewed, e.g., by Liò and Goldman 1998
; Thorne 2000
; Whelan, Liò, and Goldman 2001
). This work relies on statistical tests that can compare competing models and diagnose their shortcomings (Felsenstein 1981
; Goldman 1993a,
1993b;
Yang, Goldman, and Friday 1994
; Whelan and Goldman 1999
; Goldman and Whelan 2000
; Ota et al. 2000
). Amino acid replacement models have been improved by the use of maximum likelihood (ML) methods to infer replacement rate parameters (Adachi and Hasegawa 1996
; Yang, Nielsen, and Hasegawa 1998
; Adachi et al. 2000
; Whelan and Goldman 2001
). Both amino acid replacement and DNA substitution models have been improved by the incorporation of amongst-site heterogeneity of evolutionary rate (Yang 1994a,
1996
). Codon-level models can now test for effects of selection on sequence evolution and permit detection of individual sites where selection acts (Nielsen and Yang 1998
; Yang et al. 2000
; Anisimova, Bielawski, and Yang 2001
).
The expected frequencies of nucleotides, amino acids, or codons may be matched to those observed in the data under analysis. This generally leads to significantly improved models and has become widespread. For DNA models this has often led to the use of the FEL (Felsenstein 1981
) and HKY (Hasegawa, Kishino, and Yano 1985
) models in preference to the JC (Jukes and Cantor 1969
) and K2P (Kimura 1980
) models. For amino acid models, the same has been achieved by the "+F" method of Cao et al. (1994)
. For codon models, analogous methods were first described by Muse and Gaut (1994)
and Goldman and Yang (1994)
. All of these models achieve equality of the frequencies expected by the model and observed in the data by setting the instantaneous rates of evolutionary changes proportional to the observed frequency of the nucleotide, amino acid, or codon being changed to.
In this article, we describe a generalization of these approaches that allows evolutionary rates to be dependent on the frequency of the nucleotide, amino acid, or codon being replaced as well as on the frequency of the one that forms the replacement. This introduces just one additional parameter (or degree of freedom) to the model. We have applied the new method, denoted "+gwF," to models of DNA substitution and amino acid replacement and demonstrate that the new method gives a statistically significant improvement to the fit of models to data in a large proportion of the amino acid data sets in a database of protein family alignments. Our method is equally applicable to codon models, although we have not yet investigated its utility in this context.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Typical applications of these models to phylogenetic inference from amino acid sequences combine the database estimates with information from the observed sequences using the +F method first described by Cao et al. (1994)
. This assumes that in the parameterization
ij =
j
ij the exchangeabilities
ij are fundamental properties of the amino acids (i, j). In contrast, the frequencies
j represent properties that depend on factors such as mutation bias, selective pressure (the relative advantage/disadvantage of different amino acids) and codon usage, which are assumed to differ from one protein sequence data set to another. As a consequence, the +F method proposes that during the analysis of a specific data set the fundamental properties described by the
ij derived from a database analysis remain unaltered, but that the
j, describing evolutionary pressures acting on the sequences, be replaced with a set of frequencies more appropriate to the data being analyzed. Denoting these frequencies
j, this gives a rate matrix defined by the off-diagonal elements
![]() |
A New Generalization of Amino Acid Replacement Models
The parameterization in equation (1)
and the +F method assume that the rate of change of amino acid i to amino acid j is proportional to j, the equilibrium frequency of j in the sequences under study, and to the exchangeability
ij. There seems no reason why this parameterization, and its implicit assumption of the fundamental nature of the exchangeabilities
ij, should be the only one considered. Other formulations exist that combine information derived from database analyses (embodied in estimates
) with observed amino acid frequencies in a manner that equates the expected and observed frequencies: for example, making the rate of change of i to j inversely proportional to
i. In the former case, a low frequency of amino acid l (small
l) is explained by there being only low rates of change to (or fixation of) l (qil
l), perhaps reflecting a selective disadvantage of l. In the latter case it is explained by there being rapid mutation away from l (qlj
1/
l), perhaps representing a high mutability of l. Both these factors seem biologically plausible.
We generalize the formulations given by Cao et al. (1994)
and above by proposing the parameterization
|
|
|
We propose the suffix "+gwF" (generalized weighted Frequencies) to indicate use of the model of equation (3)
. The +gwF formulation requires the estimation of amino acid frequencies for the observed data (j), as do +F models, and it includes one extra free parameter (f) that may generally take any value (but see below) and may be estimated by ML.
Occasionally, sequence alignments may be observed in which one or more amino acids do not appear; that is, z = 0 for some amino acid z. (We assume that this is never the case for the
j because any reasonable database is expected to contain each amino acid at least once.) This causes no problem when f < 0, but equation (3)
generates infinite rates of replacement of z (qzj) when f
0. Note, however, that the proportion of all replacements occurring that are from i to z is given by
iqiz = (
i
z)1-f
fi
f-1z
iz. This equals 0 when 0
f
1: amino acid z never arises, and so the infinite values of qzj suggested by equation (3) can be replaced with arbitrary finite values without altering the probabilities P(t) necessary for likelihood calculations. If f > 1, amino acid z both arises and is replaced at an infinite rate: the new model is then meaningless in the context of sequence evolution. In effect, f must be
1 when some observed amino acid frequencies equal zero.
When f = 0, equation (4)
reduces to equation (1)
and the +F method results. From equation (3)
above, the dependency of the model on the observed data is then of the form qij
j/
j and data sets for which the estimate of f (denoted
) is near 0 may be considered to have replacements rates largely determined by factors dependent on the resulting residue j (e.g., mutation bias or selection acting on j). If f = 1 we have qij
i/
i, and data sets with
near 1 may have replacement rates largely determined by factors dependent on the replaced residue i (e.g., its mutability). Other values of
indicate the relative strength of these two types of factors. Intermediate values indicate a more even combination of both of these effects (e.g., f = 1/2 leads to qij
(
j/
j)1/2(
i/
i)1/2). More extreme values indicate stronger tendencies for factors dependent on the resulting (
< 0) or replaced (
> 1) residue to determine replacements. The dependence of equation (3)
on the terms (
j/
j)1-f and (
i/
i)f means that the most extreme values of f lead to replacements being dominated by interchanges between the amino acids that are commonest (when f << 0) or rarest (when f >> 1) in the data set under analysis relative to the database values. But we have observed no data sets where such very extreme estimates
were found (see Results and Discussion, below).
If the sets of observed and database amino acid frequencies are equal (i.e., j =
j for all j) then the +gwF model has no dependency on the parameter f (see eq. 3
). Although this is improbable, there will be an effect of decreased dependency on f when the
j are close to the
j. We discuss some consequences of this effect below.
Application of Generalized Amino Acid Replacement Models
Application of the new models of amino acid replacement is via standard ML methods, as described by Swofford et al. (1996)
. No closed form solution of P(t) = exp(tQ) is generally available for equation (3)
, and so spectral decomposition must be used (Liò and Goldman 1998
). The amino acid frequencies
j and the
or exchangeability parameters
ij for the standard Dayhoff, JTT, and WAG models are widely available. We estimate the amino acid frequencies
j for each data set analyzed by counting the residues of the sequences. The parameter f of +gwF models is estimated by ML simultaneously with any other free parameters (branch lengths, etc.).
Our main aim here is to assess the utility of +gwF models, measured by any improvement to the statistical fit of models to observed data. We compare the +F and +gwF versions of the existing JTT and WAG models, each considered both without and with the addition of a discrete (four category) distribution (Yang 1994a,
1996
; denoted "+
") to model heterogeneity of evolutionary rates amongst amino acid sites. This leads to a total of eight analyses for each aligned sequence family studied, with the most important comparisons being the following four:
Likelihood ratio tests (LRTs) can be performed to discover whether the additional parameter makes a statistically significant improvement to the fit of the model to observed data for these comparisons (Yang, Goldman, and Friday 1994
; Huelsenbeck and Rannala 1997
). As noted above, equation (1)
is recovered from equation (4)
when f = 0. This indicates that +F models are nested within the new +gwF models and differ by one degree of freedom. The test statistic of the difference in maximized log-likelihoods is compared with a
21/2 distribution (Yang, Goldman, and Friday 1994
; Whelan and Goldman 1999
) and a value in excess of the 95%-point of this distribution (
21,95%/2 = 1.92) indicates the rejection of the simpler (+F) model in favor of the more complex (+gwF) model.
An asymptotically equivalent test of the significance of the +gwF model is to estimate a confidence interval (CI) for the parameter f and see if that CI excludes 0, the value of f at which the +gwF model reduces to the +F form. The required CI can be estimated by the curvature method (Kendall and Stuart 1979
, pp. 4549; Yang, Goldman, and Friday 1995
).
A New Generalization of Nucleotide Substitution Models
For models of nucleotide substitution, standard parameters ij or
j and
ij have not been derived from large databases. Instead,
ij have often been taken to equal 1 for all substitutions, or else taken to equal
(an unknown parameter, often to be estimated from observed data) when the substitution i to j is a transition (A
G, C
T) and 1 otherwise (transversions A, G
C, T). These exchangeability parameters have been used either without consideration of observed
j (effectively setting
j = 1/4 for all j), as in the models JC (Jukes and Cantor 1969
) and K2P (Kimura 1980
), or with the
j estimated from the observed sequences, as in the models FEL (Felsenstein 1981
) and HKY (Hasegawa, Kishino, and Yano 1985
). Following the notation used for amino acid models, we note that the nucleotide models FEL and HKY might equally be denoted JC+F and K2P+F, respectively.
Development of +gwF models for nucleotide substitution follows the same arguments as for amino acid replacement models. In the absence of nucleotide frequency values from databases, we set j = 1/4 for all j. Because this is a constant and all rates qij are ultimately scaled so that the mean rate of evolution is 1, for clarity we can omit the
j and the scaling factor from our equations. Using
ij = 1 for all nucleotides i and j we get a generalized version of the JC model, denoted JC+gwF, defined by
|
|
Utility of the new models in terms of any improved fit to observed data may be assessed via the following comparisons:
Data Used for Investigating New Models
To investigate the performance of the new +gwF models of amino acid replacement, we analyzed multiple sequence alignments from the BRKALN database (D. Jones, personal communication). Our working version of this database contains 182 aligned protein families, totaling 3,905 sequences. These data have previously been used to estimate the WAG model of amino acid replacement (Whelan and Goldman 2001
) and amino acid replacement models specific to different protein secondary structures (Thorne, Goldman, and Jones 1996
; Goldman, Thorne, and Jones 1998
). For each alignment in our database only one, previously established, topology was considered; see Whelan and Goldman (2001)
for details of how these topologies were derived. Although there are likely to be some suboptimal topologies amongst those used, we expect that on the whole the topologies are optimal or near optimal and no systematic errors or biassing of results will occur (see also Whelan and Goldman 2001
).
Investigation of the performance of +gwF models of nucleotide substitution used two data sets: the well-known 895-bp alignment of mtDNA sequences of five primates (Brown et al. 1982
), and an alignment of 2000 bp from the gag and pol genes of six isolates of HIV-1 (Goldman, Anderson, and Rodrigo 2000
).
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
We also note that in comparison G, 142 out of the 182 alignments (78.0%) lead to a model based on the WAG standard model being preferred, reinforcing evidence suggesting that WAG is a useful alternative to JTT (Whelan and Goldman 2001
).
Figure 2
shows the relationship between the improvements in log-likelihood (log L) and
for comparison B of table 1
, and the distribution of
over 182 sequence alignments under the JTT+gwF+
model. This plot is typical of those obtained for any of the comparisons and models described in this article. Recall that when
= 0, +gwF reduces to +F and
log L = 0 necessarily. In figure 2 the distinction is made between cases when the +gwF model is statistically preferred to +F according to the LRT, and cases where it is not (i.e., whether or not
log L > 1.92). Over the 182 alignments studied, and for whichever +gwF model is used, we find that the statistically significant inferred values of
are in the main distributed between 0.25 and 1.5, with a peak around 0.6 (e.g., under the best +gwF model as selected by scenario G above, the distribution of values of
significantly different from 0 has mean = 0.69, median = 0.64, standard deviation (SD) 0.42, interquartile range = 0.26).
|
As noted above, if the observed j values are close to the database values
j then the dependency of +gwF models on f is reduced. Measuring the similarity of the
j and
j by the G statistic of the LRT test of the goodness-of-fit of the
j to the
j (Sokal and Rohlf 1994
, p. 690; G = 0 when
j =
j for all j, and increasing values of G indicate decreasing similarity), we find no correlations between G and estimates
but negative correlations between G and the widths of CIs for f (results not shown). This demonstrates that similarity of the
j and
j has no biassing effect on estimates of f, but reduces the information available about f (e.g., making it harder to reject the null hypothesis f = 0).
Behavior of +gwF Models
The statistical tests described above indicate that +gwF models of amino acid sequence evolution give significant improvements in many cases. It is also interesting to see if the parameter f is "well-behaved." If so, this is a further indication that the new model is accurately reflecting a biological feature of sequence evolution and not simply improving the fit of model to data by the addition of a meaningless parameter to a poor model. We performed various analyses to investigate this.
Figure 3A
shows the correlation of estimates derived from the 182 alignments in our database under the WAG+gwF and WAG+gwF+
models and distinguishes the cases where neither, one, or both of these models give a significant improvement over their respective +F versions. The estimates
are generally approximately the same under the two models, and the correlation is high (for the 121 alignments for which WAG+gwF and WAG+gwF+
are significantly better than WAG+F and WAG+F+
, respectively, Spearman's rank correlation coefficient
= 0.87, P < 0.001). Figure 3B
illustrates the same effect in the comparison of the JTT+gwF and WAG+gwF models. Here, the estimates
are less closely related (
= 0.67, P < 0.001, for the 119 cases in which JTT+gwF and WAG+gwF are significantly better than JTT+F and WAG+F, respectively) and it appears that choice of base model has a greater effect on inferred values
than does inclusion of +
. Nevertheless, estimates of f are clearly not strongly affected by either the base model or other components of the model used.
|
|
Choice of base model has a greater effect on the improvement in likelihood afforded by the +gwF method. This is illustrated by comparing log L(JTT±gwF) and
log L(WAG±gwF), (fig. 4C
) and
log L(JTT±gwF +
) and
log L(WAG±gwF+
) (fig. 4D
). Note that the correlations, though lower, are still highly significant (both have
= 0.93, P < 0.001). These plots also show a tendency for points to fall below the 45° line, indicating that greater improvements in likelihoods tend to be obtained with the JTT base model than with WAG. This may be because of the superiority of WAG over JTT for the majority of alignments in our database (see above and Whelan and Goldman 2000), leading to a greater scope for +gwF to deliver improvements in likelihoods under JTT than under WAG.
We also investigated the effect the use of +gwF has on estimates of the parameter of the
distribution that describes heterogeneity of evolutionary rates amongst sequence sites. Figure 5A
illustrates this with a comparison of values
estimated under the WAG+F+
and WAG+gwF+
models. These are shown for the 133 alignments in our database for which the WAG+gwF+
model is significantly better than the WAG+gwF modelin other words, alignments for which there is reasonable evidence that +
is contributing to an improved fit of evolutionary model to observed data. We see a very high correlation of estimates
(
= 0.97, P < 0.001), with the estimates being almost equal under the two models for most data sets. This also confirms that the +gwF and +
components of evolutionary models are broadly independent, with no awkward interactions between them suggesting any weaknesses in their joint use.
|
Effects on Branch Length Estimates
Our primary interest so far has been to see if the new +gwF method is able to improve the fit of evolutionary model to observed data. To investigate the potential of +gwF for improving phylogenetic inferences, we studied its effect on inferred branch lengths. For almost all families in our database, branch length estimates were only very slightly affected by the addition of the +gwF method to the JTT or WAG base model (results not shown). The implications of this finding are discussed below.
Generalized Nucleotide Replacement Models
Table 2
shows the results of various LRTs involving the +gwF method for the mtDNA and HIV-1 data sets described above. Qualitatively equivalent results are obtained whether or not the models tested include the + option (i.e., top/bottom half of table 2
). For the mtDNA sequences, the +gwF models do not give a significant improvement over their corresponding +F versions. For the HIV-1 sequences, the K2P+gwF and K2P+gwF+
models do lead to significant improvements over the HKY and HKY+
substitution models, respectively. In both cases, however, the corresponding REV models are significantly better still.
Consequently, neither the mtDNA or HIV-1 data set gives an example of a case in which a +gwF model would be statistically the best description of the sequences' evolution. For this case to arise we need, for example, a data set where K2P+gwF+ is preferred to HKY+
and where REV+
is not significantly better than K2P+gwF+
.
Our use of j = 1/4 for all nucleotides j may be one of the reasons that the +gwF nucleotide substitution models do not perform well. Whereas for the amino acid models the effect of the database values
j is to incorporate information on an underlying distribution of amino acids common to all protein sequences, we have not yet included equivalent information into the nucleotide models. Perhaps in future the estimation of
j from large nucleotide databases, and indeed the (empirical) estimation of
to derive replacements for the commonly used (parametrically inspired) values of 1 or
for sij, will lead to greater success. Because the +gwF models of nucleotide substitution are subsumed by the REV model, which it is computationally feasible to optimize for most data sets, it is not clear that the +gwF models will ever offer any important advantage.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
All the analyses performed indicate that inference under +gwF amino acid models is well-behaved, with no undue interactions between the new parameter f and other parameters and having stability with respect to other variations in models. This suggests that the +gwF models are reacting to evolutionary information in aligned sequence data sets that is not detected by existing models. Various analyses of CIs for and likelihood scores show that f may be reliably estimated for a range of data sets representative of those used in typical phylogenetic studies.
In the expectation that improved evolutionary models may lead to improved understanding of the process of evolution, it seems justified to recommend that +gwF versions of the Dayhoff (Dayhoff, Schwartz, and Orcutt 1978
), JTT (Jones, Taylor, and Thornton 1992
) and WAG (Whelan and Goldman 2001
) models be considered when analyzing amino acid sequences. The same recommendation can also be extended to replacement models derived specifically for mitochondrial (Adachi and Hasegawa 1996
; Yang, Nielsen, and Hasegawa 1998
) and chloroplast (Adachi et al. 2000
) amino acid sequences. Given an inferred value of
for a particular data set we may now make an inference about the relative importance of selective or mutational forces acting on that protein, as described in the Methods section above.
We have not yet implemented the +gwF method within any codon models of sequence evolution (e.g., Goldman and Yang 1994
; Muse and Gaut 1994
; Nielsen and Yang 1998
; Yang et al. 2000
). These are effectively +F models, with instantaneous rates of codon substitutions typically being proportional to frequencies of replacing codons, estimated either (1) directly from the observed frequencies of each of the sense codons or (2) from the products of the observed frequencies of each nucleotide at each codon position. We also note that these models disallow instantaneous changes of more than one nucleotide within a codon, meaning that observed changes of two or more nucleotides within one codon have to be assumed to take place only via one or more intermediate codons. If these intermediates are rarely observed (in case (1) above), or consist of a combination of nucleotides rarely observed at those codon positions (in case (2) above), the observed changes are deemed virtually impossible by the model despite their perhaps common occurrence in a data set. This could lead to a poor fit of evolutionary model to observed data, misinterpretation of the evolutionary signal in a data set, and inaccurate phylogenetic inferences. These effects could be more pronounced in cases of unusual or extreme codon usage. In contrast, +gwF models could allow codon substitution rates to be inversely related to observed codon frequencies (cf. eq. 3 above with f > 0) and so could explain observed changes of two or three nucleotides within a codon by permitting the appearance and then rapid loss of the necessary but rarely observed intermediates. Future work will investigate the utility of codon models incorporating the +gwF method.
Because of the large scale of the analyses we have performed, only one topology could be considered for each amino acid sequence alignment in our database. Consequently, it has not been possible to study the effect the new models have on inferred topologies. The analysis we have reported of branch length estimates under different models does not lead us to expect systematic changes in branch length estimates or different inferred topologies. Until such indications are found, the new +gwF models seem to be contributing mostly to improved fit of models to data and thus to our understanding of the processes and patterns of amino acid sequence evolution.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: amino acid frequencies
Markov process models
model comparison
molecular phylogenetics
nucleotide frequencies
Address for correspondence and reprints: Nick Goldman, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K. goldman{at}ebi.ac.uk
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi J., M. Hasegawa, 1996 Model of amino acid substitution in proteins encoded by mitochondrial DNA J. Mol. Evol 42:459-468[ISI][Medline]
Adachi J., P. J. Waddell, W. Martin, M. Hasegawa, 2000 Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA J. Mol. Evol 50:348-358[ISI][Medline]
Anisimova M., J. P. Bielawski, Z. Yang, 2001 Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution Mol. Biol. Evol 18:1585-1592
Brown W. M., E. M. Prager, A. Wang, A. C. Wilson, 1982 Mitochondrial DNA sequences of primates: tempo and mode of evolution J. Mol. Evol 18:225-239.[ISI][Medline]
Cao Y., J. Adachi, A. Janke, S. Pääbo, M. Hasegawa, 1994 Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene J. Mol. Evol 39:519-527[ISI][Medline]
Cunningham C. W., H. Zhu, D. M. Hillis, 1998 Best-fit maximum-likelihood models for phylogenetic inference: empirical tests with known phylogenies Evolution 52:978-987[ISI]
Dayhoff M. O., R. M. Schwartz, B. C. Orcutt, 1978 A model of evolutionary change in proteins Pp. 345352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, DC
Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]
Goldman N., 1993a. Simple diagnostic statistical tests of models for DNA substitution J. Mol. Evol 37:650-661[ISI][Medline]
. 1993b. Statistical tests of models of DNA substitution J. Mol. Evol 36:182-198[ISI][Medline]
Goldman N., J. P. Anderson, A. G. Rodrigo, 2000 Likelihood-based tests of topologies in phylogenetics Syst. Biol 49:652-670[ISI][Medline]
Goldman N., J. L. Thorne, D. T. Jones, 1998 Assessing the impact of secondary structure and solvent accessibility on protein evolution Genetics 149:445-458
Goldman N., S. Whelan, 2000 Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics Mol. Biol. Evol 17:975-978
Goldman N., Z. Yang, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences Mol. Biol. Evol 11:725-736
Hasegawa M., H. Kishino, T. Yano, 1985 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA J. Mol. Evol 22:160-174[ISI][Medline]
Huelsenbeck J. P., B. Rannala, 1997 Phylogenetic methods come of age: testing hypotheses in an evolutionary context Science 276:227-232
Jones D. T., W. R. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences CABIOS 8:275-282[Abstract]
Jukes T. H., C. R. Cantor, 1969 Evolution of protein molecules Pp. 21132 in H. N. Munro, ed. Mammalian protein metabolism, Vol. 3. Academic Press, New York
Kendall M., A. Stuart, 1979 The advanced theory of statistics. 4th edition. Vol. 2 Charles Griffin, London
Kimura M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]
Kuhner M. K., J. Felsenstein, 1994 A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates Mol. Biol. Evol 11:459-468 (see also: Erratum. Mol. Biol. Evol. 12:525 [1995]) [Abstract]
Liò P., N. Goldman, 1998 Models of molecular evolution and phylogeny Genome Res 8:1233-1244
Muse S. V., B. S. Gaut, 1994 A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome Mol. Biol. Evol 11:715-724
Nielsen R., Z. Yang, 1998 Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene Genetics 148:929-936
Ota R., P. J. Waddell, M. Hasegawa, H. Shimodaira, H. Kishino, 2000 Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters Mol. Biol. Evol 17:798-804
Philippe H., A. Germot, 2000 Phylogeny of eukaryotes based on ribosomal RNA: long-branch attraction and models of sequence evolution Mol. Biol. Evol 17:830-834
Posada D., K. A. Crandall, 2001 Simple (wrong) models for complex trees: a case from the Retroviridae Mol. Biol. Evol 18:271-275
Sokal R. R., F. J. Rohlf, 1994 Biometry. 3rd edition W. H. Freeman and Co., New York
Swofford D. L., G. J. Olsen, P. J. Waddell, D. M. Hillis, 1996 Phylogenetic inference Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass
Takezaki N., T. Gojobori, 1999 Correct and incorrect vertebrate phylogenies obtained by the entire mitocohndrial DNA sequences Mol. Biol. Evol 16:590-601[Abstract]
Thorne J. L., 2000 Models of protein sequence evolution and their applications Curr. Opin. Genet. Dev 10:602-605[ISI][Medline]
Thorne J. L., N. Goldman, D. T. Jones, 1996 Combining protein evolution and secondary structure Mol. Biol. Evol 13:666-673[Abstract]
Whelan S., N. Goldman, 1999 Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics Mol. Biol. Evol 16:1292-1299
. 2001 A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach Mol. Biol. Evol 18:691-699
Whelan S., P. Liò, N. Goldman, 2001 Molecular phylogenetics: state of the art methods for looking into the past Trends Genet 17:262-272[ISI][Medline]
Yang Z., 1994a. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods J. Mol. Evol 39:306-314[ISI][Medline]
. 1994b. Estimating the pattern of nucleotide substitution J. Mol. Evol 39:105-111[ISI][Medline]
. 1996 Among-site rate variation and its impact on phylogenetic analysis TREE 11:367-372
Yang Z., J. P. Bielawski, 2000 Statistical methods for detecting molecular adaptation TREE 15:496-503[Medline]
Yang Z., N. Goldman, A. Friday, 1994 Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation Mol. Biol. Evol 11:316-324[Abstract]
. 1995 Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem Syst. Biol 44:384-399[ISI]
Yang Z., R. Nielsen, N. Goldman, A.-M. K. Pedersen, 2000 Codon-substitution models for heterogeneous selection pressure at amino acid sites Genetics 155:431-449
Yang Z., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611
Yoder A. D., Z. Yang, 2000 Estimation of primate speciation dates using local molecular clocks Mol. Biol. Evol 17:1081-1090
Zhang J., 1999 Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models Mol. Biol. Evol 16:868-875[Abstract]