Department of Zoology, Brigham Young University
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Models of evolution are used in phylogenetic analyses to describe changes in character state, i.e., the rate of change from one nucleotide to another. The first model developed for molecular evolution was that of Jukes and Cantor (1969)
(JC), who considered all possible changes among nucleotides to occur with equal rates. Other authors have suggested the incorporation of more realistic assumptions into these models (for a review of models, see Swofford et al. 1996
; Liò and Goldman 1998
). For example, base frequencies often differ among nucleotides and therefore may affect the rate of change from one nucleotide to another. Likewise, many genes show a bias in transitions over transversions, again affecting the rate of change from one nucleotide to another. We can incorporate these differences in rates of change by incorporating different rate parameters. Ultimately, for a symmetrical change model without consideration of codon position, we can have 10 parameters: 6 rate parameters and 4 nucleotide frequency parameters (fig. 1
). Of these 10 parameters, 8 can vary, since the nucleotide frequencies must add up to 1 and the rates are relative to a single change occurring with rate 1. Given a large number of parameters to choose from, we wish to optimize a model for our particular data set.
|
Nevertheless, the use of models is not important only for phylogenetic reconstruction. Accurate estimation of genetic parameters from a DNA alignment may depend on the model of nucleotide substitution assumed. For example, when a simple model of evolution is used, sequence divergence, transition/transversion ratios, and branch lengths may be underestimated (Tamura 1992
; Yang et al. 1994
; Adachi and Hasegawa 1995
; Yang, Goldman, and Friday 1995
). Moreover, the use of correct models is also relevant for evolutionary hypothesis testing (e.g., molecular-clock likelihood ratio tests) (Zhang 1999
).
The molecular-clock hypothesis, which states that the rate of evolution of a gene is approximately constant among different lineages (Zuckerkandl and Pauling 1
965), can also be incorporated in a model of evolution. The assumption that HIV-1 follows a molecular clock is controversial. While some authors dispute the existence of a molecular clock (Coffin 1995
; Holmes, Pybus, and Harvey 1999
), other authors claim that the evolution of HIV-1 is clocklike (Gojobori, Moriyama, and Kimura 1990
; Leitner and Albert 1999
; Shankarappa et al. 1999
). Although a molecular clock is not necessary for phylogeny estimation using neighbor joining or maximum likelihood, it becomes a relevant parameter for the study of the origin of HIV-1 (Korber, Theiler, and Wolinsky 1998
; Korber et al. 2000
)
It does not seem likely that there is a single best-fit model of evolution appropriate for any HIV-1 data set (Muse 1999
). Different lineages, genes, or regions within HIV-1 may evolve at distinct rates. Different degrees of variability are observed for the same region depending on the hierarchical level of the comparisons, i.e., within or among individuals, or within or among subtypes. Consequently, model selection should be a common practice when estimating HIV-1 phylogenies. We suggest two different statistical approaches for model selectionhierarchical likelihood ratio tests (LRTs) and the Akaike information criterionbut other strategies can be used (Rzhetsky and Nei 1995
). Computer simulation studies show that these methods for selecting the model of nucleotide substitution perform well and that they are not affected by the starting topology used to estimate the likelihood of the different models evaluated (Posada and Crandall 2001a
). Moreover, the specific LRT hierarchy used in this study seems to perform slightly better than other possible orders of LRTs.
The aim of this study was to use statistical testing in order to establish the best-fit model of evolution for an array of different data sets representing different genes and taxonomic levels in HIV-1. By doing this, the fit of a molecular clock to HIV-1 data was also evaluated. We show how different HIV-1 data sets are better explained by different models of evolution (different from K80) and how the molecular clock is rejected for most HIV-1 data sets.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
When the models compared are nested (the simple model is a special case of the complex model) and the simple model corresponds to fixing some parameters in the complex model to values inside the parameter space, is asymptotically distributed as
2 with q degrees of freedom, where q is the difference in number of free parameters between the two models (Kendall and Stuart 1979
). When the simple model corresponds to fixing one parameter at the boundary of its range in the complex model, a mixed
2 (or
2) distribution, consisting of 50%
20 and 50%
21, should be used (Self and Liang 1987
; Goldman and Whelan 2000
; Ota et al. 2000
). Once a model was chosen, an LRT for the molecular-clock hypothesis (Felsenstein 1981
) was also performed among the best-fit model with and without the molecular-clock restriction. The number of degrees of freedom for the molecular-clock LRT was n - 2, with n being the number of taxa.
We also explored another approach to compare different models without the nesting requirement or the assumption of a 2 distribution for statistical comparison, the Akaike (1974)
information criterion (AIC). The AIC is a useful measure that rewards models for good fit (smaller values of AIC indicate better models) but imposes a penalty for unnecessary parameters (Hasegawa 1990a, 1990b
; Hasegawa, Kishino, and Saitou 1991
; Muse 1999
). If L is the maximum value of the likelihood function for a specific model using p independently adjusted parameters within the model, then
![]() |
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
HIV-1 Molecular Clock
The rejection of the molecular clock in most data sets is most easily explained by the absence of a molecular clock. Such rate variation among lineages seems very plausible in light of the different selective pressures exerted by the immune system and the repeated reduction in effective population sizes during infection that HIV lineages experience. In other cases, the molecular-clock hypothesis can be rejected when the sequences are evolving in a clocklike fashion because of the presence of recombination (Schierup and Hein 2000b
), which is a frequent phenomenon in HIV (Robertson et al. 1995
). This rejection of the clock is not a failure of the LRT, but rather the consequence of recombination violating the actual null hypothesis that the LRT of the molecular clock is testing: that the sequences are evolving in a clocklike fashion on one tree. In either case, the application of molecular-clock techniques in HIV-1 seems to be inappropriate due to either the absence of a clock and/or the presence of more than one true tree because of recombination.
The main study supporting a molecular clock in HIV-1 was by Leitner and Albert (1999)
, who suggested that the molecular clock explained the genetic variation in the p17 and V3 regions in a known transmission HIV-1 cluster. In arriving at this conclusion, Leitner and Albert used a regression analysis of genetic distance and time. However, molecular clocks should ideally be calibrated using independent lineages in the phylogeny. Calibration using pairwise differences among taxa within a group inflates the correlation between divergence and time because pairwise differences are based on shared proportions of the phylogeny and therefore are not independent (Hillis, Mable, and Moritz 1996
). This lack of independence makes the regression analysis of genetic distance on time inadequate. Moreover, estimates of HIV-1 divergence rates vary depending on the region of the genome under study, alignment, amount of recombination, different selection pressures among individuals, and phylogenetic accuracy (Korber, Theiler, and Wolinsky 1998
; Korber et al. 2000
). We performed an LRT of the molecular-clock hypothesis on these data sets and were able to strongly reject the molecular clock for both data sets (p17, P < 0.0001; V3, P < 0.0001) using the best-fit models for each data set. The P values for these tests even decreased when the true topology and the GTR+dG4 model were used. Furthermore, we also rejected the molecular clock for these same data when the different sampling times were taken into account (Rambaut 2000
) (P < 0.0001) (see also Rambaut 1997
). Computer simulation studies have shown that the LRT of the molecular clock performs quite well under a "reasonable" model of evolution and that it becomes a conservative test when the assumptions of the substitution model are not met (Yang, Goldman, and Friday 1995
; Zhang 1999
). In addition, the confounding factor of recombination is absent due to the known history of the sequences among individuals. Interestingly, the results from this one transmission case with a known history have been extrapolated across all of HIV diversity with an obviously complex and recombinogenic history to justify the assumption of a molecular clock (e.g., Korber et al. 2000
).
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The large number of DNA sequences available at the HIV-1 sequence database allows for an alternative approach to model selection. The parameters of a complex model could be estimated for each region or gene of the HIV-1 virus using all (or part) of the data available. Once these parameters were estimated, they could be used for subsequent analyses without the necessity of estimating them again (Hillis 1999
). However, for this approach, we must be willing to assume that HIV-1 evolves under the same underlying model of DNA substitution at any hierarchical level. In addition, it is not clear if this general-model procedure would raise the accuracy of phylogenetic estimation for a smaller data set. We are currently investigating these questions.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: model selection
likelihood ratio test
Akaike information criterion
molecular clock
HIV-1
2 Address for correspondence and reprints: David Posada, 574 WIDB, Department of Zoology, Brigham Young University, Provo, Utah 84602-5255. dp47{at}email.byu.edu
.
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi, J., and M. Hasegawa. 1995. Improved dating of the human/chimpanzee separation in the mitochondrial DNA tree: heterogeneity among amino acid sites. J. Mol. Evol. 40:622628[ISI][Medline]
Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19:716723
Bruno, W. J., and A. L. Halpern. 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16:564566
Coffin, J. M. 1995. HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267:483489
Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401410[ISI]
. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376[ISI][Medline]
Fukami-Kobayashi, K., and Y. Tateno. 1991. Robustness of maximum likelihood tree estimation against different patterns of base substitutions. J. Mol. Evol. 32:7991[ISI][Medline]
Gojobori, T., E. N. Moriyama, and M. Kimura. 1990. Molecular clock of viral evolution, and the neutral theory. Proc. Natl. Acad. Sci. USA 87:1001510018
Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182198[ISI][Medline]
Goldman, N., and S. Whelan. 2000. Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics. Mol. Biol. Evol. 17:975978
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736
Hasegawa, M. 1990a. Mitochondrial DNA evolution in primates: transition rate has been extremely low in the lemur. J. Mol. Evol. 31:113121
. 1990b. Phylogeny and molecular evolution in primates. Jpn. J. Genet. 65:243266
Hasegawa, M., H. Kishino, and N. Saitou. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol. 32:443445[ISI][Medline]
Hasegawa, M., K. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160174[ISI][Medline]
Hillis, D. M. 1999. Phylogenetics and the study of HIV. Pp. 105121 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md
Hillis, D. M., B. K. Mable, and C. Moritz. 1996. Applications of molecular systematics: the state of the field and a look to the future. Pp. 515543 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass
Hochberg, Y. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800802
Holmes, E. C., O. G. Pybus, and P. H. Harvey. 1999. The molecular population dynamics of HIV-1. Pp. 177207 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md
Huelsenbeck, J. P., and K. A. Crandall. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28:437466[ISI]
Huelsenbeck, J. P. and D. M. Hillis. 1993. Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247264[ISI]
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21132 in H. M. Munro, ed. Mammalian protein metabolism. Academic Press, N.Y
Kelsey, C. R., K. A. Crandall, and A. F. Voevodin. 1999. Different models, different trees: the geographic origin of PTLV-I. Mol. Phylogenet. Evol. 13:336347[ISI][Medline]
Kendall, M., and A. Stuart. 1979. The advanced theory of statistics. Charles Griffin, London
Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120[ISI][Medline]
. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78:454458
Korber, B., M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B. H. Hahn, S. Wolinsky, and T. Bhattacharya. 2000. Timing the ancestor of the HIV-1 pandemic strains. Science 288:17891796
Korber, B., J. Theiler, and S. Wolinsky. 1998. Limitations of a molecular clock applied to the considerations of the origin of HIV-1. Science 280:18681871
Leitner, T., and J. Albert. 1999. The molecular clock of HIV-1 unveiled through analysis of a known transmission history. Proc. Natl. Acad. Sci. USA 96:1075210757
Leitner, T., D. Escanilla, C. Franzen, M. Uhlen, and J. Albert. 1996. Accurate reconstruction of known HIV-1 transmission history by phylogenetic tree analysis. Proc. Natl. Acad. Sci. USA 93:1086410869
Leitner, T., and W. M. Fitch. 1999. The phylogenetics of known transmission histories. Pp. 315345 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md
Leitner, T., S. Kumar, and J. Albert. 1997. Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history. J. Virol. 71:47614770[Abstract]
Liò, P., and N. Goldman. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:12331244
Moriyama, E. N., Y. Ina, K. Ikeo, M. Shimizu, and T. Gojobori. 1991. Mutation pattern of human immunodeficiency virus gene. J. Mol. Evol. 32:360363[ISI][Medline]
Muse, S. 1999. Modeling the molecular evolution of HIV sequences. Pp. 122152 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715724
Nei, M. 1987. Molecular evolutionary genetics. Columbia University Press, N.Y
Ota, R., P. J. Waddell, M. Hasegawa, H. Shimodaira, and H. Kishino. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17:798803
Ou, C.-Y., C. A. Ciesielski, G. Myers et al. (18 co-authors). 1992. Molecular epidemiology of HIV transmission in a dental practice. Science 256:11651171
Pedersen, A.-M. K., C. Wiuf, and F. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:10691081[Abstract]
Penny, D., P. J. Lockhart, M. A. Steel, and M. D. Hendy. 1994. The role of models in reconstructing evolutionary trees. Pp. 211230 in R. W. Scotland, D. J. Siebert, and D. M. Williams, eds. Models in phylogenetic reconstruction. Clarendon Press, Oxford, England
Posada, D., and K. A. Crandall. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14:817818
. 2001a. Selecting the best-fit model of nucleotide substitution. Syst. Biol. (in press)
. 2001b. Simple (wrong) models for complex trees: empirical bias. Mol. Biol. Evol. 18:271275
Posada, D., K. A. Crandall, and D. M. Hillis. 2000. Phylogenetics of HIV. Pp. 121160 in A. G. Rodrigo and G. H. J. Learn, eds. Computational and evolutionary analysis of HIV molecular sequences. Kluwer, Norwell, Mass
Rambaut, A. 1997. The inference of evolutionary and population dynamic processes from molecular phylogenies. University of Oxford, Oxford, England
. 2000. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16:395399
Robertson, D. L., P. M. Sharp, F. E. McCutchan, and B. H. Hahn. 1995. Recombination in HIV-1. Nature 374:124126
RodrÍguez, F., J. F. Oliver, A. MarÍn, and J. R. Medina. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol. 142:485501[ISI][Medline]
Rzhetsky, A., and M. Nei. 1995. Tests of applicability of several substitution models for DNA sequence data. Mol. Biol. Evol. 12:131151[Abstract]
Rzhetsky, A., and T. Sitnikova. 1996. When is it safe to use an oversimplified substitution model in tree-making? Mol. Biol. Evol. 13:12551265[Abstract]
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425[Abstract]
Schierup, M. H., and J. Hein. 2000a. Consequences of recombination on traditional phylogenetic analysis. Genetics 156:879891
. 2000b. Recombination and the molecular clock. Mol. Biol. Evol. 17:15781579
Self, S. G., and K.-L. Liang. 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 82:605610[ISI]
Shankarappa, R., J. B. Margolick, S. J. Gange et al. (12 co-authors). 1999. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol. 73:1048910502
Sullivan, J., and D. L. Swofford. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenies. J. Mamm. Evol. 4:7786
Swofford, D. L. 1998. PAUP*: phylogenetic analysis using parsimony (* and other methods). Version 4.0 beta. Sinauer, Sunderland, Mass
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass
Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. 9:678687[Abstract]
Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512526[Abstract]
Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins. 1997. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 24:48764882
Van de Peer, Y., W. Janssens, L. Heyndrickx, K. Fransen, G. van der Groen, and R. De Wachter. 1996. Phylogenetic analysis of the env gene of HIV-1 isolates taking into account individual nucleotide substitution rates. AIDS 10:14851494
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306314[ISI][Medline]
. 1996. Among-site rate variation and its impact on phylogenetic analysis. Trends Ecol. Evol. 11:367372[ISI]
. 1997. How often do wrong models produce better phylogenies? Mol. Biol. Evol. 14:105108
Yang, Z., N. Goldman, and A. Friday. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316324[Abstract]
. 1995. Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Syst. Biol. 44:384399[ISI]
Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449
Zhang, J. 1999. Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. Mol. Biol. Evol. 16:868875[Abstract]
Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315329[ISI][Medline]
Zuckerkandl, E., and L. Pauling. 1965. Evolutionary divergence and convergence in proteins. Pp. 97166 in V. Bryson and H. J. Vogel, eds. Evolving genes and proteins. Academic Press, N.Y