Selecting Models of Nucleotide Substitution: An Application to Human Immunodeficiency Virus 1 (HIV-1)

David Posada and Keith A. Crandall

Department of Zoology, Brigham Young University


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 literature cited
 
The blind use of models of nucleotide substitution in evolutionary analyses is a common practice in the viral community. Typically, a simple model of evolution like the Kimura two-parameter model is used for estimating genetic distances and phylogenies, either because other authors have used it or because it is the default in various phylogenetic packages. Using two statistical approaches to model fitting, hierarchical likelihood ratio tests and the Akaike information criterion, we show that different viral data sets are better explained by different models of evolution. We demonstrate our results with the analysis of HIV-1 sequences from a hierarchy of samples; sequences within individuals, individuals within subtypes, and subtypes within groups. We also examine results for three different gene regions: gag, pol, and env. The Kimura two-parameter model was not selected as the best-fit model for any of these data sets, despite its widespread use in phylogenetic analyses of HIV-1 sequences. Furthermore, the model complexity increased with increasing sequence divergence. Finally, the molecular-clock hypothesis was rejected in most of the data sets analyzed, throwing into question clock-based estimates of divergence times for HIV-1. The importance of models in evolutionary analyses and their repercussions on the derived conclusions are discussed.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 literature cited
 
The use of phylogenetics in viral studies has increased dramatically in the last years. When estimating phylogenetic relationships among DNA sequences, the use of a model of nucleotide substitution—a model of evolution—is necessary. While maximum parsimony assumes a model of evolution in an implicit manner, distance methods and maximum likelihood explicitly estimate parameters according to the model of evolution specified (distance methods estimate only the substitution rate, while maximum likelihood estimates all the parameters of the model). The use of particular models of evolution without obvious justification is, unfortunately, an extended practice in the viral community. Even worse, ignorance about the model of evolution used in the analysis, or failing to report it, is also common in the viral literature (Leitner and Fitch 1999Citation ). The Kimura (1980)Citation two-parameter model (K80) has been extensively used for estimating viral phylogenies without justification. Many viral evolutionary studies have focused on the HIV-1 virus, and we used it here for a case study.

Models of evolution are used in phylogenetic analyses to describe changes in character state, i.e., the rate of change from one nucleotide to another. The first model developed for molecular evolution was that of Jukes and Cantor (1969)Citation (JC), who considered all possible changes among nucleotides to occur with equal rates. Other authors have suggested the incorporation of more realistic assumptions into these models (for a review of models, see Swofford et al. 1996Citation ; Liò and Goldman 1998Citation ). For example, base frequencies often differ among nucleotides and therefore may affect the rate of change from one nucleotide to another. Likewise, many genes show a bias in transitions over transversions, again affecting the rate of change from one nucleotide to another. We can incorporate these differences in rates of change by incorporating different rate parameters. Ultimately, for a symmetrical change model without consideration of codon position, we can have 10 parameters: 6 rate parameters and 4 nucleotide frequency parameters (fig. 1 ). Of these 10 parameters, 8 can vary, since the nucleotide frequencies must add up to 1 and the rates are relative to a single change occurring with rate 1. Given a large number of parameters to choose from, we wish to optimize a model for our particular data set.



View larger version (49K):
[in this window]
[in a new window]
 
Fig. 1.—Different rates of nucleotide substitution. All of the models compared here are symmetrical, so the rate of change from nucleotide i to nucleotide j is the same as the rate of change from nucleotide j to nucleotide i.

 
It seems intuitive that a simple model like K80 may not adequately represent the complexity of the nucleotide substitution process in human immunodeficiency virus 1 (HIV-1) (Moriyama et al. 1991Citation ; Leitner, Kumar, and Albert 1997Citation ; Muse 1999Citation ). One possible solution to model selection for constructing HIV-1 phylogenies could be the arbitrary use of complex (parameter-rich) models (e.g., Korber et al. 2000Citation ). However, this approach has several disadvantages. First, a large number of parameters need to be estimated, so the analyses become computationally difficult and require larger amounts of time. Second, the use of complex models increases the error with which each parameter is estimated. Ideally, we would like to incorporate as much complexity as needed in the estimation procedure. Indeed, this best-fit model of evolution can be chosen through rigorous statistical testing (Goldman 1993Citation ; Rzhetsky and Nei 1995Citation ; Huelsenbeck and Crandall 1997Citation ; Posada and Crandall 1998Citation ). The relevance of model selection becomes apparent when the use of one model of evolution or another changes the results of the analysis (Sullivan and Swofford 1997Citation ; Kelsey, Crandall, and Voevodin 1999Citation ). Phylogenetic methods may be less accurate (recover an incorrect tree more often) or may be inconsistent (converge to an incorrect tree with increased amounts of data) when the model of evolution assumed is incorrect (Felsenstein 1978Citation ; Huelsenbeck and Hillis 1993Citation ; Penny et al. 1994Citation ; Bruno and Halpern 1999Citation ; but see Rzhetsky and Sitnikova 1996Citation ; Yang 1997Citation ; Posada and Crandall 2001bCitation ). It has been shown that the use of adequate models of evolution improves the accuracy of HIV-1 phylogenetic inference (Leitner et al. 1996Citation ; Leitner, Kumar, and Albert 1997Citation ; Posada, Crandall, and Hillis 2000Citation ).

Nevertheless, the use of models is not important only for phylogenetic reconstruction. Accurate estimation of genetic parameters from a DNA alignment may depend on the model of nucleotide substitution assumed. For example, when a simple model of evolution is used, sequence divergence, transition/transversion ratios, and branch lengths may be underestimated (Tamura 1992Citation ; Yang et al. 1994Citation ; Adachi and Hasegawa 1995Citation ; Yang, Goldman, and Friday 1995Citation ). Moreover, the use of correct models is also relevant for evolutionary hypothesis testing (e.g., molecular-clock likelihood ratio tests) (Zhang 1999Citation ).

The molecular-clock hypothesis, which states that the rate of evolution of a gene is approximately constant among different lineages (Zuckerkandl and Pauling 1Citation 965), can also be incorporated in a model of evolution. The assumption that HIV-1 follows a molecular clock is controversial. While some authors dispute the existence of a molecular clock (Coffin 1995Citation ; Holmes, Pybus, and Harvey 1999Citation ), other authors claim that the evolution of HIV-1 is clocklike (Gojobori, Moriyama, and Kimura 1990Citation ; Leitner and Albert 1999Citation ; Shankarappa et al. 1999Citation ). Although a molecular clock is not necessary for phylogeny estimation using neighbor joining or maximum likelihood, it becomes a relevant parameter for the study of the origin of HIV-1 (Korber, Theiler, and Wolinsky 1998Citation ; Korber et al. 2000Citation )

It does not seem likely that there is a single best-fit model of evolution appropriate for any HIV-1 data set (Muse 1999Citation ). Different lineages, genes, or regions within HIV-1 may evolve at distinct rates. Different degrees of variability are observed for the same region depending on the hierarchical level of the comparisons, i.e., within or among individuals, or within or among subtypes. Consequently, model selection should be a common practice when estimating HIV-1 phylogenies. We suggest two different statistical approaches for model selection—hierarchical likelihood ratio tests (LRTs) and the Akaike information criterion—but other strategies can be used (Rzhetsky and Nei 1995Citation ). Computer simulation studies show that these methods for selecting the model of nucleotide substitution perform well and that they are not affected by the starting topology used to estimate the likelihood of the different models evaluated (Posada and Crandall 2001aCitation ). Moreover, the specific LRT hierarchy used in this study seems to perform slightly better than other possible orders of LRTs.

The aim of this study was to use statistical testing in order to establish the best-fit model of evolution for an array of different data sets representing different genes and taxonomic levels in HIV-1. By doing this, the fit of a molecular clock to HIV-1 data was also evaluated. We show how different HIV-1 data sets are better explained by different models of evolution (different from K80) and how the molecular clock is rejected for most HIV-1 data sets.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 literature cited
 
HIV-1 Sequences and Alignment
Thirty-three DNA data sets were gathered from the HIV Sequence Database (http://hiv-web.lanl.gov/) to represent a reasonable range of nucleotide diversity, as well as different regions of the HIV-1 genome (table 1 ). Four different taxonomic levels of sequence variation were studied: within individuals, within subtypes, within groups, and within HIV-1. Three different genes, env, pol, and gag, were analyzed at each one of these levels. DNA sequences were obtained already aligned from the HIV Sequence Database, or they were aligned using ClustalX (Thompson et al. 1997Citation ). In either case, alignments were inspected and confirmed by eye. Regions of the alignment with ambiguous homology were excluded from the analysis. Final alignments are available in NEXUS format on request from the authors.


View this table:
[in this window]
[in a new window]
 
Table 1 Data Sets Analyzed in this Study

 
Model Selection
A neighbor-joining (NJ) tree (Saitou and Nei 1987Citation ) was estimated for each data set under the JC model. Parameter estimation is believed to be insensitive to tree topology (Yang, Goldman, and Friday 1995Citation ), and an NJ-JC tree is a good estimate of topology for estimating the parameters of the corresponding model (Posada and Crandall 2001aCitation ). Likelihood scores for 56 different models of evolution were calculated for the NJ-JC tree in PAUP* (Swofford 1998Citation ). These likelihood scores were then compared using the hierarchical likelihood ratio test approach ({eta}LRT) (Huelsenbeck and Crandall 1997Citation ) implemented in the program Modeltest, version 3.0 (Posada and Crandall 1998Citation ) (available at http://bioag.byu.edu/zoology/crandall_lab/modeltest.htm). LRTs are widely used to compare the relative fits of two different models to the data. The LRT statistic is

where L0 is the likelihood maximized under the null hypothesis (simple model) and L1 is the likelihood maximized under the alternative hypothesis (complex model).

When the models compared are nested (the simple model is a special case of the complex model) and the simple model corresponds to fixing some parameters in the complex model to values inside the parameter space, {delta} is asymptotically distributed as {chi}2 with q degrees of freedom, where q is the difference in number of free parameters between the two models (Kendall and Stuart 1979Citation ). When the simple model corresponds to fixing one parameter at the boundary of its range in the complex model, a mixed {chi}2 (or 2) distribution, consisting of 50% {chi}20 and 50% {chi}21, should be used (Self and Liang 1987Citation ; Goldman and Whelan 2000Citation ; Ota et al. 2000Citation ). Once a model was chosen, an LRT for the molecular-clock hypothesis (Felsenstein 1981Citation ) was also performed among the best-fit model with and without the molecular-clock restriction. The number of degrees of freedom for the molecular-clock LRT was n - 2, with n being the number of taxa.

We also explored another approach to compare different models without the nesting requirement or the assumption of a {chi}2 distribution for statistical comparison, the Akaike (1974)Citation information criterion (AIC). The AIC is a useful measure that rewards models for good fit (smaller values of AIC indicate better models) but imposes a penalty for unnecessary parameters (Hasegawa 1990a, 1990bCitation ; Hasegawa, Kishino, and Saitou 1991Citation ; Muse 1999Citation ). If L is the maximum value of the likelihood function for a specific model using p independently adjusted parameters within the model, then

The models of evolution compared in this study include parameters that describe (1) the nucleotide base frequencies, (2) the substitution rates among the four bases, (3) the rate distribution among sites (homogeneous or heterogeneous), and (4) the rate distribution among lineages (homogeneous [i.e., molecular clock] or heterogeneous). Different models assume that base frequencies are equal (fA = fC = fG = fT = 0.25) or allow them to vary freely with the only constraint being that they have to add up to 1. When reversibility of change is assumed, i.e., the probability of changing from nucleotide i to nucleotide j is the same as the probability of changing from nucleotide j to nucleotide i, there are six possible substitution rates among nucleotides (a, or rAC; b, or rAG; c, or rAT; d, or rCG; e, or rCT; f, or rGT) (fig. 1 ). Complex models allow these six rates to vary freely, while less complex models assume that some of them are equal to others, e.g., transition (rAG = rCT)/transversion (rAC = rAT = rCG = rCT) ratio, and simple models assume that all rates are equal. It is this substitution scheme that usually gives names to the models (JC, K80, F81, HKY, etc.; see fig. 1 ). On the other hand, the rates of evolution among different sites can be similar or very different. This heterogeneity of rate variation among sites can be incorporated by simply assuming that there is a proportion of invariable sites (p-inv), while the rest of the sites evolve at the same rate. A more detailed rate variation model assigns to each site a certain probability of belonging to a specific rate class or category. This set of probabilities is conveniently described by a discrete gamma distribution with four categories (dG4) (Yang 1994, 1996Citation ). When the shape parameter of the gamma distribution ({alpha}) is small, most of the sites evolve very slowly, but a few sites have moderate-to-fast rates. When {alpha} increases, most of the sites evolve at medium rates, and a few evolve at slow and fast rates. When {alpha} is infinity, all of the sites evolve at the same rate. Rates of evolution can also be different in different parts of the tree. These rates can be assumed equal by enforcing a molecular clock, or each branch can be allowed to have its own rate of evolution. Base frequencies, substitution rates, proportion of invariable sites, and {alpha} were estimated in PAUP* under each model on the initial NJ-JC tree. After including the molecular-clock hypothesis, there were 112 models of evolution compared in a hierarchical fashion for each data set (fig. 2 ). What we call the best-fit model of evolution is nothing more or less than the model selected among these possible alternative models.



View larger version (38K):
[in this window]
[in a new window]
 
Fig. 2.—Hierarchical likelihood ratio tests ({eta}LTRs). Hypotheses tested are indicated on the left. The acceptance or rejection of each LRT (by default when P < 0.01) determines the path. For clarity purposes, only the full paths corresponding to the K81 and F81 model families are indicated. Mean features of the models of evolution compared are summarized at the bottom. JC (Jukes and Cantor 1969Citation ); K80 (Kimura 1980Citation ); K81 (Kimura 1981Citation ); TrNef (TrN model with equal base frequencies); TIMef (TIM model with equal base frequencies); TVMef (TVM model with equal base frequencies); SYM (Zharkikh 1994Citation ); F81 (Felsenstein 1981Citation ); HKY (Hasegawa, Kishino, and Yano 1985Citation ); K81uf (K81 model with unequal base frequencies); TrN (Tamura and Nei 1993Citation ); TIM (transitional model: rAC = rGT != rAT = rCG != rAG != rGT); TVM (transversional model: rAC = rCT != rAG != rAT != rCG != rGT); GTR (Rodríguez et al. 1990Citation ). I = invariable sites; G = gamma distribution; c = molecular clock enforced

 

    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 literature cited
 
Different data sets resulted in different best-fit models of nucleotide substitution. Furthermore, the commonly used K80 model of evolution was never the optimal model. The relative fits of different models changed for particular genes and regions and for different hierarchical levels. The models of nucleotide substitution selected by the {eta}LRT and AIC approaches increased in complexity—becoming more parameter-rich—in accordance with the increasing hierarchical level of nucleotide diversity (table 2 ). Levels of model complexity were similar across genes and regions, and most models incorporated rate heterogeneity among sites. There was a slight tendency for the AIC procedure to favor more complex models than the LRT strategy, but there was a general agreement on the models selected. Maximum-likelihood estimates of base frequencies were similar for different taxonomic levels and for different regions and genes (table 3 ). Adenine was the most common nucleotide, while guanine was the second most common base (except for the pol gene, where the second most common base was thymine). Maximum-likelihood estimates of substitution rates revealed a similar pattern among genes, where the most common change (after multiplying by the corresponding base frequency) was the A-to-G transition (table 4 ). The rest of the transitions were still more common than any transversion. Gamma shape estimates ({alpha}) were almost invariably <1, indicating that most of the sites evolve relatively very slowly but a few have faster rates (table 5 ). Rate heterogeneity among sites increased and the proportion of invariable sites (p-inv) decreased with increasing taxonomic level. Those patterns were generally similar for the three genes studied. After correcting the statistical significance for multiple tests, all of the LRTs of the molecular-clock hypothesis were significant at the 0.05 family level except for the envelope V3 region. Three additional tests were not significant at the 0.01 family level (table 6 ).


View this table:
[in this window]
[in a new window]
 
Table 2 Models of Evolution Selected by the Hierarchical Likelihood Ratio Test (HLRT) and Akaike Information Criterion (AIC) Strategies

 

View this table:
[in this window]
[in a new window]
 
Table 3 Maximum-Likelihood Estimates of Base Frequencies Under the Best-Fit Models Selected by the Hierarchical Likelihood Ratio Test Procedure

 

View this table:
[in this window]
[in a new window]
 
Table 4 Maximum-Likelihood Estimates of Nucleotide Substitution Rates Under the Best-Fit Models Selected by the Hierarchical Likelihood Ratio Test Strategy

 

View this table:
[in this window]
[in a new window]
 
Table 5 Maximum-Likelihood Estimates of the Gamma Shape Parameter ({{alpha}}) and the Proportion of Invariable Sites (p-inv) Under the Best-Fit Models Selected Using the Hierarchical Likelihood Ratio Test

 

View this table:
[in this window]
[in a new window]
 
Table 6 Likelihood Ratio Test (LRT) of the Molecular-Clock Hypothesis

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 literature cited
 
Models of HIV-1 Sequence Evolution
The fact that different HIV-1 data sets are better explained by different models of evolution suggests that blind model selection may confound inferences based on phylogenetic analyses of HIV-1. However, this does not necessarily mean that different data sets for the same genes evolved under different models of evolution. The different substitution patterns may simply reflect different evolutionary times. In fact, the complexity of the models correlates well with nucleotide diversity. The extensive rate heterogeneity observed among sites is easily explained by different selective regimes along the genome. Moreover, recombination in HIV-1 can inflate the amount of rate variation (Schierup and Hein 2000aCitation ). More complex models exist that could increase the fit to HIV-1 sequences, especially codon models including selection (Pedersen, Wiuf, and Christiansen 1998Citation ; Yang et al. 2000Citation ). In this paper, we have restricted ourselves to nucleotide substitution models currently implemented for phylogenetic estimation.

HIV-1 Molecular Clock
The rejection of the molecular clock in most data sets is most easily explained by the absence of a molecular clock. Such rate variation among lineages seems very plausible in light of the different selective pressures exerted by the immune system and the repeated reduction in effective population sizes during infection that HIV lineages experience. In other cases, the molecular-clock hypothesis can be rejected when the sequences are evolving in a clocklike fashion because of the presence of recombination (Schierup and Hein 2000bCitation ), which is a frequent phenomenon in HIV (Robertson et al. 1995Citation ). This rejection of the clock is not a failure of the LRT, but rather the consequence of recombination violating the actual null hypothesis that the LRT of the molecular clock is testing: that the sequences are evolving in a clocklike fashion on one tree. In either case, the application of molecular-clock techniques in HIV-1 seems to be inappropriate due to either the absence of a clock and/or the presence of more than one true tree because of recombination.

The main study supporting a molecular clock in HIV-1 was by Leitner and Albert (1999)Citation , who suggested that the molecular clock explained the genetic variation in the p17 and V3 regions in a known transmission HIV-1 cluster. In arriving at this conclusion, Leitner and Albert used a regression analysis of genetic distance and time. However, molecular clocks should ideally be calibrated using independent lineages in the phylogeny. Calibration using pairwise differences among taxa within a group inflates the correlation between divergence and time because pairwise differences are based on shared proportions of the phylogeny and therefore are not independent (Hillis, Mable, and Moritz 1996Citation ). This lack of independence makes the regression analysis of genetic distance on time inadequate. Moreover, estimates of HIV-1 divergence rates vary depending on the region of the genome under study, alignment, amount of recombination, different selection pressures among individuals, and phylogenetic accuracy (Korber, Theiler, and Wolinsky 1998Citation ; Korber et al. 2000Citation ). We performed an LRT of the molecular-clock hypothesis on these data sets and were able to strongly reject the molecular clock for both data sets (p17, P < 0.0001; V3, P < 0.0001) using the best-fit models for each data set. The P values for these tests even decreased when the true topology and the GTR+dG4 model were used. Furthermore, we also rejected the molecular clock for these same data when the different sampling times were taken into account (Rambaut 2000Citation ) (P < 0.0001) (see also Rambaut 1997Citation ). Computer simulation studies have shown that the LRT of the molecular clock performs quite well under a "reasonable" model of evolution and that it becomes a conservative test when the assumptions of the substitution model are not met (Yang, Goldman, and Friday 1995Citation ; Zhang 1999Citation ). In addition, the confounding factor of recombination is absent due to the known history of the sequences among individuals. Interestingly, the results from this one transmission case with a known history have been extrapolated across all of HIV diversity with an obviously complex and recombinogenic history to justify the assumption of a molecular clock (e.g., Korber et al. 2000Citation ).


    Conclusions
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 literature cited
 
Researchers of HIV-1 evolution should justify their choice of models of nucleotide substitution. Simple methods have been implemented in computer programs to facilitate the statistical justification of a particular model of substitution (Posada and Crandall 1998Citation ). Different models can change not only topologies, but also branch lengths and distance calculations. Indeed, nucleotide substitution parameters are more accurately estimated under the correct model (Fukami-Kobayashi and Tateno 1991Citation ; Yang, Goldman, and Friday 1994Citation ; Van de Peer et al. 1996Citation ; Leitner, Kumar, and Albert 1997Citation ). Bootstrap values may be dependent on the adequacy of the model for the data. Common approaches for detecting recombination in HIV-1 data sets will also be affected by model choice, since one's ability to partition effects of rate heterogeneity and recombination is critical. Both topology and branch length estimates are becoming increasingly important in HIV-1 transmission cases (Ou et al. 1992Citation ) and therefore must be estimated with justified models for the greatest accuracy. Although we have made a case with HIV-1 DNA sequences, results from this study apply to the use of models of DNA substitution in any evolutionary study. Current models of nucleotide substitution are not perfect (and they never will be), and parameters that are more realistic will be incorporated to describe the complexity of evolution. As models that are more complex are proposed (Goldman and Yang 1994Citation ; Muse and Gaut 1994Citation ; Pedersen, Wiuf, and Christiansen 1998Citation ), there is further reason to justify one's choice of models.

The large number of DNA sequences available at the HIV-1 sequence database allows for an alternative approach to model selection. The parameters of a complex model could be estimated for each region or gene of the HIV-1 virus using all (or part) of the data available. Once these parameters were estimated, they could be used for subsequent analyses without the necessity of estimating them again (Hillis 1999Citation ). However, for this approach, we must be willing to assume that HIV-1 evolves under the same underlying model of DNA substitution at any hierarchical level. In addition, it is not clear if this general-model procedure would raise the accuracy of phylogenetic estimation for a smaller data set. We are currently investigating these questions.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 literature cited
 
Thomas Leitner kindly provided several data sets. Eddie Holmes and Marcos Losada made helpful suggestions on the manuscript. We thank Rayna Clay for assistance in preliminary analyses of the HIV data. This work was supported by a BYU Graduate Studies Award (D.P.) and by NIH grant number RO1-HD34350 and the Alfred P. Sloan Foundation (K.A.C.).


    Footnotes
 
Stephen Palumbi, Reviewing Editor

1 Keywords: model selection likelihood ratio test Akaike information criterion molecular clock HIV-1 Back

2 Address for correspondence and reprints: David Posada, 574 WIDB, Department of Zoology, Brigham Young University, Provo, Utah 84602-5255. dp47{at}email.byu.edu . Back


    literature cited
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 literature cited
 

    Adachi, J., and M. Hasegawa. 1995. Improved dating of the human/chimpanzee separation in the mitochondrial DNA tree: heterogeneity among amino acid sites. J. Mol. Evol. 40:622–628[ISI][Medline]

    Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19:716–723

    Bruno, W. J., and A. L. Halpern. 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16:564–566[Free Full Text]

    Coffin, J. M. 1995. HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267:483–489

    Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401–410[ISI]

    ———. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376[ISI][Medline]

    Fukami-Kobayashi, K., and Y. Tateno. 1991. Robustness of maximum likelihood tree estimation against different patterns of base substitutions. J. Mol. Evol. 32:79–91[ISI][Medline]

    Gojobori, T., E. N. Moriyama, and M. Kimura. 1990. Molecular clock of viral evolution, and the neutral theory. Proc. Natl. Acad. Sci. USA 87:10015–10018

    Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182–198[ISI][Medline]

    Goldman, N., and S. Whelan. 2000. Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics. Mol. Biol. Evol. 17:975–978[Free Full Text]

    Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736[Abstract/Free Full Text]

    Hasegawa, M. 1990a. Mitochondrial DNA evolution in primates: transition rate has been extremely low in the lemur. J. Mol. Evol. 31:113–121

    ———. 1990b. Phylogeny and molecular evolution in primates. Jpn. J. Genet. 65:243–266

    Hasegawa, M., H. Kishino, and N. Saitou. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol. 32:443–445[ISI][Medline]

    Hasegawa, M., K. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174[ISI][Medline]

    Hillis, D. M. 1999. Phylogenetics and the study of HIV. Pp. 105–121 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md

    Hillis, D. M., B. K. Mable, and C. Moritz. 1996. Applications of molecular systematics: the state of the field and a look to the future. Pp. 515–543 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass

    Hochberg, Y. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800–802

    Holmes, E. C., O. G. Pybus, and P. H. Harvey. 1999. The molecular population dynamics of HIV-1. Pp. 177–207 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md

    Huelsenbeck, J. P., and K. A. Crandall. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28:437–466[ISI]

    Huelsenbeck, J. P. and D. M. Hillis. 1993. Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247–264[ISI]

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. M. Munro, ed. Mammalian protein metabolism. Academic Press, N.Y

    Kelsey, C. R., K. A. Crandall, and A. F. Voevodin. 1999. Different models, different trees: the geographic origin of PTLV-I. Mol. Phylogenet. Evol. 13:336–347[ISI][Medline]

    Kendall, M., and A. Stuart. 1979. The advanced theory of statistics. Charles Griffin, London

    Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120[ISI][Medline]

    ———. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78:454–458

    Korber, B., M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B. H. Hahn, S. Wolinsky, and T. Bhattacharya. 2000. Timing the ancestor of the HIV-1 pandemic strains. Science 288:1789–1796

    Korber, B., J. Theiler, and S. Wolinsky. 1998. Limitations of a molecular clock applied to the considerations of the origin of HIV-1. Science 280:1868–1871

    Leitner, T., and J. Albert. 1999. The molecular clock of HIV-1 unveiled through analysis of a known transmission history. Proc. Natl. Acad. Sci. USA 96:10752–10757

    Leitner, T., D. Escanilla, C. Franzen, M. Uhlen, and J. Albert. 1996. Accurate reconstruction of known HIV-1 transmission history by phylogenetic tree analysis. Proc. Natl. Acad. Sci. USA 93:10864–10869

    Leitner, T., and W. M. Fitch. 1999. The phylogenetics of known transmission histories. Pp. 315–345 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md

    Leitner, T., S. Kumar, and J. Albert. 1997. Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history. J. Virol. 71:4761–4770[Abstract]

    Liò, P., and N. Goldman. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:1233–1244[Abstract/Free Full Text]

    Moriyama, E. N., Y. Ina, K. Ikeo, M. Shimizu, and T. Gojobori. 1991. Mutation pattern of human immunodeficiency virus gene. J. Mol. Evol. 32:360–363[ISI][Medline]

    Muse, S. 1999. Modeling the molecular evolution of HIV sequences. Pp. 122–152 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md

    Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715–724[Abstract/Free Full Text]

    Nei, M. 1987. Molecular evolutionary genetics. Columbia University Press, N.Y

    Ota, R., P. J. Waddell, M. Hasegawa, H. Shimodaira, and H. Kishino. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17:798–803[Abstract/Free Full Text]

    Ou, C.-Y., C. A. Ciesielski, G. Myers et al. (18 co-authors). 1992. Molecular epidemiology of HIV transmission in a dental practice. Science 256:1165–1171

    Pedersen, A.-M. K., C. Wiuf, and F. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:1069–1081[Abstract]

    Penny, D., P. J. Lockhart, M. A. Steel, and M. D. Hendy. 1994. The role of models in reconstructing evolutionary trees. Pp. 211–230 in R. W. Scotland, D. J. Siebert, and D. M. Williams, eds. Models in phylogenetic reconstruction. Clarendon Press, Oxford, England

    Posada, D., and K. A. Crandall. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14:817–818

    ———. 2001a. Selecting the best-fit model of nucleotide substitution. Syst. Biol. (in press)

    ———. 2001b. Simple (wrong) models for complex trees: empirical bias. Mol. Biol. Evol. 18:271–275

    Posada, D., K. A. Crandall, and D. M. Hillis. 2000. Phylogenetics of HIV. Pp. 121–160 in A. G. Rodrigo and G. H. J. Learn, eds. Computational and evolutionary analysis of HIV molecular sequences. Kluwer, Norwell, Mass

    Rambaut, A. 1997. The inference of evolutionary and population dynamic processes from molecular phylogenies. University of Oxford, Oxford, England

    ———. 2000. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16:395–399

    Robertson, D. L., P. M. Sharp, F. E. McCutchan, and B. H. Hahn. 1995. Recombination in HIV-1. Nature 374:124–126

    RodrÍguez, F., J. F. Oliver, A. MarÍn, and J. R. Medina. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol. 142:485–501[ISI][Medline]

    Rzhetsky, A., and M. Nei. 1995. Tests of applicability of several substitution models for DNA sequence data. Mol. Biol. Evol. 12:131–151[Abstract]

    Rzhetsky, A., and T. Sitnikova. 1996. When is it safe to use an oversimplified substitution model in tree-making? Mol. Biol. Evol. 13:1255–1265[Abstract]

    Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425[Abstract]

    Schierup, M. H., and J. Hein. 2000a. Consequences of recombination on traditional phylogenetic analysis. Genetics 156:879–891

    ———. 2000b. Recombination and the molecular clock. Mol. Biol. Evol. 17:1578–1579

    Self, S. G., and K.-L. Liang. 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 82:605–610[ISI]

    Shankarappa, R., J. B. Margolick, S. J. Gange et al. (12 co-authors). 1999. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol. 73:10489–10502[Abstract/Free Full Text]

    Sullivan, J., and D. L. Swofford. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenies. J. Mamm. Evol. 4:77–86

    Swofford, D. L. 1998. PAUP*: phylogenetic analysis using parsimony (* and other methods). Version 4.0 beta. Sinauer, Sunderland, Mass

    Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass

    Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. 9:678–687[Abstract]

    Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512–526[Abstract]

    Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins. 1997. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 24:4876–4882

    Van de Peer, Y., W. Janssens, L. Heyndrickx, K. Fransen, G. van der Groen, and R. De Wachter. 1996. Phylogenetic analysis of the env gene of HIV-1 isolates taking into account individual nucleotide substitution rates. AIDS 10:1485–1494

    Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306–314[ISI][Medline]

    ———. 1996. Among-site rate variation and its impact on phylogenetic analysis. Trends Ecol. Evol. 11:367–372[ISI]

    ———. 1997. How often do wrong models produce better phylogenies? Mol. Biol. Evol. 14:105–108[Free Full Text]

    Yang, Z., N. Goldman, and A. Friday. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316–324[Abstract]

    ———. 1995. Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Syst. Biol. 44:384–399[ISI]

    Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449

    Zhang, J. 1999. Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. Mol. Biol. Evol. 16:868–875[Abstract]

    Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315–329[ISI][Medline]

    Zuckerkandl, E., and L. Pauling. 1965. Evolutionary divergence and convergence in proteins. Pp. 97–166 in V. Bryson and H. J. Vogel, eds. Evolving genes and proteins. Academic Press, N.Y

Accepted for publication February 2, 2001.