School of Biological Sciences, University of Auckland, Auckland, New Zealand
Department of Ecology and Genetics, University of Århus, Århus, Denmark
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Population genetics studies that utilize molecular sequences typically rely on samples of sequences that have been obtained contemporaneously (Felsenstein 1992
; Fu 1994
; Nee et al. 1995
; Pybus, Rambaut, and Harvey 2000
). However, there has recently been increased interest in the analysis of samples that are gathered serially, each at a different time. This includes samples from rapidly evolving viral populations such as HIV (Leitner and Albert 1999
; Rodrigo et al. 1999
) and samples of ancient DNA from fossilized remains (Leonard, Wayne, and Cooper 2000)
. It is our aim to derive estimates of substitutional parameters from this type of data using biologically relevant models.
Recently, two papers have independently described methods to estimate substitution rate, µ, from serial samples under the assumption of a molecular clock. Rambaut (2000)
shows how a phylogeny-based maximum-likelihood estimate (MLE) of the constant substitution rate, µ, expressing the divergence between dated sequences as a product of µ and the time interval, can be obtained (fig. 1a
). Drummond and Rodrigo (2000)
, using a distance-matrix least-squares (LS) approach, parameterize intersample divergence in two ways. First, analogous to Rambaut's single rate with dated tips (SRDT) model,
-parameterization estimates only a single substitution rate,
, using
ti as the intersample divergence for the ith interval with elapsed time ti (
is the number of substitutions per unit time, but since time may be measured in chronological units rather than in generations, Drummond and Rodrigo use
instead of µ). Second, with
-parameterization, each intersample interval is allowed to have its own substitution rate,
i; i.e., for the ith interval with elapsed time ti,
iti =
i (fig. 1b
). In keeping with Rambaut's terminology, we will refer to this as the "multiple rates with dated tips" (MRDT) model. Drummond and Rodrigo (2000)
go on to use these substitution rate estimates in a phylogenetic reconstruction procedure called serial-sample unweighted pair grouping method with arithmetic means (sUPGMA) which recovers a tree with lineages that terminate in the order of sampling.
|
![]() |
Likelihood Model |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The parameters of the tree are thus the substitution rates and the vector of times corresponding to the dated tips and the (n - 1 for a bifurcating tree) internal node heights (h), measured in units of substitutions (following Rambaut 2000; note that the tip times may be measured either in generations or in some calendar unit, and a simple rescaling allows one to move between the two). Our framework estimates a series of substitution rates only within the interval bounded by the first and last samples. Specifically, no assumptions are made with regard to the rate between the earliest sampling time and the root of the tree. Over this interval, there is no chronological information, and the branch lengths are free to be optimized in the standard manner as composite parameters of time and substitution rate. This rate may, of course, be of interest, for example, in dating the most recent common ancestor (MRCA). In this case, additional assumptions must be made: a natural assumption in the case of stepwise changes is that the earliest estimated rate remains constant when extrapolated back in time to the root.
For a given tree T, the likelihood of is the conditional probability of obtaining the sequence data S given
, T, and
, the vector of times, as well as the instantaneous substitution rate matrix M (also assumed to be known):
![]() |
Confidence interval estimation in the case of multiple 's is less straightforward. There are at least two ways of computing confidence intervals for multiple rates. First, multivariate upper and lower (1 -
)% confidence limits may be obtained by locating rates that correspond to log likelihood values differing from the maximum log likelihood value by
2k,
/2, where k is the number of rates estimated. If unbiased, these confidence intervals have a probability of (1 -
) of enclosing the true
. Second, a profile confidence likelihood interval may be obtained for each
as follows. Over a range of
i, locate the upper and lower values of
i such that
|
In the case where all elements of are equal, the MRDT model collapses to the SRDT model of a uniform molecular clock. If all
parameters are set to 0, the MRDT model reduces to the standard contemporaneous tips clock model (the single-rate [SR] model; Goldman 1993
; Rambaut 2000
). In fact, under the likelihood framework, one is able to test whether the MRDT model is a significantly better model for the data than the SRDT model. Since the SRDT model is simply a constrained MRDT model, the standard asymptotic likelihood ratio test can be applied. In this case, the test statistic,
|
When testing the SRDT model against the SR model, the null and alternative hypotheses are of the form
|
![]() |
![]() |
Least-Squares Model |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
The standard error of the LS estimates of cannot be calculated easily because of the nonindependence of the pairwise distances. Drummond and Rodrigo (2000)
advocate the use of the parametric bootstrap (Efron and Tibshirani 1993
; Goldman 1993
) to generate confidence intervals of the estimates. Parametric bootstrapping requires specification of a model and subsequent simulation of pseudoreplicate data sets with the same number of sequences and sites as the original data, assuming that the estimates recovered using the observed data are the "true" values of the parameters. With the SRDT model and an assumption of a constant
over time, it is easy to generate pseudoreplicate data sets under a coalescent model in which population size is held constant (Drummond and Rodrigo 2000)
. However, under the MRDT model, parametric bootstrapping is not simple, since any resampling procedure must accommodate changing substitution rates and
's. This is a drawback of the distance-based LS methodprocedures for variance estimation are often elusive.
![]() |
Example |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Before the advent of potent combination therapy against HIV, drugs were less effective in lowering viral load and hindering progression toward AIDS. To investigate the effect of a one-drug therapy regime on the evolutionary progression of HIV, we analyzed previously published data consisting of serially sampled partial HIV-1 envelope (env) sequences from an infected individual who began Zidovudine treatment partway through the sampling period (Rodrigo et al. 1999
). Complete details of the data set are given in Rodrigo et al. (1999); briefly, the data set contains an initial sample followed by additional samples after 7 (day 214), 22 (day 671), 23 (day 699), and 34 months (day 1005). Monotherapy with Zidovudine was initiated after 13 months (day 409; J. Mullins, University of Washington, personal communication) and continued during the remaining time of study. Therefore, the data set contains two samples from before and three samples from after treatment began.
It has been suggested that highly active combination antiretroviral therapy leads to a cessation of viral replication (Finzi et al. 1997
; Wong et al. 1997
). A natural question is whether monotherapy with Zidovudine had the effect of slowing or halting viral replication in the particular individual from whom samples were available. If viral replication does, in fact, cease (or slow down), this will be reflected in the rate at which substitutions accumulate, since it is during the process of viral replication that this occurs. This corresponds to testing whether a, MRDT model, which allows for a change in substitution rate after the onset of therapy, provides a better fit to the data than an SRDT model and, if so, whether the estimated substitution rate after drug therapy is significantly different from zero.
The data set consisted of 60 sequences from five time points. The length of the alignment was 660 nt. Gapped columns were included in the analysis. To begin with, the data set was first split into two subsets, one containing all sequences before therapy (28 sequences), and the other containing all sequences after therapy commenced (32 sequences). For each of these data sets, a neighbor-joining tree was built and an ML general time-reversible (REV) model was estimated using PAUP* 4.0b4 (Swofford 1999
).
Each tree was used to estimate a uniform substitution rate using the SRDT likelihood model as implemented in the computer program TIPDATE (Rambaut 2000)
. TIPDATE was also used to find the ML roots for the two trees. This was done by rooting the tree at every branch on the unrooted topology and optimizing the branch lengths in accordance with the dated tips. The rooted topology that maximizes the likelihood was used to estimate the substitution rate. All estimated rates are reported in table 1
. A rate (
before) of 5.034 x 10-5 substitutions per site per day (1.84% per year, 95% confidence limit = [1.02%, 2.73%]) was obtained for the sequences before therapy, and a rate (
after) of 5.8 x 10-7 substitutions per site per day (0.021% per year, 95% confidence limit = [0.0%, 0.77%]) for the sequences after therapy. As
after has a confidence interval that encloses 0, we cannot show that significant substitutions have occurred since therapy commenced.
|
|
|
Similar analyses were performed for before- and after-therapy samples, except that in these instances, the only comparison made was between the SRDT and SR models. For the before-therapy samples, the SRDT model has a statistically better fit to the data than the SR model (P < 0.01). However, for the after-therapy sequence subset, the SRDT model cannot be distinguished statistically from the SR model. Taken on its own, this suggests that there is little or no accumulation of substitutions over this period. (Caution must be taken with this interpretation: as we discuss in the next section, the MRDT model is significantly worse than a model that assumes no consistent clocklike pattern of evolution among the sequences).
Equivalent estimates were also derived with the LS method. Table 1 summarizes the results. Both the likelihood and the least-squares procedures consistently estimated a higher rate of substitution before therapy, about an order of magnitude greater than the estimated rate after therapy.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
First, for any analysis that involves the inference of some kind of clocklike behavior, whether it be a constant clock or a changing clock, a first step should be a test of whether such a model is significantly worse than an unconstrained nonclock model (also called a different-rate [DR] model). The DR model is the standard used in phylogenetic tree reconstruction and effectively allows every branch to have its own substitution rate. Thus, the length of the ith branch is an estimate of the composite parameter iti. If an SRDT model or an MRDT model is significantly worse than the DR model, it means that at least some lineages are not evolving in a clocklike manner. In fact, where appropriate, we recommend a hierarchy of nested likelihood ratio tests: DR versus MRDT, MRDT versus SRDT, and SRDT versus SR. For our example data set, the DR model was always significantly better than the SRDT and MRDT models (data not shown). Our primary intention with the use of the data set was simply to illustrate the methods described in this paper, rather than to make substantive statements about the effects of monotherapy on substitution rates. It is important, however, to note this here for completeness.
Also, the likelihood estimation procedures presented here (and by Rambaut [2000]) assumes that the evolutionary history of the sequences, i.e., the topology of the genealogy, is known or can be reconstructed correctly. The bias introduced into parameter estimation and hypothesis-testing procedures by using incorrect genealogies is largely unknown. On the other hand, the least-squares estimation procedure is not based on a reconstructed topology and therefore may not suffer from this possible source of bias. For example, for a single-rate model, the least-squares estimator has been shown to be an unbiased estimator (Drummond and Rodrigo 2000)
. However, distance-based LS methods do not take into account the correlations induced by shared history, thus making variance estimation difficult.
Ultimately, the best approach would be to incorporate the uncertainty of the genealogy explicitly into a probabilistic framework. One way of taking the uncertainty of the topology into consideration in the likelihood model is to integrate over a number of topologies. A natural way to do this is to use a Markov chain Monte Carlo (MCMC) sampling procedure to sample tree space in proportion to the likelihood of the data (Kuhner, Yamato, and Felsenstein 1995
). This approach has been used, for example, to incorporate the uncertainty in the tree topology into estimates of population size and growth rates (Kuhner, Yamato, and Felsenstein 1995, 1998
). This method has a natural extension to the estimation of substitution rates and can also be used to find confidence intervals in topology space under the SRDT and MRDT models of evolution.
One of the interesting observations of this study is that different models (SR, SRDT, MRDT) can have different ML tree topologies. This may turn out to be a common occurrence. For example, for a 45-sequence subset of the data, 729 strictly bifurcating ML tree topologies were found. Although these trees had identical likelihoods under an unconstrained (nonclock) model, they had a range of likelihoods under the SR, SRDT, and MRDT models. Furthermore, no single strictly bifurcating topology represented the ML topology under all three models. If one chooses to use different topologies for each model, then the asymptotic approximation to the likelihood ratio test cannot be used. Instead, some alternative procedure (say, a parametric simulation procedure [Goldman and Whelan 2000]) should be used. A sampling method such as MCMC would also be useful in this case, as the sampling procedure integrates over tree space in proportion to the likelihood of the data. Thus, for two competing models, a null and alternative distribution can be compared.
In the previous section, we also alluded to the fact that different rootings of an unrooted tree can have different likelihoods under a given model of substitution. By extension, this also means that different models may require the tree to be rooted differently. This does not change the mechanics of any likelihood ratio test, since no new free parameters are added to the model. However, if the root of the tree is not known, an extra step needs to be added to any analysis to find the appropriate root.
Serial molecular samples add a new dimension to population genetics studies. Since it is possible to estimate substitution (or mutation) rate independently of other parameters, it is also possible to decouple composite population parameters like = 2Neµ (where Ne is the effective population size) into their component parts. The models we introduce here go one step further and allow these parameters to be expressed as functions of time. Although we have only spoken of stepwise changes in substitution rates, these models can be generalized to allow substitution rate to vary as any predefined function of time. With viral populations such as HIV, this becomes especially interesting, since it allows us to study changes in average generation time and substitution rate during disease progression or under different therapeutic regimes. In conjunction with the estimation of demographic functions of time (Pybus
, Rambaut, and Harvey 2000)
, it also means that we can decompose
(t) = 2Ne(t)µ(t) into the component functions of Ne(t) and µ(t), where µ(t) is a stepwise function of time.
In this paper, we have assumed that the times corresponding to changes in substitution rates are fixed either to the sampling times or to some time point known a priori. Similarly, we have also assumed that the phylogeny is known. However, since these times and the phylogeny are parameters embedded in the model, they can also be jointly estimated within the likelihood framework.
The models we have described in this paper apply to any set of molecular sequences of sufficient length, or obtained sufficiently far apart in time, that an appreciable amount of substitutions has accumulated. These include ancient DNA sequences, as well as rapidly evolving viral sequences. In conjunction with efforts to model lineage-specific rates (Thorne, Kishino, and Painter 1998
; Huelsenbeck, Larget, and Swofford 2000
) and other time- or lineage- dependent processes, the models presented here go some way toward a more realistic description of the evolution of molecular sequences.
MLE and LS estimates under the SR, SRDT, and MRDT models can be obtained using the computer program PEBBLE, available from the website http://www.cebl.auckland.ac.nz or from the authors.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: serial samples
stepwise rate changes
maximum likelihood
least squares
substitution rate
2 Address for correspondence and reprints: Allen G. Rodrigo, School of Biological Sciences, University of Auckland, Private Bag 92019, Auckland, New Zealand. E-mail: a.rodrigo{at}auckland.ac.nz
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Chun T. W., L. Stuyver, S. B. Mizell, L. A. Ehler, J. A. Mican, M. Baseler, A. L. Lloyd, M. A. Nowak, A. S. Fauci, 1997 Presence of an inducible HIV-1 latent reservoir during highly active antiretroviral therapy Proc. Natl. Acad. Sci. USA 94:13193-13197
Drummond A., A. G. Rodrigo, 2000 Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA (sUPGMA) Mol. Biol. Evol 17:1807-1815
Efron B., R. Tibshirani, 1993 An introduction to the bootstrap Chapman and Hall, London
Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]
. 1992 Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates Genet. Res 59:139-147[ISI][Medline]
Finzi D., M. Hermankova, T. Pierson, et al. (15 co-authors). 1997 Identification of a reservoir for HIV-1 in patients on highly active antiretroviral therapy Science 278:1295-1300
Fu Y. X., 1994 A phylogenetic estimator of effective population size or mutation rate Genetics 136:685-692
Goldman N., 1990 Maximum likelihood inferences of phylogenetic trees, with special reference to the Poisson process model of DNA substitutions and to parsimony analysis Syst. Zool 39:345-361[ISI]
. 1993 Statistical tests of models of DNA substitution J. Mol. Evol 36:182-198[ISI][Medline]
Goldman N., S. Whelan, 2000 Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics Mol. Biol. Evol 17:975-978
Huelsenbeck J. P., B. Larget, D. Swofford, 2000 A compound Poisson process for relaxing the molecular clock Genetics 154:1879-1892
Kuhner M. K., J. Yamato, J. Felsenstein, 1995 Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling Genetics 140:1421-1430
. 1998 Maximum likelihood estimation of population growth rates based on the coalescent Genetics 149:429-434
Leitner T., J. Albert, 1999 The molecular clock of HIV-1 unveiled through analysis of a known transmission history Proc. Natl. Acad. Sci. USA 96:10752-10757
Leonard J. A., R. K. Wayne, A. Cooper, 2000 Population genetics of Ice Age brown bears Proc. Natl. Acad. Sci. USA 97:1651-1654
Nee S., E. C. Holmes, A. Rambaut, P. H. Harvey, 1995 Inferring population history from molecular phylogenies Philos. Trans. R. Soc. Lond. B Biol. Sci 349:25-31[ISI][Medline]
Ota R., P. J. Waddell, M. Hasegawa, H. Shimodaira, H. Kishino, 2000 Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters Mol. Biol. Evol 17:798-803
Pybus O. G., A. Rambaut, P. H. Harvey, 2000 An integrated framework for the inference of viral population history from reconstructed genealogies Genetics 155:1429-1437
Rambaut A., 2000 Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies Bioinformatics 16:395-399[Abstract]
Rodrigo A. G., E. G. Shpaer, E. L. Delwart, A. K. N. Iversen, M. V. Gallo, J. Brojatsch, M. S. Hirsch, B. D. Walker, J. I. Mullins, 1999 Coalescent estimates of HIV-1 generation time in vivo Proc. Natl. Acad. Sci. USA 96:2187-2191
Rodriguez F., J. F. Oliver, A. Marin, J. R. Medina, 1990 The general stochastic model of nucleotide substitution J. Theor. Biol 142:485-501[ISI][Medline]
Shankarappa R., J. B. Margolick, S. J. Gange, et al. (12 co-authors). 1999 Consistent viral evolutionary dynamics associated with the progression of HIV-1 infection J. Virol 73:10489-10502
Swofford D. L., 1999 PAUP* Phylogenetic analysis using parsimony (*and other methods). Version 4.0b4. Sinauer, Sunderland, Mass
Thorne J. L., H. Kishino, I. S. Painter, 1998 Estimating the rate of evolution of the rate of molecular evolution Mol. Biol. Evol 15:1647-1657
Wong J. K., M. Hezareh, H. F. Günthard, D. V. Havlir, C. C. Ignacio, C. A. Spina, D. D. Richman, 1997 Recovery of replication-competent HIV despite prolonged suppression of plasma viremia Science 278:1291-1300