* Computational and Evolutionary Biology Laboratory, School of Biological Sciences, and the
Allan Wilson Centre for Molecular Ecology and Evolution, University of Auckland, Auckland, New Zealand
Bioinformatics Research Center, Department of Ecology and Genetics, University of
rhus,
rhus, Denmark
Correspondence: E-mail: a.rodrigo{at}auckland.ac.nz.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: serial samples substitution rate subtree likelihood whole-tree likelihood maximum likelihood
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The estimation methods developed to date use only sequences sampled serially from a single population. However, certainly with viruses, it is quite common to sample viral sequences from several different hosts, and within each host, at different timepoints (e.g., Gunthard et al. 1999; Holmes et al. 1992; Rodrigo et al. 1999; Shankarappa et al. 1999). If we assume, as is frequently done for viruses such as HIV-1, that there is little likelihood of multiple transmission events, then viruses in each host are part of an isolated and unique population, evolving independently from a single founding variant.
In this paper, we describe two likelihood-based methods for jointly estimating a substitution rate using serially sampled sequences, when these are obtained from different populations. These methods are analogous to those developed by Gu (2001) for the analysis of functional divergence in protein families, and we use Gu's terminology in this paper. In the first of these procedures, the subtree of sequences from each population is treated as an unrelated phylogeny. This "subtree likelihood" (STL) approach uses the likelihoods of the subtrees as independent contributors to the total likelihood of all samples. In an alternative approach, a phylogeny of all sequences is constructed and the "whole-tree likelihood" (WTL) is then used as a basis for estimation.
The joint estimation of substitution rate is a reasonably simple extension to work previously done (Rambaut 2000) under both approaches. There is, however, an interesting problem that provides a more challenging application of the STL or WTL procedure. Gunthard et al. (1999) described a study in which HIV-1 partial envelope (env) gene sequences were obtained from individuals just before and 2 years after the commencement of combination antiretroviral therapy. The aim of the study was to determine if antiretroviral therapy effectively controlled viral replication, as had previously been suggested by several workers (Finzi et al. 1997; Wong et al. 1997). In the event that the patient responds to therapy and viral replication is halted, there would be no measurable (or statistically significant) accumulation of substitutions in env sequences sampled before and after therapy. In a study such as this, the aim is to quantify and test whether the virus population continues to evolve within a host over the period of the study. Gunthard et al. (1999) analyzed each patient separately, but such an analysis runs the risk of inflating the probability of a type I error. Here, we apply the STL and WTL procedures to provide joint estimators of both the proportion of individuals who do not respond to therapy (i.e., whose viral population continues to replicate and accumulate substitutions) and the rate of ongoing viral substitution in these patients. Finally, we show how each patient may be assigned to the class of nonresponders or responders using empirical Bayes' classifiers.
![]() |
Methodology |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Using Subtree Likelihoods (STLs)
We wish to extend this model to the case where there are n serially sampled data sets, S1,...,Sn, each from a different population. Associated with each is a fixed tree, Ti, possibly a different model of evolution, Mi, and a different set of sampling times, i. When the aim is to estimate a common substitution rate, the likelihood function can be written as
|
Here, we assume that S1, ... , Sn are drawn from independent populations, so that for a sample of aligned sequences, Si, from the ith population,
|
Given equation 3, it follows that
|
|
|
How do we modify equation 5 to allow groups of subpopulations to have different values of ? As discussed above, we want to extend equation 4 to allow for the possibility that there is no measurable accumulation of substitutions in some populations so that for these,
= 0. The following description applies specifically to this case, but it is general enough to be applied to other values of
as well. More importantly, whereas we focus exclusively on two rate categories (i.e.,
> 0 and
= 0), these methods can also be generalized to data with more than two rate categories.
We define a Bernoulli random variable, R, where R = 0 classifies a population for which = 0, and R = 1, a population where
> 0. Let R = (R1, ... ,Rn) represent the vector of population states. We can define the joint likelihood of R and a common positive-valued
(for those populations for which R = 1) as
|
|
|
The value of this approach is that it identifies the populations that show an accumulation of substitutions and those that do not. However, frequently what is required is an estimate of the proportion of populations that are classified as either R = 0 or R = 1. Of course, this can be estimated simply from R after maximizing equation 7. Ideally, however, if the intention is to obtain an MLE of this proportion, the likelihood function needs to be recast.
Let the probabilities associated with R = 0 and R = 1 be (1 p) and p respectively. MLEs of p and a positive-valued can be obtained jointly by maximizing the following likelihood function:
|
Asymptotic bivariate (1 )%-profile confidence envelopes may be obtained by locating pairs of (
*, p*) such that ln L(
,
,
i, ... ,
n) ln L(
*, p*,
, ... ,
) =
/2; here,
, ... ,
are the node heights that give the highest likelihood for
* and p*. Alternatively, a profile confidence likelihood interval may be obtained for each parameter (either
or p). For p, for instance, locate upper and lower values, p*, such that
|
Joint estimation of p and does not specifically identify which populations are classified as R = 1 or R = 0. It is however possible to use an empirical Bayes' procedure to classify the ith population according to their relative posterior probabilities after fixing
=
, p =
, and Hi to
i0 or
i1 depending on whether Ri = 0 or R = 1, respectively. To implement this, the ratio
|
It is worthwhile noting that identifying the precise configuration of R = (R1,...Rn), as we do in equation 8, and deriving by calculating a posteriori the proportion of subpopulations classified as Ri = 1 (or 0) may lead to an inconsistent estimate of p, (i.e., there exists a small
such that Prob(|
p| <
)
0, as n
). This is because as n increases, the probability of incorrectly classifying subpopulations increases; this affects our estimate of p when it is calculated a posteriori. For this reason, it may be more defensible theoretically to estimate p directly.
For k (k > 2) categories of rates, there will be a vector p = {p1, ... , pk} where pi corresponds to the proportion of subpopulations in rate category i. Equation 11 will be inapplicable in such a case. Nonetheless, for each subpopulation, it is easy enough to calculate the posterior probability of each rate category (for two categories, these correspond to the numerator and denominator of equation 11). These posterior probabilities can be thought of as "classification probabilities;" a subpopulation is assigned to the category with the highest classification probability.
Using Whole-Tree Likelihoods (WTL)
An alternative to the STL methods described above is to build and use a tree that represents the joint phylogeny of all sequences sampled from all populations. If the real sampling times from different populations are known, it is possible to build a serial phylogenetic tree of the entire set of sequences. In this circumstance, the complete phylogeny can be used to estimate a single mutation rate under the SRDT model. This would be analogous to what Gu (2001) refers to as the "whole-tree likelihood" approach. However, even if these times are available, it is still not obvious that a single tree and the SRDT analysis would be the appropriate approach, because it assumes that the rates of substitution of the virus between individuals are the same as those within individuals. Of course, this may not be true, since the accumulation of substitutions between individuals is subject to the evolutionary dynamics operating as a result of transmission from host to host.
An alternative is to construct a partially-constrained serial phylogenetic tree that allows the sequences within each population to evolve according to the SRDT model but also allows the lengths of branches connecting the subtrees of sequences from the different populations to vary freely (fig. 1B). In this case, the likelihood of the partially-constrained serial tree, T, with a single model of evolution, M, and node heights, HT, some of which are free to vary, is
|
If we are interested in estimating the numbers of individuals with rates > 0 or
= 0, it is possible to choose a partially constrained tree with the particular assignment of samples to each rate group that has the highest likelihood. So, by cycling through all 2n possible combinations of rate assignments, we are able to identify the ML combination.
The approach above is equivalent to that applied using the subtree likelihood method, in which a particular combination of population states R1...Rn (and the value of associated with those populations for which R = 1) that maximizes L(R,
) is found. As with that approach, the disadvantage is that we do not estimate a proportion, p, of the number of populations that have state R = 1.
It is possible, albeit tedious, to estimate both p and using the WTL approach. Let Ri = (Ri1, ... Rin) represent the ith combination of states assigned to the n samples, i = 1, ... , 2n. Let ki be the number of samples assigned state R = 1 and (n ki) the number assigned state R = 0 in Ri. The joint likelihood for
and p is
|
Finally, we are interested in assigning subpopulations to different rate categories. As with the STL approach, we do this using an empirical Bayes' classifier. For the jth subpopulation, j = 1, ... , n, with rate assignment Rij Ri in the ith combination of rate assignments,
|
|
|
|
Example: Estimating the Proportion of Individuals Responding to Antiretroviral Therapy
Gunthard et al. (1999) studied the evolution of partial (regions C2C3) HIV-1 env sequences over 2 years of combination antiretroviral therapy in six individuals. Viral RNA sequences were obtained just before therapy began ("early" sequences) and cell-associated viral DNA sequences were obtained 2 years later ("late" sequences). As mentioned previously, if therapy is successful at halting viral replication, there is no opportunity for the virus to accumulate mutations, since this only happens when viral RNA is reverse transcribed to cDNA after infection of host cells. Therefore, one expects that successful therapy will leave behind a population of viral "fossils" embedded in the genomes of host cells infected before therapy began. This means that when therapy begins, the mutation rate, , becomes zero. Note that setting
= 0 only makes sense if serial sequence samples are available. In the absence of serially sampled data,
= 0 implies that there can be no differences between sequences, but with serial samples,
= 0 only implies that over some period between sampling events, there was no accumulation of substitutions.
Gunthard et al. (1999) reconstructed the phylogenies of each set of sequences from each subject. They then measured the evolutionary distance of each sequence to the root of each tree and compared the distances of early and late sequences using a nonparametric test. There are obvious problems with this approach, principally the genealogical dependence of evolutionary distances. In effect, this analysis assumes that each sequence terminates a lineage that evolved independently from the most recent common ancestor. Here, we reanalyze the data obtained by Gunthard et al. (1999). We used PAUP* (Swofford 1999) to construct individual maximum-likelihood phylogenies for sequences from each subject with a common GTR model of evolution that we had previously estimated for all subjects simultaneously. We applied the analyses described above for both the STL-based and WTL-based approaches. For the WTL analyses, an unrooted tree of the entire data set was used. As expected, sequences from each subject clustered together. For the STL analyses, the phylogeny of each set of sequences was rooted using sequences from other subjects as outgroups. Once the trees were rooted, the outgroup sequences were pruned from the trees so that the rooted topology contained only the sequences for that subject.
STL Analyses
First, for each subject, we derived a nonnegative MLE of by setting the times between early and late sequences at 2 years, using the estimated phylogeny and common GTR model of evolution. Interestingly, the MLEs of
for four of the six subjects were, in fact, 0. The MLEs of
for sequences obtained from Patient M was 1.8% per year, and that of Patient C was 0.3% per year. Using the asymptotic likelihood ratio test (LRT) described by Drummond, Forsberg, and Rodrigo (2001), we found that
was statistically different from 0 only for Patient M (P < 0.01 [table 1]). This result is interesting because we were only able to find evidence for the continued accumulation of substitutions in one subject. In contrast, using the nonparametric approach and treating the distance-to-root of each sequence as an independent measurement, Gunthard et al. (1999) found that the viral populations in three subjects continued to evolve. This is not surprising, because the assumption that each sequence in a given sample sits at the tip of an independently evolving lineage falsely inflates both the degrees of freedom of a test (broadly defined, the apparent number of replicates) and our confidence that any estimated difference is statistically significant.
|
At this point, it is worth noting that the log-likelihood of one other configuration was only very slightly different from that of the ML configuration. The configuration in which Patients K and M have states R = 1, and a common value of = 0.017 (1.7% per year) has a log-likelihood of 4,391.37. At first glance, this is a curious resultan examination of table 1 indicates that for Patient K, the MLE of
= 0. Why then should Patient K be assigned a state that signals a detectable accumulation of substitutions? The reason becomes obvious when we examine the topology of the sequences for Patient K (fig. 2). The tree is reciprocally monophyletic, with early sequences (labeled with the prefix "KV") and late sequences ("KP") clustering on different clades. This means that simply by moving the position of the root on the branch connecting the two clades, it is possible to get estimates of
that range from 0 to some positive value without changing the log-likelihood.
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The approaches we apply here based on subtree likelihoods and whole-tree likelihoods are equivalent to those used by Gu (2001). In our example, there appears to be very little difference in the results. It is worth reiterating the points that Gu (2001) makes in his comparison of the subtree likelihood versus whole-tree likelihood approaches. In essence, if there are very long or very short internal branch lengths between clades, the whole-tree method offers little improvement over the subtree method. This is because the value of the whole-tree approach depends on the extent to which the joint phylogeny influences the prior probabilities of nucleotide states at the roots of each of the individual subtrees and hence the likelihood of the tree and its attendant parameter estimates. If these prior probabilities and the resultant likelihood are unaffected or only marginally affected by combining the subtrees into a single joint phylogeny, then there is little added value in the whole-tree approach, particularly when we consider the computational overheads of the method. These can be quite substantial; for instance, the grid search to generate the likelihood contour plot using the STL method, programmed with JAVA version 1.4, took an average of 71 s on a PC with an AMD Athlon 1400+ processor and 512 Mb RAM. In contrast, the WTL grid search, on the same machine, took 8,195 s, or just over 2 h.
The STL methods we have developed in this paper can also be applied to serial sequence samples drawn from different, unlinked loci. In this case, we may be interested in testing whether the different loci are evolving at the same rate. Alternatively, the sampled loci may be partitioned into groups each evolving with its own rate. The methods we have described here work as well with such data.
Obviously, these methods should only be applied if each subpopulation is evolving in a clocklike manner. It should be routine to validate this assumption first using the tests described by Rambaut (2000) and Drummond, Forsberg, and Rodrigo (2001).
Our analyses, as we have described them, rely on the assumption of a given topology. To relax this assumption, we need to allow for uncertainty in the evolutionary relationships of the sampled sequences. Two of us (A.D. and A.G.R.) have been involved in recent work on the use of Bayesian analysis of serially sampled sequences using Markov chain Monte Carlo (MCMC) methods (Drummond et al. 2002). This approach allows different topologies to be sampled in proportion to their contribution to the joint posterior probability of all unknown quantities. The plan for the immediate future is to incorporate the analyses, and different options, described here into the MCMC framework already available.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Diethard Tautz, Associate Editor
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Barnes, I., P. Matheus, B. Shapiro, D. Jensen, and A. Cooper. 2002. Dynamics of Pleistocene population extinctions in Beringian brown bears. Science 295:2267-2270.
Drummond, A., R. Forsberg, and A. G. Rodrigo. 2001. Estimating stepwise changes in substitution rates using serial samples. Mol. Biol. Evol. 18:1365-1371.
Drummond, A., G. K. Nicholls, A. G. Rodrigo, and W. Solomon. 2002. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161:1307-1320.
Drummond, A., and A. G. Rodrigo. 2000. Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA (sUPGMA). Mol. Biol. Evol. 17:1807-1815.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]
Finzi, D., M. Hermankova, and T. Pierson, et al. (15 co-authors). 1997. Identification of a reservoir for HIV-1 in patients on highly active antiretroviral therapy. Science 278:1295-1300.
Forsberg R., M. B. Oleksiewicz, A. M. K. Petersen, J. Hein, A. Bøtner, and T. Storgaard. 2001. A molecular clock dates the common ancestor of European-type porcine reproductive and respiratory syndrome virus at more than 10 years before the emergence of disease. Virology 289:174-179.[CrossRef][ISI][Medline]
Fu, Y. X. 2001. Estimating mutation rate and generation time from longitudinal samples of DNA sequences. Mol. Biol. Evol. 18:620-626.
Goldman, N. 1990. Maximum likelihood inferences of phylogenetic trees, with special reference to the Poisson process model of DNA substitutions and to parsimony analysis. Syst. Zool. 39:345-361.[ISI]
Gu, X. 2001. Maximum-likelihood approach for gene family evolution under functional divergence. Mol. Biol. Evol. 18:453-464.
Gunthard, H. F., S. D. Frost, and A. J. Leigh Brown, et al. (12 co-authors). 1999. Evolution of envelope sequences of human immunodeficiency virus type 1 in cellular reservoirs in the setting of potent antiviral therapy. J. Virol. 73:9404-9412.
Holmes, E. C., L. Q. Zhang, P. Simmonds, C. A. Ludlam, and A. J. Leigh Brown. 1992. Convergent and divergent sequence evolution in the surface envelope glycoprotein of HIV-1 within a single infected patient. Proc. Natl. Acad. Sci. USA 89:4835-4839.[Abstract]
Lambert, D. M., P. A. Ritchie, C. D. Millar, B. Holland, A. J. Drummond, and C. Baroni. 2002. Rates of evolution in ancient DNA from Adelie penguins. Science 295:2270-2273.
Leitner, T., and J. Albert. 1999. The molecular clock of HIV-1 unveiled through analysis of a known transmission history. Proc. Natl. Acad. Sci. USA 96:10752-10757.
Leonard, J.A., R. K. Wayne, and A. Cooper. 2000. Population genetics of Ice Age brown bears. Proc. Natl. Acad. Sci. USA. 97:1651-1654.
Ota, R., P. J. Waddell, M. Hasegawa, H. Shimodaira, and H. Kishino. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17:798-803.
Rambaut, A. 2000. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16:395-399.[Abstract]
Rodrigo, A. G., E. G. Shpaer, E. L. Delwart, A. K. N. Iversen, M. V. Gallo, J. Brojatsch, M. S. Hirsch, B. D. Walker, and J. I. Mullins. 1999. Coalescent estimates of HIV-1 generation time in vivo. Proc. Natl. Acad. Sci. USA 96:2187-2191.
Rodriguez, F., J. F. Oliver, A. Marin, and J. R. Medina. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol. 142:485-501.[ISI][Medline]
Seo, T. K., J. L. Thorne, M. Hasegawa, and H. Kishino. 2002a. A viral sampling design for testing the molecular clock and for estimating evolutionary rates and divergence times. Bioinformatics 18:115-123.
Seo, T. K., J. L. Thorne, M. Hasegawa, and H. Kishino. 2002b. Estimation of effective population size of HIV-1 within a host. A pseudomaximum-likelihood approach. Genetics 160:1283-1293.
Shankarappa, R., J. B. Margolick, and S. J. Gange, et al. (12 co-authors). 1999. Consistent viral evolutionary dynamics associated with the progression of HIV-1 infection. J. Virol. 73:10489-10502.
Swofford, D. L. 1999. PAUP*: phylogenetic analysis using parsimony (*and other methods). Sinauer Associates, Sunderland, Mass.
Wong, J. K., M. Hezareh, H. F. Gunthard, D.V. Havlir, C. C. Ignacio, C. A. Spina, and D. D. Richman. 1997. Recovery of replication-competent HIV despite prolonged suppression of plasma viremia. Science 278:1291-1295.
|