* Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence
Correspondence: E-mail: sw292{at}cornell.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: HIV divergence rate longitudinal study
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The most detailed information regarding viral evolution in vivo comes from longitudinal studies, in which samples of viral DNA or RNA are taken from the same infected patient over time (e.g., Balfe et al. 1990; Wolfs et al. 1991; Holmes et al. 1992; Zhang et al. 1993). We use "longitudinal data set" to refer to the set of all sequences sampled from a single patient over time. The temporal structure of longitudinal data sets allows researchers to determine the immediate consequences of specific events in viral evolution. Examples include the evolution of drug resistance (e.g., Larder, Darby, and Richman 1989), the evolution of viral phenotypes that escape the current immune response (McMichael and Phillips 1997; Goulder et al. 2001), and disease progression (Viscidi 1999; Ross and Rodrigo 2002; Williamson 2003).
Shankarappa et al. (1999) published an extensive collection of longitudinal data sets, including data from nine infected individuals. Sequences of the C2-V5 region of the envelope gene (env) were sampled over the entire course of infection in each patient. Samples were taken at approximately 10-month intervals with 10 to 20 sequences determined per sample. The authors assert that the most important pattern evident in their data is "consistent stages" of viral evolution (figure 6 in Shankarappa et al. [1999]). In the first stage, both env nucleotide diversity and divergence from the founding population increase linearly. During the next stage, which lasts, on average, approximately 2 years, nucleotide diversity either stabilizes or declines, but divergence continues to accumulate. In the final stage, divergence also stabilizesthat is, env sequence evolution nearly comes to a halt. This final stage usually begins less than 1 year before the onset of clinical AIDS (CD4+ T cell counts <300 cells/µl). Further, Shankarappa et al. (1999) found close associations with these patterns and the emergence of the X4 viral genotype, an important indicator of disease progression (Schuitemaker et al. 1992). They noted that the time of peak diversity coincided with the initial emergence of X4 viruses, and divergence stabilization coincided with the X4 viruses reaching their highest frequency. In this paper, we investigate the causal factors leading to these consistent patterns of env evolution. In particular, we address what causes the evolutionary rate to slow so dramatically toward the end of infection.
We consider two alternative hypotheses to explain divergence stabilization. The first hypothesis, which we refer to as cellular exhaustion, proposes that the viral generation time increases as the primary target cell populations (CD4+ T cells) become depleted. As the generation time increases, the evolutionary rate slows and divergence stabilizes. Several factors could contribute to the increase in generation time coincident with target cell depletion. First, as CD4+ T cells are depleted late in infection, a greater proportion of replication cycles might go through other infectable cell types, such as macrophages and CD4+ memory T cells. The intracellular stage of the viral life cycle occurs much more slowly in these alternative cell types (Finzi et al. 1997, 1999), and a shift to increasing usage of these cell types can have a major impact on the viral replication rate (Kelly 1994, 1996; Kelly et al. 2003). Also, the average time of the virion stage of the viral life cycle could increase simply because target cells are more rare and encountered less frequently by virions. This would also increase the overall generation time.
The second hypothesis, which we refer to as immune relaxation, is that the evolutionary rate slows because infection disrupts immune function. This, in turn, diminishes positive selection pressure to escape immune response (Bonhoeffer, Holmes, and Nowak 1995). Immune-mediated positive selection is thought to be an important force driving evolution in the HIV genome, especially in the env gene (e.g., Wolfs et al. 1990; Holmes et al. 1992; Bonhoeffer, Holmes, and Nowak 1995; Wolinsky et al. 1996; Richman et al. 2003). As the immune system is disrupted, the evolutionary rate could fall back closer to the expected neutral rate. This hypothesis, if correct, would have important implications for HIV evolution and dynamics. First, it would suggest that the primary selective force acting on env is the immune system itself. This result would not be surprising as it is generally assumed that the immune system is the primary source of positive selection pressure. However, the importance of immune-mediated selection relative to other selective pressures has been difficult to establish. Second, and perhaps more importantly, the immune-relaxation hypothesis suggests that the parts of the immune system disrupted by infection (most notably, CD4+ T cells) play an indispensable role in responding to epitopes coded for by env and in driving env sequence evolution. Thus, the immune-relaxation hypothesis implies a feedback loop, with the virus directly impairing HIV-specific responses. Immunological evidence points to such an interaction between the virus and HIV-specific CD4+ memory T cells (Douek et al. 2002). However, the evolutionary significance of this interaction has not yet been established.
We can differentiate between cellular exhaustion and immune relaxation by contrasting patterns of divergence at nonsynonymous and synonymous sites within env. The cellular-exhaustion hypothesis, because it is based on an increase in the viral generation time, predicts that both nonsynonymous and synonymous divergence will slow down at the time of divergence stabilization. In contrast, the immune-relaxation hypothesis predicts a slow-down in only the nonsynonymous sites because the synonymous sites are not subject to immune-mediated selection. In this paper, we measure the initial divergence rate and the time of divergence stabilization for both nonsynonymous and synonymous sites in the Shankarappa et al. (1999) data sets in hopes of distinguishing between the two alternative hypotheses of divergence stabilization.
When Shankarappa et al. (1999) initially identified the "consistent patterns," the times of peak diversity and divergence stabilization were estimated by eye. Given the stochastic nature of viral evolutionary dynamics, this subjective procedure can be problematic. Mayer-Hamblett and Self (2001) addressed this problem for changes in viral diversity by using regression models to estimate the time of peak diversity and to test hypotheses regarding changes in the level of diversity. Here, we develop a statistical procedure for characterizing changes in divergence rates based on a simple probability model. This method not only allows us to test the cellular-exhaustion and immune-relaxation hypotheses but also allows more accurate association between divergence stabilization and other aspects of infection (e.g., disease progression) and more accurate measures of the divergence rate.
In the next section, we develop a statistical framework for the analysis of divergence in longitudinal data sets and apply it to the longitudinal data from eight of the HIV-infected patients studied by Shankarappa et al. (1999). We derive a diffusion model to predict the likelihood of the various evolutionary trajectories observed in the data, either with or without divergence stabilization. This model serves as the underpinning for a maximum-likelihood estimation routine, which allows estimation of evolutionary parameters such as the mean and variance of divergence rate. It also provides a rigorous statistical test (a likelihood ratio) for the existence of divergence stabilization within patients. We use a parametric bootstrapping routine to evaluate the significance of the likelihood ratio tests. Finally, we test the assumptions of our approach using forward population-genetic simulations.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We take an alternative approach for measuring divergence. The temporal structure of longitudinal data allows us to infer multiple substitutions at individual sites by simple inspection of the time series. This approach is sound as long as sampling times are frequent enough so that multiple substitutions do not typically occur in between timepoints. We have developed a simple algorithm for determining when multiple substitutions have occurred between a given sequence and the progenitor sequence. Also, we have developed an algorithm for using the temporal information in the data to more accurately delineate between nonsynonymous and synonymous substitutions when more than one change has occurred in the same codon. These methods are described in detail in the Supplementary Material.
Data and Analysis Scheme for Each Patient
Nucleotide sequences in longitudinal data sets were obtained for all patients from the Shankarappa et al. (1999) study, except for patient 11, which was excluded because of smaller sample sizes and infrequent sampling. First, the progenitor sequence in each patient was approximated as the consensus sequence of the earliest sample. Next, the progenitor sequence and all of the sequences in the remaining timepoints were aligned using the default parameters of ClustalX (Thompson et al. 1997) and then hand corrected. In data sets from most patients, there were regions that could not be aligned for the whole data set; these regions were excluded from the analysis. In terms of the nucleotide position in the progenitor sequence (excluding gaps), these regions are p1 (574 to 588), p2 (382 to 411 and 577 to 597), p5 (573 to 579), p8 (382 to 387 and 565 to 585), and p9 (392 to 414 and 558 to 594). After alignment, nonsynonymous, synonymous, and total nucleotide divergence values were tabulated for each sequence using the methods outlined in the Supplementary Material. Our estimates of mean divergence (synonymous and nonsynonymous) within each patient are given as a function of time in figure 1. The individual sequence values were used in the series of statistical procedures described below. Nucleotide sequence alignments and the source code for the programs that implement the divergence tabulation are available from the corresponding author.
|
Our constant-rate model assumes that the divergence rate () is constant throughout infection. This serves as a null model for testing divergence stabilization. Let u(y, t) be the probability density function of the population divergence, y, at time t. u is governed by the following partial differential equation:
![]() | (1) |
If we ignore the boundary condition, the solution of equation (1) is a normal density function. For the constant-rate model, the mean is t and the variance is
2t. The same variance is obtained for the divergence-stabilization model, but the mean is
t for t <
and
for t >
. We use this approximation to develop the maximum-likelihood framework described below. However, the boundary is explicitly included in the parametric bootstrapping simulations that we use to evaluate our likelihood-ratio tests.
Two sorts of stochasticity must be considered: evolutionary fluctuations and sampling variation. Let i = 0, 1, 2, ... k denote the samples taken at different timepoints, in chronological order. In each patient, we use sample 0 to infer the progenitor sequence of all subsequent samples. Early in infection, the viral population is genetically homogeneous within the env gene (Zhang et al. 1993). Therefore, we approximate the progenitor sequence as the consensus sequence of sample 0. For samples 1 through k, let ti denote the time (since seroconversion) that the sample was taken. Let yi denote the (unobserved) population divergence at time ti, measured as the number of differences per site from the progenitor. Let ni be the sample size of each timepoint. Let xij denote the divergence of sequence j in sample i, in terms of the number of differences per site from the progenitor. Conditional on each yi, we assume that the xij are also normally distributed:
![]() | (2) |
![]() | (3) |
![]() | (4) |
![]() | (5) |
There is a well-developed theory for estimation of parameters from multinormal data (Searle, Casella, and McCulloch 1992). Let x denote the vector of the xij's, with the observations from different timepoints concatenated sequentially. The constant-rate model contains k+2 parameters: k population variances () and the drift (
) and diffusion (
2) parameters of equation 1. Under this model, the log-likelihood is
![]() | (6) |
The divergence-stabilization model contains k+3 parameters: k population variances, ,
2, and
. The only effect this new parameter has on the likelihood function is within the vector of expected values. Call this new vector, p:
![]() | (7) |
![]() | (8) |
To fit the divergence-stabilization model for a single data set, we conducted a series of optimizations over the range of possible values for . We started with the largest possible value for
, which is tk, then maximized the log-likelihood function with a fixed value of
; that is, the profile log-likelihood. From this optimization, we recorded the maximum-likelihood estimates for
and
2 and the maximum log-likelihood value. We then reduced the value of
by 1 month and maximized the profile log-likelihood again by finding optimal values for
and
2. This process was repeated with each successive iteration considering a lower value of
. An example of this procedure, the actual analysis of nonsynonymous evolution in patient 1, is given in figure 2. Thus, for both synonymous and nonsynonymous evolution within each patient, the model-fitting procedure yields a set of log-likelihood values defined for each possible value of
.
|
Hypothesis Testing
It is possible to construct a test within each patient by comparing the maximum likelihoods associated with each model. An evolutionary slowdown is indicated if the maximum likelihood for the divergence-stabilization model is sufficiently greater than the maximum likelihood for the constant-rate model. Because the constant-rate model is nested within the divergence-stabilization model (the divergence-stabilization model reduces to the constant-rate model for = tk), a likelihood-ratio test (LRT) is appropriate. The LRT statistic is
, where
and
are the maximum-likelihood estimates (MLEs) under the divergence-stabilization models, and
and
are the MLEs under the constant-rate model. We calculated this statistic for both nonsynonymous and synonymous evolution within each patient.
Both evolutionary variability and small sample sizes limit the power of tests based on individual patients. Thus, to distinguish immune relaxation from cellular exhaustion, we also performed two analyses that combine data across patients. These combined analyses are possible because evolution in different patients is independent. As a consequence, we can sum the (maximum) log-likelihoods across patients for the different evolutionary models. This allows us to compare the constant-rate and divergence-stabilization models with combined LRTs using all the nonsynonymous or synonymous data simultaneously. If divergence stabilization occurs in all patients, these combined LRTs will have more power to detect it. In the combined analyses, the log-likelihood value for the constant-rate model is simply the sum of the maximum log-likelihood values for this model within each patient.
We consider two different versions of divergence stabilization as alternatives to the constant-rate model. The first version posits that divergence stabilization is directly related to disease progression. In each patient, divergence stabilization occurs at time Tp + , where Tp is the time of disease progression, and
is constant across all patients. We define Tp as the first time at which the patient's CD4+ T cell count dropped below 300 cells/ml. Hereafter, we will refer to this model as the single-
divergence-stabilization model. The maximum log-likelihood for the model is obtained by finding the value for
that maximizes the sum of patient-specific log-likelihoods with
= Tp +
within each patient. The second variant of the divergence-stabilization model, which we refer to as the free-
model, allows the time of divergence stabilization (
) to vary among patients without constraint.
If the constant-rate model were true, we might expect the likelihood-ratio test statistic to follow a chi-square distribution, with the number of degrees of freedom equal to the difference in the number of parameters between the models that are being compared. In the present context, the comparison between the single- divergence-stabilization model and the constant-rate model would involve a single degree of freedom because the more complicated model adds only one parameter (
). The degrees of freedom for the free-
model would be equal to the number of patients, because
is estimated separately in each patient. Unfortunately, the asymptotic theory that justifies use of the chi-square distribution for evaluating LRT statistics is not applicable in this case. The parameters
2 and
are close to the edges of their respective ranges, and the sample sizes in each timepoint are relatively small (
10 to 20 sequences). Therefore, we use parametric bootstrapping to approximate the null distributions of our LRT statistics (Davison and Hinkley 1997).
It is straightforward to simulate divergence under either the constant-rate or the divergence-stabilization model for any particular set of parameter values. First, we simulated the population divergence recursively over small time steps (t = 0.1 months): starting at y = 0 at time 0, the population divergence at the next time step, y(t +
t), is drawn from a normal distribution with mean y(t) +
*
t and variance
2*
t. To incorporate the boundary, negative values were reset to 0. Next, for the sampling times (ti) in each particular patient, the divergence values for individual sequences were simulated following equation 2 using the simulated population divergences, where the sampling variance was estimated from the data. For the analyses that consider data from each patient separately, we used our maximum-likelihood estimates for
and
2 from the fitted constant-rate model to produce a set of 500 simulated data sets for each patient in the Shankarappa et al. (1999) study. We then fitted both models to each simulated data set, recorded the maximum log-likelihoods for each model, and calculated the LRT (see Kelly [2003] for an application of the parametric bootstrap to multinormal data). The null distributions for our single-
and free-
LRTs were obtained by combining simulation results from the individual patients. We repeatedly reconstituted whole data sets (involving eight patients) by randomly selecting simulated data sets from each patient. The combined LRTs were calculated for these reconstituted data sets and the values recorded. This procedure was repeated 10,000 times to estimate the distributions.
Forward Simulations
In developing our constant-rate and divergence-stabilization models, we have made two main assumptions: (1) the mean divergence of the entire population is accurately described by a diffusion process, and (2) the boundary condition at y = 0 does not strongly affect our method of inference. To assess the validity of these assumptions under different evolutionary scenarios, we performed forward population-genetic simulations of sequence evolution. These simulations incorporated natural selection, recombination, and mutation.
Each iteration proceeded as follows: we started with a genetically homogeneous population of 2N sequences, each of length L. This formed the ancestral population. Reproduction proceeded according to a Wright-Fisher model for diploidsthat is, with nonoverlapping generations, random mating, and a constant population size. Mutation occurred at rate µ per sequence (not per base pair) per generation, and the mutation rate was uniform over the whole sequence. When a mutation occurred at a previously mutated site, it reverted back to the ancestral state. Recombination occurred at rate r per sequence per generation, and recombination was modeled as a crossover process occurring only at homologous sites within diploid individuals. We did not allow insertions or deletions. For the simulations with natural selection, a certain proportion of new mutations, ps, were subject to selection, and the remainder were neutral. The selection coefficient s describes the relative advantage of the new mutation over the ancestral allele at that site. The selective effects of multiple mutations within a sequence were combined multiplicatively. At 10-month intervals (the approximate interval between sampling in the Shankarappa et al. [1999] study), the mean divergence of the entire population was tabulated relative to the ancestral population, not correcting for back mutation. This was repeated up to 100 months. For each parameter combination, the diploid population size and sequence length were held constant (N = 1000, L = 650), and we assumed that one generation occurred every 2 days (Rodrigo et al. 1999, Fu 2001, Drummond et al. 2002). The mutation rate, µ, was selected such that the sequences diverged by approximately 1% per year, which is typical of the Shankarappa et al. (1999) data. The simulations were iterated 500 times for each parameter combination.
Figure 3 shows the results of the forward simulations for several different combinations of recombination and selection parameters. The (unbounded) diffusion models presented above predict that, among replicate populations (1) both the mean and variance of population divergence should increase linearly with time, (2) at any timepoint, the population divergence should be normally distributed, and (3) because of the memoryless property of the diffusion, the covariance in population divergence among two different timepoints should equal the variance of the earlier timepoint. Figures 3c and d indicate that prediction (1) holds for different recombination rates and selection regimes: both the mean and variance increase linearly. Figure 3a shows the distribution of population divergence at different timepoints for the case of neutral evolution with a high recombination rate, along with a fitted normal distribution. The normality prediction seems to hold very well. The same pattern is observed for the other parameter combinations, including those with selection. Finally, figure 3b plots the covariance between timepoints as a function of the variance of the earlier timepoint. The agreement with prediction (3) is quite good. Taken together, our forward simulations indicate that, for a range of evolutionary models, mean population divergence is very well approximated by a diffusion, and our assumption regarding the boundary does not have a strong affect on our probability model.
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The support for the immune-relaxation hypothesis in the present set of analyses corroborates several common ideas about HIV evolution. First, our results are fully consistent with the notion that the immune system is the primary agent of positive selection in the C2-V5 region of env (Bonhoeffer, Holmes, and Nowak 1995; Wolinsky et al.1996; Ross and Rodrigo 2002). Second, our results suggest that HIV disrupts those agents of the immune system that play a critical role in controlling infection. Immunological studies support the existence of such a feedback loop. Douek et al. (2002) found that the virus preferentially infects HIV-specific CD4+ T cells. Because HIV-specific CD4+ T cells are thought to play an important role in controlling viremia (e.g., Rosenberg et al. 1997), Douek et al. (2002) proposed that this viral specificity contributes to the uncontrolled viral replication that coincides with progression to AIDS. Our analyses further suggest that the interaction between the virus and HIV-specific immune effectors, such as HIV-specific CD4+ T cells, alters the course of viral evolution.
In a previous analysis, Bonhoeffer, Holmes, and Nowak (1995) observed a decrease in the dN/dS ratio of the env gene V3 loop over time within a single infected patient. They attributed this to a reduction in selection intensity over the course of infection. However, Nielsen (1999) pointed out that such a change could also be caused by the nonequilibrium dynamics of an evolving virus population. Because the population is initially genetically homogeneous, it takes some time for new mutations to spread to detectable frequencies, where they could be counted as substitutions. This transition time will be longer for neutrally evolving synonymous mutations than for positively-selected nonsynonymous mutations, so the dN/dS ratio would decrease over time to some equilibrium, without a change in selection intensity. In contrast, we do not expect nonequilibrium dynamics to have a large effect on our analyses. We follow divergence at each type of site separately. The founder effect of initial infection will cause the divergence rate at each type of site to increase monotonically until it reaches some equilibrium. It will not cause the divergence rate to drop back to 0, as is the case in our divergence-stabilization model. Therefore, our test for a decrease in the divergence rate is conservative with respect to the founder effect.
Relation to Other Methods
Our diffusion-based analysis is complementary to three other approaches that have been taken in analyzing longitudinal data sets: phylogenetic methods, coalescence theory, and regression models. Phylogenetic methods allow powerful inferences regarding natural selection and other population processes. Indeed, a phylogenetic tree has become the de facto method for summarizing the information in a longitudinal data set (e.g., Wolinsky et al. 1996; Ganeshan et al. 1997; Markham et al. 1998; Shankarappa et al. 1999; Yamaguchi-Kabata and Gojobori 2000), and phylogenetic methods based on codon-substitution models (Nielsen and Yang 1998) have been used to identify widespread positive selection in longitudinal data sets (e.g., Zanotto et al. 1999, Ross and Rodrigo 2002). A potential difficulty in applying phylogenetic methods to longitudinal data sets is that recent studies suggest that the recombination rate of HIV populations in vivo is high (Jung et al. 2002, Wain-Hobson et al. 2003). Phylogenetic methods generally assume no recombination. Although the effects of violating the no-recombination assumption are not fully understood, simulations have shown that even small amounts of recombination have a large effect on tree statistics and can exaggerate of the degree of substitution-rate heterogeneity (Schierup and Hein 2000). Also, unrecognized recombination can cause codon-based phylogenetic methods to falsely identify positive selection (Shriner et al. 2003; Anisimova, Yang, and Nielsen 2003). In general, results from phylogenetic analyses of longitudinal data sets can be difficult to interpret in the presence of recombination.
A second approach taken in analyzing longitudinal data sets is coalescence theory (Kingman 1982; Hudson 1983), which has been adapted specifically for longitudinal data sets ("the serial coalescent" of Rodrigo and Felsenstein [1999]). Because the serial coalescent is based on a population process, it can be used to estimate important population parameters, such as the within-patient effective population size (Leigh-Brown 1997, Seo et al. 2002, Drummond et al. 2002) and the generation time (Rodrigo et al. 1999, Fu 2001, Drummond et al. 2002). The coalescent approach makes two assumptions that may be violated in the env gene. First, current implementations of the serial coalescent assume no recombination. Second, the serial coalescent assumes neutral evolution, whereas natural selection is thought to play an important role in env. Modifications to the coalescent that allow for selection (e.g., Neuhauser and Krone 1997, Barton and Etheridge 2004) are not applicable to the type of selection common in HIV: strong positive and diversifying selection at many linked sites.
Regression modeling is a third approach that has been taken in analyzing longitudinal data from HIV-infected patients. Mayer-Hamblett and Self (2001) describe a set of regression models for characterizing genetic diversity within and among samples of sequences. Applying these models to the Shankarappa et al. (1999) data set, they conclude that viral diversity (variation among sequences within a single sample) increases over most of the course of infection but declines at the end. However, in contrast to the analysis presented here, they failed to confirm a stabilization of divergence towards the end of infection. The differing conclusions of our study and that of Mayer-Hamblett and Self (2001) probably stem, in part, from differences in our respective modeling approaches. However, the most important difference is likely that we distinguish synonymous and nonsynonymous changes. Only the latter seems to provide compelling evidence of divergence stabilization.
By following the mean population divergence from the progenitor sequence rather than the genealogy of all sequences, our analysis is fairly permissive in terms of population processes such as recombination and natural selection. Our null model requires only that the underlying rate of substitution is approximately constant through time. For example, consider the case of two linked loci subject to neutral evolution, sampled in a single individual. In this situation, past recombination events will not affect the number of substitutions at either locus (or the total number of substitutions), simply because each locus is expected to experience the same number of generations whether or not recombination has occurred. Also, consider the case of frequent selective sweepsthat is, rapid fixation events of advantageous mutations. If the rate of advantageous mutation is constant, then the rate of substitution will also be constant. Population-genetic simulations support these arguments: the diffusion model seems to provide an adequate description of mean population divergence through time even when both recombination and natural selection are operating.
The sufficiency of the diffusion models stems largely from the fact that we are considering only one aspect of the complex data structure from longitudinal studies. The models make no prediction about the pattern of variation among sequences within a single sample, either in the extent to which they differ from the progenitor or from each other. Such measures of viral diversity exhibit large fluctuations over the course of infection, even within a single patient (Shankarappa et al. 1999; Mayer-Hamblett and Self 2001). Thus, one advantage of our approach is that it is relatively insensitive to these large-scale fluctuations. One disadvantage, however, is that such fluctuations in viral diversity may play an important role in disease progression (Nowak et al. 1991), and our models cannot address this problem. A major challenge in the analysis of longitudinal data sets lies in developing methods that characterize changes in divergence, diversity, and the relationship between the two.
![]() |
Supplementary Material |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To summarize, when the second derived nucleotide is first observed, we infer that it has arisen from whatever character state is found in the closest (in terms of pairwise differences) previous sequence. This routine does not account for back mutation.
In addition to correcting for multiple substitutions, we can also use the temporal information in longitudinal data sets to more accurately delineate between nonsynonymous and synonymous substitutions. A common problem in DNA sequence analysis lies in counting the numbers of nonsynonymous and synonymous substitutions when there have been multiple changes in a single codon. Consider the following example for the changes in a particular codon: AGT (Ser) GGA (Gly). Assuming that only two changes have occurred, the potential intermediaries are:
|
If AGA is the true intermediate, then there have been two nonsynonymous changes, but if GGT is the true intermediate, then there have been one synonymous and one nonsynonymous change. There are several possible ways to weight the likelihood of the different paths, and, consequently, different sequence analysis programs can produce different counts for the numbers of nonsynonymous and synonymous sites. However, with longitudinal data sets, when multiple changes have occurred in the same codon, we can simply look back at previous timepoints and infer which path was taken. We propose the following algorithm for counting nonsynonymous and synonymous changes when there have been two changes in the same codon and when the different intermediaries produce different counts:
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Anderson, J. P., A. G. Rodrigo, G. H. Learn, Y. Wang, H. Weinstock, M. L. Kalish, K. E. Robbins, L. Hood, and J. I. Mullins. 2001. Substitution model of sequence evolution for the human immunodeficiency virus type 1 subtype B gp120 gene over the C2-V5 region. J. Mol. Evol. 53:5562.[ISI][Medline]
Anisimova, M., R. Nielsen, and Z. Yang. 2003. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 164:12291236.
Balfe, P., P. Simmonds, C. A. Ludlam, J. O. Bishop, and A. J. Brown. 1990. Concurrent evolution of human immunodeficiency virus type 1 in patients infected from the same source: rate of sequence change and low frequency of inactivating mutations. J. Virol. 64:62216233.[ISI][Medline]
Barton, N. H., and A. M. Etheridge. 2004. The effect of selection on genealogies. Genetics 166:11151131.
Bonhoeffer, S., E. C. Holmes, and M. A. Nowak. 1995. Causes of HIV diversity. Nature 376:125.[CrossRef][ISI][Medline]
Davison, A. C., and D. V. Hinkley. 1997. Bootstrap methods and their application. Cambridge University Press, Cambridge, UK.
Douek, D. C., J. M. Brenchley, M. R. Betts et al. (15 co-authors). 2002. HIV preferentially infects HIV-specific CD4+ T cells. Nature 417:9598.
Drummond, A. J., G. K. Nicholls, A. G. Rodrigo, and W. Solomon. 2002. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161:13071320.
Durbin, J., and S. J. Koopman. 2001. Time series analysis by state space methods. Oxford University Press, New York.
Finzi, D., J. Blankson, J. D. Silicano et al. (16 co-authors). 1999. Latent infection of CD4 T cells provides a mechanism for lifelong persistence of HIV-1, even in patients on effective combination therapy. Nat. Med. 5:512517.[CrossRef][ISI][Medline]
Finzi, D., M. Hermankova, T. Pierson et al. (15 co-authors). 1997. Identification of a reservoir for HIV-1 in patients on highly active antiretroviral therapy. Science 278:12951300.
Fu, Y.-X. 2001. Estimating mutation rate and generation time from longitudinal samples of DNA sequences. Mol. Biol. Evol. 18:620626.
Ganeshan, S., R. E. Dickover, B. T. M. Korber, Y. J. Bryson, and S. M. Wolinsky. 1997. Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J. Virol. 71:663677.[Abstract]
Goulder, P. J. R., C. Brander, Y. Tang et al. (19 co-authors). 2001. Evolution and transmission of stable CTL escape mutations in HIV infection. Nature 412:334338.[CrossRef][ISI][Medline]
Graur, D., and W.-H. Li. 2000. Fundamentals of molecular evolution. Sinauer Associates, Sunderland, Mass.
Harvey, A. C. 1989. Forecasting, structural time series models and the Kalman filter. Cambridge University Press, Cambridge, UK.
Hillis, D. M., J. P. Huelsenbeck, and C. W. Cunningham. 1994. Application and accuracy of molecular phylogenies. Science 264:671677.[ISI][Medline]
Holmes, E. C., L. Q. Zhang, P. Simmonds, C. A. Ludlam, and A. J. Brown. 1992. Convergent and divergent sequence evolution in the surface envelope glycoprotein of human immunodeficiency virus type 1 within a single infected patient. Proc. Natl. Acad. Sci. USA 89:48354839.
Hudson, R. R. 1983. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23:183201.[ISI][Medline]
Jung A, R. Maier, J. P. Vartanian, G. Bocharov, V. Jung, U. Fischer, E. Meese, S. Wain-Hobson, and A. Meyerhans. 2002. Multiply infected spleen cells in HIV patients. Nature 418:144.[CrossRef][ISI][Medline]
Kelly, J. K. 1994. An application of population genetic theory to synonymous gene sequence evolution in the human immunodeficiency virus (HIV). Genet. Res. 64:19.[ISI][Medline]
Kelly, J. K. 1996. Replication rate and evolution in the human immunodeficiency virus. J. Theor. Biol. 180:359364[CrossRef][ISI][Medline]
Kelly, J. K. 2003. Deleterious mutations and the genetic variance of male fitness components in Mimulus guttatus. Genetics 164:10711085.
Kelly, J. K., and H. S. Arathi. 2003. Inbreeding and the genetic variance in floral traits of Mimulus guttatus. Heredity 90:7783.[CrossRef][ISI][Medline]
Kelly, J. K., S. Williamson, M. E. Orive, M. S. Smith, and R. D. Holt. 2003. Linking dynamical and population genetic models of persistent viral infection. Am. Nat. 162:1428.[CrossRef][ISI][Medline]
Kingman, J. F. C. 1982. The coalescent. Stoch. Proc. Appl. 13:235248.[CrossRef]
Larder, B. A., G. Darby, and D. D. Richman. 1989. HIV with reduced sensitivity to zidovudine (AZT) isolated during prolonged therapy. Science 243:17311734.[ISI][Medline]
Leigh-Brown, A. J. 1997. Analysis of HIV-1 env gene sequences reveals evidence for a low effective number in the viral population. Proc. Natl. Acad. Sci. USA 94:18621865.
Markham, R. B., W.-C. Wang, A. E. Weisstein et al. (11 co-authors) 1998. Patterns of HIV-1 evolution in individuals with differing rates of CD4 T cell decline. Proc. Natl. Acad. Sci. USA 95:1256812573.
Mayer-Hamblett, N, and S. Self. 2001. A regression modelling approach for describing patterns of HIV genetic variation. Biometrics 57:449460.[CrossRef][ISI][Medline]
McMichael, A. J., and R. E. Phillips. 1997. Escape of human immunodeficiency virus from immune control. Annu. Rev. Immunol. 15:271296.[CrossRef][ISI][Medline]
Moriyama, E. N., Y. Ina, K. Ikeo, N. Shimizu, and T. Gojobori. 1991. Mutation pattern of human immunodeficiency virus gene. J. Mol. Evol. 32:360363.[ISI][Medline]
Neuhauser, C., and S. M. Krone. 1997. The genealogy of samples in models with selection. Genetics 145:519534.
Nielsen, R. 1999. Changes in ds/dn in the HIV-1 env gene. Mol. Biol. Evol. 16:711714.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929936.
Nowak, M. A., R. M. Anderson, A. R. McLean, T. F. Wolfs, J. Goudsmit, and R. M. May. 1991. Antigenic diversity thresholds and the development of AIDS. Science 254:963969.[ISI][Medline]
Richman, D. D., T. Wrin, S. J. Little, and C. J. Petropoulos. 2003. Rapid evolution of the neutralizing antibody response to HIV type 1 infection. Proc. Natl. Acad. Sci. USA 100:41444149.
Rodrigo, A. G., and J. Felsenstein. 1999. Coalescent approaches to HIV population genetics. Pp. 233272 in K. A. Crandall, ed. The evolution of HIV. John Hopkins University Press, Baltimore.
Rodrigo, A. G., E. G. Shaper, E. L. Delwart, A. K. Iversen, M. V. Gallo, J. Brojatsch, M. S. Hirsch, B. D. Walker, and J. I. Mullins. 1999. Coalescent estimates of HIV-1 generation time in vivo. Proc. Natl. Acad. Sci. USA 96:21872191.
Rosenberg, E. S., J. M. Billingsley, A. M. Caliendo, S. L. Boswell, P. E. Sax, S. A. Kalams, and B. D. Walker. 1997. Vigorous HIV-1-specific CD4+ T cell responses associated with control of viremia. Science 278:14471450.
Ross, H. A., and A. G. Rodrigo. 2002. Immune-mediated positive selection drives human immunodeficiency virus type 1 molecular variation and predicts disease duration. J. Virol. 76:1171511720.
Schierup, M. H., and J. Hein. 2000. Consequences of recombination on traditional phylogenetic analysis. Genetics 156:879891.
Schuitemaker, H., M. Koot, N. A. Kootstra, M. W. Dercksen, R. E. de Goede, R. P. van Steenwijk, J. M. Lange, J. K. Schattenkerk, F. Miedema, and M. Tersmette. 1992. Biological phenotype of human immunodeficiency virus type 1 clones at different stages of infection: progression of disease is associated with a shift from monocytotropic to T-cell-tropic virus population. J. Virol. 66:13541360.[Abstract]
Searle, S. R., G. Casella, and C. E. McCulloch. 1992. Variance components. John Wiley and Sons, New York.
Seo, T.-K, J. L. Thorne, M. Hasegawa, and H. Kishino. 2002. Estimation of effective population size of HIV-1 within a host: a pseudomaximum-likelihood approach. Genetics 160:12831293.
Shankarappa, R. 1999. Evolution of HIV-1 resistance to antiviral agents. Pp. 469490 in K. A. Crandall, ed. The evolution of HIV. John Hopkins University Press, Baltimore.
Shankarappa, R., J. B. Margolick, S. J. Gange et al. (12 co-authors). 1999. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol. 73:1048910502.
Shriner, D., D. C. Nickle, M. A. Jensen, and J. I. Mullins. 2003. Potential impact of recombination on sitewise approaches for detecting positive natural selection. Genet. Res. 81:115121.[CrossRef][ISI][Medline]
Sokal, R. R., and F. J. Rohlf. 1995. Biometry, 3rd ed. W. H. Freeman and Company, New York.
Tersmette, M., R. A. Gruters, F. de Wolf, R. E. de Goede, J. M. Lange, P. T. Schellekens, J. Goudsmit, H. G. Huisman, and F. Miedema. 1989. Evidence for a role of virulent human immunodeficiency virus (HIV) variants in the pathogenesis of acquired immunodeficiency syndrome: studies on sequential HIV isolates. J. Virol. 63:21182125.[ISI][Medline]
Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins. 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 24:48764882.[CrossRef]
Viscidi, R. P. 1999. HIV evolution and disease progression via longitudinal studies. Pp. 346389 in K. A. Crandall, ed. The evolution of HIV. John Hopkins University Press, Baltimore.
Wain-Hobson, S., C. Renoux-Elbé, J.-P. Vartanian, and A. Meyerhans. 2003. Network analysis of human and simian immunodeficiency virus sequence sets reveals massive recombination resulting in shorter pathways. J. Gen. Virol. 84:885895.
Williamson, S. 2003. Adaptation in the env gene of HIV-1 and evolutionary theories of disease progression. Mol. Biol. Evol. 20:13181325.
Wodarz, D., P. Klenerman, and M. A. Nowak. 1998. Dynamics of cytotoxic T-lymphocyte exhaustion. Proc. R. Soc. Lond. B Biol. Sci. 265:191203.[CrossRef][ISI][Medline]
Wolfs, T. F., J. J. de Jong, H. Van den Berg, J. M. Tijnagel, W. J. Krone, and J. Goudsmit. 1990. Evolution of sequences encoding the principal neutralization epitope of human immunodeficiency virus 1 is host-dependent, rapid, and continuous. Proc. Natl. Acad. Sci. USA 87:99389942.
Wolinsky, S. M., B. T. M. Korber, A. U. Neumann et al. (11 co-authors). 1996. Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science 272:537542.[Abstract]
Wolinsky, S. M., and G. H. Learn. 1999. Levels of diversity within and among host individuals. Pp. 275314 in K. A. Crandall, ed. The evolution of HIV. John Hopkins University Press, Baltimore.
Wong, J. K., M. Hezareh, H. F. Günthard, D. V. Havlir, C. C. Ignacio, C. A. Spina, and D. D. Richman. 1997. Recovery of replication-competent HIV despite prolonged suppression of plasma viremia. Science 278:13911300.
Yamaguchi-Kabata, Y., and T. Gojobori 2000. Reevaluation of amino acid variability of the human immunodeficiency virus type 1 gp120 envelope glycoprotein and prediction of new discontinuous epitopes. J. Virol. 74:43354350.
Zanotto, P. M. A., E. G. Kallas, R. F. de Souza, and E. C. Holmes. 1999. Genealogical evidence for positive selection in the nef gene of HIV-1. Genetics 153:10771089.
Zhang, L. Q., P. MacKenzie, A. Cleland, E. C. Holmes, A. J. Brown, and P. Simmonds. 1993. Selection for specific sequences in the external envelope protein of human immunodeficiency virus type 1 upon primary infection. J. Virol. 67:33453356.[Abstract]
|