Human Genetics Center, University of Texas at Houston
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The generation time, or the length of a generation, of an organism is the average length of time between two identical and successive stages in the life cycle of the organism. For example, the generation time of animals of large body size can be defined as the average length of time for an adult to produce another adult; for a virus such as human immunodeficiency virus type 1, the generation time can be defined as the average length of time from the release of the virion until it infects another cell and causes the release of another virion. Generation time not only is part of the biological properties of an organism, but also plays an essential role in analyzing polymorphism data from a population, because population genetic models are usually developed with units of time corresponding to a certain number of generations, rather than days or years.
The life cycle of large animals can be observed easily, and it is usually not a problem to derive a generation time. For example, 20 years is widely used for one human generation. It is difficult to observe the life cycle of small organisms, such as viruses, in vivo, so there is a need to estimate the generation time. DNA polymorphisms in longitudinal sample provide an opportunity to do so when an independent estimate of the mutation rate per generation is available. The purpose of this paper is to present a simple method for estimating mutation rate per site per year which also yields an estimate of generation time when the mutation rate per generation is known. We apply the method to a longitudinal sample from an HIV patient both to illustrate the method and to obtain an estimate of mutation rate and an estimate of generation time for HIV-1. The differences between the new method and the method of Rodrigo and Felsenstein (1999)
will be discussed.
![]() |
The Theory |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Let dkl be the number of nucleotide substitutions per site between sequences k and l (k l).
![]() | (1) |
Suppose that a second sample of m sequences is taken T days later from the same population. For sequences k and l (k l) from the second sample, we have, similar to equation (1)
, that
![]() | (2) |
![]() | (3) |
![]() | (4) |
| (5) |
| (8) |
| (9) |
|
| (10) |
| (11) |
Suppose that samples from r different time points were taken. For each pair of samples, one can obtain an estimate of G. Let Gij be the estimate from samples i and j. Then, there are r(r - 1)/2 estimates of G. How to combine these estimates to obtain an overall estimate of G is not only an interesting theoretical issue, but also of great practical importance. Since each Gij is an unbiased estimate of G, for any weights ij
0 and
ij
ij = 1, the quantity
| (12) |
The simplest method is to take an unweighted average. That is, let ij = 2/[r(r - 1)], resulting in
| (13) |
![]() | (14) |
| (15) |
Variance
The variance of any of the G estimators is extremely complex even under the simple situation of constant population size. Not only is the formula for individual Var(Gij) intractable, but the differences between Gij values are correlated to each other. We therefore propose the use of bootstrap samples for estimating the variance of LT (or GT).
Let ni be the number of sequences in sample i (i = 1, ... , r). A bootstrap sample will consist of r subsamples, with the ith subsample obtained by selecting ni sequences with replacement from the ni sequences. This stratified bootstrap is reasonable in the absence of detailed knowledge of the dynamics of the population being studied. The bootstrap estimate of the variance of LT is obtained by the following steps: (1) carry out bootstrap sampling for each of the r samples, (2) compute the value of LT from the bootstrap samples, and (3) repeat steps 1 and 2 many times; the resulting sampling variance of LT is then an estimate of the variance of LT.
It should be noted that there are two levels of variance associated with LT. One is due to the stochasticity of evolution, which leads to differences among replicates of the same evolutionary process. The other is due to the effect of sampling. The bootstrap estimate of variance described above only accounts for the variance due to sampling. Although one needs to be cautious in interpreting this component of variance, it nevertheless gives a lower bound of the total variance. When multiple populations are available which can be considered replicates of the same evolutionary process (e.g., the HIV populations in different human hosts), a bootstrap resampling procedure consisting of sampling from both within a replicate and among replicates will give an estimate of the total variance.
![]() |
Application to HIV |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The 0.65-kb region of the env gene spanning the third to the fifth variable regions was sequenced for each virus in the sample. We will utilize the same sequence alignment as that in Rodrigo et al. (1999)
. For simplicity, we will consider only nucleotide substitutions. Three methods for computing the number of nucleotide substitutions between two sequences are considered. One is the number of nucleotide differences, the second is the distance using Jukes-Cantor correction, and the third is the distance corrected using Kimura's two-parameter model. The sample size and time lengths between samples, as well as the
values by the first and third methods, are given in table 1
.
|
As we pointed out earlier, one can obtain an estimate of the generation time when the mutation rate per site per generation is known. Mansky (1996)
estimated that the overall mutation rate per site per generation was 4 x 10-5, which includes base substitution, frameshift, deletion, and insertion. Since we only consider base substitutions in our analysis, it is necessary to use the mutation rate for base substitution only. Using Mansky's data (table 4 in Mansky 1996
) that there are 15 base substitutions in 5,272 shuttle vector proviruses with a target segment of 114 nt, we obtain a nucleotide substitution rate of 15/(5,272 x 114) = 2.5 x 10-5 per site per generation. The pairwise estimates of the number G of generations per day are given in table 2
.
|
|
|
![]() |
Alternative Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Consider a sample of m sequences from a haploid population with a constant effective population size N. If one examines the history of these m sequences by tracing backward in time, one will find that there is a period in which there are m ancestral sequences, a period in which there are m - 1 ancestral sequences, and so on. The time length ti of the period in which there are i ancestral sequences has an exponential distribution with a mean equal to 2N/i(i - 1) (Kingman 1982a). The number of generations from time t + T back to the first sampling time t at which there are l ancestral sequences is Gl = tm + ... + tl+1, whose expectation is equal to
| (17) |
The effective population size N can be estimated from an estimate of = 2Nµ, where µ is the mutation rate per generation per site. There are a number of methods available for estimating
(see review by Li and Fu 1999
). For example, since
2 has a mean equal to
under the assumption of a constant effective population size, one can estimate N as
=
2/(2µ). Rodrigo et al. (1999)
used a more complex method and obtained estimates of N similar to those by Brown and Richman (1997)
. The value of m - l was estimated from a rooted phylogeny of the sequences from both samples as the number of coalescent events among the sequences in the second samples. Rodrigo et al. (1999)
used the neighbor- joining method (Saitou and Nei 1987
) for phylogeny reconstruction and used an outgroup sequence to root the tree. Note that another tree-based approach for estimating mutation rate is proposed by Rambaut (2000)
.
Using the approach described above, Rodrigo et al. (1999)
obtained an estimate of the generation time for HIV-1 of 1.2 days. At first glance, their estimate appears to be comparable with our estimate of 1.78 days, but the two estimates cannot be directly compared for two reasons. First, Rodrigo et al. (1999)
used a different approach than ours to combine pairwise estimates. Second, we use here a different mutation rate. Although only nucleotide substitutions were considered, Rodrigo et al. (1999)
nevertheless used the mutation rate 4 x 10-5 compiled by Mansky (1996), which includes both insertions and deletions. When only nucleotide substitutions are counted, the mutation rate from Mansky's data becomes 2.5 x 10-5 per site per generation. With this mutation rate, pairwise estimates of G computed from table 2 of Rodrigo et al. (1999)
become those in table 4
. Comparison of table 4
with table 2
reveals that Rodrigo et al.'s (1999)
estimate of G is considerably larger than ours in every case, and is on average more than twice as large as ours. Although some of the differences must be due to the variances in both estimates, it is highly unlikely that random errors alone can result in such systematic differences. Some possible causes for this discrepancy will be discussed later.
A very different technique for estimating generation time using within-host longitudinal viral load data has been developed (Coffin 1995; Wei et al. 1995
; Perelson et al. 1996
). This approach is based on the principle that when a potent drugsuch as Ritonavir which is a protease inhibitor, is administered to a patient, the rate of loss of virions in plasma can be modeled by a set of differential equations with a few parameters, which can be estimated from the longitudinal viral load data. The values of these parameters can then be used to estimate the generation time. The estimates of generation time from this technique vary from about 4 days by Wei et al. (1995)
to 2.6 days by Perelson et al. (1996)
. The latter group have recently revised their estimate to 1.8 days (Rodrigo et al. 1999
), which agrees well with our estimate of 1.78 days.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the case of population growth or, in general, varying effective population size, it is easy to see that is an unbiased estimator of v because equations (5) and (7)
hold regardless of the value of effective population size. This property of our estimator is particularly important for its application to fast-changing viral populations such as HIV-1, because a within-host population can change dramatically in size over a short period of time.
The estimator is also unbiased in the presence of population structure, because regardless of population structure, every sequence in the second sample experiences the same amount of time since the time at which the first sample was taken. As long as a consistent sampling strategy is used from different samples, equations (5) and (7)
hold regardless of population structure.
It is also obvious that recombination does not introduce bias in our estimate of either, because equations (5) and (7)
hold in the presence of recombinations. Of course, this is not to say that nonconstant effective population size, population structure, and recombination have no effect on our estimate, because they do affect the variance of the estimator.
Natural selection is an important factor to consider when analyzing samples from viral populations such as HIV-1. When the DNA region under study is not directly involved in natural selection, our estimator should remain nearly unbiased. This includes the situation in which the region under study is tightly linked to a locus that is under strong natural selection. For example, if natural selection has led to the fixation of a favorable mutation before sampling starts, then its effect is very similar to that of a growing population and thus will not lead to bias in our estimator. When many mutations in the samples are not selectively neutral, the accumulation of nucleotide changes in a sequence in a given period may deviate from Poisson distribution, and the substitution rate can be elevated or reduced depending on the type of natural selection. In the case of deleterious mutations, the mutation rate per year estimated from equation (13) is likely to be smaller than that extrapolated by mutation rate per site per generation and generation time. This will result in an overestimate of the generation time. On the other hand, if positive selection is involved, the substitution rate per site per year will be elevated, which will result in an underestimate of the generation time. One way to minimize the effect of natural selection is to conduct analyses on synonymous substitutions only. With more and more data available, such analyses should be very informative. Since our analysis in this paper is mainly for the purpose of illustration, and also because of the relative small samples, we do not pursue the more detailed analysis.
Since the number of mutations that substantially enhance viral survival should be small compared with the total number of mutations, the bias in our estimate of v due to positive selection is unlikely to be substantial. Nevertheless, since positive selection is likely operating on the env gene of HIV-1 (e.g., Bonhoeffer, Holmes, and Nowak 1995
; Yamaguchi and Gojobori 1997
; Zhang et al. 1997
), our estimate of the generation time may be slightly affected. Another potential source of error in the estimate of generation time is the mutation rate per site per generation. For example, if the mutation rate is underestimated, then it will result in an underestimate of the generation time. With these caveats, it is encouraging that our estimate of generation time agrees well with the recent estimate of 1.8 days from viral load data (see Rodrigo et al. 1999
). It will be interesting to see if the agreement continues to hold with increasing data.
What causes Rodrigo et al.'s (1999)
estimates of G to be consistently larger than ours? Although the variances in both estimators may lead to fluctuation, the discrepancy is likely due to some fundamental differences between the two estimators. One similarity between the two estimators of generation time is that both rely on estimates of the numbers of generations per day. However, our method is more direct and is unbiased for estimating the number of generations per day, while Rodrigo and Felsenstein's (1999)
method is indirect, relying on estimates of both the effective population size and the number of coalescent events among the sequences in sample 2 in the period that separates the two samples. Counting the coalescent events directly from an estimated phylogeny will likely overestimate this number even if the phylogeny is perfectly reconstructed. The simple analysis below will reveal why this is so.
Consider two random samples with sizes n and m, respectively, taken at the same time. Their coalescences will be like that for a single sample of n + m sequences. A coalescent event will be counted as a coalescence between sequences from sample 2, or simply a coalescence within sample 2, if and only if neither of the two coalescing sequences has a descendant in sample 1. Let p(n, m) be the probability that there is no coalescence within sample 2. Each time a coalescence occurs, each pair of sequences has the same probability to be chosen. There are n(n - 1)/2 ways to coalesce two sequences from sample 1, and there are nm ways to coalesce one sequence from sample 1 and one sequence from sample 2. Therefore, we have the following recurrence equation:
| (18) |
| (19) |
The probability that there is at least one coalescence within sample 2 is 1 - p(n, m), and its numerical values for a number of sample size combinations are given in table 5
. It is clear from table 5
that for those sample sizes in our longitudinal samples, this probability is quite high. For example, when n = 15 and m = 13, the probability is 0.94. This analysis suggests that even when there is no time (T = 0) separating the two samples, it is very likely to observe coalescent events within the second sample, so that counting these events as the estimate of m - l and then using this estimate as the basis for estimating G will result in an overestimation of G. This is likely one of the reasons why Rodrigo et al.'s (1999)
estimates of G (table 4 ) are consistently larger than our estimates (table 2
). Such a discrepancy will also be observed if the effective population size N is underestimated in Rodrigo et al. (1999)
.
|
It is worth emphasizing that the longitudinal samples required for estimating mutation rate and generation time do not have to come from within single host. The estimator is applicable to samples in which each sequence comes from a different host. This feature is valuable for studying the evolution of a pathogen that does not stay in a single host for a long period. On the other hand, if longitudinal samples are available from multiple populations which can be considered replicates of the same evolutionary process, accuracy in the estimation of generation time can be substantially improved because samples from different hosts should be more or less independent. Furthermore, the total variance of the estimator can be obtained by bootstrapping both within- population and among-populations samples. Longitudinal samples from within-host HIV populations are being accumulated, and it will be interesting to see how variable the generation time can be among different hosts.
We have so far considered two ways of utilizing the pairwise number of nucleotide substitutions. Another possible use of equations (5) and (7)
is to estimate the time length T separating two samples when both the generation time and the mutation rate per generation are known. An estimate of T is
| (20) |
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: mutation rate
generation time
longitudinal sample
HIV
coalescent process
2 Address for correspondence and reprints: Yun-Xin Fu, Human Genetics Center, University of Texas at Houston, 6901 Bertner Avenue S222, Houston, Texas 77030. fu{at}hgc.sph.uth.tmc.edu
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Balfe, P., P. Simmonds, C. A. Ludlam, J. O. Bishop, and A. J. Brown. 1990. Concurrent evolution of human immunodeficiency virus type 1 in patients infected from the same source: rate of sequence change and low frequency of inactivating mutations. J. Virol. 64:62216233.[ISI][Medline]
Bonhoeffer, S., E. C. Holmes, and M. A. Nowak. 1995. Causes of HIV diversity. Nature 376:125
Brown, A. J. L., and D. D. Richman. 1997. HIV-1: gambling on the evolution of drug resistance? Nat. Med. 3:268
Coffin, J. M. 1995. HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267:483489
Holmes, E. C., L. Q. Zhang, P. Simmonds, C. A. Ludlam, and A. J. Brown. 1992. Convergent and divergent sequence evolution in the surface envelope glycoprotein of human immunodeficiency virus type 1 within a single infected patient. Proc. Natl. Acad. Sci. USA 89:48354839
Kingman, J. F. C. 1982a. On the genealogy of large populations. J. Appl. Prob. 19A:2743
. 1982b. The coalescent. Stochast. Processes Applications 13:235248
Li, W. H., and Y. X. Fu. 1999. Coalescent theory and its applications in population genetics. Pp. 4579 in E. Halloran and S. Geisser, eds. Statistics in genetics. Springer, New York.
Mansky, L. M. 1996. Forward mutation rate of human immunodeficiency virus type 1 in a T lymphoid cell line. AIDS Res. Hum. Retroviruses 12:307314
Perelson, A. S., A. U. Neumann, M. Markowitz, J. M. Leonard, and D. D. Ho. 1996. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral generation time. Science 271:15821586
Rambaut, A. 2000. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenetics. Bioinformatics 16:395399
Rodrigo, A. G., and J. Felsenstein. 1999. Coalescent approaches to HIV population genetics. In K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md.
Rodrigo, A. G., E. G. Shpaer, E. L. Delwart, A. K. N. Iversen, M. V. Gallo, J. Brojatsch, M. S. Hirsch, B. D. Walker, and J. I. Mullins. 1999. Coalescent estimates of HIV-1 generation time in vivo. Proc. Natl. Acad. Sci. USA 96:21872191
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425[Abstract]
Simmonds, P., L. Q. Zhang, F. McOmish, P. Balfe, C. A. Ludlam, and A. J. Brown. 1991. Discontinuous sequence change of human immunodeficiency virus (HIV) type 1 env sequences in plasma viral and lymphocyte-associated proviral populations in vivo: implications for models of {HIV} pathogenesis. J. Virol. 65:626676[ISI][Medline]
Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437460
Wei, X., S. K. Ghosh, M. E. Taylor et al. (12 co-authors). 1995. Viral dynamics in human immunodeficiency virus type 1 infection. Nature 373:117122
Wolfs, T. F., G. Zwart, M. Bakker, M. Valk, C. L. Kuiken, and J. Goudsmit. 1991. Naturally occurring mutations within HIV-1 V3 genomic RNA lead to antigenic variation dependent on a single amino acid substitution. Virology 185:195205
Yamaguchi, Y., and T. Gojobori. 1997. Evolutionary mechanisms and population dynamics of the third variable envelope region of HIV within single hosts. Proc. Natl. Acad. Sci. USA 94:12641269
Zhang, L., R. S. Diaz, D. D. Ho, J. W. Mosley, M. P. Busch, and A. Mayer. 1997. Host-specific driving force in human immunodeficiency virus type 1 evolution in vivo. J. Virol. 71:25552561[Abstract]