Estimating Mutation Rate and Generation Time from Longitudinal Samples of DNA Sequences

Yun-Xin Fu2,

Human Genetics Center, University of Texas at Houston


    Abstract
 TOP
 Abstract
 Introduction
 The Theory
 Application to HIV
 Alternative Methods
 Discussion
 Acknowledgements
 literature cited
 
We present in this paper a simple method for estimating the mutation rate per site per year which also yields an estimate of the length of a generation when mutation rate per site per generation is known. The estimator, which takes advantage of DNA polymorphisms in longitudinal samples, is unbiased under a number of population models, including population structure and variable population size over time. We apply the new method to a longitudinal sample of DNA sequences of the env gene of human immunodeficiency virus type 1 (HIV-1) from a single patient and obtain 1.62 x 10 as the mutation rate per site per year for HIV-1. Using an independent data set to estimate the mutation rate per generation, we obtain 1.8 days as the length of a generation of HIV-1, which agrees well with recent estimates based on viral load data. Our estimate of generation time differs considerably from a recent estimate by Rodrigo et al. when the same mutation rate per site per generation is used. Some factors that may contribute to the difference among different estimators are discussed.


    Introduction
 TOP
 Abstract
 Introduction
 The Theory
 Application to HIV
 Alternative Methods
 Discussion
 Acknowledgements
 literature cited
 
Mutation rate per nucleotide per year is a fundamental quantity for studying the molecular evolution of an organism. Mutation rates are usually estimated by one of two approaches. The first approach is to use homologous DNA sequences from two species with divergence time calibrated by an independent source, usually paleontological data. The second approach is to directly examine the number of mutations over one or a few generations. The first approach is simpler and more economical but is applicable only when a reliable estimate of the divergence time is available. In comparison, the second approach is more widely applicable in principle, but it usually requires examination of either a large number of individuals or a large DNA segment to obtain a reasonable number of changes. The genetic state of the progenitor also needs to be known. For large animals, it is costly to use the second approach because of the lengthy generation time. For rapidly evolving organisms, DNA polymorphisms in longitudinal samples, that is, samples taken at a series of time points, provide another way to estimate mutation rates.

The generation time, or the length of a generation, of an organism is the average length of time between two identical and successive stages in the life cycle of the organism. For example, the generation time of animals of large body size can be defined as the average length of time for an adult to produce another adult; for a virus such as human immunodeficiency virus type 1, the generation time can be defined as the average length of time from the release of the virion until it infects another cell and causes the release of another virion. Generation time not only is part of the biological properties of an organism, but also plays an essential role in analyzing polymorphism data from a population, because population genetic models are usually developed with units of time corresponding to a certain number of generations, rather than days or years.

The life cycle of large animals can be observed easily, and it is usually not a problem to derive a generation time. For example, 20 years is widely used for one human generation. It is difficult to observe the life cycle of small organisms, such as viruses, in vivo, so there is a need to estimate the generation time. DNA polymorphisms in longitudinal sample provide an opportunity to do so when an independent estimate of the mutation rate per generation is available. The purpose of this paper is to present a simple method for estimating mutation rate per site per year which also yields an estimate of generation time when the mutation rate per generation is known. We apply the method to a longitudinal sample from an HIV patient both to illustrate the method and to obtain an estimate of mutation rate and an estimate of generation time for HIV-1. The differences between the new method and the method of Rodrigo and Felsenstein (1999)Citation will be discussed.


    The Theory
 TOP
 Abstract
 Introduction
 The Theory
 Application to HIV
 Alternative Methods
 Discussion
 Acknowledgements
 literature cited
 
Suppose a sample of n sequences is taken at time t from a population of a haploid organism. The choice to consider a haploid population is entirely for the convenience of later discussion, and the theory is almost identical for a diploid population.

Let dkl be the number of nucleotide substitutions per site between sequences k and l (k != l).

(1)
where E is expectation and {theta}t is a quantity whose value is determined by the dynamics of the population. For example, it is well known from the coalescent theory (Kingman 1982a, 1982bCitation ; see, e.g., a recent review by Li and Fu [1999]Citation ) that under the neutral Wright-Fisher model with a constant effective population size N, the value of {theta}t is 2Nµ (e.g., Tajima 1983Citation ), where µ is the mutation rate per site per generation. In general, {theta}t is dependent on the time t at which the sample is taken.

Suppose that a second sample of m sequences is taken T days later from the same population. For sequences k and l (k != l) from the second sample, we have, similar to equation (1) , that

(2)
Again under the neutral Wright-Fisher model with a constant effective population size, {theta}t+T = 2Nµ. In addition to the pairwise number of nucleotide substitutions within each sample, we can examine nucleotide substitutions between two sequences, one from each sample. Let d'kl be the number of nucleotide substitutions per site between sequence k from sample 1 and sequence l from sample 2. Then,

(3)
where µ is the mutation rate per site per generation and G is the number of generations per day. This relationship can be seen clearly from figure 1 . The first term to the right of the equality is due to the fact that at time t, the ancestral sequence of sequence l is just another random sequence from the population at time t; thus, the expected number of nucleotide substitutions between this ancestral sequence and sequence k from sample 1 is thus {theta}t. The second term to the right of the equality is the expected number of mutations per site occurring in the ancestral sequence of sequence k in T days, assuming that the number of mutations occurring in a sequence in a given number GT of generations is a Poisson variable with mean equal to µGT. It should be noted that although not every sequence in sample 2 has experienced the same number of generations in the past T days, equation (3) still holds because the mean number of generations for each sequence is G x T. Let v be the mutation rate per site per day. Then, v = µG. Therefore, equation ([REF:d12]) can be written as

(4)
Let {Pi}i be the mean number of nucleotide substitutions per site between two sequences from sample i, and let {Pi}12 be the mean number nucleotide substitutions per site between two sequences, one from sample 1 and one from sample 2. Then, we have


(5)
From these expectations, it is easy to see that an unbiased estimator of v is


(8)
With the further assumption that {theta}t is the same for different values of t, another unbiased estimator of v is


(9)
Although the estimator ' should have a smaller variance than , the assumption for ensuring its unbiasedness is likely questionable in many situations. Therefore, is generally preferable to '.



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 1.—Schematic relationship between two sequences taken T days apart

 
If the mutation rate µ per site per generation is known or has been estimated independently, an unbiased estimator of G is


(10)
The length L of a generation, i.e., the generation time, can be obtained as


(11)
Combining Estimates

Suppose that samples from r different time points were taken. For each pair of samples, one can obtain an estimate of G. Let Gij be the estimate from samples i and j. Then, there are r(r - 1)/2 estimates of G. How to combine these estimates to obtain an overall estimate of G is not only an interesting theoretical issue, but also of great practical importance. Since each Gij is an unbiased estimate of G, for any weights {alpha}ij >= 0 and {Sigma}ij {alpha}ij = 1, the quantity


(12)
is also an unbiased estimate of G. Therefore, there are many ways to combine pairwise estimates.

The simplest method is to take an unweighted average. That is, let {alpha}ij = 2/[r(r - 1)], resulting in


(13)
A problem with this estimator is that terms with large variances may dominate the final estimate. Therefore, a better way to combine estimates is to take the variance of each estimate into consideration. Let Tij be the time length between samples i and j. Then, {Pi}ij - {Pi}i reflects on average the number of mutations in a sequence during a period of length Tij, which is approximately Poisson distributed. Therefore, the variance of {Pi}ij - {Pi}i should be proportional to Tij. In fact, it can be shown that

(14)
where cij is a complex quantity which is a function of the sample sizes, Tij, and parameters that affect population dynamics, such as population size, growth rate, and generation time. The reason why cij is even dependent on generation time is that it depends on the number of ancestral sequences the jth sample has at the time ith sample was taken, and this number is dependent on generation time. Since generation time is what we intend to estimate, an overall estimator G{alpha} which is dependent on generation time is undesirable (although the variance of any G{alpha} is dependent on generation time). Because of the complexity of cij, a reasonable strategy for guiding our choice of weights is to ignore both cij and the correlations among estimates. This simple strategy results in optimal weights being {alpha}ij = Tij/{Sigma}i<j Tij and the corresponding estimators of v and G being


(15)
where µ is the mutation rate per site per generation. The length L of a generation, i.e., the generation time, can thus be estimated as LT = G-1T.

Variance
The variance of any of the G{alpha} estimators is extremely complex even under the simple situation of constant population size. Not only is the formula for individual Var(Gij) intractable, but the differences between Gij values are correlated to each other. We therefore propose the use of bootstrap samples for estimating the variance of LT (or GT).

Let ni be the number of sequences in sample i (i = 1, ... , r). A bootstrap sample will consist of r subsamples, with the ith subsample obtained by selecting ni sequences with replacement from the ni sequences. This stratified bootstrap is reasonable in the absence of detailed knowledge of the dynamics of the population being studied. The bootstrap estimate of the variance of LT is obtained by the following steps: (1) carry out bootstrap sampling for each of the r samples, (2) compute the value of LT from the bootstrap samples, and (3) repeat steps 1 and 2 many times; the resulting sampling variance of LT is then an estimate of the variance of LT.

It should be noted that there are two levels of variance associated with LT. One is due to the stochasticity of evolution, which leads to differences among replicates of the same evolutionary process. The other is due to the effect of sampling. The bootstrap estimate of variance described above only accounts for the variance due to sampling. Although one needs to be cautious in interpreting this component of variance, it nevertheless gives a lower bound of the total variance. When multiple populations are available which can be considered replicates of the same evolutionary process (e.g., the HIV populations in different human hosts), a bootstrap resampling procedure consisting of sampling from both within a replicate and among replicates will give an estimate of the total variance.


    Application to HIV
 TOP
 Abstract
 Introduction
 The Theory
 Application to HIV
 Alternative Methods
 Discussion
 Acknowledgements
 literature cited
 
A number of studies involving within-host longitudinal samples of DNA sequences of human immunodeficiency virus type 1 (HIV-1) have been reported (e.g., Balfe et al. 1990Citation ; Simmonds et al. 1991Citation ; Wolfs et al. 1991Citation ; Holmes et al. 1992Citation ; Zhang et al. 1997Citation ; Rodrigo et al. 1999Citation ). We will use the case analyzed by Rodrigo et al. (1999)Citation for illustration of our method and for the purpose of comparison of estimates. This longitudinal sample was obtained from a homosexual Caucasian male who was diagnosed as HIV-1 seropositive following an episode of aseptic meningitis in 1985 (see Rodrigo et al. [1999]Citation and references therein). The first sample was taken in April 1989, and subsequent samples were taken 7, 22, 23, and 34 months later. The patient started treatment with zidovudine at month 13 after the first sample and continued until the end of the study.

The 0.65-kb region of the env gene spanning the third to the fifth variable regions was sequenced for each virus in the sample. We will utilize the same sequence alignment as that in Rodrigo et al. (1999)Citation . For simplicity, we will consider only nucleotide substitutions. Three methods for computing the number of nucleotide substitutions between two sequences are considered. One is the number of nucleotide differences, the second is the distance using Jukes-Cantor correction, and the third is the distance corrected using Kimura's two-parameter model. The sample size and time lengths between samples, as well as the {Pi} values by the first and third methods, are given in table 1 .


View this table:
[in this window]
[in a new window]
 
Table 1 The Mean Numbers ({{Pi}}ij) (x102) of Nucleotide Substitutions

 
From table 1 , the mutation rate per site per day is estimated by vT to be 4.71 x 10-5 without correction and 4.45 x 10-5 with correction using Kimura's two parameter model. These values correspond, respectively, to 1.71 x 10-2 and 1.62 x 10-2 per site per year. The estimate using Jukes-Cantor correction lies between these two values.

As we pointed out earlier, one can obtain an estimate of the generation time when the mutation rate per site per generation is known. Mansky (1996)Citation estimated that the overall mutation rate per site per generation was 4 x 10-5, which includes base substitution, frameshift, deletion, and insertion. Since we only consider base substitutions in our analysis, it is necessary to use the mutation rate for base substitution only. Using Mansky's data (table 4 in Mansky 1996Citation ) that there are 15 base substitutions in 5,272 shuttle vector proviruses with a target segment of 114 nt, we obtain a nucleotide substitution rate of 15/(5,272 x 114) = 2.5 x 10-5 per site per generation. The pairwise estimates of the number G of generations per day are given in table 2 .


View this table:
[in this window]
[in a new window]
 
Table 2 Pairwise Estimates of G Using Kimura's Two-Parameter Model with {µ} = 2.5 x 10-5

 
The estimates of generation time L using the original distance, the distance with Juke-Cantor correction, and the distance with correction by the Kimura two- parameters model are given in table 3 , together with bootstrap estimates of standard errors. For comparison, estimates using Rodrigo and Felsenstein (1999)Citation (see discussion in the next section) are also included.


View this table:
[in this window]
[in a new window]
 
Table 3 Estimation of Generation Time L

 
It is clear from table 3 that differences in the estimates of L among correction methods are rather minor, but there are considerable differences between estimates based on Gs and GT. The reason why Gs is nearly twice as large as GT—and thus G-1s is nearly half of G-1T— is that Gs is dominated by a single large estimate of G from samples 3 and 4. Because there were only 28 days separating these two samples, the resulting estimate of G has a large variance. In comparison, GT gives a smaller weight to the estimate from these two samples, which results in a smaller value and, consequently, a larger estimate of generation time. Table 3 also shows that the bootstrap standard error of G-1s is slightly larger than that of G-1T, which further supports our hypothesis that G-1T is a better estimator than G-1s. We thus conclude that the generation time for the HIV population in this patient is about 1.78 ± 0.25 days. In comparison, combined estimates from pairwise estimates of G (table 4 ) by the Rodrigo and Felsenstein (1999)Citation method are less than half of our estimates.


View this table:
[in this window]
[in a new window]
 
Table 4 Pairwise estimates of G by Rodrigo et al. (1999) with {µ} = 2.5 x 10-5

 

    Alternative Methods
 TOP
 Abstract
 Introduction
 The Theory
 Application to HIV
 Alternative Methods
 Discussion
 Acknowledgements
 literature cited
 
There are several existing methods for estimating the generation time. Here, we will focus on a method proposed by Rodrigo and Felsenstein (1999)Citation , but another entirely different approach will be discussed as well. A brief description of Rodrigo and Felsenstein's (1999)Citation method is as follows.

Consider a sample of m sequences from a haploid population with a constant effective population size N. If one examines the history of these m sequences by tracing backward in time, one will find that there is a period in which there are m ancestral sequences, a period in which there are m - 1 ancestral sequences, and so on. The time length ti of the period in which there are i ancestral sequences has an exponential distribution with a mean equal to 2N/i(i - 1) (Kingman 1982a). The number of generations from time t + T back to the first sampling time t at which there are l ancestral sequences is Gl = tm + ... + tl+1, whose expectation is equal to


(17)
where m - l is the number of coalescent events among the m sequences during the time period T. For a pair of samples taken T days apart, one can estimate the number of generations between the two sampling times using the above equation if both N and l are known. Similar to our estimator, one can convert Gl to an estimate of the generation time L as T/Gl.

The effective population size N can be estimated from an estimate of {theta} = 2Nµ, where µ is the mutation rate per generation per site. There are a number of methods available for estimating {theta} (see review by Li and Fu 1999Citation ). For example, since {Pi}2 has a mean equal to {theta} under the assumption of a constant effective population size, one can estimate N as = {Pi}2/(2µ). Rodrigo et al. (1999)Citation used a more complex method and obtained estimates of N similar to those by Brown and Richman (1997)Citation . The value of m - l was estimated from a rooted phylogeny of the sequences from both samples as the number of coalescent events among the sequences in the second samples. Rodrigo et al. (1999)Citation used the neighbor- joining method (Saitou and Nei 1987Citation ) for phylogeny reconstruction and used an outgroup sequence to root the tree. Note that another tree-based approach for estimating mutation rate is proposed by Rambaut (2000)Citation .

Using the approach described above, Rodrigo et al. (1999)Citation obtained an estimate of the generation time for HIV-1 of 1.2 days. At first glance, their estimate appears to be comparable with our estimate of 1.78 days, but the two estimates cannot be directly compared for two reasons. First, Rodrigo et al. (1999)Citation used a different approach than ours to combine pairwise estimates. Second, we use here a different mutation rate. Although only nucleotide substitutions were considered, Rodrigo et al. (1999)Citation nevertheless used the mutation rate 4 x 10-5 compiled by Mansky (1996), which includes both insertions and deletions. When only nucleotide substitutions are counted, the mutation rate from Mansky's data becomes 2.5 x 10-5 per site per generation. With this mutation rate, pairwise estimates of G computed from table 2 of Rodrigo et al. (1999)Citation become those in table 4 . Comparison of table 4 with table 2 reveals that Rodrigo et al.'s (1999)Citation estimate of G is considerably larger than ours in every case, and is on average more than twice as large as ours. Although some of the differences must be due to the variances in both estimates, it is highly unlikely that random errors alone can result in such systematic differences. Some possible causes for this discrepancy will be discussed later.

A very different technique for estimating generation time using within-host longitudinal viral load data has been developed (Coffin 1995; Wei et al. 1995Citation ; Perelson et al. 1996Citation ). This approach is based on the principle that when a potent drug—such as Ritonavir— which is a protease inhibitor, is administered to a patient, the rate of loss of virions in plasma can be modeled by a set of differential equations with a few parameters, which can be estimated from the longitudinal viral load data. The values of these parameters can then be used to estimate the generation time. The estimates of generation time from this technique vary from about 4 days by Wei et al. (1995)Citation to 2.6 days by Perelson et al. (1996)Citation . The latter group have recently revised their estimate to 1.8 days (Rodrigo et al. 1999Citation ), which agrees well with our estimate of 1.78 days.


    Discussion
 TOP
 Abstract
 Introduction
 The Theory
 Application to HIV
 Alternative Methods
 Discussion
 Acknowledgements
 literature cited
 
Very often, a statistical method for analyzing a population sample is developed under a specific model, such as the constant effective population size assumed by Rodrigo et al. (1999)Citation . When the population in question evolves in a manner that is significantly different from the model, the statistical analysis and the resulting conclusions can be misleading. Therefore, it is important to understand how an estimator behaves under various situations. The estimator of mutation rate proposed in this paper has the distinct feature of being unbiased in a variety of situations, which deserves further discussion.

In the case of population growth or, in general, varying effective population size, it is easy to see that v is an unbiased estimator of v because equations (5) and (7) hold regardless of the value of effective population size. This property of our estimator is particularly important for its application to fast-changing viral populations such as HIV-1, because a within-host population can change dramatically in size over a short period of time.

The estimator v is also unbiased in the presence of population structure, because regardless of population structure, every sequence in the second sample experiences the same amount of time since the time at which the first sample was taken. As long as a consistent sampling strategy is used from different samples, equations (5) and (7) hold regardless of population structure.

It is also obvious that recombination does not introduce bias in our estimate of v either, because equations (5) and (7) hold in the presence of recombinations. Of course, this is not to say that nonconstant effective population size, population structure, and recombination have no effect on our estimate, because they do affect the variance of the estimator.

Natural selection is an important factor to consider when analyzing samples from viral populations such as HIV-1. When the DNA region under study is not directly involved in natural selection, our estimator should remain nearly unbiased. This includes the situation in which the region under study is tightly linked to a locus that is under strong natural selection. For example, if natural selection has led to the fixation of a favorable mutation before sampling starts, then its effect is very similar to that of a growing population and thus will not lead to bias in our estimator. When many mutations in the samples are not selectively neutral, the accumulation of nucleotide changes in a sequence in a given period may deviate from Poisson distribution, and the substitution rate can be elevated or reduced depending on the type of natural selection. In the case of deleterious mutations, the mutation rate per year estimated from equation (13) is likely to be smaller than that extrapolated by mutation rate per site per generation and generation time. This will result in an overestimate of the generation time. On the other hand, if positive selection is involved, the substitution rate per site per year will be elevated, which will result in an underestimate of the generation time. One way to minimize the effect of natural selection is to conduct analyses on synonymous substitutions only. With more and more data available, such analyses should be very informative. Since our analysis in this paper is mainly for the purpose of illustration, and also because of the relative small samples, we do not pursue the more detailed analysis.

Since the number of mutations that substantially enhance viral survival should be small compared with the total number of mutations, the bias in our estimate of v due to positive selection is unlikely to be substantial. Nevertheless, since positive selection is likely operating on the env gene of HIV-1 (e.g., Bonhoeffer, Holmes, and Nowak 1995Citation ; Yamaguchi and Gojobori 1997Citation ; Zhang et al. 1997Citation ), our estimate of the generation time may be slightly affected. Another potential source of error in the estimate of generation time is the mutation rate per site per generation. For example, if the mutation rate is underestimated, then it will result in an underestimate of the generation time. With these caveats, it is encouraging that our estimate of generation time agrees well with the recent estimate of 1.8 days from viral load data (see Rodrigo et al. 1999Citation ). It will be interesting to see if the agreement continues to hold with increasing data.

What causes Rodrigo et al.'s (1999)Citation estimates of G to be consistently larger than ours? Although the variances in both estimators may lead to fluctuation, the discrepancy is likely due to some fundamental differences between the two estimators. One similarity between the two estimators of generation time is that both rely on estimates of the numbers of generations per day. However, our method is more direct and is unbiased for estimating the number of generations per day, while Rodrigo and Felsenstein's (1999)Citation method is indirect, relying on estimates of both the effective population size and the number of coalescent events among the sequences in sample 2 in the period that separates the two samples. Counting the coalescent events directly from an estimated phylogeny will likely overestimate this number even if the phylogeny is perfectly reconstructed. The simple analysis below will reveal why this is so.

Consider two random samples with sizes n and m, respectively, taken at the same time. Their coalescences will be like that for a single sample of n + m sequences. A coalescent event will be counted as a coalescence between sequences from sample 2, or simply a coalescence within sample 2, if and only if neither of the two coalescing sequences has a descendant in sample 1. Let p(n, m) be the probability that there is no coalescence within sample 2. Each time a coalescence occurs, each pair of sequences has the same probability to be chosen. There are n(n - 1)/2 ways to coalesce two sequences from sample 1, and there are nm ways to coalesce one sequence from sample 1 and one sequence from sample 2. Therefore, we have the following recurrence equation:


(18)
When there is only one sequence in the second sample, there is no chance of having coalescence within sample 2. Therefore, the initial condition for solving the above recurrence equation is


(19)

The probability that there is at least one coalescence within sample 2 is 1 - p(n, m), and its numerical values for a number of sample size combinations are given in table 5 . It is clear from table 5 that for those sample sizes in our longitudinal samples, this probability is quite high. For example, when n = 15 and m = 13, the probability is 0.94. This analysis suggests that even when there is no time (T = 0) separating the two samples, it is very likely to observe coalescent events within the second sample, so that counting these events as the estimate of m - l and then using this estimate as the basis for estimating G will result in an overestimation of G. This is likely one of the reasons why Rodrigo et al.'s (1999)Citation estimates of G (table 4 ) are consistently larger than our estimates (table 2 ). Such a discrepancy will also be observed if the effective population size N is underestimated in Rodrigo et al. (1999)Citation .


View this table:
[in this window]
[in a new window]
 
Table 5 The Probability (1 - p(n, m)) of Having at Least One Coalescence Among the Sequences in the Second Sample when T = 0

 
Table 5 also shows that when n is large, 1 - p(n, m) becomes small, which suggests that it is possible to reduce bias in Rodrigo and Felsenstein's (1999)Citation method by using a much larger sample size for the first sample. However, when more than two samples are taken, such as the samples analyzed in this paper, it is not easy to persuade an experimenter to, say, halve the previous sample size whenever a new sample is taken, because samples are usually not collected entirely for a single purpose. In general, increasing all of the sample sizes will improve the accuracy of the final estimate, which is also true for Rodrigo and Felsenstein's (1999)Citation method because coalescent time is smaller when there are many sequences; thus, the error due to an incorrect estimate of l is not as severe as that for a small sample.

It is worth emphasizing that the longitudinal samples required for estimating mutation rate and generation time do not have to come from within single host. The estimator is applicable to samples in which each sequence comes from a different host. This feature is valuable for studying the evolution of a pathogen that does not stay in a single host for a long period. On the other hand, if longitudinal samples are available from multiple populations which can be considered replicates of the same evolutionary process, accuracy in the estimation of generation time can be substantially improved because samples from different hosts should be more or less independent. Furthermore, the total variance of the estimator can be obtained by bootstrapping both within- population and among-populations samples. Longitudinal samples from within-host HIV populations are being accumulated, and it will be interesting to see how variable the generation time can be among different hosts.

We have so far considered two ways of utilizing the pairwise number of nucleotide substitutions. Another possible use of equations (5) and (7) is to estimate the time length T separating two samples when both the generation time and the mutation rate per generation are known. An estimate of T is


(20)
A potential use of such an estimator is to date the ancestral population from which an ancient DNA sample is obtained. This, of course, requires that the modern sample is taken from a population that shared the same ancestral population from which the ancient DNA sample was derived.


    Acknowledgements
 TOP
 Abstract
 Introduction
 The Theory
 Application to HIV
 Alternative Methods
 Discussion
 Acknowledgements
 literature cited
 
I thank Dr. Stanley Sawyer and reviewers for their comments, and Dr. Allen G. Rodrigo for kindly providing me his sequence alignment. This work was supported by NIH grants R29 GM50428 and R01 HG01708 and a fellowship from the Japan Society for the Promotion of Science. Special thanks go to Dr. Naruya Saitou for hosting me in Mishima.


    Footnotes
 
Naruya Saitou, Reviewing Editor

1 Keywords: mutation rate generation time longitudinal sample HIV coalescent process Back

2 Address for correspondence and reprints: Yun-Xin Fu, Human Genetics Center, University of Texas at Houston, 6901 Bertner Avenue S222, Houston, Texas 77030. fu{at}hgc.sph.uth.tmc.edu Back


    literature cited
 TOP
 Abstract
 Introduction
 The Theory
 Application to HIV
 Alternative Methods
 Discussion
 Acknowledgements
 literature cited
 

    Balfe, P., P. Simmonds, C. A. Ludlam, J. O. Bishop, and A. J. Brown. 1990. Concurrent evolution of human immunodeficiency virus type 1 in patients infected from the same source: rate of sequence change and low frequency of inactivating mutations. J. Virol. 64:6221–6233.[ISI][Medline]

    Bonhoeffer, S., E. C. Holmes, and M. A. Nowak. 1995. Causes of HIV diversity. Nature 376:125

    Brown, A. J. L., and D. D. Richman. 1997. HIV-1: gambling on the evolution of drug resistance? Nat. Med. 3:268

    Coffin, J. M. 1995. HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267:483–489

    Holmes, E. C., L. Q. Zhang, P. Simmonds, C. A. Ludlam, and A. J. Brown. 1992. Convergent and divergent sequence evolution in the surface envelope glycoprotein of human immunodeficiency virus type 1 within a single infected patient. Proc. Natl. Acad. Sci. USA 89:4835–4839

    Kingman, J. F. C. 1982a. On the genealogy of large populations. J. Appl. Prob. 19A:27–43

    ———. 1982b. The coalescent. Stochast. Processes Applications 13:235–248

    Li, W. H., and Y. X. Fu. 1999. Coalescent theory and its applications in population genetics. Pp. 45–79 in E. Halloran and S. Geisser, eds. Statistics in genetics. Springer, New York.

    Mansky, L. M. 1996. Forward mutation rate of human immunodeficiency virus type 1 in a T lymphoid cell line. AIDS Res. Hum. Retroviruses 12:307–314

    Perelson, A. S., A. U. Neumann, M. Markowitz, J. M. Leonard, and D. D. Ho. 1996. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral generation time. Science 271:1582–1586

    Rambaut, A. 2000. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenetics. Bioinformatics 16:395–399

    Rodrigo, A. G., and J. Felsenstein. 1999. Coalescent approaches to HIV population genetics. In K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md.

    Rodrigo, A. G., E. G. Shpaer, E. L. Delwart, A. K. N. Iversen, M. V. Gallo, J. Brojatsch, M. S. Hirsch, B. D. Walker, and J. I. Mullins. 1999. Coalescent estimates of HIV-1 generation time in vivo. Proc. Natl. Acad. Sci. USA 96:2187–2191

    Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425[Abstract]

    Simmonds, P., L. Q. Zhang, F. McOmish, P. Balfe, C. A. Ludlam, and A. J. Brown. 1991. Discontinuous sequence change of human immunodeficiency virus (HIV) type 1 env sequences in plasma viral and lymphocyte-associated proviral populations in vivo: implications for models of {HIV} pathogenesis. J. Virol. 65:6266–76[ISI][Medline]

    Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437–460

    Wei, X., S. K. Ghosh, M. E. Taylor et al. (12 co-authors). 1995. Viral dynamics in human immunodeficiency virus type 1 infection. Nature 373:117–122

    Wolfs, T. F., G. Zwart, M. Bakker, M. Valk, C. L. Kuiken, and J. Goudsmit. 1991. Naturally occurring mutations within HIV-1 V3 genomic RNA lead to antigenic variation dependent on a single amino acid substitution. Virology 185:195–205

    Yamaguchi, Y., and T. Gojobori. 1997. Evolutionary mechanisms and population dynamics of the third variable envelope region of HIV within single hosts. Proc. Natl. Acad. Sci. USA 94:1264–1269

    Zhang, L., R. S. Diaz, D. D. Ho, J. W. Mosley, M. P. Busch, and A. Mayer. 1997. Host-specific driving force in human immunodeficiency virus type 1 evolution in vivo. J. Virol. 71:2555–2561[Abstract]

Accepted for publication December 11, 2000.