Institut Cavanilles de Biodiversitat i Biologia Evolutiva and Departament de Genètica, Universitat de València, Apartado Oficial 2085, 46071 València, Spain
Correspondence
Fernando González-Candelas
fernando.gonzalez{at}uv.es
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The nucleotide sequence data from the variants reported in this paper will appear in the EMBL and GenBank databases under accession nos AJ560333AJ560620.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To analyse and characterize such extremely variable populations, there are a number of techniques that allow the estimation of genetic variability, such as heteroduplex mobility assays (Woodward et al., 1994), single-strand conformation polymorphism assays (Spinardi et al., 1991
), multiple-site-specific tracking assays (Resch et al., 2001
), mutant analysis by PCR and restriction enzyme cleavage (Chumakov et al., 1991
) or denaturing gradient gel electrophoresis (Fodde & Losekoot, 1994
; Woodward et al., 1994
). However, to analyse many of the properties of such populations, it is necessary to know the nucleotide sequence of the constituting genomes. There are three main methodologies for this. The first method proceeds through reverse transcription (RT), amplification and direct sequencing of the resulting cDNA (Leitner et al., 1993
), hence rendering a single, consensus sequence on which variability is usually estimated by the analysis of variable positions in the electrophoregrams. The second method also amplifies cDNA by PCR but the resulting products are cloned into an appropriate vector. Clones derived from a single DNA molecule are sequenced, thus providing individual sequences representative of the initial, variable population. The third method, denoted PCR-based limited dilution assay, is also aimed at providing individual sequences but avoiding the cloning steps. This is achieved through limiting dilutions prior to PCR amplification (Rodrigo et al., 1997
; Taswell, 1981
), thus assuring that only a single molecule acts as template for the reaction. Later, these PCR products are sequenced directly.
All of these methods have their advantages and limitations and none is universally best for all applications and in all circumstances. Since hepatitis C virus (HCV)-infected individuals usually harbour 10101012 virus particles (Neumann et al., 1998), it could seem evident that the low numbers of sequences or clones obtained in these studies, usually in the tens at most, would hardly be a truly representative sample of the whole population and that increasing the number of sequences would provide a much better evaluation of the underlying diversity. The question can then be restated as: do the conclusions obtained with a relatively small number of sequences (about 10) still hold when compared with those obtained using a larger number of sequences (say 100) from the same serum sample?
A second interesting question is the repeatability of results obtained with an experimental protocol that involves one RT and several independent PCR amplifications. One process that might introduce a bias in the PCR products is PCR drift (Wagner et al., 1994). This kind of bias could be due to stochastic variation in the early cycles of amplification and could result in poor repeatability in replicate PCR amplifications. Consequently, for many applications, such as molecular epidemiology or forensic studies, it is important to ascertain what levels of repeatability can be obtained with these sample sizes and techniques. Once again, our interest is not simply in the reproduction of the same raw sequences, since using such small sample sets makes it very unlikely to obtain exactly the same ones, but in the conclusions that can be derived from their analysis.
Lastly, and also as a consequence of previous considerations, we are interested in which of two alternative strategies is best for obtaining a large number of individual sequences, either cloning and sequencing a large number, namely 100, of DNA amplified products from a single PCR reaction or dividing the total number of sequences into several PCR reactions and cloning and sequencing a smaller number of products from each.
We have used a factorial design to analyse these questions. Since the initial level of genetic variability in each sample is a likely factor affecting diversity analyses, we decided to use four HCV-infected patients whose viruses covered a wide range of genetic variability. Our results indicate that essentially the same conclusions can be obtained from a moderately small sample set than from a large sample set, although, as expected, the larger the sample set the more detailed the description of the virus population will be.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Amplifications were performed in a 100 µl volume containing 4 µl of the RT product, 10 µl 10x PCR buffer, 200 µM of each dNTP, 400 nM of each primer (sense, 5'-CGCCAYTGGACRACGCAA-3', positions 12301247 in the reference sequence accession no. M62321; antisense, 5'-RCAMCCRAACCAATTGCC-3', positions 19971980) and 2·5 U Pfu DNA polymerase (Stratagene). PCR was performed in a Perkin Elmer 2400 thermal cycler with the following thermal profile: 94 °C for 3 min, then 5 cycles at 94 °C for 30 s, 55 °C for 30 s and 72 °C for 3 min, followed by 35 cycles at 94 °C for 30 s, 52 °C for 30 s and 72 °C for 3 min. A final extension at 72 °C for 10 min was also carried out.
Amplification products were cloned directly into the EcoRV-digested pBluescript II SK (+) phagemid (Stratagene). Recombinant clones with our insert were selected by PCR-colony isolation and were purified by manual precipitation. Clones were sequenced using primers 5'-RGCCATCTTGGAYATGATYGC-3' (sense, positions 13671387) and 5'-YTTGGRGGGTAGTGCCARCARTA-3' (antisense, positions 18161794) and the ABI PRISM BigDye Terminator Cycle Sequencing Ready Reaction kit (Applied Biosystems) in an ABI 3700 automated sequencer (Applied Biosystems). Sequences were verified and both strands assembled using the Staden package (Staden et al., 1999). Sequences obtained in the previous RT reaction were obtained in a similar manner, except that the same primers were used for amplification and sequencing, thus rendering 406 nt long sequences from the same genome region.
Statistical analysis.
For all analyses, 13 data sets for each patient were used. Of these, 10 corresponded to the 10 independent amplifications, nine with a sample size of around 10 sequences and one with a sample size of about 100 sequences. A further set was obtained by combining the nine sets of 10 sequences into a single set. Another set was composed of all of the sequences from the 10 independent amplifications. The last set corresponded to a previous study (unpublished data), in which we obtained a similar number of sequences (n=10) by the same procedure and from the same region for each sample, although from a different RT reaction. Hence, a test on the effect of the RT reaction was possible by comparing results from two different RT reactions. Also, we obtained information on the repeatability of the results, by comparing the nine samples of 10 sequences among themselves, on the effect of sample size, by comparing each of these with the samples of 100 sequences, and on the effect of sampling a similar number of sequences from a single amplification (of 100 sequences) or from different amplifications (nine amplifications of 10 sequences).
Genetic variation for each data set was evaluated using DNAsp, version 3.51 (Rozas & Rozas, 1999). Pairwise comparisons between data sets from the same patient were obtained with Arlequin, version 2000 (Schneider et al., 2000
), as estimates of the population subdivision statistic Fst. The statistical significance for this statistic was evaluated by 1000 random permutations in each case. Phylogenetic trees were constructed using the neighbour-joining algorithm (Saitou & Nei, 1987
) based on the general time reversible evolutionary model for nucleotide substitution (Posada & Crandall, 2001
). These analyses were done with PAUP*, version 4.0b10 (Swofford, 1998
). Estimates of synonymous and non-synonymous substitutions among sequences from each data set were obtained using the NeiGojobori method (Nei & Gojobori, 1986), as implemented in the program MEGA (Kumar et al., 2000).
Exact, unbiased estimates of P values in contingency tables were obtained using the Metropolis algorithm implemented in the program RxC (Miller, 1997).
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For the four patients, nucleotide diversity was very similar for the nine data sets of 10 sequences, as well as for the one with 100 sequences and the one obtained by pooling the previous sets (9x10). Only for patient 16 was there a certain difference both in the number of different haplotypes and in nucleotide diversity, which were larger for the pooled set than for the large set (1x100). In all cases, estimates of nucleotide diversity for the large and pooled sets were intermediate among those obtained for the nine small data sets. A similar result was obtained when the estimates from the previous experiment were compared with this one, with the exception of patient 13, as already noted. It is also noticeable that although different haplotypes were sequenced, largely similar values of genetic variability were obtained. For instance, of the 113 different haplotypes sequenced from patient 21 when the large and the pooled sets were considered (56+57), only eight were coincident between both groups, the remaining 105 were different. The same pattern was obtained in the other patients.
A summary of the results from genetic differentiation analyses is shown in Table 2. Pairwise genetic differentiation analyses of the large sample set (1x100) with respect to each small size data set and the pooled set (9x10) from each patient were obtained. After correction for multiple, non-independent comparisons using Bonferroni's method (Miller, 1966
), there were only two statistically significant Fst values and both corresponded to patient 16. One of them was from one of the small samples (series 1603) and the other corresponded to the pooled sample (9x10). This result was largely due to differences arising in three different data sets (series 1603, 1605 and 1607). In these cases, the intergroup component of variation was close to or even larger than 10 %, whereas in none of the other sets for this patient was it larger than 5 %.
|
Our final test for the homogeneity of sequences obtained from different transformation experiments came from their phylogenetic analysis. If there were substantial differences among the data sets obtained from different amplification experiments, then we would expect to obtain highly structured phylogenetic trees, with most sequences derived from each set grouped into separate clusters. The phylogenetic trees obtained for the different haplotypes from the almost 200 sequences from each patient are shown in Fig. 1 and the frequency distribution of sequences from each set into haplotypes is shown in Table 3
. As expected, each phylogenetic tree reflected the genetic variability levels described previously, with a higher degree of branching in the tree for patient 21, the one with the largest variability. Nevertheless, in all cases, it was evident that sequences derived from any data set did not group into separate clusters and, instead, they mixed in quite a random manner. This was also true for the sequences obtained in the previous experiment, derived from an independent RT reaction followed by PCR amplification, as in the four phylogenetic trees they grouped similarly to the other small sample size data sets from the corresponding patient.
|
|
Rates of synonymous and non-synonymous substitutions were not significantly different among data sets from each patient (data not shown, available upon request), although there were significant differences among patients. These differences correlate with the levels of variability described previously, especially for non-synonymous substitutions, ranging from an average of 0·0005 substitutions per site (s s-1) for patient 13 to 0·0487 s s-1 for patient 21 (0·0015 for patient 45 and 0·0185 for patient 16). Interestingly, synonymous substitutions were less variable among patients, with average values of 0·0043 s s-1 for patient 13, 0·0051 s s-1 for patient 45, 0·0053 for patient 16 and 0·0295 for patient 21.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our genetic differentiation and phylogenetic analyses reflect the existence of a substantial homogeneity among the data derived from the different series obtained in each of the four patients included in our study. No significant differences were observed when the genetic variability parameters obtained from small sample data sets were compared to those obtained from the corresponding large ones. In all cases, we found that large data set values were intermediate among those obtained from the small sets, thus indicating a smaller accuracy for the estimates derived from the latter. Although, as expected, the larger the size of a data set the more precise the derived estimates are, according to our results, the variation found among small sample sets and between these and the large ones was not significant. Hence, our study shows that it is adequate to use relatively small sample sizes to evaluate genetic variability in virus populations by means of RT, PCR amplification, cloning and sequencing of recombinant plasmids.
Comparisons among the previous conclusions from the four patients with different levels of variability included in our study show that this variability has no influence on what we have just considered. There is neither a patient effect for the different sample sizes nor an interaction with respect to the level of variability of the samples analysed, at least for the range of variability we have worked with.
Since we have found a considerable similarity between the data derived from the large data set obtained from a single amplification and those obtained with the pooled series from different amplifications, our results also indicate that both strategies employed to obtain large samples are equally valid. Therefore, the choice between methods can be based upon other considerations.
The comparison to an additional data set of the same patients from a different, previous experiment allowed us to test the role that the RT reaction could play as a biasing factor with regards to the reproducibility of the data. We found consistency between the conclusions extracted from data sets from two different RT reactions, as the previous set was undistinguishable from the small size sets derived from the same RT reaction in this experiment. The only difference was observed for patient 13, in which the previous experiment sample showed no variation, whereas all samples derived in the new experiment harbour at least two variants. However, there is no statistical significance in the differences, again due to the small sample sizes used. This result is relevant for those cases in which an independent validation of the results obtained in a laboratory has to be performed in a different one. In these cases, it is not the absolute identity of the sequences obtained from both settings that should be expected. Rather, it is the concordance in the genetic variability parameters and phylogenetic relationships that should be compared. We must emphasize that these conclusions hold only for general evaluations of variability. In any case, the search for specific variants in the virus population in different experiments should provide identical results.
Furthermore, our experimental design allowed us to address another important issue in the estimation of genetic variability in virus populations, i.e. the error introduced by random preferential amplification of some variants by DNA polymerases used in PCR. Even a relatively low preferential amplification during the first rounds would lead to increased frequency estimates for some variants as a result of the exponential growth in subsequent replication rounds. Our data do not provide support for this, since there is homogeneity in the distribution of variants among different data sets for the same patient, even including a set from a separate RT reaction. Consequently, for the estimation of genetic variability of HCV in these patients, it would be legitimate to pool the data from all the sets, thus obtaining a more accurate estimate of the true value in each case.
In summary, our main conclusion is that although the raw data in the different sets were all distinct, we have found a great consistency between the conclusions derived from them, not only in genetic variability but also in phylogenetic relationship estimates. This consistency is maintained regardless of sample size or the amplification and cloning strategy used.
![]() |
ACKNOWLEDGEMENTS |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Domingo, E. (2002). Quasispecies theory in virology. J Virol 76, 463465.
Domingo, E. & Holland, J. J. (1997). RNA virus mutations and fitness for survival. Annu Rev Microbiol 51, 151178.[CrossRef][Medline]
Drake, J. W. & Holland, J. J. (1999). Mutation rates among RNA viruses. Proc Natl Acad Sci U S A 96, 1391013913.
Fodde, R. & Losekoot, M. (1994). Mutation detection by denaturing gradient gel electrophoresis (DGGE). Hum Mutat 3, 8394.[Medline]
Holmes, E. C. & Moya, A. (2002). Is the quasispecies concept relevant to RNA viruses? J Virol 76, 460465.
Kumar, S. Tamura, K., Jakobsen, I. B. & Nei, M. (2000). MEGA: Molecular Evolutionary Genetics analysis, version 2.0b3. Institute of Molecular Evolutionary Genetics, Arizona State University, Arizona, USA.
Leitner, T., Halapi, E., Scarlatti, G., Rossi, P., Albert, J., Fenyo, E. M., & Uhlén, M. (1993). Analysis of heterogeneous viral populations by direct DNA sequencing. Biotechniques 15, 120127.[Medline]
Miller, R. G. (1966). Simultaneous Statistical Inference. New York: McGraw-Hill.
Miller, M. P. (1997). RxC, a program for the analysis of contingency tables. Northern Arizona University, Flagstaff, USA.
Moya, A., Elena, S. F., Bracho, A., Miralles, R. & Barrio, E. (2000). The evolution of RNA viruses: a population genetics view. Proc Natl Acad Sci U S A 97, 69676973.
Nei, M. & Gojobori, T. (1986). Simple methods for estimating the number of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3, 418426.[Abstract]
Neumann, A. U., Lam, N. P., Dahari, H., Gretch, D. R., Wiley, T. E., Layden, T. J. & Perelson, A. S. (1998). Hepatitis C viral dynamics in vivo and the antiviral efficacy of interferon- therapy. Science 282, 103107.
Posada, D. & Crandall, K. A. (2001). Selecting the best-fit model of nucleotide substitution. Syst Biol 50, 580601.[CrossRef][Medline]
Resch, W., Parkin, N., Stuelke, E. L., Watkins, T. & Swanstrom, R. (2001). A multiple-site-specific heteroduplex tracking assay as a tool for the study of viral population dynamics. Proc Natl Acad Sci U S A 98, 176181.
Rodrigo, A. G., Goracke, P. C., Rowhanian, K. & Mullins, J. I. (1997). Quantitation of target molecules from polymerase chain reaction-based limiting dilution assays. AIDS Res Hum Retroviruses 13, 737742.[Medline]
Rozas, J. & Rozas, R. (1999). DNAsp, version 3: an integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics 15, 174175.
Saitou, N. & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406425.[Abstract]
Schneider, S., Roessli, D. & Excoffier, L. (2000). Arlequin, version 2000: a software for population genetics data analysis. Genetics and Biometry Laboratory, University of Geneva, Switzerland.
Spinardi, L., Mazars, R. & Theillet, C. (1991). Protocols for an improved detection of point mutations by SSCP. Nucleic Acids Res 19, 4009.[Medline]
Staden, R., Beal, K. & Bonfield, J. (1999). The Staden package, 1998. In Computer Methods in Molecular Biology, pp. 115130. Edited by S. Misener & S. Krawetz. Totowa: Humana Press.
Swofford, D. L. (1998). PAUP*. Phylogenetic Inference Using Parsimony (* and other methods), version 4.0b10. Sunderland, MA: Sinauer Associates.
Taswell, C. (1981). Limiting dilution assays for the determination of inmunocompetent cell frequencies. J Immunol 126, 16141619.
Wagner, A., Blackstone, N., Cartwright, P. & 7 other authors (1994). Surveys of gene families using polymerase chain reactions: PCR sequences and PCR drift. Syst Biol 43, 250261.
Woodward, T. M., Carlson, J., McClelland, C. & DeMartini, J. C. (1994). Analysis of lentiviral genomic variation by denaturing gradient gel electrophoresis. Biotechniques 17, 366371.[Medline]
Received 4 April 2003;
accepted 9 May 2003.
HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
INT J SYST EVOL MICROBIOL | MICROBIOLOGY | J GEN VIROL |
J MED MICROBIOL | ALL SGM JOURNALS |