Network analysis of human and simian immunodeficiency virus sequence sets reveals massive recombination resulting in shorter pathways

Simon Wain-Hobson1, Céline Renoux-Elbé1, Jean-Pierre Vartanian1 and Andreas Meyerhans2

1 Unité de Rétrovirologie Moléculaire, Institut Pasteur, F-75724 Paris cedex 15, France
2 Department of Virology, University of the Saarland, D-66421 Homburg, Germany

Correspondence
Simon Wain-Hobson
simon{at}pasteur.fr


   ABSTRACT
Top
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
The intrinsic recombination rate of human immunodeficiency virus (HIV) exceeds the point mutation rate by a factor of 10. As the majority of infected cells in vivo harbour multiple proviruses, the stage is set for rampant recombination. Therefore, it may be presumed that phylogenic relationships and mutation frequencies will probably be affected by recombination. However, the proportion of homoplasies arising from recombination and mutation is not known. By studying the evolution of the hypervariable regions of the simian immunodeficiency virus envelope gene among four macaques, it is shown that homoplasies arise more from recombination than from point mutation. When recombination is accounted for, the minimum number of substitutions in a sequence set may be reduced by as much as 45 %. In fact, the true number of point mutations in a set of HIV sequences tends to the number of discrete substitutions. Hence, lineages are younger than anticipated previously, although not in proportion to the ratio of the intrinsic recombination/point mutation rate. Recombination also inflates codon polymorphisms.


   INTRODUCTION
Top
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Recombination is a common feature among viruses. Only for the negative-stranded RNA viruses is the phenomenon infrequent (Plyusnin et al., 2002; Sibold et al., 1999). For retroviruses, the initial description of recombination goes back more than 30 years (Kawai & Hanafusa, 1972; Vogt, 1971; Weiss et al., 1973). They are inherently recombinogenic. This follows from there being two copies of genomic RNA in a virion, while reverse transcriptase (RT) is of low processivity. As a consequence, the nascent strand may switch RNA templates during reverse transcription, giving rise to a mosaic provirus (Coffin, 1979).

The first estimation of a retrovirus recombination rate, i.e. the number of RNA crossovers in a single round of replication, was made for spleen necrosis virus (SNV), an oncogenic avian retrovirus. The rate was ~0·2 crossovers per genome per round (Hu & Temin, 1990; Zhang & Temin, 1993). In contrast, the recombination rate for the human immunodeficiency virus type 1 (HIV-1) lentivirus is of the order of about three per genome per round (Jetzt et al., 2000; Yu et al., 1998), more than 10-fold the rate for SNV. Importantly, for both viruses these recombination rates are 4- to 10-fold greater than the overall point mutation rates, which are ~0·05 per genome per round for SNV (Pathak & Temin, 1990) and ~0·25 per genome per round for HIV-1 (Mansky & Temin, 1995). In other words, by the time a mutation is made, template switching has occurred between 4 and 10 times.

Recombination is found at all levels of HIV genetics. Some HIV-1 M strains in global circulation are clearly composites of viruses from two to three clades (Carr et al., 1998; Hoelscher et al., 2001; McCutchan, 2000), while a few recombinants between HIV-1 M and O have been described (Peeters et al., 1999; Takehisa et al., 1999). Viral segments amplified from isolates or patient material have revealed numerous examples of recombination (Cheynier et al., 2001; Gratton et al., 2000; Vartanian et al., 1991). In an experimental setting, wild-type simian immunodeficiency virus (SIVmac) could be recovered from peripheral blood of macaques co-infected 15 days earlier by two viruses carrying deletions in either the vif or the nef genes (Wooley et al., 1997).

A recent report showed that the majority of HIV-1-infected splenocytes from two individuals harboured between one and eight proviruses, the mean being three to four per cell (Jung et al., 2002). Depending on the sequence divergence among proviruses within a single cell, the impact of recombination may be undetectable or easily identifiable. For example, if the two genomic RNAs packaged in a retrovirus particle were identical, it would be impossible to identify a recombinant. However, if the two genomes were substantially different, then recombination could be readily discerned. In the above study on single splenocytes, up to 20–30 % amino acid sequence variation was noted within the first two hypervariable regions of Env (Jung et al., 2002), including numerous recombinants. Multiple genetically divergent proviruses per cell means that the budding virions may contain RNA genomes derived from two different proviruses. Given m provirus copies per cell, there are m(m+1)/2 distinct ways to randomly assort m different genomic RNAs, among which will be m(m-1)/2 heterokaryons. For example, when m=4, there are 10 ways to reassort the genomes, of which six will be heterokaryons. Given a recombination rate of three crossovers per genome per cycle, virtually all heterokaryons will give rise to a unique mosaic structure.

Recombination can generate homoplasies that are the same character state in different genomes. However, homoplasies can also arise by independent RT misincorporation, which will be referred to as point mutation. Given that the HIV-1 recombination rate is 10-fold greater than the rate of point mutation, when in doubt it may be reasonable to rule in favour of recombination underlying any homoplasy as opposed to point mutation. Yet the overall rate of retrovirus point mutation is of course an average; the Km and Vmax of some substitutions, particularly transitions, will be greater than others and context dependent (Ricchetti & Buc, 1990). Thus, it might be argued that a few sites, constituting hot or warm spots, might rival recombination as an explanation for the existence of homoplasies. The question requiring an empirical answer is, do homoplasies in lentiviral genomes arise mainly from recombination or mutation?

If rampant recombination generates large numbers of homoplasies then what are the effects on branch lengths? Certainly, standard phylogenic methods ignore recombination. Computer simulations in which sequences were recombined show that branch lengths were overestimated (Schierup & Hein, 2000). Network analyses such as the SplitsTree program are more appropriate to describing homoplasies, recombination and sequence space (Bandelt & Dress, 1992; Eigen & Nieselt-Struwe, 1990; Huson, 1998). Yet sequence space is enormous. For example, 100 variable sites – typical of different molecular clones from the same isolate – means connecting 4100 (~1060) points in sequence space. Inevitably, constraints have been introduced into the SplitsTree program. For example, no hypercube presentation of the sequence data is possible. Nonetheless, useful information can still be recovered (Cheynier et al., 2001; Kils-Hütten et al., 2001; Plikat et al., 1997). Here, it is shown that for HIV and SIV, homoplasies are mainly the result of recombination. The implications for the analysis of virus evolution are discussed.


   METHODS
Top
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
The HIV-1 and SIV sequence data sets have been published previously (Jung et al., 2002; Pelletier et al., 1995; Plikat et al., 1997; Vartanian et al., 1997). Nucleic acid sequences were aligned using the multiple sequence alignment algorithm, as implemented in CLUSTAL W (Thompson et al., 1994). Gap penalty parameters were set to 3·0 for opening a new gap and 0·05 for extension of an existing gap, the output format was set to multisequence format. Sequence alignments were translated to NEXUS format using a modified version of READSEQ and used as input for SplitsTree, version 2.4 (Bandelt & Dress, 1992; Huson, 1998). When indels were present, sequences were compiled and coded using INDELCODE (Cheynier et al., 2001). For each data set, the most parsimonious path was mapped out on the phylogram.

Consider a set of four closely related sequences (Fig. 1A). Sequences I and N differ by single substitutions (A2G and T6C) from the parental sequence H, while D encodes the same two substitutions. These sequences can be placed at the vertices of a square of unit length of one mutation (Fig. 1B). D can be derived from either I or N by an independent transition, either A2G or T6C (Fig. 1C). In this representation, the same substitution at the same site, yet in a different genome, i.e. homoplasy, shows up as parallel lines. The alternative solution is that sequence D arose by recombination of sequences I and N, somewhere between bases 2 and 6 (Fig. 1C). It is clear that the results of the two different molecular events, recombination and point mutation, are indistinguishable by a posteriori sequence analyses.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1. Homoplasies arising from recombination and point mutation are indistinguishable. (A) A set of four sequences (H, I, N and D) involving transitions at sites 2 and 6, where H is arbitrarily taken as origin. (B) SplitsTree representation of four sequences, homoplasies being shown as parallel edges. (C) Most parsimonious path lengths are drawn over the network in boldface arrows. A minimum of three transitions are necessary. However, if sequences I and N are allowed to recombine between bases 2 and 6 they can generate sequence D. This possibility is shown by broken lines. Importantly, it requires only two substitutions and a single crossover. For HIV, the recombination rate is ~10 times the point mutation rate (Jetzt et al., 2000; Mansky & Temin, 1995; Yu et al., 1998). Were these sequences derived from HIV, then inferring a pathway involving recombination would be the more parsimonious solution.

 
Importantly, for a set of sequences, SplitsTrees discard substitutions at sites that violate the constraints of the program rather than discard complete sequences (Bandelt & Dress, 1992; Huson, 1998). The proportion of sites removed is indicated in what is called the fit. When all substitutions at all variable sites are retained, the fit is 100. When sites are stripped, the fit falls approximately in proportion to the number of sites stripped. The topology of the sequences is correct for the subset of variable sites retained but information has been removed. This is illustrated in Fig. 2. The 16 taxa set describes 11 site polymorphisms in six codons of the HIV-1 nef gene (Plikat et al., 1997) that highlight the problem (Fig. 2A). The taxa have N465 as their origin, although this information is not necessary for generating a SplitsTree. When all 16 taxa were included, a dendrogram with eight mutations was generated (Fig. 2B). Yet simple inspection of the sequences suffices to show that sequences 465 and 482 are not identical. Indeed no two sequences are the same. The fit of ~59 indicates that some sites have been stripped, notably sites 43, 404 and 563. As a general rule, the lower the fit the more a collection of sequences resembles a star phylogeny, Fig. 2(B) being typical.



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 2. Effects of excluding polymorphic sites and sequences in SplitsTree analyses. A collection of 16 nef sequences (Plikat et al., 1997) have been pared down to six polymorphic codons selected to show the effects of site or sequence stripping. (A) Sequence N465 is the origin while the numbering of polymorphic nucleotides in the codons is with respect to the first base of the nef ATG codon. Only differences with respect to N465 are shown. (B) SplitsTree with all 16 sequences or taxa. The network is star-like as the fit is <<100 %. Only eight substitutions were incorporated, sites 43, 404 and 563 having been stripped. Because of this, a number of sequences are apparently identical when in fact no two sequences are, see (A). (C) As an alternative, removal of just one sequence, N504, gave rise to a complex network including all 11 polymorphic sites. (D) N504 differs from N500 by a single A563G transition, which was already present in the network and is highlighted in bold (see Fig. 3C). This substitution can be added back manually, so generating three sides of a square with a common vertex, something the SplitsTree program cannot do. In the present example, excluding a single sequence generates a fuller tree in which all informative sites were represented.

 
An alternative to removing sites is to remove the minimum number of sequences while retaining a fit of 100. In the present example, removing just sequence N504 resulted in a 15 taxa network with a fit of 100 (Fig. 2C). Furthermore, all 11 polymorphic sites were incorporated. Inspection of N504 shows that it differs from N500 by a single A563G transition (Fig. 2A). Yet this transition was already extensively represented in the network (Fig. 2C, highlighted in bold). The A563G transition coupling N504 to N500 can be added back manually to the figure, so generating a structure with three faces of a cube with a common vertex (Fig. 2D), something that the SplitsTree program cannot construct. Not surprisingly, therefore, a cube cannot be constructed either, even though it merely represents all transitions from, for example, AAA to GGG. However, if two sequences at adjacent or opposing vertices of the cube are removed, SplitsTrees can describe six of the eight sequences making up a cube as adjacent squares or a regular hexagon, respectively.

In contrast to site stripping, which may lead to erroneous topologies (Alber et al., 2001; Barbrook et al., 1998; Holmes et al., 1999a, b; Smith et al., 1999) (Fig. 2B), eliminating the minimum number of sequences has the advantage of generating a correct topology for those sequences retained. This trial and error solution has been used in a few studies to analyse mutation rates and matrices among sets of closely related HIV or SIV sequences (Cheynier et al., 2001; Kils-Hütten et al., 2001; Plikat et al., 1997).


   RESULTS
Top
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
From a posteriori sequence analyses, it is not possible per se to distinguish between a homoplasy arising from recombination or point mutation. Experimental data are necessary. One way of distinguishing between the two is to compare the evolution of the same sequence in different macaques. If there are homoplasies due to point mutation, perhaps resulting from hot or warm spots intrinsic to a particular molecular clone or virus strain, then they should show up in different animals. However, genomes from different macaques kept in isolation simply cannot recombine with each other. Hence, the comparison of sequences from different animals should reveal homoplasies resulting uniquely from point mutations.

In a previous study of the evolution of the first two hypervariable regions (V1V2) of the SIV envelope protein, data from four animals at two time-points were reported (Pelletier et al., 1995). The sequences were under very weak selection pressure, a finding reiterated for an analysis of the equivalent regions in the HIV-1 Env protein (Kils-Hütten et al., 2001). The interesting feature of the SIV data set was the fact that the infections were initiated by inoculation with DNA of an infectious molecular clone. Accordingly, all the sequences were derived from the same founder sequence and not a collection of variants derived during preparation of stock virus that would otherwise be used for inoculation.

For three animals, 20-188, 20-526 and 20-402, virtually all sequences could be incorporated individually in SplitsTrees with a fit of 100 (Fig. 3A–C). There was evidence of some parallelograms but they were generally few. In contrast, for animal i-963, there was far greater evidence of network formation (Fig. 3D). Homoplasies, i.e. the same substitution at the same site, among sequences from the four animals are highlighted in colour. There were 15 homoplasies among a total of 125 discrete substitutions, 14 shared between two animals and one (A68G) between three animals. However, this A68G transition was present only in sequences taken at 19 months post-infection, indicating that it is hardly a hot spot of mutation. Equally, a substitution connecting a taxon to the parental SIVmac239 in one tree was rarely connected to SIVmac239 in another tree, again indicating that they cannot be described as hot spots. Overall, the fraction of true homoplasies due to point mutations between animals represent a minority of total number of substitutions (excess substitutions=14x1+1x2=16; excess/total discrete substitutions=16/125, ~13 %), indicating that this is probably also true of substitutions within an animal. Homoplasies between animals could be broken down into 10 non-synonymous substitutions, four synonymous substitutions and one transversion generating an in-phase stop codon. Such a distribution is close to that expected from weak purifying selection, as noted before (Pelletier et al., 1995).



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 3. SplitsTrees of SIVmac239 env V1V2 region nucleotide sequences in four rhesus macaques (A–D). After exclusion of highly defective G->A hypermutated sequences, all but one taxon were incorporated for three monkeys with little evidence of networks. For the fourth monkey, i-963, the tree using all 41 taxa had a fit of only 62·9 % and was star-like (data not shown). For the network shown in (D), sequences 108, 112, 113, 116, 201, 204, 205, 221 and 224 were excluded (Pelletier et al., 1995). The SIVmac239 reference sequence was included in the tree, even though it was not found among the 41 sequences obtained. Sequences were distinguished by the prefix 1 and 2 according to whether they were derived at 14 or 19 months (9 and 14 months for i-963) post-infection. The curve links a group of six taxa connected to sequence 114, which were detached to facilitate visualization of the network. The 15 homoplasies between the four animals are identified in colour. These were decomposed into 11 non-synonymous (including one nonsense mutation) and four synonymous substitutions, a distribution typical of that expected from very weak purifying selection. The synonymous A238G transition is part of the i-963 network.

 
Fig. 4(A) describes the minimum path length connecting the 33 taxa in the i-963 network in the absence of recombination. A total of 64 substitutions was necessary, including 11 homoplasies highlighted in red. As only three of them were noted in the three other monkeys, the majority of homoplasies in this network were unique to the animal. As there is little evidence for hot spots of substitution, the extensive networking for monkey i-963 must reflect mainly the effects of recombination.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 4. Minimum path lengths with and without recombination traced over the i-963 network. A cluster of six sequences has been detached from the network to enhance clarity. They are linked to sequence 114 by a curve. (A) Path length without recombination; the 11 homoplasies are highlighted in red. (B) Putative recombination pathways are indicated by orthogonal converging arrows. The minimum number of crossovers necessary was 12, although the actual number could be greater. For this network, ignoring recombination inflated the minimum path length by 11/53 or 21 %.

 
As diversity is needed to identify recombination, it is not surprising that monkey i-963, which showed the greatest degree of variation, also showed the greatest extent of networking. These observations concord well with the 10-fold higher rate of HIV-1 recombination over mutation. As soon as a mutation is made during provirus formation, it will be spread to other genomes via reassortment and crossing over in the next round. When counting the number of mutations in a data set, recombination will inflate the number of discrete substitutions. A trivial and hypothetical example was illustrated in Fig. 1. Without recombination, three substitutions were necessary to connect all sequences in the set. With recombination, only two substitutions were necessary. What is the situation for a real data set? Taking the i-963 network as an example, the minimum path length connecting the 33 SIV V1V2 env sequences in the network was 64 substitutions long (Fig. 4A), of which 11 were homoplasies (shown in red). If recombination is allowed, then a number of sequences are clearly recombinants, for example, 110, 119, 202 and 208 to cite a few (Fig. 4B). In this representation, putative recombination pathways are shown by converging orthogonal arrows (shown in red). The minimum path length over the same network is reduced to 53 substitutions. In other words, ignoring recombination inflated the path length by ~21 %.

Previous work has shown that the analogous region of the HIV-1 env gene is also diversifying under weak selection and shows extensive networks by SplitsTree analyses (Cheynier et al., 2001; Kils-Hütten et al., 2001). Furthermore, the same region was studied for HIV-1 DNA from laser microdissected single cells (Jung et al., 2002). When analysed by SplitsTree, the sequence sets from individual cells showed extensive networking (Fig. 5A, B). Given that homoplasies in the SIV V1V2 locus arose mainly by recombination, it is reasonable to assume that these networks too arose mainly by recombination. As before, the minimum path lengths were shorter when recombination was factored in 16 and 9 % for the two examples in Fig. 5(A, B), respectively.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 5. Minimum path lengths for HIV-1 env V1V2 sequences derived from laser microdissected single nuclei. (A) SplitsTree network (fit=100) for nine sequences from cell B9. Bold type identifies the minimum path length without and with recombination. (B) A comparable network for nine sequences from cell B7 (fit=100). Ignoring recombination inflated the path lengths by 5/31 (16 %) and 3/31 (9 %) for these two cases, respectively. Two lone branches leading to sequences B9-4* and B7-4 are encircled as they are, in fact, part of a fuller network (see Fig. 6).

 
By analysing together sequences from three infected cells from the same patient (B7, B9 and B10), it was possible to construct a broader network with a fit of 100 (Fig. 6). The minimum path length involved 45 substitutions, which were reduced in number to 31 when recombination was factored in; this can be viewed as an inflation of 45 % when ignoring recombination. The most striking examples of recombination are exemplified by B10-4 and B7-4. Both could be explained by a single recombination between sequences en route for B9-4* and B7-2*. Indeed B7-4 is a ‘pure’ recombinant without further additional point mutations with respect to putative ancestors, B9-4* and S (Fig. 6). B10-4 is a recombinant that could have had missing intermediates Q and R as ancestors (Fig. 6).



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 6. Increasing the number of taxa in a network reveals further recombination. By combining sequences from different cells from the same patient, 13 taxa encoding 45 mutations could be built into a broad network (fit=100). Without recombination, the network included 14 homoplasies (top). Allowing as few as five crossovers, the minimum path length was 31 mutations (bottom). Some hypothetical intermediates that could have recombined are noted as Q, R and S. For this network, ignoring recombination inflated the minimum path length by 14/31 or 45 %. Sequences B9-4* and B7-4 are encircled (see Fig. 5).

 
For branches not in a network, it may not be concluded that they represent a series of consecutive point mutations. Firstly, the recombination rate is 10-fold greater than the point mutation rate, meaning that in all probability, recombination must be suspected. Secondly, some intermediates may either have escaped sampling or have become extinct. Referring back to Fig. 5(A, B), the unique branch lengths connecting sequences B9-4* (Fig. 5A, encircled) and B7-4 (Fig. 5B, encircled), respectively, to their networks cannot be prima-facie evidence of a series of point mutations but artificial lineages arising from insufficient sampling. In contrast, the branch of five substitutions leading to B7-2* (Fig. 5B) remained a branch in Fig. 6. It may be supposed that with greater sampling this too might prove to be part of a network. Given that only a trivial fraction of HIV genomes in an individual are sampled, lone branches would seem inevitable.

When may mutation generate extensive networking? Monotonous G->A hypermutation of retroviral genomes occurs during negative-strand DNA synthesis under conditions of an unbalanced dTTP and dCTP concentration (Martinez et al., 1994; Vartanian et al., 1997). Overall, G->A substitution frequencies of up to 0·3 per G residue are possible (Vartanian et al., 2002). As an example, a small segment of a set of hypermutated HIV sequences generated in vitro is shown in Fig. 7(A). SplitsTree analyses of the sequences generated extensive networks (Fig. 7B). Thus, when the point mutation rate is very high, ~104-fold greater than the normal rate of ~0·25x10-5 per base per cycle, as is the case here, extensive networking due to point mutation is possible. In turn, this indicates that at normal mutation rates, recombination is the major mechanism underlying network formation.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 7. Extensive network formation when the mutation rate is very high. (A) A set of G->A hypermutated HIV-1 U5 sequences generated in an in vitro reaction with a highly biased [dTTP] : [dCTP] ratio (Vartanian et al., 1997). The overall frequency of mutation was ~0·1 per G residue, which is in the order of 104-fold greater than the overall HIV-1 mutation rate (Mansky & Temin, 1995). As G->A transitions are produced co-incident with DNA synthesis, even if recombination occurred, it would not produce networks for the RNA templates that were error free. (B) Network formation can be attributed solely to multiple point mutations. Even in this situation, sequences A07, A09, A12, A14, A15 and A18 were removed to obtain a fit of 100. For the full set of 18 taxa the fit was ~9.

 

   DISCUSSION
Top
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Rampant recombination among HIV genomes provides a challenge to genetic analysis if only by its tempo and magnitude in vivo. Once a little diversity has been generated, virtually every provirus will be genetically distinct. Because the m.o.i. is generally high in culture, as it is in vivo, recombination is also an issue when culturing HIV ex vivo. Only with a short-term culture derived from an infectious molecular clone will the effects of recombination be minimized. Purifying selection must be weeding out the effete recombinant genomes, just as it does deleterious point mutations. However, even this situation is not simple. Firstly, a cell harbouring two defective, yet transcriptionally active, proviruses can produce heterokaryons, which, in the next round, can recombine to yield viable progeny. Secondly, given that the average provirus copy number per cell is three to four in vivo (Jung et al., 2002), even if a recombinant is functionally viable, in the next round it may recombine with a different genome to yield yet another recombinant of unpredictable fitness.

As can be seen when scoring the number of mutations, ignoring recombination inflates the minimum path length connecting sequences in any data set (Figs 1, 4–6). Among the examples shown there was an overestimation of between 10 and 45 %. This conclusion is derived empirically. It is intriguing that the upper value is close to the 50 % predicted from the simplest conceptual example shown in Fig. 1(C). Hence, from the temporal standpoint, intrapatient HIV lineages may be generally younger than might be otherwise anticipated.

When factoring in recombination, the minimum number of point mutations is simply the number of discrete substitutions across all variable sites in the data set. This obviously reduces statistical power when analysing mutations in terms of non-synonymous and synonymous substitutions (Zanotto et al., 1999). Furthermore, by ignoring both phylogeny and recombination, two-by-two sequence comparisons used to compute ratios of non-synonymous to synonymous substitutions and thence infer positive selection are equally erroneous (Evans et al., 1999; Price et al., 1997). The statistical analysis of codon polymorphisms has received considerable attention of late (Nielsen & Yang, 1998; Yamaguchi-Kabata & Gojobori, 2000; Yang et al., 2000). In fact, sequence data in Fig. 2 corresponded to nef codons that were identified in a PAML analysis as being under statistically significant positive selection (Nielsen & Yang, 1998; Yang et al., 2000; data not shown). The minimum path length connecting all 16 taxa in Fig. 2(D) requires 17 substitutions, among which there are seven homoplasies (Fig. 8A). Allowing for recombination, the minimum of substitutions necessary is 11 (Fig. 8B), the number of discrete substitutions in the 16 taxa data set (Fig. 2A). In short, each substitution could have occurred just once and spread by subsequent recombination. It would seem that inferring positive selection from the statistical analysis of polymorphic codons may be invalid if recombination is taken into account.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 8. Codon polymorphisms can be inflated by recombination. Data are those presented in Fig. 2. The six codons were identified as being under statistically significant positive selection in numerous PAML analyses. (A) The minimum path length in the absence of recombination involved 17 point mutations, among which seven were homoplasies. (B) When recombination is factored in, as indicated by pairs of orthogonal dashed lines with converging arrows, only 11 point mutations were necessary. This is the number of discrete substitutions in the 16 taxa data set (see Fig. 2A). Such a representation can only estimate the minimum number of crossovers; the true number is probably more given that the intrinsic recombination rate is ~10-fold greater than the point mutation rate.

 
With so much recombination going on, why doesn't multidrug-resistant HIV come up earlier? There are three restrictions limiting the emergence of multidrug resistance. The first is a truism: less replication means less recombination and less mutation. Secondly, HIV replication is highly localized (Blancou et al., 2001; Cheynier et al., 1994, 1998), occurring in discrete sites throughout the body. The massive destruction of infected cells (Pelletier et al., 1995; Wain-Hobson, 1993) and virus (Ho et al., 1995; Wei et al., 1995) means that mutants do not mix freely. Genomes encoding drug resistance mutations can exist in separate sites but, until they find themselves in the same cell, recombination cannot accelerate the process. In this context, it would be interesting to determine the mean provirus copy number per cell in individuals under HAART (highly active anti-retrovirus therapy). The final factor may well be the least important. In principle, it should be easier to generate a gp41 fusion inhibitor/protease inhibitor double mutant where the crucial bases conferring resistance are ~5 kb apart, rather than a protease/RT double mutant where the sites are separated by <1 kb. However, as the recombination rate is so high, this is unlikely to be a major restriction.

This situation where the recombination rate exceeds the mutation rate is not without precedent. For numerous bacteria, including Neisseria species, Streptococcus pneumoniae and Staphylococcus aureus, evolutionary change at neutral loci is more likely to occur by recombination than by point mutation (Feil et al., 1999, 2000, 2001). Splits decomposition analyses have revealed extensive networks (Alber et al., 2001; Holmes et al., 1999a; Smith et al., 1999). Estimates of the recombination to substitution rates may be as high as 100 : 1 (Feil & Spratt, 2001), although precise recombination rates are lacking. Yet, as the point mutation rates for RNA viruses and retroviruses are far higher than those of bacteria, the sheer profusion and tempo by which retrovirus recombinants arise distinguish them from their bacterial counterparts.

In conclusion, HIV and SIV sequence polymorphisms in a data set are strongly influenced by recombination. When recombination is catered for by the SplitsTrees program, the minimum number of substitutions in a data set necessary to explain sequence complexity is the number of unique mutations. Accordingly, a sequence set is younger than would have been thought previously.


   ACKNOWLEDGEMENTS
 
We wish to thank Drs D. Haydon for PAML analyses, N. Müller-Lantzsch for continued support and J. Diez for helpful discussions. This work was supported by grants from the Institut Pasteur, the ANRS and the Deutsche Forschungsgemeinschaft.


   REFERENCES
Top
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Alber, D., Oberkötter, M., Suerbaum, S., Claus, H., Frosch, M. & Vogel, U. (2001). Genetic diversity of Neisseria lactamica strains from epidemiologically defined carriers. J Clin Microbiol 39, 1710–1715.[Abstract/Free Full Text]

Bandelt, H. J. & Dress, A. W. (1992). Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Mol Phylogenet Evol 1, 242–252.[Medline]

Barbrook, A. C., Howe, C. J., Blake, N. & Robinson, P. (1998). The phylogeny of The Canterbury Tales. Nature 394, 839.

Blancou, P., Chenciner, N., Cumont, M. C., Wain-Hobson, S., Hurtrel, B. & Cheynier, R. (2001). The infiltration kinetics of simian immunodeficiency virus-specific T cells drawn to sites of high antigenic stimulation determines local in vivo viral escape. Proc Natl Acad Sci U S A 98, 13237–13242.[Abstract/Free Full Text]

Carr, J. K., Salminen, M. O., Albert, J., Sanders-Buell, E., Gotte, D., Birx, D. L. & McCutchan, F. E. (1998). Full genome sequences of human immunodeficiency virus type 1 subtypes G and A/G intersubtype recombinants. Virology 247, 22–31.[CrossRef][Medline]

Cheynier, R., Henrichwark, S., Hadida, F., Pelletier, E., Oksenhendler, E., Autran, B. & Wain-Hobson, S. (1994). HIV and T cell expansion in splenic white pulps is accompanied by infiltration of HIV-specific cytotoxic T lymphocytes. Cell 78, 373–387.[Medline]

Cheynier, R., Gratton, S., Halloran, M., Stahmer, I., Letvin, N. L. & Wain-Hobson, S. (1998). Antigenic stimulation by BCG as an in vivo driving force for SIV replication and dissemination. Nat Med 4, 421–427.[Medline]

Cheynier, R., Kils-Hutten, L., Meyerhans, A. & Wain-Hobson, S. (2001). Insertion/deletion frequencies match those of point mutations in the hypervariable regions of the simian immunodeficiency virus surface envelope gene. J Gen Virol 82, 1613–1619.[Abstract/Free Full Text]

Coffin, J. M. (1979). Structure, replication, and recombination of retrovirus genomes: some unifying hypotheses. J Gen Virol 42, 1–26.[Medline]

Eigen, M. & Nieselt-Struwe, K. (1990). How old is the immunodeficiency virus? AIDS 4 (Suppl. 1), S85–S93.[Medline]

Evans, D. T., O'Connor, D. H., Jing, P. & 14 other authors (1999). Virus-specific cytotoxic T-lymphocyte responses select for amino-acid variation in simian immunodeficiency virus Env and Nef. Nat Med 5, 1270–1276.[CrossRef][Medline]

Feil, E. J. & Spratt, B. G. (2001). Recombination and the population structures of bacterial pathogens. Annu Rev Microbiol 55, 561–590.[CrossRef][Medline]

Feil, E. J., Maiden, M. C. J., Achtman, M. & Spratt, B. G. (1999). The relative contributions of recombination and mutation to the divergence of clones of Neisseria meningitidis. Mol Biol Evol 16, 1496–1502.[Abstract]

Feil, E. J., Smith, J. M., Enright, M. C. & Spratt, B. G. (2000). Estimating recombinational parameters in Streptococcus pneumoniae from multilocus sequence typing data. Genetics 154, 1439–1450.[Abstract/Free Full Text]

Feil, E. J., Holmes, E. C., Bessen, D. E. & 9 other authors (2001). Recombination within natural populations of pathogenic bacteria: short-term empirical estimates and long-term phylogenetic consequences. Proc Natl Acad Sci U S A 98, 182–187.[Abstract/Free Full Text]

Gratton, S., Cheynier, R., Dumaurier, M. J., Oksenhendler, E. & Wain-Hobson, S. (2000). Highly restricted spread of HIV-1 and multiply infected cells within splenic germinal centers. Proc Natl Acad Sci U S A 97, 14566–14571.[Abstract/Free Full Text]

Ho, D. D., Neumann, A. U., Perelson, A. S., Chen, W., Leonard, J. M. & Markowitz, M. (1995). Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection. Nature 373, 123–126.[CrossRef][Medline]

Hoelscher, M., Kim, B., Maboko, L., Mhalu, F., von Sonnenburg, F., Birx, D. L. & McCutchan, F. E. (2001). High proportion of unrelated HIV-1 intersubtype recombinants in the Mbeya region of southwest Tanzania. AIDS 15, 1461–1470.[CrossRef][Medline]

Holmes, E. C., Urwin, R. & Maiden, M. C. J. (1999a). The influence of recombination on the population structure and evolution of the human pathogen Neisseria meningitidis. Mol Biol Evol 16, 741–749.[Abstract]

Holmes, E. C., Worobey, M. & Rambaut, A. (1999b). Phylogenetic evidence for recombination in dengue virus. Mol Biol Evol 16, 405–409.[Abstract]

Hu, W. S. & Temin, H. M. (1990). Genetic consequences of packaging two RNA genomes in one retroviral particle: pseudodiploidy and high rate of genetic recombination. Proc Natl Acad Sci U S A 87, 1556–1560.[Abstract]

Huson, D. H. (1998). SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14, 68–73.[Abstract]

Jetzt, A. E., Yu, H., Klarmann, G. J., Ron, Y., Preston, B. D. & Dougherty, J. P. (2000). High rate of recombination throughout the human immunodeficiency virus type 1 genome. J Virol 74, 1234–1240.[Abstract/Free Full Text]

Jung, A., Maier, R., Vartanian, J. P., Bocharov, G., Jung, V., Fischer, U., Meese, E., Wain-Hobson, S. & Meyerhans, A. (2002). Multiply infected spleen cells in HIV patients. Nature 418, 144.[CrossRef][Medline]

Kawai, S. & Hanafusa, H. (1972). Genetic recombination with avian tumor virus. Virology 49, 37–44.[Medline]

Kils-Hütten, L., Cheynier, R., Wain-Hobson, S. & Meyerhans, A. (2001). Phylogenetic reconstruction of intrapatient evolution of human immunodeficiency virus type 1: predominance of drift and purifying selection. J Gen Virol 82, 1621–1627.[Abstract/Free Full Text]

Mansky, L. M. & Temin, H. M. (1995). Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J Virol 69, 5087–5094.[Abstract]

Martinez, M. A., Vartanian, J. P. & Wain-Hobson, S. (1994). Hypermutagenesis of RNA using human immunodeficiency virus type 1 reverse transcriptase and biased dNTP concentrations. Proc Natl Acad Sci U S A 91, 11787–11791.[Abstract/Free Full Text]

McCutchan, F. E. (2000). Understanding the genetic diversity of HIV-1. AIDS 14 (Suppl. 3), S31–S44.[Medline]

Nielsen, R. & Yang, Z. (1998). Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148, 929–936.[Abstract/Free Full Text]

Pathak, V. K. & Temin, H. M. (1990). Broad spectrum of in vitro forward mutations, hypermutations, and mutational hotspots in a retroviral shuttle vector after a single replication cycle: substitutions, frameshifts, and hypermutations. Proc Natl Acad Sci U S A 87, 6019–6023.[Abstract]

Peeters, M., Liegeois, F., Torimiro, N., Bourgeois, A., Mpoudi, E., Vergne, L., Saman, E., Delaporte, E. & Saragosti, S. (1999). Characterization of a highly replicative intergroup M/O human immunodeficiency virus type 1 recombinant isolated from a Cameroonian patient. J Virol 73, 7368–7375.[Abstract/Free Full Text]

Pelletier, E., Saurin, W., Cheynier, R., Letvin, N. L. & Wain-Hobson, S. (1995). The tempo and mode of SIV quasispecies development in vivo calls for massive viral replication and clearance. Virology 208, 644–652.[CrossRef][Medline]

Plikat, U., Nieselt-Struwe, K. & Meyerhans, A. (1997). Genetic drift can dominate short-term human immunodeficiency virus type 1 nef quasispecies evolution in vivo. J Virol 71, 4233–4240.[Abstract]

Plyusnin, A., Kukkonen, S. K. J., Plyusnina, A., Vapalahti, O. & Vaheri, A. (2002). Transfection-mediated generation of functionally competent Tula hantavirus with recombinant S RNA segment. EMBO J 21, 1497–1503.[Abstract/Free Full Text]

Price, D. A., Goulder, P. J. R., Klenerman, P., Sewell, A. K., Easterbrook, P. J., Troop, M., Bangham, C. R. M. & Phillips, R. E. (1997). Positive selection of HIV-1 cytotoxic T lymphocyte escape variants during primary infection. Proc Natl Acad Sci U S A 94, 1890–1895.[Abstract/Free Full Text]

Ricchetti, M. & Buc, H. (1990). Reverse transciptases and genomic variability: the accuracy of DNA replication is enzyme specific and sequence dependent. EMBO J 9, 1583–1593.[Abstract]

Schierup, M. H. & Hein, J. (2000). Consequences of recombination on traditional phylogenetic analysis. Genetics 156, 879–891.[Abstract/Free Full Text]

Sibold, C., Meisel, H., Krüger, D. H., Labuda, M., Lysy, J., Kozuch, O., Pejcoch, M., Vaheri, A. & Plyusnin, A. (1999). Recombination in Tula hantavirus evolution: analysis of genetic lineages from Slovakia. J Virol 73, 667–675.[Abstract/Free Full Text]

Smith, N. H., Holmes, E. C., Donovan, G. M., Carpenter, G. A. & Spratt, B. G. (1999). Networks and groups within the genus Neisseria: analysis of argF, recA, rho and 16S rRNA sequences from human Neisseria species. Mol Biol Evol 16, 773–783.[Abstract]

Takehisa, J., Zekeng, L., Ido, E., Yamaguchi-Kabata, Y., Mboudjeka, I., Harada, Y., Miura, T., Kaptu, L. & Hayami, M. (1999). Human immunodeficiency virus type 1 intergroup (M/O) recombination in Cameroon. J Virol 73, 6810–6820.[Abstract/Free Full Text]

Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673–4680.[Abstract]

Vartanian, J. P., Meyerhans, A., Asjo, B. & Wain-Hobson, S. (1991). Selection, recombination, and G->A hypermutation of human immunodeficiency virus type 1 genomes. J Virol 65, 1779–1788.[Medline]

Vartanian, J. P., Plikat, U., Mahieux, R., Guillemot, L., Meyerhans, A. & Wain-Hobson, S. (1997). HIV genetic variability is directed and restricted by DNA precursor availability. J Mol Biol 270, 139–151.[CrossRef][Medline]

Vartanian, J. P., Henry, M. & Wain-Hobson, S. (2002). Sustained G->A hypermutation during reverse transcription of an entire human immunodeficiency type 1 strain Vau group O genome. J Gen Virol 83, 801–805.[Abstract/Free Full Text]

Vogt, P. K. (1971). Genetically stable reassortment of markers during mixed infection with avian tumor viruses. Virology 46, 947–952.[Medline]

Wain-Hobson, S. (1993). Viral burden in AIDS. Nature 366, 22.[Medline]

Wei, X., Ghosh, S. K., Taylor, M. E. & 9 other authors (1995). Viral dynamics in human immunodeficiency virus type 1 infection. Nature 373, 117–122.[CrossRef][Medline]

Weiss, R. A., Mason, W. S. & Vogt, P. K. (1973). Genetic recombinants and heterozygotes derived from endogenous and exogenous avian RNA tumor viruses. Virology 52, 535–552.[Medline]

Wooley, D. P., Smith, R. A., Czajak, S. & Desrosiers, R. C. (1997). Direct demonstration of retroviral recombination in a rhesus monkey. J Virol 71, 9650–9653.[Abstract]

Yamaguchi-Kabata, Y. & Gojobori, T. (2000). Reevaluation of amino acid variability of the human immunodeficiency virus type 1 gp120 envelope glycoprotein and prediction of new discontinuous epitopes. J Virol 74, 4335–4350.[Abstract/Free Full Text]

Yang, Z., Nielsen, R., Goldman, N. & Pedersen, A. M. (2000). Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155, 431–449.[Abstract/Free Full Text]

Yu, H., Jetzt, A. E., Ron, Y., Preston, B. D. & Dougherty, J. P. (1998). The nature of human immunodeficiency virus type 1 strand transfers. J Biol Chem 273, 28384–28391.[Abstract/Free Full Text]

Zanotto, P. M., Kallas, E. G., de Souza, R. F. & Holmes, E. C. (1999). Genealogical evidence for positive selection in the nef gene of HIV-1. Genetics 153, 1077–1089.[Abstract/Free Full Text]

Zhang, J. & Temin, H. M. (1993). Rate and mechanism of nonhomologous recombination during a single cycle of retroviral replication. Science 259, 234–238.[Medline]

Received 10 October 2002; accepted 13 December 2002.