Insertion/deletion frequencies match those of point mutations in the hypervariable regions of the simian immunodeficiency virus surface envelope gene

Rémi Cheynier1, Laurens Kils-Hütten2, Andreas Meyerhans2 and Simon Wain-Hobson1

Unité de Rétrovirologie Moléculaire, Institut Pasteur, 28 rue de Dr Roux, F-75724 Paris cedex 15, France1
Abteilung Virologie, Universität des Saarlandes, Institut für Medizinische Mikrobiologie und Hygiene, D-66421 Homburg, Germany2

Author for correspondence: Simon Wain-Hobson. Fax +33 1 4568 8874. e-mail simon{at}pasteur.fr


   Abstract
Top
Abstract
Introduction
Methods
Results
Discussion
References
 
A method for encoding insertions and deletions (indels) has been developed and adapted to the SplitsTree program. Following phylogenetic reconstruction, the relative frequencies of indels were estimated for a large number of in vivo sequence sets corresponding to the env V1 hypervariable region of the simian immunodeficiency virus SIVmac251. The method allowed recovery of many point mutations hitherto lost due to gap stripping. Deletions were as frequent as transversions and were 4- to 8-fold more frequent than insertions, invariably duplications. The high proportion of deletions among mutation events suggests that lentivirus vectors may readily delete parts of their cargo.


   Introduction
Top
Abstract
Introduction
Methods
Results
Discussion
References
 
There is plenty of evidence for recombination among the human immunodeficiency viruses (HIV). Recombinants are visible at an epidemic level, with the occasional chimera reported between group M and group O viruses (McCutchan et al., 1996 ; Peeters et al., 1999 ; Robertson et al., 1995 ; Takehisa et al., 1999 ). Perhaps the most stunning finding is a mean recombination rate of three cross-overs per round of replication (Jetzt et al., 2000 ). A strong reminder of the importance of recombination for disease came from a study in which monkeys were co-infected by two variants of the macaque simian immunodeficiency virus SIVmac239 (Wooley et al., 1997 ). One variant was vpr-, the other nef-. The replication of both variants in vivo was known to be greatly impaired with respect to the parent virus. Within 15 days of co-infection, wild-type (vpr+ nef+) virus was isolated, and the animals went on to develop disease.

Recombination can also result in the formation of insertions and deletions (indels). These are nowhere more in evidence than in the segments corresponding to the hypervariable regions of the envelope protein. As opposed to point mutations, the rate of fixation of indels is not known. Certainly, indels are observed within a matter of months of infection (Burns & Desrosiers, 1991 ). However, there is no more precise estimation than this and there is certainly no study ranking the fixation rate of indels with respect to transitions and transversions. Of course, indels are invariably stripped from multiple alignments. In the case of the hypervariable regions of the primate immunodeficiency viruses, where indels are common, gap-stripping restricts the understanding of their evolution.

Phylogenetic reconstruction is needed in the counting of mutations; otherwise, there is a tendency to inflate their number (Pelletier et al., 1995 ; Plikat et al., 1997 ; Zanotto et al., 1999 ). The SplitsTree program creates networks of sequences representing parallel mutations and allows sequences to be placed at nodes as well as tips of a phylogram (Huson, 1998 ). This is particularly appropriate when analysing early quasispeciation by an RNA virus or retrovirus such as HIV or SIV (Dopazo et al., 1993 ; Pelletier et al., 1995 ; Plikat et al., 1997 ).

In principle, indels can be coded and treated as single events as long as the degree of sequence diversification is not too great, a condition met in the first 2–3 years of an HIV or SIV infection. Although a few reports have attempted to code indels (Barriel, 1994 ), none has applied the technique to RNA viruses, retroviruses or HIV/SIV. Here, a short program has been written that allows indel coding within sets of sequences and which is adapted to the SplitsTree program. When applied to the first hypervariable region of the SIVmac251 env gene, it was found that the fixation rate of deletions rivalled that of transitions.


   Methods
Top
Abstract
Introduction
Methods
Results
Discussion
References
 

Indels were coded by using Indelstack, developed under the Hypercard 2.3.1 software (available through anonymous login at ftp://ftp.pasteur.fr/pub/retromol/Hypercard-stacks/IndelCode.sit. Sequences used are those in the directory ftp.pasteur.fr/pub/retromol/Cheynier-et-al-98/). Sequences were aligned to an internal reference for each dataset. For ambiguities arising from repeat sequences, the bases were shifted 3' as much as possible. Indelstack uses fully aligned sequences. The first step consisted of coding the insertions. For gaps introduced into the reference to maximize the alignment, a number was added to the 3' end of each sequence: ‘2’ signified a particular insertion, ‘0’ signified the absence of that insertion. In all sequences that did not contain the insertion, the gaps were replaced by the predominant base encountered at the particular position. Hence, an insertion of n bases was reduced to a single mutational event, ‘2’, placed 3' of the nucleic acid sequence. This procedure was applied for every insertion in the sequence set, starting from the 5' end.

Deletions were similarly coded by ‘0’ or ‘1’ added 3' of the sequence after the coding of insertions. ‘0’ means the absence of the deletion while ‘1’ scores the deletion. Obviously, gaps in sequences were considered to be the same event when the borders were identical. Two sequences bearing deletions of say 6 and 9 bp, with only one common border, were treated as two independent events rather than a common 6 bp deletion and an extra 3 bp deletion juxtaposed in one sequence. After the coding process was performed, each gap was replaced in each sequence by the consensus bases derived from the remaining sequences without gaps.

In cases where there was no consensus among the remaining sequences, bases were chosen according to the following nucleotide frequency rules: A=T>G>C, A=G>T>C, A=C>G>T, A=T=C>G, A=T=G>C, A=C=G>T or A=T=C=G; A was chosen. T=G>A>C, G=C>A>T or T=C=G>A; G was chosen. T=C>A>G; T was chosen.

These rules were adapted to work with HIV and SIVs and reflect the general base composition of the primate lentiviruses (A>G>T>C) (Wain-Hobson et al., 1985 ). This can be modified. The set of 698 sequences analysed corresponds to the region of SIVmac251 encoding the hypervariable V1 loop of the envelope protein (Cheynier et al., 1998 ).


   Results
Top
Abstract
Introduction
Methods
Results
Discussion
References
 
A typical example of indel coding is given in Fig. 1. The sequences correspond to the first hypervariable region of SIVmac251 env. They were selected as they provide a good illustration of indel coding. The aligned sequence sets, without and with indel coding, are shown in Fig. 1(A, B). The first indel coded corresponded to an 18 bp insertion starting at position 43 in sequence S9. Accordingly, S9 has the suffix ‘2’ added at the 3' end while ‘0’ was added to all remaining sequences. Deletions were coded next. The 5'-most deletion was 3 bp at positions 12–14. Hence ‘1’ was added at the 3' ends of S16, S20, S5 and S9 and ‘0’ to all others. The second and third deletions of 9 and 6 bp have a common 3' border. As mentioned in Methods, they were treated as two simple deletions. Note that the 6 bp deletion in S1 and S10 (bases 38–42 and 61) spans the 18 bp insertion (bases 43–60). After coding the insertion and six deletions, each sequence has a suffix of seven numbers tagged on to the 3' end.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 1. SIVmac251 V1 sequences and the recovery of mutations. (A) A set of V1 sequences aligned to that of the input sequence (ref). Gaps (–) were introduced to maximize the alignment. (B) The same set of sequences with insertions and deletions coded as ‘2’ and ‘1’, respectively, placed 3' to the sequence. The gaps within the sequence are replaced by the consensus base, as described in Methods. (C)–(D) SplitsTree phylograms using the data from (A) and (B), respectively. In (C), five mutations are captured; in (D), 16 are captured. In terms of nomenclature, 20{Delta}9 means a 9 bp deletion at starting at position 20 and 4318 represents an 18 bp insertion starting at position 43.

 
The SplitsTree program was modified to accept numbers. Of course, indels could have been coded using As and Gs or Ts and Cs. However, numbers have the advantage that they stand out clearly at the end of the alignments. SplitsTree phylogenetic representations of a typical set of sequences without and with indel coding are shown in Fig. 1(C, D). As branch lengths are proportional to the number of events (point mutations, insertions or deletions all have the same length), mutations are easily superimposed over the branches. It is clear that gap stripping reduced the number of informative sites in the set to five point mutations (Fig. 1C), whereas indel coding recovered twice as many point mutations (Fig. 1D).

The combination of indel coding and scoring mutations over SplitsTrees was applied to an extensive set of 698 sequences derived from a SIVmac251-infected macaque (Cheynier et al., 1998 ). Sequences from 15–21 weeks post-infection were analysed together (Fig. 2); a total of 40 could be introduced into a network with a fit of 100%, i.e. without removal of any mutations. There was some evidence of networking, which may reflect homoplasies and/or recombination. Choosing the minimum path length connecting all sequences, 56 point substitutions and five deletions were mapped among 40 sequences. By 61–64 weeks post-infection, there was extensive networking; the most parsimonious path length scored 35 substitutions and seven deletions among the 30 sequences (Fig. 3). In both instances, the numbers of deletions were comparable to the number of transversions.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 2. A large collection of 51 taxa taken at 15–21 weeks post-infection. The fit was 100%. Branch lengths are proportional to the number of mutations that separate sequences. Insertions, deletions and point mutations all have equal weight. The minimal path length is shown. Dashed lines show alternative paths. See legend to Fig. 1 for explanation of nomenclature.

 


View larger version (22K):
[in this window]
[in a new window]
 
Fig. 3. A collection of 31 taxa taken at 61–64 weeks post-infection. The fit was 100%. Networks are very much in evidence, with the input SIVmac251 sequence no longer detectable. This is in sharp contrast to Fig. 2, where the majority of taxa were derived by a short lineage from the input sequence. The transition:transversion ratio is lower than for the earlier set of data, already suggesting some degree of saturation.

 
With SplitsTree networks, as more and more sequences are added, an increasing number of sequences have to be discarded if a 100% fit to the data is desired (i.e. no informative site in a sequence is excluded). To get around this problem, the datasets, usually 18–20 sequences, were analysed separately. The most parsimonious pathway connecting all sequences was established and the number of mutations was scored. The numbers of transitions, transversions, insertions and deletions were then normalized for the length of the sequence and the time after infection. The results are given in Table 1. Transitions outnumbered transversions by a factor of 6–14. Deletions were 4- to 8-fold more frequent than insertions, and even proved to be more frequent than transversions. The concordance of data from samples R-3 to R+27, as well as all the spleen and lymph node samples taken at R+27, indicates that these findings are robust.


View this table:
[in this window]
[in a new window]
 
Table 1. Normalized indel and substitution frequencies for numerous SIVmac251 env V1 datasets

 
Inspection of some of the SplitsTree phylograms revealed some intriguing details, one example being the sequence set from splenic white pulp 19 (WP19). There was much evidence of networks involving one insertion and four deletions (Fig. 4A). Connecting intermediate V to sequence WP19.5 via W involves two deletions and one insertion (48{Delta}3 and 56{Delta}9), one insertion (7918) and one transition (A70C) (Fig. 4B). However, precisely the same mutations separated V and WP19.5 via the intermediate X. By the same token, the mutations separating V from Y and Z (two deletions, 48{Delta}3 again and 62{Delta}9, and two transitions, C98T and A109G) are the same as those connecting Y and Z with WP19.14. In other words, there are a number of homoplasies involving indels. When sequence divergence is low, as in the datasets used here, it may be argued that, given their lower frequencies, transversions and indels revert less frequently than transitions.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 4. SplitsTree network for sequences derived from white pulp 19 (WP19). (A) The network with individual substitutions, insertions and deletions laid over the phylogeny. (B) Making the assumption that all homoplasies reflect recombination, the pathways indicating possible recombinants between hypothetical intermediates are shown as bold lines.

 
An alternative to homoplasy is recombination. The recombination rate of HIV-1 is remarkably high; approximately three cross-overs per genome per cycle (Jetzt et al., 2000 ). Presumably, the recombination rate of SIV is comparable (Wooley et al., 1997 ). If intermediates W and X were to recombine, they could generate WP19.5. The same is true for intermediates Y and Z, which could give rise to WP19.14. Hence, recombination would reduce the total number of substitutions necessary to connect all sequences in a network.


   Discussion
Top
Abstract
Introduction
Methods
Results
Discussion
References
 
Phylogenetic reconstruction of SIV V1 sequences using indel coding and SplitsTrees showed that the frequency of deletions is greater than that of transversions and only 2- to 6-fold lower than that of transitions. Hence, deletions are a major phenomenon in the generation of sequence diversity in the hypervariable regions of the immunodeficiency viruses. That the frequency of deletions proved to be 4- to 8-fold greater than that of insertions is in keeping with ex vivo observations (Mansky & Temin, 1995 ). It helps to explain why reporter genes such as CAT or GFP cloned into HIV genomes are deleted so rapidly, as are remnants of the nef gene, once a segment is deleted (Daniel et al., 1992 ; Kirchhoff et al., 1994 ). Presumably, negative selection and packaging constraints keep the hypervariable regions from collapsing and maintain the size of the genome in the range 9·2–9·3 kb. However, a few duplications have obviously contributed to the evolution of the HIV/SIV genomes, as testified by the sequence similarity between Vpr and Vpx (Tristem et al., 1992 ). A 2- to 3-fold excess of deletions over insertions was made for pseudogenes with respect to their orthologues (Ophir & Graur, 1997 ), indicating that this may be a more general feature of DNA replication not confined to lentiviruses.

Overlaying nucleotide substitutions onto the SplitsTrees and comparing observed and expected non-synonymous and synonymous substitution showed scant evidence of selection within the hypervariable V1 region (data not shown). The accompanying comprehensive analysis using the SplitsTree approach of a large number of different loci and published datasets tends to show that the majority of sites are not under positive selection (Kils-Hütten et al., 2001 ). This is not to say that positive selection is not operative, just that, using the ratio of non-synonymous to synonymous substitutions in segments of several hundred bases, the signal is generally too weak.

The possibility of recombination contributing to sequence complexity is apparent (Fig. 4). Distinguishing between homoplasies and recombination is not easy. However, hot spots were not much in evidence in sequences corresponding to the hypervariable regions of SIV and HIV Env, suggesting that some of the sequences were indeed recombinants.

It appears that indels contribute considerably to the evolution of the hypervariable regions and are more frequent than transversions. Insertions in these regions are invariably duplications and involve N-linked glycosylation sites, almost all of which are occupied. The carbohydrate moieties are thought to mask the hypervariable regions from recognition by immunoglobulins (Kwong et al., 1998 ; Wyatt et al., 1998 ). Other regions of the HIV genome generally do not support so many indels (Alizon et al., 1986 ), so the indel fixation rates revealed here represent upper values. Whether the unusually high A content of this region, with its capacity for propeller-twisting hydrogen bonds in oligo(A):oligo(T) tracts (Burkhoff & Tullius, 1987 ), is conducive to indel formation remains to be established.

In conclusion, indel coding allows the estimation of indel frequencies with respect to point mutations and, in so doing, recovers more information from a set of sequences. The method could be useful in other contexts, such as estimating the frequencies of bacterial variable tandem repeats, and could be adapted to encompass the appearance on large additions and deletions of blocks of DNA when comparing bacterial genomes (Ochman et al., 2000 ).


   Acknowledgments
 
This work was supported by the Deutsche Forschungsgemeinschaft and grants from the Institut Pasteur and Agence Nationale de Recherche sur le SIDA (ANRS).


   References
Top
Abstract
Introduction
Methods
Results
Discussion
References
 
Alizon, M., Wain-Hobson, S., Montagnier, L. & Sonigo, P. (1986). Genetic variability of the AIDS virus: nucleotide sequence analysis of two isolates from African patients. Cell 46, 63-74.[Medline]

Barriel, V. (1994). Molecular phylogenies and nucleotide insertion–deletion. Comptes Rendus de l’Academie des Sciences 317, 693–701 (in French).

Burkhoff, A. M. & Tullius, T. D. (1987). The unusual conformation adopted by the adenine tracts in kinetoplast DNA. Cell 48, 935-943.[Medline]

Burns, D. P. & Desrosiers, R. C. (1991). Selection of genetic variants of simian immunodeficiency virus in persistently infected rhesus monkeys. Journal of Virology 65, 1843-1854.[Medline]

Cheynier, R., Gratton, S., Halloran, M., Stahmer, I., Letvin, N. L. & Wain-Hobson, S. (1998). Antigenic stimulation by BCG vaccine as an in vivo driving force for SIV replication and dissemination. Nature Medicine 4, 421-427.[Medline]

Daniel, M. D., Kirchhoff, F., Czajak, S. C., Sehgal, P. K. & Desrosiers, R. C. (1992). Protective effects of a live attenuated SIV vaccine with a deletion in the nef gene. Science 258, 1938-1941.[Medline]

Dopazo, J., Dress, A. W. M. & von Haeseler, A. (1993). Split decomposition: a technique to analyze viral evolution. Proceedings of the National Academy of Sciences, USA 90, 10320-10324.[Abstract]

Huson, D. H. (1998). SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14, 68-73.[Abstract]

Jetzt, A. E., Yu, H., Klarmann, G. J., Ron, Y., Preston, B. D. & Dougherty, J. P. (2000). High rate of recombination throughout the human immunodeficiency virus type 1 genome. Journal of Virology 74, 1234-1240.[Abstract/Free Full Text]

Kils-Hütten, L., Cheynier, R., Wain-Hobson, S. & Meyerhans, A. (2001). Phylogenetic reconstruction of intrapatient evolution of human immunodeficiency virus type 1: predominance of drift and purifying selection. Journal of General Virology 82, 1621-1627.[Abstract/Free Full Text]

Kirchhoff, F., Kestler, H. W.III & Desrosiers, R. C. (1994). Upstream U3 sequences in simian immunodeficiency virus are selectively deleted in vivo in the absence of an intact nef gene. Journal of Virology 68, 2031-2037.[Abstract]

Kwong, P. D., Wyatt, R., Robinson, J., Sweet, R. W., Sodroski, J. & Hendrickson, W. A. (1998). Structure of an HIV gp120 envelope glycoprotein in complex with the CD4 receptor and a neutralizing human antibody. Nature 393, 648-659.[Medline]

McCutchan, F. E., Salminen, M. O., Carr, J. K. & Burke, D. S. (1996). HIV-1 genetic diversity. AIDS 10 (Suppl. 3), S13–S20.

Mansky, L. M. & Temin, H. M. (1995). Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. Journal of Virology 69, 5087-5094.[Abstract]

Ochman, H., Lawrence, J. G. & Groisman, E. A. (2000). Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299-304.[Medline]

Ophir, R. & Graur, D. (1997). Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 205, 191-202.[Medline]

Peeters, M., Liegeois, F., Torimiro, N., Bourgeois, A., Mpoudi, E., Vergne, L., Saman, E., Delaporte, E. & Saragosti, S. (1999). Characterization of a highly replicative intergroup M/O human immunodeficiency virus type 1 recombinant isolated from a Cameroonian patient. Journal of Virology 73, 7368-7375.[Abstract/Free Full Text]

Pelletier, E., Saurin, W., Cheynier, R., Letvin, N. L. & Wain-Hobson, S. (1995). The tempo and mode of SIV quasispecies development in vivo calls for massive viral replication and clearance. Virology 208, 644-652.[Medline]

Plikat, U., Nieselt-Struwe, K. & Meyerhans, A. (1997). Genetic drift can dominate short-term human immunodeficiency virus type 1 nef quasispecies evolution in vivo. Journal of Virology 71, 4233-4240.[Abstract]

Robertson, D. L., Hahn, B. H. & Sharp, P. M. (1995). Recombination in AIDS viruses. Journal of Molecular Evolution 40, 249-259.[Medline]

Takehisa, J., Zekeng, L., Ido, E., Yamaguchi-Kabata, Y., Mboudjeka, I., Harada, Y., Miura, T., Kaptu, L. & Hayami, M. (1999). Human immunodeficiency virus type 1 intergroup (M/O) recombination in Cameroon. Journal of Virology 73, 6810-6820.[Abstract/Free Full Text]

Tristem, M., Marshall, C., Karpas, A. & Hill, F. (1992). Evolution of the primate lentiviruses: evidence from vpx and vpr. EMBO Journal 11, 3405-3412.[Abstract]

Wain-Hobson, S., Sonigo, P., Danos, O., Cole, S. & Alizon, M. (1985). Nucleotide sequence of the AIDS virus, LAV. Cell 40, 9-17.[Medline]

Wooley, D. P., Smith, R. A., Czajak, S. & Desrosiers, R. C. (1997). Direct demonstration of retroviral recombination in a rhesus monkey. Journal of Virology 71, 9650-9653.[Abstract]

Wyatt, R., Kwong, P. D., Desjardins, E., Sweet, R. W., Robinson, J., Hendrickson, W. A. & Sodroski, J. G. (1998). The antigenic structure of the HIV gp120 envelope glycoprotein. Nature 393, 705-711.[Medline]

Zanotto, P. M., Kallas, E. G., de Souza, R. F. & Holmes, E. C. (1999). Genealogical evidence for positive selection in the nef gene of HIV-1. Genetics 153, 1077-1089.[Abstract/Free Full Text]

Received 20 October 2000; accepted 8 March 2001.