Low linkage disequilibrium indicative of recombination in foot-and-mouth disease virus gene sequence alignments

Daniel T. Haydon1,{dagger}, Armanda D. S. Bastos2 and Philip Awadalla3

1 Department of Zoology, University of Guelph, Guelph, Ontario, Canada N1G 2W1
2 Mammal Research Institute, Department of Zoology and Entomology, University of Pretoria, Pretoria 0002, South Africa
3 Department of Genetics, North Carolina State University, Raleigh, NC 27695-7614, USA

Correspondence
Daniel T. Haydon
D.Haydon{at}bio.gla.ac.uk


   ABSTRACT
Top
ABSTRACT
MAIN TEXT
REFERENCES
 
We have applied tests for detecting recombination to genes of foot-and-mouth disease virus (FMDV). Our approach estimated summary statistics of linkage disequilibrium (LD), which are sensitive to recombination. Using the genealogical relationships, rate heterogeneity and mutation parameters estimated from individual sets of aligned gene sequences, we simulated matching RNA sequence datasets without recombination. These simulated datasets allowed for recurrent mutations at any site to mimic homoplasy in virus sequence data and allow construction of null distributions for LD parameters expected in the absence of recombination. We tested for recombination in two ways: by comparing LD in observed data with corresponding null distributions obtained from simulated data; and by testing for a negative relationship between observed LD between pairs of polymorphic nucleotide sites and inter-site distance. We applied these tests to six FMDV datasets from four serotypes and found some evidence for recombination in all of them.

{dagger}Present address: Division of Environmental and Evolutionary Biology, University of Glasgow, Glasgow G12 8QQ, UK.


   MAIN TEXT
Top
ABSTRACT
MAIN TEXT
REFERENCES
 
Examining sequence data for recombination is important because only when it is absent are direct inferences from phylogenies, or those derived from phylogenetically based methods of analysis, likely to be reliable (e.g. Anisimova et al., 2003). A number of methods have been developed that detect recombination from nucleotide data (compared by Wall, 2000; Posada & Crandall, 2001; reviewed by Awadalla, 2003). Some examine sequence alignments for a few discrete recombination events between distantly related genotypes (e.g. Maynard Smith, 1992; Grassly & Holmes, 1997; Holmes et al., 1999), but other approaches are required if recombination occurs frequently between closely related lineages. Coalescent-based population genetic methods have been developed for when mutation is rare (e.g. Hudson, 1987; Hey & Wakeley, 1997), but these methods are likely to overestimate recombination rate when applied to virus sequences in which mutation rates are high. Recently, new methods have been proposed to tease these two confounding processes apart (Worobey, 2001; McVean et al., 2002) that are more suitable for the analysis of RNA virus sequence data in which the frequency of multiple mutations at the same site is high.

One frequently applied test for the presence of recombination is to examine the relationship between the level of linkage disequilibrium (LD) between tightly linked sites and those spaced further apart within the sequence (Schaeffer & Miller, 1993; Conway et al., 1999; Awadalla & Charlesworth, 1999; Hudson, 2001; McVean et al., 2002). LD is a measure of the correlation between the occurrence of genetic markers (e.g. nucleotides, restriction sites or alleles) at different sites in the genome measured across multiple genomes. Recombination occurring between two sites will usually reduce the LD between them. Since the recombination rate is likely to be higher between more physically distant pairs of sites, the result will be a negative relationship between estimated LD associated with pairs of bi-polymorphic sites (those at which just two nucleotides are present) and the number of nucleotide sites separating them (Fig. 1).



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1. An example of how mutation and recombination can produce different patterns of linkage disequilibrium (LD). Sites shown are all bi-polymorphic (just two nucleotides occur at each site), but site 1 is a singleton – the bi-polymorphism is represented in just one sequence (sequence i). Sites 2 and 3 are in full LD with each other (r2=1, D'=1, T at site 2 is never found with C at site 3, nor A at site 2 with G at site 3). However, a recombination event occurring between genomes (ii) and (iii) at a position between sites 3 and 200 would cause the complete elimination of LD (r2=0, D'=0) between sites 2 and 300 and 3 and 300 (since all four combinations of nucleotides are consequently observed at each pair of sites). Thus recombination results in a negative relationship between LD and inter-site distance. Note that multiple mutation at the same site will also weaken LD, but the extent to which it does so can be predicted from the overall incidence of nucleotide substitution.

 
However, mean levels of LD may also be informative of recombination rate. Using realistic models of nucleotide substitution, simulation can be used to predict expected distributions of LD in the presence of high mutation rates. There are two possible approaches. The first is to simulate the full genealogical process underlying the observed data using a coalescent population genetic model (e.g. McVean et al., 2002). Coalescent models use parameters estimated from sequence alignments to simulate the fusion (or coalescence) of different genetic lineages – in a sense, running the phylogenetic branching process in reverse, until only the single common ancestor to the whole phylogeny remains. The advantage of this approach is that quantitative estimates of the recombination rate can be obtained by maximizing the goodness-of-fit of the model to the sequence data. The disadvantage is that coalescent models also require that specific assumptions be made about demographic processes – which for viruses are likely to be very complicated and about which we know rather little. Biases in the estimate of recombination rate might arise if assumptions made about the underlying genealogical process are not met. The second approach (adopted here) estimates levels of LD arising among sets of sequences in the absence of recombination and tests whether this level is significantly higher than that observed (e.g. Worobey, 2001). The parameters of a mutation model can be estimated directly from the data and simulated datasets can be created assuming genealogical relationships estimated from the observed phylogeny. No assumptions need be made about demography or the underlying genealogical processes.

Foot-and-mouth disease virus (FMDV), in the family Picornaviridae, is a widely distributed disease of cloven-hoofed animals, occurring as seven serotypes: A, O, C, Asia 1 and SAT (South African Territories) types 1, 2 and 3. Acute FMDV infections of domestic animals are usually of only 2–3 weeks' duration, which limits the opportunity for both accumulation of de novo genetic variation and multiple infections by different genotypes. However, more persistent subclinical infections may establish in cattle and particularly of SAT type viruses in African buffalo (Syncerus caffer) from which virus may be recovered for 2–5 years after time of first infection.

Evidence from several genera within the family Picornaviridae suggests that occasional recombination between distantly related genomes has been important in the genetic history of this group (e.g. Brown et al., 2003; Liu et al., 2003; Yang et al., 2003). In FMDV, strong evidence exists for historical between-serotype recombination (Krebs & Marquardt, 1992; van Rensburg et al., 2002) and a within-serotype recombinant has been identified by Tosh et al. (2002). However, it is not clear from these observations whether recombination is persistently high but only occasionally detectable, or whether it is actually a rare event. High rates of intragenic recombination will lower the resolution of phylogenetic inference and serve to generate antigenic novelty in areas where multiple strains co-circulate. Laboratory studies of FMDV and other picornaviruses suggest that recombination could be very common during infection (King et al., 1985; King, 1988). If it is, the epidemiology of FMDV dictates that most recombination is likely to be between virus genes of high sequence identity and hence only detectable at the virus ‘population level’ through a general lowering of LD.

For each dataset considered, we calculated the correlation between the occurrence of different nucleotides at different sites using the Hill & Robertson (1968) measure calculated for all pairs of sites segregating for two nucleotides (where ; the notation is conventional: pAB represents the frequency of alleles with nucleotide A present at the first site and B present at the second, pA represents the frequency of nucleotide A at the first site, etc.). We also calculated D', a measure of degree of association between nucleotide variants of different polymorphic sites, where and Dmax is the largest possible value of D given the nucleotide frequencies (Brown, 1975; Lewontin, 1988). The correlation between both pairwise measures of LD and distance, dij, for pairs of polymorphic sites was evaluated using the standard Pearson correlation coefficient and significance was determined using a Mantel test (randomizing the position of sites; Manly, 1986). The value from the actual sequence data was compared with the distribution of coefficients from randomized sets of data and was considered significant at a given level if its absolute magnitude exceeded the 95th percentile. Mean values of r2 and D' were computed over all bi-polymorphic pairs of sites within each dataset and denoted and , respectively.

The methodology that follows is conceptually similar to the ‘informative sites' test of Worobey (2001) except that we used LD rather than numbers of informative sites as a test statistic. Phylogenetic tree topology was estimated for each dataset using DNADIST and FITCH in the PHYLIP package (Felsenstein, 1993). This topology was then used to make maximum-likelihood estimates of branch lengths, rate heterogeneity ({alpha}) and transition–transversion ratio ({kappa}/2) using the HKY85 model of base substitution (Hasegawa et al., 1985) as implemented in BASEML in the PAML package (Yang, 1997). The analysis was restricted to 3rd codon base positions to remove as far as possible the influence of selection. Having arrived at final estimates of {kappa}, {alpha} and branch lengths for each dataset, we used these parameters to simulate 500 equivalent datasets using the EVOLVER program (again using the HKY85 model) from the PAML package. We compared observed values of LD statistics from each real dataset with corresponding distributions of LD statistics obtained from each set of 500 simulated datasets and thereby inferred which observed values of LD differed significantly from expectation under the hypothesis of no recombination.

We examined six sets of sequences of FMDV VP1 genes from four different serotypes, SAT-1, -2 and -3 (where most of the isolates are from, or closely related to, isolates from African buffalo) and serotype O (all recovered from infections of domestic livestock). The data and its origins are described in Table 1. Prior analysis (using PLATO; Grassly & Holmes, 1997) revealed that there was no large-scale heterogeneity in these alignments and thus no obvious evidence for genetically distinctive recombinants. Because population structure tends to increase LD, the largest dataset was broken down into those arising from smaller geographic regions. As an indication of the effectiveness of these methods for detecting recombination, we subjected four additional datasets to identical analyses (Table 1). We analysed two human immunodeficiency virus (HIV) datasets (dataset G, HIV env gene sequences; Kuiken et al., 2000; and dataset H, HIV nef sequences isolated from a single patient 41 weeks post-infection; Plikat et al., 1997), dataset I, a mitochondrial DNA (mtDNA) dataset for the COII gene from Pan troglodytes verus (Wise et al., 1998) and dataset J, rabies virus N gene sequences isolated from bats in the USA (Smith, 2002). The HIV datasets were purportedly recombining, whereas the mtDNA and the negative-strand rabies virus were considered less likely to be recombining.


View this table:
[in this window]
[in a new window]
 
Table 1. Description of the data

 
Parameters estimated and used in the simulation of the data are reported in Table 1. If recombination was present in the original data, it is possible that our estimates of rate heterogeneity (derived from the estimated phylogeny) have been overestimated. However, adoption of too much rate heterogeneity in our simulations would tend to reduce predicted LD in the absence of recombination and render our conclusions conservative.

Analyses were performed using all bi-polymorphic sites and then with low frequency variants (singletons, where the polymorphism is maintained in just one sequence; doubletons, maintained in two sequences; and tripletons, maintained in three sequences) progressively removed. Table 2 shows LD statistics for all datasets. Two-thirds (4/6) of the FMDV datasets indicated at least one significantly negative correlation (at the 5 % level) between the Hill & Robertson measure of LD (r2) and inter-site distance, and one half (3/6) of the datasets indicated significantly negative relationships (at the 5 % level) between D' (a differently scaled measure of LD) and inter-site distance, both suggestive of recombination.


View this table:
[in this window]
[in a new window]
 
Table 2. Linkage disequilibrium statistics for the datasets analysed

 
These findings were supported by the low levels of mean LD observed in the FMDV data compared with datasets simulated in the absence of recombination (e.g. Fig. 2). The differences increase in magnitude as low frequency polymorphisms are filtered out. When all bi-polymorphisms are considered, values are ~90 % of simulated recombination-free values; however, values become proportionately smaller than simulated values as singletons and doubletons are omitted (~85 % of simulated values) and then tripletons (80 % of simulated values) removed. All FMDV sets indicated reduced LD values (Table 2) compared with their simulated counterparts and all but one of the datasets (SAT-1, Set i) indicated significant reductions for either r2 or D' at some level of polymorphic screening.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2. The distribution of LD estimates, as quantified by in 500 simulated datasets parameterized to match as closely as possible the VP1 gene from isolates of SAT-1 FMDV from southern Africa (dataset B), but assuming no recombination. Low frequency singleton and doubleton bi-polymorphisms have been screened out (x=2, Table 2). The arrow indicates the values observed in the real data . Significance is indicated in Table 2.

 
Both HIV datasets, but particularly dataset G, exhibited significant negative relationships between r2 and inter-site distance (Table 2) and both were characterized by reduced LD in comparison with the matched simulations, although the statistical significance of these reductions was weak (Table 2). In contrast, neither the P. troglodytes verus mtDNA COII dataset nor the rabies virus N gene alignment revealed any evidence of recombination, with no negative relationship between LD and inter-site distance and no significant deviations of mean LD estimates from the simulated expectation that were suggestive of recombination (Table 2), even though the sequences from these datasets are approximately twice as long as the FMDV sequences.

There are a number of reasons to suppose this form of analysis may be robust. While it requires parameter estimates of the mutation model, Worobey (2001) concluded that his simulations, conducted in an almost identical way, were robust to probable levels of uncertainty in parameter estimation. Patterns of virus demography may affect tree shape (Schierup & Hein, 2000) but direct use of recovered phylogenies insulates our conclusions from effects of population demographics on the genealogical process. Overestimating rate heterogeneity (because of recombination events unaccounted for in the estimation of phylogeny) will result in less, not more, LD in simulated data, rendering our inference process conservative. Finally, it is not easy to envisage ways in which positive or purifying selection might result in a reduction in LD at 3rd base positions.

Our proposed methodology falls short of quantifying the extent of recombination in FMDV responsible for the identified linkage deficit. However, while quantitative estimates of recombination rates would be extremely valuable, currently the only way to estimate them from nucleotide data requires specifying a coalescent model. For example, the method described by McVean et al. (2002) assumes a Fisher–Wright population genetic model (constant population size, no selection, no migration, non-overlapping generations) and as a result two sources of uncertainty are incurred: (i) a known additional variance in estimates of recombination rate arising from genealogical variability introduced by this model; and (ii) a largely unknown sensitivity to the inevitable violations of the assumptions made by this particular model when applied to FMDV.

Sequences of SAT serotypes, particularly those from or closely related to isolates from African buffalo – which may remain infected for years – may present the virus with greater opportunities for observable recombination than isolates of other serotypes, which are usually associated with shorter more acute infections. Results from these analyses suggest that frequent recombination between genetically closely related genotypes may be a plausible explanation for the low levels of LD characteristic of the FMDV alignments examined here.


   ACKNOWLEDGEMENTS
 
D. T. H. and P. A. gratefully acknowledge the financial support of the Wellcome Trust. We thank Livio Heath, Eddie Holmes and Gareth Hughes for discussions of this problem.


   REFERENCES
Top
ABSTRACT
MAIN TEXT
REFERENCES
 
Anisimova, M., Nielsen, R. & Yang, Z. (2003). Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 164, 1229–1236.[Abstract/Free Full Text]

Awadalla, P. (2003). The evolutionary genomics of pathogen recombination. Nat Rev Genet 4, 50–60.[CrossRef][Medline]

Awadalla, P. & Charlesworth, D. (1999). Recombination and selection at Brassica self-incompatibility loci. Genetics 152, 413–425.[Abstract/Free Full Text]

Bastos, A. D. S. (2001). Molecular epidemiology and diagnosis of SAT-type foot-and-mouth disease in southern Africa. PhD thesis, University of Pretoria.

Bastos, A. D. S., Bertschinger, H. J., Cordel, C., van Vuuren, C. D., Keet, D., Bengis, R. G., Grobler, D. G. & Thomson, G. R. (1999). Possibility of sexual transmission of foot-and-mouth disease from African buffalo to cattle. Vet Rec 145, 77–79.[Medline]

Bastos, A. D. S., Haydon, D. T., Forsberg, R., Knowles, N. J., Anderson, E. C., Bengis, R. G., Nel, L. H. & Thomson, G. R. (2001). Genetic heterogeneity of SAT-1 type foot-and-mouth disease viruses in southern Africa. Arch Virol 146, 1537–1551.[CrossRef][Medline]

Brown, A. H. (1975). Sample sizes required to detect linkage disequilibrium between two or three loci. Theor Popul Biol 8, 184–201.[Medline]

Brown, B., Oberste, M. S., Maher, K. & Pallansch, M. A. (2003). Complete genomic sequencing shows that polioviruses and members of human enterovirus species C are closely related in the noncapsid coding region. J Virol 77, 8973–8984.[Abstract/Free Full Text]

Conway, D. J., Roper, C., Oduola, A. M., Arnot, D. E., Kremsner, P. G., Grobusch, M. P., Curtis, C. F. & Greenwood, B. M. (1999). High recombination rate in natural populations of Plasmodium falciparum. Proc Natl Acad Sci U S A 96, 4506–4511.[Abstract/Free Full Text]

Felsenstein, J. (1993). PHYLIP: Phylogeny Inference Package, version 3.5c. Department of Genetics, University of Washington, Seattle, WA, USA.

Grassly, N. C. & Holmes, E. C. (1997). A likelihood method for the detection of selection and recombination using sequence data. Mol Biol Evol 14, 239–247.[Abstract]

Hasegawa, M., Kishino, H. & Yano, T. (1985). Dating of the human–ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 21, 160–174.

Hey, J. & Wakeley, J. (1997). A coalescent estimator of the population recombination rate. Genetics 145, 833–846.[Abstract/Free Full Text]

Hill, W. G. & Robertson, A. (1968). Linkage disequilibrium in finite populations. Theor Appl Genet 38, 226–231.

Holmes, E. C., Worobey, M. & Rambaut, A. (1999). Phylogenetic evidence for recombination in dengue virus. Mol Biol Evol 16, 405–409.[Abstract]

Hudson, R. R. (1987). Estimating the recombination parameter of a finite population model without selection. Genet Res 50, 245–250.[Medline]

Hudson, R. R. (2001). Two-locus sampling distributions and their application. Genetics 159, 1805–1817.[Abstract/Free Full Text]

King, A. M. Q. (1988). Preferred sites of recombination in poliovirus RNA: analysis of 40 intertypic cross-over sequences. Nucleic Acids Res 6, 11705–11723.

King, A. M. Q., McCahon, D., Saunders, K., Newman, J. W. & Slade, W. R. (1985). Multiple sites of recombination within the RNA genome of foot-and-mouth disease virus. Virus Res 3, 373–384.[CrossRef][Medline]

Krebs, O. & Marquardt, O. (1992). Identification and characterization of foot-and-mouth disease virus O1 Burgwedel/1987 as an intertypic recombinant. J Gen Virol 73, 613–619.[Abstract]

Kuiken, C., Thakallapalli, R., Eskild, A. & de Ronde, A. (2000). Genetic analysis reveals epidemiologic patterns in the spread of human immunodeficiency virus. Am J Epidemiol 152, 814–822.[Abstract/Free Full Text]

Lewontin, R. C. (1988). On measures of gametic disequilibrium. Genetics 120, 841–847.[Abstract/Free Full Text]

Liu, H. M., Zheng, D. P., Zhang, L. B., Oberste, M. S., Kew, O. M. & Pallansch, M. A. (2003). Serial recombination during circulation of type 1 wild-vaccine recombinant polioviruses in China. J Virol 77, 10994–11005.[Abstract/Free Full Text]

Manly, B. F. J. (1986). Randomization and regression methods for testing for associations with geographical, environmental and biological distances between populations. Res Popul Ecol 28, 201–218.

Maynard Smith, J. (1992). Analysing the mosaic structure of genes. J Mol Evol 34, 126–129.[Medline]

McVean, G. A. T., Awadalla, P. & Fearnhead, P. (2002). A coalescent based method for detecting and estimating recombination from gene sequences. Genetics 160, 1231–1241.[Abstract/Free Full Text]

Plikat, U., NieseltStruwe, K. & Meyerhans, A. (1997). Genetic drift can dominate short-term human immunodeficiency virus type 1 nef quasispecies evolution in vivo. J Virol 71, 4233–4240.[Abstract]

Posada, D. & Crandall, K. A. (2001). Evaluation of methods for detecting recombination from DNA sequences: computer simulations. Proc Natl Acad Sci U S A 98, 13757–13762.[Abstract/Free Full Text]

Samuel, A. R. & Knowles, N. J. (2001). Foot-and-mouth disease type 0 viruses exhibit genetically and geographically distinct lineages (topotypes). J Gen Virol 82, 609–621.[Abstract/Free Full Text]

Schaeffer, S. W. & Miller, E. L. (1993). Estimates of linkage disequilibrium and the recombination parameter determined from segregating nucleotide sites in the alcohol dehydrogenase region of Drosophila pseudoobscura. Genetics 135, 541–552.[Abstract/Free Full Text]

Schierup, M. H. & Hein, J. (2000). Consequences of recombination on traditional phylogenetic analysis. Genetics 156, 879–891.[Abstract/Free Full Text]

Smith, J. S. (2002). Molecular epidemiology. In Rabies, pp. 79–111. Edited by A. C. Jackson & W. H. Wunner. New York: Academic Press.

Tosh, C., Hemadri, D. & Sanyal, A. (2002). Evidence of recombination in the capsid-coding region of type A foot-and-mouth disease virus. J Gen Virol 83, 2455–2460.[Abstract/Free Full Text]

van Rensburg, H., Haydon, D., Fourie Joubert, F., Bastos, A. D. S., Heath, L. & Nel, L. (2002). Genetic heterogeneity in the foot-and-mouth disease virus leader and 3C proteinase genes. Gene 289, 19–29.[CrossRef][Medline]

Wall, J. D. (2000). A comparison of estimators of the population recombination rate. Mol Biol Evol 17, 156–163.[Abstract/Free Full Text]

Wise, C. A., Srmal, M. & Easteal, S. (1998). Departure from neutrality at the mitochondrial NADH dehydrogenase subunit 2 gene in humans, but not in chimpanzees. Genetics 148, 409–421.[Abstract/Free Full Text]

Worobey, M. (2001). A novel approach to detecting and measuring recombination: new insights into evolution in viruses, bacteria, and mitochondria. Mol Biol Evol 18, 1425–1434.[Abstract/Free Full Text]

Yang, C. F., Naguib, T., Yang, S. J. & 10 other authors (2003). Circulation of endemic type 2 vaccine-derived poliovirus in Egypt from 1983 to 1993. J Virol 77, 8366–8377.[Abstract/Free Full Text]

Yang, Z. (1997). PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 15, 555–556.

Received 19 August 2003; accepted 7 January 2004.