Department of Statistics, University of Oxford, United Kingdom;
Department of Computer Science
Institute of Biological Sciences, University of Aarhus, Denmark
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Over the last two decades, many methods have been proposed to detect mosaic structures caused by recombination (see Crandall and Templeton [1999
] for a review). The problem has been attacked at several distinct levels. At a first level, one could be interested in reporting whether an observed sample of sequences has experienced recombination in its history. This is a simple yes-or-no questionhas recombination occurred? Several methods attack the problem at this level only (e.g., Sawyer 1989
; Maynard Smith and Smith 1998
). Confirming that recombination events have occurred in the sample's history, one can then go on to ask where the break points are located along the sequences. This problem is considerably harder and has also been addressed, e.g., Hein (1993)
and Weiller (1998)
. Between these two levels of approach are methods that consist of manual inspection and division of the sequences into smaller regions (Stephens [1985
], Maynard Smith [1992
], and Jakobsen and Easteal [1996
], among others), which are then tested for the presence of recombination. At a final level, one can attempt to reconstruct the entire history of the sample. In the parsimony approach taken by Hein (1993)
, the most parsimonious history is constructed assuming that the cost of a substitution compared with that of a recombination event is known. Recently, McGuire, Wright, and Prentice (2000)
developed a similar method based on hidden Markov models in a Bayesian framework.
Each recombination event breaks the sequence alignment into two parts such that the left part of the alignment has a phylogenetic history potentially different from that of the right part (fig. 1 ). A recombination event can only be detected if the histories of the left and right parts are different. If the two recombining lineages coalesce before merging with any other lineage in the sample's history, no trace of the recombination event is left. On the other hand, if one (or both) of the recombining lineages merges with a nonrecombining lineage before the two recombining lineages coalesce, then the phylogenies on each side of the break point differ. As a consequence, the probabilities of a polymorphic site are not the same in the two phylogenies, and the recombination event is, in principle, detectable from sequence data.
|
Other methods are within the framework of compatibility (Le Quesne 1969
; Sneath, Sackin, and Ambler 1975
). A site is compatible with a tree if the observed characters can be explained by c - 1 substitutional events, where c is the observed number of different characters in the given site. If more substitutional events are required, the site is said to be incompatible with the tree (fig. 2A
). It is easy to check if there exists a tree such that two given sites are both compatible with the tree. If two given sites are not both compatible with the same tree, the incompatibility can be caused either by recurrent substitutions in one (or both) of the sites or by recombination (fig. 2B
). The pair of sites is then said to be incompatible. The methods by Stephens (1985)
, Hein (1993)
, Fitch and Goodman (1991)
, Jakobsen and Easteal (1996)
, and Jakobsen, Wilson, and Easteal (1997)
, among others, are all based on the concept of compatibility. The approaches to the problem are, however, very different. For example, Hein (1993)
and Fitch and Goodman (1991)
develop parsimony procedures, whereas Jakobsen and Easteal (1996)
develop a method based on comparisons of all adjacent columns in the sequence alignment.
|
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sawyer (1989)
developed the (inner) SSCF method to detect recombinations or gene conversions without reference to the history of the sample or where on the sequences the recombinations occurred. This method detects the presence of recombination by the appearance of long tracts of identities among pairs of sequences.
Maynard Smith (1992)
proposed the Max Chi-Squared method, which searches for recombination break points by comparing the number of segregating sites on both sides of a putative recombination break point in a pair of sequences with the number of segregating sites in the rest of the sequences in the sample. The window size used in Max Chi-Squared was 300 (2 x 150). This gave the best power overall.
Jakobsen and Easteal's (1996)
neighbor similarity score (NSS) method uses pairs of informative sites. It detects recombination by the tendency of neighboring positions to be more compatible than sites that are farther apart.
Hein (1993)
developed RecPars, a parsimony algorithm to detect shifts in evolutionary history along the sequences. RecPars minimizes a combined cost of recombinations and substitutions necessary to explain a data set. The cost of a recombination (d) was set to 1.5, and that of a substitution (s) was set to 1 (based on initial simulations of sequences under different scenarios).
Evaluating Power
A permutation test to evaluate the power of the methods was used. Assume that M() is a stochastic model of sequence evolution, and let
denote the recombination rate, where
= 0 implies no recombination (i.e., all sites have the same history), and
=
implies that all sites have independent histories. The power, defined as the probability of rejecting the hypothesis H0 of no recombination given the true model M(
), 0
, was assessed by comparing the output of a method under M(
) with output from permutations of the alignment.
Unless = 0 or
=
, the distribution of the permuted alignment will be different from the distribution of the original alignment. If there is no recombination (
= 0), all sites have the same history, and if all sites are unlinked (
=
), all sites have independent histories. In both cases, the distribution of the alignment is invariant under permutations. Thus, comparison of the output of a method under M(
) with output from permutations of the alignment provides a test of H0 jointly with the hypothesis that all sites are unlinked, H
, given the true model M(
). The power is expected to increase from
%, the chosen significance level, for
= 0 until a certain point, and then to decrease to
% again for
=
.
In comparing the output of a method under M() with output from permutations of the alignment, the essential assumption is an overall rate homogeneity across sites, that is, that no regions evolve faster than other regions. No specific model of sequence evolution (e.g., the Jukes-Cantor model) or a specific model of the phylogenetic process (including recombination) is assumed. This makes it attractive compared with the traditional method of comparing the output under M(
) with output under M(0), where a specification of M(0) is required.
Simulation Algorithms and Theory
Two different setups were used to simulate sample histories. In the first setup, we used the coalescent process with recombination and exponential growth (Hudson 1983
; Slatkin and Hudson 1991
). The coalescent process emerges in a variety of contexts (Kingman 1982b
), has been used in studies of viral data (e.g., Pybus, Holmes, and Harvey 1999
; Rodrigo and Felsenstein 1999
), and provides a natural foundation for simulation. However, the main objective of this simulation approach is to provide a stochastic tool to generate sample histories with shapes that depend on the choice of parameter values. There are two parameters: the recombination rate
and the growth rate ß. In the population genetic context,
= 2Nr and ß = Nb, where N is the effective population size, r is the probability of a recombination per sequence per generation, b is the growth rate of the population per generation, and time is measured in units of N generations (Hudson 1983
; Slatkin and Hudson 1991
). In particular, if b = 0 (ß = 0), the population is of constant size. Details of the simulation algorithm can be found in Hudson (1983)
and Griffiths and Tavaré (1994)
.
Consider a single site. Let Wk, k = n, n - 1, ... , 2, be the times between successive coalescent events; that is, Wk is the time while there are k lineages in the history of a particular site. The distribution of Wk depends on ß: (1) If ß = 0, Wk is exponentially distributed, Exp(k(k - 1)/2); and (2) if ß is large (e.g., ß = 5,000), Wk /Wn 0, k = 2, 3, ... , n - 1 (Griffiths and Tavaré 1998
). This imposes fundamental differences on the shape of a tree (see also table 1
):
|
We let recombination happen uniformly along the sequences at rate /2 per sequence. The number of recombination events, R(n), in the history of a sample of size n has expectation
|
|
In the second setup, a different technique was applied to generate samples with one recombination event only. Let k be the joint rate of coalescence and recombination events. The time from when there are k ancestral lineages until there are k + 1 lineages (recombination) or k - 1 lineages (coalescence) is thus exponentially distributed with parameter
k (fig. 3
), and times between events are independent. The probability that the recombination occurs while there are k lineages is
|
|
The first setup reflects the situation in which a real data set is under scrutiny. In the second setup, the phylogenetic signal from a single recombination event is explored.
Details of Setups
In all simulations, sequence length L was fixed at 1,000. Sample size n, mutation rate , recombination rate
, and growth rate ß or the rates
k were varied. Between 1,500 and 2,800 samples of sequences and sequence histories were simulated for each choice of parameters, and the four methods were applied to all of the simulated samples. For each of these simulated samples, 200 permutations of the columns in the sequence alignment were performed, and the methods were run on the permuted data sets. For RecPars, 200 histories were simulated and 200 permutations performed due to a very time-consuming algorithm in RecPars. It was reported how many times the outcome from simulated samples deviated significantly at a 5% level from the outcome of permuted data sets.
Setup 1: Histories with a Random Number of Recombination Events
In one set of simulations, n was fixed at 10, and was varied such that the expected sequence divergence p between two sequences was p = 1%, 2.5%, 5%, 10%, and 20%. In a second set of simulations,
was chosen such that p = 10% and n was varied over 5, 10, 15, 25, and 50. Thus, two sequences differed on average in p sites, but due to recombination, the distribution of polymorphic sites was not necessarily uniform along the sequences. Two choices of ß were considered: ß = 0 and ß = 5,000. The relationship between p and
for ß = 0 is given by
|
In the first set of simulations, the rate of recombination was varied over = 2, 4, 8, and 16 for ß = 0 (n = 10). For ß = 5,000,
was determined such that Eß[R(n)] = E0[R(n)], and the same amount of recombination was expected in samples simulated under ß = 0 and ß = 5,000. This gave
= 800, 1,600, 3,200, and 6,400, approximately, for n = 10. In the second set of simulations, for ß = 0,
= 4, and for ß = 5,000,
= 2,200 (n = 5),
= 1,600 (n = 10),
= 1,300 (n = 15),
= 1,000 (n = 25), and
= 700 (n = 50), approximately.
Setup 2: Histories with One Recombination Event Only
In the second setup, samples and histories were simulated according to equation (3)
with one recombination event of type 2 or 3 only or with one recombination event of type 3 only. The sample size was 10 in all simulations.
Three different forms of the rates k were chosen: (A)
k = k(k - 1)/2, (B)
k = 1, and (C)
k = n - k + 2. (Note that
k = n - k + 2 is only meaningful because the number of recombinations is restricted to 1; thus, k
n + 1 and
k > 0 for all k.) Form A gives phylogenies with short external branches and long internal branches, and form C gives phylogenies with long external branches and short internal branches. Form B is between forms A and C. The form of
k determines when the recombination event happens: for form A, most recombinations happen while there are few lineages, and for forms B and C, most recombinations happen while there are many lineages (see eq. 3
). The recombination break point occurs between nucleotides 500 and 501. The parameter
was adjusted such that the pairwise sequence divergence varied over p = 1%, 2.5%, 5%, 10%, and 20%. This was accomplished by simulation.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
For sample size 10, an increase in power was observed with increasing recombination rate (figs. 4 and 5
) with both ß = 0 and ß = 5,000. For ß = 0, the expected number of events of types 2 and 3 goes from about 4.8 ( = 2) to about 35 (
= 16), with the expected total number of events going from 5.7 (
= 2) to about 45 (
= 16). The expected number of type 2 and 3 events is larger for ß = 5,000 than for ß = 0. For
4, the chance that there is at least one event of type 2 or 3 in a random sample is almost 1, essentially ruling out the possibility of reporting false positives, that is, reporting recombinations in sample histories without recombination. Generally, an overall increase in power was observed with increasing sequence divergence p; this was expected, because the number of polymorphic sites increases with p. An exception to this was SSCF (with ß = 0), for which the power stayed roughly constant for fixed recombination rate. Note that for
= 0, recombinations were detected in about 5% of the simulations. This was expected, as the level of significance was set to 5%.
|
Table 2
shows the results for the second set of simulations with p = 10% and n varying. Consider ß = 0: All four methods showed increasing power with increasing sample size. Next, consider ß = 5,000. Here, SSCF and Max Chi-Squared showed drastically reduced power. The power of NSS was increased from n = 5 to n = 10, but decreased for n > 10 from 66% to 15%. In contrast, RecPars retained the power obtained under ß = 0. The reduced power of SSCF and Max Chi-Squared can be explained by the fact that under population growth, even for n = 50, the branches of the tips compose a large part of the total branch length (table 1
), and as consequence, most mutations happen near the tips. The fact that the power of NSS is increased from n = 5 to n = 10 is likely due to the higher percentage of type 3 events in the latter case. The decrease in power for larger sample sizes for this method was expected: The number of incompatible pairs of sites increases with sample size (for fixed and p; results not shown). RecPars was not similarly affected, as it is not based on pairwise comparisons of sites. RecPars suggests a change in topology if a stretch of sites is consistently more economically explained by a new topology. Within such a stretch, there might be sites incompatible with the new topology (e.g., because of recombination or mutations); these sites will then be explained by mutations. This might account for the observed increase in power with n for this method.
|
|
|
Conditioned on exactly one event of type 2 or 3, the ratio of type 3 events to type 2 events changes with the form of k: If
k = k(k - 1)/2, r = p3/(p2 + p3) = 38%; if
k = 1, r = 75%; and if
= 10 - k + 2 (n = 10), r = 82%. This should be reflected in NSS and RecPars, and to lesser extent in Max Chi-Squared and SSCF, because these methods are not based on the distribution of incompatible sites in the sample. For example, in the first case, NSS and RecPars should recover less than 38% of the recombination events unless some events of type 2 also are detected. In contrast, Max Chi-Squared and SSCF could potentially recover 100% because they are designed to detect either of the two types. Conditional on type 3 events, only the ratio is, of course, 100% for all methods.
For simplicity, let us denote by T3 the case in which the conditioning is on type 3 events only and let us denote by T23 the case in which the conditioning is on type 2 or 3 events. In case T23, the power was highest under (A) k = k(k - 1)/2 and lowest under (C)
k = 10 - k + 2, except for RecPars, where form B had the most power (fig. 6
; standard errors ranged from 1 to 3 for RecPars and were less than 1.2 for the other methods; results not shown). Two factors are in play. First, more recombination events happen under C than under A while there are many lineages. This implies that the chance of a type 3 event is higher under C than under A, because a type 3 event requires at least four sequences. As a consequence, the chance of detecting a recombination should increase. (This argument also accounts for the observed higher power under T3 than under T23.) Second, under C, the phylogenies tend to be more starlike, and less information about topology is available from the sequences. The latter effect seemed to be of the most importance. It is interesting that the power of RecPars and NSS did not go up as the upper bound for detecting a recombination went up from 38% under A, through 75% under B, to 82% under C. This must be ascribed, again, to the fact that recombination is harder to detect in starlike trees.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Both RecPars and Max Chi-Squared would perform better if the recombination cost or the window size were set individually for each simulated sample. The values chosen here are those that overall gave the highest power. In general, it would be difficult, if not impossible, to assign a value to the recombination cost in RecPars from information in data. The value chosen here (1.5 times the cost of a mutation) seems to work fine despite the wide range of shapes of genealogies and the amount of recombination, and it might be taken as a sensible starting value in analysis of real data. In contrast, the window size in Max Chi-Squared can easily be varied, and one can choose the size that gives the most significant output. However, this will not necessarily guarantee the highest power.
The mutation scheme applied in the simulations is very simple and not realistic for most real data. Rates of mutation tend to vary and to be higher in some regions of the sequences than in other regions. This will inevitably give more variable sequence patterns and affect the power negatively.
The sequence length can be varied. In setup 1, the effect of varying the sequence length L is equivalent to the effect of varying the recombination rate ; all pairs (L,
) of constant ratio C =
/L produce similar outcomes. In setup 2, there is only one recombination event, and an increased sequence length increases the chance of detecting the recombination event: more information is available. Max Chi-Squared is unaffected, as a sliding-window approach is adopted. In real sequences there will be an upper limit to the size of a segment without recombination. For example, most reported recombinant segments in HIV are less than 1,000 nt long (see, e.g., http://hiv-web.lanl.gov), the sequence length chosen in our studies.
In general, it is difficult to set up guidelines concerning which method to choose in an analysis of real sequences. If long branches are expected at the tips, compatibility methods might be preferred. An indication of long branches could be a large number of singletons in the sample. It is advisable to use a set of methods based on different principles.
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: DNA sequences
incompatibility
recombination
simulations
2 Address for correspondence and reprints: Carsten Wiuf, Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, United Kingdom. wiuf{at}stats.ox.ac.uk
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Crandall K. A., A. R. Templeton, 1999 Statistical methods for detecting recombination Pp. 153176 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md
Fitch D. H. A., M. Goodman, 1991 Phylogenetic scanning: a computer-assisted algorithm for mapping gene conversion and other recombinational events Comput. Appl. Biosci 7:207-215[Abstract]
Grassly N. C., E. C. Holmes, 1997 A likelihood method for the detection of selection and recombination using nucleotide sequences Mol. Biol. Evol 14:239-247[Abstract]
Griffiths R. C., S. Tavar, 1994 Sampling theory for neutral alleles in a varying environment Philos. Trans. R. Soc. Lond. B Biol. Sci 344:403-410[ISI][Medline]
. 1998 The age of a mutant in a general coalescent tree Stochastic Models 14:273-295
Hein J., 1993 A heuristic method to reconstruct the history of sequences subject to recombination J. Mol. Evol 36:396-405[ISI]
Hudson R. R., 1983 Properties of the neutral allele model with intergenic recombination Theor. Popul. Biol 23:183-201[ISI][Medline]
Hudson R. R., N. Kaplan, 1985 Statistical properties of the number of recombination events in the history of DNA sequences Genetics 111:147-164
Jakobsen I., S. Easteal, 1996 A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences CABIOS 12:291-295[Abstract]
Jakobsen I., S. R. Wilson, S. Easteal, 1997 The partition matrix: exploring variable phylogenetic signals along nucleotide sequence alignments Mol. Biol. Evol 14:474-484[Abstract]
Jukes T. H., C. Cantor, 1969 Evolution of protein molecules Pp 21123 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York
Kimura M., 1980 A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences J. Mol. Evol 16:111-120[ISI][Medline]
Kingman J. F. C., 1982a. The coalescent Stochastic Processes Appl 13:235-248
. 1982b. Exchangeability and the evolution of large populations Pp. 97112 in G. Koch and F. Spizzi-chino, eds. Exchangeability in probability and statistics. Amsterdam, North-Holland.
Le Quesne W. J., 1969 A method of selection of characters in numerical taxonomy Syst. Zool 18:202-205
McGuire G., F. Wright, M. J. Prentice, 2000 A Bayesian model for detecting past recombination events in DNA multiple alignments J. Comp. Biol 7:159-170[ISI]
Maynard Smith J., 1992 Analyzing the mosaic structure of genes J. Mol. Evol 34:126-129[ISI][Medline]
Maynard Smith J., N. H. Smith, 1998 Detecting recombination from gene trees Mol. Biol. Evol 15:590-599[Abstract]
Pybus O. G., E. C. Holmes, P. H. Harvey, 1999 The mid-depth method and HIV-1: a practical approach for testing hypotheses of viral epidemic history Mol. Biol. Evol 16:953-959[Abstract]
Rodrigo A. G., J. Felsenstein, 1999 Coalescent approaches to HIV population genetics Pp 233272 in K. A. Crandall, ed. The evolution of HIV. Johns Hopkins University Press, Baltimore, Md
Sawyer S., 1989 Statistical tests for detecting gene conversion Mol. Biol. Evol 6:526-536[Abstract]
Slatkin M., R. R. Hudson, 1991 Pairwise comparisons of mitochondria DNA sequences in stable and exponentially growing populations Genetics 129:555-562
Sneath P. H. A., M. J. Sackin, R. P. Ambler, 1975 Detecting evolutionary incompatibilities from protein sequences Syst. Zool 24:311-332[ISI]
Stephens J. C., 1985 Statistical methods of DNA sequence analysis: detection of intragenic recombination or gene conversion Mol. Biol. Evol 2:539-556[Abstract]
Weiller G. F., 1998 Phylogenetic profiles: a graphical method for detecting genetic recombinations in homologous sequences Mol. Biol. Evol 15:326-335[Abstract]