* Biomathematics and Statistics Scotland (BioSS), JCMB, King's Buildings, Edinburgh, United Kingdom
Department of Applied Statistics, University of Reading, Reading, United Kingdom
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: phylogeny DNA sequence alignment recombination hidden Markov models Markov chain Monte Carlo
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the last few years, a plethora of methods for detecting recombination have been developedfollowing up on the seminal paper by Maynard Smith (1992)and it is beyond the scope of this article to present a comprehensive overview. Many detection methods for identifying the nature and the breakpoints of the resulting mosaic structure are based on moving a window along the sequence alignment and computing a phylogenetic divergence score for each window position. Two well-established methods following this approach are PLATO and TOPAL.
PLATO (Grassly and Holmes, 1997) estimates a phylogenetic tree from the whole DNA sequence alignment, and then systematically looks for subsets with a low likelihood under this model by computing the statistic
|
|
|
The article is organized as follows: the next section, Method: Background and Earlier Approaches, introduces the mathematical method and discusses the shortcomings of existing parameter estimation techniques. Then, under Method: A Bayesian Approach, we discuss how earlier approaches can be improved with a Bayesian approach using Markov chain Monte Carlo. In the section titled Data we describe various synthetic and real-world DNA sequence alignments on which the proposed scheme was tested. We next present the simulation study itself and discuss the results. The article ends with a conclusion and recommendations for future work.
![]() |
Method: Background and Earlier Approaches |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Obviously, this optimization problem is, in general, intractable. First, the number of possible topologies at a given site, K, increases super-exponentially with the number of sequences m. Second, there are KN different state sequences, which prevents an exhaustive search even for small values of K. Consequently, the introduction of approximations and restrictions is inevitable.
To deal with the second source of computational complexity, interactions between sites are limited to nearest-neighbor interactions. This allows the application of a dynamic programing scheme which reduces the computational complexity to (K2N), that is, to an expression linear in N. To deal with the first source of complexity, the scheme has to be restricted to alignments with small numbers of sequences. In the current work, we restrict our approach to alignments with only m = 4 taxa. In the Discussion we describe how this restriction can be relaxed.
RECPARS
Hein (1993) defined optimality in a parsimony sense. His algorithm, RECPARS, searches for the most parsimonious state sequence S, that is, the one that minimizes a given parsimony cost function E(S). Interactions between sites are restricted to nearest-neighbor interactions, as discussed above, and the search is carried out with dynamic programing. Although RECPARS is faster than the methods to be discussed below, it suffers from the shortcomings inherent in parsimony, as discussed by Felsenstein (1988). Moreover, E(S) depends only on the topology-defining sites; thus the algorithm discards a substantial proportion of sites in the alignment. The most serious disadvantage is that the cost function E(S) depends on certain parametersthe mutation cost Cmut, and the recombination cost Crecwhich can not be optimized within the framework of this method. Consequently, these parameters have to be chosen by the user in advance, and the predictions depend on this rather arbitrary prior selection.
Detecting Recombination with Hidden Markov Models
Adopting a statistical approach to phylogenetics, illustrated in figure 3, the probabilistic equivalent to RECPARS is a hidden Markov model (HMM), whose application to the detection of recombination was first suggested by McGuire, Wright, and Prentice (2000). Figure 4, left, shows the corresponding probabilistic graphical model. White nodes represent hidden states, St, which have direct interactions only with the states at adjacent sites, St-1 and St+1. Black nodes represent columns in the DNA sequence alignment, yt. The joint probability of the DNA sequence alignment, , and the sequences of hidden states, S, factorizes:
|
|
|
|
While, in general, this problem would be intractable because of the exponential increase in the number of state sequences (see above), the reduction to nearest-neighbor interactions between hidden states and the resulting factorization (3) allows the application of a dynamic programing technique, the so-called Viterbi algorithm (Rabiner 1989), to find the mode with computational complexity
(N). The factorization (3) contains three terms: P(yt|St), P(St|St-1), and P(S1). The transition probabilities P(St|St-1) correspond to recombination events (if St
St-1). Let
denote the probability that the tree topology remains unchanged as we move from a given site in the alignment, t, to an adjacent site, t + 1 or t - 1. We then obtain for the state transition probabilities:
|
|
|
Heuristic Parameter Estimation: HMM-Heuristic
McGuire, Wright, and Prentice (2000) estimated the branch lengths w for each tree topology separately with maximum likelihood. This approach is suboptimal. For a proper estimation of the branch lengths of a recombinant tree, one would have to restrict the parameter estimation to the recombinant region. The location of this region, however, is not known in advance. Estimating the branch lengths from the whole DNA sequence alignment leads to seriously distorted values, as demonstrated by Husmeier and Wright (2001), because the estimation includes regions of the alignment for which the tree topology is incorrect. A heuristic way to address this problem, suggested by McGuire, Wright, and Prentice (2000), is to estimate the branch lengths from a subregion of the alignment. The length of this region should be matched to the length of the recombinant region, which, however, is not known in advance. Also, this approach does not offer a way to estimate the recombination parameter .
Parameter Estimation with Maximum Likelihood: HMM-ML
A solution to this problem, proposed by Husmeier and Wright (2001), is a proper maximum likelihood estimation of the parameters so as to maximize
|
To rephrase this problem, note that hidden Markov models and phylogenetic trees have many similarities with neural networks; in fact, all three models are instances of the more general class of graphical models (Heckermann 1999). Studies on neural networks and graphical models have shown that, for sparse data, maximum likelihood is susceptible to over-fitting, and that the generalization performance is significantly improved with the Bayesian approach. A detailed investigation of this approach can be found in Neal (1996). In a nutshell, maximum likelihood gives only a point estimate of the parameters, which ignores the more detailed information contained in the curvature and (possibly) multimodality of the likelihood landscape. By sampling rather than optimizing parameters, the Bayesian approach captures more information about this landscape, and consequently gives improved and more reliable predictions.
![]() |
Method: A Bayesian Approach |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In principle this avoids the over-fitting scenario mentioned above and removes the need for a separate hypothesis test. The difficulty, however, is that the integral in (eq. 8) is analytically intractable, which calls for the application of a numerical approximation, using Markov chain Monte Carlo (MCMC). The practical viability of the Bayesian framework thus hinges on the performance of this scheme. In the subsections below, we will discuss the following issues: (1) the choice of prior probabilities; (2) the chosen Markov chain Monte Carlo method, which has the form of a Metropolis-Hastings and Gibbs-within-Gibbs sampling scheme; (3) methods for accelerating the convergence of the Markov chain; (4) the prediction resulting from this scheme; and (5) a software implementation. We will then test this approach on various DNA sequence alignments.
Prior Probabilities
Inherent in the Bayesian framework is the choice of prior probabilities for all model parameters, as illustrated in figure 4, right. We make the usual assumption of parameter independence, P(, w,
) = P(
)P(w)P(
), and choose rather vague priors to reflect the absence of true prior knowledge. The prior probabilities will either be conjugate, where possible, or uniform, but proper (that is, restricted to a finite interval).
The recombination parameter is a binomial random variable, for which the conjugate prior is a beta distribution,
|
|
The prior on depends on the model of nucleotide substitution. In the present study, the Felsenstein 84 model (Felsenstein and Churchill, 1996) is used, which has four free parameters: the nucleotide frequencies,
A,
C,
G, and
T, and the transition bias
. (Note that because of the normalization constraint
A +
C +
G +
T = 1, there are 4 rather than 5 free parameters.) In our approach, each tree is allowed to have a different value of
, whereas the nucleotide frequencies are assumed to be the same for all trees. This is for algorithmic efficiency: Allowing each tree to have a different set of frequencies means that some frequencies might be inferred from a small amount of data, leading to vague posterior distributions that slow down the convergence of the Markov chain. The total vector of nucleotide substitution parameters is thus of the form
|
A natural prior for the nucleotide frequencies i is a Dirichlet distribution, which, as a multivariate generalization of the beta distribution (eq. 9), satisfies the normalization constraint. We here choose a Dirichlet (1,1,1,1) distribution, which is a uniform distribution subject to the normalization constraint and thus maximally non-informative. For the transition biases
k (k = 1,..., K), we choose a uniform prior over the interval [0, 2]. Again, an upper bound is needed to prevent the prior from becoming improper. Allowing
k to be as large as 2 will account for extreme cases of transition bias, which should not impose any serious restrictions in practice. Finally, we assume P(S1) =
S1
{1,..., K}, that is, a uniform prior on the tree topologies.
The joint distribution of the DNA sequence alignment, the state sequences, and the model parameters is given by
|
Markov Chain Monte Carlo (MCMC) Sampling
Ultimately, we are interested in the marginal posterior probability of the state sequences, P(S|), which requires a marginalization over the model parameters according to (eq. 8). The numerical approximation is to sample from the joint posterior distribution
|
|
Define . From (eq. 5) and (eq. 9) it is seen that writing the joint probability (eq. 11) as a function of
gives:
|
On normalization this gives
|
For sampling the state sequences S, we adopt the approach suggested by Robert, Celeux, and Diebolt (1993) and sample each state St separately, conditional on the othersthat is, with a Gibbs-within-Gibbs scheme:
|
|
For sampling the remaining parameters, w and , we apply the Metropolis-Hastings algorithm (see Hastings [1970] and Chib and Greenberg [1995]). Let z(i) denote the parameter configuration in the ith sampling step. A new parameter configuration
is sampled from a proposal distribution Q(
|z(i)), and then accepted with probability
|
Improving the Convergence of the Markov Chain
In theory the algorithm converges to the posterior distribution (eq. 12) irrespective of the choice of the proposal distribution (assuming ergodicity). In practice, a "good" choice of Q(.|.) is crucial to achieve convergence within a reasonable amount of time, and will be discussed next.
For the components wl of the vector of branch lengths w and for the transition biases k, a new value is selected from a uniform interval centred around the existing value. This is a symmetric proposal distribution, so the terms Q(.|.) cancel out in equation (18). For the nucleotide frequencies
A,
C,
G,
T, new values are sampled from a Dirichlet distribution. This ensures that the normalization constraint
A+
C +
G +
T = 1 is satisfied. The parameters of the Dirichlet distribution are chosen proportional to the current values of the nucleotide frequencies, thereby proposing new values close to the current ones, which in turn makes it more likely that the proposed values will be accepted. This proposal distribution is not symmetric, so the Q(.|.) terms must be calculated in equation (18).
If too few proposed values are accepted, the corresponding proposal distributions Q(.|.) may be tuned to make acceptance more likely and thereby to accelerate convergence. For the branch lengths wl and the transition biases k, this is done by decreasing the width of the uniform interval from which the new value is sampled. For the nucleotide frequencies
A,
C,
G,
T, the constant of proportionality in the Dirichlet distribution is increased so that the proposed frequencies are more likely to be closer to the existing values.
The algorithm is started by first initializing the chain. The sequence of topologies S is chosen randomly or from some initial estimation, e.g., using RECPARS. The branch lengths are set to some plausible value, e.g., the average branch length of the global maximum likelihood tree obtained with DNAML of the PHYLIP package (available from http://evolution.genetics.washington.edu/phylip.html). Initial values for the transition biases k and the nucleotide frequencies
A,
C,
G,
T can be estimated from the data, as described later under Simulations. The parameter groups are then updated in order according to (eq. 13) and the details described above. An initial equilibration or burn-in period must be run to allow the Markov chain to reach stationarity. In this part of the simulation, the parameters of the proposal distributions Q(.|.) are tuned as described above. This burn-in is followed by the sampling phase of the simulation, in which the state sequences S (and, if of interest, the model parameters) are saved for further analysis. Note that during the sampling phase, the parameters of the proposal distributions must not be tuned, as this might lead to biased samples that do not represent the correct posterior probability (eq. 12).
In principle, the initialization of the hidden states is unimportant because the Markov chain will forget its initial configuration and converge toward the equilibrium distribution irrespective of its starting point. In practice, however, extreme starting values can slow down the mixing of the chain and result in a very long burn-in, in which case the MCMC sampler may fail to converge toward the main support of the posterior distribution in the available simulation time. To address this problem, we combined simulations from different initializations and explored a method akin to simulated annealing (Kirkpatrick, Gelatt, and Vecchi 1983). Note that the bottleneck of the presented Markov chain Monte Carlo scheme is the sampling of the topology sequences S. Because recombination events are quite rare, the number of topology changes along the DNA sequence alignment is usually small and, consequently, the posterior distribution of concentrated on values close to 1. This discourages state transitions and may slow down the mixing of the Markov chain. To increase the transition rate during equilibration, we therefore modified the transition probabilities as follows: Let T denote the total length of the equilibration phase, and let i
{1,...,T} denote the ith sample of the Markov chain during equilibration. Then, during equilibration, equation (15) is replaced by
|
Prediction
Recall that the proposed Bayesian method samples topology sequences from the joint posterior probability P(S|) = P(S1,..., SN|
), where St (t = 1,..., N) represents the topology at site t. To display the results graphically, we marginalize, for each site in turn, over all the remaining sites so as to return the marginal posterior probabilities P(St|
). These are then plotted, for each topology P(St = 1|
), P(St = 2|
), and P(St = 3|D), along the sequence alignment. Assigning each site t to the mode of the posterior probability P(St|
) gives a list of putative recombinant regions, identical to the output of RECPARS. This is a useful reduction of information that allows the comparison of classification scores, as shown later (see figure 12). Note, however, that the posterior probabilities P(St|
) contain further, additional information, as they also indicate the uncertainty of the prediction.
|
![]() |
Data |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Simulated Recombination
DNA sequences, 1000 bases long, were evolved along a 4-species tree, using the Kimura model of nucleotide substitution (Kimura 1980) with a transition-transversion ratio of 2. Two recombination events were simulated, as shown in figure 7. Topology 1 is the "true" topology, which applies to those parts of the alignment that are not affected by recombination. The four sequences are evolved along the interior branch and the first quarter of the exterior branches of a phylogenetic tree (top left). At this point, the subsequence between sites 201 and 400 in Strain 3 is replaced by the corresponding subsequence in Strain 1 (top right). The sequences then continue to evolve along the exterior branches until the branch length is 0.75 times the final exterior branch length (middle, left). This is followed by a second recombination event, where the subsequence between sites 601 and 800 in Strain 2 replaces the corresponding subsequence in Strain 3 (middle right). The sequences then continue to evolve along the exterior branches for the remaining length (bottom left). The resulting mosaic structure of the alignment is shown in figure 7, bottom right. In the main part of the alignment, Strain 3 is most closely related to Strain 4. However, in the region between sites 201 and 400, it is most closely related to Strain 1, and in the region between 601 and 800, it is most closely related to Strain 2 (bottom right). Thus, the first, more ancient, recombination event corresponds to a transition from Topology 1 into Topology 2. The second, more recent, recombination event corresponds to a transition from Topology 1 into Topology 3. This model simulates a realistic scenario where an ancestor of Strain 3 incorporates genetic material from ancestors of other extant strains, which in each case is followed by subsequent evolution. The simulations were repeated for a different mosaic structure, where the first recombinant region was extended by 100 nucleotides (region 201500), and the second region was shortened by 100 nucleotides (region 701800). The mosaic structures are shown in the top-left subfigure of figures 811. For each mosaic structure, we repeated the simulation with three different tree heights, where the tree height is defined as half the sum of all the branch lengths between the two strains that are farthest apart. Note that as the tree height becomes smaller, the numbers of polymorphic and topology-defining sites decrease. This reduces the information content in the alignment and makes the detection of recombinant regions more difficult.
|
|
|
|
|
Maize
Gene conversion is a process equivalent to recombination, which occurs in multigene families, where a DNA subsequence of one gene can be replaced by the DNA subsequence from another. Indication of gene conversion between a pair of maize actin genes has been reported by Moniz de Sa and Drouin (1996), who showed that the Maz56 and Maz63 genes had a gene conversion covering the first 875 nucleotides of their coding regions. We applied our algorithm to a multiple alignment of the following four maize sequences (1008 nucleotides long): Maz56 (GenBank/EMBL accession number U60514), Maz63 (U60513), Maz89 (U60508), and Maz95 (U60507). The sequences were aligned with ClustalW (Thompson, Higgins, and Gibson 1994), using the default parameter settings. We define the states of the HMM as follows: Topology 1: [(Maz56,Maz63),(Maz89,Maz95)]; topology 2: [(Maz56,Maz89),(Maz63,Maz95)]; topology 3: [(Maz56,Maz95),(Maz63,Maz89)]. (See figure 15, top, for an illustration.)
|
Neisseria
One of the first indications for sporadic recombination was found in the bacterial genus Neisseria (Maynard Smith 1992). We chose a subset of the 787-nucleotide Neisseria argF DNA multiple alignment studied by Zhou and Spratt (1992), where we selected the four strains (1) N. gonorrhoeae [X64860], (2) N. meningitidis [X64866], (3) N. cinera [X64869], and (4) N. mucosa [X64873] (GenBank/EMBL accession numbers are in brackets). Zhou and Spratt (1992) found two anomalous, or more diverged regions in the DNA alignment, which occur at positions t = 1202 and t = 507538 (Note that Zhou and Spratt (1992) used a different labeling scheme, with the first nucleotide at t = 296, and the last one at t = 1082.) In the rest of the alignment, N. meningitidis clusters with N. gonorrhoeae (defined as topology St = 1 in our HMM), whereas between t = 1 and t = 202, they found that N. meningitidis is grouped with N. cinera (defined as state St = 3). Zhou and Spratt (1992) suggested that the region t = 507538 might be the result of rate variation. The situation is illustrated in figure 16.
|
![]() |
Simulations |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
TOPAL was applied with two different window lengths: W = 100 and W = 200. We used Version 2 of the program (McGuire and Wright 2000) with the default options except for the nucleotide substitution model, where we replaced the (default) Jukes-Cantor model with the Kimura model. The transition-transversion ratio was estimated with Puzzle (Strimmer and von Haeseler 1996).
For PLATO, we used Version 2.11 of the program developed by Grassly and Holmes (1997), available from http://evolve.zoo.ox.ac.uk/software/Plato/Plato2.html, and we varied the window length between five bases and half the sequence length. The reference tree was obtained with maximum likelihood from the whole DNA sequence alignment, using DNAML of the PHYLIP package (available from http://evolution.genetics.washington.edu/phylip.html) or Puzzle. On the synthetic data, we used a uniform substitution rate, and set the transition-transversion ratio to the known true value. On the real data, we used two models of rate heterogeneity: (1) a uniform rate and (2) gamma distributed rates with five rate categories. The respective PLATO commands are these: (1) plato -mHKY -tTAU and (2) plato -g5 -aALPHA -mHKY -tTAU, where TAU and ALPHA are the transition-transversion ratio and the alpha parameter of the discrete gamma distribution, respectively, both estimated with Puzzle.
The application of HMM-heuristic was similar to the study by McGuire, Wright, and Prentice (2000). We chose the Felsenstein 84 model of nucleotide substitution, estimating the transition-transversion ratio with maximum likelihood (using Puzzle), and estimating the nucleotide frequencies A,
C,
G,
T from the data according to
X = NX/
X'NX', where NX is the number of occurrences of nucleotide X
{A, C, G, T}. For each topology in turn, we optimized the branch lengths of the corresponding phylogenetic tree with maximum likelihood on the whole alignment, using the program DNAML of the PHYLIP package. As opposed to McGuire Wright, and Prentice (2000), we did not restrict the optimization to subsets of the alignments, since the subset size is a parameter that cannot be properly optimized within the framework of this approach.
For training the HMM-ML, we followed Husmeier and Wright (2001) and optimized the recombination parameter and all the branch lengths simultaneously in a maximum likelihood sense with the EM algorithm, using the MATLAB programs written by the authors (available from http://www.bioss.sari.ac.uk/
dirk/My_software.html).
Finally, the proposed Bayesian MCMC scheme, HMM-Bayes, was applied as follows. We used the Felsenstein 84 model of nucleotide substitution, with a prior on the parameters as described earlier under Method: A Bayesian Approach. For the prior on
[0,1], we chose the three distributions shown at the bottom of figure 6, which incorporate our knowledge that P(
) must be skewed toward the right of the interval [0,1]. This is so because
= 0.5 means that, on average, every second site is subject to recombination, and we know that recombination events are much rarer, that is, that
>> 0.5. We found that for the chosen priors, the simulations gave very similar results. Recall that the posterior probability is the product of the prior and the likelihood. Whereas the first term is constant, the second term scales like N, the number of sites in the DNA sequence alignment. Consequently, for a sufficiently long alignment and a reasonable prior, the weight of the likelihood is considerably higher than that of the prior, and variations of the latter have therefore only a marginal influence on the prediction. This was borne out in the simulations, as shown in figure 21 and discussed in the Appendix.
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
When the tree height is decreased to 0.1, neither RECPARS nor HMM-Bayes predicts the mosaic structure of the alignment correctly. RECPARS finds only one recombinant region, which for the first alignment is even badly misplaced (fig. 8, bottom right). HMM-Bayes detects both recombinant regions and even locates them rather accurately, but it misclassifies the topology change for one of these regions (fig. 9, bottom right; fig. 11, bottom right). This is most likely a consequence of the fact that for small tree heights, the number of mutations and, consequently, the number of polymorphic sites is small. Thus, there is less information in the data, and any inference is inevitably less accurate.
For a more quantitative comparison between RECPARS and HMM-Bayes, recall that the detection of recombination is basically a classification problem: Each site in the sequence alignment is assigned to one of the three possible tree topologies. For RECPARS, this is done directly. For HMM-Bayes, it is done by assigning each site to the mode of the posterior probability. We use two criteria to rate the performance of the methods: The sensitivity, which is the percentage of correctly classified recombinant sites, and the specificity, which measures the percentage of correctly classified non-recombinant sites. Comparing the performance of RECPARS and HMM-Bayes across all simulations, shown in figure 12, we found that HMM-Bayes gives a consistent and significant improvement on RECPARS in the accuracy of locating and classifying the recombinant regions, as indicated by a systematically increased sensitivity score.
Figure 13 compares the predictions of RECPARS and HMM-Bayes on the real-world DNA sequence alignments. First, it can be seen that the predictions with RECPARS depend sensitively on the recombination-mutation cost ratio Crec/Cmut. The best predictions show a qualitative agreement with the predictions from the literature, but note that selecting the best value of Crec/Cmut in this way, with the benefit of hindsight, is not possible in real applications where the location of the recombinant regions is not known beforehand. Also note that setting Crec/Cmut = 1.5, as suggested by Wiuf Christensen, and Hein (2001), is not guaranteed to give reliable results, as can be seen for the hepatitis B sequence alignment (fig. 13, middle left), where this parameter setting leads to an over-tessellated mosaic structure with false positive predictions of spurious recombinant regions.
|
The subfigures on the right of figure 13 show the predictions with HMM-Bayes. Note that these predictions do not depend on any heuristic tuning of parameters, since all the parameters are properly inferred from the data by sampling them from the posterior distribution (12) with MCMC.
The mosaic structures and the breakpoints predicted with HMM-Bayes for the maize and the hepatitisB sequence alignments are in agreement with the results from the literature (fig. 13, top and middle). Also, the location of the predicted recombinant region in Neisseria accords with the prediction by Zhou and Spratt (1992). Their second anomalous region, shown by the gap in figure 13, bottom left, is modeled by a distributed representation with HMM-Bayes; this reflects the uncertainty about the nature of this region. The only difference between the literature and the prediction with HMM-Bayes is the presence of a further peak of P(St = 3|) at the end of the alignment (see fig. 13, bottom right). Because this region is very short (less than 20 bases long) and therefore difficult to detect with other methods, we assume that we have found a true recombinant region that has not been discovered before.
Comparison with the Heuristic HMM
Figure 14 shows the results obtained on one of the synthetic DNA sequence alignments (mosaic structure A, tree height 0.2). The subfigure on the left shows the prediction of P(St|) with HMM-heuristic. For this method, the recombination parameter
has to be specified in advance, and we have set it to the mean of the prior distribution:
= 0.8. It can be seen that the overall pattern of the posterior probabilities is correct, showing an increase for topology St = 2 in the region 200 < t < 400, and an increase for topology St = 3 in the region 600 < t < 800. However, the signals are very noisy, and an automatic classification based on the mode of the posterior probability would incur a high proportion of erroneously predicted topology changes. The Bayesian scheme, HMM-Bayes, shown on the right of figure 14, overcomes this shortcoming. The predicted state transitions coincide with the true breakpoints. The posterior probabilities for the topologies, P(St|
), are close to zero or one, which significantly reduces the noise. By assigning each site in the alignment to the mode of P(St|
), the tree topologies and the mosaic structure of the alignment are predicted correctly. The mean and the standard deviation of the posterior distribution P(
|
) are
posterior = 0.992 and
posterior = 0.004. With four breakpoints in an alignment of length 1000 bases, the true value for the recombination parameter is
= 0.996, which deviates from the prediction by only 0.4%.
|
Comparison with Maximum Likelihood
On the maize and hepatitis B sequence alignments, the predictions with HMM-ML and HMM-Bayes were practically indistinguishable (graphs not included in this article). The difference between the two approaches is in the confidence that we have in the prediction. The prediction with HMM-ML, P(St|, w,
,
), is dependent on the model parameters w,
,
, which were fitted with maximum likelihood and might therefore be subject to over-fitting. This calls for an independent statistical significance test, using, e.g., parametric bootstrapping. This approach is extremely computationally expensive, as discussed by Larget and Simon (1999). The prediction with HMM-Bayes, on the other hand, is only dependent on the data, P(St|
), because the model parameters have been integrated out. This means that the prediction is consistent within the Bayesian framework and does not require an independent significance test (assuming sufficient convergence of the MCMC simulation).
Figure 16 shows the prediction of P(St|) on the Neisseria sequence alignment, where the subfigure in the bottom left was obtained with HMM-ML, and the subfigure in the bottom right with HMM-Bayes. Both methods agree in predicting a sharp transition from topology St = 3 to St = 1 at breakpoint t = 202, which is in agreement with the findings by Zhou and Spratt (1992). Both methods also agree in predicting a short recombinant region of the same topology change at the end of the alignment. However, while the prediction with HMM-ML could have been the result of over-fitting, HMM-Bayes, by integrating out the model parameters, is not susceptible to this fallacy. This corroborates the prediction with HMM-ML, in the same way as a frequentist hypothesis test, and thus suggests that we have discovered a new recombinant region undetected by Zhou and Spratt (1992).
Differences between the predictions of HMM-ML and HMM-Bayes are found in the middle of the alignment, where two further breakpoints occur at sites t = 506 and t = 537. This is in agreement with Zhou and Spratt (1992). However, while Zhou and Spratt (1992) suggested that the region between t = 506 and t = 537 might be the result of rate heterogeneity, HMM-ML predicts a recombination event with a clear transition from topology St = 1 into St = 2. This seems to be the result of over-fitting: Because the distribution of the nucleotide column vectors yt in the indicated region is significantly different from the rest of the alignment, modeling this region with a different hidden state can increase the likelihood, although the hidden state itself (topology St = 2) might be ill-matched to the data. This deficiency is redeemed with HMM-Bayes, whose prediction is shown in figure 16, bottom right. The critical region between sites t = 506 and t = 537 is again identified, indicated by a strong drop in the posterior probability for the dominant topology, P(St = 1|). However, the uncertainty in the nature of this region is indicated by a distributed representation, where both alternative hidden states, St = 2 and St = 3, are assigned a significant probability mass. With the prediction of this uncertainty, HMM-Bayes also indicates a certain model misspecification inherent in the current schemethe absence of hidden states for representing different evolutionary ratesand thus avoids the over-fitting incurred when applying HMM-ML.
To test the conjecture that the Bayesian approach is more robust to over-fitting than maximum likelihood, we carried out two further simulation studies. In the first study, we simulated the effect of rate heterogeneity (fig. 17, top). We simulated the evolution process along the branches of a four-species phylogenetic tree with the Kimura model, as described earlier under Data, but reduced the rate of nucleotide substitution by a factor of 1/5 in the center region, between sites t = 301 and t = 600. The results are shown in the bottom of figure 17. The maximum likelihood approach clearly over-fits and erroneously predicts a recombinant region. The Bayesian MCMC approach is only slightly affected in its prediction of the posterior probabilities: the graph of P(St = 1|) shows a small dent. This does not lead to classifying any site in the alignment as a topology different from St = 1, though, hence no recombination is predicted, and over-fitting is avoided.
|
To this end, we first simulated a recombination process according to the method described in the Data section, and then simulated observational noise, corresponding to wrong base calls or typing errors in DNA sequencing, by randomly replacing 20% of the columns in the alignment by those of a second alignment that was unaffected by recombination. The process is illustrated in the top of figure 18. The consequence is that the uncertainty in the prediction of the recombinant region should increase, and the recombination event should be predicted with a probability less than 1. Figure 18 (bottom left) shows the prediction with maximum likelihood. The location of the breakpoint (in the middle of the alignment at site 200) is fairly accurate, and the nature of the mosaic structure has been identified correctly in that a recombination event corresponding to a change from topology 1 (left) into topology 3 (right) is predicted. However, this prediction is overconfident in that the topology change is predicted with probability 1, which ignores the effect of typing errors. When applying the Bayesian approach, we found that the MCMC trajectories converged to two different semiconverged states, one corresponding to a clear recombination event and the other to the absence of any recombination. While this bimodality indicates insufficient mixing of the Markov chain, it also points to the intrinsic uncertainty in the prediction problem, caused by the introduction of typing errors (we did not observe any bimodality in the absence of typing errors). When combining several MCMC trajectories started from different initializations, we obtained the prediction shown in figure 18, bottom right, which, as opposed to the maximum likelihood approach, indicates the intrinsic uncertainty by predicting a recombination event with probability less than one.
|
|
On comparing these methods with HMM-Bayes (figs. 13 and 14), we find that the latter gives more accurate predictions than PLATO and more precise locations of the breakpoints than TOPAL, with the further advantage that all parameters are estimated from the data, and no arbitrary window parameter has to be chosen in advance.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Note that our approach focuses on topology changes rather than recombination in general. If a recombination event only changes the branch lengths of a tree, without affecting its topology, it will not be detected. However, the main motivation for our method is a prescreening of an alignment for topology changes as a crucial prerequisite for a consistent phylogenetic analysis. Most standard phylogenetic methods are based on the implicit assumption that the given alignment results from a single phylogenetic tree. In the presence of recombination they would therefore infer some "average" tree. If the recombination event leads to different trees with the same topology but different branch lengths, then the wrong assumption of having only one tree is not too dramatic: The topology will still be correct (as it has not been changed by the recombination event), and the branch lengths will show some average value. This is still reasonable, because branch lengths are continuous numbers, and the mean of continuous numbers is well defined. If the recombination event, however, leads to a change of the topology, inferring an average tree is no longer reasonable: Tree topologies are cardinal entities, for which an average value is not defined. In fact, it is well known that in this case the resulting average tree will be in a distorted "limbo" state between the dominant and recombinant trees, which renders the whole inference scheme unreliable. Thus, by focusing on recombination-induced topology changes we focus on those events that cause the main problems with standard phylogenetic analysis methods.
We should further point out that our approach is not about estimating recombination rates, but rather about inferring the mosaic structure of a particular sequence alignment. The proposed method has a parameter that might be confused with a recombination rate: the parameter , which should actually be referred to as a recombination probability (the probability that on moving from the nth site to the (n + 1)th site in the sequence alignment no topology change occurs). This parameter corresponds to the recombination cost in RECPARS; it is not a recombination rate in population genetics terms. By estimating this parameter from the datathat is, the sequence alignmentthe prediction performance of the algorithm improves considerably, thereby overcoming the main shortcomings of RECPARS and HMM-heuristic, where this parameter has to be chosen arbitrarily in advance. We do not claim, however, that
has any meaning in itself: it is a parameter whose proper estimation from the sequence alignment improves the detection of phylogenetic topology changes, and this is what we are interested in.
A limitation of the proposed method is that the states of the HMM represent only different tree topologies but do not allow for different rates of evolution. A way to redeem this deficiency is to employ a factorial hidden Markov model (FHMM), as discussed by Ghahramani and Jordan (1997), and to introduce two different types of hidden states: one representing different topologies, the other representing different evolutionary rates. This effectively combines the method of the present paper with the approach of Felsenstein and Churchill (1996). While parameter estimation with maximum likelihood would lead to a considerable increase of the computational complexity (Ghahramani and Jordan, 1997), it seems that this increase will be less dramatic for the MCMC method, because the Gibbs sampling scheme of (eq. 17) can be applied to both types of hidden states separately. A detailed investigation of this approach is the subject of future research.
As mentioned earlier under Method: Background and Earlier Approches, the method presented here is restricted to DNA sequence alignments with small numbers of taxa, and our current software implementation can only deal with alignments of four sequences. This is because each possible tree topology constitutes a separate state of the HMM. In practical applications, our method has therefore to be combined with a fast low-resolution preprocessing method, like RECPARS or TOPAL: In a first, preliminary analysis, apply RECPARS or TOPAL to identify putative recombinant sequences and their approximate mosaic structures. In a second, subsequent step, apply HMM-Bayes to the tentative sets of four sequences that result from the previous step. This will allow a more accurate analysis of the mosaic structure than can be obtained from a window method like TOPAL with its inherently low resolution, and it will resolve contradictions that are likely to arise from different (arbitrary) settings of the parsimony cost parameters of RECPARS. In general, after identifying a small set of putative recombinant sequences with any fast low-resolution method, the exact nature of the recombination processes and the location of the breakpoints can be further investigated with the high-resolution method proposed in this article.
Note that, in principle, our method can deal with more than four sequences. All that is required is that the set of different candidate topologies, which constitute the states of the HMM, is sufficiently small (fewer than 10, say). Such a sparse set of candidate topologies can be obtained from the preliminary analysis with one of the lower-resolution methods. The proposed Bayesian HMM method can then be applied in a subsequent step, with the hidden states set to the topologies obtained from the preliminary analysis. This will, obviously, not be able to detect any topology changes not detected before, but it would still be likely to give an improvement on the low-resolution method of the previous step in that recombination breakpoints and the nature of the mosaic structure would be predicted more accurately.
A more ambitious goal would be to improve the methodology itself by introducing a more informative form of the transition probabilities between the topologies. In the current version, all changes into other topologies are equally likely, as expressed by (eq. 5). For four sequences with only three possible tree topologies, this seems to be a valid assumption. However, on increasing the number of sequences, this uniform prior on the topologies leads to a superexponential explosion of the parameter space. Choosing a more informative prior that, given a topology, is only nonzero for closely related topologiesas suggested by Hein (1993) and implemented in RECPARSmight offer a way to apply the proposed method to alignments of more than four sequences. This approach, however, has not yet been explored and will certainly offer an interesting topic for future research.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bollyky, P. L., A. Rambaut, P. H. Harvey, and E. C. Holmes. 1996. Recombination between sequences of hepatitis B virus from different genotypes,. J. Mol. Evol. 42:97-102.[ISI][Medline]
Casella, G., and E. I. George. 1992. Explaining the Gibbs sampler. Am. Statist. 46:167-174.[ISI]
Chib, S., and E. Greenberg. 1995. Understanding the Metropolis-Hastings Algorithm. Am. Statist. 49:327-335.[ISI]
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B39:1-38.
Felsenstein, J. 1981. Evolution trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]
Felsenstein, J.. 1988. Phylogenies from molecular sequences: inference and reliability,. Annu. Rev. Genet. 22:521-565.[CrossRef][ISI][Medline]
Felsenstein, J., and G. A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104.[Abstract]
Ghahramani, Z., and M. I. Jordan. 1997. Factorial hidden Markov models,. Machine Learn. 29:245-273.[CrossRef][ISI]
Grassly, N. C., and E. C. Holmes. 1997. A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol. 14:239-247.[Abstract]
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.[ISI]
Heckermann, D. 1999. A tutorial on learning with Bayesian networks. Pp. 301354 in M. I. Jordan, ed. Learning in Graphical Models, Adaptive computation and machine learning. MIT Press, Cambridge, Massas.
Hein, J. 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36:396-405.[ISI]
Husmeier, D., and F. Wright. 2001. Detection of recombination in DNA multiple alignments with hidden Markov models. J. Comput. Biol. 8:401-427.[CrossRef][ISI][Medline]
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[ISI][Medline]
Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. 1983. Optimization by simulated annealing. Science 220:671-680.[ISI]
Larget, B., and D. L. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16:750-759.
Mau, B., M. A. Newton, and B. Larget. 1999. Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55:1-12.[ISI][Medline]
Maynard Smith, J. 1992. Analyzing the mosaic structure of genes. J. Mol. Evol. 34:126-129.[ISI][Medline]
McGuire, G., and F. Wright. 2000. TOPAL 2.0: improved detection of mosaic sequences within multiple alignments. Bioinformatics 16:130-134.[Abstract]
McGuire, G., F. Wright, and M. J. Prentice. 1997. A graphical method for detecting recombination in phylogenetic data sets. Mol. Biol. Evol. 14:1125-1131.[Abstract]
McGuire, G., F. Wright, and M. J. Prentice.. 2000. A Bayesian method for detecting recombination in DNA multiple alignments. J. Comput. Biol. 7:159-170.[CrossRef][ISI][Medline]
Moniz de Sa, M., and G. Drouin. 1996. Phylogeny and substitution rates of angiosperm actin genes. Mol. Biol. Evol. 13:1198-1212.[Abstract]
Neal, R. M. 1996. Bayesian learning for neural networks, vol. 118, Lecture notes in statistics. Springer-Verlag, New York.
Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77:257-286.[CrossRef][ISI]
Robert, C. P., G. Celeux, and J. Diebolt. 1993. Bayesian estimation of hidden Markov chains: A stochastic implementation. Statist. Prob. Lett. 16:77-83.[CrossRef][ISI]
Rubinstein, R. Y. 1981. Simulation and the Monte Carlo method. John Wiley & Sons, New York.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.[Abstract]
Wiuf, C., T. Christensen, and J. J. Hein. 2001. A simulation study of the reliability of recombination detection methods. Mol. Biol. Evol. 18:1929-1939.
Yang, Z., and B. Rannala. 1997. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method,. Mol. Biol. Evol. 14:717-724.[Abstract]
Zhou, J., and B. G. Spratt. 1992. Sequence diversity within the argF, fbp and recA genes of natural isolates of Neisseria meningitidis: interspecies recombination within the argF gene,. Mol. Microbiol. 6:2135-2146.[ISI][Medline]