*Department of Zoology and
Department of Statistics, University of Oxford, Oxford, England; and
FMI, Physics and Mathematics Department, Mid Sweden University, Sundsvall, Sweden
In Strimmer and Moulton (2000)
, we described a method for computing the likelihood of a set of sequences assuming a phylogenetic network as an evolutionary hypothesis. That approach relied on converting a given graph into a directed graphical model or stochastic network from which all desired probability distributions could be derived. In particular, we investigated how to compute likelihoods using split-graphs (Huson 1998
). However, in the presence of recombination, split-graphs may not provide an appropriate choice of the underlying graph. In this letter, we propose basing the stochastic network on an ancestral recombination graph (ARG) (Hudson 1983
; Griffiths and Marjoram 1996, 1997
). We show that our approach using directed graphical models extends in a straightforward fashion to ARGs, and we outline the computation of their likelihoods. In particular, we provide an example of an ARG whose likelihood is greater than that of a competing nonnested tree, even though the ARG has a smaller number of free parameters.
Statistical phylogenetic analysis requires a model for the evolutionary relationships between the sequences in a given data set. For this purpose, directed acyclic graphs (DAGs) are suitable for describing the dependencies between the sequences. Several subclasses of DAGs can be easily derived from frequently used graphs such as trees, phylogenetic networks (Bandelt 1994
), and split-graphs (Huson 1998
). The latter two classes exhibit network-like structures for which the tree is a special case. Net-like models for sequence evolution can be particularly attractive when modeling statistical dependencies among sequences in the presence of recombination, where evolution is clearly nontree-like. In order for a DAG-based phylogeny to provide a realistic model of recombination, we propose that the DAG should have at least the following properties:
Under these simple premises, it appears that split-graphs, even though they may give a good indication of when recombination is occurring, may not provide a suitable underlying graph for a stochastic network. For example, the number of recombination nodes in a DAG-based split-graph cannot be freely selected, and split-graphs are always generated as subgraphs of hypercubes, which can lead to somewhat restrictive constraints on any resulting DAG-based probabilistic model. However, split-graphs were not designed specifically with recombination in mind; they graphically portray incompatibilities in the data which may (or may not) be a consequence of recombination.
Taking the above considerations into account, another variant of net-like graphs, ARGs, may provide a more appropriate DAG-based phylogeny. These rooted graphs provide a way to represent linked collections of clock-like trees by a single network, and were originally developed in population genetics to describe stochastic processes generating hypothetical genealogies for a set of sequences subject to recombination (Hudson 1983
; Griffiths and Marjoram 1996, 1997
; Wiuf and Hein 1999
). In addition to their use in coalescent simulations, they can also be employed as stand-alone models for sequence phylogeny. We suggest that ARGs can offer a useful basis for the statistical analysis of sequences whose evolution is net-like, and we demonstrate this by reanalyzing the HTLV data set we considered in Strimmer and Moulton (2000)
.
In figure 1 , a hypothetical ARG on four taxa is presented that has five tree and two recombination nodes. If a genealogy contains no recombination nodes, then the ARG will degenerate to a rooted clock-like tree. Note that the ARG in figure 1 contains four embedded subtrees containing both the root and the tip nodes, which are pictured on the right of the figure. In general, if an ARG has r recombination nodes, it will contain 2r such subtrees, and we will call these the canonical subtrees contained in the ARG. ARGs are parameterized by the heights of the tree nodes and the root node and by a breakpoint at each recombination node that specifies which part of the recombinant sequence represented by the node is derived from the parent sequences. Therefore, the number of free parameters for an ARG is precisely the number of internal nodes, so the ARG shown in figure 1 has seven parameters.
|
| (1) |
![]() | (2) |
We now reanalyze the HTLV data set presented in Strimmer and Moulton (2000)
. In figure 2
, an ARG representing a possible history for this data set is shown, where sequence L76054 is a putative recombinant. This ARG was obtained by "gluing together" two tree topologies obtained from a standard breakpoint analysis using the diversity plot (Robertson, Beaudoing, and Claverie 1999
). The diversity plot also gave the estimate for the breakpoint given in figure 2
. Optimizing the five node heights and employing the same substitution model as in Strimmer and Moulton (2000)
, a log likelihood (log L) of -1,496.46 was obtained using the breakpoint model. Intriguingly, this likelihood is greater than that of the maximum-likelihood tree (log L = -1,505.88), which has seven free parameters (branch lengths), even though the ARG has only six parameters. Note that the unconstrained tree is not nested within the ARG. Also note that for each canonical subtree in the ARG, a statistical test could not reject a molecular clock for the respective parts of the sequences, which supports the use an ARG-based phylogeny in this example. It also confirms that incorporating recombination in a sequence analysis can reveal clock-like evolution that would otherwise have remained hidden (Schierup and Hein 2000
).
|
Reconstruction of the ancestral history of a set of sequences subject to recombination is difficult (see, e.g., Hein [1993
], where a parsimony approach is taken). Thus, it is expected that inferring an ARG for a given data set will be just as difficult, although heuristic approaches, such as gluing together trees as we did in the example above, may deserve some attention. However, ARGs also impose some implicit constraints on the sequences (such as clock-likeness) that may not be valid for all data sets. Therefore, it seems that the "best" net-like model for sequence evolution under recombination will probably be some relaxed variant of the ARG.
Acknowledgements
We thank Arndt von Haeseler and Dirk Metzler for providing stimulating questions concerning Strimmer and Moulton (2000)
, and the referees and Michael Hendy for valuable comments. K.S. also wants to thank Mid Sweden University for its hospitality during a visit during which this letter was completed. This work was supported by an Emmy-Noether-Fellowship of the DFG to K.S., by BBSRC grant 43/MMI09788 and the Carlsberg Foundation, Denmark (C.W.), and by a grant from the Swedish Natural Science Research Council to V.M.
Footnotes
1 Keywords: ancestral recombination graph
likelihood-based sequence analysis
Bayesian network
phylogeny
split-graph
2 Address for correspondence and reprints: Vincent Moulton, FMI, Physics and Mathematics Department, Mid Sweden University, S 851-70 Sundsvall, Sweden. E-mail: vince{at}dirac.fmi.mh.se
literature cited
Bandelt, H.-J. 1994. Phylogenetic networks. Verh. Naturwiss. Vereins Hamburg 34:5171.
Griffiths, R. C., and P. Marjoram. 1996. Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3:479502.[ISI][Medline]
. 1997. An ancestral recombination graph. Pp. 257270 in P. Donelly and S. Tavaré, eds. IMA volumes in mathematics and its applications, Vol. 87. Progress in population genetics and human evolution. Springer Verlag, Berlin.
Hein, J. 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36:396406.[ISI]
Hudson, R. R. 1983. Properties of the neutral allele model with intergenic recombination. Theor. Popul. Biol. 23:183201.[ISI][Medline]
Huson, D. H. 1998. SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14:6873.
Kuhner, M. K., J. Yamato, and J. Felsenstein. 2000. Maximum likelihood estimation of recombination rates from population data. Genetics 156:13931401.
Robertson, D. L., E. Beaudoing, and J. M. Claverie. 1999. HIV/SIV phylogenetic analysis page (http://igs-server.cnrs-mrs.fr/anrs/phylogenetics). Marseille, France.
Robertson, D. L., P. M. Sharp, F. E. McCutchan, and B. H. Hahn. 1995. Recombination in HIV-1. Nature 374:124126.
Schierup, M. H., and J. Hein. 2000. Recombination and the molecular clock. Mol. Biol. Evol. 17:15781579.
Strimmer, K., and V. Moulton. 2000. Likelihood analysis of phylogenetic networks using directed graphical models. Mol. Biol. Evol. 17:875881.
Wiuf, C., and J. Hein. 1999. The ancestry of a sample of sequences subject to recombination. Genetics 151:12171228.