*GSF-Forschungszentrum für Umwelt und Gesundheit, MIPS, am Max-Planck-Institut für Biochemie, Martinsried, Germany;
and
FMI, Physics and Mathematics Department, Mid Sweden University, Sundsvall, Sweden
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In recent years, it has become evident that many data sets from fields as diverse as epidemiology (Holmes, Worobey, and Rambaut 1999
), population genetics (Oota et al. 1999
), genomics (Lake, Jain, and Rivera 1999
), early evolution (Doolittle 1999
), etc., exist for which the evolution of the sequences cannot properly be represented by a tree. Due to recombination and horizontal gene transfer, multiple sources may contribute to a single gene. In these instances, the underlying structure of the data can be represented by phylogenetic networks (Bandelt 1994
; Dress, Huson, and Moulton 1996
). These networks extend the concept of trees and can be viewed in part as a combination of different treelike histories. Algorithms for constructing phylogenetic networks have been developed on the basis of both pairwise distances and parsimony criteria (Bandelt and Dress 1992, 1993
; Bandelt et al. 1995
). von Haeseler and Churchill (1993)
provided a framework for evaluating likelihoods of binary sequences related by networks. However, a general procedure for statistically assessing a phylogenetic network is still lacking.
Here we present a likelihood approach for arbitrary phylogenetic networks based on directed graphical models. Our approach generalizes the approach of Felsenstein (1981)
, including it as a special case when the network in question is a tree. This enables the direct comparison of both trees and networks on a statistically sound basis. Furthermore, given suitable network search procedures, it opens the way to inferring maximum-likelihood networks.
The rest of the paper is organized as follows. The next section gives an overview of directed graphical models, a powerful framework for describing complex systems of statistically correlated random variables. Next, we show how the approach of Felsenstein (1981)
can be conveniently restated in terms of graphical models of sequence evolution and how the likelihood of a tree can be obtained within this framework using a variant of Markov chain Monte Carlo sampling. Subsequently, the model is generalized to arbitrary phylogenetic networks. Finally, using this method, we investigate a viral data set.
![]() |
Directed Graphical Models |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We now consider a simple (and often used) example of a Bayesian network (e.g., figure 2.1.2 in Jensen 1996
) to illustrate some of the relevant features of these objects. In figure 1
, four nodes representing the variables Cloudy (C), Sprinkler (S), Rain (R), and Wet Grass (W) are connected to form a so-called directed acyclic graph (DAG). Note that directed cycles are disallowed in a Bayesian network so as to prohibit the possibility of a node influencing itself through a chain of intermediate nodes. The variables C, R, S, and W take on one of two states (true, false). Also, each node is assigned a probability of observing a state at the node given the states of the parent nodes, which we give in a table next to the node. These probabilities form the basis for computing the joint probability
|
i.e., the probability of observing a combination of states assuming the specified probabilistic dependencies among the nodes. Thus, in this example, the probability of seeing wet grass for a cloudy sky with rain and no sprinkler on is 0.5 x 0.9 x 0.8 x 0.9 = 0.324.
Not all states in the network may be known or observable at any given time. In this case, marginal probabilities for subsets of known states are obtained from the joint probability by summing over all possible combinations of states for the unknown variables. For example, if we could not observe R or S, we would obtain the marginal probability
i.e., the likelihood of the observed states, taking into account dependences among hidden variables. In this way, the unknown variables R and S are eliminated from the distribution. Thus, in our example, we see that the likelihood of seeing wet grass when it is cloudy is (0.5 x 0.1 x 0.8 x 0.0) + (0.5 x 0.1 x 0.2 x 0.9) + (0.5 x 0.9 x 0.8 x 0.9) + (0.5 x 0.9 x 0.2 x 0.99) = 0.4221.
Computing the marginal probabilities generally takes exponential time in the number of unknown nodes, as the number of terms in equation (2) increases exponentially in the unknown variables. However, for special kinds of networks such as trees, there exist economies for computing these probabilities based on variable elimination (Horner's rule) that rearrange the individual terms in equation (2) for more efficient computation. Unfortunately, in the general case, no simple savings of this kind apply. Instead, it is usually necessary to apply approximations, for example, Monte Carlo simulations (Chickering and Heckerman 1997
; Jordan 1999
). Alternatively, it is also possible to convert the belief network into a secondary treelike structure (junction tree) and then compute probabilities working with this structure (Jensen 1996
). However, the difficulty of finding an optimal junction tree remains.
![]() |
Directed Graphical Models and Sequence Evolution on Trees |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The underlying DAG is provided by rooting the tree at a prescribed node and directing all edges in the tree away from this node (see fig. 2a ). The nodes of this DAG are considered as variables with four nucleotide states, and directed edges introduce dependences among these variables. In particular, each node of the network is assigned a nucleotide state x, y, z, ... , where states of internal nodes (which usually correspond to unknown sequences) are ancestral to the states observed at the external nodes (which correspond to observed sequences).
|
|
Luckily, it turns out that only a minor fraction of the terms in the likelihood summation (eq. 2) contribute significantly to the overall likelihood, thus making it possible to approximate the likelihood by summing only these terms. To detect the combination of states at the unknown nodes that contribute most to the likelihood, a search procedure that visits the most important combinations is necessary. Stochastic procedures like Gibbs sampling are suited for this purpose. Basically, we start from a random assignment of nucleotide states at the unknown nodes, computing its probability using the joint probability distribution (eq. 1). Subsequently, one of the internal nodes is selected at random, and its current state is replaced by drawing a new state from its distribution conditional on the current states of all the other nodes, again using equation (1). This replacement procedure is repeated as often as desired. It is possible to show (Gelfand and Smith 1990
) that in this way a series of combinations of assignments of states to the internal nodes is generated, such that their frequencies in the chain are proportional to their importances in the sum in equation (2). Thus, all important combinations of states are eventually visited. They are stored for computation of the overall likelihood.
Before going on, we should make it clear that although the above discussion on directed graphical models was based on DNA sequences, it can also be applied to protein sequences, making the appropriate changes where necessary.
![]() |
Likelihoods for Phylogenetic Networks |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Given a set of taxa, it is natural to consider bipartitions or splits of the taxa. For example, if the taxa A, B, C, D, E are analyzed, it may turn out that there is clear evidence (coming from, for example, distance or parsimony considerations) for separating the groups A, B, E and C, D. Phylogenetic networks are graphs that can be used to represent collections of splits derived from the sequences in question (Bandelt 1994
). If a set of splits derived from the taxa is compatible, the relationship among the taxa can be represented by a tree (Buneman 1971
), and so trees are in particular examples of phylogenetic networks. However, if this is not the case, then phylogenetic networks represent sets of contradicting splits by hypercubes. Note that this is in contrast to consensus procedures in which contradictory or incompatible splits are dropped in order to arrive at a tree structure (Margush and McMorris 1981
). As a result, phylogenetic networks can be rather complex objects falling somewhere between the extremes of being a tree or a hypercube.
In table 1
and figure 4
, we present some simple phylogenetic networks. Note that a collection of parallel edges in a given network corresponds to a split represented by the network. Moreover, parallel edges are all assumed to be assigned the same length. In case the set of splits derived from the data is circularas is quite often the case for molecular data when using the program SplitsTree (Huson 1998
)the relationship between the taxa can be visualized by a planar phylogenetic network, called an outer-planar graph (A. W. M. Dress and D. H. Huson, personal communication). Please refer to appendix b for a brief review of these terms.
|
|
We begin by turning our given outer-planar graph into a DAG. To do this, any vertex is selected as a root node, and all edges are directed away from this node (see fig. 2b ). It can be shown that in this way we produce a DAG in which (1) the root node is the only node that is a source, (2) the terminal nodes are the only nodes that are sinks, and (3) parallel edges are assigned the same direction. Note that from the biological point of view, it not only makes sense to have a unique root node, but assigning the same direction to parallel edges havingper definitionthe same length is also plausible. In particular, the direction of an edge indicates the flow of information from one sequence to another, and for a given split, information flows uniformly from one group of related sequences to the other.
We now explain how to assign local node probabilities to the DAG we have obtained. In a network, a node can have several direct parents. Therefore, we need to extend the assignment of local node probabilities given for a tree to the case of two or more parents (fig. 3c ). We present the case with two parents, the generalization to three or more being clear. We define the probability Pr(x | y, z) of observing state x given states y and z to be pyPyx(ky) + pzPzx(kz), where py, pz denote the prior probabilities that the node in state x is influenced by the direct parent in state y, z, respectively. This choice of assignment of probabilities has the advantage of generalizing easily to an arbitrary number of parents and of having a clear Bayesian interpretation.
The prior probabilities are also easily interpreted as recombination parameters. For a sequence of a given length, they denote the proportion of sites stemming from one of the parents. We note that these parameters are not derived from the underlying phylogenetic graph. In contrast, they are to be estimated from the sequence data. As a consequence, the maximum-likelihood value for a given data set related by a network is always at least as high as the largest likelihood value of a tree embedded in the network. This can be seen by appropriately setting the prior probabilities to 0 or 1. If estimation of priors is not feasible and no additional information on the preference of parents is available, we suggest using uniform priors (py = pz = ). This is reasonable not only because the strength of the correlation is already described by the branch lengths ky and kz, but also because the two parents share a common ancestor themselves, and the distances of the node in state x to that common ancestor through any path involving one of the parents are identical, as parallel edges in the network are of equal length. Thus, no path is preferred to arrive at the common ancestor.
The likelihood of a network can then be computed by marginalization of the joint probability (eq. 2), using, e.g., MCMC sampling to approximate the sum. Note that in contrast to the special case of an evolutionary tree (Felsenstein 1981
), even if a reversible model of substitution is used so that
xPxy(k) =
yPyx(k) applies, the likelihood can vary depending on the location of the root (a fact that follows from the construction of the local node probabilities and the sum used for computing the likelihood).
Before we close this section, we end with a word of caution. Even though it is stated above that we could possibly use this approach for computing the likelihood of arbitrary networks, this could, in general, lead to several problems. For example, even if we drop the constraint that parallel edges should have equal lengths for a phylogenetic network, problems with multiple maxima of the likelihood function can occur (data not shown). Moreover, in the general case, it is not at all clear how networks should be unambiguously rooted or how prior probabilities should be assigned to parent nodes.
![]() |
HTLV Data |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The list of the four best trees, along with their corresponding likelihood values, is shown in table 2 . The outer-planar network is displayed in figure 5 . Note that each of the four best trees is contained (as a subnetwork) in the network. As a basis for computing the likelihood for a network of the sequences and for inference of corresponding maximum-likelihood branch lengths and recombination parameters, the network shown in figure 6 was selected. This simplified network captures the essential nontreelikeness of the data while focusing on the strongest phylogenetic signal in the data. Its simplicity also helped us to perform the necessary computation and optimization in short time on a microcomputer.
|
|
|
As the likelihood of a phylogenetic network depends on the choice of root, for this example nodes other than R1 were also tested. For example, when the node R2 was chosen, the log likelihood of the network was -1,501.12. Thus, it makes sense to choose as a root some vertex that maximizes the likelihood. In fact, from an embedding of the investigated sequences in a larger network produced with SplitsTree, it was seen that node R1 was biologically preferable (data not shown).
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In addition, parameters for the model of substitution and rate variation can be optimized on the basis of a network, and ancestral sequences at internal nodes can be inferred (Yang, Kumar, and Nei 1995
). A series of statistical tests using the likelihood ratio (Huelsenbeck and Rannala 1997
) or parametric bootstrapping (Goldman 1993
) are applicable. The method could also be used to detect recombination events (Grassly and Holmes 1997
). Finally, the structure of the network can be directly inferred from the data using, for example, procedures similar to that of learning general Bayesian networks (Buntine 1996
; Krause 1998
; Jordan 1999
). Essentially, this amounts to devising search strategies similar to those employed for computing maximum-likelihood trees (Swofford et al. 1996
; Larget and Simon 1999
).
The current maximum-likelihood framework provides a unification of the theory of evolutionary trees with that of phylogenetic networks. Thus, networks no longer stand apart when a probabilistic treatment is necessary. Interpretation and understanding of phylogenetic networks is therefore greatly improved. However, to render analysis of large data sets possible, more work has to be done to improve the marginalization procedure and to explore suitable network search methods.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Present address: Department of Zoology, University of Oxford, Oxford, England.
2 Keywords: maximum likelihood
phylogenetic network
graphical model
Bayesian network
evolutionary tree
Markov chain Monte Carlo sampling
3 Address for correspondence and reprints: Vincent Moulton, FMI, Physics and Mathematics Department, Mid Sweden University, S 851-70, Sundsvall, Sweden. E-mail: vince{at}dirac.nts.mh.se
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bandelt, H.-J. 1994. Phylogenetic networks. Verh. Naturwiss. Ver. Hambg. 34:5171.
Bandelt, H.-J., and A. W. M. Dress. 1992. A canonical decomposition theory for metrics on a finite set. Adv. Math. 92:47105.[ISI]
. 1993. A relational approach to split decomposition. Pp. 123131 in O. Opitz, B. Lausen, and R. Klar, eds. Information and classification, Springer, Berlin.
Bandelt, H.-J., P. Forster, B. C. Sykes, and M. B. Richards. 1995. Mitochondrial portraits of human populations using median networks. Genetics 141:743753.
Buneman, P. 1971. The recovery of trees from measures of dissimilarity. Pp. 387395 in F. R. Hodson, D. G. Kendall, and P. Tautu, eds. Mathematics in the archeological and historical sciences. Edinburgh University Press, Edinburgh, Scotland.
Buntine, W. 1996. A guide to the literature on learning probabilistic networks from data. IEEE Trans. Knowl. Data Eng. 8:195210.[ISI]
Chickering, D. M., and D. Heckerman. 1997. Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning 29:181212.
Doolittle, W. F. 1999. Phylogenetic classification and the universal tree. Science 284:21242128.
Dress, A. W. M., D. H. Huson, and V. Moulton. 1996. Analyzing and visualizing sequence and distance data using SplitsTree. Discrete Appl. Math. 71:95109.[ISI]
Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, England.
Edwards, A. W. F., and L. L. Cavalli-Sforza. 1964. Reconstruction of evolutionary trees. Pp. 6776 in V. H. Heywood and J. McNeill, eds. Phenetic and phylogenetic classification. Systematic Association, London.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum-likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
Felsenstein, J., and G. A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93104.[Abstract]
Gelfand, A. E., and F. M. Smith. 1990. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85:398409.[ISI]
Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182198.[ISI][Medline]
Grassly, N. C., and E. C. Holmes. 1997. A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol. 14:239247.[Abstract]
Holmes, E. C., M. Worobey, and A. Rambaut. 1999. Phylogenetic evidence for recombination in dengue virus. Mol. Biol. Evol. 16:405409.[Abstract]
Huelsenbeck, J. P., and B. Rannala. 1997. Phylogenetic methods come of age: testing hypotheses in a evolutionary context. Science 276:227232.
Huson, D. H. 1998. SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14:6873.
Jensen, F. V. 1996. Introduction to Bayesian networks. UCL Press, London.
Jordan, M. I. ed. 1999. Learning in graphical models. MIT Press, Cambridge, Mass.
Kashab, R. L., and S. Subas. 1974. Statistical estimation of parameters in a phylogenetic tree using a dynamical model of the substitutional process. J. Theor. Biol. 47:75101.[ISI][Medline]
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum-likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:170179.[ISI][Medline]
Krause, P. J. 1998. Learning probabilistic networks. Knowl. Eng. Rev. 13:321351.[ISI]
Lake, J. A., R. Jain, and M. C. Rivera. 1999. Genomicsmix and match in the tree of life. Science 283:20272028.
Larget, B., and D. L. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16:750759.
Lauritzen, S. 1996. Graphical models. Oxford University Press, Oxford, England.
Liò, P., and N. Goldman. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:1233-1244.
Margush, T., and F. R. McMorris. 1981. Consensus n-trees. Bull. Math. Biol. 43:239244.[ISI]
Neyman, J. 1971. Molecular studies of evolution: a source of novel statistical problems. Pp. 127 in S. S. Gupta and J. Yackel, eds. Statistical decision theory and related topics. Academic Press, New York.
Oota, H., N. Saitou, T. Matsushita, and S. Ueda. 1999. Molecular genetic analysis of remains of a 2,000-year-old human population in Chinaand its relevance for the origin of the modern Japanese population. Am. J. Hum. Genet. 64:250258.[ISI][Medline]
Pearl, J. 1988. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Mateo, California.
Russell, S., and P. Norvig. 1995. Artificial intelligence: a modern approach. Prentice Hall, Upper Saddle River, N.J.
Strimmer, K., and A. von Haeseler. 1996. Quartet-puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964969.
Swofford, D. L., G. J. Olsen, P. J. Wadell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Systematic biology. Sinauer, Sunderland, Mass.
Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512526.[Abstract]
Templeton, A. R. 1983. Phylogenetic inference from restriction endonucleoase cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37:221244.
von Haeseler, A., and G. A. Churchill. 1993. Network models for sequence evolution. J. Mol. Evol. 37:7785.[ISI][Medline]
Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:16411650.