*Department of Zoology, George S. Wise Faculty of Life Sciences,
and
Department of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Ramat Aviv, Israel
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Maximum likelihood (ML) is a general estimation paradigm which has been widely utilized in evolutionary studies, overcoming many shortcomings of parsimony (Felsenstein 1981
; Kishino, Miyata, and Hasegawa 1990
). ML-based methods for inference of ancestral sequences were devised by Yang, Kumar, and Nei (1995)
and Koshi and Goldstein (1996)
. A widely used variant of ML (the Bayesian approach) finds the most probable parameter set given the data. Applying this ML variant to ancestral-sequence reconstruction, one maximizes P(ancient amino acid sequences | contemporary sequences). Indeed, the method developed by Yang, Kumar, and Nei (1995)
is Bayesian. By using the most likely set of sequences at all the internal nodes to evaluate the number of synonymous versus nonsynonymous substitutions along branches, Zhang, Rosenberg, and Nei (1998)
inferred positive Darwinian selection after gene duplication in primate ribonuclease genes.
Yang (1995)
distinguished between two variants of ancestral ML reconstruction, termed "joint" and "marginal." To illustrate the difference between the two methods, let us consider the multifurcated tree in figure 1
. Suppose character state A can change to either B or C, and then to D ... I, according to the probabilities listed in figure 1
. If we are interested in the most likely pathway, then clearly the answer is set{A, C, I}, i.e., A
C
I, which has a probability of 0.45 x 0.9 = 0.405. If, on the other hand, we are interested in the most likely character state after one step, then B is the winner (although B does not even feature in the most likely set). As far as ancestral-sequence inference is concerned, we have an analogous situation. We may be interested either in a the set of all the hypothetical taxonomic unit (HTU) sequences (joint reconstruction) or in a specific HTU whose sequence we would like to estimate (marginal reconstruction). As our examples demonstrate, the results are not necessarily the same under the two methods of ML reconstruction.
|
Koshi and Goldstein (1996)
have developed a fast dynamic programming algorithm for marginal reconstruction, whose variants are implemented in existing software (Yang 1995
; Zhang and Nei 1997
). To date, no fast algorithm exists for joint reconstruction.
Here, we provide a new efficient algorithm for joint ML ancestral reconstruction. The running time of our algorithm scales linearly with the number of sequences and thus can be applied to a practically unlimited number of sequences. The new algorithm is based on the dynamic programming scheme (for a general description of this scheme, see, e.g., Cormen, Leiserson, and Rivest 1990
). Finally, we compare the performance of our method to those of extant algorithms by applying it to a data set of cytochrome b sequences.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Consider, for example, an unrooted tree with five operational taxonomic units (OTUs), as in figure 2
. At a given sequence position, there are 203 possible reconstructions of the amino acids at the three internal nodes (A in node 6, A in node 7, A in node 8; or A in node 6, A in node 7, C in node 8; ... or Y in node 6, Y in node 7, Y in node 8). The aim of joint ML is to identify the triplet , maximizing P(
| data). That is, we want to find, from among all possible triplets, the one that maximizes
|
Since P(data) is the same for all triplets, it suffices to maximize P(data | ) x P(
). For the tree in figure 2
, we solve
The choice of node 8 as the root is arbitrary, because the model assumed is time-reversible (Felsenstein 1981
; Yang, Kumar, and Nei 1995
).
Complexity
The complexity issue is central to ancestral reconstruction. The solution sought in equation (2) is the maximum over all possible triplets. However, for larger trees, say, with h HTUs, one needs to maximize over the set of all possible combinations of h ancestral character states. As explained in the introduction, this set is very large, including 20h such combinations. Even if one considers only the c character states observed in the site, there are ch combinations of such states that are likely to appear (Zhang and Nei 1997
). Naive implementation, examining each such combination separately, is impractical for all but very modest values of h.
In the following, we present a fast new algorithm for ancestral reconstruction. This dynamic programming algorithm guarantees finding the set of sequences, one sequence per node, most likely to have been the progenitors of the extant sequences. The complexity of the algorithm is linear. That is, its running time per site is proportional to the number of internal nodes, and its efficiency enables its employment for any number of OTUs. Note that our algorithm maximizes the likelihood over all 20h possible combinations. For very small numbers of OTUs, this implies longer running times than those for existing programs, which check only ch combinations. However, since our algorithm is efficient, it does not require more than a few seconds (see Results).
Terminology
We root the tree at an arbitrary internal node. If node x is the direct descendant of node y, we say that y is the father of node x, and x is the son of node y. Thus, each nonroot node has a father. Each internal node has two sons, except for the root, which has three sons. OTUs have no sons. Also, each tree branch supports a subtree, which includes the father and son that it connects, together with all of the descendants of the son. For a demonstration of the terminology, consider the tree in figure 2
. Node 8 is the arbitrary root. Node 7 is the father of nodes 3 and 6. The branch connecting nodes 7 and 6 supports the subtree in the shaded region, including nodes 7, 6, 5, and 4.
A Linear-Time Algorithm for Joint Ancestral Reconstruction
Only alanine (A) and valine (V) are observed at OTUs of the tree in figure 2
. Thus, the chances of any other amino acid occurring at internal nodes are quite negligible. For the sake of clarity and compactness, in this section we shall only deal with the values of Pij(t) for i and j which are either alanine or valine. For simplicity, we further assume that all the branches of the phylogenetic tree in figure 2
are of the same length t and that the replacement probabilities Pij(t) for this value of t are given in table 1
. Note, however, that this table is merely a toy model for the sole purpose of demonstrating the algorithm. In practice, we use Pij(t) values from matrices published in standard literature (Dayhoff, Schwartz, and Orcutt 1978
; Jones, Taylor, and Thornton 1992
; Adachi 1995
) and use trees whose branch lengths have been estimated.
|
These concepts are best understood with an example. Consider node x = 6 in the tree in figure 2 . If one assumes its father, i.e., node 7, is assigned A, then the best reconstruction of the shaded subtree is given in figure 2b. In this case, L6(A) = 0.7 x 0.7 x 0.3 = 0.147, and C6(A) = A. By a similar argument, L6(V) = 0.55 x 0.55 x 0.45 = 0.136, and C6(V) = V, as demonstrated in the legend of figure 2 .
We now give a full description of the algorithm.
|
Empirical Example
Aligned cytochrome b amino acid sequences from a sample of mammals were taken from the data of Takezaki and Gojobori (1999)
. Phylogenetic trees with 4 to 24 sequences (from 21 taxa) were prepared by taking a subset from these sequences and using the neighbor-joining algorithm (Saitou and Nei 1987
) to reconstruct the tree. For each such tree, we compared the running time required to estimate the ancestral sequences by the three programs: FastML, PAML, and ANCESTOR.
We also compared joint versus marginal reconstruction for cytochrome b sequences of the 21 taxa in the data set. The phylogenetic tree obtained for these sequences (fig. 5
) was in agreement with the tree obtained by Takezaki and Gojobori (1999)
. Calculations of the replacement probabilities were done with the REV model (Adachi 1995
).
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
As shown in table 2 , in all but five positions, there was no difference between joint and marginal reconstruction. However, depending on the intention of the study, the differences may be important. For example, consider node 33 (fig. 5 ). If one seeks the most likely amino acid in this HTU, then it is leucine. However, when jointly reconstructing the whole tree, the most likely member of the set containing all the internal-node amino acid assignments is methionine. Deciding which is "more correct" depends on the question asked. For instance, if one wishes to count the number of threonine-to-methionine replacements over the entire tree, then the joint reconstruction should be used to obtain this number (2, in our case, on the branch connecting node 24 to node 3 and the branch connecting node 32 to node 33). However, if one wishes to synthesize the hypothetical cytochrome b sequence of the ancestor of Cetartiodactyla, then one should use the marginal reconstruction approach. We emphasize that both methods compute optimal reconstructions by using all of the available data. Discrepancies originate not from misuse of information, but from the difference in the nature of the probabilistic questions asked.
The rate of amino acid replacement is usually not constant among sites. Our algorithm finds the most likely ancestral sequence only for the case of constant rate only. There are as yet no programs for joint reconstruction that take rate variation among sites into account. In PAML, a marginal reconstruction of ancestral sequence that assumes gamma distribution among sites is available. We were unable to apply our linear algorithm to the gamma model because of the different expressions that have to be maximized.
The algorithm described above is presented in terms of amino acids and bifurcating trees. However, the algorithm can be easily adapted for nucleotide sequences and multifurcating trees.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: ancestral sequences
fast algorithm
joint reconstruction
maximum likelihood
dynamic programming
molecular evolution
2 Address for correspondence and reprints: Dan Graur, Department of Zoology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel. E-mail: graur{at}post.tau.ac.il
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi, J. 1995. Modeling of molecular evolution and maximum likelihood inference of molecular phylogeny. Ph.D. dissertation, Graduate University for Advanced Studies, Hayama, Japan.
Cormen, T. H., C. E. Leiserson, and R. L. Rivest. 1990. Introduction to algorithms. MIT Press, Cambridge, Mass.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. Pp. 345352 in M. O. Dayhoff, ed. Atlas of protein sequences and structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
Fitch, W. M. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20:406416.[ISI]
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275282.[Abstract]
Kishino, H., T. Miyata, and M. Hasegawa. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31:151180.[ISI]
Koshi, J. M., and R. A. Goldstein. 1996. Probabilistic reconstruction of ancestral protein sequences. J. Mol. Evol. 42:313320.[ISI][Medline]
Malcolm, B. A., K. P. Wilson, B. W. Matthews, J. F. Kirsch, and A. C. Wilson. 1990. Ancestral lysozymes reconstructed, neutrality tested, and thermostability linked to hydrocarbon packing. Nature 345:8689.
Saitou, N., and M. Nei. 1987. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Stewart, C.-B. 1995. Active ancestral molecules. Nature 374:1213.
Swofford, D. L. 1993. PAUP: phylogenetic analysis using parsimony. Version 3.1. Illinois Natural History Survey, Champaign.
Takezaki, N., and T. Gojobori. 1999. Correct and incorrect phylogenies obtained by the entire mitochondrial DNA sequences. Mol. Biol. Evol. 16:590601.[Abstract]
Walker, A. E. 1998. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. 47:686690.[ISI][Medline]
Yang, Z. 1995. PAML: a phylogenetic analysis by maximum likelihood. Version 2.0e. Pennsylvania State University, University Park.
Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:16411650.
Zhang, J., and M. Nei. 1997. Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods. J. Mol. Evol. 44(Suppl. 1):s139s146.
Zhang, J., H. F. Rosenberg, and M. Nei. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:37083713.