Department of Mathematics, National University of Singapore, Singapore
Correspondence: E-mail: matzlx{at}nus.edu.sg.
Key Words: molecular phylogeny tandem duplication history duplication tree model
Introduction
Large genomes are full of repeated DNA sequences. It was estimated that over half of the human DNA consists of repeated sequences (Baltimore 2001; Eichler 2001; Leem et al. 2002). Tandem duplication is one of the important evolutionary mechanisms for producing repeated DNA sequences, in which the copies that may or may not contain genes are adjacent along the genome. Fitch (1977) first observed that tandem duplication histories are much more constrained than speciation histories and proposed to model them assuming that unequal crossover is the biological mechanism from which they originate. The corresponding trees are now called tandem duplication trees, the term tandem sometimes being omitted for the sake of conciseness. With more and more genomic sequences becoming known, inferring tandem duplication history has again redrawn researchers' attention (Benson and Dong 1999; Tang, Waterman, and Yooseph 2002; Elemento, Gascuel, and Lefranc 2002; Zhang et al. 2003).
The aim of this article is, first, to present a simple recurrence formula for the number of rooted duplication trees based on the recurrence formula proved in Gascuel et al. (2003). We also give a simple non-counting proof of the fact that the number of rooted duplication trees for n segments is exactly twice the number of unrooted duplication trees for n segments. Notice that this fact was proved based on a counting argument in Gascuel et al. (2003).
Duplication Tree Model
Assume n sequence segments {1, 2, ... , n} were formed from a locus by tandem duplication. Then, assume that the locus had grown from a single copy through a series of tandem duplications. Each duplication replaced a stretch of DNA sequences containing several repeats with two identical and adjacent copies of itself. If the stretch contains k repeats, the duplication is called a k-duplication.
A rooted duplication tree for tandemly repeated segments {1, 2, ... , n} is a rooted binary tree that contains blocks as shown in figure 1. A node in
represents a repeat. Obviously, the root represents the original copy at the locus and leaves the given segments.
|
|
The leaves representing given segments are placed from left to right in the order of the segments appearing on the chromosome. Here we assume that such an order is the increasing order.
Counting Rooted Duplication Trees
An r-duplication event in a rooted duplication tree is visible if none of the 2r copied segments produced by the event have been duplicated subsequently. It is easy to see that the children of the nodes contained in a visible duplication block are leaves. For example, there are five visible duplications in the rooted duplication tree in figure 1: [q], [n], [j, k], [o], [p]. For a visible r-duplication event, if there are i given segments remaining to the right of the 2r copies produced by the event, we refer it as a (i, r)-duplication event. In figure 1, the visible 2-duplication [j, k] is a (7, 2)-duplication.
We use rn to denote the number of all the rooted duplication trees for n segments. For n 2 and 0
i
n 2, let p(n, i) denote the number of all the rooted duplication trees for n segments in which the leftmost visible duplication is an (i, r)-duplication for some r, 1
r
(n i)/2. Then,
Lemma 1 (Gascuel et al. 2003) For any n 2 and 1
i
n 4,
|
THEOREM 1 For any n 2,
|
|
|
Now, we show equation (4) by induction on k. When k = 2, 3, equation (4) becomes equation (3) and hence is true. Assume it is true for k j. For k = j + 1 > 3, by equation (2),
|
Now, by equation (1),
|
The recurrence formula in Theorem 1 allows us to find a closed formula for computing rn. Let X = (rn, rn1, ... , r3, r2)T. Then, by the recurrence formula,
|
|
|
Unrooted Duplication Trees
An unrooted duplication tree is a tree derived from a rooted duplication tree by removal of the root. Let
be an unrooted duplication tree for n segments {1, 2, ... , n}. By definition, at least one rooted duplication tree can be obtained from
by rooting it at some edge in the path from 1 to n. We say that
can be rooted at an edge e if a rooted duplication tree can be formed by rooting it at e. Let Si denote the set of unrooted duplication trees that can be rooted at exactly i edges, i
1. Then, the following facts are true:
|
Assume
Sk, k
3 and i and i' are as in (2). Let Tj denote the subtree of
rooted at a child of uj that is off the path from 1 to n. We also let
j(j+1) denote the unrooted duplication tree obtained from
by interchanging Tj and Tj+1 as illustrated in figure 2. Then, we attain i' i 2 unrooted trees
(i+1)(i+2),
(i+2)(i+3), ...,
(i'2)(i'1) from
. It is easy to see that, for each j from i + 1 to i' 2,
j(j+1) can be rooted uniquely at the edge ej+1.
|
Therefore, the mapping from to
= {
j(j+1) | i + 1
j
i' 2} is one-to-one from unrooted duplication trees
Sk (k
3) to the (k 2)-subsets of S1. Recall that rn denotes the number of rooted duplication trees for n segments. Letting un be the number of unrooted duplication trees, we obtain
|
In conclusion, the following two facts hold for duplication trees:
|
Acknowledgements
L. Zhang thanks Mike Steel for drawing attention to the counting problem studied here, and for further discussions. Thanks, too, to Olivier Gascuel for useful comments on the first version of this paper. The work is partially supported by Singapore BioMedical Research Council Research grant BMRC01/1/21/19/140.
Footnotes
Peter Lockhart, Associate Editor
Literature Cited
Baltimore, D. 2001. Our genome unveiled. Nature 409:814-816.[CrossRef][ISI][Medline]
Eichler, E. E. 2001. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17:661-669.[CrossRef][ISI][Medline]
Elemento, O., and O. Gascuel. 2003. An exact and polynomial distance-based algorithm to reconstruct single copy tandem duplication trees. Pp. 96108 in Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching (CPM'03), Morelia, Mexico. Lecture Notes in Computer Science, vol. 2676, Springer-Verlag.
Elemento, O., O. Gascuel, and M.-P. Lefranc. 2002. Reconstructing the duplication history of tandemly repeated genes. Mol. Biol. Evol. 19:278-288.
Fitch, W. 1977. Phylogenies constrained by cross-over process as illustrated by human hemoglobins in a thirteen cycle, eleven amino-acid repeat in human apolipoprotein A-I. Genetics 86:623-644.
Gascuel, O., M. D. Hendy, A. Jean-Marie, and R. McLachlan. 2003. The combinatorics of tandem duplication trees. Syst. Biol. 52:110-118.[ISI][Medline]
Leem, S.-H., J. A. Londoño-Vallejo, and J.-H. Kim, et al. 2002. The human telomerase gene: complete genomic sequence and analysis of tandem repeat polynomisms in intronic regions. Oncogene 21:769-777.[CrossRef][ISI][Medline]
Tang, M., M. Waterman, and S. Yooseph. 2002. Zinc finger gene clusters and tandem gene duplication. J. Comput. Biol. 9:429-446.[CrossRef][ISI][Medline]
Zhang, L., B. Ma, L. Wang, and Y. Xu. 2003. Greedy method for inferring tandem duplication history. Bioinformatics 19:1497-1504.
|