L. H. Bailey Hortorium, Cornell University
Abstract
In this paper, we propose a new method (uninode coding) for coding duplicate (paralogous) genes to infer species trees. Uninode coding incorporates data from duplicated and unduplicated gene copies in phylogenetic analyses of taxa. Uninode coding utilizes global parsimony through the inclusion of both duplicated and unduplicated gene copies, allows one to code all data sources from a taxon into a single terminal, and overcomes problems of character dependence among duplicated and unduplicated gene copies. We present an example of uninode coding using the phytochrome A and phytochrome C data from a study by Donoghue and Mathews.
Introduction
With increased use of DNA sequence data matrices in phylogenetic analyses, new methodologies are necessary to address problems that have not been encountered in data matrices based on morphology, anatomy, secondary compounds, etc. One of these problems is gene duplication. Coding duplicate gene copies in phylogenetic analyses is problematic when some taxa included in the data matrix contain duplicated paralogous sequences and other taxa contain ancestral unduplicated sequences. Generally, unduplicated and duplicated sequences are treated as homologous and used to reconstruct gene trees. These trees are then used (e.g., by taxonomic congruence; see Miyamoto and Fitch 1995
) to make conclusions regarding species phylogeny (e.g., Gottlieb and Ford 1996
; Sang, Donoghue, and Zhang 1997
). Taxonomic congruence analyses do not allow for simultaneous inclusion of other data sources. In such analyses, neither paralogous genes nor other data sources can be combined into a single terminal that is appropriate for a total-evidence analysis (Kluge 1989
) of the species tree (phylogeny of the taxa from which the loci have been sampled).
The explicit use of gene duplication events has also been proposed as a method to root gene trees from which the species tree is then inferred by taxonomic congruence. This "duplicate gene rooting" (as termed by Donoghue and Mathews [1998
]) was originally proposed to root the tree of life, for which outgroups are not available for DNA sequence data (Gogarten et al. 1989
; Iwabe et al. 1989
). In the absence of outgroups, which usually are employed in the rooting of a phylogenetic analysis, putatively paralogous sequences were analyzed simultaneously and rooted along the branch at which the gene duplication event had been inferred. Donoghue and Mathews (1998)
recently proposed that duplicate gene rooting should be employed to root ingroup trees when outgroups are available, but are so divergent as to obscure homology and thus to have the potential for long-branch attraction. The methodology proposed by Donoghue and Mathews (1998)
(and used in Mathews and Donoghue 1999
) has theoretical and methodological problems (unpublished data).
Although gene duplications may provide interesting data relevant to genomic history, we do not advocate searching for evidence of such events to reduce branch lengths and root trees as suggested by Donoghue and Mathews (1998)
. However, duplicate gene copies are occasionally found in data matrices, and a method is needed for effectively coding duplicate gene copies in phylogenetic analyses of taxa. We propose a new method, called "uninode coding," that allows one to apply global parsimony and incorporate multiple data sources in total-evidence analyses.
Materials and Methods
We present an example of uninode coding using the phytochrome A (PHYA) and phytochrome C (PHYC) data from Donoghue and Mathews (1998
, their fig. 4) in figure 1
. As with duplicate gene rooting, uninode coding may be used for gene families in which the duplicate loci sampled do not undergo concerted evolution. Uninode coding is implemented in a two-step analysis. The first step is conducted to determine where putative gene duplications have occurred on the gene tree. A gene tree matrix is constructed with all sequences aligned to each other and treated as homologous relative to all other sequences (fig. 1a
). This first step is a standard gene tree analysis of a multigene family. In examining the gene tree in this example (fig. 1b
), the investigator will infer a single gene duplication (represented by the node with the asterisk). All terminals external ("basal") to this node are inferred to have an unduplicated gene, and all terminals internal ("derived") to this node are inferred to have duplicate genes (regardless of potential secondary loss of one copy). In the second step of the uninode analysis, the inferred gene duplication is used to construct a species tree data matrix in which the species are terminals (fig. 1c
). Using the species tree data matrix, the investigator attempts to reconstruct the phylogeny of the taxa from which the gene(s) has been sampled, not to reconstruct the gene tree (e.g., fig. 1a and b
).
|
|
Next, the unambiguously optimized character states (i.e., all possible character states that can be optimized at the node) are determined for the internal node of the gene tree that represents the gene duplication event (as can be done in WinClada [Nixon 1999
] or MacClade [Maddison and Maddison 1992
]). In WinClada, the unambiguously optimized character states for the internal node may be listed using the "Output" menu option. In MacClade, the unambiguously optimized character states for the internal node are determined by selecting "Trace All States," while in the "Tree Window," and then saving the "Node List" as a text file. The "Node List" records all possible states reconstructed at the node above each branch (Maddison and Maddison 1992
).
The node that represents the gene duplication event is the most recent common hypothetical ancestor of PHYA and PHYC. In the data matrix, the taxa with unduplicated gene sequences (Physcomitrella through Ginkgo) are all coded as having the character states of this hypothetical ancestor (H.A. in fig. 1c
) when compared with each of the duplicated paralogs (PHYA and PHYC). All unduplicated gene sequences are also compared with this hypothetical ancestor. In effect, the species tree data matrix contains three times as many characters, but the same or slightly fewer (due to ambiguous optimization of character states at the hypothetical-ancestor node) minimal number of steps as the gene tree data matrix. Finally, each inferred gene duplication is added as a single binary character that is given the same weight as all other characters in the data matrix. In this example, all taxa with sequences in the gene tree derived from the node that represents the gene duplication event are scored as having presence of the gene duplication, and all taxa with sequences external to this node are scored as lacking the gene duplication (fig. 1c
). This coding scheme can be extended to address multiple gene duplication events (e.g., the three inferred in fig. 3 of Donoghue and Mathews [1998
]).
With uninode coding, character changes along all branches of the gene tree are potentially included in the species tree, and independence of characters is maintained. This revised data matrix is then used to infer the species tree (fig. 1d ). Characters from other sources can simply be added to the species tree data matrix. In this example, rbcL and 18S nrDNA sequences available in GenBank for genera from figure 1c were included in the matrix to infer the species tree (matrix in fig. 1e, inferred species tree in fig. 1f ).
Results and Discussion
In implementing uninode coding, treatment of taxa with duplicate sequences is straightforward. However, treatment of taxa without duplicated sequencesthe non-angiosperm outgroups from Donoghue and Mathews (1998)
is less clear. If the unduplicated outgroup sequences are compared with every duplicated sequence (e.g., if the Psilotum phytochrome is compared independently with both PHYA and PHYC), there is the problem of duplication of presumably independent characters. All character state changes along the bolded branches in figure 1b
would be duplicated. This violates the assumption of independence among characters. Alternatively, if the unduplicated outgroup sequences are arbitrarily compared with one of the paralogs, the duplication of characters is eliminated. However, the character state changes along one of the gene tree branches would be lost (the dashed line in fig. 1b
if the unduplicated outgroup sequences were compared with PHYA), resulting in loss of branch support in the species tree. The problems of duplication of characters and loss of branch support are avoided using uninode coding, in which the topology of the gene tree is recoded into a data matrix to infer the species tree (fig. 1c
).
In the uninode species tree data matrix, unduplicated sequences are no longer directly compared with the duplicated sequences. Secondary signals (character state groupings that contradict the inferred tree topology; equivalent to homoplasy [Farris 1979
]) may be lost in the form of characters that contradict monophyly of the taxa with duplicated genes. This procedure is no different than making homology decisions during alignment of sequences before a phylogenetic analysis. These secondary signals have already been determined to be nonhomologous in the original gene tree and are irrelevant unless other data are added (e.g., 18S nrDNA, rbcL), as is only possible in a simultaneous analysis (= total evidence; Kluge 1989
; Eernisse and Kluge 1993
; Kluge and Wolf 1993
; Nixon and Carpenter 1996
). In a simultaneous analysis of data sets that support alternative groupings but have a common secondary signal with other data sets in the analysis, the common secondary signal may result in groupings supported by no one data set alone (Barrett, Donoghue, and Sober 1991
). All secondary signals are lost, however, when congruence is used to infer the species tree (Nixon and Carpenter 1996
), as is done in duplicate gene rooting as presented by Donoghue and Mathews (1998)
.
Uninode coding is a form of simultaneous analysis. Instead of two or more terminals representing a taxon in the species tree (if the gene tree is interpreted as a reconstruction of the species tree) as in figure 1b,
a single terminal for each taxon is made using uninode coding, as in figure 1d.
With simultaneous analyses, one can expect greater branch support values (e.g., Sullivan 1996
; Soltis et al. 1998
), unless the phylogenetic signal from the two loci are extremely incongruent. Increased branch support values (as measured by bootstrap support [Felsenstein 1985
]) are obtained in the phytochrome example. On the inferred gene tree in figure 1b,
12 of 31 internal branches have less than 50% support values, and only 2 internal branches have 100% support values with the strict-consensus bootstrap. In contrast, on the inferred species tree in figure 1d,
only 2 of 19 internal branches have less than 50% support values, and 6 internal branches have 100% support values with the strict-consensus bootstrap.
One drawback of uninode coding is the potential loss of ambiguously optimized character states. Using this method, in which unambiguously optimized character states are determined for the internal node, this loss occurs whenever the internal node that represents the gene duplication event has alternative optimizations for any given character. Using parsimony, only homoplasious characters and characters that are scored as missing for one or more taxa can have alternative optimizations. When there are alternative optimizations, these characters may not be as useful as unambiguously optimized characters for selecting among possible trees, and they do not contribute to branch support (when branches that are not supported using every parsimonious reconstruction are collapsed [Farris et al. 1996
]). These character state changes could be preserved if the internal node were scored under fast (ACCTRAN) or slow (DELTRAN) optimization. However, there is no justification for choosing one over the other, and how the internal node is scored can affect the species tree topology. Therefore, it is conservative to use the unambiguously optimized character states for the internal node. Different most-parsimonious trees may be obtained under unambiguous, fast, and slow optimizations for the internal node. Because ambiguously optimized character states are caused by missing data and homoplasious characters, different trees are most likely to be found using very homoplasious data matrices and/or those with significant missing data in terminals near the internal node that represents the inferred gene duplication event. Different trees are obtained with the example used here; the phytochrome matrix is both very homoplasious (in the gene tree from fig. 1b,
ensemble consistency index = 0.38 and ensemble retention index = 0.42) and has significant missing data in Ephedra and Psilotum.
The coding of duplicate genes presents unusual problems for phylogenetic analyses. Uninode coding is a new method that allows data from duplicated and unduplicated gene copies to be incorporated in phylogenetic analyses of taxa. Uninode coding utilizes global parsimony through the inclusion of both duplicated and unduplicated gene copies, allows one to code all data sources from a taxon into a single terminal, and overcomes problems of character dependence among duplicated and unduplicated gene copies.
Acknowledgements
We thank Jerry Davis, Jeff Doyle, Helga Ochoterena, and two anonymous reviewers for reviewing the paper. We also thank the L. H. Bailey Hortorium Cladistics Discussion Group and the Doyle Lab Group for helpful discussions.
Footnotes
Pamela Soltis, Reviewing Editor
1 Keywords: angiosperm phylogeny
duplicate gene rooting
gene tree/species tree
orthology/paralogy
phylogeny reconstruction
phytochrome gene family
2 Address for correspondence and reprints: Mark P. Simmons, L. H. Bailey Hortorium, 462 Mann Library, Cornell University, Ithaca, New York 14853. E-mail: mps14{at}cornell.edu
literature cited
Barrett, M., M. J. Donoghue, and E. Sober. 1991. Against consensus. Syst. Zool. 40:486493.[ISI]
Donoghue, M. J., and S. Mathews. 1998. Duplicate genes and the root of angiosperms, with an example using phytochrome sequences. Mol. Phylogenet. Evol. 9:489500.[ISI][Medline]
Donoghue, M., M. Sanderson, and W. Piel. 1996. TreeBASE: a database of phylogenetic knowledge. Retrieved December 14, 1998 from the World Wide Web: http://www.herbaria.harvard.edu/treebase/.
Eernisse, D. J., and A. G. Kluge. 1993. Taxonomic congruence versus total evidence, and Amniote phylogeny inferred from fossils, molecules, and morphology. Mol. Biol. Evol. 10:11701195.[Abstract]
Farris, J. S. 1979. On the naturalness of phylogenetic classification. Syst. Zool. 28:200214.[ISI]
Farris, J. S., V. A. Albert, M. Källersjö, D. Lipscomb, and A. G. Kluge. 1996. Parsimony jackknifing outperforms neighbor-joining. Cladistics 12:99124.
Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783791.
Gogarten, J. P., H. Kibak, P. Dittrich et al. (13 co-authors). 1989. Evolution of the vacuolar H+-ATPase: implications for the origin of eukaryotes. Proc. Natl. Acad. Sci. USA 86:66616665.
Goloboff, P. 1993. Nona. version 1.6 (computer software and manual). Distributed by the author, Tucuman, Argentina.
Gottlieb, L. D., and V. S. Ford. 1996. Phylogenetic relationships among the sections of Clarkia (Onagraceae) inferred from the nucleotide sequences of PgiC. Syst. Bot. 21:4562.
Iwabe, N., K.-I. Kuma, M. Hasegawa, S. Osawa, and T. Miyata. 1989. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc. Natl. Acad. Sci. USA 86:93559359.
Kluge, A. G. 1989. A concern for evidence and a phylogenetic hypothesis for relationships among Epicrates (Boidae, Serpentes). Syst. Zool. 38:125.[ISI]
Kluge, A. G., and A. J. Wolf. 1993. Cladistics: what's in a word? Cladistics 9:183199.
Maddison, W. P., and D. R. Maddison. 1992. MacClade: analysis of phylogeny and character evolution. Sinauer, Sunderland, Mass.
Mathews, S., and M. J. Donoghue. 1999. The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science 286:947950.
Miyamoto, M. M., and W. M. Fitch. 1995. Testing species phylogenies and phylogenetic methods with congruence. Syst. Biol. 44:6476.[ISI]
Nixon, K. C. 1999. WinClada. Version 1.0 (computer software and manual). Distributed by the author, Cornell University, Ithaca, N.Y.
Nixon, K. C., and J. M. Carpenter. 1996. On simultaneous analysis. Cladistics 12:221242.
Sang, T., M. J. Donoghue, and D. Zhang. 1997. Evolution of alcohol dehydrogenase genes in peonies (Paeonia): phylogenetic relationships of putative nonhybrid species. Mol. Biol. Evol. 14:9941007.[Abstract]
Soltis, D. E., P. S. Soltis, M. E. Mort, M. W. Chase, V. Savolainen, S. B. Hoot, and C. M. Morton. 1998. Inferring complex phylogenies using parsimony: an empirical approach using three large DNA data sets for angiosperms. Syst. Biol. 47:3242.[ISI][Medline]
Sullivan, J. 1996. Combining data with different distributions of among-site variation. Syst. Biol. 45:375380.[ISI]