* Section of Evolution and Ecology, University of California, Davis
Botanical Garden and Centre for Plant Research, University of British Columbia
Department of Computer Science, Iowa State University
Correspondence: E-mail: mjsanderson{at}ucdavis.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: biclique NP-complete sequence concatenation phylogeny optimization
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
However, phylogenetic methods require input data in the form of rectangular matrices of taxa by aligned sequences. Ideally, such data sets are completemeaning that every species has been sequenced for every gene in the data matrix. The sample of genes among taxa found in sequence databases or available from other sources rarely allows large complete data sets to be constructed, however, and therefore all large concatenated phylogenetic data sets published recently have missing entries (Qiu et al. 1999; Soltis, Soltis, and Chase 1999; Murphy et al. 2001). Missing data in incomplete data sets should eventually degrade phylogenetic inference by increasing both the number of optimal solutions found and the uncertainty in the placement of some taxa relative to others (Wiens 1998), although how much missing data is tolerable remains an open question (Kearney 2002). This motivates the present study. How can we optimally construct complete phylogenetic data sets from large sequence databases? The problem can be posed more formally as follows: Given a large collection of sequences that have been partitioned into sets of homologous genes, is it possible to construct a complete data matrix in which m taxa have sequences for the same k genes? Moreover, is it possible to find all complete data sets of this size or larger? Finding complete data sets in a sequence database is a nontrivial problem. In fact, determining whether a complete data set of a given size exists is an NP-complete problem, meaning that efficient (polynomial time) algorithmic solutions are unlikely to be discovered (Garey and Johnson 1979). However, we describe an exact, exponential time algorithm which effectively solves many large problems, and we illustrate its use by constructing maximal concatenated data sets from a large set of orthologous protein sequences available from green plants. Alexe et al. (2002) have described a different algorithm which, though polynomial in the output size, also has worst-case exponential running time, because the output size can grow exponentially with input size.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Given: Bipartite graph G(C) for cluster set C, natural numbers k and m.
Find: All maximal bicliques, Kk',m', for G(C), in which k' ≥ k, m' ≥ m.
The decision problem asking whether a biclique, Kk,m, exists in G(C) is NP-complete. This follows because the decision problem version of our problem for the instance is called GT24 (balanced complete bipartite subgraph problem) in Garey and Johnson (1979; see also Peeters 2000) and is known to be NP-complete. Therefore the decision variant of our problem (without the restriction
) must therefore be NP-complete. This implies that the optimization problem stated above is also unlikely to have an efficient solution for arbitrary inputs (Cormen et al. 2001).
Algorithm
We have developed an exact algorithm that solves this problem quickly (in minutes to a few hours on a Linux workstation) for our data, by exhaustively building progressively larger bicliques. Briefly, suppose we seek all bicliques at least as large as Kk,m. The exact algorithm examines every pair of clusters and calculates the intersection of their taxon sets. Let the size of that intersection (the number of taxa in both clusters) be m'. This step finds all bicliques, K2,m'. Discard any bicliques in which , andassuming any are lefttry adding every remaining cluster to each surviving candidate biclique. This step finds all bicliques, K3,m'. If this iterative procedure terminates before finding bicliques, Kk,m, then none exists satisfying the constraints. On the other hand if the iteration gets to the point of examining k clusters, then every biclique found from then on will be of the required size or larger, and the algorithm will continue to enumerate these until it has found all of them. Maximal bicliques are obtained from these by discarding any that are contained in another biclique. This algorithm was implemented in C++, using the LEDA library (Algorithmic Solutions Software GmbH) for data structures. Source code and a Linux executable are available at (http://ginger.ucdavis.edu/sandlab/SOFTWARE). See the Appendix for a detailed description of the algorithm.
Sequence Data, Extraction of Cluster Set, and Identification of Ortholog Clusters
To illustrate the effectiveness of this algorithm, we applied it to a set of 100,813 proteins sampled from 11,587 taxa of green plants extracted from the five GenBank GBPLN flatfiles in release 127.0 of GenBank (December 2001), representing all green plant protein coding sequences in the database that have been translated. The cluster set was constructed using all-against-all BLAST searches (Altschul et al. 1990) combined with single linkage clustering as implemented in the National Center for Biotechnology Information (NCBI) blastclust tool (Dondoshansky 2002), at a stringency of 60% at the amino acid level. Other clustering strategies have been described which are considerably more sophisticated than ours (Krause, Stoye, and Vingron 2000; Kriventseva et al. 2001; Tatusov et al. 2001), but our algorithm can be applied to any cluster set, regardless of how it is constructed. All clusters are available in our online mySQL database (host: ginger.ucdavis.edu; user name "guest").
Sequence concatenation is most appropriate for orthologous sequences. Automated identification of orthologs in sequence database is a challenging task both for computational reasons (Remm et al. 2001; Lee et al. 2002) and because sparse sampling of gene families in databases imposes sharp limits on the strength of inferences. We used a phylogenetic procedure, rather than relying on more standard reciprocal Blast searching strategies (e.g., Lee et al. 2002). By default, the sequences in any cluster in which only a single sequence was present per taxon were treated as orthologous. For clusters containing multiple sequences in at least one taxon, unconstrained and constrained gene trees were constructed using maximum parsimony with a "protein parsimony" step matrix (Swofford et al. 1996; gaps treated as missing data) in the program PAUP* (Swofford 2002). The constrained tree forced all sequences from the same species to form a clade. If the constrained tree was significantly less parsimonious than the unconstrained tree (using a signed-rank test; Swofford et al. 1996), the cluster was considered to contain paralogs and was excluded from concatenation analysis. Note that a cluster might well contain only the orthologs of one paralog in a gene family. Also, some orthologous genes with ancestral polymorphisms will be excluded by the test, and if sampling of a gene family is extremely poor, such that only one sequence per taxon is found in the database, then we are left with no choice but to mistakenly infer orthology. This will be especially likely for clusters with only a few sequences and taxa.
Phylogenetic Analysis of Concatenated Data Sets
Phylogenies were constructed using standard methods for two representative maximal bicliques selected from the set of maximal bicliques. Clusters containing multiple accessions from the same species were pruned such that only one sequence was included per species. This pruning is justified because all sequences from the same species form a clade for those clusters passing the phylogenetic test for orthology. Presumably these represent multiple accessions of the same gene or multiple alleles from the same locus. Amino acid sequences were aligned with default options in ClustalW (Thompson et al. 1994). Protein parsimony was used to reconstruct trees (see above). Bootstrap analysis (100 replicates) was used to assess phylogenetic support for clades. The single-celled streptophyte Mesostigma was used as the outgroup to land plants for the K39,10 biclique; three chlorophytes were used as outgroups to the streptophytes (including land plants) in the K15,15 biclique. Aligned data sets are available at http://ginger.ucdavis.edu/sandlab/WWW_DATA.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Although the running time of the exact algorithm precluded identification of all maximal bicliques, careful setting of different combinations of lower bounds, k, and m, allowed us to quickly identify the largest ones (fig. 2). Running times increase rapidly as the input constraints, k and m, are decreased, which is expected given the computational complexity of the problem. To find the largest maximal bicliques (those forming the boundary in fig. 2), it suffices to choose values of k and m that are just small enough to find some bicliques but not so small that run times become a problem. Each such successful run identifies one or more bicliques guaranteed to have the property that no other bicliques exist above and to the right of it (i.e., with larger values of either k or m or both). Ten runs of the algorithm were sufficient to identify all largest maximal bicliques with more than 3 clusters and 6 taxa.
|
The phylogenetic tree (fig. 3) based on the K39,10 biclique (39 genes; 8,523 amino acids; 10 species) provides strong support for major clades within land plants, including basal relationships within eudicot angiosperms that are not well resolved in recent analyses (Qiu et al. 1999; Soltis et al. 1999; Savolainen et al. 2000). For example, spinach (Chenopodiaceae) and tobacco (Solanaceae) share a more recent common ancestor than either does with the other angiosperms sampled. The K15,15 biclique (15 genes; 3,919 amino acids; 15 species) samples many fewer genes but delves more deeply into green plant phylogeny, including several green algal relatives of land plants, another angiosperm, and another seed plant, Pinus. Bootstrap support was slightly lower within eudicots because of fewer characters, which is apparently the trade-off for increased taxonomic coverage. These examples are characteristic of the high degree of overlap between many maximal bicliques.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Assembling the Tree of Life: The Role of Sequence Concatenation
These results have implications for recent efforts aimed at assembling large parts of the tree of life (http://research.amnh.org/biodiversity/features/feat.html; or more correctly, that part of the history of life that is tree-like, see Wolf et al. 2002). Exploiting the size and diversity of sequence databases for building comprehensive species phylogenies poses many computational challenges. Two competing strategies have been discussed: (1) concatenation of sequences for increasingly large sets of taxa, as discussed here; and (2) combination of trees (rather than data) constructed from separate but overlapping data sets, using "supertree" methods (Sanderson, Purvis, and Henze 1998; Daubin, Gouy, and Perriere 2001; Liu et al. 2001). Given that some 11,000 species of green plants have protein sequences in GenBank, and that these fall into 40,000 protein clusters, the discovery that the corresponding maximal bicliques have sizes on the order of only 15 by 15 is striking. Moreover, the fact that these complete data sets each contain no more than about 2% of the available sequences implies some limitations on the utility of concatenated data sets. Even though the algorithm described here will permit more intensive exploitation of the sequence databases, concatenation alone will probably not provide a comprehensive solution to building the tree of life any time soon, even with the rapid accumulation of new complete genome sequences. There is little reason to expect that the rich diversity of species from within the broad clades represented in maximal bicliques will soon have many sequences in GenBank. Inclusion of the innumerable "minor" species in the tree of life will therefore require other strategies, such as supertree methods. Nonetheless, bicliques will ultimately provide well-supported phylogenetic backbones that can form the basis for more comprehensive supertree-style studies.
An important lingering question concerns the treatment of overlapping bicliques. Although some bicliques found by running this algorithm are mutually exclusive, many are not. The two "optimal" bicliques shown in figure 2, for example, contain many of the same sequences, and these overlap as well with some of the other maximal bicliques that satisfy the input constraints. If it were desirable to construct many phylogenetic trees from these data sets, it would be necessary to develop strategies either to avoid redundantly using the same sequence data or to account for that redundant use in later applications. One way to formalize this problem is to (edge-)partition the bipartite graph with the fewest bicliques possible. Deciding whether it is possible to partition a bipartite graph into k bicliques is NP-complete (Amilhastre 1999), but minimizing k would guarantee at least that the average number of sequences (edges) per data set (biclique) was maximized. More specific optimality criteria relating to the entire collection of bicliques might be necessary to build robust large phylogenies, however, such as requiring bicliques to be of a minimum size.
Caveats and Limitations
Concatenation of sequences from different genes may not always be a good idea. If different clusters contain different phylogenetic signals, either because of real differences in their evolutionary history, or because of different statistical biases, concatenation may obscure the underlying species tree (Bull et al. 1993). An extensive literature has considered the problem of combinability of data, and statistical tests are widely available (e.g., in PAUP* and other software). However, until very recently these tests have generally dealt with two or three data sets at a time. Scaling up tests to tens or ultimately perhaps even hundreds of genes will present important new challenges. Judging by recent phylogenetic analyses using concatenated genes, the tendency will be to combine data by default, in the hopes that weight of evidence will resolve any conflicts. As genes are sampled from multiple linkage groups and multiple genomes, however, the chances for conflicts between gene trees will rise (Baker and Desalle 1997; Krzywinski ,Wilkerson, and Besansky 2001).
Extensions
The problems described in this paper can be easily modified to include weights on either the taxa or the clusters. A natural weight on clusters is the length of the sequences, which should be crudely correlated with robustness (all else being equal). For phylogenetic reconstruction it might be worthwhile to have one gene of 2,500 nucleotides rather than three genes that together only comprise 1,500 nucleotides. At present we treat all clusters as equal. Equal weighting may have its uses if each cluster is considered a potentially independent source of evidence on the species tree. However, in the case of the green plant data, the bicliques were made almost exclusively of genes in a single linkage group, the chloroplast genome. In that case, weighting by number of sites might be instructive. Better still might be weighting by an a posteriori analysis of the phylogenetic signal in the cluster, such as by an average bootstrap score or posterior probability from Bayesian analysis. However, implementing this sort of analysis would be difficult, because the taxa from the cluster that are eventually represented in the biclique might be in a part of the tree that is poorly supported, even if the remainder of the tree is strongly supported.
Another straightforward extension is to constrain bicliques to contain specific taxa, clusters, or both. Phylogeneticists may wish to obtain large bicliques that include specific taxa of interest; molecular evolutionists may wish to include genes sampled from specific classes of molecules. The algorithm described in this paper can be easily modified to start from a given set of constraints.
It may be important to extend these methods to "quasi-bicliques"bicliques that are allowed to have some fixed number or fraction of empty elements (an element being an entire sequence missing for some taxon). It should be possible to construct quasi-bicliques larger than the bicliques found here. Recent large multigene studies (Qiu et al. 1999; Soltis, Soltis, and Chase 1999; Murphy et al. 2001) all contain "holes" in their data matrices, and are thus quasi-bicliques. The problem is to assess the trade-off between better sampling in a quasi-biclique and additional noise owing to the missing data (Kearney 2002). Quasi-bicliques might also be used to identify which new sequences should be obtained in the laboratory to permit true blicliques to be constructed. In this way, algorithms can guide phylogenetic experimental design.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
SET Max-Biclique(C, k, m)
{
/* Initialization */
Delete from C any cluster that is found in fewer than m taxa;
Delete from C any taxon that is found in fewer than k clusters;
Set ; Set
FOREACH cluster, c, in C
{
Make a biclique, b, such that Cluster(, and Intersect_Set(
; Add b to current_biclique_set;
}
/* Main loop */
WHILE (current_biclique_set Ø)
{
FOREACH biclique, b, in current_biclique_set
{
IF |Intersect_Set( THEN
Delete b from current_biclique_set ;
ELSE
IF k' ≥ k, THEN
Add b to the solutions_biclique_set;
}
Set ;
FOREACH biclique, b, in current_biclique_set
{
FOREACH cluster, c in (c Cluster (b))
{
Let b' be a biclique such that Cluster()
{c}
Add b' to current_biclique_set;
}
}
FOREACH biclique, b, in current_biclique_set
{
IF |Cluster(
Remove b, from current_biclique_set ;
}
}
If , THEN
RETURN Ø; /* no bicliques exist for inputs k and m. */
ELSE
RETURN solutions_biclique_set
}
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Alexe, G., S. Alexe, Y. Crama, S. Foldes, P. L. Hammer, and B. Simeone. 2002. Consensus algorithms for the generation of all maximal bicliques. DIMACS Technical Report No. 200252.
Altschul, S., W. Gish, W. Miller, E. W. Myers, and D. Lipman. 1990. A basic local alignment search tool. J. Mol. Biol. 215:403-410.[CrossRef][ISI][Medline]
Amilhastre, J. 1999. Complexity of minimum biclique decomposition of bipartite graphs. Université Montpellier II/CNRS. http://citeseer.nj.nec.com/148735.html.
Baker, R. H., and R. Desalle. 1997. Multiple sources of character information and the phylogeny of Hawaiian drosophilids. Syst. Biol. 46:654-673.[ISI][Medline]
Bapteste, E., H. Brinkmann, J. A. Lee, D. V. Moore, C. W. Sensen, P. Gordon, L. Durufle, T. Gaasterland, P. Lopez, M. Muller, and H. Philippe. 2002. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc. Natl. Acad. Sci. USA 99:1414-1419.
Bininda-Emonds, O. R. P., S. G. Brady, J. Kim, and M. J. Sanderson. 2001. Scaling of accuracy in extremely large phylogenetic trees. Pacific Symposium on Biocomputing. 6:547-558.
Brown, J. R., C. J. Douady, M. J. Italia, W. E. Marshall, and M. J. Stanhope. 2001. Universal trees based on large combined protein sequence data sets. Nat. Genet. 28:281-285.[CrossRef][ISI][Medline]
Bull, J. J., J. P. Huelsenbeck, C. W. Cunningham, D. L. Swofford, and P. J. Waddell. 1993. Partitioning and combining data in phylogenetic analysis. Syst. Biol. 42:384-397.[ISI]
Chase, M. W., D. E. Soltis, R. G. Olmstead, D. Morgan, D. H. Les, B. D. Mishler, M. R. Duvall, R. A. Price, H. G. Hills, and Y.-L. Qiu. 1993. Phylogenetics of seed plants: an analysis of nucleotide sequences from the plastid gene rbcL. Ann. Missouri Bot. Gard. 80:528-580.[ISI]
Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein. 2001. Introduction to algorithms, 2nd edition. MIT Press, Cambridge, Mass.
Daubin, V., M. Gouy, and G. Perriere. 2001. Bacterial molecular phylogeny using supertree approach. Genome Informatics 12:155-164.
Dawande, M., P. Keskinocak, J. Swaminathan, and S. Tayur. 2001. On bipartite and multipartite clique problems. J. Algorithms 41:388-403.[CrossRef][ISI]
Dondoshansky, I. 2002. Blastclust (NCBI Software Development Toolkit), 6.1 edition. NCBI, Bethesda, MD.
Erdos, P. L., M. A. Steel, L. A. Szekely, and T. J. Warnow. 1999. A few logs suffice to build (almost) all trees (I). Random Struct. Algorithms 14:153-184.[CrossRef][ISI]
Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401-410.[ISI]
Garey, M. R., and D. S. Johnson. 1979. Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman, San Francisco.
Graham, S. W., and R. G. Olmstead. 2000. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am. J. Bot. 87:1712-1730.
Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130-131.[CrossRef][ISI][Medline]
Hochbaum, D. S. 1998. Approximating clique and biclique problems. J. Algorithms 29:174-200.[CrossRef][ISI]
Kearney, M. 2002. Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. Syst. Biol. 51:369-381.[CrossRef][ISI][Medline]
Krause, A., J. Stoye, and M. Vingron. 2000. The SYSTERS protein sequence cluster set. Nucleic Acids Res. 28:270-272.
Kriventseva, E. V., W. Fleischmann, E. M. Zdobnov, and R. Apweiler. 2001. CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res. 29:33-36.
Krzywinski, J., R. C. Wilkerson, and N. J. Besansky. 2001. Toward understanding Anophelinae (Diptera, Culicidae) phylogeny: insights from nuclear single-copy genes and the weight of evidence. Syst. Biol. 50:540-556.[CrossRef][ISI][Medline]
Lee, Y., R. Sultana, G. Pertea, J. Cho, S. Karamycheva, J. Tsai, B. Parvizi, F. Cheung, V. Antonescu, J. White, I. Holt, F. Liang, and J. Quackenbush. 2002. Cross-referencing eukaryotic genomes: TIGR orthologous gene alignments (TOGA). Genome Res. 12:493-502.
Liu, F.-G. R., M. M. Miyamoto, N. P. Freire, P. Q. Ong, M. R. Tennant, T. S. Young, and K. F. Gugel. 2001. Molecular and morphological supertrees for eutherian (placental) mammals. Science 291:1786-1789.
Madsen, O., M. Scally, C. J. Douady, D. J. Kao, R. W. DeBry, R. Adkins, H. M. Amrine, M. J. Stanhope, W. W. de Jong, and M. S. Springer. 2001. Parallel adaptive radiations in two major clades of placental mammals. Nature 409:610-614.[CrossRef][ISI][Medline]
Miya, M., and M. Nishida. 2000. Use of mitogenomic information in teleostean molecular phylogenetics: a tree-based exploration under the maximum-parsimony optimality criterion. Mol. Phylog. Evol. 17:437-455.[CrossRef][ISI][Medline]
Murphy, W. J., E. Eizirik, W. E. Johnson, Y. P. Zhang, O. A. Ryder, and S. J. O'Brien. 2001. Molecular phylogenetics and the origins of placental mammals. Nature 409:614-618.[CrossRef][ISI][Medline]
Peeters, R. 2000. The maximum edge biclique problem is NP-complete. Preprint.
Qiu, Y.-L., J. Lee, F. Bernasconi-Quadroni, D. E. Soltis, P. S. Soltis, M. Zanis, E. A. Zimmer, Z. Chen, V. Savolainen, and M. W. Chase. 1999. The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402:404-407.[CrossRef][ISI][Medline]
Remm, M., C. E. V. Storm, and E. L. L. Sonnhammer. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314:1041-1052.[CrossRef][ISI][Medline]
Sanderson, M. J., A. Purvis, and C. Henze. 1998. Phylogenetic supertrees: assembling the trees of life. Trends Ecol. Evol. 13:105-109.[CrossRef][ISI]
Savolainen, V., M. W. Chase, S. B. Hoot, C. M. Morton, D. E. Soltis, C. Bayer, M. F. Fay, A. Y. de Bruijn, S. Sullivan, and Y.-L. Qiu. 2000. Phylogenetics of flowering plants based on combined analysis of plastid atpB and rbcL gene sequences. Syst. Biol. 49:306-362.[CrossRef][ISI][Medline]
Soltis, D. E., P. S. Soltis, M. E. Mort, M. W. Chase, V. Savolainen, S. B. Hoot, and C. M. Morton. 1998. Inferring complex phylogenies using parsimony: an empirical approach using three large DNA data sets for angiosperms. Syst. Biol. 47:32-42.[CrossRef][ISI][Medline]
Soltis, P. S., D. E. Soltis, and M. W. Chase. 1999. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402:402-404.[CrossRef][ISI][Medline]
Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer Associates, Sunderland, Mass.
Tatusov, R. L., D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin. 2001. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29:22-28.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.[Abstract]
West, D. B. 2001. Introduction to graph theory, 2nd edition. Prentice-Hall, Upper Saddle River, N.J.
Wiens, J. J. 1998. Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst. Biol. 47:625-640.[CrossRef][ISI][Medline]
Wolf, Y. I., I. B. Rogozin, N. V. Grishin, and E. V. Koonin. 2002. Genome trees and the tree of life. Trends Genet. 18:472-479.[CrossRef][ISI][Medline]