*Department of Life Sciences, Indiana State University;
Department of Mathematics, Rose-Hulman Institute of Technology
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
An alternative approach makes use of data selection rather than data compression. In this case, attempts are made to select the most informative character comparisons for analysis (e.g., Grundy and Naylor 1999
). Whereas these methods are likely to be more rigorous than data compression, they are also more demanding in terms of computational time and resources. However, rapid increases in computational power and whole genome information constantly force a reevaluation of these methods for comparing biomolecular sequences (Galperin and Koonin 2000
).
In this report, we describe a method for generating phylogenetic trees from whole genome data. After the conversion of a large multigenome data set into a matrix of short character string frequencies, the matrix analysis method referred to as singular value decomposition (SVD) is applied. The output of the SVD can be used to (1) select the most informative biomolecular sequence characteristics for comparison, and (2) estimate evolutionary distances between sequences using these selected characteristics. This application of SVD to describe biomolecular sequences is similar to its application to large compilations of text documents (Landauer, Foltz, and Laham 1998
; Berry, Drmac, and Jessup 1999
). The method, as currently designed, operates only on protein sequences and can be applied to all genome sequences that are accompanied by nearly complete sets of predicted coding regions. The optimized pairwise protein distances obtained after SVD can be used directly to generate a multiorganism gene tree or combined by species to generate a derived species tree. A distinct advantage of the method is that sequence alignments are not required (Stuart et al. 2002
).
In this study, we generate integrated gene and species trees for 832 mitochondrial proteins obtained from 64 whole vertebrate genomes. We also explain the technique in terms of abstract multidimensional vector spaces and interpret the SVD-derived independent characteristics as representing novel forms of correlated sequence motifs. This application sets the stage for future analyses involving a variety of more complex whole genome data sets.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
SVD-Based Protein Sequence Representation
Protein sequences were recoded as tetrapeptide frequency values using all possible overlapping tetrapeptides. SVD was performed on the resulting tetrapeptide versus protein data matrix applying the program bls-1 from SVDPACK (Berry 1992
). Dimension reduction was accomplished by setting the smaller singular values resulting from SVD equal to zero. The optimal dimension reduction was estimated by evaluating the gene tree output for a given dimension setting as described subsequently. Vector definitions for all 832 mitochondrial proteins were retrieved from the SVD output and used to calculate pairwise cosine values, which were subsequently converted into a matrix of pairwise evolutionary distances using the formula dij = -ln [(1 + cos
)/2]. This formula is derived from the standard distance formula d = -lnS and converts a similarity measure ranging from 0 to 1 into a distance measure potentially ranging from 0 to infinity. The corresponding species distances were estimated by summing the 13 vector definitions for each organism and then using the pairwise cosine values between the derived species vectors to estimate the evolutionary distance (Stuart et al. 2002
).
Tree Generation and Analysis
Distance matrices derived via SVD were converted into phylogenetic trees using the NEIGHBOR program from PHYLIP (Felsenstein, University of Washington). The faster UPGMA option of Neighbor was used to estimate the computationally intensive gene trees (832 OTUs). Corresponding species trees (64 OTUs) were estimated using the NJ option of NEIGHBOR. Gene trees were evaluated by determining how well individual members of the 13 families of mitochondrial genes were sorted into uninterrupted family groupings. Adjacent members of the same family were counted as a group.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Tetrapeptides representing a protein include all possible overlapping tetrapeptides. This has two advantages: (1) no important functional peptide present in a sequence will be missed because it is out of phase with the scanning window, and (2) overlapping peptides tend to define a unique order for peptides within the sequence. In contrast, nonoverlapping windows would miss certain peptide patterns that are actually present and would provide no information about the order of the peptides in the sequence. Overlapping, partially redundant information is required to provide unique or nearly unique reconstructions for most sequences. Hence, the overlapping tetrapeptide representation of each protein is in essence a detailed fingerprint that serves to uniquely identify and accurately describe the protein (fig. 1 ).
|
Vector representations in high-dimensional space provide easily accessible relative measures of relatedness (fig. 2A ). For example, any pair of proteins that differ by a single amino acid must also differ by one to four tetrapeptides and would therefore be represented in high-dimensional space by slightly different vectors. Thus, proteins can be recognized as close relatives because their vectors are oriented in space in nearly the same direction. As proteins evolve and diverge, their vector representations in space begin to separate, increasing the measured angle between them and reducing the calculated cosine of that angle further from its maximum value of unity. Hence, documenting the relatedness of all the sequences in a data set can be accomplished by arranging all the cosine values into a pairwise similarity matrix or following a simple conversion, a pairwise distance matrix. The latter is frequently used as an input to one of several standard programs used to generate evolutionary trees (e.g., NEIGHBOR).
|
The motifs described previously are composed of two classes of correlated peptides: those peptides that appear simultaneously, perhaps as adjacent peptides within a given motif, and those that are interchangeable with other peptides and therefore appear exclusively within a given motif. A comprehensive set of interchangeable peptides accompanied by quantitative estimates of their interchangeability can serve to define a model of sequence evolution. Similar models are frequently embodied in empirically derived PAM tables. Alternative models could be based on the relative interchangeability of di-, tri-, or tetrapeptides (or peptides of any size). Whereas PAM tables have only 202 elements, a tetrapeptide version would have 160,0002 elements. The initial tetrapeptide frequency matrix described previously can be used to derive a similar table representing a model of sequence evolution. Each row of protein frequency values serves to define each tetrapeptide as a vector in peptide definition space (the row space of the tetrapeptide frequency matrix). Hence, pairwise peptide similarity can be calculated using pairwise cosine values. Peptide similarities derived in this way are somewhat different than those embodied in PAM tables because they encompass both kinds of similarity described above: simultaneous co-occurrence and interchangeability.
The peptide definition space is different from the one used earlier to define proteins, although it is constructed using the same frequency data. If there are 832 protein sequences in the data set, then the vectors defining each tetrapeptide exist in an 832 dimensional space, which is distinct from the 160,000 dimensional space used for proteins (fig. 2 ). As in the case of the protein space, the peptide space described by the initial frequency matrix provides a relatively poor definition of peptides. This arises because the proteins that define the orthogonal axes in the space are not all independent; the presence of a given peptide in one member of a protein family generally predicts its presence in all members of that family.
Proteins and peptides would be more accurately defined in high-dimensional space by orthogonal axes representing truly independent characters. We suggest that motifs composed of correlated sets of peptides can provide close approximations to these independent characters (fig. 2C
). Motifs defined in this way have several useful properties. As long as the correlated peptides that define the motifs remain unchanged, motifs can be of any size and may contain any number of gaps resulting from sequence insertions or deletions. We refer to these motifs as correlated peptide motifs, or simply, copep motifs. These relatively flexible motifs can be contrasted with those defined by the local sequence alignments typically utilized in biomolecular phylogenies. Local alignments define motifs using models of sequence evolution that fail to account for insertions and deletions of more than a few characters (Thorne 2000
).
The SVD: Correlated Peptide Motifs Emerge as New Orthogonal Axes
Improved vector definitions for both proteins and peptides can be obtained by decomposing the initial frequency matrix A into three separate matrices U, , and V using SVD (fig. 3
). The three resulting matrices can be used to reconstruct the original matrix using the relation A = U
VT. From a graphical perspective, the effect of the SVD is to remodel both the peptide space and the protein space such that peptide vectors and protein vectors are represented using a newly derived set of orthonormal axis vectors. Hence, the V matrix is a protein matrix that defines proteins as vectors using orthogonal axes that represent derived independent characteristics. The U matrix is a peptide matrix that defines peptides as vectors using the same set of independent characteristics. The singular values from the diagonal
matrix, together with the corresponding rows or columns of the protein and peptide matrices comprise the singular triplets presented in rank order. The triplets with the largest singular values identify the most important independent characteristics serving to define both proteins and peptides in the data set. As suggested earlier, our interpretation of these independent characteristics is that they represent sequence motifs comprised of correlated sets of peptides or copep motifs. A preliminary inspection of the first few ranked singular triplets for the mitochondrial data set analyzed here shows that this is indeed the case. For example, the 11th singular vector describes an ND3 protein motif by defining a particular linear combination of peptides that appears primarily, but not exclusively, as a contiguous sequence only within this highly conserved family (not shown).
|
Reduced dimensionality has the effect of making the problem of vector comparisons computationally approachable. In addition, and perhaps more importantly, it tends to improve the relative accuracy of protein vector definitions by discarding a substantial fraction of the noise (including homoplasy) in the data. If 1,...,
r are the positive singular values of the tetrapeptide frequency matrix, then the singular vectors associated with any particular singular value (e.g.,
j) accounts for the fraction
|
|
|
Determining the number (k) of ranked singular values that best serve to separate signal from noise within a data set is difficult. With the mitochondrial data set considered here, it is possible to estimate the optimal number of ranked singular values by measuring how well the 832 sequences are grouped into the 13 known mitochondrial gene families (fig. 4 ). In our example, as the number of included dimensions increases from 10 to about 60, the UPGMA tree calculated from the vector-based pairwise distance matrix steadily improves. At 66, 73, and 74 dimensions, an optimum of 16 groupings is observed in the tree. At higher dimensions, the tree begins to deteriorate again. Our conclusion is that this 832 protein data set is best described by 6480 copep motifs distinguishable within the 13 mitochondrial protein families (roughly 56 per protein); that is, we should retain the motifs defined by the singular vectors associated with the largest 6480 singular values as useful signal and discard the remainder (out of 832) as distracting noise which has degraded the integrity of the data.
|
Integrated Gene and Species Trees for 64 Vertebrates Based on Whole Mitochondrial Genome Sequence
The 832 member gene tree at 74 dimensions is summarized in figure 5 . This tree is one of the three in which a minimal number of groupings (16) was observed. In this tree, all sequences from a given gene family that appear within monophyletic groups have been compressed into a single branch for efficient presentation. Only CYTB, ND3, and ND5 fail to form single monophyletic branches containing all 64 sequences. In addition, both CYTB and ND5 have only one sequence each that fails to group with their remaining family members, whereas ND3 is separated into one mammalian branch (31 genes) and one nonmammalian branch (33 genes). Although the deeper branches within this gene tree may reflect possible evolutionary relationships between certain mitochondrial gene families, we place little significance on these distant relationships, as all these sequences, by default, appear in the tree independent of their degree of relatedness. The relationships of the more similar sequences within each family are more useful.
|
The species tree generated at 74 dimensions was nearly identical to the consensus tree derived from the 17 individual species trees obtained using dimension settings ranging from 64 to 80 (fig. 6
). These trees were similar in size and structure to a recently generated 69 member species tree (Pollack et al. 2000
), placing the vast majority of species into well-accepted groupings. Within the mammals, perissodactyls, carnivores, and cetartiodactyls are grouped together as expected (Xu, Janke, and Arnason 1996
; Arnason et al. 2000
; Murphy et al. 2001
). In addition, these three groups, which form the ferungulates (+cetacea), cluster with the mole (Teur) and the bat (Ajam), as observed in recent independent analyses (Mouchaty et al. 2000
; Nikaido et al. 2000
). These groups and the remaining eutherian mammals are rooted in this portion of the tree by the noneutherian mammals, within which the monotremes are seen to be more closely aligned with marsupials (Janke, Xu, and Arnason 1997
). The overall structure of the mammalian portion of the tree is very similar to that derived in an analysis focused on mammalian mitochondrial genomes by the same method (Stuart et al. 2002
). There is a somewhat surprising tendency of elephants to group with primates. Because few organisms likely to be close relatives of the elephant (e.g., Dnov) are represented in our data set and also because elephants and primates both branch near the root of recently published mammalian trees (Murphy et al. 2001
), the association of elephants with primates seen in our tree is perhaps not too unreasonable.
|
The 17-fold consensus tree (fig. 6
) was obtained using dimension settings within the optimization well ranging from 64 to 80 (fig. 4
). We interpret the included dimensions to the left of this well to be associated with signal and included dimensions to the right to be associated with noise. Within the well, however, it is difficult to separate useful information from misinformation. Consequently, dimension settings within the range 6480 may be equally valid. This interpretation is reinforced by the fact that the consensus tree over this range is nearly identical to the species tree obtained at 74 dimensions. This observation also suggests that both trees are relatively stable approximations of the true tree. Only a few branches in the consensus tree appear in less than 50% of the trees; one of these calls into question the monophylogeny of rodents, which is currently an unresolved issue (see Kramerov, Vassetzky, and Serdobova 1999
; Reyes et al. 2000
). Other poorly supported branches question the monophylogeny of cetartiodactyls, the specific grouping of bats with cetartiodactyls (although the grouping with ferungulates is stable), the specific grouping of rabbits and rodents with ferungulates, and the sister group relationship between birds and mammals. The latter instability may be because of a tendency for reptiles and birds to group together as sauropods. In fact, this grouping was specifically observed in the tree generated at 74 dimensions (not shown).
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Optimal dimensions were estimated by reference to prior categorical information concerning family memberships; no prior alignment information of any kind was used as input. Although many data sets may lack prior categorical information, such information is commonly available for even very large data sets (e.g., the COG database: http://www.ncbi.nlm.nih.gov/COG/). Nevertheless, the development of a procedure whereby optimal dimensions can be approximated without reference to prior information would represent an important advancement (e.g., Ding 1999
).
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: genomics
mitochondrial DNA
molecular phylogenetics
molecular systematics
sequence analysis
singular value decomposition
Address for correspondence and reprints: Gary W. Stuart, Department of Life Sciences, Indiana State University, Terre Haute, Indiana 47809. G-Stuart{at}indstate.edu
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Arnason U., A. Gullberg, S. Gretarsdottir, B. Ursing, A. Janke, 2000 The mitochondrial genome of the sperm whale and a new molecular reference for estimating eutherian divergence dates J. Mol. Evol 50:569-578[ISI][Medline]
Berry M. W., 1992 Large scale singular value computations Int. J. Supercomput. Appl 6:13-49[ISI]
Berry M. W., Z. Drmac, E. R. Jessup, 1999 Matrices, vector spaces, and information retrieval SIAM Rev 41:335-362[ISI]
Boore J. L., W. M. Brown, 1998 Big trees from little genomes: mitochondrial gene order as a phylogenetic tool Curr. Opin. Genet. Dev 8:668-674[ISI][Medline]
Curole J. P., T. D. Kocher, 1999 Mitogenomics: digging deeper with complete mitochondrial genomes Trends Ecol. Evol 14:394-398[ISI][Medline]
Ding C. H. Q., 1999 A similarity-based probability model for latent semantic indexing Pp. 5965. Proc. 22nd ACM SIGIR Conf. (SIGIR'99)
Fitz-Gibbon S. T., C. H. House, 1999 Whole genomebased phylogenetic analysis of free-living microorganisms Nucleic Acids Res 271:4218-4222
Galperin M. Y., E. V. Koonin, 2000 Who's your neighbor? New computational approaches for functional genomics Nat. Biotechnol 186:609-613
Grundy W. N., G. J. Naylor, 1999 Phylogenetic inference from conserved sites alignments J. Exp. Zool 285:128-139[ISI][Medline]
Harlid A., A. Janke, U. Arnason, 1998 The complete mitochondrial genome of Rhea americana and early avian divergences J. Mol. Evol 46:669-679[ISI][Medline]
Janke A., X. Xu, U. Arnason, 1997 The complete mitochondrial genome of the wallaroo Macropus robustus and the phylogenetic relationship among Monotremata, Marsupialia, and Eutheria Proc. Natl. Acad. Sci. USA 94:1276-1281
Kramerov D., N. Vassetzky, I. Serdobova, 1999 The evolutionary position of dormice Gliridae in Rodentia determined by a novel short retroposon Mol. Biol. Evol 165:715-717
Landauer T. K., P. W. Foltz, D. Laham, 1998 Introduction to latent semantic analysis Discourse Proc 25:259-284
Li M., J. H. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, 2001 An information-based sequence distance and its application to whole mitochondrial genome phylogeny Bioinformatics 17:149-154[Abstract]
Lin J., M. Gerstein, 2000 Whole-genome trees based on the occurrence of folds and orthologs, implications for comparing genomes on different levels Genome Res 10:808-818
Mouchaty S. K., A. Gullberg, A. Janke, U. Arnason, 2000 The phylogenetic position of the Talpidae within eutheria based on analysis of complete mitochondrial sequences Mol. Biol. Evol 17:60-67
Murphy W. J., E. Eizirik, W. E. Johnson, Y. P. Zhang, O. A. Ryder, S. J. O'Brien, 2001 Molecular phylogenetics and the origins of placental mammals Nature 409:614-618[ISI][Medline]
Nakashima H., M. Ota, K. Nishikawa, T. Ooi, 1998 Genes from nine genomes are separated into their organisms in the dinucleotide composition space DNA Res 5:251-259[Medline]
Nikaido M., M. M. Harad, Y. Cao, M. Hasegawa, N. Okada, 2000 Monophyletic origin of the order chiroptera and its phylogenetic position among mammalia, as inferred from the complete sequence of the mitochondrial DNA of a japanese megabat, the ryukyu flying fox Pteropus dasymallus J. Mol. Evol 51:318-328[ISI][Medline]
Pollack D. D., J. A. Eisen, N. A. Doggett, M. P. Cummings, 2000 A case for evolutionary genomics and the comprehensive examination of sequence biodiversity Mol. Biol. Evol 17:1776-1788
Rasmussen A. S., U. Arnason, 1999 Molecular studies suggest that cartilaginous fishes have a terminal position in the piscine tree Proc. Natl. Acad. Sci. USA 965:2177-2182
Reyes A. C., C. Gissi, G. Pesole, F. M. Catzeflis, C. Saccone, 2000 Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris Mol. Biol. Evol 17:979-983
Snel B., P. Bork, M. A. Huynen, 1999 Genome phylogeny based on gene content Nat. Genet 21:108-110[ISI][Medline]
Stuart G. W., K. Moffet, S. Baker, 2002 Integrated gene and species phylogenies from unaligned whole genome sequence Bioinformatics 18:100-108
Tekaia F., A. Lazcano, B. Dujon, 1999 The genomic tree as revealed from whole proteome comparisons Genome Res 9:550-557
Thorne J. L., 2000 Models of protein sequence evolution and their applications Curr. Opin. Genet. Dev 10:602-605[ISI][Medline]
Xu X., A. Janke, U. Arnason, 1996 The complete mitochondrial DNA sequence of the greater indian rhinoceros, Rhinoceros unicornis, and the phylogenetic relationship among Carnivora, Perissodactyla, and Artiodactyla Mol. Biol. Evol 13:1167-1173[Abstract]