* Department of Biology, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
Institute of Theoretical Physics, The Chinese Academy of Sciences, Beijing, China
Programs in Statistics and Operations Research, Queensland University of Technology, Brisbane, Australia
Department of Mathematics, Xiangtan University, Hunan, China
Correspondence: E-mail: kahouchu{at}cuhk.edu.hk.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: chloroplast genome plant phylogeny
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Composition Vectors and Distance Matrix
We base our analysis on all protein sequences, including hypothetical reading frames from each genome, regarding sequences of the 20 amino acids as symbolic sequences. In such a sequence of length L, there are a total of N = 20K possible types of strings of length K. We use a window of length K and slide it through the sequences by shifting one position at a time to determine the frequencies of each of the N kinds of strings in each genome. A protein sequence is excluded if its length is shorter than K. The observed frequency p(1
2 ...
K) of a K-string
1
2 ...
K is defined as p(
1
2 ...
K) = n(
1
2 ...
K)/(L K + 1), where n(
1
2 ...
K) is the number of times that
1
2 ...
K appears in this sequence. For example, in the protein sequence "MKRTFQPSILKRNRSHGFRIRMATKNGRYILSRRRAKLRTRLTVSSK," p(R) = 11/47, p(MR) = 0, p(RR) = 2/(472 + 1) = 1/23, and p(RRR) = 1/(473 +1) = 1/45. Denoting by m the number of protein sequences from each complete genome, the observed frequency of a K-string
1
2 ...
K is defined as
; here nj(
1
2 ...
K) means the number of times that
1
2 ...
K appears in the jth protein sequence, and Lj is the length of the jth protein sequence in this complete genome.
Mutations occur in a random fashion at the molecular level, while selections shape the direction of evolution. There is always some randomness in the composition of protein sequences, revealed by statistical properties of protein sequences at single amino acid or oligopeptide level (see Weiss, Jimenez, and Herzel [2000] for a recent discussion on this point). To highlight the selective diversification of sequence composition, we subtract the random background from the simple counting results. If we perform direct counting for all strings of length (K 1) and (K 2), we can predict the expected frequency of appearance of K-strings by using a Markov model (Brendel, Beckmann, and Trifonov 1986):
|
|
|
If we view the N components in vectors X and Y as the samples of two zero-mean random variables, respectively, the correlation C(X,Y) between any two genomes X and Y is defined in the usual way in probability theory as C(X,Y) = Xi x Yi/
x
1/2. The distance D(X,Y) between the two genomes is then defined as the equation D(X,Y) = (1 C[X,Y])/2. A distance matrix for all the genomes under study is then generated for construction of phylogenetic trees.
The vector p that we described is identical to the peptide frequency vector used by Stuart, Moffet, and Baker (2002) and Stuart, Moffet, and Leader (2002). However, their method of structure removal is entirely different from our method. Starting from the vector p, these authors used singular value decomposition (SVD) and then dimension reduction on their constructed matrix. The correlation distance is then used to construct the tree. In our method, we subtract random background via a Markov model for q and X. The SVD step is much more complicated than our method in both theoretical and practical considerations.
Tree Construction and Statistical Test of the Trees
Different distance methods, including Fitch-Margoliash (Fitch and Margoliash 1967), neighbor-joining (Saitou and Nei 1987), and minimum evolution (Saitou and Imanishi 1989), are used to construct the phylogenetic trees. A previous study on prokaryotes shows that the topology of the trees stabilized for K 5 (Qi, Wang, and Hao 2004). In the present study, we used K = 4 or 5 in our analysis, and the topologies of the resulting trees are similar. Here we present the results based on K = 5. We conducted the analysis on all the 34 genomes, as well as on the 21 chloroplast genomes alone using Synechocystis as the outgroup. The former analysis aims to explore the origin of the chloroplast genome, whereas the latter analysis is for comparison with previous phylogenetic analyses (Martin et al. 1998, 2002; Turmel, Otis, and Lemieux 1999; De Las Rivas, Lozano, and Oritz 2002) that include most of chloroplast genomes as in our analysis using the same outgroup taxon. The distance matrix generated from this analysis is available at http://www.itp.ac.cn/
qiji.
Bootstrapping is performed to give statistical support to the phylogenetic trees. Sequences of proteins are drawn randomly from a complete genome until the total number of proteins selected in each bootstrap is equal to the number of protein-coding genes of that particular genome. That is, in each bootstrap, some proteins may be selected more than once, whereas others may not be included at all. We generate a total of 100 bootstrap matrices and the bootstrap values are expressed as percentage of support for each branch.
An IBM cluster of 64 CPUs with 3-GB memory is used for the computation of this study. All the calculations take more than 100 h.
Analysis of the Subtraction Procedure
To elucidate the biological meaning of the subtraction procedure, we have performed a concrete analysis on the example of Escherichia coli at string length K = 5. There are 1,343,887 nonzero five-strings belonging to 841,832 different string types. Among all the counts, the maximal one is 58 for the string "GKSTL." The frequency of the substrings "GKST" and "KSTL" is 113 and 77, respectively. The frequency of the middle string "KST" is 247. Thus, the predicted value is (113 x 77)/247 = 35.2267 compared with the real count 58 (neglecting the normalization factor when L >> K). The corresponding component in the composition vector after subtraction is (58 35.2267)/35.2267 = 0.646478.
On the contrary, the string "HAMSC" only appears once in E. coli. Its substrings "HAMS" and "AMSC" also merely appear once; the frequency of the middle three-string "AMS" is 198. Its predicted value is (1 x 1)/198 = 0.00505051. The residual vector becomes (1 0.00505051)/0.00505051 = 197, making "HAMSC" the largest component in the vector.
To reveal the biological difference between the two strings "GKSTL" and "HAMSC," we search for the exact match of these two pentapeptides in the Protein Information Resource (PIR) database that contains more than 1.2 million protein sequences in the present. The string "HAMSC" has 15 matches, among which one comes from eukaryotic species, four (essentially the same protein) come from a virus, and 10 come from prokaryotes. Among those from prokaryotes, four are from E. coli and Shigella and two are from Salmonella, while the prokaryotes with the string are closely related to Enterobacteria. In sharp contrast to "HAMSC," the string "GKSTL" has 6,121 matches with proteins from organisms of a wide taxonomic assortment, ranging from virus to human. As a commonly occurring pentapeptide, the string "GKSTL" in E. coli genome does not carry much phylogenetic information, although it appears most frequently. On the contrary, the pentapeptide "HAMSC" is more characteristic for prokaryotes, especially for Enterobacteria.
It can be argued that frequently occurring strings per se may not be significant for inferring phylogenetic relationships. In the parlance of classic cladistics, they contribute to plesiomorphic characters and should be eliminated under strict treatment. On the other hand, some strings with small counts, which are of apomorphic characters, may be more significant, if their counts are largely different from what is predicted by a reasonable statistical model. The subtraction procedure helps to highlight these significant strings, although it is not always possible to evaluate the effect in a clear-cut way as we did above in the extreme cases.
After the subtraction procedure, the frequency of some peptides is reduced to zero, although the number of such string is not large. By counting the number of strings whose value after subtraction fall in the range 0.1 to 0.1, we find that they only make up a small proportion. It is 6% in Cyanophora and 7% for E. coli. We cannot say that these zero-strings are not important. Actually they provide necessary information on the degree of dissimilarity among the species that eventually contributes to systematics.
From a mathematical point of view, the subtraction procedure can be considered as removing a multifractal structure before performing a cross-correlation analysis (similar to removing a time-varying mean in time series before computing the cross-correlation of two time series). The multifractal method has been discussed in Anh, Lau, and Yu (2001) and is not elaborated here.
We consider the subtraction of random background an essential step in our analysis. The phylogenetic trees generated without using this procedure are quite different. In fact, without this procedure, the topology is inconsistent with the phylogenetic relationships elucidated by traditional approaches. In the study by Qi, Wang, and Hao (2004), a tree of 109 species was generated without the subtraction procedure. In this tree, species of archaea, bacteria, and eukaryotes intermingle with one another and do not clearly cluster into three groups as in the tree presented in Qi, Wang, and Hao (2004). In the tree without subtraction, the groupings in lower systematic levels are in most cases not in agreement with those based on traditional methods. We also generated the chloroplast tree without subtraction of random background. The tree shows that, although all the chloroplasts cluster together, species of archaea and bacteria do not cluster into separate groups. From this comparison, it is apparent that subtraction of random background is necessary and crucial in our correlation analysis.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The chlorophyte-like chloroplast of euglenophytes is generally believed to have arisen from secondary symbiosis by capture of a green alga in the kinetoplastid lineage (Palmer and Delwiche 1998; Cavalier-Smith 2000). The euglenophyte Euglena branches basal to chlorophytes s.l. in our tree and is consistent with recent analyses of complete chloroplast genomes (De Las Rivas, Lozano, and Ortiz 2002; Martin et al. 2002), although other analyses have placed Euglena within the green algae (Van de Peer et al. 1996; Köhler et al. 1997; Turmel, Otis, and Lemieux 1999). The chloroplasts of green algae, including Chlorella, Nephroselmis, and Mesostigma, are more closely related to land plants than to other algae (Wakasugi et al. 1997; Martin et al. 2002). Our analysis however suggests that this assemblage is paraphyletic, but the branching order among the three species receives little bootstrap support. ME and NJ trees grouping Mesostigma with Nephroselmis as prasinophytes are consistent with results from another correlation analysis of complete chloroplast genomes (De Las Rivas, Lozano, and Ortiz 2002). Yet an alternate topology (T1) from the MF tree indicates that Mesostigma is closely related to the streptophytes (including the charophyte Chaetosphaeridium and land plants). Previous molecular phylogenetic studies have also produced conflicting results on the placement of Mesostigma. The first complete chloroplast genome analysis of this species showed that it is an ancestral branch of green plant evolution, representing a lineage that emerged before the divergence of green algae and streptophytes (Lemieux, Otis, and Turmel 2000). Yet a recent analysis on chloroplast genome sequences showed that it is basal to land plants above the green algae (Martin et al. 2002), in accordance with a multigene analysis on a wide variety of charophytes assigning Mesostigma to a basal group of charophytes (Karol et al. 2001). The difficulty in resolving the phylogeny of Mesostigma in relation to other members of chlorophytes s.l. in our analysis is possibly because of the limited taxon sampling of the chloroplasts in green algae and charophytes.
The charophyte Chaetosphaeridium globosum represents a basal branch of the streptophyte clade in all analyses. This is consistent with the chloroplast genome analysis of this species (Turmel, Otis, and Lemieux 2002), suggesting that charophytes were the immediate ancestor of land plants, or embryophytes (Graham, Cook, and Busse 2000). Whereas the support for the angiosperm (flowering plants) clade is strong, its relationships with other land plants is not well resolved in our analysis. An alternative topology (T2) of both the NJ and FM trees suggests that the angiosperms are more closely related to the liverwort Marchantia and the psilophyte Psilotum than to the conifer Pinus. Interestingly, a recent correlation analysis on the complete chloroplast genomes also indicates the same topology (De Las Rivas, Lozano, and Ortiz 2002). Whether this anomaly is caused by the almost complete loss of a large inverted repeat in Pinus (Wakasugi et al. 1994) as compared with other photosynthetic eukaryotes remains to be investigated. Our analysis clearly separates the angiosperms into two clades corresponding to the monocotyledons and eudicots, the two large clades in current understanding of angiosperm phylogeny (Crane, Friis, and Pedersen 1995), although it should be noted that all the monocots included in the tree are members of a single family (Poaceae). The branching order within each clade is not well supported by bootstrapping. A different topology (T3) among three of the eudicots (Spinacia, Nicotiana, and Arabidopsis) is suggested by both the NJ and the FM trees as compared with the ME tree.
Our simple correlation analysis on the complete chloroplast genomes has yielded a tree that is in good agreement with our current knowledge on the origin of the chloroplasts and the phylogenetic relationships of different groups of photosynthetic eukaryotes as elucidated previously by traditional analyses of the chloroplast genomes and other molecular/ultrastructural approaches (e.g., Martin et al. 2002; De Las Rivas, Lozano, and Ortiz 2002; see also Palmer and Delwiche [1998] and McFadden [2001a, 2001b] for reviews). Our approach circumvents the ambiguity in the selection of genes from complete genomes for phylogenetic reconstruction, and is also faster than the traditional approaches of phylogenetic analysis, particularly when dealing with a large number of genomes. Moreover, because multiple sequence alignment is not necessary, the intrinsic problems associated with this complex procedure can be avoided. In contrast to a recent similar analysis on mitochondrial genomes based on compositional vector (Stuart, Moffet, and Baker 2002; Stuart, Moffet, and Leader 2002), our approach does not require prior information on gene families in the genome and is also simpler in the method used for subtraction of random background from the data set (see Materials and Methods). We have also shown that this approach is applicable for analyzing the much larger genomes of chloroplast, as well as the prokaryotes (Qi, Wang, and Hao 2004). We believe that the present approach is an important step towards the analysis of the wealth of information provided by genome projects. In view of the lower resolving power (i.e., relatively low bootstrap support in most of the branches) as compared with the conventional analysis of chloroplast genomes (e.g., Martin et al. 2002), further refinements of the method are being explored in our laboratories, along with the question on the nature of the phylogenetic signals revealed in our method. It is hoped that efforts in this line of research will provide us with fast and useful tools in comparative genome analysis as well as insights on genome structure and evolution.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi, J., P. J. Waddell, W. Martin, and M. Hasegawa. 2000. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J. Mol. Evol. 50:348-358.[ISI][Medline]
Anh, V. V., K. S. Lau, and Z. G. Yu. 2001. Multifractal characterization of complete genomes. J. Phys. A: Math. Gen. 34:7127-7139.[CrossRef][ISI]
Brendel, V., J. S. Beckmann, and E. N. Trifonov. 1986. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J. Biomol. Struct. Dyn. 4:11-21.[ISI][Medline]
Cavalier-Smith, T. 2000. Membrane heredity and early chloroplast evolution. Trends Plant Sci. 5:174-182.[CrossRef][ISI][Medline]
Crane, P. R., E. M. Friis, and K. R. Pedersen. 1995. The origin and early diversification of angiosperms. Nature 374:27-33.[CrossRef][ISI]
De Las Rivas, J., J. J. Lozano, and A. R. Ortiz. 2002. Comparative analysis of chloroplast genomes: functional annotation, genome-based phylogeny, and deduced evolutionary patterns. Genome Res. 12:567-583.
Delwiche, C. F. 1999. Tracing the thread of plastid diversity through the tapestry of life. Am. Nat. 154:S164-S177.[CrossRef][ISI][Medline]
Douglas, S. E., and S. L. Penny. 1999. The plastid genome of the cryptophyte alga, Guillardia theta: complete sequence and conserved synteny groups confirm its common ancestry with red algae. J. Mol. Evol. 48:236-244.[ISI][Medline]
Edwards, S. V., B. Fertil, A. Giron, and P. J. Deschavanne. 2002. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Syst. Biol. 51:599-613.[CrossRef][ISI][Medline]
Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155:279-284.[ISI][Medline]
Fitz-Gibbon, S. T., and C. H. House. 1999. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 27:4218-4222.
Graham, L. E., M. E. Cook, and J. E. Busse. 2000. The origin of plants: body plan changes contributing to a major evolutionary radiation. Proc. Natl. Acad. Sci. USA 97:4535-4540.
Gray, M. W. 1992. The endosymbiont hypothesis revisited. Int. Rev. Cytol. 141:233-357.[ISI][Medline]
Gray, M. W. 1999. Evolution of organellar genomes. Curr. Opin. Genet. Dev. 9:678-687.[CrossRef][ISI][Medline]
Karol, K. G., R. M. McCourt, M. T. Cimino, and C. F. Delwiche. 2001. The closest living relatives of land plants. Science 294:2351-2353.
Köhler, S., C. F. Delwiche, P. W. Denny, L. G. Tilney, P. Webster, R. J. M. Wilson, J. D. Palmer, and D. S. Roos. 1997. A plastid of probable green algal origin in apicomplexan parasites. Science 275:1485-1489.
Lemieux, C., C. Otis, and M. Turmel. 2000. Ancestral chloroplast genome in Mesostigma viride reveals an early branch of green plant evolution. Nature 403:649-652.[CrossRef][ISI][Medline]
Li, M., J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. 2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149-154.[Abstract]
Lin, J., and M. Gerstein. 2000. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes at different levels. Genome Res. 10:808-818.
McFadden, G. I. 2001a. Primary and secondary endosymbiosis and the origin of plastids. J. Phycol. 37:951-959.[CrossRef][ISI]
McFadden, G. I. 2001b. Chloroplast origin and integration. Plant Physiol. 125:50-53.
Martin, W., and R. G. Herrmann. 1998. Gene transfer from organelles to the nucleus: how much, what happens, and why? Plant Physiol. 118:9-17.
Martin, W., B. Stoebe, V. Goremykin, S. Hansmann, M. Hasegawa, and K. V. Kowallik. 1998. Gene transfer to the nucleus and the evolution of chloroplasts. Nature 393:162-165.[CrossRef][ISI][Medline]
Martin, W., T. Rujan, E. Richly, A. Hansen, S. Cornelsen, T. Lins, D. Leister, B. Stoebe, M. Hasegawa, and D. Penny. 2002. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc. Natl. Acad. Sci. USA 99:12246-12251.
Moreira, D., H. Le Guyader, and H. Philippe. 2000. The origin of red algae and the evolution of chloroplasts. Nature 405:69-72.[CrossRef][ISI][Medline]
Oliveira, M. C., and D. Bhattacharya. 2000. Phylogeny of the Bangiophycidae (Rhodophyta) and the secondary endosymbiotic origin of algal plastids. Am. J. Bot. 87:482-492.
Palmer, J. D., and C. F. Delwiche. 1998. The origin and evolution of plastids and their genomes. Pp. 345409 in D. E. Soltis, P. S. Soltis, and J. J. Doyle, eds. Molecular systematics of plants II: DNA sequencing. Kluwer, London.
Percus, J. K. 2002. Mathematics of genome analysis. Cambridge University Press, New York.
Qi, J., B. Wang, and B. Hao. 2004. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. J. Mol. Evol. 58:1-11.
Saitou, N., and T. Imanishi. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol. 6:514-525.
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.[Abstract]
Sankoff, D., G. Leaduc, N. Antoine, B. Paquin, B. F. Lang, and R. Cedergren. 1992. Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA 89:6575-6579.[Abstract]
Stirewalt, V. L., C. B. Michalowski, W. Loffelhardt, H. J. Bohnert, and D. A. Bryant. 1995. Nucleotide sequence of the cyanelle genome from Cyanophora paradoxa. Plant Mol. Biol. Rep. 13:327-332.[ISI]
Stoebe, B., and K. V. Kowallik. 1999. Gene-cluster analysis in chloroplast genomics. Genome Analysis Outlook 15:344-347.
Stuart, G. W., K. Moffet, and S. Baker. 2002. Integrated gene species phylogenies from unaligned whole genome protein sequences. Bioinformatics 18:100-108.[CrossRef][ISI][Medline]
Stuart, G. W., K. Moffet, and J. J. Leader. 2002. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol. Biol. Evol. 19:554-562.
Tekaia, F., A. Lazcano, and B. Dujon. 1999. The genomic tree as revealed from whole proteome comparisons. Genome Res. 9:550-557.
Turmel, M., C. Otis, and C. Lemieux. 1999. The complete chloroplast DNA sequence of the green alga Nephroselmis olivacea: insights into the architecture of ancestral chloroplast genomes. Proc. Natl. Acad. Sci. USA 96:10248-10253.
Turmel, M., C. Otis, and C. Lemieux. 2002. The chloroplast and mitochondrial genome sequences of the charophyte Chaetosphaeridium globosum: insights into the timing of the events that restructured organelle DNAs within the green algal lineage that led to land plants. Proc. Natl. Acad. Sci. USA 99:11275-11280.
Van de Peer, Y., S. A. Rensing, U. G. Maier, and R. De Wachter. 1996. Substitution rate calibration of small subunit ribosomal RNA identifies chlorarachniophyte endosymbionts as remnants of green algae. Proc. Natl. Acad. Sci. USA 93:7732-7736.
Wakasugi, T., T. Nagai, and M. Kapoor, et al. (15 co-authors). 1997. Complete nucleotide sequence of the chloroplast genome from the green alga Chlorella vulgaris: the existence of genes possibly involved in chloroplast division. Proc. Natl. Acad. Sci. USA 94:5967-5972.
Wakasugi, T., J. Tsudzuki, S. Ito, K. Nakashima, T. Tsudzuki, and M. Sugiura. 1994. Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. Proc. Natl. Acad. Sci. USA 91:9794-9798.
Weiss, O., M. A. Jimenez, and H. Herzel. 2000. Information content of protein sequences. J. Theor. Biol. 206:379-386.[CrossRef][ISI][Medline]
Wolfe, K. H., C. W. Morden, and J. D. Palmer. 1992. Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant. Proc. Natl. Acad. Sci. USA 89:10648-10652.[Abstract]
Yoon, H. S., J. D. Hackett, G. Pinto, and D. Bhattacharya. 2002. The single, ancient origin of chromist plastids. Proc. Natl. Acad. Sci. USA 99:15507-15512.
Yu, Z.-G., and P. Jiang. 2001. Distance, correlation and mutual information among portraits of organisms based on complete genomes. Phys. Lett. A 286:34-46.[CrossRef][ISI]