Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In Origin of the Species, Darwin (1859)
wrote, "All the forgoing rules and aids and difficulties in classification may be explained, if I do not greatly deceive myself, on the view that Natural System is founded on descent with modification. ..." Many taxonomists agree with his idea, although some arguments remain (see Mayr 1981
; Wiley 1981
).
A large number of methods have been developed for classifying genes. Dayhoff (1978)
constructed gene families on the basis of sequence similarity. Some authors typed virus variants by the significance levels of phylogenetic trees (Louwagie et al. 1993
; Ohba et al. 1995
; Seibert et al. 1995
), and some typed them by PCR with type-specific primers (Okamoto et al. 1992
). Torroni et al. (1993)
used haplogroups based on certain restriction sites. This variety of methods, however, causes confusion in classifications. A widely applicable method for classifying genes is needed.
Klastorin (1982)
proposed a method for classifying hospitals based on the cluster analysis (Sneath and Sokal 1973
). In this paper, we apply Klastorin's (1982)
method to the classification of genes by using the phylogenetic tree of genes. This method is applicable as long as the phylogenetic tree is obtained. We also present a test of classifications based on bootstrap resampling (Efron and Tishirani 1991
).
![]() |
Theory |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Criterion for Classifying Genes
Klastorin (1982)
proposed a method for classifying hospitals by using a dendrogram. A dendrogram is a treelike graph which shows the similarities between OTUs (Sneath and Sokal 1973
). Figure 1
shows a dendrogram. In Klastorin's (1982)
method, a group must be defined by a branch, so that the length of a branch represents a group's distinctiveness. In figure 1
, there are four groups, a, b, c, and d.
|
| (1) |
Let us examine an example in figure 1 . There are two possible classifications: P = {a, d} and Q = {b, c, d}. From equation (1) , we obtain E(P) = n(a)l(a) + n(d)l(d) = 1.0, and E(Q) = n(b)l(b) + n(c)l(c) + n(d)l(c) = 0.8. Thus, P is the classification to be chosen.
We can classify genes by using Klastorin's (1982)
method with the phylogenetic tree of genes instead of the dendrogram, because the phylogenetic tree expresses the evolutionary history of genes. The distinctiveness of a group corresponds to the number of evolutionary changes occurring on the ancestral lineage.
Algorithm for Classifying Genes
We denote the classification with the largest expected distinctiveness by G(), where
is the phylogenetic tree of the genes to be classified. Klastorin (1982)
developed an algorithm to obtain G(
). We developed a new algorithm to obtain G(
) which is simpler than Klastorin's (1982)
algorithm.
Before we present the algorithm, we must note that E has a favorable feature. Namely, when X, Y, and Z are classifications and Z = X Y, we have
![]() | (2) |
Because of equation (2)
, there are only two candidates for G() when G(ß) and G(
) are given, where ß and
are subtrees of
. One candidate is C(
), the classification in which all genes are classified into one group. The other candidate is G(ß)
G(
). Thus, we can find G(
) by using C(
), G(ß), and G(
) as follows:
where N() is the number of genes in tree
. Although Klastorin's (1982)
algorithm consists of two types of search procedures, while our algorithm consists of one type of recursive procedure, both are equivalent and yield the same classification. It is important to note that this algorithm takes a short time. Each branch length and each number of members are evaluated only once. The number of comparisons is proportional to the number of branches. The number of branches is also proportional to the number of genes. Thus, the time this algorithm takes is proportional to the number of genes.
We implemented this algorithm as a Java program that reconstructs the phylogenetic tree of genes from the sequence data by using the neighbor-joining method (Saitou and Nei 1987
) and obtains the classification of genes. This program is available from the authors on request.
Let us see how this algorithm works by examining figure 2
, which shows hypothetical examples of phylogenetic trees. in figure 2a
is a phylogenetic tree. ß and
in figure 2b
are the subtrees of
.
and
in figure 2c
are the subtrees of ß. Figure 2
also shows five possible groups, a, b, c, d, and e. The branch lengths of these groups are also shown in figure 2
, as l(a) = 1, l(b) = 4, l(c) = 7, l(d) = 3, and l(e) = 2. Figure 2d
shows possible classifications.
has three possible classifications: P = {a}, Q = {b, c}, and R = {d, e, c}. ß has two possible classifications: S = {b} and T = {d, e}.
,
, and
each have one possible classification: U = {c}, V = {d}, and W = {e}, respectively.
|
Bootstrap Test
On the basis of bootstrap resampling (Efron and Tishirani 1991
), we present a test of classifications. The test consists of two procedures: (1) reconstructing the phylogenetic tree by using the resampled sequences, and (2) classifying the genes into groups. In order to calculate the bootstrap probability of a certain group (Pg), we repeat these procedures many times (say, 1,000 times) and count the number of cases in which the group is selected in the classification with the largest E.
![]() |
Example |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The evolutionary distances were estimated based on amino acid sequence comparison using Poisson model (Zuckerkandl and Pauling 1965
). The phylogenetic tree was reconstructed using neighbor-joining method (Saitou and Nei 1987
), and the classification was obtained using our method. Bootstrap tests were conducted for phylogenetic relationship (P) and for groups (Pg). Bootstrap resampling was repeated 1,000 times for each test.
Figure 3 shows the phylogenetic tree of opsin genes and their classification. The asterisks indicate the branches which are significantly supported (P > 95%). The thick lines correspond to the ancestral branches of the groups. The numbers above the branches are Pg > 10% of the groups, which correspond to the branches.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our method classifies genes by using the branch lengths of phylogenetic trees. On the other hand, Dayhoff (1978)
classified genes based on the similarity between two genes. Note that the similarity between two genes is the proportion of shared characters, which are categorized into two types: (1) the ones which were inherited from their ancestor and (2) the ones which arose through parallel substitutions. Since parallel substitutions would make the classification less informative (Farris 1979
), our method would be more suitable for gene classification than would Dayhoff's (1978)
method.
Although our method classifies genes by relative branch lengths of their phylogenetic tree, there might be some methods which classify genes using absolute branch lengths. Such methods, however, would depend on the genes to be classified, since the branch lengths depend on the evolutionary rates of genes. For example, fast-evolving genes would be classified into a large number of small groups, and slow-evolving genes would be classified into a small number of large groups. Therefore, the classification methods based on relative branch lengths might be more appropriate than those based on absolute branch lengths.
We presented a simple algorithm for obtaining the classification. It is worth noting that the number of possible classifications can be very large. For example, the number of possible classifications for 10 genes is 115,975, and that for 20 genes is larger than 5 x 1013. If hundreds of genes must be classified, an exhaustive search is almost impossible. On the other hand, the time our algorithm takes is proportional to the number of OTUs. Thus, our algorithm might have a tremendous advantage in the classification of a large number of genes.
The opsin classification obtained here was based on the phylogenetic tree which was reconstructed through sequence comparison. Although gene functions were not used, the classification obtained here is the same as the previous classification based on light absorption wavelengths. This agreement comes from the fact that the changes in absorption wavelength are caused by the amino acid changes (see Yokoyama 1997
and references therein). This result suggests that classifications can identify the functions of genes in some cases.
We developed a test for classification on the basis of the bootstrap method, because it requires less statistical assumptions (Efron and Tishirani 1991
). The bootstrap method is widely used in statistical analyses. Felsenstein (1985)
developed a bootstrap test for topology estimation. Dopazo (1994)
obtained branch length errors by using bootstrap resampling. Our test is a compromise of these methods and is suitable for testing the reliability of classifications, because classification is affected by both tree topologies and branch lengths.
It is worth noting that the reliability of a classification depends on the sampling. From figure 3 , we can see that group M1 is not supported with a high bootstrap value. This is probably because the branch leading to M92037 is long. In group M1, only M92037 is sampled from tetrapods. If orthologous genes of M92037 are found in other tetrapods, this branch will be divided into shorter branches, and the group M1 will be supported more strongly. This suggests that the reliability of a classification depends on the sampling. To obtain the classification using our method, it might be better to sample genes from various organisms.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: molecular classification
molecular phylogeny
bootstrap test
opsin classification
2 Address for correspondence and reprints: Fumio Tajima, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Hongo, Bunkyo-Ku, Tokyo 113-0033, Japan. E-mail: ftajima{at}biol.s.u-tokyo.ac.jp
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
al-Ubaidi, M. R., S. J. Pittler, M. S. Champagne, J. T. Triantafyllos, J. F. McGinnis, and W. Baehr. 1990. Mouse opsin. Gene structure and molecular basis of multiple transcripts. J. Biol. Chem. 265:2056320569
Batni, S., L. Scalzetti, S. A. Moody, and B. E. Knox. 1996. Characterization of the Xenopus rhodopsin gene. J. Biol. Chem. 271:31793186
Bukh, J., R. H. Purcell, and R. H. Miller. 1993. At least 12 genotypes of hepatitis C virus predicted by sequence analysis of the putative E1 gene of isolates collected worldwide. Proc. Natl. Acad. Sci. USA 90:82348238
Chan, S.-W., F. Mcomich, E. C. Holmes, B. Dow, J. F. Peutherer, E. Follet, and P. Yap. 1992. Analysis of a new hepatitis C virus type and its phylogenetic relationship to existing variants. J. Gen. Virol. 73:11311141[Abstract]
Darwin, C. 1859. On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. John Murray, London
Dayhoff, M. O. 1978. Atlas of sequence and structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Silver Spring, Md
Dopazo, J. 1994. Estimating errors and confidence intervals for branch length in phylogenetic trees by a bootstrap approach. J. Mol. Evol. 38:300304[ISI][Medline]
Efron, B., and R. Tishirani. 1991. Statistical data analysis in the computer age. Science 253:390395
Farris, J. S. 1979. The information content of the phylogenetic system. Syst. Zool. 28:483519[ISI]
Felsenstein, J. 1985. Confidence limits on phylogenies; an approach using the bootstrap. Evolution 39:783791
Hisatomi, O., T. Iwasa, F. Tokunaga, and A. Yasui. 1991. Isolation and characterization of lamprey rhodopsin cDNA. Biochem. Biophys. Res. Commun. 174:11251132[ISI][Medline]
Horai, S., R. Kondo, Y. Nakagawa-Hattori, S. Hayashi, S. Sonoda, K. Tajima. 1993. Peopling of the Americas, founded by four major lineages of mitochondrial DNA. Mol. Biol. Evol. 10:2347[Abstract]
Hunt, D. M., A. J. Williams, J. K. Bowmaker, and J. D. Mollon. 1993. Structure and evolution of the polymorphic photopigment gene of the marmoset. Vision Res. 33:147154[ISI][Medline]
Johnson, R. L., K. B. Grant, T. C. Zankel, M. F. Boehm, S. L. Merbs, J. Nathans, and K. Nakanishi. 1993. Cloning and expression of goldfish opsin sequences. Biochemistry 32:208214
Kawamura, S., and S. Yokoyama. 1993. Molecular characterization of the red visual pigment gene of the American chameleon Anolis carolinensis. FEBS Lett. 323:247251
Klastorin, T. D. 1982. An alternative method for hospital partition determination using hierarchical cluster analysis. Oper. Res. 30:11341147[ISI][Medline]
Kojima, D., T. Okano, Y. Fukada, Y. Shichida, T. Yoshizawa, and T. G. Ebrey. 1992. Cone visual pigments are present in gecko rod cells. Proc. Natl. Acad. Sci. USA 89:68416845
Louwagie, J., F. E. McCutchan, M. Peeters et al. (11 co-authors). 1993. Phylogenetic analysis of gag genes from 70 international HIV-1 isolates provides evidence for multiple genotypes. AIDS 7:769780
Mayr, E. 1981. Biological classification: toward a synthesis of opposing methodologies. Science 214:510516
Nathans, J., and D. S. Hogness. 1983. Isolation, sequence analysis, and intron-exon arrangement of the gene encoding bovine rhodopsin. Cell 34:807814
. 1984. Isolation and nucleotide sequence of the gene encoding human rhodopsin. Proc. Natl. Acad. Sci. USA 81:48514855
Nathans, J., D. Thomas, and D. S. Hogness. 1986. Molecular genetics of human color vision: the genes encoding blue, green, and red pigments. Science 232:193202
Ohba, K., M. Mizokami, T. Ohno, K. Suzuki, E. Orito, Y. Ina, J. Y. N. Lau, T. Gojobori. 1995. Classification of hepatitis C virus into major types and subtypes based on molecular evolutionary analysis. Virus Res. 36:201214[ISI][Medline]
Okamoto, H., K. Kurai, S.-I. Okada, K. Yamamoto, H. Iizuka, T. Tanaka, S. Fukuda, F. Tsuda, and S. Mishiro. 1992. Full-length sequence of hepatitis C virus genome having poor homology to reported isolates: Comparative study of four distinct genotypes. J. Gen. Virol. 188:331341
Okano, T., D. Kojima, Y. Fukada, Y. Shichida, and T. Yoshizawa. 1992. Primary structures of chicken cone visual pigments: Vertebrate rhodopsins have evolved out of cone visual pigments. Proc. Natl. Acad. Sci. USA 89:59325936
O'Tousa, J. E., W. B. Baehr, R. L. Martin, J. Hirsh, W. L. Pak, and M. L. Applebury. 1985. The Drosophila ninaE gene encodes an opsin. Cell 40:839850
Pappin, D. J. C., E. Elipoulos, M. Brett, and J. B. C. Findlay. 1984. A structural model for ovine rhodopsin. Int. J. Biol. Macromol. 6:7376[ISI]
Petersen-Jones, S. M., A. K. Sohal, and D. R. Sargan. 1994. The nucleotide sequence of the canine rod opsin gene. Gene 143:281284
Pittler, S. J., S. J. Fliesler, and W. Baehr. 1992. Primary structure of frog rhodopsin. FEBS Lett. 313:103108[ISI][Medline]
Robinson, J., E. A. Schmitt, F. I. Harosi, R. J. Reece, and J. E. Dowling. 1993. Zebrafish ultraviolet visual pigment: absorption spectrum, sequence, and localization. Proc. Natl. Acad. Sci. USA 90:60096012
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425[Abstract]
Seibert, A. A., C. Y. Howell, M. K. Hughes, and A. L. Hughes. 1995. Natural selection on the gag, pol, and env genes of human immunodeficiency virus 1 (HIV-1). Mol. Biol. Evol. 12:803813[Abstract]
Sneath, P. H. A., and R. R. Sokal. 1973. Numerical taxonomy: the principles and practice of numerical classification. W. H. Freeman, San Francisco
Takao, M., A. Yasui, and F. Tokunaga. 1988. Isolation and sequence determination of the chicken rhodopsin gene. Vision Res. 28:471480[ISI][Medline]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of practical multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680[Abstract]
Tokunaga, F., T. Iwasa, M. Miyagishi, and S. Kayada. 1990. Cloning of cDNA and amino acid sequence of one of chicken cone visual pigments. Biochem. Biophys. Res. Commun. 173:12121217[ISI][Medline]
Torroni, A., T. G. Schurr, M. F. Cabell, M. D. Brown, J. V. Neel, M. Larsen, D. G. Smith, C. M. Vullo, and D. C. Wallace. 1993. Asian affinities and continental radiation of the four founding Native American mtDNAs. Am. J. Hum. Genet. 53:563590[ISI][Medline]
Wiley, E. O. 1981. Phylogenetics: the theory and practice of phylogenetic systematics. John Wiley and Sons, New York
Yokoyama, R., and S. Yokoyama. 1990. Convergent evolution of the red- and green-like visual pigment genes in fish, Astyanax fasciatus and human. Proc. Natl. Acad. Sci. USA 87:93159318
. 1993. Isolation, DNA sequence and evolution of a color visual pigment gene of the blind cave fish Astyanax fasciatus. Vision Res. 30:807816
Yokoyama, R., B. E. Knox, and S. Yokoyama. 1995. Rhodopsin from the fish, Astyanax: role of tyrosine 261 in the red shift. Invest. Ophthalmol. Vis. Sci. 36:939945[Abstract]
Yokoyama, S. 1993. Molecular characterization of a blue visual pigment gene in the fish Astyanax fasciatus. FEBS Lett. 334:2731
. 1997. Molecular genetic basis of adaptive selection: examples from color vision in vertebrates. Annu. Rev. Genet. 31:311332
Yokoyama, S., W. T. Starmer, and R. Yokoyama. 1993. Paralogous origin of the red- and green-sensitive visual pigment genes in vertebrates. Mol. Biol. Evol. 10:527538[Abstract]
Zuckerkandl, E., and L. Pauling. 1965. Evolutionary divergence and convergence in proteins. Pp. 97166 in V. Bryson and H. J. Vogel, eds. Evolving genes and proteins. Academic Press, New York