A New Theory of Phylogeny Inference Through Construction of Multidimensional Vector Space

Yasuhiro Kitazoe, Yukio Kurihara, Yuichi Narita, Yoshiyasu Okuhara, Akira Tominaga and Tomohiko Suzuki

*Center of Medical Information Science,
{dagger}Department of Information Science,
{ddagger}Department of Medical Biology,
§Medical Research Center, Kochi Medical School, Nankoku, Kochi, Japan;
||Department of Biology, Faculty of Science, Kochi University, Akebono, Kochi, Japan


    Abstract
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 
Here, a new theory of molecular phylogeny is developed in a multidimensional vector space (MVS). The molecular evolution is represented as a successive splitting of branch vectors in the MVS. The end points of these vectors are the extant species and indicate the specific directions reflected by their individual histories of evolution in the past. This representation makes it possible to infer the phylogeny (evolutionary histories) from the spatial positions of the end points. Search vectors are introduced to draw out the groups of species distributed around them. These groups are classified according to the nearby order of branches with them. A law of physics is applied to determine the species positions in the MVS. The species are regarded as the particles moving in time according to the equation of motion, finally falling into the lowest-energy state in spite of their randomly distributed initial condition. This falling into the ground state results in the construction of an MVS in which the relative distances between two particles are equal to the substitution distances. The species positions are obtained prior to the phylogeny inference. Therefore, as the number of species increases, the species vectors can be more specific in an MVS of a larger size, such that the vector analysis gives a more stable and reliable topology. The efficacy of the present method was examined by using computer simulations of molecular evolution in which all the branch- and end-point sequences of the trees are known in advance. In the phylogeny inference from the end points with 100 multiple data sets, the present method consistently reconstructed the correct topologies, in contrast to standard methods. In applications to 185 vertebrates in the {alpha}-hemoglobin, the vector analysis drew out the two lineage groups of birds and mammals. A core member of the mammalian radiation appeared at the base of the mammalian lineage. Squamates were isolated from the bird lineage to compose the outgroup, while the other living reptilians were directly coupled with birds without forming any sister groups. This result is in contrast to the morphological phylogeny and is also different from those of recent molecular analyses.


    Introduction
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 
Molecular studies have provided new findings on evolutionary phylogeny (Goodman 1963Citation ; Sarich and Wilson 1967Citation ; Gemmell and Westerman 1994Citation ; Janke et al. 1996Citation ; Springer et al. 1997Citation ; Kumar and Hedges 1998Citation ; Hedges and Poling 1999Citation ). Phylogeny inference has been done by methods directly using the substitution distances between the species (Sokal and Michener 1958Citation ; Eck and Dayhoff 1966Citation ; Fitch and Margoliash 1967Citation ; Sokal and Rohlf 1981Citation ; Saitou and Nei 1987Citation ) and by those taking explicit account of nucleotide or amino acid substitutions (Felsenstein 1993Citation ; Adachi and Hasegawa 1996Citation ; Strimmer and von Haeseler 1996Citation ; Swofford 1998Citation ). The topologies obtained by these methods are often changed on the basis of the numbers and kinds of selected species, including the outgroup. There appear to be strange topologies in spite of high branch resolutions (Cao et al. 1998a, 1998bCitation ; also shown in this paper). Furthermore, the branch resolutions tend to be lower with an increasing number of species, since the number of topological variations increases rapidly. Hence, the whole structure of the vertebrate phylogeny remains unclear (Novacek 1992Citation ; Hedges 1994Citation ; Cao et al. 1998aCitation ). In this respect, the development of a new methodology is expected to provide an understanding of molecular evolution.

There exists an original space of vectors whose components are the amino acid sequences of the respective species. However, this space is too abstract for a quantitative analysis of the tree structure, since these components are not digital. If this space can be transformed into the multidimensional vector space (MVS) of Euclidean space without decreasing the degree of freedom of the variables, the tree structure can be directly analyzed by using geometrical quantities such as the distances and angles between the vectors.

In this paper, we first show how the evolutionary process is represented in the MVS. To make lucid our basic idea for this representation, we consider the simplest model of substitutions (hereinafter called the single-substitution model [SSM]), in which the amino acids of any sites can be changed at most once over all the evolutionary processes, and the substitution distances between two species are approximated by the number of different amino acid sites between two sequences. Then, the processes are represented as the developments of branch vectors along the orthogonal coordinate axes in the MVS, and the end points of these vectors correspond to the extant species. The species vectors indicate the specific directions in the MVS, respectively.

Next, to infer the tree structure from the species vectors of the end points, we introduce search vectors to draw out the groups of species, which are distributed around these vectors. A search vector may be a one-species vector or a center-of-gravity vector (CGV) of many species. The drawn-out species stand in line in the nearby order of the topological relationship with the search vectors if the branch vectors are orthogonal. The branching patterns of the tree are analyzed by a scatter diagram which expresses the relationship among the groups of species drawn out by two search vectors. It is shown that the diagram consists of the three parts: the left branch, the right branch, and the central line. The first two include the drawn-out species which are distributed parallel to the x- and y-axes, respectively, while the third represents the other species (outgroup) of the outside branches, which are distributed around the origin of coordinates or along the central line of y = x.

The present theory is meaningless if the spatial positions of the respective species are not obtained. In this paper, their positions in the MVS are determined so that their relative distances may reproduce the substitution distances between species. This can be easily done in simple cases of three and four species. As the number I of species and the dimensional size N increase, however, it is difficult to determine the positions by statistical methods of minimizing or maximizing a quantity, since the number of variables is huge (if I = 150 and N = 100, the number is I x N = 15,000). To remove this difficulty, we introduce the dynamic method of time-dependently solving the equation of motion for the many-body system in physics (David and James 1988Citation ). Here, the species are regarded as particles moving under their mutual interactions. By utilizing the law of physics that the system finally falls into the ground state, irrespective of the initial random distributions of the particles, we can precisely determine the positions of the species in the MVS. Here, the interactions are defined as the sums of two-body potentials, which are expressed so as to be minimum at the points equal to the substitution distances between the two related species. The solution of the equation of motion always converges on the ground state by the (I - 1)-dimensional size. However, when I is larger than the number L of variable sites in amino acid sequences of these species, it is found that the equation of motion can be solved with good accuracy by the L-dimensional size in place of (I - 1). Then, the number of variables in the MVS becomes substantially equal to that in the original abstract space. This means a substantial transformation of the original space into the MVS.

The validity of the present method was examined by using computer simulations of molecular evolution in which any branch patterns could be created by using the Monte Carlo method (Hammersley and Handscomb 1964Citation ) with the weight of the transition probability matrix. Then, it was possible to directly compare the present method with standard ones such as neighbor joining (NJ), maximum likelihood (ML), and parsimony (PA), since all the branch- and end-point sequences were known in advance. The predominance of the present method was demonstrated by using two examples of typical branch patterns in which 100 multiple data sets with a variety of branch lengths were prepared by repeating the simulations.

The present method was applied to the vector analysis of 185 vertebrate species in the {alpha}-hemoglobin, and the results were compared with those of the standard methods. The standard methods gave erroneous topologies in the relationship among Artiodactyla, Perissodactyla, and Chiroptera in spite of very high branch resolutions which became lower as the number of species increased. On the other hand, the present approach gave the correct topology, and a larger number of species brought a more stable solution. Encouraged by this success, we challenged recent analyses of topics regarding the reptilian phylogeny. Our results showed that squamates were clearly isolated from the lineage of birds, while the other extant reptilians belonged to the lineage of birds without forming any sister groups. This is in contrast to the traditional phylogeny (Romer 1966Citation ; Benton 1997Citation ) and is also different from recent reports (Gorr, Mable, and Kleinschmidt 1998Citation ; Hedges and Poling 1999Citation ).


    Theory
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 
Representation of the Evolutionary Process in the MVS
The minimum unit of evolution is the elementary process of figure 1a, in which point 1 is split into points 3 and 4 via point 2. Here, we define di,j as the substitution distance between the ith and jth points. The SSM gives the relations d1,3 = d1,2 + d2,3, d1,4 = d1,2 + d2,4, and d3,4 = d2,3 + d2,4. In this paper, we also define the spatial distance Ri,j = |Ri - Rj| in the MVS by the square root of di,j. Then, these relations are expressed as follows: R1,32 = R1,22 + R2,32, R1,42 = R1,22 + R2,42, and R3,42 = R2,32 + R2,42, which represent Euclidean distances. Thus, the tree structure of figure 1a is represented by three vectors, R1,2, R2,3, and R2,4, along the x-, y-, and z-axes in the three-dimensional space (fig. 1b ). When point 3 is further split into two branches, the tree is composed of five vectors along the orthogonal coordinate axes in the five-dimensional space. In this way, the evolution in the SSM is represented as the development of branch vectors along the orthogonal coordinate axes of the MVS, and the extant species correspond to the end points of these developments.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 1.—Three-dimensional description of the elementary process of evolution. An elementary process (a) of splitting a branch into two is expressed in the single-substitution model (SSM) by the three branch vectors (b) along the x-, y-, and z-axes of the three-dimensional space. A succession of such elementary processes forms an evolutionary tree, which can be represented in the multidimensional vector space (MVS)

 
The orthogonal relation between the branch vectors is very useful in the structural analysis of the tree. Without loss of generality, we consider the topology of figure 2 , which consists of the branch points (points 1–15) and the endpoints (points 16–31). Here, point 1 is the root of the tree. The SSM gives a number of orthogonal relations between the branch vectors Ri,j = Ri - Rj. For example, we have Ri,1 {bot} Rj,1 for the two groups of points (i = 2, 4, 5, 8–11, 16–23) and (j = 3, 6, 7, 12–15, 24–31), and also Ri,2 {bot} Rj,2 for the two groups of points (i = 4, 8, 9, 16–19) and (j = 5, 10, 11, 20–23). However, for example, the vectors Ri,1 (i = 6, 12, 13, 24–27) are not orthogonal to the vectors Rj,1 (j = 7, 14, 15, 28–31), since they include the common vector R1,3.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 2.—Evolutionary tree for the presentation of the present theory. The present theory is presented using this figure without loss of generality. The figure consists of the upper and lower wings, which have the seven branch points and eight end points, respectively

 
When we specify a point s as one of the search vectors, the angle {theta}i,s between the vectors Rs and Ri is given by the relation Ui,s = Ri·Rs = {Sigma}Nn=1 Xi,nXs,n = RiRs cos {theta}i,s. Here, the vectors Ri are measured from the position of the root (point 1), and Xi,n and Xs,n denote the nth dimensional components of Ri and Rs, respectively. Then, we have the quantity

(1)
which represents the square of the projection component of the ith point vector Ri to Rs. By specifying another point, t, we obtain the scatter diagram for Wi,s and Wi,t, which give the x and y values, respectively.

A variety of pairs may be selected for the search points (s, t). For the first pair, s = 31 and t = 16, the SSM gives the following orthogonal relations: R15,1 {bot} R15,s, R15,1 {bot} R15,30, and R15,s {bot} R15,30, since points 1, 15, 30, and 31 correspond to points 1, 2, 3, and 4, respectively, in figure 1b. We have Zs,s = Rs,12 and Z15,s = Z30,s. Hence, we obtain Zi,s = C1 for the point (i = 31), Zi,s = C2 for the points (i = 15, 30), Zi,s = C3 for the points (i = 7, 14, 28, 29), Zi,s = C4 for the points (i = 3, 6, 12, 13, 24–27), and Zi,t = 0 for the points (i = 3, 6, 7, 12–15, 24–31), where Cj (C1 > C2 > C3 > C4) are constants. In this way, the four groups of points stand in line on the x-axis according to the nearby order of branches from the search point (s = 31). Similarly, in the upper branches of figure 2 , the point (i = 16) gives the largest value of Zi,t on the y-axis, the points (i = 8, 17) form the second group, the points (i = 4, 9, 18, 19) form the third group, and the points (i = 2, 5, 10, 11, 20–23) form the fourth group.

For the second example, the pair (s = 31, t = 24), we have the orthogonal relations Zi,s = Zi,t = 0 for the points (i = 2, 4, 5, 8–11, 16–23), Zi,t = C5 for the point (i = 24), Zi,t = C6 for the points (i = 12, 25), Zi,t = C7 for the points (i = 6, 13, 26, 27), and Zi,s = C8 for the points (i = 3, 7, 14, 15, 28–31), since Ri,3 {bot} Rj,3 for the two groups of points (i = 6, 12, 13, 24–27) and (j = 7, 14, 15, 28–31). Here, C5, C6, C7, and C8 are constants (C5 > C6 > C7 > C8). Here, in the x-direction, the point (i = 31) gives the largest value of Zi,s, the points (i = 15, 30) form the second group, the points (i = 7, 14, 28, 29) form the third group, and the points (i = 3, 6, 12, 13, 24–27) form the fourth group. In the y-direction, point 24 gives the largest value of Zi,t, the points (i = 12, 25) form the second group, the points (i = 6, 13, 26, 27) form the third group, and the points (i = 3, 7, 14, 15, 28–31) form the fourth group. In this way, the scatter diagram for Zi,s and Zi,t gives the groupings of the points which correspond to the nearby orders of branches viewed from the search points (s and t). Also, the regression lines along these plots are orthogonal to each other. However, actual evolutionary processes may include more or less substitutions of higher orders than those in the SSM. Then, the two lines cannot be completely parallel to the x- and y-axes, respectively.

A computer simulation of evolution is a useful tool to check the validity of the vector analysis in terms of Zi,s and Zi,t, since all the sequences of the 31 points of figure 2 are known in advance. The simulation created these sequences on the basis of the Markov process without applying the SSM. An amino acid sequence (141 sites) of the first point (the root) was given by generating white random numbers for the 20 amino acids. We tried to substitute an amino acid at each site of the root sequence by generating a random number with the weight of the transition probability matrix between amino acids. Then, most of the sites remained unchanged, since the diagonal parts of the matrix were much larger than the nondiagonal ones. The sequences of the next two points (2, 3) were fixed after amino acid substitutions of 10 sites in the root sequence, respectively. This procedure was continued to the end points (points 16–31). In this simulation, the Dayhoff model was used for both the transition probability matrix and the substitution distances between the sequences of the respective points (Dayhoff, Schwartz, and Orcutt 1978Citation ). The substitution distances between points 1–31 were calculated in a way equivalent to that of the software package Protdist (Felsenstein 1993Citation ), and used as the input data for the equation of motion.

Positions Ri of the 31 points of figure 2 were precisely determined by solving the equation of motion in the 30 dimensions and used for the vector analysis. Figure 3a shows the scatter diagram for Zi,s and Zi,t (s = 31, t = 16). The scatter plots were clearly classified into the eight groups according to the branching orders from the specific points (s = 31, t = 16), and the regression lines along the plots were almost orthogonal to each other on the x- and y-axes, although the SSM was not applied and the end points were rather far from the root (point 1) (d {approx} 35). Figure 3b shows the scatter diagram for Zi,s and Zi,t (s = 31, t = 24). Here, the points of the upper branches of figure 2 were distributed around the origin of coordinates to compose the outgroup, since their vectors were almost orthogonal to the vectors Rs and Rt. Those of the lower branches were distributed in a pattern quite similar to that shown in figure 3a and were classified into the seven groups. In this way, a successive change of the search vectors s and t completely resolved the branch pattern of figure 2 .



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 3.—Vector analysis of the evolutionary tree of figure 2 . The positions Ri of the 31 points of figure 2 were precisely determined in the MVS of 30 dimensions. The vectors R31, R16, and R24 of the points (i = 31, 16, 24) were first chosen as the search vectors to analyze the tree structure. a and b, scatter diagrams for Zi,31 and Zi,16 and for Zi,31 and Zi,24 (defined by eq. 1 ), respectively. In a, with the search points (i = 31, 16), the points of the upper and lower wings of figure 2 were almost distributed on the y- and x-axes, respectively. The groups of points were clearly classified in the nearby orders of branches from the search points (i = 31, 16). In b, with the search points (i = 31, 24), the branch points (points 3, 6, 12) were further resolved, and the regression lines along the plots were also orthogonal to each other. Here, the points of the upper wing were distributed around the origin of coordinates as the outgroup, since their vectors were almost orthogonal to the search vectors R24 and R31. The successive changes of the search vectors could resolve the branches of the tree. Next, to infer the phylogeny, we determined the vectors Ri of the end points (points 16–31) by solving the equation of motion of only these points. We also determined position Rc of the spherical center by using equation (2) . The obtained grouping patterns of the end points in the vector analysis were consistent with those of this figure

 
Vector Analysis for Phylogeny Inference
Unfortunately, the sequences of branch points 1–15 in figure 2 are not known in actual evolution. To confirm the validity of the vector analysis, we have to reproduce the topology of this figure by blinding the branch points. Using the distance matrix for the end points (points 16–31) as the input data, we determine their positions Ri by solving the equation of motion in 15 dimensions.

Prior to the inference, we determine the origin of coordinates using the positions of the end points. The molecular clock is assumed as the first approximation for this position. We look for the position Rc of the spherical center which is equidistant from all the end points. Let us consider the quantity S defined by the equation


(2)
The central position is determined by minimizing the S value on the basis of the least-squares method; {partial}S/{partial}Xc,n = 0 (n = 1, 2, ... , N) with linear coupled equations for Xc,n. Here, Xc,n denotes the nth dimensional component of Rc. The values Xc,n can be obtained by solving the matrix equation. As a result, we can check how the end points are spherically distributed around the center, since the distance between the center and the ith point is given by Ri,c = {{Sigma}Nn=1 (Xi,n - Xc,n)2}1/2.

Once the root position is determined, the tree structure can be analyzed in terms of the scattering diagram for Zi,s and Zi,t by using only the end points. By adopting R16 and R31 as the search vectors, we obtain figure 3a without points 2–15. Then, on the basis of the orthogonal theory presented in the previous section, we know that points 16–23 and points 24–31 groups that are independent from each other. In the group of points 16–23, points 20–23 are first separated from the evolution process toward the point 16, points 18 and 19 are separated next, and point 17 is finally separated. On the other hand, in the group of points 24–31, points 24–27 are first separated from the evolution process toward point 31, points 28 and 29 are separated next, and the point 30 is finally separated. As a result, we obtain the tree inserted in figure 3a. Next, we further resolve points 24–27 by taking R24 as the search vector instead of R16. We get figure 3b without points 2–15. Then, we know that the points 26 and 27 are first separated from the evolution process toward point 24, and point 25 is separated next. As a result, we get a more detailed tree, presented in the insert in figure 3b. By continuing such a procedure, we can obtain the full tree of 16 points.

The vector analysis can be performed without any mathematical approximations. However, the branch lengths of the tree are obliged to be evaluated model-dependently. This can be easily done by using the SSM as follows: In figure 2 , for example, we have the orthogonal relations Ri,4 {bot} Rj,4 for the points [i = (16, 17), j = (18, 19)], [i = (1), j = (16, 17)], and [i = (1), j = (18, 19)]. We have R16,42 + R18,42 = R16,182, R1,42 + R16,42 = R16,12, and R1,42 + R18,42 = R18,12. As a result, the branch lengths are given by R1,42 = (R16,12 + R18,12 - R16,182)/2, R16,42 = (R16,12 + R16,182 - R18,12)/2, and R18,42 = (R18,12 + R16,182 - R16,12)/2. We finally note that the scalar product Ui,s = Ri·Rs = {Sigma}Nn=1 Xi,nXs,n = RiRscos {theta}i,s of the search vector Rs and vector Ri gives the distance from the root to the branch point between the two points (s and i). The scatter diagram may be expressed in terms of Ui,s and Ui,t in place of Zi,s and Zi,t.

Determination of Species Positions Ri in the MVS
To obtain species positions Ri in the MVS, we have to determine a large number of parameters of Xi,n (i = 1, 2, ... , I; n = 1, 2, ... , N), which are the vector components of Ri. We introduce a new method of time-dependently solving the equation of motion for the many-body system. For this purpose, the species are regarded as the particles moving according to the equation of motion. They are initially deposited at random within a volume of the MVS. With time, they continue to move toward lower energy states under the influences of both a viscosity effect and their mutual interactions. They gradually lose their kinetic energies due to the viscosity effect and finally stop falling into the ground state of the system, irrespective of the initial conditions given using random numbers. This falling is one of the most fundamental laws of physics. Therefore, it is quite easy to confirm whether the equation of motion has been correctly solved or not, since the ground state is explicitly known in advance. The interactions are defined as the sum of two-body attractive potentials, which are minimized at the points equal to the square roots of the substitution distances between the two species. The ground state corresponds to the situation in which any pairs of the particles are positioned at the points equal to the square roots of the substitution distances.

The initial positions Ri and velocities Vi of the particles are randomly distributed within the volumes of R0 and V0 by the equations


(3)
for i = (1, 2, ... , I) and n = (1, 2, ... , N). Here, V0 and R0 are constants, F is a random number between 0 and 1, and Vi,n and Xi,n are the nth dimensional components of Vi and Ri, respectively. The particles are allowed to move in time according to the kinematics of damped equations of motion (David and James 1988Citation ):


(4)
which are the differential equations about time T with the N dimensions (n = 1, 2, ... , N) and the I species (i = 1, 2, ... , I). Here, µ is the viscosity to damp the motion of the particles. The total energy E of the system is given by


(7)
which consists of the sum of the kinetic energy of the first term and the potential energy of the second term. We introduce the two-body potential between the ith and jth particles, defined by

(8)
which minimizes at the point of Ri,j = Di,j. Here, P0 and A are constants, Rij = {{Sigma}Nn=1 (Xi,n - Xj,n)2}1/2, and the quantity Di,j is defined as the square root of the substitution distance di,j between the two sequences of the ith and jth species.

Equations (4)–(8) are expressed in no dimensions without physical units such as meters, grams, or seconds. If we set µ = 0, these equations are reduced to the well-known "canonical equation," where the total energy E of the system is conserved time-independently. The Runge-Kutta method is applied for the numerical solution, where we set the constant values as follows: V0 = 0.3, M = 900, P0 = -0.5, µ = 8, and A = 20. R0 denotes the range of the initial distribution of the particles and is defined as half of the maximum value of Di,j. P0 is the depth of the potential which confines the particles within the interaction range. The value of µ may be adjusted to be small such that the particles will gradually lose their kinetic energies in time, and µ = 8 was found to be suitable. A time mesh to integrate the differential equations about time is taken as {Delta}T = 0.3, which is sufficiently small to keep the calculations accurate. The size of dimensions, N, is set equal to (I - 1) if I is smaller than the number of variable sites within the given species, L. If I > L, we may put N = L, since the average error of Ri,j versus Di,j is then negligible. The numerical calculations start with a high-energy state given by a set of the initial distribution of the particles, and they stop at the ground state, whose appearance verifies the correct solution of equations (4)–(6) . This can be easily confirmed in practice, since the ground state is given by E0 = P0I(I - 1)/2, as known from equations (7) and (8) . In numerical calculations using the interaction potential of equation (8), we do not have any local minimum of the total energy. The calculations were performed by a Fortran program with double precision.

Comparison with Other Methods by Computer Simulations of Molecular Evolution
The efficacy of the present theory was demonstrated by using computer simulations of molecular evolutions in which all of the branch- and end-point sequences were known in advance. Once an initial sequence of amino acids is given, a series of sequences of the following evolutionary process are determined by using Monte Carlo method (Hammersley and Handscomb 1964Citation ) with the weight of the transition probability matrix t(p, q) for the Dayhoff model. The amino acid transition at each site is determined by generating a random number F between 0 and 1: if a given F satisfies the condition T(p, q - 1) < F <= T(p, q), the transition from the pth amino acid to the qth one occurs. Here, T(p, q) = {Sigma}qr=1 T(p, r)/{Sigma}20r=1 T(p, r), T(p, 0) = 0, and T(p, p) means no transition. An operation of the transitions is performed from the first site to the last, so that the first sequence is created. The second sequence is created by the next operation. The (i + 1)-th sequence is given by using the ith sequence. Two sequences are created at each branch point. In this way, we can simulate any branch patterns of molecular evolution based on the stochastic (Markov) process. The number of operations is proportional to evolutionary time.

To compare the present method with the standard ones, we formed the branch patterns of figure 4a and b. Here, the end points (points 5–9) correspond to the 100 operations. Branch points 2, 3, and 4 in figure 4a are the positions of the 50, 70, and 70 operations, respectively. Those in figure 4b are the positions of the 40, 60, and 80 operations, respectively. To make the simulation realistic, we assumed that the initial point (point 1) was the consensus sequence of eutherians. We also assumed that the 64 sites could be substituted, since the other 77 sites were known to almost freeze in eutherians.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 4.—Computer simulations of molecular evolution. The present method was compared with the standard ones using the two branch patterns of a and b. The initial point (point 1) is given by the consensus sequence of eutherians. A series of sequences of the following evolutionary process are determined by using the Monte Carlo method with the weight of the transition probability matrix. The branch points 2, 3, and 4 in a are the positions of the 50, 70, and 70 operations for the amino acid transitions, respectively. Those in b are the positions of the 40, 60, and 80 operations, respectively. The end points (points 5–9) correspond to the 100 operations. By repeating the simulations, 100 multiple data sets with a variety of branch lengths were prepared for a statistical comparison with the other methods, and the sequences of the end points were used for the phylogeny inference

 
By repeating the simulation, we produced the 100 multiple data sets (100 evolutionary events) in both topologies of figure 4a and b and tried to reconstruct the original branch patterns by using only the sequences of the end points, where point 5 was taken as the outgroup. Table 1 shows the percentages of the correct solutions for ML, NJ, and the present method. The ML and NJ analyses were performed using the software packages PUZZLE (Strimmer and von Haeseler 1996Citation ) and PHYLIP (Felsenstein 1993Citation ), respectively. The NJ method gave better results than the ML method. A clear correlation of branch resolution between the two methods did not exist due to their different algorithms for phylogeny inference. Figures 5 and 6 show typical examples in which the two methods gave erroneous topologies in same data sets, corresponding to figure 4a and b, respectively. The ML method could not reconstruct the original topology in spite of the 100% confidence (fig. 5a ). This suggests that the ML value cannot necessarily be the criterion for the correct topology. The results of table 1 depend on the numbers of operations in figure 4 . For example, if the numbers 20 and 30 in figure 4a are replaced by 30 and 20, respectively, the results of the standard methods are improved.


View this table:
[in this window]
[in a new window]
 
Table 1 Percentages of Correct Solutions

 


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 5.—Results of the standard methods for the branch pattern of figure 4a. The figure gives a typical example (table 2 ) of data sets in which the maximum-likelihood (a) and neighbor-joining (b) methods could not reconstruct the original topology of figure 4a in spite of high branch resolutions. Here, point 5 was taken as the outgroup

 
Next, we follow the procedure of the present method. First, we obtained the MVS of the four dimensions by using the distance matrix for the end points (points 5–9). Then, we obtained the spherical center by using equation (2). However, the obtained radius was too short due to the small number of end points. According to the SSM, the minimum value of this radius is half of the average distance between the outgroup (5) and the other end points. This value was basically used as the spherical center (origin of coordinates) but multiplied by the factor 1.3 for a deviation from the SSM. We again solved the equation of motion of the five dimensions which took into account this center and the end points (points 5–9) and constructed the MVSs for all of the data sets. As seen in table 1 , we obtained an excellent result of the vector analysis over all the data sets. Here, we give full details of the two examples in which the standard methods gave erroneous topologies (figs. 5 and 6 ). By examining every pair of end points (points 6–9) as the search vectors, we selected the best pair of them which fulfilled the criterion of orthogonal relations such as those in figure 3b. Figure 7a gives the vector analysis using the data set in the case of figure 5 . Here, the search vectors of points 6 and 8 revealed the existence of the two sister groups of (6, 7) and (8, 9). These groupings were confirmed by using points 8 and 9 as search vectors (see fig. 7b ), since the other points (6 and 7) were positioned outside of the orthogonal relation (solid lines in Fig. 7b ) formed by points 8 and 9. Next, figure 8 gives the vector analysis using the data set in the case of figure 6 . Using points 7 and 8 as search vectors, we found that point 9 was coupled to point 8 while point 6 was positioned outside of the orthogonal relation (solid lines) formed by points 7–9 (fig. 8a ). This branch pattern was confirmed using points 8 and 9 as search vectors, since then points 6 and 7 were positioned outside of the orthogonal relation formed by points 8 and 9, and point 7 had larger vector components to points 8 and 9 than point 6 (fig. 8b ).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 7.—Scatter diagram for the branch pattern of figure 4a. The figure gives the vector analysis of the data set used in figure 5 . The analysis was done using the MVS of the five dimensions which include both the end points (points 5–9) and a spherical center defined as the origin of coordinates. The search vectors (points 6 and 8) drew out points 7 and 9, respectively (a). By changing the search vectors to the points 8 and 9, the points 6 and 7 were positioned outside of the orthogonal relation (solid lines of b) formed by these vectors. In this way, the original topology of figure 4a was confirmed

 


View larger version (11K):
[in this window]
[in a new window]
 
Fig. 8.—Scatter diagram for the branch pattern of figure 4b. The figure gives the vector analysis of the data set used in figure 6 . The analysis was done using the MVS of the five dimensions which include both the end points (points 5–9) and a spherical center defined as the origin of coordinates. The search vectors (points 7 and 8) drew out point 9 (a). By changing the search vectors to points 8 and 9, the points 6 and 7 were positioned outside of the orthogonal relation (solid lines of b) formed by these vectors, and point 7 had larger vector components to points 8 and 9 than point 6. In this way, the original topology of figure 4b was confirmed

 


View larger version (14K):
[in this window]
[in a new window]
 
Fig. 6.—Results of the standard methods for the branch pattern of figure 4b. The figure gives a typical example of data sets in which the maximum-likelihood (a) and neighbor-joining (b) methods could not reconstruct the original topology of figure 4b. Here, point 5 was taken as the outgroup

 
Finally, we evaluated the branch lengths by using the scalar product Ui,j of the two vectors of the ith and jth end points. Here, we note the relation Ui,j = (R1,i2 + R1,j2 - Ri,j2)/2. In the topology of figure 4a, length (1-2) was given by the average value of Ui,j (i = 6, 7; j = 8, 9). Lengths (2-3) and (2-4) were given by subtracting length (1-2) from lengths (1-3) and (1-4), respectively. Lengths (3-i) (i = 6, 7) were given by subtracting length (1-3) from lengths (1-i), respectively. Lengths (4-i) (i = 8, 9) were given in the same way. Figure 9a expresses the branch lengths in the case of figure 7 , which are compared with the true ones (fig. 9b ) given by using the branch- and end-point sequences. Figure 9c gives the result of the ML method, and a similar result was also obtained with the NJ method. In figure 9c, the branch point between points 5 and 6 was taken as the origin of coordinates, since the position of the root was unknown. The obtained branch lengths were very different from both the present ones and the true ones. However, such a comparison for the different topologies may be meaningless. The values along the branches in figure 9 denote the point-to-point distances. In this paper, the distances between point 1 and the end points were assumed to be constant for simplicity. The branch lengths in figure 9a can be revised by readjusting these distances so as to be proportional to those between the outgroup (point 5) and the other points (points 6–9). Table 2 gives the original sequences of point 1 and the end points (points 6–9).



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 9.—Branch lengths for the branch pattern of figure 4a. The branch lengths were evaluated using the data set analyzed in figures 5 and 7 . A distance between two branch (or end) points was assumed to be the sum of the branch lengths connecting these two points. The result (a) is compared with the true tree (b) given by using the branch- and end-point sequences. The values along the branches denote the point-to-point distances. For the maximum-likelihood method (c), the branch point between points 5 and 6 was taken as the origin of coordinates, since the position of the root was unknown. The obtained branch lengths were very different from both the present ones and the true ones due to the erroneous topology (fig. 5a )

 

View this table:
[in this window]
[in a new window]
 
Table 2 The Amino Acid Sequences Used in the Analyses of Figures 5, 7, and 9

 

    Applications to Vertebrates in the {alpha}-Hemoglobin
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 
The present theory was applied to a few examples of vertebrates in the {alpha}-hemoglobin, in which the largest number of species have been registered in the database. The sequences of 255 vertebrate species were obtained from the Protein Identification Resources (PIR) database. First, we selected only species with 141 and 142 sites, whose sequences were readjusted so as to be 141 for the alignment analysis. Second, we excluded species with the same sequences. As a result, we analyzed 185 species from fish to primates. The substitution distances were calculated in a way equivalent to that of the package Protdist (Felsenstein 1991Citation ) with the Dayhoff model, since calculation errors were found in this package. The present results were insensitive to the use of other models, such as the Dayhoff-F, the JTT (Jones, Taylor, and Thornton 1992), and the JTT-F models (Cao et al. 1994).

For the search vectors in the vector analysis, we introduced the CGV Rs to express the representative direction of a group (s) of species in the MVS. It was defined by Rs = {Sigma}'i Ri/Is, where the prime sign of {Sigma}'i denotes the summation about the species of the group s and Is is their total number. Consequently, the scatter diagram for the two groups s and t was expressed in terms of the quantities Zi,s and Zi,t, which are the projection components of the species vectors Ri to the two CGVs of Rs and Rt, respectively.

Relationship Among Artiodactyla, Perissodactyla, and Chiroptera
First, the present theory was applied to a typical problem in which it is difficult to get the correct branch patterns by using standard methods such as the ML, NJ, and PA methods. We picked up eight species of Artiodactyla and Perissodactyla and used a rat as the outgroup. It is well known that the consensus topology is [rat, {(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}] (Novacek 1992Citation ; Benton 1997Citation ). The standard methods gave an erroneous pattern, [Tylopoda, {Ruminantia, (Ceratomorpha, Hippomorpha)}], with high branch resolutions (see fig. 10 ). Especially, the ML method gave this strange pattern with 100% confidence (fig. 10a ), although this method gave the correct pattern with 100% confidence in the case of the ß-hemoglobin (data not shown). Additional inclusions of other orders resulted in another problem with the topology. As shown in figure 11 , by adding Chiroptera, Tylopoda still made the first clade of the tree in a way similar to that in figure 10 , and furthermore, Chiroptera were included inside of the ungulate branches. Here, the branch resolutions were rather high for the NJ method (fig. 11b ), while they were very low for the ML method (fig. 11a ). It is well known that Chiroptera are located outside the ungulate branches, since it is rather close to Primates (Novacek 1992Citation ; Benton 1997Citation ).



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 10.—Relationship among the suborders of Artiodactyla and Perissodactyla by standard methods. The standard methods of maximum likelihood (a) and neighbor joining (b) gave the strange pattern [Tylopoda, {Ruminantia, (Ceratomorpha, Hippomorpha)}] in spite of high branch resolutions, especially with the 100% confidence in a. Here, eight species of Artiodactyla and Perrisodactyla in the {alpha}-hemoglobin sequences were selected, and a rat was used as the outgroup

 


View larger version (33K):
[in this window]
[in a new window]
 
Fig. 11.—Relationship among Artiodactyla, Perissodactyla, and Chiroptera by standard methods. The maximum-likelihood and neighbor-joining methods were applied to the analysis of the 19 species of Ungulates and the 7 species of Chiroptera, which were full inclusions from the PIR database. The relationship among the suborders of Artiodactyla and Perissodactyla remained erroneous in a pattern similar to that in figure 10 . Furthermore, Chiroptera were included inside of the ungulate branches. When the number of species was increased, the branch resolutions became lower

 
In the present approach, the positions Ri of 27 species (their accession numbers are listed in the Supplementary Material section) of figure 11 were precisely determined by solving the equation of motion in the 26 dimensions. The position Rc of the spherical center was also determined by minimizing the S value of equation (2) and defined as the origin of coordinates. We looked at the CGVs of Artiodactyla and Perrisodactyla as the search vectors and investigated how the species were distributed around these two vectors. As a result, the scatter diagram for Zi,a and Zi,p showed a clear pattern of orthogonal relation between the two groups of Artiodactyla and Perissodactyla (fig. 12a ). This is analogous to the simulation of figure 3b and results in the branch pattern [Chiroptera, {(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}]. Here, the rat and Chiroptera were distributed as the outgroup with small values of Zi,a and Zi,p, since their vectors were directed at large angles from the CGVs of Artiodactyla and Perissodactyla. Figure 12b gives the result of a full inclusion of 113 eutherians stored in the PIR database. This figure shows that the additional inclusion of other species produces a more stable pattern of the scatter diagram.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 12.—Relationship among Artiodactyla, Perissodactyla, and Chiroptera by the present method. a shows the result of the vector analysis using the same species as in figure 11 . Looking at the two directions of the center-of-gravity vectors (CGVs) of Artiodactyla and Perisodactyla as the search vectors, we obtained the scatter diagram for Zi,a and Zi,p. The plots represented a clear orthogonal relation between these two orders: they were similar to those of the simulation of figure 3b and resulted in the topology [Chiroptera, {(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}], since other groupings of the search vectors did not give the orthogonal relation (c). Here, the rat and Chiroptera gave small x and y values, since their vectors were directed at large angles from the CGVs of Artiodactyla and Perissodactyla. It is here noted that the rat and Chiroptera are also directed at both large angles and long distances from each other, although they were located close together in this figure. The stability of the topology was examined by a full inclusion of 113 eutherians stored in the PIR database. As shown in b, the branch pattern of the suborders of Artiodactyla and Perissodactyla was consistent with that of a

 
It is here noted that the orthogonal relation of figure 12a and b is unique for the branch pattern [Chiroptera, {(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}]. The results (fig. 10 ) of the standard methods supported a grouping of Ceratomorpha, Hippomorpha, and Ruminantia. However, if the topology of figure 10 were true, the angles between Tylopada and Perissodactyla would become larger than those between Ruminantia and Perissodactyla. As a result, an orthogonal relation such as that in figure 12a and b could not be satisfied. The uniqueness was also confirmed by showing the nonorthogonal relation in the other groupings (Ceratomorpha + Ruminantia and Hippomorpha + Tylopoda) for the CGVs (fig. 12c ).

Drawing Out of Lineage of Mammals and Birds
The positions Ri of 185 vertebrates were determined in the 110-dimensional size in which the average error of Ri,j versus Di,j was sufficiently small, i.e., 0.08%. The spherical center Rc of vertebrates was obtained by using equation (2) and defined as the origin of coordinates, from which the species vectors Ri were measured. The CGVs Rm and Rb of mammals and birds were calculated and regarded as their representative directions in the MVS, respectively. Figure 13 shows the angular distribution of the vectors Ri around the CGV Rm. Here, the x-axis denotes the projection components Wi,m = (Zi,m)1/2 of the species vectors Ri to the vector Rm. The angles show roughly a chronological order of evolution of the species, since the branching events of the tree at earlier stages of evolution yield larger angles. The distribution was well fitted to the dotted curve (given by W = R cos({theta}) with a constant radius R = 8.5, which is the average value of |Ri|.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 13.—Angular distribution of 185 vertebrates around the center-of-gravity vector (CGV) of mammals. The positions Ri of 185 vertebrate species were determined by solving the equation of motion of 110 dimensions. The x-axis denotes the projection components Wi,m = (Zi,m)1/2 of the species vectors Ri to the CGV Rm of mammals. The y-axis denotes the angles ({theta}) between Ri and Rm. The angles give roughly a chronological order of evolution of the species, since evolutionary events at earlier stages give larger angles in the MVS. The scatter plots were well fitted to the dotted curve given by W = R cos({theta}) with the constant radius R = 7.5, which is the average value of |Ri|

 
To obtain the scatter diagram for Zi,m and Zi,b, we shifted the center position toward the directions of the CGVs of mammals and birds by using the position Qc given by a proportional relation,


(9)
with a constant C. Here, the species vectors Ri were redefined as measured from the new position Qc. Here, the C value may be adjusted for the first few species of fishes so as to be placed near the new origin of coordinates, since the vectors of these fishes become orthogonal to the CGVs of Rm and Rb. However, the scatter diagram patterns of the drawn-out species were insensitive to the C values (C = 0 - 0.35).

Figure 14 gives the scatter diagram (with C = 0.2) consisting of the three parts: the central line, the left branch, and the right branch. This figure showed a clear splitting into three groups of the central line and the left and right branches. The central line included fishes, amphibians, and squamates to form the outgroup. Fishes began with a coelacanth, a carp, and a dragonfish and ended with a goldfish. The central line included fishes, newts, frogs, and squamates. The left branch represented the lineage of birds. Crocodilians, turtles, and tuataras belonged to this lineage. In this way, the extant reptilians were clearly classified into the two groups of squamates and others. On the other hand, the right branch represented the mammalian lineage, which began with Monotremata (a platypus and an echidna) and Marsupialia (an opossum) at the base of the branch. They were followed by a guinea pig, a hedgehog, a manatee, and a whale, which are considered to be the core member species of the mammalian radiation. Primates were located at the uppermost part of the right branch.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 14.—Scatter diagram for mammals and birds in 185 vertebrates. The x and y values denote the projection components Zi,m and Zi,b of the species vectors Ri to the center-of-gravity vectors (CGVs) of mammals and birds, respectively. The scatter plots were clearly decomposed into the three parts, i.e., the left branch of birds, the right branch of mammals, and the central part of the others. The extant reptilians were split into the two branches: squamates belonged to the central part, while the others belonged to the lineage of birds. The central part included fishes, amphibians, and squamates. On the other hand, the right branch began with Monotremata (a platypus and an echidna) and Marsupialia (an opossum) at the base of the mammalian lineage. They were followed by a guinea pig, a hedgehog, a manatee, and a whale, which are considered to be the core member species of the mammalian radiation. Primates were found on the uppermost part of the right branch. There appeared a vacuum at the base of the two branches of birds and mammals. This is regarded as the result of the extinction of reptilians. The species of the central part took small values of Zi,m and Zi,b, since their vectors were directed at large angles from the CGVs of birds and mammals. The species vectors of different classes such as fishes and amphibians were directed at large angles from each other, although these species were closely located in the plots of this figure, since we looked only at the two directions of mammals and birds

 
From a morphological viewpoint, the origin of Monotremata has been regarded as much older than that of the opossum. In the present analysis, however, they had similar values of Zi,m in spite of their mutual long distances. This result is consistent with the proposal of Gemmell and Westerman (1994)Citation that they diversified simultaneously (Cao et al. 1994; Janke et al. 1994). It is interesting to see that Monotremata shifted to the direction (upward) of birds compared with the opossum.

There appeared to be a vacuum at the base of the two branches of birds and mammals. This is regarded as the result of the extinction of reptilians. Finally, it is noted that the line along the plots of the bird lineage is not orthogonal to that of the mammalian lineage (see the solid lines of fig. 14 ). This suggests that substitutions of higher orders than the SSM may exist in the sequences between birds and mammals. The effect of these substitutions may be related to the well known underestimation of the molecular clock between mammals and birds (Zuckerkandl and Pauling 1965Citation ). This problem will be discussed in detail elsewhere.

Relationship Among Birds, Squamates, and Other Extant Reptilians
The phylogeny of reptilians has been one of the most important problems in vertebrate evolution. From a morphological point of view, the number of temporal openings in the skull has been used as an important clue to the classification of species (Romer 1966Citation ; Benton 1997Citation ): mammals have a single opening, the turtle lacks a temporal opening, and the other extant reptilians have two openings. In this way, birds and crocodilians formed one group, tuataras and squamates formed another group, and the turtle was placed at the base of the tree (fig. 15a ). On the other hand, the concept of a sister group of birds and mammals was proposed (Gardiner 1982Citation ; Løvtrup 1985Citation ) and was supported by molecular analyses using the sequences of the myoglobin and ß-hemoglobin (Goodman, Miyamoto, and Czelusniak 1987Citation ; Bishop and Friday 1988Citation ). In recent molecular analyses (Gorr, Mable, and Kleinschmidt 1998Citation ; Hedges and Poling 1999Citation ), however, the turtle was joined with crocodilians, and squamates were placed at the base of the reptilian tree. If these results are accurate, the traditional classification based on the number of temporal openings becomes meaningless for phylogeny, and the morphological and paleontological evidence then remains unclear.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 15.—Relationship among birds, squamates, and other extant reptilians. a, The morphological and paleontological view. b and c, Recent results (Gorr, Mable, and Kleinschmidt 1998Citation ; Hedges and Poling 1999Citation ) with use of the maximum-likelihood method, respectively. b and c show a Turtle-crocodilian sister group, different from a. The present result (d) is basically similar to them, but did not form the subgroup within reptilians, since all of tuataras, turtles, and crocodilians were strongly coupled with birds

 
In the present approach, squamates were clearly isolated from the other extant reptilians, whose vectors were directed at small angles from the CGV of birds. On the other hand, the vectors of squamates, amphibians, and fishes were directed at large angles from the CGVs of birds and mammals so that they may have had small values of Zi,m and Zi,b, although they were also directed at large angles from each other. Figure 14 shows that the first coupling to birds is the tuatara, the second is the turtle, and the third is crocodilians (their accession numbers are listed in the Supplementary Material section). Then, we have the topology of figure 15d, in contrast to the morphological view (fig. 15a ). This figure was also compared with the recent results of figure 15b (Gorr, Mable, and Kleinschmidt 1998Citation ) and c (Hedges and Poling 1999Citation ) by the ML method, which shows a sister group between crocodilians and turtles. The present analysis did not give this subgroup (fig. 15d ), since birds were always and strongly drawn out by taking tuataras, turtles, or crocodilians in place of birds as the search vector. This situation was made clearer by the scatter diagram for the CGVs of crocodilians and birds (fig. 16 ): the turtle was coupled more strongly with birds than crocodilians, in a way quite similar to that shown in figure 3b. Here, a large value (C = 0.35) was used in order to enlarge the relationship among crocodilians, turtles, tuataras, and birds.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 16.— Relationship among birds, tuataras, turtles, and crocodilians. To confirm the present topology of figure 15d, we examined the scatter diagram for the CGVs of crocodilians and birds. The turtle was more strongly coupled with birds than crocodilians, so turtles and crocodilians did not form a subgroup, in contrast to recent results (Gorr, Mable, and Kleinschmidt 1998Citation ; Hedges and Poling 1999Citation )

 

    Discussion
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 
Molecular evolution has traditionally been studied with cluster models (Sokal and Michener 1958Citation ; Eck and Dayhoff 1966Citation ; Fitch and Margoliash 1967Citation ; Sokal and Rohlf 1981Citation ; Saitou and Nei 1987Citation ). These models directly infer phylogeny by using only the substitution distances of the scalar. In the present approach, however, the phylogeny is inferred by constructing the MVS, which includes the information of not only the scalar, but also the vector. The ML method is very interesting in that amino acid substitutions are explicitly taken into account in the algorithm of phylogeny inference. It is expected that the ML method can give better results of the inference than the NJ method, since the former includes a much higher degree of freedom of variables than the latter. In practical applications with the use of computer simulations, however, the latter was better than the former. In the ML method, all of the possible alignments at the respective branches are summed up in calculations of the ML value. The number of this summation increases rapidly at branches toward the past (root), although each of these branches occupied one sequence in the actual process of the evolutionary history. In the ML method, it is implicitly assumed that the amino acid sequences of the extant species at the present time are a realization of the most probable (likely) course among the evolutionary processes based on the stochastic process. Events of deviations from this course may exist in practice. It is not necessarily self-evident whether they correspond to only the parallel and convergent evolutions pointed out by Cao et al. (1998a)Citation . On the other hand, in the present approach, the ML method is applied only to get the substitution distances, which are used as the input data to construct the MVS.

The MVS was shown to be useful for phylogeny inference by the vector analysis based on the orthogonal relation, in contrast to the multidimensional space (Grantham et al. 1981Citation ) which is directly expressed in terms of the amino acid or codon frequencies. The solution of the kinetic equation for the MVS always converges on the ground state by the N = (I - 1)-dimensional size. However, when I was larger than L, we found that the equation of motion could be solved with good accuracy by the L-dimensional size in place of (I - 1). In fact, the L values for 113 eutherians and 185 vertebrates were 80 and 110, respectively, and the former values were much smaller than the latter. In this way, the degree of freedom of variables in the MVS became equal to that in the original space of vectors having the amino acid sequences as their components. This means that the MVS is a transformation of the original space, although the components of the vectors are quite different from each other. In this way, we have to analyze a large number of species compared with that of variable sites in order to get a reliable topology. In fact, in the case of 27 species of ungulates, Chiroptera, and rat of figure 12a, a part of the outside species (Chiroptera) slightly fluctuated upward on the y-axis (toward the CGV of Perissodactyla). This fluctuation may be connected with reason that the size of dimensions (26) was smaller than the number of variable sites (35), since the results (figs. 12b, 14, and 16 ) with the larger numbers of species did not bring such a fluctuation.

The standard methods gave a strange topology in the relationship among the four suborders of Perissodactyla and Artiodactyla in spite of high branch resolutions (fig. 10 ). Table 3 gives the distance matrix for a tapir, a horse, a bovine, an alpaca, and a rat. In this matrix, it is natural that the distance between the tapir and the horse (10.1) is the shortest. However, the distance between the bovine and the alpaca of the same order (17.3) is larger than that between the bovine and horse of different orders (14.1). Such a complicated relationship among these four species seems phenomenologically to be a main reason for the strange topology of figure 10 . As another possible reason, the number of eight species sampled in figure 10 might be too small to obtain the correct topology. To explain this reason, let us consider a simple case of three species with the grouping {1, (2, 3)}. Then, we have a possibility of getting the opposite coupling, {(1, 2), 3}, by additionally including another species (4) which is located intermediate between points 1 and 2. It was therefore expected that the correct topology might be obtained by increasing the number of species. However, the situation became worse, since Chiroptera were furthermore included in the inside of the ungulate branches and the branch resolutions became very low (fig. 11 ). Another example of the strange topology was reported in the ML analysis of the relationship among primates, ferungulates, and rodents with use of the ND1 mitochondrial protein by Cao et al. (1998a)Citation , who suggested the possibility of convergent or parallel evolution. We investigated this relationship in the {alpha}-hemoglobin by using the ML and NJ methods, which gave the same strange pattern, {ungulates, (rodents, primates)} with the ND1, although the branch resolutions were very poor. Therefore, phenomena similar to the ND1 may exist in the {alpha}-hemoglobin.


View this table:
[in this window]
[in a new window]
 
Table 3 The Distance Matrix for Suborders of Ungulates

 
The present approach clearly produced the normal pattern for Artiodactyla and Perissodactyla (fig. 12 ). The scatter diagrams did not change their pattern with the full inclusion of 113 eutherians stored in the PIR database. This absence of change is due to a feature of the present approach in which the positions of species can be precisely determined prior to and separate from the vector analysis for the phylogeny inference. With an increasing number of species, the vectors occupy more specific points of the MVS for the sake of a high-density relationship among the relative distances, and then the vector analysis can be also done more precisely. Finally, the strange topology given by the standard methods did not appear in the present vector analysis for rodents, ungulates, and primates, since rodents were found to form the first clade in the eutherian phylogeny, as seen in figure 14 .

In this paper, we tried to obtain statistical confidence by increasing the number of species. Another approach is to fluctuate the distances themselves. In the present examples of applications, however, the groupings of species in the scatter diagrams were very clear. When a grouping pattern is complicated, a more quantitative analysis will be necessary for phylogeny inference. This problem will be discussed elsewhere.


    Conclusions
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 
The evolutionary process was represented in the MVS as the developments of branch vectors whose end points corresponded to the extant species. A new algorithm was proposed to precisely determine the positions of species by solving the equation of motion. A vector analysis was presented to infer the phylogeny from these positions. A validity of the analysis was demonstrated using computer simulations of molecular evolution. The present method consistently reconstructed the correct topologies, in contrast to standard methods. The efficacy of the method was shown through applications to typical examples of vertebrates in which the standard methods give strange topologies or low values of the branch resolution. Furthermore, the application to recent topics gave a new branch pattern for the reptilian phylogeny. A feature of the present approach is that the MVS is constructed prior to and separate from the vector analysis for the phylogeny inference. As the number of species increases, the vectors Ri occupy more specific positions in the MVS individually. Consequently, the vector analysis can be done more precisely, since it is then possible to substantially transform the original abstract space into the MVS without decreasing the degree of freedom of variables. The efficacy of the present theory will be clarified by applications to many fields in molecular evolution where it remains difficult to infer phylogeny.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 
The species and accession numbers used in figures 11, 12, and 16 are as follows: myotis bat (A25357), pallid bat (A29702), tomb bat (S28934), Australian bat (S20277), California bat (A26640), Brazilian bat (A29391), Indian bat (A29392), Indian rhinoceros (A26543), white rhinoceros (HARNW), Brazilian tapir (HATPI), mountain zebra (HAHOZ), common zebra (HAHOZC), donkey (HAHOD), kulan (HAHOK), horse (HAHO), moose (HAEKN), reindeer (S13481), sheep (S31853), goat (HAGT), bovine (HABO), yak (HAYA1), gayal (HABOG), alpaca (C25478), llama (HALL), vicuna (A25478), camel (HACMA), rat (HART1), crocodile (HAAK), alligator (HAAQ), caiman (HACQ), tuatara (HATJA), turtle (HATTAP).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 
We are grateful to Prof. Hiroshi Furutani (Kyoto University of Education), Prof. Takeshi Agatsuma and Dr. Teruaki Watabe (Kochi Medical School), and Naruya Saitou (National Institute of Genetics, Japan) for useful discussions.


    Footnotes
 
Fumio Tajima, Reviewing Editor

1 Keywords: phylogeny inference multidimensional space equation of motion vector analysis hemoglobin vertebrates Back

2 Address for correspondence and reprints: Yasuhiro Kitazoe, Center of Medical Information Science, Kochi Medical School, Nankoku, Kochi 783-8505, Japan. kitazoey{at}med.kochi-ms.ac.jp Back


    literature cited
 TOP
 Abstract
 Introduction
 Theory
 Applications to Vertebrates in...
 Discussion
 Conclusions
 Supplementary Material
 Acknowledgements
 literature cited
 

    Adachi, J., and M. Hasegawa. 1996. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. 28:1–150.

    Benton, M. J. 1997. Vertebrate paleontology. 2nd edition. Chapman and Hall, New York.

    Bishop, M. J., and A. E. Friday. 1988. Estimating the interrelationships of tetrapod groups on the basis of molecular sequence data. Pp. 33–58 in M. J. Benton, ed. The phylogeny and classification of the tetrapods. Clarendon Press, Oxford.

    Cao, Y., J. Adachi, A. Janke, S. Pääbo, and M. Hasegawa. 1994. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J. Mol. Evol. 39:519–527.[ISI][Medline]

    Cao, Y., A. Janke, P. J. Wadell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Pääbo, and M. Hasegawa. 1998a. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J. Mol. Evol. 47:307–314.

    Cao, Y., P. J. Waddel, N. Okada, and M. Hasegawa. 1998b. The complete mitochondrial DNA sequence of the shark Mustelus manazo: evaluating rooting contradictions to living bony vertebrates. Mol. Biol. Evol. 15:1637–1646.

    David, H. B., and N. G. James. 1988. Quasiparticle model for nuclear dynamics studies: ground-state properties. Phys. Rev. C 38:1870–1878.

    Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. Pp. 345–352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure, Vol. 5. National Biomedical Research Foundation, Washington, DC.

    Eck, R. V., and M. O. Dayhoff. 1966. Atlas of protein sequence and structure. National Biomedical Research Foundation, Silver Spring, Md.

    Felsenstein, J. 1993. PHYLIP (phylogeny inference package). Version 3.5c. Distributed by the author, Department of Genetics, University of Washington, Seattle.

    Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155:279–284.

    Gardiner, B. G. 1982. Tetrapod classification. Zool. Linn. Soc. 74:207–232.[ISI]

    Gemmell, N. J., and M. Westerman. 1994. Phylogenetic relationships within the class Mammalia: a study using mitochondrial 12S RNA sequences. J. Mamm. Evol. 2:3–23.

    Goodman, M. 1963. Serological analysis of the systematics of recent hominoids. Hum. Biol. 35:377–436.[ISI][Medline]

    Goodman, M., M. M. Miyamoto, and J. Czelusniak. 1987. Pattern and process in vertebrate phylogeny revealed by coevolution of molecules and morphologies. Pp. 141–176 in C. Patterson, ed. Molecules and morphology in evolution: conflict or compromise? Cambridge University Press, Cambridge, England.

    Gorr, T. A., B. K. Mable, and T. Kleinschmidt. 1998. Phylogenetic analysis of reptilians: trees, rates, and divergences. J. Mol. Evol. 47:471–485.[ISI][Medline]

    Grantham, R., C. Gautier, M. Jacobzone, and R. Mercier. 1981. Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res. 9:43r–74r.

    Hammersley, J. M., and D. C. Handscomb. 1964. Monte Carlo methods. Methuen, London.

    Hedges, S. B. 1994. Molecular evidence for the origin of birds. Proc. Natl. Acad. Sci. USA 91:2621–2614.

    Hedges, S. B., and L. L. Poling. 1999. A molecular phylogeny of reptiles. Science 283:998–1001.

    Janke, A., G. Feldmaier-Fuchs, W. K. Thomas, A. von Haeseler, and S. Pääbo. 1994. The marsupial mitochondrial genome and the evolution of placental mammals. Genetics 137:243–256.

    Janke, A., N. J. Gemmell, G. Feldmaier-Fuchs, A. von Haeseler, and S. Pääbo. 1996. The complete mitochondrial genome of a monotreme, the platypus (Ornithorhynchus anatinus). J. Mol. Evol. 42:153–159.[ISI][Medline]

    Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8: 275–282.

    Kumar, S., and S. B. Hedges. 1998. A molecular timescale for vertebrate evolution. Nature 392:917–920.

    Løvtrup, S. 1985. On the classification of the taxon Tetrapoda. Syst. Zool. 4:463–470.

    Novacek, M. J. 1992. Mammalian phylogeny: shaking the tree. Nature 356:121–125.

    Romer, A. S. 1966. Vertebrate paleontology. University of Chicago Press, Chicago.

    Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425.[Abstract]

    Sarich, V. M., and A. C. Wilson. 1967. Immunological time scale for hominid evolution. Science 158:1200–1203.

    Sokal, R. R., and C. D. Michener. 1958. A statistical method for evaluating systematic relationship. Univ. Kans. Sci. Bull. 28:1409–1438.

    Sokal, R. R., and F. J. Rohlf. 1981. Biometry. 2nd edition. Freeman, San Francisco.

    Springer, M. S., G. C. Gregory, O. Madsen, W. W. DeJong, V. G. Waddell, H. M. Amrine, and M. J. Stanhope. 1997. Endemic African mammals shake the phylogenetic tree. Nature 388:61–64.

    Strimmer, K., and A. von Haeseler. 1996. A quartet maximum-likelihood method for reconstructing tree structure. Mol. Biol. Evol. 13:964–969.[Free Full Text]

    Swofford, D. L. 1998. PAUP. Version 4. Sinauer, Sunderland, Mass.

    Zukerkandl, E., and L. Pauling. 1965. Evolutionary divergence and convergence in proteins. Pp. 97–166 in V. Bryson and H. J. Vogel, eds. Evolving genes and proteins. Academic Press, New York.

Accepted for publication January 16, 2001.