*Department of Biological Sciences, Tokyo Metropolitan University, Tokyo;
Department of Biology, Arizona State University, Tempe
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For these reasons, the use of LogDet-based distance methods (e.g., Lockhart et al. 1994
; Gu and Li 1996
; Yang and Kumar 1996
) is often advocated for phylogenetic analyses over methods such as the Tamura and Nei (1993)
method. The formula for estimating the Tamura-Nei distance is indeed derived under the homogeneity assumption and assumes a complex, but specific, model of nucleotide substitution. The LogDet-based methods are considered to be superior because they do not require these assumptions. However, the LogDet distances are paralinear, i.e., they are expected to show linearity with time and are actually not designed to measure the actual number of substitutions (Lockhart et al. 1994
). For instance, it is known that the LogDet method will overestimate evolutionary distances if the four bases do not occur with the equal frequency in the nucleotide sequences compared, even when the evolutionary process is homogeneous (Swofford et al. 1996
). In contrast, the Tamura-Nei method measures the actual number of substitutions irrespective of the base frequency bias, when the evolutionary process is homogeneous. It is a more general model than the Hasegawa, Kishino, and Yano (1985
; HKY) model and is known to adequately describe patterns of DNA sequence evolution for many genes (e.g., Tamura 1994
; Kumar 1996
; Suchard, Weiss, and Sinsheimer 2001
). Therefore, both LogDet and Tamura-Nei methods have certain desirable and certain undesirable properties, and it is not clear which of these have more adverse impact on the distance estimation in the actual data analyses.
Therefore, we have conducted computer simulations and empirical data analyses to compare the performance of the LogDet based methods and the Tamura-Nei method for estimating evolutionary distances under a variety of conditions. In the following paragraphs, we begin with the description of a simple ad hoc modification of the Tamura-Nei method to relax the assumption of homogeneity of substitution pattern between lineages.
![]() |
Modified Tamura-Nei Distance for Heterogeneous Substitution Pattern |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
We applied our modification to the gamma version of the Tamura-Nei distance as well. In this case,
|
Similarly, this modification can be applied to essentially any distance methods that account for the base frequency bias. For example, the Tamura (1992)
formula becomes
|
![]() |
![]() |
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
Using P1 and P2 for descendant sequences 1 and 2, respectively, we obtain the matrix F in which an element Fij contains the proportion of sites showing nucleotide i in the sequence 1 and nucleotide j in the sequence 2. F is given by
![]() |
Using the matrix F, we then estimated the number of substitutions per site by the original Tamura-Nei method, its modified version (eq. 1
), and the LogDet methods. Among variations of the LogDet methods, we picked up the following two formulas in this study. The first formula is the original LogDet method suggested by Lockhart et al. (1994)
(see also Gu and Li 1996
), which is also available in PAUP* (Swofford 2001
).
|
|
Empirical Data Analyses for Human and Mouse Genes
We also examined the performance of the Tamura-Nei method, equation (1) , and LogDet methods in estimating the evolutionary distance for 3,789 human and mouse nuclear cDNA sequences. For this purpose, we used the fourfold-degenerate sites that are known to have evolved with heterogeneous pattern of change in more than 40% genes (Kumar and Gadagkar 2001
; Kumar and Subramanian 2002
). A site was considered fourfold degenerate if it was fourfold degenerate in human as well as mouse genes. This data set provides us with an opportunity to examine whether the results obtained in the computer simulations are representative of those in real data analyses where the number of sites is finite.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In the case of homogeneous and nonstationary patterns in the two lineages, P(dt)s are equal between the two lineages, but the nucleotide frequencies are changing through time from the ancestral to the descendent sequences (fig. 2E and G ). In this case, none of the methods gives the perfect estimate. The efficiency of the LogDet methods is strongly dependent on the ancestral base frequencies: it is good only if the ancestral frequencies are equal (fig. 2F ). The Tamura-Nei method is much less sensitive to the direction of the base-frequency change (fig. 2F and H ).
Heterogeneous Substitution Patterns Under the HKY Model
We simulated the heterogeneous evolutionary patterns under the HKY model using different P(dt)s in the two lineages. We assumed A =
T =
C =
G = 0.25 for the first lineage and
A =
T = 0.05 and
C =
G = 0.45 for the other lineage with
= 4 for both lineages (fig. 3A
). This is to simulate a case of a nuclear gene evolution, where the substitution pattern in the new lineage has changed toward a G+Crich base composition. For the ancestral sequence, we assume that the four nucleotides occur with the equal frequency (same as the first sequence), which means that the second sequence is evolving with a different substitution pattern. The results show that the bias of the estimated number of substitutions per site (d) is rather small for all the methods (fig. 3B and C
). However, the bias of dTN becomes larger as d increases after d > 0.5. It is clear that our modification in equation (1)
corrects this bias very well and gives estimates (dMTN) better than dLD obtained by the LogDet method for any value of d. Nevertheless, dMLD obtained by equation (11)
is even better than dMTN.
|
In the above scenario, we assumed that the ancestral base frequencies were the same with those in the first lineage and were equal to for every nucleotide. We now examine the other possibility, i.e., the starting base frequencies are the same with those of the second lineage and are not equal (fig. 3G
). These simulation conditions produce marked difference among different methods. Figure 3H
shows that the Tamura-Nei method and equation (1)
work well and give results similar to those under previous conditions (fig. 3B
). In contrast, the LogDet method and equation (11)
substantially overestimate the d value through the entire range of d (fig. 3I
). As in the case of homogeneous and stationary substitution patterns, the bias of dMLD from the true d value is much smaller than that of dLD, but the linearity with the true d is no longer maintained.
Homogeneous and Heterogeneous Substitution Patterns Under the Unrestricted Model
In the results of the above simulation we see that equation (1) can efficiently correct the estimation bias caused by the heterogeneous substitution pattern, whereas the efficiency of the LogDet method and equation (11)
is strongly dependent on base frequencies of the ancestral sequence. However, the pattern of nucleotide substitution was assumed to follow the HKY model with specific sets of parameters. To obtain more general results, we next examined the performance of these methods in computer simulations with the unrestricted model of nucleotide substitution, for which rate parameters were determined randomly. This model does not even assume the reversibility of the evolutionary process and has the maximum number of parameters possible.
In a given lineage, each element of P(dt) was randomly chosen from a range of 110. The resultant P(dt) was then normalized such that it represented the average rate of 10-6 substitutions per site. We selected 1,000 different P(dt)s to examine the expected distance estimates for the true distance equal to 1. Because the number of possible matrices is 1012, the probability that a given matrix was identical to another chosen randomly was virtually zero. For the cases of heterogeneous substitution pattern, P(dt) was selected randomly for the evolution of each sequence separately. For a given P(dt), the equilibrium base frequencies were obtained by multiplying P(dt)s until the long-run distribution of the Markov chain was obtained, i.e., all the elements within a column of P = P(dt) become equal at the level of computational precision.
We first examined the case of homogeneous and stationary substitution patterns. In this case, the ancestral base frequencies were equal to the equilibrium frequencies for the descendent P(dt) (fig. 4A ). Figure 4B shows the distribution of expected distance estimates obtained by the Tamura-Nei method. The Tamura-Nei method underestimates the evolutionary distance by about 3% on average. But these estimates are quite close to the true value. Note that equation (1) always gives exactly the same results as the Tamura-Nei method when the substitution pattern is homogeneous (fig. 4C ). On the other hand, the overestimation of d by the LogDet method is quite serious. The estimated d values are sometimes more than 50% higher than the true value with an average bias exceeding 15% for dLD (fig. 4D ). This overestimation is considerably corrected by using equation (11) ; the average bias becomes 4% for dMLD (fig. 4E ). At any rate, neither the LogDet method nor equation (11) is suitable for estimating the number of substitutions if the base frequency bias exists.
|
When the ancestral base frequencies are equal, all the methods seem to give pretty good estimates of d (fig. 4GJ
), suggesting that the performance of the estimation of d is not so sensitive to the violation of the assumption of homogeneous substitution pattern as long as ancestral base frequencies are all and the difference in substitution pattern is not extreme. The bias of estimates is practically negligible for all the methods because the sampling error is much larger unless the number of sites examined is very large. Note that the substitution process is not stationary in this case because the initial base frequencies are almost always different from the equilibrium frequencies of the descendent lineages. The performance of the Tamura-Nei method is clearly insensitive to violations of the underlying assumption of either homogeneity or stationarity of nucleotide substitution. The LogDet method and equation (11)
produce biased but still better estimates than estimates from the Tamura-Nei method and equation (1)
.
However, more than 80% of the human and mouse genes show significantly unequal base frequencies. So, the above scenario of equal base frequencies in the ancestral lineage is much less common. In such cases, the LogDet method frequently overestimates d substantially (fig. 4N ). This is understandable because the base frequency bias is already shown to be a problem for the LogDet method (see figs. 2 and 3 ). The overestimation of d for equation (11) is much less than that for the LogDet method (fig. 4O ). On the other hand, the performance of the Tamura-Nei method and equation (1) is influenced only slightly by the ancestral base frequencies even when the substitution pattern is heterogeneous (fig. 4LM ).
Estimation of Evolutionary Distance Between Human and Mouse Genes
The performance of the Tamura-Nei and LogDet methods and equations (1) and (11)
was examined in estimating evolutionary divergences at fourfold degenerate sites in 3,789 human and mouse nuclear genes. This analysis provides us with an opportunity to examine the usefulness of the computer simulation results in the context of the real data analysis in which the number of sites is finite. In figure 5AD,
dTN, dMTN, dLD, and dMLD are plotted against the respective p-distances. Just as was the case in the computer simulations, dTN and dMTN are almost identical, except that the extremely biased dTN values (indicated by arrow heads in fig. 5A
) are efficiently corrected by equation (1)
(fig. 5B
). Genes showing these highly biased distance estimates are small (usually <100 bp), which indicates that equation (1) works well even for short sequences and thus should be preferred over the original Tamura-Nei method. Furthermore, the spreads of the dLD and dMLD values are much wider than those for dTN and dMTN, with dMTN showing the least variation among genes for the same p-distance (fig. 5AD
).
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
On the other hand, we have found that the Tamura-Nei method gives pretty good estimates of d in the most cases, irrespective of the substitution pattern and its homogeneity among lineages. Especially when d is not large, say d < 0.5, we do not have to worry about the problem. Actually, the efficiency of standard methods for estimating d is generally robust against the violation of the underlying model; even the simplest Jukes and Cantor (1969)
method works well in many cases (Nei and Kumar 2000
, pp. 3345) when d is not large. Furthermore, the Tamura-Nei method and its modified version presented here have advantages that are not available in the LogDet methods. First, these methods can be used to estimate the numbers of transitions and transversions separately, facilitating the estimation of the transition-transversion ratio. The transition-transversion ratio is not only a fundamental parameter for the evolution of DNA sequences but also a useful parameter to evaluate the reliability of the estimation of d (Tamura 2000
). Second, the gamma version is available for these methods to take the rate variation among sites into account. Because the assumption of the constant rate among sites rarely holds, it is very important to support the site-to-site rate variation (Nei and Kumar 2000
, p. 43). However, the estimation bias of the original Tamura-Nei method can be very large in some extreme, but biologically realistic, cases, as often observed in animal mitochondrial DNA (fig. 3E
) and in the cases of short sequences (or both) (fig. 5
). For such cases, we found that the modification introduced here can effectively correct the bias.
It should be emphasized that although correct estimation of the number of substitutions actually occurred is particularly important to infer phylogenetic trees, the genes evolving with heterogeneous pattern should not be used to estimate the rate of point mutation, which is defined biologically as the overall rate of replication errors, DNA damages, etc. (see Kumar and Subramanian 2002
) and mathematically as the instantaneous substitution rate matrix [P(dt)]. This is because the substitution rate at neutral sites can no longer be equated to the rate of point mutation when the substitution patterns in the two lineages are not the same: an excess or deficit of certain types of substitutions occurring as soon as one of the lineages starts evolving with a different substitution pattern often result in a higher rate of substitution as compared with the case where the substitution pattern remains the same. For example, when a given gene from a genomic segment with an A+Trich base composition in the ancestor is moved to a chromosomal region with a high G+Crich content, a large number of A+T to G+C substitutions will occur until it becomes G+Crich. This seems to be the case observed in the real data analysis for human and mouse genes. The average sequence divergence for the genes showing heterogeneous substitution pattern is larger than that for the genes showing homogeneous substitution pattern (fig. 7
). This was also confirmed in the computer simulations, when the constant P(dt) was used throughout the entire course of sequence evolution in the case of heterogeneous substitution pattern (fig. 8
). Therefore, a larger extent of sequence divergence observed is not necessarily a reflection of an increased rate of point mutation when the pattern of substitution is not homogeneous between lineages. To distinguish the estimation bias caused by the violation of the underlying assumption in the methods from this de novo effect, we artificially forced a constant number of substitutions rather than constant P(dt) in the computer simulations presented earlier. Consequently, we found that the estimation bias could be corrected, and the number of substitutions actually occurred could be estimated efficiently by the new methods introduced in this study. These methods will be made available in the computer software MEGA2 (Kumar et al. 2001
) available from http://www.megasoftware.net.
|
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: substitution rate
mutation rate
base composition
LogDet
computer simulation
Address for correspondence and reprints: Koichiro Tamura, Department of Biological Sciences, Tokyo Metropolitan University, 1-1 Minami-ohsawa, Hachioji-shi, Tokyo 192-0397, Japan. ktamura{at}evolgen.biol.metro-u.ac.jp
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bulmer M., 1991 Use of the method of generalized least squares in reconstructing phylogenies from sequence data Mol. Biol. Evol 8:868-883
Foster P. G., D. A. Hickey, 1999 Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions J. Mol. Evol 48:284-290[ISI][Medline]
Galtier N., M. Gouy, 1995 Inferring phylogenies from DNA sequences of unequal base compositions Proc. Natl. Acad. Sci. USA 92:11317-11321[Abstract]
Gu X., W.-H. Li, 1996 Bias-corrected paralinear and LogDet distances and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies Mol. Biol. Evol 13:1375-1383
Hasegawa M., T. Hashimoto, 1993 Ribosomal RNA trees misleading? Nature 361:23.[ISI][Medline]
Hasegawa M., H. Kishino, T. Yano, 1985 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA J. Mol. Evol 22:160-174[ISI][Medline]
Jukes T. H., C. R. Canter, 1969 Evolution of protein molecules Pp. 21132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York
Kumar S., 1996 Patterns of nucleotide substitution in mitochondrial protein coding genes of vertebrates Genetics 143:537-548
Kumar S., S. R. Gadagkar, 2001 Disparity index: a simple statistic to measure and test the homogeneity of substitution patterns between molecular sequences Genetics 158:1321-1327
Kumar S., S. Subramanian, 2002 Mutation rates in mammalian genomes Proc. Nat. Acad. Sci. USA 99:803-808
Kumar S., K. Tamura, I. B. Jakobsen, M. Nei, 2001 MEGA2: molecular evolutionary genetics analysis software Bioinformatics 17:1244-1245
Lockhart P. J., M. A. Steel, M. D. Hendy, D. Penny, 1994 Recovering evolutionary trees under a more realistic model of sequence evolution Mol. Biol. Evol 11:605-612
Loomis W. F., D. W. Smith, 1990 Molecular phylogeny of Dictyostelium discoideum by protein sequence comparison Proc. Natl. Acad. Sci. USA 87:9093-9097[Abstract]
Nei M., S. Kumar, 2000 Molecular evolution and phylogenetics Oxford University Press, New York
Saccone C., G. Pesole, G. Preparata, 1989 DNA microenvironments and the molecular clock J. Mol. Evol 29:407-411[ISI][Medline]
Suchard M. A., R. E. Weiss, J. S. Sinsheimer, 2001 Bayesian selection of continuous-time Markov chain evolutionary models Mol. Biol. Evol 18:1001-1013
Swofford D., 2001 PAUP*: phylogenetic analysis using parsimony* (and other methods) Version 4.0b7 beta. Sinauer Associates, Sunderland, Mass
Swofford D. L., G. J. Olsen, P. J. Waddell, D. M. Hillis, 1996 Phylogenetic inference Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer Associates, Sunderland, Mass.
Tajima F., M. Nei, 1984 Estimation of evolutionary distance between nucleotide sequences Mol. Biol. Evol 1:269-285[Abstract]
Tamura K., 1992 Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G + C-content biases Mol. Biol. Evol 9:678-687[Abstract]
. 1994 Model selection in the estimation of the number of nucleotide substitutions Mol. Biol. Evol 11:154-157
. 2000 On the estimation of the rate of nucleotide substitution for the control region of human mitochodrial DNA Gene 259:189-197[ISI][Medline]
Tamura K., M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Mol. Biol. Evol 10:512-526[Abstract]
Tarrío R., F. Rodríguez-Trelles, F. J. Ayala, 2001 Shared nucleotide composition biases among species and their impact on phylogenetic reconstructions of the Drosophilidae Mol. Biol. Evol 18:1464-1473
Tourasse N. J., W.-H. Li, 1999 Performance of the relative-rate test under nonstationary models of nucleotide substitution Mol. Biol. Evol 16:1068-1078[Abstract]
Yang Z., 1994 Estimating the pattern of nucleotide substitution J. Mol. Evol 39:105-111[ISI][Medline]
Yang Z., S. Kumar, 1996 Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites Mol. Biol. Evol 13:650-659[Abstract]