* Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Japan
Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University
Correspondence: E-mail: yossuzuk{at}lab.nig.ac.jp.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: Positive selection parsimony likelihood Thalassiosira weissflogii sexually induced gene 1 human T-cell lymphotropic virus type I tax.
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is, however, possible that positive selection operates only for a subset of codon sites. Sorhannus (2003) analyzed Sig1 sequences of Thalassiosira weissflogii using the parsimony-based method (Suzuki and Gojobori 1999) and the ML-based Bayesian method (Yang et al. 2000) for detecting positive selection at individual codon sites and concluded that the latter method detected positively selected sites, whereas the former method did not. However, the ML-based method is known to produce many false-positive results for positive selection, whereas the parsimony-based method rarely does so (Suzuki and Nei 2002). It is, therefore, possible that the ML results obtained by Sorhannus (2003) are, in fact, false positives.
Sorhannus (2003) commented that "the results obtained by likelihood analysis of HLA data by Suzuki and Nei (2001) appear to be problematical as simpler models had much higher likelihood values than more general models and multiple runs led to many different sets of parameter estimates." After publication of our paper, N. Goldman, R. Nielsen, and Z. Yang (personal communication) informed us that the version 3.0a of the computer program PAML, which we used, contained an inaccurate computing algorithm. However, even when we used the new version, PAML 3.12, the disconcerting results mentioned by Sorhannus (2003) did not completely disappear (Suzuki and Nei, unpublished data).
The purpose of this paper is to show that the results obtained by Sorhannus (2003) are apparently false positives caused by using an unreliable phylogenetic tree and that there is no compelling evidence of positive selection in Sig1. In addition, false-positive selection observed in the tax gene in human T-cell lymphotropic virus type I (HTLV-I) will be presented as an extreme example. The reasons that ML-based methods produce many false positives are discussed.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A multiple sequence alignment was made for each data set using ClustalW version 1.81 (Thompson, Higgins, and Gibson 1994). In each data set, sequences were quite homogeneous, and there were no alignment gaps. Positive selection was inferred at each codon site by the parsimony-based method with ADAPTSITE version 1.3 (Suzuki, Gojobori, and Nei 2001) and the ML-based method as implemented in PAML version 3.13 (Yang 1997). The detailed procedures of the statistical methods are explained by Suzuki and Gojobori (1999) and Yang et al. (2000). In both methods, the phylogenetic tree of sequences used is assumed to be known and has to be preassigned. Sorhannus (2003) made a composite tree of the bootstrap consensus trees of Sig1 and ß-tubulin, which were originally constructed by Armbrust and Galindo (2001). However, the tree was highly multifurcative and appeared to be unreliable (see Supplementary Material online). We, therefore, constructed the new trees using the neighbor-joining (NJ) method (Saitou and Nei 1987). To examine the effect of topological difference on the inference of positive selection, we constructed two trees for each data set using two different evolutionary distances; that is, p-distance (proportion of different nucleotide sites) and dS-distance (estimated number of synonymous nucleotide substitutions by Nei and Gojobori's [1986] method) (see Supplementary Material online). It is known that the ML-based method sometimes produces different estimates of at a given codon site, depending on the input
value, because of multiple local maxima on the likelihood surface. For this reason, we used 0.4, 3.14, and 4 as the input
values. (Only the results with the highest log-likelihood [lnL] values were presented in tables 2 and 3.) The significance (confidence) level (cutoff point) for inferring positive selection was 0.05 (0.95) for both parsimony-based and ML-based methods.
|
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
This situation changed dramatically when we used the dS-distance tree. In each data set, existence of a group of positively selected sites was identified, and this group included many codon sites (table 3). In addition, both M3 and M8 were judged to fit the data better than M0 and M7, respectively. For the large data set, the initial values of 0.4, 3.14, and 4 all indicated that five sites (positions 4, 37, 42, 52, and 149) and four sites (positions 37, 42, 52, and 149) were positively selected with M3 and M8, respectively. For the small data set, 23 sites (positions 4, 9, 10, 37, 42, 43, 52, 60, 63, 83, 84, 95, 98, 100, 119, 126, 137, 139, 144, 149, 159, 178, and 182) were identified as positively selected for both M3 and M8 when the initial
values of 0.4 and 3.14 were used. However, when
= 4 was used as the initial value, the results for M8 changed drastically, and the selected sites were now 37, 42, 52, and 149 only. (The same four sites were obtained when we tried the initial
values of 5, 6, and 7.) Therefore, the detection of positively selected sites is dependent on the initial
value, as was indicated by Suzuki and Nei (2001), and the results obtained can be very different, depending on the initial
value. In the present case, initial
= 4 gave a higher lnL value (1111.000) than that for initial
= 0.4 or 3.14 (1115.173).
At any rate, we have obtained three different results from the same data sets: the p-distance tree, the dS-distance tree, and the Sorhannus tree. These different results must be caused by differences in the trees used, because there is no other difference. Previously, Yang et al. (2000) stated that inference of positively selected sites does not seem to be sensitive to the assumed topology. In the present data sets, however, this is not the case. To examine which tree and which computational results are most reliable, we computed the lnL values. The lnL value for the large data set with M3 was 1257.99 for the p-distance tree, 1337.46 for the dS-distance tree, and 1308.00 for the Sorhannus tree. Similarly, the lnL value was highest for p-distance tree and lowest for dS-distance tree for all models in both large and small data sets. We also computed the total tree length (TL), consistency index (CI), retention index (RI), and rescaled consistency index (RC) for the three trees. All these indices indicated that the p-distance tree was best and the dS-distance tree was worst (table 4). In addition, the p-distance tree was judged to fit the data better than other two trees by the tests of Templeton (1983) and Kishino and Hasegawa (1989) (P < 0.05), whereas the latter two trees were not significantly different from each other. It is also known that p-distance is generally most reliable for constructing NJ trees of closely related sequences (Takahashi and Nei 2000). To examine whether there was a more reliable tree than the p-distance tree, we constructed phylogenetic trees using the MP method with PAUP* version 4.0b10 (Swofford 1998) for both the large and small data sets. In addition, we examined the best-fit model of nucleotide substitution among all possible models currently available with MODELTEST version 3.06 (Posada and Crandall 1998) and constructed trees using the NJ and ML methods assuming that model. It was found that the best-fit model was the Kimura (1980) model for both large and small data sets, and the topologies of the MP, NJ, and ML trees obtained were all identical to that of the p-distance tree. These results strongly suggest that the p-distance tree is most reliable for both large and small data sets. Therefore, the result for the p-distance tree, in which no positively selected site was inferred, appears to be most reliable, and positive selection identified by Sorhannus (2003) appears to be false positives caused by the unreliable tree used.
|
|
In the ML-based method, codon sites are grouped into two or more categories with different values, and the null hypothesis of
= 1 is tested indirectly for the group with
> 1. Because the sites with high
values are grouped into the
> 1 category, this method is more efficient than the parsimony method in detecting selection if the high
values are caused by selection. In practice, however,
is affected by stochastic errors, and it is possible that most high
values are caused by random errors. For example, if cS = 0 by chance at a given codon site with cN > 0,
becomes theoretically infinite (table 5). Similarly,
can easily be inflated if cN becomes large or cS becomes small by chance. If this happens, the
value for the
> 1 group may again become significantly higher than 1, but this does not mean that the codon sites involved have been subjected to positive selection. This is the main reason why the ML-based method can produce many false positives.
In table 5, we have seen that the dS-distance and Sorhannus trees generated many codon sites with high cN values and produced many positively selected sites in comparison with the p-distance tree. Because the actual process of identification of positively selected sites is quite complicated, it is difficult to see the exact relationships between cN values and the number of positively selected sites. However, the reason the number of positively selected sites is larger in the dS-distance and Sorhannus trees than in the p-distance tree seems to be that when there are many codon sites with large cN and small cS values, the average value for the group of so-called "selected sites" can be reasonably high even if more sites with lower cN values are included. If this argument is right, one would expect that a poor tree that generates many sites with high cN values would give more "selected sites" than a better tree. This is indeed what we observed in tables 2, 3, and 5. For example, the cN/cS value at position 4 for the large data set was 3/0, regardless of the trees assumed (table 5). However, this site was inferred as positively selected for the dS-distance and Sorhannus trees but not for the p-distance tree, probably because the cN values at another sites (e.g., positions 37, 42, and 149) were inflated for the former trees.
However, if positive selection at individual sites can be falsely identified because of stochastic errors, it would happen irrespective of the topology used. In the following, we show a striking example in which positive selection was inferred even at invariable codon sites under the assumption of a reliable tree.
Striking Example of Inferred Selected Sites: the tax Gene of HTLV-I
Twenty nucleotide sequences of the tax gene of HTLV-I were extracted from the international nucleotide sequence database. The accession numbers of these sequences were AB045401, AB45410, AB045425, AB045442, AB045481, AB045482, AB045486, AB045490, AB045514, AB045519, AB045520, AB045528, AB045541, AB045546 to AB045549, AB045558, AB045559, and AB45639 (Furukawa et al. 2001). Each sequence consisted of 181 codon sites that were not overlapped with the open reading frame of the rex gene. These sequences were highly homogeneous, and there were no alignment gaps. The total branch length per codon site for the entire tree (SA; Anisimova, Bielawski, and Yang's S) was 0.128. Parsimony-based and ML-based methods were used for inferring positively selected sites. The phylogenetic tree was constructed by the NJ method with p-distance. The tree obtained was a star phylogeny, which was obviously the most reliable tree for these sequences because all mutations were singletons.
Parsimony analysis did not detect any positively selected sites, because cS and cN values were all small and either 0 or 1. Surprisingly, however, the ML-based method indicated that all codon sites were positively selected (table 6). That is, M3 and M8 both indicated that a group of positively selected codon sites existed with probability 1, and all codon sites were inferred to belong to this group with a Bayesian posterior probability of 1. In addition, M8 was judged to fit the data better than M7 by the LRT. Interestingly, the lnL value for M3 was very close to that for M0, but it was shown that all codon sites were positively selected in both models. Surprisingly, 158 out of a total of 181 codon sites analyzed were invariable among all the aligned sequences.
|
Incidentally, the possibility of occurrence of > 1 for all codon sites when closely related sequences are analyzed was previously indicated by Anisimova, Bielawski, and Yang (2002) in a computer simulation. From this simulation, they suggested that the ML-based method should be used only when the number of sequences used (T) is greater than 6 and the total number of nucleotide substitutions per codon for the entire tree (SA) is greater than 0.11. In the present case T = 20 and SA = 0.128, so this condition is satisfied. Yet, we observed a case of
> 1 for all sites. However, the real problem is that false positives can occur even when both T and SA are large (Suzuki and Nei 2002). Anisimova, Bielawski, and Yang's simulation was not intended to study false positives, but the example data set of T = 17 and SA = 0.38 in their figure 1B indicates that false positives occurred with this data set, because the accuracy of predicting positively selected sites was lower than the cutoff point P when P > 0.85.
Some Anomalous Observations
Finally, it should be pointed out that essentially the same lnL values and same parameter estimates were obtained for different models M3 and M8 in tables 2 and 6. Similar results were also obtained for models M3 and M8 in table 1 of Sorhannus (2003). Furthermore, even M0 gave the same results as those of M3 and M8 in table 6. These unexpected results were apparently obtained because the different distributions of assumed among different sites converged to the same one in different models. In addition, in PAML, the beta distribution of
in M8 is approximated by 10 discrete categories of
, and these categories apparently converged to the three categories of M3 in table 2 and to one category of M0 in table 6.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
By contrast, the ML-based method is intended to identify a group of codon sites for which positive selection might operate simultaneously. If there is indeed such a group as in the case of major histocompatibility complex (MHC) genes, use of this method is justified. However, if high values are generated by random errors, this method is expected to give false positives. Furthermore, the general applicability of Goldman and Yang's (1994) codon substitution model, which is the basis of the ML-based model, has been questioned (Nei and Kumar 2000). As far as we know, there are no empirical data to support this model. Because it is very difficult to distinguish between true positives and false positives only from the sequence analysis, it is important to exercise caution in the interpretation of the results obtained by this method. Obviously, the final proof of positive selection rests on experimental verification, as was done with some genes (e.g., Jermann et al. 1995; Zhang, Zhang, and Rosenberg 2002; Shi and Yokoyama 2003). As long as the conclusion is drawn only from the sequence analysis without conducting experiments, a more conservative parsimony-based method is safer for detecting positive selection at single amino acid sites.
Detection of a significant excess of nonsynonymous substitutions is often taken as evidence that the protein under consideration undergoes adaptive evolution. In host-defense genes or immunogenic genes, excess nonsynonymous substitutions caused by positive selection seem to occur continuously to avoid parasitic attack or immune surveillance, as in the cases of MHC genes (Hughes and Nei 1988, 1989) and immunogenic genes in human immunodeficiency and influenza viruses (Bush et al. 1999; Suzuki and Gojobori 1999). For these genes, the statistical methods considered here would be useful if a large number of sequences are used. In many genes, however, both advantageous and deleterious mutations caused by nonsynonymous substitutions may occur at different times or nearly at the same time, and the overall function of a gene may remain more or less the same in long-term evolution. In this case, positive or negative selection at individual codon sites may not have significant effects on the evolution of a gene. For this reason, it is important to distinguish between excess nonsynonymous nucleotide substitution and positive selection for enhancing gene function. In other words, detection of excess nonsynonymous substitutions at some codon sites does not necessarily mean the detection of positive selection for gene function. By contrast, only one or two amino acid changes at a few codon sites may affect the function of a gene drastically, as in the case of alligator hemoglobin (Perutz 1983) or vertebrate color vision genes (Yokoyama and Radlwimmer 2001). In this case, it would be difficult to detect positive selection by the statistical methods considered in this paper.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Anisimova, M., J. P. Bielawski, and Z. Yang. 2002. Accuracy and power of Bayes prediction of amino acid sites under positive selection. Mol. Biol. Evol. 19:950-958.
Armbrust, E. V. 1999. Identification of a new gene family expressed during the onset of sexual reproduction in the centric diatom Thalassiosira weissflogii. Appl. Environ. Microbiol. 65:3121-3128.
Armbrust, E. V., and H. M. Galindo. 2001. Rapid evolution of a sexual reproduction gene in centric diatoms of the genus Thalassiosira. Appl. Environ. Microbiol. 67:3501-3513.
Bailley, X., R. Leroy, S. Canrey, O. Collin, F. Zal, A. Toulmond, and D. Jollivet. 2003. The loss of the hemoglobin HsS-binding function in annelids from sulfide-free habitats reveals molecular adaptation driven by Darwinian positive selection. Proc. Natl. Acad. Sci. USA 100:5885-5890.
Bush, R. M., C. A. Bender, K. Subbarao, N. J. Cox, and W. M. Fitch. 1999. Predicting the evolution of human influenza A. Science 286:1921-1925.
Furukawa, Y., R. Kubota, M. Tara, S. Izumo, and M. Osame. 2001. Existence of escape mutant in HTLV-1 tax during the development of adult T-cell leukemia. Blood 97:987-993.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.
Hughes, A. L., and M. Nei. 1988. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335:167-170.[CrossRef][ISI][Medline]
Hughes, A. L., and M. Nei. 1989. Nucleotide substitution at major histocompatibility complex II loci: evidence for overdominant selection. Proc. Natl. Acad. Sci. USA 86:958-962.[Abstract]
Jermann, T. M., J. G. Opitz, J. Stackhouse, and S. A. Benner. 1995. Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature 374:57-59.[CrossRef][ISI][Medline]
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[ISI][Medline]
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topology from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:170-179.[ISI][Medline]
Miller, S. R. 2003. Evidence for the adaptive evolution of the carbon fixation gene rbcL during diversification in temperature tolerance of a clade of hot spring cyanobacteria. Mol. Ecol. 12:1237-1246.[ISI][Medline]
Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418-426.[Abstract]
Nei, M., and S. Kumar. 2000. Molecular evolution and phylogenetics. Oxford University Press, New York.
Perutz, M. F. 1983. Species adaptation in a protein molecule. Mol. Biol. Evol. 1:1-28.[Abstract]
Posada, D., and K. A. Crandall. 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14:817-818.[Abstract]
Saitou, N. 1989. A theoretical study of the underestimation of branch lengths by the maximum parsimony principle. Sys. Zool. 38:1-6.[ISI]
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.[Abstract]
Shi, Y., and S. Yokoyama. 2003. Molecular analysis of the evolutionary significance of ultraviolet vision in vertebrates. Proc. Natl. Acad. Sci. USA 100:8308-8313.
Sorhannus, U. 2003. The effect of positive selection on a sexual reproduction gene in Thalassiosira weissflogii (Bacillariophyta): results obtained from maximum likelihood and parsimony-based methods. Mol. Biol. Evol. 20:1326-1328.
Suzuki, Y., and T. Gojobori. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16:1315-1328.[Abstract]
Suzuki, Y., T. Gojobori, and M. Nei. 2001. ADAPTSITE: detecting natural selection at single amino acid sites. Bioinformatics 17:660-661.
Suzuki, Y., and M. Nei. 2001. Reliabilities of parsimony-based and likelihood-based methods for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 18:2179-2185.
Suzuki, Y., and M. Nei. 2002. Simulation study of the reliability and robustness of the statistical methods for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 19:1865-1869.
Swanson, W. J., and V. D. Vacquier. 2002. The rapid evolution of reproductive proteins. Nat. Rev. Genet. 3:137-144.[CrossRef][ISI][Medline]
Swofford, D. L. 1998. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.
Takahashi, K., and M. Nei. 2000. Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol. Biol. Evol. 17:1251-1258.
Templeton, A. R. 1983. Phylogenetic inference from restriction cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37:221-244.[ISI]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.[Abstract]
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555-556.[Medline]
Yang, Z., R. Nielsen, N. Goldman, and A. M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449.
Yokoyama, S., and F. B. Radlwimmer. 2001. The molecular genetics and evolution of red and green color vision in vertebrates. Genetics 158:1697-1710.
Zhang, J., Y. P. Zhang, and H. F. Rosenberg. 2002. Adaptive evolution of a duplicated pancreatic ribonuclease gene in a leaf-eating monkey. Nat. Genet. 30:411-415.[CrossRef][ISI][Medline]