*Department of Biology, Galton Laboratory
Center for Mathematics and Physics in the Life Sciences and Experimental Biology (CoMPLEX), University College London
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A number of authors proposed alternative methods that account for variable selective pressures across sites. Fitch et al. (1997)
and Suzuki and Gojobori (1999)
used the parsimony method to reconstruct sequences in the extinct ancestors and counted the changes along the tree at each site to identify sites at which there was an excess of nonsynonymous substitutions. Nielsen and Yang (1998)
and Yang et al. (2000)
implemented codon substitution models of heterogeneous
ratios among sites in the maximum likelihood (ML) framework and used the Bayes method to predict sites under positive selection. ML and Bayes methods are major statistical estimation methods and are widely used in molecular phylogenetics (Huelsenbeck and Rannala 1997
; Lewis 2001
; Whelan, Lio, and Goldman 2001
). Although ML models are computationally intense for large numbers of lineages, they do not rely on ancestral reconstruction and can easily accommodate known features of sequence evolution, such as the transition-transversion rate bias and the codon usage bias. The ML approach uses the likelihood ratio test (LRT) to compare two nested models: a null model, which does not account for sites with
> 1, and an alternative model that does. A gene is considered to be under positive selection if (1) the LRT is significant, and (2) at least one of the ML estimates of
>1. When the ML parameter estimates indicate the presence of sites under positive selection, the empirical Bayes approach can be used to predict them (Nielsen and Yang 1998
; Yang et al. 2000
). One computes the posterior probability that a site belongs to each
class of the model, given the data at that site. Sites with high posterior probabilities of belonging to a class with
> 1 are likely to be evolving by positive selection.
The codon models were successfully applied to detect positive selection in a number of genes. For example, Bishop, Dean, and Mitchell-Olds (2000)
detected strong adaptive pressure in the cell wallattacking enzyme chitinase and mapped positively selected residues on a three-dimensional structure of the class I chitinase. An excess of amino acid replacements in the active cleft of the enzyme indicated that class I chitinase evolves in response to pathogenic variation. This supported a hypothesis of coevolutionary "arms race" between plants and their pathogens. The same approach demonstrated that evasion of the immune system by viruses occurs by diversifying selection, e.g., across the HIV-1 nef gene (Zanotto et al. 1999
) and the capsid gene of the foot-and-mouth disease virus (Fares et al. 2001
; Haydon et al. 2001
). Remarkably, there was a correlation between the known cytotoxic T-lymphocytes (CTL) epitopes responsible for antigenic determination and the predicted positive selection sites.
A better understanding of adaptive evolution, however, also requires an understanding of how the adaptive changes affect phenotypes. The Bayes method can identify the few sites responsible for adaptive change, providing the initial information required to understand the changes in the form and function of proteins over evolutionary time. Specific structural and functional hypotheses can be formulated if we know which sites in an ancestral protein evolve by positive selection (e.g., Adey et al. 1994
; Chandrasekharan et al. 1996
; Dean and Golding 1997
; Yang, Swanson, and Vacquier 2000
). Site-directed mutagenesis could then be used to conduct biochemical testing of such hypotheses (Chang, Kazmi, and Sakmar 2001
).
Although the ML method appeared successful in real data analysis, its performance is not well understood. Anisimova, Bielawski, and Yang (2001)
used computer simulation to examine the performance of the LRT in detecting adaptive evolution; they found that the use of the
2 distribution made the LRT conservative; however, the test was powerful when enough lineages were analyzed. In this study we examine the performance of Bayes prediction of positive selection sites. We simulate data under heterogeneous
models and evaluate the performance of the method under different sequence lengths, sequence divergences, and numbers of taxa. We explore conditions under which Bayes posterior probabilities give a reliable measure of uncertainty.
![]() |
Theory and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Positive selection, that is, the presence of sites with > 1, can be tested by LRTs by comparing models M0 (one-ratio) with M3 (discrete), M1 (neutral) with M2 (selection), and M7 (beta) and M8 (beta&
). The performance of two LRTs, which compare M0 with M3 and M7 with M8, was examined by Anisimova, Bielawski, and Yang (2001)
.
Bayes Prediction of Sites Under Positive Selection
We estimate the parameters of the codon model by ML and use the Bayes theorem to infer to which class each site is most likely to belong (Nielsen and Yang 1998
; Yang et al. 2000
). Suppose a model of heterogeneous
ratios assumes K classes, with the proportions and
ratios given as
![]() |
Proportions pi are the prior probabilities for site classes, i = 0, 1, ..., K - 1. The posterior probability that site h with data xh belongs to class i is
|
Here, P(xh|i) is the probability of the data at site h, given that it belongs to site class i. In the implementation of models by Nielsen and Yang (1998)
and Yang et al. (2000)
, parameters such as
i and pi in equation (1)
are estimated by ML, and the ML estimates are used to calculate the posterior probabilities in equation (2)
. This is known as the empirical Bayes approach. If a site has a high posterior probability of coming from a class with
estimated to be >1, the site is likely to be under positive selection. A conservative approach can be taken to predict sites under positive selection by requiring this posterior probability to exceed a cut-off value such as P = 0.95 or 0.99. Thus, a site with a high probability (say 0.9) of being under positive selection is not considered to be under positive selection if a stringent cutoff is applied (say P = 0.95).
Simulations
All results were based on 100 simulated replicates. All data sets were simulated under models M3 (discrete) and M8 (beta&) and included a small fraction of sites evolving under positive selection (table 1
). ML estimates of parameters from 17 vertebrate ß-globin genes (Yang et al. 2000
) formed the basis of most simulations (table 1
). Simulated data sets varied in the number of taxa T, sequence length in the number of codons Lc, sequence divergence S, and selective pressure, here
2 for M3 and
for M8 (table 1
). S was measured by tree length, i.e., the expected number of nucleotide substitutions per codon along the tree. We used three values for S (table 1
) and referred to them as low, medium, and high sequence divergences, respectively. We also simulated data sets using the ML parameter estimates obtained from 98 hemagglutinin gene sequences of the human influenza type A virus (GenBank accession numbers AF180564AF180666; except for duplicates AF180572, AF180577, AF180596, AF180636, and a highly divergent outlier AF180666; table 1
). These sequences constitute a subset of the data analyzed by Bush et al. (1999)
and Yang (2000)
.
|
|
Accuracy of Bayes prediction was measured by the probability that a site predicted to be under positive selection was truly under positive selection. Let N·+ denote the number of sites predicted to be under positive selection (table 2
). Here, the first subscript refers to classifications by evolver, and the second subscript refers to prediction by codeml, with "·" meaning all sites, "+" meaning sites with > 1, and "-" meaning sites with
< 1. For example, N++ denotes the number of correctly predicted positive selection sites. Accuracy was estimated by averaging the proportion N++/N·+ over the replicates. Replicates in which positive selection was not detected (N·+ = 0) were ignored in the calculation. Note that prediction of sites under positive selection by codeml depends on a cut-off probability P.
|
Although it is convenient to think of false positives and false negatives for the prediction of positive selection sites, the Bayes method is not based on formal hypothesis testing. Our use of the terms accuracy and power in this paper is informal, and they are not equal to (1 - type I error rate) and (1 - type II error rate) in hypothesis testing. One could measure accuracy by N- -/N-·, so that both accuracy and power (N++/N+·) are conditional on the truth (classification of sites by evolver). However, prediction of conserved sites or sites not under positive selection is easy and biologically not so interesting.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Accuracy was always much higher for sequences of medium and high sequence divergences than for those at low sequence divergence (fig. 1 ). The worst case was observed for data sets of six sequences at low divergence (S = 0.11), where accuracy was much lower than the cut-off values (fig. 1A ). Note that tree length S = 0.11 means highly similar sequences, with 0.024 nucleotide substitutions per nonsynonymous site and 0.078 nucleotide substitutions per synonymous site along the tree. In such data sets, only about 10% of the sites (codons) showed any variability, and even fewer sites had two or more changes. Even at medium and high sequence divergences (S = 1.1, S = 11), accuracy was lower than the cut-off P for P > 0.8 (Lc = 100) or P > 0.9 (Lc = 500; fig. 1A ). Also note that high divergence (S = 11) means very divergent sequences, with 2.4 nonsynonymous substitutions per nonsynonymous site and 7.8 synonymous substitutions per synonymous site on the small tree consisting of six taxa. The results of figure 1 suggest that Bayes prediction is tolerant of multiple substitutions at the same site.
Accuracy was improved by increasing the number of taxa to 17 (compare fig. 1A with B
). However, for similar sequences (S = 0.38) accuracy remained well below the cut-off P for P > 0.7 (N = 100; results not shown) or P > 0.8 (N = 500; fig. 1B
). For medium and high sequence divergences (S = 2.11, S = 16.88), accuracy was always higher than the cut-off P (fig. 1B
). We note that the best accuracy was achieved at medium divergence in the small tree (fig. 1A
) and at high divergence in the large tree (fig. 1B
). Although sequence divergences in trees of different sizes are not directly comparable, we expect large trees to be more tolerant of multiple substitutions. We also analyzed data in which the strength of positive selection was higher (2 = 4.74) and observed a substantial improvement in accuracy at all levels of sequence divergence (compare fig. 1C with A
). Although in reality one has no control over the level of positive selection pressure on a gene, Bayes prediction is likely to be more accurate in the presence of strong positive selection.
We simulated data sets of 98 lineages using the ML parameter estimates from the influenza A hemagglutinin gene under model M3 (discrete) (table 1
). The results are presented in figure 2 . The tree length S = 1.3 means an average branch length of S/(2T - 3) 0.007 nucleotide substitutions per codon, so that the sequence divergence is even lower than the low divergences in the 6-taxon (S = 0.11) and 17-taxon (S = 0.38) trees in previous experiments. However, accuracy was much higher in the 98-taxon data sets (compare fig. 2A
with fig. 1A, B,
and low S). Accuracy was higher than the cut-off values when P < 0.90 and only slightly lower than the cut-off values when P > 0.90 (fig. 2A
). For example, accuracy was
0.93 at P = 0.95 and
0.97 at P = 0.99 (fig. 2A
).
|
|
We observed much higher power for data sets simulated with increased positive selection pressure (2 = 4.74; compare fig. 3C with A
). Thus, one is more likely to predict positive selection sites if strong positive selection acts on the gene.
The effect of the tree shape reflected in relative branch lengths was not examined in this paper but is expected to be similar to that of the sequence divergence. A well-balanced tree, with an even distribution of changes over its branches, will achieve optimal information content. Highly biased trees might harbor too many changes along the long branches and too few changes along the short branches, leading to lack of information and low power in the detection method.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The alternative in this case is the hierarchical Bayes approach (or full Bayes approach), which accounts for uncertainties in the parameters in the prior distribution by assigning and integrating over a hyper-prior distribution for parameters in the prior. This does not increase the power of the analysis but has the effect of adding noise into the model so that the posterior distribution will become more spread out, reducing the confidence in the prediction. The computation can be achieved by the Markov chain Monte Carlo algorithms but is expected to be much more expensive than for the ML procedure implemented in PAML. For small data sets, where the use of the full Bayes approach is most rewarding, the computation seems feasible and well worth pursuing.
In the mean time, we advise caution on Bayes prediction of sites under selection when the data contain only a few highly similar sequences. Collecting more sequences seems to be the most effective strategy in improving the accuracy and power of the analysis.
A further difficulty is caused by the fact that is a continuous variable and is best described by a continuous distribution, but we had to use a discrete distribution for computational reasons (Yang et al. 2000
). As an example, if the three
ratios under M3 are 0.9, 1.0, and 1.1, it will be extremely difficult to assign sites into these classes using the Bayes or any other approach. Consequently, one should be very cautious when drawing a conclusion about positive selection when the estimated
is only slightly >1.
Robustness Analysis and Testing Selection with Multiple Models
To evaluate the robustness of Bayes prediction of sites under positive selection and to explore the performance of different models, we simulated data sets under models M3 and M8 and analyzed them using three different models: M2 (selection), M3 (discrete), and M8 (beta&). Both 6-taxon and 17-taxon trees were used in the simulation, and we present results obtained from the 17-taxon data sets in figure 4
. Note that the parameters for M3 and M8 were obtained from the same real data set (table 1
). When model M3 was used in both simulation and analysis (diamonds in fig. 4B and D
), the accuracy and the power of Bayes site prediction were similar to those when M8 was used for both simulation and analysis (circles in fig. 4A and C
). Regardless of the simulation model, a lower accuracy and a higher power were achieved when data were analyzed under M3 rather than under M8. The difference is because of the fact that M3 predicted almost 30% more sites than M8 did. The accuracy and the power of site prediction under M2 (selection) were essentially the same as under M8 (beta&
) (fig. 4
). It should be noted, however, that M2 was very conservative for six-taxon data sets, i.e., accuracy was very high, but power was very poor (results not shown).
|
Model M3 is not as precise in distinguishing positive selection sites as M8 or M2. When the strength of positive selection is low, and there is a large fraction of neutral sites, M3 can yield a large number of incorrectly predicted sites. Anisimova, Bielawski, and Yang (2001)
also showed that violations of model assumptions could lead to a high rate of false positives for the LRT in detecting positive selection. However, M3 has some advantages. First, M3, with only three site classes, appears to fit any data set as well as any other model (e.g., Yang et al. 2000
). Second, it does not appear to have multiple local optima and is computationally fast.
Because each model has advantages and disadvantages, we suggest the use of multiple models in real data analysis. Yang et al. (2000)
recommended the following models for real data analysis: M0 (one-ratio), M1 (neutral), M2 (selection), M3 (discrete), M7 (beta), and M8 (beta&
).
Overall Recommendations Concerning the Likelihood Method to Detect Positive Selection
The likelihood models implemented by Nielsen and Yang (1998)
and Yang et al. (2000)
are designed to perform the following tasks, in order of increasing difficulty: (1) test for the presence of sites under positive selection (
> 1) by LRT, comparing two nested models, (2) estimation of the proportion of positive selection sites and the strength of positive selection by ML estimation of parameters in the
distribution, and (3) identification of amino acid sites under positive selection by Bayes prediction.
The LRT seems quite reliable even in small data sets with only a few similar sequences, according to the simulation study of Anisimova, Bielawski, and Yang (2001)
. Although use of the
2 approximation makes the test conservative, the test appeared quite powerful. We expect some data sets to contain insufficient information for reliable identification of sites under positive selection, but it is still interesting to know whether such sites exist at all. In such data sets, the LRT, which uses combined evidence from all sites in the sequence, should be more powerful than methods that test positive selection at each site by counting substitutions (e.g., Suzuki and Gojobori 1999
).
Estimation of parameters in the distribution is a much more difficult matter, mainly because of the strong correlation among parameters in the
distribution. The different models of variable
s among sites and different parameters of the same model correspond to the different ways of lumping sites into classes, and for a given data set, there can be many almost equally good ways of lumping sites.
Lastly, Bayes prediction of sites under selection is most difficult, and many sequences are required to accumulate synonymous and nonsynonymous changes at individual sites, which are the source of information for inferring the underlying ratio at each site. Furthermore, random and systematic errors in parameter estimates will affect the accuracy of Bayes prediction. Also, note that the accuracy and power measures used in this paper apply to one site and not to the entire sequence, and that it is almost certain that some sites in the sequence will be incorrectly identified, because so many inferences are made in one analysis.
Given these considerations, it is remarkable that the application of these methods to real data has generated biologically highly sensible results (e.g., Zanotto et al. 1999
; Bishop, Dean, and Mitchell-Olds 2000
; Swanson et al. 2001
). We note that our simulations examined conditions where Bayes site prediction will be most difficult, i.e., small data sets with a few highly similar sequences. To date, most ML-based studies of positive selection in nonviral organisms sampled more than 17 lineages, which represented relatively divergent sequences (e.g., Bishop, Dean, and Mitchell-Olds 2000
; Yang, Swanson, and Vacquier 2000
; Peek et al. 2001
). Bayes prediction in such cases should be more reliable.
Based on our simulations, we make the following generalizations. (1) Prediction of positively selected sites is unreliable when sequences are very similar, and the number of lineages is small (e.g., S 0.11 or T
6). (2) Increasing the number of lineages is the most effective way to improve accuracy and power. Accurate prediction is possible for data sets comprising very similar sequences if a very large number of lineages have been sequenced. (3) Multiple models should be used in real data analysis to ensure the robustness of the results.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: Bayes inference
likelihood
nonsynonymous-synonymous rate ratio
positive selection
posterior probability
Address for correspondence and reprints: Ziheng Yang, Department of Biology, University College London, Darwin Building, Gower Street, London WC1E 6BT, United Kingdom. z.yang{at}ucl.ac.uk
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adey N. B., T. O. Tollefsbol, A. B. Sparks, M. H. Edgell, C. A. Hutchison III., 1994 Molecular resurrection of an extinct ancestral promoter for mouse L1 Proc. Natl. Acad. Sci. USA 91:1569-1573[Abstract]
Anisimova M., J. P. Bielawski, Z. Yang, 2001 Accuracy and power of the likelihood ratio test to detect adaptive molecular evolution Mol. Biol. Evol 18:1585-1592
Bishop J. G., A. M. Dean, T. Mitchell-Olds, 2000 Rapid evolution in plant chitinases: molecular targets of selection in plant-pathogen coevolution Proc. Natl. Acad. Sci. USA 97:5322-5327
Bush R. M., W. M. Fitch, C. A. Bender, N. J. Cox, 1999 Positive selection on the H3 hemagglutinin gene of human influenza virus A Mol. Biol. Evol 16:1457-1465[Abstract]
Chandrasekharan U. M., S. Sanker, M. J. Glynias, S. S. Karnik, A. Husain, 1996 Angiotensin II-forming activity in a reconstructed ancestral chymase Science 271:502-505[Abstract]
Chang B. S. W., M. A. Kazmi, T. P. Sakmar, 2001 Synthetic gene technology: applications to ancestral gene reconstruction and structure-function studies of receptors Methods Enzymol 343:274-294[ISI]
Dean A. M., G. B. Golding, 1997 Protein engineering reveals ancient adaptive replacements in isocitrate dehydrogenase Proc. Natl. Acad. Sci. USA 94:3104-3109
Endo T., K. Ikeo, T. Gojobori, 1996 Large-scale search for genes on which positive selection may operate Mol. Biol. Evol 13:685-690[Abstract]
Fares M. A., A. Moya, C. Escarmis, E. Baranowski, E. Domingo, E. Barrio, 2001 Evidence for positive selection in the capsid protein-coding region of the foot-and-mouth disease virus (FMDV) subjected to experimental passage regimens Mol. Biol. Evol 18:10-21
Fitch W. M., R. M. Bush, C. A. Bender, N. J. Cox, 1997 Long term trends in the evolution of H(3) HA1 human influenza type A Proc. Natl. Acad. Sci. USA 94:7712-7718
Golding G. B., A. M. Dean, 1998 The structural basis of molecular adaptation Mol. Biol. Evol 15:355-369[Abstract]
Goldman N., Z. Yang, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences Mol. Biol. Evol 11:725-736
Haydon D. T., A. D. Bastos, N. J. Knowles, A. R. Samuel, 2001 Evidence for positive selection in foot-and-mouth disease virus capsid genes from field isolates Genetics 157:7-15
Huelsenbeck J. P., B. Rannala, 1997 Phylogenetic methods come of age: testing hypotheses in an evolutionary context Science 276:227-232
Johnson N. L., S. Kotz, A. W. Kemp, 1993 Univariate discrete distributions, Vol. 1 Wiley, New York
Lewis P. O., 2001 Phylogenetic systematics turns over a new leaf TREE 16:30-37[Medline]
Muse S. V., B. S. Gaut, 1994 A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome Mol. Biol. Evol 11:715-724
Nielsen R., Z. Yang, 1998 Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene Genetics 148:929-936
Peek A. S., V. Souza, L. E. Eguiarte, B. S. Gaut, 2001 The interaction of protein structure, selection, and recombination on the evolution of the type-1 fimbrial major subunit (fimA) from Escherichia coli J. Mol. Evol 52:193-204[ISI][Medline]
Suzuki Y., T. Gojobori, 1999 A method for detecting positive selection at single amino acid sites Mol. Biol. Evol 16:1315-1328[Abstract]
Swanson W. J., Z. Yang, M. F. Wolfner, C. F. Aquadro, 2001 Positive Darwinian selection in the evolution of mammalian female reproductive proteins Proc. Natl. Acad. Sci. USA 98:2509-2514
Whelan S., P. Lio, N. Goldman, 2001 Molecular phylogenetics: state-of-the-art methods for looking into the past Trends Genet 17:262-272[ISI][Medline]
Yang Z., 1997 PAML: a program package for phylogenetic analysis by maximum likelihood Comput. Appl. Biosci 13:555-556 (http://abacus.gene.ucl.ac.uk/software/paml.html) [Medline]
. 2000 Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A J. Mol. Evol 51:423-432[ISI][Medline]
Yang Z., J. P. Bielawski, 2000 Statistical methods for detecting molecular adaptation TREE 15:496-503[Medline]
Yang Z., R. Nielsen, N. Goldman, A. M. Pedersen, 2000 Codon-substitution models for heterogeneous selection pressure at amino acid sites Genetics 155:431-449
Yang Z., W. J. Swanson, V. D. Vacquier, 2000 Maximum-likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites Mol. Biol. Evol 17:1446-1455
Zanotto P. M., E. G. Kallas, R. F. de Souza, E. C. Holmes, 1999 Genealogical evidence for positive selection in the nef gene of HIV-1 Genetics 153:1077-1089