* Department of Biology, University College London, London, United Kingdom; Department of Biological Statistics and Computational Biology, Cornell University; and
Center for Bioinformatics, University of Copenhagen, Copenhagen, Denmark
Correspondence: E-mail: rasmus{at}binf.ku.dk.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: positive selection codon-substitution models Bayes empirical Bayes
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Analysis of both real and simulated data has provided insights into the statistical properties of the models and highlighted the strengths and weaknesses of such codon-based analysis. The site models of Nielsen and Yang (1998) and Yang et al. (2000) use a statistical distribution to describe the random variation in among sites. A likelihood ratio test (LRT) is conducted to compare a null model that does not allow
> 1 in the distribution with an alternative model that does. Several LRTs were implemented and two appeared to have good power and low false-positive rate. The first involves the null model M1a (NearlyNeutral), which assumes two site classes in proportions p0 and p1 = 1 p0 with 0 <
0 < 1 and
1 = 1, and the alternative model M2a (PositiveSelection), which adds a proportion p2 of sites with
2 > 1 estimated from the data. Those are slight modifications of models M1 (neutral) and M2 (selection) implemented in Nielsen and Yang (1998), which had
0 = 0 fixed. The old M1 and M2 were found to be unrealistic for many data sets as they failed to account for sites under weak purifying selection with 0 <
< 1 (e.g., Yang et al. 2000). The second LRT compares the null model M7 (beta), which assumes a beta distribution for
(in the interval 0 <
< 1), and the alternative model M8 (beta&
), which adds an extra class of sites with positive selection (
s > 1). If the LRT is significant, positive selection is inferred. An empirical Bayes (EB) approach is then used to calculate the posterior probability that each site is from a particular site class, and sites with high posterior probabilities coming from the class with
> 1 (say, with P > 95%) are inferred to be under positive selection. This approach makes it possible to detect positive selection and identify sites under positive selection even if the average
ratio over all sites is much less than 1.
The EB approach we implemented, known as the naive EB (NEB), uses maximum likelihood estimates (MLEs) of parameters, such as the proportions and ratios for the site classes, without accounting for their sampling errors. While this is not a problem in large data sets, where parameters are reliably estimated, in small data sets the MLEs may have large sampling errors, and the NEB calculation of posterior probabilities may be unreliable (Anisimova, Bielawski, and Yang 2002). For example, if the MLEs under M2a are
and
use of such estimates to calculate posterior probabilities will lead to the conclusion that every site in the sequence is under positive selection with P = 1. Such extreme estimates can occur, for example, when the data contain a few almost identical sequences.
One solution to this problem was provided by Huelsenbeck and Dyer (2004). They implemented a full Bayesian method for calculating posterior probabilities using Markov Chain Monte Carlo. By assigning prior probabilities to the nuisance parameters, the method is able to take uncertainty in these parameters into account. While this method may have desirable statistical properties, it is computationally slow and may not be practical for large data sets or for evaluation by simulation. Also, the method was implemented only under models M2 and M3 (discrete) (Yang et al. 2000) instead of the more useful models M2a or M8.
In this paper, we develop a method to accommodate uncertainties in the MLEs of parameters in the distribution using numerical integration. We assign a prior for those parameters and average over this prior, a procedure known as Bayes empirical Bayes (BEB) (Deely and Lindley 1981). We expect that the effects of this correction should be negligible in large data sets but may be important in small data sets. Thus, we test the new method using three data sets analyzed previously: a large informative data set of 192 human class I major histocompatibility complex (MHC) alleles, analyzed by Yang and Swanson (2002); a data set of HIV-1 env gene V3 region from 13 HIV-1 isolates with a known transmission history, analyzed by Yang et al. (2000) (data set D10); and a data set of 20 HTLV-I tax gene sequences, analyzed by Suzuki and Nei (2004). We also conduct computer simulation to examine the performance of the new BEB method in comparison with the old NEB method.
We also implement similar BEB corrections for branch-site model A of Yang and Nielsen (2002) and the clade model C of Bielawski and Yang (2004) (see also Forsberg and Christiansen 2003). Our implementations are described in Methods. Simulation studies evaluating the performance of those models will be published elsewhere.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() | (1) |
Several procedures have been suggested in the statistics literature to correct for the bias in the NEB approach to achieve approximately correct frequentist coverage probabilities (e.g., Morris 1983; Laird and Louis 1987; Carlin and Gelfand 1990). Most of them work only for simplistic examples or otherwise involve complicated approximations. Laird and Louis (1987) proposed a general approach to the problem, using what they called the type III parametric bootstrap. However, the approach used for the present problem would involve extensive computation. Here, we take a hierarchical Bayes approach, also known as BEB (Deely and Lindley 1981). We use a prior f() for parameters
and integrate over the prior.
![]() | (2) |
![]() | (3) |
We approximate the integral over by a sum over a 4-D grid.
![]() | (4) |
![]() | (5) |
![]() | (6) |
M2a involves four parameters: p0, p1, 0, and
s. We use the priors
0
U(0, 1) and
s
U(1, 11), and for each parameter, we use the midpoint of each interval to represent the density in that interval. Thus, the U(0, 1) density for
0 is approximated by 10 values 0.05, 0.15, ..., 0.95, each with probability 0.1, while the U(1, 11) density for
s is approximated using 10 values 1.5, 2.5, ..., 10.5, each with probability 0.1. Parameters p0, p1, and p2 = 1 p0 p1 are assumed to have a Dirichlet prior D(
0,
1,
2), as in Huelsenbeck and Dyer (2004), with density
![]() | (7) |
The p0-p1-p2 space is represented by a triangle shown in figure 1. We partition it into d2 = 100 equal-sized triangles and use the center of each to represent the density mass on that triangle (fig. 1). Let the d2 triangles be labeled 0, 1, ..., d2 1, starting from the one on the top row, then three on the second row, five on the third row, and finally 2d 1 on the last row, row d 1. The mth triangle is on the ith row and jth column (i = 0, 1, ..., d 1; j = 0, 1, ..., 2i), with
![]() | (8) |
![]() | (9) |
|
Under M8, we use d = 10 categories for each of the four parameters p0, p, q, s, and use the midpoint of each interval to represent the density in that interval. We assume the following priors: p0
U(0, 1), p
U(0, 2), q
U(0, 2), and
s
U(1,11). Thus, p0 takes any of the 10 values 0.05, 0.15, ..., 0.95 with a prior probability of 0.1, each of p and q takes any of the 10 values 0.1, 0.3, ..., 1.9 with a prior probability 0.1, while
s takes any of the 10 values 1.5, 2.5, ..., 10.5 with a prior probability 0.1. To save computation, the beta distribution (for given values of p and q) is discretized using d = 10 equally spaced categories, unlike Yang et al. (2000), who used 10 equal-probability categories; that is, beta(p, q) is approximated using 10 categories represented by
= 0.05, 0.15, ..., 0.95, with the proportion for each category equal to the probability mass within that category. Thus, different beta distributions specified by different values of p and q on the grid are represented by the same set of
values, and f(xh|
) is calculated for the same set of
values for all sites. This strategy makes the computation feasible (see below), although it may not be as good as the equal-probability scheme for approximating a skewed beta density.
The posterior distribution of parameters (that is, p0, p1,
0,
2 under M2a and p0, p, q,
s under M8) is calculated as posterior probabilities for the 104 points on the 4-D grid. We summarize the distribution by the marginal densities of the parameters. For example, the posterior probability for the proportion parameter p0 under M8 is
![]() | (10) |
Computational Issues. The computation required by the old NEB method is equivalent to one calculation of the likelihood function, which is about K times as expensive as under the one-ratio model M0, where K is the number of site classes (Nielsen and Yang 1998). The conditional probability of data at each site h given the site class or the ratio, f(xh|
(h) =
k), has to be calculated separately for the K site classes. Similarly, in our implementation of the BEB procedure, we would like to calculate the conditional probability for as few
values as possible. Thus, we fix the branch lengths at the synonymous sites (i.e., the expected number of synonymous substitutions per codon) at their MLEs. Then we calculate f(xh|
(h) =
k) for 2K + 1 = 21 different
values under M2a: 10 values for
0, 1 for
1 = 1, and 10 values for
2; and 2K = 20 different
values under M8: 10 values for the
from the beta and 10 values for
s. While the computation is several times more expensive than for the NEB procedure, it is much faster than the ML iteration, which requires many calculations of the likelihood function.
BEB Calculations Under Branch-Site and Clade Models
Yang and Nielsen (2002) implemented two branch-site models, A and B, which allow the ratio to vary both among sites and among branches. Positive selection is potentially operating on only some branches, called the foreground branches, while the other (background) branches are under purifying selection. The models assume four site classes. In site class 0, all lineages are under purifying selection with a small dN/dS ratio
0. In site class 1, all lineages are undergoing weak purifying selection or neutral evolution with
1 close to 1. In site classes 2a and 2b, a proportion of class-0 and class-1 sites become under positive selection with
2 > 1 on the foreground lineages. Model A fixes
0 = 0 and
1 = 1, while model B estimates those two parameters from the data. Real data analysis suggests that model A is very unrealistic as it fails to account for conserved sites with 0 <
< 1. Thus, we modify model A so that 0 <
0 < 1 is estimated. We still fix
1 = 1 to avoid misclassifying sites under weak purifying selection (with
close to but less than 1) as positive selection sites. The modified model is still referred to as model A (table 1) and involves four parameters:
= (p0, p1,
0,
2). Model A is the alternative model and can be used to construct two LRTs. The null model in test 1 is M1a, which assumes two site classes in proportions p0 and p1 = 1 p0 with ratios
0 and
1 = 1. The null model in test 2 is the same as model A (table 1) except that
2 = 1 is fixed. Test 1 may mistake relaxed purifying selection on the foreground branches as positive selection, while test 2 appears to be a direct test of positive selection. A simulation study evaluating the two tests will be reported elsewhere.
|
Bielawski and Yang (2004) (see also Forsberg and Christiansen 2003) implemented two clade models, called C and D, to detect divergent selective pressures between clades. Branches in the phylogeny are assumed to fall into two clades. Three site classes are assumed in the models. In site class 0, all lineages are under purifying selection with a small ratio 0. In site class 1, all lineages are evolving neutrally or under weak purifying selection with
1 close to 1. In site class 2, branches in the two clades are evolving with
2 and
3, respectively. No positive selection is assumed for either clade; instead the clades may be evolving under divergent selective pressures at some sites. In model C,
0 = 0 and
1 = 1 are fixed while
2 and
3 are estimated together with the proportions p0 and p1. In model D all parameters are estimated from the data including
0 and
1. We modify model C so that 0 <
0 < 1 is estimated while
1 = 1 is fixed (table 2) and will refer to the modified model still as model C. The model has five parameters:
= (p0, p1,
0,
2,
3). The model can be compared with the site model M1a to construct a LRT. Here, we briefly describe our implementation of the BEB method for calculating posterior probabilities for site classes under model C.
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Human Class I MHC Alleles
A data set consisting of 192 alleles of the human class I MHC alleles from the A, B, and C loci are analyzed. This data set was compiled and analyzed using the old NEB approach by Yang and Swanson (2002). The sequence length is 270 codons. The tree topology estimated by Yang and Swanson (2002) is used here. The F3x4 model of codon frequencies is used. To save computation, we estimate the branch lengths on the tree under model M0 (one-ratio) and use them as fixed when fitting site models M2a and M8.
Table 3 lists the log-likelihood values and the MLEs of parameters under models M2a and M8. Both models have much higher log-likelihood values than their corresponding null models M1a and M7, providing strong evidence for presence of sites under positive selection (results not shown; see table 2 of Yang and Swanson [2002]). Sites inferred to be under positive selection by the NEB and BEB approaches under the two models are listed as well, with the cutoff posterior probability set at Pb = 95%.
|
The posterior probabilities for the three site classes under M2a are almost identical between NEB and BEB. Those under BEB are often (but not always) less extreme (away from 0 or 1) than those under NEB. Note that the posterior probabilities sum to 1 over the site classes in each method, and less extreme probabilities mean less confidence in inference. The lists of sites under positive selection at the cutoffs Pb = 95% and 99% are exactly the same between the two approaches (table 3).
Under M8, the results of table 3 are slightly different from those of Yang and Swanson (2002) due to minor differences in the branch lengths used. The MLEs of parameters and their approximate SEs are In the BEB analysis, the posterior distribution of p0 is concentrated on 0.85 and 0.95, with probabilities 0.876 and 0.124. The beta parameter p is concentrated on 0.1, with probability
1 while q is concentrated on points 0.5 and 0.7, with probabilities 0.810 and 0.179, respectively. Parameter
s is concentrated on 4.5 and 5.5, with probabilities 0.135 and 0.865. Again, these approximate posterior estimates agree well with the MLEs. The estimates of beta parameters p and q are slightly more different between the two methods because of different discretization schemes used.
M8 assumes 11 site classes: 10 classes for the beta distribution and 1 class for positively selected sites. Because the old NEB used 10 equal-probability categories to approximate the beta distribution and the new BEB used 10 equally spaced categories, the posterior probabilities for the first 10 site classes are not directly comparable between the two approaches. The posterior probabilities for the positive selection class are very similar. The lists of positive selection sites at Pb = 95% and 99% are almost identical; the only differences are at sites 82R and 94T, for which P = 0.992 and 0.992 by NEB while P = 0.985 and 0.987 by BEB.
We also conducted a robustness analysis under M8. First, we used d = 20 categories in the BEB calculation, with 160,000 instead of 10,000 points on the 4-D grid, to examine the effect of the number of categories d on calculation of the posterior probabilities. The beta distribution is discretized using d = 20 equally spaced categories as well. The posterior probabilities are very similar to those obtained using d = 10 categories, when two consecutive categories for d = 20 are merged into one category for d = 10. Exactly the same sites are inferred to be under positive selection at the 95% and 99% cutoffs for the two values of d. Also, the correlation coefficients in the posterior mean are all greater than 0.999 among the three analyses: the old NEB with 10 categories, and the new BEB with 10 or 20 categories. Ten categories appear to be sufficient for discretizing the integral over parameters
.
Next we examine the effect of the prior for . We applied a triangle prior for p0 under M8 with density f(p0) = 2p0, 0 < p0 < 1. This prior places more density mass on p0 close to 1; the prior probabilities for the 10 values 0.05, 0.15, ..., 0.95 are 0.01, 0.03, ..., 0.19. The lists of sites inferred to be under positive selection at Pb = 95% and 99% are essentially identical to those obtained under the uniform prior (table 3); the only difference is that site 94T had P = 0.992 for the uniform prior and 0.987 for the triangle prior. We also used U(1, 21) instead of U(1, 11) as the prior for
s. This change had a slightly greater effect. For example, the posterior probabilities for sites 82R, 94T, and 113Y changed from 0.985, 0.987, and 0.993 under the old prior to 0.933, 0.947, and 0.979 under the new prior. Overall, the priors for parameters
had minimal effects on the calculation of the posterior probabilities in this data set.
HIV env Gene
The second data set consists of the HIV-1 env gene V3 region from 13 HIV-1 isolates, previously analyzed by Yang et al. (2000). The sequence has 91 codons. The F3x4 model of codon frequencies is used. This was intended to be a small data set, suitable for demonstrating differences between the NEB and BEB approaches, but it failed to do so (see below). To see the effects of sequence sampling, we also analyzed a smaller data set of only the first four sequences (accession numbers U68496U68499).
The MLEs of parameters under models M2a and M8 are listed in table 4, together with the sites inferred to be under positive selection by the NEB and BEB approaches at Pb = 95%. Both M2a and M8 have much higher likelihood values than their corresponding null models M1a and M7, so the LRTs suggest presence of sites under positive selection. Both models identified three sites under positive selection by the old NEB approach.
|
Under M8, the MLEs of parameters and their approximate SEs are and
The large sampling errors, especially for the beta parameters p and q, indicate considerable uncertainty in the MLEs. The marginal posterior densities of p0, p, q, and
s peak at 0.75, 0.70, 1.9, and 3.5, with probabilities at the peaks to be 0.494, 0.181, 0.121, and 0.510, respectively. The approximate Bayesian estimates of p0 and
s agree well with the MLEs, but the estimates of p and q do not because of the different schemes used to discretize the beta distribution. Both analyses suggest that p and q are more poorly estimated than p0 and
s. The posterior probabilities for the positive-selection class are found to be quite similar between NEB and BEB. For example, the probabilities for site 9S are 0.759 and 0.859, with posterior mean
to be 2.858 and 2.819, for NEB and BEB, respectively. At Pb = 95%, NEB identified 28T, 66E, and 87V to be under positive selection, while at Pb = 90%, sites 26N and 51I are identified as well. All of these five sites reached the 95% cutoff by the BEB approach, which identified two additional sites at Pb = 90%: 69N and 83V. It is interesting that BEB inferred more sites under positive selection than NEB in this data set (table 4).
We also applied the triangle prior for p0 under M8 with density f(p0) = 2p0, 0 < p0 < 1. This had very minor effect on the posterior distributions of the parameters or on the posterior probabilities of sites under positive selection. The lists of sites under positive selection at Pb = 0.95 and 0.99 are identical between the two priors, with almost identical probabilities.
Overall the NEB and BEB approaches produced similar inferences of sites under positive selection for this data set, despite the considerable uncertainties in the MLEs of model parameters. To explore further the differences between the two approaches and to examine the effects of sequence sampling, we analyzed a smaller data set consisting of only the first four sequences (accession numbers U68496U68499).
In this small data set, the LRT statistic is 2 = 4.1 for both the M1a-M2a and M7-M8 comparisons, and the null hypotheses are not rejected. Model M2a produced the following MLEs:
with very large sampling errors. At Pb = 0.95, both NEB and BEB identified one site (28T) to be under positive selection, with P = 0.96 for NEB and 0.95 for BEB. Model M8 gave parameter estimates
and
again with large sampling errors. At Pb = 0.95, both NEB and BEB identified site 28T to be under positive selection, with P = 0.96 and 0.98 for NEB and BEB, respectively. Overall, NEB and BEB are similar and models M2a and M8 are consistent in this small data set. Note that the single site (28T) identified in this small data set was also identified in the larger 13-sequence data set. The larger data set provided stronger evidence for positive selection in identifying more sites with higher posterior probabilities. A similar pattern was reported for two MHC data sets, with 6 and 192 sequences, respectively, by Swanson et al. (2001) and Yang and Swanson (2002).
HTLV-I tax Gene
Twenty sequences of the tax gene from the HTLV-I are retrieved from GenBank and analyzed on a star phylogeny, following Suzuki and Nei (2004). The sequences, 181 codons long, are very similar and all differences are singletons. Ancestral sequence reconstruction suggests a total of 23 single-nucleotide mutations: 2 synonymous transitions (at sites 33L and 38E), 19 nonsynonymous transitions (at sites 4P, 39D, 43I, 53V, 60S, 62L, 81G, 85I, 92D, 101S, 108K, 115H, 146S, 152K, 154A, 157N, 161P, 166G, 181V), and 2 nonsynonymous transversions (2C, 69L). Site numbering here refers to sequence AB045401. We use the F3x4 model to accommodate biased codon usage. Application of model M0 (one-ratio) leads to the estimates and
with the log-likelihood
= 892.02. M0 can be compared with the null model that fixes
= 1. This LRT rejects the null model, with P = 0.008. Thus, the average
across the whole sequence and across all branches on the tree is significantly greater than 1, and there seems to be no doubt that positive selection drives the evolution of the tax gene.
We fix the branch lengths at the estimates obtained under M0 when applying the site models. The MLEs under both M2a and M8 are reduced to those under M0, with all sites having (
and
under M2a, and
and
under M8). The MLEs under the null models M1a and M7 are reduced to
= 1, with
= 895.50. The test statistic for the two LRTs comparing M2a with M1a and comparing M8 with M7 is 2
= 6.96, and the null models are rejected with a marginal P value
0.03. NEB calculation of posterior probabilities for site classes using such MLEs led to the conclusion that all sites, including the 158 invariant sites, are under positive selection with P = 1, as reported by Suzuki and Nei (2004).
The BEB approach is then applied to the same data. Under M2a, the posterior density for p0 and p1 is the highest at three points (0.033, 0.033), (0.067, 0.067), and (0.033, 0.133), each receiving probability 0.035, compared with the prior probability 0.01. The posterior distribution of 0 peaks at 0.95, with probability 0.113, while that of
2 peaks at 6.5 and 7.5, each with probability 0.184. Those probabilities are not very different from the prior probability 0.1, indicating lack of information in the data. The 21 sites with a nonsynonymous mutation are inferred to be under positive selection with 0.91 < P < 0.93, while all other sites are under positive selection with 0.55 < P < 0.61.
Under M8, the posterior densities for p0, p, q, and s peak at 0.05, 1.9, 0.1, and 5.5, with probabilities 0.206, 0.106, 0.111, and 0.211, compared with the prior probability 0.1 each. The 21 sites with a nonsynonymous mutation are inferred to be under positive selection with 0.96 < P < 0.97, while all other sites are inferred to be under positive selection with 0.69 < P < 0.73. We also applied the triangle prior for p0 under M8. Under this prior, the posterior densities for p0, p, q, and
s peak at 0.05, 1.9, 0.1, and 5.5, respectively, as under the uniform prior, and the probabilities at the peaks are 0.197, 0.106, 0.111, and 0.207, similar to those under the uniform prior. The posterior probabilities for site classes are identical between the two priors at the level of accuracy used here.
Considering the LRTs and the BEB calculations of posterior probabilities, we suggest that positive selection has affected the evolution of the tax gene. The nonsynonymous substitutions seen in the data are likely due to positive selection, although the evidence is marginal.
Computer Simulation
Before describing our simulation experiment, we illustrate the concept of posterior probabilities as well as three performance measures of methods for detecting positive selection sites. Figure 2 shows the results obtained by simulating the data under the prior and analyzing them under the correct prior and model. The tree used is (((A:0.1, B:0.2):0.12, C:0.3):0.123, D:0.4, E:0.5), where the branch length is measured by the expected number of nucleotide substitutions per codon. It is assumed that there is no transition-transversion rate difference so that = 1 and each codon has the equilibrium frequency 1/61. The sequence length is 1,000 codons. The data are generated under M2a with the priors (p0, p1, p2)
D(1, 1, 1),
0
U(0, 1) and
s
U(1, 11). Each replicate data set is generated by drawing parameters p0, p1,
0, and
s from those priors and then evolving sequences on the tree.
|
A third measure is the false-positive rate (fig. 2d), the proportion of sites not under positive selection that are inferred falsely to be under positive selection. This is a frequentist measure, formulating the problem of identifying positive selection sites as one of testing problem, in which the null hypothesis assumes neutral evolution ( = 1) while the alternative hypothesis assumes positive selection (
> 1) (Suzuki and Gojobori 1999). In this formulation, the false-positive rate is also the type I error. This measure was used by Suzuki and Gojobori (1999) and Wong et al. (2004). One minus the false-positive rate is also known as "specificity." Note that the Bayesian posterior probability calculation gives the correct accuracy, but not the frequentist false-positive rate. However, many Bayesian methods are known to have good frequentist properties (see, e.g., pp. 92108; Carlin and Louis 2000). In figure 1d, the false-positive rate does not match and is much lower than 1 Pb. For example, at the cutoff Pb = 0.7, the false-positive rate is only 0.03, much lower than 1 0.7 = 0.3 (fig. 2d). Note that Suzuki and Nei (2002) confused the Bayesian posterior probabilities with the frequentist type I error rate when they claimed that 1 Pb should equal the nominal P value.
In the following simulation, we examine the frequentist false-positive rate and the power (proportion of true positives) of the new BEB approach in comparison with the old NEB approach for detecting positive selection sites. Data are simulated using fixed values of parameters for the distribution. We address two major questions: (1) does BEB overcome the problem of high false positives of NEB in small data sets? (2) does the BEB correction cause a loss of power in large data sets in which NEB was working well? We used two simulation schemes (4 and 6) of Wong et al. (2004) plus a new scheme. Scheme 4 assumes two site classes in proportions 1:1 with
= 1 and 1.5. The old NEB produced many false positives under this scheme (Wong et al. 2004), and it is interesting to know whether the BEB is an improvement. Scheme 6 assumes three site classes in proportions 45%, 45%, and 10% with
ratios 0, 1, and 5, respectively. Under this scheme, NEB performed very well (Wong et al. 2004), and it is interesting to know whether the BEB correction causes any loss of power. The third scheme (scheme 7) is new and assumes 12 site classes in proportions 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.05, 0.05, 0.05, 0.05 with
ratios 0, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 1, 2, 3, 4, and 5, respectively. This scheme is used to evaluate the robustness of the analysis and to address the concern that both M2a and M8 assume one
value for all sites under positive selection while those sites may be expected to be under different strengths of selection.
Simulation conditions follow Wong et al. (2004). Two trees with either 5 or 30 taxa are used with fixed branch lengths. The tree length, that is, the expected number of nucleotide substitutions per codon along all branches in the tree, is 3. Again the model assumes no transition-transversion bias ( = 1) or codon usage bias (
j = 1/61). The sequence length is 500 codons. Data sets were simulated using the evolver program in the PAML package (Yang 1997) and analyzed using the codeml program, which implements both the NEB and BEB approaches. The correct tree topology is used, but the branch lengths are estimated by ML.
The results are summarized in table 5. Under scheme 4, the old NEB has high false-positive rates caused by inaccurate MLEs of parameters in the distribution. The BEB procedure corrects for the problem and reduces the false-positive rate considerably. For example, under M2a the false-positive rate is 42% for NEB but only 1% for BEB at the cutoff Pb = 95%. The false-positive rate for BEB under M8 is higher than under M2, at 5% in the large tree and 6% in the small tree. Under this scheme, BEB has very low true-positive rate, never identifying more than 9% of all positive selection sites, even in the large tree. It appears to be very difficult to identify positively selected sites with
as low as 1.5. The apparent high power of NEB in this scheme is clearly unreliable.
|
To examine the false-positive rate of the BEB procedure when the data contain no positive selection sites, we also simulated data sets under schemes 1 and 2a of Wong et al. (2004). Scheme 1 assumes that all sites have = 1 (corresponding to a pseudogene), while scheme 2a assumes that 50% of sites have
= 0 and 50% have
= 1. Schemes 1 and 2a are similar to schemes 4 and 6 except for the absence of sites under positive selection. It is not possible to calculate true-positive rates as there are no true-positive sites. The false-positive rate for BEB at the cutoff Pb = 0.95 is found to be 0 for both schemes 1 and 2a, for both trees, and both before and after the LRT. The error rate is 0 even if a less stringent criterion Pb = 0.5 is applied. Thus, the false-positive rate for BEB is lower in schemes 1 and 2a, where no positive selection sites are present, than in corresponding schemes 4 and 6, where some sites are under positive selection.
We also note that BEB maintains a low false-positive rate even when the LRT has not been performed first. However, we suggest that to answer the question whether there are any sites in the sequence under positive selection, the LRT should be used, while the BEB should be used to identify positive selection sites when the LRT indicates that such sites exist. Overall, the BEB correction appears to avoid the high false-positive rates of the NEB approach in small noninformative data sets, while it has not caused any loss of power in large informative data sets. It also appears that the BEB procedure tends to be conservative if considered a frequentist test; the false-positive rate is often much lower than 1 Pb when sites are identified at the cutoff posterior probability Pb.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We used three real data sets to evaluate the differences between the NEB and BEB approaches. The two methods are different when the MLEs are extreme as in the HTLV-I rax gene. What is striking is perhaps the similarity between the two methods in very small data sets, such as the HIV env genes. The real data analysis also suggests that models 2A and 8 usually gave similar conclusions, as found in early studies (e.g., Yang et al. 2000; Swanson et al. 2001). This pattern appears to suggest that previous studies using the NEB approach should be fine as long as the data set is not too small and the estimates are not extreme (say, with estimates of proportions to be 0 or 1). However, if the data consist of few short sequences, or if estimates of s are only slightly larger than 1, it may be worthwhile to use the new BEB method to confirm results. Sequence sampling seems to have greater effects than either the prior for parameters or the different methods (NEB vs. BEB).
The simulation study suggests that the BEB method in general appears to have good statistical properties. In small data sets, the BEB does not have the high false-positive rate of the NEB approach, while in large data sets, the BEB seems at least as powerful as NEB. The BEB appears often to be conservative under the frequentist criterion, with the false-positive rate to be lower than 5% if a cutoff posterior probability of Pb = 95% is applied.
The extensive simulation studies performed by Anisimova, Bielawski and Yang (2001) and Wong et al. (2004) demonstrate that the LRTs for detecting positive selection, suggested by Nielsen and Yang (1998) and Yang et al. (2000), have good statistical properties over a wide range of conditions. Analyses of both real and simulated data sets in this study suggest that the new BEB method is reliable in both small and large data sets and also has good power for identifying individual positively selected sites, especially in large data sets or with strong selective pressure. Together, those methods provide a robust and trustworthy framework for inference of positive selection affecting protein-coding genes. However, it is important to be aware of the inherent limitations of these methods. First, they have appreciable power to detect positive selection only if multiple substitutions have occurred at the same codon site throughout the phylogeny. If positive selection does not involve recurrent fixations of nonsynonymous mutations at the same sites, those methods may fail. For example, Guindon et al. (2004) demonstrated that in some HIV-1 genes, the selective pressure varies not only among sites but also among lineages. Second, a number of assumptions are made in the tests for positive selection, which may be violated in real data. For example, simulations demonstrated that the LRTs are not robust to frequent intragenic recombinations (Anisimova, Nielsen, and Yang 2003). Likewise, the methods accommodate variable nonsynonymous rates among sites but assume the same synonymous rate, and it is possible that varying mutation rates (or other mutational parameters) among sites may mimic the effect of positive selection (Kosakovsky Pond, Frost, and Muse 2004). We encourage more work identifying cases where the likelihood methods for detecting positive selection might fail. Only by identifying such cases is it possible to further improve the current framework and construct even better statistical methods.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Anisimova, M., J. P. Bielawski, and Z. Yang. 2001. The accuracy and power of likelihood ratio tests to detect positive selection at amino acid sites. Mol. Biol. Evol. 18:15851592.
. 2002. Accuracy and power of Bayes prediction of amino acid sites under positive selection. Mol. Biol. Evol. 19:950958.
Anisimova, M., R. Nielsen, and Z. Yang. 2003. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 164:12291236.
Bielawski, J. P., and Z. Yang. 2001. Positive and negative selection in the DAZ gene family. Mol. Biol. Evol. 18:523529.
. 2004. A maximum likelihood method for detecting functional divergence at individual codon sites, with application to gene family evolution. J. Mol. Evol. 59:121132.[ISI][Medline]
Bishop, J. G., A. M. Dean, and T. Mitchell-Olds. 2000. Rapid evolution in plant chitinases: molecular targets of selection in plant-pathogen coevolution. Proc. Natl. Acad. Sci. USA 97:53225327.
Carlin, B. P. and T. A. Louis. 2000. Bayes and empirical Bayes methods for data analysis. London, Chapman and Hall.
Carlin, B. P., and A. E. Gelfand. 1990. Approaches for empirical Bayes confidence intervals. J. Am. Stat. Assoc. 85:105114.[ISI]
Deely, J. J., and D. V. Lindley. 1981. Bayes empirical Bayes. J. Am. Stat. Assoc. 76:833841.[ISI]
Filip, L. C., and N. I. Mundy. 2004. Rapid evolution by positive Darwinian selection in the extracellular domain of the abundant lymphocyte protein CD45 in primates. Mol. Biol. Evol. 21:15041511.
Ford, M. J. 2001. Molecular evolution of transferrin: evidence for positive selection in salmonids. Mol. Biol. Evol. 18:639647.
Forsberg, R., and F. B. Christiansen. 2003. A codon-based model of host-specific selection in parasites, with an application to the influenza A virus. Mol. Biol. Evol. 20:12521259.
Goldman, N. and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736.
Guindon, S., A. G. Rodrigo, K. A. Dyer, and J. P. Huelsenbeck. 2004. Modeling the site-specific variation of selection patterns along lineages. Proc. Natl. Acad. Sci. USA 101:1295712962.
Haydon, D. T., A. D. Bastos, N. J. Knowles, and A. R. Samuel. 2001. Evidence for positive selection in foot-and-mouth-disease virus capsid genes from field isolates. Genetics 157:715.
Huelsenbeck, J. P., and K. A. Dyer. 2004. Bayesian estimation of positively selected sites. J. Mol. Evol. 58:661672.[ISI]
Kosakovsky Pond, S. L., S. D. W. Frost, and S. V. Muse. 2004. HyPhy: hypothesis testing using phylogenies. Bioinformatics (in press).
Laird, N. M., and T. A. Louis. 1987. Empirical Bayes confidence intervals based on bootstrap samples. J. Amer. Stat. Assoc. 82:739750.[ISI]
Lane, R. P., J. Young, T. Newman, and B. J. Trask. 2004. Species specificity in rodent pheromone receptor repertoires. Genome Res. 14:603608.
Li, W.-H., C.-I. Wu, and C.-C. Luo. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitutions considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2:150174.[Abstract]
Miyata, T., S. Miyazawa, and T. Yasunaga. 1979. Two types of amino acid substitutions in protein evolution. J. Mol. Evol. 12:219236.[ISI][Medline]
Mondragon-Palomino, M., B. C. Meyers, R. W. Michelmore, and B. S. Gaut. 2002. Patterns of positive selection in the complete NBS-LRR gene family of Arabidopsis thaliana. Genome Res. 12:13051315.
Morris, C. 1983. Parametric empirical Bayes inference: theory and applications. J. Am. Stat. Assoc. 78:4765.[ISI]
Moury, B. 2004. Differential selection of genes of cucumber mosaic virus subgroups. Mol. Biol. Evol. 21:16021611.
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715724.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929936.
Suzuki, Y., and T. Gojobori. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16:13151328.[Abstract]
Suzuki, Y., and M. Nei. 2002. Simulation study of the reliability and robustness of the statistical methods for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 19:18651869.
. 2004. False-positive selection identified by ML-based methods: examples from the Sig1 gene of the diatom Thalassiosira weissflogii and the tax gene of a human T-cell lymphotropic virus. Mol. Biol. Evol. 21:914921.
Swanson, W. J., Z. Yang, M. F. Wolfner, and C. F. Aquadro. 2001. Positive Darwinian selection in the evolution of mammalian female reproductive proteins. Proc. Natl. Acad. Sci. USA 98:25092514.
Takebayashi, N., P. B. Brewer, E. Newbigin, and M. K. Uyenoyama. 2003. Patterns of variation within self-incompatibility loci. Mol. Biol. Evol. 20:17781794.
Twiddy, S. S., C. H. Woelk, and E. C. Holmes. 2002. Phylogenetic evidence for adaptive evolution of dengue viruses in nature. J. Gen. Virol. 83:16791689.
Wong, W. S. W., Z. Yang, N. Goldman, and R. Nielsen. 2004. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 168:10411051.
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555556. (http://abacus.gene.ucl.ac.uk/software/paml.html).[Medline]
. 2000. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. J. Mol. Evol. 51:423432.[ISI][Medline]
Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908917.
Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449.
Yang, Z., and W. J. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:4957.
Zanotto, P. M., E. G. Kallas, R. F. Souza, and E. C. Holmes. 1999. Genealogical evidence for positive selection in the nef gene of HIV-1. Genetics 153:10771089.