Institute of Molecular Evolutionary Genetics, Department of Biology, The Pennsylvania State University
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Inferring positive selection at single amino acid sites is of biological and medical importance. By pinpointing the positively selected amino acid sites in a protein, one may infer the mechanism of positive selection, given that the biological function of these sites is well understood (e.g., antigen recognition sites in human leukocyte antigen [HLA] [Suzuki and Gojobori 1999
]). In addition, neutralizing epitopes in viral proteins may be predicted as positively selected sites because such regions are expected to be positively selected (e.g., hepatitis C virus [Suzuki and Gojobori 2001
]).
For inferring positive selection at single amino acid sites, Suzuki and Gojobori (1999)
developed a parsimony-based (SG) method, whereas Yang et al. (2000)
developed a likelihood-based (Yang) method. But the reliabilities of these methods are not well understood. The purpose of this article is to study the reliability and robustness of these methods by conducting computer simulation.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the Yang method, is assumed to follow a certain probability distribution among codon sites in the sequence. Fourteen different distributions (M0M13) were initially proposed to be used, but M3 and M8 have been recommended for the real data analysis and were actually used more often than others (Yang 2000
; Yang et al. 2000
). Model M3 is used in combination with M0. In M0,
is assumed to be the same for all codon sites. In M3, codon sites are classified into three categories, 0, 1, and 2, for which
is assumed to take values
0,
1, and
2, with proportions p0, p1, and p2, respectively. In each model, free parameters are estimated by the maximum-likelihood (ML) method. If
> 1 for any category in M3, we test whether M3 fits the data better than does M0 by the likelihood ratio test (LRT). Once the test is significant, we conclude that positively selected sites exist in the sequence. We then compute the posterior probability that a given codon site belongs to a category with
> 1 in M3. If the probability is higher than a given confidence level, positive selection is inferred.
Model M8 is used in combination with M7. In M7, is assumed to follow a beta distribution with 0
1. In M8, codon sites are classified into two categories, 0 and 1, which exist with proportions p0 and p1, respectively. In category 0,
is assumed to follow a beta distribution with 0
1, and in category 1,
takes a given value of
1. Positively selected amino acid sites are inferred in the same way as above, with M0 and M3 replaced by M7 and M8, respectively.
Computer Simulation
A random nucleotide sequence with 300 codon sites was allowed to evolve following a symmetrical phylogenetic tree with 8 (fig. 1A
), 16 (fig. 1B
), or 32 (fig. 1C
) exterior nodes. All branch lengths (b) were set such that the expected number of synonymous substitutions per synonymous site was 0.1. We assumed that = 0 at half the codon sites and that
= 1 at the other half. The simulation scheme has been described in detail by Suzuki and Gojobori (1999)
.
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In the above analysis an increase in the number of sequences was accompanied by an increase in the number of nucleotide substitutions at each codon site in the entire phylogenetic tree. To examine which of these two factors, i.e., the number of sequences or the number of nucleotide substitutions, was responsible for the observed increase in the false-positive rate in the Yang method, we analyzed 16 sequences (fig. 1B ) generated by the same scheme as above, but with b = 0.05 and eight sequences (fig. 1A ) with b = 0.2. Note that the number of nucleotide substitutions for the former data set is similar to that for eight sequences with b = 0.1, whereas the number for the latter data set is similar to that for 16 sequences with b = 0.1. As shown in table 1 , only the case of eight sequences with b = 0.2 produced as many false positives as the case of 16 sequences with b = 0.1, indicating that the increase in the number of nucleotide substitutions was mainly responsible for the increase in the false-positive rate.
To examine whether the high false-positive rate observed above is specific to the particular fractions of codon sites with = 0 (50%) and
= 1 (50%), we considered the case where
= 0 at 90% of codon sites and
= 1 at the remainder with 32 sequences (fig. 1C
). In the Yang method the false-positive rate (11%) was higher than expected, whereas in the SG method the rate (0.5%) was lower (table 1
).
We also considered the case where = 0.5 or 1.5 for 50% of codon sites and
= 1 for the remaining 50% to examine whether the high false-positive rate observed at the codon sites with
= 1 is specific to the particular
value (0) at other sites. Again, the false-positive rate was high (12% for
= 0.5 and 7% for
= 1.5) in the Yang method but was low (1% for both) in the SG method (table 1
).
Finally, to examine whether the high false-positive rate is specific to the symmetrical phylogenetic tree, we generated 32 sequences following an asymmetrical phylogenetic tree (fig. 1D
), with = 0 at half the codon sites and
= 1 at the other half. The false-positive rate was still high (10%) in the Yang method and was low (1%) in the SG method (table 1
).
Note that in all the above simulations the SG method never inferred positive selection at the codon sites with = 0, whereas the Yang method occasionally did (table 1
).
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The Yang method, in contrast, appears to be liberal under the same conditions. When we did not conduct the LRT, the false-positive rate at the codon sites with = 1 was quite high in all cases examined. This probably occurred because the posterior probability is computed under the assumption that the estimate (
) of
is correct, although it is subject to the sampling error. The LRT appears to help reduce the cases where
happens to be greater than 1 by chance, but this test is known to be sensitive to violation of the assumptions (Zhang 1999
), as shown above. Anisimova, Bielawski, and Yang (2001)
conducted a simulation using six sequences generated with
= 0 at half the codon sites and
= 1 at the other half and found that model M3 produced many false positives even after conducting the LRT but M8 did not do so (see also Anisimova, Bielawski, and Yang 2002
). According to these results, they concluded that the Yang method is robust as long as multiple models are used. These observations are similar to our simulation results using eight sequences with b = 0.1. In our study with a larger number of sequences, however, both M3 and M8 produced many false positives regardless of the fractions of codon sites with
= 0 and
= 1, the
values at the codon sites where
1, and the phylogenetic trees. In addition, an increase in the number of nucleotide substitutions appears to be mainly responsible for the increase in the false-positive rate. Because the number of nucleotide substitutions is equivalent to the sample size in the statistical test of positive selection, the Yang method appears to produce many false positives consistently under the above conditions. We have also shown that even amino acid sites with
= 0 may be identified as positively selected sites.
The Yang method appears to have another problem when the number of sequences used is large. When Suzuki and Nei (2001)
analyzed 218 sequences of the HLA gene by the Yang method, they often failed to find the ML estimates of free parameters because of the existence of multiple local optima on the likelihood surface. Su, Nguyen, and Nei (2002)
also obtained different results depending on the initial
values when they analyzed 70 dimeric immunoglobulin (Ig) variable region (VHH) genes and 96 conventional Ig heavy chain (VH) genes. Note that even when putative ML estimates were found, in both studies, the overall reliability of the Yang method appeared to be similar to or lower than that of the SG method. In the present study we also compared the overall efficiency of detecting positively selected sites between the SG and Yang methods, using the case with
= 1.5 at half the codon sites and
= 1 at the other half. When we computed the odds ratio which is defined by n1.5(30,000 - n1)/(30,000 - n1.5)n1, where n1.5 is the number of positively selected sites inferred at the codon sites with
= 1.5, and n1 is the number at the codon sites with
= 1; the ratios were 3.88 for the SG method and 3.56 for the Yang method, suggesting that the former method had a slightly higher efficiency than did the latter method. In conclusion, both our simulation studies and real data analyses suggest that the positively selected sites inferred by the parsimony-based method are more reliable than those inferred by the likelihood-based method.
Several other methods similar to the SG method have been proposed for detecting positive selection at single amino acid sites. Fitch et al. (1997)
proposed to use CS/(CS + CN) and CN/(CS + CN) as the probabilities of occurrence of synonymous and nonsynonymous substitutions at every codon site, respectively, for computing the binomial probability, where CS and CN are the sums of cS and cN over all codon sites in the sequence, respectively. This method effectively assumes that the average of
over all codon sites represents the neutrality level. It is, therefore, possible that the results include the false positives or negatives according to the sequence analyzed. For example, if
= 0.1 at a certain fraction of codon sites and
= 0.2 at others in the sequence, the latter sites may be inferred as positively selected, although they are actually negatively selected, because the neutrality level is assumed to be somewhere between
= 0.1 and
= 0.2.
Yamaguchi-Kabata and Gojobori (2000)
proposed to use Fisher's exact test instead of the binomial probability for testing the selective neutrality in the SG method. They computed the expected numbers of synonymous (E(cS)) and nonsynonymous (E(cN)) substitutions at a given codon site approximately by E(cS) = (cS + cN)sS/(sS + sN) and E(cN) = (cS + cN)sN/(sS + sN), respectively. They then conducted Fisher's exact test for the 2 x 2 contingency table, which consisted of cS and cN in the first and second columns in the first row, respectively, and E(cS) and E(cN) in the first and second columns in the second row, respectively. Strictly speaking, however, Fisher's exact test was originally designed for the cases where the marginal totals (in the present example, cS + cN, E(cS) + E(cN), cS + E(cS), and cN + E(cN)) in the table were fixed, which is not applicable to the present case (Sokal and Rohlf 1995, pp. 724743
). To examine the reliability of this method, we analyzed 64 or 128 sequences which were generated following symmetrical phylogenetic trees under the assumption that all codon sites in a sequence (300 codon sites) had the same
value, i.e.,
= 0.2, 0.5, 1, 2, or 5, and b = 0.01, 0.02, or 0.03 (Suzuki and Gojobori 1999
). (Note that it is not very meaningful to include the Yang method in this analysis because the efficiency of detecting positively selected sites in the Yang method depends on the distribution of
values over all codon sites [Yang et al. 2000
], whereas positive selection is detected at each codon site independently in the SG method.) Similar to the case where binomial probability was used, the false-positive rate of the SG method using Fisher's exact test was low under various conditions (table 2
). But the probability of correctly identifying the positively and negatively selected sites (true-positive rate) was significantly higher in all cases examined when the binomial probability was used than when Fisher's exact test was used. It, therefore, appears to be better to use the binomial probability than Fisher's exact test in the SG method.
|
![]() |
Note Added in Proof |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: positive selection
parsimony
binomial probability
likelihood
posterior probability
Address for correspondence and reprints: Yoshiyuki Suzuki, Institute of Molecular Evolutionary Genetics, Department of Biology, The Pennsylvania State University, 328 Mueller Laboratory, University Park, Pennsylvania 16802. yis1{at}psu.edu
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Anisimova M., J. P. Bielawski, Z. Yang, 2001 Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution Mol. Biol. Evol 18:1585-1592
. 2002 Accuracy and power of Bayes prediction of amino acid sites under positive selection Mol. Biol. Evol 19:950-958
Fitch W. M., 1971 Toward defining the course of evolution: minimum change for a specific tree topology Syst. Zool 20:406-416[ISI]
Fitch W. M., R. M. Bush, C. A. Bender, N. J. Cox, 1997 Long-term trends in the evolution of H(3) HA1 human influenza type A Proc. Natl. Acad. Sci. USA 94:7712-7718
Hartigan J. A., 1973 Minimum mutation fits to a given tree Biometrics 29:53-65[ISI]
Hughes A. L., M. Nei, 1988 Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection Nature 335:167-170[ISI][Medline]
. 1989 Nucleotide substitution at major histocompatibility complex class II loci: evidence for overdominant selection Proc. Natl. Acad. Sci. USA 86:958-962[Abstract]
Nei M., S. Kumar, 2000 Molecular evolution and phylogenetics Oxford University Press, Oxford, N.Y
Sokal R. R., F. J. Rohlf, 1995 Biometry. 3rd edition Freeman, New York
Su C., V. K. Nguyen, M. Nei, 2002 Adaptive evolution of variable region genes encoding an unusual type of immunoglobulins in camelids Mol. Biol. Evol 19:205-215
Suzuki Y., T. Gojobori, 1999 A method for detecting positive selection at single amino acid sites Mol. Biol. Evol 16:1315-1328[Abstract]
. 2001 Positively selected amino acid sites in the entire coding region of hepatitis C virus subtype 1b Gene 276:83-87[ISI][Medline]
Suzuki Y., T. Gojobori, M. Nei, 2001 ADAPTSITE: detecting natural selection at single amino acid sites Bioinformatics 17:660-661
Suzuki Y., M. Nei, 2001 Reliabilities of parsimony-based and likelihood-based methods for detecting positive selection at single amino acid sites Mol. Biol. Evol 18:2179-2185
Yamaguchi-Kabata Y., T. Gojobori, 2000 Reevaluation of amino acid variability of the human immunodeficiency virus type 1 gp120 envelope glycoprotein and prediction of new discontinuous epitopes J. Virol 74:4335-4350
Yang Z., 2000 PAML (phylogenetic analysis by maximum likelihood). Version 3.0 University College London, London
Yang Z., R. Nielsen, N. Goldman, A.-M. K. Pedersen, 2000 Codon-substitution models for heterogeneous selection pressure at amino acid sites Genetics 155:431-449
Zhang J., 1999 Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models Mol. Biol. Evol 16:868-875[Abstract]