* Department of Biometrics, Cornell University
Department of Biology, University College London, London, England
Correspondence: E-mail: rn28{at}cornell.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: dN/dS maximum likelihood selection coefficients neutral theory mitochondrial DNA positive selection
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The first important modification to Kimura's model was proposed by Ohta (1973). In her slightly deleterious mutation theory, new mutations have exponentially distributed negative selection coefficients. This model allows some mutations to be slightly deleterious, while no positive selection is allowed. A generalization of this model was provided by Kimura (1979, 1983), who suggested that the negative selection coefficients follow a gamma distribution. Kimura named this model the model of effectively neutral mutations.
There have also been many suggestions of models that involve positive selection coefficients. For example, in the classical Fisherian model (Fisher 1930a), the fitness effect of a new mutation is inversely related to the difference of the new allele from the ancestral allele, while both positive and negative selection coefficients are allowed in the model. Ohta (1992) proposed a modification of her original model, to allow a proportion of new mutations to have positive selection coefficients. This model is known as the nearly neutral model and is similar to some of the selection models discussed in Gillespie (1991). In fact, the current controversy regarding the distribution of selection coefficients is often reduced to a discussion of the relative importance of positive selection. Nonetheless, because of mathematical convenience, the strictly neutral model remains the most commonly assumed model in population genetic studies of demography or ancestral history.
Codon-Based Likelihood Models
For protein-coding genes, a measure of selective pressure on amino acid replacement mutations is the nonsynonymous/synonymous substitution rate ratio ( = dN/dS). Recent computational advances have made inferences regarding the distribution of
among sites possible (Nielsen and Yang 1998; Yang et al. 2000). To estimate
, we describe the evolutionary process in nucleotide sequences at the codon level as a continuous time Markov chain, with state space on the 61 possible sense codons in the universal genetic code (or the 60 sense codons in the vertebrate mitochondrial code). The infinitesimal rate of change from codon i to codon j in these models is (Goldman and Yang 1994; Muse and Gaut 1994):
|
Nielsen and Yang (1998) and Yang et al. (2000) extended this model to the case in which varies among sites. In one model,
was a random variable that takes the value 1 with probability p and 0 with probability 1 - p. This model was interpreted as a strictly neutral model of evolution. If the selection coefficient is the same for all nonsynonymous mutations in a particular codon site, this distribution of
among sites is predicted from Kimura's (1968) strictly neutral model of evolution. This model was extended by adding a proportion of sites with
> 1. Comparison between the two models using a likelihood ratio test constitutes a test of the neutral model against a positive selection alternative.
In Yang et al. (2000), several new models for variation in were introduced, varying in complexity from a simple gamma distribution of
among sites to a mixture of three normal distributions. Some limited power was found in distinguishing the various models (Yang et al. 2000). For example, in all the 10 data sets analyzed, the strictly neutral model gave a significantly worse fit to the data than a model with an additional category of sites in which
was a free parameter. However, very little power was detected to distinguish between the more parameter-rich models. In eight out of the 10 cases, the likelihood values obtained under a simple gamma distribution and under a parameter-rich mixture of three normal distributions were within 1 log-likelihood unit of each other.
As the nonsynonymous/synonymous substitution rate ratio is a measure of selective pressure acting on the protein, it is possible to use codon-based likelihood models to make inferences regarding the distribution of selection coefficients of nonsynonymous mutations. In this paper, we will illustrate how this can be done. We will develop some new codon-based likelihood models derived from population genetics models, and we will estimate parameters of these population genetics models and compare the fit of the models using likelihood ratio tests. Our analyses are based on relatively strong assumptions about the mutation process. In particular, we will assume that all nonsynonymous mutations in the same site have the same selection coefficient. This assumption may not be met in many cases. However, without a function mapping selection coefficients to values of
, little or no progress can be made on estimating the distribution of selection coefficients.
Several previous methods have been suggested for estimating the distribution of selection coefficients, often based on interspecific data using allelic distributions, or frequency spectrums (e.g., Bustamante et al. 2002; Fay, Wyckoff, and Wu 2001). The new method differs from such methods in that it only considers interspecific data and does not use information regarding the allelic distribution within species.
![]() |
Material and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Assume that there is no interference in the fixation process of multiple mutations at different sites. This will be true for interspecific data if there are not many strongly or moderately selected mutations segregating at the same time or if the level of recombination between sites is sufficiently high. If this assumption is not met, our method will tend to underestimate the selection coefficients. Additionally, we will assume that there are never more than two alleles segregating at the same nucleotide site. This is a reasonable assumption when the expected time to fixation or extinction measured in generations is short compared with the inverse of the mutation rate. This assumption should be valid for most organisms.
Under these assumptions, the fixation rate of new mutations with selection coefficients s is the product of the mutation rate (µ) per site, the chromosomal population size (N) in a haploid organism, and the probability of fixation (Kimura 1962),
|
|
Yang et al. (2000) have previously developed models of varying among codon sites. For example, in one of the models it is assumed that
is gamma distributed among codon sites with parameters
and ß (model M5 in Yang et al. 2000). The probability density function for S (using equation 3) is thus given by
|
Models of Varying N Among Lineages
Our goal here is to develop models that will allows us to estimate the effective population sizes in different lineages of a phylogeny jointly with the distribution of selection coefficients among sites. One of the major aims is to investigate to which degree it is possible to distinguish between different distributions of selection coefficients using phylogenetic data. The methods we develop will be applied to a previously published data set of complete mitochondrial genomes from eight primate species (Yoder and Yang 2000).
The codon-based substitution models are parameterized in terms of the population genetics parameters s and N. We make the simplifying assumption that the selection coefficient acting on new mutations at a site is constant in a particular lineage; that is, all mutations occurring at the same site have the same selection coefficient. We will also assume that the population size along each lineage of the phylogeny is a free parameter; that is, we allow variation of population size among lineages in the phylogeny. The value of at site i in lineage j is then
|
Ten different models are implemented, allowing different assumptions regarding Nj and si. To keep the models identifiable, we may only estimate the relative population sizes. We can, for example, fix the size of the human population and estimate the population sizes in all the other lineages relative to the human population size. The parameters of the models are then the relative population sizes and those in the distribution of S = 2Ns. The models differ from previous approaches in that they are parameterized directly in terms of selection coefficients. Estimates of S cannot be obtained simply by transforming estimates of in most of these models because N is allowed to vary among lineages.
Analytical calculation of the likelihood function under the continuous distributions is not computationally possible. Instead we approximate the distributions using 10 discrete categories as in Yang et al. (2000). A summary of the models is provided in table 1.
|
In the second model (model 2), we assume that the effective population size varies among lineages but that si = s for all i. Model 2 has 13 parameters: 12 values of N and one value of s. There are eight species, resulting in 13 different lineages in an unrooted tree. Since we can only estimate the relative population sizes, there are 12 values of N to estimate.
Model 3 assumes that s follows a normal distribution among sites with mean µ and variance 2. This model has 14 parameters: 12 values of N and µ and
2.
Model 4 is identical to model 3, except that the normal distribution has been truncated such that values of s > 0 are not allowed; that is, no positive selection is allowed. This model also has 14 parameters.
In model 5 it is assumed that s follows a gamma distribution reflected around the s = 0 axis; that is,
|
In model 6 it is assumed that s follows a reflected exponential distribution reflected around the s = 0 axis; that is,
|
Models 7, 8, and 9 are identical to models 2, 3, and 5, respectively, except that an extra category of sites (with proportion p0) in which s = - (
= 0) is added to each model. The proportion of variable sites is then p1 = 1 - p0. Therefore, the number of parameters in models 7, 8, and 9 are 15, 15, and 13, respectively. The reason for implementing these models is that we are mostly interested in the distribution of selection coefficients in sites that may vary. If many sites are completely functionally constrained, such that all new mutations are immediately lost from the population, our parameter estimates may be heavily influenced by the presence of such sites.
In the last model (model 10), it is assumed that all new mutations at a site are either neutral (s = 0) or strongly deleterious (s = -). In such a model, it is not possible to estimate population sizes for different lineages, since N does not influence the proportion of neutral sites. There is, therefore, only one parameter: p0, the proportion of invariable sites. This model is equivalent to the neutral model of Nielsen and Yang (1998).
Note that models 3 to 9 differ from all models implemented previously, not only in the distributional assumptions but also in allowing the effect of selection to vary simultaneously among lineages and among sites. Such models are necessary to allow variation in N among lineages while inferring the distribution of selection coefficients. At the level of computation, these models are much more demanding because they require the recalculation of the transition matrices for each branch and for each site class. A computationally more efficient alternative for allowing variation among lineages and sites simultaneously was described in Yang and Nielsen (2002). However, the method considered in Yang and Nielsen (2002) allows only a few categories of sites and cannot easily be used to estimate the distribution of selection coefficients.
Viral Data
We first use the model of fixed N among lineages and gamma distributed to estimate the distribution of selection coefficients among sites in two large viral data sets. The first data set consists of 421 codons from each of 186 sequences from the HIV-1 env gene. This data set was previously analyzed by Yamaguchi-Kabata and Gojobori (2000) and Yang (2001). The second data set consists of 329 codons from each of 349 sequences from the human influenza virus H3 hemagglutinin gene. It was previously analyzed by Bush et al. (1999) and Yang (2000). A more detailed description of the two data sets can be found in these references. The data sets are analyzed assuming
is gamma distributed among sites and that this distribution is constant among lineages. For computational and statistical details, see Yang et al. (2000). After the parameters of the gamma distribution have been estimated, the distribution of
is transformed into a distribution of S using equation 4.
Both of the viral genes analyzed are known to undergo strong positive selection; that is, there are multiple sites for which S > 0 (or > 1). They code for the coat proteins of the two viruses, which are primary targets for the host immune system. It is generally believed that the positive selection is driven by a selective pressure to avoid host immune recognition. For those data sets, it is entirely possible that more than two nucleotides may be segregating at the same time in a site. For positively selected sites, the effect will presumably be to underestimate the magnitude of the selection coefficient.
mtDNA Data
This data set consists of eight whole mitochondrial genomes from human (Homo sapiens [GenBank accession number X93334]), common chimpanzee (Pan troglodytes [GenBank accession number X93335]), bonobo (Pan paniscus [GenBank accession number D38116]), gorilla (Gorilla gorilla [GenBank accession number X93347]), Bornean orangutan (Pongo pygmaeus p [GenBank accession number D38115]), Sumatran orangutan (Pongo pygmaeus abelii [GenBank accession number X97707]), gibbon (Hylobates lar [GenBank accession number X99256]), and hamadryas baboon (Papio hamadryas [GenBank accession number Y18001]). The data set was previously analyzed by Yoder and Yang (2000) for molecular clock dating. Only the 12 protein-coding genes on the same strand of the genome are used; after alignment and removal of gaps, each sequence consists of 3,593 codons.
For each of the models discussed above, the likelihood function can be calculated as in Yang et al. (2000), which the reader can refer to for computational details. For the continuous distributions of selection coefficients, calculations are performed by discretizing the density function using 10 categories (see Yang et al. 2000). The topology of the phylogeny is assumed to be known, and the branch lengths of the phylogeny are estimated by maximum likelihood, together with parameters in the distribution of S, as summarized in table 1. The likelihood function is calculated efficiently by storing the calculated transition probability matrices for each rate category in each lineage in the computer memory. Optimization of the likelihood function takes between a few minutes to about 12 h on a PC, depending on the model.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
We notice that the assumption of no interference among mutations may not hold for the influenza data set. The result is that we probably tend to underestimate the magnitude of the selection coefficients. The real proportion of fixed mutations that are positively selected may, therefore, be even higher than 0.851.
Analysis of Mitochondrial Protein-Coding Genes
Table 1 summarizes the results obtained from analysis of the mtDNA data. The phylogenetic tree is shown in figure 2. The maximum-likelihood value for a model with no variation in N among lineages and no variation in S among sites is -38,884.38 (model 1). If we add variation in N among lineages, the maximum-likelihood value is -38,859.04 (model 2). This difference is significant (2 = 50.68, P
10-6, df = 12), indicating a lineage specific variation in the dN/dS ratio. Nevertheless, given the amount of data, the likelihood difference is not very large, although statistically significant.
|
Model 4 differs from model 3 in that no sites with s > 0 are allowed, and the model allows only neutral or negatively selected new mutations. The likelihood under this model is almost identical to that under model 3.
Allowing the selection coefficients to follow a reflected gamma distribution among sites (model 5) gives a maximum-likelihood value very similar to the value obtained under the normal distribution (model 4). It is not surprising that a normal and a reflected gamma distribution may give similar fit, since the gamma distribution tends to a normal distribution as becomes large. By inspecting figure 3, we also notice that the estimated shapes of the distribution are quite similar under the two models.
|
Models 7, 8, and 9 are identical to models 3, 5, and 6 except that an extra category of invariable sites have been added. Adding a category of invariable sites led to only marginal increases in likelihood for all of the models, although the parameter estimates changed. The estimate of the proportion of invariable sites vary from 0.0 to 0.124. The explanation for the small increase in likelihood is probably not that invariable sites are rare, but more likely, that we do not have much power to distinguish between parameter-rich models. The same observation was made in Yang et al. (2000). The model providing the best fit, of all the models analyzed here, is the one assuming a normal distribution and an additional category of invariable sites. This model places a lot of probability mass around -2.5 < S < -0.5, but also allows for some invariable sites.
Model 10 assumes that sites come in two flavors, sites that are completely invariable and sites in which all new mutations are neutral. This corresponds to the strictly neutral model in Nielsen and Yang (1998). This model performs significantly worse than all other models. It has the same number of parameters as model 1 but is nearly 1,000 likelihood units worse. Clearly, sites at which - < s < 0 are dominant in the evolution of mtDNA.
The branch lengths in figure 2 are proportional to the expected number of synonymous mutations for the reflected gamma + invariable model of selection coefficients. In addition, the estimated population sizes, relative to the human population size, are also shown for each branch. Notice that all estimates are within the same order of magnitude. Although this is not theoretically impossible, it does appear to be somewhat unlikely. Further research is warranted to explain this observation.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Very little is known from experimental data regarding the distribution of fitness effects of spontaneous mutations. Most experimental evidence regarding the fitness effects of new mutations comes from mutation accumulation experiments from Drosophila, such as the early experiments conducted by Mukai and colleagues (e.g., Mukai 1964; Mukai et al. 1972). In general, these experiments suggest that most mutations are deleterious and of small effect. Keightley (1994) reanalyzed data from Mukai et al. (1972) and Ohnishi (1974). Assuming a genome-wide mutation rate of 0.4, he obtained maximum-likelihood estimates of gamma distributions with very high variance of mutational effects ( < 1); that is, with a majority of mutations having very little effect on fitness. This contrasts with our results, which suggest a distributions with very little probability mass centered around S = 0 and (
> 1) for the gamma distribution. The data analyzed, the scaling of the parameters, and the underlying assumptions differ much between the two studies, so it is not surprising that the results differ. It is possible that our assumption of constant fitness effects of mutations occurring in the same site may increase the discrepancy between the two results.
In Yang et al. (2000), statistically significant (5% level) evidence for the existence of positively selected sites was detected in a different data set containing seven of the eight sequences analyzed here. A model in which is assumed to follow a beta distribution was compared with a model in which
was assumed to follow a mixture distribution of a point mass located at
> 1 and a beta distribution. The two models were compared using a likelihood ratio test, which rejected the beta distribution model. However, in this study, it was found that a normal distribution of values of S fits the data worse than a normal distribution truncated at S < 0. One possible explanation for the discrepancy is the difference in model assumptions between the two studies. The test based on the beta distribution of
may be more powerful. Also, the data differs slightly between the two studies by the inclusion of the baboon sequence in the present study.
Our major objectives in this paper were (1) to investigate whether we can use phylogenetic data to distinguish between different distributions of selection coefficients and (2) to determine which of the most common population genetics models of molecular evolution best fits the mtDNA data. It seems that we can in fact distinguish different distributions of selection coefficients. In particular, Kimura's (1979) effectively neutral model seems to fit the data much better than the exponential model of Ohta (1973), and a strictly neutral model (Kimura 1968) can be easily rejected. However, it appears that we do not have the power to distinguish between more parameter-rich distributions or between distributions with similar shapes, such as the reflected gamma and the normal distributions.
To make these inferences, we have to make some assumptions. Probably the most problematic assumption is that all new mutations at a site have the same selection coefficients. In the following, we discuss an alternative model that assumes that each of the four possible nucleotides in a site has a fixed fitness. The substitution process in a nucleotide site can then be described as a continuous time ergodic Markov chain with four states (1, 2, 3, 4) in which a transition between any of the states corresponds to a fixation event. The stationary frequencies of the Markov chain (p1, p2, p3, p4) obey the equations
|
|
|
|
Equations 3 and 11 are plotted in figure 4. Notice that the pattern observed in the two models is quite different. In the first model, is a strictly increasing function of the selection coefficient. In the second model in contrast, the highest value of
is obtained at S = 0. This model is effectively a model of constrained evolution in which a particular nucleotide represent the optimum. The difference between the two models demonstrates that any inferences will be strongly dependent on the specifics of the model.
|
We have here explored the problem of how to estimate the distribution of selection coefficients from comparative data. We have discussed some of the necessary assumptions and some of the problems related to these assumptions. We hope this study will encourage population geneticists to seek further use of the very large amounts of available comparative data for addressing the basic questions in population genetics regarding the fitness effects of new mutations.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bush, R. M., W. M. Fitch, C. A. Bender, and N. J. Cox. 1999. Positive selection on the H3 hemagglutinin gene of human influenza virus A. Mol. Biol. Evol. 16:1457-1465.[Abstract]
Bustamante, C. D., R. Nielsen, S. A. Sawyer, K. M. Olsen, M. D. Purugganan, and D. L. Hartl. 2002. The cost of inbreeding in Arabidopsis. Nature 416:531-534.[CrossRef][ISI][Medline]
Halpern, A. L., and W. J. Bruno. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 15:910-917.[Abstract]
Fay, J. C., G. J. Wyckoff, and C.-I. Wu. 2001. Positive and negative selection on the human genome. Genetics 158:1227-1234.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]
Fisher, R. A. 1930a. The distribution of gene ratios for rare mutations. Proc. R. Soc. Edinb. Sect. B 50:204-219.
Fisher, R. A. 1930b. The genetical theory of natural selection. 2nd edition. Dover Press, New York.
Gillespie, J. H. 1991. The causes of molecular evolution. Oxford University Press, Oxford.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.
Hudson, R. R. 1990. Gene genealogies and the coalescent process. Pp. 144 in P. H. Harvey and L. Partridge, eds. Oxford surveys in evolutionary biology. Vol. 7. Oxford University Press, New York.
Hudson, R. R., M. Kreitman, and M. Aguadé. 1987. A test of neutral molecular evolution based on nucleotide data. Genetics 116:153-159.
Keightley, P. D. 1994. The distribution of mutation effects of viability in Drosophila melanogaster. Genetics 138:1315-1322.
Kimura, M. 1962. On the probability of fixation of mutant genes in a population. Genetics 47:713-719.
Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217:624-626.[ISI][Medline]
Kimura, M. 1979. Model of effectively neutral mutations in which selective constraint is incorporated. Proc. Natl. Acad. Sci. USA 75:1934-1937.[ISI]
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge.
Kingman, J. F. C. 1982. The coalescent. Stochast. Proc. Appl. 13:235-248.[CrossRef]
Mukai, T. 1964. The genetic structure of natural populations of Drosophila melanogaster. I. Spontaneous mutation rate of polygenes controlling viability. Genetics 50:1-19.
Mukai, T., S. I. Chigusa, L. E. Metler, and J. F. Crow. 1972. Mutation rate and dominance of genes affecting viability in Drosophila melanogaster. Genetics 72:333-355.
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to chloroplast genome. Mol. Biol. Evol. 11:715-724.
Moran, P. A. P. 1962. The statistical processes of evolutionary theory. Clarendon Press, Oxford.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929-936.
Ohnishi, O. 1974. Spontaneous and ethyl methanesulfonate-induced mutations controlling viability in Drosophila melanogaster. Doctoral dissertation, University of Wisconsin, Madison.
Ohta, T. 1973. Slightly deleterious mutant substitutions in evolution. Nature 246:96-98.[ISI][Medline]
Ohta, T. 1992. The nearly neutral theory of molecular evolution. Annu. Rev. Ecol. Syst. 23:263-286.[CrossRef][ISI]
Sawyer, S., and D. Hartl. 1992. Population genetics of polymorphism and divergence. Genetics 132:1161-1176.
Tajima, F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.
Yamaguchi-Kabata, Y., and T. Gojobori. 2000. Reevaluation of amino acid variability of the human immunodeficiency virus type 1 gp120 envelope glycoprotein and prediction of new discontinuous epitopes. J. Virol. 74:4335-4350.
Yang, Z. 2000. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. J. Mol. Evol. 51:423-432.[ISI][Medline]
Yang, Z. 2001. Maximum likelihood analysis of adaptive evolution in HIV-1 gp120 env gene. Pacific Symp. Biocomp. 2001:226-237.
Yang, Z., and R. Nielsen. 2002. Codon substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917.
Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for variable selection pressure at amino acid sites. Genetics 155:431-449.
Yoder, A. D., and Z. Yang. 2000. Estimation of primate speciation dates using local molecular clocks. Mol. Biol. Evol. 17:1081-1090.