* Department of Zoology, George S. Wise Faculty of Life Science, Tel Aviv University, Ramat Aviv, Israel; and Department of Biology and Biochemistry, University of Houston, Houston, Texas
Correspondence: E-mail: dgraur{at}uh.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: Codon volatility comparative method positive selection synonymous and nonsynonymous substitutions Darwinian purifying selection
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
PDF2004 is based on a measure called "codon volatility." Several variants of codon volatility have been proposed by Plotkin and Dushoff (2003), but PDF2004 uses the simplest variant, in which the volatility of a codon, v(C), is defined as
![]() | (1) |
The PDF2004 method for detecting positive selection relies on the premise that high gene volatility is indicative of an excess of amino acid replacements and, hence, of positive selection (at least in the recent evolutionary past). Under this paradigm, a genome can be scanned, and genes with observed volatility values that are significantly higher than their expected volatility are deemed to have been subject to positive Darwinian selection. Conversely, genes exhibiting exceptionally low volatilities are deemed to have been subjected to strong purifying selection.
The PDF2004 method was applied to two completely sequenced genomes: a bacterium (Mycobacterium tuberculosis) and a unicellular eukaryote (Plasmodium falciparum). Based on these analyses, 9% to 10% of all protein-coding genes were found to be significantly more volatile than the genome as a whole (P < 0.05); that is, they may be said to have evolved under positive selection. These values are much larger than estimates obtained through traditional comparative analyses (e.g., Endo, Ikeo, and Gojobori 1996). The volatility results indicative of purifying selection are even more peculiar. Only 7.5% of the protein-coding genes in M. tuberculosis and 13.9% of the protein-coding genes in P. falciparum exhibit any telltale signs of purifying selection, whereas the current knowledge in evolutionary biology is that virtually all protein-coding genes are subject to purifying selection.
In this note, we examine the PDF2004 method in some detail to determine whether it can be used to detect positive Darwinian selection.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Codon Usage and Genome Volatility
The frequency of each codon (excluding stop codons) was counted for all the protein-coding genes in 171 completely sequenced genomes of bacteria from the July 2004 version of the NCBI database (ftp://ftp.ncbi.nlm.nih.gov/). For each genome, preferred codon usage for each amino acid was defined as the codon with the highest frequency from among the different codons for the same amino acid. The universal genetic code was used to calculate codon volatility. For each genome, mean genome volatility was calculated as the mean volatility of all codons weighted by their frequency within the genome. Mean genome volatility of preferred codons was calculated as the mean volatility of the preferred codons weighted by the frequency of their respective amino acids within the proteome.
Simulated Evolution
To test the effect of different selection regimes on the predictions made by the volatility method, we simulated evolutionary processes for all the genes within three genomes that were chosen because they exhibit low, intermediate, and high mean volatilities. The mean volatilities of Mycobacterium tuberculosis, Geobacter sulfurreducens, and Clostridium acetobutylicum are 0.733, 0.746, 0.783, respectively.
The parameters for the simulation included (1) numbers of point mutations, (2) transition/transversion ratios, and (3) fixation probabilities for two types of nonsynonymous mutation and three types of synonymous mutation (see below). In each step of simulation, one of the nucleotides within a gene was mutated. This mutation was then eliminated or fixed according to the specified selectional regime. That is, for each gene, a site was chosen randomly from a uniform distribution ranging between 1 and the length of the gene. The nucleotide at the site was then randomly mutated into another according to the predetermined transition/transversion ratio. The mutation was then classified into synonymous, nonsynonymous, or nonsense. Synonymous mutations were further classified into (1) preferred codon to nonpreferred codon, (2) nonpreferred codon to preferred codon, or (3) nonpreferred codon to another nonpreferred codon. Amino acid replacements resulting from nonsynonymous mutations were further classified into radical or conservative replacements, depending on whether the Grantham (1974) physicochemical distance between the exchanged amino acids was larger or smaller than 100, respectively.
Four selective regimes were simulated (table 1). In the (completely) neutral regime, all sense and missense mutations were fixed. In the purifying selection regime, nonsynonymous mutations were selected against. The probability of fixation for a radical amino acid replacement was smaller than that for a conservative replacement. Synonymous mutations to nonpreferred codons were also selected against, but to a lesser extent than nonsynonymous mutations. In the so-called synonymous regime, nonsynonymous mutations were invariably eliminated, and only synonymous mutations were allowed to fix. Synonymous mutations to preferred codons were assigned fixation probabilities of 1. Synonymous mutations to nonpreferred codons were assigned fixation probabilities of 0.5. Under the so-called nonsynonymous regime, all of the nonsynonymous mutations and about 50% of the synonymous mutations were fixed. This regime is supposed to simulate an excess of nonsynonymous over synonymous substitutions as expected under positive selection. A mutation was fixed if the value drawn from a uniform distribution between 0 and 1 was smaller than the fixation parameter for the type of mutation in the selectional regime in question.
|
Gene volatilities and corresponding P values were calculated using the PDF2004 software at http://www.cgr.harvard.edu/volatility.
Comparison of the PDF2004 Method with the Comparative Method
The results of the PDF2004 method were compared with those obtained from a comparison between the genomes of M. tuberculosis (strain CDC1551) and M. bovis. We identified putative orthologs by the method of reciprocal best Blast hits with a cutoff of e < 0.001. Of the 4,099 M. tuberculosis genes, 3,723 orthologs were identified in the genome of M. bovis. The sequences of the orthologous genes were aligned to the corresponding protein sequence of M. tuberculosis using Wise2 (Birney, Clamp, and Durbin 2004). This alignment respects codon positions as needed for the calculation of synonymous and nonsynonymous rates of substitution. Rates of synonymous (dS) and nonsynonymous (dN) substitutions were estimated using PAML (Yang 1997). We tested for correlation between the results of PDF2004 and the comparative method (dN/dS) by using Spearman rank nonparametric correlation (Zar 1999).
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The expected volatility and variance is computed for each amino acid as
![]() | (2) |
![]() | (3) |
We note that for the 16 amino acids with invariant codon volatilities (i.e., any amino acid that is not Gly, Leu, Arg, or Ser) the expected amino acid volatility is independent of either codon or amino acid frequencies:
![]() | (4) |
What is, however, more important is that for these 16 amino acids, the variance of the expected volatility is zero. The variances for the remaining amino acids are close to zero: 0.0061, 0.00015, 0.0053, and 0.014 for the Gly, Leu, Arg, and Ser, respectively.
The expected volatility of a gene (G) and its variance are calculated as
![]() | (5) |
![]() | (6) |
As stated previously, the volatility P value of a gene is calculated by comparing its observed volatility to the expected volatility. The expected volatility typically has a tiny variance, hence, it is easy to overestimate the statistical significance of the observed values. Moreover, the calculation of P values in the PDF2004 method involves thousands of dependent comparisons, and, therefore, a Bonferroni correction for multiple comparisons should be used in assessing statistical significance. Thus, in the case of the 4,099 genes from M. tuberculosis, P values should be smaller than 0.05/4,099 = 1.21 x 105to be significant. Without Bonferroni correction, 376 M. tuberculosis genes have P values smaller than 0.05. With the Bonferroni correction, the number of genes associated with statistically significant P values is reduced to 14. The corresponding numbers in P. falciparum are 534 and 52, respectively.
As we have shown previously, gene volatility is determined solely by 22 codons encoding four amino acids. Given the very low variation in codon volatility, it is reasonable to assume that even a small excess of high-volatility codons in comparison with their frequency in the genome may result in statistically significant volatility P values. To address this issue, we tested the correlation between the frequency of the 22 codons specifying Gly, Leu, Arg, and Ser in M. tuberculosis and the volatility P values. We also used multiple correlation to assess how much of the variation in P values can be explained by combinations of codons (table 3). Surprisingly, we found that with only two codons (CTG and AGC), we can explain about 42% of the variation. Using all 22 codons for Gly, Leu, Arg, and Se, we can explain up to 70% of the variation in volatility P value.
|
The results of the simulations (table 4, and see figures F1F3 in Supplementary Material online) indicate that regardless of initial codon composition and selection regime, mean gene volatility tends to approach asymptotically an equilibrium value. If the initial volatility is above the equilibrium value, it will decrease during evolution; if it is below the equilibrium value, it will increase. The exact value at equilibrium will obviously be determined by the amino acid frequencies and codon usage. In particular, gene volatility may be affected by whether the codon usage is biased towards low-volatility or high-volatility codons. In the simplest case, in which all 61 codons are equally frequent, the mean volatility will be approximately 0.747.
|
We compared the results of the PDF2004 method with the dN/dS ratios obtained by comparing orthologous genes from M. tuberculosis and M. bovis. There were no significant correlation coefficients between either expected or observed volatility in M. tuberculosis on the one hand and dN, dS, and dN/dS on the other. As far as the volatility P values are concerned, we obtained no significant correlations with either dN or dN/dS. A tiny negative correlation was seen between volatility P values and dS (r2= 0.0013). To illustrate this point, we note that in several cases, orthologous genes may be identical in sequence between M. tuberculosis and M. bovis, and yet they may show volatility P values indicative of positive selection. One such case is gene MT0441 (a hypothetical membrane protein in M. tuberculosis), which is identical in sequence to its ortholog in M. bovis but has volatility P value of 8.97 x 106. Thus, the comparative method indicates very strong purifying selection, whereas the PDF2004 method "reveals" an instance of strong positive Darwinian selection that paradoxically results in no evolutionary changes in either gene. We must, therefore, conclude that the PDF2004 method cannot detect selective pressures.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Birney E., M. Clamp, and R. Durbin. 2004. Genewise and genomewise. Genome Res. 14:988995.
Endo, T., K. Ikeo, and T. Gojobori. 1996. Large-scale search for genes on which positive selection may operate. Mol. Biol. Evol. 13:685690.[Abstract]
Grantham, R. 1974. Amino acid difference formula to help explain protein evolution. Science 85:862864.
Kinsella R. J., D. A. Fitzpatrick, C. J. Creevey, and J. O. Mclnerney. 2003. Fatty acid biosythesis in Mycobacterium tubeculosis: lateral gene transfer, adaptive evolution, and gene duplication. Proc. Natl. Acad. Sci. USA 100:1032010325.
Plotkin, J. B., and J. Dushoff. 2003. Codon bias and frequency-dependent selection on the hemagglutinin epitopes of influenza A virus. Proc. Natl. Acad. Sci. USA 100:71527157.
Plotkin, J. B., J. Dushoff, and H. B. Fraser. 2004. Detecting selection using a M. tuberculosis and P. falciparum. Nature 428:942945.[CrossRef][ISI][Medline]
Tekaia F., S. V. Gordon, T. Garnier, R. Brosch, B. G. Barrell, and S. T. Cole 1999. Analysis of the proteome of Mycobacterium tuberculosis in silico. Tuber. Lung Dis. 79:329342.[CrossRef][Medline]
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555556.[Medline]
Zar, J. H. 1999. Biostatistical analysis. Prentice Hall, Upper Saddle River, NJ.