Department of Biological Sciences, University of South Carolina, Columbia
Correspondence: E-mail: austin{at}biol.sc.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: codon volatility nucleotide content positive selection synonymous site nonsynonymous site
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Widely used methods of testing for evidence of positive selection involve comparison of the number of synonymous nucleotide substitutions per synonymous site (dS) and the number of nonsynonymous substitutions per nonsynonymous site (dN) (Hughes and Nei 1988; Goldman and Yang 1994; Hughes 1999). The neutral theory predicts that dS will exceed dN in most genes because most nonsynonymous mutations are harmful to protein structure and are, therefore, eliminated (Kimura 1977). The opposite pattern is evidence that natural selection has acted to favor changes at the amino acid level (Hughes and Nei 1988). Estimation of dS and dN involves comparison of at least two related sequences, and this method may be unable to detect positive selection if the two sequences are distantly related (Hughes 1999). Because a closely related sequence may be unavailable for comparisonparticularly in the case of model organisms for which complete genome sequences are availablea method to detect a signature of positive selection in a single genome has considerable appeal.
Furthermore, in most well-studied cases of positive selection, positive selection affects only a small proportion of codons in the gene (Hughes 1999). In some cases, biological knowledge makes it possible to predict which codons are likely to be subject to positive selection (Hughes and Nei 1988), but often, such information is lacking. In these cases, additional statistical methods to identify a signal of past selection on individual codons may be desirable.
Codon volatility bears an obvious relationship to the estimation of dS and dN that was not noted by Plotkin, Dushoff, and Fraser (2004). Simple methods of estimating dS and dN, such as that of Nei and Gojobori (1986), involve counting the numbers of synonymous and nonsynonymous sites in coding sequences, and this process, like the computation of codon volatility, involves taking into account the proportion of point mutations that will give rise to an amino acid change. In Nei and Gojobori's (1986) method, for each of two sequences compared, one counts S (the number of synonymous sites) and N (the number of nonsynonymous sites). Next, one computes pS (the proportion of synonymous differences per synonymous site) and pN (the proportion of nonsynonymous differences per nonsynonymous site), and these quantities are corrected for multiple hits (Nei and Gojobori 1986). It is obvious that codon volatility bears a close relationship to N, the number of nonsynonymous sites; in fact, codon volatility should be essentially equivalent to N/(N+S). However, if positive selection is expected to increase dN, it is not intuitive at all that an increase in codon volatilityrepresenting in essence the denominator of dNshould be a hallmark of positive selection.
Plotkin, Dushoff, and Fraser (2004) reported weak but significant negative correlations between the P value of the test for significance of codon volatility and dN in comparisons between different genomes of Mycobacterium tuberculosis and between M. tuberculosis and other Mycobacterium species. In addition, they noted that surface antigens such as the EMP1 family of Plasmodium falciparum and PE/PPE of M. tuberculosis showed significantly high volatility (Plotkin, Dushoff, and Fraser 2004). However, because this method was applied to only a few cases and because of the problematic relationship between codon volatility and N/(N+S), it is unclear whether codon volatility is a good indicator of positive selection in general. Therefore, we decided to test this method further by comparing codon volatility with estimates of dS and dN between orthologous loci from three pairs of closely related eukaryotic genomes: (1) the protists Plasmodium falciparum and P. yoelii, (2) the fungi Saccharomyces cerevisiae and S. paradoxus, and (3) the mammals mouse (Mus musculus) and rat (Rattus norvegicus).
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Homologous sequences were aligned at the amino acid level using the ClustalW program (Thompson, Higgins, and Gibson 1994), and this alignment was imposed on the DNA sequences. The number of synonymous nucleotide substitutions per synonymous site (dS) and the number of nonsynonymous nucleotide substitutions per nonsynonymous site (dN) were estimated by a maximum-likelihood method (Yang and Nielsen 2000) using the PAML software package (Yang 1997). The number of synonymous sites (S) and the number of nonsynonymous sites (N) were counted by the Nei-Gojobori (1986) method. Codon volatility and the test for significance of codon volatility were computed according to the method in Plotkin, Dushoff, and Fraser (2004), using the software provided by those authors.
Because the distribution of most of the variables analyzed deviated significantly from the normal distribution (Kolmogorov-Smirnov test), we used nonparametric methods for all statistical testing. G + C content at the three codon positions and codon volatility were highly correlated between the two species compared for all three sets of comparisons (Spearman's rank correlation coefficient rS > 0.90; P < 0.001 in every case). Therefore, in comparisons among genes, we used the means of these quantities for the two genomes compared.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
The variables examined in table 3 were intercorrelated in complex ways (data not shown). We, therefore, used partial correlation analysis (applied to rank correlation coefficients) to reveal the effect of each of a set of independent variables on a dependent variable, while simultaneously controlling for the effect of the remaining independent variables. We examined two dependent variables in these analyses: (1) codon volatility and (2) the minimum significance level of the test (Plotkin, Dushoff, and Fraser 2004) of codon volatility. We computed fifth-order partial rank correlation coefficients between each of a set of six independent variables and codon volatility, simultaneously controlling for the other five independent variables (table 4). Likewise, we computed fifth-order partial rank correlation coefficients between the same variables and the minimum significance level; that is, the lower of the two P values obtained when the test of significance of codon volatility was applied to each of the two genomes (table 4).
|
|
Thus, the results indicated that nucleotide content was in general a more powerful predictor of codon volatility and of the significance level of the test of codon volatitility than were dN or dN/dS. One surprising finding, however, was that in Plasmodium, unlike the other comparisons, the correlations with GC2 and GC3 were opposite in direction (table 4). One possible factor in this difference was that Plasmodium showed a unique pattern of correlation between G + C content values at the three codon positions. In the other species compared, GC1, GC2, and GC3 were all significantly positively correlated with each other; and the same was true when data from all three species comparisons were pooled (table 5). However, in Plasmodium, GC3 was not significantly correlated with GC1 or GC2 (table 5).
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Rather, codon volatility seemed to reflect largely nucleotide composition, particularly at the second codon position. Why this should be so is easily seen from a consideration of the genetic code. In the universal code, the two most volatile codons are ATG (Met) and TGG (Trp). The next most volatile codons are the group of two-fold degenerate codons, at which two of the possible third-position mutations are nonsynonymous. An examination of the 18 two-fold degenerate codons shows that 14 have A in the second position, and two have T in the second position. All sense codons with A at the second position are two-fold degenerate. Thus, rather than being a measure of positive selection, the codon volatility of a gene is a measure largely of the percentage of A at second positions in the gene.
Most two-fold degenerate codons with A in the second position encode highly hydrophilic amino acids, including the charged residues His, Lys, Asp, and Glu. Such amino acids are often found on the surface of globular proteins, where they are often subject to a relatively low level of functional constraint (Kimura and Ohta 1973). Thus, in globular proteins, high codon volatility may be associated with elevated dN, not as a result of positive selection but simply as a consequence of the relaxation of purifying selection. On the other hand, charged residues are often involved in recognition molecules, a category that includes many of the best-documented cases of positive selection at the molecular level (Hughes 1999).
In the case of Plasmodium, a number of surface antigens are characterized by amino acid repeat arrays with a high frequency of hydrophilic residues (Verra and Hughes 1999; Hughes 2004). This may be an adaptation to attract an ineffective T cellindependent immune response on the part of the vertebrate host (Kemp, Coppel, and Anders 1987). In some cases, nonrepeat regions of the same proteins have been shown to be subject to positive selection, probably driven by host T cell recognition (Hughes 1991, 1992; Hughes and Hughes 1995). Similarly, high frequencies of hydrophilic residues are found in repeat arrays of the PE and PPE families in Mycobacterium tuberculosis, particularly in regions shown to be recognized by host immunoglobulins (Okkels et al. 2003). The fact that Plotkin,. Dushoff, and Fraser (2004) reported high codon volatility at antigen loci of Plasmodium and Mycobacterium, thus, apparently reflects hydrophilic amino acid composition, which in these cases is fortuitously associated with positive selection.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:33893402.
Goldman, N., and Z. Yang. 1994. Codon-based models of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736.
Hughes, A. L. 1991. Circumsporozoite protein genes of malaria parasites (Plasmodium spp.): evidence for positive selection on immunogenic regions. Genetics 127:345353.
. 1992. Positive selection and interallelic recombination at the merozoite surface antigen-1 (MSA-1) locus of Plasmodium falciparum. Mol. Biol. Evol. 9:381393.[Abstract]
. 1999. Adaptive evolution of genes and genomes. Oxford University Press, New York.
. 2004. The evolution of amino acid repeat arrays in Plasmodium and other organisms. J. Mol. Evol. 59:528535.[CrossRef][ISI][Medline]
Hughes, A. L., and R. Friedman. 2003. Genome-wide survey for genes horizontally transferred from cellular organisms to baculoviruses. Mol. Biol. Evol. 20:979987.
Hughes, A. L., and M. Nei. 1988. Pattern of nucleotide substitution at MHC class I loci reveals overdominant selection. Nature 335:167170.[CrossRef][ISI][Medline]
Hughes, M. K., and A. L. Hughes. 1995. Natural selection on Plasmodium surface proteins. Mol. Biochem. Parasitol. 71:99113.[CrossRef][ISI][Medline]
Kemp, D. J., R. L. Coppel, and R. F. Anders. 1987. Repetitive genes and proteins of malaria. Ann. Rev. Microbiol. 41:181208.[CrossRef][ISI][Medline]
Kimura, M. 1977. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275276.[ISI][Medline]
Kimura, M., and T. Ohta. 1973. Mutation and evolution at the molecular level. Genetics 73(Suppl.):1935.[ISI][Medline]
Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418426.[Abstract]
Okkels, L. M., I. Brock, F. Follmann, E. A. Agger, S. M. Arend, T. H. M. Ottenhoff, F. Oftung, I. Rosenkrands, and P. Andersen. 2003. PPE protein (Rv3873) from DNA segment of RD1 of Mycobacterium tuberculosis: strong recognition of both specific T-cell epitopes and epitopes conserved within the PPE family. Infect. Immunu. 71:61166123.[CrossRef]
Plotkin, J. B., J. Dushoff, and H. B. Fraser. 2004. Detecting selection using a single genome sequence of M. tuberculosis and P. falciparum. Nature 428:942945.[CrossRef][ISI][Medline]
Thompson, J. D., D. G. Higgins, and T. Gibson. 1994. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.[Abstract]
Verra, F., and A. L. Hughes. 1999. Biased amino acid composition in repeat regions of Plasmodium antigens. Mol. Biol. Evol. 16:627633.[Abstract]
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555556.[Medline]
Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:3243.
|