Bioinformatics Research Center (BiRC), Department of Genetics and Ecology, The Institute of Biological Sciences, University of rhus,
rhus, Denmark
Correspondence: E-mail: forsberg{at}stats.ox.ac.uk.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: host-specific adaptation influenza evolution codon model
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
When a host radiation event occurs, it is therefore expected to be accompanied by adaptive evolutionary changes that establish a new phenotype and allow the parasite to complete its life cycle in the new host. Thus, identification of the positions in the parasite genome underlying the phenotypic difference between host-specific strains may provide information about the molecular basis for species-specific adaptation.
Previous studies aimed at this adaptation have focused on identifying fixed species-specific changes in the genomes of parasites from different host species (Hughes et al. 2000; Chang, Sgro, and Parrish 1992). These are natural candidates readily identified from alignment data. However, adaptive alteration of protein functionality may not occur only through fixation of different amino acids in different hosts. It may present itself as a more subtle change in the suite of amino acids that are allowed to occur, resulting not in fixation, but in a different substitution process in the two species. Additionally, host-specific adaptation can occur continuously as a result of host-specific immune selection that continuously selects for protein variants with new antigenic configurations. Fixation of genetic variants is therefore too rigid a criterion for the identification of positions involved in species-specific adaptation. Instead, a comparative methodology which focuses on changes in the evolutionary process in the different host environments may provide a fruitful approach.
To this end, we elaborate a simple codon-based model of nucleotide substitution which describes changes in the selective regime at the protein level that may affect the genes of a parasite confronted with a new host environment. The purpose of the model is twofold: First, we aim to gain a better understanding of the evolutionary processes in parasites that undergo species change. Second, we wish to use the model to identify positions in parasite genomes that may be involved in host-specific adaptations. These may then act as candidates for further experimental work on the molecular biology of species-specific adaptation.
The fundamental idea of the model is to infer how selection acts on variants of parasite genes in different hosts. To do so, we use a site-specific comparison of the substitution process of synonymous (silent) and nonsynonymous (amino acid altering) nucleotide substitutions in the parasite populations of different hosts. This type of comparison has previously been shown to be a useful analytical tool in the study of molecular evolution (Kimura 1983; Yang et al. 2000). The model is based on evolutionary comparisons on a genealogy and thus exploits the evolutionary information contained in the data while incorporating the correlation by common ancestry. The method is illustrated with an application to data from the influenza A virus.
![]() |
Theory and Model |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Upon introduction to the new host, changes in the position-specific selective regime that acts on parasite protein variants may occur through several mechanisms: In the adaptation to the new host, parts of the protein may acquire new functions, or different chemical configurations, in order to interact properly with the new and biochemically different host-cell environment. Hence, the functional constraints on the amino acid positions involved in such adaptation may differ in the two hosts, resulting in a change in the site-specific substitution process of amino acids. Alternatively, positions that are important for protein function in the original host may lose their function in the new host and therefore experience relaxed functional constraints after host radiation. Finally, the new host environment may alter the pattern of immune-mediated selection. The location of antigenic regions may change with ensuing changes in the substitution pattern along the sequence. Also, the epidemiological dynamics may change in the new host, thereby altering the strength of immune-mediated selection. Suppose the dynamics of the parasite in the original host is that of a childhood disease where the parasite most commonly infects immunologically naive individuals. Immune-mediated selection may therefore be limited or absent. Properties of the new host may, however, dictate a different epidemiological dynamic, where the parasite is dependent on continuous reinfection of previously infected individuals and therefore subjected to a strong immune-mediated selection that favors changes in its antigenic properties.
These host-mediated changes in the selective regime will potentially result in an observable host-mediated difference in the position-specific ratio of the two kinds of nucleotide substitutions, and hence will allow for the identification of codon positions involved in host-specific adaptation.
Data
We consider a sample of coding sequences from homologous genes taken from a virus that infects two different species (fig. 1). At some point in its evolutionary history the parasite acquired a new host species. Assuming no recombination, the relationship between the sequences can be described by a tree in which one clade represents sequences evolved in the new host N and the rest of the tree represents parasites evolved in the original host O. In the tree shown in figure 1 and in the following formulae the root of the tree represents the point of host radiation, but this could equally well be represented by an internal node in the tree. The differentiation between the original host and the new host is used throughout this article for the sake of clarity. It is not necessary, however, to know which species is the original host as long as the point of divergence between the two groups of sequences can be identified in the tree.
|
The nucleotide-substitution process is described by a codon-based model of substitution, which is a modification of the model proposed by Goldman and Yang (1994). This is given by a 61 x 61 matrix (stop codons not allowed) of the relative instantaneous substitution rates, where the rate of transition from codon i to codon j at site h is
|
|
Allowing for Heterogeneous Selection Pressure
The selective regime may differ at different positions in the gene, and potentially in different hosts. To reflect this difference, we allow a statistical distribution p() of
ratios among sites (Yang et al. 2000). The probability of observing the data in a site is then obtained by integration,
|
Constant Selection Pressure After Host Radiation
Consider a codon in the alignment. We assume that the selective regime is constant in the clade that represents viruses from the original host. The simplest event that may occur at the time of host radiation is the event c, that the selective regime in this codon position (and therefore ) remains the same. Conditional on event c we have that the total probability of the data in site h is given by
|
|
|
Describing Host-Mediated Changes in the Selective Regime
To allow for the selective regime to change after the introduction to the new host, we include in our model the event d, which specifies a divergence in the selective regime affecting a site after host radiation. This means that if d occurs, then the ratio of the rate of nonsynonymous to synonymous substitutions () on the branches of the tree that represent evolution in the new host is permitted to differ from that in the original host. Lacking any prior assumptions concerning the direction of the change, we simply say that if d occurs, then the choices of the
parameters in the two hosts become independent of each other. Conditional on d, the probability of observing oh and nh is therefore obtained by independently integrating over the
distribution in the two parts of the tree and multiplying
|
|
Comparing Models by Likelihood Ratio Tests
The two hypotheses are nested in that H1 H0. Comparison of the models, however, corresponds to fixing the parameter pd at the boundary. In this situation, evaluation of the test statistic using the
2 distribution will provide a conservative test of significance. To avoid this effect, the simple
2 distribution can be replaced with a mixed distribution when comparing test statistics that are close to the critical value (Ota et al. 2000).
Identification of Candidate Sites
To identify sites where the selective regime has changed after host radiation, we use an empirical Bayes approach (Nielsen and Yang 1998; Yang et al. 2000). Following maximum likelihood estimation of the parameters in the model (,
d), we can calculate the probability of event d in a given site h as
|
![]() |
Implementation and Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To allow for heterogeneous selection regimes, we first implemented a discrete version of a continuous Gamma distribution following Yang et al. (2000). However, as have others, we encountered several problems with this approach, and it was therefore abandoned in favor of a discrete distribution (Yang et al. 2000). Say we have K classes of sites in the alignment with the ratios and proportions given as
|
To demonstrate the utility of our model, we analyzed a data set of nucleoprotein (NP) sequences from the influenza A virus taken from avian and human hosts. The NP gene encodes a protein with 498 amino acids whose primary function is the binding of viral RNA segments to form the nucleocapsid of the virus particle (Lamb and Krug 1996). The NP gene has been assigned a putative role as a determinant of host range (Murphy and Webster 1996), but the identification of residues important for species-specific adaptation has been hindered by the lack of a resolved three-dimensional protein structure (Lamb and Krug 1996). Molecular biological analysis, however, has revealed regions of the NP that are of functional importance and several epitopes that are supposedly recognized by cytotoxic T lymphocytes (CTL) of the cellular immune response (Neumann, Castrucci, and Kawaoka 1997; DiBrino et al. 1993; Davey, Dimmock, and Colman 1985; Voeten et al. 2000; Boon et al. 2002). The present-day form of the human influenza. A nucleoprotein is believed to have entered the human population as an avian virus was transferred to humans immediately before the onset of the Spanish Influenza pandemic of 19181920 (Gorman et al. 1991). Since then, the nucleoprotein gene has evolved independently in human and avian hosts. In contrast, other genetic elements were exchanged between avian and human influenza viruses prior to each of the two pandemics in the last half of the twentieth century (Murphy and Webster 1996).
Sequences were downloaded from the Influenza Sequence Database (Macken et al. 2001) and include all available avian NP sequences and an equal number of human NP sequences that were chosen to cover the largest possible time span. Figure 2 shows the genealogy of the chosen isolates. This was inferred under the maximum likelihood (ML) criteria by the fastDNAmL algorithm (Olsen et al. 1994) provided at http://bioweb.pasteur.fr/seqanal/interfaces/fastdnaml-simple.htm. A very similar genealogy was inferred with the Neighbor-Joining (NJ) algorithm as implemented in the PAUP software (Swofford 2002). Furthermore, a Bayesian estimate of the posterior genealogy distribution was performed using the MrBayes program (Huelsenbeck and Ronquist 2001) with a general time reversible (GTR) model of substitution and a gamma distribution on rate heterogeneity. The estimation involved 3 million Markov Chain Monte Carlo samples and showed that besides the ordering of a few short external branches, there was high confidence in the structure of the ML tree (results not shown). The position of the root was determined by the inclusion of an equine influenza virus (Gorman et al. 1991), and because this isolates the human influenza lineages as a monophyletic group, the root was used as the time of the transmission event. Branch lengths were estimated using the CODEML program of the PAML package under a model that allows three discrete classes of rates (Yang 2000).
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This approach focuses on identifying sites where the selective regime has changed, either as a result of host-specific alterations in immune-mediated selection or as a result of altered functional constraints after an adaptive move in sequence space. In the latter case, we are not inferring the adaptive evolutionary change itself, but rather its signature in the ensuing evolutionary process. We would, however, expect these adaptive changes to create an excess of nonsynonymous substitutions in the time immediately following the host radiation event. In an attempt to include this possibility, we also implemented a model that allows for a burst of nonsynonymous substitutions on the uppermost branch of the subtree relating the sequences from the new (human) host. This was done by multiplying all values in the distribution by a positive number (fb) when calculating the substitution probabilities on this branch, thus increasing their value. The results of this series of calculations were, however, not easily interpretable, as the model seemed to favor lower
values on this branch (fb < 1), contrary to our expectation.
The inference of parameters like fb that have their effect deep within an evolutionary tree is an intrinsically difficult problem because the uncertainty of the states at the inner nodes of a tree increases as one progresses upward from the leaves, making it difficult to determine whether the above-mentioned result is an actual phenomenon, or an artifact of the methodology. A potential explanation, is that the chosen node of divergence is a bad estimate of the true time of divergence. A branch in a tree is an arbitrary evolutionary unit affected by sampling effects. Hence, the inclusion of further avian lineages might break up the branch leading to the clade of human influenzas, signifying that part of the evolution on this branch occurred in the original avian hosts rather than in the new human host. In this case, a generally higher rate of nonsynonymous substitution on the human part of the branch caused by, e.g., immune-mediated selection would give a poor description of the evolutionary regime predominant in the true avian host. A way to counter this effect would be to lower the fb parameter and thus the number of nonsynonymous substitutions.
There are two interesting implications of this that we did not pursue. It may be possible to infer the maximum likelihood position of the species transmission event along this branch, which in conjunction with a molecular clock model could provide an estimated date of species transmission. Furthermore, it may be possible to infer which host species is most likely to be new to the parasite if the history of the parasite is unknown.
The motivation for developing the model presented here was to study the evolution of parasites undergoing host radiation, but the model addresses the more general problem of describing the substitution process in two groups of related organisms and determining whether and where the selective regime may differ. In our formulation, the evolutionary event that is believed to have changed the way natural selection acts is the introduction to a new host environment. A closely related problem, and one that has received considerable attention, is that of functional divergence in paralogous genes, where the decisive evolutionary event is a duplication of the genetic element.
It is therefore not surprising that two recent models of functional divergence in paralogous genes are conceptually similar to our model in using models of amino acid substitution that allow the position-specific rate of amino acid substitution to change after gene duplication (Gu 2001; Knudsen and Miyamoto 2001). However, these models infer changes in the selective regime through a change in the absolute rate of substitution, whereas the codon-based approach presented here uses a comparison between the two different types of nucleotide substitution, thus enabling use of the full information in the nucleotide sequence and inclusion of the known biological phenomenon of transition-transversion bias. A potential advantage of the amino acidbased models is that they use empirically based transition matrices, which may include the effect that the physiochemical properties of amino acids have on the substitution process.
To simplify our model, we have not included any such empirical priors on substitution rates, but such factors could easily be included in the codon-based substitution model (see Goldman and Yang 1994). In relation to this aspect of the model, the parameter used is a rather crude indicator of the selective regime, and it would be interesting to consider a model which allowed additional indicators such as, e.g., shifts between biochemically different groups of amino acids.
There are additional ways in which the present approach could be improved. An apparent and readily made generalization, is to extend the model to allow for several independent host change eventsi.e., a tree in which several clades of the new host are embedded in a larger tree of isolates from the original host. In the construction of the model, we have made the implicit assumption that the equilibrium codon frequencies do not change after host change. The validity of this assumption is not clear, but changes in the host cell environment could potentially change the direction of any codon bias. A simple way to include this assumption could be to estimate the equilibrium codon frequencies from the two groups separately and then apply separate transition matrices to the different parts of the tree.
Another assumption is that the distribution of is the same in both speciesi.e., that the change in
is symmetric so that an upward change in one site is balanced by a downward change in another site. This assumption is justified by the recognition that there is limited information in the data to estimate two independent distributions and that any bias it induces on the identification procedure will be toward more conservative estimates. A more grave assumption is that of independence between codon positions. This is a commonly made assumption in molecular evolutionary analysis of proteins, but it is inherently wrong because of the known three-dimensional nature of biological molecules. Some efforts have been made to address this issue (Pedersen and Jensen 2001; Pollock, Taylor, and Goldman 1999), but both approaches are computationally very demanding and were therefore not pursued here.
Rates of mutation may change after a viral host change because of alterations in, e.g., the rate of replication. As stated previously, we have assumed that the rates of synonymous and nonsynonymous substitution scale identically with such changes in the mutation rate. The validity of this assumption depends on the unknown population genetic mechanisms underlying the process of fixation. For sites where all amino acid substitutions are either neutral or strongly deleterious, we expect this scaling to hold. This may not be the case however, for sites undergoing positive selection if the process of fixation is limited by factors other than the availability of mutations. Should this assumption be violated, a change in the mutation rate could cause a change in the ratio of the rates of synonymous and nonsynonymous substitution to occur without there being any change in the underlying selective regime acting at the protein level, and therefore, bias analyses like the one presented here lead to the identification of false positives. In our future work, we intend to explore and potentially eliminate this bias through the construction of a model which allows for simultaneous changes in both the selective regime acting at the protein level and the underlying rate of mutation as represented by the rate of synonymous substitution.
All this having been said, we still believe that the results from the analysis of the NP gene data show promise for our comparative approach and that this type of analysis, correlated with structural, immunological, and other information, may provide a useful way to identify candidate sites for the molecular biological investigation of species-specific adaptation in parasites.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keith Gandall, Associate Editor
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Boon, A. C., G. de Mutsert, Y. M. Graus, R. A. Fouchier, K. Sintnicolaas, A. D. Osterhaus, and G. F. Rimmelzwaan. 2002. Sequence variation in a newly identified HLA-B35-restricted epitope in the influenza A virus nucleoprotein associated with escape from cytotoxic t lymphocytes. J. Virol. 76:2567-2572.
Chang, S. F., J. Y. Sgro, and C. R. Parrish. 1992. Multiple amino acids in the capsid structure of canine parvovirus coordinately determine the canine host range and specific antigenic and hemagglutination properties. J. Virol. 66:6858-6867.[Abstract]
Chua, K. B., W. J. Bellini, and P. A. Rota, et al. (22 co-authors). 2000. Nipah virus: a recently emergent deadly paramyxovirus. Science 288:1432-1435.
Crill, W. D., H. A. Wichman, and J. J. Bull. 2000. Evolutionary reversals during viral adaptation to alternating hosts. Genetics 154:27-37.
Davey, J., N. J. Dimmock, and A. Colman. 1985. Identification of the sequence responsible for the nuclear accumulation of the influenza virus necleoprotein in Xenopus oocytes. Cell 40:667-675.[CrossRef][ISI][Medline]
DiBrino, M., T. Tsuchida, R. V. Turner, K. C. Parker, J. E. Coligan, and W. E. Biddison. 1993. HLA-A1 and HLA-A3 T cell epitopes derived from influenza virus proteins predicted from peptide binding motifs. J. Immunol. 151:5930-5935.
Drummond, A., and K. Strimmer. 2001. PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics 17:662-663.
Ebert, D. 1998. Experimental evolution of parasites. Science 282:1432-1435.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[ISI][Medline]
Gao, F., E. Bailes, and D. Robertson, et al. (12 co-authors). 1999. Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature 397:436-441.[CrossRef][ISI][Medline]
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.
Gorman, O. T., W. J. Bean, Y. Kawaoka, I. Donatelli, Y. J. Guo, and R. G. Webster. 1991. Evolution of influenza A virus nucleoprotein genes: implications for the origins of H1N1 human and classical swine viruses. J. Virol. 65:3704-3714.[ISI][Medline]
Gu, X. 2001. A site-specific measure for rate difference after gene duplication or speciation. Mol. Biol. Evol. 18:2327-2330.
Huelsenbeck, J. P., and F. Ronquist. 2001. MRBAYES: bayesian inference of phylogenetic trees. Bioinformatics 17:754-755.
Hughes, M. T., M. Matrosovich, M. E. Rodgers, M. McGregor, and Y. Kawaoka. 2000. Influenza A viruses lacking sialidase activity can undergo multiple cycles of replication in cell culture, eggs, or mice. J. Virol. 74:5206-5212.
Kimura, M. 1983. The Neutral Theory of Molecular Evolution, 1st edition. Cambridge University Press, Cambridge.
Knudsen, B., and M. M. Miyamoto. 2001. A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc. Natl. Acad. Sci. USA 98:14512-14517.
Lamb, R. A., and R. M. Krug. 1996. Orthomyxoviridae: the viruses and their replication. Pp. 13531395 in B. N. Fields, ed. Fields Virology, 3rd Edition. Raven Press, New York.
Macken, C., H. Lu, J. Goodman, and L. Boykin. 2001. The value of a database in surveillance and vaccine selection. Pp. 103106 in A. Osterhaus, N. Cox, and A. Hampson, eds, Options for the Control of Influenza IV. Elsevier Science, New York.
Murphy, B. R., and R. G. Webster. 1996. Orthomyxoviruses. Pp. 13971445 in B. N. Fields, ed. Fields Virology, 3rd Edition. Raven Publishers, New York.
Neumann, G., M. R. Castrucci, and Y. Kawaoka. 1997. Nuclear import and export of influenza virus nucleoprotein. J. Virol. 71:9690-9700.[Abstract]
Nielsen, H. S., M. B. Oleksiewicz, R. Forsberg, T. Stadejek, A. Botner, and T. Storgaard. 2001. Reversion of a live porcine reproductive and respiratory syndrome virus vaccine investigated by parallel mutations. J. Gen. Virol. 82:(Pt 6): 1263-1272.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929-936.
Olsen, G. J., H. Matsuda, R. Hagstrom, and R. Overbeek. 1994. fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:41-48.[Abstract]
Ota, R., P. J. Waddell, M. Hasegawa, H. Shimodaira, and H. Kishino. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17:798-803.
Pedersen, A. M., and J. L. Jensen. 2001. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol. 18:763-776.
Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187-198.[CrossRef][ISI][Medline]
Swofford, D. L. 2002. PAUP*, phylogenetic analysis using parsimony (*and other methods). version 4. Sinauer Associates, Sunderland, Mass.
Turner, P. E., and S. F. Elena. 2000. Cost of host radiation in an RNA virus. Genetics 156:1465-1470.
Voeten, J. T., T. M. Bestebroer, N. J. Nieuwkoop, R. A. Fouchier, A. D. Osterhaus, and G. F. Rimmelzwaan. 2000. Antigenic drift in the influenza A virus (H3N2) nucleoprotein and escape from recognition by cytotoxic T lymphocytes. J. Virol. 74:6800-6807.
Yang, Z. 2000. Phylogenetic Analysis by Maximum Likelihood (PAML), 3rd edition. University College London.
Yang, Z., R. Nielsen, N. Goldman, and A. M. Petersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449.