Antiviral Research Center, University of California San Diego
Correspondence: E-mail: spond{at}ucsd.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: lineage-specific selection model selection genetic algorithms branch site model codon substitution model
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
One disadvantage of these lineage-specific models of substitution is that particular lineages evolving under different models must be specified a priori. Given the large number of possible lineage-to-model assignments, there may exist models that fit the data much better than the a priori hypothesis, and that may even contradict the biological conclusions drawn from the results based on the a priori hypothesis. Hence, it is important to place a pre-specified model in context. Even when there is no underlying biological hypothesis, it is likely that the assumption of a constant selection pressure over the entire phylogenetic tree is unrealistic. However, the large number of possible lineage-specific models make it infeasible to test the fit of every model. In this study, we propose the use of a genetic algorithm approach to assign different classes of to lineages. Genetic algorithms have proven very effective at solving very complex and poorly understood optimization problems by rapid adaptive exploration of the parameter space (Whitley 2001).
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() | (1) |
![]() | (2) |
This model differs slightly from the GY94 (Goldman and Yang 1994) model used in Yang (1998), in that GY94 uses y in place of
in the rate matrix. In our experience, the MG94 x HKY85 yields better likelihood scores than GY94 with the same number of parameters, especially when base composition of the sequences is biased. For comparative purposes, we also included fits for most of the models using the GY94 rate matrix parameterization.
Model Selection Using a Genetic Algorithm
Given n sequences, an unrooted phylogeny with 2n 3 branches, and the number of desired lineage classes B, we search the state space of lineage-to-class assignments where bi
0 ... B 1. The ith entry of the vector represents the class assignment for the ith branch in the post-order traversal of the tree. In order not to include equivalent branch partitions more than once (e.g., (0,1,2,1,2) and (0,2,1,2,1) for a 5-branch tree), we require that the first occurrence of each rate assignment in the vector
are in increasing order. There are
S(2n 3, k) possible assignments to consider, and if a model with fewer than B classes has a better fitness, it will be included in the search space. Here
![]() |
We employ the CHCCross generational elitist selection, Heterogeneous recombination, and Cataclysmic mutation (Eshelman 1991)an aggressive population-based hill-climber. This algorithm always retains the most fit individual. It uses a radical recombination operator to rapidly explore the parameter space without collapsing it. When the diversity of the population drops below a certain threshold, the algorithm invokes a radical mutation step: a large proportion (15%35%) of randomly chosen positions in all individuals, except for the most fit individual, are mutated. The mutation step helps to prevent entrapment in local maxima.
The recombination operator of the CHC algorithm generates an offspring from two parent states by randomly choosing one of the two parental values in every position of the state vector which differs between the parents.
The parameter space for our optimization problem is composed of two components: discrete partitioning of tree branches into different classes and a vector of real valued parameters, such as branch lengths, b and R. We used the CHC algorithm to search through the discrete component of the parameter space and conventional numerical optimization techniques to find maximum likelihood estimates of all other model parameters, given a partitioning of branches into rate classes. The fitness of every model was measured by its small sample Akaike's Information Criterion score (AICc; Sugiura 1978; Akaike 1974). The AICc of a model with p parameters fitted to a sample of size s is defined as
![]() |
For any fixed p, and thus AICc and the standard Akaike Information Criterion coincide for large sample sizes. Clearly, AICc can only be used when s > p + 1. In our case we treat each alignment column as an independent sample, and the method requires at least 2n 1 + B alignment columns to be applicable. AICc is a second order correction to standard AIC, and its use can be advocated unless the number of samples is much larger (40x or more) than the number of parameters (Burnham and Anderson [2003], p. 66).
The mutation phase is triggered if the difference between the AICc of the best fitting model and the worst fitting model in a generation decreases below a threshold > 0; on average, each 5th position in all model vectors of the current generation (excluding the best fitting one) is randomly replaced with an integer in 0 ... B 1 (excluding the present state).
Our parallel implementation of the CHC algorithm was run on a 16-node Linux MPI cluster, with each generation consisting of 28 individuals. We used the convergence criterion that requires the AICc of the best fitting individual over all generations to remain constant for 30 generations. We also verified convergence by repeating the runs several times, with the initial population generated randomly every time.
Once we have selected a model using the GA, it is important to assess the confidence we have in the model, both in terms of how much better it fits compared to other candidate models, and in terms of whether we would select the same model again if other independent data sets were available. The goodness-of-fit between the best fitting model identified using the GA, the global (single-ratio) model, and the local (free-ratio) model, can be determined using a likelihood ratio test, as these models are nested. To determine the significance of the difference in fit between models specified a priori, and the best fitting model identified using a GA (which may not necessarily be nested), we employ a Shimodaira-Hasegawa test (Shimodaira and Hasegawa 1999), using the difference in AICc, rather than the difference in log-likelihoods, to compare models with different numbers of parameters.
Rather than just base inference on the single best fitting model, we also consider averaging over a set of models. We calculate the Akaike weights (Akaike 1983) for each model examined during a run of the GA. If we define then the Akaike weight for model mi, where M is the total number of models, is given by the following.
![]() | (3) |
Effect of Site-to-Site Rate Variation
Our GA algorithms can be easily adapted to utilize models with site-to-site rate variation while searching for the best branch class assignments. However, using such models will significantly increase run times of individual model fitting and the GA overall. We fitted models with two types of site-to-site rate variation using branch partitioning obtained with the homogeneous site rates model to investigate relative improvements of fit.
First, we consider the scenario where mutational loads vary from site to site. Formally, this scenario can be implemented by using the following rate matrix
![]() | (4) |
![]() |
Second, we test the hypothesis that selective pressure varies across alignment sites. Formally,
![]() | (5) |
Note that the MG94 x HKY85 model is nested in both µMG94 x HKY85 and MG94 x HKY85. These models are somewhat similar to the branch-site models (Yang and Nielsen 2002). However, whereas Nielsen and Yang were interested in gaining more power to detect individual sites under selection, we are primarily interested in whether incorporating site-to-site rate variation can significantly alter the improvements in fit obtained by using branch-specific models. Essentially, by asking the question of when selection occurred we cannot ignore the effect that where the selection might have occurred could have on our conclusions, because the two effects may be confounded. Additionally, our branch-specific
b serve as rate multipliers at every site in the alignment, where as the branch-site models allow for variable
b along pre-specified branches and among sites.
Sequence Data and an A Priori Model of Sequence Evolution
We investigate the primate lysozyme data originally analyzed by Messier and Stewart (1997), and subsequently by Yang (1998) and Yang and Nielsen (2002). Following Yang (1998), we consider a large data set consisting of 19 distinct sequences and a small 7-sequence data set, representing the four major groups (Colobines, Cercopithecines, Hominoids, and New World monkeys) in the phylogeny. As an example of an a priori biological model, we used model E of Yang (1998), which assumed that the branches leading to hominoids and colobines each had an independently estimated value of (
c and
h), whereas all other branches shared a third value
=
0, which was also estimated. Apart from the free-ratio model, model E was the most general of those considered in Yang (1998). We denote this model as Y98.
Implementation
Recent advances in computational power make the use of complex search algorithms eminently feasible. All sequence analyses and model fitting were performed using the HyPhy (Kosakovsky Pond, Frost, and Muse 2000) software package on an 16-node Linux MPI cluster; 14 slave nodes were used to fit various models, and a single master node dispatched the jobs and assembled the results. A single pass of the GA algorithm required from an hour for the small data set to a day for the large data set. Lysozyme alignments used in this article and all HyPhy batch files needed to perform the analyses and result processing can be obtained from http://www.hyphy.org/gabranch/.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
There were a large number of models in the 95% confidence set for both the small (211) and large data sets (333). The evidence ratio for the best fitting model and the second best fitting model was 1.07 for the small data set. The two top models differ in class allocation for a single branchthe ancestor of colobines and cercopithecines (labeled as N3 in figure 2). The best fitting model allocated this branch to a class with ßsb = 0.61, while the second best model assigned it to the class with ßsb = 4.39. The third best model had the evidence ratio of 1.92. We can also compute the largest AICc needed for a model to be included in the 95% confidence set. For the smaller data set, this value was 1832.76, and hence the Y98 model is included in the credible set.
For the large data set, there were 19 models with essentially equal support (evidence ratio 1). This is likely due to the relatively few (
4) sites per estimated model parameter and concomitant unreliable inference, as well as the the presence of short branches in the phylogeny. The threshold AICc value was found to be 2152.28, and thus the a priori model with AICc of 2167.54 is not included in the confidence set of models.
To assess the robustness of the conclusions drawn from the best fitting model, we calculated the support for an estimate of dN/dS = ßsb > 1 for each branch (fig. 2), averaging over models using their Akaike weights. The support for ßsb greater than 1 in the lineage leading to hominoids and colobines was 94.8% and 88.4%, respectively, for the small data set and 99.9% and 99.6%, respectively, for the large data set.
We next fitted site-to-site rate variation models using the branch assignments derived from running the genetic algorithm using a homogeneous rates model. Adding branch variation to a site-to-site model and site-to-site variation to a branch model yielded substantial AICc improvements for both data sets. Also, AICc scores for µMG94 x HKY85 and MG94 x HKY85 models indicate that variation in site-to-site mutational load fits lysozyme data better than variation in selection pressure over sites. Interestingly, the improvement in goodness-of-fit (in terms of the log likelihood) obtained by allowing both site-to-site variation in mutational load and branch variation was close to the improvement in fit when considering each effect separately. For the large data set, the difference in log-likelihood,
logL, between the one ratio model and the GA 3-ratio model with variation in mutational load was 32.15, which is close to the sum of
logL when considering branch variation and site-to-site variation separately (17.45 + 14.6 = 32.05). In addition, the estimates for ßsb were similar between models with and without site-to-site variation in mutational load. Thus, we can conclude that branch-specific variation in selection pressure and site-specific variation in mutational load play important (and apparently orthogonal) roles in shaping genetic variation in lysozyme. We also ran the GA using the µMG94 x HKY85 model on the small data set and found that the best fitting branch classes were the same as the ones found by the GA under the MG94 x HKY85 model, consistent with the apparent orthogonality between the models of site-to-site variation in mutational load and branch-specific variation in ßsb.
The model-averaging approach enables us to address issues which cannot be easily resolved using simple hypothesis testing. For example, we could compute probabilities of any two branches evolving under the same selective pressure (table 4), without making a priori assumptions about selective pressure acting on all other branches.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Using Akaike's Information Criterion as a measure of goodness-of-fit has been criticized for being too liberal, and may select models that are overly complex; instead, we have used a small sample AIC, AICc . In principle, information criteria other than AICc could be used, for example Schwarz's Bayesian Information Criterion (BIC) (Schwarz 1978), which is consistent (the probability that it will recover a true low-dimensional model approaches 1 as the sample size tends to infinity). However, the BIC assumes that the true model is contained within the candidate set of models, whereas the AIC does not assume that any of the candidate models is necessarily true; rather, it quantifies the discrepancy between the probability density generated by the model and the data as measured by Kullback-Liebler information (Kullback and Liebler 1951).
Given the large number of possible models and limited data, it is likely that a substantial numbertens or hundredsof models will be consistent with the data. With the use of Akaike weights, which can be directly interpreted as conditional probabilities of the models, the robustness of conclusions to the choice of model can be assessed by averaging over models. Furthermore, an a priori biological model can be placed in the context of other credible models. If, for example, such a model falls outside the inferred 95% confidence set, it may not be well supported, as there are many more models which fit the data significantly better, even when the a priori model may be significantly better than the single-ratio model, as assessed by a nested comparison using a likelihood ratio test. Additionally we can test whether an a priori biological model is significantly worse than the best fitting model found by the GA using a Shimodaira-Hasegawa test. It is important to consider whether the a priori model is simply a submodel of the best fitting model, and whether the biological conclusions are altered under the best fitting model. Based on the small lysozyme data set, we find that the hypothesis that the lineages leading to Colobines and Hominoids have elevated nonsynonymous to synonymous substitution rates is consistent with our GA-based analysis. However, our GA results based on the large data set suggest that positive selection is ongoing following the radiation of these clades in contrast to the hypothesis of Messier and Stewart (1997).
When an a priori model compares favorably with the set of "good" models, we may be more confident in biological conclusions derived from studying such an a priori model. If, however, there are many alternative models which fit the data much better than our preconceived hypothesis, further exploration of the data, and perhaps a reassessment of our assumptions, is in order. In many practical cases the model itself is a nuisance component of the analysis, for example when one is interested in evolution along a specific lineage, and the rest of the phylogeny forms the "background." By weighing over credible models, quantities of biological interest (such as the probability that along a particular branch dN/dS > 1) may be derived without relying on a specific a priori model and concomitant assumptions, such as uniformity of selective pressure along background branches.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Abramowitz, M., I. Stegun, eds. 1972. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. Dover, New York.
Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Automatic Control 119:716723.[CrossRef]
Akaike, H. 1983. Information measures and model selection. Int. Stat. Inst. 44:139149.
Brauer, M. J., M. T. Holder, L. A. Dries, D. J. Zwickl, P. O. Lewis, and D. M. Hillis. 2002. Genetic algorithms and parallel processing in maximum-likelihood phylogeny inference. Mol. Biol. Evol. 19:17171726.
Burnham, K. P., and D. R. Anderson. 2003. Model Selection and Multimodel Inference, 2nd ed. Springer, New York.
Eshelman, L. 1991. Foga-1. Morgan Kaufmann, Los Atlos, Calif.
Goldman, N., and Z. H. Yang. 1994. Codon-based model of nucleotide substitution for protein-coding DNA-sequences. Mol. Biol. Evol. 11:725736.
Katoh, K., K. Kuma, and T. Miyata. 2001. Genetic algorithm-based maximum-likelihood analysis for molecular phylogeny. J. Mol. Evol. 53:477484.[CrossRef][ISI][Medline]
Kim, Y. H., S. K. Lee, and B. R. Moon. 2003. Optimizing the order of taxon addition in phylogenetic tree construction using genetic algorithm. Genetic and Evolutionary ComputationGecco 2003, Pt Ii, Proceedings 2724:21682178.
Kosakovsky Pond, S., S. D. W. Frost, and S. V. Muse. 2000. HyPhy: hypothesis testing using phylogenies. Bioinformatics Advance Access published on October 27, 2004, doi:10.1093/bioinformatics/bti079.
Kullback, S., and R. Liebler. 1951. On information and sufficiency. Ann. Math. Stat. 22:7986.[ISI]
Lemmon, A. R., and M. C. Milinkovitch. 2002. The metapopulation genetic algorithm: an efficient solution for the problem of phylogeny estimation. Proc. Natl. Acad. Sci. USA 99:1051610521.
Lewis, P. O. 1998. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol. Biol. Evol. 15:277283.[Abstract]
Messier, W., and C. B. Stewart. 1997. Episodic adaptive evolution of primate lysozymes. Nature 385:151154.[CrossRef][ISI][Medline]
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715724.
Schwarz, G. 1978. Estimating the dimension of a model. Ann. Stat. 6:461464.[ISI]
Shen, J., and R. B. Heckendorn. 2004. Discrete branch length representations for genetic algorithms in phylogenetic search. Appl. Evol. Comput. 3005:94103.[ISI]
Shimodaira, H., and M. Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16:11141116.
Sugiura, N. 1978. Further analysis of the data by Akaike's information criterion and the finite corrections. Commun. Stat. Theory Methods A7:1326.
Whitley, D. 2001. An overview of evolutionary algorithms: practical issues and common pitfalls. J. Inf. Software Technol. 43:817831.[CrossRef]
Yang, Z. H. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568573.[Abstract]
Yang, Z. H., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908917.
Yang, Z. H., R. Nielsen, N. Goldman, and A. M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449.