* Genome Atlantic, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada; Canadian Institute for Advanced Research, Program in Evolutionary Biology, Toronto, Ontario, Canada;
Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
Correspondence: E-mail: cblouin{at}cs.dal.ca
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: protein evolutionary rate functional divergence maximum likelihood simulation
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Rate heterogeneity is an important concept in molecular biology because sites that are the most tolerant to substitution are likely to appear more variable. To address the issue of non-uniform rates in phylogeny, an improvement over original models of substitution was made by the introduction of rates-across-sites models (RAS), whereby the rates of evolution at sites were modeled by a discretized -distribution (Yang 1993, 1994). As expected, the introduction of a few extra parameters improved the fit of phylogenetic models to sequence alignments, which, in this case, translates into enhanced discrimination between phylogenetic tree topologies (Yang 1996). However, the RAS model should not be regarded as a silver bullet against all types of phylogenetic artifacts (Reyes, Pesole, and Saccone 2000).
Modeling by means of a discrete rate distribution enables each site to be classified by determining the rate category with the largest conditional posterior probability (Yang 1995; Felsenstein and Churchill 1996). This method of estimating site rates essentially uses the empirical Bayes framework. A discussion of empirical Bayes versus maximum likelihood estimation was recently published (Mayrose et al. 2004). This study, interestingly, also demonstrates that the rate estimates are generally robust to the length of alignments as well as to the extent of the rate heterogeneity (measured using the -distribution shape parameter:
). Regardless of the method used to generate these site rates, these estimates can be regarded as a measure of the level of evolutionary constraint imposed by functional requirements (Pupko et al. 2002; Blouin, Boucher, and Roger 2003). Consequently, these rates have potential applications in functional prediction, the study of protein evolution, and as a guide for experimental design using genomic information (Dean et al. 2002; Inagaki et al. 2002, 2003).
Probabilistic models of character substitution can also be used to simulate sequences from an input tree. These simulations are commonly used for hypothesis testing in phylogeny (Swofford et al. 1996). A Monte Carlo simulation was implemented for DNA evolution in Seq-Gen (Grassly, Adachi, and Rambaut 1997) and for protein sequences in Pseq-Gen (Rambaut and Grassly 1997). Another implementation was devised to enhance the flexibility of the Pseq-Gen engine (Tuffery 2002); this included the definition of the root sequence and the constraints imposed by correlated evolution of clusters of sites within the sequence. A different approach is used in Rose (Stoye, Evers, and Meyer 1998), in which the evolutionary model includes an underlying insertion and deletion mechanism.
In this study, we have implemented, from the original PSeq-Gen code, a simulation engine that allows flexibility in the model of evolution to simulate functional divergence via changes in the rate of evolution at sites across the tree. Specifically, we generated sequence alignments for which all rates and rate shift information is known. We applied this method to two test cases to examine the robustness of rate estimation methods that are designed as tools for functional studies. The first test case examined sampling error in the estimation of the most probable site-rate category. This measurement has gained attention as an indicator of evolutionary constraints (Pollock, Taylor, and Goldman 1999; Gu 2001; Dean et al. 2002; Pupko et al. 2002; Blouin, Boucher, and Roger 2003). The second test case evaluated the performance of four methods to detect changes of site rates between two subtrees (herein referred to as site-rate shifts). Finally, we suggest guidelines to assemble data sets which can be used for robust estimation of evolutionary constraints.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sampling Error Data Set
Two 1,000 site data sets were generated using a -distributed rates-across-sites model (
= 0.65). The first alignment, generated from a tree herein called UNIFORM, contained 968 sequences with pairwise distances distributed between 0.1 and 1.1 expected substitutions per site. The UNIFORM tree approximates an infinitely diverse set of sequences from which experimental subsets are collected (fig. 1). The second alignment, generated from a tree herein called OUTGROUP, is similar to UNIFORM except that each jackknifed subsets included a sequence that corresponded to a deep branching outgroup sequence. The distance from the root to this terminal node was simulated at 1.5 substitutions per site. In both cases, six series of 100 alignments with sequence sample sizes of 5, 10, 15, 20, 30, and 40 taxa were uniformly drawn by jackknife re-samplings. The maximum likelihood phylogeny (model: JTT, 8
-distributed rate categories) was inferred using tree-puzzle 5.0 (Strimmer and von Haeseler 1996). The estimate of the
-distribution parameter and the highest conditional probability rate category (i.e., an empirical Bayes rate estimate) for each site was extracted for analysis. The variance of each site rate i was calculated as (S(i))2, where S is the standard deviation of the rate estimate across all jackknife replicates for a given sequence sample size. The accuracy of the rate estimate was calculated with the frequency at which the estimated rate category was matching the rate category of the reference rates. The boundaries of the reference categories were determined from a discrete
-distribution with shape parameter = 0.65. To avoid artifacts arising from the discretized nature of the rate distribution, the estimate (Rc(i)) was considered accurate if
This is a conservative definition of accuracy and therefore is expected to yield accuracies slightly larger than the actual values.
|
Software Availability
The source code and user manual for covTREE, our Monte Carlo sequence simulator used in all of our analyses, can be found at http://www.cs.dal.ca/cblouin/Blouin. The sequence data and trees are also available as Supplementary Material online.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In actual systems, the availability of an adequate number of sequences representing an appropriate phylogenetic depth can be an issue. It was reported that, for a data set of NADH-2 of requiem sharks, a number of sequences roughly over 25 was necessary to eliminate the compensatory effect on the maximum likelihood estimation of Pinvariable and the rate distribution shape parameter (Sullivan, Swofford, and Naylor 1999). Similar results were obtained with insect data (Misof et al. 2002). Is 25 sequences a reasonable sample size for robust rate estimation?
Two large data sets, UNIFORM (with pairwise distances between 0.1 and 1.1) and OUTGROUP (the same as before except with a single long branch of 1.5 substitutions/site from the root) of 968 sequences total were generated. Subsamples of various sequence numbers were performed from these data sets by jackknifing (see Materials and Methods). The estimated rates were compared to the simulated rates, and the variance of the estimates was computed. In both cases, the variance on the estimation of the conditional mode site rate became small for a data set of approximately 30 sequences (fig. 2). This is in agreement with the figure of 2030 sequences reported by Mayrose et al. (2004).
|
|
Robustness of Rate Estimates versus Outgroup Sequences
The effect of estimating rates while including an outlier sequence was tested. The previous experiment (UNIFORM) was modified such that each jackknife-sampled data set contained one distantly related sequence. As shown in figure 3B, the frequency of correct assignment of the rate category has not yet converged to 95%, even for a relatively large sample alignment of 40 sequences. It is unclear whether the estimation of the rate category will eventually converge on a 100% correct assignment because the variance on the estimated rates for large data sets is becoming small (fig. 2). This suggests that the rate estimate is affected by a systematic source of error rather than by sampling issues. An outgroup with a distance from the root of 1.5 expected substitutions per site simulates the presence, in an alignment, of a very distantly related sequence, which may be poorly aligned. This experiment tested the capacity of ML estimation to cope with the noise introduced by a distant or misaligned sequence in a data set.
Because of the difficulty in estimating intermediate rates, and because of the questionable biological significance of high resolution in relative rate categories, the use of a large number of rate categories appears to be unjustified. These results are based on the classification performance of sites into the correct rate categories. Using a different performance metric: the overall square of differences (R)2 between estimates and real rates, the rate estimation was demonstrated to reach an accuracy plateau with 410 rate categories (Mayrose et al. 2004). In our experiments, the probability density around the "true" category is quite uniform. An increased number of categories can thus minimize (
R)2 without necessarily improving the classification performance.
Assigning Rates to Heterotachous Sites
If a multiple sequence alignment contains sites where rates of evolution changed throughout evolution, the resulting rate will be an average of rates. This average may poorly model the evolutionary process for this site in any given lineage. Such sites are known as heterotachous, or "covarion." There is a growing body of evidence suggesting that rate heterogeneity is a common process in protein evolution (Lockhart et al. 1998; Steel, Huson, and Lockhart 2000; Philippe et al. 2003).
The biological basis for shifts in evolutionary rates is functional divergence. If a site must have a specific identity in one lineage, its rate of evolution should be low. However, if this site is not used in a different lineage, the observed rate of evolution will be larger. Not all functional divergence reflects change in biological function; the clusters of residues that are important for the folding of a protein can also change over time, resulting in site-rate shifts even within orthologs (Lockhart et al. 2000; Lopez, Casane, and Philippe 2002).
Detecting the sites that have undergone site-rate shifts can be used as a comparative tool to understand/predict biological function. This is especially true if there is protein structure available to visualize these rate-shift patterns on a protein (Pupko et al. 2002; Blouin, Boucher, and Roger 2003).
Detecting Site-Rate Shifts
To test methods that detect site-rate changes, data sets were simulated by generating two lineages of homologous sequences with 20% of the sites evolving at uncorrelated rates (fig. 4). Because both uncorrelated rates are randomly generated from the same distribution, only a fraction of the rate shifts represent significantly different rates. As there is no reason to expect that all sites with divergent rates will be identified, the performance metric of most interest is thus the detection accuracy (sites correctly assigned as rate-variable) and the occurrence of false positives.
|
|
|
The bivariate method (Susko et al. 2002) lies at the other end of the spectrum. This method tends to fail to identify small rate shifts. This is because the uncertainty on each of the jointly estimated site rates compounds into a larger confidence interval for site-rate differences as a whole. The method of Gu99 (Gu 1999), which has been widely used (Gu 1999; Wang and Gu 2001; Gaucher et al. 2002) offers a non-significant occurrence of false positives (P(FP) 0.01, threshold = 0.80). This method uses an empirical Bayes framework to evaluate at each site the posterior probability of functional divergence given the site's amino acid pattern. As an added advantage for the Gu99 method, the final classification appears to be robust to the exact posterior-probability cutoff. This robustness is a desirable property because it is impossible to calibrate the cutoff during the analysis of "real" sequence data. Nevertheless, the performance (precision and power) of the bivariate method and of Gu99 are essentially identical.
Caveats of Modeling Rate Distributions Using a Discretized Distribution
A few caveats have to be considered because of the properties of the assumed discretized -distribution of site rates: (1) For each of these experiments, the rate distribution shape parameter
was set to 0.65. For data sets with an extremely large
parameter (more uniform rates across sites), the difference in modes of each category will be smaller. In this situation, more sequences will be required to accurately determine which rate category is the most appropriate. (2) If the rates are estimated with a data set containing functionally diverse sequences, the presence of site-rate shifts within the data set will cause an averaging of the relative rates and an overestimation of the shape parameter
, which determines the discrete rate categories available. This is likely to adversely affect the accuracy of rate assignment at sites even if they have not undergone rate shifts.
However, it should be noted that the distribution of rates obtained without assuming a distribution in a number of protein data sets was shown to match the
distribution very well (Susko et al. 2003). This suggests that the assumption that site rates are
distributed is sound for the modeling of rate heterogeneity.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The choice of a number of discrete rate categories has an impact on the results obtained for the estimation of site rates. In small-sized and/or taxonomically broad phylogenetic samples, the apparent resolution achieved by using a large number of rate categories may be misleading. This is so because of the difficulty in correctly estimating rates in the intermediate range.
In practice, the exact intermediate rate of evolution of a site has limited importance for the investigator interested in structure-function relationships; a smaller number of rate categories can be used and still provide adequate resolution. In extreme cases of poor sampling, the rate estimates will be no less informative than the visual annotation of multiple sequence alignments using the simple schemes: constant, conserved, or variable. However, providing that a large enough number of sequences are available, and assuming that all positions in the sequences share homologous functions, maximum likelihoodbased rate estimation is a better metric of evolutionary constraints because it explicitly considers phylogenetic relationships.
Conversely, it is inadvisable to include distantly related outgroup sequences and/or sequences with suspected different functions, in an attempt to mitigate sampling error. To minimize the effect of distantly related sequences, it is preferable to separately estimate the rates at sites of robust monophyletic groups each of which is made up of a large number of sequences that are more functionally similar.
Heterogeneity in the rate of substitution within a site can be an informative property: When an alignment contains functionally divergent sequences, rate variation among particular lineages or clades provides critical information regarding the presence of fluctuating evolutionary constraints (Inagaki et al. 2003). Four methods were tested for their ability to identify these sites. The choice of the most appropriate method depends on the type of information that is of interest. The method of Knudsen and Miyamoto (Knudsen and Miyamoto 2001) is excellent at detecting a large number of sites containing rate shifts. On the other hand, it identifies a significant number of false positives.
If the objective of the experiment is to target individual positions (a mutational study for example), the methods of Gu (Gu and Vander Velden 2002) and Susko (Susko et al. 2002) are reliable over a range of confidence levels. However, the confidence level should be chosen prior to running the experiment as to avoid problems of multiple testing. The results presented in table 1 can be used as a guide for making this choice.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Blouin, C., Y. Boucher, and A. J. Roger. 2003. Inferring functional constraints and divergence in protein families using 3D mapping of phylogenetic information. Nucleic Acids Res. 31:790797.
Bruno, W. J. 1996. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:13681374.
Dean, A. M., C. Neuhauser, E. Grenier, and G. B. Golding. 2002. The pattern of amino acid replacements in alpha/beta-barrels. Mol. Biol. Evol. 19:18461864.
Felsenstein, J., and G. A. Churchill. 1996. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93104.[Abstract]
Gaucher, E. A., X. Gu, M. M. Miyamoto, and S. A. Benner. 2002. Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem. Sci. 27:315321.[CrossRef][ISI][Medline]
Grassly, N. C., J. Adachi, and A. Rambaut. 1997. PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:559560.[Medline]
Gu, X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16:16641674.
. 2001. Maximum-likelihood approach for gene family evolution under functional divergence. Mol. Biol. Evol. 18:453464.
Gu, X., and K. Vander Velden. 2002. DIVERGE: phylogeny-based analysis for functional-structural divergence of a protein family. Bioinformatics 18:5001.
Halpern, A. L., and W. J. Bruno. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 15:910917.[Abstract]
Inagaki, Y., C. Blouin, W. F. Doolittle, and A. J. Roger. 2002. Convergence and constraint in eukaryotic release factor 1 (eRF1) domain 1: the evolution of stop codon specificity. Nucleic Acids Res. 30:532544.
Inagaki, Y., C. Blouin, E. Susko, and A. J. Roger. 2003. Assessing functional divergence in EF-1alpha and its paralogues in eukaryotes and archaebacteria. Nucleic Acids Res. 31:42274237.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 8:275282.[Abstract]
Knudsen, B., and M. M. Miyamoto. 2001. A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc. Natl. Acad. Sci. USA 98:1451214517.
Lartillot, N., and H. Philippe. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21:10951109.
Lockhart, P. J., D. Huson, U. Maier, M. J. Fraunholz, Y. Van De Peer, A. C. Barbrook, C. J. Howe, and M. A. Steel. 2000. How molecules evolve in eubacteria. Mol. Biol. Evol. 17:835838.
Lockhart, P. J., M. A. Steel, A. C. Barbrook, D. H. Huson, M. A. Charleston, and C. J. Howe. 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15:11831188.[Abstract]
Lopez, P., D. Casane, and H. Philippe. 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19:17.
Mayrose, I., D. Graur, N. BenTal, and T. Pupko. 2004. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol. Biol. Evol. 21:17811791.
Misof, B., C. L. Anderson, T. R. Buckley, D. Erpenbeck, A. Rickert, and K. Misof. 2002. An empirical analysis of mt 16S rRNA covarion-like evolution in insects: site-specific rate variation is clustered and frequently detected. J. Mol. Evol. 55:460469.[CrossRef][ISI][Medline]
Philippe, H., D. Casane, S. Gribaldo, P. Lopez, and J. Meunier. 2003. Heterotachy and functional shift in protein evolution. IUBMB Life 55:257265.[ISI][Medline]
Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187198.[CrossRef][ISI][Medline]
Pupko, T., R. E. Bell, I. Mayrose, F. Glaser, and N. Ben-Tal. 2002. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 18(Suppl 1):S7177.
Pupko, T., and N. Galtier. 2002. A covarion-based method for detecting molecular adaptation: application to the evolution of primate mitochondrial genomes. Proc. R. Soc. Lond. Ser. B Biol. Sci. 269:13131316.[CrossRef][ISI][Medline]
Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235238.[Abstract]
Reyes, A., G. Pesole, and C. Saccone. 2000. Long-branch attraction phenomenon and the impact of among-site rate variation on rodent phylogeny. Gene 259:177187.[CrossRef][ISI][Medline]
Steel, M., D. Huson, and P. J. Lockhart. 2000. Invariable sites models and their use in phylogeny reconstruction. Syst. Biol. 49:225232.[CrossRef][ISI][Medline]
Stoye, J., D. Evers, and F. Meyer. 1998. Rose: generating sequence families. Bioinformatics 14:157163.[Abstract]
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. J. Mol. Biol. 13:964969.
Sullivan, J., D. L. Swofford, and G. J. P. Naylor. 1999. The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Mol. Biol. Evol. 16:13471356.
Susko, E., C. Field, C. Blouin, and A. J. Roger. 2003. Estimation of rates-across-sites distributions in phylogenetic substitution models. Syst. Biol. 52:594603.[CrossRef][ISI][Medline]
Susko, E., Y. Inagaki, C. Field, M. E. Holder, and A. J. Roger. 2002. Testing for differences in rates-across-sites distributions in phylogenetic subtrees. Mol. Biol. Evol. 19:15141523.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis (1996). Phylogenetic inference. Pp. 0000 in D. M. Hillis, C. Moritz, and B. Mable, eds. Molecular systematics. Sinauer Associates, Sunderland, Mass.
Tuffery, P. 2002. CS-PSeq-Gen: simulating the evolution of protein sequence under constraints. Bioinformatics 18:10151016.
Wang, Y., and X. Gu. 2001. Functional divergence in the caspase gene family and altered functional constraints: statistical analysis and prediction. Genetics 158:13111320.
Whelan, S., and N. Goldman. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18:691699.
Wichmann, B. A., and I. D. Hill. 1982. An efficient and portable pseudo-random number generator. Appl. Stat. 31:188190.[ISI]
Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:13961401.[Abstract]
. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306314.[ISI][Medline]
. 1995. A space-time process model for the evolution of DNA sequences. Genetics 139:9931005.
. 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367372.[CrossRef][ISI]
|