Impact of Taxon Sampling on the Estimation of Rates of Evolution at Sites

C. Blouin*,{ddagger}, D. Butt{ddagger} and A. J. Roger*,{dagger}

* Genome Atlantic, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada; {dagger} Canadian Institute for Advanced Research, Program in Evolutionary Biology, Toronto, Ontario, Canada; {ddagger} Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada

Correspondence: E-mail: cblouin{at}cs.dal.ca


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
The function of individual sites within a protein influences their rate of accepted point mutation. During the computation of phylogenetic likelihoods, rate heterogeneity can be modeled on a site-per-site basis with relative rates drawn from a discretized {Gamma}-distribution. Site-rate estimates (e.g., the rate of highest posterior probability given the data at a site) can then be used as a measure of evolutionary constraints imposed by function. However, if the sequence availability is limited, the estimation of rates is subject to sampling error. This article presents a simulation study that evaluates the robustness of evolutionary site-rate estimates for both small and phylogenetically unbalanced samples. The sampling error on rate estimates was first evaluated for alignments that included 5–45 sequences, sampled by jackknifing, from a master alignment containing 968 sequences. We observed that the potentially enhanced resolution among site rates due to the inclusion of a larger number of rate categories is negated by the difficulty in correctly estimating intermediate rates. This effect is marked for data sets with less than 30 sequences. Although the computation of likelihood theoretically accounts for phylogenetic distances through branch lengths, the introduction of a single long-branch outlier sequence had a significant negative effect on site-rate estimates. Finally, the presence of a shift in rates of evolution between related lineages can be diagnostic of a gain/loss of function within a protein family. Our analyses indicate that detecting these rate shifts is a harder problem than estimating rates. This is so, partially, because the difference in rates depends on two rate estimates, each with an intrinsic uncertainty. The performances of four methods to detect these site-rate shifts are evaluated and compared. Guidelines are suggested for preparing data sets minimally influenced by error introduced by sequence sampling.

Key Words: protein • evolutionary rate • functional divergence • maximum likelihood • simulation


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
The evolution of biological sequences over time can be represented by phylogenetic trees. Assuming a probabilistic model of amino acid substitution, tree topologies can be estimated from a sequence alignment by using the maximum likelihood criterion. For proteins, typical probabilistic models assume that (1) all sites are evolving in a state-specific, generic set of constraints, and (2) character substitution is occurring at a uniform rate across the sequence alignment. The examination of the patterns of evolutionary changes at different sites calls into question the validity of these simplifying assumptions. The fact that there is no generic substitution model suitable for all amino acid positions has recently garnered some attention (Bruno 1996; Halpern and Bruno 1998; Lartillot and Philippe 2004). In this study, we focus on heterogeneity in the rate of evolution across different sites.

Rate heterogeneity is an important concept in molecular biology because sites that are the most tolerant to substitution are likely to appear more variable. To address the issue of non-uniform rates in phylogeny, an improvement over original models of substitution was made by the introduction of rates-across-sites models (RAS), whereby the rates of evolution at sites were modeled by a discretized {Gamma}-distribution (Yang 1993, 1994). As expected, the introduction of a few extra parameters improved the fit of phylogenetic models to sequence alignments, which, in this case, translates into enhanced discrimination between phylogenetic tree topologies (Yang 1996). However, the RAS model should not be regarded as a silver bullet against all types of phylogenetic artifacts (Reyes, Pesole, and Saccone 2000).

Modeling by means of a discrete rate distribution enables each site to be classified by determining the rate category with the largest conditional posterior probability (Yang 1995; Felsenstein and Churchill 1996). This method of estimating site rates essentially uses the empirical Bayes framework. A discussion of empirical Bayes versus maximum likelihood estimation was recently published (Mayrose et al. 2004). This study, interestingly, also demonstrates that the rate estimates are generally robust to the length of alignments as well as to the extent of the rate heterogeneity (measured using the {Gamma}-distribution shape parameter: {alpha}). Regardless of the method used to generate these site rates, these estimates can be regarded as a measure of the level of evolutionary constraint imposed by functional requirements (Pupko et al. 2002; Blouin, Boucher, and Roger 2003). Consequently, these rates have potential applications in functional prediction, the study of protein evolution, and as a guide for experimental design using genomic information (Dean et al. 2002; Inagaki et al. 2002, 2003).

Probabilistic models of character substitution can also be used to simulate sequences from an input tree. These simulations are commonly used for hypothesis testing in phylogeny (Swofford et al. 1996). A Monte Carlo simulation was implemented for DNA evolution in Seq-Gen (Grassly, Adachi, and Rambaut 1997) and for protein sequences in Pseq-Gen (Rambaut and Grassly 1997). Another implementation was devised to enhance the flexibility of the Pseq-Gen engine (Tuffery 2002); this included the definition of the root sequence and the constraints imposed by correlated evolution of clusters of sites within the sequence. A different approach is used in Rose (Stoye, Evers, and Meyer 1998), in which the evolutionary model includes an underlying insertion and deletion mechanism.

In this study, we have implemented, from the original PSeq-Gen code, a simulation engine that allows flexibility in the model of evolution to simulate functional divergence via changes in the rate of evolution at sites across the tree. Specifically, we generated sequence alignments for which all rates and rate shift information is known. We applied this method to two test cases to examine the robustness of rate estimation methods that are designed as tools for functional studies. The first test case examined sampling error in the estimation of the most probable site-rate category. This measurement has gained attention as an indicator of evolutionary constraints (Pollock, Taylor, and Goldman 1999; Gu 2001; Dean et al. 2002; Pupko et al. 2002; Blouin, Boucher, and Roger 2003). The second test case evaluated the performance of four methods to detect changes of site rates between two subtrees (herein referred to as site-rate shifts). Finally, we suggest guidelines to assemble data sets which can be used for robust estimation of evolutionary constraints.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
Simulating Protein Evolution with Functional Divergence
The Monte Carlo simulation procedure is similar to that implemented in Pseq-Gen (Rambaut and Grassly 1997). The simulation requires an input tree, and any of the optional items: a user-defined root sequence, a {Gamma}-distributed random or user-defined sequence of site rates, and a user-defined amino acid frequency vector. The rate and frequency parameter can be applied from any nodes in the input tree during the simulation. The JTT (Jones, Taylor, and Thornton 1992), WAG (Whelan and Goldman 2001), or PAM models of character substitution can be used, although the JTT model was used in this series of simulations. Pre-determined shifts in site rates were applied from a specified node to all of its descendents. The sequences were generated recursively from an arbitrary root node toward the terminal nodes using a robust pseudo-random number generator (Wichmann and Hill 1982).

Sampling Error Data Set
Two 1,000 site data sets were generated using a {Gamma}-distributed rates-across-sites model ({alpha} = 0.65). The first alignment, generated from a tree herein called UNIFORM, contained 968 sequences with pairwise distances distributed between 0.1 and 1.1 expected substitutions per site. The UNIFORM tree approximates an infinitely diverse set of sequences from which experimental subsets are collected (fig. 1). The second alignment, generated from a tree herein called OUTGROUP, is similar to UNIFORM except that each jackknifed subsets included a sequence that corresponded to a deep branching outgroup sequence. The distance from the root to this terminal node was simulated at 1.5 substitutions per site. In both cases, six series of 100 alignments with sequence sample sizes of 5, 10, 15, 20, 30, and 40 taxa were uniformly drawn by jackknife re-samplings. The maximum likelihood phylogeny (model: JTT, 8 {Gamma}-distributed rate categories) was inferred using tree-puzzle 5.0 (Strimmer and von Haeseler 1996). The estimate of the {Gamma}-distribution parameter and the highest conditional probability rate category (i.e., an empirical Bayes rate estimate) for each site was extracted for analysis. The variance of each site rate i was calculated as (S(i))2, where S is the standard deviation of the rate estimate across all jackknife replicates for a given sequence sample size. The accuracy of the rate estimate was calculated with the frequency at which the estimated rate category was matching the rate category of the reference rates. The boundaries of the reference categories were determined from a discrete {Gamma}-distribution with shape parameter = 0.65. To avoid artifacts arising from the discretized nature of the rate distribution, the estimate (Rc(i)) was considered accurate if This is a conservative definition of accuracy and therefore is expected to yield accuracies slightly larger than the actual values.



View larger version (22K):
[in this window]
[in a new window]
 
FIG. 1.— Topology of the UNIFORM tree. This tree was generated by iteratively expanding each terminal node T to one of two possible internal nodes: (T:0.05,T:0.05) or ((T:0.05,T:0.05):0.05,T:0.1). For the analyses described in the text, random subsets of N sequences were drawn from this tree, each of which yielded its own subtree. This is analogous to analyzing a multiple sequence alignment for which it is assumed that only a fraction of the biodiversity is sampled.

 
Simulating Functional Divergence
An alignment of 44 sequences with 1,000 sites was generated to simulate a discrete site-rate shift event. At the internal node separating the tree into two halves of 22 sequences, 20% of the rates were replaced by randomly drawn rates from the original {Gamma}-distribution ({alpha} = 0.65). Detection of site-rate changes over this data set were then performed according the following methods: DIVERGE (Gu and Vander Velden 2002), Knudsen and Miyamoto (Knudsen and Miyamoto 2001), bivariate estimates (Susko et al. 2002), and the weighted difference measure (Blouin, Boucher, and Roger 2003). The performance and accuracy of these detection methods was then evaluated against the actual site-rate differences.

Software Availability
The source code and user manual for covTREE, our Monte Carlo sequence simulator used in all of our analyses, can be found at http://www.cs.dal.ca/~cblouin/Blouin. The sequence data and trees are also available as Supplementary Material online.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
Sampling Error on Empirical Bayes Site Rates
There are two major sampling issues that can affect site-rate estimation: (1) the number of sequences analyzed and (2) the phylogenetic depth of the sampled sequences. First, rate estimates that derived from small data sets are prone to sampling error. This is especially true if the sampling among lineages is not homogeneous. Second, the phylogenetic sampling can affect rate estimates in two ways: (1) saturation arising from evolutionarily divergent sequences and (2) the possibility that some individual site rates were not constant throughout the period of evolution under examination. The experiments described in the sections that follow test these elements.

In actual systems, the availability of an adequate number of sequences representing an appropriate phylogenetic depth can be an issue. It was reported that, for a data set of NADH-2 of requiem sharks, a number of sequences roughly over 25 was necessary to eliminate the compensatory effect on the maximum likelihood estimation of Pinvariable and the rate distribution shape parameter (Sullivan, Swofford, and Naylor 1999). Similar results were obtained with insect data (Misof et al. 2002). Is 25 sequences a reasonable sample size for robust rate estimation?

Two large data sets, UNIFORM (with pairwise distances between 0.1 and 1.1) and OUTGROUP (the same as before except with a single long branch of 1.5 substitutions/site from the root) of 968 sequences total were generated. Subsamples of various sequence numbers were performed from these data sets by jackknifing (see Materials and Methods). The estimated rates were compared to the simulated rates, and the variance of the estimates was computed. In both cases, the variance on the estimation of the conditional mode site rate became small for a data set of approximately ≥30 sequences (fig. 2). This is in agreement with the figure of 20–30 sequences reported by Mayrose et al. (2004).



View larger version (23K):
[in this window]
[in a new window]
 
FIG. 2.— Variance on the estimation of the most probable site-rate category with respect to the size of the input data set. Boxplot of the variance (Y-axis) on a per-site estimation of the conditional mode rate. The full data set contained 968 sequences. Subsets of a given number of taxa (X-axis) were generated by jackknife re-sampling of simulated protein sequences (100 times). For each sequence sampling size, the data sets were derived from simulations from the UNIFORM tree ("Un," on the left) and the OUTGROUP tree ("Ot," on the right).

 
Accuracy of Rate Estimate versus Sample Size
In a sampling of 30 sequences or more from the UNIFORM data set, the rates of ≥95% sites were correctly classified (fig. 3A). This classification accuracy drops to ≥90% in a sampling of 20 sequences. Below this number of sequences, a very clear pattern of rate estimation inaccuracies as a function of rates becomes apparent. Specifically, the accuracy of site-rate estimation remains relatively high for the rates at the extremes (e.g., very high or very low rates) but drops precipitously for intermediate rates.



View larger version (44K):
[in this window]
[in a new window]
 
FIG. 3.— Frequency of correct assignment of site rate categories. The most probable site rate category (r*) estimated 100 times for each of 1,000 sites was compared to the category of its reference rate (r). (A) Using the UNIFORM data set; (B) when all alignments included a distance outgroup. The frequency at which site rates are correctly assigned is plotted with respect to both the number of sequences in the alignment and the rate category of the reference rate. Note that a site rate is considered correctly assigned if it is assigned to the correct category or its immediate neighbor.

 
The explanation for this phenomenon is likely that in an alignment with a limited number of sequences, slowly evolving sites are likely to appear constant (thus slowly evolving) and fast evolving sites are more likely to display a scattered selection of characters, regardless of the sample size. Estimation of the very low rates and the very high rates is therefore inherently less error prone because there is only one boundary for the incorrect versus correct classification. Sampling error affecting sites evolving at intermediate rates can lead to misclassification in either direction (either too fast or too slow); consequently, a larger number of observations (sequences) must be collected to achieve an accurate estimate.

Robustness of Rate Estimates versus Outgroup Sequences
The effect of estimating rates while including an outlier sequence was tested. The previous experiment (UNIFORM) was modified such that each jackknife-sampled data set contained one distantly related sequence. As shown in figure 3B, the frequency of correct assignment of the rate category has not yet converged to ≥95%, even for a relatively large sample alignment of 40 sequences. It is unclear whether the estimation of the rate category will eventually converge on a 100% correct assignment because the variance on the estimated rates for large data sets is becoming small (fig. 2). This suggests that the rate estimate is affected by a systematic source of error rather than by sampling issues. An outgroup with a distance from the root of 1.5 expected substitutions per site simulates the presence, in an alignment, of a very distantly related sequence, which may be poorly aligned. This experiment tested the capacity of ML estimation to cope with the noise introduced by a distant or misaligned sequence in a data set.

Because of the difficulty in estimating intermediate rates, and because of the questionable biological significance of high resolution in relative rate categories, the use of a large number of rate categories appears to be unjustified. These results are based on the classification performance of sites into the correct rate categories. Using a different performance metric: the overall square of differences ({Delta}R)2 between estimates and real rates, the rate estimation was demonstrated to reach an accuracy plateau with 4–10 rate categories (Mayrose et al. 2004). In our experiments, the probability density around the "true" category is quite uniform. An increased number of categories can thus minimize ({Delta}R)2 without necessarily improving the classification performance.

Assigning Rates to Heterotachous Sites
If a multiple sequence alignment contains sites where rates of evolution changed throughout evolution, the resulting rate will be an average of rates. This average may poorly model the evolutionary process for this site in any given lineage. Such sites are known as heterotachous, or "covarion." There is a growing body of evidence suggesting that rate heterogeneity is a common process in protein evolution (Lockhart et al. 1998; Steel, Huson, and Lockhart 2000; Philippe et al. 2003).

The biological basis for shifts in evolutionary rates is functional divergence. If a site must have a specific identity in one lineage, its rate of evolution should be low. However, if this site is not used in a different lineage, the observed rate of evolution will be larger. Not all functional divergence reflects change in biological function; the clusters of residues that are important for the folding of a protein can also change over time, resulting in site-rate shifts even within orthologs (Lockhart et al. 2000; Lopez, Casane, and Philippe 2002).

Detecting the sites that have undergone site-rate shifts can be used as a comparative tool to understand/predict biological function. This is especially true if there is protein structure available to visualize these rate-shift patterns on a protein (Pupko et al. 2002; Blouin, Boucher, and Roger 2003).

Detecting Site-Rate Shifts
To test methods that detect site-rate changes, data sets were simulated by generating two lineages of homologous sequences with 20% of the sites evolving at uncorrelated rates (fig. 4). Because both uncorrelated rates are randomly generated from the same {Gamma} distribution, only a fraction of the rate shifts represent significantly different rates. As there is no reason to expect that all sites with divergent rates will be identified, the performance metric of most interest is thus the detection accuracy (sites correctly assigned as rate-variable) and the occurrence of false positives.



View larger version (34K):
[in this window]
[in a new window]
 
FIG. 4.— Diagram of the phylogenetic tree simulating a functional divergence event. The node from which the series of site rates r(a) was applied is indicated by an arrow (gray sequences). The nodes simulated under the set of rates r(b) are shown in black.

 
The correlation in simulated rates is shown in figure 5. This situation is assumed to be an idealized case of a site-rate shift event resulting from function gain/loss between two lineages of homologous proteins. Four published methods for the detection of site-rate changes were tested (Table 1 and fig. 5). A related, nonempirical Bayes, method has been described (Pupko and Galtier 2002) but was not included in this comparison because of a lack of available software tools.



View larger version (25K):
[in this window]
[in a new window]
 
FIG. 5.— Detection of sites with rate shifts. All rates simulated under one site rate profile (ra) are plotted as gray dots against their homologous rate used in the other half of the simulation (rb). Sites along the diagonal did not undergo a site-rate shift. The sites that have been detected as having undergone a site rate shift according to the specified method are indicated with black circles. The thresholds used to determine the detection limit are shown in the lower right-hand corner of each plot. Because of the log scale, data points distant from the diagonal in the lower left corner of these plots represent very small difference in rate.

 

View this table:
[in this window]
[in a new window]
 
Table 1 Performance of Methods Designed to Detect Sites Undergoing a Rate Shift

 
Two of the four tested methods yielded a significant number of false positives (sites wrongly classified as having evolved under two distinct rates; table 1): the simple weighted difference of the most probable conditional mode rate (Blouin, Boucher, and Roger 2003) and the likelihood-ratio test (Knudsen and Miyamoto 2001). The direct use of rate estimates in the weighted difference method is prone to sampling issues as presented in the previous sections. Once the threshold of detection is lowered (table 1), the difference between poorly estimated rates from both subtrees yields a significant number of false positives. The likelihood-ratio test (Knudsen and Miyamoto 2001) considers the joint phylogeny of two lineages under dependent and independent vectors of rates (Knudsen and Miyamoto 2001). This method appears to be very sensitive to rate variation across subtrees (table 1). Almost 90% of the sites with true rate shifts were identified as such at a threshold of 0.95. The recovery of 8.3% of false positives with this method is a natural consequence of this heightened sensitivity. A significant number of false positives are still recovered (2.2%), even if the threshold is raised to 0.99. Therefore, the likelihood ratio method would be ideal to enrich a data set in functionally divergent sites if false positives are of less concern.

The bivariate method (Susko et al. 2002) lies at the other end of the spectrum. This method tends to fail to identify small rate shifts. This is because the uncertainty on each of the jointly estimated site rates compounds into a larger confidence interval for site-rate differences as a whole. The method of Gu99 (Gu 1999), which has been widely used (Gu 1999; Wang and Gu 2001; Gaucher et al. 2002) offers a non-significant occurrence of false positives (P(FP) ≤ 0.01, threshold = 0.80). This method uses an empirical Bayes framework to evaluate at each site the posterior probability of functional divergence given the site's amino acid pattern. As an added advantage for the Gu99 method, the final classification appears to be robust to the exact posterior-probability cutoff. This robustness is a desirable property because it is impossible to calibrate the cutoff during the analysis of "real" sequence data. Nevertheless, the performance (precision and power) of the bivariate method and of Gu99 are essentially identical.

Caveats of Modeling Rate Distributions Using a Discretized {Gamma} Distribution
A few caveats have to be considered because of the properties of the assumed discretized {Gamma}-distribution of site rates: (1) For each of these experiments, the rate distribution shape parameter {alpha} was set to 0.65. For data sets with an extremely large {alpha} parameter (more uniform rates across sites), the difference in modes of each category will be smaller. In this situation, more sequences will be required to accurately determine which rate category is the most appropriate. (2) If the rates are estimated with a data set containing functionally diverse sequences, the presence of site-rate shifts within the data set will cause an averaging of the relative rates and an overestimation of the shape parameter {alpha}, which determines the discrete rate categories available. This is likely to adversely affect the accuracy of rate assignment at sites even if they have not undergone rate shifts.

However, it should be noted that the distribution of rates obtained without assuming a {Gamma} distribution in a number of protein data sets was shown to match the {Gamma} distribution very well (Susko et al. 2003). This suggests that the assumption that site rates are {Gamma} distributed is sound for the modeling of rate heterogeneity.


    Conclusions
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
Site-rate estimates potentially provide information on the restriction caused by the function of specific sites. We have critically assessed the accuracy of estimates of evolutionary rates and, in a comparative manner, the detection of rate shifts in protein families.

The choice of a number of discrete rate categories has an impact on the results obtained for the estimation of site rates. In small-sized and/or taxonomically broad phylogenetic samples, the apparent resolution achieved by using a large number of rate categories may be misleading. This is so because of the difficulty in correctly estimating rates in the intermediate range.

In practice, the exact intermediate rate of evolution of a site has limited importance for the investigator interested in structure-function relationships; a smaller number of rate categories can be used and still provide adequate resolution. In extreme cases of poor sampling, the rate estimates will be no less informative than the visual annotation of multiple sequence alignments using the simple schemes: constant, conserved, or variable. However, providing that a large enough number of sequences are available, and assuming that all positions in the sequences share homologous functions, maximum likelihood–based rate estimation is a better metric of evolutionary constraints because it explicitly considers phylogenetic relationships.

Conversely, it is inadvisable to include distantly related outgroup sequences and/or sequences with suspected different functions, in an attempt to mitigate sampling error. To minimize the effect of distantly related sequences, it is preferable to separately estimate the rates at sites of robust monophyletic groups each of which is made up of a large number of sequences that are more functionally similar.

Heterogeneity in the rate of substitution within a site can be an informative property: When an alignment contains functionally divergent sequences, rate variation among particular lineages or clades provides critical information regarding the presence of fluctuating evolutionary constraints (Inagaki et al. 2003). Four methods were tested for their ability to identify these sites. The choice of the most appropriate method depends on the type of information that is of interest. The method of Knudsen and Miyamoto (Knudsen and Miyamoto 2001) is excellent at detecting a large number of sites containing rate shifts. On the other hand, it identifies a significant number of false positives.

If the objective of the experiment is to target individual positions (a mutational study for example), the methods of Gu (Gu and Vander Velden 2002) and Susko (Susko et al. 2002) are reliable over a range of confidence levels. However, the confidence level should be chosen prior to running the experiment as to avoid problems of multiple testing. The results presented in table 1 can be used as a guide for making this choice.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
This work was supported by the "Prokaryotic Genome Diversity and Evolution Project" of Genome Atlantic, Natural Sciences and Engineering Research Council of Canada (NSERC) Operating Grant 227085–00, awarded to A.J.R., and NSERC Discovery Grant 298397–04, awarded to C.B.


    Footnotes
 
Peter Lockhart, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 

    Blouin, C., Y. Boucher, and A. J. Roger. 2003. Inferring functional constraints and divergence in protein families using 3D mapping of phylogenetic information. Nucleic Acids Res. 31:790–797.[Abstract/Free Full Text]

    Bruno, W. J. 1996. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:1368–1374.[Abstract/Free Full Text]

    Dean, A. M., C. Neuhauser, E. Grenier, and G. B. Golding. 2002. The pattern of amino acid replacements in alpha/beta-barrels. Mol. Biol. Evol. 19:1846–1864.[Abstract/Free Full Text]

    Felsenstein, J., and G. A. Churchill. 1996. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93–104.[Abstract]

    Gaucher, E. A., X. Gu, M. M. Miyamoto, and S. A. Benner. 2002. Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem. Sci. 27:315–321.[CrossRef][ISI][Medline]

    Grassly, N. C., J. Adachi, and A. Rambaut. 1997. PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:559–560.[Medline]

    Gu, X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16:1664–1674.[Abstract/Free Full Text]

    ———. 2001. Maximum-likelihood approach for gene family evolution under functional divergence. Mol. Biol. Evol. 18:453–464.[Abstract/Free Full Text]

    Gu, X., and K. Vander Velden. 2002. DIVERGE: phylogeny-based analysis for functional-structural divergence of a protein family. Bioinformatics 18:500–1.[Abstract/Free Full Text]

    Halpern, A. L., and W. J. Bruno. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 15:910–917.[Abstract]

    Inagaki, Y., C. Blouin, W. F. Doolittle, and A. J. Roger. 2002. Convergence and constraint in eukaryotic release factor 1 (eRF1) domain 1: the evolution of stop codon specificity. Nucleic Acids Res. 30:532–544.[Abstract/Free Full Text]

    Inagaki, Y., C. Blouin, E. Susko, and A. J. Roger. 2003. Assessing functional divergence in EF-1alpha and its paralogues in eukaryotes and archaebacteria. Nucleic Acids Res. 31:4227–4237.[Abstract/Free Full Text]

    Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 8:275–282.[Abstract]

    Knudsen, B., and M. M. Miyamoto. 2001. A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc. Natl. Acad. Sci. USA 98:14512–14517.[Abstract/Free Full Text]

    Lartillot, N., and H. Philippe. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21:1095–1109.[Abstract/Free Full Text]

    Lockhart, P. J., D. Huson, U. Maier, M. J. Fraunholz, Y. Van De Peer, A. C. Barbrook, C. J. Howe, and M. A. Steel. 2000. How molecules evolve in eubacteria. Mol. Biol. Evol. 17:835–838.[Free Full Text]

    Lockhart, P. J., M. A. Steel, A. C. Barbrook, D. H. Huson, M. A. Charleston, and C. J. Howe. 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15:1183–1188.[Abstract]

    Lopez, P., D. Casane, and H. Philippe. 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19:1–7.[Abstract/Free Full Text]

    Mayrose, I., D. Graur, N. BenTal, and T. Pupko. 2004. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol. Biol. Evol. 21:1781–1791.[Abstract/Free Full Text]

    Misof, B., C. L. Anderson, T. R. Buckley, D. Erpenbeck, A. Rickert, and K. Misof. 2002. An empirical analysis of mt 16S rRNA covarion-like evolution in insects: site-specific rate variation is clustered and frequently detected. J. Mol. Evol. 55:460–469.[CrossRef][ISI][Medline]

    Philippe, H., D. Casane, S. Gribaldo, P. Lopez, and J. Meunier. 2003. Heterotachy and functional shift in protein evolution. IUBMB Life 55:257–265.[ISI][Medline]

    Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187–198.[CrossRef][ISI][Medline]

    Pupko, T., R. E. Bell, I. Mayrose, F. Glaser, and N. Ben-Tal. 2002. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 18(Suppl 1):S71–77.[Abstract/Free Full Text]

    Pupko, T., and N. Galtier. 2002. A covarion-based method for detecting molecular adaptation: application to the evolution of primate mitochondrial genomes. Proc. R. Soc. Lond. Ser. B Biol. Sci. 269:1313–1316.[CrossRef][ISI][Medline]

    Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235–238.[Abstract]

    Reyes, A., G. Pesole, and C. Saccone. 2000. Long-branch attraction phenomenon and the impact of among-site rate variation on rodent phylogeny. Gene 259:177–187.[CrossRef][ISI][Medline]

    Steel, M., D. Huson, and P. J. Lockhart. 2000. Invariable sites models and their use in phylogeny reconstruction. Syst. Biol. 49:225–232.[CrossRef][ISI][Medline]

    Stoye, J., D. Evers, and F. Meyer. 1998. Rose: generating sequence families. Bioinformatics 14:157–163.[Abstract]

    Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. J. Mol. Biol. 13:964–969.

    Sullivan, J., D. L. Swofford, and G. J. P. Naylor. 1999. The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Mol. Biol. Evol. 16:1347–1356.[Free Full Text]

    Susko, E., C. Field, C. Blouin, and A. J. Roger. 2003. Estimation of rates-across-sites distributions in phylogenetic substitution models. Syst. Biol. 52:594–603.[CrossRef][ISI][Medline]

    Susko, E., Y. Inagaki, C. Field, M. E. Holder, and A. J. Roger. 2002. Testing for differences in rates-across-sites distributions in phylogenetic subtrees. Mol. Biol. Evol. 19:1514–1523.[Abstract/Free Full Text]

    Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis (1996). Phylogenetic inference. Pp. 00–00 in D. M. Hillis, C. Moritz, and B. Mable, eds. Molecular systematics. Sinauer Associates, Sunderland, Mass.

    Tuffery, P. 2002. CS-PSeq-Gen: simulating the evolution of protein sequence under constraints. Bioinformatics 18:1015–1016.[Abstract/Free Full Text]

    Wang, Y., and X. Gu. 2001. Functional divergence in the caspase gene family and altered functional constraints: statistical analysis and prediction. Genetics 158:1311–1320.[Abstract/Free Full Text]

    Whelan, S., and N. Goldman. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18:691–699.[Abstract/Free Full Text]

    Wichmann, B. A., and I. D. Hill. 1982. An efficient and portable pseudo-random number generator. Appl. Stat. 31:188–190.[ISI]

    Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401.[Abstract]

    ———. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306–314.[ISI][Medline]

    ———. 1995. A space-time process model for the evolution of DNA sequences. Genetics 139:993–1005.[Abstract/Free Full Text]

    ———. 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367–372.[CrossRef][ISI]

Accepted for publication December 6, 2004.





This Article
Abstract
FREE Full Text (PDF)
Correction to PDF
An erratum has been published
All Versions of this Article:
22/3/784    most recent
msi065v1
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (2)
Request Permissions
Google Scholar
Articles by Blouin, C.
Articles by Roger, A. J.
PubMed
PubMed Citation
Articles by Blouin, C.
Articles by Roger, A. J.