*Department of Mathematics and Statistics, Dalhousie University;
Program in Evolutionary Biology, Department of Biochemistry and Molecular Biology, Canadian Institute for Advanced Research, Dalhousie University;
Xeotron Corporation, Houston
![]() |
Abstract |
---|
![]() |
Introduction |
---|
The first type of methodology for detecting rate differences uses distances between rates estimated separately for the two subtrees of interest. Because there is a wide variety of ways in which rate variation can occur (for instance, only small rates might vary between the two subtrees or only large rates might vary) a number of different distances are considered. A parametric bootstrap is used with these distances to determine what types of distances are expected under the null hypothesis that there are no rate differences across the two trees.
An alternative approach for detecting whether rate differences exist comes through a regression analysis of the rate estimates. This is fairly easy to implement and can be used to complement the parametric bootstrap methodology. Under the null hypothesis that the rates are the same at each of the sites, a plot of the rates for one of the trees against the rates at the corresponding site for the other tree should look like scatter around the x = y line. We use orthogonal regression methodology with the rate estimates for the two trees to test whether the mean relationship between the rates satisfies this hypothesis. The estimated coefficients from the regression methods allow for statements about the general differences in the rates.
Given that rate differences exist between subtrees, it is of interest to determine at which sites in the sequences these changes occur. Confidence intervals for the rates at a site in the two subtrees based on the conditional distribution of the rates, given the data at the site, provide one way of detecting where changes that take random variation into account occur. Although confidence intervals for the rate at a site for a tree are of interest themselves, in the present context they are useful to construct confidence intervals for the differences in rates. Sites with confidence intervals for differences that do not contain 0 are likely to correspond to rate differences in the two subtrees.
Confidence intervals can be estimated separately for subtrees or based on a bivariate distribution for the rates in the separate subtrees. The bivariate model considered here is similar to Gu's (2001)
model of functional divergence. Under the Gu (2001)
model, the bivariate distribution of the rates in subtrees is a mixture of a distribution in which the rates are assigned independently to the two subtrees and a distribution in which the same rates are assigned to both subtrees. Here, in contrast, the bivariate distribution for rates is not restricted to any particular form. In addition, the methodology presented here provides confidence intervals for the differences in the rates at sites.
![]() |
Models for Rate Variation |
---|
![]() |
|
Given the estimated tree for a set of data, rate estimates are based on the conditional distribution of rates, given the data at a site. This distribution can be obtained through Bayes' formula:
|
|
In cases where differences in rates in subtrees are of interest, separate rate estimates can be obtained by treating the data in the subtrees separately:
![]() |
Parametric Bootstrap Methodology for Detecting Rate Differences |
---|
The first rate-distance measure is simply the absolute value of the conditional mode rate for position i in the alignment from the first subtree subtracted from the conditional mode rate for the corresponding position in the second subtree. The sum of these distances, summed over all sites, yields the global arsum distance measure
|
|
|
In addition to these three distance measures, the same measures were calculated without the absolute values (designated rsum, lrsum, and brsum, respectively). In principle, changes in rates at sites can be equally distributed across both subtrees (a homogeneous shift) or unequally distributed (nonhomogeneous shift). In the latter case, the rates could be systematically higher in one subtree versus the second subtree. The nonabsolute value measure described here can distinguish between these alternatives. If no systematic shift in rates has occurred, then the summed differences are expected to cancel out and fall within the simulated null distribution of the same measure. By contrast, a systematic shift will yield an imbalance of positive or negative distances that, through summation, will be evident as a large positive or negative value falling outside of the simulated null distribution.
The parametric bootstrapping tests were carried out as follows. For each pair of subtrees, the full data set alignment of the two subdata sets together is initially considered. From this alignment, the parameter,
total, is estimated by maximum likelihood on a neighbor-joining topology using TREE-PUZZLE version 4.02 (Strimmer and von Haeseler 1996
), and maximum likelihood distances for all pairs of taxa are estimated using the PAM +
model (the PAM 001 model was used and the gamma distribution is approximated by an eight-category discrete rate distribution, where the eight categories have equal probability and the rate for each category is set to its mean). From this distance matrix, an optimal topology is estimated using the Fitch-Margoliash weighted least-squares method with 10 random additions with global rearrangements implemented in the PHYLIP 3.5c (Felsenstein 1993
) program FITCH. Maximum likelihood branch lengths for topology are estimated with TREE-PUZZLE with the PAM +
model using
total. From this topology, N data sets of equal size are simulated with the PAM +
model using the program PSeq-Gen (Rambaut and Grassly 1997
). Each of the full data set alignments (observed or simulated) is then split into the two smaller data sets corresponding to the separate subtrees (data set1 and data set2) for separate analysis. For each of the smaller data sets, a maximum likelihood distance matrix is estimated with the PAM +
model using
total. A topology is estimated from this matrix using the neighbor-joining algorithm (implemented in the PHYLIP program NEIGHBOR), and this is used as a user-defined tree for a second round of TREE-PUZZLE analysis. This analysis uses the PAM +
model, using
total to estimate maximum likelihood branchlengths for the NJ topology and the conditional mode rates for the data set. The conditional mode rates for each site are then compared between data set1 and data set2, using the global distance measures described earlier. A shell-script program, COVAR, was written to automate these comparisons.
![]() |
Testing for Rate Differences Using Regression Methods |
---|
Let ri1, ri2 denote the rate estimates at site i for the two trees, and let µi1, µi2 denote the true unobserved rates at the site. We can then define errors in estimation through
![]() |
Under the null hypothesis of interest H0: µi1 = µi2 for all i. If this null hypothesis is true, then
![]() |
![]() |
![]() |
In most regression models of the form (5)
, least squares estimates of (ß0, ß1) are used. These are not, however, appropriate in the current setting. This is because both of the variables in the regression equation are subject to error. As a consequence, least squares estimates of the slope are biased in the direction of 0. Because of the common error process for the two trees (the errors i1 and
i2 are assumed to have the same error distribution), an adjustment is available through orthogonal regression (cf.
12.3 Casella and Berger 1990
, pp. 583594).
Orthogonal regression is best explained by contrast with least squares regression. In both forms of regression, parameters are chosen to make the sum of the distances between the observed rates (ri1, ri2) and points on the regression line as small as possible. The regression methods differ in the way they measure the distance from the observed (ri1, ri2) to the line. As illustrated in figure 1 , least squares estimation uses the vertical distance, whereas orthogonal regression uses perpendicular distance.
|
|
If the distribution of the error terms i1 and
i2 is normal, orthogonal regression turns out to be equivalent to maximum likelihood estimation of the regression parameters. Even if this is not the case, it provides a reasonable methodology for estimating the parameters in the model. A standard error for the estimate
1 is available as
|
The hypothesis that (ß0, ß1) = (0,1) in (eq. 5
) is equivalent to the hypothesis that ß1 = 1 and that the mean rates, averaged over all sites, for the two subtrees is the same. A paired t-test can be used to test the hypothesis that the mean rates are equal. A test of the hypothesis that ß1 = 1 can be constructed from the orthogonal regression estimate and standard error for 1. An approximate P value for the test is
![]() |
![]() |
Confidence Bounds for Rates |
---|
Confidence bounds for the rate at a site for a given subtree is most naturally based on the conditional distribution for rates at a site:
|
|
The confidence bounds for the rates at individual sites in individual subtrees are of interest in their own right but also provide a means for detecting the locations of rate differences. If, for a given site, the confidence bounds for rates in two subtrees do not overlap, a rate difference is suggested. More generally, the confidence intervals for the rates at sites can be used to construct confidence intervals for the rate difference between the two sites. For a given site, suppose that the (1 - /2) x 100% confidence bounds for the rates in two subtrees are [l1, u1] and [l2, u2]. Then a (1 -
) x 100% confidence interval for the rate difference is the set of all difference r1 - r2 with r1 in [l1, u1] and r2 in [l2, u2]; this set is simply [l1 - u2, u1 - l2].
![]() |
Confidence Bounds for Rate Differences |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The usual model for rate variation treats rates at different sites as random variables that are independently distributed from a common rate distribution (cf. Yang 1994
; Felsenstein and Churchill 1996
). For all practical purposes, rate distributions are discrete; they assign probabilities
1, ...,
k to a set of rates r1, ..., rk. For a given tree, and a set of rates r1, ..., rk, the likelihood for the rate distribution parameters
1, ...,
k is obtained by taking the product of the probabilities of data at the individual sites. An estimate of the rate distribution can be obtained by maximizing the likelihood over all
1, ...,
k corresponding to rate distributions that have a mean rate of 1 (cf. Susko et al. 2001
).
The extension of rate distributions for a given tree to separate subtrees would be a bivariate distribution that assigns probabilities to pairs of rates (r1, r2) at a site for the two subtrees. If r1, ..., rk is the set of possible rates for tree T1 and s1, ..., sk is the set of possible rates for tree T2, a bivariate distribution would assign some probability ij to the pair of rates (ri, sj). To ensure that the branch lengths can be interpreted as the expected number of substitutions, the rate distribution should be chosen so that the expected rates, separately, at the two subtrees are 1. Specifically,
|
When estimating rate distributions or constructing confidence intervals for a rate in a given subtree, it suffices to consider only the data in that subtree. In contrast, calculations of the probability of data at a site in a bivariate model require that the data in both subtrees be considered jointly. For a given site and given rates r and s for subtrees T1 and T2, let f (x,y|r,s;T1,T2) be the joint probability of the data x in subtree T1 and data y in subtree T2. The probability of data x and y at a site is then obtained as
|
Given probabilities ij for the pairs of rates (ri, sj), similarly as in the case of a single subtree, the conditional distribution of the pair (ri, sj) is calculated as
|
|
|
The joint probability of the data f (x,y|r,s; T1,T2) is needed in all of the aforementioned calculations. Calculation of this quantity requires an additional branch length for the branch connecting the two subtrees, branches in the two subtrees where the additional branch will join the two subtrees, and a position along the additional branch where the rate change occurs. To avoid the additional computation implied by these additional parameters we use an independence model approximation in practice, replacing f (x,y|r,s; T1,T2) by the product of the separate probabilities of data in the two subtrees: f (x|r;T1)f(y|s;T2). If the branch connecting the two subtrees is relatively long, this should provide a good approximation.
![]() |
Results |
---|
![]() |
Parametric Bootstrap Methodology |
---|
|
|
|
![]() |
Regression Methodology Results |
---|
|
|
![]() |
Confidence Bounds for Rate Differences |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
![]() |
Concluding Remarks |
---|
Failure to reject the hypothesis of significant differences suggests that further analysis is not necessary and effectively controls the type I error rate. Given the rejection of the orthogonal regression or parametric bootstrap tests, confidence intervals can be constructed for rates to locate sites that are likely to have significant rate differences. Confidence bounds for rate differences, calculated after estimating a bivariate rate distribution for the two subtrees of interest, can be expected to produce tighter bounds for the rate differences. To allow for easier computation, an independence assumption was made about the data in the two subtrees. When the branch connecting the two subtrees is relatively long, one can expect the resulting approximations to be reasonable. We expect that sites with low rates in both subtrees will be the ones most significantly affected by a loosening of the independence assumption. In this case, the branch length between the two subtrees for the site is effectively shortened because of the low rates so that the data in the two subtrees become more dependent. Determining how or whether this would affect the confidence interval for the rate difference will require further study.
Our methods have been applied to compare rates at sites for data sets with subtrees from the HBS1 and eRF3 data sets. Although significant rate differences were suggested in the comparison of the rates from the aEF-1 and eEF-1
data sets, the rate differences in the HBS1 versus eRF3 comparisons were marginally significant and insignificant using the parametric bootstrapping test and regression test, respectively. Further analysis is needed to determine what the effects of failing to adjust for rate differences in subtrees might be for phylogenetic estimation.
The likelihood ratio statistic test of Knudsen and Miyamato (2001)
treats the rate difference at a site as a fixed parameter for estimation. It uses only the data at the site and assumes that large sample likelihood theory is applicable. In contrast, the bivariate confidence intervals treat the rate difference as a random variable. The information about the range of likely rate differences contained within the data at a site is obtained by conditioning on the data at a site, but information from other sites is obtained by using the bivariate rate distribution in the calculation of intervals. The bivariate model considered here can be viewed as an extension of the rates-across-sites model where rates are allowed to vary in subtrees of the larger tree. The model is similar to Gu's (2001)
model for functional divergence but places less restrictions on the form of the bivariate distribution. The extension is also similar to the covarion models of Tuffley and Steel (1998), Galtier (2001)
, and Penny et al. (2001)
, which also allow rate variation between subtrees. These covarion models assume a stationary process of rate variation at a site throughout the tree. This differs from the bivariate model considered here in at least two respects. First, the process of rate variation is constant throughout the tree so that the rate distribution or rates-across-sites model for one subtree should be the same as for any other subtree. The bivariate model allows different rate distributions in different subtrees. Second, in the Tuffley and Steel model, rates are allowed to vary within any branch of a subtree. In contrast, the bivariate model assumes a single rate at a site for a given subtree. Nevertheless, the bivariate models considered here can be useful in detecting covarion-type rate variation similar to that of the Tuffley and Steel model. In the Tuffley and Steel model one can think of an average rate at a site for a subtree, where the average is taken over branches of the subtree. Assuming that rates at a site vary randomly throughout a tree, by chance there should be differences in the average rates at a site in any two subtrees of the larger tree. Because the average rates will differ, with sufficient data, the orthogonal regression or parametric bootstrap tests presented here will reject the null hypothesis of a single rate distribution model.
The bivariate model can be extended to allow rate variation in smaller and smaller subtrees of the tree of interest. In the most general case, we could partition the tree into m subtrees, T1, ..., Tm. In a multivariate model for rate variation at a site, a set of rates r1, ..., rm for the subtrees would be drawn from a multivariate distribution that assigns some probability (i1, ..., im) to every possible set ri1, ..., rim of m rates that could arise for the subtrees.
![]() |
Acknowledgements |
---|
![]() |
Footnotes |
---|
Abbreviations: EF-1, elongation factor 1
; aEF-1
, archaebacterial EF-1
; eEF-1
, eukaryotic EF-1
; eRF3, eukaryotic release factor 3.
Keywords: covarion
rates-across-sites
Markov models
maximum likelihood
molecular evolution
phylogenetics
Address for correspondence and reprints: Edward Susko, Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia B3H 3J5, Canada. E-mail: susko{at}mathstat.dal.ca
![]() |
References |
---|
Casella G., R. L. Berger, 1990 Statistical inference Brooks/Cole, Pacific Grove, Calif
Dayhoff M. O., R. M. Schwartz, B. C. Orcutt, 1979 A model of evolutionary change in proteins Pp. 345352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure, Vol. 5(Suppl. 3). National Biomedical Research Foundation, Silver Spring, Md
Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]
. 1993 PHYLIP (phylogeny inference package). Version 3.5c Distributed by the author, Department of Genetics, University of Washington, Seattle
Felsenstein J., G. A. Churchill, 1996 A hidden Markov model approach to variation among sites in rate of evolution Mol. Biol. Evol 13:93-104[Abstract]
Fitch W. F., E. Markowitz, 1970 An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution Biochem. Genet 4:579-593[ISI][Medline]
Galtier N., 2001 Maximum-likelihood phylogenetic analysis under a covarion-like model Mol. Biol. Evol 18:866-873
Gaucher E. A., M. M. Miyamato, S. A. Benner, 2001 Function-structure analysis of proteins using covarion-based evolutionary approaches: elongation factors Proc. Natl. Acad. Sci. USA 98:548-552
Gu X., 2001 Maximum likelihood approach for gene family evolution under functional divergence Mol. Biol. Evol 18:453-464
Inagaki Y., W. F. Doolittle, 2000 Evolution of the eukaryotic translation termination system: origins of release factors Mol. Biol. Evol 17:882-889
Knudsen B., M. M. Miyamato, 2001 A likelihood ratio test of evolutionary rate shifts and functional divergence among proteins Proc. Natl. Acad. Sci. USA 98:14512-14517
Lockhart P. J., D. H. Huson, U. Mairer, M. J. Fraunholz, Y. Van de Peer, A. C. Barbrook, C. J. Howe, M. A. Steel, 1998 A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages Mol. Biol. Evol 15:1183-1188[Abstract]
. 2000 How molecules evolve in eubacteria Mol. Biol. Evol 17:835-838
Lopez P., Forterre P., H. Phillipe, 1999 The root of the tree of life in the light of the covarion model J. Mol. Evol 49:496-508[ISI][Medline]
Miyamoto M. M., W. M. Fitch, 1995 Testing the covarion hypothesis of molecular evolution Mol. Biol. Evol 12:503-513[Abstract]
Penny D., B. J. McComish, M. A. Charleston, M. D. Hendy, 2001 Mathematical elegance with biochemical realism: the covarion model of molecular evolution J. Mol. Evol 53:711-723[ISI][Medline]
Rambaut A., N. C. Grassly, 1997 Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees Comput. Appl. Biosci 13:235-238[Abstract]
Strimmer K., A. von Haeseler, 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies Mol. Biol. Evol 13:964-969
Susko E., C. Field, C. Blouin, A. R. Roger, 2002 Estimation of rate distributions in phylogenetic models Preprint.
Tuffley C., M. Steel, 1998 Modeling the covarion hypothesis of nucleotide substitution Math. Biosci 147:63-91[ISI][Medline]
Yang Z., 1994 Maximum likelihood phylogentic estimation from DNA sequences when substitution rates differ over sites: approximate methods J. Mol. Evol 39:306-314[ISI][Medline]