Deutsches Krebsforschungszentrum, Theoretische Bioinformatik, Im Neuenheimer Feld 280, Heidelberg, Germany
Duke University, Institute of Statistics and Decision Sciences, Durham, North Carolina
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
If this replacement improves the fitness of the organism the new amino acid will be accepted by natural selection. Replacements of very dissimilar amino acids often drastically change the fold of the protein, leading to a complete loss of function. Hence, such mutations are less often observed than those of similar amino acids, which have only slight effects with respect to fold and function. Therefore, amino acid similarity is reflected in replacement frequencies. It is important to note that actual mutation counts depend not only on these similarities but also on the degree of divergence of the sequences that one compares. We thus need a model which describes protein evolution as a function of time.
Modeling amino acid replacements by a Markov chain has been introduced by Dayhoff, Schwartz, and Orcutt (1978)
. In this approach a set of identical Markov chains acting independently on each site of the protein is used. The time index of the process is interpreted as a measure of evolutionary divergence. The challenge is to estimate the parameters of the process from divergent and time-inhomogeneous sequence data.
In the original approach of Dayhoff, Schwartz, and Orcutt (1978)
the actual estimation is restricted to only very closely related pairs of sequences. However, once a Markov model is fitted to this data, replacement frequencies characteristic for distantly related sequences can be extrapolated from the model.
Dayhoff's approach has been generalized and applied to larger data sets (Gonnet, Cohen, and Benner 1992
; Jones, Taylor, and Thornton 1992
). Furthermore, the advent of large numbers of structurally derived alignments has raised interest in using information also from very distant related alignments (Overington et al 1990
; Risler et al 1988
). However, these authors do not provide an estimation procedure which would account for the time-inhomogeneity of the input data.
Benner, Cohen, and Gonnet (1994)
were the first to point out the problem of estimating one consistent model from an inhomogeneous pool of alignment data. They sketch a normalization algorithm, which is based on computing logarithms of transition matrices which they approximate by power series. The approach is heuristic, as the convergence of the power series cannot be guaranteed for empirically derived matrices.
Müller and Vingron (2000)
present a rigorous estimation procedure which is based on an entirely different mathematical formalism. We refer to this method as resolvent method and briefly review it in the Resolvent Method section. Alternatively, we describe a novel maximum likelihood based approach. The Maximum Likelihood section provides the details of the mathematical formalisms and computations.
In principle, one has two important, although conflicting, criteria for evaluating the quality of the methods. For large-scale applications, time performance of the algorithms is crucial, whereas low statistical efficiency of the estimator can be compensated for by the huge amount of data that is used. On the other hand, if one is restricted to only a small set of input data, the accuracy of the estimator is more important. In the Results section we discuss the individual merits of the resolvent and the maximum likelihood estimator.
![]() |
Model |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
|
|
|
![]() |
The forward and backward equations can be solved under the initial condition P(0) = I and yield
![]() |
The problem of modeling amino acid replacement frequencies requires additional assumptions on the Markov chain. Following Dayhoff, Schwartz, and Orcutt (1978)
we model the evolution of each site of the proteins by a single time-homogeneous Markov chain X(t), calibrated, such that on average 1% of the amino acids are changed after one unit of time: Prob[X(t)
X(t + 1)] = 0.01. Once calibrated, the time t in the Markov chain can be used as a measure of evolutionary divergence. The acronym PAM (Point Accepted Mutations) is commonly used for this unit of divergence.
We assume that any amino acid can change into any other one. This is ensured by the requirement that the exchange rates qij be strictly positive for all i, j. Then there exists a unique limiting amino acid distribution = (
1, ...,
20) where
j = limt
pij(t) > 0 is independent of the initial residue i. The distribution
fulfills the equations
Q = 0 and
P(t) =
for all t
0 and is called the equilibrium distribution. We only consider Markov processes which are in equilibrium. With the transition probabilities (Pt)t
0 and the overall amino acid distribution we calculate the joint distribution mij(t) =
ipij(t) of (Xs, Xs+t). M(t) = mij(t) denotes the probability of finding amino acid ai and amino acid aj aligned with each other in two sequences that are t time units apart.
The Markov chain Xt describes the evolution of a single site in a protein from ancestors to descendants. Whereas data from ancestor sequences are not available, we do observe pairs of proteins that have evolved from a common, though unknown, ancestor. We cannot decide the direction of the mutation, we only observe pairs of corresponding amino acids at certain positions in a protein. It would not be appropriate to model such a direction. Such processes are described by the class of time-reversible Markov chains. This means that the probability of being in amino acid ai and going from ai to aj in time t is equal to that of being in amino acid aj and going from amino acid aj to ai. Consequently, we get the detailed balance equation iqij =
jqji. In particular, M(t) is a symmetric matrix. We use the shortcut Evolutionary Markov Process (EMP) for a process satisfying all the conditions discussed earlier.
The transition and the rate matrix of an EMP have the following mathematical properties. Denote by F the diagonal matrix with entries i. Then F constitutes a symmetric, positive definite matrix and
x, y
F =
x, Fy
defines an inner product. Because of the reversibility, Q and P(t) are selfadjoint relative to
·, ·
F, i.e.,
P(t)x, y
F =
x, P(t)y
F, and can therefore be transformed into diagonal form by a change of coordinates. The eigenvalues of Q are real by selfadjointness, and negative owing to Gershgorin's theorem. Using definition (4) we can rewrite Q and P(t) as
|
![]() |
Estimation Algorithms |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Dayhoff's Method
Dayhoff's method has a strategy consisting of estimating P(1) and then extrapolating to higher PAM distances. She pools input alignments of only closely related sequences. Nevertheless, this data can still be time-inhomogeneous. From this data she derives a calibrated transition matrix using the fact that on small evolutionary distances the calibration can be carried out linearly, as in equation (1)
. It is important to note that Dayhoff intended to use only alignments of very closely related pairs of sequences. There is no theoretical justification for applying it to more divergent input alignments. In fact, from our discussion earlier it becomes clear that linear calibration fails for more divergent data. The computational details of Dayhoff's method are summarized in various textbooks, see e.g., Setubal and Meidanis (1997)
.
Maximum Likelihood
With the enormous number of divergent alignments available today, Dayhoff's approach implies a huge loss of information. Yet it is highly desirable to exploit sequence alignments of widely different evolutionary distances. However, this requires progress in the theory of EMP estimation. An appropriate estimator should account for the evolutionary divergence of each alignment in the data set.
Our procedure is based on a two-step algorithm to improve a given set of rates and an equilibrium distribution. In the first step, the degree of evolutionary distance for each given alignment is estimated under the input parameters. In the second step, new parameters are estimated assuming the divergence times just calculated. This two-step procedure was started using Dayhoff's parameters and then iterated until no change of the parameters could be detected. This iterative approach is discussed in Müller and Vingron (2000)
. The time estimation part is discussed in Barry and Hartigan (1987a, 1987b
); Adachi and Hasegawa (1996)
; Baake and von Haeseler (1999)
; Müller and Vingron (2000)
. For the following exposition we thus focus on the estimation of the rate matrix assuming that the evolutionary distances of all alignments are given.
The main problem in formulating a maximum likelihood estimator is to develop a parameterization for the rate matrix that reflects all requirements for an EMP. In order to specify the EMP, we estimate 210 parameters, namely the equilibrium distribution and the rate matrix Q.
For simplicity we start by formulating the estimator only for a single given alignment with evolutionary distance t. The maximum likelihood method yields the following estimates for and Q:
|
![]() |
|
![]() |
|
|
|
The maximization problem is tackled using standard optimization algorithms (Brent 1973
; Press et al 1990
). To this end, the constrained optimization problem needs to be mapped to an unconstrained one. For example, we map the distribution values to the interval (0,1) using the function x
arctan(x)/
+ 0.5 and we map the positive relative rates to the positive real line by the function x
exp(-x).
Resolvent Method
An alternative method for estimating the rate matrix Q of an EMP is presented in Müller and Vingron (2000)
. It is based on the relation
![]() |
Let n alignments be given and assume that tk is the degree of divergence of the sequences in alignment k. The goal is to estimate an EMP from the alignment data using the distances tk. We first estimate P(tk) by the empirical transition frequencies in the respective alignments. We estimate pij(tk) by counting all occurrences of (ai, aj) and (aj, ai), and then normalizing by the overall frequency of amino acid i. For each alignment, this yields one estimated transition matrix (tk) for each time tk. We want to approximate the integral (R
)ij =
0 e-
tpij(t) dt. This is done using linear interpolation of the pij(tk) and then integrating the piecewise functions. Note that the 20 x 20 entries of the resolvent can be calculated separately and independently of each other. Theoretically, the rate matrix is independent of
, but for empirical integrals it is not. The approximation of the integral (R
)ij =
0 e-
tpij(t) dt is most accurate if the lattice points pij(tk) lie in the high-density region of the integrand. Whether this is the case or not depends on the parameter
. Consequently, a sensitive choice for
is called for. Our approach is likelihood based. With equation (3) we obtain
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We pool alignments of various degrees of divergence and run the estimators on this data, which is then used to reestimate the parameters (Q, ) that are used for alignment generation. In this setup, estimator evaluation is straightforward.
The resulting estimated substitution models can be equivalently represented by the rate matrix Q, any transition matrix P(t), or a matrix of pair frequencies M(t). As the first two display strong diagonal dominance, graphical comparison of them is not appropriate. Instead we choose the M(100) pair frequencies.
We show three simulation results. First, we test the estimators in the case of a small input data set. We reestimate the model parameters from 30 alignments of 300 sites each, where the degree of divergence varies from 10 to 300 PAMs with one alignment for each distance. The results are shown in figure 1 . In this situation the maximum likelihood method is more accurate than either the resolvent method or Dayhoff's method. This is not surprising, because maximum likelihood approaches are known to yield highly efficient estimators. Yet one can clearly improve the accuracy of estimations by using more input data. In the case of protein evolution, tens of thousands of alignments are easily accessible. Theoretically, one would assume that the maximum likelihood estimator would be superior in this setup, but in practice, it is not appropriate because it is computationally too demanding. For real data sets the evolutionary degree of divergence of all alignments is very likely to be different. This is especially bad as it makes likelihood evaluations slow, see equation (10) . In order to simulate the performance of the maximum likelihood estimator on a data set of medium size we use a set containing 30 alignments of 5,000 sites, for distances ranging from 10 to 300 PAM. This yields a large set of observed amino acid pairs but only at 30 different PAM distances. The results are shown in figure 2 . One can clearly observe that the resolvent method catches up when compared to the maximum likelihood method, whereas Dayhoff's method shows the expected bias resulting from ignoring the evolutionary distances.
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The EMP model reduces the phenomenon of protein evolution to 210 parameters. It is obvious that this cannot cover the entire complexity of evolution. One assumes that the positions in a protein evolve independently of each other with the same dynamics for each site, which can be modeled by a Markov chain. Of course, it is well known that different sites in a protein may evolve at different speeds and that possibly different replacement mechanisms are operating. All of our assumptions are questionable from a biological point of view. However, from the perspective of data analysis it is obvious that one needs to simplify to make model fitting practical. The challenge is to reflect as much of the reality as possible with 210 parameters.
In 1972 when Dayhoff et al. proposed the first solution to this problem, only a few sequences were available, and homology detection was restricted to relatively closely related pairs of sequences. Clearly, their method was intended for the use of this kind of data. Our use of Dayhoff's method in the context of divergent alignments is not in the sense of its authors and has no theoretical justification. We use it to demonstrate the practical importance of including the divergence parameter into a model.
A natural shortcoming of using only closely related sequence alignments is that the estimator is biased toward the evolution of fast-evolving positions in proteins. Thus, basing the estimation on the large and divergent data set that we used does not only improve the model parameters because of the much larger amount of input data, but might also reflect protein evolution on longer time scales more appropriately.
Our simulation results show that the maximum likelihood estimator is more efficient than the resolvent method. On the other hand, it is restricted to input data sets of moderate size. But more input data clearly gives more accurate estimates for the transition probabilities, which may well compensate for the theoretical suboptimality of the estimator. In principle, we have a trade off between the statistical and the computational efficiency of the estimators. For small data sets we recommend a maximum likelihood approach, whereas the resolvent method is a practical alternative tailored for huge data sets.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: amino acid replacement
amino acid score matrix
maximum-likelihood
protein evolution
Address for correspondence and reprints: Tobias Müller, Deutsches Krebsforschungszentrum, Theoretische Bioinformatik, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany. t.mueller{at}dkfz.de
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi J., M. Hasegawa, 1996 Molphy: (programs for molecular phylogenetics) Version 2.3 Institute of Statistical Mathematics, Tokyo
Baake E., A. von Haeseler, 1999 Distance measures in terms of substitution processes Theor. Popul. Biol 5:166-175
Barry D., J. Hartigan, 1987a. Asynchronous distance between homologous DNA sequences Biometrics 43::261-276[ISI][Medline]
. 1987b. Statistical analysis of hominoid molecular evolution Stat. Sci 2:191-210
Benner S., M. Cohen, G. Gonnet, 1994 Amino acid substitution during functionally constrained divergent evolution of protein sequences Protein Eng 7:1323-1332[Abstract]
Brent R. P., 1973 Algorithms for minimization without derivatives Prentice Hall, Englewood Cliffs, NJ
Dayhoff M., R. Schwartz, B. Orcutt, 1978 A model of evolutionary change in protein Atlas Protein Seq. Struct 5:345-352
Gonnet G., M. Cohen, S. Benner, 1992 Exhaustive matching of the entire protein sequence database Science 256:1443-1445[ISI][Medline]
Jones D. T., W. R. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences CABIOS 8:275-282[Abstract]
Müller T., M. Vingron, 2000 Modeling amino acid replacement J. Comput. Biol 7:761-776[ISI][Medline]
Overington J., M. Johnson, A. Sali, T. Blundell, 1990 Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction Proc. R. Soc. Lond. B 241:132-145[ISI][Medline]
Press W. H., B. P. Flannery, S. Teukolsky, W. T. Vetterling, 1990 Numerical recipes in C Press Syndicate of the University of Cambridge, Cambridge
Risler J., M. Delorme, H. Delacroix, A. Henaut, 1988 Amino acid substitutions in structurally related proteins J. Mol. Biol 204:1019-1029[ISI][Medline]
Setubal J., J. Meidanis, 1997 Introduction to computational molecular biology PWS Publishing Company, Boston