Department of Statistics, University of Oxford, Oxford, U.K.
Correspondence: E-mail: miklos{at}stats.ox.ac.uk.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: Stochastic modeling of molecular evolution Structural alignment Maximum Likelihood evolutionary time estimation
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To date, the canonical model for biological sequence evolution with indels has been the TKF91 model (Thorne et al. 1991). This model describes the evolution of a finite sequence and allows only single-residue indel events. TKF91 alignment therefore uses a global scoring scheme with a linear penalty function for gap sizes. The TKF91 model has been extensively analyzed (Hein et al. 2000) and developed into a multiple alignment algorithm, both in full likelihood (Hein 2001; Lunter et al. 2003) and Markov Chain Monte Carlo (MCMC) settings (Holmes and Bruno 2001).
In contrast to TKF91 alignment, many computational biologists use a local scheme with affine gap penalties. In evolutionary terms, affine gap costs correspond roughly to a geometric length distribution for each single indel event. Previously, the closest evolutionary equivalent to this has been the TKF92 model. In this model, the sequence is assumed to consist of finite-length indivisible fragments, and the indel process acts on fragments rather than residues. This introduces hidden information in the form of fragment boundaries whose locations must be inferred. The realism of these invisible boundaries is questionable, and they may potentially bias multiple alignment (Thorne et al. 1992).
When indels affect single residues only, the fate of each residue in a given sequence is independent: we can chop a pairwise alignment into independently evolving zones by making a cut before every ancestral residue (Thorne et al. 1991). Modeling alignments with long insertion events but single-residue deletions is tractable (if mathematically complex) because each ancestral residue still corresponds to an independent zone (Miklós and Toroczkai 2001). However, finding exact probabilities for alignments with long deletion events is difficult: for any two ancestral residues, there is a finite probability that they will be deleted in a single event, so the alignment cannot be split into independent zones by cutting after each ancestral residue.
Later in this paper (see Models), we present a new long indel model as an alternative to the TKF91 model, introducing a new notation for evolutionary models that we call the rate grammar. We then show (see Algorithm) that, in an idealized model where the evolving sequence is assumed to be part of an infinitely long sequence, the pairwise alignment can still be split into independent zones; not after every ancestral residue, but after every surviving ancestral residue; that is, every column in the alignment that does not contain a gap. Because no deletion event can cross such a boundary, the zones are independent. The probability of observing a zone can be estimated, not analytically (as with the TKF91 and TKF92 models), but by directly summing the probability of short mutation trajectories such as that shown in Figure 1. We have implemented our model and compared its performance to the TKF91, TKF92, and Gotoh algorithms (Gotoh 1982), using for a benchmark the structurally informed alignments from the HOMSTRAD database (Mizuguchi et al. 1998), as described in the section titled Evaluation. Our results, given in Results, show that the long indel model outperforms TKF91 and TKF92 both at alignment and evolutionary distance estimation, and its performance is comparable with that of Gotoh, while providing much more information about sequence comparison (confidence levels). In the Discussion we present applications of our theory.
|
![]() |
Models |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
SID Models
The evolutionary models we consider are continuous-time Markov processes whose state space is the set of all sequences,
=
*. We consider in particular a class of models called substitution/insertion/deletion (SID) models. These models allow local mutations only, including point substitutions and multi-residue indels. The rates of substitution, insertion, and deletion events are given, respectively, by
S(SL,SX,SY,SR),
I(SL,SI,SR), and
D(SL,SD,SR), where SX, SY are the incoming and outgoing sub-stitution sequences (each of length 1), SI,SD are the inserted or deleted sequences, and SL,SR are the sequences flanking the mutation.
Thus, the instantaneous rate R(S,S') with which sequence S mutates to sequence S' is given by
|
A more readable notation for this model is
|
Returning to straightforward SID models, if S,
I, and
D are independent of SL and SR (i.e.,
S
S(SX,SY),
I
I(SI), and
D
D(SD)) then we say that the SID model is context-independent. Note, however, that a context-independent rate grammar does not have quite the same meaning as a context-free grammar in the Chomsky sense, because a context-free grammar would not allow multi-residue deletions, but a context-independent rate grammar allows such deletions as long as they occur at a rate independent of the flanking sequence.
TKF91
The TKF91 model (Thorne et al. 1991) is a reversible context-independent SID model (see above) with the following indel rates:
|
Let (n) be the equilibrium length distribution. For reversibility, detailed balance requires that
(n) = µ
(n + 1) and so
(n) =
n(1 -
) where
=
/µ. That is, the equilibrium length distribution is geometric with parameter
and mean
/(1 -
). The full sequence equilibrium distribution takes the form
.
Because the TKF91 model does not allow multiple residues to be deleted instantaneously, the fates of any two ancestral residues are independent. This permits an exact solution of the transition probabilities by dynamic programming (Thorne et al. 1991).
TKF92
The TKF92 model is a variation of the TKF91 model (Thorne et al. 1992). In TKF92, the sequence consists of fixed-length indivisible fragments, each fragment containing a geometrically distributed number of residues. This is equivalent to replacing the finite alphabet of residues, , with an infinite alphabet of sequence fragments,
*, so that the state space of the model is now
= (
*)*i.e., the set of sequences of sequences. The fragment substitution matrix
S is set up to allow only point substitutions within each fragment.
A TKF92 evolutionary state is not a sequence of residues, but rather a sequence of sequences; the observed, "biological" sequence is recovered by concatenating all the fragments, using the map f(S) = S1S2 ... S|S|. Fragments are not observed, and alignment probabilities, as well as the maximally contributing alignment, are obtained by summing out all possible fragmentations of the observed sequences. (We use the standard Viterbi algorithm for finding the maximally contributing alignment, and the summation over fragmentations is done implicitly by a careful design of the TKF92 HMM.) The advantage of TKF92 is that it allows the simultaneous deletion of multiple residues (in the same fragment) while retaining transition probabilities that can be calculated exactly, as per TKF91 (Thorne et al. 1992). However, this comes at the expense of introducing hidden information that may bias alignment.
Long Indel Model
We return now to models with state space =
*, where the evolutionary state is a sequence (rather than a sequence of sequences, as in TKF92).
The long indel model is a time-reversible, context-independent SID model, as described earlier. It generalizes TKF91 by allowing instantaneous deletion of arbitrarily long subsequences, without requiring that these subsequences form an indivisible fragment.
The long indel model has the following rates:
|
Again, let (n) be the equilibrium length distribution. Time reversibility implies detailed balance, which requires that
k
(n) = µk
(n + k) for all n and k, implying both that
(n) =
n(1 -
) and that
k/µk =
k, where
=
1/µ1. That is, not only is the equilibrium length distribution geometric with parameter
, but so is the ratio
k/µk as a function of k. Again,
.
Note that we have full freedom in choosing the deletion rate function (which then fixes the insertion rates). For simplicity, however, we choose the rate for k-residue deletions to be a geometric function of k with parameter r, so µk = µ(1 - r)2rk-1 and k =
µ(1 - r)2(
r)k-1. The reason for the normalization factor (1 - r)2 is that it makes the total deletion rate per site come out as
, so that µ has the same meaning as in the TKF91 and TKF92 models, where
is the total deletion rate per site. It can be seen that TKF91 emerges as a special case of the geometric long indel model when r = 0.
In sequence alignment terms, with this choice of indel rates our model is the evolutionary analog of Gotoh's affine gap algorithm (Gotoh 1982). Note, however, that, although the size of indel events is geometrically distributed, the size of gaps in an alignment will only be geometrically distributed when the two sequences are sufficiently close so that there is only one expected indel event per observed gap.
In the above treatment, we have adopted a reversible model purely for technical convenience. Compared to irreversible models, reversible models have roughly half as many parameters; furthermore, they permit the use of an unrooted phylogenetic tree, which for us means that we can treat one sequence as the ancestor and the other as descendant with no influence on the outcomes (Durbin et al. 1998).
Although reversibility is a common assumption in molecular evolutionary analyses, we believe that it must be treated with skepticism. Although there is anecdotal evidence that nucleotide substitution processes are often close to reversible (Bruno and Arvestad 1997), there is no reason why this should, in general, be the case. There are many ways in which a realistic indel model could violate detailed balance: for example, insertion events might typically be small and frequent, and deletion events might be rare and large (Holmes and Durbin 1998). Similarly plausible irreversibilities can be conjectured for substitution models, particularly if the substitution rate depends on the sequence context, as, e.g., in CPG depression.
Infinite Sequence
Although the long indel model has some convenient theoretical properties, it also has some that are decidedly inconvenient. For example, consider all deletion events on sequence S that start at residue n; in other words, deletions of the form SLSDSR SLSR where |SL| = n - 1. Call such an event a rightward deletion of residue n. The total rate of all rightward deletions of residue n is
. Clearly this is dependent on n: the total rightward deletion rate is lower when n is near the right-hand end of the sequence, so that the probability that the residue escapes rightward deletion after time t depends on the history of the flanking sequence. This is bad news for dynamic programing alignment methods, which require conditional independence between different parts of the alignment.
It would be convenient to have a rightward deletion rate that is independent of the position along the sequence. This condition is satisfied in infinite sequences, in which the rightward deletion rate of any residue according to the long indel model is a constant, . One way to achieve a position-independent rightward deletion rate is therefore to consider the sequence S as being embedded in an infinite sequence, I = LSR, where L, R
. Here L and R are infinite flanking sequences that we do not observe.
Embedding S in an infinite sequence I has consequences for indels at the ends of S. Consider, for example, a deletion event in I.
|
|
|
|
|
![]() |
Algorithm |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Let us consider the following alignment:
|
|
There are four kinds of chop zone, distinguished by whether they border the left or right flanking sequence, neither of them, or both. The conditional probabilities of observing these chop zones are denoted by Lij, Rij, Nij, and Bij, respectively (see table 1). The indices specify the fate of nucleotides within the chop zone, namely the deletion of i residues, which are replaced by j newly inserted residues. Chop zones that do not border the right flanking sequence (Lij and Nij) furthermore end in a single aligned residue pair; the others do not.
|
|
Finite Trajectory Approximation
To calculate alignment likelihoods, we need to compute probabilities for the four kinds of chop zone listed in table 1. In each case, this involves enumerating the different chains of events, or trajectories, that give rise to the particular outcome, and then calculate the appropriate transition probability for the zone.
Suppose we want to calculate N3,2. If A represents a deleted ancestral residue, M the conserved ancestral residue and B an inserted residue, then the zone starts as the four-residue sequence AAAM and ends as the three-residue sequence BBM. One valid, three-event trajectory generating this outcome is shown in figure 1.
Abstractly, a trajectory is viewed as a sequence of events, and the configuration of the sequence (within the chop zone) in between events is referred to as the state of the sequence at that moment. To calculate a trajectory probability, we need the rate for all events in the trajectory, as well as the exit rate for each state visited (the exit rate of a state is defined as the total rate of mutation events for that statei.e., the sum of the outgoing transition rates). Because the required probability is conditional on no deletion event crossing the left chop zone boundary, this includes only rightward deletions that originate in the current chop zone. However, to calculate the exit rate we must take into account all such deletions, including those that continue into neighboring chop zones on the right. To make sure the last residue pair of a chop zone remains homologous, insertions are assigned to the chop zone of the residue to the right of the insertion, except at the end of the sequence, where it is assigned to the last chop zone. Other insertions (or deletions) are taken not to change the state.
Exact calculation of these probabilities involves solving some partial differential equations for a generating function (Miklós and Toroczkai 2001), and it becomes very complicated, particularly when deletions are involved. Instead of using exact results, we approximated the chop zone probabilities by bounding the number of indel events, and the indel lengths per event. This reduces the infinite sum over all trajectories to a finite sum. We find empirically that, in the parameter regime of the alignments we studied, it is sufficient to allow at most three indel events, and indel lengths of at most 100. (Gaps exceeding 100 residues are extremely rare in the database we used for testing, accounting for less than 0.1% of all gaps, and the probability of more than three indel events overlapping, assuming an indel rate of µ = 0.1 and evolutionary distance 3, is less than (1 - e-3x0.1)4 < 0.005 per site.)
Using this finite trajectory approximation, we can sum the likelihoods of the zone trajectories directly. A recursive algorithm for computing the likelihood of an individual trajectory is given in Appendix A.
Dynamic Programming
To compute the joint likelihood of two sequences, we sum over all chop zone assignments. The following DP algorithm achieves this. Let be the sum of probabilities of all partial alignments of the first i residues of ancestral sequence A with the first j residues of descendant sequence B, where the last characters of the two subsequences are homologous, and conditional on observing the ancestral sequence, then
|
|
|
|
The long indel model can be formulated as a hidden Markov model (HMM) with three states, Initial, Middle, and End, and transitions that correspond to the chop zones (fig. 3). In this formulation, emissions are associated to transitions [the "Mealy machine" view (Durbin et al. 1998)]. The DP algorithm for computing the joint likelihood is then simply the forward algorithm for HMMs, and general HMM algorithms for sampling and finding the most likely path can be used straightforwardly.
|
![]() |
Evaluation |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The alignments in the HOMSTRAD database have sequence similarity ranging from 3% to 99%, with 10% quantiles at 14% and 83%. For the Gotoh alignment algorithm, we used the standard BLOSUM62 matrix, which gave similar but slightly better results than other BLOSUM matrices. We optimized gap opening and extension parameters to maximize the Gotoh algorithm's accuracy on the training set.
To test the probabilistic models, we needed a substitution rate matrix. Using the Dayhoff matrix as initial guess, we obtained maximum likelihood (ML) evolutionary distances t for the training set alignments. We then estimated ML rate matrix by Expectation Maximization (Holmes and Rubin 2002), normalizing the matrix for one expected mutation per site at t - 1. This procedure was iterated until both the evolutionary distances and the rate matrix converged. The resulting rate matrix was used for all probabilistic models.
We obtained the deletion rate parameters, µ and (for the TKF92 and long indel model) r by ML estimation on alignments from the training set, using the time estimates described above. The insertion rate parameter was allowed to vary for each sequence pair to make the expected sequence length equal the average length of the two sequences under consideration. We refer to µ and r as the indel rate parameters.
For the test set, we fixed the indel rate parameters and we performed a ML parameter estimation of the evolutionary distance t for each pair of sequences, summing over all possible alignments. We then recovered to maximum likelihood alignment using the Viterbi algorithm (Durbin et al. 1998).
For the long indel model, we also implemented a program computing the posterior labeling (Durbin et al. 1998) of each aligned column of the Viterbi alignment, that is, the probability of the aligned column given the data.
Alignment software for the long indel model is available on request (miklos{at}stats.ox.ac.uk).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Maximum likelihood time estimates (MLTEs) of evolutionary separations for the HOMSTRAD pairwise alignments, estimated using a point substitution model, yielded times between 0.014 and 4.23 (in units of expected substitutions per site) with 10% quantiles at 0.37 and 2.26. We refer to these estimates as Homstrad MLTEs. We also calculated time MLTEs for unaligned sequences, using the various indel models to sum over all alignments, referring to these individually as TKF91 MLTEs, TKF92 MLTEs, and Long indel MLTEs, and collectively as model-based MLTEs. The relationship between Homstrad MLTE and long percentage identity was approximately linear (data not shown).
For TKF91, TKF92, and the Long indel model, we obtained evolutionary parameters µ and r by Maximum Likelihood (table 2). In addition, we endowed the Long indel model with a mixed geometric distribution for the indel rates,
|
|
|
The relationships between the model-based and Homstrad MLTEs are shown in figure 4. The TKF91 time estimates often diverged to infinity, probably as a result of a bad model fit. This problem was less pronounced in TKF92, and all but absent for the long indel model. All model-based estimates of divergence times tend to be lower than estimates based on Homstrad alignments, with least-squares slopes (on data with outliers removed) in the range 0.750.78 for the three models, all significantly different from 1. The hypothesis that the slopes for all three models were equal could not be rejected at a 5% level.
|
At higher sequence identity all models perform much better. This can clearly be seen for the long indel model in figure 5, which plots the overlap for the long indel and Gotoh algorithms as a function of Homstrad MLTE.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our data suggest that the long indel model gives better time estimates than TKF92, which gives better estimates than TKF91. An implication of this result is that the long indel model is preferable for molecular phylogeny. In terms of alignment accuracy, the ordering is TKF91 < TKF92 < long indel Gotoh. The high accuracy of the relatively simple Gotoh algorithm is unexpected, and demonstrates that this algorithm remains a powerful tool for molecular biologists.
Although there is no significant difference in alignment accuracy between Gotoh and the long indel model, the latter provides more information about the alignment. It allows us to compute posterior probabilities ("reliabilities") of individual alignment columns (Durbin et al. 1998). We find that these reliabilities are good predictors of correctness of alignment (see fig. 6). The ability to assign reliabilities to parts of the alignment is a major advantage of using probabilistic, as opposed to score-based, models.
It is interesting to note that the divergence time estimates obtained from the substitution model on HOMSTRAD structural alignments are systematically higher than those obtained from the evolutionary models, which combine substitution and indel events (fig. 4). This is surprising, because heterogeneous mutation rates along the sequence can be expected to bias the estimates from HOMSTRAD alignments toward lower values, because regions experiencing a low mutation rate have more alignable parts. Furthermore, all models investigated estimate divergence times by considering both insertion-deletions and substitutions. Therefore, unlike in score-based alignment methods, aligning more similar amino acids by introducing more gaps does not automatically yield lower time estimates. However, alignments based on structural similarity might not always reflect homology, as small shifts can occur because of the spatial extent and periodicity of structural elements; see figure 7 for an example. This biases the time estimates based on HOMSTRAD alignments toward higher values, and this effect seems to dominate.
|
Our evaluation method used the HOMSTRAD alignments as a guide to trim the sequences to the alignable regions. This means that the absolute alignment accuracies of table 3 are probably overestimates. We chose this method because the alignment algorithms we implemented are global in nature; however, we plan to develop local versions. Note that because of the long indel model's increased indel rate at both sequence ends, this data preparation method puts it at a slight disadvantage compared with the other methods.
Although Gotoh's algorithm has an advantage in our comparison because Gotoh uses optimized parameters, whereas for the evolutionary models we estimate parameters by maximum likelihood, accuracies for the long indel model with estimated parameters closely approach those of Gotoh's algorithm (81.1% vs. 82.2%). This shows the utility of ML parameter estimation for the long indel model. The ability to use such objective parameter estimation methods is a general advantage of probabilistic models over score-based ones.
Using a hybrid method of optimization and curve fitting, we estimated parameters for a mixed geometric indel rate distribution, resulting in a further improved alignment accuracy (to 82.1%). It is interesting that Gotoh reaches this accuracy without such sophisticated models. To test whether Gotoh's success was due to the BLOSUM62 matrix, we tried forcing our long indel algorithm to use BLOSUM62, but this did not perform any better than our ML rate matrix. One factor that might be relevant, however, is that affine gap penalties were used in constructing the HOMSTRAD structural alignments. These particular penalties will bias the resulting gap length distribution, which will naturally favor Gotoh's algorithm.
In general, there may be theoretical limits to the accuracy obtainable with respect to databases such as HOMSTRAD: the homology signal may be too weak in places (Holmes and Durbin 1998), or structural and sequence homology may be mutually inconsistent. Nonetheless, the enhanced performance of the long indel model over TKF91 and TKF92 is grounds for encouragement. Its performance is comparable with that of Gotoh's algorithm. In addition, it allows for inference of evolutionary divergence, and it provides information on the reliability of different parts of the alignment. This information can be useful, e.g., for 3D structural modeling, where reliable regions can be used as a seed for homology modeling.
By sampling alignments in local neighborhoods on a phylogenetic tree, a multiple-sequence Markov Chain Monte Carlo (MCMC) version of our alignment algorithm can be constructed [Holmes and Bruno 2001; Jensen and Hein 2002]. In this context, the trajectory likelihood algorithm of Appendix A will again be useful. As the corpus of data from large-scale sequencing projects grows and our understanding of molecular evolution deepens, the incorporation of this understanding leads to more realistic models, that often cannot be solved analytically. This is especially common when the state space of such models is large, as is the case for whole-sequence models. In such situations, MCMC approaches can be used (Pedersen and Jensen 2001; Robinson et al. 2003). Since the algorithm of Appendix A applies to general discrete-state continuous-time Markov models, it is useful for any MCMC procedure that samples trajectories of such models [Robinson et al., 2003]. So, with our algorithm, it is possible to avoid sampling event times (which are true nuisance variables) and thus to improve the performance of the MCMC procedure.
The long indel model is the first evolutionary model to incorporate realistic long indel events without introducing hidden information. As such, it should find applications in alignment, phylogenetic analysis and profiling.
![]() |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Because the probability of an event occurring in an infinitesimal time interval is vanishingly small, individual histories have infinitesimal likelihood, but finite likelihood density. Suppose that the model starts in state 0 at time t0 = 0. At time t1, it mutates to state
1; at time t2, it mutates to
2; and so on up to
N at time tN
T. Setting tN+1 = T, the likelihood density of this history for the time interval 0
t
T, conditional on the initial state, is
|
|
|
A subtlety arises in the following analysis if there is degeneracy in the set of exit rates; that is, if any of the rates 0, ... ,
N are equal. To account for this, we suppose that {
0, ... ,
M} is the corresponding set of rates with duplicates removed (so M
N), and let dn + 1 be the multiplicity of the nondegenerate rate
n among the rates
0, ... ,
N; thus, exit rate
0 occurs d0 + 1 times, exit rate
1 occurs d1 + 1 times, and so on; therefore
di = N - M.
As a trial solution, we write the likelihood in the form
|
|
|
|
|
|
|
This algorithm has computational time complexity O(N2). Two special cases are worth noting, because the coefficients can be obtained in closed form. If no two rates n,
m are equal, then dn = 0 and
=
m
n(
m -
n)-1. If all the rates
n are identical, then d0 = N and
= 1/N! while
= 0 for k < N; apart from a factor rn/
n, this is just the Poisson distribution for the number of mutation events N.
The expected amount of evolutionary time spent in a particular state can be found using a variation of this recursion.
Appendix B: Consistency Relations
Because of the complicated combinatorics involved in computing the chop zone probabilities Nij, Lij, Rij, and Bij, it is useful to have some consistency checks:
|
Equations (30) also hold for finite-order approximations; the others are only true for the exact probabilities, but are useful nonetheless to spot errors.
Appendix C: Posterior Likelihood
As with generic HMMs, it is possible to compute the posterior likelihood of particular alignment columns by calculating the likelihood of visiting a certain state using the Forward-Backward algorithm (Durbin et al. 1998). As a consequence of employing the Mealy machine view, to compute the reliability of an unaligned residue, we have to sum over all possible transitions associated to this possibility. Because many transitions may contribute to the emission of an unaligned residue, this is an expensive operation; however, there exists a dynamic programming solution.
Let be the "forward" recursion as given in the section titled, Dynamic Programming, and
the result of the corresponding Backward algorithm. If Ui is the posterior probability that ancestral residue i is unaligned, then
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bruno, W. J., and L. Arvestad. 1997. Estimation of reversible substitution matrices from multiple pairs of sequences. J. Mol. Evol. 45:696-703.[ISI][Medline]
Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, U.K.
Gotoh, O. 1982. An improved algorithm for matching biological sequences. J. Mol. Biol. 162:705-708.[ISI][Medline]
Hein, J. 2001. An algorithm for statistical alignment of sequences related by a binary tree. Pp. 179190 in Pac. Symp. Biocomp., R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, eds, World Scientific, Singapore.
Hein, J., C. Wiuf, B. Knudsen, M. B. Moller, and G. Wibling. 2000. Statistical alignment: computational properties, homology testing and goodness-of-fit. J. Mol. Biol. 302:265-279.[CrossRef][ISI][Medline]
Holmes, I., and W. J. Bruno. 2001. Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17:803-820.
Holmes, I., and R. Durbin. 1998. Dynamic programming alignment accuracy. J. Comp. Biol. 5:493-504.[ISI]
Holmes, I., and G. M. Rubin. 2002. An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol. 317:757-768.[CrossRef]
Jensen, J., and J. Hein. 2002. Gibbs sampler for statistical multiple alignment. Technical Report 429, Department of Theoretical Statistics, University of Aarhus, Denmark.
Lunter, G. A., L. Miklós, Y. S. Song, and J. Hein. 2003. An improved algorithm for multiple alignment on arbitrary phylogenetic trees. J. Comp. Biol. 10:869-889.[CrossRef][ISI]
Metzler, D. 2003. Statistical alignment based on fragment insertion and deletion models. Bioinformatics 19:490-499.
Metzler, D., R. Fleißner, A. Wakolbinger, and A. von Haeseler. 2001. Assessing variability by joint sampling of alignments and mutation rates. J. Mol. Evol. 53:660-669.[CrossRef][ISI][Medline]
Miklós, I., and Z. Toroczkai. 2001. An improved model for statistical alignment. Pp. 110. in First workshop on algorithms in bioinformatics. Springer-Verlag, Berlin, Heidelberg.
Mizuguchi, K., C. M. Deane, T. L. Blundell, and J. P. Overington. 1998. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7:2469-2471.
Pedersen, A. M., and J. L. Jensen. 2001. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol. 18:763-776.
Robinson, D. M., D. T. Jones, H. Kishino, N. Goldman, and J. L. Thorne. 2003. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol. 20:1692-1704.
Thorne, J. L., H. Kishino, and J. Felsenstein. 1991. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33:114-124.[ISI][Medline]
Thorne, J. L., H. Kishino, and J. Felsenstein. 1992. Inching toward reality: an improved likelihood model of sequence evolution. J. Mol. Evol. 34:3-16.[ISI][Medline]