Department of Applied Statistics, University of Reading, Reading, England
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Likelihood methods are model-dependent; they usually provide the optimal approach if the model is adequate, but if an inappropriate model is used, they may well return poor results. If a model ignores some of the information in the data, then the likelihood analysis may be inefficient. This latter case is certainly a worry for many data setsthe commonly used Markov models of evolution describe only the substitution process and ignore insertion and deletion events within the sequences. When analyzing alignments using these models, gaps are generally handled in one of two ways. The first possibility is to ignore any columns containing gaps. The second, used in most likelihood programs, is to treat a gap as an ambiguity character (N). This permits likelihood calculations to be carried out (essentially coding a gap as N has the effect of removing the corresponding branch from the tree). Both of these approaches are suboptimal, since shared patterns of insertions and deletions (indels) may reveal much about the relationships among different sequences (e.g., see the alignment in Williams and Holland [1998]
).
Within the pairwise alignment literature, there has been some work on developing stochastic models of sequence evolution which include indel events. Early work by Bishop and Thompson (1986)
on evolutionary models for aligning two sequences was improved on by Thorne, Kishino, and Felsenstein (1991, 1992)
and Thorne and Churchill (1995)
. Their model allows for base substitution, as well as for single or multiple-base insertions and deletions. To do this, they assume that the DNA or amino acid sequences consist of short blocks (fragments) of one or more bases/residues joined by imaginary links. A birth-death process (Grimmett and Stirzaker 1982
, pp. 157161) operates on the links. If a "birth" occurs, a new link and fragment are inserted, while the "death" of a link means that the link itself and its corresponding fragment are deleted. This model could, in principle, be used to model sequence evolution in a phylogenetic tree, but standard likelihood calculations (summing over all possible data assignments at internal nodes of a tree) would be intractable.
An alternate way of using indels in the estimation of a phylogenetic tree is through the use of hidden Markov models (see, e.g., MacDonald and Zucchini 1997
). Mitchison (1999)
expands on earlier work (Mitchison and Durbin 1995
) to tackle the problem of simultaneously aligning and estimating the phylogeny for a set of sequences. He uses profile hidden Markov models (Krogh et al. 1994
), but instead of concentrating on the emission probabilities (i.e., the probability of observing a given nucleotide, amino acid, or gap character given that the underlying state is either a gap or match state), he focuses on changes in the path though the model. For instance, given two sites with nucleotides, these could be represented as MM (M = match). If over the course of time a deletion occurred at the second site, then the path through the model would become MG (G = gap). The probability of path changes like these, together with the usual nucleotide or amino acid substitution probabilities, is used to evaluate a given phylogenetic tree. This algorithm appears to work well in general but has difficulties with some evolutionary events such as a multiple-base/residue deletion followed by an insertion at short distances.
Some implementations of the parsimony principle for estimating a tree treat a gap as a fifth character in the nucleotide alphabet, which seems an obvious first step to using the information contained within gaps (e.g., Simmons and Ochoterena 2000
). For likelihood analyses, existing models of nucleotide or amino acid substitution could be extended to incorporate this additional state. Extending simple models like the Jukes-Cantor (JC; Jukes and Cantor 1969
) or Felsenstein81 (F81; Felsenstein 1981
) model is straightforward. This idea has been mentioned previously in the literature (e.g., Durbin et al. 1998
, p. 217) yet appears to have been largely dismissed because it assumes unrealistically that occurrences of gaps at adjacent sites are independent. However, this overweighting of gaps may well be an improvement over simply ignoring them. Treating gap characters as an extra state in nucleotide and amino acid models has proved useful in other applications; for example, Koshi and Goldstein (1995)
use a Bayesian formalism to derive amino acid substitution models, including a gap as a 21st character, for different constituent parts of proteins.
This paper describes some extensions to standard nucleotide models which incorporate a gap character and examines their performance on some simulated and real data sets. In addition to identifying the tree favored under a model, it is important to examine the precision allocated to this estimate. One way this may be done is to carry out Bayesian inference, which involves calculating the posterior probability of a tree being the true one given the data. Within the Bayesian framework, a commonly used point estimate of the true tree is the maximum a posteriori (MAP) estimate, while the 95% credible set (the smallest set of trees whose cumulative probabilities sum to 0.95) indicates the precision of this estimate. While the posterior distribution cannot be calculated analytically, it is possible to develop efficient sampling strategies using Markov chain Monte Carlo (MCMC).
MCMC methods are generally used to return the posterior distribution of the parameters of interest (in this case, the tree) given the data. Working with the posterior distribution (proportional to the likelihood times a prior distribution for the parameters) is attractive because posterior probabilities have a simple interpretation which contrasts with the difficulty in interpreting nonparametric bootstrap values (e.g., Efron, Halloran, and Holmes 1996
). Mau, Newton, and Larget (1999)
described a computationally feasible approach for estimating the posterior distribution of trees from relatively large data sets conforming to a molecular clock using MCMC. This was subsequently extended to nonclocklike data by Larget and Simon (1999)
, who also released software (BAMBE) implementing this analysis (Simon and Larget 2000). Compared with standard likelihood and bootstrapping approaches in finding the best tree and measuring its reliability, BAMBE returns the posterior distribution over tree space in a fraction of the time required by bootstrapping. Hence, MCMC seems a sensible approach for evaluating the precision of an estimate.
![]() |
Method |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The models described below are similar to the F84 model of nucleotide substitution (Felsenstein and Churchill 1996
) and its special cases. The F84 model differentiates between transitions and transversions and allows for unequal base composition. It is very similar to the HKY85 model (Hasegawa, Kishino, and Yano 1985
) but, despite having a more complicated instantaneous rate matrix, is mathematically more tractable.
Our extension to the F84 model, the F84+Gaps model, makes the usual sequence evolution assumptions, that sites evolve independently, which means that the likelihood is a product over all the individual site likelihoods. We also assume that sites evolve according to the same evolutionary process, and that character (nucleotides, gaps) frequencies have reached equilibrium. The F84+Gaps model, like the F84 model, is time-reversible, which means that it cannot be used to root a phylogenetic tree unless a further assumption of a molecular clock is made.
Following Felsenstein and Churchill (1996)
, there are two types of events in the F84 model: In a type I event, a nucleotide is replaced by one drawn from its own class (e.g., a purine is replaced by one of the two purines, A and G). In a type II event, any nucleotide replaces the existing one. In both cases, a nucleotide may be "replaced" by the same one, resulting in no change.
To add a gap character to this model, the type II event may be modified so that it becomes instead an event in which any character (nucleotide/gap) replaces the existing one. Assigning rates of and
to each of these events, and letting
i, i
{A, C, G, T, }, represent the stationary frequencies of the characters, the rate matrix Q for this model has the following form:
| (1) |
The substitution probabilities are given by the equation P(t) = exp(Qt) and have the following closed- form solution:
| (1) |
The F84+Gaps model can deal with a higher frequency of transitions and can also model different rates of within-nucleotide and nucleotide-gap changes. It accomplishes this by using the rate parameters and, where applicable,
, and the stationary frequencies of each character. For example, if indels appear to be infrequent (a small number of gaps in the alignment), then the stationary frequency of the gap character will be low, and, hence, so too will the rate of change from a nucleotide to a gap.
Special cases of this model include the K2P+Gaps model (extension of the model described in Kimura [1980]
), which assumes that character frequencies are all equal (to 0.2). In this case, the transition rate,
, corresponds to 0.5
+ 0.2
, while
= 0.2
is the rate of all other changes. Substitution probabilities for this model may easily be found from the F84+Gaps probabilities.
The F81+Gaps model (unequal character frequencies, one rate parameter) assumes that = 0. In this case, the substitution probabilities are given by
| (2) |
The simplest model, JC+Gaps, is a special case of F81+Gaps, since it corresponds to the special case in which the character frequencies are all equal. Thus, the rate of change is =
/5, so the substitution probabilities may be written as
| (3) |
These substitution probabilities may be used in likelihood calculations. Since rates and time are confounded, we follow the usual convention here that the overall rate is scaled to 1. This results in the time (branch length) corresponding to the number of character substitutions that have occurred. Should branch lengths be important, some thought is needed to decide on an appropriate measure. If the researcher is interested in measuring distances between the sequences in the alignment itself, then the branch length estimate from the five-state model may be an appropriate measure, since it takes into account gaps which can be an important source of differences between otherwise closely related sequences. On the other hand, if reference is to be made to previously published work, it may be necessary to ignore the gaps and use the usual (four-state) nucleotide substitution models to estimate the number of nucleotide substitutions per site, this being the most commonly employed measure of distance between DNA sequences.
The main criticism that has been leveled at five- state models is that they treat each site independently. For example, a 20-bp insertion is treated as 20 independent events, implying strong evidence for any phylogenetic split supported by this event. At the other extreme, some approaches remove columns where insertion and deletion events have occurred, thereby removing all of the information contained within the indels and nucleotide substitutions at those sites. A simple ad hoc approach to finding a compromise between these extremes is to downweight the likelihoods at sites containing indel events so that they count approximately as one event. To implement this, for any column in an alignment containing a gap, we sum the lengths of any indels which pass through this site in all sequences. Dividing by the number of sequences with indels at that particular site yields the average indel length at that site. The inverse of this is used to downweight the log likelihood. This quantity is calculated for all sites (columns with no indels receive a weighting of 1), and all of the site log likelihoods are weighted to yield the overall log likelihood. This is still too conservative, however, since it essentially views (for example) shared 3-bp gaps and 25-bp gaps as carrying the same evidence that two sequences are monophyletic.
The Posterior Distribution over Tree Space
Following Bayes' theorem, the posterior distribution over the parameter space is proportional to the likelihood times the prior distribution. The likelihood function for a given tree (topology and branch lengths) may be calculated in the usual way (i.e., using the pruning algorithm described in Felsenstein [1981]
).
Prior distributions are required for all parameters used. Obviously, these include the topology (branching pattern) and the branch lengths. All possible topologies were considered equally likely a priori (this uniform distribution is a common choice; e.g., it is used in BAMBE [see Larget and Simon 1999
]). A weak (i.e., almost flat, therefore vague) gamma prior was placed on the branch lengths which assigned positive probability to a wide range of branch lengths. Alternatives include assuming that branch lengths are uniformly distributed over the positive real line or assuming that they lie uniformly within an interval bounded by 0 and some large number. The latter is a better choice than the former, since it is a proper prior (i.e., integrates to a finite number) but has the disadvantage of imposing an upper limit on the branch lengths. The gamma distribution is both a proper prior and does not impose an upper bound (although in practice, branch lengths will generally lie between 0 and 1, so a large enough upper bound should not materially change the posterior distribution).
Prior distributions may also be required for the model parameters. The five character frequencies must lie between 0 and 1 and sum to 1; an obvious prior for these is a Dirichlet distribution with all five parameters equal to 1. This is a vague prior in that it assumes all possible combinations of frequencies are equally likely. In the F84 model, once it has been assumed that the mean rate is 1, given the character frequencies, a simple relationship exists between and
. Thus, one can be written in terms of the other, so only one prior is required. We chose to represent
in terms of
, with the prior for
being a uniform distribution over the interval (0, 2). Ideally, the upper bound should be the largest value of
consistent with
remaining nonnegative. However, fixing the bound at 2 is easier to implement while allowing a range of transition-transversion bias from none to extremely high.
In this application, the posterior is known only up to the constant of proportionality. While this does not pose a problem for MCMC, it does lead to one drawback. Since the five-state models view a gap as another character, they allow a positive probability for a column consisting entirely of gaps. In practice, this will never be observed, so the correct likelihood would condition on that. Thus, the likelihood of seeing a tree T with all other parameter values (branch lengths, character frequencies, etc.) represented by would be Pr(actual data | T,
)/[1 - Pr(all gaps | T,
)], where Pr(all gaps | T,
) is the likelihood of T and
for a site containing gap characters only. Being unable to make this correction is unlikely to have any noticeable effect on the posterior distribution, since, especially for larger trees, Pr(all gaps | T,
) will be very small.
MCMC Algorithm
The Metropolis-Hastings algorithm (Metropolis et al. 1953
; Hastings 1970
; Chib and Greenberg 1995
) is used here to sample from the posterior distribution of trees. The steps of this algorithm are as follows:
| (2) |
In theory, q may have any form provided only that q(x | y) 0 whenever q(y | x)
0, but in practice it is important to select q carefully so that the chain moves rapidly around tree space yet has a reasonable proportion of candidate points accepted (which is usually achieved by ensuring that sampled points from q tend to be close to xt, yet not so close that it takes many steps to traverse the parameter space). Furthermore, if the parameter space is complex, it is often more efficient to split the parameters into related components and update these one by one (e.g., Gilks, Richardson, and Spiegelhalter 1995
). This has been implemented below with three components. The first consists of the tree (both topology and branch lengths) and the second is the set of character frequencies, while the third is the parameter
from the F84+Gaps model. A suitable proposal distribution for each component is described below.
The proposal distribution for trees adopted here depends on the current tree and changes a small part of that tree. This distribution is symmetric (i.e., q(x | y) = q(y | x)) so that the acceptance probability, ß, does not depend on q. Although our method does not employ the molecular-clock assumption and leads to inferences about unrooted trees, the proposal distribution generates rooted trees for ease of computer coding and likelihood calculations. It is a simple matter to identify the unrooted tree that corresponds to any given rooted tree. The steps to generate a candidate tree given the current tree are as follows:
A Dirichlet distribution is used to generate new values for the character frequencies, thus assuring that they both lie in the interval (0, 1) and sum to 1. The parameters of this Dirichlet distribution are proportional to the current values of the character frequencies, thereby generating new values close to the existing ones. This proposal distribution is not symmetric, so the probabilities q(x | y) and q(y | x) contribute to the Metropolis-Hastings acceptance probability.
To vary the F84+Gaps model parameters, new values are proposed for , the transition rate in the F84 model. Since we assume that the mean rate is 1, given
and the character frequencies,
may easily be determined. To propose a new value for
, a value is selected uniformly from an interval centered around the existing value of
, leading to a symmetrical proposal distribution. Should
be close to its boundaries such that part of this interval lies outside, values can be reflected back into range; in such an event, the proposal distribution remains symmetric.
An initial burn-in period is necessary to ensure that the chain has reached its stationary distribution. During this time, the proportion of each of the components accepted is monitored. If any of these values is too low (e.g., <0.05), then the proposal distributions may be tuned to increase acceptance rates. For example, in the tree proposal algorithm, the value of can be reduced, leading to new branch lengths closer to the existing ones. For the character frequencies, using a higher constant of proportionality should lead to an increased chance of accepting new values, while the interval from which new
values are proposed can be narrowed. Postburn-in, these aspects of the proposal distributions are required to remain fixed.
Since the tree proposal distribution in particular makes only small changes in the tree, successive points in the chain will be highly correlated. To reduce this correlation, the output should be thinned; i.e., only every jth iteration should be stored. Sensible choices of burn- in period and the level of thinning required may be made by running several short chains from different starting points for a particular data set and observing the output.
We have written a C++ implementation of this algorithm, MAC5, which returns a number of files containing different information including the topologies (branching order only) at every jth iteration. This application is available on the World Wide Web at http://www.reading.ac.uk/statistics/genetics/ in the software section. The BAMBE program SUMMARIZE (Simon and Larget 2000
) is used to generate a listing of all topologies in descending order of frequency.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
All data sets were simulated under a simple eight- species bifurcating tree, (((1, 2), (3, 4)), ((5, 6), (7, 8))), with all branches having the same length. The data in the first part of the study were simulated under the JC+Gaps model. The sequences were 1,000 bp long, and three data sets having branch lengths of 0.05, 0.1, and 0.4 were generated.
For the second half of the study, the same tree was again used, but with branch lengths of 0.1, 0.2, 0.3, and 0.4. Nucleotide substitution followed the JC model. The root sequence of the tree was 500 bp long and contained no gap characters. At each site along an evolving sequence, there were three possible events: insertion, deletion, or substitution. If either an insertion or a deletion event occurred at a site, then the additional insertion/ deletion length l was randomly generated from a geometric distribution with parameter 0.5. For a deletion, l + 1 bases, including the current base, were replaced by gap characters. If an insertion occurred, then l + 1 bases were placed in the sequence at that point, with the nucleotides being randomly selected (the JC model assumes that nucleotides occur with equal frequencies). In all other sequences, l + 1 gap characters were added in the corresponding locations to maintain the sequence alignment. The smaller of the insertion and deletion probabilities was on the order of 0.005 per site per branch of length 0.1. Three groups of data were simulated; all consisted of four data sets (the four different branch lengths) with the first setting Prob(deletion) = 2 x Prob(insertion), the second having Prob(deletion) = Prob(insertion), and the third having Prob(insertion) = 2 x Prob(deletion). Part of one of these data sets is shown in figure 1 . Note that the lengths of the alignments returned will be variable; the higher the probability of insertion, the longer the alignment.
|
Following some exploratory runs of the Markov chain for some of the simulated data sets, the burn-in time chosen for BAMBE, MCMC-5*, and MCMC-5*W was 200,000 iterations. Postburn-in, 10,000 values were returned, sampled every 200 iterations.
When implementing MCMC, it is important to carry out diagnostics on the chain to check that it has reached the stationary distribution and that it is mixing well. The former was checked by starting the chain from different points and ensuring that the log posterior probabilities all reached a similar plateau; the latter was checked by examining the log posterior probability and branch length information (average branch length for MCMC-5*, half tree-diameter for BAMBE) to see if these values were mixing well with low amounts of serial dependency (this may be checked by examining the autocorrelation function). All the data sets appeared to have passed burn-in and were mixing well (data not shown). An example of these plots is given in figure 2 for a mitochondrial data set.
|
|
|
Further evidence of the increased precision of the estimates comes from examining the number of trees in the 95% credible sets from substitution (BAMBE) and five-state (MCMC-5* and MCMC-5*W) models. All of the 95% credible sets using the latter models consist of only one tree, while, for larger branch lengths, the nucleotide substitution model sets contain more than one tree.
The comparison between the unweighted and weighted methods using the gap models is quite interesting. MCMC-5*W gives less weight to gaps and may represent an approximately minimum use of the gap information, while MCMC-5* tends to overweight gap characters because it treats each of them as resulting from independent indel events, effectively ignoring the possibility of multiple-site indel events. MCMC-5*W leads to a considerable gain in precision over BAMBE for longer trees. There is a further increase in precision from MCMC-5*W to MCMC-5*, but it is much smaller. This suggests that using the information contained within the gaps can lead to a considerable improvement in inference, while, for data sets with short to medium- length gaps, using a straightforward five-state model may not greatly overestimate the precision of the estimate.
The branch length estimation for longer trees appears to be poor for MCMC-5* and MCMC-5*W, which is not surprising. The longer trees, with their increased probability of insertions and deletions, contain a significant number of sites with matching gap characters. Since the five-state model treats a gap as a character, this fools it into believing that the sequences are more closely related than they actually are.
Real Data
We next applied the model to a set of four mitochondrial D-loop sequences (human, chimp, pygmy chimp, and gorilla) discussed by Saitou and Ueda (1994)
, who noted that in this data set (and in others included in that paper), single-nucleotide insertions and deletions occurred at a considerably higher frequency than multiple insertions/deletions. This suggested that the five-state model was appropriate here. The data set was taken from Foran, Hixson, and Brown (1988)
; leaving out uncertain regions of the alignment, it is 990 bp long.
The F84+Gaps model was applied to this data set, with both the character frequencies and the transition- transversion ratio (via ) being updated alongside the tree in the MCMC algorithm. A burn-in of 200,000 iterations was used, and 100,000 values were returned, sampled every 200 iterations. The top graphs in figure 2
show part of the trace of the log posterior probability and the autocorrelation function of the entire postburn- in log posterior probability. These show that the chain converges quickly and then mixes well.
This phylogeny is easily resolvable into ((chimp, pygmy chimp), human, gorilla); this tree is found in all iterations of the process. This agrees with standard ML analysis (using DNAML), which also finds that the correct tree has a considerably higher likelihood than the alternatives. Figure 2 also shows the marginal distributions of the mean branch length and the transition-transversion ratio as yielded by the MCMC algorithm. For this data set, defining the branch length to be the average number of mutations (substitutions, insertions, deletions) per site may be sensible, since most indels affect one site only and thus could be viewed mathematically as a substitution process. Inclusion of the indels leads to significantly longer branches (an average branch length of 0.0697 compared with 0.0599 in the ML tree under a no-gaps model). There is a transition/transversion bias in this data set, with an estimate of 1.58 for the transition-transversion ratio.
A further primate data set was examined. This consisted of the ß-globin spacer for seven primate sequences (two human alleles, chimpanzee, gorilla, orangutan, rhesus macaque, and spider monkey) and was taken from Maeda et al. (1988)
. This alignment contained a CT-rich region, which was removed. While many insertion and deletion events affected only one site, there were a number of longer events (lengths of 10 or 12 bp). Consequently, both the standard application and the weighted versions of the five-state model were applied to the data.
Again the full F84+Gaps model was used (i.e., both the character frequencies and were updated in the MCMC algorithm). Once again, a burn-in of 200,000 iterations was applied, and 100,000 values were returned, sampled at intervals of 200. The diagnostic plots (data not shown) showed that the Markov chain rapidly converged and thereafter mixed well (the autocorrelation function rapidly tails off to 0).
The MAP estimate returned by both MCMC-5* and MCMC-5*W was the tree (((((human1, human2), chimp), gorilla), orang), macaque, spider monkey), which agrees with the currently accepted phylogeny. The posterior probability was estimated to be 1. BAMBE estimates the posterior probability of this tree as 0.69 (HKY model, kappa = 4.3), which is a considerable reduction in precision. The 95% credible interval returned by BAMBE contained two trees. The other tree exchanges chimp and gorilla in the tree above and has a posterior probability of 0.3.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The main difficulties with this overweighting of gaps will be in sequences with long insertion and deletion events where some support relationships different from those in the correct tree. Treating each position independently results in a great deal of support for the incorrect relationship and may change the tree inferred. A good example of a case in which this occurs is that of the chloroplast data described by Golenberg et al. (1993)
. These data consist of a region of noncoding chloroplast DNA for nine members of the grass family. Golenberg et al. (1993)
found evidence that a limited number of sites were prone to insertion and deletion events and that consequently multiple indels had taken place along the same stretches of DNA. This led to conflicting phylogenetic signal which overwhelmed that from the rest of the (closely related) sequences, resulting in an incorrect tree being inferred (data not shown).
One suggestion to avoid this problem is to treat each indel as being one event, which can be approximated by dividing the site log likelihood by the average gap length at that site. As discussed previously, this may be regarded as roughly the other extreme for treating gapsthey are not ignored, but short and long gaps contribute similar amounts to the likelihood. However, our simulation study suggests that even doing this achieved a considerable gain in the precision of the estimate. This method is also better able to handle data sets such as the chloroplast one; it did not return the tree inferred by the gaps as the MAP estimate, although there was some support for it.
The best set of weights to correctly incorporate gap information using a five-state model may lie somewhere between the two cases presented here (weight each site equally vs. downweight each site likelihood by the average gap length if an indel event passes through it in at least one of the sequences). This possibility is currently being explored.
Of course, a fully statistical method would be preferable to using an ad hoc approach like downweighting gaps. The difficulty here will be in the likelihood computations. Multiple-site indel events clearly violate the assumption that each site evolves independently, suggesting that a model that allows some dependence between neighboring sites is necessary. However, independent sites are the key which makes tree likelihood calculations tractable, so this will not be a straightforward task. The method of Mitchison (1999)
allows the state at a site to depend on whether or not there is a gap at neighboring sites; this may suggest a way to proceed. Implementing more realistic models is the subject of current work.
A further issue is the treatment of gaps at the start or end of a sequence alignment. These leading and trailing gaps do not generally correspond to indel events but are, rather, artifacts of the sequencing procedure (the sequencing procedure does not necessarily start or finish in the same place in all of the sequences). Clearly, these should not be treated in the same way as interior gaps. One suggestion is to code these gaps as missing data (e.g., N instead of ) before carrying out any inference.
One important application of this work is to the gene tree problem (the estimation of the hierarchical relationships among a gene familysequences from the same gene in different species and/or similar genes within an individual). Since there is a limited amount of data (i.e., the gene) available for inferring the gene tree, it is important to use these data as efficiently as possible. Many gene families will have a considerable number of insertions and deletions (a good example is the aforementioned alignment in Williams and Holland [1998]
), so these models could be of use. The principles behind the five-state models can also be applied to models for amino acid replacement, leading to 21-state models.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: DNA sequences
evolution models
gaps
likelihood
Markov chain Monte Carlo
phylogenetics
2 Address for correspondence and reprints: David Balding, Department of Applied Statistics, University of Reading, P.O. Box 240, Reading RG6 6FN, United Kingdom. d.j.balding{at}reading.ac.uk
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bishop, M. J., and E. A. Thompson. 1986. Maximum likelihood alignment of DNA sequences. J. Mol. Biol. 190:159 165[ISI][Medline]
Chib, S., and E. Greenberg. 1995. Understanding the Metropolis-Hastings algorithm. Am. Stat. 49:327335[ISI]
Durbin, R., S. R. Eddy, A. Krogh, and G. Mitchison. 1998. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, England
Efron, B., E. Halloran, and S. Holmes. 1996. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA 93:1342913434
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368 376[ISI][Medline]
. 1993. PHYLIP (phylogeny inference package). Version 3.5c. Distributed by the author, Department of Genetics, University of Washington, Seattle
Felsenstein, J., and G. A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93104[Abstract]
Foran, D. R., J. E. Hixson, and W. M. Brown. 1988. Comparisons of ape and human sequences that regulate mitochondrial DNA transcription and D-loop DNA synthesis. Nucleic Acids Res. 16:58415861[Abstract]
Gilks, W. R., S. Richardson, and D. J. Spiegelhalter. 1995. Introducing Markov chain Monte Carlo. Pp. 119 in W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds. Markov chain Monte Carlo in practice. Chapman and Hall, London
Golenberg, E. M., M. T. Clegg, M. L. Durbin, J. Doebley, and D. P. Ma. 1993. Evolution of a noncoding region of the chloroplast genome. Mol. Phylogenet. Evol. 2:5264[Medline]
Grimmett, G. R., and D. R. Stirzaker. 1982. Probability and random processes. 1st edition. Oxford University Press, Oxford, England
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160174[ISI][Medline]
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97 109
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21232 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York
Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120[ISI][Medline]
Koshi, J. M., and R. A. Goldstein. 1995. Context-dependent optimal substitution matrices. Protein Eng. 8:641645[Abstract]
Krogh, A., M. Brown, I. S. Mian, and D. Haussler. 1994. Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 235:15011531[ISI][Medline]
Larget, B., and D. L. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16:750759
MacDonald, I. L., and W. Zucchini. 1997. Hidden Markov and other models for discrete-valued time series. Chapman and Hall, London
Maeda, N., C.-I. Wu, J. Bliska, and J. Reneke. 1988. Molecular evolution of intergenic DNA in higher primates: pattern of DNA changes, molecular clock and evolution of repetitive sequences. Mol. Biol. Evol. 5:120[Abstract]
Mau, B., M. A. Newton, and B. Larget. 1999. Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55:112
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21: 10871092
Mitchison, G. J. 1999. A probabilistic treatment of phylogeny and sequence alignment. J. Mol. Evol. 49:1122[ISI][Medline]
Mitchison, G. J., and R. M. Durbin. 1995. Tree-based maximal likelihood substitution matrices and hidden Markov models. J. Mol. Evol. 41:11391151[ISI]
Saitou, N., and S. Ueda. 1994. Evolutionary rates of insertion and deletion in noncoding nucleotide sequences of primates. Mol. Biol. Evol. 11:504512[Abstract]
Simmons, M. P., and H. Ochoterena. 2000. Gaps as characters in sequence-based phylogenetic analyses. Syst. Biol. 49:369381[ISI][Medline]
Simon, D. L., and B. Larget. 2000. BAMBE. Version 2.02 beta. Duquesne University, Pittsburgh, Pa
Thorne, J. L., and G. A. Churchill. 1995. Estimation and reliability of molecular sequence alignments. Biometrics 51: 100113
Thorne, J. L., H. Kishino, and J. Felsenstein. 1991. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33:114124[ISI][Medline]
. 1992. Inching toward reality: an improved likelihood model of sequence evolution. J. Mol. Evol. 34:316[ISI][Medline]
Williams, N. A., and P. W. H. Holland. 1998. Gene and domain duplication in the chordate Otx gene family: insights from Amphioxus Otx. Mol. Biol. Evol. 15:600607[Abstract]