* Department of Biomedical Informatics, Columbia Genome Center, and Center for Computational Biology and Bioinformatics, Columbia University, New York; and Department of Ecology and Evolutionary Biology, University of California, Irvine
Correspondence: E-mail: wfitch{at}uci.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: stochastic context-free grammar human influenza virus hemagglutinin stochastic measures of tree similarity multitype branching processes
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
Materials, Methods, and Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A hierarchy of formal grammars, suggested by Chomsky (1956, 1959), comprises regular, context-free, context-sensitive, and unconstrained grammars, which we list here in order of increasing complexity. Each of the grammars in the list is a special case of all the grammars that follow it in the list. (For a detailed description of Chomsky's hierarchy, see e.g., Manning and Schütze [1999].) The computational cost of applying formal grammars rises quickly, from linear for regular grammars, to polynomial for context-free grammars, to exponential for more complex (context-sensitive and unrestricted) grammars. The context-free grammars are often used in practical applications because they are not only more powerful than regular grammars but also are less prohibitively expensive, in terms of computation, than the more complex grammars.
The original purpose of the formal grammars was generating sentences of a human language (such as English, see Chomsky [1956,1959]). Thus, we can think of the viral trees generated by our stochastic grammar as languages or dialects.
Assumptions
Two different rooted trees are perceived as having different shapes for two reasons. First, in the cactus tree, if we imagine how it grows, starting at the root, many lineages have one or no descendants. In contrast, in the growing acacia tree, most lineages have one or two descendants. Second, the cactus and acacia trees may have dissimilar distributions of branch length. For example, branches leading to nodes with offspring may be longer than branches leading to childless nodes in acacia tree but not in the cactus tree. Therefore, we make the following assumptions about the process that we are modeling:
Generation of Tree-Encoding Strings
The model that we describe here is suitable for generating any of numerous currently available tree formats. For the sake of descriptive simplicity, we have chosen to represent trees in the Newick format (Archie et al. 1986), which is itself an extension of a tree encoding suggested by the nineteenth-century English mathematician Arthur Cayley. For example, a three-species tree, shown in figure 2A, is represented in the Newick format as "((1:1,2:1):1,3:0.45)1'". The topology of the tree is encoded as "((1,2),3)," indicating that species 1 is grouped with species 2, and species 3 is attached to the 1-2 cluster. Furthermore, branches of the tree (of lengths 1, 1, 1, 0.45, and 1) are encoded with the set of species corresponding to them1, 2, and 3 for the leaves of the tree, and (1, 2) for the only interior branch of the tree that has the cluster 1, 2 beneath it. This encoding allows us to combine in a single string information about the tree topology and the branch lengths. The length of each branch is specified right after the encoding that corresponds to the cluster underneath that branch. In our tree-generating algorithm, we start by specifying the value of Nthe size of the treewhere the size is defined as the length of the longest direct path through the tree starting at the root, or as the number of leaves minus 1.
|
![]() |
![]() | (1) |
![]() | (2) |
As an example, we show the generation of a tree of height 3 (see figure 2B).
We start with a string "G;" (step 1 of the algorithm).
Our Newick tree string is ready to be visualized (see figure 2B). The advantage of using the Newick format is that we can apply directly various programs to convert the tree string into a tree picture. To illustrate application of this algorithm, let us consider two trees that were generated by two grammars and that are different in only the values of parameters andß (see figure 3): The values of
approaching 1 tend to produce short, bushy trees (see figure 2C), whereas values of ß approaching 0.5 tend to produce tall, skinny trees (see figure 2D).
|
Assuming that our tree has M interior nodes, the likelihood (the probability of the data given the model and the model parameters) of our tree under our model can be written in the following way:
![]() | (3) |
To explain how to compute probability (3), let us introduce notations, and
which stand for the "probability that x is sampled from distribution f1" and "probability that x is sampled from distribution f2," respectively. Then, the likelihood value for each pair of edges can be written in the following way. If both the right and left branches are terminal, the likelihood is
![]() | (4) |
If only one branch is terminal but the other one is not, we have
![]() | (5) |
Finally, if both branches are nonterminal, the expression reduces to
![]() | (6) |
Equation (4) has four terms because, when we work with a real tree, we do not know with certainty the actual production rule used to generate a pair of terminal tree branches, and, therefore, we have to sum through the probabilities that the forklike topological motif would be generated, using all possible productions for nonterminal symbol G.
The information provided in the preceding paragraphs should be sufficient to allow anyone proficient in computer programming to implement model analyses with the maximum-likelihood method. Such analyses would produce point estimates of the parameter values. In addition to these estimates, we would like to compute credible intervals (Bayesian equivalents of confidence intervals in classical statistics [e.g., see Howson and Urbach {1993}]). This computation would allow us to make statements about the significance of the difference in parameter estimates for different viral trees. Alternatively, we could use estimates of the information matrix at the point of the maximum likelihood to determine approximate confidence intervals. This second method, however, is known to produce liberal confidence intervals and is based on a frequently violated assumption, for real-life sample sizes, of the approximate normality of the likelihood function.
Thus, to determine Bayesian credible intervals, we would like to estimate the posterior density
Calculation of Posterior Densities Using Markov Chain Monte Carlo
We compute the posterior density using the Bayes' theorem:
![]() | (7) |
![]() | (8) |
For large trees, equation (7) is too cumbersome for exact analytical treatment, and researchers commonly use an approximation to the integral. An estimate of an integral of a function can be obtained with a Monte Carlo integration. In its simplest version, the Monte Carlo integration involves generation of random points within a hypercube that encapsulates a surface of interest (the surface corresponds to the function under the integral sign). By estimating the proportion of points under the surface to the total number of randomly generated points and by knowing the volume of the hypercube, we can estimate the volume under the surface with arbitrarily high precision.
The Markov chain Monte Carlo (MCMC) algorithm is a more computationally efficient version of a stochastic integration (Metropolis et al. 1953; Hastings 1970; Gilks, Richardson, and Spiegelhalter 1996). It is based on the observation that Markov chains of the class called "positive ergodic Markov chains," have the important property of converging to their stationary distribution starting at an arbitrary point in state space. Thus, we can start a time-reversible Markov process at an arbitrary point of the parameter space, then let it run for a large number of iterations. In this application, we used 100,000 iterations of full update of the parameter values after MCMC reached stationarity. Eventually, the stochastic process will reach the condition where all states will be visited in proportion to the stationary distribution of the chain. Therefore, if we can design a time-reversible Markov chain that has the required probability distribution as a stationary distribution (in our case this is to estimate the distribution, we need only to compute the frequency with which each state of the Markov chain occurs in a long simulation.
More specifically, given the current values of parameters , we need to define a probabilistic way to sample new parameter values
*. We chose q(
*|
) as a uniform one-dimensional probability function that is symmetrical when parameter values are far from the boundary of the admissible values for the parameter and asymmetrical when its values are close to the boundary. We updated the parameters in a stepwise fashion, such that
differed from
* by the value of a single parameter. We computed the probability of accepting the new parameter values
* given old values
in the following way:
![]() | (9) |
We made the actual decision whether to accept the new state by generating a random value of a uniformly distributed random variable; if the generated value was smaller than A( *,
), we accepted the new state. L(
) in equation (9) stands for the likelihood value computed for parameter values Q.
We updated one parameter at a time using the following transition function, which, as we mentioned, is symmetric uniform far from boundaries and is asymmetric uniform at the boundaries of the parameter values.
![]() | (10) |
Data Analysis
We applied our estimation procedure to the same three data sets used by Ferguson, Galvani, and Bush (2003). The data sets A.H1, A.H3, and B contain alignments of 104, 357, and 220 sequences (see figure 1), respectively. Following Ferguson, we estimated phylogenetic trees with the maximum-parsimony method using PAUP* (Swofford 1996). To root the trees, we used as outgroups sequences X00027_A/USSR/90/77, A/Oita/3/83, and AB027392_B/Aichi/70/81 for viral subtypes A H1, A H3, and B, respectively. The resulting trees were saved in a Newick format and analyzed under the model described in the Introduction. All programs for this analysis were written in MatLab.
The results of our analyses are shown in figures 35 and in tables 1 and 2. In a nutshell, parameter estimates for all three data sets are essentially identical under both versions of our six-parameter model (all differences are nonsignificant). One respect in which two data sets (the A H1 and A H3 subtypes) are different from each other is the branch length distributions (see figures 3E and table 1). For the A H1 and A H3 subtypes, the branch-length densities f2 have significantly (at the 95% level) different mean values (see table 1). Note also that, for all three data sets, the estimated densities f1 and f2 have significantly different parameter estimates (see figure 5 and tables 1 and 2); this difference indicates that the model in which the tree edge lengths are described by two different distributions agrees well with the data.
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The suggested methodology is not limited to viral trees or even to trees inferred from gene or protein sequences. It should be applicable to a general class of dendrograms utilized in social and natural sciences. As in the current application, the null hypothesis that two or more trees were generated by the same stochastic grammar is compared with the hypothesis that the different trees were produced by significantly different grammars.
Our model also bears kinship to the Galton-Watson trees, where each node in a growing tree can produce k offspring (k = 0, 1, 2, 3, ...) with a probability pk that is the same for all nodes. We can convert our model to a special case of a Galton-Watson tree by setting p2 = , p1 = 2ß, and p0 = 1p1p2, or, by doing substitution p2 =
2, p1 = 2
(1
), p0 = (1
)2, to one-parameter version. The most common definition of the Galton-Watson trees does not allow for correlations between topological elements and edge lengths, such as we have in our model; however, we can reformulate the Galton-Watson model to account for edge-length variation. Context-free grammars that add large fragments of tree topology (e.g., a grammar using the following production rule
are no longer equivalent to the Galton-Watson trees.
To exclude the possibility that our model is too parameter-rich to detect a difference in shape among the influenza trees, we analyzed a simplified one-parameter Galton-Watson model as just described, where p2 = 2, p1 = 2
(1
), p0 = (1
)2, and assumed that all edges of the tree were sampled from the same distribution. We found that this model also failed to detect significant differences among the trees (see table 3). Therefore, we currently cannot reject the hypothesis that all three viral trees considered in this study are identical in terms of overall tree topology (i.e., that they were generated by the same stochastic grammar).
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Archie, J., W. H. E. Day, W. Maddison, C. Meacham, F. J. Rolf, D. Swofford, and J. Felsenstein. 1986. Newick tree format., http://evolution.genetics.washington.edu/phylip/newicktree.html.
Athreya, K. B., and N. Keiding. 1975. Estimation theory for continuous-time branching processes. University of Copenhagen, Institute of Mathematical Statistics, Copenhagen.
Athreya, K. B., and P. Ney. 2004. Branching processes. Dover Publications, Mineola, N.Y.
Chomsky, N. 1956. Three models for the description of language. IRE Trans. Inform. Theory 2:113124.[CrossRef][ISI]
. 1959. On certain formal properties of grammars. Inform. Control 1:91112.
Ferguson, N. M., A. P. Galvani, and R. M. Bush. 2003. Ecological and immunological determinants of influenza evolution. Nature 422:428433.[CrossRef][ISI][Medline]
Gilks, W. R., S. Richardson, and D. J. Spiegelhalter. 1996. Markov chain Monte Carlo in practice. Chapman & Hall/CRC, New York.
Harris, T. E. 1963. The theory of branching processes. Springer, Berlin.
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97109.[ISI]
Howson, C., and P. Urbach. 1993. Scientific reasoning : the Bayesian approach. Open Court, Chicago.
Lange, K., and R. Z. Fan. 1997. Branching process models for mutant genes in nonstationary populations. Theor. Popul. Biol. 51:118133.[CrossRef][ISI][Medline]
Manning, C. D., and H. Schütze. 1999. Foundations of statistical natural language processing. MIT Press, Cambridge, Mass.
Metropolis, S. C., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. J. Chem. Phys. 21:10871092.[ISI]
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Swofford, D. L. 1996. PAUP*: phylogenetic analysis using parsimony (*and other methods). Sinauer Associates, Sunderland, Mass.
|