Experimental Phylogeny of Neutrally Evolving DNA Sequences Generated by a Bifurcate Series of Nested Polymerase Chain Reactions

Gerdine F. O. Sanson, Silvia Y. Kawashita, Adriana Brunstein and Marcelo R. S. Briones

Departamento de Microbiologia, Imunologia e Parasitologia, Escola Paulista de Medicina, Universidade Federal de São Paulo, São Paulo, Brazil


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
A known phylogeny was generated using a four-step serial bifurcate PCR method. The ancestor sequence (SSU rDNA) evolved in vitro for 280 nested PCR cycles, and the resulting 15 ancestor and 16 terminal sequences (2,238 bp each) were determined. Parsimony, distance, and maximum likelihood analysis of the terminal sequences reconstructed the topology of the real phylogeny and branch lengths accurately. Divergence dates and ancestor sequences were estimated with very small error, particularly at the base of the phylogeny, mostly due to insertion and deletion changes. The substitution patterns along the known phylogeny are not described by reversible models, and accordingly, the probability substitution matrix, based on the observed substitutions from ancestor to terminal nodes along the known phylogeny, was calculated. This approach is an extension of previous studies using bacteriophage serial propagation, because here mutations were allowed to occur neutrally rather than by addition of a mutagenic agent, which produced biased mutational changes. These results provide for the first time biochemical experimental support for phylogenies, divergence date estimates, and an irreversible substitution model based on neutrally evolving DNA sequences. The substitution preferences observed here (A to G and T to C) are consistent with the high G+C content of the Thermus aquaticus genome. This suggests, at least in part, that the method here described, which explores the high Taq DNA polymerase error rate, simulates the evolution of a DNA segment in a thermophilic organism. These organisms include the bacterial rod T. aquaticus and several Archaea, and thus, the method and data set described here may well contribute new insights about the genome evolution of these organisms.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Experimental phylogenetics is a convincing means of understanding basic processes of nucleotide change in phylogenies (Hillis, Mable, and Moritz 1996Citation ). Hillis and collaborators evolved the bacteriophage T7 by sequential propagation and generated a known phylogeny which provided for the first time experimental support for phylogeny inference methods (Hillis et al. 1992Citation ; Bull et al. 1993Citation ). Gene phylogenies can be inferred from sequence data using several algorithms under three basic optimality criteria, namely, parsimony (Fitch 1977Citation ), pairwise distances (Saitou and Nei 1987Citation ), and maximum likelihood (Felsenstein 1981Citation ). Among these, the use of explicit models of sequence evolution, or maximum likelihood, which is computationally very intensive, has been increasing recently because of improvements in computer hardware speed and software optimization (Olsen et al. 1994Citation ; Strimmer and von Haeseler 1996Citation ; Korber et al. 2000Citation ). Gene phylogenies may represent species phylogeny if the substitutions in a particular gene represent orthologous steps (Li and Graur 1991Citation ). Divergence dates of genes and species can also be estimated from phylogenetic distances (Rambaut and Bromham 1998Citation ; Yoder and Yang 2000Citation ). These estimates are based on the concept of a molecular clock (Zuckerkandl and Pauling 1962Citation ), either global or local, which can be tested for a set of sequences, models, and trees, using relative rate tests (Sarich and Wilson 1973Citation ), triplets rate test (Tajima 1993Citation ), and the likelihood ratio test (Felsenstein 1988Citation ).

In the maximum likelihood method for phylogeny inference, the explicit model of nucleotide substitution (the state transition matrix) is of primary importance (Felsenstein 1981Citation ; Posada and Crandall 1998Citation ). This varies from the single parameter Jukes and Cantor model to the general time reversible model (Jukes and Cantor 1969Citation ; Rodriguez et al. 1990Citation ). However, these models assume reversible matrices; in other words, they assume that the probability of the forward change over time (e.g., A to G) is equal to the probability of the reverse event (G to A). Other models, based on irreversible matrices, were proposed, but require the position of the phylogeny root to be inferred by maximizing the likelihood (Yang and Roberts 1995Citation ; Galtier and Gouy 1998Citation ; Schadt, Sinsheimer, and Lange 1998Citation ; Galtier, Tourasse, and Gouy 1999Citation ).

Neutral substitutions are very interesting for phylogenetic purposes, although their use for divergence date estimates and phylogeny inference has not yet been tested explicitly by experimental phylogenetics. In the experimental phylogeny of bacteriophage T7, the accelerated rate of change was induced by the presence of a mutagenic agent which changed not only the tempo, but also the mode of evolution, because the mutagenic agent likely biased substitutions from G to A and C to T (Bull et al. 1993Citation ). Also, in the T7 sequential propagation, lineages that replicate faster or more effectively tend to be overrepresented compared with slower replicating lineages, and therefore the system is not neutral (Hillis et al. 1992Citation ). Neutral substitutions, with proper corrections for superimposed mutations and stochastic variations, likely reflect the elapsed time, and therefore history, since the divergence of two sequences (Li and Graur 1991Citation ). Substitutions in sites with selective pressure mostly reflect the selection regime and thus may cause convergent substitutions, under negative selection, or anomalously long phylogeny branches, under positive selection. Accordingly, divergence dates could be underestimated or overestimated, respectively.

Here we present a method, based on serial PCR, that extends previous studies (Hillis et al. 1992Citation ; Bull et al. 1993Citation ) by generation of a data set of neutrally evolving sequences. Analysis of this data set provided experimental support for maximum likelihood, with modelfit analysis, for finding the correct topology, reconstruction of ancestor sequences, and divergence date estimates. This data set was also used to calculate a probability transition matrix which describes an irreversible dynamics, based on Taq DNA polymerase error rate, and should contribute to further research on coalescence theory (Kingman 1982Citation ), phylogeny inference methods, and evolution of thermophilic organisms (Galtier, Tourasse, and Gouy 1999Citation ).


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Gene Amplification and Sequencing
The ancestor sequence (Trypanosoma cruzi SSU rDNA, GenBank AF288660) was used as template in a 35 cycle PCR using primers RIBA (5'-CCGAATTCGTCGACAACCTGGTTGATCCTGCCA GT-3') and RIBB (5'-CCCGGGATCCAAGCTTGATCCTTCTGCAGGTTCACCTAC-3') which enables the amplification of the complete SSU rDNA sequence. The amplification started with amplification of 0.35 µg of cloned ancestor sequence (0.102 pmol, 6.14 x 1010 molecules) in a 100-µl solution containing 5 units of Taq DNA pol (GIBCO), 7 mM MgCl2, 40 nmol each deoxynucleotide triphosphates (dNTP), 20 mM tris(hydroxymethyl)aminomethane (Tris)–HCl pH 8.4, 50 mM KCl, and 10 pmol of each primer. After 35 cycles amplicons were purified from gel (GFX, Amersham-Pharmacia) serially diluted 1:1,000, and 4 µl of the diluted solution were used as template for an additional 35 cycles in 100-µl solutions containing 5 units of Taq DNA pol (GIBCO), 7 mM MgCl2, 40 nmol each dNTP, 20 mM Tris–HCl p H8.4, 50 mM KCl, and 10 pmol of each primer (fig. 1 ). Cycling conditions were 94°C for 1 min, 41°C for 1 min, and 72°C for 2 min, and final extension 72°C for 7 min. After each 70 nested cycles the amplicons were cloned into pBluescript, and two randomly selected clones were picked to be the ancestor of another 70 cycles nested for a total of four rounds of 70 nested cycles. This procedure was repeated four times, resulting in four rounds of 70 nested PCR cycles. Amplification of cloned products was done using M13 forward and reverse primers. The nested reaction has 280 cycles, and all 16 terminal sequences coalesce to ancestor sequence 1.1 (fig. 2 ). The 70 cycles from each round were divided into two 35 nested cycles, because at 35 cycles the reaction is close to its linear stage. All 31 sequences derived from the process (16 terminal nodes plus 15 ancestors), along with six additional clones of the first 70 cycles, were determined completely, in a total of 37 complete SSU rDNA sequences (total of 82,806 assembled bases). Sequencing was done using BigDye Terminators (Applied Biosystems) in an ABI377/96 automated sequencer, by primer walking using internal primers of the 18S rRNA gene. Sequences were assembled using PHRED+PHRAP+CONSED (Gordon, Abajian, and Green 1998Citation ) from ABI chromatograms, available upon request. Quality of assembled sequences ranged from phred scores 30 to 40 as estimated by CONSED (Gordon, Abajian, and Green 1998Citation ). Sequences presented here have been deposited in GenBank under accession numbers AF288660 (ancestor 1.1) and AF359461 to AF359496 (from 2.1 to T16). Supplementary data, such as alignments and phylogenies, are available on the World Wide Web site (http://compbio.epm.br/ievol) of one of us.



View larger version (49K):
[in this window]
[in a new window]
 
Fig. 1.—Evolution of DNA sequences by a series of bifurcate PCRs. An ancestor SSU rDNA cloned in pBluescript was used as template for series 1 of 70 nested PCR cycles with M13 primers. After the initial 35 cycles, reaction products were diluted 1:1,000 and used as templates for the subsequent 35 cycles, with rDNA primers RIBA and RIBB. After 70 cycles amplicons were cloned, and two clones were picked randomly and used as templates for the next series of nested PCR cycles. Lineages are propagated at random, and therefore the evolution is neutral and behaves as a stochastic process. Tree nodes T1 to T16 indicate terminal sequences, and 1.1 to 4.8, internal ancestors

 


View larger version (79K):
[in this window]
[in a new window]
 
Fig. 2.—Polymorphic sites of sequences generated by serial PCR neutral evolution. Sequences 1.1 to 4.8 represent the internal ancestors, and T1 to T16, the terminal sequences. Sequences 2.3 to 2.8 were not used in subsequent propagation, except to estimate the Taq DNA polymerase error rate at 70 cycles. Numbers above sequences indicate the position number in the alignment (total number of positions considered is 2,238). Dots indicate residues that are identical to sequence 1.1

 
Phylogenetic Analysis
Terminal sequences were aligned by eye using SEAVIEW sequence editor (Galtier, Gouy, and Gautier 1996Citation ). Trees were constructed using PAUP 4.0b6 (Swofford 1998Citation ) and TREE-PUZZLE (Strimmer and von Haeseler 1996Citation ), and modelfit tests were performed using the hierarchical likelihood ratio test implemented in MODELTEST 3.04 (Posada and Crandall 1998Citation ). Maximum likelihood trees were constructed using the model selected by MODELTEST (TVMef [Posada and Crandall 1998Citation ] with equal base frequencies; the rate matrix A–C = 0.4397, A–G = 13.2362, A–T = 4.9778, C–G = 0.2123, C–T = 13.2362, and G–T = 1.0; proportion of invariable sites = 0; and equal rates for all sites) with heuristic tree-bisection–reconnection (TBR) search. Parsimony (Fitch 1977Citation ) trees were built using accelerated transformation (ACCTRAN) and TBR searching with collapse option. Distance trees were constructed using Neighbor-Joining (Saitou and Nei 1987Citation ) with a maximum likelihood distance matrix using the model selected by MODELTEST. The standard errors of branch lengths were estimated using PAUP 4.0b6 (Swofford 1998Citation ). The number of molecules after 280 cycles was calculated by quantitation of templates and PCR products by absorbance at 260 nm, applying corresponding corrections for dilutions, and conversion to number of molecules using Avogadro's number. Divergence dates with low and high confidence intervals and associated evolutionary rates were estimated using the quartet analysis implemented in QDATE version 1.11 (Bromham et al. 1998Citation ; Rambaut and Bromham 1998Citation ). The number of substitutions of each rate category was directly quantitated from polymorphic sites along the real phylogeny.

Model Construction
The instantaneous rate matrix (Q-matrix) was built from the observed number of changes of each of 16 categories, in the real topology. Each element Qij was calculated by dividing the observed number of changes from nucleotide i to nucleotide j by the total number of changes. The diagonal elements Qii were calculated by using the relation Qii = -{Sigma}j!=i Qij. The Q-matrix was then written as in table 1 .


View this table:
[in this window]
[in a new window]
 
Table 1 Observed Substitutions Along the True Phylogeny (fig. 3A)

 
The substitution probability matrix (P-matrix) was calculated from the relation P(t) = eQt (Swofford et al. 1996Citation ). In order to find the elements of the P-matrix we had to decompose the Q-matrix into its eigenvalues and eigenvectors. The eigenvalues were obtained by calculating the {lambda} values (vn) that satisfied the equation det(Q - {lambda}I) = 0 (where I is the identity matrix), namely, v1 = -0.5187, v2 = - 0.4218, v3 = - 0.0595, and v4 = -5.7958 x 10-6. For each {lambda} value the respective eigenvector (V{lambda}) was determined as that satisfying the equation (Q - {lambda}I)V{lambda} = 0. These vectors are: V1 = (-0.6060,-0.1701, 0.1669, 0.7589)T; V2 = (0.5025, -0.2510, -0.2801, 0.7785)T; V3 = (0.5243, -0.5155, 0.6207,-0.2722)T; and V4 = (-0.4999, -0.5001, -0.4998, -0.5000)T. The Pij(t) elements in the P-matrix were calculated using standard transformation procedures of the Q-matrix, using the previously calculated eigenvalues and eigenvectors. All mathematical calculations were confirmed using MATHEMATICA software (Wolfram Research).


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Here we wanted to simulate neutral evolution and therefore we present a model which explores the high error rate of Taq DNA polymerase (Saiki et al. 1988Citation ) (fig. 1 ). A known SSU rRNA gene sequence was used as the ancestor to start the process of sequential PCR evolution (fig. 1 ). Every 70 cycles the PCR products were cloned, and two clones were completely sequenced and used as templates (ancestors) for the next round of 70 cycles (fig. 1 ). Accordingly, the ancestor 1.1 originated 16 terminal sequences or terminal nodes, T1 to T16, after 280 PCR nested cycles (generations) (fig. 1 ). The full-length sequences of the 16 terminal nodes and the 15 internal nodes (2.1 to 4.8, fig. 3A ), or ancestors, were determined. The alignment of all 37 sequences used here revealed 196 polymorphic sites, including gaps, and the alignment of the 16 terminal sequences had 169 polymorphic sites (fig. 2 ). After PCR evolution of 280 generations (fig. 1 ), the total number of molecules, 9.92 x 1011, was estimated from PCR product quantitation. Sequences 2.3 to 2.8 were used only to calculate the mutation rate of Taq DNA polymerase after 70 PCR cycles, as described later.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 3.—Comparison of real phylogeny with inferred maximum likelihood phylogeny (Felsenstein 1981Citation ; Posada and Crandall 1998Citation ; Swofford 1998Citation ). The serial PCR in vitro evolution resulted in the topology depicted (A) with varying branch lengths whose ancestors (1.1 to 4.8, circled) and terminal sequences (T1 to T16) were sequenced in full length. Scale bar indicates the number of cycles between tree internodes and nodes. The inferred phylogeny (B) has a topology identical to the real tree (A) and 9 out of 30 branch lengths were estimated correctly. Boxed numbers indicate branch lengths (number of substitutions), numbers in italics represent the percentage of a given cluster in 100 bootstrap replicates, with reestimation of parameters at each bootstrap replicate (top), and without reestimation at each replicate (bottom). Numbers below arrows indicate the estimated divergence (cycles ago), with the low–high confidence interval range (in parenthesis) as calculated by maximum likelihood quartet analysis (Rambaut and Bromham 1998Citation ). Numbers of substitutions, with corresponding standard errors, in the inferred tree (B) were calculated by multiplying the branch lengths (in substitutions per site) by the total number of positions (2,238 bp)

 
The real phylogeny obtained (fig. 3A ) has a series of 15 dichotomies from the initial ancestor to the 16 terminal sequences, and substitutions that occurred along the phylogeny involved 7.6% of the total number of positions (table 1 ). We observed that the number of substitutions per time interval (along a branch length) varies significantly. This might be due to stochastic effects once lineages are sampled and propagated randomly. The 16 terminal sequences (T1 to T16) were used to reconstruct the real phylogeny by maximum likelihood, parsimony, and neighbor-joining (Fitch 1977Citation ; Felsenstein 1981Citation ; Saitou and Nei 1987Citation ). Topologies obtained by the three methods were identical and found the real topology. In fig. 3B we show the reconstructed maximum likelihood phylogeny which has a topology identical to the true tree (fig. 3A ). This phylogeny has ln(Likelihood) = -4259.7384 and was inferred using a submodel of the General Time Reversible Model (TVMef) with equal rates for all sites, with parameter determination by hierarchical likelihood ratio tests (Rodriguez et al. 1990Citation ; Posada and Crandall 1998Citation ). The best parsimony tree found had 82 informative sites, 157 steps, rescaled consistency index of 0.962, homoplasy index of 0.038, and ln(Likelihood) = -4259.7384, but was not statistically different from the six other parsimony trees with up to 160 steps, as tested by the Kishino–Hasegawa topology test (Kishino and Hasegawa 1989Citation ). The maximum likelihood tree differs from the real tree by 30% (9 out of 30) with respect to branch lengths, particularly in clusters where the number of substitutions where higher (fig. 3 ). However, the differences in branch lengths between the known phylogeny (fig. 3A ) and the inferred phylogeny (fig. 3B ) are within the range of the maximum likelihood standard error for branch lengths, as estimated in the inferred tree (fig. 3B ), and therefore, these differences were expected by the inference method. In addition to MODELTEST, optimization of model parameters was also done with simultaneous estimation of tree searching. For this, an initial topology was inferred by neighbor-joining with simultaneous estimation of nucleotide frequencies and the rate matrix (R-matrix). This initial topology was then rearranged by TBR with simultaneous estimation of the gamma distribution rate parameter and the proportion of invariant sites, and reestimation of nucleotide frequencies and the R-matrix. This approach is as close as possible, in the scope of the present study, to escape from point estimates of model parameters, as done by MODELTEST, and to letting parameters behave freely during tree searching. The tree obtained by this heuristic full optimization (ln[Likelihood] = -4254.74116) found that the real topology (fig. 3A ) and the parameters were very similar to those obtained by MODELTEST, except in the proportion of invariant sites, which in the full optimization was 0.43. At least in the case of the sequences presented here, MODELTEST gave results similar to those of the heuristic full optimization. A complete full optimization would have to be done using either an exhaustive search of all trees (more than 2 x 1014 trees for 16 taxa) or an exact algorithm, such as branch-and-bound, with simultaneous parameter optimization, and this would take an impractical computational time.

To verify the evolutionary rate of the real phylogeny (Taq DNA polymerase error) we used eight sequences, 2.1 to 2.8 (fig. 2 ), of the first 70 nested cycles. Among 2,238 positions, we observed 44 polymorphic sites, being 2 insertions, 4 deletions, and 48 base substitutions. The average change from the ancestor 1.1 to sequences 2.1 to 2.8 (fig. 2 ) is six misincorporations (excluding indels), which gives 0.0027 errors per position and an evolutionary rate of 0.38 x 10-4 substitutions per site per generation. This is higher than other estimates for Taq DNA polymerase (0.27 x 10-4) (Bracho, Moya, and Barrio 1998Citation ) and might be because of the concentration of MgCl2 (7 mM) used in our study, in order to accelerate the substitution rate. The substitutions at terminal sequences are evenly distributed along the SSU rRNA sequence length (table 1 ) which demonstrates that we generated a neutral evolution simulation. Therefore, selection is not responsible for the branch length variation observed in the real tree (fig. 3A ).

We also tested if divergence times could be estimated from inferred phylogenies. The 16 terminal sequences were analyzed by a maximum likelihood quartet method (Rambaut and Bromham 1998Citation ), using the same substitution model and parameters used to infer the maximum likelihood phylogeny (fig. 3B ). All terminal pairs diverged 70 cycles earlier and were clustered into quartets to infer the divergence of internal nodes. Divergence times of ancestors 1.1, 2.1, 2.2, 3.2, and 3.4 were estimated correctly, and ancestors 3.1 and 3.3 had dates outside the 95% confidence interval. The inferred tree, however, passes the likelihood ratio test for molecular clocks (Felsenstein 1988Citation ) as the difference between the likelihoods of the molecular clock constrained tree and the unconstrained tree is not statistically significant at 95% confidence level (2 x {Delta}LnL = 21.2818, and the Chi-square critical value for 14 degrees of freedom at P < 0.05 is 23.685). The maximum likelihood quartet analysis also estimated the evolutionary rate between 0.24 x 10-4 and 0.42 x 10-4 substitutions per site per generation.

Ancestor sequences 1.1 to 4.8 were also predicted from the maximum likelihood tree (hypothetical ancestors, HA) and compared with ancestors from the real phylogeny (fig. 4 ). Ancestors of the real phylogeny were reconstructed with accuracy from 99.46% to 99.87% (3 to 12 differences) by maximum likelihood. Most of the inaccurately assigned ancestor states were in regions with insertions and deletions and in two positions with two substitutions at the same site, positions 101 and 637 (fig. 4 and table 1 ). The reconstruction with the most errors (12) was ancestor 1.1, and the most accurate reconstructions were from ancestors 3.4, 4.7, and 4.8 (fig. 4 ). This suggests that as we move deeper in the tree, ancestor sequence reconstructions might be more sensitive to insertion and deletion events and multiple substitutions at the same site. Insertions and deletions occurred much more frequently after runs of homopolymeric regions, as observed elsewhere, and might be caused by the slippage of Taq DNA polymerase (Bracho, Moya, and Barrio 1998Citation ).



View larger version (49K):
[in this window]
[in a new window]
 
Fig. 4.—Comparison between ancestor sequences in the real phylogeny (fig. 3A ) and maximum likelihood reconstruction of ancestral states, HA. Only polymorphic positions are shown. Dots indicate residues identical to sequence 1.1. Numbers above the alignment indicate the position numbers. Numbers in parenthesis indicate the differences between the HA and their corresponding real ancestor sequences

 
The substitution model selected by the hierarchical likelihood ratio test was compared with the actual changes (table 2 ). It shows that the changes in the real tree follow an irreversible model, particularly when A > G-G > A, A > T-T > A, and C > T-T > C changes are compared. Base frequencies in the data set (16 terminal sequences) are f(A) = 0.24938; f(C) = 0.22548; f(G) = 0.26574; and f(T) = 0.25941; among constant sites they are (0.24374, 0.23225, 0.27171, 0.25230) and among variable sites (0.32895, 0.12972, 0.18143, 0.35990). This suggests that the direction of change is not biased toward more abundant residues in the nucleotide pool.


View this table:
[in this window]
[in a new window]
 
Table 2 Rate-Matrix Selected by Modelfit Analysis (reversible) Compared with the Rate-Matrix Observed from Real Data (in parenthesis)

 
To depict the probability of substitutions along the real phylogeny, the elements of the instantaneous rate matrix, the Q-matrix (table 2 ), were used to assemble the following equations:


where P(A), P(C), P(G), and P(T) are the probabilities of observing the corresponding nucleotides at a given site after a time interval dt.

To construct the substitution probability matrix, the P-matrix (table 3 ; table 2 , in italics), the Q-matrix was decomposed into its eigenvalues and eigenvectors and used in calculations as described in Materials and Methods. The resulting P-matrix (table 3 ) is a stochastic Markov model (Swofford et al. 1996Citation ) which is also irreversible, as are other models proposed (Yang and Roberts 1995Citation ; Galtier and Gouy 1998Citation ; Schadt, Sinsheimer, and Lange 1998Citation ; Galtier, Tourasse, and Gouy 1999Citation ). Another variation in the models is regarding their homogeneity along the phylogeny (Galtier and Gouy 1998Citation ; Galtier, Tourasse, and Gouy 1999Citation ). To verify whether the model is homogeneous or nonhomogeneous along the known phylogeny (fig. 3A ), we compared the rate matrices (R-matrix) after 280 cycles with the rate matrix after 70 cycles (based on sequences 2.1 to 2.8). The R-matrix at 70 cycles (not shown) is nearly identical to the R-matrix after 280 generations (table 2 ) which suggests that the model of sequence evolution is homogeneous along the know phylogeny in fig. 3A.


View this table:
[in this window]
[in a new window]
 
Table 3 Substitution Probability Matrix Elements of the P-matrix, Derived from Observed Substitutions Along the Real Phylogeny (fig. 3A)

 
The experimental approach to phylogeny inference by a biochemical assay, as presented here, is advantageous over plain computer simulations, because the introduction of variations in sequences along the phylogeny is driven by errors of an enzymatic reaction, which is a better approximation of the real event. The process of substitution along a phylogeny is not understood in its full extent, and therefore, it is very unlikely that a computer simulation will include all the variations and nuances underlying this process. Consequently, computer simulations will be dependent on the particular parameters and bias that the programmer chooses to include.

The results presented here, using a biochemical assay, show that the phylogeny topology was reconstructed correctly, even with differences between the observed model and the maximum likelihood modelfit selected model (table 2 ). However, inference of branch lengths, divergence dates, and hypothetical ancestors are more sensitive to stochastic rate variations which should be corrected by using a more accurate model. Also, the variation in branch lengths, meaning variation in evolutionary rates, observed in the true tree (fig. 3A ) occurred in the absence of selection. Interestingly, the molecular clock is not rejected by the likelihood ratio test of the inferred phylogeny (fig. 3B ) (Felsenstein 1988Citation ), which suggests that even with considerable variation of rates among lineages, at least to the extent shown here, divergence date estimates from phylogenies are accurate (Bromham et al. 1998Citation ; Rambaut and Bromham 1998Citation ). We employed a strategy analogous to the one used by Hillis et al. (1992)Citation . However, in our study the number of molecules at each stage and the exact number of generations (PCR cycles) are known. Because we simulated neutral evolution of an SSU rDNA, changes in the secondary structure of the gene product did not interfere with our in vitro evolutionary process. The SSU rRNA in vivo has a functionally conserved secondary structure, and most substitutions tend to occur within regions of unpaired loops (Hillis and Dixon 1991Citation ).

The serial PCR method described here could be used in studies of evolution of thermophilic organisms. The substitution bias observed here (A > G and T > C, table 1 ) is consistent with the high G + C content of the Thermus aquaticus genome, and might be a consequence of specific properties of Taq DNA polymerase. However, the polymerase domains of Taq DNA polymerase and the Klenow fragment of Escherichia coli DNA polymerase I are nearly identical. In Taq DNA polymerase, two of the catalytically critical carboxylate residues on the 3'–5' exonuclease activity are missing (Kim et al. 1995Citation ). The simulation described here might well reflect the neutral evolution of a DNA segment in a thermophilic organism, such as in the bacterial rod T. aquaticus and in several Archaea, which implies that the method presented here could be developed to address questions about the genome evolution of these organisms (Woese 1987Citation ; Pace 1997Citation ).

As a perspective, the serial PCR evolution method and the data set presented here can contribute to future studies on coalescence theory (Kingman 1982Citation ) and to divergence date estimate methodology (Rambaut and Bromham 1998Citation ; Yoder and Yang 2000Citation ), because the number of generations, the phylogeny, and the mutation rate per generation are known. The in vitro generation of a phylogeny with no selection and no migration should be particularly useful for estimating the {theta} parameter (Kuhner, Yamato, and Felsenstein 1995Citation ). Nevertheless, this study provides for the first time biochemical experimental support for phylogeny inference from neutral substitutions using maximum likelihood with modelfit optimization.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
We thank J. F. Perez, Scientific Director of Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) for sequencing equipment made available to us through the ONSA Brazilian sequencing network, and the excellent suggestions from the two anonymous referees who reviewed this manuscript. G.F.O.S. and S.Y.K. received graduate and undergraduate fellowships, respectively, from CNPq (Brazil). This work was supported by grants to M.R.S.B. from FAPESP and CNPq (Brazil), and the International Research Scholars Program of the Howard Hughes Medical Institute (United States).


    Footnotes
 
Wolfgang Stephan, Reviewing Editor

Keywords: molecular evolution experimental phylogenetics SSU rRNA gene maximum likelihood Back

Address for correspondence and reprints: Marcelo R. S. Briones, Departamento de Microbiologia, Imunologia e Parasitologia, Escola Paulista de Medicina, Universidade Federal de São Paulo, Rua Botucatu, 862, 3° andar. CEP 04023–062, São Paulo, S.P. Brazil. marcelo{at}ecb.epm.br . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 

    Bracho M. A., A. Moya, E. Barrio, 1998 Contribution of Taq polymerase-induced errors to the estimation of RNA virus diversity J. Gen. Virol 79:2921-2928[Abstract]

    Bromham L., A. Rambaut, R. Fortey, A. Cooper, D. Penny, 1998 Testing the Cambrian explosion hypothesis by using a molecular dating technique Proc. Natl. Acad. Sci. USA 95:12386-12389[Abstract/Free Full Text]

    Bull J. J., C. W. Cunningham, I. J. Molineux, M. R. Badgett, D. M. Hillis, 1993 Experimental molecular evolution of bacteriophage-T7 Evolution 47:993-1007[ISI]

    Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[ISI][Medline]

    ———. 1988 Phylogenies from molecular sequences: inference and reliability Annu. Rev. Genet 22:521-565[ISI][Medline]

    Fitch W. M., 1977 On the problem of discovering the most parsimonious tree Am. Nat 111:223-257[ISI]

    Galtier N., M. Gouy, 1998 Inferring pattern and process: maximum likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis Mol. Biol. Evol 15:871-879[Abstract]

    Galtier N., M. Gouy, C. Gautier, 1996 SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny Comput. Appl. Biosci 12:543-548[Abstract]

    Galtier N., N. Tourasse, M. Gouy, 1999 A nonhyperthermophilic common ancestor to extant life forms Science 283:220-221[Abstract/Free Full Text]

    Gordon D., C. Abajian, P. Green, 1998 Consed: a graphical tool for sequence finishing Genome Res 8:195-202[Abstract/Free Full Text]

    Hillis D. M., J. J. Bull, M. E. White, M. R. Badget, I. J. Molineux, 1992 Experimental phylogenetics: generation of a known phylogeny Science 255:589-592[ISI][Medline]

    Hillis D. M., M. T. Dixon, 1991 Ribosomal DNA: molecular evolution and phylogenetic inference Q. Rev. Biol 66:411-453[Medline]

    Hillis D. M., B. K. Mable, C. Moritz, 1996 Applications of molecular systematics: the state of the field and a look to the future Pp. 515–543 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer Associates, Sunderland, Mass

    Jukes T. H., C. R. Cantor, 1969 Evolution of protein molecules Pp. 21–123 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York

    Kim Y., S. H. Eom, J. Wang, D.-S. Lee, S. W. Suh, T. A. Steitz, 1995 Crystal structure of Thermus aquaticus DNA polymerase Nature 376:612-616[ISI][Medline]

    Kingman J. F. C., 1982 The coalescent Stochast. Proc. Appl 13:235-248

    Kishino H., M. Hasegawa, 1989 Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominidea J. Mol. Evol 29:170-179[ISI][Medline]

    Korber B., M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B. H. Hahn, S. Wolinksy, T. Bhattacharya, 2000 Timing the ancestor of the HIV-1 pandemic strains Science 288:1789-1796[Abstract/Free Full Text]

    Kuhner M. K., J. Yamato, J. Felsenstein, 1995 Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling Genetics 140:1421-1430[Abstract/Free Full Text]

    Li W. H., D. Graur, 1991 Fundamentals of molecular evolution Sinauer Associates, Sunderland, Mass

    Olsen G. J., H. Matsuda, R. Hagstrom, R. Overbeek, 1994 fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood Comput. Appl. Biosci 10:41-48[Abstract]

    Pace N. R., 1997 A molecular view of microbial diversity and the biosphere Science 276:734-740[Abstract/Free Full Text]

    Posada D., K. A. Crandall, 1998 Modeltest: testing the model of DNA substitution Bioinformatics 14:817-818[Abstract]

    Rambaut A., L. Bromham, 1998 Estimating divergence dates from molecular sequences Mol. Biol. Evol 15:442-448[Abstract]

    Rodriguez F., J. F. Oliver, A. Marin, J. R. Medina, 1990 The general stochastic model of nucleotide substitution J. Theor. Biol 142:485-501[ISI][Medline]

    Saiki R. K., D. H. Gelfand, S. Stoffel, S. J. Scharf, R. Higuchi, G. T. Horn, K. B. Mullis, H. A. Erlich, 1988 Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase Science 239:487-491[ISI][Medline]

    Saitou N., M. Nei, 1987 The neighbor-joining method: a new method for reconstructing phylogenetic trees Mol. Biol. Evol 4:406-425[Abstract]

    Sarich V. M., A. C. Wilson, 1973 Generation time and genomic evolution in primates Science 179:1144-1147[ISI][Medline]

    Schadt E. E., J. S. Sinsheimer, K. Lange, 1998 Computational advances in maximum likelihood methods for molecular phylogeny Genome Res 8:222-233[Abstract/Free Full Text]

    Strimmer K., A. von Haeseler, 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies Mol. Biol. Evol 13:964-969[Free Full Text]

    Swofford D. L., 1998 PAUP* Phylogenetic analysis using parsimony (* and other methods). Version 4b6. Sinauer Associates, Sunderland, Mass

    Swofford D. L., G. J. Olsen, P. J. Waddell, D. M. Hillis, 1996 Phylogenetic inference Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer Associates, Sunderland, Mass

    Tajima F., 1993 Simple methods for testing the molecular evolutionary clock hypothesis Genetics 135:599-607[Abstract/Free Full Text]

    Woese C. R., 1987 Bacterial evolution Microbiol. Rev 51:221-271[ISI]

    Yang Z., D. Roberts, 1995 On the use of nucleic acid sequences to infer early branchings in the tree of life Mol. Biol. Evol 12:451-458[Abstract]

    Yoder A. D., Z. Yang, 2000 Estimation of primate speciation dates using local molecular clocks Mol. Biol. Evol 17:1081-1090[Abstract/Free Full Text]

    Zuckerkandl E., L. Pauling, 1962 Molecular disease, evolution and genetic heterogeneity Pp. 189–225 in M. Marsha and B. Pullman, eds. Horizons in biochemistry. Academic Press, New York

Accepted for publication September 26, 2001.