Patterns of Nucleotide Substitution in Angiosperm cpDNA trnL (UAA)–trnF (GAA) Regions

Freek T. Bakker3,*, Alastair Culham*, Rosalba Gomez-Martinez*, Jose Carvalho*{dagger}, James Compton*, Richard Dawtrey* and Mary Gibby{ddagger}

*Department of Botany, The University of Reading, Reading, England;
{dagger}Botanical Garden, Funchal, Madeira, Portugal; and
{ddagger}The Natural History Museum, Cromwell Road, London, England

Abstract

Patterns of substitution in chloroplast encoded trnL-F regions were compared between species of Actaea (Ranunculales), Digitalis (Scrophulariales), Drosera (Caryophyllales), Panicoideae (Poales), the small chromosome species clade of Pelargonium (Geraniales), each representing a different order of flowering plants, and Huperzia (Lycopodiales). In total, the study included 265 taxa, each with >900-bp sequences, totaling 0.24 Mb. Both pairwise and phylogeny-based comparisons were used to assess nucleotide substitution patterns. In all six groups, we found that transition/transversion ratios, as estimated by maximum likelihood on most-parsimonious trees, ranged between 0.8 and 1.0 for ingroups. These values occurred both at low sequence divergences, where substitutional saturation, i.e., multiple substitutions having occurred at the same (homologous) nucleotide position, was not expected, and at higher levels of divergence. This suggests that the angiosperm trnL-F regions evolve in a pattern different from that generally observed for nuclear and animal mtDNA (transition/transversion ratio >= 2). Transition/transversion ratios in the intron and the spacer region differed in all alignments compared, yet base compositions between the regions were highly similar in all six groups. A{leftrightarrow}T and G{leftrightarrow}C transversions were significantly less frequent than the other four substitution types. This correlates with results from studies on fidelity mechanisms in DNA replication that predict A{leftrightarrow}T and G{leftrightarrow}C transversions to be least likely to occur. It therefore strengthens confidence in the link between mutation bias at the polymerase level and the actual fixation of substitutions as recorded on evolutionary trees and, concomitantly, in the neutrality of nucleotide substitutions as phylogenetic markers.

Introduction

Knowledge of the process and pattern of nucleotide substitution is important for estimating the number of substitutional events between DNA sequences since their divergence, as well as for methods of phylogenetic reconstruction that incorporate models of DNA sequence evolution. The process of nucleotide substitution is generally considered biased toward transitions, i.e., substitutions between purines (A{leftrightarrow}G) or pyrimidines (C{leftrightarrow}T), although twice as many transversion type substitutions, i.e., substitutions between purines and pyrimidines, are possible.

At low DNA sequence distances, this transition bias results in a substitution pattern that is characterized by transition/transversion (ti/tv) ratios typically ranging between 2 and 10 (Brown et al. 1982Citation ; Gojobori, Li, and Graur 1982Citation ; Purvis and Bromham 1997Citation ; Ina 1998Citation ). Indeed, some phylogenetic reconstruction computer packages, such as PHYLIP (Felsenstein 1993Citation ), use a ti/tv ratio of 2 as the default setting in several programs. However, studies reporting ti/tv ratios have mostly involved animal mtDNA sequences, which calls into question the universality of this pattern given the differences in evolutionary origin, composition, and functional constraints between nuclear, mitochondrial, and chloroplast DNA.

Transition bias has been explained by mechanisms operating both at the level of mutation and at the level of fixation. Misincorporation during DNA replication in Escherichia coli has been found to allow predominantly G·T, A·C, and G·A mispairings, which mostly require transitions for correction (Echols and Goodman 1991Citation ). Maintenance of secondary structure in RNA molecules through compensatory substitutions conserving purine·pyrimidine base pairing in stem regions favors transitions (Rousset, Pélandakis, and Solignac 1991Citation ). Transcriptional regulation through methylation of CpG dinucleotides is also known to constrain DNA sequence evolution in that methylated cytosines frequently mutate to thymines (Jupe and Zimmer 1993Citation ; Vairapandi and Duker 1994Citation ). At the population level, processes such as selection against replacement substitution in amino acid sequences are known to constrain the allowed substitutions to transitions in coding regions (e.g., Crozier and Crozier 1993Citation ).

Assuming higher rates of transitional substitutions, ti/tv ratios <1 have often been equated with transitional saturation, in which the same nucleotide position undergoes multiple transitions, or considered indicative of high levels of homoplasy (e.g., Hillis, Allard, and Miyamoto 1993Citation ). However, if the assumption of higher transitional rate is false, ti/tv ratios <1 do not necessarily indicate transitional saturation, but rather reflect linear accumulation of both substitutional types.

Methods for measuring rate differences between transitions and transversions have been summarized by Wakeley (1996)Citation and Ina (1998)Citation . A distinction can be made between pairwise estimates, tree-based parsimony estimates, and maximum-likelihood estimates based on specific models of substitution. The main differences between these methods pertain to whether phylogenetic structure and/or correction for multiple substitutions at one nucleotide position are taken into account. Comparisons among taxa should not be considered independent data points because of the shared evolutionary history of related taxa within a phylogeny (Harvey and Pagel 1993, pp. 9–21Citation ). Therefore, methods for measuring patterns and rates of nucleotide substitution are preferably phylogeny-based, maximizing the explanatory power of the data (Wakeley 1996Citation ). Counting transitional and transversional steps on a tree, for instance, using the "chart changes" option in MacClade (Maddison and Maddison 1992Citation ), underestimates the number of substitutional events in the absence of correction for multiple substitutions and rate heterogeneity. In addition, this method usually produces asymmetric substitution rate matrices, which may in fact reflect rooting artifacts. In contrast, the general time-reversible model (Yang 1994Citation ), as used in maximum-likelihood estimation, produces a symmetrical rate matrix in which multiple substitution correction, among-site rate heterogeneity, and base frequencies are taken into account.

Angiosperm chloroplast DNA studies have focused mainly on substitution patterns in genes such as rbcL (Albert et al. 1994Citation ; Kellogg and Juliano 1997Citation ; Manen, Cuénoud, and Martinez 1999Citation ), ndhF (Catalan, Kellogg, and Olmstead 1997Citation ), and matK (Hilu and Liang 1997Citation ) and have outnumbered studies on noncoding chloroplast regions such as the trnL (UAA) 5' exon–trnF (GAA) exon region. Yet, this so-called "trnL-F" region is being used increasingly for species level phylogenetic reconstruction (Compton, Culham, and Jury 1998Citation ; Bakker et al. 1999Citation ; Bayer and Starr 1999Citation ; McDade and Moody 1999Citation ; Wikström, Kenrick, and Chase 1999Citation ), making a better understanding of its substitution pattern desirable.

The trnL-F region contains the trnL gene, which is split by a group I intron, an intergenic spacer, and the trnF exon. Group I introns are characterized by a highly conserved core structure encoding an active site that mediates self-splicing from the pre-tRNA (Michel and Dujon 1983Citation ; Cech 1988Citation ). The trnL intron was the first group I intron described in chloroplast DNA and also the first one described to interrupt a tRNA gene (Bonnard et al. 1984Citation ). Other chloroplast DNA group I introns occur in the 16S rRNA, psbA, psbC, and psaB genes (Cavalier-Smith 1993Citation ). In plants, the trnL intron usually shows sequence conservation in the regions flanking both trnL exons, whereas the central part is highly variable. Within the intergenic spacer, no secondary-structural elements have been found that could serve as splicing points, indicating that trnL and trnF are probably co-transcribed (Bonnard et al. 1984Citation ). A general feature of cpDNA spacer regions is the occurrence of indels that can be derived from either deletion or duplication of adjacent sequences or occur in nonrepetitive regions of the spacer (Golenberg et al. 1993). For instance, in Poaceae and in Proteaceae, the atpBrbcL intergenic spacer region was found to have 82% and 69% of its indels occurring in repetitive regions (Golenberg et al. 1993; Hoot and Douglas 1998Citation ).

In this study, substitution patterns in trnL-F regions were analyzed in sequence alignments from five different angiosperm groups (each representing a different order) and from Huperzia (Lycopodiales). Each data set comprised predominantly closely related species so that most intermediate evolutionary steps were captured. The purpose of this study was to investigate whether universal patterns of substitution exist in the trnL-F region across different angiosperm lineages and to what extent this pattern differs from the non-chloroplast DNA examples given above.

Materials and Methods

TrnL-F sequence alignments were obtained for Actaea (Ranunculales) (Compton, Culham, and Jury 1998Citation ; accession numbers AJ222985–AJ222983), Digitalis (Scrophulariales) (Carvalho and Culham 1998Citation ), Drosera (Caryophyllales) (Culham, unpublished), the Panicoideae clade (Poales) (Gomez-Martinez, unpublished), the small chromosome clade of Pelargonium (Geraniales) (Bakker et al. 1999Citation ); accession numbers AF03685–AF036080), and the Huperzia data set of Wikström, Kenrick, and Chase (1999Citation ; accession numbers AJ224591–AJ224609). Sequence alignment was according to Bakker et al. (1998)Citation and included initial alignment using CLUSTAL followed by manual adjustment, resulting in largely unambiguous alignments for all data sets. Generally, if ambiguous regions were encountered, nucleotide substitutions were minimized relative to indels, since indel formation is considered to occur at higher rates (Golenberg et al. 1993). Phylogenetically informative indels (table 1 ), varying in length up to 56 nt (Pelargonium), were considered to represent separate events if found in overlapping positions of the alignment and were scored as single binary characters irrespective of indel length. Nucleotide substitution patterns in the five sequence alignments were evaluated based on pairwise comparisons, as well as on maximum-likelihood evaluation of most-parsimonious trees using PAUP* 4d64 (D. L. Swofford, personal communication). Pairwise comparisons of substitution patterns within each alignment were performed using the "dinucleotide frequencies" command. Because all alignments contained length variation (varying up to 15% of the average number of nucleotides compared), pairwise comparisons were based on different numbers of nucleotides. To avoid overestimation of nucleotides differing divided by total nucleotides compared (p-differences) when the same number of substitutions was encountered in different length comparisons, dinucleotide frequencies, expressed as ti/tv ratios, were plotted against the total number of substitutions encountered in each pairwise comparison (fig. 1 ).


View this table:
[in this window]
[in a new window]
 
Table 1 Properties of the Five tnrL-F Data sets and Their Most-Parsimonious Trees

 


View larger version (55K):
[in this window]
[in a new window]
 
Fig. 1.—Transition/transversion (ti/tv) ratios (black diamonds) plotted against total number of substitutions counted in pairwise comparisons of trnL-F sequences from five angiosperm groups (each representing a different order) and Huperzia (Lycopodiales). Squares indicate all possible ti/tv ratios with P > 0.05 (dark gray) and P > 0.01 (light gray) probability of occurrence, based on maximum-likelihood estimates of numbers of transitions and transversions for each data set (see text).

 
To compare these plots with their theoretical expectation, a separate plot was constructed containing all theoretically possible ti/tv ratios occurring in pairwise comparisons involving up to n = 100 substitutions. For each pairwise comparison in this theoretical data set, a probability for the associated ti/tv ratio was then calculated based on three different combinations of probabilities pti and qtv, the occurrence of transitions and transversions, respectively. The first combination was 2pti = qtv = 2/3, theoretically expected to result in a ti/tv rate ratio of pti/qtv = 0.5, the second was pti = qtv = 1/2, resulting in a ti/tv rate ratio of 1.0, and the third was pti = 2qtv = 2/3, resulting in a ti/tv rate ratio of 2.0. Probabilities for ti/tv ratios associated with each comparison were than calculated as


where n is the total number of substitutions, TI is the number of transitions, and TV is the number of transversions occurring in the comparison. Probabilities are multiplied in the first term of the equation with the binomial coefficient of n and TI in order to arrive at the sum of probabilities for all possible combinations resulting in that particular ti/tv ratio. For each pti and qtv combination, only those ti/tv ratios with P >= 0.05 and 0.05 >= P >= 0.01 were plotted (fig. 1AF ).

Most-parsimonious trees (MPTs) were calculated following Bakker et al. (1999)Citation including recoded indels and performing 1,000 replicates of random-addition sequence and tree bisection- reconnection (TBR) branch swapping (see table 1 for search scores). The alignments were then divided into three parts, containing the (partial) intron, the 3' trnL (UAA) exon, and the (partial) intergenic spacer, respectively, following Bonnard et al. (1984)Citation for position of coding termini. Using the "treescores/likelihood" option in PAUP, substitution patterns in the intron and spacer alignments were evaluated both separately and in conjunction on each of the MPTs calculated for the total alignments. Relative rates of the different substitution types (the R-matrix) were estimated using the general time-reversible model (Yang 1994Citation ) with rate heterogeneity assumed to follow a gamma distribution. Likelihood settings were as follows: empirical nucleotide frequencies; proportion invariable sites estimated; gamma distribution shape parameter {alpha} estimated; discrete gamma approximation: number of rate categories = 4, average rate for each category represented by median. Prior to likelihood evaluation, outgroups were pruned from MPTs in order to avoid possible saturation effects introduced by long branches connecting to the ingroup. For each data set (intron, spacer, or total), R-matrix values were then averaged across MPTs calculated for the total data set. Averaged relative-rate values were normalized and multiplied by the number of steps contributed by that particular data set to the total MPT length. This resulted in a contingency table containing for each data set the number of substitutions inferred for each of the six substitution categories, shown for the total data sets as histograms in figure 2 . Proportions of each of the six substitution types within the total number of substitutions per group are plotted in figure 3 , grouped by type.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 2.—Total numbers of substitutions within each substitution type for the five angiosperm data sets and Huperzia (Lycopodiales)

 


View larger version (45K):
[in this window]
[in a new window]
 
Fig. 3.—Proportions of each of the six substitution types within the total number of substitutions occurring in the five angiosperm groups and Huperzia (Lycopodiales), grouped by type. Values are based on maximum-likelihood estimations on most-parsimonious trees calculated for each data set using the general time-reversible model with rate heterogeneity (see text)

 
To test for dependence between substitution pattern and the data sets, i.e., whether the substitution patterns in any of the data sets are significantly different, an R x C test of independence between rows (the six substitution types) and columns (the data sets) was applied using a G-test (Sokal and Rohlf 1995Citation ). This test was conducted for the separate intron and spacer data sets and the combined data sets (table 2 ). Within some data sets, the frequencies observed for some substitution types were combined in order to avoid frequencies <3 in the cells of the contingency table, following the recommendations of Sokal and Rohlf (1995Citation , p. 702).


View this table:
[in this window]
[in a new window]
 
Table 2 Significance Testsa for (In)dependence Between Groupsb and Observed Frequencies Within Each Nucleotide Substitution Typec

 
In addition, a two-way ANOVA (n = 36) was performed to test for significance of difference between occurrences of each substitution type in the total alignments. Because levels of sequence divergence in the data sets ranged between 5% and 23%, a variance-stabilizing transformation of the counts of substitutions in the contingency table was performed prior to the ANOVA by using the square root of the counts. To assess whether certain substitution types were significantly rarer than others, a method similar to that described by Calinski and Corsten (1985)Citation was followed. First, the closest sample means of counts, as used in the two-way ANOVA, were tested for significance of difference by t-tests. Internally homogeneous and nonoverlapping subsets of substitution types thus found were then simultaneously compared with other such groups in a second round of t-tests until significant differences were found. By following this approach (rather than performing multiple pairwise t-testing), loss of control over type I error was avoided, and tests between homogeneous pairs of substitution types were independent (Calinski and Corsten 1985Citation ).

Results

Taxonomic sampling was complete (all recognized species were sampled) for the Actaea and Digitalis data sets. For the Pelargonium data set and the Panicoideae data set, all recognized sections and most genera, respectively, within those clades were represented (Clayton and Renvoize 1986Citation ; Bakker et al. 1999Citation ; Gomez-Martinez 1999Citation ). The Drosera data set (31 taxa) comprised ca. 20% of the genus, representing all but one recognized section sensu Diels (1906)Citation . The Huperzia data set comprised ca. 40% of all recognized taxa. In total, the six alignments included 265 sequences over approximately 900 bp (see table 1 ). This sampling represents five different angiosperm orders and might therefore be expected to reflect general angiosperm substitution patterns in the trnL-F region.

Ingroup sequence divergence for the alignments ranged between 5% and 23%, probably representing a similar range of divergence times among the taxa (table 1 ; see also fig. 2 ). No variable positions were found within the trnL 3' exon. Variable positions within the intron region were mostly distributed outside the structural elements required for intron processing (Bonnard et al. 1984Citation ), i.e., not within the first 130 bp at the 5' end and not within the 200 bp at the 3' end of the intron region. Within the spacer region, variable positions were more evenly distributed.

A+T content for the angiosperm alignments of the total trnL-F region ranged between 64.5% and 67.1% and was 69% for Huperzia (table 1 ). These values are similar to those found for 19 other cpDNA noncoding regions in Poaceae (Morton and Clegg 1995Citation ; Morton, Oberholzer, and Clegg 1997Citation ) and, for the atpBrbcL intergenic spacer region, in Rubiaceae (Manen and Natali 1995Citation ). Mean A+T content differed by up to 2% between the intron and spacer regions of all trnL-F alignments studied here.

For all alignments, pairwise sequence comparisons showed ti/tv ratios to be centered around and below 1 for comparisons involving up to 20 accumulated substitutions (fig. 1AF ). In all six groups studied here, the same trend is present, i.e., an equal occurrence of ti/tv ratios below and above 1 at early sequence divergence with subsequent convergence at a value <1. As can be seen in figure 1 , most observed ti/tv values have >=0.05 probability, but at sequence divergence values of 25–30 substitutions, a cluster of values is found in most data sets with probability 0.01–0.05. Discontinuities in the ranges of data points reflect either differences in levels of taxonomic sampling between the five groups studied or possibly differences in evolutionary age between the groups. For example, no sequence divergences <0.2% were available for Drosera, whereas for Pelargonium nearly all theoretically possible ti/tv ratios were found over that range. High-species-level DNA sequence divergence in Drosera was also found for rbcL (Williams, Albert, and Chase 1994Citation ). The discontinuities in the plot for the Digitalis data set reflect the relatively isolated phylogenetic position of this tribe within the "Scroph 2" clade (Olmstead and Reeves 1995Citation ) of the Scrophulariaceae rather than insufficient taxonomic sampling: the low level of DNA sequence divergence within the Digitalis clade suggests recent proliferation (Carvalho and Culham 1998Citation ), and its position on a relatively long branch connecting it to its sister group (Veronica, Erinus) results in a discontinuous range of DNA sequence divergence values when comparisons outside Digitalis are made. Divergence within Digitalis does not exceed 18 substitutional events, while comparisons outside Digitalis all exceed 30 such events (fig. 1 ).

Ti/tv ratios, as estimated on MPTs derived from the total trnL-F region, were similar to those estimated from the pairwise comparisons involving >30 substitutions (fig. 1 ). Ti/tv ratios differed between the intron and spacer regions, respectively, but this difference is not consistent among the six groups compared (see table 1 ). No significant correlation between level of sequence divergence and ti/tv ratio in each region could be found. Furthermore, a search for local sequence composition heterogeneity using the WINDOW program of the GCG package (Genetics Computer Group 1995Citation ) indicated largely similar sequence compositions between the regions (data not shown).

The R x C test of independence between substitution pattern and the five data sets, which tests H0: no difference between the five data sets with regard to substitution pattern, yielded different results depending on whether or not intron and spacer regions were analyzed separately and whether or not the Huperzia data set was included (table 2 ). G values were found to be significant for the intron region (across angiosperms: P < 0.025; angiosperms + Huperzia: P < 0.050) and highly significant for both the spacer region and the combined intron + spacer regions (P 0.001). This would indicate that the trnL-F intron region accumulates nucleotide substitutions in a more uniform pattern than does the intergenic spacer region.

The two-way ANOVA also showed differences between groups and substitution types to be highly significant (table 3 ). Results of the subsequent analysis of differences among types is graphically represented in figure 4 . The closest means of counts of types were between the transitions A{leftrightarrow}G and T{leftrightarrow}C; a Student's t-test (df = 24) showed that occurrences within these types were not significantly different, and they can therefore be considered a statistically homogeneous group of types. The same result was obtained for the pairs A{leftrightarrow}C + G{leftrightarrow}T and A{leftrightarrow}T + G{leftrightarrow}C. Comparison between the pairs yielded a significant difference only for the A{leftrightarrow}T + G{leftrightarrow}C pair versus the other pairs (P < 0.01 and P < 0.05 respectively). Therefore, we can say that the A{leftrightarrow}T and G{leftrightarrow}C type transversions have occurred in trnL-F regions significantly less frequently than all other substitution types.


View this table:
[in this window]
[in a new window]
 
Table 3 Sample Means and Two-Way ANOVA Table of Occurrencesa of Substitution Types of trnL-F Regions of Five Angiosperm Groups and Huperzia (Lycopodiales)

 


View larger version (13K):
[in this window]
[in a new window]
 
Fig. 4.—Relationships of frequencies of occurrence within each of the six substitution types in trnL-F regions of five angiosperm groups and Huperzia (Lycopodiales). Student's t values are plotted against sample means used in the two-way ANOVA (see table 3 ). Means are of the square roots of counts for each type. Thick lines indicate comparisons between pairs of substitution types found not to be significantly different. P values between lines indicate significance levels for a comparison

 
Discussion

In this study, we investigated patterns of nucleotide substitutions in sets of related cpDNA-encoded trnL-F sequences sampled from five different angiosperm orders and Huperzia (Lycopodiales). Low ti/tv ratios occur in cpDNA trnL-F regions even at early stages of sequence divergence. Ti/tv ratios as estimated on MPTs calculated from the alignments do not exceed 1 for any of the groups examined (table 1 ). Similar values, some based on pairwise comparisons, have also been found in other chloroplast noncoding regions such as the atpBrbcL intergenic spacer (Morton and Clegg 1995Citation ; Manen and Natali 1995Citation ; Hoot and Douglas 1998Citation ) and in coding regions such as matK and rbcL (ti/tv = 1 and 0.9–1.4, respectively; Johnson and Soltis 1995Citation ; Hilu and Liang 1997Citation ; Manen, Cuénoud, and Martinez 1998Citation ) but not atpB (ti/tv = 2.21) (Bayer et al. 1999Citation ). Because these low ti/tv ratios are found in the trnL-F data sets from early stages of sequence divergence onward, we consider them to not reflect saturation with transitional substitutions. This therefore contradicts the widely held notion that low ti/tv ratios indicate low phylogenetic signal (e.g., Hillis, Allard, and Miyamoto 1993Citation ; Page and Holmes 1998Citation , p. 150). We interpret the values in the trnL-F data sets presented here as indicating similar rates of transition and transversion. In contrast, values of around 2 or more have typically been found for both coding and noncoding animal mtDNA sequences (Brown et al. 1982Citation ; Gojobori, Li, and Graur 1982Citation ; Purvis and Bromham 1997Citation ; Ina 1998Citation ), whereas values of 1–1.5 have been found for nuclear DNA sequences such as rDNA internal transcribed spacers (Möller and Cronk 1997Citation ; Molvray, Kores, and Chase 1999Citation ). The lower ti/tv values found here for angiosperm trnL-F regions raise the question of what different factors underlie the substitution dynamics in these noncoding chloroplast DNA regions.

Base content may be one factor that could explain the occurrence of a relatively high proportion of transversions in trnL-F regions. Based on pairwise comparisons of 20 different chloroplast noncoding regions in Poaceae, substitutions occurring in a context of high A+T content were found to have a greater frequency of transversions than those occurring in a high-G+C context (Morton, Oberholzer, and Clegg 1997Citation ). The same authors also found that this transversion bias increased even more when the base 5' to the site of substitution was a pyrimidine. In view of this, the relatively high A+T values (63%–67% A+T) in angiosperm trnL-F regions may explain the high proportion of transversions found. A+T-rich regions have been documented to be highly prone to replication slippage (Levinson and Gutman 1987Citation ; Cummings, King, and Kellogg 1994Citation ), a mechanism that involves local intrahelical denaturation and displacement of replicating strands, which further increases A+T content, possibly enhancing transversion bias. Difference in local heterogeneity between regions that have otherwise similar overall base compositions could therefore result in different patterns of substitution. In our alignments, no apparent difference in local base composition heterogeneity and similar A+T content was found in the intron and spacer regions (see Results), yet ti/tv ratios differed for both regions in all data sets. Factors other than overall base composition and neighboring base effect could be involved in causing the differences in ti/tv ratios in each region. The different G values for the intron and spacer regions (table 2 ) indicate that nucleotide substitutions in the intron region accumulate in a more uniform manner across angiosperms than do those in the spacer region. This may reflect differing functional constraints between the regions.

When the observed ti/tv ratios for the total trnL-F regions are compared with the theoretically expected values (fig. 1 ), we see that most values have P > 0.05. Values with expected probability P < 0.01 are rare but are found in the Actaea, Pelargonium, and Panicoideae data sets at sequence comparisons with approximately 25–30 substitutions (fig. 1A, B, and F ). Correlating with this pattern is a lack of observed values in this range in the lower half of the area containing values with P > 0.05. Apparently, a shift in substitution dynamics toward transitions occurs at this level of trnL-F sequence divergence. This could be explained by the presence of a set of more conserved nucleotide positions that become substituted only at the above-mentioned levels of sequence divergence. Possibly, these positions are functionally constrained with a preference to undergo transitions. This could be the case for positions involved in secondary-structural base pairings, such as positions undergoing compensatory changes, known to proceed predominantly via transitions (Rousset, Pélandakis, and Solignac 1991Citation ). In order to test this hypothesis, we would need a good secondary-structure model for the variable part of these angiosperm trnL group I intron regions, as well as for the intergenic spacer region.

Nucleotide Substitution Bias
In spite of the high A+T content in the trnL-F regions, A{leftrightarrow}T and G{leftrightarrow}C transversions were found to occur significantly less often than the other substitution types in the intron + spacer regions (P < 0.01; fig. 4 ) for all five angiosperm groups and Huperzia. This finding contrasts sharply with data from, for example, insect mtDNA 16S rDNA sequences that also had high A+T contents (on average, ~72%) but for which A{leftrightarrow}T transversions comprised the vast majority of all substitutions (Fang et al. 1993Citation ; Xiong and Kocher 1993Citation ). A possible explanation for the apparent avoidance of A{leftrightarrow}T and G{leftrightarrow}C transversions might come from mechanisms ensuring fidelity in DNA replication. Apart from free energy minimization of base pairing between template base and incoming dNTP, geometric selection is thought to be the most important mechanism ensuring replication fidelity (Echols and Goodman 1991Citation ; Goodman 1997Citation ). This selection principle, i.e., that the newly formed base pair should have a geometry equivalent to "standard" Watson-Crick base pairs, predicts that base mispairs that are closest to this geometry will occur more frequently (table 4 ). The mispairs G·T, A·C, and G·A are closest in geometry to Watson-Crick geometry and have in fact been found as constituents of B-DNA (Kennard 1987Citation ), the "classical" Watson-Crick type double helix containing about 10 residues per turn. These "allowed" mispairs were also found to be most frequently produced by the pol subunit of Polymerase III, a prokaryotic DNA polymerase (Sloane, Goodman, and Echols 1988Citation ). Substitutions resulting from repair of these "allowed" mispairs are either transitions or A{leftrightarrow}C and G{leftrightarrow}T transversions (table 4 ) and are expected to occur most frequently. In contrast, A{leftrightarrow}T and G{leftrightarrow}C transversions, resulting from repair of "disallowed" mispairings, are expected to be rare. The trnL-F data sets described here are in line with that prediction, with A{leftrightarrow}T and G{leftrightarrow}C transversions occurring significantly less often than substitution types resulting from repair of "allowed" mispairings. Whether the correlation between DNA polymerase fidelity and pattern of fixed substitutions is also found in other noncoding regions remains to be investigated.


View this table:
[in this window]
[in a new window]
 
Table 4 Overview of Routes Leading from Base Mispairings to Substitutions in DNA Sequences During Replication

 
The implications of the findings presented in this study for phylogenetic analysis of DNA sequence data are, first, that data sets in which ti/tv ratios <2 are found are not necessarily substitutionally saturated. Second, our findings indicate a correlation between mutation bias at the polymerase level and the actual fixation of nucleotide substitutions as recorded on evolutionary trees. This strengthens confidence in the neutrality of nucleotide substitutions as phylogenetic markers.

Acknowledgements

We thank Dr. Terry Hedderson for critically reading the manuscript and Dr. Derek Pike for statistical advice. This work was supported by NERC grant GST\02\1169 to A.C. and M.G., by CITMA for J. Carvalho, by University F. de Miranda for R.G.-M., and by a Royal Horticultural Society bursary to A.C.

Footnotes

Elizabeth Kellogg, Reviewing Editor

1 Abbreviation: trnL-F, chloroplast-encoded trnL (UAA) 5' exon–trnF (GAA) exon region. Back

2 Keywords: angiosperms trnL-F, transition/transversion ratio substitution bias Back

3 Address for correspondence and reprints: Freek T. Bakker, Plant Taxonomy Group, Wageningen University, P.O. Box 8010, 6700 ED Wageningen The Netherlands. E-mail: freek.bakker;caalgem.pt.wau.nl. Back

literature cited

    Albert, V. A., A. Backlund, K. Bremer, M. W. Chase, J. R. Manhart, B. D. Mishler, and K. C. Nixon. 1994. Functional constraints and rbcL evidence for land plant phylogeny. Ann. Mo. Bot. Gard. 81:534–567.

    Bakker, F. T., A. Culham, L. C. Daugherty, and M. Gibby. 1999. A trnL-F based phylogeny for species of Pelargonium (Geraniaceae) with small chromosomes. Plant Syst. Evol. 216:309–324.[ISI]

    Bakker, F. T., D. Hellbrügge, A. Culham, and M. Gibby. 1998. Phylogenetic relationships within Pelargonium sect. Peristera (Geraniaceae) inferred from nrDNA and cpDNA sequence comparisons. Plant Syst. Evol. 211:273–287.

    Bayer, R. J., M. F. Fay, A. Y. de Bruijn, V. Savolainen, C. M. Morton, K. Kubitzki, W. S. Alverson, and M. W. Chase. 1999. Support for an expanded family concept of Malvaceae within a recircumscribed order Malvales: a combined analysis of plastid atpB and rbcL sequences. Bot. J. Linn. Soc. 129:267–303.[ISI]

    Bayer, R. J., and J. R. Starr. 1999. Tribal phylogeny of the Asteraceae based on two non-coding chloroplast sequences, the trnL intron and trnL/trnF intergenic spacer. Ann. Mo. Bot. Gard. 85:242–256.

    Bonnard, G., F. Michel, J. H. Weil, and A. Steinmetz. 1984. Nucleotide sequence of the split tRNALeu gene from Vicia faba chloroplasts: evidence for structural homologies of the chloroplast tRNALeu intron with the intron from the autosplicable Tetrahymena ribosomal RNA precursor. Mol. Gen. Genet. 194:330–336.[ISI]

    Brown, W. M., E. M. Prager, A. Wang, and A. C. Wilson. 1982. Mitochondrial DNA sequences of primates: the tempo and mode of evolution. J. Mol. Evol. 18:225–239.[ISI][Medline]

    Calinski, T., and L. C. A. Corsten. 1985. Clustering means in ANOVA by simultaneous testing. Biometrics 41:39–48.

    Carvalho, J. A., and A. Culham. 1998. Conservation status and phylogenetics of Isoplexis (Lindl.) Benth. Bol. Mus. Municipal Funchal 5:109–127.

    Catalan, P., E. A. Kellogg, and R. G. Olmstead. 1997. Phylogeny of Poaceae subfamily Pooideae based on chloroplast ndhF gene sequences. Mol. Phylogenet. Evol. 8:150–166.[ISI][Medline]

    Cavalier-Smith, T. 1993. Evolution of the eukaryotic genome. Pp. 333–386 in P. Broda, S. G. Oliver, and P. F. G. Sims, eds. The eukaryotic genome. Cambridge University Press, Cambridge, England.

    Cech, T. R. 1988. Conserved sequences and structures of group I introns: building an active site for RNA catalysis—a review. Gene 73:259–271.

    Clayton, W. D., and S. A. Renvoize. 1986. Genera Graminum: grasses of the world. Kew Bulletin add. series 13, Royal Botanic Gardens, Kew.

    Compton, J. A., A. Culham, and S. Jury. 1998. Reclassification of Actaea to include Cimicifuga and Souliea (Ranunculaceae): phylogeny inferred from morphology, nrDNA ITS and cpDNA trnL-F sequence variation. Taxon 47:593–634.

    Crozier, R. H., and Y. C. Crozier. 1993. The mitochondrial genome of the honeybee Apis mellifera: complete sequence and genome organization. Genetics 133:97–117.

    Cummings, M. P., L. King, and E. A. Kellogg. 1994. Slipped-strand mispairing in a plastid gene—rpoC2 in grasses (Poaceae). Mol. Biol. Evol. 11:1–8.[Abstract]

    Diels, L. 1906. Droseraceae. Pp. 1–137 in A. Engler, ed. Das Pflanzenreich. Vol. 4 (112). von Wilhelm Engelman, Leipzig.

    Echols, H., and M. F. Goodman. 1991. Fidelity mechanisms in DNA replication. Annu. Rev. Biochem. 60:477–511.[ISI][Medline]

    Fang, Q., W. C. Black IV, H. D. Blocker, and R. F. Whitcomb. 1993. A phylogeny of New World Deltocephalus-like leafhopper genera based on mitochondrial 16S ribosomal DNA sequences. Mol. Phylogenet. Evol. 2:119–131.[Medline]

    Felsenstein, J. 1993. PHYLIP. Version 3.5c. Distributed by the author, Department of Genetics, University of Washington, Seattle.

    Genetics Computer Group. 1995. Program manual for the Wisconsin package, version 8, September 1994. Genetics Computer Group, Madison, Wisconsin.

    Gojobori, T., W.-L. Li, and D. Graur. 1982. Patterns of nucleotide substitution in pseudogenes and functional genes. J. Mol. Evol. 18:360–369.[ISI][Medline]

    Golenberg, E. M., M. T. Clegg, M. L. Durbin, J. Doebley, and D. P. Ma. 1993. Evolution of a noncoding region of the chloroplast genome. Mol. Phylogenet. Evol. 2:52–64.[Medline]

    Gomez-Martinez, R. 1999. A systemtic study of the grass tribe Paniceae with special emphasis on the genus Axonopus. Unpublished Ph.D. thesis, University of Reading, England.

    Goodman, M. F. 1997. Hydrogen bonding revisited: geometric selection as a principal determinant of DNA replication fidelity. Proc. Natl. Acad. Sci. USA 94:10493–10495.

    Harvey, P. H., and M. D. Pagel. 1993. The comparative method in evolutionary biology. Oxford University Press, Oxford, England.

    Hasegawa, M., H. Kishino, and T.-A. Yano. 1985. Dating of the human–ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.[ISI][Medline]

    Hillis, D. M., M. W. Allard, and M. M. Miyamoto. 1993. Analysis of DNA sequence data: phylogenetic inference. Methods Enzymol. 224:456–490.[ISI][Medline]

    Hilu, K. W., and H. P. Liang. 1997. The matK gene: sequence variation and application in plant systematics. Am. J. Bot. 84:830–839.[Abstract]

    Hoot, S. B., and A. W. Douglas. 1998. Phylogeny of the Proteaceae based on atpB and atpBrbcL intergenic spacer region sequences. Aust. Syst. Bot. 11:301–320.[ISI]

    Ina, Y. 1998. Estimation of the transition/transversion ratio. J. Mol. Evol. 46:521–533.[ISI][Medline]

    Johnson, L. A., and D. E. Soltis. 1995. Phylogenetic inference in Saxifragaceae sensu stricto and Gilia (Polemoniaceae) using matK sequences. Ann. Mo. Bot. Gard. 82:149–175.

    Jupe, E. R., and E. A. Zimmer. 1993. Assaying differential ribosomal-RNA gene-expression with allele-specific probes. Methods Enzymol. 224:541–552.[ISI][Medline]

    Kellogg, E. A., and N. D. Juliano. 1997. The structure and function of RuBisCo and their implications for systematic studies. Am. J. Bot. 84:413–428.[Abstract]

    Kennard, O. 1987. Pp. 25–52 in F. Eckstein and D. M. J. Lilley, eds. Nucleic acids and molecular biology. Springer, Heidelberg, Germany.

    Levinson, G., and G. A. Gutman. 1987. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4:203–221.[Abstract]

    McDade, L. A., and M. L. Moody. 1999. Phylogenetic relationships among Acanthaceae: evidence from noncoding trnL-trnF chloroplast DNA sequences. Am. J. Bot. 86:70–80.[Abstract/Free Full Text]

    Maddison, W. P., and D. R. Maddison. 1992. MacClade version 3.04. Analysis of phylogeny and character evolution. Sinauer, Sunderland, Mass.

    Manen, J. F., P. Cuénoud, and M. D. P. Martinez. 1998. Intralineage variation in the pattern of rbcL nucleotide substitution. Plant Syst. Evol. 211:103–112.[ISI]

    Manen, J.-F., and A. Natali. 1995. Comparison of the evolution of ribulose-1, 5-biphosphate carboxylase (rbcL) and atpB-rbcL noncoding spacer sequences in a recent plant group, the tribe Rubieae (Rubiaceae). J. Mol. Evol. 41:920–927.[ISI][Medline]

    Michel, F., and B. Dujon. 1983. Conservation of RNA secondary structures in two intron families including mitochondrial-, chloroplast- and nuclear-encoded members. EMBO J. 2:33–38.[ISI][Medline]

    Möller, M., and Q. C. B. Cronk. 1997. Origin and relationships of Saintpaulia (Gesneriaceae) based on ribosomal DNA internal transcribed spacer (ITS) sequences. Am. J. Bot. 84:956–965.[Abstract]

    Molvray, M., P. J. Kores, and M. W. Chase. 1999. Phylogenetic relationships within Korthalsella (Viscaceae) based on nuclear ITS and plastid trnL-F sequence data. Am. J. Bot. 86:249–260.[Abstract/Free Full Text]

    Morton, B. R., and M. T. Clegg. 1995. Neighboring base composition is strongly correlated with base substitution bias in a region of the chloroplast genome. J. Mol. Evol. 41:597–603.[ISI][Medline]

    Morton, B. R., V. M. Oberholzer, and M. T. Clegg. 1997. The influence of specific neighboring bases on substitution bias in noncoding regions of the plant chloroplast genome. J. Mol. Evol. 45:227–231.[ISI][Medline]

    Olmstead, R. G., and P. A. Reeves. 1995. Evidence for the polyphyly of the Scrophulariaceae based on chloroplast rbcL and ndhF sequences. Ann. Mo. Bot. Gard. 82:176–193.

    Page, R. D. M., and E. C. Holmes. 1998. Molecular evolution: a phylogenetic approach. Blackwell Science, London.

    Purvis, A., and L. Bromham. 1997. Estimating the transition/transversion ratio from independent pairwise comparisons with an assumed phylogeny. J. Mol. Evol. 44:112–119.[ISI][Medline]

    Rousset, F., M. Pélandakis, and M. Solignac. 1991. Evolution of compensatory substitutions through G·U intermediate state in Drosophila rRNA. Proc. Natl. Acad. Sci. USA 88:10032–10036.

    Sloane, D. L., M. F. Goodman, and H. Echols. 1988. The fidelity of base selection by the polymerase subunit of DNA polymerase-III holoenzyme. Nucleic Acids Res. 16:6465–6475.[ISI][Medline]

    Sokal, R. R., and F. J. Rohlf. 1995. Biometry. 3rd edition. W. H. Freeman, New York.

    Vairapandi, M., and N. J. Duker. 1994. Excision of ultraviolet-induced photoproducts of 5-methylcytosine from DNA. Mutat. Res. 315:85–94.[ISI][Medline]

    Wakeley, J. 1996. The excess of transitions among nucleotide substitutions: new methods of estimating transition bias underscore its significance. Trends Ecol. Evol. 11:158–163.[ISI]

    Wikström, N., P. Kenrick, and M. W. Chase. 1999. Epiphytism and terrestrialization in tropical Hyperzia (Lycopodiaceae). Plant Syst. Evol. 218:221–243.[ISI]

    Williams, S. E., V. A. Albert, and M. W. Chase. 1994. Relationships of Droseraceae: a cladistic analysis of rbcL sequence and morphological data. Am. J. Bot. 81:1027–1037.[ISI]

    Xiong, B., and T. D. Kocher. 1993. Phylogeny of sibling species of Simulium venustrum and S. verecundum (Diptera: Simuliidae) based on sequences of the mitochondrial 16S rRNA gene. Mol. Phylogenet. Evol. 2:293–303.

    Yang, Z. B. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105–111.[ISI][Medline]

Accepted for publication March 27, 2000.