Patterns of Nucleotide Substitution Among Simultaneously Duplicated Gene Pairs in Arabidopsis thaliana

Liqing Zhang*, Todd J. Vision{dagger} and Brandon S. Gaut*

*Department of Ecology and Evolutionary Biology, University of California;
{dagger}Department of Biology, University of North Carolina at Chapel Hill


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We characterized rates and patterns of synonymous and nonsynonymous substitution in 242 duplicated gene pairs on chromosomes 2 and 4 of Arabidopsis thaliana. Based on their collinear order along the two chromosomes, the gene pairs were likely duplicated contemporaneously, and therefore comparison of genetic distances among gene pairs provides insights into the distribution of nucleotide substitution rates among plant nuclear genes. Rates of synonymous substitution varied 13.8-fold among the duplicated gene pairs, but 90% of gene pairs differed by less than 2.6-fold. Average nonsynonymous rates were {approx}fivefold lower than average synonymous rates; this rate difference is lower than that of previously studied nonplant lineages. The coefficient of variation of rates among genes was 0.65 for nonsynonymous rates and 0.44 for synonymous rates, indicating that synonymous and nonsynonymous rates vary among genes to roughly the same extent. The causes underlying rate variation were explored. Our analyses tentatively suggest an effect of physical location on synonymous substitution rates but no similar effect on nonsynonymous rates. Nonsynonymous substitution rates were negatively correlated with GC content at synonymous third codon positions, and synonymous substitution rates were negatively correlated with codon bias, as observed in other systems. Finally, the 242 gene pairs permitted investigation of the processes underlying divergence between paralogs. We found no evidence of positive selection, little evidence that paralogs evolve at different rates, and no evidence of differential codon usage or third position GC content.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Synonymous and nonsynonymous substitution rates vary among genes, but the evolutionary forces underlying rate variation remain unclear. For nonsynonymous substitutions, substitution rates vary in part because proteins differ in their degree of selective constraint. For example, histone genes probably evolve slowly because most amino acid changes disrupt gene function and are therefore selectively deleterious (Nei 1987Citation , pp. 47–52). Other factors also contribute to nonsynonymous rate variation, including the physical location of the gene in the genome (Wolfe, Sharp, and Li 1989aCitation ; Williams and Hurst 2000Citation ) and mutation biases (Bird 1980Citation ; Tsunoyama, Bellgard, and Gojobori 2001Citation ).

It is even more difficult to identify causes of synonymous rate variation among genes, but at least three molecular characteristics correlate with synonymous substitution rates. The first characteristic is GC content (Ticher and Graur 1989Citation ), which affects synonymous substitution rates because CpG dinucleotides are subject to high rates of mutation (Tsunoyama, Bellgard, and Gojobori 2001Citation ). The second characteristic is codon bias. Highly biased genes evolve more slowly, on average (Sharp and Li 1987Citation ; Akashi 1994aCitation , 2001Citation ; Eyre-Walker and Bulmer 1995Citation ), presumably because selection for codon use limits the number of acceptable synonymous nucleotide substitutions in highly biased genes. Finally, mutation rates can differ among genomic regions, contributing to variation in synonymous substitution rates among genes (Wolfe, Sharp, and Li 1989aCitation ; Matassi, Sharp, and Gautier 1999Citation ). However, none of these three factors alone consistently explains variation in synonymous substitution rates (Ticher and Graur 1989Citation ; Wolfe, Sharp, and Li 1989aCitation ; Akashi 1994b, 1997Citation ; Matassi, Sharp, and Gautier 1999Citation ), and the interdependence between these factors is not always clear.

An important prerequisite for understanding causes of rate variation is to characterize the distribution of rates among genes. To date, most studies of the distribution of synonymous and nonsynonymous substitution rates have been restricted to animals. In plants, several studies have examined rate variation across evolutionary lineages of mitochondrial and chloroplast genes (e.g., Gaut et al. 1992Citation ; dePamphilis, Young, and Wolfe 1997Citation ; Laroche, Maggia, and Bousquet 1997Citation ), but few studies have either characterized rate variation among plant nuclear genes or discerned the factors contributing to rate variation (Wolfe, Sharp, and Li 1989bCitation ; Alvarez-Valin et al. 1999Citation ). One possible reason for the dearth of plant studies is that most plant nuclear genes are members of multigene families. Copy number within multigene families fluctuates (Clegg, Cummings, and Durbin 1997Citation ), and as a result it is often difficult to identify orthologs between species and hence to compare substitution rates among genes.

The recently sequenced Arabidopsis thaliana genome offers a unique opportunity to study substitution rate variation across plant nuclear genes. Many large arabidopsis chromosomal segments are duplicated (Mayer et al. 1999Citation ; AGI 2000Citation ), and these segments contain genes that were likely duplicated contemporaneously. Genetic distances can be compared among pairs of duplicated sequences (or gene pairs). For this special case in which the time of duplication is equivalent for all gene pairs, comparing genetic distances among gene pairs is equivalent to comparing nucleotide substitution rates among gene pairs.

In addition to providing insight into rate variation among genes, sequence data from duplicated chromosomal regions also facilitate characterization of patterns of sequence divergence between paralogs. Divergence patterns between paralogs have been of interest since Ohno (1970)Citation suggested that most adaptive changes occur after gene duplication (see also Ohta and Kimura 1973Citation ). At the molecular level, adaptive changes can be detected by measuring the ratio of nonsynonymous (Ka) to synonymous substitution (Ks). A Ka/Ks ratio >1 is strong evidence for positive selection having acted during sequence divergence, and a Ka/Ks < 1 is consistent with purifying selection, although it does not rule out the possibility that positive selection acted. Although the sensitivity of Ka/Ks to detect positive selection can be low (Hughes 1999Citation ), the distribution of Ka/Ks among genes can be useful for characterizing the relative strengths of evolutionary forces acting on individual genes (Charlesworth, Charlesworth, and McVean 2001)Citation .

Sequence divergence can have other measurable effects. For example, shifts in selective constraint after duplication can lead to changes in rates of nucleotide substitution in one or both of the duplicated sequences (Goodman, Moore, and Matsuda 1975Citation ; Li and Gojobori 1983Citation ; Gonzalez and Jordan 2000Citation ). Similarly, paralogs often diverge in synonymous codon usage (Gaut et al. 1999Citation ; Zhang, Kosakovsky-Pond, and Gaut 2001bCitation ), either as a function of changes in mutation biases or as a function of shifts in gene expression (Fennoy and Bailey-Serres 1993Citation ; Duret and Mouchiroud 1999Citation ).

The large number of duplicated genes in A. thaliana permits unprecedented examination of rate variation among plant nuclear genes and also facilitates study of patterns of evolutionary divergence between paralogs. Here we characterize patterns of molecular evolution in a duplicated region between arabidopsis chromosomes 2 and 4. With data from 242 gene pairs, we address the following questions: (1) What is the distribution of synonymous and nonsynonymous substitution rates among gene pairs? (2) Are synonymous and nonsynonymous substitution rates a function of physical location, GC content, or codon usage? (3) Is there evidence for positive selection acting during the divergence of paralogs? (4) Is sequence divergence between paralogs accompanied by divergence in codon usage or evolutionary rate?


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Data set
A duplicated region of 5.6 Mb on chromosome 2 and 4.6 Mb on chromosome 4 was initially identified by Terryn et al. (1999)Citation and subsequently extended by Lin et al. (1999)Citation . We decided to study the genes of this region for two reasons. First, this is one of the largest duplicated segments in the A. thaliana genome. Second, this segment duplicated {approx}100 MYA (Vision, Brown, and Tanksley 2000Citation ), whereas other segments are more ancient and therefore too diverged for reasonable estimates of evolutionary rates.

The chromosome 2–4 duplication was further partitioned into five blocks (numbered 45, 49, 52, 54, and 56 by Vision, Brown, and Tanksley 2000Citation ) that differ in their order and orientation relative to one another on the two chromosomes but show only minor within-block rearrangement (fig. 1 ). A minimum of four inversions is necessary to convert the order and orientation of blocks on one chromosome into that of the other. It is thought that these blocks arose from a single duplication event, both because of their suggestive spatial arrangement and based on a genome-wide analysis of the distribution of amino acid distances between duplicated genes (Vision, Brown, and Tanksley 2000Citation ). It should be emphasized, however, that our analyses do not assume that all blocks were duplicated at the same time, but we do assume that gene pairs within a single block were duplicated contemporaneously when they were in collinear order.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 1.—Dot-plot alignment of putative paralogous genes between chromosomes 2 and 4. The coordinates are the physical order of predicted genes in each chromosome, counting tandem arrays only once. The 242 genes used in this article are marked by filled circles. Putative paralogous gene pairs removed from the data set are marked by open circles. Identification of paralogous gene pairs, gene coordinates, and block numbers is from Vision, Brown, and Tanksley (2000)Citation

 
Our initial task was to identify gene pairs for further study. In the data set compiled by Vision, Brown, and Tanksley (2000)Citation , the five blocks contained 1183 genes (or tandem arrays of gene families) on chromosome 2, 1168 genes on chromosome 4, and 326 genes duplicated between chromosomes. In 52 of these 326 gene pairs, at least one copy belonged to a tandem array. We excluded these genes from further analyses because of uncertainty as to the relative timing of tandem and chromosomal duplication events. Of the remaining 274 gene pairs, three were excluded from subsequent analyses because they proved to be too diverged to allow reasonable evolutionary inference. In addition, we excluded 29 genes that were suspect either because they were not in the optimal collinear order of gene pairs between the two chromosome copies or because they were too distant from other gene pairs. Collinearity was assessed by computing the longest common subsequence of gene pairs (Gusfield 1997Citation , pp. 287–293). The distance between gene pairs was measured by the number of intervening genes on each chromosome. There are two such distances for a terminal gene pair in a block, whereas there are four for an internal gene pair. We eliminated gene pairs from analysis that were overly distant from neighboring genes by only allowing one (in the case of terminal gene pairs) or two (in the case of internal gene pairs) distances to exceed a value of twelve. This cutoff was chosen to coincide with two standard deviations (SDs) about the mean distance calculated using all gene pairs.

The accession numbers of the 484 sequences used in the analyses are available at bgbox.bio.uci.edu/data/lq2acn.html.

Alignment
Amino acid alignments were obtained with ClustalW (Thompson, Higgins, and Gibson 1994Citation ), using default parameters. Visual inspection revealed that the alignments around gaps were sometimes ambiguous, and we therefore analyzed two alignment data sets. In the first data set, the residues around gaps were included. In the second data set, five amino acids were removed on either side of each gap. The two sets of amino acid alignments were then back translated using the known coding sequences to produce DNA sequence alignments. Gap treatment did not qualitatively affect our results, and we therefore report only the results for the data set with the original alignments. The DNA sequence alignments are available at bgbox.bio.uci.edu/.

Sequence Analyses
We calculated the percent identity of DNA sequences and protein sequences for all gene pairs. The expected relationship between DNA and amino acid identity under strictly neutral evolution was obtained by simulation, using Evolver (Yang 1997Citation ). We simulated DNA sequences with distances ranging from 0.1 to 2.0 along distance intervals of 0.05 units, corresponding to the range of distances observed between duplicated genes in our data set. For each distance, 20 pairs of DNA sequences of 500 codons were simulated. For all simulated data, percent identity was calculated from DNA sequences and translated protein sequences.

Ks and Ka between duplicated sequences were estimated by the maximum likelihood (ML) method implemented in PAML (Yang 1997Citation ), using the codon model of Goldman and Yang (1994)Citation . We used a likelihood ratio (LR) comparison to test for a Ka/Ks ratio different from 1.0. To do this, two models were applied to the data: model 0 constrains the Ka/Ks ratio to 1.0, and model 1 estimates the Ka/Ks ratio as a free parameter. The LR of model 0 and model 1 was compared with the {chi}2 distribution with one degree of freedom, as detailed by Yang (1998)Citation .

Relative rate tests were performed with HYPHY (http://peppercat.statgen.ncsu.edu/~hyphy/), for both protein sequences and DNA sequences. In order to apply the relative rate test, we obtained outgroup sequences for a large number of the gene pairs. Ku et al. (2000)Citation proposed that the duplication between chromosomes 2 and 4 occurred after the split of the tomato and A. thaliana lineages. If this is true, then plants as divergent, or more divergent, from A. thaliana than tomato can be used as the source of outgroup sequences. The Rosidae are the largest phylogenetic grouping in the NCBI taxonomy that contains A. thaliana and excludes tomato. Accordingly, we searched for candidate outgroup sequences among all GenBank records derived from angiosperms, excluding the Rosidae. TBLASTX (Altschul et al. 1997Citation ) was used to find matches between the 242 genes and the DNA sequences in the database. We required that both arabidopsis genes showed a strong match (expected value <1 x 10-10) to at least one of the sequences in the database for which a predicted or experimentally determined protein translation was available. In cases where multiple GenBank records met these criteria, the outgroup was selected from among the following NCBI-derived taxonomic groupings, in descending order of preference (and increasing taxonomic breadth): Asteridae (which includes tomato), core eudicots, eudicotyledons, and Magnoliophyta.

Each relative rate test was based on one gene pair and its outgroup. For DNA sequences, synonymous and nonsynonymous substitution rates were tested separately using the codon substitution model of Muse and Gaut (1994)Citation . For amino acid sequences, we used the substitution model of Dayhoff, Schwartz, and Orcutt (1978)Citation . The relative rate test uses an LR statistic to test the null hypothesis that two sequences evolve at equal rates (Muse and Weir 1992Citation ).

The effective number of codons (ENC) and GC content at synonymous third codon positions (GCS) were calculated using codon W (http://www.molbiol.ox.ac.uk/cu/). We used the ENC as a measure of codon usage because it is not biased by gene length, given a certain minimum length, or by amino acid composition (Wright 1990Citation ; Comeron and Aguade 1998Citation ). ENC values range from 20 to 61; a value of 20 represents extreme bias, and a value of 61 indicates random codon use. We tested for homogeneity of ENC and GCS between duplicated sequences by a permutation procedure described previously (Zhang, Kosakovsky-Pond, and Gaut 2001bCitation ), with the slight modification that the test statistic was the difference in ENC (or GCS) between sequences rather than the variance in ENC (or GCS) among sequences. Test statistics were based on 1000 permutations.

Spatial correlation of Ka and Ks
To examine the relationship between evolutionary distance and physical position, we performed a spatial correlation analysis. The following statistic (Chatfield 1999Citation , p. 20) was calculated for a range of distances between gene pairs, where distance was defined as the number of intervening genes between gene pairs:


where xi is the value of Ka (or Ks) for the gene pair at position i, xi+d is the value of Ka (or Ks) for the gene pair separated by d - 1 intervening genes, and s2x are the mean and sample variance of Ka (or Ks), and n is the number of possible comparisons. This statistic tends toward zero when the Ka (or Ks) values of gene pairs are uncorrelated at distance d, approaches 1.0 when they are tightly correlated, and approaches negative 1.0 when they are inversely correlated. To test for significant spatial correlations, the observed values were compared with the tail values of Cd calculated on 1000 permuted data sets generated by assigning the original Ka (or Ks) values to random positions on the chromosome. The analyses were performed separately for each block and for all blocks simultaneously.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The Distribution of Substitution Rates Among Gene Pairs
We collected 242 gene pairs duplicated between chromosome 2 and 4 that were in compact collinear order and not duplicated in tandem. Their chromosomal locations are shown in figure 1 . Among all gene pairs, DNA sequence identity ranged from 46.8% to 92.8%, and protein sequence identity ranged from 33.2% to 100%. Over the entire region, Ka had a range of 0.0005–0.78 nonsynonymous substitutions per nonsynonymous site and a mean of 0.17. The range of Ks was 0.31–4.27 synonymous substitutions per synonymous site, with a mean of 0.82 (table 1 and fig. 2 ). Despite the larger range of Ks, coefficients of variation (CV) of Ka and Ks were similar, with a 1.5-fold difference between them (table 1 ). Furthermore, Ka and Ks were positively and significantly correlated among genes within each block, just as they were correlated over the entire five-block region (Kendall's rank correlation {tau} = 0.70, P < 0.001; table 1 ).


View this table:
[in this window]
[in a new window]
 
Table 1 Ka, Ks and Ka/Ks Statistics

 


View larger version (13K):
[in this window]
[in a new window]
 
Fig. 2.—The distribution of Ka, Ks and Ka/Ks over the entire region. The line represents the cumulative distribution of each measure

 
Gene pairs within individual blocks exhibited a similar range of DNA and protein identities to the entire region, but there were apparent differences among blocks. For example, block 45 had a relatively high synonymous CV, and blocks 52 and 54 had a relatively narrow range of Ka and K s (table 1 ). These observations could reflect sampling phenomena because some blocks contained small numbers of duplicated genes. To test for among-block rate variation, we performed an ANOVA. The ANOVA indicated that Ka and Ks are homogeneous among blocks (Ka: P = 0.45, Ks: P = 0.09), further suggesting that the blocks may have been duplicated contemporaneously.

Over the whole region, 58% (141/242) of gene pairs had higher DNA than protein identity, and DNA sequence identity was always higher than protein sequence identity when amino acid identity was greater than {approx}80% (fig. 3 ). To determine the relationship between DNA and protein identities relative to a model of strictly neutral evolution, we simulated pairs of gene sequences under neutrality. For simulations, Ka/Ks was set to 1.0; {kappa}, the transition and transversion parameter was set to 3.0; and the codon frequency matrix was based on tabulated frequencies from 14,647,315 A. thaliana codons (http://www.kazusa.or.jp/codon/). The simulations indicated that DNA sequence identity was always higher than protein sequence identity under the neutral model (fig. 3 ), and the result was qualitatively similar with different transition:transversion ratios ({kappa} = 2, 3, or 4; data not shown). In our arabidopsis data set, amino acid sequence identity was always higher than that for simulated gene pairs with equivalent DNA sequence identity (fig. 3 ). Thus, selective constraint appears to be slowing the pace of nonsynonymous nucleotide substitution for all the gene pairs in our data set.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 3.—The relationship between DNA sequence identity and protein sequence identity for all 242 gene pairs. The simulated data are represented by gray dots. The 242 gene pairs are represented by black marks. The line represents equal DNA sequence identity and protein sequence identity

 
Examining Factors That Contribute to Rate Variation Among Gene Pairs
Four factors are commonly cited as contributing to rate variation among genes: (1) physical location within the genome, (2) GC content, (3) codon usage bias, and (4) differing levels of selective constraint on amino acid substitutions. In this section, we examine the relationship between rate variation and the first three factors.

We tested the relationship between substitution rate and physical distance with two methods. First, we applied a spatial autocorrelation test to both Ka and Ks. There were no significant results with Ka, but autocorrelation for Ks was greater than expected at the 5% significance level for some distances—for example, genes separated by 10 genes are more highly correlated than expected at random (fig. 4 ). We should note, however, that Ks autocorrelation statistics do not demonstrate significantly high autocorrelation in neighboring genes, as expected under the hypothesis that physical location is a contributing factor to variation in synonymous substitution rates among genes. Thus, the Ks results are difficult to interpret but suggest only weakly that physical location contributes to synonymous rate variation among genes. Second, we applied the permutation method of Williams and Hurst (2000)Citation to test the null hypothesis that rates are random with respect to physical location. This test also failed to reject the null hypothesis for Ka (P = 0.66), but the test was borderline significant for Ks (P = 0.067). Thus, both tests provide suggestive, but inconclusive, evidence that there could be a relationship between Ks and physical distance.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 4.—Spatial autocorrelation (Cd statistic) for Ka and Ks. The solid line is the observed Cd at a given distance. The distance is measured as the number of intervening genes. The broken lines show 95% and 99% confidence intervals

 
GC content and codon bias have also been shown to correlate with evolutionary rate (Ticher and Graur 1989Citation ; Tsunoyama, Bellgard, and Gojobori 2001Citation ). We calculated ENC and GCS for each gene pair. Among all genes, ENC ranged from 35.4 to 61.0, and GCS varied from 0.27 to 0.68. We examined correlations among GCS, codon bias, and evolutionary rates with Kendall's nonparametric test of correlation (table 2 ). The results are inconsistent among blocks, but two correlations stand out. The first is the correlation between Ka and GCS, which is negative in four of five blocks and significantly negative over the entire region. The second correlation is between Ks and ENC, which is positive in four of five blocks and significantly positive over the entire region.


View this table:
[in this window]
[in a new window]
 
Table 2 Kendall's Correlation ({tau}) Among GCS, ENC and Substitution Rate, with Probability Value \[cf4\]P\[cf1\]

 
The Distribution of Ka/Ks Across Gene Pairs
To examine whether positive selection commonly contributed to sequence divergence between paralogs, we estimated Ka/Ks for all gene pairs and tested whether Ka/Ks exceeded 1.0. Estimated Ka/Ks values ranged from 0.00097 to 0.677, with a mean of 0.02 (table 1 and fig. 2 ). For all but one of the gene pairs, a model in which Ka/Ks was estimated from the data fit significantly better at the P < 0.05 level than did a model that constrained Ka/Ks to 1.0. Most (236 of 242) tests remained significant after Bonferroni correction. Altogether Ka/Ks estimates indicated that there was neither positive selection (Ka/Ks > 1.0) nor strictly neutral evolution (Ka/Ks = 1.0) governing sequence divergence in the vast majority (236/241 = 98%) of gene pairs. These conclusions are consistent with the analysis of protein and DNA sequence identity (fig. 3 ).

Divergence in Evolutionary Rate and Codon Usage Between Paralogs
We performed relative rate tests to examine whether sequences within a gene pair evolve at different evolutionary rates. Relative rate tests require an outgroup for each gene pair. Using the search strategy described in Materials and Methods, we identified putative outgroup sequences for 105 gene pairs. To examine the validity of these outgroup assignments, we examined the three pairwise genetic distances for each gene pair and its outgroup. In all 105 cases, paralogs had a smaller distance (as measured by both Ka and Ks) to one another than to the outgroup sequence (data not shown), suggesting that our outgroup assignments were reasonable.

Few paralogs demonstrated deviation from clock-like evolution after divergence. Relative rate tests using protein sequences resulted in 14 significant tests at P < 0.05; three remained significant after Bonferroni correction for an experiment-wide error of {alpha} = 0.05. The relative rate tests for nonsynonymous substitution rates provided similar results, with 24 significant gene pairs; three remained significant after Bonferroni correction. Ten of the 14 gene pairs that were significant for protein sequences were also significant for Ka-based tests, showing that results were reasonably consistent between methods. For Ks, six of 105 gene pairs were significant at P < 0.05, but only one remained significant after Bonferroni correction.

To further characterize patterns of sequence divergence between duplicated genes, we measured both GCS and ENC for all gene pairs. Table 3 provides average ENC and GCS for both chromosomes and all blocks. We applied paired t-tests to determine whether the two chromosomes differ significantly for either measure; no significant difference was detected between chromosomes for either ENC (t = 0.478, P = 0.633) or GCS (t = -1.591, P = 0.112). This result did not preclude the possibility that any two paralagous sequences had diverged significantly in codon usage. To test this possibility, we applied a permutation test of homogeneous ENC (or GCS) to each of the 242 gene pairs (Zhang, Kosakovsky-Pond, and Gaut 2001bCitation ). No paralogs exhibited significant difference in either measure after Bonferroni correction (data not shown), and thus there has been no detectable divergence in codon usage or GC content after duplication.


View this table:
[in this window]
[in a new window]
 
Table 3 The Mean and Standard Deviation of ENC and GCS for Chromosomes 2 and 4

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
In this study, we examined sequence divergence among gene pairs located in the duplicated region between chromosomes 2 and 4 of A. thaliana. Gene pairs were studied under the assumption that they were duplicated contemporaneously. We took three precautions to ensure the accuracy of this assumption. First, the chromosome 2–4 duplication consists of five rearranged blocks that presumably owe their duplication to the same event, but we studied each block separately as well as in combination. Second, we did not consider arrayed genes in tandem because it is difficult to determine the relative timing of tandem and chromosomal duplication events. Finally, we studied only genes that occur in compact collinear arrangement between the two chromosomes; collinearity reinforces the notion that the gene pairs originated during a single duplication event.

Our study differs from previous studies of rate variation among genes in that we used paralogous, rather than orthologous, sequences to estimate substitution rates. One potential complication is that gene conversion can occur between paralogs. If gene conversion has occurred in some gene pairs but not others, the net effect is to increase the range and variance of Ka and Ks among gene pairs. We tested for the presence of gene conversion in all gene pairs with Sawyer's (1989)Citation method but found no evidence for conversion between paralogs (data not shown). We should note, however, that most tests for gene conversion detect events that affect a portion of a gene rather than a complete gene.

Despite the tendency of gene conversion to inflate the range of substitution rates, we found Ks ranges similar to previous studies. For example, Ks ranged 15-fold in a study of 24 Drosophila orthologs (Zeng et al. 1998Citation ) and 20-fold in a study of 363 rat-mouse orthologs (Wolfe and Sharp 1993Citation ). In our study, synonymous substitution rates varied up to 10.4-fold within an individual block and 13.8-fold over the entire data set (table 1 and fig. 2 ).

The distribution of Ka and Ks from arabidopsis gene pairs differs from previous studies of non-plant taxa in two noteworthy ways. First, the ratio of mean synonymous to mean nonsynonymous substitution rates is relatively low. The ratio is {approx}5 for our arabidopsis data (table 1 ), but it is as high as 24 for bacterial genes (Sharp 1991Citation ) and {approx}7 for rat-mouse comparisons (Wolfe and Sharp 1993Citation ), based on comparisons between orthologous sequences. Differences in this ratio may reflect different population sizes and life histories (Sharp 1991Citation ), but may also reflect differences between ortholog and paralog comparisons. For example, Kondrashov et al. (2002)Citation recently found that the ratio of synonymous to nonsynonymous substitution is {approx}twofold lower between paralogs than orthologs. Thus, the forces affecting this ratio remain unclear at present. Second, the ratio of CVs between nonsynonymous and synonymous rates is lower for this study than previous studies. In rat-mouse comparisons, for example, the CV for nonsynonymous rates was fourfold higher than the CV for synonymous rates (Wolfe and Sharp 1993Citation ). Among arabidopsis gene pairs, there is only a 1.5-fold difference between nonsynonymous and synonymous CVs (table 1 ), indicating that synonymous and nonsynonymous rates vary among genes to roughly the same extent.

Our analyses represent the most extensive comparison of rates among plant nuclear genes to date. Some previous studies have been based on a much smaller number of sequence comparisons. For example, Wolfe, Sharp, and Li (1989b)Citation and Gaut (1998)Citation found that synonymous rates vary up to 2.5-fold among genes, but these studies were based on only 11 and nine nuclear genes, respectively. A more recent study made 212 orthologous sequence comparisons between arabidopsis and Brassica rapis (Tiffin and Hahn 2002Citation ), but relied on EST data, which could bias results depending on the accuracy of EST data and the gene regions sequenced. Nonetheless, Tiffin and Hahn (2002)Citation described two rate characteristics that are similar to our results: (1) an {approx}24-fold range of synonymous rate variation among genes and (2) a twofold difference in CV between nonsynonymous and synonymous rates. Together these two plant studies indicate that synonymous rates vary at least an order of magnitude among plant nuclear genes and also that variation in nonsynonymous and synonymous rates are relatively similar among genes, based on the CV. It is clear, however, that a general picture of plant nuclear gene evolution requires additional studies, particularly given the fact that different plant lineages evolve with different mutational patterns (Tiffin and Hahn 2002Citation ) and different rates (e.g., Eyre-Walker and Gaut 1997Citation ).

With the increasing availability of genome sequences, several studies have used genetic distances (either Ka or Ks) to make genome-wide inferences about the timing and frequency of gene and genome duplication (Gaut and Doebley 1997Citation ; Lynch and Conery 2000Citation ; Vision, Brown, and Tanksley 2000Citation ; Friedman and Hughes 2001Citation ). For example, Lynch and Conery (2000)Citation used Ks values between duplicated arabidopsis sequences as a proxy for divergence time; in essence they assumed that substitution rates were homogeneous among gene pairs. Their use of Ks has come under criticism (Long and Thornton 2001Citation ; Zhang, Gaut, and Vision 2001aCitation ), but this study provides quantitative insights about the degree to which the assumption can be misleading. In a worst-case scenario, Ks provides time estimates that differ up to 14-fold (table 1 ). However, the average effect of the homogeneous rate assumption is not nearly as dramatic because ~90% of the gene pairs in this study fall within a Ks range of 0.462–1.188 (fig. 2 ), and thus most genes (~90%) vary less than 2.57-fold in rate. Overall, however, our study indicates that Ks variation among genes is generally higher than previously reported for plants (Wolfe, Sharp, and Li 1989bCitation ; Gaut 1998Citation ), and thus one should be cautious when equating Ks with time.

Factors Contributing to Rate Variation Among Gene Pairs
What are the forces that contribute to variation in evolutionary rates among gene pairs? The first possibility, discussed earlier, is gene conversion. Although we did not detect gene conversion in any of our gene pairs, the possibility of gene conversion should not be ignored. However, the fact that the range of variation found in our 242 gene pairs is similar to that of orthologous gene pairs from other studies (rat-mouse, drosophilids and Arabidopsis-Brassica; see earlier) suggests that the effects of gene conversion have been negligible. In addition, a similar study of yeast paralogs found no evidence of gene conversion, suggesting gene conversion may not be widespread (Pál, Papp, and Hurst 2001Citation ).

The origin of duplicated chromosomes can also affect the variance in rates across gene pairs. If chromosomal duplication originated via a polyploid event, the original gene pairs contained varied levels of residual standing variation (polymorphism) at the onset of duplication. Standing variation contributes to variation in genetic distances among gene pairs (discussed in Gaut and Doebley 1997Citation ). However, for highly diverged gene pairs like those we have studied here, the level of standing variation at the time of origin should be very low relative to the total divergence between paralogs. Hence the origin effect probably contributes little to the observed variance in substitution rates among the 242 arabidopsis gene pairs.

Physical location is a third potential contributor to rate variation among genes. For example, Williams and Hurst (2000)Citation found that nonsynonymous substitution rates between mouse and rat vary as a function of genome location. Similarly, Matassi, Sharp, and Gautier (1999)Citation documented location effects on synonymous substitution rates in human-rodent comparisons. Both of these effects likely reflect variable mutation rates along chromosomes (Casane et al. 1997Citation ; Lercher, Williams, and Hurst 2001Citation ). We examined the relationship between physical location and nucleotide substitution rates with two methods, and neither analysis provided evidence that physical location affects nonsynonymous substitution rates. In contrast, there is a hint that physical location and synonymous substitution rates are correlated, but the effect, if present, is weak. It should be noted that previous studies focused on whole genomes, with neighboring genes up to 5 cM apart (Matassi, Sharp, and Gautier 1999Citation ; Williams and Hurst 2000Citation ; Lercher, Williams, and Hurst 2001Citation ), which is roughly 10 Mb in humans (Williams and Hurst 2000Citation ). In contrast, our entire study focused on a region of 5 Mb, with neighboring genes separated by only a few kb. If the physical scale that affects substitution rates is large, our analyses could miss location effects.

Variation in selective constraint among proteins also contributes to rate variation among gene pairs. Deviation from a molecular clock is one potential measure of variation in selective constraint between paralogs, but on the whole our analyses uncovered little deviation from clock-like evolution. With some caveats, Ka/Ks can also be used for characterizing the relative strength of evolutionary forces acting on individual gene pairs. For the 242 gene pairs in our study, Ka/Ks ranged from 0.0 to 0.70, with a mean of 0.20, suggesting extensive variation in selective constraint among gene pairs. However, Ka/Ks never exceeded 1.0 (table 1 and fig. 3 ), and there is thus no evidence that positive selection has driven divergence between any of the paralogs on the duplicated regions of chromosomes 2 and 4. Altogether, variation in selective constraint, but not positive selection, likely contributes to variation in substitution rates among arabidopsis gene pairs.

Both GC content and codon bias have been shown to correlate with nucleotide substitution rates (Moriyama and Gojobori 1992Citation ; Alvarez-Valin et al. 1999Citation ), and hence GC content and codon bias may also contribute to nucleotide substitution rate variation among gene pairs. We detected no correlations between GCS and Ks in arabidopsis gene pairs, but did find a negative correlation between GCS and Ka. The phenomena underlying the negative correlation are unclear, but the lack of correlation between GCS and Ks is not surprising given the relatively homogeneous base composition of the A. thaliana genome (Barakat, Matassi, and Bernardi 1998Citation ; AGI 2000Citation ). We also found a correlation between codon bias and Ks that is similar to that documented in other species (Sharp and Li 1987Citation ; Moriyama and Hartl 1993Citation , pp. 847–858; Akashi 1994bCitation ). This correlation is consistent with strong codon bias (low ENC) limiting acceptable synonymous changes and thus retarding substitution rates.

One interesting feature of codon bias is that it often also correlates with gene expression level—i.e., highly expressed genes are more biased (Gouy and Gautier 1982Citation ; Duret and Mouchiroud 1999Citation ; Coghlan and Wolfe 2000Citation ). Surprisingly, highly expressed genes also evolve with relatively low Ka (Duret and Mouchiroud 2000Citation ). Both of these observations suggest that yet another factor—gene expression—contributes to variation in Ka and perhaps Ks among genes. To briefly explore this possibility with the arabidopsis gene pairs, we employed the database procedures of Duret and Mouchiroud (2000)Citation and counted the number of BLAST hits for each gene pair to an arabidopsis EST database (http://www.kazusa.or.jp/en/plant/arabi/EST/). We then regarded the number of hits as a rough approximation of the expression level of a gene pair (this approximation is particularly rough given the diverse origin and preparation methods of cDNA libraries used for EST sequencing), and compared this expression level with evolutionary rates. Based on pairwise correlations among Ka, Ks and the number of EST hits, there is a significantly negative partial correlation between expression and Ka (r = -0.172, P = 0.007) but no significant correlation between expression and Ks (r = -0.037, P = 0.57). More detailed analyses require better expression data, but this significant negative correlation is consistent with recent ideas that evolutionary rates may be either proximally or secondarily a function of gene expression (reviewed in Akashi 2001Citation ).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We thank A. McLysaght, M. Tenaillon, M. Mondragon, M. Clegg, and P. Tiffin for valuable comments, and S. Muse provided valuable advice. We also thank S. Kosakovsky-Pond and Z. Yang for assistance with Hyphy and PAML, respectively. This work was funded by NSF grants DBI-9872631 and DEB-0104796. T.J.V. was supported by the USDA-ARS Center for Agricultural Bioinformatics and the Cornell Theory Center.


    Footnotes
 
Brian Golding, Reviewing Editor

Address for correspondence and reprints: Brandon S. Gaut, Department of Ecology and Evolutionary Biology, 321 Steinhaus Hall, University of California, Irvine, California 92697-2525. E-mail: bgaut{at}uci.edu Back

Keywords: nucleotide substitution rates positive selection codon usage regional mutation Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    AGI (Arabidopsis Genome Initiative). 2000 Analysis of the genome sequence of the flowering plant Arabidopsis thaliana Nature 408:796-815[ISI][Medline]

    Akashi H., 1994a. Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA Genetics 139:1067-1076[Abstract/Free Full Text]

    ———. 1994b. Synonymous codon usage in Droshophila melanogaster: natural selection and translational accuracy Genetics 136:927-935[Abstract/Free Full Text]

    ———. 1997 Codon bias evolution in Drosophila Population genetics of mutation-selection drift. Gene 205:269-278

    ———. 2001 Gene expression and molecular evolution Curr. Opin. Genet. Dev 11:660-666[ISI][Medline]

    Altschul S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402[Abstract/Free Full Text]

    Alvarez-Valin F., K. Jabbari, N. Carels, G. Bernardi, 1999 Synonymous and nonsynonymous substitutions in genes from Gramineae: intragenic correlations J. Mol. Evol 49:330-342[ISI][Medline]

    Barakat A., G. Matassi, G. Bernardi, 1998 Distribution of genes in the genome of Arabidopsis thaliana and its implications for the genome organization of plants Proc. Natl. Acad. Sci. USA 95:10044-10049[Abstract/Free Full Text]

    Bird A. P., 1980 DNA methylation and the frequency of CpG in animal DNA Nucleic Acids Res 8:1499-1504[Abstract]

    Casane D., S. Boissinot, B. H. J. Chang, L. C. Shimmin, W. H. Li, 1997 Mutation pattern variation among regions of the primate genome J. Mol. Evol 45:216-226[ISI][Medline]

    Charlesworth D., B. Charlesworth, G. McVean, 2001 Genome sequences and evolutionary biology, a two-way interaction TREE 16:235-242[ISI][Medline]

    Chatfield C., 1999 The analysis of time series: an introduction Chapman and Hall, London

    Clegg M. T., M. P. Cummings, M. L. Durbin, 1997 The evolution of plant nuclear genes Proc. Natl. Acad. Sci. USA 94:7791-7798[Abstract/Free Full Text]

    Coghlan A., K. H. Wolfe, 2000 Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae Yeast 16:1131-1145[ISI][Medline]

    Comeron J. M., M. Aguade, 1998 An evaluation of measures of synonymous codon usage bias J. Mol. Evol 47:268-274[ISI][Medline]

    Dayhoff M. O., R. M. Schwartz, B. C. Orcutt, 1978 A model for evolutionary change in proteins Pp. 345–352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. National Biochemical Research Foundation, Washington, D.C.

    dePamphilis C. W., N. D. Young, A. D. Wolfe, 1997 Evolution of plastid gene rps2 in a lineage of hemiparasitic and holoparasitic plants: many losses of photosynthesis and complex patterns of rate variation Proc. Natl. Acad. Sci. USA 94:7367-7372[Abstract/Free Full Text]

    Duret L., D. Mouchiroud, 1999 Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila and Arabidopsis Proc. Natl. Acad. Sci. USA 96:4482-4487[Abstract/Free Full Text]

    ———. 2000 Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate Mol. Bio. Evol 17:68-85[Abstract/Free Full Text]

    Eyre-Walker A., M. Bulmer, 1995 Synonymous substitution rates in enterobacteria Genetics 140:1407-1412[Abstract/Free Full Text]

    Eyre-Walker A., B. S. Gaut, 1997 Correlated rates of synonymous site evolution among plant genomes Mol. Biol. Evol 14:455-460[Abstract]

    Fennoy S. L., J. Bailey-Serres, 1993 Synonymous codon usage in Zea mays L. nuclear genes is varied by levels of C and G-ending codons Nucleic Acids Res 21:5294-5300[Abstract]

    Friedman R., A. L. Hughes, 2001 Gene duplication and the structure of eukaryotic genomes Genome Res 11:373-381[Abstract/Free Full Text]

    Gaut B. S., 1998 Molecular clocks and nucleotide substitution rates in higher plants Evol. Biol 30:93-120[ISI]

    Gaut B. S., J. F. Doebley, 1997 DNA sequence evidence for the segmental allotetraploid origin of maize Proc. Natl. Acad. USA 94:6809-6814[Abstract/Free Full Text]

    Gaut B. S., S. V. Muse, W. D. Clark, M. T. Clegg, 1992 Relative rates of nucleotide substitution at the rbcL locus of monocotyledonous plants J. Mol. Evol 35:292-303[ISI][Medline]

    Gaut B. S., A. S. Peek, B. R. Morton, M. T. Clegg, 1999 Patterns of genetic diversification within the Adh gene family in the grasses (Poaceae) Mol. Biol. Evol 16:1086-1097[Abstract]

    Goldman N., Z. H. Yang, 1994 Codon-based model of nucleotide substitution for protein coding DNA sequences Mol. Biol. Evol 11:725-736[Abstract/Free Full Text]

    Gonzalez D. S., I. K. Jordan, 2000 The alpha-mannosidases: phylogeny and adaptive diversification Mol. Biol. Evol 17:292-300[Abstract/Free Full Text]

    Goodman M., G. W. Moore, G. Matsuda, 1975 Darwinian evolution in the genealogy of hemoglobin Nature 253:603-608[ISI][Medline]

    Gouy M., C. Gautier, 1982 Codon usage in bacteria: correlation with gene expressivity Nucleic Acids Res 10:7055-7074[Abstract]

    Gusfield D., 1997 Algorithms on strings trees and sequences Cambridge University Press, Cambridge, Mass

    Hughes A. L., 1999 Adaptive evolution of genes and genomes Oxford University Press, Oxford

    Kondrashov F. A., I. B. Rogozin, Y. I. Wolf, E. V. Koonin, 2002 Selection in the evolution of gene duplications Genome Biol 3:1-9

    Ku H.-M., T. Vision, J. Liu, S. D. Tanksley, 2000 Comparing sequenced segments of the tomato and Arabidopsis genomes: large-scale duplication followed by selective gene loss creates a network of synteny Proc. Natl. Acad. Sci. USA 97:9121-9126[Abstract/Free Full Text]

    Laroche J., P. Li, L. Maggia, J. Bousquet, 1997 Molecular evolution of angiosperm mitochondrial introns and exons Proc. Natl. Acad. Sci. USA 94:5722-5727[Abstract/Free Full Text]

    Lercher M. J., E. J. B. Williams, L. D. Hurst, 2001 Local similarity in evolutionary rates extends over whole chromosomes in human-rodent and mouse-rat comparisons: implications for understanding the mechanistic basis of the male mutation bias Mol. Bio. Evol 18:2032-2039[Abstract/Free Full Text]

    Li W. H., T. Gojobori, 1983 Rapid evolution of goat and sheep globin genes following gene duplication Mol. Biol. Evol 1:94-108[Abstract]

    Lin X. Y., S. S. Kaul, S. Rounsley, et al. (37 co-authors) 1999 Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana Nature 402:761-768[ISI][Medline]

    Long M. Y., K. Thornton, 2001 Gene duplication and evolution Science 293:U1.

    Lynch M., J. S. Conery, 2000 The evolutionary fate and consequences of duplicate genes Science 290:1151-1155[Abstract/Free Full Text]

    Matassi G., P. M. Sharp, C. Gautier, 1999 Chromosomal location effects on gene sequence evolution in mammals Curr. Biol 9:786-791[ISI][Medline]

    Mayer K., C. Schuller, R. Wambutt, et al. (240 co-authors) 1999 Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana Nature 402:769-777.[ISI][Medline]

    Moriyama E. N., T. Gojobori, 1992 Rates of synonymous substitution and base composition of nuclear genes in Drosophila Genetics 130:855-964[Abstract/Free Full Text]

    Moriyama E. N., D. N. Hartl, 1993 Codon usage bias and base composition of nuclear genes in Drosophila Genetics 134.

    Muse S. V., B. S. Gaut, 1994 A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome Mol. Biol. Evol 11:715-724[Abstract/Free Full Text]

    Muse S. V., B. S. Weir, 1992 Testing for equality of evolutionary rates Genetics 132:269-276[Abstract/Free Full Text]

    Nei M., 1987 Molecular evolutionary genetics Columbia University Press, New York

    Ohno S., 1970 Evolution by gene duplication Springer-Verlag, Heidelberg

    Ohta T., M. Kimura, 1973 Model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population Genet. Res 22:201-204.[ISI][Medline]

    Pál C., B. Papp, L. D. Hurst, 2001 Highly expressed genes in yeast evolve slowly Genetics 158:927-931[Free Full Text]

    Sawyer S., 1989 Statistical tests for detecting gene conversion Mol. Biol. Evol 6:526-538[Abstract]

    Sharp P. M., 1991 Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium codon usage, map position, and concerted evolution J. Mol. Evol 33:23-33[ISI][Medline]

    Sharp P. M., W.-H. Li, 1987 The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias Mol. Biol. Evol 4:222-230[Abstract]

    Terryn N., L. Heijnen, A. De Keyser, et al. (21 co-authors) 1999 Evidence for an ancient chromosomal duplication in Arabidopsis thaliana by sequencing and analyzing a 400-kb contig at the APETALA2 locus on chromosome 4 FEBS Lett 445:237-245.[ISI][Medline]

    Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 ClustalW—improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]

    Ticher A., D. Graur, 1989 Nucleic acid composition, codon usage, and the rate of synonymous substitution in protein-coding genes J. Mol. Evol 28:286-298[ISI][Medline]

    Tiffin P., W. M. Hahn, 2002 Coding sequence divergence between two closely related plant species: Arabidopsis thaliana and Brassica rapa ssp. pekinensis J. Mol. Evol 54::746-753.[ISI][Medline]

    Tsunoyama K., M. I. Bellgard, T. Gojobori, 2001 Intragenic variation of synonymous substitution rates is caused by nonrandom muations at methylated CpG J. Mol. Evol 53:456-464[ISI][Medline]

    Vision T. J., D. G. Brown, S. D. Tanksley, 2000 The origins of genomic duplication in the Arabidopsis genome Science 290:2114-2117[Abstract/Free Full Text]

    Williams E. J., L. D. Hurst, 2000 The proteins of linked genes evolve at similar rates Nature 407:900-903[ISI][Medline]

    Wolfe K. H., P. M. Sharp, 1993 Mammalian gene evolution: nucleotide sequence divergence between mouse and rat J. Mol. Evol 37:441-456[ISI][Medline]

    Wolfe K. H., P. M. Sharp, W.-H. Li, 1989a. Mutation rates differ among regions of the mammalian genome Nature 337:283-285[ISI][Medline]

    ———. 1989b. Rates of synonymous substitution in plant nuclear genes J. Mol. Evol 29:208-211[ISI]

    Wright F., 1990 The ‘effective number of codons' used in a gene Gene 87:23-29[ISI][Medline]

    Yang Z., 1997 PAML: a program package for phylogenetic analysis by maximum likelihood CABIOS 13:555-556[Medline]

    ———. 1998 Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution Mol. Bio. Evol 15:568-573[Abstract]

    Zeng L.-W., J. M. Comeron, B. Chen, M. Kreitman, 1998 The molecular clock revisited: the rate of synonymous versus replacement change in Drosophila Genetica 102/103:369-382

    Zhang L. Q., B. S. Gaut, T. J. Vision, 2001a. Gene duplication and evolution Science 293:U1-U2

    Zhang L. Q., S. Kosakovsky-Pond, B. S. Gaut, 2001b. A survey of the molecular evolutionary dynamics of twenty-five multigene families from four grass taxa J. Mol. Evol 52:144-156[ISI][Medline]

Accepted for publication April 29, 2002.