*The Biological Process Technology Institute
Department of Ecology, Evolution, and Behavior, University of Minnesota;
Department of Biochemistry and Molecular Biology, FUHS/CMS, North Chicago; and
Department of Biology, McMaster University, Hamilton, Ontario
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A correlation between site-specific rate variation and (1) distance from the active site, (2) solvent accessibility, and (3) treating glycines in unusual main-chain conformations as a separate class, explains approximately half the causal variation. Secondary structure exerts little influence on the pattern and distribution of replacements. Additional domains and subunits, side-chain hydrogen bonds, unusual side-chain rotamers, nonplanar peptide bonds, strained main-chain conformations, and buried hydrophilic-charged residues contribute little to variability among sites because they are rare. Nonlinear models do not improve the fits. In several enzymes, deviations from the typical pattern of replacements suggest the possible action of natural selection. A statistical analysis shows that, in all cases, much of the remaining unexplained variation is not attributable to chance and that other, as yet unidentified, causal relations must exist.
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A great deal of attention has been paid to the study of variable rates in DNA: between genes, between coding and noncoding sequences, between regulatory elements and their adjacent cistrons, and between nonsynonymous and synonymous substitutions within structural genes (Li 1997
). Much discussion has centered on the role played by functional constraints in determining evolutionary rates, particularly with reference to the neutral theory (Kimura 1983
). As previously noted (Dean and Golding 2000
), many of these patterns are also consistent with Fisher's theory of evolution near fitness optima (Fisher 1930
), wherein mutations of small effect are more likely to be fitter than those of large effecthence the former (e.g., synonymous substitutions) occur more frequently than the latter (e.g., nonsynonymous replacements).
Proteins, diverse in structure and in function, also form a natural arena in which to explore issues surrounding variability in evolutionary rates. After an early study demonstrating that amino acid replacement rates vary among sites in proteins (Uzzel and Corbin 1971
), Kimura and Ohta (1973)
established that sites lining the heme-binding pockets of hemoglobin evolve less rapidly than those on solvent accessible surfacesa now oft cited example of functional constraint. In many recent studies high rates of amino acid replacement compared with the rates of silent substitution are taken as evidence of selection (methods and results reviewed by Yang and Bielawski 2000
). A number of these studies reveal that rapidly evolving sites are localized within the three-dimensional structures of proteins, thereby providing additional insights into the mode of adaptive evolution (e.g., Hughes and Nei 1988
; Bishop, Dean, and Mitchell-Olds 2000
). Yet, between the extremes of casual inspection of protein structure and of rigorous application of statistical theory, there remains a vast gulf in our knowledge.
Incorporating protein structure into evolutionary models has recently become a focus of renewed interest (Thorne 2000
). Bustamente, Townsend, and Hartl (2000)
showed that polymorphic sites in several bacterial enzymes are far more likely to be on solvent accessible surfaces than in hydrophobic interiors, an observation entirely in accord with the observations of Kimura and Ohta (1973)
and of Goldman and coworkers (Goldman, Thorne, and Jones 1998
; Lio and Goldman 1999
). The latter also attempted to extend the approach by incorporating knowledge of secondary structures, but with mixed success. Atchley, Terhalle, and Dress (1999)
and Atchley et al. (2000)
used an information theoretic approach to analyze amino acid replacements in a DNA-binding helix-loop-helix domain of transcription factors. They found significant levels of covariation at surface sites that could not be ascribed to common phylogenetic history. Pollock, Taylor, and Goldman (1999)
constructed models that explicitly incorporate covariation among sites and compared the fits with models that invoke no covariation.
In this article we explore two issues, building on an approach first used to analyze isocitrate dehydrogenase (Dean and Golding 2000
). First, we determine when there is sufficient information in a protein phylogeny such that reliable inferences about the distribution of amino acid replacements can be made. Second, we explore the extent to which biological function affects the distribution of amino acid replacements within structures. To answer the first question we determine the proportion of variation in the number of amino acid replacements among sites that is attributable to causal effects. This is done using a new method that is independent of the structure of the underlying biological model, although it assumes an underlying Poisson process of amino acid replacement. To answer the second question we make a survey of proteins with an eightfold
/ß-barrel, a motif that appears in a variety of structural and functional contexts.
The Model
We assume the following model for amino acid sequence evolution: among phylogenetically related amino acid sequences, different sites evolve independently along the branches of a given phylogenetic tree according to a Poisson process. Different sites are allowed to have different rates of evolution, and these rates may vary independently of others, both within and between branches. We do not use empirical correction matrices because these mask heterogeneity that is rightfully attributable to three-dimensional structural effects. When conditioned on phylogenetic history (tree topology, rates of evolution, and changes in rates of evolution), the total number of replacements at a given site for a sample of amino acid sequences is still Poisson distributed. We assume that at site i, the mean number of replacements over the entire history represented in the phylogenetic tree is µi.
Suppose we align a number of amino acid sequences, reconstruct their phylogeny, and infer the actual number of replacements per site. There are two distinct sources for variation in this data set: one is biological, namely the site-to-site variation that is caused by evolutionary forces that determine the rate of replacement, whereas the second is probabilistic, namely the inevitable variation that accompanies stochastic processesin this case Poisson processes that determine the number of replacements per site, given the rate of replacement. Because both sources contribute to variation, it is important to determine what fraction of variation is due to the stochastic process and what fraction is due to biological forces determining the site-to-site variation in replacement rates. Fortunately, in the case of Poisson noise, it is possible to tease apart these two sources of variation without recourse to a biological model of protein evolution, i.e., a model that seeks to explain the between-site variation in replacement rates.
Let Yi be the random number counting the number of replacements at site i, i = 1, 2, ..., n, accumulated throughout phylogenetic history. Each Yi is Poisson distributed with mean µi for site i. Sites are independent of each other. Setting =
i = 1nµi/n, we can partition the variance in the number of replacements among all sites into two parts, the first ascribable to causal variation among sites and the second ascribable to residual stochastic error due to the Poisson process acting at each site:
|
![]() |
|
|
|
We also show (see appendix) that the variance of 2 is given (approximately) by
|
|
|
Reducing Error
The variation due to Poisson error can be reduced by increasing sample size. The approximate increase in sampling size can be computed assuming constant rates of evolution. Suppose the total branch length in the tree is t and µi = it, then the total variance is
|
|
|
|
|
|
Using PECD
The PECD 2 has three uses: (1) identifying data sets with sufficient causal variation to be worthy of analysis (
2
1), (2) determining how much more data need be collected (
) to reduce the stochastic variation to some desirable limit, and (3) knowing when to stop tinkering with a regression model constructed from biological variables because its correlation coefficient (
) has approached the theoretical limit (
).
PECD helps identify those data sets most worthy of analysis. When all sites in a sequence evolve at the same rate, the expected distribution of replacements is Poisson, with the expected variance (2) equal to the expected mean (µ). The ratio of the estimated variance (s2y) to the estimated mean (
) weighted by the degrees of freedom, (n - 1)sy2/
, is approximately distributed as
2n-1 and provides a convenient test for deviations from Poisson (Fisher 1948
). Yet, significance alone is insufficient a criterion to pursue an analysis. For example, with df = 400 (common enough with molecular data) the variance need only be 14% larger than the mean to be significant, resulting in a PECD 1 - 1/1.14 = 0.12 or 12% causal variation. Yet a data set with 12% causal variation is hardly worthy of detailed analysis when there may be others consisting of 90% causal variation.
Additional data must often be gathered in an effort to reduce stochastic noise to an acceptable level. There are two possible strategies: increasing sequence length or obtaining more sequences. The first is of limited use because interest often centers on sequences of defined length (e.g., a gene), and where lengthening is possible it does nothing to reduce the stochastic proportion of the estimated variance (/sy2 = 1 -
2), whereas its variance (Var(
/sy2) =
2
2) is reduced only in direct proportion to the increase in length. The second approach is far more efficient, even though the degrees of freedom remain unchanged. The stochastic portion of the variance decreases (roughly) as
, whereas its variance decreases (roughly) as
3. For this second approach,
provides a useful, if approximate, gauge of the necessary increase in sample sizeapproximate only because the precise increase also depends on sequence relatedness and because very small initial samples necessarily produce large standard errors (table 2
).
|
Simulations
To ascertain the reliability of our analysis when applied to proteins, we simulated the accumulation of amino acid replacements at 245 sites in the glycolytic enzyme triosephosphate isomerase (TIM). Analysis (see Methods) of 178 sequences from extant taxa allocates 90% of the site-to-site variability to causal effects. With 10% of the variation attributed to stochastic effects, the observed distribution provides a sufficiently robust estimate of the true underlying distribution for simulating Poisson scatter.
The mean and standard deviations (SD) of 2,000 replicate simulations (table 1
), with the expected mean number of replacements per site varied between 0.853 and 27.29, reveal that the simulated 2 closely follows both the theoretical
2 and the simulated r2. When there are few replacements,
2 underestimates r2. However, the bias is negligible compared with the SD values (table 1
) and the 95% confidence intervals (fig. 1
). Table 1
also shows that the approximate SD values of
2 (s
2, from eq. 2) closely follow the empirically determined SD values and the approximate SD values of
2(
2). With many replacements, s
2, SD, and
2 are smaller than the empirically determined SD of
2.
|
|
![]() |
![]() |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
/ß-barrel enzymes have diverse functions (table 3
). At the organismal level they contribute to photosynthesis, respiration, cell growth, development, defense, and communication. They may be extracellular or confined to certain organelles. At the metabolic level they play roles in glycolysis and gluconeogenesis, CO2 fixation, assorted biosyntheses and degradations, DNA repair, and bioluminescence. They carry out a diversity of biochemical transformations, including CC bond formation, oxidations and reductions, hydrolyses and condensations using a variety of chemical mechanisms (Walsh 1979
). Reactions may proceed concertedly (everything happens simultaneously) or sequentially (in a stepwise fashion). The transition states vary widely in chemical character, from enolates to radicals to oxocarbenium ions. These transition states may be stabilized directly by the protein or indirectly by way of divalent metals or pyridoxal phosphate. In the case of chitinase there is even the suggestion that the substrate itself directly assists catalysis (Terwisscha van Scheltinga et al. 1995
). In some, the substrate becomes temporarily covalently attached to the enzyme or coenzyme (e.g., through a lysyl or pyridoxal phosphate Schiff-base, or the formation of acyl-enzyme intermediate). In others, the substrates are noncovalently bound throughout the reaction.
|
|
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Amino acid sequences homologous to those in the Protein Data Bank were identified using web-based gapped-BLAST (Altschul et al. 1997
; http://www.ncbi.nlm.nih.gov:80/BLAST) and Neighbors (http://www.ncbi.nlm.nih.gov:80/entrez) algorithms. All mutant and chimeric sequences were discarded, as were short peptide fragments (<150 residues). Sequences were aligned using CLUSTALW (Thompson, Higgins, and Gibson 1994
; Higgins, Thompson, and Gibson 1996
; code and documentation available at ftp://ftp.bio.indiana.edu/molbio/align/clustal/). All sequences less than 40% identical to a known structure were discarded. Only a single wild-type representative in clusters of sequences sharing more than 99% identity was retained. Partial sequences were removed, unless the missing sites comprised less than 5% of the sequence (commonly seen at the amino and carboxy termini). For many enzymes the structures from several different species are available. Structural alignments, obtained from web-based databases and programs (the MMDB database at http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml, VAST alignments at http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml), CE (Shindyalov and Bourne 1998
) at http://cl.sdsc.edu/, FSSP (Holm and Sander 1996
) at http://www2.ebi.ac.uk/dali/fssp/, or locally with QUANTA (MSI, CA), were all similar and used as guides to adjust alignments with the SEQPUP editor (D. Gilbert 1999; http://iubio.bio.indiana.edu/soft/molbio/seqpup/java). Searches were completed on August 1, 2000.
Phylogenetic Reconstructions
Phylogenies were constructed using the Fitch-Margoliash least squares method (Fitch and Margoliash 1967
), as implemented in PHYLIP (Felsenstein 1989
; code and documentation available at http://evolution.genetics.washington.edu/phylip.html). Distances were calculated using a PAM250 matrix (Jones, Taylor, and Thornton 1992
), with gaps treated as missing data. Branches longer than 0.3 were pruned, and clusters having five or more sequences were analyzed separately. For each cluster, the phylogeny was repeatedly reconstructed and pruned until all branch lengths were shorter than 0.3. Phylogenies were further pruned so that no more than 30% of their total lengths comprised branches longer than 0.2.
Inferred Amino Acid Replacements
For each Fitch-Margoliash tree the number of amino acid replacements per site was inferred by parsimony using PAUP* (Swofford 1998
). Sites absent in the protein structure were discarded as were those present in the structure but absent in more than 10% of aligned homologous sequences.
Parsimony systematically underestimates the number of replacements per site. We therefore implemented a Jukes-Cantorlike correction to adjust for this bias. Following Gu and Zhang (1997)
, the probability of one or more amino acid replacements at site i on branch j (yij) when governed by a Poisson process is given by
![]() |
|
|
On the left-hand side is the difference between the number of branches (m) in the phylogeny and the observed number of branches with at least one replacement (bi.). On the right-hand side, estimates of tj are provided by the Fitch-Margoliash tree. There is no general analytic solution for i (this equation being similar in form to Euler equations), and therefore we found its numerical value using the RootFind function in Mathematica (Wolfram Research, Inc., IL). The corrected number of replacements per site (yi.) is then estimated as
|
Protein Characterization
An SGI Indigo II (Mountain View, CA) running Quanta (MSI, CA) software was used to calculate H-bonds and and
angles and to define secondary structure from PDB files. Quanta was also used to identify those side-chains engaged in ionic interactions and those engaged in H-bonding, and whether the latter were side-chain to side-chain, side-chain to main-chain, interdomain or intersubunit interactions, whether they were donors or recipients, and whether the atoms involved were charged or polar (or both). The fraction of each amino acid side-chain exposed to the solvent was calculated using a 0.01-Å grid with a 1.4-Å radius probe (the diameter of water) in Quanta. The distance (Å) from the atom in each residue closest to the active site (taken to be an atom implicated in catalysis from mechanistic considerations and which, depending on context, may reside on a side-chain, in a prosthetic group, in a bound ligand or be a bound metal ion) was calculated from the x,y,z atomic coordinates using the calculator in JMP (SAS Institute Inc., NC). A similar calculation was performed with ligands bound in allosteric sites.
Regression Analyses
Linear least squares regression models of the form
|
![]() |
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Corrections
Parsimony assigns no more than one replacement per site per branch. This causes the number of replacements to be systematically underestimated, particularly at rapidly evolving sites on long branches. The bias is inescapable, although pruning long branches (>0.3) and restricting those of modest length (0.2 < ti < 0.3) to no more than 30% of the total length of the phylogeny minimizes serious underestimates. Nevertheless, the Jukes-Cantorlike correction produces a 15% increase in the mean number of replacements per site (table 4
), with increases exceeding 30% at approximately 4% of sites. The need to implement the correction necessitated determining its accuracy.
We used computer simulations to assess the accuracy of the Jukes-Cantorlike correction (table 5
). All simulations are based on the trees for enolase and TIM, using the observed branch lengths (tj) and the observed distributions of rates of amino acid replacements (i). In the simulations, each site i evolves at a constant rate, with the number of replacements on branch j drawn from a Poisson distribution with mean
itj. For each site, the total number of replacements (the "Poisson" data) and the number that would be inferred by parsimony (the "Parsimony" data) are recorded. The Jukes-Cantorlike correction is then applied to the Parsimony data to produce the "Corrected" data. In the first pair of simulations all sites evolve at the same constant rate (
constant). As expected of Poisson processes, the mean and variance values are equal and all variability is stochastic (
/sy2
1). In the second pair of simulations different sites evolve at different rates, with the
i distributions taken from the enolase and TIM phylogenies. Site-to-site differences produce an additional source of variation that inflates the variance relative to the mean. Now, only 10% of the variability is attributable to stochastic effects (
/sy2
0.1).
|
We conclude that parsimony seriously underestimates the number of replacements per site, even on severely pruned trees. Correcting this bias is essential for reliable analyses. Simulations show that the Jukes-Cantorlike correction accurately recovers the mean, the variance, and their ratio (table 5 ), with residual biases far smaller than the stochastic errors inherent to single replicates.
Reliability of the NCD (r2/2)
We used simulations to assess the accuracy of NCD. Data from TIM were used as expectations around which Poisson sampling effects were simulated. For each site in the sequence, the simulated number of replacements was drawn from a Poisson distribution whose expectation varied in proportion to the expected number of replacements per site (from 0.8 to 27). The NCD was determined using the regression coefficient of the simulated data against the true expectations and the PECD. Estimates of the NCD are highly unreliable (fig. 4A
) below a mean of approximately two replacements per site and accurate above a mean of five. The same general trend is manifest in real data (fig. 4B
) regressed against the minimal model (see Regression Analyses below), although here the observed NCDs are far lower. We conclude that simulations and real data indicate that reliable NCDs can only be obtained from large phylogenies.
|
|
|
Frequency Distributions
To determine if /ß-barrel enzymes have similar frequency distributions of amino acid replacements per site we normalized our data such that each enzyme has an average of one amino acid replacement per site. This allows enzymes with few replacements per site to be directly compared with ones with many replacements per site. Of the 25 enzymes studied, 24 appear to share a common distribution (or a set of similar distributions) in which sites with many replacements are far less frequent than expected for a simple exponential decay (fig. 6
). RUBISCO is the clear exception with sites having a moderate number of replacements being relatively rare, whereas those with many replacements are far more frequent than is typical.
|
We investigated additional models of amino acid replacement rates for four of the largest data sets (enolase, TIM, 5-aminolevulinate dehydratase [5-ALDH] and class-I fructose 1,6 bisphosphate aldolase [F16BP.I]) and where most variation is causal. Improvements in the fits obtained by adding additional terms to the minimal model are marginal, given the expenditure in degrees of freedomthe NCD increases by approximately 15% at a cost of over 25 df (table 6 ). Treating the two domains of each enolase monomer separately improves the NCD by only 2%. Distinguishing the amino terminus of each 5-ALDH monomer from the remaining /ß-barrel (the first 27 amino acids form an extend tail) produces no improvement in the fit. Including the x,y,z coordinates of the C
carbon atoms typically improves the NCD by approximately 5%, suggesting that weak directional gradients in the frequency of replacements across monomers are fairly common. Accounting for secondary structure (
-helix, ß-sheet, ß-bulge, turn, random coil) produces even less improvement. Hence, secondary structure exerts little influence on amino acid replacement rates.
|
Introducing interaction terms (e.g., distance x access) also produces small (<5%) improvements in the NCD at the expense of a large number of degrees of freedom. When the TIM data are fitted to the minimal model supplemented with x,y,z coordinates, the NCD rises from 0.69 to 0.72 when 10 second-order interaction terms are included (no interaction terms with glycine are included because of a lack of degrees of freedom) and then to 0.75 when all 26 third- and fourth-order interactions are included. Similar results are obtained when the same models are fitted to enolase data with the NCD rising from 0.60 to 0.61 and then to 0.63. We conclude that interactions between variables are of little consequence.
There is no theoretical reason to suppose that amino acid replacement rates should be a linear function of distance, access, or any other metric associated with protein structure. We therefore explored a nonlinear version of the minimal model based on the Hill equation of enzyme kinetics, an equation that displays a wide variety of behaviors from hyperbolic to sigmoid. For many data sets, fits do not converge. When fits did converge, increases in the NCDs were again marginal (not shown).
We conclude that distance from the active site, solvent accessibility, and glycine residues at constrained positions in the main-chain explain approximately 50% of the variability in rates of evolution not attributable to chance. Other variables, including secondary structure, and interactions account for only a small proportion of the observed variation.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A solution in the special case of Poisson-distributed errors is possible. The reason is that the grand mean () of summed Poisson distributions provides an estimate of the stochastic error that is entirely independent of the observed variance. This is not true for other distributions. For example, the grand mean of summed normal distributions provides no information about variances, whatever their source.
Each site in a protein accumulates amino acid replacements according to its own Poisson process. Though the rate at each site may vary independently of others, or coordinately with them (producing branch length effects), the overall process at each site is still Poisson when we condition on all historical contingencies. With evolution of a Poisson process, the proportion of site-to-site variation due to chance (/sy2) can be partitioned from that due to unspecified causal effects (
2 = 1 -
/sy2).
This simple calculation allows us to concentrate on data with high information content. We decided that at least half the observed variation should be causal (2 > 0.5) to warrant further analysis, a criterion that corresponds (roughly) to a phylogeny of 10 sequences averaging 1.75 replacements per site. Only 25 of 125
/ß-barrel structures in the Protein Data Bank, with phylogenies pruned of long branches and highly divergent sequences, satisfied this criterion on August 1, 2000 (table 4
). Very large phylogenies (75 sequences averaging 10 replacements per site) are necessary if more than 90% of the variation is to be ascribed to causal effects. Few proteins of known structure are associated with such large phylogenies.
Despite pruning trees to remove branches longer than 0.3, parsimony underestimates both the mean and the variance in the number of amino acid replacements per site and overestimates their ratio. These biases are severe and a correction is essential to any reliable analysis (table 5 ). A Jukes-Cantorlike correction accurately recovers the mean, the variance, and their ratio, with remaining biases far below the stochastic noise inherent to the data. The only additional assumption needed for this correction is that changes in evolutionary rate affect all sites proportionally.
Approximately half of the causal variation is explained by just df = 3: distance from the active site, solvent accessibility, and glycines in unusual main-chain conformations (table 4
). Like several recent analyses (Bustamante, Townsend, and Hartl 2000
; Goldman, Thorne, and Jones 1998
), we confirm that solvent accessibility is a major determinant of amino acid replacement rates. We also show that distance from the active sites equals, and sometimes surpasses, solvent accessibility in importance (see the regression sums of squares in table 4
). Glycines in unusual main-chain conformations make significant contributions to the regression sums of squares because they are highly conserved.
Asp, Glu, Arg, and Pro tend to be more conserved than expected, given their positions in crystal structures (data not shown). The first three are charged and their side-chains can H-bond to other polar and charged side-chains, but there is no correlation here: His occupies sites that evolve at expected rates, whereas Lys tends to occupy rapidly evolving sites. H-bonding among these and other polar side-chains does not contribute much to the fit (table 6
). Asp sometimes plays a structural role in capping the dipole at the amino termini of -helices but not so frequently to explain this level of conservation. Pro too is more conserved than most residues, perhaps because it often plays an important structural role by restricting acceptable main-chain conformations (its side-chain being covalently attached to the main-chain nitrogen). However, there is as yet no definitive method to predict, from structural data alone, which Pro residues will be conserved and which are free to evolve.
A surprising result of our analysis is that secondary structure has little predictive power regarding rates of evolution (table 6 ). TIM provides a typical example. Alone, secondary structure produces an NCD of q2 = 2/
2 = 0.12, with helices evolving more rapidly than sheets and with turns and random coils having intermediate rates. When used to supplement the minimal model (distance, access,
Gly) secondary structure improves q2 from 0.613 to 0.643, an increase of only 0.03. The difference (0.12 vs. 0.03) arises as a consequence of the construction of
/ß-barrels (fig. 2
). The sheets, which contain residues forming the active site, are buried in the hydrophobic core of the barrel. The helices, farther from the active site and forming the perimeter of the barrel, have faces exposed to solvent (fig. 2
). Hence, the helices of
/ß-barrels evolve more rapidly than do sheets, not because they have any innate tendency to do so but because their position and exposure to solvent place them in regions where the functional and structural consequences of amino acid replacements are less severe. Indeed, sites in helices exposed to solvent evolve far more rapidly than those buried against the hydrophobic core (fig. 7
). Secondary structure is of little consequence in determining rates of amino acid replacement.
|
|
RUBISCO is a notable exception to the above generalization. Its frequency spectrum of amino acid replacements differs dramatically from others in having far higher proportions of both slowly and rapidly evolving sites (fig. 6
). Inspection of the structure reveals the typical overall pattern; a conserved active site surrounded by evolving sites, with the most rapidly evolving sites being remote and exposed to solvent. The structure offers no obvious explanation as to why the frequency spectrum should differ so markedly. It also does not offer any obvious insight into why the minimal regression model fits so poorly (q2 = 2/
2 = 0.18; table 4
), save that the greater site-to-site variability in rates, spread throughout the structure, reduces the correlation.
Analysis of the aldo-keto reductases also yields a poor fit to the minimal model (q2 = 2/
2 = 0.21). Found in eukaryotes and prokaryotes, these enzymes belong to a diverse superfamily, sharing obvious sequence identity, a common structural fold, and a common catalytic mechanism but having widely different biological functions (Jez et al. 1997
). Upon further investigation we discovered that the cause of the poor fit (q2 =
2/
2 = 0.113; table 4
) is attributable to a cluster of hydroxysteroid dehydrogenases (HSDs) and allied enzymes. Once the HSDs are removed, the remaining aldo-keto reductases behave in a fashion typical of many other superfamilies (q2 =
2/
2 = 0.42; table 4
), with the most rapidly evolving sites scattered over the surface, well away from the active site (fig. 9A
). In stark contrast, the most rapidly evolving sites in the HSDs cluster on either side of the substrate-binding cleft in the active site (fig. 9B
). These replacements are concentrated in three loops that, when introduced into mammalian 3
-HSD from 20
-HSD, switch specificity from androgens to progestins (Ma and Penning 1999
).
|
A reasonable fit (q2 = 2/
2 = 0.6, table 4
) is obtained when our simple model is applied to xylose isomerase. However, when residues with the 25 highest normalized deviations (observed/expected - 1) are plotted onto the protein structure, they form two contiguous bands, each flanking pairs of active sites in the tetramer (fig. 10
). These "rings of fire" are not caused by the small number of expected replacements at sites near the active siteat only three of the 15 sites is the expected number of replacements less than one, and at other sites in the vicinity there is no tendency toward high normalized deviations. Similar patterns are not evident in other proteinsin pyruvate kinase, another large tetrameric enzyme, such sites are scattered haphazardly throughout the structure. The cause of these rings of fire remains a mystery, although their proximity to the active sites is suggestive of mechanistic consequences subject to natural selection.
|
The patterns and distributions of amino acid replacements among /ß-barrel enzymes are remarkably consistent, regardless of their diverse biochemical, metabolic, and biological roles. Indeed, fully half of the variation attributable to causal effects is explained by a simple regression model consisting of nothing more than solvent accessibility, distance from the active site, and treating glycines occupying unusual main-chain conformations as a separate class. Other factors, notably secondary structure, exert little influence.
These results are general. The existence of additional domains in /ß-barrels have no obvious effect, but the simple model proves an equally good fit to isocitrate dehydrogenase, an enzyme that completely lacks an
/ß-barrel (Dean and Golding 2000
). On rare occasions when, as in the active site of HSDs, biological necessity disturbs the general pattern, a goodly portion of the causal variation in rates remains explainable by overall structural considerations. Nevertheless, our simple statistical analysis reveals that a considerable portion of the remaining unexplained variation is not attributable to chance. Other, as yet unidentified forces, must influence protein evolution.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: amino acid replacement
evolution
rate
structure
Address for correspondence and reprints: Antony M. Dean, The Biological Process Technology Institute, 240 Gortner Laboratories, 1479 Gortner Avenue, University of Minnesota, St. Paul, Minnesota 55108. adean{at}biosci.umn.edu
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul S. F.L. M. T., A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402
Atchley W. R., W. Terhalle, A. W. Dress, 1999 Positional dependence, cliques, and predictive motifs in the bHLH protein domain J. Mol. Evol 48:501-516[ISI][Medline]
Atchley W. R., K. R. Wollenberg, W. M. Fitch, W. Terhalle, A. W. Dress, 2000 Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis Mol. Biol. Evol 17:164-178
Babbitt P. C., J. A. Gerlt, 1997 Understanding enzyme superfamilies. Chemistry as the fundamental determinant in the evolution of new catalytic activities J. Biol. Chem 272:30591-30594
Bishop J. G., A. M. Dean, T. Mitchell-Olds, 2000 Rapid adaptive evolution in the active site of plant class I chitinases Proc. Natl. Acad. Sci. USA 97:5322-5327
Branden C., J. Tooze, 1999 Introduction to protein structure. 2nd edition Garland Science, Ky
Bustamante C. D., J. P. Townsend, D. L. Hartl, 2000 Solvent accessibility and purifying selection within proteins of Escherichia coli and Salmonella enterica Mol. Biol. Evol 17:301-308.
Copley R. R., P. Bork, 2000 Homology among (ß)8 barrels: implications for the evolution of metabolic pathways J. Mol. Biol 303:627-640[ISI][Medline]
Dean A. M., G. B. Golding, 2000 Enzyme evolution explained (sort of) Pp. 112 in R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, eds. The Pacific symposium on bioinformatics 2000. World Scientific, Singapore
Felsenstein J., 1989 PHYLIP (phylogeny inference package). (Version 3.2) Cladistics 5:164166
Fisher R. A., 1930 The genetical theory of natural selection Clarendon Press, Oxford
. 1948 Statistical methods for research workers. 10th edition Oliver and Boyd, Edinburgh
Fitch W. M., E. Margoliash, 1967 Construction of phylogenetic trees. A method based on mutation distances as estimated from cytochrome c sequences is of general applicability Science 155:279-284[ISI][Medline]
Gerlt J. A., 2000 New wine from old barrels Nat. Struct. Biol 7:171-173[ISI][Medline]
Goldman N., J. L. Thorne, D. T. Jones, 1998 Assessing the impact of secondary structure and solvent accessibility on protein evolution Genetics 149:444-458
Gu X., 1999 Statistical methods for testing functional divergence after gene duplication Mol. Biol. Evol 16:1664-1674
. 2001 Maximum likelihood approach for gene family evolution under functional divergence Mol. Biol. Evol 18:453-464
Gu X., J. Zhang, 1997 A simple method for estimating the parameter of substitution rate variation among sites Mol. Biol. Evol 14:1106-1113[Abstract]
Higgins D. G., J. D. Thompson, T. J. Gibson, 1996 Using CLUSTAL for multiple sequence alignments Methods Enzymol 266:383-402[ISI][Medline]
Holm L., C. Sander, 1996 Mapping the protein universe Science 273:595-602
Hughes A. L., M. Nei, 1988 Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection Nature 335:167-170[ISI][Medline]
Jez J. M., M. J. Bennett, B. P. Schlegel, M. Lewis, T. M. Penning, 1997 Comparative anatomy of the aldo-keto reductase superfamily Biochem. J 326:625-636[ISI][Medline]
Jones D. T., W. R. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences Comput. Appl. Biosci 8:275-282[Abstract]
Kendall M., A. Stuart, 1977 The advanced theory of statistics. 4th edition, Vol. 1 McMillan, New York
Kimura M., 1983 The neutral theory of molecular evolution Cambridge University Press, Cambridge, U.K
Kimura M., T. Ohta, 1973 Mutation and evolution at the molecular level Genetics 73:19-35[ISI][Medline]
Landgraf R., D. Fischer, D. Eisenberg, 1999 Analysis of heregulin symmetry by weighted evolutionary tracing Protein Eng 12:943-951
Li W.-H., 1997 Molecular evolution Sinauer Associates, Sunderland, Mass
Lio P., N. Goldman, 1999 Using protein structural information in evolutionary inference: transmembrane proteins Mol. Biol. Evol 16:1696-1710
Lo Conte L., B. Ailey, T. J. Hubbard, S. E. Brenner, A. G. Murzin, C. Chothia, 2000 SCOP: a structural classification of proteins database Nucleic Acids Res 28:257-259
Ma H., T. M. Penning, 1999 Conversion of mammalian 3-hydroxysteroid dehydrogenase to 20
-hydroxysteroid dehydrogenase using loop chimeras: changing specificity from androgens to progestins Proc. Natl. Acad. Sci. USA 96:11161-11166
Miyamoyo M. M., W. M. Fitch, 1996 Constraints on protein evolution and the age of the eubacterial/eukaryotic split Syst. Biol 45:568-575[ISI][Medline]
Pollock D., W. R. Taylor, N. Goldman, 1999 Coevolving protein residues: maximum-likelihood identification and relationship to structure J. Mol. Biol 287:187-198[ISI][Medline]
Shindyalov I. N., P. E. Bourne, 1998 Protein structure alignment by incremental combinatorial extension (CE) of the optimal path Protein Eng 11:739-747[Abstract]
Swofford D. L., 1998 Phylogenetic analysis using parsimony (* and other methods) Sinauer Associates, Sunderland, Mass
Taneto Y., N. Takezaki, M. Nei, 1994 Relative efficiencies of the maximum-likelihood, Neighbor-Joining, and maximum-parsimony methods when substitution rate varies with sites Mol. Biol. Evol 11:261-277[Abstract]
Terwisscha van Scheltinga A. C., S. Armand, K. H. Kalk, A. Isogai, B. Henrissat, B. W. Dijkstra, 1995 Stereochemistry of chitin hydrolysis by a plant chitinase/lysozyme and X-ray structure of a complex with allosamidin: evidence for substrate assisted catalysis Biochemistry 34:15619-15623[ISI][Medline]
Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]
Thorne J. L., 2000 Models of protein sequence evolution and their applications Curr. Opin. Genet. Dev 10:602-605[ISI][Medline]
Uzzel T., K. W. Corbin, 1971 Fitting discrete probability distributions to evolutionary events Science 172:1089-1096[ISI][Medline]
Walsh C., 1979 Enzymatic reaction mechanisms Freeman, NY
Yang Z., 1994 Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites Mol. Biol. Evol 10:1396-1401[Abstract]
. 1996 Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites: approximate methods J. Mol. Evol 39:306-314
Yang Z., J. Bielawski, 2000 Statistical methods for detecting molecular adaptation Trends Ecol. Evol 15:496-503[ISI][Medline]