Department of Organismic and Evolutionary Biology, Harvard University
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is well known that the strength of purifying selection varies considerably between classes of DNA sites (e.g., between sites that alter amino acid sequence vs. those that do not). It has also been established through comparison of Ka/Ks (the ratio of divergence at amino acid replacement sites relative to divergence at synonymous sites) that purifying selection varies considerably between different proteins and, within the same protein, between different regions (Li 1997
). Kimura's formulation of the mutation-drift hypothesis postulated that differences in the strength of purifying selection between different proteins are due to differences in functional constraint, such that genes that evolve quickly are more robust with respect to amino acid sequence than those that evolve slowly.
Understanding variability in substitution rates between different regions of proteins and between different classes of amino acid residues has been of considerable interest to molecular evolutionists. A growing literature in molecular phylogenetics has begun to address the question of how structural constraints relate to rate variation and thus to phylogenetic estimation (e.g., Naylor and Brown 1997
). It has also been shown that location in the secondary structure and solvent accessibility systematically affect substitution rates in a wide range of protein families (Goldman, Thorne, and Jones 1998
). The problem has also been of considerable interest to those who work on protein folding, since nonrandom substitution patterns can signal structural constraints as well as motifs important for optimizing protein-folding prediction (e.g., Koshi and Goldstein 1995
; Overington et al. 1992
).
The nature of structural factors determining levels of variation below the species level has not been examined. This is largely because the three-dimensional structures of the majority of proteins studied in population genetics are unknown. In this paper, we analyze five enzymes for which sequence variation among natural isolates of Escherichia coli and Salmonella enterica have been characterized and protein structures for E. coli forms of the enzymes are also known. For these five proteins, we find that solvent accessibility in the protein structure is a strong predictor of whether or not an amino acid will be polymorphic. This, of course, does not imply that any particular amino acid polymorphism is selectively neutral, only that purifying selection at the site is weak enough to allow the particular amino acid replacement to become polymorphic (Hartl et al. 2000). Here, we show that solvent accessibility is a better predictor of polymorphism for a given amino acid than its size, its physicochemical properties, or its location in the secondary structure of the protein.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Solvent-Accessible Surface
A measure used throughout this paper is the exposed surface area of amino acid residues. This concept has been used extensively in structural biophysics to estimate the net gain in free energy due to protein folding as hydrophobic amino acid residues shed their "water cages" (Chothia 1974
; Ooi et al. 1987
) and is also used in energy refinement of protein structure (von Freyberg, Richmond, and Braun 1993
). Consider the solvent-accessible surface area (SAS) of an atom, defined as the area on the surface of a sphere of radius R on each point of which the center of a solvent molecule can be placed in contact with the van der Waals sphere around the atom without penetrating any other atom in the molecule. The radius R is therefore given by the sum of the van der Waals radius of the atom and the chosen radius of the solvent molecule (Lee and Richards 1971
). Finding the SAS of an amino acid is equivalent to rolling a water molecule (or another solvent molecule) over the van der Waals radii of the atoms in the amino acid as it is packed into the protein structure and calculating the surface area that the water molecule touches.
In our analysis, the SAS measure was used to estimate the proportion of each amino acid residue that is accessible to solvent. This was done by taking the ratio of SAS we calculated from the actual protein structure to that of the maximum exposed surface area in the fully extended conformation of the pentapeptide gly-gly-X-gly-gly, where X is the amino acid in question. We used two methods to estimate solvent accessibility that implemented in the package MOLMOL (Koradi, Billeter, and Wuthrich 1996
) and Eisenhaber's ASC method (Eisenhaber and Argos 1993
; Eisenhaber et al. 1995
). The methods gave indistinguishable results. The distributions of solvent accessibility for polymorphic and invariant residues for each enzyme are indicated by the open and shaded bars, respectively, in figure 2
. In this figure, the line segments connect the proportion of polymorphic amino acids observed in each category of solvent accessibility.
|
where , ß1, ß2, ß3, and ß4 are the intercept and slopes for secondary-structure class, solvent accessibility, amino acid size, and physicochemical class, respectively, and Wi, Xi, Yi, and Zi are the values of secondary-structure class, solvent accessibility, size, and physicochemical class for the ith amino acid in the primary sequence of the solved structure. The parameters ß1 and ß4 allow for a unique intercept, and the parameters ß5 and ß6 allow for unique slopes for each secondary-structure element and physicochemical class of amino acid, respectively. For the sake of clarity, the parameter ß2 is hereinafter referred to as ßsas.
Maximum likelihood is the standard method used to estimate the slopes and intercepts for logistic regressions. Since the solutions to the derivative of the log-likelihood functions are not in closed form (Christensen 1997
), we used Newton-Raphson iteration to obtain the estimates. Confidence intervals for the slopes and intercepts reported in this paper here are based on nonparametric bootstrapping of the data with 1,000 replicate data sets generated using a published algorithm for STATA (King, Tomz, and Wittenberg 1998
). We report the 25th and 975th ranked estimates of the relevant parameter.
Multiple logistic regression models were explored to determine if including amino acid size (residue mass in daltons), physicochemical class, and secondary structure made a significant improvement on the reduced model with solvent accessibility alone. To assess improvement between nested models that differed in complexity, we used the difference in the log-likelihood of the hypotheses, which is approximately 2 distributed with degrees of freedom equal to the difference in degrees of freedom of the original models considered.
We estimated how Ka/Ks changes in trpC with solvent accessibility by using separate logistic regressions for replacement polymorphism versus synonymous polymorphism after classifying amino acids according to synonymy class (twofold redundant and fourfold redundant; amino acids that were neither twofold nor fourfold redundant were ignored). For each partition, we estimated Ka/Ks for a given value of solvent accessibility, Xo, as
where Pa(Xo) is the probability of amino acid polymorphism per codon at Xo, and Ps(Xo) is the probability of synonymous polymorphism per codon at Xo calculated from the logistic regression (eq. 1); C is the fraction of all single nucleotide changes that lead to a synonymous substitution assuming equal frequencies of nucleotide substitution. The quotient C/(1 - C) is a scaling coefficient that allows us to generate a proxy for Ka/Ks from the ratio of the probability of replacement polymorphism to the probability of synonymous polymorphism. For twofold-redundant sites, C = 1/9 and C/(1 - C) = 1/8. For fourfold-redundant sites C = 3/9 and C/(1 - C) = 1/2. Confidence intervals for Ka/Ks were generated from replicate data sets generated through nonparametric bootstrapping.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Table 1
summarizes the maximum-likelihood estimates of the parameters in the logistic regression model of polymorphism on solvent accessibility for amino acids grouped by protein. In the first column, we list the number of sequences from each species used in our study. The second column lists the proportion of sites that vary within each protein. The third and fourth columns give the maximum-likelihood estimates of and ßsas, respectively. The fifth column gives the results of likelihood ratio tests (LRTs) of whether the model with ßsas = MLE(ßsas) fits the data significantly better than a model with ßsas = 0, where the test statistic is approximately distributed as
2 with one degree of freedom. For four out of the five genes, a model with increasing probability of polymorphism with solvent accessibility is a significantly better model, and the one protein (phoA) for which the test is not significant has the least polymorphism, thus compromising the power of the test. Using the analytical approximation of Whittemore (Whittemore 1981
) it can be shown that for overall frequencies of polymorphism of 2%, 5%, and 10%, we have approximately 30%, 70%, and 90% power, respectively, to reject the null hypothesis that ßsas = 0 in favor of ßsas = 3.5 (the average ßsas for all of the genes).
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is suggested by the structures themselves (fig. 1 ) that there seems to be a concentration of polymorphic amino acid sites on the "outside" of each enzyme. These tend to be regions of relatively high solvent accessibility, and many of the polymorphic residues protrude from the structure in such a way that an amino acid replacement would not drastically alter hydrogen bonding or hydrophobic contacts made with other residues. That polymorphic residues tend to cluster on the outside of molecules is supported by histograms of solvent accessibility for invariant and polymorphic residues (fig. 2 ). Nevertheless, our analysis indicates that the "outside"-"inside" dichotomy is too simplistic. The probability of amino acid polymorphism increases as a continuous function of solvent accessibility.
The logistic regression analysis combines the intuitive appeal of ordinary least-squares regression with the ease of a likelihood framework for testing more complicated models. We found that all of the proteins surveyed showed strong effects of solvent accessibility on relative probability of polymorphism. This effect was significant for four of five proteins, and the one nonsignificant protein was also the least polymorphic so that the test had the least power. Unexpectedly, all five proteins had very similar regression coefficients, suggesting that lower solvent accessibility may be similarly associated with stronger selective constraints across a wide range of enzymes differing in myriad details of their individual structures.
We also investigated whether the effect of solvent accessibility reflects a shift in amino acid composition merely from areas of low solvent accessibility to areas of high solvent accessibility or from one element of secondary structure to another. For example, if hydrophobic residues tended to be concentrated in areas of low solvent accessibility and also tended to be monomorphic, but for charged amino acids the relations were the other way around, the overall correlation of polymorphism with solvent accessibility would be spurious. This is not the case. When we compare the estimates of the slope, ßsas, for each of the major classes of amino acids in tables 2 and 3 , we note a striking similarity. There is also a good fit between the predicted probability of polymorphism from the combined logistic regression and the observed probability of polymorphism for amino acids grouped by physicochemical properties as hydrophobic (H), charged (C), and uncharged (U) (fig. 3 A) and grouped by structural elements as helix (X), sheet (S), and coil (L) (fig. 3 B). To address this issue formally, we also estimated multiple-regression models (table 4 ) that included size, secondary structure, and/or physicochemical class with and without solvent accessibility. Multiple-regression models that included solvent accessibility were significantly better at predicting probability of polymorphism than those that did not include it, and including amino acid size and/or physicochemical class in a multiple logistic regression made no significant improvement to a simpler model with solvent accessibility alone (table 4 ). The one improvement that could be made on the simplest model of solvent accessibility alone was to add an intercept term to account for differences in overall levels of polymorphism between elements of secondary structure. In short, the probability of polymorphism is more closely related to solvent accessibility than to amino acid identity, secondary structure, or size.
The logistic regression was also used in conjunction with data on synonymous polymorphism to estimate quantitatively the reduction in purifying selection with increasing solvent accessibility. When compared with the distribution of synonymous polymorphism, the increased probability of amino acid polymorphism with solvent accessibility (fig. 4
) suggests strong purifying selection in areas of low solvent accessibility and weak purifying selection in areas of high solvent accessibility, irrespective of synonymy class. The reduction in purifying selection is so large that sites near the high end of the solvent accessibility range appear to be evolving at a rate 510 times as fast (Ka/Ks 0.5 for both fourfold- and twofold-redundant sites) as those in areas of low solvent accessibility that are under very strong selection (Ka/Ks
0.1 for fourfold-redundant sites, and Ka/Ks < 0.05 for twofold-redundant sites).
Although our results are based on only five proteins, they tentatively suggest that similar constraints may govern disparate enzymes independent of their function. This finding, if proven to be general, may be rationalized in a broader consideration of how enzymes are thought to function. For a particular enzyme, only a few key residues are directly involved in the catalytic function (i.e., those residues directly in the vicinity of the active site). The majority of other residues play a role in maintaining the correct three-dimensional structure of the protein so that the protein can perform its function (Pakula and Sauer 1989
). Our results tentatively suggest that the majority of the sites that are allowed to vary within species are those sites that are less involved in the stabilizing of protein structure, since they are residues that are in close contact with solvent and thus do not form hydrogen bonds with other residues in the protein. The pervasive effect of solvent accessibility on polymorphism argues for a theory of universal structural constraint on amino acid evolution in enzymes and perhaps in other classes of protein structure.
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Abbreviations: LRT, log-likelihood ratio test; MLE, maximum- likelihood estimate; SAS, solvent accessible surface.
2 Keywords: purifying selection
polymorphism
solvent accessibility
neutral theory
Escherichia coli,
Salmonella enterica,
logistic regression
3
Address for correspondence and reprints: Daniel L. Hartl, 16 Divinity Avenue, Cambridge, Massachusetts 02138. E-mail: dhartl{at}oeb.harvard.edu
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Boyd, E. F., K. Nelson, F. S. Wang, T. S. Whittam, and R. K. Selander. 1994. Molecular genetic basis of allelic polymorphism in malate dehydrogenase (mdh) in natural populations of Escherichia coli and Salmonella enterica. Proc. Natl. Acad. Sci. USA 91:12801284.
Chothia, C. 1974. Hydrophobic bonding and accessible surface area in proteins. Nature 248:338339.
Christensen, R. 1997. Log-linear models and logistic regression. Springer, New York.
DuBose, R. F., D. E. Dykhuizen, and D. L. Hartl. 1988. Genetic exchange among natural isolates of bacteria: recombination within the phoA gene of Escherichia coli. Proc. Natl. Acad. Sci. USA 85:70367040.
Duee, E., L. Olivier-Deyris, E. Fanchon, C. Corbier, G. Branlant, and O. Dideberg. 1996. Comparison of the structures of wild-type and a N313T mutant of Escherichia coli glyceraldehyde 3-phosphate dehydrogenases: implication for NAD binding and cooperativity. J. Mol. Biol. 257:814838.[ISI][Medline]
Eisenhaber, F., and P. Argos. 1993. Improved strategy in analytic surface calculations for molecular systems: handling of singularities and computational efficiency. J. Comput. Chem. 14:12721280.[ISI]
Eisenhaber, F., P. Lijnzaad, P. Argos, C. Sander, and M. Scharf. 1995. The double cubic lattice method: efficient approaches to numerical integration of surface area and volume and to dot surface contouring of molecular assemblies. J. Comput. Chem. 16:273284.[ISI]
Goldman, N., J. L. Thorne, and D. T. Jones. 1998. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149:445458.
Hall, M. D., D. G. Levitt, and L. J. Banaszak. 1992. Crystal structure of Escherichia coli malate dehydrogenase. A complex of the apoenzyme and citrate at 1.87 A resolution. J. Mol. Biol. 226:867882.[ISI][Medline]
Hartl, D., E. F. Boyd, C. D. Bustamante, and S. Sawyer. 2000. The glean machine: What can we learn from DNA sequence polymorphism? In S. Suhai, ed. Genomics and proteomics. Plenum Press, New York (in press).
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, England.
King, G., M. J. Tomz, and J. Wittenberg. 1998. Making the most of statistical analyses: improving interpretation and presentation. Am. J. Political Sci. (in press).
Koradi, R., M. Billeter, and K. Wuthrich. 1996. MOLMOL: a program for display and analysis of macromolecular structures. J. Mol. Graph. 14:5155.[ISI][Medline]
Koshi, J. M., and R. A. Goldstein. 1995. Context-dependent optimal substitution matrices. Protein Eng. 8:641645.[Abstract]
Lawrence, J. G., D. L. Hartl, and H. Ochman. 1991. Molecular considerations in the evolution of bacterial genes. J. Mol. Evol. 33:241250.[ISI][Medline]
Lee, B., and F. M. Richards. 1971. The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 55:379400.[ISI][Medline]
Li, W.-H. 1997. Molecular evolution. Sinauer, Sunderland, Mass.
Milkman, R., and M. M. Bridges. 1993. Molecular evolution of the Escherichia coli chromosome. IV. Sequence comparisons. Genetics 133:455468.
Naylor, G. J., and W. M. Brown. 1997. Structural biology and phylogenetic estimation. Nature 388:527528.
Nelson, K., T. S. Whittam, and R. K. Selander. 1991. Nucleotide polymorphism and evolution in the glyceraldehyde-3-phosphate dehydrogenase gene (gapA) in natural populations of Salmonella and Escherichia coli. Proc. Natl. Acad. Sci. USA 88:66676671.
Ooi, T., M. Oobatake, G. Nemethy, and H. A. Scheraga. 1987. Accessible surface areas as a measure of the thermodynamic parameters of hydration of peptides. Proc. Natl. Acad. Sci. USA 84:30863090.
Overington, J., D. Donnelly, M. S. Johnson, A. Sali, and T. L. Blundell. 1992. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1:216226.
Pakula, A. A., and R. T. Sauer. 1989. Genetic analysis of protein stability and function. Annu. Rev. Genet. 23:289310.[ISI][Medline]
Priestle, J. P., M. G. Grutter, J. L. White, M. G. Vincent, M. Kania, E. Wilson, T. S. Jardetzky, K. Kirschner, and J. N. Jansonius. 1987. Three-dimensional structure of the bifunctional enzyme N-(5'-phosphoribosyl)anthranilate isomerase-indole-3-glycerol-phosphate synthase from Escherichia coli. Proc. Natl. Acad. Sci. USA 84:56905694.
Pupo, G. M., D. K. Karaolis, R. Lan, and P. R. Reeves. 1997. Evolutionary relationships among pathogenic and nonpathogenic Escherichia coli strains inferred from multilocus enzyme electrophoresis and mdh sequence studies. Infect. Immun. 65:26852692.[Abstract]
Stoddard, B. L., A. Dean, and D. E. Koshland Jr. 1993. Structure of isocitrate dehydrogenase with isocitrate, nicotinamide adenine dinucleotide phosphate, and calcium at 2.5-A resolution: a pseudo-Michaelis ternary complex. Biochemistry 32:93109316.
von Freyberg, B., T. J. Richmond, and W. Braun. 1993. Surface area included in energy refinement of proteins. A comparative study on atomic solvation parameters. J. Mol. Biol. 233:275292.[ISI][Medline]
Wang, F. S., T. S. Whittam, and R. K. Selander. 1997. Evolutionary genetics of the isocitrate dehydrogenase gene (icd) in Escherichia coli and Salmonella enterica. J. Bacteriol. 179:65516559.[Abstract]
Whittemore, A. 1981. Sample size for logistic regression with small response probability. J. Am Stat. Assoc. 76:2732.[ISI]
Wilmanns, M., J. P. Priestle, T. Niermann, and J. N. Jansonius. 1992. Three-dimensional structure of the bifunctional enzyme phosphoribosylanthranilate isomerase: indoleglycerolphosphate synthase from Escherichia coli refined at 2.0 A resolution. J. Mol. Biol. 223:477507.[ISI][Medline]