Probabilistic approach to the design of symmetric protein quaternary structures

Xiaoran Fu1, Hidetoshi Kono2 and Jeffery G. Saven1,3

1Makineni Theoretical Laboratories, Department of Chemistry, University of Pennsylvania, 231 South 34th Street, Philadelphia, PA 19104, USA and 2Neutron Science Center and Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institiute, 8-1, Umemidai, Kizu-cho, Souraku-gun, Kyoto 619-0215, Japan

3 To whom correspondence should be addressed. e-mail: saven{at}sas.upenn.edu


    Abstract
 Top
 Abstract
 Introduction
 Theory
 Applications and results
 Summary
 References
 
Probabilistic methods have been developed that estimate the site-specific probabilities of the amino acids in sequences likely to fold to a particular target structure, and such information can be used to guide the de novo design of proteins and to probe sequence variability. An extension of these methods for the design of symmetric homo-oligomeric quaternary structures is presented. The theory is in excellent agreement with the results of studies on exactly solvable lattice models. Application to an atomically detailed representation of proteins verifies the utility of a symmetry assumption, which greatly simplifies and accelerates the calculations. The method may be applied to a wide variety of symmetric and periodic protein structures.

Keywords: computational protein design/protein oligomers/p53 tetramerization domain/quaternary structure


    Introduction
 Top
 Abstract
 Introduction
 Theory
 Applications and results
 Summary
 References
 
De novo protein design identifies amino acid sequences with predetermined folded structures and functions. Such a design probes the determinants of protein stability and function and may potentially lead to novel enzymes, therapeutics and biomaterials. The design of entire sequences is complicated by the size of proteins, their many conformational degrees of freedom, the subtlety of the interactions that stabilize the folded state, and the exponentially large number of possible sequences (Saven, 2001Go). Despite this complexity, computationally aided design has led to successful realization of a variety of proteins (DeGrado et al., 1999Go; Kraemer-Pecore et al., 2001Go) and the redesign of natural proteins to confer novel functionalities (Bryson et al., 1995Go; Hellinga, 1999Go; Looger et al., 2003Go). Herein, we present an efficient methodology for estimating the site-specific amino acid probabilities of oligomeric protein structures, where the target structure for design comprises both quaternary and tertiary structure.

Most efforts in computational design apply ‘directed’ methods. Here ‘directed protein design’ refers to the search for a sequence (or a small set of sequences) consistent with a predetermined backbone structure. The search is directed toward those sequences having low energy (or a favorable score) when they assume the target structure. Pioneering efforts in design identified proteins having substantial ordering but not necessarily well defined tertiary structures (Bryson et al., 1995Go). Since exponentially large numbers of sequences are possible even for small proteins, computational methods have accelerated successful design. Exhaustive searching of all mN possible sequences is usually intractable, where N is the number of variable residues and m is the number of residue degrees of freedom, e.g. the number of allowed amino acids and side chain conformations. Genetic algorithms (Jones, 1994Go) and simulated annealing methods (Shakhnovich and Gutin, 1993Go; Hellinga and Richards, 1994Go) search sequence space in a partially random fashion, on average progressing toward lower energy sequences. Such stochastic methods have been used to redesign a variety of natural proteins (Desjarlais and Handel, 1995Go, 1999; Desjarlais and Clarke, 1998Go; Johnson et al., 1999Go; Kraemer-Pecore et al., 2001Go) and novel helical proteins (Bryson et al., 1998Go). Elimination methods such as ‘dead end elimination’ provide better estimates of global optima and have also been used to automate both the redesign of sub-regions within natural proteins (Malakauskas and Mayo, 1998Go; Strop and Mayo, 1999Go; Bolon and Mayo, 2001Go; Looger et al., 2003Go), as well as full sequence design (Dahiyat and Mayo, 1997Go). Directed approaches are sensitive to the energy functions used, which may be problematic given that such energy functions are necessarily approximate. Uncertainties in the energy function may not merit the search for global optima. For cases where information about a problem is incomplete, probabilistic rather than directed approaches to design may be appropriate.

Statistical methods complementary to directed protein design have been developed (Zou and Saven, 2000Go; Kono and Saven, 2001Go). Such methods reveal the features of sequences likely to fold to a particular structure but which may not be thermodynamically ‘optimal’. Such probabilistic methods provide site-specific information about the range of allowed amino acid substitutions. Within this methodology, much of the formalism of statistical thermodynamics is recast so as to reveal the properties of sequences likely to fold to a target three-dimensional structure. The site-specific probabilities of the amino acids are determined by maximizing an effective entropy function, subject to constraints on the sequences. Such constraints can be physically based, such as the energy of sequences after they acquire the target backbone structure, or functionally based, such as the patterning of amino acids at predetermined positions to confer metal binding. The theory takes as input (i) a given target structure, (ii) energy functions for quantifying sequence–structure compatibility and (iii) a set of constraints on the sequences. For some forms of the constraints, the approach reduces to a form of heterogeneous mean field theory (Koehl and Delarue, 1996Go). The theory yields estimates of the number of sequences and, most importantly, the site-specific probabilities of the amino acids and their side chain conformations. The computation time of the calculations scale is (Nm)2. With statistical methods, in a much shorter time larger numbers N of variable residues can be examined using a larger diversity of residue states m than with other computational methods. Sequences are not explicitly sampled; the calculations yield the amino acid probabilities directly, which are useful in protein design for identifying allowed amino acids at each position in a target structure. The probabilities may also be used both to identify sites tolerant of mutations, where functional properties may be engineered, and to guide the construction of a combinatorial library of protein sequences.

Most computational efforts in protein design have focused on the design of tertiary structures, but some of the earliest design targets were oligomeric structures. The DeGrado group has crafted a variety of dimeric, trimeric and tetrameric helical bundles (Bryson et al., 1995Go; DeGrado et al., 1999Go). Directed computational methods have been used in the design of coiled coils, where the symmetry of the structure facilitates the design (Harbury et al., 1998Go; Keating et al., 2001Go). Just as in nature, quaternary structure provides a facile route to large, well structured protein systems. It is of interest to extend the probabilistic methods for designing protein and protein libraries to include such an intermolecular structure, which is presented herein. For symmetric arrangements of protomers, a symmetry assumption greatly reduces the computational overhead so that arbitrary oligomers and multidimensional arrays may be included as design targets.


    Theory
 Top
 Abstract
 Introduction
 Theory
 Applications and results
 Summary
 References
 
In this section, we briefly review statistical theory that yields estimates of the site-specific amino acid probabilities for a given backbone structure. Molecular energy functions are a key component of the formalism for quantifying sequence– structure compatibility, and methods for incorporating inter- and intra-molecular energies into the statistical formalism are presented. For polymeric or oligomeric systems, i.e. homomeric systems with quaternary structure, a simplifying symmetry assumption greatly improves the efficiency of the calculations.

Overview of statistical theory

A statistical, entropy-based formalism has been developed to identify the features of sequences that are likely to fold to a given backbone structure (Saven and Wolynes, 1997Go; Zou and Saven, 2000Go; Kono and Saven, 2001Go). Generally, the method takes as input a given template backbone structure and a function for characterizing sequence–structure compatibility. The calculation yields the site-specific probabilities of the amino acids. Since the method does not involve the explicit generation of sequences, information about exponentially large numbers of sequences may be obtained. This is particularly useful given the large numbers of possible protein sequences (20N for an N-residue protein) and the wide natural diversity of sequences sharing common folds. The generality of the method allows a number of requirements to be prescribed upon sequences as constraints. Such constraints can include both the patterning of residues, e.g. so as to confer solubility, or global energetic constraints, so as to identify the properties of sequences likely to have a particular target structure as their folded state.

The fundamentals of the method have been presented in previous work (Saven and Wolynes, 1997Go; Zou and Saven, 2000Go; Kono and Saven, 2001Go), and are briefly reviewed here. The most probable set of site-specific amino acid probabilities at each position is determined by maximizing an effective entropy function subject to imposed constraints using the method of Lagrange multipliers (McQuarrie, 1976Go), thus specifying a variational functional V:

V = S{lambda}1f1{lambda}2f2 – ...(1)

The sequence entropy S quantifies the number of sequences likely to fold to a particular structure. The fk are functions that impose the constraint conditions and the {lambda}k are their corresponding Lagrange multipliers. The variational function V and the constraint function fk depend upon the site-specific probabilities wi({alpha}, r({alpha})). Each residue position (site) in the structure is labeled by the index i (i = 1, ..., N), where N is the total number of residues. Here wi({alpha}, r({alpha})) is the probability that residue position i is in a particular ‘state’, where the state of the residue is specified by both the identity of the amino acid {alpha} and the conformation of its side chain r({alpha}). Generally, {alpha} labels any one of the possible amino acids (natural or non-natural), and r({alpha}) labels the members of a discrete set of side chain conformations for each amino acid, so-called rotamer states (Dunbrack, 2002Go). Here S is written solely in terms of the one-body probabilities wi({alpha}, r({alpha})):

In identifying the state probabilities consistent with particular values of the constraints, the k-th constraint function fk is constrained to have a particular value fok:

fok = fk({wi({alpha}, r({alpha}))})(3)

For example, such constraints may be used to specify the normalization of the probabilities, to pattern residue types, or to specify the energy sequences acquire when in the folded structure. As a result, the properties of sequences sharing these common constraints can be examined. The site-specific probabilities wi({alpha}, r({alpha})) are determined as those that maximize V subject to given values of fo1, fo2, ... (see Equation 3). It is important to note that although the form of S in Equation 2 would seem to imply that the wi({alpha}, r({alpha})) are independent, the constraint conditions in Equation 3 will cause them to be coupled to one another. The probability of an amino acid at a particular position may be obtained by summing over the probabilities of its rotamers: wi({alpha}) = {Sigma}r({alpha})wi({alpha}, r({alpha})). This formalism may be applied to different representations of protein structure. In many cases, it is useful to simplify the representation of each residue so as to eliminate explicit side chain degrees of freedom, thus yielding effective united-residue representations and energy functions often used in protein science (Miyazawa and Jernigan, 1996Go; Liwo et al., 1998Go). For such representations, each amino acid has effectively only one conformational state: wi({alpha}) = {Sigma}r({alpha})wi({alpha}, r({alpha})) = wi({alpha}, r({alpha})), and for such cases we may omit the variable indicating the rotamer state.

Constraints and protein sequence energetics

The theory may accommodate a wide variety of constraints of the form in Equation 3. Each residue site i must be occupied, i.e. the probabilities wi({alpha}, r({alpha})) are normalized, which leads to:

The hydrophobic effect may be included as an effective ‘one-residue’ energy that is dependent upon the local density of ß-carbons (Kono and Saven, 2001Go). The sum of these solvation or environment scores may also be constrained by specifying the average ‘environmental energy’ as summed over the positions in a particular target structure:

Other multi-body representations of the hydrophobic effect and effective pair interactions between residues may also be incorporated in the theory (Miyazawa and Jernigan, 1985Go).

For realistic, atom-based representations of proteins, design algorithms that focus primarily on optimizing inter-atomic interactions within the folded state have had substantial success (Hellinga and Richards, 1994Go; Dahiyat et al., 1997Go; Kortemme et al., 1998Go; Baker and DeGrado, 1999Go; Hellinga, 1999Go; Strop and Mayo, 1999Go; Bolon and Mayo, 2001Go; Looger et al., 2003Go). Such methods use atom based potentials to account for both covalent and non-covalent interactions, e.g. van der Waals forces, hydrogen bonds, and electrostatic interactions. The statistical theory of sequence ensembles may be formulated so as to include such atom-based descriptions of both intra- and inter-molecular interactions.

For a protomer structure in an oligomeric complex, the intra-molecular energy Ef of a single chain depends upon the set of amino acid identities {{alpha}1, ..., {alpha}N} and the rotameric state of each of these amino acids r({alpha}i). Ef may be written as a sum of effective one-residue and two-residue interactions:

The indices i and j refer to residue positions in the structure, and the second term sums only over unique interactions between pairs of residues. The ‘one-residue’ energy {gamma}i(1)({alpha}i, r({alpha}i)) is the energy associated with locating the amino acid {alpha} with conformation r({alpha}) at site i within a single protomer structure. This energy is determined by side-chain backbone interactions or intrinsic structural tendencies of the amino acids, such as preferences for solvent exposure or secondary structure. Similarly, the pair energy {gamma}i(2)({alpha}i, r({alpha}i); {alpha}j, r({alpha}j)) is the sequence dependent interaction energy between residue i and residue j, each of which is a member of the same protein chain. Such interaction energies may be inferred from a database using reduced descriptions of amino acids (Miyazawa and Jernigan, 1996Go; Onuchic et al., 1997Go) or, using an atomically detailed model, as a sum over inter-atomic interactions of the identity-rotamer states at sites i and j, as specified by ({alpha}i, r({alpha}i)) and ({alpha}j, r({alpha}j)).

Similarly, the intermolecular association energy Ea is that due to the intermolecular interactions among sites in a complex of M protein chains, where chains are labeled by the indices m and n. The association energy of the complex Ea may be written in terms of the identity-rotamer states of each site in the complex:

where {epsilon}i(1)({alpha}im, r({alpha}im)) is the additional effective energy associated with locating the amino acid {alpha}im with conformation r({alpha}im) at site i on chain m in the complex. For example, a residue that is exposed to solvent in the folded structure of an isolated protomer may become sequestered from solvent or interact with the backbone of an adjacent protomer upon complex formation; {epsilon}i(1)({alpha}im, r({alpha}im)) quantifies the effective change in energy upon association. For given values of the identities and rotamers at sites i and j on two different protomers, the pair interaction {epsilon}ij(2)({alpha}im, r({alpha}im); {alpha}jn, r({alpha}jn)) is the sequence-dependent intermolecular interaction energy between residue i on chain m and residue j on chain n.

Equation 7 may be applied to any given complex, including hetero-dimers and other proteins having asymmetric quaternary structure. For symmetric complexes comprising identical chains, however, symmetry implies that equivalent positions on each chain have the same identity and conformational state. This symmetry assumption may be expressed as follows:

1. The amino acid identities at equivalent sites on each chain are the same: {alpha}im = {alpha}i for each protomer m = 1, ..., M.

2. Side chains at equivalent sites on different chains take on equivalent rotamer states: r({alpha}im) = r({alpha}i) for each m = 1, ..., M.

This assumption is introduced to expedite the design process, though it is perhaps overly stringent, particularly with regard to side chain conformation. For example, distant, exposed residues of a protein need not always have precisely the same conformation in solution. For systems with only one state per site, e.g. the lattice model discussed in the Applications and results (section ‘Lattice model’), this symmetry assumption is exact.

Within this symmetry assumption, Equation 7 may be written in a form analogous to Equation 6, involving effective interactions between sites on a single chain:

where

We note that Ea now has a form exactly analogous to Ef. Each residue ‘sees’ other residues on the same chain and ‘images’ of itself and other residues via intermolecular interactions with other chains.

For a given set of constraint conditions, the values of Ef and Ea are assumed to be well represented by their sequence averages (Zou and Saven, 2000Go; Kono and Saven, 2001Go). With this approximation and the factorization implicit in Equation 2, Ef and Ea may be expressed as functions of the site probabilities wi({alpha}, r({alpha})):

Note that we sum over amino acids at each site and no longer consider just a single sequence as would be indicated by the set {{alpha}i}. The Ef and Ea are now each functions of the site-specific probabilities wi({alpha}, r({alpha})) and may now appear as constraints of the form suggested in Equation 3. With the symmetry assumption, the number of variables decreases dramatically relative to a calculation that treats each site i in the complex separately, i.e. a calculation where N is the total number of residues in the complex rather than in a single protomer. For symmetric structures, only the wi({alpha}, r({alpha})) of residues on a single chain need be determined. There are typically Nxm such independent variables, where m is the number of allowed identity rotamer states at each site and N is the number of residues per protomer.

Other energetic constraints may be imposed on the sequences as well. For simplified representations of the amino acids, in designing protein sequences it is often important to account for non-target structures that a sequence may also acquire. For a foldable sequence, the target state should be energetically removed from these other possible collapsed structures. There are many ways to achieve this (Saven, 2001Go), but perhaps the simplest is to optimize a stability gap {Delta}f = Ef<Ef>u, where <Ef>u is an average over the an appropriate ensemble of unfolded structures. Similarly, we may also define a ‘binding stability gap’ {Delta}b = Ea<Ea>b, where <Ea>b represents an energy average over configurations of the complex other than the target quaternary structure. Herein, these alternate configurations involve different relative orientations of the individually folded protomers. The nature of the configurational averaging <...>b depends on the particular model and may be impractical for atomically detailed models due to both the large number of unfolded conformations and the large numbers of identity-rotamer states. For contact type energy functions, however, this averaging is straightforward (Zou and Saven, 2000Go), and an example is presented in the next section.


    Applications and results
 Top
 Abstract
 Introduction
 Theory
 Applications and results
 Summary
 References
 
The statistical theory of homo-oligomeric protein complexes is applied to two systems: a lattice model and a realistic atom-based representation of a protein. The lattice model may be solved exactly via explicit enumeration and is used both to test the accuracy of the theory and to exemplify how sequence space may be resolved in terms of the folding and association energetics. An atomically detailed representation of the protein is used to test and illustrate the symmetry assumption and show how the method may be applied to realistic representations of protein structures.

Lattice model

The model consists of a 27 residue self-avoiding polymer whose monomers occupy sites on a cubic lattice (Lau and Dill, 1990Go; Shakhnovich and Gutin, 1990Go; Leopold et al., 1992Go). Two types of monomers, hydrophobic (H) and polar (P), are used to construct sequences. Each of these residue types has a single conformational state: wi({alpha}, r, ({alpha})) = wi({alpha}), where {alpha} = H or P. The 227 possible sequences may be exhaustively enumerated. As a result, this model may be ‘solved’ exactly, and the results may be compared with theoretical results. A simplified potential developed for HP type monomers (Li et al., 1996Go; Zou and Saven, 2000Go) is used that contains only two-body interactions, those interactions make the calculations non-trivial. The two body interactions are non-zero only if the amino acids are nearest neighbors on the lattice but are not bonded to one another. Let rij be the distance between residues i and j and r0 is the distance between nearest neighbors on the lattice. Then the contact parameter {sigma}ij is non-zero only if rij <= r0, for which {sigma}ij = 1. The contact variable for intermolecular interactions appears as {sigma}ijmn, which is unity only if residues i and j of chains m and n, respectively, are in contact with one another and {sigma}ijmn = 0 otherwise. For this model, the folded state energy and the association energy then take the following form (see Equations 12 and 13):

The same energy function is used for both folding and for association: {gamma}(2)({alpha}; {alpha}') = {epsilon}(2)({alpha}; {alpha}'). The contact energies are chosen as (Li et al., 1996Go; Zou and Saven, 2000Go): {gamma}(2)(Hi, Hj) = –3{epsilon}, {gamma}(2)(Hi, Pj) = {gamma}(2)(Pi, Hj) = –{epsilon}, and {gamma}(2)(Pi, Pj) = 0. Equating {gamma}(2)({alpha}, {alpha}') = {epsilon}(2)({alpha}, {alpha}') corresponds to treating inter-residue interactions in the same fashion for both intra-molecular (folding) and inter-molecular (specific oligomerization) organization. As necessary, distinct potentials for intra-molecular folding and for inter-molecular association may be used.

The energetic constraints in this lattice model study are the folding stability gap ({Delta}f) and binding stability gap ({Delta}b). For such folding criteria, ensembles of unfolded and mis-associated states are necessary. The choice of folded structure is arbitrary, and a structure that is the conformational ground state for a large number of sequences is chosen in this study (Li et al., 1996Go). The remaining 103 345 compact, cubic structures of the 27-mer are chosen as the ensemble of unfolded states in the calculation of <Ef>u. The target structures of the complex are those depicted in Figure 1. As an ensemble of mis-associated states for the lattice proteins, we consider arrangements of the oligomer for which each chain takes on the target folded structure and the interface between two protomers has nine contacts, i.e. the faces of the protomers are in registry with one another (Figure 1). This loosely mimics the expectation that the mis-associated states of a particular complex most likely to compete with the target structure are those involving large numbers of residues in inter-molecular contact. For the dimer, there are 84 unique configurations of the two protomers.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 1. A lattice model homo-dimer. Two identical 3x3x3 lattice ‘protomers’ share a common binding surface. Each numbered bead represents a simplified residue. The links between two beads represent effective chemical bonds.

 
Calculations were performed for the homo-dimer depicted in Figure 1. Using this structure, the site-specific amino acid probabilities were determined using entropy optimization for prescribed values of {Delta}f and {Delta}b. The entropy surface S({Delta}f, {Delta}b) is presented in Figure 2, where S has been normalized such that {int}d{Delta}f d{Delta}bexp(S({Delta}f, {Delta}b)) = 227 (see Zou and Saven, 2000Go). The exact results in Figure 2 are determined using S = ln {Omega}({Delta}f, {Delta}b), where {Omega} is the number of sequences having common values of {Delta}f and {Delta}b. In comparing the entropy surfaces, we note that there is excellent agreement of the theory with the exact results. Interestingly, those sequences that are optimal with regard to {Delta}f are not those that are optimal with respect to {Delta}b and vice versa. Sequences having low values of {Delta}f are those for which the target structure is well removed from other competing folds. Such sequences have largely exposed hydrophilic groups. For this model energy function, such sequences have unfavorable values of the binding gap energy, {Delta}b {approx} 0.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 2. Sequence entropy surface spanning {Delta}b and {Delta}f for lattice model homo-dimer structure. (a) Theoretical results. (b) Results from the exact enumeration of all 227 sequences.

 
The polar probabilities of each residue wi(P) as obtained by theoretical calculation and enumeration are presented in Figure 3. There is excellent agreement. Although only probabilities calculated at {Delta}f = –14{epsilon} and {Delta}b = –5{epsilon} are shown for this illustration, excellent agreement is also obtained at other values of these energies (data not shown). For the sequences considered, {Delta}f < 0 and {Delta}b < 0. Within the context of this model, such sequences are likely to form stable isolated structures and/or stable complexes, i.e. sequences for which both the folded state of each chain and the oligomer structure are of lower energy than competing collapsed and mis-associated states.



View larger version (30K):
[in this window]
[in a new window]
 
Fig. 3. Probability distribution of lattice model homo-dimer, comparing theoretical results (black bars) and enumeration results (open bars).

 
Atomically detailed model

For realistic representations of proteins, an all-atom model is used that includes side chain conformational degrees of freedom. Side chains assume discrete conformations, rotamers (Ponder and Richards, 1987Go; Dunbrack and Cohen, 1997Go; Tuffery et al., 1997Go). Here we use the backbone-dependent rotamer library of Dunbrack and coworkers (Dunbrack and Cohen, 1997Go) The Amber force field (Weiner et al., 1984Go) is used to calculate non-bonded interactions. This same potential is used to quantify both the intra- and inter-molecular energies. To address the hydrophobic effect, the ‘environmental energy’ (Equation 5) for the complex is constrained to the value of the wild-type sequence (Kono and Saven, 2001Go). The site-specific probabilities are determined for multiple (constrained) values of E = Ef + Ea, and conjugate to this energy is an effective inverse temperature ß. In what follows, the probabilities are presented at ß = 0.5 mol/kcal, which is also the effective temperature at which (unfolded) reference energies of the amino acids are determined, as described in Calhoun et al. (2003Go).

We select the tetramerization domain of p53 tumor suppressor (PDB code: 1C26) as the target structure (Figure 4). This transcription factor is involved in a number of important physiological roles, including regulation of the cell cycle, apoptosis, DNA repair and angiogenesis, and mutations of p53 have been linked to cancers (Vogelstein et al., 2000Go). The domain considered here is a tetramer with four identical 32 residue chains. In this study, all 20 amino acid residues are allowed at each position. Thus, a total of 2032 {approx} 4x1041 sequences are possible for a single chain. Taking side chain conformations into account, there are a total of 320 states per residue position. In the absence of the symmetry assumption, there are 128 variable positions and 320128 possible states. The imposition of the symmetry constraint reduces this complexity to 32 variable positions and 32032 possible states; the number of independent site probabilities to be determined is reduced from 128x320 to 32x320.



View larger version (55K):
[in this window]
[in a new window]
 
Fig. 4. p53 tumor suppressor tetramerization domain. Residues shown in space-filling mode involve: (1) F328 from chain A and F338 from chain C have stacking phenyl rings; (2) E343 from chain C and K351 from chain B forms an intermolecular salt-bridge; (3) intermolecular salt-bridge pair R337 from chain D and D352 from chain B are shown.

 
Calculations are performed both with and without the symmetry assumption, permitting full sequence variability at each site in the tetramer. Without using symmetry assumption, the four chains have identical site-specific probabilities at equivalent positions. This is expected, since the structures are identical. In terms of both memory usage and computer time, the efficiency of the calculation scales as M2 for a symmetrical structure with M subunits. For the p53 tetramer M2 = 16, imposition of the symmetry assumption reduces the computation time by a factor of 10. For larger oligomers, we would expect even greater savings in computer time and memory usage; for infinite symmetries, e.g. crystals, the assumption makes calculations on such systems feasible. Comparing the results with and without symmetry assumption, we find excellent agreement (Figure 5).



View larger version (37K):
[in this window]
[in a new window]
 
Fig. 5. Amino acid probabilities at selected sites for the p53 tetramerization domain. Results shown are those calculated without (open bars) and with (black bars) the symmetry assumption. Residues numbers and wild-type amino acid are indicated on each panel.

 
In many cases, the site-specific probabilities are in agreement with the wild-type amino acid residues. For the well structured, stable tetrameric structure, it is not unreasonable to expect that wild-type residues are among the more probable at many sites. This finding is also in harmony with previous design studies that have identified sequences having considerable wild-type character (Kuhlman and Baker, 2000Go; Raha et al., 2000Go). The wild-type amino acid is the most probable at 31.3% (10/32) of the 32 positions (E326, L330, I332, G334, E336, E346, A347, E349, Q354 and G356, where the letters denote the wild-type amino acids). For example, see site L330 in Figure 5. At eight of the remaining positions (G325, R333, E339, F341, E343, L344, L348 and A353), the wild-type amino acids are among the five most probable (for example, see sites 333 and 341 in Figure 5). Sites Y327, T329, Q331, R335, R342, D352 and A355 tolerate many different amino acids, including the wild-type. Interestingly, these 25 sites also include interfacial positions where protomers come into contact, such as L330, I332, F341, L344 and A347. Nonetheless, there are seven positions that have calculated probabilities significantly different from the wild-type amino acids. Although for residues M340, N345 and L350 the wild-type identities are of much lower probability than the calculated most probable residues, the most probable amino acids maintain hydrophobic properties and suggest the following mutations: M340I, N345Y and L350V. At sites F328 and F338, the wild-type residues assume side chain conformations in which the side chains are in van der Waals contact and this arrangement cannot be recovered with the rotamer library used (Figure 4). This does, however, suggest mutations that may reduce side chain ‘strain’: F328H and F338T. In the wild-type, two intermolecular salt bridges R337–D352 and K351–E343 are formed (Figure 4), of which R337 and K351 are buried in the complex and are replaced by hydrophobic amino acids, R337V and K351V. The other two residues involved in the salt bridges, D352 and E343, are partially exposed, and the corresponding most probable residues turn out to be a neutral, hydrophilic amino acid Q (D352Q and D343Q) in both cases. The seven suggested mutations may increase the stability of this p53 tetramer, but this may come at the expense of function, since no information about the binding properties other than oligomerization was included in the calculations.


    Summary
 Top
 Abstract
 Introduction
 Theory
 Applications and results
 Summary
 References
 
Herein we have provided a formalism for the probabilistic design of oligomeric protein structures. The theory quantitatively recovers the exact results obtained from a simple lattice model of proteins. The application to an atomically detailed representation of the protein p53 tetramerization domain reveals the utility and accuracy of the symmetry assumption and also recovers many of the sequence properties observed in the wild-type structure. The probabilities obtained from the theory may be used either iteratively in protein design, where increasing numbers of residues are specified with each round of the calculation, or to guide a biased search for low energy sequences (Zou and Saven, 2003Go). These probabilities may also be used to specify the composition of a protein combinatorial library. Such computational methodologies will be useful in designing large, oligomeric protein complexes or multi-dimensional arrays of proteins for biomaterials applications.


    Acknowledgements
 
We acknowledge support from the NSF (CHE 99-84752 and DMR 00-79909). J.G.S. is a Cottrell Scholar of Research Corporation and an Arnold and Mabel Beckman Foundation Young Investigator.


    References
 Top
 Abstract
 Introduction
 Theory
 Applications and results
 Summary
 References
 
Baker,D. and DeGrado,W.F. (1999) Curr. Opin. Struct. Biol., 9, 485–486.[CrossRef][ISI][Medline]

Bolon,D.N. and Mayo,S.L. (2001) Proc. Natl Acad. Sci. USA, 98, 14274–14279.[Abstract/Free Full Text]

Bryson,J.W., Betz,S.F., Lu,H.S., Suich,D.J., Zhou,H.X., O’Neil,K.T. and DeGrado,W.F. (1995) Science, 270, 935–941.[Abstract]

Bryson,J.W., Desjarlais,J.R., Handel,T.M. and DeGrado,W.F. (1998) Protein Sci., 7, 1404–1414.[Abstract/Free Full Text]

Calhoun,J., Kono,H., Lahr,S., Wang,W., DeGrado,W.F. and Saven,J.G. (2003) J. Mol. Biol., 334, 1101–1115.[CrossRef][ISI][Medline]

Dahiyat,B.I. and Mayo,S.L. (1997) Science, 278, 82–87.[Abstract/Free Full Text]

Dahiyat,B.I., Sarisky,C.A. and Mayo,S.L. (1997) J. Mol. Biol., 273, 789–796.[CrossRef][ISI][Medline]

DeGrado,W.F., Summa,C.M., Pavone,V., Nastri,F. and Lombardi,A. (1999) Annu. Rev. Biochem., 68, 779–819.[CrossRef][ISI][Medline]

Desjarlais,J.R. and Clarke,N.D. (1998) Curr. Opin. Struct. Biol., 8, 471–475.[CrossRef][ISI][Medline]

Desjarlais,J.R. and Handel,T.M. (1995) Protein Sci., 4, 2006–2018.[Abstract/Free Full Text]

Desjarlais,J.R. and Handel,T.M. (1999) J. Mol. Biol., 290, 305–318.[CrossRef][ISI][Medline]

Dunbrack,R.L. (2002) Curr. Opin. Struct. Biol., 12, 431–440.[CrossRef][ISI][Medline]

Dunbrack,R.L.J. and Cohen,F.E. (1997) Protein Sci., 6, 1661–1681.[Abstract/Free Full Text]

Harbury,P.B., Plecs,J.J., Tidor,B., Alber,T. and Kim,P.S. (1998) Science, 282, 1462–1467.[Abstract/Free Full Text]

Hellinga,H. (1999) FASEB J., 13, A1430.

Hellinga,H.W. and Richards,F.M. (1994) Proc. Natl Acad. Sci. USA, 91, 5803–5807.[Abstract]

Johnson,E.C., Lazar,G.A., Desjarlais,J.R. and Handel,T.M. (1999) Struct. Fold. Design, 7, 967–976.[ISI]

Jones,D.T. (1994) Protein Sci., 3, 567–574.[Abstract/Free Full Text]

Keating,A.E., Malashkevich,V.N., Tidor,B. and Kim,P.S. (2001) Proc. Natl Acad. Sci. USA, 98, 14825–14830.[Abstract/Free Full Text]

Koehl,P. and Delarue,M. (1996) Curr. Opin. Struct. Biol., 6, 222–226.[CrossRef][ISI][Medline]

Kono,H. and Saven,J.G. (2001) J. Mol. Biol., 306, 607–628.[CrossRef][ISI][Medline]

Kortemme,T., Ramirez-Alvarado,M. and Serrano,L. (1998) Science, 281, 253–256.[Abstract/Free Full Text]

Kraemer-Pecore,C.M., Wollacott,A.M. and Desjarlais,J.R. (2001) Curr. Opin. Chem. Biol., 5, 690–695.[CrossRef][ISI][Medline]

Kuhlman,B.K. and Baker,D. (2000) Proc. Natl Acad. Sci. USA, 97, 10383–10388.[Abstract/Free Full Text]

Lau,K.F. and Dill,K.A. (1990) Proc. Natl Acad. Sci. USA, 87, 638–642.[Abstract]

Leopold,P.E., Montal,M. and Onuchic,J.N. (1992) Proc. Natl Acad. Sci. USA, 89, 8721–8725.[Abstract]

Li,H., Helling,R., Tang,C. and Wingreen,N. (1996) Science, 273, 666–669.[Abstract]

Liwo,A., Kazmierkiewicz,R., Czaplewski,C., Groth,M., Oldziej,S., Wawak,R.J., Rackovsky,S., Pincus,M.R. and Scheraga,H.A. (1998) J. Comput. Chem., 19, 259–276.[CrossRef][ISI]

Looger,L.L., Dwyer,M.A., Smith,J.J. and Hellinga,H.W. (2003) Nature, 423, 185–190.[CrossRef][ISI][Medline]

Malakauskas,S.M. and Mayo,S.L. (1998) Nat. Struct. Biol., 5, 470–475.[ISI][Medline]

McQuarrie,D.A. (1976) Statistical Mechanics. Harper and Row, New York.

Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 218, 534–552.

Miyazawa,S. and Jernigan,R.L. (1996) J. Mol. Biol., 256, 623–644.[CrossRef][ISI][Medline]

Onuchic,J.N., Luthey-Schulten,Z. and Wolynes,P.G. (1997) Annu. Rev. Phys. Chem., 48, 539–594.

Ponder,J.W. and Richards,F.M. (1987) J. Mol. Biol., 193, 775–791.[ISI][Medline]

Raha,K., Wollacott,A.M., Italia,M.J. and Desjarlais,J.R. (2000) Protein Sci., 9, 1106–1119.[Abstract]

Saven,J.G. (2001) Chem. Rev., 101, 3113–3130.[CrossRef][ISI][Medline]

Saven,J.G. and Wolynes,P.G. (1997) J. Phys. Chem. B, 101, 8375–8389.[CrossRef][ISI]

Shakhnovich,E. and Gutin,A. (1990) J. Chem. Phys., 93, 5967–5971.[CrossRef][ISI]

Shakhnovich,E.I. and Gutin,A.M. (1993) Protein Eng., 6, 793–800.[ISI][Medline]

Strop,P. and Mayo,S.L. (1999) J. Am. Chem. Soc., 121, 2341–2345.[CrossRef][ISI]

Tuffery,P., Etchebest,C. and Hazout,S. (1997) Protein Eng., 10, 361–373.[CrossRef][ISI][Medline]

Vogelstein,B., Lane,D. and Levine,A.J. (2000) Nature, 408, 307–310.[CrossRef][ISI][Medline]

Weiner,J.S., Kollman,P.A., Case,D.A., Singh,U.C., Ghio,C., Alagona,S.J.,G. Profeta and Weiner,P. (1984) J. Am. Chem. Soc., 106, 765–784.[ISI]

Zou,J. and Saven,J.G. (2000) J. Mol. Biol., 296, 281–294.[CrossRef][ISI][Medline]

Zou,J. and Saven,J.G. (2003) J. Chem. Phys., 118, 3843–3854.[CrossRef][ISI]

Received August 20, 2003; revised October 24, 2003; accepted October 28, 2003





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (1)
Request Permissions
Google Scholar
Articles by Fu, X.
Articles by Saven, J. G.
PubMed
PubMed Citation
Articles by Fu, X.
Articles by Saven, J. G.