From the Department of Chemistry, Princeton University, Princeton, New Jersey 08544-1009
Protein design ultimately comes down to choosing
amino acid sequences. In contrast to traditional studies of protein
folding and function, the choice is not limited to sequences that
nature has provided. Advances in molecular biology and synthetic
chemistry have made it possible to produce virtually any sequence of
amino acids. The number of possible sequences from which to choose is enormous, and even for a small design, one cannot sample all
possibilities. For example, for a relatively short sequence of 100 residues composed of the 20 naturally occurring amino acids, there are
20100 possibilities. This number is so large
(20100 > 10130) that if one synthesized a
single molecule of each sequence and put the entire collection into a
box, the resulting box would be larger than Avogadro's number of
universes.
Among this astronomical number of possible sequences, the majority will
not fold into soluble, globular structures. Indeed, open reading frames
constructed of randomly chosen sequences typically produce insoluble
material (1-3). Consequently, among the 20100 possible
sequences, only a small fraction can fold into soluble, globular
proteins. Thus, the box of "good sequences" is far smaller than
Avogadro's number of universes. Nevertheless, even this smaller box is
astronomical in size. Therefore, to enhance the likelihood of success
(and reduce the level of frustration), designs must be confined to
desirable neighborhoods of sequence space.
What properties define the regions of sequence space that favor
soluble, well folded globular structures? All amino acids in a sequence
do not contribute equally to the formation of a native structure.
Indeed, the tolerance of natural proteins to a variety of amino acid
substitutions (4-6) demonstrates that some features are far more
important than others. To successfully design proteins de
novo, which features are crucial?
Binary Patterning of Polar and Nonpolar Amino Acids The global features likely to be most important for designing
novel proteins can be inferred from an examination of natural proteins.
Such examination reveals two universal themes. First, globular proteins
fold into structures that maximize burial of hydrophobic side chains
while simultaneously exposing hydrophilic side chains to solvent.
Second, these structures typically contain an abundance of secondary
structure such as Sequences capable of forming regular secondary structure while
simultaneously burying hydrophobic side chains (and exposing hydrophilic ones) can be designed by patterning the sequence
periodicity of polar and nonpolar residues to match the structural
periodicity of the desired secondary structure. For example, to design
Kamtekar et al. (9) proposed that such patterning of polar
and nonpolar residues might serve as the cornerstone for initial stages
of de novo protein design. According to this proposal, only
the sequence locations of polar and nonpolar residues must be specified
explicitly. The precise identities of the polar and nonpolar residues
need not be constrained and can be varied extensively. This proposal
gave rise to a design strategy based on a "binary code," which
specifies only whether a given residue is polar or nonpolar. The
initial test of the binary code strategy focused on the design of
four-helix bundles (9). A combinatorial library of de novo
sequences was constructed and expressed. All sequences shared the
identical pattern of polar and nonpolar residues. However, the
combinatorial underpinnings of the strategy yielded a distinct amino
acid sequence for each member of the collection. This is shown
schematically in Fig. 1. The collection of novel
proteins was expressed from a degenerate family of synthetic genes in
which polar amino acids (Lys, His, Glu, Gln, Asp, and Asn) were encoded by the degenerate DNA codon NAN, and nonpolar amino acids (Met, Leu,
Ile, Val, and Phe) were encoded by the degenerate codon NTN (where N
represents a mixture of A, G, T, and C). Application of this
combinatorial method to a sequence pattern containing 24 nonpolar
positions and 32 polar positions can yield >1041
(i.e. 524 × 632) different amino
acid sequences. While this is obviously a very large number, it is
small compared with the number of sequences that would have been
possible if all 56 helical residues were completely randomized
(2056 > 1072). Limiting the library to
sequences that satisfy the binary code reduces the available sequence
space by ~31 orders of magnitude. This reduction in sequence space
vastly increased the likelihood of recovering well folded,
water-soluble proteins. Consequently, in contrast to the insoluble
material typically generated by totally random sequences (1, 2), the
majority of binary code sequences gave rise to proteins that were both
Natural proteins fold into structures with well packed hydrophobic
cores. However, the combinatorial basis of the binary code strategy
precludes rational design of specific packing interactions. Can the
binary code nonetheless generate novel proteins with native-like properties? Recent experiments in our laboratory indicate that proteins
with properties similar to those of natural proteins are indeed found
among an initial collection of binary code Adherence to the binary patterning reduces the amount of sequence space
that must be explored and increases the likelihood of obtaining well
folded proteins. However, even this reduced sector of sequence space is
enormous, and not all sequences within this space will actually fold
into native-like structures. Indeed some do not fold at all (9).
Clearly, other features must also contribute to the stability of folded
structures. Within the reduced sector of sequence space defined by the
binary code, which additional features must be designed explicitly?
Intrinsic Propensities for Secondary Structure In the 1970s, when the data base of known protein structures was 2 orders of magnitude smaller than it is today, it was already recognized
that the 20 amino acids had different statistical biases for one or
another type of secondary structure (10). In the ensuing years, the
intrinsic propensities that underlie these statistical biases have been
probed in model systems ranging from mutant proteins to synthetic
peptides and copolymers (e.g. Refs. 11-20). Overall, these
studies have demonstrated that the statistical trends observed in the
data base of known structures correlate quite well with the intrinsic
propensities for secondary structure determined from physical
measurements in model systems (16).
The magnitudes of the intrinsic propensities are typically modest.
Therefore, with the possible exception of proline and glycine, these
propensities can be overwhelmed by the context of an amino acid in the
overall sequence. For example, a given peptide sequence can form
different secondary structure in two different protein contexts (21,
22). A dramatic example of this context dependence was demonstrated by
Minor and Kim (23), who showed that an 11-amino acid sequence
(Ala-Trp-Thr-Val-Glu-Lys-Ala-Phe-Lys-Thr-Phe) formed an Chain reversals connecting successive elements of secondary
structure give rise to the globular appearance of folded protein structures. However, correctly folded globular structures can also form
from sequences in which the normal connectivity of the protein has been
altered by cleavage (24) or by circular permutation (25-27). How
important are the sequences of turns in determining the overall
structure of a protein? Does the amino acid sequence of a turn dictate
the location of a chain reversal? If so, then it will be essential to
design turn sequences explicitly. Alternatively, are turns merely
default structures that occur between elements of secondary structure?
If this is the case, then the precise sequences of these stretches of
"molecular string" may not need to be designed a
priori.
Turns connecting Interhelical turns have been studied in several different proteins.
Brunet et al. (28) analyzed the turn between the third and
fourth Is the length of the interhelical region important? Must the length of
the turn be designed to disrupt the polar/nonpolar periodicity of the
helices bracketing the turn? Vlassi et al. (30) addressed
this question by inserting two extra residues into the interhelical
turn of Rop. This insertion causes the Although the results with Rop and cytochrome
b562 demonstrate that particular interhelical
turn sequences are not essential for maintaining the structure of a
protein, turn sequences can affect stability. This was shown explicitly
by mutating Asp30 of Rop to each of the other 19 naturally
occurring amino acids. While all 19 variants fold and function
correctly, they had a range of stabilities (34). Work on variants of
cytochrome b562 yielded similar results
(35).
Recent research suggests that the turns between The ability of a structure to tolerate different turn sequences will
depend, of course, on the overall stability of the protein. Zhou
et al. (37) mutated a five-residue turn connecting the last
two Is it crucial that the chain reversals in novel proteins be designed
explicitly? While turn sequences can affect protein stability, effects
are likely to be small and dependent on context. If a design is
otherwise robust, particular turn sequences may not need to be designed
explicitly, especially in four-helix bundles. However, for the design
of more complex structures or for designs that are only moderately
stable, it is prudent to specify "good" turn sequences. Indeed,
careful attention to the design of turns has had dramatic effects for
two recent designs. The betabellin protein was stabilized by the
incorporation of non-natural D-amino acids to favor inverse
common (type I Is good packing essential? Must it be explicitly designed a
priori? We suggest the answers to these questions are
(respectively): most certainly yes; and probably no.
The importance of good packing in natural proteins is evident both from
analyzing their structures and mutating their sequences. The
hydrophobic cores of wild-type structures are invariably well packed
with densities approaching those seen in crystals of small organic
molecules (40). Mutations that reduce packing density typically render
a protein less stable (e.g. Refs. 41-43), while those that
improve packing yield proteins with enhanced stability (44).
Packing also plays an important role in the structures of de
novo proteins. By using different arrangements of nonpolar
residues to redesign the hydrophobic core of Rop, Munson et
al. (45, 46) demonstrated that size, shape, and relative location
of side chains can specify both the stability and the "native-like" properties of a protein. Underpacking yielded proteins that were not
stable, whereas overpacking yielded structures that were stable but not
native-like.
The determinants of good packing are not merely size, and the effects
of altered packing can be quite dramatic. For example, Harbury et
al. (47) showed that a coiled-coil peptide with Ile in the "a"
positions and Leu in the "d" positions forms dimers, while the
analogous peptide in which these residues have been reversed forms
tetramers. In these and related peptides, the shape (rather than size
or hydrophobicity) of nonpolar side chains dictates whether peptides
associate to form two-, three-, or four-stranded structures (47).
Further work on coiled coils has shown that the drive toward good
packing can serve as the basis for engineering allostery. Variants of
the GCN4 coiled coil were designed to recruit small nonpolar molecules
from solution to fill a hole and thereby stabilize a three-stranded
relative to a two-stranded coiled coil (48).
Packing is clearly important. Must it be designed a priori?
This question can be addressed using metaphors invoked by Bromberg and
Dill (49). They compare side chain packing to either a jigsaw puzzle
model with "lock-and-key fits in which there is specific pairwise
matching of complementary side chains" or to "a nuts and bolts
model in which side chains pack together without specificity" (see
Fig. 2). Both models ultimately yield good packing.
However, for the jigsaw puzzle, complementarity must be designed
explicitly by the manufacturer of the puzzle (i.e. the
protein designer). However, in the nuts and bolts model nonspecific
forces (e.g. the size of the jar holding the nuts and bolts
or the hydrophobic effect driving the collapse of a polypeptide chain)
lead to compaction, and high packing density results from promiscuous
surfaces finding a way to nuzzle up against one another.
The nuts and bolts model is supported by two lines of evidence. First,
analysis of known protein structures showed that preferred interactions
among hydrophobic side chains are not observed (50). Based
on this observation, Behe et al. (50) suggested that
although packing is an indispensable prerequisite for the native
conformation, it does not serve as the causal agent for the
native conformation. Indeed, they conclude that "high packing
densities are readily attainable among clusters of the naturally
occurring hydrophobic amino acid residues." The second type of
evidence supporting the nuts and bolts model comes from mutagenesis
studies. For example, Axe et al. (51) reconstructed the
entire hydrophobic core of barnase with random nonpolar residues and
found that ~23% of the variants retained enzymatic activity.
Although the detailed structures of these proteins have not been
reported, it is clear that functioning enzymes can be isolated without
specifying the details of a jigsaw puzzle. Similarly, our own work
using the binary code to construct novel sequences has shown that
patterning of polar and nonpolar residues can yield compact Merely designing favorable interactions in the folded state of a
protein is not sufficient to generate a unique structure. It is equally
important to design against competing alternatives. This is sometimes
described as "negative design" (52). Some examples of negative
design are fairly simple, such as the incorporation of a glycine or
proline to disfavor continuation of a helix and thereby enhance the
likelihood of a desired turn (see Fig. 1 and Refs. 52 and 53). Others
are more subtle. For example, inclusion of polar residues at key
positions throughout a sequence can disfavor "wrong" hydrophobic
cores and thereby favor the formation of the desired unique structure
(54). Experimental support for this suggestion comes from the work of
Raleigh et al. (55), who showed that incorporation of polar
residues at the interface between buried and exposed Designing proteins necessitates choosing sequences. Do we
understand proteins well enough to make successful choices? If the goal
is to define regions of sequence space that yield stable, water-soluble, Nonetheless, many challenges remain;
-helices and
-sheets. The prevalence of these
two features among natural proteins suggests that they play a crucial
role in defining regions of sequence space that are most likely to
yield soluble, well folded de novo proteins.
-helical segments, the periodicity of polar and nonpolar residues
would approximate a repeat of 3.6 residues/turn. In contrast, designed
-strand segments would be composed of sequences with alternating polar and nonpolar residues (7, 8).
-helical and water soluble (9).
Fig. 1.
Helix net representation of the four-helix
bundles designed by Kamtekar et al. (9). Exposed,
polar residues are represented as white circles. Buried,
nonpolar residues are represented as black circles.
Identities of the turn residues are shown explicitly. The hydrophobic
faces of the helices are shaded. Because the exact identity
of each side chain within the helices is not specified, the design is
based on a binary code, which specifies only whether a given residue is
polar or nonpolar.
[View Larger Version of this Image (49K GIF file)]
-helical proteins. Many
of them display cooperative thermal
denaturations.1 Furthermore, some proteins
in the collection give rise to NMR spectra with significant chemical
shift dispersion in both the amide and methyl regions. Most
importantly, amide protons are protected from exchange with solvent to
an extent similar to that seen in some natural
proteins.2
-helix when
placed in one location but folded as a
-strand when placed in an
alternative location within the same protein. In related work, the same
authors showed explicitly that the "intrinsic" propensities of the
20 amino acids for
-structure are not "intrinsic" but
context-dependent; different values are obtained depending on whether propensities are measured for a central
-strand or an
edge strand (20). To explicitly measure the importance of intrinsic
propensities relative to context, Xiong et al. (7) designed
a series of peptides in which intrinsic propensities favored one type
of secondary structure, while binary patterning favored an alternative
structure. Characterization of the peptides demonstrated that when the
sequence periodicity of polar and nonpolar residues matches the repeat
pattern of a particular secondary structure, the peptide forms that
structure regardless of the intrinsic propensities of the component
amino acids.
-helices can be structurally quite different from
those connecting
-strands, and the results obtained in one system
may not be directly applicable to those in the other system. Thus it is
important to consider both kinds of turns.
-helices in the four-helix bundle, cytochrome
b562. The natural
Glu81-Gly82-Lys83 turn was replaced
by random tripeptide sequences. All studied variants were shown to form
structures similar to wild type. In a similar study on the
Asp30-Ala31-Asp32 interhelical turn
in Rop, 377 of the 380 isolated Rop variants folded properly (29).
These results suggest that the sequence of interhelical turns does not
dictate the structure of anti-parallel, four-helix bundles.
-helical periodicity of polar
and nonpolar residues to be maintained through the entire length of the
Rop sequence. Nonetheless, the mutant sequence formed the interhelical
turn in the correct location, and the overall three-dimensional
structure was identical to wild type. Further evidence that the length
of an interhelical turn is not essential comes from experiments
demonstrating that a variety of insertions can be tolerated in the
interhelical turns of cytochrome b562 (31, 32).
These results suggest that helix-helix interactions, not interhelical
turns, dictate the overall structure of a four-helix bundle. This
suggestion was confirmed by Predki and Regan (33), who used polyglycine
linkers of various lengths to connect the helices of Rop in a different
order from wild type. The rearranged sequences folded into the correct
four-helix bundle and were biologically active.
-strands may not be
as tolerant to substitution as interhelical turns. For example, when a
type II reverse turn
(Pro47-Ser48-Gly49-Val50)
in the
-sheet protein plastocyanin was replaced by random
tetrapeptides, the correctly folded, blue copper protein formed in only
six of the 98 characterized variants (36). This interstrand turn
apparently plays a significant role in determining the fold of the
-barrel structure of plastocyanin.
-strands in an
+
domain of protein G. They found that
the majority of random substitutions was tolerated. However, when the
same turn was randomized in a variant of protein G that had already
been destabilized by mutations elsewhere in the structure, then a far
smaller fraction of turn sequences was tolerated. Thus, tolerance to
turn substitution is strongly dependent on the global stability of the
host protein.
) turns (38). Likewise, incorporation of
D-amino acids to favor a type II
-turn played a key
role in stabilizing the native-like
structure of a 23-residue de novo sequence (39).
Fig. 2.
Two models for side chain packing.
A, jigsaw puzzle model requires complementary pieces
(adapted from Ref. 57); B, nuts and bolts model permits
promiscuous packing (adapted from Ref. 49).
[View Larger Version of this Image (143K GIF file)]
-helical
structures (9), which in some cases possess native-like
properties.1,2 These results demonstrate that although good
packing is important, it is possible to construct native-like de
novo proteins without explicitly designing all of the tertiary
interactions a priori.
-helical
surfaces enhances the native-like properties of their
2
peptide.
-helical proteins, then the answer is unequivocally "yes." In the past decade we have progressed from a time when the
only available proteins were those isolated from natural organisms to a
time when de novo proteins have become the focus of new
journals and symposia.
-sheet proteins and mixed
/
proteins are considerably more difficult to design (56). Even
for
-helical structures, successful design of novel molecules that
recapitulate all the thermodynamic, structural, and functional properties of natural proteins remains a difficult challenge. Sequences
capable of folding into precise structures that possess high levels of
enzymatic activity will be found only rarely in the overall "box"
of sequence space. The challenge to devise such molecules will both
enhance our understanding of natural proteins and refine our ability to
choose de novo sequences.