Knowledge-based potential defined for a rotamer library to design protein sequences

Motonori Ota1,2, Yasuhiro Isogai3 and Ken Nishikawa1

1 National Institute of Genetics, Mishima, Shizuoka 411-8540 3 The Institute of Physical and Chemical Research (RIKEN), Wako,Saitama 351-0198, Japan


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
A knowledge-based potential for a rotamer library was developed to design protein sequences. Protein side-chain conformations are represented by 56 templates. Each of their fitness to a given structural site-environment is evaluated by a combined function of the three knowledge-based terms, i.e. two-body side-chain packing, one-body hydration and local conformation. The number of matches between the native sequence and the structural site-environment in the database and that of the virtually settled mismatches, counted in advance, were transformed into the energy scores. In the best-14 test (assessment for the reproduction ability of the native rotamer on its structural site within a quarter of 56 fitness rank positions), the structural stability analysis on mutants of human and T4 lysozymes and the inverse-folding search by a structure profile against the sequence database, this function performs better than the function deduced with the conventional normalization and our previously developed function. Targeting various structural motifs, de novo sequence design was conducted with the function. The sequences thus obtained exhibit reasonable molecular masses and hydrophobic/hydrophilic patterns similar to the native sequences of the target and act as if they were the homologs to the target proteins in BLASTP search. This significant improvement is discussed in terms of the reference state for normalization and the crucial role of short-range repulsion to prohibit residue bumps.

Keywords: 3D profile/de novo design/inverse folding/stability/threading


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The search for a protocol to design a foldable sequence into a desired structure is one of the major issues in structural biology. In the early stages of this endeavor, a simple fold of the four-helix bundle was frequently employed as a target motif (Regan and DeGrado, 1988Go; Hecht et al., 1990Go). For this structure, a manual sequence design succeeded well, based on the amino acid propensities for the secondary structure formation and the binary pattern of polar and non-polar residues (Betz et al., 1997Go). Designing sequences for structures containing a ß-sheet was more difficult (Quinn et al., 1994Go), yet some progress was reported recently (de Alba et al., 1999Go). To design more complicated folds, we cannot rely on the manual building method and thus we should adopt a computational approach. Stringent work along this path was carried out by Dahiyat and Mayo (Dahiyat and Mayo, 1997Go). They used a computational algorithm to design a sequence of about 30 amino acids in length targeting the ßß{alpha} zinc finger motif and confirmed its validity by an NMR structure analysis. They employed a rotamer library, a set of frequently observed side-chain templates, to represent the protein side-chain conformations (Ponder and Richards, 1987Go). After fixing the backbone coordinates of the target structure, the optimal sequence was sought by a sophisticated algorithm, the dead-end elimination (Desmet et al., 1992Go). The target function for the optimization consists of a physico-chemical Lennard–Jones potential for each atom pair and an empirical hydration function determined by a QSAR-like procedure (Dahiyat and Mayo, 1996Go).

We sought to develop a method for designing a sizable protein with a length of more than 100 amino acids. Since there is an enormously large number of different sequence candidates (20100), rigorous physico-chemical approaches could not be employed for the calculation to be completed within a realistic computational time. Thus, a more practical way was attempted (Shakhnovich and Gutin, 1993Go; Jones, 1994Go; Godzik, 1995Go) to utilize a knowledge-based potential of amino acids developed for protein fold recognition by the threading technique (Ota and Nishikawa, 1997Go). Progress in this area was compiled in the Proceedings of CASP3, Critical Assessment of Techniques for Protein Structure Prediction (Murzin, 1999Go). In the structure prediction, called the forward-folding protocol, a query sequence is searched for its compatible fold in the structural library, whereas in the reverse case, for designing the sequence (inverse-folding protocol), compatible sequences for a target fold are sought in the sequence space (Kocher et al., 1994Go). Although the two protocols look symmetrical at a glance, the inverse-folding protocol revealed that more attention should be paid to the reference state problem, i.e. how to define the zero level of the function (Rooman and Wodak, 1995Go; Miyazawa and Jernigan, 1999Go; Ota and Nishikawa, 1999Go).

In our previous work (Ota and Nishikawa, 1997Go), the energy of the reference state was adjusted by utilizing the three-dimensional (3D) profile, in which the fitness of each amino acid type to every structural site is shown as a two-dimensional array of 20 (number of amino acid types)xN (length of protein) (Bowie et al., 1991Go). Employing the 3D profile of ribonuclease HI, the difference of the folding energy ({Delta}G) of a mutant relative to that of the wild-type ({Delta}{Delta}G) was calculated and the computation was correlated with the experimental value ({Delta}Tm) (Ota et al., 1995Go). The design of human lysozyme mutants more stable than the native was further carried out and nine mutants were selected which were predicted to be strongly stabilized. However, significant stabilization was observed in only one of the mutant proteins (Takano et al., 1999aGo). This failure was partly attributed to the poor estimation of the amino acid side-chain volume, because the large-sized amino acids, e.g. Trp, Phe, Met and Leu, tend to augment the stability in the calculation [see Table IGo of Takano et al. (1999a)Go], even if the reference state was adjusted to exclude atomic overlaps among large side chains (Ota and Nishikawa, 1997Go).


View this table:
[in this window]
[in a new window]
 
Table I. Proteins used for the best14 test
 
The issue of side-chain volume also arose in the de novo design of protein sequences. The optimal sequence was sought to fit the globin fold by generating the 3D profile recursively (Isogai et al., 1999Go). The computationally selected sequence, SCS1, contained many Trp, Met and Leu, but few Val, Ile and Ala, and its molecular mass was larger by more than 10% as compared with the native globin [see Table 1Go of Isogai et al. (1999)Go]. Bumping residues were trimmed manually using 3D modeling and many large-sized residues were substituted with smaller ones, based on the 3D profile. The refined sequence (DG1) was synthesized and its structural features were evaluated extensively. It was concluded that the secondary structure and the global shape of DG1 resemble the targeted globin fold; however, it lacks the structural uniqueness at the side-chain level: its NMR spectrum is broad and it exhibits poor folding cooperativity (Isogai et al., 1999Go). The structural uniqueness was improved by manual replacements of Leu and Met in DG1 with ß-branched amino acids, Ile and Val (Isogai et al., 2000Go). Following these results, we were urged to develop a theoretical method to improve the side-chain packing, because part of our failure was attributed to the poor estimation of the steric interactions between side-chains.

In this study, a knowledge-based potential for a rotamer library, based on the structures in the Protein Data Bank (PDB) (Bernstein et al., 1977Go), was developed to improve the side-chain packing estimation for designing a protein sequence. Thus far, potential functions for threading have usually been defined for each residue or residue pair, but detailed side-chain conformations were ignored (Miyazawa and Jernigan, 1985Go; Sippl, 1990Go). We introduced 56 rotamers in total (Ponder and Richards, 1987Go), comprising three rotamers depending on the {chi}1 angle for each amino acid type, except for Ala and Gly. Each rotamer is represented by a virtual atom situated at the centroid of the rotamer conformation. The side-chain packing function was defined to depend on the distance between the rotamer centroids. It also depends on the types of interacting amino acid residues, but does not depend on the rotamer types. For instance, the three Leu rotamers (Leu1, Leu2 and Leu3) share the same potential function; the only difference characterizing these three rotamers is the position of the side-chain centroid relative to the backbone coordinates. Clearly, our knowledge-based function differs from the functions derived for each atom pair (Melo and Feytmans, 1997Go; Samudrala and Moult, 1998Go).


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Structural library

We selected 683 structures deposited in the Protein Data Bank (PDB) (Bernstein et al., 1977Go). These structures, determined within 2.5 Å resolution with no omitted side chains, have sequence lengths of more than 100 residues and are mutually dissimilar, sharing less than 30% of identical sequences.

Rotamer library

A rotamer library consisting of 56 templates was employed (Ponder and Richards, 1987Go). The side-chain conformations of amino acids other than Ala, Gly and Pro are divided into three types, referring to their {chi}1 angle (type 1, 0 to 120°; type 2, 120 to –120°; type 3, –120 to 0°). Pro was divided into three types, referring to its {chi}1 and {chi}2 angles (type 1, 0 < {chi}1 < 60° and –60 < {chi}2 < 0°; type 2, –60 < {chi}1 < 0° and 0 < {chi}2 < 60°; type 3, {chi}1x{chi}2 > 0). Apparently, Ala and Gly have a single template. Although the same library proposed by Dunbrack and Karplus (Dunbrack and Karplus, 1992Go), consisting of 112 templates, was employed at the beginning of this study, a library consisting of many rotamers was unsuitable for the knowledge-based approach, mainly for statistical reasons, and therefore this minimal set was used.

Compatibility function

Three knowledge-based terms, the side-chain packing, hydration and local conformational functions, constitute the compatibility function. The hydration function was the same as used in the previous work (Ota and Nishikawa, 1997Go). The local conformational function also followed the previous definition, except that it was defined for each rotamer instead of each amino acid. The side-chain packing function is explained below.

To compile the side-chain packing function, we employed the native match/mismatch model (NMM model), which is illustrated in Figure 1Go. In the left-hand view, the environment at each residue site (A, B, C, D and E) is extracted from the PDB entry and is shown on the bottom of the right-hand view and they are aligned against the native rotamers at the corresponding sites, a, b, c, d and e (top of the right-hand view). After mounting a native rotamer on its own environment (a on A, b on B), the distance between the rotamers (centroid distance between the native rotamer and a rotamer situated near the site) was measured. Also, we virtually settled the mismatch, corresponding to misalignments, as a on B, a on C, etc. and measured the distance between the rotamers. Each rotamer centroid is situated at the expected center of the gravity of side-chain atoms except Cß. For Ala and Gly, Cß and C{alpha} atoms are regarded as the rotamer centroids, respectively. The number of observations in which rotamers of residue pair xy is located at a distance r under the native-match and mismatch events is counted and normalized by each of the total number in the contact region, to give the potential function:

(1)
where f Nxy (r) and f Mxy (r) are the frequencies of occurrence for rotamers, each of which belongs to residue x or y, located at distance r in the native match and mismatch, respectively. p xync (r) is part of the damping factor: the non-contact probability of rotamers of residue pair xy at r. If some atoms in each of side chains are within 5 Å in the native structure, then they are defined to be in contact. R is the gas constant and T = 300 K. The side-chain packing function is defined when the sequence separation is more than four residues and sparse data correction as proposed by Sippl was applied (Sippl, 1990Go). The discretization of the spatial distance r is carried out using the bins of 1 Å width and linear smoothing: if r is 4.8 Å, 0.7 and 0.3 portions are counted for bins of 4–5 and 5–6 Å, respectively. Equation 1Go indicates that the mismatch plays the part of the reference state. The same formulation is also introduced to the interactions between a main-chain atom (N, C{alpha}, C, O or Cß) and the rotamer centroid as a supplement of the side-chain packing. A combined function composed of this function, the hydration function and local conformational function is abbreviated as ‘rotamer’.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 1. Schematic diagram to illustrate the native match/mismatch (NMM) model. On the left-hand side, there is a globular protein with structural environments A, B, C, D and E at different residue sites. On the right-hand side, A, B, C, D and E are shown as boxes with different patterns, indicating their different structural features. Above them, native rotamers corresponding to A, B, C, D and E are shown as a, b, c, d and e, respectively. If rotamer a is put in its native environment A, then we call it the native match. If rotamer a is put in any non-native environment (other than A), then we call it the mismatch.

 
Another side-chain packing function originally introduced by Sippl (Sippl, 1990Go), was also compiled (the acronym of the combined function is ‘Sippl_r’). This formulation considers all rotamer pairs of any amino acid type at a distance in the native structures as the reference state for the normalization:

(2)
where f (r) is the probability of any rotamer pair having a centroid separation of r (maximum r is 10 Å). This is a distinct point compared with the NMM model, in which not only native but also non-native (mismatch) rotamers are taken into account.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Best-14 test

The 3D profile was constructed for the 25 proteins listed in Table IGo using the compatibility function of ‘rotamer’ derived from the remaining 658 structures in the structural library. A part of the 3D profile of human lysozyme is shown in Figure 2Go. The 56 rotamers are arranged in order of the fitness score from left to right. The native rotamer is highlighted and generally appears on the left-hand side (preferable to its own site). We examined how many native rotamers were positioned within the best-14 ranks amongst the 56 rotamers and estimated the success rate (best-14 score) for 5024 sites of the 25 proteins. Fourteen is a quarter of 56 and therefore this test, called the best-14 test, is comparable to the previously conducted best-five test for the 20 amino acid types (Ota and Nishikawa, 1997Go). It was assumed that the best-14 score should be higher as the function improves, given that almost all native rotamers (residues) fit well to the native site.



View larger version (88K):
[in this window]
[in a new window]
 
Fig. 2. Part of the 3D profile of human lysozyme (PDB code 1lz1) constructed with the rotamer function. Rotamers are ordered according to fitness from left to right in the profile table. Native rotamers are highlighted. N, the site number; Rt, the native rotamer; L, the local structure; H, the hydration class; En, score of the native rotamer; Rk, ranking position of the native rotamer amongst 56 alternatives; Eb, score of the fittest rotamer; Ew, score of the worst rotamer [see also Ota and Nishikawa (1997)Go for details].

 
The results of the best-14 test, conducted for rotamer and Sippl_r functions, are shown in column 2 in Table IIGo. The best-14 score of the rotamer function is 75.7%. This score is three times higher than the random level (25%) and better than the score of the Sippl_r function (64.2%). Each function is also asked whether the native rotamer is correctly detected as having the lowest energy among the three alternatives of the same amino acid residue (except Ala and Gly). For instance, if the native rotamer is Leu1 and it has the lowest score among Leu1, Leu2 and Leu3, then the assignment is correct. This type of test was applied only to residues at buried sites, where the hydration class defined by the number of surrounding heavy atoms is >=7 (Ota and Nishikawa, 1997Go). The prediction accuracy was 78.6% by the rotamer function (33% by the random assignment), as shown in column 3 of Table IIGo. Apparently, it performs better than the Sippl_r function (70.1%). Gilis and Rooman (1997)Go reported that the relative weights for the local and non-local terms depend on the hydration class, so that the statistics of the rotamer positions were suspected to depend on the hydration class (buried, partially buried in protein or exposed to water). To check this point, counting and compilation were carried out individually on the hydration class on the side-chain packing function or the local conformational function. However, the dependence on the hydration class has little influence on the results (data not shown). Obviously, the introduction of a rotamer succeeded well in the best-14 test (75.7%) owing to the improvement in the packing, as compared with the previous function, ON97, that was defined for the amino acid pairs, yielded 58.9% in the best-5 test for the same 25 samples (Ota and Nishikawa, 1997Go).


View this table:
[in this window]
[in a new window]
 
Table II. Various assessments for the potential functions
 
Stability analysis

To estimate the relative stability of a mutant protein to the wild-type ({Delta}{Delta}G), we have to know the energy level of the denatured state. Regarding the ensemble of non-native (mismatch) occurrences specified by the residue-pair (xy) and their distance (r) as the denatured state, the NMM model directly provides the relative side-chain packing energy (Equation 1Go) between the native and denatured states. This approximation is allowable if we virtually define the denatured state of given x, y and r, maintaining the structurally compact state but lacking the structure (r)–sequence (xy) relationships observed in the native state (Rooman and Wodak, 1995Go). The stability of the mutant proteins relative to the wild-type ({Delta}{Delta}Gc) was calculated and compared with the experimentally determined energy {Delta}{Delta}Ge. Since the present approach with side-chain rotamers requires the 3D structures of the mutant proteins, mutants of human lysozyme and T4 lysozyme were employed, for which substantial stability and structural data are available. Seventy-seven single mutants of human lysozyme, analyzed mainly by Yutani and co-workers (Herning et al., 1993Go; Takano et al., 1995Go, 1997aGo,bGo, 1999aGo,bGo; Yamagata et al., 1998Go; Funahashi et al., 1999Go), and 103 point mutations of T4 lysozyme, analyzed by Matthews (Matthews, 1995Go), were employed. The T4 lysozyme data are a mixture of point mutants of the wild-type and point mutants of lysozyme lacking two cysteines (Matthews, 1995Go). We adopted both the real native protein (PDB code 3lzm) and pseudo-native, the double mutants of C54T and C97A (PDB code 1l63), as parent structures and mutants generated from either of the parent structures are called ‘mutants’ and were treated in the same way. Each data set is a mixture of various mutants whose stability changes were attributed to several reasons, e.g. hydrophobicity, secondary structure propensity or hydrogen bonding (Matthews, 1995Go; Funahashi et al., 2001Go). We examined the total ability of functions for the stability analysis regardless of the reason for the stabilization of each mutant.

Figure 3A and BGoGo are the scatter plots of the theoretical values ({Delta}{Delta}Gc), estimated with the rotamer function, against the experimental values ({Delta}{Delta}Ge) for the human and T4 lysozymes, respectively. The two values correlate: the correlation coefficients for the human and T4 lysozymes are 0.58 and 0.69, as shown in Table IIGo. The results indicate a notable improvement as compared with the previous study, where the correlation coefficients for the human lysozyme data and T4 lysozyme data were 0.48 (Takano et al., 1999aGo) and 0.56, respectively (Table IIGo). Since this calculation contains no adjustable parameters, the correlation coefficients are not comparable directly with other studies in which many factors were determined by regression (Gromiha et al., 1999Go; Funahashi et al., 2001Go). In this test, the Sippl_r function performs as well as the rotamer function (Table IIGo). We classified single-residue substitutions into four groups: between hydrophobic (C, F, I, L, M, V, W, Y)–hydrophobic residues (denoted HH), hydrophilic (D, E, H, K, N, Q, R, S, T)–hydrophilic residues (PP), hydrophobic–hydrophilic residues (HP) and those involving Ala, Gly or Pro (AGP). Substitutions between the residues of the same character, e.g. HH or PP, do not yield a good correlation between the computation and the experiment, whereas substitutions between different types, HP and AGP, significantly correlate: the correlation coefficients are 0.65 (human lysozyme) and 0.76 (T4 lysozyme). Also, correlation at the buried sites is remarkable: 0.64 (human lysozyme) and 0.73 (T4 lysozyme). It is noteworthy that the energy scale is almost identical for the ordinate and the abscissa in Figure 3Go (ratio 0.6). In other words, the theoretical and experimental values are similar on the absolute scale (Mirny and Shakhnovich, 1996Go), although the knowledge-based function has frequently been criticized regarding this point (Thomas and Dill, 1996Go; Ben-Naim, 1997Go).



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 3. Scatter plots of the experimental stability ({Delta}{Delta}Ge) of a mutant protein and its calculated stability ({Delta}{Delta}Gc) by the rotamer function, for human lysozyme (A) and T4 lysozyme (B). Positive values indicate stabilization in comparison with the wild-type; 90% density ellipses are shown.

 
Inverse-folding search

The structure–recognize–sequence protocol, named the inverse-folding search, was conducted with the rotamer and Sippl_r functions. A benchmark set composed of 382 structures used in the previous work was also employed (Ota and Nishikawa, 1999Go); 177 of them have at least one related protein belonging to the same superfamily of the SCOP database (Murzin et al., 1995Go) and the structures of 165 queries (<400 residues long) sought their homologous sequences (including the native one) in the sequence library. Here 657 true pairs and 62 373 false pairs constitute all of the 3D–1D matches.

Employing so-called the jack-knife procedure, the 3D profiles of query structures were accomplished by each of the functions derived from this 382 data set. Assigning the native type of the rotamer to each sequence, e.g. V3L3S1E2G1E3W3Q2L3V2 ... for the sperm whale myoglobin (1mbd) sequence, 3D–1D alignment using the ‘rotamer’-based 3D profile was carried out with the optimized gap penalties. Since the conservation of the rotamer type ({chi}1 angle) is not obvious in the remote homologous proteins, e.g. only 41% of rotamer types are identical between 1mbd and 1ash (ascaris hemoglobin domain I) according to FSSP alignment (Holm and Sander, 1996Go), the assignment of the native rotamer type into the sequence does not always guarantee an improvement of the performance. The alignment score was just regarded as the compatibility score (Ota and Nishikawa, 1999Go) and transformed into the SD (standard deviation unit) score.

Figure 4Go is a sensitivity (number of the true positives/number of the trues)–specificity (number of the true positives/number of the positives) plot, which shows how many of the possible trues are detected at a given confidence level. The curve drawn toward the upper-right corner is excellent. A clear improvement is seen for the rotamer function (thick line) from the previous function, ON97 (thick broken line) (Ota and Nishikawa, 1997Go, 1999Go). In this analysis, an ~3% difference in the sensitivity is statistically significant (Ota and Nishikawa, 1999Go). The Sippl_r function (thin broken line) performs as well as or slightly worse than the ON97 function. To estimate the ability for the actual sequence database search with the rotamer function, where the rotamer type of the sequence is unknown, we transformed the ‘rotamer’-based 3D profile to a ‘residue’- based 3D profile. Among the scores of the rotamers belonging to a given residue, the best of them was chosen. The results of this search using these shrunk 3D profiles are shown with the thin line (rotamer_b). The sensitivity– specificity curve of rotamer_b overlaps with the curve of the rotamer above 0.95 specificity and appears to perform better than the previous results (ON97) above 0.90 specificity. Even the rotamer profile search (rotamer) is still slightly inferior to the sequence similarity search using the Dayhoff PAM250 matrix (PAM250) (Dayhoff et al., 1978Go); however, if it is properly combined with the sequence information (Domingues et al., 2000Go; Panchenko et al., 2000Go), it outperforms the sequence similarity search (rotamer_b + PAM250).



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 4. Sensitivity (number of true positives/number of true positives and true negatives)–specificity (number of true positives/number of true positives and false positives) plot of the inverse-folding search. 3D profiles made by several functions, rotamer (thick line), ON97 (thick broken line) and Sippl_r (thin-broken line), were employed. Rotamer_b (thin line) represents a search results using a residue-based 3D profile made by the rotamer function, in which among the scores of the rotamers belonging to a given residue, the best of them was chosen. PAM250 (thin dotted line) shows the results of the sequence similarity search using Dayhoff PAM250 substitution matrix (Dayhoff et al., 1978Go). Rotamer_b + PAM250 (thin one-point broken line) is the search results obtained by the combined profile consisting of rotamer_b (weight 0.7) and PAM250 matrix (–0.15). In all the searches a simple dynamic programming algorithm was employed using gap opening and extension penalties and global–local alignment scheme, i.e. gaps on the sequence termini are not penalized (Needleman and Wunsch, 1970Go). In the 3D profile searches gap opening penalties were altered according to the hydration class (Ota and Nishikawa, 1997Go, 1999Go). The optimized gap penalties for each search are as follows: gap opening penalty for the most exposed site (hydration class 1: go1), gap opening penalty for the most buried site (hydration class 9: go9) and gap extension penalty (ge) = 2.2, 4.4 and 0.28 kcal/mol, respectively, for rotamer, Sippl_r, rotamer_b and rotamer_b + PAM250 profile searches; and 2.4 (go1), 4.8 (go9) and 0.3 (ge) kcal/mol for ON97 profile search (Ota and Nishikawa, 1999Go). For the partially buried sites the gap opening penalties are determined by linear interpolation of go1 and go9. In the sequence similarity search, gap-opening and extension penalties are –11 and –2, respectively.

 
De novo design

A de novo sequence design of eight proteins chosen from Table IGo was conducted by rotamer, ON97 and Sippl_r functions. Using only the one-body function (hydration and local conformation), an optimal sequence was determined in advance. Mounting this sequence on the protein backbone structure, a 3D profile was produced by adding the two-body function. The next, tentative optimal sequence was derived from this 3D profile and used for the next calculation (as the sequence mounting on the structure). This cycle was iterated recursively until the self-consistent sequence (SCS) was given (Isogai et al., 1999Go). When oscillation occurred between two fixed sequences, we mixed them randomly and continued the calculation. Fifty SCS sequences were generated for each of target proteins and the one with the best compatibility score was selected.

The quality of a designed protein is not easy to assess, other than by experiments and synthesis of the designed protein. However, the molecular mass and the match of the hydrophobic (H)/hydrophilic (P) amino acids with the native sequence are useful criteria to check the rationality of the designed sequences prior to experiments, as shown by Isogai et al. (Isogai et al., 1999Go). In Table IIIGo, the match of the hydrophobicity in the designed sequences and their molecular masses are shown. HPid is the identical ratio between the native and designed sequences, each of which is represented by H/P symbols, and HPr is the correlation coefficient of both H/P patterns. Here we regard A, C, F, I, L, M, V, W and Y as hydrophobic.


View this table:
[in this window]
[in a new window]
 
Table III. Molecular weight, HP pattern identity (HPid) and correlation coefficient (HPr) of the designed proteins
 
The average molecular mass of the designed sequences by the rotamer function is 17.0 kDa and is slightly higher than the native sequence (16.5 kDa). The sequences designed by the other two functions were significantly heavy, >19 kDa. Especially the molecular mass of the sequence designed by targeting the 1mbd template by the rotamer function, 18.2 kDa, is comparable to DG1 (18.6 kDa), which was made by manual modification of the automatically designed sequence, SCS1 (19.2 kDa in Table IIIGo), and was experimentally shown to fold into a globin-like structure (Isogai et al., 1999Go). It is clear that the introduction of the side-chain rotamers resulted in elimination of the atomic collisions and maintaining an appropriate molecular mass. The average HPid (75%) and HPr (0.48) of the rotamer function are also the highest among the results, indicating that an H/P pattern similar to the native sequence was generated by the rotamer function. For instance, sperm whale myoglobin (1mbd) and ascaris hemoglobin domain I (1ash) share 74% H/P pattern and their correlation is 0.47. Therefore, 75% HPid and 0.48 HPr obtained by the rotamer function are promising, implying that the function can generate a sequence similar to the homologs for the target protein. To clarify this point further, a BLASTP search (Altschul et al., 1997Go) was performed with each of the sequences designed by the rotamer function against the SwissProt database release 39 (Bairoch and Apweiler, 2000Go). The first hit of this search is shown in Table IVGo. All of them are homologs of the target protein, aligned with the query (designed) sequences over half of the total length (see alignment length in Table IVGo), and almost all of them have significant E values. In addition, the phylogenetic tree among the designed globins, true globins and phycocyanins (remote homologs of globin) was constructed by CLUSTALW (Thompson et al., 1994Go) and is shown in Fig. 5Go. The designed globins made a cluster and appear to relate with the globin family as well as phycocyanins. Hence we successfully designed a fake of the homologs by the rotamer function.


View this table:
[in this window]
[in a new window]
 
Table IV. BLASTP search from the designed sequence against SwissProt
 


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 5. The phylogenetic tree among the designed globins (by rotamer, ON97 and Sippl_r functions), true globins and phycocyanins constructed by CLUSTALW (Thompson et al., 1994Go). Except for two hemoglobins, natural sequences are distantly related with having <25% sequence identity. SwissProt codes of the sequences are MYG_PHYCA (sperm whale myoglobin), HBA_HUMAN (human hemoglobin {alpha}), HBB_HUMAN (human hemoglobin ß), GLB3_CHITH (erythrocruorin), GLB1_GLYDI (leghemoglobin), PHA1_FREDI (c-phycocyanin {alpha}) and PHB1_FREDI (c-phycocyanin ß).

 
Reference state problem

The only difference between the side-chain packing functions of rotamer and Sippl_r is the denominator in the argument of the logarithmic function (see Equations 1 and 2GoGo), i.e. the reference state for the normalization (Rooman and Wodak, 1995Go; Miyazawa and Jernigan, 1999Go; Ota and Nishikawa, 1999Go). The former employs the distribution of the rotamer distance under virtually settled mismatch according to the NMM model [f xyM (r) in Equation 1Go], whereas the latter employs the distribution of any rotamer pairs observed in the native proteins [f(r) in Equation 2Go]. To investigate this difference further, we plotted scores of each function [{Delta}ExyNMM(r) and {Delta}ExySp(r)] in Fig. 6Go. Surprisingly, corresponding score values over a 3 Å distance are very similar and clustered around 0 kcal/mol (blue dots). The other values, scores within 3 Å (red dots), are significantly affected by the choice of the reference state. Almost all red dots are on or under the diagonal, indicating that the Sippl_r function yields better scores within a 3 Å rotamer distance. In other words, the repulsive part of Sippl_r is very weak. This is the essential difference between the rotamer and Sippl_r functions. Rotamer pairs within 3 Å distance are rare in the real protein structures, whereas this is not the case in mismatch, resulting in a large f xyM (r). Referring to this virtual situation, rotamer bumps are properly inhibited with the rotamer function. To confirm the effect of this repulsive term, we made a modified Sippl_r function in which only the scores within a 3 Å distance are exchanged by that of the rotamer function. The best-14 score of this function is 71% and it predicts rotamer type with 76% accuracy. This superiority over the results of the Sippl_r function (Table IIGo) indicates the crucial role of steric hindrance in the short-range interaction. Interestingly, most of the scores that are frequently used in structural stability analysis are the scores of over 3 Å rotamer distance (data not shown). This is the reason why the rotamer and Sippl_r functions behave equivalently in the stability analysis (Table IIGo). Each of the mutant proteins examined here folds into its unique structure with least deviation from its native structure, and therefore it is unnecessary to evaluate the unrealistic collisions within a 3 Å rotamer distance.



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 6. Scatter plots of the interaction parameters of Sippl_r side-chain packing function [{Delta}E xySp (r)] against rotamer side-chain packing function [{Delta}E xyNMM (r)]. Red and blue dots are the scores for rotamer–rotamer pair distance separation less than or over 3 Å, respectively.

 
In conclusion, the knowledge-based side-chain packing function is improved by the introduction of rotamers if it is normalized by the NMM model. This new function performs better than either the previous function (ON97) or the function normalized by the ordinal formula (Sippl_r) in the best-14 test, the stability analysis of mutant proteins and the inverse folding search. This improvement was achieved by the proper estimation of packing, especially the repulsive part of short-range interactions. The function developed here might be suitable to evaluate designed sequences on a target fold and to investigate the importance of the packing in protein folding or on structural uniqueness. We are validating it also by experiments.


    Notes
 
2 To whom correspondence should be addressed. E-mail: mota{at}genes.nig.ac.jp Back


    Acknowledgments
 
We are grateful to Hiroyuki Izuno for his help in obtaining data in the first stage of this work and Noriko Ito for her assistance. We also thank Katsuhide Yutani, Kazufumi Takano and Jun Funahashi for helpful discussions and their kindness in offering their latest data. This work was supported by a grant-in-aid to M.O. from the Ministry of Education, Science, Sports and Culture, Japan.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Altschul,S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Bairoch,A. and Apweiler,R. (2000) Nucleic Acids Res., 28, 45–48.[Abstract/Free Full Text]

Ben-Naim,A. (1997) J. Chem. Phys., 107, 3698–3706.[ISI]

Bernstein,F.C., Koetzle,T.F., Williams,G.T.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535–542.[ISI][Medline]

Betz,S., Liebman,P. and DeGrado,W. (1997) Biochemistry, 36, 2450–2458.[ISI][Medline]

Bowie,J.U., Lüthy,R. and Eisenberg,D. (1991) Science, 253, 164–170.[ISI][Medline]

Dahiyat,B. and Mayo,S. (1996) Protein Sci., 5, 895–903.[Abstract/Free Full Text]

Dahiyat,B. and Mayo,S. (1997) Science, 278, 82–87.[Abstract/Free Full Text]

Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), A Model of Evolutionary Change in Proteins, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345–352.

de Alba,E., Santoro,J., Rico,M. and Jimenez,M. (1999) Protein Sci., 8, 854–865.[Abstract]

Desmet,J., De Maeyer,M., Hazes,B. and Lasters,I. (1992) Nature, 356, 539–542.[ISI]

Domingues,F.S., Lackner,P., Andreeva,A. and Sippl,M.J. (2000) J. Mol. Biol., 297, 1003–1013.[ISI][Medline]

Dunbrack,R.L. and Karplus,M. (1992) J. Mol. Biol., 230, 543–574.[ISI]

Funahashi,J., Takano,K., Yamagata,Y. and Yutani,K. (1999) Protein Eng., 12, 841–850.[Abstract/Free Full Text]

Funahashi,J., Takano,K. and Yutani,K. (2001) Protein Eng., 14, 127–134.[Abstract/Free Full Text]

Gilis,D. and Rooman,M. (1997) J. Mol. Biol., 272, 276–290.[ISI][Medline]

Godzik,A. (1995) Protein Eng., 8, 409–416.[Abstract]

Gromiha,M.M., Oobatake,M., Kono,H., Uedaira,H. and Sarai,A. (1999) Protein Eng., 12, 549–555.[Abstract/Free Full Text]

Hecht,M., Richardson,J., Richardson,D. and Ogden,R. (1990) Science, 249, 884–891.[ISI][Medline]

Herning,T., Yutani,K., Inaka,K., Kuroki,R., Matsushima,M. and Kikuchi,M. (1993) Biochemistry, 32, 7077–7085.

Holm,L. and Sander,C. (1996) Science, 273, 595–602.[Abstract/Free Full Text]

Isogai,Y., Ota,M., Fujisawa,T., Izuno,H., Mukai,M., Nakamura,H., Iizuka,T. and Nishikawa,K. (1999) Biochemistry, 38, 7431–7443.[ISI][Medline]

Isogai,Y., Ishii,A., Fujisawa,T., Ota,M. and Nishikawa,K. (2000) Biochemistry, 39, 5683–5690.[ISI][Medline]

Jones,D.T. (1994) Protein Sci., 3, 567–574.[Abstract/Free Full Text]

Kocher,J.A., Rooman,M.J. and Wodak,S.J. (1994) J. Mol. Biol., 235, 1598–1613.[ISI][Medline]

Matthews,B. (1995) Adv. Protein Chem., 46, 249–278.[ISI][Medline]

Melo,F. and Feytmans,E. (1997) J. Mol. Biol., 267, 207–222.[ISI][Medline]

Mirny,L. and Shakhnovich,E. (1996) J. Mol. Biol., 264, 1164–1179.[ISI][Medline]

Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 18, 534–552.[ISI]

Miyazawa,S. and Jernigan,R. (1999) Proteins, 36, 357–369.[ISI][Medline]

Murzin,A. (1999) Proteins, 37(S3), 88–103.[ISI][Medline]

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[ISI][Medline]

Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443–453.[ISI][Medline]

Ota,M. and Nishikawa,K. (1997) Protein Eng., 10, 339–351.[Abstract]

Ota,M. and Nishikawa,K. (1999) Protein Sci., 8, 1001–1009.[Abstract]

Ota,M., Kanaya,S. and Nishikawa,K. (1995) J. Mol. Biol., 248, 733–738.[ISI][Medline]

Panchenko,A.R., Marchler-Bauer,A. and Bryant,S.H. (2000) J. Mol. Biol., 296, 1319–1331.[ISI][Medline]

Ponder,J.W. and Richards,M. (1987) J. Mol. Biol., 193, 775–791.[ISI][Medline]

Quinn,T., Tweedy,N., Williams,R., Richardson,J. and Richardson,D. (1994) Proc. Natl Acad. Sci. USA, 91, 8747–8751.[Abstract]

Regan,L. and DeGrado,W.F. (1988) Science, 241, 976–978.[ISI][Medline]

Rooman,M.J. and Wodak,S.J. (1995) Protein Eng., 8, 849–858.[Abstract]

Samudrala,R. and Moult,J. (1998) J. Mol. Biol., 275, 895–916.[ISI][Medline]

Shakhnovich,E. and Gutin,A. (1993) Protein Eng., 6, 793–800.[Abstract]

Sippl,M.J. (1990) J. Mol. Biol., 213, 859–883.[ISI][Medline]

Takano,K., Ogasawara,K., Kaneda,H., Yamagata,Y., Fujii,S., Kanaya,E., Kikuchi,M., Oobatake,M. and Yutani,K. (1995) J. Mol. Biol., 254, 62–76.[ISI][Medline]

Takano,K., Funahashi,J., Yamagata,Y., Fujii,S. and Yutani,K. (1997a) J. Mol. Biol., 274, 132–142.[ISI][Medline]

Takano,K., Yamagata,Y., Fujii,S. and Yutani,K. (1997b) Biochemistry, 36, 688–698.[ISI][Medline]

Takano,K., Ota,M., Ogasahara,K., Yamagata,Y., Nishikawa,K. and Yutani,K. (1999a) Protein Eng., 12, 663–672.[Abstract/Free Full Text]

Takano,K., Yamagata,Y., Kubota,M., Funahashi,J., Fujii,S. and Yutani,K. (1999b) Biochemistry, 38, 6623–6629.[ISI][Medline]

Thomas,P. and Dill,K. (1996) J. Mol. Biol., 257, 457–469.[ISI][Medline]

Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 4673–4680.[Abstract]

Yamagata,Y., Kubota,M., Sumikawa,Y., Funahashi,J., Takano,K., Fujii,S. and Yutani,K. (1998) Biochemistry, 37, 9355–9362.[ISI][Medline]

Received January 12, 2001; revised March 28, 2001; accepted April 22, 2001.