1 National Institute of Genetics, Mishima, Shizuoka 411-8540 3 The Institute of Physical and Chemical Research (RIKEN), Wako,Saitama 351-0198, Japan
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: 3D profile/de novo design/inverse folding/stability/threading
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
We sought to develop a method for designing a sizable protein with a length of more than 100 amino acids. Since there is an enormously large number of different sequence candidates (20100), rigorous physico-chemical approaches could not be employed for the calculation to be completed within a realistic computational time. Thus, a more practical way was attempted (Shakhnovich and Gutin, 1993; Jones, 1994
; Godzik, 1995
) to utilize a knowledge-based potential of amino acids developed for protein fold recognition by the threading technique (Ota and Nishikawa, 1997
). Progress in this area was compiled in the Proceedings of CASP3, Critical Assessment of Techniques for Protein Structure Prediction (Murzin, 1999
). In the structure prediction, called the forward-folding protocol, a query sequence is searched for its compatible fold in the structural library, whereas in the reverse case, for designing the sequence (inverse-folding protocol), compatible sequences for a target fold are sought in the sequence space (Kocher et al., 1994
). Although the two protocols look symmetrical at a glance, the inverse-folding protocol revealed that more attention should be paid to the reference state problem, i.e. how to define the zero level of the function (Rooman and Wodak, 1995
; Miyazawa and Jernigan, 1999
; Ota and Nishikawa, 1999
).
In our previous work (Ota and Nishikawa, 1997), the energy of the reference state was adjusted by utilizing the three-dimensional (3D) profile, in which the fitness of each amino acid type to every structural site is shown as a two-dimensional array of 20 (number of amino acid types)xN (length of protein) (Bowie et al., 1991
). Employing the 3D profile of ribonuclease HI, the difference of the folding energy (
G) of a mutant relative to that of the wild-type (
G) was calculated and the computation was correlated with the experimental value (
Tm) (Ota et al., 1995
). The design of human lysozyme mutants more stable than the native was further carried out and nine mutants were selected which were predicted to be strongly stabilized. However, significant stabilization was observed in only one of the mutant proteins (Takano et al., 1999a
). This failure was partly attributed to the poor estimation of the amino acid side-chain volume, because the large-sized amino acids, e.g. Trp, Phe, Met and Leu, tend to augment the stability in the calculation [see Table I
of Takano et al. (1999a)
], even if the reference state was adjusted to exclude atomic overlaps among large side chains (Ota and Nishikawa, 1997
).
|
In this study, a knowledge-based potential for a rotamer library, based on the structures in the Protein Data Bank (PDB) (Bernstein et al., 1977), was developed to improve the side-chain packing estimation for designing a protein sequence. Thus far, potential functions for threading have usually been defined for each residue or residue pair, but detailed side-chain conformations were ignored (Miyazawa and Jernigan, 1985
; Sippl, 1990
). We introduced 56 rotamers in total (Ponder and Richards, 1987
), comprising three rotamers depending on the
1 angle for each amino acid type, except for Ala and Gly. Each rotamer is represented by a virtual atom situated at the centroid of the rotamer conformation. The side-chain packing function was defined to depend on the distance between the rotamer centroids. It also depends on the types of interacting amino acid residues, but does not depend on the rotamer types. For instance, the three Leu rotamers (Leu1, Leu2 and Leu3) share the same potential function; the only difference characterizing these three rotamers is the position of the side-chain centroid relative to the backbone coordinates. Clearly, our knowledge-based function differs from the functions derived for each atom pair (Melo and Feytmans, 1997
; Samudrala and Moult, 1998
).
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
We selected 683 structures deposited in the Protein Data Bank (PDB) (Bernstein et al., 1977). These structures, determined within 2.5 Å resolution with no omitted side chains, have sequence lengths of more than 100 residues and are mutually dissimilar, sharing less than 30% of identical sequences.
Rotamer library
A rotamer library consisting of 56 templates was employed (Ponder and Richards, 1987). The side-chain conformations of amino acids other than Ala, Gly and Pro are divided into three types, referring to their
1 angle (type 1, 0 to 120°; type 2, 120 to 120°; type 3, 120 to 0°). Pro was divided into three types, referring to its
1 and
2 angles (type 1, 0 <
1 < 60° and 60 <
2 < 0°; type 2, 60 <
1 < 0° and 0 <
2 < 60°; type 3,
1x
2 > 0). Apparently, Ala and Gly have a single template. Although the same library proposed by Dunbrack and Karplus (Dunbrack and Karplus, 1992
), consisting of 112 templates, was employed at the beginning of this study, a library consisting of many rotamers was unsuitable for the knowledge-based approach, mainly for statistical reasons, and therefore this minimal set was used.
Compatibility function
Three knowledge-based terms, the side-chain packing, hydration and local conformational functions, constitute the compatibility function. The hydration function was the same as used in the previous work (Ota and Nishikawa, 1997). The local conformational function also followed the previous definition, except that it was defined for each rotamer instead of each amino acid. The side-chain packing function is explained below.
To compile the side-chain packing function, we employed the native match/mismatch model (NMM model), which is illustrated in Figure 1. In the left-hand view, the environment at each residue site (A, B, C, D and E) is extracted from the PDB entry and is shown on the bottom of the right-hand view and they are aligned against the native rotamers at the corresponding sites, a, b, c, d and e (top of the right-hand view). After mounting a native rotamer on its own environment (a on A, b on B), the distance between the rotamers (centroid distance between the native rotamer and a rotamer situated near the site) was measured. Also, we virtually settled the mismatch, corresponding to misalignments, as a on B, a on C, etc. and measured the distance between the rotamers. Each rotamer centroid is situated at the expected center of the gravity of side-chain atoms except Cß. For Ala and Gly, Cß and C
atoms are regarded as the rotamer centroids, respectively. The number of observations in which rotamers of residue pair xy is located at a distance r under the native-match and mismatch events is counted and normalized by each of the total number in the contact region, to give the potential function:
![]() | (1) |
|
![]() | (2) |
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The 3D profile was constructed for the 25 proteins listed in Table I using the compatibility function of rotamer derived from the remaining 658 structures in the structural library. A part of the 3D profile of human lysozyme is shown in Figure 2
. The 56 rotamers are arranged in order of the fitness score from left to right. The native rotamer is highlighted and generally appears on the left-hand side (preferable to its own site). We examined how many native rotamers were positioned within the best-14 ranks amongst the 56 rotamers and estimated the success rate (best-14 score) for 5024 sites of the 25 proteins. Fourteen is a quarter of 56 and therefore this test, called the best-14 test, is comparable to the previously conducted best-five test for the 20 amino acid types (Ota and Nishikawa, 1997
). It was assumed that the best-14 score should be higher as the function improves, given that almost all native rotamers (residues) fit well to the native site.
|
|
To estimate the relative stability of a mutant protein to the wild-type (G), we have to know the energy level of the denatured state. Regarding the ensemble of non-native (mismatch) occurrences specified by the residue-pair (xy) and their distance (r) as the denatured state, the NMM model directly provides the relative side-chain packing energy (Equation 1
) between the native and denatured states. This approximation is allowable if we virtually define the denatured state of given x, y and r, maintaining the structurally compact state but lacking the structure (r)sequence (xy) relationships observed in the native state (Rooman and Wodak, 1995
). The stability of the mutant proteins relative to the wild-type (
Gc) was calculated and compared with the experimentally determined energy
Ge. Since the present approach with side-chain rotamers requires the 3D structures of the mutant proteins, mutants of human lysozyme and T4 lysozyme were employed, for which substantial stability and structural data are available. Seventy-seven single mutants of human lysozyme, analyzed mainly by Yutani and co-workers (Herning et al., 1993
; Takano et al., 1995
, 1997a
,b
, 1999a
,b
; Yamagata et al., 1998
; Funahashi et al., 1999
), and 103 point mutations of T4 lysozyme, analyzed by Matthews (Matthews, 1995
), were employed. The T4 lysozyme data are a mixture of point mutants of the wild-type and point mutants of lysozyme lacking two cysteines (Matthews, 1995
). We adopted both the real native protein (PDB code 3lzm) and pseudo-native, the double mutants of C54T and C97A (PDB code 1l63), as parent structures and mutants generated from either of the parent structures are called mutants and were treated in the same way. Each data set is a mixture of various mutants whose stability changes were attributed to several reasons, e.g. hydrophobicity, secondary structure propensity or hydrogen bonding (Matthews, 1995
; Funahashi et al., 2001
). We examined the total ability of functions for the stability analysis regardless of the reason for the stabilization of each mutant.
Figure 3A and B are the scatter plots of the theoretical values (
Gc), estimated with the rotamer function, against the experimental values (
Ge) for the human and T4 lysozymes, respectively. The two values correlate: the correlation coefficients for the human and T4 lysozymes are 0.58 and 0.69, as shown in Table II
. The results indicate a notable improvement as compared with the previous study, where the correlation coefficients for the human lysozyme data and T4 lysozyme data were 0.48 (Takano et al., 1999a
) and 0.56, respectively (Table II
). Since this calculation contains no adjustable parameters, the correlation coefficients are not comparable directly with other studies in which many factors were determined by regression (Gromiha et al., 1999
; Funahashi et al., 2001
). In this test, the Sippl_r function performs as well as the rotamer function (Table II
). We classified single-residue substitutions into four groups: between hydrophobic (C, F, I, L, M, V, W, Y)hydrophobic residues (denoted HH), hydrophilic (D, E, H, K, N, Q, R, S, T)hydrophilic residues (PP), hydrophobichydrophilic residues (HP) and those involving Ala, Gly or Pro (AGP). Substitutions between the residues of the same character, e.g. HH or PP, do not yield a good correlation between the computation and the experiment, whereas substitutions between different types, HP and AGP, significantly correlate: the correlation coefficients are 0.65 (human lysozyme) and 0.76 (T4 lysozyme). Also, correlation at the buried sites is remarkable: 0.64 (human lysozyme) and 0.73 (T4 lysozyme). It is noteworthy that the energy scale is almost identical for the ordinate and the abscissa in Figure 3
(ratio 0.6). In other words, the theoretical and experimental values are similar on the absolute scale (Mirny and Shakhnovich, 1996
), although the knowledge-based function has frequently been criticized regarding this point (Thomas and Dill, 1996
; Ben-Naim, 1997
).
|
The structurerecognizesequence protocol, named the inverse-folding search, was conducted with the rotamer and Sippl_r functions. A benchmark set composed of 382 structures used in the previous work was also employed (Ota and Nishikawa, 1999); 177 of them have at least one related protein belonging to the same superfamily of the SCOP database (Murzin et al., 1995
) and the structures of 165 queries (<400 residues long) sought their homologous sequences (including the native one) in the sequence library. Here 657 true pairs and 62 373 false pairs constitute all of the 3D1D matches.
Employing so-called the jack-knife procedure, the 3D profiles of query structures were accomplished by each of the functions derived from this 382 data set. Assigning the native type of the rotamer to each sequence, e.g. V3L3S1E2G1E3W3Q2L3V2 ... for the sperm whale myoglobin (1mbd) sequence, 3D1D alignment using the rotamer-based 3D profile was carried out with the optimized gap penalties. Since the conservation of the rotamer type (1 angle) is not obvious in the remote homologous proteins, e.g. only 41% of rotamer types are identical between 1mbd and 1ash (ascaris hemoglobin domain I) according to FSSP alignment (Holm and Sander, 1996
), the assignment of the native rotamer type into the sequence does not always guarantee an improvement of the performance. The alignment score was just regarded as the compatibility score (Ota and Nishikawa, 1999
) and transformed into the SD (standard deviation unit) score.
Figure 4 is a sensitivity (number of the true positives/number of the trues)specificity (number of the true positives/number of the positives) plot, which shows how many of the possible trues are detected at a given confidence level. The curve drawn toward the upper-right corner is excellent. A clear improvement is seen for the rotamer function (thick line) from the previous function, ON97 (thick broken line) (Ota and Nishikawa, 1997
, 1999
). In this analysis, an ~3% difference in the sensitivity is statistically significant (Ota and Nishikawa, 1999
). The Sippl_r function (thin broken line) performs as well as or slightly worse than the ON97 function. To estimate the ability for the actual sequence database search with the rotamer function, where the rotamer type of the sequence is unknown, we transformed the rotamer-based 3D profile to a residue- based 3D profile. Among the scores of the rotamers belonging to a given residue, the best of them was chosen. The results of this search using these shrunk 3D profiles are shown with the thin line (rotamer_b). The sensitivity specificity curve of rotamer_b overlaps with the curve of the rotamer above 0.95 specificity and appears to perform better than the previous results (ON97) above 0.90 specificity. Even the rotamer profile search (rotamer) is still slightly inferior to the sequence similarity search using the Dayhoff PAM250 matrix (PAM250) (Dayhoff et al., 1978
); however, if it is properly combined with the sequence information (Domingues et al., 2000
; Panchenko et al., 2000
), it outperforms the sequence similarity search (rotamer_b + PAM250).
|
A de novo sequence design of eight proteins chosen from Table I was conducted by rotamer, ON97 and Sippl_r functions. Using only the one-body function (hydration and local conformation), an optimal sequence was determined in advance. Mounting this sequence on the protein backbone structure, a 3D profile was produced by adding the two-body function. The next, tentative optimal sequence was derived from this 3D profile and used for the next calculation (as the sequence mounting on the structure). This cycle was iterated recursively until the self-consistent sequence (SCS) was given (Isogai et al., 1999
). When oscillation occurred between two fixed sequences, we mixed them randomly and continued the calculation. Fifty SCS sequences were generated for each of target proteins and the one with the best compatibility score was selected.
The quality of a designed protein is not easy to assess, other than by experiments and synthesis of the designed protein. However, the molecular mass and the match of the hydrophobic (H)/hydrophilic (P) amino acids with the native sequence are useful criteria to check the rationality of the designed sequences prior to experiments, as shown by Isogai et al. (Isogai et al., 1999). In Table III
, the match of the hydrophobicity in the designed sequences and their molecular masses are shown. HPid is the identical ratio between the native and designed sequences, each of which is represented by H/P symbols, and HPr is the correlation coefficient of both H/P patterns. Here we regard A, C, F, I, L, M, V, W and Y as hydrophobic.
|
|
|
The only difference between the side-chain packing functions of rotamer and Sippl_r is the denominator in the argument of the logarithmic function (see Equations 1 and 2), i.e. the reference state for the normalization (Rooman and Wodak, 1995
; Miyazawa and Jernigan, 1999
; Ota and Nishikawa, 1999
). The former employs the distribution of the rotamer distance under virtually settled mismatch according to the NMM model [f xyM (r) in Equation 1
], whereas the latter employs the distribution of any rotamer pairs observed in the native proteins [f(r) in Equation 2
]. To investigate this difference further, we plotted scores of each function [
ExyNMM(r) and
ExySp(r)] in Fig. 6
. Surprisingly, corresponding score values over a 3 Å distance are very similar and clustered around 0 kcal/mol (blue dots). The other values, scores within 3 Å (red dots), are significantly affected by the choice of the reference state. Almost all red dots are on or under the diagonal, indicating that the Sippl_r function yields better scores within a 3 Å rotamer distance. In other words, the repulsive part of Sippl_r is very weak. This is the essential difference between the rotamer and Sippl_r functions. Rotamer pairs within 3 Å distance are rare in the real protein structures, whereas this is not the case in mismatch, resulting in a large f xyM (r). Referring to this virtual situation, rotamer bumps are properly inhibited with the rotamer function. To confirm the effect of this repulsive term, we made a modified Sippl_r function in which only the scores within a 3 Å distance are exchanged by that of the rotamer function. The best-14 score of this function is 71% and it predicts rotamer type with 76% accuracy. This superiority over the results of the Sippl_r function (Table II
) indicates the crucial role of steric hindrance in the short-range interaction. Interestingly, most of the scores that are frequently used in structural stability analysis are the scores of over 3 Å rotamer distance (data not shown). This is the reason why the rotamer and Sippl_r functions behave equivalently in the stability analysis (Table II
). Each of the mutant proteins examined here folds into its unique structure with least deviation from its native structure, and therefore it is unnecessary to evaluate the unrealistic collisions within a 3 Å rotamer distance.
|
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bairoch,A. and Apweiler,R. (2000) Nucleic Acids Res., 28, 4548.
Ben-Naim,A. (1997) J. Chem. Phys., 107, 36983706.[ISI]
Bernstein,F.C., Koetzle,T.F., Williams,G.T.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535542.[ISI][Medline]
Betz,S., Liebman,P. and DeGrado,W. (1997) Biochemistry, 36, 24502458.[ISI][Medline]
Bowie,J.U., Lüthy,R. and Eisenberg,D. (1991) Science, 253, 164170.[ISI][Medline]
Dahiyat,B. and Mayo,S. (1996) Protein Sci., 5, 895903.
Dahiyat,B. and Mayo,S. (1997) Science, 278, 8287.
Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), A Model of Evolutionary Change in Proteins, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345352.
de Alba,E., Santoro,J., Rico,M. and Jimenez,M. (1999) Protein Sci., 8, 854865.[Abstract]
Desmet,J., De Maeyer,M., Hazes,B. and Lasters,I. (1992) Nature, 356, 539542.[ISI]
Domingues,F.S., Lackner,P., Andreeva,A. and Sippl,M.J. (2000) J. Mol. Biol., 297, 10031013.[ISI][Medline]
Dunbrack,R.L. and Karplus,M. (1992) J. Mol. Biol., 230, 543574.[ISI]
Funahashi,J., Takano,K., Yamagata,Y. and Yutani,K. (1999) Protein Eng., 12, 841850.
Funahashi,J., Takano,K. and Yutani,K. (2001) Protein Eng., 14, 127134.
Gilis,D. and Rooman,M. (1997) J. Mol. Biol., 272, 276290.[ISI][Medline]
Godzik,A. (1995) Protein Eng., 8, 409416.[Abstract]
Gromiha,M.M., Oobatake,M., Kono,H., Uedaira,H. and Sarai,A. (1999) Protein Eng., 12, 549555.
Hecht,M., Richardson,J., Richardson,D. and Ogden,R. (1990) Science, 249, 884891.[ISI][Medline]
Herning,T., Yutani,K., Inaka,K., Kuroki,R., Matsushima,M. and Kikuchi,M. (1993) Biochemistry, 32, 70777085.
Holm,L. and Sander,C. (1996) Science, 273, 595602.
Isogai,Y., Ota,M., Fujisawa,T., Izuno,H., Mukai,M., Nakamura,H., Iizuka,T. and Nishikawa,K. (1999) Biochemistry, 38, 74317443.[ISI][Medline]
Isogai,Y., Ishii,A., Fujisawa,T., Ota,M. and Nishikawa,K. (2000) Biochemistry, 39, 56835690.[ISI][Medline]
Jones,D.T. (1994) Protein Sci., 3, 567574.
Kocher,J.A., Rooman,M.J. and Wodak,S.J. (1994) J. Mol. Biol., 235, 15981613.[ISI][Medline]
Matthews,B. (1995) Adv. Protein Chem., 46, 249278.[ISI][Medline]
Melo,F. and Feytmans,E. (1997) J. Mol. Biol., 267, 207222.[ISI][Medline]
Mirny,L. and Shakhnovich,E. (1996) J. Mol. Biol., 264, 11641179.[ISI][Medline]
Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 18, 534552.[ISI]
Miyazawa,S. and Jernigan,R. (1999) Proteins, 36, 357369.[ISI][Medline]
Murzin,A. (1999) Proteins, 37(S3), 88103.[ISI][Medline]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[ISI][Medline]
Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443453.[ISI][Medline]
Ota,M. and Nishikawa,K. (1997) Protein Eng., 10, 339351.[Abstract]
Ota,M. and Nishikawa,K. (1999) Protein Sci., 8, 10011009.[Abstract]
Ota,M., Kanaya,S. and Nishikawa,K. (1995) J. Mol. Biol., 248, 733738.[ISI][Medline]
Panchenko,A.R., Marchler-Bauer,A. and Bryant,S.H. (2000) J. Mol. Biol., 296, 13191331.[ISI][Medline]
Ponder,J.W. and Richards,M. (1987) J. Mol. Biol., 193, 775791.[ISI][Medline]
Quinn,T., Tweedy,N., Williams,R., Richardson,J. and Richardson,D. (1994) Proc. Natl Acad. Sci. USA, 91, 87478751.[Abstract]
Regan,L. and DeGrado,W.F. (1988) Science, 241, 976978.[ISI][Medline]
Rooman,M.J. and Wodak,S.J. (1995) Protein Eng., 8, 849858.[Abstract]
Samudrala,R. and Moult,J. (1998) J. Mol. Biol., 275, 895916.[ISI][Medline]
Shakhnovich,E. and Gutin,A. (1993) Protein Eng., 6, 793800.[Abstract]
Sippl,M.J. (1990) J. Mol. Biol., 213, 859883.[ISI][Medline]
Takano,K., Ogasawara,K., Kaneda,H., Yamagata,Y., Fujii,S., Kanaya,E., Kikuchi,M., Oobatake,M. and Yutani,K. (1995) J. Mol. Biol., 254, 6276.[ISI][Medline]
Takano,K., Funahashi,J., Yamagata,Y., Fujii,S. and Yutani,K. (1997a) J. Mol. Biol., 274, 132142.[ISI][Medline]
Takano,K., Yamagata,Y., Fujii,S. and Yutani,K. (1997b) Biochemistry, 36, 688698.[ISI][Medline]
Takano,K., Ota,M., Ogasahara,K., Yamagata,Y., Nishikawa,K. and Yutani,K. (1999a) Protein Eng., 12, 663672.
Takano,K., Yamagata,Y., Kubota,M., Funahashi,J., Fujii,S. and Yutani,K. (1999b) Biochemistry, 38, 66236629.[ISI][Medline]
Thomas,P. and Dill,K. (1996) J. Mol. Biol., 257, 457469.[ISI][Medline]
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 46734680.[Abstract]
Yamagata,Y., Kubota,M., Sumikawa,Y., Funahashi,J., Takano,K., Fujii,S. and Yutani,K. (1998) Biochemistry, 37, 93559362.[ISI][Medline]
Received January 12, 2001; revised March 28, 2001; accepted April 22, 2001.