1College of Life Science and Bioengineering, Beijing University of Technology, Beijing 100022, 2Department of Astronomy and Applied Physics, University of Science and Technology of China, Hefei 230026 and 3Institute of Biophysics, Academia Sinica, Beijing 100101, China
4 To whom correspondence should be addressed. e-mail: cxwang{at}bjut.edu.cn
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: minimization/off-lattice model/protein folding prediction/relative entropy
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Theory and computational method |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
where P and P0 mean different probabilities. For a given configuration r = {ri}, the probability P0 that the molecule adopts sequence S = {si} is written as
The probability P that the molecule has a specially assigned sequence (the target sequence) S
= {si
} is defined as
From the definition, it is easily shown that the relative entropy G 0. Obviously, the measure G has minimum value when P0 = P
. The property of G determines that it is a measure to evaluate the difference between the distribution functions P0 of random sequence and the distribution P
with the specially assigned sequence in sequence space. Thus, the difference between the distributions P0 and P
is not directly evaluated by P0 P
, but by the mean of the difference of their logarithms. When the protein conformation closes to the native structure adopted by the given sequence, the difference, in the sense of measure G, between the distribution P0 and the distribution P
tends to be small. This is because on the native conformation of the given sequence S
, P0 has a very large value at least on this sequence S
from the evolutionary point of view, which results in a small value of G. Our aim is to find a good distribution function P0 closest to P
through minimizing G, rather than a minimum energy or free energy of the system through energy minimization. Therefore, the folding prediction can be done by searching the conformational space to find an optimal structure
for a sequence {si
} through minimizing G. The gradient descent algorithm for minimization can be expressed as
where is an adjustable parameter with a value between 0 and 1 for controlling the convergence speed and the subscript i denotes the ith C
atom. As shown in Equation 4, the measure G substitutes the systems potential <H>
used in energy minimization.
Moreover, using the definition of P in Equation 3, it is found that the spatial derivative
is zero, because the last two terms in the brackets are equal owing to the limitation of the Kronecker delta functions. Then, using P0 and P defined in Equations 2 and 3, Equation 4 can be reduced to
where the second term in the brackets,
, is a mean value that is independent of the amino acid sequence, but dependent on conformation space {ri}.
A simple Hamiltonian of protein system using the type of the contact potential is written as
where
denotes the contact strength depending on the distance rij between the ith and jth residues, U(si,sj) is the contact potential between residues si and sj. For a real protein, we have chosen a simple form of U(si,sj) given by Li et al. (Li et al., 1997
), who showed that a simple equation could be used as a good approximation of Miyazawa and Jernigans 20x20 potential matrix elements (Miyazawa and Jernigan, 1985, 1996
), which are statistically deduced pairwise interaction potential energies among the 20 types of amino acids:
where the subscripts i and j label the 20 amino acid residues and the three coefficients are set to c0 = 1.38, c1 = 5.08 and c2 = 7.40, in units of RT, the gas constant times room temperature. Each si corresponds to a value qi [see Table I in Wang and Lee (Wang and Lee, 2000)].
Using Equations 5 and 6, the final numerical iteration equation can be deduced from Equation 4 as
where the superscript k represents the kth iteration, ß = 1/RT and <U(si,sj)>0 denotes the average contact potential with respect to the probability distribution P0. In order to search the configurations in real space, a continuous form of the function
[denoted A(rij) for simplicity] is adopted in our work:
where d is a value around the contact distance between residues (it is set to 6.0 Å) and n, and
are adjustable parameters. The distance between the connective residues is constrained by the SHAKE algorithm as a bond (Ryckaert et al., 1977
), and therefore the interaction between any two connective residues is skipped. The first exponential term in Equation 9 can be considered as a continuous approximate form of a delta function, which vanishes quickly when rij > d. This term together with the U(si,sj) factor counts for the major driving force of protein folding: hydrophobic and hydrophilic interactions, in which the d value (6.0 Å) just corresponds to the contact distance. The second term in Equation 9 can be considered as an additional term used to prevent some residues, such as hydrophobic residues, from moving closely. In this work, a van der Waals-like potential is used, which has a smoother distance dependence than the ordinary van der Waals function. This term results in a potential barrier at a small contact distance between two residues around 2 Å.
In order for a sequence to fold into a stable native structure, it is reasonable to suggest that the native state has an energy that is much lower than the energies of the bulk of misfolded states, especially of the denatured states (Shakhnovich and Gutin, 1993; Deutsch and Kurosky, 1996
; Shakhnovich, 1998
). It also holds that there is a large energy gap between the energy of the ground state and the average energy of all possible conformations. As treated below, the average energy of a protein is considered to be approximately equal to that of the denatured state, because the conformational space is mainly occupied by the denatured states. The average energy is related to the term <U(si,sj)>0 in Equation 8. Because the energy of the native state is less than the average, minimizing the quantity in Equation 8 is likely to result in a large energy gap. In addition, our method is essentially to find a structure with higher occupation probability, and therefore the predicted structure should correspond to the conformation with nearly the lowest free energy. With some modifications of this minimization function, it is just the form of the difference in free energy used in the work on reverse protein folding (Deutsch and Kurosky, 1996
). Our method is nearly identical with those used in the learning theory of neural networks, the distinction being that the probability functions are Gibbs distributions in this work.
A difficult task is to estimate the value of the term <U(si,sj)>0 in Equation 8. It can be estimated from the mean field theory. However, in order to test our computational approach, we prefer to estimate <U(si,sj)>0 by a more simplified approach. Because the conformational space is mainly occupied by the denatured or misfolded states, the mean energy <H>0 can be considered as the mean energy of denatured states, which can be estimated by calculating the energy of the denatured state of the given sequence. <H>0, the ensemble average of H(s,r) over the sequence S, is expressed as
where <U(si,sj)>0 can be assumed to be a constant independent of si, sj for simplicity and à denotes the average of A(rij).
The energy of the native state for the given sequence s = {si
} can be divided into two parts:
where the second term corresponds to the long-range interaction. The denatured state energy H is estimated by neglecting all long sequence-range interactions, all those terms whose sequence separation is >1 and taking some average distance between residues. From the above, <H>0 can be taken into account as the denatured state energy H
, hence Equation 10 can be written as
where we used two definitions,
and
Obviously, -U can be calculated from the given sequence. Thus,
where we define an adjustable parameter km = -A/à to avoid calculating à and -A directly. Now <U(si,sj)>0 can be estimated from the given sequence under a condition of an adjustable parameter km.
Equation 8 shows that our approach differs from the energy minimization method by the additional term <U(si,sj)>0. The appearance of this term gives our approach close similarity to other work (Deutsch and Kurosky, 1996) using free energy as a minimization function.
In all performance tests, we set in Equation 4 equal to 0.13 and the temperature T = 1. To decide simply on the values of the unknown parameters n,
and
, the approach was first tested on bovine pancreatic trypsin inhibitor (BPTI). We adjusted the three parameters to achieve the most accurate prediction result and they were finally set as n = 360 Å2,
= 0.17 and
= 1.35 Å. The parameter km was assigned a value of 7.2 for our target proteins. The iteration of our algorithm was considered to be convergent when the positional difference between two consecutive iterations for each bead was <0.001 Å. Therefore, the tolerance for SHAKE was set to 0.001 Å. It should be noted that when the initial bond length deviated too far from the constrained length, a harmonic potential was used to constrain the bond length first, followed by the SHAKE constraint method. For a more efficient conformational search, simulated annealing was adopted.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
The feature of our approach lies in the additional term of the mean contact potential <U(si,sj)>0 in Equation 8. Its value is fairly small and generally much less than the U(si,s
j) for the small proteins. This additional term imposes obvious effects on the predicted results and the convergence speed for real proteins. A comparison between the performances of the two algorithms with and without the term was made for 1ejg and 1bpi. For 1ejg, the algorithm with this term converges after 2257 iteration steps and produces 76 native contacts with an r.m.s.d. of 5.2 Å (144 contacts in total in the native structure). However, the algorithm without the mean potential term of Equation 8 converges after 3673 iteration steps and produces 71 native contacts with an r.m.s.d. of 5.9 Å. For 1bpi, the algorithm with the mean potential term converges after 2728 iteration steps and produces 92 native contacts with an r.m.s.d. of 6.8 Å (180 contacts in total in the native structure). However, the algorithm without the term converges after 4390 iteration steps and produces 91 native contacts with an r.m.s.d. of 7.0 Å. In most cases, the algorithm with the mean potential results in an increase in the native contacts and therefore the prediction accuracy. The previous minimization method aims to find a structure with the minimum energy. However, the essence of our method is to search the conformational space with a Boltzmann distribution, namely, to find a structure with a maximum occupying probability. This treatment is able to be more consistent with the argument that the native state of a protein does not just stay on the state with the minimum energy; rather, it corresponds to the state with the lowest free energy.
However, as in other minimization methods, there is still a common problem in this method that the predicted result is dependent on the initial conformation. The folded structure can be trapped in the state with the local energy minimum. Table II lists the predicted results for 1bpi with the different initial conformations. Generally, the larger the deviation of the initial conformation from the native conformation, the worse the prediction accuracy will become. To reveal further the difference between our relative entropy minimization method and energy minimization method, the contact potentials of each folded structure are also shown in Table II. It is interesting that the better predicted conformation (with smaller r.m.s.d.) does not necessarily have the lower potential. This indicates that the energy minimization method might fail to determine the native conformation of a protein. The fact that this method tends to select the folded conformation with lower r.m.s.d., instead of with a lower potential, implies that the entropy effect plays an important role in the determination of the native conformation of a protein. The relative entropy of each folded structure is not given in Table II, because the calculation cannot be achieved directly from our minimization method. Generally, the calculation of relative entropy defined in this work needs very expensive sampling in conformational space for real proteins, which is not the aim of our work. In fact, the advantage of this method in practice is just that the calculation of entropy or free energy is avoided and the degrees of freedom in sequence space are averaged out.
|
We have presented a new and efficient algorithm for protein folding, which essentially searches the conformation space obeying a Boltzmann distribution to find a conformation on which the probability P0 is close to P of a given sequence. As a reasonable result, the found conformation is near the native structure of the given sequence. This approach is based entirely on physical principles and is fundamentally different from other structure prediction methods that employ homology modeling, threading and statistical comparisons with the known crystal structures. Moreover, this method only adopts a simple, generalized contact potential that does not include angle, torsion and other forms of potential (Lee et al., 1999
). As a result, this method just predicts the frame of the protein backbone, and the conformation obtained does not contain the detailed structure and image of some parts of the native structure. An additional van der Waals-like potential is required in our method. This term is effective in the short range and actually acts as a repulsion force on closely located residues. Of course, other function forms can replace this term. However, as suggested elsewhere (Li et al., 1997
), Equation 7 underestimates the attractions between positively and negatively charged amino acids and in contact between both Cys residues. In addition, owing to the statistical nature of Miyazawa and Jernigans matrix, certain features of inter-residue interactions (such as orientational dependence of the interaction, side-chain packing, etc.) are averaged out. Specific features may be necessary for building a more realistic potential for protein folding. In principle, this approach can be applied as a uniform frame for both folding and inverse folding of proteins (Wang et al., 1999
). With some changes in the definition of P0 and P
, in which the sum on {si} space is changed on r space, Equation 1 can lead to the algorithm for the reverse folding problem (Wang et al., 1999
).
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bonneau,R., Strauss,C.E.M. and Baker,D. (2001) Proteins, 43, 111.[CrossRef][ISI][Medline]
Deutsch,J.M. and Kurosky,T. (1996) Phys. Rev. Lett., 76, 323326.[CrossRef][ISI][Medline]
Hinds,D.A. and Levitt,M. (1994) J. Mol. Biol., 243, 668682.[ISI][Medline]
Huang,E.S., Samudrala,R. and Ponder,J.W. (1999) J. Mol. Biol., 290, 267281.[CrossRef][ISI][Medline]
Lee,J., Liwo,A. and Scheraga,H.A. (1999) Proc. Natl Acad. Sci. USA, 96, 20252030.
Li,H., Tang,C. and Wingreen,N.S. (1997) Phys. Rev. Lett., 79, 765768.[CrossRef][ISI]
Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 18, 534552.[ISI]
Miyazawa,S. and Jernigan,R.L. (1996) J. Mol. Biol., 256, 623644.[CrossRef][ISI][Medline]
Moult,J., Hubbard,T., Fidelis,K. and Pedersen,J.T. (1999) Proteins, 3 (Suppl.), 26.[CrossRef]
Mumenthaler,C. and Braun,W. (1995) Protein Sci., 4, 863871.
Orengo,C.A., Bray,J.E., Hubbard,T., LoConte,L. and Sillitoe,I. (1999) Proteins, 3 (Suppl.), 149170.[CrossRef]
Osguthorpe,D.J. (2000) Curr. Opin. Struct. Biol., 10, 146152.[CrossRef][ISI][Medline]
Reva,B.A., Finkelstein,A.V. and Skolnic,J. (1998) Fold. Des., 3, 141147.[ISI][Medline]
Ryckaert,J.P., Ciccotti,G. and Berendsen,H.J.C. (1997) J. Comput. Phys., 23, 327341.
Shakhnovich,E.I. (1994) Phys. Rev. Lett., 72, 39073910.[CrossRef][ISI][Medline]
Shakhnovich,E.I. (1998) Fold. Des., 3, R45R58.[ISI]
Shakhnovich,E.I. and Gutin,A.M. (1993) Proc. Natl Acad. Sci. USA, 90, 71957199.[Abstract]
Sun,S.J., Thomas,P.D. and Dill,K.A. (1995) Protein Eng., 8, 769778.[Abstract]
Venclovas,C., Zemla,A., Fedelis,K. and Moult,J. (1999) Proteins, 3 (Suppl.), 231237.[CrossRef]
Wang,B.H. et al. (1999) J. Biosci., 24 (Suppl. 1), 61.
Wang,Z.H. and Lee,H.C. (2000) Phys. Rev. Lett., 84, 574577.[CrossRef][ISI][Medline]
Zhou,Y.Q. and Karplus,M. (1999) J. Mol. Biol., 293, 917950.[CrossRef][ISI][Medline]
Received June 25, 2002; revised June 6, 2003; accepted July 29, 2003.