1 Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305 and 3 Cereon Genomics, 45 Sidney Street, Cambridge, MA 02139, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: conditional probability/discriminatory function/knowledge-based/protein structure prediction/side chain construction
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
This approximation reduces atomic detail and leads to certain information being ignored, such as the side chainside chain and side chainmain chain atom interactions. However, considering the interactions of side chain atoms with other atoms in the environment has been shown to help discriminatory functions in choosing near-native conformations in the sample space more accurately (Samudrala and Moult, 1998a). Given the intractability in searching protein conformational space with an all-atom representation, a two-step procedure in which the search initially focuses on the main chain and side chains are added before the conformations are evaluated would be very useful. The method for side chain construction must be computationally efficient and as accurate as possible, considering that the main chains generated will be fairly distant (
2.0 Å C
r.m.s.d.) from the native conformation.
In this work, we first constructed side chains using four previously published methods (Levitt, 1992; Koehl and Delarue, 1994
; Bower et al., 1997
; Samudrala and Moult, 1998b
) on four proteins with a varying number of near-native (
4.0 Å C
r.m.s.d.) conformations generated by two ab initio protein structure prediction methods (Park and Levitt, 1996
; Simons et al., 1997
). We compared the performance of these approaches with a naive approach that simply uses the side chain rotamer most frequently observed in protein structures. We then used a residue-specific all-atom conditional probability discriminatory function (RAPDF) to select the lowest all-atom conformations generated by each of the methods and compared the discrimination results obtained using all-atom information with those obtained using only main chain information. The implications of these results for ab initio protein structure prediction are discussed.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Table I gives the details of the four proteins that were selected for evaluating side chain construction. The proteins had been used for two ab initio protein structure prediction studies where only the main chain information was used to explore the conformational space (Park and Levitt, 1995
; Simons et al., 1997
). These methods generate conformations with C
r.m.s.d. between 1.5 and 12.0 Å for these proteins and a subset of conformations
4.0 Å C
r.m.s.d. were chosen for this study.
|
The method of Park and Levitt builds all main chains using four discrete (,
) values. The native secondary structure is fixed and designated residues are permitted to explore all possibilities of the four (
,
) values in a combinatorial fashion (Park and Levitt, 1996
). This method generates conformations with lower C
r.m.s.d. than the method of Simons et al., but is more sensitive to knowledge of the exact protein secondary structure. The original ab initio-generated coordinates for the two methods are available from the Decoys `R' Us database at <http://dd.stanford.edu>.
Residue-specific all-atom conditional probability discriminatory function (RAPDF)
We use an all-atom distance-dependent conditional probability-based discriminatory function to calculate the probability of a native structure, given a set of distances between pairs of atoms. A full description can be found in Samudrala and Moult (1998a). Briefly, the required probabilities are compiled by counting frequencies of distances between pairs of atom types in a database of protein structures. All non-hydrogen atoms are considered and the description of the atoms is residue specific, i.e. the C of an alanine is different from the C
of a glycine. This results in a total of 167 atom types. We divide the distances observed into 1.0 Å bins ranging from 3.0 to 20.0 Å. Contacts between atom types in the 0.03.0 Å range are placed in a separate bin, resulting in total of 18 distance bins.
We compile tables of scores s proportional to the negative log conditional probability that we are observing a native conformation given an interatomic distance d for all possible pairs of the 167 atom types, a and b, for the 18 distance ranges, :
|
where P(dab|C) is the probability of observing a distance d between atom types a and b in a correct structure and P(dab) is the probability of observing such a distance in any structure, correct or incorrect. The required ratios are obtained as follows:
|
where N(dab) is the number of observations of atom types a and b in a particular distance bin d, dN(dab) is the number of ab contacts observed for all distance bins,
abN(dab) is the total number of contacts between all pairs of atoms types a and b in a particular distance bin d and
d
abN(dab) is the total number of contacts between all pairs of atom types a and b summed over all the distance bins d. No intra-residue distances are included in the summation. The tables of scores are compiled from a set of 312 unique folds from the SCOP database (Hubbard et al., 1997
).
A naive approach for side chain construction
The naive approach simply constructs side chains based on the most frequently observed rotamer value in a database of protein structures (Table II). The particular values used were generated by the program mutate by R.Read (personal communication).
|
Four previously published approaches were selected for comparison with the naive method. The approaches were primarily chosen because of their computational speed, their widespread use and their diversity in terms of methodology applied: (i) scgen uses the all-atom scoring function described above to select the lowest scoring rotamer from a discrete library considering interactions between the side chain atoms and the local main chain (Samudrala and Moult, 1998b); (ii) scmf uses self-consistent mean-field theory to position rotamers in conjunction with a van der Waals potential (Koehl and Delarue, 1994
); (iii) scwrl uses a main chain dependent rotamer library to position side chains and minimizes the steric clashes (Bower et al., 1997
); and (iv) segmod pastes in side chain conformers directly from a structural database, using a Boltzmann-weighted probability to choose the conformation in the context of the main chain and the side chains already positioned (Levitt, 1992
). All methods assume a fixed main chain at the time of side chain placement.
Evaluating side chain placement accuracy
To evaluate side chain placement accuracy, we determine the percentage of angles constructed within ±40° of the values observed in the native conformation. We also use the all-atom r.m.s.d. between the near-native conformations and the experimental conformation, which is calculated using the equation
|
where xi,
yi and
zi are distances in Cartesian space between N corresponding atoms. Coordinate superposition is performed using the program align (Satow et al., 1986
; McLachlan, 1979
).
Designation of buried residues
Computation of solvent accessibility for each side chain was performed using the software naccess (by S.J.Hubbard and J.M.Thornton of University College, London). Side chains with relative solvent exposure of 20% were considered to be buried.
Minimization of conformations
Since different methods have different protocols for generating side chains and since packing influences can have an effect on side chain accuracy, all conformations were minimized for 200 steps using encad (Levitt and Lifson, 1969; Levitt, 1974
, 1983
; Levitt et al., 1995
) as a means of `normalizing' the side chain models to remove severe steric strain. The all-atom r.m.s.d. between the minimized and unminimized conformations on average for each set is <0.1 Å.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Figure 1 shows the accuracy of side chain construction by the five different methods on our test sets. All methods, including the naive approach, build side chains with similar accuracy (5060%) in terms of the percentage of
1 angles within 40° for all residues and for buried residues in the core. The trend is identical even when both
1 and
2 angles are considered (with accuracies ranging from 40 to 50%).
|
Even though the results are similar, three of the methods (scwrl, segmod and the naive approach) rely predominantly on the knowledge base on protein structures whereas the other two (scgen and scmf) perform selection of side chains based on energetic criteria.
In previous work it has been shown that the percentage 1 accuracy has a theoretical upper limit of about 60% given the steric constraints imposed by the near-native main chains for this set of conformations (Huang et al., 1998a
). This combined with the results from Figure 1
would indicate that the more sophisticated methods mimic the naive approach on main chains where the C
r.m.s.d. ranges from 1.5 to 4.0 Å.
Distribution of all-atom r.m.s.d.s and discrimination by an all-atom scoring function is similar for all methods
Although the results from Figure 1 are interesting, they expose a limitation in determining the accuracy of side chain construction using the percentage of
1 angles as a gauge on near-native main chains. This is because this measure does not account for the variance in the main chain conformations. Irrespective of how well the main chain is modeled, the
1 angles will always remain the same: the percentage
1 accuracy of the naive method on the native main chain and on a main chain that is, say, 10.0 Å away from the native conformation will be identical. A measure such as all-atom r.m.s.d. takes into account the variance in both the main chains and the side chains and the expectation would be that the sophisticated methods would produce better all-atom r.m.s.d.s since they are designed to perform better on main chains that closer to the native conformation. We therefore compute the all-atom r.m.s.d. for each set of conformations generated by the different methods to determine whether this expectation is true.
Figure 2 illustrates the distribution of the all-atom r.m.s.d.s for the five methods using a `gel graph', where the limits of the horizontal bar indicate the all-atom r.m.s.d. range and the density of shading indicates the fraction of conformations observed at a particular r.m.s.d. Again, the naive method generates conformations with r.m.s.d. distributions that are as good as those observed for the more sophisticated methods.
|
Ignoring side chain information results in worse discrimination
It has been shown before that taking side chain interactions into account leads to better discrimination (Samudrala and Moult, 1998a). For this particular set, using the all-atom scoring function to discriminate using only main chain information leads to an average selection accuracy that is worse by about 0.8 Å all-atom r.m.s.d. over all the sets (Figure 2
). The discrimination is consistently better when all-atom information is taken into account.
Implications for protein structure prediction
Our work does not attempt an exhaustive comparison of side chain methods. Rather, the goal was to compare a naive approach, based on using the most frequently observed rotamer in known protein structures, with a set of more sophisticated side chain construction methods and to determine the utility of side chain construction on near-native main chains. We have found that the simple naive approach performs as well as the more sophisticated methods.
The all-atom function does better at discriminating near-native conformations on main chains that are closer to the native conformation (Figure 2) and side chain information is indeed important to achieve this discrimination even at low resolution. Thus current ab initio methods that use reduced representations of protein would be better off building side chains in the best manner possible. Given that millions or billions of conformations are generated in an ab initio simulation (Samudrala et al., 1999
; Xia et al., 2000
), taking side chains into account using the inexpensive naive approach appears to be a good trade-off between computation time and accuracy.
It is likely that as the conformations become closer to the native (2.0 Å C
r.m.s.d.), side chain construction by the more sophisticated methods will have a greater impact on discrimination. Although current ab initio methods do not sample conformations to this resolution, this suggests a future approach for side chain construction in ab initio prediction: (i) Construct side chains on all or a large subset of the main chains using the naive approach to produce detailed all-atom models; (ii) filter using an all-atom scoring function to give a low scoring subset; and (iii) for this subset, build side chains using one of the more sophisticated methods described previously.
The results obtained here used `vanilla' versions of the side chain modeling software. Improvements to the more sophisticated methods have been published (Dunbrack, 1999). Similarly, the naive approach can be refined as the knowledge base of known protein structures becomes larger, which may lead to more accurate all-atom models and consequently more accurate discrimination.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Diesenhofer,J. (1981) Biochemistry, 20, 23612370.[ISI][Medline]
Dunbrack,R. (1999) Proteins: Struct. Funct. Genet., S3, 8187.
Huang,E., Koehl,P., Levitt,M., Pappu,R. and Ponder,J. (1998a) J. Mol. Biol., 33, 204217.
Huang,E., Samudrala,R. and Ponder,J. (1998b) Protein Sci., 7, 19982003.
Hubbard,T., Murzin,A., Brenner,S. and Chothia,C. (1997) Nucleic Acids Res., 25, 236239.
Kissinger,C.R., Liu,B.S., Martin-Blanco,E. Kornberg,T.B. and Pablo,C.O. (1990) Cell, 63, 579590.[ISI][Medline]
Koehl,P. and Delarue,M. (1994) J. Mol. Biol., 239, 249275.[ISI][Medline]
Koehl,P. and Levitt,M. (1999) Nature Struct. Biol., 6, 108111.[Medline]
Lee,J., Liwo,A., Ripoll,D., Pillardy,J. and Scheraga,J. (1999) Proteins: Struct. Funct. Genet., S3, 204208.
Levitt,M. (1974) J. Mol. Biol., 82, 393420.[ISI][Medline]
Levitt,M. (1983) J. Mol. Biol., 168, 595620.[ISI][Medline]
Levitt,M. (1992) J. Mol. Biol., 226, 507533.[ISI][Medline]
Levitt,M. and Lifson,S. (1969) J. Mol. Biol., 46, 269279.[ISI][Medline]
Levitt,M., Hirshberg,M., Sharon,R. and Daggett,V. (1995) Comput. Phys. Commun., 91, 215231.[ISI]
McLachlan,A. (1979) J. Mol. Biol., 128, 4979.[ISI][Medline]
Mondragon,A., Subbiah,S., Almo,S.C., Drottar,M. and Harrison,S.C. (1989) J. Mol. Biol., 205, 189200.[ISI][Medline]
Moult,J., Hubbard,T., Fidelis,K. and Pedersen,J. (1999) Proteins: Struct. Funct. Genet., S3, 26.
Mumenthaler,C. and Braun,W. (1995) Protein Eng., 4, 863871.
Orengo,C., Bray,J., Hubbard,T., LoConte,L. and Sillitoe,J. (1999) Proteins: Struct. Funct. Genet., S3, 149170.
Ortiz,A., Kolinkski,A., Rotkiewicz,P., Ilkowski,B. and Skolnick,J. (1999) Proteins: Struct. Funct. Genet., S3, 177185.
Osguthorpe,D. (1999) Proteins: Struct. Funct. Genet., S3, 186193.
Park,B., Huang,E. and Levitt,M. (1997) J. Mol. Biol., 266, 831846.[ISI][Medline]
Park,B. and Levitt,M. (1995) J. Mol. Biol., 249, 493507.[ISI][Medline]
Park,B. and Levitt,M. (1996) J. Mol. Biol., 258, 367392.[ISI][Medline]
Pedersen,J.T. and Moult,J. (1997) J. Mol. Biol., 269, 240259.[ISI][Medline]
Samudrala,R. and Moult,J. (1998a) J. Mol. Biol., 275, 895916.[ISI][Medline]
Samudrala,R. and Moult,J. (1998b) Protein Eng., 11, 991997.[Abstract]
Samudrala,R., Xia,Y., Huang,E. and Levitt,M. (1999) Proteins: Struct. Funct. Genet., S3, 194198.
Satow,Y., Cohen,G., Padlan,E. and Davies,D. (1986) J. Mol. Biol., 190, 593604.[ISI][Medline]
Simons,K., Kooperberg,C., Huang,E. and Baker,D. (1997) J. Mol. Biol., 268, 209225.[ISI][Medline]
Simons,K., Bonneau,R., Ruczinski,I. and Baker,D. (1999) Proteins: Struct. Funct. Genet., S3, 171176.
Sun,S. (1993) Protein Sci., 2, 762785.
VijayKumar,S., Bugg,C.E. and Cook,W.J. (1987) J. Mol. Biol., 194, 531544.[ISI][Medline]
Xia,Y., Huang,E.S., Levitt,M. and Samudrala,R. (2000) J. Mol. Biol., 300, 171185.[ISI][Medline]
Received January 4, 2000; revised April 1, 2000; accepted May 2, 2000.