1 CTA/CAMM, Novartis Institute for Biomedical Research, 556 Morris Avenue, Summit, NJ 07901 and 3 Department of Mathematics, Rutgers State University, New Brunswick, NJ 08855, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: averaging over homologs/fragmentation threading/framework-forming regions in immunoglobulins/sequence-to-structure specificity
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Answering this question quantitatively (even approximately) is important for developing protein structure prediction methods, understanding the architecture of protein globules and designing new protein structures. In principle, one could think of two different hypothetical classes of protein sequences. In the first class of sequences, most chain fragments contribute equally to the energy separation of the native structure from other structures. In the second class of sequences, there are significant differences among chain fragments in their contribution into recognition of the native structure. Some of the chain fragments are highly preferred in the native structures making the dominant contribution to energy separation of the native structure from other folds; the other chain fragments are energetically `neutral' (have low-specificity) for the native structure or possibly even prefer a different fold. In this work we investigate which of these two hypothetical classes of sequences more closely reflects the case of immunoglobulins.
Recently, we have developed an approach of `fragmentation threading' that allows estimation of the contributions of particular chain fragments to recognition of the native structure (Reva and Topiol, 2000). In particular, we compared the roles of secondary structure and loops in recognition of the native structures of proteins. The accuracy of recognition was estimated by computing the Z-score values for fragments of protein chains in threading tests. We found that for the overwhelming majority of diverse protein sequences, secondary structure fragments are more determinant of the native structure than loops. Typically, however, the secondary structure is, in turn, less determinant of the structure than the whole sequence. Therefore, roughly speaking, the majority of sequences show the features described above as class I. However, ~14% of the sequences studied (34 of 240) showed the features of class II: the fragments of secondary structure taken alone produce significantly lower Z-score values than the corresponding whole sequences. Most of these proteins (33 of 34) contain a significant portion of ß-structure.
In this work we apply the fragmentation threading approach in a more detailed study of the specific role of secondary structure fragments in determining (recognizing) the three-dimensional structure of the immunoglobulin molecule that have been found among the class II proteins. Immunoglobulins are chosen for such a study because they constitute a vast and well explored class of proteins. There are hundreds of three-dimensional structures of immunoglobulins that can be used for producing reliable sequence and structure alignments; alignments of diverse homologs help to increase the accuracy of energy calculations (Finkelsten, 1998; Reva et al., 1999). Because immunoglobulins play a central role in developing specific immune responses, they are an important object for protein engineering.
Analysis of conserved mutations in multiple alignments of immunoglobulin sequences and structures suggested a special role of four central ß-strands (strands B, C, E, F in Figure 1) in determining immunoglobulin fold (Galitsky et al., 1999
; Halaby et al., 1999
; Kister et al., 2001
).
|
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The energy functions
In energy calculations we used pairwise C atom-based potentials (Reva et al., 1997
; Reva et al., 2000
) as illustrated in Figure 2
.
|
The Z-score
The accuracy of protein structure recognition is commonly characterized by the value of the Z-score (Hendlich et al., 1990):
![]() | (1) |
![]() |
![]() |
| (2) |
When Z<<1, which corresponds to a reasonable accuracy of the predicting methods,
| (3) |
The larger NZ (and hence Z), the more accurate the protein structure recognition.
Generation of alternative structures by gapless threading
For evaluating the average energy, E
, and standard deviation, D, one needs to use a representative set of alternative structures. Commonly, such structures are obtained by the method of gapless threading of a query sequence onto all possible three-dimensional structures provided in the form of backbones of a set of non-homologous proteins (Hendlich et al., 1990
). No internal gaps or insertions are allowed; thus, a chain of N residues in length can be threaded through a host protein molecule of M residues in length in M N + 1 different ways. Because threaded structures which differ by a small number of register shifts are similar to each other (measured by RMSD; Reva et al., 1998
) we sampled threaded structures with register shifts of 10. Only a fraction of the possible conformations of the query sequences is generated by this procedure. However, this fraction is enough for a crude estimation of the Z-score values. [All the protein structures used in threading were taken from the PDB according to Sander's 25% similarity list of October 1997 (Hobohm et al., 1992
). We used only 364 structures from this list as follows: those with no chain breaks, with a resolution better than 2.5 Å and an R factor less than 0.2, and with no structural homologs (Reva et al., 1998
)].
To determine a particular contribution of a given group of residues to structure recognition we calculated separately the energy of this group in the native structure as well as the corresponding energies in alternative structures. These energies were used for estimating the Z-score for the particular fragments of aligned sequences.
Normal distribution for alternative fold energies
In estimating the accuracy of protein structure recognition by a Z-score value we assume a normal distribution for alternative fold energies. This type of energy distribution is typical for systems approximated by the random energy model (REM) (Derrida, 1981). The total energy in the REM is a sum of many independent random interactions. The energy of a protein globule is a sum of thousands of inter-residue interactions. Alternative random structures allow for a huge variety of inter-residue contacts. Therefore, it is generally believed that the energy spectrum of a protein molecule has the normal form. However, because we compare the energies of interactions for relatively small and specially chosen groups of residues it is especially important to validate the applicability of the REM for separate chain fragments. A deviation between the theoretical and observed distribution is usually measured by the
2 value (Mathews and Walker, 1964
). To compute this quantity we divide an energy distribution into bins and calculate the observed and the expected bin populations. The
2 value for a protein p is computed as:
| (4) |
Alignment of immunoglobulin chains
The multiple alignment of immunoglobulin chains is based on both sequence and structural information (Gelfand and Kister, 1995). The alignment procedure consisted of two steps: (1) determination of the secondary structure units in the proteins, and (2) multiple sequence and structural alignment of the secondary structure units. For assignment of the secondary structure we determined a consensus in (i) backbone dihedral angles, (ii) torsion angles around virtual C
C
bonds, and (iii) hydrogen bonds between main chain atoms (Gelfand and Kister, 1995
).
In constructing the multiple alignments the corresponding strands and loops from different structures were grouped together and aligned separately (i.e. the group of A strands, the group of AB loops, etc; see Tables I and II). No internal deletions or insertions were allowed in the selected chain fragments. In constructing the multiple alignment of separate chain fragments we took into account and determined a consensus in (i) the sequence alignment, (ii) the backbone conformation and hydrogen bonds, and (iii) the residueresidue contacts (contact between two residues was detected if any two heavy atoms of these residues were closer than 5 Å; Gelfand and Kister, 1995
). Only the structures where all the residues have been resolved by X-ray were used in the final multiple alignments. The resulting alignments of variable domains consist of 70 (VL) and 64 (VH) chains for light and heavy chains, respectively; the alignments of constant domains were constructed for 13 (CL) light and 16 (CH) heavy chains.
|
|
To reduce the errors in energy calculations we used the recently developed approach of averaging energies over aligned homologs (Finkelstein, 1998; Reva et al., 1999
). The averaged energy Ui corresponding to a structure i is computed directly as:
![]() | (5) |
The improvement in the accuracy of protein structure recognition depends on the diversity of homologs used in this averaging (Reva et al., 1999). Nearly identical sequences are obviously of little use for energy averaging; the homologs selected for averaging should be as dissimilar as possible (within the context of a set of homologs). When homologous sequences are diverse and there are no significant differences between their native folds, such energy averaging reduces random energy errors and improves the separation of the common native fold from other structures. According to the general approach (Finkelstein, 1998
), the diverse homologs are determined by the lowest correlation in energy. However, in practice, the selection of homologs is a controversial procedure because initially homologs are selected by their sequence similarity. Based on our previous tests (Reva et al., 1999
), we used the optimized procedure for selection of homologs, i.e. we computed pairwise correlation coefficients for each pair
µ of homologous chain fragments and, for the purpose of averaging, selected only those homologs with an energy correlation of less than 0.70.85. The energies corresponding to interactions with `gap' positions in the alignment were taken to be zero.
In the recognition tests with energy averaging over homologs (Reva et al., 1999) a sequence of known three-dimensional structure was used as a `root' sequence. (A root sequence is taken into alignment with no deletions or insertions; the alignments of other sequences have to be adjusted correspondingly.) The three-dimensional structures are known for all sequences of immunoglobulins used in this work. Therefore, each of the sequences could be taken as a root sequence with the corresponding three-dimensional structure used as the common native fold. We considered two types of averaging. In the first case, each of the sequences of the original alignment was treated as a root, we computed the corresponding set of diverse homologs for each sequence and used them, in turn, in averaging over energies. The obtained characteristics were simply averaged over all the sequences in the original alignment. Because these sequences are not equally diverse, for control, we also considered separately, specially selected diverse sets of homologs (mini-alignments) extracted from the corresponding original alignments. Specifically, the mini-alignments were formed from diverse sets with a maximal number of homologs; in cases where the numbers of homologs in the alignments were equal, we chose the one corresponding to the shortest root sequence to decrease a number of possible deletions. This procedure worked well for relatively diverse variable domains (for more details see captions to Table III
). For constant domains, the root sequences for mini-alignments were chosen among the shortest sequences of the original alignments and then by alphabetic order of the corresponding PDB names.
|
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To make the results of our calculations more reliable and also to take the advantage of the multiple alignments, we computed the Z-scores using energy averaging over diverse homologs (Finkelsten, 1998; Reva et al., 1999). Diverse homologs were separately grouped into mini-alignments (similar to the ones shown in Tables III and IV
) for all types of the considered chain fragments by the maximal pairwise energy correlation criterion. Table I
shows that averaging energies over diverse homologs noticeably improves the accuracy of the structure recognition in all the cases considered. The results produced for single chain energies, as characterized above, and for averaged energies are perfectly consistent. The averaged number of diverse homologs,
, may be used to indicate the extent of the diversity of the corresponding chain fragments. One can see that ß-structure and core strands are the most conserved fragments of the variable chains of immunoglobulins, whereas loops are the most variable ones. The constant domains show significantly less variability compared to the variable ones. (To get at least two chains in the diverse alignments for all the sequences in CL and CH multiple alignments we had to increase the correlation threshold to 0.85.)
|
One can see that the values of the Z-scores of Tables I and II are generally consistent both qualitatively and quantitatively. Besides the Z-scores, Table II
gives additional information on average percentages of ß-structure, core strands and loops to indicate how many residues actually contribute to structure recognition. The
2 values of Equation (8) are used to compare the deviations between the observed energy distributions and the REM-based estimates.
A critical question for this study is the legitimacy of the approach used for estimating the accuracy of fold recognition. The Z-score comparative analysis is based on the concept that distributions of alternative fold energies are always well approximated by the normal law. The low 2 values of Table II
shows that for the majority of the tested chain fragments the energy distributions can be reasonably treated as normal ones (however, the energy distributions for loops show more significant deviations from the normal law than the other considered fragments).
Table II also presents data on average energy differences per residue between the native and misfolded conformations for the considered chain fragments and the corresponding standard deviations of energies. In complete agreement with the general theory (Finkelstein, 1998
), the average energy differences between the native and the misfolded structures are virtually unaffected by energy averaging over homologs, whereas the corresponding standard deviations are systematically reduced in comparison to the average standard deviations for single chains. This results in a reduction of the Z-scores and an increase in the accuracy of structure recognition.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The results of this study provide a quantitatively based understanding of the design of immunoglobulin molecules. Comparing the fold recognition data for different chain fragments one can say that ß-strands form a rigid framework for the immunoglobulin molecule, whereas loops, with no structural task, are `free' to develop a broad variety of binding specificities. It is well known that protein function is determined by specific fragments of a protein chain. This study suggests that the whole protein structure can be predominantly determined by a few specific fragments of a chain which form a structural frame of the molecule. This idea may help in better understanding the mechanisms of protein evolution: strengthening of a protein structure in the key frame-forming regions allows mutations and flexibility in other chain regions.
The fragmentation threading is a simple and fast method to explore design of protein globules and to approach the problem of sequence-to-structure specificity.
![]() |
Notes |
---|
2 Present address: Discovery Partners, Computational Division, Suite 645, Two Executive Drive, Fort Lee, NJ 07024. E-mail: breva{at}stprot.com
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Chothia,C., Gelfand,I.M. and Kister,A.E (1998) J. Mol. Biol., 278, 457479.[CrossRef][ISI][Medline]
Derrida,B. (1981) Phys. Rev. B, 24, 26132626.[CrossRef][ISI]
Fersht,A. (1999) Structure and Mechanism in Protein Science: A Guide to Enzyme Catalysis and Protein Folding. W.H.Freeman & Co.
Finkelstein,A. (1998) Phys. Rev. Lett., 80, 48234825.[CrossRef][ISI]
Flores,T., Moss,D. and Thornton,J. (1994) Protein Eng., 7, 3137.[Abstract]
Galitsky,B., Gelfand,I. and Kister,A. (1999) Protein Eng., 12, 919925.
Gelfand,I. and Kister,A. (1995) Proc. Natl Acad. Sci. USA, 92, 1088410888.[Abstract]
Halaby,D.M., Poupon,A. and Mornon,J.-P. (1999) Protein Eng., 12, 563571.
Hendlich,M., Lackner,P., Weitckus,S., Flokner,H., Froschauer,S. Gottsbacher,K., Casari,G., Sippl,M. (1990) J. Mol. Biol., 216, 167180.[ISI][Medline]
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409417.
Kister,A., Roytberg,M., Clothia,C., Vasiliev,Y., Gelfand,I. (2001) Protein Sci., 10, 18011810.
Mathews,J. and Walker,B. (1964) Mathematical Methods in Physics. W.A.Benjamin Inc., New York.
Mirny,L. and Shakhnovich,E. (1999) J. Mol. Biol., 291, 177196.[CrossRef][ISI][Medline]
Orengo,C. (1999) Protein Sci., 8, 699715.[Abstract]
Ptitsyn,O. (1998) J. Mol. Biol., 278, 655666.[CrossRef][ISI][Medline]
Reva,B. and Topiol,S. (2000) Biocomputing: Proceedings of the Pacific Symposium. World Scientific Publishing Co., pp. 168178.
Reva,B., Finkelstein,A., Sanner,M. and Olson,A. (1997) Protein Eng., 10, 865876.[Abstract]
Reva,B., Finkelstein,A. and Skolnick,J. (1998) Folding Design, 3, 141147.[ISI][Medline]
Reva,B., Finkelstein,A., Skolnick,J. (2000) Derivation and testing residue-residue mean force potentials for use in protein structure recognition. Protein Structure Prediction Methods and Protocols. Humana Press Inc., Tokowa, NJ, USA. pp 155174.
Reva,B., Skolnick,J. and Finkelstein,A. (1999) Proteins, 35, 353359.[CrossRef][ISI][Medline]
Yang,A.-S. and Honig,B. (2000) J. Mol. Biol., 301, 691711.[CrossRef][ISI][Medline]
Received May 1, 2001; accepted September 25, 2001.