1Department of Chemistry and 2Department of General Engineering, San Jose State University, San Jose, CA 95192-0101, 3Sage-N Research, Saratoga, CA 95070-6082 and 4L.H.Baker Center for Bioinformatics and Biological Statistics, Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50014, USA
5 To whom correspondence should be addressed. E-mail: blustig{at}science.sjsu.edu
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: hydrophobicity/sequence entropy/sequencestructure relationship/sequence variability
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Globular proteins are compact and hence densely packed (Richards, 1974), even to the extent that their interior is frequently viewed as being solid-like (Hermans and Scheraga, 1961
; Richards, 1997
); however, there are still numerous voids and cavities in protein interiors (Liang and Dill, 2001
). The importance of tight packing is widely acknowledged and is thought to be important for protein stability (Ericksson et al., 1992
; Privalov, 1996
), for nucleation of protein folding (Ptitsyn, 1998
; Ptitsyn and Ting, 1999
; Ting and Jernigan, 2002
) and for the design of novel proteins (Dahiyat and Mayo, 1997
). In conjunction with nucleation, it has previously been posited that the conservation of amino acid residues through evolution may include essential tightly packed sites (Mirny et al., 1998
; Ptitsyn, 1998
; Ptitsyn and Ting, 1999
; Ting and Jernigan, 2002
).
However, the exact relationship between sequence and structure is only partially understood (Jones, 2000; Baker and Sali, 2001
), which is the subject of this paper. Whereas protein sequence is easily determined, 3-D structure is significantly more difficult. Employing sequence alignments in conjunction with molecular modeling has proven to be among the most successful computational methodologies for protein structure prediction (Bryant and Lawrence, 1993
; Marti-Renom et al., 2000
). One key assumption in homology-based modeling is that conserved regions share structural similarities, but the structural basis of this connection has not been clearly determined.
Multiple alignments of regions of secondary structure may be useful in the identification of key hydrophobic residues when utilizing hydrophobic cluster analysis (Poupon and Mornon, 1999; Gross et al., 2000
). Determining patterns of variability within amino acid sequence by using information theory has also proven useful in identifying unique protein secondary structures (Pilpel and Lancet, 1999
). Large-scale exploration of sequence space has shown clustering of sequence entropy values corresponding to a particular fold (Larson et al., 2002
). The application of Shannon entropy to nucleic acid sequence variability has proven to be a useful tool in identifying control regions in DNA (Schneider et al., 1986
) and has been extended as one of several methods of scoring amino acid conservation in proteins (Zou and Saven, 2000
; Valdar, 2002
).
Shannon entropies for protein sequence have been shown to correlate with entropies calculated from local physical parameters, including backbone geometry (Koehl and Levitt, 2002). Interestingly, conventional generalized chain statistics appear to overweigh significantly the magnitude of the entropic penalty associated with loop closure in proteins and RNA (Lustig et al., 1998
; Scalley-Kim et al., 2003
). It is clear that continued exploration of the connections between entropy, structure and sequence is critical to a better understanding of protein stability and function.
Although there have been some demonstrations of connections between sequence conservation and structural properties (Demirel et al., 1998), there are no definitive studies on this subject. Establishing direct connections between sequences and structural features has proven difficult, hence the limited number of successes at protein design and the limited understanding of mutagenesis. Recent applications of sequence variability to structure predictions have enhanced results, so empirical measures of sequence variability are useful by themselves, even if their full implications are not well understood in terms of structural features.
While investigations of packing of protein atoms would likely be informative, we chose here to investigate coarse-grained packing among points each representing a neighboring amino acid. The results we will see are then more general, even if not so directly useful in predictions related to protein design.
Here we generate a large set of aligned protein sequences generated from a diverse sample of 130 protein sequences. Sequence entropies for individual residues are calculated. They are then compared with the corresponding local flexibility as measured by the extent of C packing calculated from the corresponding structures. Similar comparisons are also made between the residue hydrophobicity and the corresponding packing.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
For protein sequences an expression for sequence entropy Sk at amino acid position k is expressed as
![]() | (1) |
![]() | (2) |
For each residue from the 130 sample protein sequences, C packing densities are calculated using their associated atomic coordinates. An optimal radius of C
packing was determined for 9 Å around a given C
residue position. In limited preliminary investigations this value was found to be best; greater scatter is observed for example in the single average entropies for radii of 10 and 11 Å. Smaller values omit some important cases in the distribution. Here we investigate the extent to which the inverse of the local packing density, as a measure of local flexibility (Bahar et al., 1997
), is correlated with sequence variability.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
Region I, in the overall correlation plots (Figure 3B), involves 74.9% of all the sample protein residues. Here the single averaged and double averaged sequence entropies are shown to be strongly linearly correlated with the inverse packing density. The straight-line fit for the single averaged sequence entropy versus inverse packing density is y = 12.350x 0.20; the correlation coefficient is 0.997; P < 0.001. The straight-line fit involving the double averaged entropy is effectively identical. Region II, accounting for an additional 24.4% of the sample protein residues, indicates for strongly hydrophobic residue types (Poupon and Mornon, 1999) an apparent limiting fraction (Figure 3A) of about 10%. This suggests a threshold for the number of hydrophobic residues embedded in regions that are probably accessible to water.
Shown in Figure 4A is a superposition of normalized averaged sample protein hydrophobicities and single averaged sequence entropy, as a function of inverse packing density. Using three different scales (Hopp and Woods, 1981; Engelman et al., 1986
; Sharp et al., 1991
), hydrophobicity is calculated for every query protein residue that is part of an alignment. For Hopp and Woods (1981)
calculations by Levitt (1976)
were also included. With each scale, a normalized hydrophobicity is calculated for the set of all residues within a density interval. Then those three normalized hydrophobicity plots (see Figure 4B) are averaged and renormalized again. Superimposed is the smooth curve normalized representation (determined from original values in Figure 3A) of values for sequence entropy. Clearly, all three sets of hydrophobicity values, calculated for each scale (Figure 4B), resemble the corresponding sequence entropy values.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Previously a strong correlation has been reported between computed displacements based on elastic networks reflecting residue packing (Bahar et al., 1998) and measured hydrogen exchange (HX). The freedom to move a residue is entropic in character. Regions of high packing density resist hydrogen exchange, because of both stability and inaccessibility. Here, we have gone further to relate our calculated inverse C
packing density from X-ray structures to the sequence variabilities. Strong linear correlations are observed between sequence entropy and the inverse packing density, except at the highest and low ranges of densities. This provides a quantitative relationship between these two quantities and an important structural measure for determining likely sites for mutagenesis.
The selection of sequences to be included in sequence analysis is a difficult problem and results can depend strongly on the selection procedure. Ptitsyn (1998) advocated selection of conserved clusters of sequence sets determined by including only distantly related species. However, here we simply used the sequence matches from GenBank without any filtering. Despite this, the overall trends are extremely clear, although to a limited extent within individual proteins.
In addition, the correlation between sequence variability and motility is consistent with a similar pattern that we noted with respect to peptide binding to RNA (Hsieh et al., 2002). Enhanced motility at a particular residue position is associated with the ability of local structure to accommodate mutation. Such behavior can more broadly be related to sequence variability in a folded protein. The ability to accommodate mutations corresponds to allowing a range of positions, including possible contacts.
Hydrophobicity and sequence variability
The strong correlation between calculated sequence entropy and the hydrophobicity shown in Figure 4 is remarkable. For each protein, its sequence entropy is calculated at each sequence position. This simply reflects the sequence variability at that position. The hydrophobicity for each residue position of each original single sequence is averaged for each bin over just the 130 sample proteins. It is important to remember that here the sequence entropy and the hydrophobicity calculations are both averaged over all residues within each density bin. In addition, the three sets of hydrophobicity scales (Hopp and Woods, 1981; Engelman et al., 1986
; Sharp et al., 1991
) are diverse in their origins and include experimental optimization and/or validation based on a variety of systems. Calculations by Levitt (1976)
were also included for use by Hopp and Woods (1981)
. The lack of any significant differences among the three sets of normalized hydrophobicity values (Figure 4B) as a function of inverse density suggests that the relative differences among individual amino acids within a hydrophobicity scale are largely compensated among other values within that set. Clearly, correlations between the sequence variabilities reflected in the sequence entropies and the corresponding hydrophobicities are consistent with the average behavior for residues with a given packing density. Still, this observed correlation between average sequence entropy and hydrophobicity is remarkable, but both are reflecting fundamental properties relating to the extent of burial. The critical importance of hydrophobicity for folding of model protein chains (Hinds and Levitt, 1994
; Dill et al., 1995
) is well known. This is consistent with the fact that key hydrophobic residues can be described as buried or tightly packed (Ptitsyn, 1998
; Ting and Jernigan, 2002
).
Packing and the resulting interactions associated with hydrophobicity are not a simple matter of just accounting for pairs of contacts (Dima and Thirumalai, 2004). In packing multiple contacts are usual. Our calculation of C
packing density represents a coarse-grained counting of such contacts, but is a less detailed consideration. We show that the local flexibility is closely related to the inverse of the coarse-grained packing density.
Here, sequence variability as measured by sequence entropy is correlated with the inverse of the residue packing. The propensity for packing of a particular amino acid type reflects its hydrophobicity and side chain entropy (Pickett and Sternberg, 1993). Notably, average contact energies for the various amino acid pairs also correlate well with existing hydrophobicity scales (Young et al., 1994
). This suggests that in principle these are strongly entropic in nature. It might be possible to calculate more directly configurational entropies in lieu of the comparable inverse density measure of relative flexibility, by using full atomic representation. Such calculations would depend upon a residue's environment in more realistic ways than given by simple residue density. This might also reduce the range for individual residue entropies calculated from sequence variability within a density bin.
Progress in this direction would assist with protein design, a closely related problem (Dahiyat and Mayo, 1997; Li et al., 1998
; Buchler and Goldstein, 1999
; Shih et al., 2000
; Tiana et al., 2001
; Koehl and Levitt, 2002
; Larson et al., 2002
; England et al., 2003
). Further studies in the direction of the present work could lead to better predictions of sustainable sequence substitutions. However, from the present results it appears that every measure of packing density for single residues of a single protein does not necessarily correlate well with the sequence conservation at that site. Further efforts are clearly required to achieve this goal; however, the present results begin to point out a way for achieving such a goal.
Conclusion
Here packing at the residue level for coarse-grained structures has been shown to exhibit a strong connection to sequence conservation, by the practice of averaging over large numbers of residues. Why is this averaging necessary? One possible explanation is that the large number of combinations of ways in which a residue's atoms can be packed together requires averaging over large numbers of occurrences, in order to obtain a meaningful single representation of all these combinations. It is also possible that residue size may affect the results, so that averaging over many occurrences will fully account for all of the various types of neighboring residues including individual side chain conformations.
Two distinct behaviors are identified for different inverse packing density regions (Figures 3 and 4). In the first region, 74.9% of sequence positions exhibit a linear dependence of sequence entropy over the inverse C packing density range 0.0400.083, whereas in the second region, having inverse packing density >0.083, another 24.4% of query positions typically indicate a nearly constant sequence entropy. This saturation suggests that up to a certain minimum number of residues are allowed in low-density regions. Moreover, a certain fraction of those residues are hydrophobic and would appear to be accessible to water, consistent with a considerable lack of restrictions on the types of residues that can be accommodated. All of this suggests that for most residue positions the ability to accommodate sequence substitutions as measured by sequence entropy is inversely correlated with the extent of their packing. Also, on average for a particular amino acid type, hydrophobicity is correlated with the degree of residue packing. Deeper understanding of the connections between structural properties and sequence entropy awaits further study. However, the future development of such sequence entropy methods for the identification of core as well as flexible residues appears promising.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul,S.F., Madden,T.L., Scaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Bagci,Z., Jernigan,R.L. and Bahar,I. (2002) J. Chem. Phys., 116, 22692276.[CrossRef][ISI]
Bagci,Z., Kloczkowski,A., Jernigan,R.L. and Bahar,I. (2003) Proteins, 53, 5667.[CrossRef][ISI][Medline]
Bahar,I., Atilgan,A.R. and Erman,B. (1997) Fold. Des., 2, 173181.[ISI][Medline]
Bahar,I., Wallqvist,A., Covell,D.G. and Jernigan,R.L. (1998) Biochemistry, 37, 10671075.[CrossRef][ISI][Medline]
Baker,D. and Sali,A. (2001) Science, 294, 9396.
Banavar,J.R., Maritan,A., Micheletti,C. and Trovato,A. (2002) Proteins, 47, 315322.[CrossRef][ISI][Medline]
Bevington,P.R. (1969) Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York, Appendix C.
Bryant,S.H. and Lawrence,C.E. (1993) Proteins, 16, 92112.[ISI][Medline]
Buchler,N.E. and Goldstein,R.A. (1999) Proteins, 34, 113124.[CrossRef][ISI][Medline]
Chothia,C. and Finkelstein,A.V. (1990) Annu. Rev. Biochem., 59, 10071039.[CrossRef][ISI][Medline]
Chothia,C., Levitt,M. and Richardson,D. (1981) J. Mol. Biol., 145, 215250.[CrossRef][ISI][Medline]
Dahiyat,B.I. and Mayo,S.L. (1997) Science, 278, 8287.
Demirel,M.C., Atilgan,A.R., Jernigan,R.L., Erman,B. and Bahar,I. (1998) Protein Sci., 7, 25222532.
Dill,K.A., Bromberg,S., Yue,K., Fiebig,K.M., Yee,D.P., Thomas,P.D. and Chan,H.S. (1995) Protein Sci., 4, 561602.
Dima,R.I. and Thirmalai,D. (2004) J. Phys. Chem. B, 108, 65646570.[CrossRef][ISI]
Doyle,D.A., Cabral,J.M., Pfuetzner,R.M., Kuo,A., Gulbis,J.M., Cohen,S.L., Chait,B.T. and MacKinnon,R. (1998) Science, 280, 6977.
Engelman,D.M., Steitz,T.A. and Goldman,A. (1986) Annu. Rev. Biophys. Biophys. Chem., 15, 321353.[CrossRef][ISI][Medline]
England,J.L., Shakhnovich,B.E. and Shakhnovich,E.I. (2003) Proc. Natl Acad. Sci. USA, 100, 87278731.
Eriksson,A.E., Baase,W.A., Zhang,X.J., Heinz,D.W., Blaber,M., Baldwin,E.P. and Matthews,B.W. (1992) Science, 255, 178183.[ISI][Medline]
Gerstein,M. and Altman,R.B. (1995) J. Mol. Biol., 251, 161175.[CrossRef][ISI][Medline]
Gross,E.A., Li,G.R., Lin,Z.-Y., Ruuska,S.E., Boatright,J.H., Mian,I.S. and Nickerson,J.M. (2000) Mol. Vis., 6, 3039.[ISI][Medline]
Hermans,J. and Scheraga,H.A. (1961) J. Am. Chem. Soc., 83, 32933330.[CrossRef][ISI]
Hinds,D.A. and Levitt,M. (1994) J. Mol. Biol., 243, 668682.[CrossRef][ISI][Medline]
Hopp,T.P. and Woods,K.R. (1981) Proc. Natl Acad. Sci. USA, 78, 38243828.[Abstract]
Hsieh,M., Collins,E.D., Blomquist,T. and Lustig,B. (2002) J. Biomol. Struct. Dyn., 20, 243251.[ISI][Medline]
Jones,D.T. (2000) Curr. Opin. Struct. Biol., 10, 371379.[CrossRef][ISI][Medline]
Koehl,P. and Levitt,M. (2002) Proc. Natl Acad. Sci. USA, 99, 12801285.
Larson,S.M., England,J.L., Desjarlais,J.R. and Pande,V.S. (2002) Protein Sci., 11, 28042813.
Levitt,M. (1976) J. Mol. Biol., 104, 59107.[CrossRef][ISI][Medline]
Li,H., Tang,C. and Wingreen,N.S. (1998) Proc. Natl Acad. Sci. USA, 95, 49874990.
Liang,J. and Dill,K.A. (2001) Biophys. J., 81, 751766.
Lustig,B., Bahar,I. and Jernigan,R.L. (1998) Nucleic Acids Res., 26, 52125217.
Maritan,A., Micheletti,C., Trovato,A. and Banavar,J.R. (2000) Nature, 406, 287290.[CrossRef][ISI][Medline]
Marti-Renom,M.A., Stuart,A.C., Fiser,A., Sanchez,R., Melo,F. and Sali,A. (2000) Annu. Rev. Biophys. Biomol. Struct., 29, 291325.[CrossRef][ISI][Medline]
Mirny,L, Abkevich,V.L. and Shakhnovich,E.I. (1998) Proc. Natl Acad. Sci. USA, 95, 49764981.
National Center for Biotechnology Information (2002) http://www.ncbi.nlm.nih.gov/
Pickett,S.D. and Sternberg,M.J.E. (1993) J. Mol. Biol., 231, 825839.[CrossRef][ISI][Medline]
Pilpel,Y. and Lancet,D. (1999) Protein Sci., 8, 969977.[Abstract]
Poupon,A. and Mornon,J.-P. (1999) Theor. Chem. Acc., 101, 28.[ISI]
Privalov,P.L. (1996) J. Mol. Biol., 258, 707725.[CrossRef][ISI][Medline]
Protein Data Bank (2002) http//www.rcsb.org.pdb/
Ptitsyn,O.B. (1998) J. Mol. Biol., 278, 655666.[CrossRef][ISI][Medline]
Ptitsyn,O.B. and Ting,K.L. (1999) J. Mol. Biol., 291, 671682.[CrossRef][ISI][Medline]
Richards,F.M. (1974) J. Mol. Biol., 82, 114.[CrossRef][ISI][Medline]
Richards,F.M. (1997) Cell. Mol. Life Sci., 53, 790802.[CrossRef][ISI][Medline]
Scalley-Kim,M. Minard,P. and Baker,D. (2003) Protein Sci., 12, 197206.
Schneider,T.D., Stormo,G.D. and Gold,L. (1986). J. Mol. Biol., 188, 415431.[CrossRef][ISI][Medline]
Sharp,K.A., Nicholls,A., Friedman,R. and Honig,B. (1991) Biochemistry, 30, 96869697.[CrossRef][ISI][Medline]
Shih,C.T., Su,Z.Y., Gwan,J.F., Hao,B.L., Hsieh,C.H. and Lee,H.C. (2000) Phys. Rev. Lett., 84, 386389.[CrossRef][ISI][Medline]
Sigler,P.B., Xu,Z., Rye,H.S., Burston,S.G., Fenton,W.A. and Horwich,A.L. (1998) Annu. Rev. Biochem., 67, 581608.[CrossRef][ISI][Medline]
Tiana,G., Broglia,R.A. and Provasi,D. (2001) Phys. Rev. E, 64, 011904_16.
Ting,K.L. and Jernigan,R.L. (2002) J. Mol. Evol., 54, 425436.[CrossRef][ISI][Medline]
Valdar,W.S.J. (2002) Proteins, 48, 227241.[CrossRef][ISI][Medline]
Young,L., Jernigan,R.L. and Covell,D.G. (1994) Protein Sci., 3, 717729.
Zhang,J., Chen,R., Tang,C. and Liang,J. (2003) J. Chem. Phys., 118, 61026109.[CrossRef][ISI]
Zou,J. and Saven,J.G. (2000) J. Mol. Biol., 296, 281294.[CrossRef][ISI][Medline]
Received August 5, 2004; revised January 25, 2005; accepted January 28, 2005.
Edited by Harold Scheraga
|