1Centro de Astrobiología (INTA-CSIC), 28850 Torrejón de Ardoz and 2Centro de Biología Molecular Severo Ochoa, Cantoblanco, 28049 Madrid, Spain and 4Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
3 To whom correspondence should be addressed at the Madrid address. E-mail: ubastolla{at}cbm.uam.es
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: disulfide bonds/folding thermodynamics/protein evolution/protein folding
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Natural selection eliminates mutants that do not fulfil thermodynamic or functional requirements and favors the fixation of variants conferring a large enough selective advantage. However, according to the neutral theory of molecular evolution, most amino acid substitutions have been fixed in biological populations not through positive natural selection, but through random genetic drift of selectively neutral alleles (Kimura, 1983). The hypothesis that protein stability is selectively neutral above some threshold provides a natural explanation for the marginal stability of proteins (Taverna and Goldstein, 2002
) and it is consistent with the strong correlation between the observed amino acid frequencies and those predictable from nucleotide frequencies (Sueoka, 1961
; Lobry, 1997
).
Evolution at the molecular level depends on the evolution at the population level. As predicted by the directionality principle for demographic stability, described by Demetrius and co-workers (Kowald and Demetrius, 2005; Ziehe and Demetrius, 2005
), evolution at the population level may result in an increase, a decrease or a random non-directional change in demographic stability, depending on the ecological constraints the population experiences. Since changes in demographic stability are predicted to be positively correlated with changes in protein stability (Demetrius, 2002
), evolutionary changes in protein stability can also result in an increase, a decrease or a random non-directional change. The latter is consistent with the notion of selectively neutral alleles in the neutral theory of molecular evolution (Kimura, 1983
).
It is important to consider that the native state of a protein must be stable not only with respect to the unfolded state (unfolding stability), but also with respect to compact, incorrectly folded conformations (misfolding stability). We distinguish here between these two types of stability. For globular proteins, the two requirements may act in opposite directions. Lattice simulations suggest that stability against misfolded states sets an upper limit on protein hydrophobicity (Sandelin, 2004). Another recent study found a negative correlation between stability against unfolding and stability against misfolding in orthologous proteins, suggesting that there is a trade-off in the evolution of these two quantities (Bastolla et al., 2004
). According to this view, protein evolution is frustrated: it cannot optimize both stabilities at the same time.
We aimed here to investigate the relationship between the thermodynamic requirements on protein evolution and chain length and composition. We adopted a computational study of protein structures in the Protein Data Bank (PDB) (Berman et al., 2000), using a knowledge-based effective energy function to predict thermodynamic properties. Our work is focused on the interplay between stability parameters, chain length and composition. The results are consistent with the hypothesis that thermodynamic requirements impose weaker constraints on the evolution of longer proteins. This interpretation agrees with a recent analysis of site-specific amino acid distributions as a function of chain length (Bastolla et al., 2005
). We distinguish three types of intra-protein interactions: hydrophobic interactions, salt bridges and disulfide bonds.
In particular, we studied in detail the distribution and correlations of disulfide bonds, finding that they illustrate nicely the general pattern. Disulfide bonds are the strongest interactions in protein structures (Anfinsen and Scheraga, 1975; Betz, 1993
). It has been shown that their number is negatively correlated with the content of aliphatic hydrophobic residues in the protein chain (Abkevich and Shakhnovich, 2000
). This finding suggests that disulfide bonds are more frequently observed where the native state, in absence of them, would have limited stability. In a recent computational study of intracellular proteins in the genomes of hyperthermophilic Archaea, Mallick et al. (2002)
found a correlation between the incidence of disulfide bonds and the optimal growth temperature of the cellthe higher the optimal growth temperature, the larger is the fraction of the total number of intracellular cysteines predicted to form disulfide bonds. This suggests that disulfide bonding may be under strong selection for increased stability in hyperthermophilic Archeae. The results presented here are consistent with the hypothesis that disulfide bonds tend to be positively selected in short proteins, whereas they evolve neutrally in long proteins.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We consider protein domains as parsed in the SCOP database (Murzin et al., 1995). We use a subset of SCOP domains having <40% pairwise sequence similarity. The structure files can be downloaded at the URL http://www.astral.berkeley.edu (Chandonia et al., 2004
). We exclude structures determined by NMR, since they are on the average less compact and they are not suitable to the application of Equation 1 (Bastolla et al., 2001
). We also excluded the following proteins: chains in multimeric complexes for which the interactions with other chains, evaluated through our effective energy function, contribute more than 15% to the stability; chains much less compact than expected for their length according to Equation 8; chains for which we find alternative conformations lower in energy than the native (apart some of the cases listed above, these are typically short fragments or proteins with large cofactors); and chains longer than 700 residues. In this way, we obtained a data set of 4528 protein domains. These were further divided in two subsets: 2835 domains with <200 residues and 1166 domains with
250 residues.
Effective energy function
We adopt a simplified description of protein structures using the contact map Cij, a symmetric matrix with elements one if residues i and j are in contact and zero otherwise. The residues are considered in contact if any pair of their heavy atoms are closer than 4.5 Å and |i j| > 2. The number of contacts is indicated as
We approximate the effective free energy of a contact map C through a simple pairwise contact approximation:
![]() | (1) |
This energy function assigns lowest energy to the experimentally known native state of almost all monomeric proteins studied by X-ray crystallography, among a large set of alternative structures generated by gapless threading and the resulting energy landscapes are typically well correlated (see below).
Unfolding free energy
To estimate the difference in free energy G between the denatured and the native state of a protein, we used the effective free energy of the native state,
, calculated as described above. We neglected the chain entropy of the native state with respect to that of the denatured state and we neglected the effective energy of the denatured state with respect to the native. With these approximations, the unfolding free energy per residue was estimated as
![]() | (2) |
First, we neglected the sequence dependence of the chain entropy, setting S(A) = s. The above equation, with S(A)T = s, was fitted against experimental measures of unfolding free energies for proteins that fold with two-state kinetics, i.e. intermediate states do not contribute appreciably to folding thermodynamics. The correlation coefficient was 0.89 for a set of 44 proteins (U.Bastolla, unpublished work), with the two free parameters a and s, whose fitted values were a = 1.62 ± 0.14 and s = 0.123 ± 0.03 cal/mol. The latter is much smaller than the typical value of side-chain entropies, which is 3.33 cal/mole per side chain (Doig and Sternberg, 1995). This discrepancy suggests that the term s also includes in an effective way stabilizing contributions not included in the effective energy function and proportional to chain length, which partially compensate the chain entropy. These may be, for instance, the main-chain hydrogen bonds that constitute secondary structure.
Then we calculated the conformational entropy per residue of the unfolded state, S(A), using the mean side-chain entropy scale reported by Doig and Sternberg (1995) and tried to fit Equation 2 in the form with only one free parameter a. Since the unfolding free energy per residue is much smaller than the typical side-chain entropy, the quantity on the left-hand side is dominated by the chain entropy. The fit became worse than before, possibly because of compensations due to terms not included in the effective free energy, so we decided to treat the effective free energy
and minus the chain entropy S(A) as two separate variables.
Stability with respect to misfolded states
An estimate of the stability with respect to misfolded states is given by the energy gap (Goldstein et al., 1992, Gutin et al., 1995
). We use a dimensionless version of this quantity, normalizing it with the native energy (Bastolla et al., 1999
). The normalized energy gap of sequence A,
(A), is defined as
![]() | (3) |
Contact strength and interactivity
Here we define two quantities related to the strength of non native interactions. They can be calculated from protein sequence alone. The first one is the mean energy of native and non-native contacts, U
. It is defined as the average over all possible contacts of the contact interaction strength:
![]() | (4) |
We use weights that decrease with sequence separation, wij = 1/|i j|, to take into account the decay of contact probability. Similar results are obtained with uniform weights, wij 1 and even omitting the condition i < j + 2 in the sum, which makes the mean interaction energy only dependent on chain composition.
The other indicator that we calculated is the mean hydrophobicity of the protein sequence, defined as . We used the generalized hydrophobicity scale h(a) called interactivity, which was obtained as the main eigenvector of the contact interaction matrix U(a,b) (Bastolla et al., 2005
). The quantities
U
and
h
are strongly correlated with each other (R = 0.73 over more than 4000 domains).
Z-score of native interactions
To quantify the extent at which native interactions are optimized with respect to the background, we use the zeta score of native interactions, defined as
![]() | (5) |
Disulfide bonds
Disulfide bonds were either read directly from the SSBOND record in the PDB file or, in case this was absent, they were defined as contacts between cysteine residues where the two sulfur atoms are closer than 2.8 Å apart, with the further condition that each cysteine residue can intervene in at most one disulfide bonds.
In our energy function, contacts between cysteine residues receive the strongest interaction value. We estimated the unfolding free energy without the contribution of disulfide bonds, , by subtracting from Enat the effective energetic contribution of disulfide bonds and adding for each subtracted bond an amount of energy equivalent to the average energy per native contact:
![]() | (6) |
Salt bridges
We measured the number of salt bridges in protein structures, NSALT. Salt bridges were identified as pairs of oppositely charged residues with charged groups closer than 4.0 Å apart (Kumar and Nussinov, 1999).
Structural indicators
The main structural indicator used in this study is the number of contacts per residue:
![]() | (7) |
It is well known (see, for instance, Bastolla et al., 2000) that the number of contacts per residue is approximately constant for residues in the core of globular proteins and smaller for residues at the surface, so that the total number of contacts, , divided by the number of residues N scales as the surface to volume ratio:
![]() | (8) |
Another useful parameter to characterize the native structure is the absolute contact order (ACO), defined as the average sequence distance between residues in contacts:
![]() | (9) |
Plaxco et al. (1998) showed that there is a strong negative correlation between the relative contact order (RCO), which is the ACO normalized by chain length, RCO = ACO/N, and the logarithm of the folding rate, for proteins with two-state kinetics. However, it was later realized that the logarithm of the folding rate of two-state and non-two-state folders is better fitted as a linear functional of the ACO (Ivankov et al., 2003
). The ACO is a more convenient indicator than the RCO because it takes into account the negative correlation between the folding rate and chain length (Thirumalai, 1995
; Gutin et al., 1996
; Finkelstein and Badretdinov, 1997
). Chain length is positively correlated with the ACO but negatively with the RCO.
We also measured the RCO of disulfide bridges (i.e. only corresponding to disulfide bridges are considered), which we called DRCO.
Correlations
We measured correlations between pairs of variables using the Pearson correlation coefficient, defined as
![]() | (10) |
![]() | (11) |
The dependences among multiple variables were disentangled using principal component analysis (PCA). This determines the orthogonal directions in a multivariate space that account for the largest part of the variance of a set of elements embedded in this space. PCA consists in diagonalizing the correlation matrix R(Xa,Xb) for the set of variables whose relationships are under study, in order to determine the coefficients through which each variable enters each principal component.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Chain length and unfolding stability
We found that the effective free energy per residue, Enat/N, is uncorrelated with chain length (R = 0.01), whereas the average strength of native contacts, defined as Enat/C, has a negative association both with the number of contacts per residue, C/N (correlation coefficient R = 0.33) and with chain length (R = 0.28). The high significance was confirmed by a chi-squared test where the proteins were divided into nine subsets of approximately equal size according to their length (P < 106), see Figure 1. The average interaction strength decreases by 15% from the shortest to the longest proteins.
|
![]() | (12) |
Proteins are marginally stable and it is not expected that their unfolding free energy per residue G/N increases with chain length: the stability requirements are not expected to depend on chain length. If, in addition, the chain entropy per residue S(A) does not increase with chain length, the above equation implies that Enat/C decreases with chain length.
In agreement with Equation 12, Enat/C is correlated with N/C with R = 0.38. This correlation is stronger than for any other size-related indicator that we tested. Enat/C is also positively correlated with the chain entropy S, with correlation R = 0.32.
We attempted several mean-square fits of the strength of native contacts based on Equation 12. The average length behavior of these fitted curves is represented in Figure 1, but the goodness of the fit was evaluated using all 4528 protein domains studied. We first neglected the variations of G/N and S (one free parameter), obtaining Enat/C
0.23 N/C with R = 0.37. Taking into account also the calculated value of the chain entropy per residue S, we then obtained Enat/C
0.065 N/C + 0.189 SN/C with R = 0.45 for two free parameters. Note, however, that the coefficients of N/C and SN/C are expected from Equation 2 to be roughly equal and nevertheless the coefficient representing
G/N is much smaller. Assuming instead that the unfolding free energy
G is not correlated with chain length, we obtain a fit of Equation 12 in the form Enat/C
1.71/C + 0.25 SN/C, with R = 0.48 for two free parameters. The fitted value of
G corresponds in this case to
G
7 kcal/mol, which is in line with experimental data.
Note that this fit, which is the best and the most plausible one, is such that long proteins lay significantly above the predicted line: the minimal stability requirements in terms of Enat/C are predicted to decrease with chain length, but longer proteins tend to stay higher than the minimal requirements. This observation is further supported by PCA and we will come back to it later.
Chain entropy
As reported below, the amino acid composition changes with chain length so that the small amino acids Ala and Gly become more abundant in longer chains. This compositional change induces a decrease of the chain entropy of the unfolded state, S, for longer chains. This decrease is modest, 5 % in the length range examined. Since the entropy of the unfolded state contributes negatively to folding stability, we consider the variable S, defined as minus the conformational entropy per residue. The correlation between this variable and chain length is R = 0.15.
This decrease in the chain entropy per residue of the unfolded state for longer chains compensates in part for the decrease in the strength of native interactions, Enat/C. The length behavior of the entropy can be explained considering Enat/C as intermediate variable. In fact, this is negatively correlated both with chain length (R = 0.28) and with S (R = 0.32) and the partial correlation between length and entropy at fixed Enat/C is only R = 0.065. S is also strongly and negatively correlated with the mean interaction energy U
(R = 0.45). This is to be expected, since more interactive residues tend to have a large conformational entropy. However, at odds with this argument, S is uncorrelated with the mean interactivity,
h
.
Chain length, misfolding stability and design of native interactions
The decrease in the strength of native interactions with chain length can be attributed in part to the overall decrease in the protein interactivity, involving both native and non-native interactions, as discussed in another section, and in part to the fact that, in longer proteins, native interactions differ less in strength from non-native interactions.
To test this hypothesis, we measured the Z-score of native interactions, Equation 5. The larger Znat is, the stronger are the native interactions with respect to non-native interactions. We observed that Znat is negatively correlated with chain length (R = 0.41), thus confirming our hypothesis that native interactions are less optimized in longer proteins (see Figure 2).
|
This behavior can be explained by considering the condition of stability against misfolding, that we evaluate through the normalized energy gap . We first calculated this quantity by using Equation 3 on all alternative structures generated by threading the protein sequence A along all structures in the PDB with a larger or equal number of residues. The normalized energy gap evaluated in this way shows a very strong length dependence, increasing roughly as the logarithm of chain length.
This length dependence is essentially due to the fact that the number of alternative structures m generated by threading decreases for increasing chain length N, since the protein sequence can be threaded only on structures that are longer than N. This leads to underestimation of the minimal energy of alternative structures and consequently overestimation of with respect to its true value when all possible decoys are used. This artifact is not important when comparing the folding properties of proteins of similar length, but it is essential in the present study.
Therefore, in this study we did not estimate through threading, but resorted to a calculation based on the random energy model (REM) (Derrida, 1981
; Shakhnovich and Gutin, 1989
). In the framework of the REM, it is possible to estimate the minimal energy of alternative structures as
![]() | (13) |
The number of all possible independent decoys is equal to the number of contact matrices with N residues compatible with steric constraints and hydrogen bonding of core residues. This number is expected to increase exponentially with N. We therefore make the ansatz
![]() | (14) |
We obtained a rough evaluation of the two parameters A and B by imposing two conditions on our data: (1) the estimated minimal effective energy should be similar to the observed value for short proteins, for which the number of decoys generated is large; and (2) the estimated minimal effective energy should be higher than the native energy for most proteins. We found that these conditions are satisfied by parameters in the range A 0.1, B
4. The evaluated value of the exponent A agrees very well with the theoretical estimate A
0.115. With these parameters, only 105 domains out of 4528 have negative
'. The average normalized energy gap is 0.3, which is significantly larger than for random sequences,
' is not correlated with chain length and it is correlated with the estimate obtained through threading,
, with R = 0.49. These results support the plausibility of our new computation of the normalized energy gap.
The above equations allow the calculation of the normalized energy gap as a function of the zeta score Znat, Equation 5, and the standard deviation of the contact interactions,
:
![]() | (15) |
The condition that ' should be positive establishes a minimal absolute value of the zeta score as
![]() | (16) |
Also in this case, as in Figure 1, we can see that long proteins lay significantly above the predicted line: the minimal stability requirements in terms of Znat are predicted to decrease with chain length, but longer proteins tend to stay higher than the minimal requirements. We will come back on this observation when discussing the PCA.
Stability with respect to misfolding measured by ' and stability with respect to the unfolding measured by Enat/N are only weakly correlated (R = 0.13). The correlation is slightly stronger between
' and Enat/C (R = 0.17). However, their correlation is essentially due to the variable Znat, which is positively correlated with both
' (R = 0.74) and with Enat/C (R = 0.63). If we calculate the partial correlation coefficient of
' and Enat/C at fixed Znat, we find a rather strong negative correlation, R = 0.51. This negative correlation is easily explained considering the variable
U
, Equation 4, the average strength of non-native and native contacts. This quantity is related to the average hydrophobicity of the protein sequence. As expected, Enat/C and
U
are strongly correlated, with R = 0.65. Moreover, interactive sequences tend also to assign low energy to misfolded configurations, so that the normalized energy gap
' tends to be reduced (R = 0.46). PCA confirms and further clarifies this trade-off between unfolding and misfolding stability mediated by hydrophobicity.
Chain length and protein composition
Apart from the reduced optimization of native interactions with respect to the sequence background, the decrease in the strength of native interactions with chain length can also be due to the change in amino acid composition, affecting both native and non-native interactions. To address this question, we examined the association between amino acid frequencies and chain length. This was already studied by White (1992) several years ago and recently by Sandelin (2004)
, in this case limited to hydrophobic amino acids. Our present results are in qualitative agreement with previous ones and are included for completeness.
Confirming the previous results of White (1992), we observed that the cumulative frequency of positively charged amino acids, Lys and Arg, decreases with chain length, whereas the cumulative frequency of negatively charged amino acids does not vary significantly, because the significant decrease in Glu is compensated by the increase in Asp. Despite the decrease in positively charged residues, the number of salt bridges per residue increases with chain length (see below).
Both the small amino acids Ala and Gly increase significantly in content with chain length. Since these amino acids are weakly interacting, this result is in line with the observed decrease in the average interaction strength. Since their side chain has no conformational entropy, the increase in Ala and Gly content partly explains the decrease in the conformational entropy with chain length.
The frequency of Cys decreases strongly with chain length, as we discuss below in our analysis of disulfide bonds.
The most strongly interacting amino acids with aliphatic (Val, Ile, Leu, Met) and aromatic (Phe, Tyr, Trp) side chains do not show a significant association with chain length, either individually or in combination, in agreement with results of White (1992) and Sandelin (2004)
. Nevertheless, the frequency of hydrophobic residues versus length has a maximum, which is significant (P < 104). The decrease at short chains is probably due to their smaller hydrophobic core.
To test whether the changes in protein composition produce a decrease of inter-residue interaction strength with chain length, we used the interactivity scale defined by Bastolla et al. (2005), which accounts for the main component of our effective interaction matrix. The mean interactivity
h
can be calculated from the protein composition alone (see Materials and methods). Like the hydrophobic content, the mean interactivity also has a maximum as a function of chain length, which is once again attained for chains between 200 and 300 residues long (see Figure 3, bottom left). A chi-squared test with nine length bins yields
2 = 113, confirming the high significance. Both shorter and longer proteins have significantly lower interactivity and for proteins with >250 residues the mean interactivity is negatively correlated with chain length (R = 0.21). The same qualitative conclusions were obtained considering the average interaction strength
U
instead of interactivity.
|
Disulfide bonds are present in about 24% of the proteins in our data set. They are not randomly distributed among protein structures. The number of intra-chain disulfide bonds is negatively correlated with the unfolding free energy per residue that the protein would have in the absence of disulfide bonds, , Equation 6. The correlation coefficient is R = 0.31 (P < 104). A chi-squared analysis confirms that the difference in unfolding free energy among classes with different numbers of disulfide bonds is highly significant (
2 = 92, six degrees of freedom).
We also measured the average length of loops between disulfide-bonded residues, normalized by chain length (DRCO). This variable is related to the reduction in the entropy of the denatured state, which is thought to be the main mechanism by which disulfide bonds stabilize the native state (Anfinsen and Scheraga, 1975; Betz, 1993
). Also this variable is negatively correlated with
.
The number of disulfide bonds per residue is also positively correlated with the normalized energy gap ' (R = 0.20). This may be in part explained because disulfide bonds per residues are associated with weaker interactivity
h
(R = 0.21), which implies larger normalized energy gaps. Therefore, these results suggest that the increase in the number of disulfide bonds in long proteins allows the reduction of hydrophobic interactions, decreasing non-native interactions and enhancing stability against misfolding. This hypothetical mechanism is supported by the PCA.
Disulfide bonds and chain length
Investigating the relationship between disulfide bonds and chain length allows us to distinguish between the relative contribution of genetic drift and natural selection in the evolution of disulfide bonds. In case disulfide bonds are mainly fixed through random drift, we expect that their number is correlated with the number of cysteines in the protein chain, which in turn is expected to be correlated with chain length, since by hypothesis it results from a random mutation process. This leads to the expectation that the number of disulfides divided by chain length is unrelated to chain length if their evolution is driven by random drift.
Instead, the number of disulfide bonds per residue decreases with chain length for short proteins (see Figure 4). For long proteins, this number does not appear to vary systematically with chain length. The hypothesis that the average number of disulfide bonds per residue is the same for all lengths yields 2 = 190 with six degrees of freedom, which is associated with a vanishingly small probability. Dividing the data set into short (<200 residues) and long chains, we find that the number of disulfides per residue is not correlated with chain length for long chains (the correlation coefficient is R = 0.028), whereas for short chains the correlation is negative and highly significant (R = 0.24, P < 104).
|
Chains shorter than 200 residues also favor relatively longer disulfide loops than long chains (the DRCO is 0.35 and 0.16, respectively). Since longer loops imply a stronger entropy reduction in the unfolded state, this is another indication that disulfide bonds lead to a relatively greater enhancement of stability in shorter proteins. This results is in part due to the general tendency of the RCO to decrease with chain length, but also the normalized variable DRCO/RCO is significantly larger in short chains (mean 1.41, standard error of the mean 0.04) than in long chains (mean 0.91, standard error of the mean 0.06), where it is compatible with equality between DRCO and RCO.
These results are summarized in Figure 4. Taken together, they strongly reject the hypothesis that the fixation of disulfide bonds is driven by random drift for short chains, whereas they are consistent with this hypothesis for long chains.
Chain length and salt bridges
In addition to disulfide bonds, we also distinguished salt bridges from other kinds of interactions for two reasons: (1) contact energy functions such as ours cannot estimate electrostatic interactions accurately, since their contribution to the free energy of unfolding depends on a delicate balance between electrostatic interactions and desolvation penalty, whose net effect is not always stabilizing (Bosshard et al., 2004); and (2) salt bridges differ from hydrophobic interactions in that they are highly specific and they differ from disulfide bonds in that they are typically short range (Kumar and Nussinov, 1995
), therefore their role in stabilizing protein folding is expected to be distinct.
We observed that the occurrence of salt bridges depends strongly on chain length. As the length increases, salt bridges represent an increasing fraction of the inter-residue contacts. This fraction, NSALT/C, increases by 27% in the length range examined. The correlation between NSALT/C and chain length is R = 0.10, which is small but significant. The trend of the number of salt bridges per residue, NSALT/N, is much stronger, since there is an additional factor C/N. It increases by 61% of its value in the whole range and it has correlation coefficient R = 0.23 with chain length. The number of salt bridges, normalized either by N or by C, is also strongly associated with the RCO, with a negative correlation coefficient R = 0.18 and 0.25, respectively. This is in part a consequence of the negative correlation between RCO and chain length and in part (partial correlation coefficient at fixed length, R = 0.15) it is a consequence of the fact that salt bridges are frequent in contacts at short sequence distance (Kumar and Nussinov, 1995), which are more common in structures with low RCO.
The number of salt bridges per residue is also negatively correlated with the number of disulfide bonds per residue (R = 0.19), which is only in part explained by their opposite length dependence (partial correlation coefficient R = 0.16). It is positively correlated with the mean interaction energy, U
, with R = 0.16, and uncorrelated with the mean native energy per contact, therefore it is negatively correlated with Znat, with R = 0.23. In other words, proteins with better designed hydrophobic interactions tend to have fewer salt bridges.
Summarizing (see Table I), we see that salt bridges tend to substitute disulfide bonds in longer proteins, in particular in proteins with short-range contacts (smaller RCO), and they occur more frequently in sequences with large interactivity and poor design of hydrophobic interactions.
|
![]() |
Principal component analysis |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this analysis, we considered eight variables that represent the most important determinants of protein stability with little redundancy. Two variables represent protein size and topology, that determine the structural constraints acting in protein evolution: the number of contacts, per residue C/N, strongly correlated with chain length; and the absolute contact order (ACO), an indicator of native topology, which is also strongly positively correlated with chain length. Two variables estimate two kinds of protein stability: stability against unfolding is estimated through the native energy per residue, Enat/N (however, this variable does not take into account the conformational entropy), and stability with respect to misfolding is estimated through the normalized energy gap ', evaluated through Equations 15 and 14. (The value of
' depends on the estimated parameters A and B, but we found that qualitatively results emerging from PCA are robust with respect to the choice of these parameters. For a control, we also examined the normalized energy gap obtained through threading, which is overestimated more and more for longer proteins, for which less decoys are generated. This artifact produces a spurious correlation between
and length that can be effectively eliminated through PCA, since this correlation basically only contributes in the size direction. The qualitative description of the other directions is the same as obtained through
'.) Last, we considered four types of interactions contributing to protein stability: the mean interaction strength calculated with our energy function,
U
, which mainly takes into account hydrophobicity (similar qualitative results are obtained considering the mean interactivity
h
instead); the number of disulfide bonds per residues, NDIS/N; the number of salt bridges per residue, NSALT/N; and minus the conformational entropy of the unfolded state, S.
We represent in Figure 5 the three components whose contribution to the total variance is larger than the average. The coefficients with which the eight variables enter the first three PCs are plotted in the figure as bars.
|
The first principal component (PC) accounts for 28% of the total variance. Its main contribution comes from the interaction strength, U
. Along this direction, the stability against unfolding, estimated through Enat/N, increases, whereas the stability against misfolding, estimated through the normalized energy gap, decreases. Therefore, there is a trade-off between these two kinds of stability.
This balance is influenced by the protein structure. In fact, this component is negatively correlated with the ACO (R = 0.19). We believe that this correlation is due to the influence of the ACO on the folding rate (Ivankov et al., 2003): proteins with larger ACO fold more slowly, therefore they are expected to undergo stronger selection for improved stability against misfolding, with the consequence of favoring this kind of stability with respect to stability against unfolding.
Looking at mechanisms of protein stability other than interactivity, we see that the conformational entropy of the unfolded state, S, is negatively correlated with the mean interaction strength and decreases with this component. Therefore, the increase in unfolding stability is actually smaller than the estimate based only on Enat/N. The number of disulfide bonds per residue also decreases, in line with the fact that this quantity is correlated with the deficit in unfolding free energy. We interpret the increase in disulfide bridges at low interactivity as a compensation of the otherwise reduced unfolding stability. Last, the number of salt bridges per residue contributes positively to the enhanced stability. In fact, this quantity is positively correlated with both the ACO and the interaction strength, to which it contributes positively. We will discuss salt bridges further in next section.
Second component: size
The second PC accounts for 23% of the total variance and it is mainly related to protein size, since it receives its main contribution from C/N and the ACO. Protein stability hardly varies along this component. The normalized energy gap ' is uncorrelated with it and the effective free energy per residue Enat/N is weakly negatively correlated, but its decrease is compensated for by the increase of the entropy term S. Disulfide bonds and salt bridges strongly co-vary with this component, the former decreasing and the latter increasing in importance, whereas the mean interaction strength
U
is uncorrelated with it. As we will discuss later, we interpret the decrease in disulfide bridges with chain length as an indication of the fact that they are positively selected in short proteins, whereas they evolve in an effectively neutral way in long proteins.
Third component: design
The third PC accounts for 14.5% of the variance and receives its main contribution from the normalized energy gap. Both types of stability improve along this direction. Since the average interaction strength and the chain entropy hardly vary along this direction, the improvement has to be attributed to the better design of native interactions. In line with this, if we include Znat in the analysis (not shown), this variable gives a strong positive contribution to this component. This component is weakly positively correlated with the two structural indicators C/N and ACO. However, the correlation becomes stronger if we distinguish between short and long domains: the correlations become R = 0.34 and 0.36, respectively, for short domains and R = 0.61 and R 0.39 for long domains.
Concerning the stabilization mechanisms, this PC is almost uncorrelated with the mean interactivity U
and with S, positively correlated with the number of disulfide bonds per residue and negatively correlated with the number of salt bridges per residue. Therefore, it appears that most of the stabilization is achieved by optimizing the design of generic interactions.
We interpret this component as a direction along which native interactions are more efficiently designed, as a consequence of enhanced selection for protein stability. Since this component is positively correlated with both C/N (in particular for long domains) and with the ACO, large values of this component appear to be associated with more slowly folding proteins. We believe that these proteins undergo stronger selective pressure for enhanced stability against misfolding.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To rationalize and separate the roles of these contributions, we performed a PCA. This allows the identification of three main directions, along which native interactions may have evolved as a result of the stability constraints specific to their structure. Along the first direction, stability against unfolding increases at the expense of stability against misfolding. Along the second direction, the two kinds of stabilities do not change, but stability requirements vary in response to protein structure, mainly its size. Along the third direction, both kinds of stability increase, mainly through a better design of native interactions. We will discuss these three evolutionary directions separately.
The hydrophobicity dimension: trade-off between unfolding and misfolding stability
The native state of a protein must be stable with respect to the unfolded state (unfolding stability) and with respect to compact, incorrectly folded conformations (misfolding stability). These two kinds of stabilities are negatively correlated in families of orthologous proteins, sharing the same structure and function but differing in thermodynamic properties (Bastolla et al., 2004). Here their association is further confirmed by the negative correlation between the strength of native interactions and the normalized energy gap for proteins with different structure, along the first principal component, expressing the variation of interactivity at constant chain length (see Figure 5).
The association between the two kinds of stability can be easily understood: A very interactive protein with many hydrophobic residues would be very stable against unfolding, but it would not fold spontaneously in a unique state, because also alternative configurations would have comparably low free energy. It would behave almost like a homopolymer or a random heteropolymer, whose normalized energy gap is expected to vanish. Consistently, Sandelin (2004) has shown through lattice simulations that the requirement of non-vanishing energy gap imposes a limit on the maximum hydrophobicity of model proteins, which is more stringent for longer proteins.
There is experimental support for this view: proteins with low hydrophobicity and high charge are unfolded in native conditions (Uversky, 2002a), implying that some minimal hydrophobicity value is required for proteins to fold. On the other hand, very hydrophobic sequences tend to adopt partially folded configurations (Uversky, 2002b
), which sometimes are off-pathway and may trigger aggregation.
We believe that stability against misfolding is an important selective property, favoring correct folding and preventing aggregation. This is also a central requirement in some models of protein evolution (Bastolla et al., 1999, 2003
). The trade-off that we have demonstrated implies that protein evolution has to mediate between stability against unfolding and stability against misfolding. It was shown recently that a mutation bias favoring or depressing more hydrophobic mutations can shift this balance (Bastolla et al., 2004
). The results presented here indicate that protein length and topology can also shift this balance: longer proteins with larger ACO tend to favor misfolding stability over unfolding stability. It is possible that this is due to their slower folding rate, that exerts a stronger selective pressure against misfolding.
The chain length dimension: weakening of stability requirements
The second principal component is mainly correlated with protein size. Along this direction the two measures of protein stability hardly change, but native interactions become on average weaker with chain length (smaller Enat/C) and less optimized with respect to non-native ones (smaller Znat), meaning that in longer proteins native interactions are more similar to background non-native interactions. Since our energy function is mainly based on hydrophobic interactions, this result refers mainly to this type of interactions.
The conclusion that the hydrophobic profile of the protein sequence is less optimized for its structure in longer proteins than in shorter ones was also reached in a recent study of site-specific amino acid distributions. In this study, amino acid distributions were predicted to have a Boltzmann form, whose parameters represent the intensity of site-specific selection for structural stability. The absolute values of these selection parameters were found to decrease with chain length (Porto et al., 2004), implying that in longer protein it is more frequent to find residues that are less fit to their structural environment. Consistently, Sandelin noticed that the fraction of core positions increases with chain length (see Equation 8), but the frequency of hydrophobic residues is almost constant with chain length (as we have shown here, it even slightly decreases for N > 300). Therefore, longer proteins have on average a larger fraction of buried polar residues (Sandelin, 2004
).
We interpret this reduced optimization of the hydrophobic profile in the sense that stability constraints become easier to fulfil for longer proteins, as expected from a simple model of protein folding, Equations 12 and 16. Since longer proteins have a larger number of native contacts per residue, they can counterbalance more easily the reduction in chain entropy upon folding, which is proportional to the number of residues. In this context, the tendency of the average native interactions to decrease with chain length is a manifestation of the general tendency of proteins to attain only marginal stability, as demonstrated in the neutral model of Taverna and Goldstein (2002). Moreover, the requirement of stability against misfolding imposes that the absolute value of the Z-score of native interactions must be above a minimum value. This condition, Equation 16, is easier to satisfy for longer chains. Therefore, the drift of the actual Z-scores towards their minimal allowed value is another manifestation of the tendency of proteins to reach marginal stability.
An exception to this picture is the increase with chain length of the number of salt bridges, both per residue and per contact. One possible explanations is that salt bridges are positively selected to compensate for the reduction in hydrophobic interactions and disulfide bonds: only these two types of interactions sense the weakening of stability requirements with chain length. However, this interpretation does not explain why salt bridges should be preferred for longer proteins. We think that this preference is due to the peculiarity of salt bridges: they are highly specific interactions and they are typically short range. As such, they are expected to accelerate the folding rate of long proteins, which would otherwise fold very slowly. Therefore, our interpretation is that salt bridges in longer proteins are preferred not in order to increase the stability of the folded state, but in order to make folding faster and more specific. As a preliminary test of this interpretation, we looked at the number of salt bridges as a function of the distance cut-off used in this definition. When this cut-off is reduced to 2.5 Å, corresponding to the most strongly interacting salt bridges, the length dependence completely disappears: long proteins are not enriched in the strongest salt bridges, but only in the weak ones.
Protein evolution and disulfide bonds
The general scenario described above is well illustrated through the example of disulfide bonds, which are the strongest interactions in protein structures.
In analyzing disulfide bonds, it is important to distinguish intracellular and extracellular proteins. Disulfide bonds are expected to be found almost exclusively in the latter ones, since the chemical environment inside a typical cell is reducing (Gilbert, 1990) and cysteine residues in intracellular proteins are generally found in their reduced form with free sulfhydryl groups (Mallick et al., 2002
). Correlations between extracellularity and protein size or other topological or thermodynamic indicators might therefore confuse the analysis of disulfide bonds. However, the PDB files do not contain automatically readable information about the cellular location of the protein and we were not able to distinguish intracellular and extracellular proteins in our large-scale analysis.
Our results are consistent with the hypothesis that disulfide bonds evolve in an effectively neutral way in long proteins, but not in short proteins, where they are much more frequent than expected under neutral evolution. In fact, shorter proteins have a larger frequency of cysteine residues, a larger frequency of those form disulfide bonds and their average relative loop length (DRCO) is larger than for other types of contacts (RCO). This is consistent with the hypothesis that short proteins experience a stronger selection for strong stabilizing interactions, owing to their reduced number of contacts per residue.
The enrichment of disulfides in short chains is complementary to the finding that proteins of thermophilic bacteria, facing more severe problems of unfolding stability, are also enriched in disulfide bonds (Mallick et al., 2002).
We also observed a negative correlation between the number of disulfide bonds and the predicted unfolding free energy per residue that the protein would have in the absence of them. We can decompose the latter in two factors as For short proteins, this negative correlation is in large part due to the increase in disulfide bonds per residue for decreasing C/N (R = 0.20), which implies a deficit in unfolding free energy that may enhance the selection pressure for disulfide bonds. For long proteins, this correlation is weaker and it is entirely due to the negative correlation between
and disulfide bonds per residue, since their correlation with C/N is now very small and positive in sign. This is another manifestation of the tendency of naturally evolved proteins to attain only marginal stability.
The design dimension: enhancing protein stability
Both unfolding stability and misfolding stability increase together along the third principal component. This is achieved without a large increment in the strength of hydrophobic interactions and with only a small increment in the contribution of disulfide bonds and a small decrease in the contribution of salt bridges. The key to this success is a better design of hydrophobic interactions, through an increased Znat. This is an example of the general fact that protein stability can be improved through evolution when it is needed.
The grounds for this enhanced stability are of course functional, but they are also in part structural. Proteins with better designed native interactions tend to be longer and to have larger ACO, as demonstrated by the third principal component. This does not contrast with the observations presented above. The scenario that these results suggest is the following: minimal stability requirements are more easily satisfied for longer proteins, but longer proteins also tend to remain above these minimal stability thresholds more than short proteins do. Both principles are exemplified in Figures 1 and 2.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Alm,E. and Baker,D. (1999) Proc. Natl Acad. Sci. USA, 96, 1130511310.
Anfinsen,C. and Scheraga,H. (1975) Adv. Protein Chem., 29, 205299.[Medline]
Bastolla,U., Roman,H.E. and Vendruscolo,M. (1999) J. Theor. Biol., 200, 4964.[CrossRef][ISI][Medline]
Bastolla,U., Farwer,J., Knapp,E.W. and Vendruscolo,M. (2001) Proteins, 44, 7996.[CrossRef][ISI][Medline]
Bastolla,U., Porto,M., Roman,H.E. and Vendruscolo,M. (2003) J. Mol. Evol., 56, 243254.[CrossRef][ISI][Medline]
Bastolla,U., Moya,A., Viguera,E. and van Ham,R.C.H.J. (2004) J. Mol. Biol., 343, 14511466.[CrossRef][ISI][Medline]
Bastolla,U., Porto,M., Roman,H.E. and Vendruscolo,M. (2005) Proteins, 58, 2230.
Bava,K.A., Gromiha,M.M., Uedaira,H., Kitajima,K. and Sarai,A. (2004) Nucleic Acids Res., 32, D120D121.
Betz,S. (1993) Protein Sci., 2, 1555115558.
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G. Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235242.
Bosshard,H.R., Marti,D.N. and Jelesarov,I. (2004) J. Mol. Recognit., 17, 116.[CrossRef][ISI][Medline]
Chandonia,J.M., Walker,N.S., Lo Conte,L., Koehl,P., Levitt,M. and Brenner,S.E. (2004) Nucleic Acids Res., 32, D189D192.
Demetrius,L. (1995) Protein Eng., 8, 791800.[ISI][Medline]
Demetrius,L. (2002) J. Theor. Biol., 217, 397411.[CrossRef][ISI][Medline]
Derrida,B. (1981) Phys. Rev. B, 24, 2613.[CrossRef][ISI]
Finkelstein,A.V. and Badretdinov (1997) Fold. Des., 2, 115121.[CrossRef][ISI][Medline]
Galzitskaya,O.A. and Finkelstein,A.V. (1999) Proc. Natl Acad. Sci. USA, 96, 1129911304.
Gilbert,H.F. (1990) Adv. Enzymol. Relat. Areas Mol. Biol., 63, 69172.[ISI][Medline]
Goldstein,R., Luthey-Schulten,Z.A. and Wolynes,P.G. (1992) Proc. Natl Acad. Sci. USA, 89, 49184922.
Gutin,A.M., Abkevich,V.I. and Shakhnovich,E.I. (1995) Proc. Natl Acad. Sci. USA, 92, 12821286.
Gutin,A.M., Abkevich,V.I. and Shakhnovich,E.I. (1996) Phys. Rev. Lett., 77, 54335436.[CrossRef][ISI][Medline]
Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522524.
Ivankov,D.N., Garbuzynskiy,S.O., Alm,E., Plaxco,K.W., Baker,D. and Finkelstein,A.V. (2003) Protein Sci., 12, 20572062.
Kimura,M. (1983) The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge.
Kowald,A. and Demetrius,L. (2005) Proc. R. Soc. Lond. B, in press.
Kumar,S. and Nussinov,R. (1999) J. Mol. Biol., 293, 12411255.[CrossRef][ISI][Medline]
Lobry,J.R. (1997) Gene, 205, 309316.[CrossRef][ISI][Medline]
Mallick,P., Boutz,D.R., Eisenberg,D. and Yeates,T.O. (2002) Proc. Natl Acad. Sci. USA, 99, 96799684.
Muñoz,V. and Eaton,W.A. (1999) Proc. Natl Acad. Sci. USA, 96, 1131111316.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Orland,H., Itzykson,C. and de Dominicis,C. (1985) J. Phys. Lett., 46, L353.
Plaxco,K.W., Simons,K.T. and Baker,D. (1998) J. Mol. Biol., 277, 985994.[CrossRef][ISI][Medline]
Porto,M., Roman,H.E., Vendruscolo,M. and Bastolla,U. (2005) Mol. Biol. Evol., 22, 630638 [erratum: Mol. Biol. Evol., 22, 1165].
Sandelin,E. (2004) Biophys. J., 86, 2330.
Shakhnovich,E.I. and Gutin,A.M. (1989) Biophys. Chem., f 187.
Sueoka,N. (1961) Proc. Natl Acad. Sci. USA, 47, 469478.
Taverna,D.M. and Goldstein,R.A. (2002) Proteins, 46, 105109.[CrossRef][ISI][Medline]
Thirumalai,D. (1995) J. Phys., 5, 14571469.
Uversky,V.N. (2002a) Eur. J. Biochem., 269, 212.
Uversky,V.N. (2002b) FEBS Lett., 514, 181183.[CrossRef][ISI][Medline]
White,S.H. (1992) J. Mol. Biol., 227, 991995.[CrossRef][ISI][Medline]
Ziehe,M. and Demetrius,L. (2005) Proc. R. Soc. Lond. B, in press.
Received December 9, 2004; revised and accepted June 21, 2005
Edited by David Thirumalai
|