* Institut für Festkörperphysik, Technische Universität Darmstadt, Hochschulstr. 8, 64289 Darmstadt, Germany; Dipartimento di Fisica and INFN, Università di Milano, Via Celoria 16, 20133 Milano, Italy;
Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK;
Centro de Astrobiología (INTA-CSIC), 28850 Torrejón de Ardoz, Spain
Correspondence: E-mail: porto{at}fkp.tu-darmstadt.de; bastollau{at}inta.es.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: Amino acid distributions Boltzmann parameters divergent evolutionary changes principal eigenvector of the contact matrix structurally constrained neutral evolution
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To take into account the protein-level selection, it is necessary to consider site-specific amino acid distributions within a protein family (Halpern and Bruno 1998). Similarly, the use of site-specific substitution matrices improves substantially maximum likelihood methods for reconstructing phylogenetic trees (Liò and Goldman 1998; Koshi and Goldstein 1998; Koshi et al. 1999; Thorne 2000; Fornasari et al. 2002).
In the studies cited above, site-specific constraints are obtained either through simulations of a model of protein evolution or by fitting parameters in a maximum likelihood framework. In this work we derive an analytic expression, without adjustable parameters, for site-specific amino acid distributions and show that they are in very good agreement with simulations of a model of protein evolution and with site-specific amino acid distributions obtained from the Protein Data Bank (PDB).
The site-specific evolutionary patterns that we predict are associated with a key topological indicator of native state topology, namely the principal eigenvector of the contact matrix (Bastolla et al. 2004; Porto et al. 2004). Our approach is based on the structurally constrained neutral model (SCN model; Bastolla, Vendruscolo, and Roman 1999, 2000a; Bastolla et al. 2002, 2003a, 2003b), where possible mutations are tested for conservation of structural stability using a computational model of protein folding (Bastolla et al. 1998; Bastolla, Vendruscolo, and Knapp 2000b; Bastolla, Knapp, and Vendruscolo 2001). This approach is similar in spirit to the structurally constrained protein evolution model (SCPE; Parisi and Echave 2001), which uses a different criterion to assess the thermodynamic stability. Notice that the SCN model does not assume independence at different sites; in fact, we have recently shown that site dependence is linked to a rather debated feature of neutral evolution, the overdispersion of neutral substitutions (Bastolla et al. 2003b).
Site-specific amino acid distributions with Boltzmann form, similar to those that we derive here, have already been used in other studies of protein evolutionfor instance by Koshi and Goldstein (1998), Koshi, Mindell, and Goldstein (1999), and Dokholyan et al. (2001, 2002). As we will discuss, the main difference between our approach and these previous models consists in the fact that we compute the Boltzmann parameters analytically, without the need for a database of sequence families.
As we have previously shown (Bastolla et al. 2003a), the site-specific conservation patterns provided by the SCN model, and therefore also by the present analytic formulation, match qualitatively structural conservation patterns found in bioinformatics studies (Ptitsyn 1998; Ptitsyn and Ting 1999). Further, the site-specific amino acid distribution that we present here also enables us to derive an estimate of the sequence entropy compatible with a given fold, which is termed the "designability" of the fold (Li, Tang, and Wingreen 1998; Helling et al. 2001). These results are in agreement with recent studies suggesting that the designability can be inferred from the protein topology alone (Koehl and Levitt 2002; England and Shakhnovich 2003).
![]() |
Background |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() | (1) |
For testing the stability of a protein conformation, we use two computational parameters: (1) The effective energy per residue, E(A, C)/N, where N is the protein length, which correlates strongly with the folding free energy per residue for a set of 18 small proteins that are folding with two-states thermodynamics (correlation coefficient 0.91; U. Bastolla, unpublished result); (2) The normalized energy gap , which characterizes fast folding model sequences (Bastolla et al. 1998) with well correlated energy landscapes (Bryngelson and Wolynes 1987; Goldstein, Luthey-Schulten, and Wolynes 1992; Abkevich, Gutin, and Shakhnovich 1994; Gutin, Abkevich, and Shakhnovich 1995; Klimov and Thirumalai 1996). A mutated sequence is considered thermodynamically stable if both computational parameters are above predetermined thresholds (Bastolla et al. 2003a).
The interaction matrix U can be written in spectral form as where
are the eigenvalues, ranked by their absolute value, and u(
) are the corresponding eigenvectors. The main contribution to the interaction energy comes from the principal eigenvector u(1), which has a negative eigenvalue
1 < 0, as
1 u(1)(a) u(1)(b) has correlation coefficient 0.81 with the elements U(a, b) of the full matrix. It is well known that hydrophobic interactions determine the most significant contribution to pairwise interactions in proteins, so that the components of the main eigenvector are strongly correlated with experimental hydropathy scales (Casari and Sippl 1992; Li, Tang, and Wingreen 1997). Considering only this main component, we define an approximate effective energy function H(A, C), yielding a good approximation to the full contact energy given by equation (1),
![]() | (2) |
The HP provides a vectorial representation of protein sequences. In turn, a convenient vectorial representation of protein structures may be obtained by the principal eigenvector of the contact matrix C, which we denote by PE and we indicate by c. This vector maximizes the quadratic form ijCijcicj with the constraint
In this sense, ci can be interpreted as the effective connectivity of position i, since positions with large ci are in contact with as many as possible positions j with large cj. All its components have the same sign, which we choose by convention to be positive. Moreover, if the contact matrix represents a single connected graph (as it does for single-domain globular proteins), the information contained in the principal eigenvector is sufficient to reconstruct the whole contact matrix (Porto et al. 2004).
For any given protein fold, identified through its PE, we define the optimal HP, hopt, as the HP that minimizes the approximate effective free energy, given by equation (2), for fixed first and second moment of the hydrophobicity vector, h
= N1
ih(Ai) and
h2
= N1
ih(Ai)2. We impose a condition on the mean hydrophobicity,
h
, because, if a sequence is highly hydrophobic, not only the native structure but also alternative compact structures will have favorable hydrophobic energies. The selection to maintain a large normalized energy gap is therefore expected to place constraints on the value of
h
.
The above property of the PE, c, suggests that the optimal HP, hopt, should be strongly correlated with the PE (Bastolla et al. 2004). In this formulation, h
and
h2
are not determined by the native structure but depend on the mutation and selection process (see below).
The optimal HP constitutes an analytic solution to the sequence design problem with energy function given by equation (2), and an approximate solution to sequence design with the full contact energy function given by equation (1). In the SCN evolutionary model, attempted mutations are accepted whenever the effective free energy and the normalized energy gap overcome predefined thresholds. Therefore, we do not expect that the optimal HP is ever observed during evolution, but we do expect thermodynamically stable sequences compatible with a given fold to have HPs distributed around the optimal one. We have verified this prediction, finding that the HPs of individual selected sequences are correlated with the principal eigenvector of the protein fold with correlation coefficient of 0.45 (averaged over seven protein folds and over hundred thousands of simulated sequences per fold), whereas the HP averaged over all sequences compatible with a given fold, [h]evol, correlates much more strongly with the principal eigenvector of that fold, with mean correlation coefficient 0.96 averaged over the same seven folds (Bastolla et al. 2004). These results show that one can recover the optimal HP through an evolutionary average of the HPs compatible with the protein fold.
We found a similar result for the protein families represented in the PFAM database (Bateman et al. 2000) and in the families of structurally similar proteins (FSSP) (Holm and Sander 1996) database. In this case, however, the correlation between the principal eigenvector and the evolutionary average of the HP is weaker: The average correlation coefficient is 0.57 for PFAM families and 0.58 for FSSP families (Bastolla et al. 2004). This weaker correlation is not unexpected, since functional conservation, which is not accounted for in the SCN model, plays an important role in protein evolution, and the effective energy function that we use to test thermodynamic stability is only an approximation. In addition, the number of sequences used to calculate the average HP in the PFAM and FSSP databases is much smaller than the number of sequences obtained by the SCN model. When the average HP is computed using only a few hundred of SCN sequences, the same order of magnitude as in PFAM or FSSP families, the correlation with the principal eigenvector also drops to values comparable to those observed for the PFAM and the FSSP sequence databases.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our previous results indicate that the evolutionary average of the hydrophobicity vector, [h]evol, is correlated with the principal eigenvector c of the native contact matrix. To derive an analytical expression for i(a), we now assume that the correlation coefficient between the principal eigenvector and the average HP is 1, i.e., r([h]evol, c) = (
[h]evol·c
[h]evol
c
)/(
c
[h]evol) = 1, and we obtain
![]() | (3) |
![]() | (4) |
Equations (3,4) are the conditions that the stationary distributions i(a) have to fulfill in order to guarantee a perfect correlation between the principal eigenvector and the average HP. We assume that these conditions are the only requirements that the
i(a) have to meet, so that at stationarity the
i(a) are the distributions of maximum entropy compatible with the above conditions. It is well known that the solution of this optimization problem are Boltzmann-like (exponential) distributions, characterized by an effective "temperature" |ßi|1 that, in this context, varies from site to site and measures the tolerance of site i to accept mutations over very long evolutionary times,
![]() | (5) |
![]() | (6) |
Equation (6) can be interpreted as follows: (1) Positions with a large eigenvector component ci are buried in the core of the protein and are therefore, with high probability, occupied by hydrophobic amino acids (positive h(a)) and have large and negative ßi; (2) surface positions with small ci are more likely occupied by polar amino acids (negative h(a)) and have large and positive ßi; and (3) intermediate positions are the most tolerant, having a small negative or small positive ßi.
Similar forms of site-specific amino acid distributions have been previously proposed by other authors; see for instance Koshi and Goldstein (1998), Koshi, Mindell, and Goldstein (1999), and Dokholyan et al. (2001, 2002). We will discuss analogies and differences between these previous approaches and the present one in the last section.
Validation: SCN Model
We first verified these predictions using the simulated trajectories obtained through the SCN model. We found that the site-specific stationary distributions of amino acids, i(a), are well fitted by an exponential function of hydrophobicity,
i(a)
exp[ßi h(a)], where we use the interactivity scale given by the main eigenvector of our interaction matrix.
Analytically, the expected site-dependent Boltzmann parameter ßi can be calculated in an implicit form by rewriting equation (6) as
![]() | (7) |
The predictions derived in this way, for the ribonuclease fold (PDB id. 7rsa), are compared in figure 1 to the ßi obtained by fitting the distributions i(a) simulated through the SCN model. The prediction does not involve any adjustable parameter, since A and B are calculated from
[h]evol
and
as determined by the results of the SCN model. The latter values do not depend on the native structure, but they depend generally on the mutation and selection process, and specifically on its realization as simulated through the SCN model. The other proteins studied show a similar behavior, and the mean correlation coefficient between predicted and observed ßi is 0.935 (cf. table 1).
|
|
![]() | (8) |
|
We thus selected 774 sequences of various lengths, 404 of which were shorter than 200 amino acids. We first considered only this subset of short proteins, and then the whole data set, divided in bins of similar lengths. We counted the number of each of the 20 amino acids as a function of ci/c
, where
c
denotes the average over a single structure. We used a bin size of 0.05 for ci/
c
2.5 and a bin size of 0.1 for ci/
c
> 2.5. Then, for each bin of ci/
c
, we fitted the observed distributions
ci/
c
(a) with an exponential function of the hydrophobicity parameters,
ci/
c
(a)
exp[ßci/
c
h(a)]. As in the case of the SCN simulations, we used the interactivity scale derived from our effective free energy function. The exponential fit was sufficiently good, and it yielded the observed Boltzmann parameters as a function of the normalized PE components.
Next we calculated the predicted Boltzmann parameters through the equation:
![]() | (9) |
![]() | (10) |
The observed Boltzmann parameters are compared in figure 3 to the predictions of equation (9). Notice that the agreement is indeed remarkable, as the predictions do not involve any adjustable parameter, because à and are calculated from the PDB data.
|
|
One convenient measure of the amino acid conservation at a given position is given by the rigidity, defined as
![]() | (11) |
![]() | (12) |
A standard information-theoretic measure of site-specific sequence conservation is given by the entropy of the amino acid distribution,
![]() | (13) |
An important property of the entropy is that its exponential, exp(Si), provides an estimate of the average number of amino acid types acceptable at position i over very long evolutionary times. The exponential of the sum of all site-specific entropies, exp(iSi), gives an estimate of the sequence space compatible with a given fold, termed the designability of the fold, where we are assuming that the amino acid distributions at different positions are independent. Although this assumption is clearly oversimplified, the estimate of designability that can be obtained should be a valuable approximation, and our approach allows us to connect it explicitly to a topological feature of the protein native structure (Koehl and Levitt 2002; England and Shakhnovich 2003).
![]() |
Discussion and Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The relationship between a structural profile (the PE) and an evolutionary profile (the Boltzmann parameters) that we find has an interesting interpretation. Positions with a large principal eigenvector component are strongly interacting with other positions in the core of the protein. Therefore, they preferentially host strongly hydrophobic amino acids (large negative ßi). In contrast, positions with a small PE are contained in loops and, with higher probability, they host polar amino acids (large positive ßi).
Despite this general interpretation, we have verified, using the SCN model of protein evolution, that the PE has the strongest predictive power among several similar structural indicators that quantify the difference between core and surface positions of a protein. As alternative structural indicators, we considered the total number of contacts, the number of long-range contacts (separated by more than 10 residues along the sequence), and the contact order (the average loop length of each contact involving a given site), and we correlated them with measures of site-specific conservation, but we always found a correlation significantly weaker than for the PE.
The distributions derived here refer to very long evolutionary times, when memory of the starting sequence has been lost. We recall the assumptions that we made for deriving the site-specific distributions. The first assumption is that selection on folding stability can be represented effectively as a maximum correlation between the hydophobicity profile (HP) of sequences compatible with a given fold and the optimal HP of that fold, the latter basically coinciding with the PE. This assumption follows directly from an approximation of our effective free energy function with its principal (hydrophobic) component. The second assumption is that the average of the HP of selected sequences over very long evolutionary times has a correlation coefficient of unity with the PE; i.e., all other energetic contributions average out. The third assumption is that this correlation is the only relevant property of the site-specific amino acid distributions; in other words, these distributions are the distributions of maximum entropy whose site-specific averages have correlation of one with the PE, thus fulfilling the stability requirement. From these three assumptions, the Boltzmann form of the amino acid distributions follows. To compute the site-specific Boltzmann parameters, however, we still have to know the positional mean and standard deviation of the site-specific HPs. These quantites are determined by the mutation process and by the selection parameters. They can be computed directly from the data, in such a way that the analytic prediction does not contain any free parameter.
Boltzmann distributions, such as those proposed in this work, have a long history in studies of protein structure and evolution. Structural properties of native protein structures, as for instance amino acid contacts, have been assumed to be Boltzmann-distributed (Miyazawa and Jernigan 1985), and Boltzmann statistics for structural elements were predicted in stable folds of globular proteins (Finkelstein, Badretdinov, and Gutin 1995). Our work may point to an additional explanation for such distributions.
Shakhnovich and Gutin (1993) proposed a model of sequence design through Monte Carlo optimization, which produced a Boltzmann distribution in sequence space. A mean field approximation of this model (Dokholyan et al. 2001, 2002) results in site-specific amino acid distributions of the form
![]() | (14) |
Koshi and Goldstein (1998) and Koshi, Mindell, and Goldstein (1999) assumed site-dependent Boltzmann distributions of physicochemical amino acid properties, deriving from it a protein evolution model that they use for phylogenetic reconstruction in a maximum likelihood framework. Because the properties they used are hydrophobicity and amino acid size, their proposed distributions are formally a general case of those derived in this work. There are, however, two important differences between their approach and ours: first, in our approach sites are structurally classified according to the PE component, which is a structural indicator very strongly correlated with conservation; second, whereas the Boltzmann parameters are treated as free parameters by Koshi, Mindell, and Goldstein and fitted in a maximum likelihood framework, in our approach they are computed analytically.
Kinjo and Nishikawa have very recently pointed out the existence of a strong relationship between hydrophobicity and the main eigenvector of substitution matrices derived from protein alignments with various values of the sequence identity of the aligned proteins (Kinjo and Nishikawa 2004). They looked at the eigenvector corresponding to the largest eigenvalue (in absolute value) of the substitution matrices. For high sequence identities (above 35%), this eigenvector indicates the propensity of the amino acid to mutate over short evolutionary times (mutability). For low sequence identities (below 35%), corresponding to long evolutionary times, this eigenvector is very strongly correlated with hydrophobicity. This correlation is easily understood in light of the result presented here. In fact, Kinjo and Nishikawa used the Henikoff and Henikoff's method (Henikoff and Henikoff 1992) for deriving substitution matrices from observed frequencies of aligned amino acids at positions with various PE values. Using our notations, these substitution matrices can be indicated as M(a, b) log[
i(a)
i(b)
/
i(a)
i(b)
], where the angle brackets denote positional average. In other words, these substitution matrices measure the tendency of two residue types a and b to co-occur at the same sites. The relationship between large time substitution matrices and hydrophobicity therefore gives independent support to our main result.
In dealing with protein families generated from natural evolution, our approach has two main difficulties: first, it cannot take into account functional conservation, and second, a large number of sequence pairs separated by very long evolutionary time is required. Because of these difficulties, the site-specific conservation predicted by our model agrees only qualitatively with site-specific conservation observed in aligned protein families (Bastolla et al. 2003a). For tackling these difficulties, we tested our analytic predictions on a large set of proteins in the PDB. Positions belonging to different protein folds were counted together in the same structural class characterized by the value of the normalized PE.
The agreement between the analytic prediction and the observed distributions may be surprising in view of the fact that equation (9) takes into account neither the genetic code nor the codon usage. On the one hand, it is known that the observed frequencies of amino acids correlate strongly with their number of synonymous codons, and even more strongly with their cumulative codon frequencies expected in the simple hypothesis that the three bases constituting each codon are independently distributed (Sueoka 1961). Moreover, the frequency of nucleotides at coding positions correlate strongly with the frequency of nucleotides at noncoding positions such as the synonymous 3rd codon positions (Bernardi and Bernardi 1986), which are thought to reflect the mutational process. On the other hand, mutations are not adequately represented in the SCN model, which is a model at the amino acid level where all amino acid mutations are equiprobable.
The most straightforward way to include the biases of the mutational process is to modify equation (5) with i(a)
w(a) exp[ßi h(a)], where the weights w(a) are either the number of synonymous codons or their cumulative frequency, estimated using average nucleotide compositions. We will address this issue in future work.
We also observed a length effect in the dependence of the Boltzmann parameters ßi on the PE. For longer proteins, this dependence becomes weaker, approximately following a law of the type A0 + A1/N. This result holds for SCN simulations and, although less significantly, for frequency distributions obtained from the PDB. Formally, the result directly follows from the fact that the evolutionary average of the hydrophobicity, [hi]evol, varies less and less across different positions i for longer proteins. This length dependence of the HP is consistent with the fact that the position average of [hi]evol decreases with chain length; i.e., the length of a protein influences its composition (White 1992; Bastolla and Demetrius, in preparation), such that longer proteins tend to become less hydrophobic. (We note, however, that these results were obtained using the interactivity scale, and they cannot be significantly generalized to other hydropathy scales.) These results can be understood considering that longer proteins are stabilized by a larger number of interactions per residue. Selection for protein stability is therefore expected to be weaker on individual interactions. In support of this prediction, the average interaction energy per contact in PDB structures was found to decrease with chain length (Bastolla and Demetrius, unpublished data).
The site-specific amino acid frequencies that we derived may find applications in the estimation of site-specific substitution matrices (Liò and Goldman 1998; Thorne 2000; Fornasari, Parisi, and Echave 2002), and site-specific structural conservation. Our approach may provide an analytic understanding of the limits imposed by structural and functional contraints to the divergent evolution of protein sequences and of the reliability of phylogenetic reconstructions from protein sequences for anciently diverged taxa (Meyer, Cusanovich, and Kamen 1986). A better understanding of structural conservation may also improve the prediction of functional conservation at sites that appear more conserved than expected on a structural basis alone (Casari, Sander, and Valencia 1995; Ota, Kinoshita, and Nishikawa 2003). The dependence of amino acid distributions on the PE that we presented here is also a step toward determining the structural determinants of protein "designability" (Li, Tang, and Wingreen 1998; Helling et al. 2001; England and Shakhnovich 2003), namely the volume of sequence space compatible with a given fold.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Peter Lockhart, Associate Editor
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Abkevich, V. I., A. M. Gutin, and E. I. Shakhnovich. 1994. Free energy landscapes for protein folding kineticsintermediates, traps and multiple pathways in theory and lattice model simulations. J. Chem. Phys. 101:60526062.[CrossRef][ISI]
Bastolla, U., H. Frauenkron, E. Gerstner, P. Grassberger, and W. Nadler. 1998. Testing a new Monte Carlo algorithm for protein folding. Proteins 32:5266.[CrossRef][ISI][Medline]
Bastolla, U., M. Vendruscolo, and H. E. Roman. 1999. Neutral evolution of model proteins: diffusion in sequence space and overdispersion. J. Theor. Biol. 200:4964.[CrossRef][ISI][Medline]
. 2000a. Structurally constrained protein evolution: results from a lattice simulation. Eur. Phys. J. B 15:385397.[CrossRef][ISI]
Bastolla, U., M. Vendruscolo, and E. W. Knapp. 2000b. A statistical mechanical method to optimize energy parameters for protein folding. Proc. Natl. Acad. Sci. USA 97:39773981.
Bastolla, U., E. W. Knapp, and M. Vendruscolo. 2001. How to guarantee optimal stability for most protein native structures in the Protein Data Bank. Proteins 44:7996.[CrossRef][ISI][Medline]
Bastolla, U., M. Porto, H. E. Roman, and M. Vendruscolo. 2002. Lack of self-averaging in neutral evolution of proteins. Phys. Rev. Lett. 89:208101/1208101/4.
. 2003a. Connectivity of neutral networks, overdispersion and structural conservation in protein evolution. J. Mol. Evol. 56:243254.[CrossRef][ISI][Medline]
. 2003b. Statistical properties of neutral evolution. J. Mol. Evol. 57:S103S119.[CrossRef][ISI][Medline]
. 2005. The principal eigenvector of contact matrices and hydrophobicity profiles in proteins. Proteins 58:2230.[CrossRef][Medline]
Bateman, A., E. Birney, R. Durbin, S. R. Eddy, K. L. Howe and E. L. L. Sonnhammer. 2000. The Pfam contribution to the annual NAR database issue. Nucleic Acids Res. 28:263266.
Bernardi, G., and G. Bernardi. 1986. Compositional constraints and genome evolution. J. Mol. Evol. 24:111.[ISI][Medline]
Bryngelson, J. D., and P. G. Wolynes. 1987. Spin-glasses and the statistical-mechanics of protein folding. Proc. Natl. Acad. Sci. USA 84:75247528.[Abstract]
Casari G., and M. J. Sippl. 1992. Structure-derived hydrophobic potential. Hydrophobic potential derived from X-ray structures of globular proteins is able to identify native folds. J. Mol. Biol. 224:725732.[ISI][Medline]
Casari, G., C. Sander, and A. Valencia. 1995. A method to predict functional residues in proteins. Nat. Struct. Biol. 2:171178.[ISI][Medline]
Dokholyan, N. V., and E. I. Shakhnovich. 2001. Understanding hierachical protein evolution from first principles. J. Mol. Biol. 312:289307.[CrossRef][ISI][Medline]
Dokholyan, N. V., L. A. Mirny, and E. I. Shakhnovich. 2002. Understanding conserved amino acids in proteins. Physica A 314:600606.
England, J. L., and E. I. Shakhnovich. 2003. Structural determinant of protein designability. Phys. Rev. Lett. 90:218101/1218101/4.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
Finkelstein, A. V., A. Ya. Badretdinov, and A. M. Gutin. 1995. Why do protein architectures have Boltzmann-like statistics?. Proteins 23:142150.[ISI][Medline]
Fornasari, M. S., G. Parisi, and J. Echave. 2002. Site-specific amino acid replacement matrices from structurally constrained protein evolution simulations. Mol. Biol. Evol. 19:352356.
Goldstein, R. A., Z. A. Luthey-Schulten, and P. G. Wolynes. 1992. Optimal protein-folding codes from spin-glass theory. Proc. Natl. Acad. Sci. USA 89:49184922.
Gutin, A. M., V. I. Abkevich, and E. I. Shakhnovich. 1995. Evolution-like selection of fast-folding model proteins. Proc. Natl. Acad. Sci. USA 92:12821286.
Halpern, A. L., and W. J. Bruno. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 15:910917.[Abstract]
Helling, R., H. Li, R. Melin, J. Miller, N. Wingreen, C. Zeng and C. Tang. 2001. The designability of protein structures. J. Mol. Graphics Modelling 19:157167.[CrossRef][ISI][Medline]
Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89:1091510919.
Henikoff, S., and J. G. Henikoff. 1993. Performance evaluation of amino acid substitution matrices. Proteins 17:4961.[ISI][Medline]
Hobohm, U., and C. Sander. 1994. Enlarged representative set of protein structure. Protein Sci. 3:522524.
Holm, L., and C. Sander. 1996. Mapping the protein universe. Science 273:595602.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comp. Appl. Biosci. 8:275282.[ISI][Medline]
Kinjo, A. R., and K. Nishikawa. 2004. Eigenvalue analysis of amino acid substitution matrices reveal a sharp transition of the mode of sequence conservation in proteins. Bioinformatics 20:25042508.
Klimov, D. K., and D. Thirumalai. 1996. Factors governing the foldability of proteins. Proteins 26:411441.[CrossRef][ISI][Medline]
Koehl, P., and M. Levitt. 2002 Protein topology and stability define the space of allowed sequences Proc. Natl. Acad. Sci. USA 99:12801285.
Koshi, J. M., and R. A. Goldstein. 1998. Models of natural mutation including site heterogeneity. Proteins 32:289295.[CrossRef][ISI][Medline]
Koshi, J. M., D. P. Mindell, and R. A. Goldstein. 1999. Using physical-chemistry based substitution models in phylogenetic analysis of HIV-1 subtypes. Mol. Biol. Evol. 16:173179.[Abstract]
Li, H., C. Tang, and N. S. Wingreen. 1997. Nature of driving force for protein folding: a result from analyzing the statistical potential. Phys. Rev. Lett. 79:765768.[CrossRef][ISI]
Li, H., C. Tang, and N. S. Wingreen. 1998. Are protein folds atypical?. Proc. Natl. Acad. Sci. USA 95:49874990.
Liò, P. and N. Goldman. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:12331244.
Meyer, T. E., M. A. Cusanovich, and M. D. Kamen. 1986. Evidence against use of bacterial amino acid sequence data for construction of all-inclusive phylogenetic trees. Proc. Natl. Acad. Sci. USA 83:217220.[Abstract]
Miyazawa, S., and R. L. Jernigan. 1985. Estimation of effective inter-residue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18:534552.[ISI]
Ota, M., K. Kinoshita, and K. Nishikawa. 2003. Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. J. Mol. Biol. 327:10531064.[CrossRef][ISI][Medline]
Overington, J., M. S. Johnson, A. Sali, and T. L. Blundell. 1990. Tertiary structural constraints on protein evolutionary diversitytemplates, key residues and structure prediction. Proc. R. Soc. Lond. Ser. B 241:132145.[ISI][Medline]
Parisi, G., and J. Echave. 2001. Structural constraints and emergence of sequence patterns in protein evolution. Mol. Biol. Evol. 18:750756.
Porto, M., U. Bastolla, H. E. Roman, and M. Vendruscolo. 2004. Reconstruction of protein structures from a vectorial representation. Phys. Rev. Lett. 92:218101/1218101/4.
Ptitsyn, O. B. 1998. Protein folding and protein evolution: common folding nucleus in different subfamilies of c-type cytochrome?. J. Mol. Biol. 278:655666.[CrossRef][ISI][Medline]
Ptitsyn, O. B., and K. H. Ting. 1999. Non-functional conserved residues in globins and their possible role as a folding nucleus. J. Mol. Biol. 291:671682.[CrossRef][ISI][Medline]
Shakhnovich, E. I., and A. M. Gutin. 1993. Engineering of stable and fast-folding sequences of model proteins. Proc. Natl. Acad. Sci. USA 90:71957199.
Sueoka, N. 1961. Correlation between base composition of the deoxyribonucleic acid and amino acid composition of proteins. Proc. Natl. Acad. Sci. USA 47:469478.[ISI][Medline]
Thorne, J. L. 2000. Models of protein sequence evolution and their applications. Curr. Opin. Genet. Dev. 10:602605.[CrossRef][ISI][Medline]
White, S. H. 1992. Amino acid preferences of small proteins. Implications for protein stability and evolution. J. Mol. Biol. 227:991995.[CrossRef][ISI][Medline]