*Department of Genetics, North Carolina State University;
and
Department of Ecology and Evolutionary Biology, University of California at Irvine;
and
Fakultät für Mathematik, Universität Bielefeld, Bielefeld, Germany
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Another approach is to model large families of naturally occurring proteins or protein domains and determine how nature has changed their characteristics over billions of years of evolutionary diversification. By examining patterns of sequence diversity, one can explore how naturally occurring sequence variability and amino acid properties (e.g., hydrophobicity, volume, and charge) are important in maintaining protein structure. The latter approach permits analyses of protein structure and function over extensive geological timescales where evolutionary processes have experimented in nature with amino acid changes with regard to protein stability, foldability, and functionality.
Experimental, as well as quantitative, analyses of proteins (including multiple-alignment procedures) often proceed by modeling frequencies of residues at individual amino acid sites. For computational expediency, these analyses assume that amino acid sites are independent, i.e., the presence of a residue at one site is assumed to be independent of residues at other sites (Swofford et al. 1996
). However, it is well known that this assumption is naïve, since the activities and properties of proteins are the result of interactions among their constitutive amino acids. Interactions among amino acid sites include salt bridges between charged residues, hydrogen bonds between electron acceptors and donors, size constraints reflecting structural interactions between large and small side chains, electrostatic interactions, hydrophobic effects, Van der Waals forces, and similar phenomena.
Detecting structural interactions and statistical covariance or associations among separate amino acid sites is fundamental for understanding protein structure and evolution. Consequently, it is important to determine the magnitude and direction of residue covariability, its origin, and its structural and functional significance. Because associations among separate amino acid sites may arise from several different sources, partitioning these associations into their component sources is fundamental to understanding protein structure, function, and evolution.
The observed covariation in residue composition between amino acid sites i and j (Cij) arises from several separate underlying causes, which can be expressed by a linear model of the form:
![]() |
Indeed, the primary null hypothesis to be evaluated for such a model is that any component covariance is equal to zero and, as a consequence, makes no significant contribution to the observed association between amino acid sites i and j. Let us consider what these various sources of variation entail.
One obvious source of covariation among residues at different sites is common evolutionary history (Cphylogeny). Felsenstein (1985)
discussed this problem with regard to evolution of complex polygenic traits among species. He pointed out quite elegantly that species are part of a hierarchically structured phylogeny and therefore cannot be regarded for statistical purposes as being drawn independently from the same distribution.
Felsensteins (1985)
argument holds as well for associations among amino acid sites in related proteins. For example, an ancient gene may have undergone early duplications followed by sequence diversification through mutation, natural selection, and genetic drift, which may act differentially in separate evolutionary lineages. The result will be collections of related proteins, e.g., families of bHLH proteins like MyoD and Myc. These families contain a number of functionally and structurally similar proteins that have arisen from a common ancestral protein followed by evolutionary diversification and hierarchical branching (Atchley, Fitch, and Bronner-Fraser 1994
; Atchley and Fitch 1995
). Within the individual members of such protein families, we would expect to find associations among residues at various amino acid sites that have persisted from the early duplication events.
Additionally, covariation among sites can arise for structural or functional reasons, i.e., Cstructure and Cfunction. In this instance, associations among amino acids arise independently of common ancestry and reflect a bias in amino acid replacements in order to satisfy structural demands. The folded nature of a native functioning protein requires that only certain amino acid replacements can occur at particular sites and still maintain the structural integrity of the folded protein. Furthermore, there are constraints on amino acid replacements that arise for functional reasons, such as amino acid bias at recognition sites related to DNA binding in transcriptional regulators. These functional changes may arise when selection operates to optimize adaptation and subsequently generate protein diversification.
Clearly, the main effects in the linear model (i.e., structure, function, and phylogeny) are confounded and therefore are not statistically independent. It is well known that structural and functional changes arise through evolutionary processes. Consequently, inclusion of a covariance term, Cinteractions, in the model is necessary to account for such higher-order statistical nonindependence.
Finally, covariation among sites may occur that cannot be explained by the main effects in the model and their statistical interaction. This component, designated here Cstochastic, refers to the lack of fit of the data to the model and is analogous to the unexplained sum of squares in analysis of variance or regression. For the sake of simplicity, this stochastic effect can be assumed to represent background covariability.
While it is obvious that covariability among sites has a multidimensional basis, partitioning the observed covariability among sites into appropriate underlying components is not a simple matter. Rather, it is a process fraught with many statistical and computational difficulties. Not the least of these difficulties is that biological sequences are represented by symbols that have no natural ordering or underlying metric (Atchley, Terhalle, and Dress 1999
). Consequently, conventional statistical analyses typically used to partition variability and covariability are difficult to apply with sequence data.
Herein, we use an entropy (information theoretic) approach coupled with simulation-based parametric bootstrap procedures to examine the magnitude and origin of associations among amino acid sites in the highly conserved basic helix-loop-helix (bHLH) domain. The bHLH domain is a DNA-binding and dimerization domain of approximately 5060 amino acids found in a large and diverse family of transcription factors (Murre et al. 1994
). A number of these proteins have been the focus of detailed structural and functional analyses. Furthermore, the bHLH domain has been the subject of several recent evolutionary analyses (Atchley and Fitch 1997
; Atchley, Terhalle, and Dress 1999
; Morgenstern and Atchley 1999
).
The present paper explores a number of questions about amino acid associations and protein structure. First, we ascertain the magnitude of association or covariation among residues between amino acid sites within the highly conserved bHLH domain. Second, we carry out computer simulations to elucidate the underlying origins of the observed associations among amino acid sites. We inquire if the observed covariability arises simply from stochastic events or if it is due to evolutionary history or structural and functional constraints. Third, we integrate measures of variability and covariability derived from information theory with structural data from published crystal studies on the bHLH domain. In doing so, we explore the relationships between primary sequence diversity and protein structure/function. Fourth, we examine the evolution of the -helical structure of the bHLH domain among a diverse collection of proteins.
![]() |
Methods and Materials |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Structure of bHLH Proteins
Crystal structure studies have been carried out on the bHLH domains of six proteins, i.e., Max, E47, MyoD, USF, PHO4, and SREBP (Ferre-DAmare et al. 1993, 1994
; Ellenberger et al. 1994
; Ma et al. 1994
; Brownlie et al. 1997
; Shimizu et al. 1997
; Parraga et al. 1998
). The Max protein, which is the dimerization partner of the protooncogene Myc, has been examined in considerable crystallographic detail and shown to have an amphipathic
-helical structure (fig. 1
) in which the protein has opposing hydrophobic and hydrophilic faces. The crystal structure of the Max homodimer shows it to be a parallel, left-handed, four-helix bundle with a hydrophobic core (Ferre-DAmare et al. 1993
).
|
The protein Max will be used as the structural model for our discussions, and it will be assumed that the 242 bHLH proteins involved in these analyses have the same general structural features as Max. This extrapolation is based on the studies of Ferre-DAmare et al. (1993, 1994)
, Ma et al. (1994)
, Ellenberger et al. (1994)
, Shimizu et al. (1997)
, and Parraga et al. (1998)
.
Helical Wheel Projections
Helical wheel projections are used in these analyses to provide insight into residue interrelationships with a protein structure. Helical wheels graphically display the disposition of amino acid side chains about an assumed -helix. The projection is along the central axis of the helix, from the N-terminus to the C-terminus, and it is a useful device for displaying the symmetry (or asymmetry) of hydrophobic/hydrophilic side chains. The helical wheel assumes a periodicity of 3.6 residues per helical turn.
Thirty-three exemplar sequences are used to explore the phylogenetic aspects of the helical wheel projections. Generally speaking, these 33 sequences are well-studied proteins that reflect the evolutionary diversity of the bHLH domain. The evolutionary relationships among the various clades and lineages for these sequences are represented by a neighbor-joining tree. Sequences are arranged phylogenetically and shown in a helical wheel configuration for helices 1 and 2.
Secondary Structure Prediction
In several instances, the secondary structure of a particular bHLH-domain-containing protein is examined using the Protein Sequence Analysis (PSA) Server from Boston University. The computer model analyzes amino acid sequences and calculates the probability of secondary structures and folding classes within regions of a sequence. The underlying theory for these predictions is described in White, Stultz, and Smith (1994)
, and the URL of the server is http://bmerc-www.bu.edu/psa/.
Variability and Covariability in Protein Sequences
As noted earlier, statistical analyses of biological sequences present difficulties because these sequences are represented by symbols that have no natural ordering or underlying metric (Atchley, Terhalle, and Dress 1999
). Consequently, conventional statistical estimates of variability and covariability are difficult to apply. Recently, several authors have suggested the use of the concepts of entropy and mutual information (Korber et al. 1993
; Clarke 1995
; Herzel and Gross 1995
; Schneider 1996
; Roman-Roldan, Bernaola-Gavan, and Oliver 1996
; Atchley, Terhalle, and Dress, 1999
).
Entropy (E) is a measure of uncertainty derived from thermodynamics and statistical physics which has considerable utility for studies of protein structure. Assume X is a discrete random variable (the amino acid sites) for which we are uncertain which of its 20 values (x1, x2, ..., x20) (amino acid residues) will occur at site X, but we do know their expected frequencies, pi, ..., pn. These expected frequencies can be used to calculate how much information E(X) is present at site X. In this context, information is a measure of the uncertainty about which residue will occur at a specified site.
The Boltzmann-Shannon entropy E(X) is defined (Applebaum 1996
) by
where n = 20, pj is the probability of an amino acid being of the jth kind, and pjlog2 pj := 0 if pj = 0. E = 0 when all elements are in the same category (the same amino acid residue at a particular site). E increases with both the number of categories (residues at a site) and their equiprobability. Entropy of a uniform distribution whose range has size n is
![]() |
The relative information content of Y contained in X is termed the mutual information, or MI(X, Y), where
Note that MI(X, Y) = MI(Y, X), and if X and Y are independent, then MI(X, Y) = 0, corresponding to the fact that no information is obtained regarding Y by finding out about X. In biological sequences, MI describes the extent of "correlation" or association between residues at amino acid sites X and Y that might arise from evolutionary, functional, or structural constraints. More algebraic details are provided in Atchley, Terhalle, and Dress (1999)
.
Statistical Inference about Mutual Information Values
An important question in biological sequence analyses is whether one can distinguish signals due to various biological sources (phylogeny, structure, and function) from any background noise (stochastic variation) inherent in a set of sequences. This is analogous, in quantitative genetics, to partitioning phenotypic variability into genetic components (including additive, dominance, and epistatic variance components) and environmental components.
In these analyses, we use a parametric bootstrap approach (Efron and Tibshirani 1993
; Goldman 1993
; Huelsenbeck, Hillis, and Jones 1996
) to generate a distribution of MI values reflecting only covariation involving stochastic and phylogenetic constraints. Additional details of this method are presented in Wollenberg and Atchley (2000). The parameters used in the parametric bootstrap simulations were the phylogenetic tree generated from the aligned protein sequences and a residue substitution matrix. Because the tree was derived from the data and the substitution matrix was not (it was chosen to reflect general amino acid substitution probabilities), the data sets generated in the parametric bootstrap simulations contained only stochastic and phylogenetic associations between sites.
A neighbor-joining tree (Saitou and Nei 1987
) was computed for the 237 sequences using p-distances. The residue change matrix used was that used in the computer program PAML, version 1.3 (Yang 1997
). This matrix was generated using the algorithm of Jones, Taylor, and Thornton (1992)
and is hereinafter referred to as the JTT matrix. This matrix does not consider gaps as characters for the generation of replicate data sets. Therefore, for MI values calculated on the empirical data to be comparable with MI values calculated on the parametrically generated data sets, only ungapped sites could be used for this statistical analysis. (All sites were used for generating the phylogeny.) For this reason, the original 242 sequences of Atchley and Fitch (1997)
were reduced to 237 to decrease the number of sites in the alignment having gaps. This resulted in 32 sites without gaps for analysis.
Like any numerical simulation of a physical process, the results depend on the assumptions of the underlying models for their validity. As in any phylogenetic analysis, results depend on the confidence one has that the tree is a realistic description of the history of the subjects being analyzed. The parametric bootstrap also depends on the tree as the source of information about the level and distribution of sequence variation. The residue substitution matrix used will control the changes that occur between sequences in the simulation. Biases in this matrix can affect the potential associations measured in the resulting simulated sequences. However, a matrix having no biases (i.e., a matrix of uniform substitution probabilities) would ignore the biology of the substitution process. Alternatively, one could use a substitution matrix derived from the empirical data, such as that calculated by the RIND program (Bruno 1996
). However, a matrix of this type would reflect biases due to phylogeny, structure, and function that are inherent in the empirical data being analyzed (Wollenberg and Atchley 2000). For these reasons, we used a general protein substitution matrix derived using the JTT algorithm.
The statistical significance of the MI values was determined by comparing the frequency distributions of MI values for the 237 bHLH sequences and the results for the parametric bootstrap analyses. Any MI value above a specific threshold was considered to contain significant associations over and above those due to stochastic or phylogenetic constraints. This threshold MI value was that value in the frequency distribution of parametric bootstrap MI values greater than a specified percentage (i.e., 99%, 99.9%) of parametric bootstrap MI values. Thus, this procedure does not test whether a given MI value is different from random; rather, it tests whether an MI value reflects a significant association due to structural and functional constraints over and above covariation arising from evolutionary history and stochastic events.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The analyses described here assume the crystal structure of Max, E47, MyoD, USF, PHO4, and SREBP, extrapolated to the other known bHLH proteins. Based on the multiple alignment of the bHLH domain sequences used in Atchley and Fitch (1997)
and Atchley, Terhalle, and Dress (1999)
, the most prevalent residues at these contact or packing sites are 16 (I, L, V), 20 (F, I, L), 23 (L), 27 (I, L, V), and 28 (L) in helix 1, and 50 (K), 53 (I, T, V), 57 (A), 60 (Y), and 64 (L) in helix 2. Thus, the five relevant contact sites in helix 1 are highly hydrophobic while the five sites in helix 2 (except for site 50, which initiates the second helix) are predominantly hydrophobic. The packing interrelationships among sites in helix 1 and helix 2 are shown graphically in figure 3
. Additionally, structural studies on SREBP (Parraga et al. 1998
) also indicate interactions between site 12 in the basic region with site 17 in helix 1 and site 50 in helix 2 in the other monomer of the homodimer.
|
|
A. R. Ferre-DAmare generously provided information from his crystal studies about the structural interactions of site 51. The side chains of the alanine residue in Max and the leucine residue in E47 at site 51 interact with those of site 16 to facilitate stabilization of the protein dimers hydrophobic core. Furthermore, in SREBP, the serine at site 51 makes a water-mediated hydrogen bond with a phosphate oxygen anchoring the dimer to DNA. Other bHLH proteins have histidine residues at site 51, and this side chain could make DNA contacts as well.
Our results suggest that two positions not originally considered contact sites by Ferre-DAmare et al. (1993)
may be such. Sites 17 and 24 have E values of 1.18 and 1.93, respectively (0.93 and 0.94 with residues classified into functional groups), which are well within the range of the E values for contact sites (table 1
). Further examination of the bHLH structural data (A. R. Ferre DAmare, personal communication) suggests that site 17 functions in a water-mediated DNA-protein contact. This interrelationship has apparently resulted in a high level of conservation of hydrophilic residues at this site (74% asparagine, 24% lysine or arginine). This interaction is clearly demonstrated in the sterol-regulatory-element-binding proteins (SREBPs) by Parraga et al. (1998)
.
Unfortunately, the situation with site 24 is more difficult to explain. Side chains in this position form part of an extension of the hydrophobic core of the bHLH dimer by burying under the flap formed by the loop component. The conservation of longer side-chain residues at site 24 may stem from their hydrophobic methylenes being partially buried and the hydrophilic tips being exposed (A. R. Ferre-DAmare, personal communication). Additional high-resolution studies of the bHLH domain may provide a biological explanation for these quantitative observations on site 24.
Variability in Buried and Exposed Sites
The results reported here are in line with many previous observations that internal (buried) residues are less variable than external or exposed residues, and less variation means lower entropy values (e.g., Goldman, Thorne, and Jones 1998
). Indeed, Atchley, Terhalle, and Dress (1999)
previously tested this hypothesis and found that the entropy values of the buried sites are significantly smaller than those of the exposed sites.
Mutual Information
tables 2 and 3
provide MI values describing the level of association among amino acid sites within the bHLH domain. Values >1.0 reported in table 2
constitute the top 5% of MI values for all bHLH domain sites as reported by Atchley, Terhalle, and Dress (1999)
. When the sites with MI > 1.0 are arranged in a network, specific patterns of association are made apparent (fig. 4
). Within this network are subnetworks consisting of sets of sites for which each site has connections to all other sites in the subnetwork. These completely connected subnetworks correspond to the cliques previously defined in Atchley, Terhalle, and Dress (1999)
. A maximum clique corresponds to the largest maximally connected subnetwork. The two maximum cliques for the data from table 2
are presented in figure 4A
and B.
|
|
Sites 3, 4, 7, and 8 in the basic region show high levels of association with specific sites in the two helices, primarily sites 14 and 21 in helix 1 and sites 52, 56, and 62 in helix 2. The helical wheel assumes a periodicity of 3.6 residues per helical turn, so site 14 is at the start of helix 1, and site 21 is two complete turns into the helix. In rank order, the MI values for the top 13 paired sites are as follows: 3 and 21 (1.20), 7 and 21 (1.19), 3 and 14 (1.18), 3 and 4 (1.17), 3 and 56 (1.15), 3 and 62 (1.13), 7 and 15 (1.12), 8 and 19 (1.12), 4 and 7 (1.11), 7 and 11 (1.10) 4 and 21 (1.10), 4 and 52 (1.10), and 11 and 21 (1.10).
Mutual Information Within and Between Helix 1 and Helix 2
table 3
provides the E and MI values for interacting sites in helices 1 and 2. Packed or contact sites are underlined and in italics in table 3
, and entropy values are provided for each site. These values provide a description of residue diversity at each site, ranging from E = 0.15 at site 23 (which is 98% leucine in this large database) to E = 3.48 at site 21. The maximum possible value for E is 4.32.
|
From table 3 , it is clear that sites with low sequence diversity (small entropy values) also show little covariation with other amino acid sites. This is to be expected, since residues at paired sites do not covary if the individual sites themselves had little residue variation. Thus, the four primary sites in each helix previously shown to pack together (sites 16, 20, 23, and 27 in helix 1, and sites 50, 53, 57, and 60 in helix 2) exhibit very little residue variability (low entropy values), and there is very little covariability among the contact sites. The latter finding is reflected by the fact there are no MI values among these eight contact residues higher than 0.22 and, as will be seen below, none show significant covariation due to structural and functional constraints.
Several sites within the helices with higher residue diversity exhibit considerable mutual information with other variable sites. Thus, site 23 (E = 3.48) in helix 1 exhibits the highest observed MI values (MI = 1.19) with sites 52, 55, and 56 in helix 2. In helix 2, site 52 likewise shows high MI values with sites 19, 21, 25, and 26 in helix 1.
Note, however, that while site 18 exhibits considerable residue diversity (E = 3.26), it does not necessarily exhibit high MI values with other sites. This demonstrates that high entropy does not necessarily produce high values of mutual information even in highly conserved protein domains.
Origins of Significant Mutual Information
There are several possible explanations for significant levels of association among many sites within the bHLH domain (other than simply chance associations). These include associations arising from evolutionary constraints, correlated mutations, functional associations, and structural constraints.
Simulation and the Partition of Observed Associations
A simulation was carried out using parametric bootstrap procedures to partition the observed covariation among amino acid sites into that due to evolutionary history (phylogeny) and stochastic events on the one hand, and that due to structural and functional constraints on the other. The distributions of MI values for the parametric bootstrap data and the empirical data were significantly different at P < 0.001. This suggests that there are significant associations among many amino acid sites in the bHLH domain over and above those due to stochastic and phylogenetic effects.
The distribution of MI values from 1,000 parametric bootstrap replicates, calculated using the neighbor-joining tree and the JTT substitution matrix, was compared with the distribution of MI values for 237 bHLH proteins (fig. 5 ). This comparison permits calculation of threshold values for distinguishing between structural/functional and phylogenetic/chance associations. In these analyses, sites having MI values above a given threshold value have a specific probability of covariation due to structural and functional constraints, rather than due to phylogenetic constraint or chance. The specific MI value used as this threshold has an associated probability based on the number of values in the parametric bootstrap distribution that are greater than the threshold.
|
Among the contact sites, there is no significant covariation from structural and functional constraints. However, there is considerable significant covariation among noncontact sites that stems from structural and functional constraints (fig. 6 ). The latter is evident in those MI values larger than 0.56, the 1% critical value from the parametric bootstrap simulations.
|
For 33 exemplar sequences, residues in helix 1 and helix 2 were coded as -1 if they were hydrophobic (A, C, G, I, L, M, F, P, and V), 1 if they were hydrophilic (R, N, D, E, Q, H, and K), and 0 if they were S, T, Y, or W. Then, pairwise product-moment correlation coefficients (r) were computed among all sites in the two helices to determine if significant associations existed among sites for hydropathy states. Several pairs of sites showed high correlations, including sites 20 and 23 (r = 0.89), 17 and 61 (r = -0.81), 50 and 61 (r = -0.75), and 27 and 61 (r = 0.71). (Any product-moment correlation >0.45 in this analysis is significant at P < 0.01) As can be seen in the data given in figure 7 , the high positive correlations relate to strong association for hydrophobic residues at sites 20 and 23, as well as at sites 27 and 61 (fig. 8 ). The negative values refer to inverse relationships between hydrophobic and hydrophilic residues. In addition to these four pairs of sites, two pairs of sites (sites 23 and 27 and sites 25 and 50) had correlations >0.6.
|
|
In spite of these caveats, there are several interesting findings here involving associations based on size. The sites with the largest correlation coefficient (sites 20 and 53) are contact sites, and these observations indicate that there is a like association; i.e., large residues are paired with large residues. Thus, when F occurs at site 20, I, L, Y, or T occurs at site 53. Similarly, an L at site 20 pairs with an L, T, or V at site 53.
The largest negative value occurs between two adjacent sites (sites 53 and 54). It might be expected that significant inverse relationships would exist for the volume of adjacent residues. However, the difference between the entropy values for sites 53 (E = 1.24) and 54 (E = 0.20) stresses the need for caution in interpreting this correlation, since variable site 53 is paired with a largely unvaried site 54.
The correlation coefficient of 0.57 between sites 22 and 26 probably represents a meaningful structural/functional association, because there is considerable residue variability at both of these sites. Both sites occur away from the contact sites and, consequently, exhibit more variability.
Helical Wheels and Sequence Diversity
The helical wheel shown in figure 7
displays the helical distribution of residues including both the basic DNA-binding region and helix 1, while figure 8
provides a helical wheel for amino acid sites 5064, which constitute helix 2. These figures summarize information for the 392 bHLH domain sequences in the database. The most prevalent residues at each site are shown in the figure. Amino acid cliques (sensu Atchley, Terhalle, and Dress 1999
) shown in these figures are defined for each helix. Cliques are groups of amino acid positions all of which are more highly associated with each other than any are with a nonmember of the clique. Maximal cliques are those not contained in larger cliques. Finally, those sites involved in determining the predictive motif described in Atchley, Terhalle, and Dress (1999)
are marked by an X. This predictive motif is a collection of 19 highly conserved sites whose amino acid compositions accurately discriminate bHLH-domain-containing proteins into groups AD according to the evolutionary classification proposed by Atchley and Fitch (1997)
.
In figure 7 , sites are denoted that have entropy values <2.0, values between 2.0 and 2.4, and values >2.4. The most prevalent amino acid residues at each site are noted with the appropriate symbols where possible. Furthermore, five sites (sites 3, 4, 7, 14, and 21) that constitute the highest-ranked multisite clique in helix 1 are denoted.
In the DNA-binding region, there are five strongly conserved sites with E < 1.7 (sites 1, 2, 9, 10, and 12) (fig. 7 ), and four of these are highly basic in that they have K or R residues in great preponderance. The exception is site 9, which has a glutamic acid in 93% of all sequences. The glutamic acid at site 9 contacts the C in the E-box (CANNTG), and its presence indicates that DNA binding occurs. Those bHLH proteins lacking an E at site 9 do not bind DNA (groups C and D, sensu Atchley and Fitch 1997). The remaining highly variable sites in the basic region (5, 6, and 11) do not appear to show a systematic pattern of functional group amino acids.
The remainder of the sites shown in figure 7 (sites 1428) constitute helix 1 in the bHLH domain. There are a number of highly conserved sites constituting one face on the helical wheel. These include sites 16, 17, 20, 23, 24, 27, and 28, and they comprise a conspicuous distribution. Sites 16, 20, 23, and 27 are hydrophobic sites with a high preponderance of I, L, V, and F amino acid residues.
According to Klingler and Brutlag (1994)
and others, there is a hydrophobic periodicity that characterizes many
helices, i.e., the relative positions of amino acids in an amphipathic
helix may influence their interresidue correlation structure. Thus, an amino acid at position i may show a preference for similar types of amino acids at sites i + 3 and i + 4, which are on the same side of the helix. Analogously, an amino acid at position i may show a preference for dissimilar amino acid types at positions i + 2 and i + 5. More explicitly, if a hydrophobic residue occurs at site i, there is a greater expectation of seeing a hydrophobic residue at sites i + 3 and i + 4. Thus, the more coincident two residues are on one side of an
helix, the more likely they are to be of the same hydropathy. Conversely, the closer to a 180° separation two residues are, the more likely they are to be of opposite hydropathies (Klingler and Brutlag 1994
).
Sites 16, 20, 23, and 27 are three to four sites apart, in accordance with the i + 3 and i + 4 pattern of hydrophobic periodicity described by Klingler and Brutlag (1994)
. Thus, this set of four sites provides a highly hydrophobic face for the helix. Sites 17 and 24, on the other hand, are conserved sites with hydrophilic residues (K, R, N) which provide the hydrophilic face indicative of amphipathic helices.
Finally, site 28 has a high frequency of P residues (63%), indicative of the last site of an helix. At site 32, 35% of the sequences have P residues. Some proteins, like MyoD, continue helix 1 another turn, and the protein is turned out of the helix into the loop one turn later than in other bHLH proteins.
figure 8
shows the helical wheel for helix 2. The helix starts with site 50, which is a highly conserved K residue (93% of all sequences). Hydrophobic periodicity is easily seen relative to sites 57 and 64, which are predominantly hydrophobic residues, as are the i + 3 and i + 4 sites. However, the i + 2 and i + 5 sites to sites 57 and 60 are not necessarily hydrophobic and are rather diverse in their amino acid compositions. This is also the case for sites 54 and 61. On average, helix 2 appears to be more hydrophobic than helix 1. The predictive motif proposed by Atchley, Terhalle, and Dress (1999)
to discriminate bHLH proteins employs sites that fall on these highly conserved faces of the two helices.
Phylogenetic Aspects of the Helical Configuration
To explore the phylogenetic aspects of an helix configuration, we combined a neighbor-joining tree of 33 bHLH domains with the distribution of their amino acids along a helical wheel (fig. 9
). The proteins chosen for these analyses were simply some typical representatives of the various clades and evolutionary lineages as reported by Atchley and Fitch (1997)
. This tree delimits those proteins that belong to groups AD. Group A and B bHLH domain proteins are most prevalent in the literature and databases, followed by those of group C. Group A proteins bind to a CAGCTG E-box configuration, while group B proteins bind to the CACGTG E-box configuration. Group C has a more complex DNA-binding behavior, and group D does not bind DNA.
|
Let us consider the structures of those proteins that deviate most from the predictive motif. CENP-B was originally described as a bHLH protein (Sullivan and Glass 1991
; Sugimoto, Muro, and Himeno 1992
). However, Atchley and Fitch (1997)
suggested that it deviated considerably in its primary sequence from more typical bHLH proteins. In addition to considerable deviation from the predictive motif for bHLH domains (Atchley, Terhalle, and Dress 1999
), the first residue in helix 1 is a proline. Proline is an amino acid generally associated with breaking of
helices, and, indeed, the last residue in helix 1 and the first one in the loop regions are prolines. Analysis of the secondary structure of the CENP-B bHLH domain by the PSA algorithm described in Materials and Methods indicates that the stretch of residues considered to be the helix 1 region have low probabilities of being
helices. The probability values range from 0.34 for the first residue (proline) to 0.56 (an isoleucine midway in helix 1). The means and standard deviations for the probabilities of being
helices for the residues in the basic, H1, L, and H2 regions are 0.67 (±0.16), 0.37 (±0.15), 0.40 (±0.15), and 0.55 (±0.15), respectively. An analysis of variance of the basic and H1 regions to test the null hypothesis that the basic and helix 1 regions have an equal probability of being
helices is rejected at P < 0.001.
Recently, it was suggested that CENP-B is not a helix-loop-helix protein but, rather, should be classified as helix-turn-helix (Iwahara et al. 1998
). Our analyses here suggest that CENP-B probably should not be classified as an HLH protein.
Another protein that deviates considerably from the predictive model is INO4, which is believed to positively regulate the coordinate expression of phospholipid structural genes in yeast (Nikoloff and Henry 1994
). The fit of INO4 to the predictive motif in helix 2 is particularly bad (five mismatches). However, in spite of the sequence differences, the PSA analyses indicate that INO4 fits the
helix model rather well. Our analyses suggest that a more detailed analysis of the structure of INO4 might be fruitful.
The extent of conservation at particular sites and pairs of sites can now be more clearly seen, together with deviations from these patterns of conservation. For example, all of the proteins except four have proline residues at site 28. The exceptions include the proteins ADD1, INO2, MyoD, and E12. Furthermore, only a single sequence (PHO4) has a residue other than L or I at site 54. In PHO4, the residue at this site is an E.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the Introduction, we listed several important topics or questions that we wished to provide information about in these analyses. We discuss these topics below.
Origins of Associations among Amino Acid Sites
We have shown elsewhere that considerable amounts of covariation occur among amino acid sites in the bHLH domain (Atchley, Terhalle, and Dress 1999
). One impetus for the present study was to explore whether such covariation could be partitioned into those effects due to phylogenetic, structural/functional, and stochastic causes. Such partitions are critical to understanding the origin of sequence and structural variability and the evolution of protein structure.
To resolve this question about partitioning covariation, we simulated sequence data using a parametric bootstrap procedure. Generation of the simulated data sets requires two underlying models: (1) a phylogenetic model that dictates the amounts of change (branch lengths) and the grouping of changes (tree topology), and (2) an evolutionary model describing the probability of change from one residue to another. Results derived from these simulated data are dependent on the characteristics of these underlying models. Because the JTT matrix is based on a generalized model of protein evolution that accounts for phylogeny, it is an appropriate model for the calculation of residue changes.
The approach employed here permits residue changes to be generated in a controlled manner. The pattern of clustering and the number of residue changes between nodes (calculated from the branch lengths) constrain the magnitude of correlations between sites. Long branches leading to clades with many taxa produce high MI values. Conversely, a pattern of short internal branches coupled with long terminal branches leads to small MI values. Therefore, as one would expect, the distribution of MI values reflecting only stochastic and phylogenetic constraint will be quite dependent on the characteristics of the tree used in the parametric bootstrap procedure.
The analyses described here for the highly conserved bHLH domain clearly demonstrate that the observed covariation among amino acid sites can be partitioned into those associations due to common evolutionary history versus those due to structural/functional constraints. We showed in tables 1 and 2 that there are significant associations arising from phylogeny, structure, and function origins in all the components of the bHLH domain. With regard to structural and functional constraints, there are significant associations among amino sites within the DNA-binding region, between the binding and dimerization regions, and between the dimerization regions. Some of these significant associations can be attributed to particular structural and functional attributes of the protein, including significant associations among amino acid sites due to hydropathy relationships and the sizes of residues. However, the basis for other significant associations await further clarification from critical experimental analyses. One of the purposes of large quantitative studies like this one is to provide hypotheses about structural and functional relationships which can be explored by subsequent experimental studies.
Entropy and Site-Directed Mutagenesis Studies
Site-directed mutagenesis is a powerful tool for elucidating protein structure. With this approach, particular amino acids are perturbed in specific ways in order to assess the impact of sequence changes on protein structure. Unfortunately, the number of possible sites to perturb is quite high, the relationships among sites is not well understood, and consequently there is always a quandary about which sites to experimentally alter.
Quantitative data from entropy and MI calculations may provide valuable insight into this problem. For example, perturbing amino acid sites exhibiting low entropy values or sites sharing significant amounts of mutual information with other sites may generate quite different results from those obtained by perturbing sites with high diversity or low mutual information. Obviously, protein stability, folding, and functionality are dynamic multidimensional and integrated phenomena. It follows, then, that information about variability and covariability among sites would provide valuable input for mutagenesis experiments and for the subsequent development of robust models for protein structure and function.
Entropy and Classes of Amino Acid Sites
Analyses of the basic DNA-binding region and the two -helical regions of the bHLH domain suggest that there are three classes of sites. The first class includes amino acid sites with low entropy and low mutual information with other sites. Epitomizing this class are the contact sites between the two
helices that comprise the hydrophobic core of these domains. These contact sites had entropy values varying between 0.2 and 2.3, a range of values that differed significantly from (and did not overlap with) the entropy values for the noncontact sites. If the amino acid residues at each site are transformed into functional groups of amino acids as described in Atchley, Terhalle, and Dress (1999)
, the entropy relationships are even more pronounced. These contact sites exhibited very low levels of correlation in residue composition with other sites in the bHLH domain. Such low levels of mutual information are to be expected, because two variables must exhibit variation before they can exhibit shared or common variation as reflected by the MI values (which can be demonstrated algebraically, as, e.g., in Atchley, Terhalle, and Dress 1999
).
The second class of amino acid sites involves those with higher levels of sequence diversity (entropy) and high levels of mutual information. We described a number of sites with higher entropy values where residue composition was highly correlated with that at other sites. Many of these sites are involved in important structural and/or functional attributes in these proteins.
The third class of amino acid sites includes those with high levels of entropy but low levels of mutual information. Thus, variability at these sites is apparently unrelated to variability at other amino acid sites. This independence could simply stem from stochastic variation at these sites unrelated to any functional or structural considerations in the protein or with regard to other sites. Alternatively, the variability at these individual sites could be of functional or structural significance, but these sites function in a manner orthogonal to other sites.
Entropy, Conserved Sites, and Protein Structure
A basic tenet of protein structural analyses is that the information contained in the primary sequence is sufficient to dictate the three-dimensional structure (Strait and Dewey 1996
). Consequently, another impetus for these analyses was to integrate information theoretic analyses about sequence diversity with attributes of the proteins that had been elucidated by experimental studies. If quantitative approaches using techniques like entropy measures and mutual information are to be successful, we must be able to relate sequence characteristics to structural and functional attributes over large numbers of proteins.
The relationship between sequence covariability, packing, and protein structure is an essential part of understanding protein evolution. Native proteins assume a particular packing density. If the maximum possible packing density is assumed, then sequence evolution could be very difficult, because each mutation of an interior residue would require one or more simultaneous and compensating mutations to maintain the dynamics of such a high density (Richards 1992
). In this case, amino acid substitutions would involve paired or higher-order changes to maintain packing relationships. Packing restrictions on the surface residues would be weaker than those in the interior.
There is considerable heterogeneity in the amount of residue diversity at various sites. However, this residue diversity seems to correlate well with surface accessibility of the positions, with the interior positions being much more conserved. Indeed, our results show systematic relationships between covariability, packing, and structure. Amino acid sites from the helices known to pack together in the interior of the protein are highly conserved and, as a consequence, exhibit low variability. The entropy values for these contact sites are low, indicating low diversity at the contact sites. Those sites within the
helix that are buried and constitute the hydrophobic core are significantly less variable than exposed and hydrophilic sites. In addition, they show very little covariability in residue composition between sites. In contrast, sites away from this hydrophobic core show significantly more sequence diversity, as reflected by their entropy values.
Furthermore, there are highly conserved sites among diverse bHLH proteins which are therefore highly predictive for bHLH proteins. As a consequence, these residues discriminate the bHLH domain with highly accuracy (Atchley, Terhalle, and Dress 1999
). These highly invariant sites show very little intercorrelation with other sites with regard to their constitutive residues. This lack of correlation stems largely from the fact that invariant sites cannot exhibit covariation with other sites. To be most effective, predictive motifs need to exhibit high stability among individual elements and a lack of intercorrelation among the elements, i.e., independence among the component elements. Such low variability and covariability is the case for the elements of the bHLH predictive motif.
The observed relationships between entropy measures and protein interactions described for the bHLH domain are very intriguing. They suggest that certain structural and functional attributes of proteins can be predicted from quantitative measures of sequence diversity and association. However, analyses such as these need to be carried out on other groups of proteins before generalities can be made. One conclusion from these theoretical and experimental findings is that efforts to model protein structure must take into consideration the simultaneous covariation among amino acid sites. Data on covariation among sites is necessary to understand the multidimensional structural, functional, and evolutionary dynamics in proteins.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Present address: Fakultät für Mathematik, Universität Bielefeld, Bielefeld, Germany.
2 Keywords: mutual information,
protein evolution,
entropy,
parametric bootstrap,
helix-loop-helix.
3 Address for correspondence and reprints: William R. Atchley, Fakultät für Mathematik, Universität Bielefeld, Postfach 10 01 31, D-33501 Bielefeld, Germany.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Applebaum, D. 1996. Probability and information. Cambridge University Press, Cambridge, England.
Atchley, W. R., and W. M. Fitch. 1995. Myc and Max: molecular evolution of a family of proto-oncogenes and their dimerization partner. Proc. Natl. Acad. Sci. USA 92:1021710221.
. 1997. A natural classification of the basic helix-loop-helix class of transcription factors. Proc. Natl. Acad. Sci. USA 94:51725176.
Atchley, W. R., W. M. Fitch, and M. Bronner-Fraser. 1994. Molecular evolution of the MyoD family of transcription factors. Proc. Natl. Acad. Sci. USA 91:1152211526.
Atchley, W. R., W. Terhalle, and A. Dress. 1999. Positional dependence, cliques and predictive motifs in the bHLH protein domain. J. Mol. Evol. 48:501516.[ISI][Medline]
Brownlie, P., T. Ceska, M. Lamers, C. Romier, G. Stier, H. Teo, and D. Suck. 1997. The crystal structure of an intact human Max-DNA complex: new insights into mechanisms of transcriptional control. Structure 5:509520.
Bruno, W. 1996. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:13681374.
Clarke, N. D. 1995. Covariation of residues in the homeodomain sequence family. Protein Sci. 4:22692278.
Efron, B., and R. J. Tibshirani. 1993. An introduction to the bootstrap. Chapman and Hall, New York.
Ellenberger, T., D. Fass, M. Arnaud, and S. C. Harrison. 1994. Crystal structure of transcription factor E47: E-box recognition by a basic region helix-loop-helix dimer. Genes Dev. 15:970980.
Felsenstein, J. 1985. Phylogenies and the comparative method. Am. Nat. 125:115.[ISI]
Ferre-DAmare, A. R., P. Pognonec, R. G. Roeder, and S. K. Burley. 1994. Structure and function of the b/HLH/Z domain of USF. EMBO J. 13:180189.[Abstract]
Ferre-DAmare, A. R., G. C. Prendergast, E. B. Ziff, and S. K. Burley. 1993. Recognition by Max of its cognate DNA through a dimeric b/HLH/Z domain. Nature 363:3845.
Goldman, N. 1993. Simple diagnostic statistical tests of models for DNA substitution. J. Mol. Evol. 37:650661.[ISI][Medline]
Goldman N, J. L. Thorne, and D. T. Jones. 1998. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149:445458.
Herzel, H., and I. Gross. 1995. Measuring correlations in symbol sequences. Physica A 216:518530.
Huelsenbeck, J. P., D. M. Hillis, and R. Jones. 1996. Parametric bootstrapping in molecular phylogenies: applications and performance. Pp. 1945 in J. D. Ferraris and S. R. Palumbi, eds. Molecular zoology: advances, strategies, and protocols. Wiley-Liss, New York.
Iwahara, J., T. K. Kigawa, H. Kitagawa, T. Masumoto, T. Okazaki, and S. Yokoyama. 1998. A helix-turn-helix- structure unit in human centromere protein B (CENP-B). EMBO J. 17:827837.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8:275282.
Klingler, T. M., and D. L. Brutlag. 1994. Discovering structural correlations in alpha-helices. Protein Sci. 3:18471857.
Korber B. T., R. M. Farber, D. H. Wolpert, and A. S. Lapedes. 1993. Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc. Natl. Acad. Sci. USA 90:71767180.
Koshi, J. M., and R. A. Goldstein. 1997. Mutation matrices and physical-chemical properties: correlations and implications. Proteins 27:336344.
Ma, P. C., M. A. Rould, H. Weintraub, and C. O. Pabo. 1994. Crystal structure of MyoD bHLH domain-DNA complex: perspectives on DNA recognition and implications for transcriptional activation. Cell 77:451459.
Morgenstern, B., and W. R. Atchley. 1999. Modular evolution of the bHLH family of transcription factors. Mol. Biol. Evol. 16:16541663.
Murre, C., G. Bain, M. A. van Dijk, I. Engel, B. A. Furnari, M. E. Massari, J. R. Matthews, M. W. Quong, R. R. Rivera, and M. H. Stuiver. 1994. Structure and function of helix-loop-helix proteins. Biochim. Biophys. Acta 1218:129135.
Nikoloff, D. M., and S. A. Henry. 1994. Functional characterization of the INO2 gene of Saccharomyces cerevisiae. J. Biol. Chem. 269:74027411.
Parraga, A., L. Bellsolell, A. R. Ferre-DAmare, and S. K. Burley. 1998. Co-crystal structure of sterol regulatory element binding protein 1a at 2.3 A resolution. Structure 6:661672.
Richards, F. M. 1992. Folded and unfolded proteins: an introduction. Pp. 158 in T. E. Creighton, ed. Protein folding. W. H. Freeman and Co., New York.
Roman-Roldan, R., P. Bernaola-Gavan, and J. L. Oliver. 1996. Application of information theory to DNA sequence analysis: a review. Patt. Recog. 29:11871194.[ISI]
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Schneider, T. D. 1996. Reading of DNA sequence logos: prediction of major groove binding by information theory. Methods Enzymol. s:445455.
Shimizu, T., A. Toumoto, K. Ihara, M. Shimizu, Y. Kyogoku, N. Ogawa, Y. Oshima, and T. Hakoshima. 1997. Crystal structure of PHO4 bHLH domain-DNA complex: flanking base recognition. EMBO J. 16:46894697.
Sokal, R. R., and F. J. Rohlf. 1995. Biometry. Freeman and Sons, New York.
Sternberg, M. J. E. 1996. Protein structure prediction. IRL Press, Oxford, England.
Strait, B. J., and T. G. Dewey. 1996. The Shannon information entropy of protein sequences. Biophys. J. 71:148155.[Abstract]
Sugimoto, K., Y. Muro, and M. Himeno. 1992. Anti-helix-loop-helix domain antibodies: discovery of antibodies that inhibit DNA binding activity of human centromere protein B (CENP-B). J. Biochem. 111:478483.[Abstract]
Sullivan, K. F., and C. A. Glass. 1991. CENP-B is a highly conserved mammalian centromere protein with homology to the helix-loop-helix family of proteins. Chromosoma 100:360370.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. 2nd edition. Sinauer, Sunderland, Mass.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.[Abstract]
White, J. V., C. M. Stultz, and T. F. Smith. 1994. Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math. Biosci. 119:3575.[ISI][Medline]
Wollenberg, K., and W. R. Atchley. 2000. Separation of phylogenetic and functional associations in biological sequences using the parametric bootstrap. Proc. Natl. Acad. Sci. USA (in press).
Yang, Z. 1997. Phylogenetic analysis by maximum likelihood (PAML). Version 1.3. Department of Integrative Biology, University of California at Berkeley.