Exploring the sequence patterns in the {alpha}-helices of proteins

Junwen Wang1,2 and Jin-An Feng1,2,3

1Department of Chemistry and 2Center for Biotechnology, Temple University, 1901 North 13th Street, Philadelphia, PA 19122, USA

3 To whom correspondence should be addressed. e-mail: feng{at}astro.temple.edu


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
This paper reports an extensive sequence analysis of the {alpha}-helices of proteins. {alpha}-Helices were extracted from the Protein Data Bank (PDB) and were divided into groups according to their sizes. It was found that some amino acids had differential propensity values for adopting helical conformation in short, medium and long {alpha}-helices. Pro and Trp had a significantly higher propensity for helical conformation in short helices than in medium and long helices. Trp was the strongest helix conformer in short helices. Sequence patterns favoring helical conformation were derived from a neighbor-dependent sequence analysis of proteins, which calculated the effect of neighboring amino acid type on the propensity of residues for adopting a particular secondary structure in proteins. This method produced an enhanced statistical significance scale that allowed us to explore the positional preference of amino acids for {alpha}-helical conformations. It was shown that the amino acid pair preference for {alpha}-helix had a unique pattern and this pattern was not always predictable by assuming proportional contributions from the individual propensity values of the amino acids. Our analysis also yielded a series of amino acid dyads that showed preference for {alpha}-helix conformation. The data presented in this study, along with our previous study on loop sequences of proteins, should prove useful for developing potential ‘codes’ for recognizing sequence patterns that are favorable for specific secondary structural elements in proteins.

Keywords: {alpha}-helix/propensity/protein structures/secondary structure/sequence pattern


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Recognizing the sequence patterns of proteins in relation to their structures has been one of the most important aspects of our efforts towards the understanding the principles of protein folding. The Anfinsen experiments in the 1950s suggested that the primary amino acid sequence contained the information that specifies the folded native protein structure (Anfinsen, 1973Go). Subsequent experiments over the past few decades have generally supported this conclusion (Dill, 1990Go; Baldwin and Rose, 1999aGo,b; Honig 1999Go). Two-dimensional NMR hydrogen exchange, coupled with stopped-flow pulse-labeling experiments, showed that the folding intermediate(s) usually possess secondary structures that are similar to that of the native protein (Hughson et al., 1991Go; Jennings and Wright, 1993Go; Chamberlain and Marquesee, 1997Go). These data suggested that the formation of secondary structures largely depended on the local amino acid sequence since the intermediate folding states presumably did not have the established tertiary contacts of the native structure.

Armed with this experimental evidence, computational efforts to identify relationships between the amino acid sequence and special local structural elements have been intensive. Most attempts to identify such relationships have proceeded by identifying a common structural motif, then characterizing the frequencies of occurrence of each amino acid at each position in that motif. One of the best examples of such a study was the sequence pattern of the ß-turn loop. This four-residue ß-turn has been observed in a large number of protein structures (Efimov, 1993Go). It is a tight-turn loop structure with an intra-loop hydrogen bond between the main chain C=O(i) and the N–H(i + 3). The ß-turn was later re-classified into many different sub-classes according to the polypeptide backbone dihedral angles (Hutchinson and Thornton, 1994Go). Other distinct sequence patterns include {alpha}-helix capping residues and ß-breakers. Gly is commonly described as a helix terminator (or Ccap residue), whereas Asp, Asn, Ser and Thr are favored at the Ncap position of {alpha}-helices (Presta and Rose, 1988Go; Richardson and Richardson, 1988Go; Parker and Hefford, 1997Go). ß-Breakers are residues often found at the ß-strand initiation or termination positions of protein structures (Colloc’h and Cohen, 1991Go). They include Gly, Glu, Asp and Pro. More comprehensive approaches have involved clustering structural segments of proteins into classes using measures of structural similarity and then tabulating the sequence preference for each of the classes (Unger et al., 1989Go; Rooman et al., 1990Go; Olivea et al., 1997Go). In spite of these efforts, the discovered relationship between a particular local structure and the amino acid sequence has been insufficient for developing a rational secondary structure predictor.

We developed a method for analyzing the sequence–structure relationship of proteins termed neighbor-dependent sequence analysis (Crasto and Feng, 2001Go). This method calculated the neighboring probability of a pair of amino acids, in any combination, in three classes of secondary structures ({alpha}-helix, ß-strand and loop). Neighbors were defined as the first neighbor, where the amino acids in the pair were immediately next to each other in sequence; the second neighbor, where the pair of amino acids was separated by one amino acid residue in sequence; the third neighbor, where the pair of amino acid was separated by two amino acids; or the fourth neighbor, where the pair of amino acids was separated by three amino acids. We applied the neighbor-dependent sequence analysis to the residues of immediate neighbors in loops of proteins (Crasto and Feng, 2001Go). A series of dyad codes that had strong preference for loop conformation were found. For example, it was found that Cys had a high loop propensity in short loops when it was at a position preceding an Arg, although both residues had low individual loop propensities. It was evident that the neighbor-dependent protein sequence analysis method could reveal ‘hidden’ sequence codes in proteins.

{alpha}-Helices are one of the most dominant structural elements in proteins. Extensive studies have been carried out focusing on {alpha}-helix folding and its sequence–structure relationship. Early studies by Chou and Fasman (Chou and Fasman, 1978Go) have established a statistical scale to evaluate the likelihood of amino acids adopting {alpha}-helix conformation. Residues with high propensities were termed strong helix conformers, and the residues with helix propensities slightly higher than random distribution were termed medium helix conformers. Amino acids having a frequency of occurrence in helices lower than that of the random distribution were regarded as weak helix conformers. Although no chemical–physical rationale could be easily derived for the preference of amino acids adopting helix conformation, such a statistical analysis has achieved limited success in assisting our understanding of the sequence–structure relationship of proteins, as well as in predicting protein secondary structures (Chou and Fasman, 1978Go). A recent study by Penel et al. (Penel et al., 1999Go) on the analysis of side chain structures influencing residues adopting {alpha}-helical conformation showed that some amino acid residues favored hydrogen bonding with neighboring residues via either main chain–side chain or side chain–side chain interactions, thus suggesting a potential neighbor-dependent preference for residues in the {alpha}-helix. In this study, we carried out detailed neighbor-dependent sequence analysis on {alpha}-helices in proteins. Our results showed that amino acid pair preference for the {alpha}-helix had a unique pattern. Based on the neighbor-dependent propensity values, we derived a series of amino acid dyads that had predominant preference for {alpha}-helical conformation. We also carried out an analysis on the propensity variation of amino acids in short, medium and long {alpha}-helices. To the best of our knowledge, such an analysis has never been reported. Our study showed that there were significant propensity variations for certain amino acids in helices of difference size groups.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
All analyses were performed using a relational database derived from the October 2001 release of the Brookhaven Protein Data Bank (PDB) (Bernstein et al., 1977Go). Sequences in the helix and strand regions were often more conserved than that of loop structures in homologous proteins (Benner and Gerloff, 1990Go). A non-redundant set of PDB entries derived by using PISCES (Wang and Dunbrack, 2003Go). Proteins with a sequence identity of >25% were removed. Since the helical regions of the protein structures were usually well defined, it was necessary to include only structures that were determined at high resolution. The resolution cut-off of our database was 2.5 Å. Based on these criteria, a total of 1430 proteins were selected from the PDB. These PDB entries were used in the subsequent parsing protocols.

The extraction of sequences and secondary structure information from every PDB entry was based on two independent secondary structure assignment strategies: assignments which were experimentally observed and assignments generated by the Kabsch and Sander DSSP algorithm (Kabsch and Sander, 1983Go), which assigned secondary structure based on an analysis of backbone dihedral angles and hydrogen bonds. For the sake of comparison between the two methods of assignment, we converted the more sophisticated DSSP structure assignments as follows: helices, 310 helices and the {pi}-helices were all considered helices.

The database system used in this study was PostgreSQL packaged in Redhat Linux 7.2. The sequence and secondary structure information of every PDB entry were parsed into relational tables. Two sets of tables (one was parsed based on the author-assigned secondary structure information and the other was parsed according to the DSSP calculation) were compared. Considering that the authors’ assignments were experimentally observed, we decided to use such assignments in the PDB as the standard for defining {alpha}-helices in the protein structures. However, manual errors were often encountered in the PDB. In order to avoid such issues, we applied a double-check mechanism where every structural element in the PDB was compared with the DSSP assignment. The {alpha}-helices that were agreed upon by both methods were selected from the PDB and placed in a helix library. The total number of helices extracted was 10 643, which constituted 96.2% of the total helices available in those PDB entries. The {alpha}-helices were grouped according to their sizes.

The residue helix propensity values ({epsilon}a) were determined from the ratio of the residue’s frequency of occurrence in helices versus its frequency of occurrence in the PDB (Equation 1):

where aS was the number of residues of type a in the helix library; nS was the total number of residues in the helix data bank; aP was the number of number of residues of type a in the PDB that contained all helices in the helix library; and nP was the total number of residues in the PDB used in this analysis. The {epsilon}a values for residues in different helix groups were calculated using the corresponding values.

For neighbor-dependent analysis of helices, the frequency of occurrence of the residue type x at neighboring positions of a helix residue (a) was calculated according to Equation 2:

where {Sigma}x(a ± i)S and {Sigma}x(a ± i)P were the occurrences of residue type x at the ±ith positions of the residue a in secondary structure sequence library S ({alpha}-helix) and in our PDB (P), respectively; n(pair)S and n(pair)P were the total number of residue pairs in S and P, respectively. The numerator of Equation 2 calculated the frequency of occurrence of residue x neighbored with the residue type a in the secondary structure (S), while the denominator of the equation calculated the frequency of occurrence of residue x neighbored with the residue type a in the PDB (P). The ratio of these values would be the propensity of residue x in S when it was neighbored with residue type a in S.

In this paper, we present the neighbor-dependent propensity values as {epsilon}x(a±1). An {epsilon}x(a±1) value of 1.0 means that the occurrence of the residue pair, ax (or xa), in helices is the same as its frequency of occurrence in proteins. A value >1.0 means the pair has an occurrence in helices higher than that in proteins, suggesting that the pair has a preference for adopting helix conformation. {epsilon}x(a±1) values lower than unity suggest less preference for the pair in helices. For example, {epsilon}P(A – 1) = 1.52 in short helices means Pro has 50% more chance to be found in short helices than in the proteins when it precedes Ala, i.e. Pro at –1 position of Ala; {epsilon}P(A+1) = 0.47 suggests Pro is less likely to be found in short helices when it follows Ala, i.e. Pro at +1 position of Ala.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Length distribution of {alpha}-helices in proteins

The lengths of {alpha}-helices in proteins varied between three and 77 residues with a total of 10 643 helices in the PDB. The population distribution of helices in the library had a mean helical length of approximately 12.1 residues (Figure 1), which was close to the mean helical length determined by Barlow and Thornton (Barlow and Thornton, 1988Go) that included 291 {alpha}-helices. There was a relatively large population of three-residue helices found in proteins. Those helices were most likely half-turn helices or irregular helices such as the 310 helices that were more frequently found in helices of less than four residues long (Barlow and Thornton, 1988Go). In contrast to the {alpha}-helix length distribution reported by Zhu and Blundell (Zhu and Blundell, 1996Go), where four-residue helices had twice the population size as that of the five-residue helices, our helix library contained more five-residue helices than four-residue helices by a ratio of 3:2 (Figure 1). There was a gradual decrease in the helix population as the helical length increased beyond 13 residues. Helices longer than 40 residues were rarely found in proteins. The longest helix in our helix library had 77 residues. In fact, there were only one or two examples of each helical length having more than 47 residues (except 50- and 51-residue helices which had three examples of each).



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1. Length distribution of {alpha}-helices in the helix library.

 
In order to establish a data set that was representative of the general helix population, we chose a subset of our helix library containing only helices with lengths between four and 22 residues. This subset had a total population of 8771 helices. Owing to the uneven distribution of different lengths of the helices in the library where the number of short helices far exceeded the number of medium and long helices (Figure 1), it was likely that the helix preference patterns described for the helix library would only reflect the characteristics of the short or medium {alpha}-helices. In order to address such potential bias, we divided the subset of helices into three groups: short helices (four to seven residues), medium helices (eight to 13 residues) and long helices (14 to 22 residues). Based on the selection criteria, all three groups had approximately equal population sizes. In the following discussions, propensities calculated from the entire subset of four to 22 residues helices were termed total helix propensity; propensities calculated from the short, the medium and the long helical groups were termed short, medium and long helix propensities, respectively.

Propensities of amino acids in {alpha}-helices of different length groups

The helix propensities of amino acids in short helices appeared quite different from that of the medium and long helices. Of particular interest were Trp and Pro. Not known as a strong helix conformer, Trp had a significantly higher frequency of occurrence in short helices ({epsilon}W = 1.51) than in medium and long helices. In fact, Trp was the strongest helix conformer in short helices. In contrast with its presence in medium and long helices, Pro also had a significantly elevated frequency of occurrence in short helices ({epsilon}P = 0.99). A number of residues, including Asp, Cys, Phe, Ser and Tyr, also had higher helix propensities in short helices than their propensity values in medium and long helices, while in the same subgroup of helices, residues Ala, Arg, Gln, Ile, Leu, Lys, Met and Val had slightly lower propensity values. Particularly noticeable were residues Asp and Ala. Both residues were good helix conformers in short helices, while their helix propensities were quite different in the context of the overall helix population (Figure 2). The amino acid propensities in the medium and the long helix subgroups were generally similar to those of the total helix group. It appeared that the helical composition of short helices was quite different from that of the medium and long helices.



View larger version (73K):
[in this window]
[in a new window]
 
Fig. 2. A bar graph of the normalized helix propensity of amino acids in groups of different helix sizes. The propensity values of amino acids in different helix groups are represented by different bars as indicated in the legend.

 
Neighbor-dependent sequence analysis of {alpha}-helices

In an attempt to analyze how neighboring residues affect the {alpha}-helix conformations of amino acids, we calculated amino acid preferences at positions immediately preceding (–1) or following (+1) an {alpha}-helix residue (a). The neighbor-dependent helix propensities of 20 amino acids at the +1 and –1 position of {alpha}-helix residues [{epsilon}x(a±1)] in different groups are tabulated in Table Ia–d. Propensity values >1.20 are in bold in the table for ease of inspection. Based on estimated standard deviations, most of the neighbor-dependent propensities had a comparable level of confidence as that of the individual amino acid propensities (Table Ia–d) (J.Wang and J.-A.Feng, unpublished results; Kumar and Bensal, 1996Go). The estimated standard deviations were slightly higher for neighbor-dependent propensities in the short helix group than that of other helix groups. This variation could in part be attributed to the small population size of residue pairs in the short helix group, which was less than one-third of the other groups.


View this table:
[in this window]
[in a new window]
 
Table I. Normalized neighbor-dependent helix propensity of residues in various helix groupsa
 
The neighbor-dependent helix propensities of amino acids often reflected the individual helix propensities of neighboring residues. When two strong helix conformers were neighbored, their neighbor-dependent propensity was almost invariably high. By the same token, when two weak helix conformers were neighbored, their neighbor-dependent propensity was often low. Interesting patterns often occurred when a strong helix conformer was neighbored with a weak helix conformer, or two moderate helix conformers were neighbored.

Ala, Glu, Leu and Gln were stronger helix conformers. Neighbor-dependent propensity calculation showed that they had a strong influence on the preference of neighboring residues adopting helical conformation. Not surprisingly, amino acids with strong or medium individual helix propensity, including Ala, Arg, Gln, Glu, Ile, Leu, Lys, Met, Phe, Ser, Trp and Tyr, often exhibited a strong preference adopting {alpha}-helix conformation when they were neighbored with Ala, Glu, Leu and Gln residues. On the other hand, the neighbor-dependent effect for stronger helix conformers positioned next to residues with low individual helix propensity was limited. Although the frequency of occurrence for residues with low individual helix propensities was generally increased when they were neighbored with strong helix conformers, most of the neighbor-dependent propensities were nevertheless <1.0 (Table Ia). Proportionally, medium helix conformers had less influence on the preference of neighboring residues adopting helix conformation than that of the strong helix conformers. Amino acids Arg, Lys, Met and Trp had high neighbor-dependent propensity when they were neighbored with each other (Table Ia). No neighbor-dependent effect was observed when they were positioned next to weak helix conformers.

Unique sequence patterns were observed in different helix groups for a number of amino acids, particularly in the short helix group. Asp had a high propensity for helix conformation when it was neighbored with Ala, Arg and Glu in short helices, while such a pattern was not observed in the medium and the long helix groups. In contrast, the pairings of Ala with Val and Ile, as well as the pairing of Arg with Leu, were less frequently found in short helices than in other helix groups (Table Ib–d). Similarly reduced neighbor-dependent propensity values were also found for Arg, Ile, Lys and Met when they were neighbored to Gln in short helices. Tyr, a weak helix conformer, had a strong neighbor-dependent helix propensity when it was positioned next to Met in short helices [{epsilon}Y(M + 1) = 1.91, {epsilon}Y(M – 1) =1.82], while in other helix groups, Met had no influence on the preference of Tyr adopting helical conformation. Another noteworthy pattern was that the pairing of Met and Ala, two strong helix conformers, showed no preference for adopting helical conformation in short helices (Table Ib). Such a pattern was not observed in other helix groups.

For weak helix conformers, their neighbor-dependent helix propensities varied significantly in different helix groups. In the short helix group, Asp was a good ‘helix neighbor’ to a number of amino acids, including Ala, Arg, Asp, Gln, Glu, Ile, Leu, Lys, Met, Phe, Trp, Tyr and Val, while in long helices Asp had little effect on other amino acids adopting helical conformation. Also in the short helix group, the pairing of Tyr with Cys and Glu yielded strong neighbor-dependent propensities. In medium and long helices, Tyr was found to be favored next to Leu residues. Cys was also more frequently found at the +1 position of Phe [{epsilon}C(F + 1) = 1.96] and at the –1 position of Trp [{epsilon}C(W – 1) = 2.54]. There were only relatively few amino acids showing preference for the helix conformation when neighbored to a His residue (Table Ia–c). One example was that Cys and Trp were preferably positioned at the –1 position of His in short helices [{epsilon}W(H – 1) = 2.07, {epsilon}W(H – 1) = 1.94]. Like that of His, residues neighboring Ser were not often found in helical conformation. The exceptions we found were the pairings of Ser with Trp and Glu in short helices (Table Ib).

Pro and Gly, being the residues that had the lowest propensity for helix conformation, had the expected pattern of weak effect on the helical propensities of their neighbors. This pattern appeared consistent in all helix groups (Table Ia–d). Interesting preference patterns of amino acids neighboring Pro residues, however, were found in short helices. Pro had high helix propensity values when it was positioned at the –1 positions of Ala, Gln, Glu, Phe, Ser and Trp residues [{epsilon}P(A – 1) = 1.52, {epsilon}P(Q – 1) = 1.97, {epsilon}P(E – 1) = 2.18, {epsilon}P(F – 1) = 1.49, {epsilon}P(S – 1) = 1.58, {epsilon}P(W – 1) = 1.90]. On the other hand, no preference pattern was found for residues neighboring Gly in helices.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The large population of secondary structural sequences derived in this study allowed us to analyze sequence patterns in different helix groups. Amino acid helix propensity values in the medium and the long helix groups were quite similar to those found in the total helix group. More significant variations were found in short helices, particularly for residues Trp and Pro where both residues showed a significant increase in their frequency of occurrence (Figure 2). Such differential preference of certain amino acids in helices of variable sizes was found in other studies. Kumar and Bansal (Kumar and Bansal, 1996Go) have shown that long helices (helices with more than 25 residues) often had a higher content of residues with longer side chains than those in medium and short helices. It was suggested that amino acids with longer side chains could perhaps better facilitate complementary interactions with other elements of the protein structure (Kumar and Bansal, 1998Go). However, the analysis of our helix library failed to show a trend that the appearance of residues with bulkier side chains was more favored in longer helices. It should be noted that our current helix library contained only a limited number of helices longer than 25 residues (Figure 2). It would be interesting to revisit the helix propensities of amino acids in longer helices when a larger population of long helices (>25 residues) becomes available.

The differential propensity values of amino acids in different helix groups were also reflected in the neighbor-dependent sequence analysis. As a result, we found a number of sequence patterns that were unique to specific helix groups. For example, Trp had mostly comparable neighbor-dependent propensities with other amino acids for helices in the medium, long and total helix groups. On the other hand, in the short helix group, the Trp was strongly favored for the helical conformation when it was positioned at the –1 positions of Glu, His, Lys and Ser. However, at the +1 positions of these residues, Trp had no preference for helical conformation (Table Ia–d). Although the estimated standard deviations for residue pairs containing Trp were higher than that of other residue pairs, the neighbor-dependent propensities for Trp–Glu, Trp–His, Trp–His, Trp–Lys and Trp–Ser were large enough to justify the statistical significance of the sequence patterns. This unsymmetrical neighbor preference was not limited to Trp; Pro also exhibited unusual sequence patterns in short helices. Although the individual propensity value of Pro in the short helix group ({epsilon}P = 0.95) was significantly higher than its propensity value in the medium and the long helix groups, it was nevertheless still below the threshold of random distribution (Figure 2). Our neighbor-dependent sequence analysis showed that Pro had high frequencies of occurrence at the –1 position of Ala, Gln, Glu, Phe, Ser and Trp, with propensity values of {epsilon}P(A – 1) = 1.52, {epsilon}P(Q – 1) = 1.97, {epsilon}P(E – 1) = 2.18, {epsilon}P(F – 1) = 1.49, {epsilon}P(S – 1) = 1.58 and {epsilon}P(W – 1) = 1.90, respectively. The occurrences of Pro at the +1 position of these amino acids in short helices, on the other hand, were significantly below that of the random distribution (Table Ib). Earlier studies had shown that Pro was one of the preferred residues at the Ncap position of the helices (the position before the first residue of the {alpha}-helix) (Richardson and Richardson, 1988Go; Harper and Rose, 1993Go; Kumar and Bansal, 1998Go), as well as at the first position of helices (the N1 position) (Penel et al., 1999Go). The finding of high occurrence of a Pro–X (X = Ala, Gln, Glu, Phe, Ser and Trp) pattern could potentially be a ‘by-product’ of Pro being favored at the N-terminus of helices. On the other hand, while the propensities of amino acids, Asp and Leu, were found to be among the highest at the second (the N2 position) and the third positions (the N3 position) of the helix (Penel et al., 1999Go; Cochran and Doig, 2001Go), the amino acid pairs, Pro–Asp and Pro–Leu, had relatively low neighbor-dependent propensity values in {alpha}-helices (Table Ia–d). It appeared that the neighbor-dependent effect played a role in determining sequence patterns in {alpha}-helices.

While it was difficult to provide a physical–chemical rationale for the sequence patterns discovered in this study, the neighbor dependency of amino acids favoring helical conformation appeared to be consistent with experimental findings. Recent studies have shown that an amino acid propensity value for a particular geometrical conformation is not independent of its environment. Sequence analysis of {alpha}-helices in proteins revealed that transitions from loop to helix conformation required the presence of a particular group of amino acids (Presta and Rose, 1988Go; Richardson and Richardson, 1988Go; Parker and Hefford, 1997Go). The amino acid composition at the ends of helices, where they were often more hydrophilic in nature, was distinctly different from the composition in the middle of the helices, where they were often more hydrophobic in nature (Lacroix et al., 1998Go). These findings were also supported by experimental work that analyzed the helix propensity values of amino acids at different positions of the synthetic peptides (Petukhov et al., 1998Go, 2002; Thomas et al., 2001Go).

One of the applications of the knowledge on the sequence–structure relationship of proteins is in predicting protein secondary structures. Protein secondary structure predictions based on statistical methods have been implemented in a number of computer algorithms. These include the rule-based Chou and Fasman method (Chou and Fasman, 1978Go), the GOR III method which predicts the structural conformation of an amino acid in a protein sequence based on the statistical information of amino acids within a window surrounding that amino acid (Gilbrat et al., 1987Go), and the more recent application of the hidden Markov model (HMM) which derives a probability for assigning a structural conformation of an amino acid by inferring its neighboring information (Asai et al., 1993Go; Schmidler et al., 2000Go). These studies have shown that the accuracy of secondary structure prediction is improved as more sophisticated implementations of sequence neighboring sequence information are applied. Implicitly, these studies suggest that neighboring residue type could play a role in affecting the propensity of an amino acid adopting a particular conformation. However, structure prediction algorithms are usually prediction-accuracy driven; little consideration has been placed on the nature of intermediate parameters derived from the training data set. As demonstrated in this study, the sequence patterns could differ significantly in secondary structures of variable lengths. It is conceivable that the prediction efficiency of these algorithms could be improved with the incorporation of knowledge learned in this study.

One of the most noticeable attributes of the neighbor-dependent analysis method was the increased statistical significance of the residue propensities. The individual propensity values, {epsilon}a, were in the range 0.67–1.51 (Figure 2), while the neighbor-dependent propensity values, {epsilon}x(a ± 1), were in a much greater range between 0.19 and 2.54. Because different scale factors were applied (see Materials and methods), the derived propensity values obviously could not be directly compared. Nevertheless, the statistical significance of both methods should be comparable, i.e. a propensity value of 1.0 represented random distribution of the amino acids in the PDB. The expanded statistical scale of the neighbor-dependent analysis enabled us to explore the hidden codes in the protein sequence. Specifically, we were able to identify dyad signatures (ab) that were highly favorable for the helix conformation, whereas their dyad pairs (i.e. ba) had little or no preference for their corresponding conformations. Table II lists some of the asymmetric dyads that had a high propensity for helix conformations. The dyads in Table II were selected from Table Ia–d according to the following criteria: (i) all entries had propensities >1.30, whereas the dyad pair of these entries was <1.2, and (ii) the propensity difference of the dyad pair was >0.3.


View this table:
[in this window]
[in a new window]
 
Table II. Dyad sequence codes for different groups of helix and strand
 
The existence of the dyad sequence patterns reflected the dependence of the helical preference of some amino acids on their neighbors. Such patterns were not always easily predictable. Short helices had by far the most diversified sequence patterns (Table II). The preference of amino acid dyads adopting helical conformation could not be easily rationalized. Through structural geometrical analysis, Penel et al. (Penel et al., 1999Go) suggested that neighbor-dependent sequence preference for adopting helical conformation could arise from side chain–main chain or side chain–side chain interactions between neighboring residues. The sequence preference for {alpha}-helix was not always predictable by assuming proportional contributions from the propensity values of the individual amino acids. For example, the occurrence of a Pro ({epsilon}P = 0.55) following a Gly ({epsilon}G = 0.58), i.e. the sequence Gly–Pro, was actually higher in helices than that of a Pro following a Gln (Gln–Pro), even though Gln had an individual helix propensity value of 1.22 (Table Ia). Similar examples were found in numerous combinations of amino acid pairs. The sequence patterns of loops were even more diversified than that of the helices discussed here (Crasto and Feng, 2001Go).

Residues in {alpha}-helices could have both sequence neighbors and spatial neighbors. Spatial neighbors in helices were residues separated by i ± 4 in sequence positions. An intriguing question was whether neighbor-dependent effects, such as those observed in the loop sequences, could exist between spatial neighboring residues. For example, side chains of the spatial neighbors could form favorable interactions, including hydrogen bonding, polar and hydrophobic interactions, thus providing stabilization energy for helical conformation. We examined this scenario by calculating the neighbor-dependent propensity between residues of i ± 4 in helices. Surprisingly, our results showed no significant correlations between chemical complementarity of spatial neighbors and their neighbor-dependent propensity in {alpha}-helix (J.Wang and J.-A.Feng, unpublished results). Spatial neighbors with chemical complementarities apparently did not have higher neighbor-dependent propensity than that of the spatial neighbors with no chemical complementarities.

The results of this study show that the sequence patterns for {alpha}-helices were more predictable than that of the sequence patterns for loops. A combination of residues with high individual propensity values for helix conformation would usually yield a high preference for adopting a helical conformation. On the other hand, when residues with moderate or low helix preferences were neighbored in sequence, unexpected sequence patterns emerged, particularly in short helices where numerous amino acid dyads were found. Considering the differential amino acid composition in helices of different length, it was not surprising that the sequence patterns of {alpha}-helices were size dependent.


    Acknowledgements
 
The authors would like to thank Jennifer Davis and other members of the Feng laboratory for helpful discussions. We also acknowledge the financial support from the National Institutes of Health (GM54630), the American Cancer Society (PRG9926301GMC) and an appropriation from the commonwealth of Pennsylvania.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Anfinsen,C.B. (1973) Science, 181, 223–230.[ISI][Medline]

Asai,K., Hayamizu,S. and Honda,K.I. (1993) Comput. Appl. Biosci., 9, 141–146.[Abstract]

Baldwin,R.L. and Rose,G.D. (1999a) Trends Biochem. Sci., 24, 26–33.[CrossRef][ISI][Medline]

Baldwin,R.L. and Rose,G.D. (1999b) Trends Biochem. Sci., 24, 77–83.[CrossRef][ISI][Medline]

Barlow,D.J. and Thornton,J.M. (1988) J. Mol. Biol., 201, 601–619.[ISI][Medline]

Benner,S.A. and Gerloff,D. (1990) Adv. Enzyme Regul., 31, 121–181.[ISI]

Bernstein,F.C. et al. (1977) J. Mol. Biol., 11, 535–542.

Chamberlain,A.K. and Marquesee,S. (1997) Structure, 5, 859–863.[ISI][Medline]

Chou,P.Y. and Fasman,G.D. (1978) Annu. Rev. Biochem., 47, 251–276.[CrossRef][ISI][Medline]

Cochran,D.A.E. and Doig,A.J. (2001) Protein Sci., 10, 1305–1311.[Abstract/Free Full Text]

Colloc’h,N. and Cohen,F.E. (1991) J. Mol. Biol., 221, 603–613.[CrossRef][ISI][Medline]

Crasto,C.J. and Feng,J.-A. (2001) Proteins: Struct. Funct. Genet., 42, 399–413.[CrossRef][ISI][Medline]

Dill,K.A. (1990) Biochemistry, 29, 7133–7155.[ISI][Medline]

Efimov,A.V. (1993) Prog. Biophys. Mol. Biol., 60, 201–239.[CrossRef][ISI][Medline]

Gilbrat,J.-F., Garnier,J. and Robson,B. (1987) J. Mol. Biol., 198, 425–433.[ISI][Medline]

Harper,E.T. and Rose,G.D. (1993) Biochemistry, 32, 7605–7609.[ISI][Medline]

Honig,B. (1999) J. Mol. Biol., 293, 283–293.[CrossRef][ISI][Medline]

Hughson,F.M., Barrick,D. and Baldwin,R.L. (1991) Biochemistry, 30, 4143–4148.[ISI][Medline]

Hutchinson,E.G. and Thornton,J.M. (1994) Protein Sci., 3, 2207–2216.[Abstract/Free Full Text]

Jennings,P.A. and Wright,P.E. (1993) Science, 262, 892–896.[ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Kumar,S. and Bansal,M. (1996) Biophys. J., 71, 1574–1586.[Abstract]

Kumar,S. and Bansal,M. (1998) Proteins: Struct. Funct. Genet., 31, 460–476.[CrossRef][ISI][Medline]

Lacroix,E., Viguera,A.R. and Serrano,L. (1998) J. Mol. Biol., 284, 173–191.[CrossRef][ISI][Medline]

Olivea,O., Bates,B.A., Querol,E., Aviles,F.X. and Sternberg,M.J.T. (1997) J. Mol. Biol., 266, 814–830.[CrossRef][ISI][Medline]

Parker,M.H. and Hefford,M.A. (1997) Protein Eng., 10, 487–496.[Abstract]

Penel,S., Hughes,E. and Doig,A.J. (1999) J. Mol. Biol., 287, 127–143.[CrossRef][ISI][Medline]

Petukhov,M., Munoz,V., Yumoto,N., Yoshikawa,S. and Serrano,L. (1998) J. Mol. Biol., 278, 279–289.[CrossRef][ISI][Medline]

Petukhov,M., Uegaki,K., Yumoto,N. and Serrano,L. (2002) Protein Sci., 11, 766–777.[Abstract/Free Full Text]

Presta,L.G. and Rose,G.D. (1988) Science, 240, 1632–1641.[ISI][Medline]

Richardson,J.S. and Richardson,D.C. (1988) Science, 240, 1648–1652.[ISI][Medline]

Rooman,M.J., Rodriguez,J. and Wodak,S.J. (1990) J. Mol. Biol., 213, 327–336.[ISI][Medline]

Schmidler,S.C., Liu,J.S. and Brutlag,D.L. (2000) J. Comput. Biol., 7, 233–248.[CrossRef][ISI][Medline]

Thomas,S.T., Loladze,V.V. and Makhatadze,G.I. (2001) Proc. Natl Acad. Sci. USA, 98, 10670–10675.[Abstract/Free Full Text]

Unger,R., Harel,D., Wherland,S. and Sussman,J.L. (1989) Proteins: Struct. Funct. Genet., 5, 355–375.[ISI][Medline]

Wang,G. and Dunbrack,R.L. (2003) Bioinformatics, 12, 1589–1591.[CrossRef]

Williams,R.W., Chang,A., Juretic,D. and Loughran,S. (1987) Biochim. Biophys Acta, 916, 200–204.[ISI][Medline]

Zhu,Z.-Y. and Blundell,T.L. (1996) J. Mol. Biol., 260, 261–276.[CrossRef][ISI][Medline]

Received January 5, 2003; revised June 25, 2003; accepted September 4, 2003.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (8)
Request Permissions
Google Scholar
Articles by Wang, J.
Articles by Feng, J.-A.
PubMed
PubMed Citation
Articles by Wang, J.
Articles by Feng, J.-A.