A critical assessment of the secondary structure {alpha}-helices and their termini in proteins

Claire L. Wilson, Simon J. Hubbard,1 and Andrew J. Doig

Department of Biomolecular Sciences, UMIST, P.O. Box 88, Manchester M60 1QD, UK


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Secondary structure prediction from amino acid sequence is a key component of protein structure prediction, with current accuracy at ~75%. We analysed two state-of-the-art secondary structure prediction methods, PHD and JPRED, comparing predictions with secondary structure assigned by the algorithms DSSP and STRIDE. The specific focus of our study was {alpha}-helix N-termini, as empirical free energy scales are available for residue preferences at N-terminal positions. Although these prediction methods perform well in general at predicting the {alpha}-helical locations and length distributions in proteins, they perform less well at predicting the correct helical termini. For example, although most predicted {alpha}-helices overlap a real {alpha}-helix (with relatively few completely missed or extra predicted helices), only one-third of JPRED and PHD predictions correctly identify the N-terminus. Analysis of neighbouring N-terminal sequences to predicted helical N-termini shows that the correct N-terminus is often within one or two residues. More importantly, the true N-terminal motif is, on average, more favourable as judged by our experimentally measured free energies. This suggests a simple, but powerful, strategy to improve secondary structure prediction using empirically derived energies to adjust the predicted output to a more favourable N-terminal sequence.


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Secondary structure prediction is a key element in many different approaches to protein structure prediction. It constitutes one of the most generally applicable forms of structure prediction (Schulz and Schirmer, 1974; King, 1996Go; Lesk, 1997Go; King et al., 2000Go), simplifying the 20 state amino acid sequence into typically three states (helix, strand or coil/loop) and is often used to provide constraints for comparative modelling or as a starting point for fold recognition (Russell et al., 1996Go; Rost et al., 1997Go). Indeed, recent work has also shown that accurate protein secondary structure information is a useful baseline in fold recognition (McGuffin et al., 2001Go).

The post-genomic era has continued to drive the interest in predicting gene function from the expanding sequence resources and it is widely recognized that structural analysis, comparison and prediction are powerful tools for inferring biological and molecular function. Given that it is now widely recognised that protein structure is better conserved than sequence, structural information can prove extremely useful when trying to predict protein function, either directly (Stawiski et al., 2001) or from predicted secondary structure (Aurora and Rose, 1998aGo; Xu et al., 1999Go). Structural comparisons allow the identification of features not recognizable from sequence alone (Holm and Sander, 1994Go; Orengo, 1994Go; Shapiro and Harris, 2000Go) and can give novel information over and above that obtained by traditional sequence-based comparisons such as BLAST (Altschul et al., 1990Go) and FASTA (Pearson and Lipman, 1988Go).

One of the major limitations in assigning function through structural similarity or in the prediction of tertiary structure is the limited number of experimentally determined protein structures. Although comparative modelling allows accurate models to be built based on similarities of the unknown protein sequence to known structures, there are relatively few homologous proteins of known structure. However, roughly one-third of all open reading frames (ORFs) in a genome will lack any detectable sequence similarity to any other proteins (Fischer and Eisenberg, 1997Go). Indeed, a larger fraction still (up to two-thirds) usually have no detectable homologue of known structure, necessitating the use of secondary structure prediction or ab initio methods to produce a structural model

Secondary structure prediction is still widely used to make structural inferences in the absence of a similar structure in the databases and has progressed considerably since early approaches. Original methods were derived from patterns of residues found in structures and were knowledge-based (Chou and Fasman, 1974Go; Garnier et al., 1978Go). Despite the increase in the size of these databases, no parallel increase in accuracy of secondary structure prediction has been observed (Chou and Fasman, 1974Go; Garnier et al., 1978Go; Gibrat et al., 1987Go; Biou et al., 1988Go; Levin and Garnier, 1988Go) and most advances have been achieved through the advancement in methodology. This plethora of methods to predict protein secondary structure all essentially employ pattern recognition techniques along with analysis of structural features and residue compositions and usually fall into one of the following four categories (Ouali and King, 2000Go):

  1. simple, linear statistics derived from residue patterns or physico-chemical properties of amino acids;
  2. nearest neighbour;
  3. machine learning, such as neural networks;
  4. methods employing complex, non-linear statistics derived from residue patterns/physico-chemical properties such hidden Markov models.

All of these methods derive information from known protein structures and are dependent on the contents of current structural databases. The earliest secondary structure prediction methods were mainly centred on the first scheme using residue properties (Garnier et al., 1978Go) or propensities (Chou and Fasman, 1974Go). These were further developed to incorporate local side chain effects (Maxfield and Scherega, 1979; Gibrat et al., 1987Go) and the use of multiple sequence alignments to average over the secondary structure propensities for aligned residues (Zvelebil et al., 1987Go).

Nearest neighbour approaches are statistical methods that arise from the hypothesis that similar primary sequences will have similar secondary structures and have exploited empirically derived measures (Levin and Garnier, 1988Go), artificial intelligence (Yi and Lander, 1993Go) and multiple sequence alignments (Salamov and Solovyev, 1995Go). Symbolic machine learning approaches form rules/theories from a training data set and form the basis for most artificial intelligence research. These rules may then be applied to test sequences to predict secondary structure (King and Sternberg, 1990Go, 1996Go). Hidden Markov models use probabilistic models and assume that the probability of a residue being in a particular type of secondary structure depends on the secondary structure of preceding residues (Krogh et al., 1994Go). Neural networks are one of the most popular applications of complex non-linear statistics in predicting protein secondary structure and these take either an amino acid sequence or a multiple sequence alignment as input and output the predicted secondary structure (Rost and Sander, 1993Go; Cuff et al., 1998Go; Jones, 1999Go).

A key advance in all types of secondary structure prediction was the inclusion of multiple sequence alignments as input to the algorithms, originally proposed by Benner and Gerloff (Benner and Gerloff, 1991Go). This study was able to predict accurately the secondary structure of cAMP-dependent kinases through solvent accessibility patterns derived by clustering sequences of an aligned family. Multiple sequence alignments reveal additional evolutionary information through patterns of sequence variability and locations of insertions and deletions, which are more likely to occur in loop regions (Jones, 1999Go). Although the original approach was more manual than automatic, multiple sequence alignments have since been incorporated into most automated methods of secondary structure prediction. Further improvements were seen in secondary structure prediction accuracy by using sequence profiles, generated by PSI-BLAST (Altschul et al., 1997Go) searches, as input into a neural network (Jones, 1999Go).

Another important development is the use of consensus servers (Rost and Sander, 1993Go; Cuff et al., 1998Go), which combine the output from several secondary structure prediction methods to yield a final consensus output. This general approach has been suggested to outperform individual methods (King et al., 2000Go). Current state of the art in secondary structure prediction is ~75% accuracy (King and Sternberg, 1990Go, 1996Go; Yi and Lander, 1993Go; Salamov and Solovyev, 1995Go; Ouali and King, 2000Go). We observe that one of the apparent failings of secondary structure prediction based on multiple sequence alignments is that although these methods are able to identify accurately interior regions of secondary structure, they often fail to identify the termini correctly. This is because, although the number, type and location of secondary structures within homologous families are frequently conserved (Shi et al., 2001Go), the lengths of secondary structural elements may vary across the family (Flores et al., 1993Go; Russell and Barton, 1993; Rost et al., 1994Go).

In addition to multiple sequence-based approaches, there is also a need further to develop methods to predict the secondary structure of single sequences for which no or very few orthologues exist and multiple sequence alignment-based methods are inappropriate. Accurate secondary structure predictions are also necessary for correct tertiary structure prediction, comparative modelling and fold recognition (Jones, 1999Go) and any improvements will lead to benefits in these areas.

In this work, we made a detailed study of the successes and failures of secondary structure prediction focusing on the state-of-the-art prediction methods PHD and JPRED applied to {alpha}-helices and assessing the potential improvements in prediction that can be gained using empirically derived information on specific positions at the fringes of {alpha}-helices. The residue immediately preceding the first helical residue is termed the N-cap and is followed by the N1, N2 and N3 positions (Richardson and Richardson, 1988Go). The N' residue precedes the N-cap. The N1, N2 and N3 sites differ from interior sites (N4 onwards) as their backbone amide NH groups do not participate in helical backbone i, i + 4 hydrogen bonds. Similarly, residues at helix C-termini are named C3, C2, C1, C-cap and C'. The termini of {alpha}-helices show unique structural and energetic properties with distinct preferences for each of these sites (Argos and Palau, 1982Go; Kumar and Bansal, 1998aGo; Petukhov et al., 1998Go, 1999Go; Penel et al., 1999aGo). In particular, the preferences for each of the 20 amino acids for the N-cap (Doig and Baldwin, 1995Go), N1 (Cochran et al., 2001Go) and N2 (Cochran and Doig, 2001Go) sites are known in terms of empirically measured free energies. Other groups have also measured these values and they show close agreement with the values recorded by Doig et al.; however, the scales are incomplete and lack values for some charged and aromatic amino acids (Petukhov et al., 1998Go, 1999Go). The N-terminus of the {alpha}-helix is the only secondary structure region for which this detailed empirical information is available and so it is here that we concentrated our analysis.

The aim of this work was to investigate current multiple sequence alignment-based methods of secondary structure prediction and how accurately they predict the lengths, locations and N-termini of {alpha}-helices. Specifically, the ability to predict the amino acids at the N-terminal boundaries of {alpha}-helices in a set of predominantly {alpha}-helical proteins was assessed in comparison with the empirically measured free energies of all 20 amino acids in {alpha}-helical peptides at the same positions. JPRED and PHD were chosen as model secondary structure prediction programs because at the onset of this work, they represented the current state of the art in secondary structure prediction. They are two of the most popular methods used to predict secondary structure and are freely available over the Internet. It is shown that JPRED and PHD fail to assign N-terminal residues correctly in 60% of cases and that a simple comparison based on free energy data could potentially improve this.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Data set

Protein chains were selected from the June 2000 release of PDBselect (Hobohm et al., 1992Go; http://www.sander.embl-heidelberg.de/pdbsel), with a maximum of 25% pairwise sequence identity between any two members. The data set was further restricted to those structures solved by X-ray crystallography, with a maximum R-factor of 0.2 and a minimum helical content of 10%. In total, 358 protein chains were analysed that contained 69 906 residues. A number of sequences were removed because either the prediction programs or the assignment programs failed to assign secondary structure (information about the data set is available on request).

Assigning secondary structure

Secondary structure assignments were made to all proteins in the PDB dataset using two widely used algorithms, DSSP (Kabsch and Sander, 1983Go) and STRIDE (Frishman and Argos, 1995Go). DSSP assigns secondary structure primarily based on consecutive repeats of hydrogen bonding patterns the carbonyl of residue i and the amide group of residue i + 4 (Kabsch and Sander, 1983Go). STRIDE combines the use of hydrogen bond energies with statistically derived backbone torsional angle information (Frishman and Argos, 1995Go). STRIDE defines an {alpha}-helix through the presence of at least two consecutive hydrogen bonds between residues i and i + 4 and combines this with torsional angle probabilities for these four residues. The inclusion of a geometric constraint means that many of the four-residue {alpha}-helices over-assigned by DSSP are assigned by STRIDE as turns or short 310-helices (Frishman and Argos, 1995Go). These helices often occur in peripheral loop regions and are not considered core secondary structure elements and are not of typical {alpha}-helix appearance (Frishman and Argos, 1995Go). For this reason, it was decided to take a consensus set of assigned {alpha}-helices because although DSSP is often the preferred choice for secondary structure assignment, it does over-assign four residue long {alpha}-helices (Colloc’h et al., 1993Go; Cuff and Barton, 1999Go). Further inspection of the data (not shown) supports this and shows that in many cases DSSP four-residue {alpha}-helices corresponded to either turns or 310-helices in STRIDE (Frishman and Argos, 1995Go). In addition, isolated helical residues were re-assigned with a coil classification (details available on request).

In prediction studies, the eight possible assigned DSSP states (including {alpha}-helix, 310-helix, ß-turn, bend, ß-bridge or loop) are reduced to three states: helix, strand or coil (Colloc’h et al., 1993Go; Rost and Sander, 1993Go; Frishman and Argos, 1995Go). For the purposes of defining {alpha}-helices, only the H states were considered since our investigations are specific to {alpha}-helices. The N-termini of {alpha}-helices are also structurally distinct from other secondary structure types and we were aiming to improve the secondary structure prediction with respect to {alpha}-helix N-termini. DSSP output was modified by manual inspection in a number of instances where unknown or unusual residues are labelled X or where exclamation marks represent possible chain breaks.

The choice of eight to three state reduction method is important since the two approaches under consideration, PHD and JPRED, use alternative methods and it could be argued that these differences lead to ‘artificially’ poor predictions at helical termini, particularly since it is well known that a significant number of {alpha}-helices contain a few 310-helical residues at their N-termini. However, we did not find this to be the case. We note that JPRED considers only DSSP ‘H’ states to be true {alpha}-helix and everything else (i.e. 310 ‘G’ states) are converted to coil, whilst PHD converts both ‘H’ and ‘G’ states to {alpha}-helix ‘H’. However, there are only 103 instances of Gs preceding Hs in the dataset out of 2228 {alpha}-helices (<5%). Furthermore, when comparing either JPRED or PHD predictions with three state definitions, using either method to assign helical states has a negligible affect on the prediction quality of helical N-caps.

Secondary structure prediction

JPRED (Cuff et al., 1998Go; http://jura.ebi.ac.uk:8008/jpred) and PHD (Rost and Sander, 1993Go; http://www.ebi.ac.uk/~rost/predictprotein) are both consensus servers based on neural networks, with reported accuracies of 72.9% (Cuff and Barton, 1999Go) and 71.9% (Rost and Sander, 1993Go), respectively. JPRED outputs a consensus prediction of secondary structure prediction based on a number of different prediction methods including JNET (Cuff and Barton, 2000Go). Secondary structure predictions from JPRED were restricted to JNET as this is a neural network-based system that takes profiles generated from PSI-BLAST along with profile-based hidden Markov models and outputs a three state prediction (helix, extended strand or coil). This restriction makes JPRED comparable to PHD for our purposes. PHD uses a system of neural networks to obtain a three state prediction (helix, extended strand or coil) and the secondary structure predicted was restricted to the output from PHD only. Both methods are reliant on underlying multiple alignments of related sequences to achieve their predictions.

All protein sequences were automatically submitted to JPRED and PHD. For PHD, the advanced submission form was used to submit sequences to PHDsec only, with the default outputs switched off. The alignment generated by PHD was not filtered prior to secondary structure prediction and a minimal column format output was returned. Submissions to JPRED involved selection of JNET only, bypassing the search PDB option to prevent the algorithm defaulting to the known structure in the PDB.

Defining positions within {alpha}-helices

The N-cap is the residue immediately preceding the N-terminal residue in an {alpha}-helix and the N' residue immediately precedes the N-cap residue. The N1 position is the first helical residue, N2 is the second, N3 is the third, N4 the fourth. All other helical residues that occur after the N4 position were classified as interior. This was necessary for two reasons:

  1. the study is aimed at predictions of residues at the N-termini of {alpha}-helices and the empirical data we have is for residue preferences at the N-terminus;
  2. residues at N4 and beyond are essentially interior residues. The N-termini of {alpha}-helices are more solvent exposed and are rarely involved with tertiary interactions; local interactions are the key influences (Doig et al., 1997Go; Penel et al., 1999bGo). Residues at the C-termini of {alpha}-helices are often involved with tertiary interactions and there is less empirical data available for residue preferences. Although C-cap free energies have been measured (Doig and Baldwin, 1995Go), these are all very similar and show less variation compared to the free energies of residues at the N-terminus.

Free energies of residues at the N-terminus of {alpha}-helices

Table IGo details the experimentally determined free energies on substituting the different residues for alanine of residues at the N-cap (Doig and Baldwin, 1995Go), N1 (Cochran et al., 2001Go) and N2 (Cochran and Doig, 2001Go) positions of the {alpha}-helix. A value of 3 kcal/mol is assigned to residues where the authors quoted the energy as being too high to measure. Since these instances are rare, the effect of removing these residues from consideration made no difference in the trends seen (data not shown).


View this table:
[in this window]
[in a new window]
 
Table I. Free energy of residues at the N terminal positions in {alpha}-helices
 
Definition of {alpha}-helix subsets

A number of {alpha}-helix subsets were identified in order to analyse and compare secondary structure predictions and assignments. A core set of {alpha}-helices was defined as {alpha}-helices assigned by DSSP and STRIDE that agree on start position, referred to as the core set throughout this paper. All predicted {alpha}-helices which overlapped with core set {alpha}-helices but had a different start position were deemed to be matching to a true, core-set {alpha}-helix. Extra {alpha}-helices were defined as those predicted helices corresponding to a region outside the core set of {alpha}-helices, where no residue in the predicted helix matched to any in the core set. Missed {alpha}-helices were defined as the core set {alpha}-helices that either JPRED or PHD failed to recognize.

Comparison of predicted and assigned {alpha}-helices

The difference in start position was calculated for predicted {alpha}-helices matching to an assigned helix, relative to the core set of {alpha}-helices. Lengths of predicted {alpha}-helices were compared with the lengths of core set {alpha}-helices and categorized as favoured, disfavoured [according to the classification of Penel et al. (Penel et al., 1999bGo)] and other (all lengths outside the range 6–31) for a number of different subsets.

Residue frequencies were calculated for positions N', N-cap, N1, N2, N3, N4 and interior for all predicted and core set {alpha}-helices. The average free energy of a residue at the N-cap, N1 or N2 positions was calculated using the following equation:


where i is one of the 20 amino acids, Ni is the number of instances of that amino acid at that position, Ntot is the total number of amino acids at that position and Ei is the free energy of that amino acid at that position. Although we used experimental free energies calculated in our laboratory, we would expect to see almost identical values using comparable scales such as those of Petukhov et al. (Petukhov et al., 1998Go, 1999Go).

The average cumulative free energy at the N-terminus of {alpha}-helices was calculated as follows: for each helix in the data set the cumulative free energy at the N-terminus was calculated by summing the free energy of the residues at positions N-cap, N1 and N2:


where CFEhelix is the cumulative free energy of a helix, ENcap is the free energy of the residue at the N-cap, EN1 is the free energy of the residue at the N1 position and EN2 is the free energy of the residue at the N2 position.

The average for the data set was calculated by summing all CFEhelix in the data set and dividing by the number of helices in the data set. This was also calculated for a number of more specific {alpha}-helix subsets (for predicted {alpha}-helices that start 1 and 2 positions before core set {alpha}-helices and also for predicted {alpha}-helices that start 1 and 2 positions after core set {alpha}-helices).


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Numbers of {alpha}-helices assigned/predicted

Considering the total number of {alpha}-helices assigned and predicted by the various algorithms, DSSP assigns the greatest number, followed by PHD then STRIDE and then JPRED (Table IIGo). DSSP assigns a greater total number of {alpha}-helices than STRIDE because it has a tendency to over-assign {alpha}-helices that are four residues long (Colloc’h et al., 1993Go), whilst STRIDE assigns some of these as 310-helices or turns (Colloc’h et al., 1993Go; Frishman and Argos, 1995Go). Consideration of helical lengths also helps illustrate the disparity in numbers of {alpha}-helices assigned and predicted. The prediction methods predict more three residue {alpha}-helices (Figure 1Go), with PHD predicting double the number of three residue long {alpha}-helices (6%) compared with JPRED (3%). There are only two cases of an assigned {alpha}-helix having a length of less than four residues, both are assigned by STRIDE and one is a two residue {alpha}-helix, the other a single residue {alpha}-helix. This is somewhat unexpected because the STRIDE definition of {alpha}-helix requires hydrogen bonds between residues at i, i + 4 and i + 1, i + 5. The prediction methods also favour longer {alpha}-helices (Figure 1Go) and occasionally will predict an {alpha}-helix that spans more than one assigned {alpha}-helix (data not shown).


View this table:
[in this window]
[in a new window]
 
Table II. Number of {alpha}-helices assigned by DSSP and STRIDE or predicted by JPRED and PHD
 


View larger version (39K):
[in this window]
[in a new window]
 
Fig. 1. Distribution of {alpha}-helix lengths for all {alpha}-helices predicted by JPRED (dark grey) and PHD (black) and those assigned by DSSP (white) and STRIDE (light grey) that belong to the core set.

 
A total of 1725 {alpha}-helices agree on start position in DSSP and STRIDE, which represents 77 and 83% of the total numbers of {alpha}-helices assigned by the independent methods, respectively (referred to as the core set of {alpha}-helices, with C-terminal positions as assigned by DSSP). The assigned C-termini positions show more variation than the N-termini partly because there is less of a preference seen at the C-terminus for specific residues and the identity of the C-cap residue is relatively unimportant in comparison with the identity of the N-cap residue (Doig and Baldwin, 1995Go). It is also interesting that DSSP {alpha}-helices that agree on start position with {alpha}-helices assigned by STRIDE tend to terminate before their counterparts assigned by STRIDE. This is in agreement with previous work which showed that STRIDE tends to extend secondary structure elements with respect to the DSSP assignments (Frishman and Argos, 1995Go). DSSP often assigns residues flanking {alpha}-helices as 310-helices or turns, as the hydrogen bond patterns are between that of {alpha}-helical and 310-helix conformations (Bally and Delettre, 1989Go). STRIDE includes backbone geometry as a parameter in secondary structure assignment and assigns these as {alpha}-helical, thus extending the ends of the {alpha}-helix relative to DSSP.

Predicted {alpha}-helices agree on start position with these core sets in only 32% (JPRED) and 27% (PHD) of total predicted {alpha}-helices (Table IIGo). JPRED predicts that 25% of {alpha}-helices are located outside helical regions from the core set (classified as extra compared with the core set), whereas this number rises slightly to 32% for PHD. Overall, PHD predicts that 16% of {alpha}-helices correspond to non-helical regions in the DSSP and STRIDE assignments and this is double the number predicted by JPRED (8%). JPRED fails to identify 16% of {alpha}-helices from the core set and this number drops to 13% for PHD (identified as the missed subset). In total, 40–43% of all predicted {alpha}-helices overlap core set {alpha}-helices (match subset), although the start position is frequently incorrectly predicted. Further classification of the start difference reveals that for {alpha}-helices not agreeing on start position, 45% (JPRED) and 38% (PHD) start within ±2 residues of a core {alpha}-helix (Table IIIGo).


View this table:
[in this window]
[in a new window]
 
Table III. Predicted {alpha}-helices that start within ±2 residues of {alpha}-helices in the core set
 
Differences in start positions of predicted {alpha}-helices compared with the core set of {alpha}-helices

Of the {alpha}-helices predicted by JPRED and PHD that match to {alpha}-helical regions in the core set, 44 and 40% agree with the core set start position, respectively (Table IIIGo). For start differences of ±1–4 residues, helices predicted by JPRED and PHD are more likely to start after the start of an assigned {alpha}-helix than they are to start before it (Figure 2Go). Figure 2Go shows that for JPRED the total frequency for predicted helices starting before core set helices is 0.23 compared with 0.34 starting after, with a standard deviation of 0.01 estimated on these values. Similar, but slightly less striking, results were obtained for PHD. This agrees with the hypothesis that multiple sequence alignments improve the prediction of core regions of secondary structure, but are less able to identify correctly fringe regions as the helical signal becomes weaker on moving away from the helix centre; hence predicted helices are often too short.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 2. Differences in start position of {alpha}-helices predicted by JPRED (filled) and PHD (unfilled) that match to {alpha}-helices in the core set. A positive difference in start position means that the predicted helix starts after the core set helix.

 
Lengths of assigned and predicted {alpha}-helices

Although the length dependence of {alpha}-helices has not been studied in great detail, a recent study by Penel et al. showed that {alpha}-helices tend to occur with a near-integral number of turns and that there is periodicity in favoured and disfavoured lengths (Penel et al., 1999bGo). This is a result of the favourable orientation of the N- and C-caps on the same side of the {alpha}-helix. Lengths were categorized as follows:

favoured: 6, 7, 10, 11, 13, 14, 17, 18, 21, 22, 24, 25, 28, 29, 31;

disfavoured: 8, 9, 12, 15, 16, 19, 20, 23, 26, 27, 30.

Their results were in agreement with previous work carried out on smaller data sets (Srinivisian, 1976Go; Barlow and Thornton, 1988Go; Kumar and Bansal, 1998bGo). In this work, we note that there is an increase in the frequency of predicted {alpha}-helix lengths of four residues or less, compared with helices assigned by DSSP and STRIDE, as well as a higher frequency of lengths >11 residues (Figure 1Go). This is reflected by the lower proportion of predicted {alpha}-helices with lengths in the range 5–10 residues (Figure 1Go) compared with the number of assigned {alpha}-helices.

In general, predicted {alpha}-helices that correspond to {alpha}-helical regions in the core set have a similar proportion of favoured, disfavoured and ‘other’ (outside the 6–31 length range) {alpha}-helical lengths. In contrast, predicted {alpha}-helices occurring outside of the set of core {alpha}-helical regions have a much higher proportion of {alpha}-helices with ‘other’ lengths (data not shown). Predicted {alpha}-helices that correspond to non-helical regions in DSSP and STRIDE have a lower proportion of favoured lengths and a higher proportion of disfavoured and ‘other’ lengths and may be described as more ‘atypical’ of those most commonly found in protein structures.

The distribution of all {alpha}-helices predicted or assigned by each program was examined in a similar procedure to that of Penel et al. (Penel et al., 1999bGo), by fitting a fourth-order polynomial to the frequency distribution of {alpha}-helical lengths (Figure 3A and BGo). A residual value from the theoretical frequency distribution was calculated, enabling a direct comparison to be made between our results and those of Penel et al. (Penel et al., 1999bGo). Classification of favoured/disfavoured lengths was based on this smoothing. Favoured {alpha}-helix lengths appear more frequently and disfavoured {alpha}-helix lengths appear less frequently than the theoretical frequencies, for lengths corresponding to a discrete number of turns of the {alpha}-helix. Plots are shown for the core set (Figure 3AGo) and all {alpha}-helices predicted by JPRED (Figure 3BGo). To aid clarity, STRIDE and PHD data have been omitted but STRIDE mirrors the DSSP plot whilst PHD mirrors the JPRED plot.



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 3. (A) Frequency distribution for all DSSP-assigned {alpha}-helices from the core set plotted as a function of {alpha}-helix length. Data are fitted to the equation y = –0.0035x4 + 0.2998x3 – 8.7154x2 + 93.8565x – 194.4204 and the inset shows the residual (observed frequency – calculated frequency) plotted against length (x-axis). (B) Frequency distribution for all {alpha}-helices predicted by JPRED plotted as a function of {alpha}-helix length. Data are fitted to the equation y = –0.0069x4 + 0.5701x3 – 16.3516x2 + 180.9729x – 510.7330. The dashed line is the calculated frequency of length distribution for {alpha}-helices in core set D (A).

 
There is a clear difference in the frequency distributions of {alpha}-helix lengths that are assigned (dashed line, Figure 3BGo) and predicted (solid line, Figure 3BGo). The inset in Figure 3AGo is in agreement with the work by Penel et al. (Penel et al., 1999bGo), with a small increase in the number of {alpha}-helices assigned with length 11 residues. The current study did not include extensions of 310-helix at {alpha}-helical termini and this may account for the small differences, particularly for long {alpha}-helices. The peaks at favoured lengths show a 3.6 residue periodicity corresponding to an integral number of helical turns, placing the N- and C-termini on the same side of the {alpha}-helix. Predicted {alpha}-helices do not reflect this trend (Figure 3BGo) and have their modal length at a shorter value. Clearly, prediction methods do not accurately reflect the bias in helical lengths observed in real {alpha}-helices.

Residue frequencies at the N-termini of assigned and predicted {alpha}-helices

The distribution of amino acids at the N-termini of {alpha}-helices predicted by JPRED and PHD is different from the characteristic pattern seen at the N-termini of {alpha}-helices in the core set. Figure 4AGo shows a lower frequency of favourable N-cap residues (serine, asparagine, aspartate and threonine) in predicted {alpha}-helices compared with the core set. This results from the high number of predicted {alpha}-helices with incorrect start positions, reducing the frequency of real N-capping residues being correctly predicted. A good example is seen in the case of proline. The increase in the amount of proline predicted at the N-caps of {alpha}-helices probably reflects its preference for the N1 position in {alpha}-helices and the frequency at which the prediction programs mispredict the helical start by one position. If predicted {alpha}-helices start one residue before core set {alpha}-helices, the residue distribution at the predicted N-cap will reflect the residue distribution of the real N1 position. This also explains why there is a slightly higher proportion of the poor N-capping residue lysine at the N-caps of predicted {alpha}-helices. Lysine is a poor N-cap residue because partial positive charges are often associated with residues at N1, N2 and N3 positions (such as lysine itself) and an N-cap lysine would probably introduce unfavourable electrostatic interactions. Conversely, if predicted {alpha}-helices start one residue after core {alpha}-helices, the residue distribution for predicted N-caps will reflect that of the N' position. This is exemplified by the increased frequency of leucine at predicted N-caps which is found relatively often at the N' position in real {alpha}-helices (Richardson and Richardson, 1988Go).



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 4. Residue frequencies at {alpha}-helix N-termini defined as the number of amino acids of a given type at a helix sub-site divided by the total number of amino acids at that site. (A) N-cap of all {alpha}-helices predicted by JPRED (grey), PHD (black) and all core set {alpha}-helices (white). (B) N' position for {alpha}-helices predicted by JPRED (filled) that start one residue after core set {alpha}-helices (unfilled). (C) N-cap position for {alpha}-helices predicted by JPRED (filled) that start one residue after core set {alpha}-helices (unfilled).

 
Figure 4B and CGo further support these observations. Although data are shown only for {alpha}-helices predicted by JPRED, a similar trend is seen for {alpha}-helices predicted by PHD. The distribution of residues at the N' position for predicted {alpha}-helices starting one residue after core set {alpha}-helices is more representative of the typical N-cap distribution (Figure 4BGo), including a high proportion of classical N-capping residues (serine, aspartic acid, glycine, asparagine and threonine). Similarly, leucines are predicted with a low frequency at N' in this set of mispredicted {alpha}-helices, despite the fact that they are the most abundant at N' in the core set {alpha}-helices (Figure 4BGo) and typically participate in hydrophobic interactions that contribute to {alpha}-helix stability (Aurora and Rose, 1998bGo). Equally, the distribution of residues at the N-caps of predicted {alpha}-helices that start one residue after core set {alpha}-helices is more representative of that seen at N1 (Figure 4CGo), typified by the proline frequencies.

Equivalent trends are shown by predicted {alpha}-helices that start two residues after core set {alpha}-helices (data not shown), except that the distribution at the N-cap now reflects N2 positional preferences, with the N-cap distribution seen at the N'' position. In fact, this general trend appears to be uniform, in that predicted helices that start one or two positions away from the true {alpha}-helix show residue preferences most similar to the corresponding position away from the true position, be it before or after the true {alpha}-helical start. This suggests that there are strong structural preferences that are maintained in these mispredicted helices that could be exploited to improve prediction quality.

Average free energies at the N-termini of assigned and predicted {alpha}-helices

Using the empirically determined free energies listed in Table IGo, the average free energies at the N-cap, N1 and N2 positions in assigned and predicted {alpha}-helices were calculated (Table IVGo). The N-cap positions show the biggest trend, reflecting the fact that they show the greatest range of free energy values (Table IGo). The average free energy of N-cap residues in all {alpha}-helices is higher for predicted {alpha}-helices than assigned {alpha}-helices, suggesting that, on average, predicted {alpha}-helices have less energetically favourable residues at this position. Few obvious trends are noted at the N1 position, but the prediction programs generally assign more energetically favourable residues at the N2 position. This can be rationalized by differences in start positions of predicted and assigned {alpha}-helices. If the predicted {alpha}-helix is out by one or two residues and an N-cap, N1 or N3 residue is placed at the N2 position it is unlikely to have a detrimental effect on the free energy value because the preferred residue types at these positions are similar (Table IGo). These results indicate that JPRED and PHD are systematically making energetically unfavourable predictions for helix N-termini.


View this table:
[in this window]
[in a new window]
 
Table IV. Average free energies at the N-termini of assigned and predicted {alpha}-helices
 
Interestingly, the core set has the lowest average free energies at each individual position. When the prediction programs correctly predict these {alpha}-helices they actually have more energetically favourable residues at N-cap, N1 and N2 (lower average free energies). This set corresponds to roughly one-third of all assigned helices and suggests that for a predicted {alpha}-helix to have the correct start position there has to be a very clear signal. When assigned {alpha}-helices disagree on start position they have less energetically favourable N-caps; a similar trend is seen at N2 and it is less noticeable at N1. The fact that the two assignment programs fail to agree suggests that these {alpha}-helices represent the more unusual {alpha}-helices which may be more distorted or have different hydrogen bond patterns. Certainly, it is expected that at least one of the two assignment programs must be out of register with the ‘true’ helical assignment in these cases, placing less favourable amino acids in the helical start positions in half of the cases. Similarly, all {alpha}-helices assigned and predicted that occur outside the core set have higher average free energies at all positions, and therefore they are less energetically stable. Interestingly, {alpha}-helices from the core set that are not predicted by JPRED and PHD have low average free energies for N-cap, N1 and N2 positions. This suggests that they have favourable residues at these positions and could be used to improve prediction quality.

Cumulative free energy at the N-termini of assigned and predicted {alpha}-helices

The trends seen in the average free energies at the individual N-cap, N1 and N2 positions are mirrored when looking at the cumulative free energy of residues at these positions (Table VGo). The average cumulative free energy at the N-termini of all assigned {alpha}-helices is 0.1 kcal/mol, increasing to 0.2 kcal/mol for all predicted {alpha}-helices. This implies that the N-termini of assigned {alpha}-helices are locally energetically more favourable than the N-termini of predicted {alpha}-helices. Again, the values for all assigned and predicted {alpha}-helices that agree on start position are lower than for any other categories with the prediction programs assigning more energetically favourable residues at these positions. This supports the previous observation that a clear signal is needed to enable JPRED or PHD to identify correctly the start position of an {alpha}-helix. Given that the prediction programs correctly predict the starts of only around one-third of the core set, there are still two-thirds of the data set which are mispredicted but may have good capping motifs.


View this table:
[in this window]
[in a new window]
 
Table V. Average cumulative free energy at the N-termini of assigned and predicted {alpha}-helices
 
The effect of incorrectly assigning the start position in predicted {alpha}-helices is visible when comparing the average cumulative free energy of predicted {alpha}-helices that match to core set helices (0.4 and 0.1 kcal/mol, respectively). This indicates that the residues at the N-termini of these predicted {alpha}-helices are less energetically favourable than those at the N-termini in the core set. This becomes even more apparent when considering predicted {alpha}-helices that start up to two residues before or after {alpha}-helices in the core set (Table VIGo). In all cases but one, the cumulative free energy at the N-terminus is noticeably lower for the true core set {alpha}-helices than their predicted counterparts. The exception is for PHD predictions that start two residues after the true helix. In these cases, the true N2 residue is predicted as N-cap, N3 as N1 and so on. These transitions are less deleterious on the total free energy since many good or acceptable N3 residues are good N1 residues (Penel et al., 1999bGo).


View this table:
[in this window]
[in a new window]
 
Table VI. Average cumulative free energy at the N-termini of predicted {alpha}-helices that start within ±2 residues of the core set
 
The distribution of cumulative free energies at the N-termini of predicted {alpha}-helices is distinct from the distribution seen for {alpha}-helices in the core set (Figure 5Go). Core set {alpha}-helices have a higher proportion of N-termini with lower cumulative free energies (and therefore more energetically favourable).



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 5. Distribution of cumulative free energies (kcal/mol) at the N-cap, N1 and N2 positions for all {alpha}-helices in the core set (black) and all {alpha}-helices predicted by JPRED (grey) or PHD (white).

 
Conclusions

The aim of this work was to identify areas in which improvements in secondary structure prediction should be targeted. Prediction methods are in general very good at predicting helices that overlap with a real helix with relatively few complete misses or extra helices. The length distribution of predicted helices is only a little different to real helices with the predicted mean helix length being too long. Where predictions show a high failure rate, however, is in identification of the helix N-terminus, with only one-third of JPRED and PHD predictions correctly finding the N-cap residue. This is presumably a result of using multiple aligned families of protein sequences to learn the statistics ultimately used in the prediction process, as the central cores of helices are better conserved. Analysis of neighbouring N-terminal sequences and calculation of their energies show that the correct N-terminus is frequently only one or two residues away. This suggests that a powerful strategy to improve secondary structure prediction would be to adjust the PHD or JPRED output to find a more favourable N-terminal sequence, using the empirical capping free energies as a guide. We are currently evaluating this approach.


    Notes
 
1 To whom correspondence should be addressed. E-mail: simon.hubbard{at}umist.ac.uk Back


    Acknowledgments
 
We acknowledge support from the BBSRC in the form of a studentship to C.L.W.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) J. Mol. Biol., 215, 403–410.[CrossRef][ISI][Medline]

Altschul,S.F., Madden,T.L., Schaffer A. A, Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Argos,P. and Palau,J. (1982) Protein Res., 19, 380–393.

Aurora,R. and Rose,G.D. (1998a) Protein Sci., 7, 21–38.[Abstract/Free Full Text]

Aurora,R. and Rose,G.D. (1998b) Proc. Natl Acad. Sci. USA, 95, 2818–2823.[Abstract/Free Full Text]

Bally,R. and Delettre,J. (1989) J. Mol. Biol., 206, 153–170.[ISI][Medline]

Barlow,D.J. and Thornton,J.M. (1988) J. Mol. Biol., 201, 601–619.[ISI][Medline]

Benner S.A. and Gerloff D (1991) Adv. Enzyme Regul., 31, 121–181.[ISI][Medline]

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235–242.[Abstract/Free Full Text]

Biou,V., Gibrat,J.F., Robson,B. and Garnier,J. (1988) Protein Eng., 2, 185–191.[Abstract]

Chandonia,J.M. and Karplus,M. (1999) Proteins: Struct. Funct. Genet., 35, 293–306.[CrossRef][ISI][Medline]

Chou,P.Y. and Fasman,G.D. (1974) Biochemistry, 13, 212–222.

Cochran,D.A.E. and Doig,A.J. (2001). Protein Sci., 10, 1305–1311.[Abstract/Free Full Text]

Cochran,D.A.E., Penel,S. and Doig,A.J. (2001) Protein Sci., 10, 463–470.[Abstract/Free Full Text]

Colloc’h,N., Etchebest,C., Thoreau,E., Henrissat,B. and Mornon, J.-P. (1993) Protein Eng., 6, 377–382.[Abstract]

Cuff,J.A. and Barton,G.J. (2000) Proteins: Struct. Funct. Genet., 40, 502–511.[CrossRef][ISI][Medline]

Cuff,J.A. and Barton,G.J. (1999) Proteins: Struct. Funct. Genet., 34, 508–519.[CrossRef][ISI][Medline]

Cuff,J.A, Clamp,M., Siddiqui,A., Finlay,M. and Barton,G.J. (1998) Bioinformatics, 14, 892–893.[Abstract]

Doig,A.J. and Baldwin,R.L. (1995) Protein Sci., 4, 1325–1336.[Abstract/Free Full Text]

Doig,A.J., MacArthur,M.W., Stapley,B.J. and Thornton,J.M. (1997) Protein Sci., 6, 147–155.[Abstract/Free Full Text]

Fischer,D. and Eisenberg,D. (1997) Bioinformatics, 15, 759–762.[Free Full Text]

Flores,T.P., Orengo,C.A., Moss,D.S. and Thornton,J.M. (1993) Protein Sci., 2, 1811–1826.[Abstract/Free Full Text]

Frishman,D. and Argos,P. (1995) Proteins: Struct. Funct. Genet., 23, 566–579.[ISI][Medline]

Garnier,J., Osguthorpe,D.J. and Robson,B. (1978) J. Mol. Biol., 120, 97–120.[ISI][Medline]

Gibrat,J.-F., Garnier,J. and Robson,B. (1987) J. Mol. Biol., 198, 425–443.[ISI][Medline]

Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409–417.[Abstract/Free Full Text]

Holm,L. and Sander,C. (1994) Proteins: Struct. Funct. Genet., 19, 165–173.[ISI][Medline]

Holm,L. and Sander,C. (1995) Trends Biochem. Sci., 20, 478–480.[CrossRef][ISI][Medline]

Jones,D.T. (1999) J. Mol. Biol., 292, 195–202.[CrossRef][ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

King,R.D. (1996) In Sternberg,M.J.E. (ed.), Protein Structure Prediction: A Practical Approach. Oxford University Press, Oxford, pp. 79–97.

King,R.D. and Sternberg,M.J.E. (1990) J. Mol. Biol., 216, 441–457.[ISI][Medline]

King,R.D. and Sternberg,M.J.E. (1996) Protein Sci., 5, 2298–2310.[Abstract/Free Full Text]

King,R.D., Ouali,M., Strong,A.T., Aly,A., Elmagrhaby,A., Kantardzic,M. and Page,D. (2000) Protein Eng., 13, 15–19.[Abstract/Free Full Text]

Krogh,A., Brown,M., Mian,I.S., Sjolander,K. and Haussler,D. (1994) J. Mol. Biol., 235, 1501–1531.[CrossRef][ISI][Medline]

Kumar,S. and Bansal,M. (1998a) Proteins: Struct. Funct. Genet., 31, 460–476.[CrossRef][ISI][Medline]

Kumar,S. and Bansal,M. (1998b) Biophys. J., 75, 1935–1944.[Abstract/Free Full Text]

Lesk,A.M. (1997) Proteins: Struct. Funct. Genet., S1, 151–166.[CrossRef]

Levin,J.M. (1997) Protein Eng., 10, 771–776.[Abstract]

Levin,J.M. and Garnier,J. (1988) Biochim. Biophys. Acta, 955, 283–295.[ISI][Medline]

Lim,V.I. (1974) J. Mol. Biol., 88, 973–894.

Maxfield,F.R. and Scheraga,H.A. (1979) Biochemistry, 18, 697–704.[ISI][Medline]

McGuffin,L.J., Bryson,K. and Jones,D.T. (2001) Bionformatics, 17, 63–72.

Orengo,C.A. (1994) Curr. Opin. Struct. Biol., 4, 429–440.[ISI]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Ouali,M. and King,R.D. (2000) Protein Sci., 9, 1162–1176.[Abstract]

Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444–2448.[Abstract]

Penel,S., Hughes,E. and Doig,A.J (1999a) J. Mol. Biol., 287, 127–143.[CrossRef][ISI][Medline]

Penel,S., Morrison,R.G., Mortishire-Smith,R.J. and Doig,A.J. (1999b) J. Mol. Biol., 293, 1211–1219.[CrossRef][ISI][Medline]

Petukhov,M., Munoz,V., Yumoto,N., Yoshikawa,S. and Serrano,L. (1998) J. Mol. Biol., 278, 279–289.[CrossRef][ISI][Medline]

Petukhov M., Uegaki K., Yumoto N., Yoshikawa S. and Serrano L. (1999) Protein Sci., 8, 2144–2150.[Abstract]

Richardson,J. and Richardson,D. (1988) Science, 240, 1648–1652.[ISI][Medline]

Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584–599.[CrossRef][ISI][Medline]

Rost,B., Sander,C. and Schneider,R. (1994) J. Mol Biol., 235, 13–26.[CrossRef][ISI][Medline]

Rost,B., Schneider,R. and Sander,C. (1997) J. Mol. Biol., 270, 471–480.[CrossRef][ISI][Medline]

Russell R.B. and Barton G.J. (1993) J. Mol. Biol., 234, 951–957.[CrossRef][ISI][Medline]

Russell,R.B., Copley,R.R. and Barton,G.J. (1996) J. Mol. Biol., 259, 349–365.[CrossRef][ISI][Medline]

Salamov,A.A. and Solovyev,V.V. (1995) J. Mol. Biol., 247, 11–15.[CrossRef][ISI][Medline]

Salamov,A.A. and Solovyev,V.V. (1997) J. Mol. Biol., 268, 31–36.[CrossRef][ISI][Medline]

Schulz,G.E. and Schirmer,R.H. (1978) Principles of Protein Structure. Springer, Berlin.

Shapiro,L. and Harris,T. (2000) Curr. Opin. Struct. Biol., 11, 31–35.

Shi,J., Blundell,T.L. and Mizuguchi,K. (2001) J. Mol. Biol., 310, 243–257.[CrossRef][ISI][Medline]

Srinivisian,R. (1976) Indian J. Biochem. Biophys., 13, 192–193.[ISI][Medline]

Stawiski,E.W., Baucom,A.E., Lohr,S.C. and Gregoret,L.M. (2000) Proc. Natl Acad. Sci. USA, 97, 3954–3958.[Abstract/Free Full Text]

Xu,H., Aurora,R., Rose,G.D. and White,R.H. (1999) Nature Struct. Biol., 6, 750–754.[CrossRef][ISI][Medline]

Yi,T.-M. and Lander,E.S. (1993) J. Mol. Biol., 232, 1117–1129.[CrossRef][ISI][Medline]

Zvelebil,M.J., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987) J. Mol. Biol., 195, 957–961.[ISI][Medline]

Received August 2, 2001; revised March 15, 2002; accepted March 22, 2002.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (3)
Request Permissions
Google Scholar
Articles by Wilson, C. L.
Articles by Doig, A. J.
PubMed
PubMed Citation
Articles by Wilson, C. L.
Articles by Doig, A. J.