An approach to improving multiple alignments of protein sequences using predicted secondary structure

Andrew J. Jennings1,2, Colin M. Edge1 and Michael J.E. Sternberg3

1 Discovery Chemistry, SmithKline Beecham Pharmaceuticals, New Frontiers Science Park, Third Avenue, Harlow, Essex CM19 5AW and 3 Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, Lincoln's Inn Fields, P.O. Box 123, London WC2A 3PX, UK


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The object of this work was to improve multiple sequence alignments using public-domain software and methods as far as possible. A method is described where the secondary structure of proteins is predicted and this information, coupled with a simplified description of the amino acids, is used to produce multiple sequence alignments. This method improved the accuracy of the resulting alignments by between 5 and 14% when compared with full sequence profile alignments (as scored against structural alignments). These improved alignments were used to predict the secondary structure of the sequences they contain. The resultant predictions were more accurate than those produced from less optimal alignments. An improvement of 6% for a three-state (helix, sheet and coil) prediction was observed when using the best alignment from the method presented here and the alignment obtained using sequence only. The method makes use of public domain software and all the associated files required to repeat the work are available from the primary author.

Keywords: alignment/predicted/sequence/structure


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
One of the most important techniques in bioinformatics and homology modelling is the alignment of multiple protein sequences. Conserved residues or patterns allow the scientist to infer the structure and/or function of a protein or family of proteins. The importance of the alignment for modelling structures by homology has been exemplified by the results from the CASP2 (Marchler-Bauer and Bryant, 1997Go) and CASP3 (Sternberg and Bates, 1999; http://PredictionCenter.llnl.gov/casp3/; and Proteins, 1997, Suppl., 1–230) evaluations.

Methods of aligning protein sequences [such as ClustalW (Thompson et al., 1994Go), the HMMER package (Eddy, http://hmmer.wustl.edu) for Hidden Markov Models (Eddy, 1996Go) and Psi-Blast (Altschul et al., 1997Go)] tend to rely upon the amino acid types themselves and do not include other information which may be available. This approach works well when the proteins in question are closely related but breaks down as the sequence similarity decreases. Where the sequence similarities approach the `twilight' range of below 30% (Rost, 1999Go) the resulting alignments are generally poor, hence additional information might be expected to aid the alignment process.

One choice of additional information would be secondary structure as much work has concentrated on designing and improving prediction algorithms (Chou and Fasman, 1974Go; Garnier et al., 1978Go; Zvelebil et al., 1987Go ; Luthy, 1991; Rost and Sander, 1993Go; Biou et al., 1995; Mehta, 1995; King and Sternberg, 1996Go; Lemer et al., 1996Go; Frishman and Argos, 1997Go). The problem with this choice is that current secondary structure prediction methods such as PSIPRED (Jones, Gohttp://globin.bio.warwick.ac.uk/psipred/) used by Jones at CASP3 appear to peak at around 77% accuracy for a three-state prediction. A three-state prediction is a measure of how accurately helix, sheet and coil are predicted, expressed as a percentage of the known secondary structure. If the sequence and predicted secondary structure information could be combined, it may be possible to overcome the flaws of one source by augmenting it with information from the other.

It was considered important that the computational tools employed in this work were readily available in the public domain and that the implementation should be within the grasp of scientists in the area.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Overview

This work required protein superfamilies composed of at least two families, with each family containing more than one member. A superfamily is defined as being a set of proteins which are related by homology. A family is defined as a set of proteins which are related by sequence identity to form a distinct group within the protein superfamily. The SCOP (Murzin et al., 1995Go) database was used to select superfamilies to be used so as to enable all results to be tested against the experimentally determined data. Test sets were chosen from across all four fold classes (as defined by the SCOP database): A, B, A + B, A/B (for the list of those families chosen, see Table IGo). Each family within a superfamily was aligned using ClustalW and each of these alignments was converted to a form which incorporated predicted secondary structure information. These new alignments were then aligned against one another using a custom matrix and the alignments produced converted back into the correct amino acid alphabet. The alignments were scored versus a structural alignment and also used as input to DSC so that further secondary structure predictions could be made.


View this table:
[in this window]
[in a new window]
 
Table I. The 14 SCOP protein superfamilies considered in this work
 
Alignment and prediction programs

Secondary structure predictions were carried out using the program DSC (King and Sternberg, 1996Go), multiple sequence alignments by the program ClustalW (Thompson et al., 1994Go) and the Hidden Markov Model work using the HMMER2 suite of programs (Eddy, (1998Go) http://hmmer.wustl.edu).

Alignment benchmarks

Structural alignments of the proteins under examination were generated by the STAMP program (Russell and Barton, 1992Go). When comparing any of the multiple sequence alignments generated with the structural alignments, only those regions of structural equivalence as identified by STAMP were examined. Many measures of alignment similarity were investigated with the most useful being deemed to be the sum of all correct pairs in the query alignment divided by the maximum possible correct pairs (as identified from the STAMP structural alignment).

For the purposes of this work, all areas of the secondary structure predictions were examined. The accuracy can be thought of in two ways. If the type of regular secondary structure is ignored and only its position considered, one can measure how well the prediction algorithms detect where there is coil and where there is not (coil being irregular and non-periodic), i.e. two-state accuracy. In addition to this positional score, one may also consider whether the type of periodic/regular secondary structure (i.e. helix or sheet) at these positions of non-coil is predicted correctly (i.e. three-state accuracy).

Simplification schemes

Simplification schemes have been proposed in the past as a means of grouping together amino acids possessing similar properties. Three simplification schemes were chosen, two from the literature and one a personally devised (PD) scheme.

Scheme 1: Taylor scheme (Taylor, 1986Go)

AGS, CP, DE, ILV, KMNQRT, FHWY

Scheme 2: Smith scheme (Smith and Smith, 1990Go)

DE, KRH, NQ, ST, ILV, FWY, C, M, AG, P

Scheme 3: PD scheme

AMLVI (lipophilic), GP (initiating/terminating), HWFY (aromatic), KDRE (charged), QNST (polar), C (disulphide bridge forming)

Comparison with other approaches

To gauge how well the method presented here works against one of the current `state-of-the-art' methods, the selected superfamilies of proteins were also aligned using Hidden Markov Models. In some preliminary work, the different ways of implementing the HMMER2 package were examined and the method that performed best chosen to be used in all subsequent work. The HMMER2 package is implemented by first training a model on the largest of the family alignments (produced by ClustalW) and aligning the members of the other, smaller subfamily to this first alignment.

Algorithm and matrices

The Taylor and PD amino acid simplification schemes consist of six groups of amino acids whilst the Smith scheme consists of 10 groups. To remain within the 20x20 matrix which ClustalW uses, any matrices used with the Smith scheme may only consider two structural states for each amino acid grouping. The structural states chosen were regular/periodic (helix or sheet) and irregular/non-periodic (coil). This leads to two states for each group which completely fills a 20x20 matrix. Twenty matrices were designed heuristically to consider different weightings of residue type and secondary structure conservation.

Matrices for the two schemes which contain only six groups can employ either a two-state or a three-state description. A three-state scheme (where the structural types are coil, helix and sheet) produces an 18x18 matrix whilst a two-state scheme produces a 12x12 matrix.

The matrices developed are depicted in Figure 1Go, which represents matrices for secondary structure matching and Taylor simplification group matching and how they lead to the final matrix for this work. Each amino acid in a sequence can have two or three states depending upon which choice of secondary structure representation we have chosen. By varying the scores for the two-state or three-state matrix we can favour one match of secondary structure over another. The three states are labelled H (helix), E (sheet) and C (coil). Similarly, for the groups of amino acids described by the Taylor paper we can favour one group over another by varying the values in the matrix. Both this matrix and the secondary structure matrix can be made to favour exact matches by making the leading diagonal values higher than off-diagonal values. By combining these two matrices we arrive at a complete matrix for this work as shown in the figure: the matrices are simply combined such that the score for any group match is affected by a score related to the type of secondary structure present. The figure shows the matrix for a Taylor three-state approach. Each Taylor group can be in one of three states and so by weighting the elements in this matrix we can favour secondary structure matches, amino acid group matches or both. Again, the leading diagonal controls the scores for an exact match of residues. By varying the combinations of these high scoring elements very diverse matrices can be constructed which favour different matches during alignment.



View larger version (44K):
[in this window]
[in a new window]
 
Fig. 1. Schematic showing the method used to construct the alignment matrices for this work and the most successful matrix designed (see text for explanation).

 
Overview of the process


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The results of the cross-validation work show that the two-state secondary structure predictions perform better than the three-state (for the PD and Taylor schemes where both two- and three-state are considered) and that the Smith two-state simplification scheme is the best overall by a small margin (cross-validation data not shown).

Table IIGo shows the results obtained from all the alignment methods examined. The scores are expressed as a fraction of the confidently aligned sections of the structural alignment (as calculated by STAMP and subsequently verified manually). A score of zero would indicate that none of the structural alignment was reproduced and a score of one that the structural alignment was reproduced exactly. `HMMER2' represents the scores obtained using the HMMER2 package and `Sequence' the alignment using the primary sequence (the standard 20 amino acid set) information only. `Profile' refers to the result of aligning the family alignments using primary sequence information only and the profile alignment routine of ClustalW. From the scores quoted one can see that the Smith two-state method is the most successful of those examined here.


View this table:
[in this window]
[in a new window]
 
Table II. Results of the alignments produced for all superfamilies using matrices developed in this work
 
In earlier work not reproduced here, the proteins were aligned using the simplified amino acid groupings without any structural predictions. In nearly all cases the alignments were at least as inaccurate as those produced using the standard 20 amino acid set. The experimentally determined secondary structure (as produced by DSSP) was used in the earlier work in the same way as predicted structure was used here. It was found that very good alignments could be obtained but interestingly and understandably the matrices that scored well were not those that scored best when using predicted secondary structure.

Table IIIGo shows the results of the secondary structure predictions obtained using DSC. It shows the prediction accuracy expressed as a percentage when using the single sequences, the profile alignment obtained from ClustalW in its default mode and the alignment produced by the matrices developed in this work (averaged over all 14 families) as input to DSC. The scores are quoted for both two-state (coil and non-coil) and three-state (helix, sheet and coil) descriptions of secondary structure.


View this table:
[in this window]
[in a new window]
 
Table III. Average accuracies (%) of the DSC predicted secondary structure for the superfamilies examined
 
Table IVGo contains the results of profile alignments using reduced amino acid sets but with varying amounts of secondary structure information for all protein superfamilies examined in this work. The results show that reducing the amino acid alphabet alone does not have a positive effect on alignment quality and confirms that the improvement in alignment score is due to the incorporation of predicted secondary structure information. The results reflect the redundancy of amino acid descriptors when looking at distantly related protein superfamilies. The alignment quality increases as the amount of secondary structure information increases as expected, but it is interesting that even using the known secondary structure we can do no better than getting half the alignment correct. This may be due to the size of the amino acid alphabets chosen or the scoring matrices, or may simply show that folding does not have a simple relationship with amino acid code. If the latter is the case, this would have important implications for threading strategies.


View this table:
[in this window]
[in a new window]
 
Table IV. Alignment accuracy between profile alignments produced using a reduced amino acid set and predicted secondary structure information (RAA/PSS), a reduced amino acid set but no secondary structure information (RAA/NSS) and a reduced amino acid set with experimentally determined secondary structure information (RAA/XSS)
 

    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The work presented here shows that the inclusion of secondary structural information when aligning proteins leads to improved accuracy. The choice of simplification scheme, matrix and whether to consider a two-state or three-state secondary structure description has a profound effect upon the accuracy of the resulting alignment. We have found that the improvement in alignment accuracy using our method results in an improved secondary structure prediction for each protein. Whilst it is obvious that this should be the case, it was not obvious as to whether the prediction algorithm would perform better on a more accurate alignment. One technique which makes use of predicted secondary structure is fold recognition. The quality of the secondary structure prediction is a major factor in the success or failure of fold recognition techniques (Rost, 1995Go). The improved secondary structure predictions presented here should lead to better quality fold recognition.

From Table IIIGo one can see that the position of non-coil structure is predicted more accurately than the type of non-coil structure when three-state prediction methods are used. Although the work here has used the program DSC to produce structure predictions, other methods and algorithms such as PHD (Rost and Sander, 1993Go), Predator (Frishman and Argos, 1997Go) and the Quadratic Logistic Server (Di Francesco et al., 1995Go) were briefly examined (results not shown here) and the same observations made. These results and the figures in Table IIGo suggest that concentrating on producing an accurate two-state prediction may lead to better results than the more usual three-state. Whilst the secondary structure prediction scores quoted in Table IIIGo are not as good as those quoted for the more recent methods such as PSIPRED, it should be noted that the predictions used here initially were less accurate than is now achievable. As it is not specific to DSC, more accurate secondary structure predictions may increase the alignment accuracy obtained by the method presented here still further.

Another method with the same aim of incorporating secondary structure information has been published by Heringa (1999). Heringa's work differs from that detailed here in that the full amino acid alphabet is retained and three amino acid exchange matrices are used for each of the three secondary structures considered (helix, sheet and coil). There is also some filtering of the predictions used within the method, with the predictions being obtained using the SSPRED technique. Gap penalties are also varied for each of the three secondary structure states predicted. Heringa's method was applied to the sequence sets of the flavodoxin and cupredoxin protein families only. This, coupled with the absence of scores representing the amount of correct alignment achieved by the method, unfortunately makes it impossible to compare the relative effectiveness of the two methods.

Homology modelling is another area where this work has implications. The initial alignment used to construct a homology model is crucial to the accuracy of the final model and any errors at this point are magnified by each subsequent step. The CASP competitions, of which CASP3 [see Sternberg et al. (1999), http://PredictionCenter.llnl.gov/casp3/ and Proteins, 1997, Suppl., 1–230] is the most recent, highlight this. It has been shown that the single most important step in the comparative (homology) modelling section is the sequence alignment (Sternberg et al., 1999Go). If the secondary structure considerations are taken into account during the alignment stage, it is possible to build better models and to prevent regions of different secondary structural type being aligned with one another.

This work has looked at the scenario where none of the three-dimensional structures within a superfamily are known yet has been able to increase the alignment accuracy by incorporating predicted secondary structure. With the explosion in the number of sequences as a result of the human genome work, many superfamilies will have no structural data whatsoever and the sheer number of sequences will make automation essential. This approach manages to automate the task of aligning protein superfamily members successfully and provides a tool to work with the large numbers of new proteins.


    Notes
 
2 To whom correspondence should be addressed Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller, W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Biou,V., Gilbrat,J.F., Levin, J.M., Robson,B. and Garnier,J. (1988) Protein Eng., 2, 185–191.[Abstract]

Chou,P.Y. and Fasman,G.D. (1974) Biochemistry, 13, 211–222.[ISI][Medline]

Di Francesco,V., Munson,P.J. et al. (1995) In Proceedings of the 28th Hawaii International Conference on System Sciences. IEEE, Los Alamitos, CA, 5, pp. 285–291.

Eddy,S.A. (1996) Curr. Opin. Struct. Biol., 6, 361–365.[ISI][Medline]

Eddy,S.A. (1998) http://hmmer.wustl.edu.

Frishman,D. and Argos,P. (1997) Proteins, 27, 329–335.[ISI][Medline]

Garnier,J., Osguthorpe,D.J. and Robson,B. (1978) J. Mol. Biol., 120, 97–120.[ISI][Medline]

Gonnet,G.H., Cohen,M.A. and Benner,S.A. (1992) Science, 256, 1443–1445.[ISI][Medline]

Heringa,J. (1999) Comput. Chem. (Oxford), 23, 341–364.[ISI][Medline]

Jones,D.T. http://globin.bio.warwick.ac.uk/psipred/.

King,R.D. and Sternberg,M.J.E. (1996) Protein Sci., 5, 2298–2310.[Abstract/Free Full Text]

King,R.D., Sternberg,M.J.E. et al. (1997) CABIOS, 13, 473–474.[Medline]

Lemer,C., Rooman,M.J. and Wodak,S.J. (1996) Proteins, 23, 337–355.[ISI]

Luthy,R., McLachlan,A.D. and Eisenberg,D. (1991) Proteins: Struct. Funct. Genet., 10, 229–239.[ISI][Medline]

Marchler-Bauer,A. and Bryant,S.H. (1997) Trends Biochem. Sci., 22, 236–240.[ISI][Medline]

Mehta,P.K., Heringa,J.P. and Argos,P. (1995) Protein Sci. 4, 2517–2525.[Abstract/Free Full Text]

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[ISI][Medline]

Rost,B. (1995) In Bohr,H. and Brunak,S. (eds), Protein Folds. A Distance-based Approach. CRC Press, Boca Raton, FL, pp. 132–151.

Rost,B. (1999) Protein Eng. 12, 85–94.[Abstract/Free Full Text]

Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584–599.[ISI][Medline]

Russell,R.B. and Barton,G.J. (1992) Proteins: Struct. Funct. Genet., 14, 309–323.[ISI][Medline]

Smith,F.R. and Smith,T.F. (1990) Proc. Natl Acad. Sci. USA, 87, 118–122.[Abstract]

Sternberg,M.J.E., Bates,P.A., Kelley,L.A. and MacCallum,R.M. (1999) Curr. Opin. Struct. Biol., 9, 368–373.[ISI][Medline]

Taylor,W.R. (1986) J. Theor. Biol., 119, 205–218.[ISI][Medline]

Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 4673–4680.[Abstract]

Zvelebil,M.J.J.M., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987) J. Mol. Biol., 195, 957–961.[ISI][Medline]

Received January 28, 2000; revised January 18, 2001; accepted February 15, 2001.