Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna, via Irnerio 42, 40126 Bologna, Italy
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: cysteine bonding state/disulfide bridges/hidden Markov models/hidden neural networks/neural networks
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The contribution of the disulfide bridge to the thermodynamic stability of proteins has been described as being due to a reduction in the conformational entropy of the unfolded polypeptide chain causing a destabilization of the unfolded state relative to the native state (for a review, see Betz) (Betz, 1993) and it can be estimated both experimentally (Privalov and Gill, 1988
; Freire, 1993
) and theoretically (Casadio et al., 1995
). Several analyses of the characteristics of disulfide bonds in proteins have been performed, including structural and sequence features and classification of connectivity (Harrison and Sternberg, 1994
). This strengthens the view that disulfide bonds increase the conformational stability of the protein mainly by constraining the unfolded conformation, as many experimental and theoretical studies suggest (Harrison and Sternberg, 1994
); for a review see Wedemeyer et al. (Wedemeyer et al., 2000
).
Moreover, the disposition of cysteine residues relative to each other and relative to protein secondary structure is important in the classification of the structure of small disulfide-rich irregular proteins (Harrison and Sternberg, 1996).
In protein folding prediction, the location of disulfide bridges can strongly reduce the search in the conformational space (Skolnick et al., 1997; Huang et al., 1999
). Therefore, the correct prediction of the disulfide connectivity starting from the protein residue sequence may help in predicting also its 3D structure.
A few studies have addressed the important problem of predicting the bonding state of cysteine in a protein chain. The correct prediction of this state can help in predicting ab initio the 3D structure of proteins by adding structural constraints and also in predicting the correct connectivity of disulfide bridges in the protein (Fariselli and Casadio, 2001). The relevance of the flanking residues in predicting a cysteine bonding state has been demonstrated using statistical methods (Fiser et al., 1992
), neural networks (Muskal et al., 1990
; Fariselli et al., 1999
) and methods that combine local context and global information about protein sequences (Fiser and Simon, 2000
; Mucchielli-Giorgi et al., 2002
).
In this paper, we present an approach based on hidden neural networks (HNN) that combines neural networks and hidden Markov models and outperforms all the existing methods.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
4136 segments containing cysteines [free and disulfide bonded (half-cystines)] were taken from the crystallographic data of the Brookhaven Protein Data Bank. Disulfide bond assignment was based on the Define Secondary Structure of Proteins (DSSP) program (Kabsch and Sander, 1983).
Non-homologous proteins (with an identity value <25% and without chain breaks) were selected using the PAPIA system (Noguchi et al., 2001). Segments whose cysteines are inter-chain disulfide bonded are included as free cysteines in the database (34 out of 27 monomeric chains). After this filtering procedure, the total number of proteins was 969, with 4136 cysteine-containing segments, 1446 of which were in the disulfide-bonded state and 2690 in the non-disulfide-bonded state. For each protein in our database, a profile based on a multiple sequence alignment was created using the BLAST program on the non-redundant dataset of sequences. The profiles obtained are used for creating the neural network input.
During the training/testing phase, the database was split into 20 subsets (almost equally sized and distributed) in order to perform a 20-fold cross-validation. Moreover, in order to highlight the method accuracy better, the performance was evaluated using (i) the whole dataset of proteins (WD) and (ii) a reduced set (RD) in which chains containing only one cysteine are excluded.
The PDB codes of the proteins whose cysteine-containing segments are included in the database, the 20-fold cross-validation lists and the training profiles are available at http://www.biocomp.unibo.it/piero/cyspred/cysdataset.tgz.
Measures of performance
The efficiency of the predictors is scored using the statistical indices defined as follows.
![]() | (1) |
The correlation coefficient C is defined as
![]() | (2) |
The accuracy for each discriminated structure s is evaluated as
![]() | (3) |
The probability of correct predictions P(s) is computed as
![]() | (4) |
Finally, the accuracy per protein is
![]() | (5) |
Neural networks
Standard feed-forward neural networks are implemented with a back-propagation algorithm as learning procedure. The network architecture is similar to that used previously (Fariselli et al., 1999) and consists of a two-layer perceptron with two hidden neurons, one output node (discriminating the disulfide and free cysteine propensities, respectively) and an input layer that consists of 540 neurons (27 residue-long input window). Owing to the limited number of examples currently available, an early learning stopping procedure was used to train the networks (Fariselli et al., 1999
).
Hidden neural network
A vector-based HMM that can handle emission probability vectors is used on top of the neural networks described above. The hybrid system is a defined hidden neural network, following Krogh and Riis (1999). A vector-based HMM, similar to that used in this paper, was recently developed and applied to the prediction of transmembrane ß-barrel proteins (Martelli et al., 2002
).
Briefly, if L is the number of cysteines in the protein and A is the size of the alphabet over which vectors are built (that is, A = 2, bonding and non-bonding/free cysteine states), we refer to this sequence vector with the notation
![]() | (6) |
The HMM for the specific problem at hand is composed of a Markov model with N states connected by means of the transition probabilities aij (Figure 1). The probability density function for the emission of a vector from each state is determined by a number A of parameters that are peculiar for each state k and are indicated with the symbols ek(c) (with c = 1,2, ..., A):
![]() | (7) |
|
![]() | (8) |
Training the HMM parameters is accomplished by using a modified expectation-maximization algorithm (Martelli et al., 2002). In order to keep the constraints derived by the selected HMM model (Figure 1
), the prediction of each cysteine is made using one protein at a time and by means of the Viterbi decoding (Durbin et al., 1998
).
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Remarkably, the accuracy obtained on protein bases is increased up to 80.2% for the difficult set (RD) and to 84.0% for the entire database (Q2prot in Table II).
Even though it is difficult to compare methods tested on different databases, it can be claimed that the accuracy obtained with HNN is greater than that previously described and obtained with other methods, incorporating also global protein rules (Fiser and Simon, 2000; Mucchielli-Giorgi et al., 2002
). The method implemented by Fiser and Simon is based on a simple majority rule and reaches an accuracy of 82% when predicting the disulfide bonding state of cysteines on a small set of proteins comprising 81 chains; that of Mucchielli-Giorgi et al. makes use of global protein descriptors and scores as high as 84% for the same task on 869 chains. The higher accuracy (88%) obtained with HNN on 969 chains is probably due to the higher flexibility of our system in capturing features of the sequences essential for the prediction of the cysteine bonding state.
In conclusion, it has been shown that a hybrid system combining local with global information outperforms previously developed methods to solve the same task, confirming that for the problem at hand a crucial step forward can be made only when global features of the protein chains are taken into consideration.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Casadio,R., Compiani,M., Fariselli,P. and Vivarelli,F. (1995) Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 8188, and references therein.[Medline]
Creighton,T. (1996) Proteins: Structures and Molecular Properties. Freeman, San Francisco.
Durbin,R., Eddy,S., Krogh,A. and Mitchinson,G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge.
Fariselli,P. and Casadio,R. (2001) Bioinformatics, 17, 957964.
Fariselli,P., Riccobelli,P. and Casadio,R. (1999) Proteins, 36, 340346.[CrossRef][ISI][Medline]
Fiser,A. and Simon,I. (2000) Bioinformatics, 6, 251256.[CrossRef]
Fiser,A., Cserzo,M., Tudos,E. and Simon,I. (1992) FEBS Lett., 302, 117120.[CrossRef][ISI][Medline]
Freire,E. (1993). Arch. Biochem. Biophys., 303, 181184.[CrossRef][ISI][Medline]
Harrison,P.M. and Sternberg,M.J.E. (1994) J. Mol. Biol., 244, 448463, and references therein.[CrossRef][ISI][Medline]
Harrison,P.M. and Sternberg,M.J.E. (1996) J. Mol. Biol., 264, 603623.[CrossRef][ISI][Medline]
Huang,E.S., Samudrala,R. and Ponder,J.W. (1999) J. Mol. Biol., 290, 267281.[CrossRef][ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Krogh,A. and Riis,S.K., (1999) Neural Comput., 11, 541563.[Abstract]
Martelli,P.L., Fariselli,P., Krogh,A. and Casadio,R. (2002) Bioinformatics, 18, S1, 4653.
Mucchielli-Giorgi,M.H., Hazout,S. and Tuffery,P. (2002) Proteins, 46, 243249.[CrossRef][ISI][Medline]
Muskal,S.M., Holbrook,R.S. and Kim,S.H. (1990) Protein Eng., 3, 667672.[Abstract]
Noguchi,T., Matsuda,T.H. and Akiyama,Y. (2001) Nucleic Acids Res., 29, 219220.
Privalov,P.L and Gill,S.J. (1988) Adv. Protein Chem., 39, 191324.[ISI][Medline]
Skolnick,J., Kolinski,A. and Ortiz,A.R. (1997). J. Mol. Biol., 265, 217241.[CrossRef][ISI][Medline]
Wedemeyer,W.J., Welkler,E., Narayan,M. and Scheraga,H.A. (2000). Biochemistry, 39, 42074216.[CrossRef][ISI][Medline]
Received May 28, 2002; accepted October 10, 2002.