Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks

Pier Luigi Martelli, Piero Fariselli, Luca Malaguti and Rita Casadio1

Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna, via Irnerio 42, 40126 Bologna, Italy


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
A hybrid system (hidden neural network) based on a hidden Markov model (HMM) and neural networks (NN) was trained to predict the bonding states of cysteines in proteins starting from the residue chains. Training was performed using 4136 cysteine-containing segments extracted from 969 non-homologous proteins of well-resolved 3D structure and without chain-breaks. After a 20-fold cross-validation procedure, the efficiency of the prediction scores as high as 80% using neural networks based on evolutionary information. When the whole protein is taken into account by means of an HMM, a hybrid system is generated, whose emission probabilities are computed using the NN output (hidden neural networks). In this case, the predictor accuracy increases up to 88%. Further, when tested on a protein basis, the hybrid system can correctly predict 84% of the chains in the data set, with a gain of at least 27% over the NN predictor.

Keywords: cysteine bonding state/disulfide bridges/hidden Markov models/hidden neural networks/neural networks


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The bonding state of cysteines plays a relevant role in stabilizing the tertiary folds of proteins and in defining protein functions. Among the amino acid residues, cysteines are unique, since they can create covalent bonds between two non-contiguous residues in the protein chain. Moreover, reduction of disulfide bridges triggers functionally relevant conformational changes (Creighton, 1996Go).

The contribution of the disulfide bridge to the thermodynamic stability of proteins has been described as being due to a reduction in the conformational entropy of the unfolded polypeptide chain causing a destabilization of the unfolded state relative to the native state (for a review, see Betz) (Betz, 1993Go) and it can be estimated both experimentally (Privalov and Gill, 1988Go; Freire, 1993Go) and theoretically (Casadio et al., 1995Go). Several analyses of the characteristics of disulfide bonds in proteins have been performed, including structural and sequence features and classification of connectivity (Harrison and Sternberg, 1994Go). This strengthens the view that disulfide bonds increase the conformational stability of the protein mainly by constraining the unfolded conformation, as many experimental and theoretical studies suggest (Harrison and Sternberg, 1994Go); for a review see Wedemeyer et al. (Wedemeyer et al., 2000Go).

Moreover, the disposition of cysteine residues relative to each other and relative to protein secondary structure is important in the classification of the structure of small disulfide-rich irregular proteins (Harrison and Sternberg, 1996Go).

In protein folding prediction, the location of disulfide bridges can strongly reduce the search in the conformational space (Skolnick et al., 1997Go; Huang et al., 1999Go). Therefore, the correct prediction of the disulfide connectivity starting from the protein residue sequence may help in predicting also its 3D structure.

A few studies have addressed the important problem of predicting the bonding state of cysteine in a protein chain. The correct prediction of this state can help in predicting ab initio the 3D structure of proteins by adding structural constraints and also in predicting the correct connectivity of disulfide bridges in the protein (Fariselli and Casadio, 2001Go). The relevance of the flanking residues in predicting a cysteine bonding state has been demonstrated using statistical methods (Fiser et al., 1992Go), neural networks (Muskal et al., 1990Go; Fariselli et al., 1999Go) and methods that combine local context and global information about protein sequences (Fiser and Simon, 2000Go; Mucchielli-Giorgi et al., 2002Go).

In this paper, we present an approach based on hidden neural networks (HNN) that combines neural networks and hidden Markov models and outperforms all the existing methods.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The database

4136 segments containing cysteines [free and disulfide bonded (half-cystines)] were taken from the crystallographic data of the Brookhaven Protein Data Bank. Disulfide bond assignment was based on the Define Secondary Structure of Proteins (DSSP) program (Kabsch and Sander, 1983Go).

Non-homologous proteins (with an identity value <25% and without chain breaks) were selected using the PAPIA system (Noguchi et al., 2001Go). Segments whose cysteines are inter-chain disulfide bonded are included as ‘free’ cysteines in the database (34 out of 27 monomeric chains). After this filtering procedure, the total number of proteins was 969, with 4136 cysteine-containing segments, 1446 of which were in the disulfide-bonded state and 2690 in the non-disulfide-bonded state. For each protein in our database, a profile based on a multiple sequence alignment was created using the BLAST program on the non-redundant dataset of sequences. The profiles obtained are used for creating the neural network input.

During the training/testing phase, the database was split into 20 subsets (almost equally sized and distributed) in order to perform a 20-fold cross-validation. Moreover, in order to highlight the method accuracy better, the performance was evaluated using (i) the whole dataset of proteins (WD) and (ii) a reduced set (RD) in which chains containing only one cysteine are excluded.

The PDB codes of the proteins whose cysteine-containing segments are included in the database, the 20-fold cross-validation lists and the training profiles are available at http://www.biocomp.unibo.it/piero/cyspred/cysdataset.tgz.

Measures of performance

The efficiency of the predictors is scored using the statistical indices defined as follows.

The accuracy is

(1)
where P is the total number of correctly predicted cysteines and N is the total number of cysteines.

The correlation coefficient C is defined as

(2)
where, for each class s (free or bonded cysteines), p(s) and n(s) are the total number of correct predictions and correctly rejected assignments, respectively, and u(s) and o(s) are the numbers of under- and over-predictions, respectively.

The accuracy for each discriminated structure s is evaluated as

(3)
where p(s) and u(s) are the same as in Equation 2Go.

The probability of correct predictions P(s) is computed as

(4)
where p(s) and o(s) are the same as in Equation 2Go.

Finally, the accuracy per protein is

(5)
where Pp is the number of the proteins whose cysteines are all correctly predicted and Np is the total number of proteins.

Neural networks

Standard feed-forward neural networks are implemented with a back-propagation algorithm as learning procedure. The network architecture is similar to that used previously (Fariselli et al., 1999Go) and consists of a two-layer perceptron with two hidden neurons, one output node (discriminating the disulfide and free cysteine propensities, respectively) and an input layer that consists of 540 neurons (27 residue-long input window). Owing to the limited number of examples currently available, an early learning stopping procedure was used to train the networks (Fariselli et al., 1999Go).

Hidden neural network

A vector-based HMM that can handle emission probability vectors is used on top of the neural networks described above. The hybrid system is a defined hidden neural network, following Krogh and Riis (1999)Go. A vector-based HMM, similar to that used in this paper, was recently developed and applied to the prediction of transmembrane ß-barrel proteins (Martelli et al., 2002Go).

Briefly, if L is the number of cysteines in the protein and A is the size of the alphabet over which vectors are built (that is, A = 2, bonding and non-bonding/free cysteine states), we refer to this sequence vector with the notation

(6)
The components of each vector st are positive and sum to a constant value S (independent of the position t).

The HMM for the specific problem at hand is composed of a Markov model with N states connected by means of the transition probabilities aij (Figure 1Go). The probability density function for the emission of a vector from each state is determined by a number A of parameters that are peculiar for each state k and are indicated with the symbols ek(c) (with c = 1,2, ..., A):

(7)
where {pi}t is the tth state in the path. Z is the normalizing factor with {Sigma}cek(c) = 1 [for further details, see Martelli et al. (Martelli et al., 2002Go)].



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 1. HNN state diagram. The arrows represent the allowed transitions. The B and F boxes represent the bonding and non-bonding cysteine states, respectively. The labels ‘e’ (even) and ‘o’ (odd) indicate the number of cysteines in the bonding state so far processed. The path can end only from an even state. This guarantees that only correct even predictions are assigned when considering intra-chain disulfide bonds.

 
The vector st is obtained directly from the neural network outputs as

(8)
where W is the local context of the cysteine and NN(B,W) and NN(F,W) are the neural network estimated probabilities of being in the bonding (B) or non-bonding/free (F) state, respectively. In this way, the local context exploited by the NN is coupled with the global information captured by the hybrid system.

Training the HMM parameters is accomplished by using a modified expectation-maximization algorithm (Martelli et al., 2002Go). In order to keep the constraints derived by the selected HMM model (Figure 1Go), the prediction of each cysteine is made using one protein at a time and by means of the Viterbi decoding (Durbin et al., 1998Go).


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The NN-based predictor is to be considered as the basic component of the hybrid system. Its accuracy compares well with that previously obtained with a similar method (Fariselli et al., 1999Go) and this is indicated by the results relative to the NN performance (Table IGo). The statistical indices are computed using two different testing sets. One contains the whole database (WD, 969 protein chains); the second (RD) is a subset of the former, obtained by eliminating all the protein sequences (187 chains) that contain only one cysteine (RD). For these chains the prediction is trivial, since one cysteine residue cannot make any intra-chain disulfide bridge. Accordingly, the NN-based predictor, trained to capture local features within the input window centered on the cysteine at hand, performs similarly on both sets when scored on cysteine bases. When we consider the performance on protein bases (we accept only those chains for which the predictions of all the bonding and non-bonding states of the cysteines in the protein are correct), the score ranges from about 50% for the difficult set to 57% for the whole database (Q2prot in Table IGo), indicating that most of the trivial cysteines are also well predicted. The overall figure is, however, rather low and a system taking into consideration global features of the chains such as the total number of cysteines and their bonding state in each chain was designed.


View this table:
[in this window]
[in a new window]
 
Table I. Performance (%) of the NN predictor (20-fold cross-validation)
 
When the NN is integrated with the HMM and the HNN method is tested, the results are indeed improved. In Table IIGo, the performance measures indicate that the introduction of the information relative to the whole protein sequence is extremely relevant for improving the prediction score. Actually, the accuracy on the whole dataset increases by 8%. In this case the only cysteine of a chain is easily labeled as unbound cysteine. Nevertheless, after the removal of the sequences containing only one cysteine, the accuracy of the HNN still remains very high, scoring more than 7% higher than the standard NN (compare the performances on the RD set in Tables I and IIGoGo).


View this table:
[in this window]
[in a new window]
 
Table II. Performance (%) of the HNN predictor (20-fold cross validation)
 
The improvement obtained with the HNN method compared with NN is seemingly due to the introduction of global ‘rules’ defined by the regular grammar implemented in the HMM (Figure 1Go). This second step evidently captures the number of cysteines in a chain and also keeps track of the bonding states of all the cysteines in the same chain.

Remarkably, the accuracy obtained on protein bases is increased up to 80.2% for the difficult set (RD) and to 84.0% for the entire database (Q2prot in Table IIGo).

Even though it is difficult to compare methods tested on different databases, it can be claimed that the accuracy obtained with HNN is greater than that previously described and obtained with other methods, incorporating also global protein rules (Fiser and Simon, 2000Go; Mucchielli-Giorgi et al., 2002Go). The method implemented by Fiser and Simon is based on a simple majority rule and reaches an accuracy of 82% when predicting the disulfide bonding state of cysteines on a small set of proteins comprising 81 chains; that of Mucchielli-Giorgi et al. makes use of global protein descriptors and scores as high as 84% for the same task on 869 chains. The higher accuracy (88%) obtained with HNN on 969 chains is probably due to the higher flexibility of our system in capturing features of the sequences essential for the prediction of the cysteine bonding state.

In conclusion, it has been shown that a hybrid system combining local with global information outperforms previously developed methods to solve the same task, confirming that for the problem at hand a crucial step forward can be made only when global features of the protein chains are taken into consideration.


    Notes
 
1 To whom correspondence should be addressed. E-mail: casadio{at}alma.unibo.it Back


    Acknowledgments
 
This work was partially supported by a grant from the Ministero della Università e della Ricerca Scientifica e Tecnologica (MURST) for the project ‘Hydrolases from Thermophiles: Structure, Function and Homologous and Heterologous Expression’, a grant for a target project in Biotechnology and a project on Molecular Genetics, both from the Italian Centro Nazionale delle Ricerche (CNR), to R.C. R.C also acknowledges an EC grant, Biowulf IST 1999-20232, for supporting the development of DNCBLAST, a parallelized version of PSI-BLAST for PC nets. P.L.M. is the recipient of a fellowship from the Italian National Institute of Biostructures and Biosystems (INBB).


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Betz,S.F. (1993) Protein Sci., 2, 1551–1558.[Abstract/Free Full Text]

Casadio,R., Compiani,M., Fariselli,P. and Vivarelli,F. (1995) Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 81–88, and references therein.[Medline]

Creighton,T. (1996) Proteins: Structures and Molecular Properties. Freeman, San Francisco.

Durbin,R., Eddy,S., Krogh,A. and Mitchinson,G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge.

Fariselli,P. and Casadio,R. (2001) Bioinformatics, 17, 957–964.[Abstract/Free Full Text]

Fariselli,P., Riccobelli,P. and Casadio,R. (1999) Proteins, 36, 340–346.[CrossRef][ISI][Medline]

Fiser,A. and Simon,I. (2000) Bioinformatics, 6, 251–256.[CrossRef]

Fiser,A., Cserzo,M., Tudos,E. and Simon,I. (1992) FEBS Lett., 302, 117–120.[CrossRef][ISI][Medline]

Freire,E. (1993). Arch. Biochem. Biophys., 303, 181–184.[CrossRef][ISI][Medline]

Harrison,P.M. and Sternberg,M.J.E. (1994) J. Mol. Biol., 244, 448–463, and references therein.[CrossRef][ISI][Medline]

Harrison,P.M. and Sternberg,M.J.E. (1996) J. Mol. Biol., 264, 603–623.[CrossRef][ISI][Medline]

Huang,E.S., Samudrala,R. and Ponder,J.W. (1999) J. Mol. Biol., 290, 267–281.[CrossRef][ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Krogh,A. and Riis,S.K., (1999) Neural Comput., 11, 541–563.[Abstract]

Martelli,P.L., Fariselli,P., Krogh,A. and Casadio,R. (2002) Bioinformatics, 18, S1, 46–53.

Mucchielli-Giorgi,M.H., Hazout,S. and Tuffery,P. (2002) Proteins, 46, 243–249.[CrossRef][ISI][Medline]

Muskal,S.M., Holbrook,R.S. and Kim,S.H. (1990) Protein Eng., 3, 667–672.[Abstract]

Noguchi,T., Matsuda,T.H. and Akiyama,Y. (2001) Nucleic Acids Res., 29, 219–220.[Abstract/Free Full Text]

Privalov,P.L and Gill,S.J. (1988) Adv. Protein Chem., 39, 191–324.[ISI][Medline]

Skolnick,J., Kolinski,A. and Ortiz,A.R. (1997). J. Mol. Biol., 265, 217–241.[CrossRef][ISI][Medline]

Wedemeyer,W.J., Welkler,E., Narayan,M. and Scheraga,H.A. (2000). Biochemistry, 39, 4207–4216.[CrossRef][ISI][Medline]

Received May 28, 2002; accepted October 10, 2002.