1 Biocomputing Group (Centro Interdipartimentale per le Ricerche Biotecnologiche) and 2 Laboratory of Biophysics, Department of Biology, University of Bologna, Via Irnerio, 42, 40126 Bologna, Italy
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: neural networks/protein structure prediction/contact maps/multiple sequence information
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this paper we address the question whether a neural network system is capable of learning the correlation between the residue covalent structure of a protein and its contact map as it can be computed from its known 3D structure. It has been shown that neural networks are efficient tools in solving several kind of problems, including predictions of secondary structures (Rost and Sander, 1995), transmembrane helices (Rost et al., 1996
) and folding initiation sites (Compiani et al., 1998
). Neural networks have been used in the pioneer work by Bohr et al. (1990) to predict distances between amino acids of homologous protein sequences and also to represent knowledge based potentials (Grossman et al., 1995
). Recently neural networks have been successfully applied to predict whether distances between pairs of amino acids are above or below a given variable threshold (Lund et al., 1997
).
Our neural network system is trained to learn the correlation between protein sequences and contact maps of a set of non-homologous proteins of known structure including 200 chains. With a cross-validation procedure on the training set and blind tests on another 408 proteins never seen before by the network, our results indicate that the neural network-based predictor achieves on average an improvement over a random predictor of a factor of 6, which is higher than that previously obtained by other methods developed to predict residue contacts (Thomas et al., 1996; Olmea and Valencia, 1997
). Although the predictive accuracy obtained with neural networks is still far from being successful, it is however rather independent of the protein size, different from that previously obtained by combining correlated mutations and sequence conservation (Olmea and Valencia 1997
). Our results are obtained by introducing a new coding procedure based on `coupled residue' representation.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Neural networks were trained and tested on a database of proteins selected from the Protein Data Bank (PDB; Bernstein et al., 1977) using the PDB_select algorithm (Hobohm et al., 1992
). For the training phase proteins with an identity value <25% were extracted from the PDB_select_oct_1996 file (http://www.embl-heidelberg.de). This set was then reduced by excluding those chains whose backbone is interrupted. Moreover in order to avoid false contacts due to the presence of hetero-atoms, only those proteins without ligands in the PDB files were kept in the training set (set LRN in Table I
). A full cross-validation procedure was routinely applied on the training set (LRN), by grouping 25 proteins (selected at random) in eight subsets. One subset at a time was used as a testing set and the network was trained on the remaining seven. Furthermore, in order to test the generalization capabilities of our predictors two more sets of sequences (whose sequence identity is <25%) were also used as testing sets. One set included those proteins listed in the PDB_select oct_1996 file and not included in the training set (namely 225 proteins containing ligands in the PDB files; COF in Table I
). The other testing set comprised 184 chains, selected from PDB_ select_oct_1997 and not present in the previous release (TS97 in Table I
).
|
Given a protein of Lp residues, its contact map S is evaluated as a symmetric matrix consisting of LpxLp elements whose values are set to 1 if there is a contact between two residues and to 0 otherwise. In order to capture the physical properties of the contacts, the contact maps of all the sequences were generated taking into account the whole set of coordinates of the heavy atoms of the protein (all atoms except hydrogen). In this way a contact is assigned if the minimal distance between heavy atoms belonging to the side chain or to the backbone of two residues is <4.5 Å. This threshold value is chosen in order to guarantee that there is neither a third residue nor a water molecule between two amino acids in contact and it is equal to that previously used in other studies (Hinds and Levitt, 1992; Mirny and Domany, 1996
). In order to prevent spurious short range contacts (Thomas et al., 1996
), our procedure does not include contacts between residues whose sequence separation is less than four residues. The training and testing sets of protein contact maps comprise 608 examples, evaluated from our database. These maps contain on average a number of contacts (Nc) much lower than the number of non-contacts (NNC) (Nc/NNC equals about 1/60). This is due to the different scaling factor of Nc and NNC. Also, as pointed out by others (Vendruscolo et al., 1997
), NC increases almost linearly with protein sequence length (data not shown). For this reason NNC increases with the square of the protein length.
Another important point to shed light on is the distribution of residue contacts. This is also necessary for determining the most appropriate input coding to the neural network system. By plotting the frequency of contacts in our protein database as a function of the residue sequence separation, it appears that the contact distribution pattern is far from uniform (Figure 1), and that contacts cluster predominantly in the short and middle range of the residue sequence separation.
|
Our predictor is a classical feed-forward neural network trained with a standard back-propagation algorithm (Rumelhart et al., 1986) to associate protein single sequences to their corresponding contact maps. We implemented a network topology which was a reasonable trade off between the size of the data set (about 6x106 examples of contacts and non contacts for 200 proteins) and the contact representation among residues. After preliminary experiments (results not shown), we found that the best architecture in terms of computational expedience was that characterized by one output neuron representing the contact propensity, one hidden layer containing two neurons and one input layer with different number of neurons depending on the amount of information encoded. We implemented five different networks of increasing input complexity. All networks used contain two input neurons, representing respectively the normalized residue gap [the length of the sequence separation divided by 1000 (maximum of protein length in residues)] and the normalized sequence separation (the sequence length divided by 1000) between the couple of residues whose contact propensity is under examination. This was done in order to take into account the properties of each contact map (see the previous paragraph and Figure 1
). When this information was not encoded, a dramatic drop of the network performance was noticed, especially for large proteins (data not shown). Major characteristics of the different networks based on the above detailed architecture are listed below.
Net1 contains 210 input neurons representing all the possible ordered couples of residues, considering that each residue couple and its symmetric are coded in the same way. This representation reduces the number of weight junctions from 20x20 to 20x(20+1)/2 input neurons. The input neuron representing the ordered couple of amino acidic residues at positions i and j is set to 1, while the remaining 209 are set to 0. The total number of input neurons of Net 1 is then 212, including also the two other neurons associated respectively to the protein length and sequence separation.
Net2 as compared with Net1 contains two more input neurons in order to take also into account the hydrophobic neighbourhood of the residue couples. Hydrophobicity values are computed according to a hydrophobicity scale (Rose et al., 1985), by averaging over a window of seven contiguous residues. Net2 is provided with 214 input neurons.
Net3 is implemented to encode evolutionary information, by considering the alignment from the corresponding HSSP files (Sander and Schneider, 1991). As depicted in Figure 2
, all the possible couples generated by residues in positions i and j of the different aligned sequences are counted. After normalization to the number of sequences, the frequencies of occurrence in the alignment of each couple is used in the corresponding position of the 210 element input vector representing all the possible ordered couples. By this, the 210 element vector may have more than one component activated. Two more neurons are then added to take into account the conservation weights of the positions i and j in the alignment. As in the case of Net2, Net3 also codes for the hydrophobic neighbourhood of the residues i and j. Thus, the final number of input neurons is 216.
|
Finally with Net5 we add to Net4 the evolutionary information with the same procedure described for Net3. In particular, all five couples are coded according to the evolutionary information, and as in Net3 two more neurons code for the conservation weights associated with the alignment positions i and j. In this case the final number of input neurons is 1056.
In order to compensate for the disproportion between contacts and non-contacts during the training phase, learning was accomplished by means of a procedure including a balancing probability factor to reduce the number of back-propagation cycles for the most abundant class (NNC) (Fariselli et al., 1993).
Evaluation of the predictor accuracy
We are actually interested in the network capability of predicting residue contacts. Therefore accuracy (A) of the network performance is scored as
A = Ncp*/Ncp (1)
where Ncp* and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. However, the evaluation of accuracy (Eq. 1) requires an estimation of Ncp and this is described in the following. The network system uses a single output neuron whose outcomes are real numbers in the interval [0, 1], depending on their activation level. Therefore, for discriminating contacts from non-contacts a possible strategy is to order network outputs according to their distance from the maximum level of activation (1). Then, based on the observation that a protein is characterized by Nc contacts, only the top Nc outputs are labelled as contacts predicted by the network. By this, the number of predicted contacts is restricted to that actually present in the protein (Ncp = Nc). This procedure is similar to that previously adopted by authors using a statistical method to predict contact maps (Thomas et al., 1996) and it has been chosen in order to compare our results to those already published. In our case, selecting only the Nc top outputs of the network is almost equivalent to setting the decision threshold value for the accepted contacts to >0.8 for each protein.
For comparison, accuracy of a random predictor (Ar) can be evaluated as
Ar = Nc/Np (2)
where Nc is the number of real contacts in the protein of length Lp, and Np are all the possible outcomes. Since the contact map S is symmetric and residues whose sequence gap is <4 are not included, Np is computed to be equal to (Lp 4)x(Lp 3)/2.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Upon increasing the information content provided to the networks, the accuracy level also increases, as shown in Table III for each subset of proteins of different length. In this table the performance of the different networks used are listed as a single average value over the whole database (Table I
). This is done averaging the values obtained after cross-validation on LRN (as described above) with the average value obtained from the eight training sets (from LRN) on the proteins comprised in the COF and TS97 test sets. The monotonic trend of the average accuracy of the different networks suggests that all the added information are useful for contact predictions (see the rightmost column of Table III
). Actually, the performance of Net2 increases from 0.144 to 0.147, when the hydrophobic propensity of the residues is taken into account; moreover a further improvement is noticeable when multiple sequence information is added. In this case the performance of Net3 increases up to 0.15. With Net4, the sequence context information is introduced, by taking into account explicitly a three residue-long window centred on the residue whose contact is to be predicted. The final average value is not very different from that obtained with Net3 (0.151 with respect to 0.150). The results indicate that the evolutionary information and the sequence context information similarly contribute to the increase in the predictive performance of the method. This is also evident by considering that for proteins of small size (protein whose length is shorter than 170 residues) Net3 performs better than Net4 (Table III
). For larger proteins, however, Net4 outdoes Net3. In conclusion, it appears that the information is complementary, and that the combination of both evolutionary and sequence context information further increases the accuracy, as demonstrated by the fact that the accuracy of Net5 reaches 0.160 on the whole set of proteins.
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
With neural networks it is also easy to change and increase the amount of information in the coding input. Variables such as hydrophobicity of the environment, sequence context and evolutionary information are easily fed to the networks and their effect can be investigated. One interesting result is that we show that all these variables when taken into consideration separately improve the accuracy of the predictor in a stepwise manner. When the performance is analysed by grouping the proteins by chain length it appears that evolutionary information is more relevant for predicting proteins whose chain length is <170 residues. For longer chains the sequence context seems to play a role. Indeed when all the variables are taken into consideration simultaneously the best predictor is obtained.
The generalization capability of the neural network-based predictor can be compared with that of other methods previously described and based on approaches different from ours. The best performances described so far were obtained with a statistical and a joint method, respectively (Thomas et al., 1996; Olmea and Valencia, 1997
). The first approach achieves an accuracy level of 0.13 (five times higher than a random predictor) in predicting contact maps generated from computing CßCß distances with a threshold value for the contacts set to 8.0 Å. However, the accuracy of the method drops below 0.1 when contact maps generated taking into account all-atom representation (unfortunately with an unspecified contact threshold) are predicted. Even though our contact maps computed from all atom interactions cannot be directly compared with those previously described (due to the unspecified contact threshold) it is however evident from our results that Net5 achieves an accuracy level of at least 6 percentage points higher (<A> = 0.160 and 6 times higher than that of a random predictor, Table III
).
The second approach focuses on the prediction on the CßCß contacts and the accuracy is improved in different ways (Olmea and Valencia, 1997). The most successful one adds to the correlated mutation information, the statistical propensities derived from the different number of contacts that the different residues occupy in the context of the three most representative secondary structure types (
-helices, ß-strands and random coils). In this way a filter procedure is generated, which takes into account the secondary structure labels for the different residues from the DSSP files (Kabsch and Sander, 1983
); moreover, for each residue predicted contacts are accepted provided that their number is less than the average number of contacts per residue in a given structural type. The result is to lower the final number of the predicted contacts by 25%, increasing remarkably the performance accuracy. In order to compare our predictions with that obtained with this method, we consider therefore only the first 75% of the real contacts (Figure 5
). The final results are reported in Table IV
, where the occupancy method (Occ; Olmea and Valencia, 1997
) is compared with the filtered Net5 (Net5_75 in Table IV
). From the data, it is evident that the occupancy method performs similarly to our predictor for small proteins. For proteins of large size (>100 residues), neural networks perform with an accuracy which is better than other methods and comparable only to that obtained with Occ. However, it should be remembered that in our case the prediction includes the all atom representation of the protein and is therefore more detailed.
|
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bohr,H., Bohr,J., Brunak,S., Cotterill,R., Fredholm,H., Lautrup,B. and Petersen,S.B. (1990) FEBS Lett., 261, 4346.[ISI]
Compiani,M., Fariselli,P., Martelli,P. and Casadio,R. (1998) Proc. Natl Acad. Sci. USA, 95, 92909294.
Fariselli,P., Compiani,M. and Casadio,R. (1993) Eur. Biophys J., 22, 4151.[ISI][Medline]
Göbel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309317.[ISI][Medline]
Grossman,T., Farber,R. and Lapedes,A. (1995) ISMB, 3 154161.
Hinds,D.A and Levitt,M. (1992) Proc. Natl Acad. Sci. USA, 89, 25362540.[Abstract]
Hobohm,U., Scharf,M., Schneider,P. and Sander,C. (1992) Protein Sci., 1, 409417.
Huang,E.S., Subbiah,S. and Levitt,M. (1985) J. Mol. Biol., 252, 709720.
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Lund,O., Frimand,K., Gorodkin,J., Bohr,H., Bohr,J., Hansen,J. and Brunak,S. (1997) Protein Engng, 10, 12411248.[Abstract]
Mairov,V.N. and Crippen,G.M. (1992) J. Mol. Biol., 227, 876888.[ISI][Medline]
Miyazawa,S. and Jerningam,R.L. (1985) Macromolecules, 18, 534552.[ISI]
Mirny,L.A. and Shaknovich,E.I. (1996) J. Mol. Biol., 264, 11641179.[ISI][Medline]
Mirny,L. and Domany,E. (1996) Proteins, 26, 391410.[ISI][Medline]
Olmea,O. and Valencia,A. (1997) Fold. Des., 2, S25-32.[ISI][Medline]
Rose,G.D., Gesolowittz,A.R., Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Science, 229, 834838.[ISI][Medline]
Rost,B. and Sander,C. (1995) Proteins, 23, 295300.[ISI][Medline]
Rost,B. Fariselli,P. and Casadio,R. (1996) Protein Sci., 5, 17041718.
Rumelhart,D.E., Hinton,G.E., Williams,R.J. (1986) Nature, 323, 533536.[ISI]
Sander,C. and Schneider R. (1991) Proteins, 9, 5668.[ISI][Medline]
Selbig,J. and Argos,P. (1998) Proteins, 31, 172185.[Medline]
Shindyalov,I.,N., Kolchanov,N.A. and Sander,C. (1994) Protein Engng, 7, 349358.[Abstract]
Sippl,M.J. (1990) J. Mol. Biol., 213, 859883.[ISI][Medline]
Thomas,D.J., Casari,G. and Sander,C. (1996) Protein Engng, 9, 941948.[Abstract]
Taylor,W.R. and Hatrick,K. (1994) Protein Engng, 7, 341348.[Abstract]
Vendruscolo,M., Kussel,E. and Domany,E. (1997) Fold. Des., 2, 295306.[ISI][Medline]
Received July 14, 1998; revised September 29, 1998; accepted October 7, 1998.