1 CIRB and Department of Biology, University of Bologna, via Irnerio 42, Bologna, Italy and 2 Protein Design Group, CNB-CSIC, Cantoblanco, Madrid 28049, Spain
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: contact maps/correlated mutations/neural networks/protein structure predictions/residue contacts
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A contact map also constitutes a structural `fingerprint' of a given protein. The secondary structure, fold topology and side-chain packing patterns (for side-chain contact maps) can be highlighted easily from a contact map. Indeed, each protein can be identified based on its contact map. Furthermore, structural similarity between a pair of proteins can be detected by inspecting their contact maps, without searching for all their possible orientations (Godzik et al., 1992). Recently residue contact classification has been correlated to structural patterns (Selbig and Argos, 1998
). It follows that a good predictor of protein contacts would play a fundamental role in structural theoretical approaches.
The prediction of contacts between residues requires the study of the protein inter-residue distances related to the specific type of residue pair. The distribution of distances between residue pairs in proteins of known 3D structure has been first computed to derive mean force potentials with the aim of tackling the folding and the inverse folding problem (Sippl, 1990; Maiorov and Crippen, 1992
; Huang et al., 1995
; Mirny and Shaknovich, 1996
; Miyazawa and Jernigam, 1999; Zhang and Sung-Hou, 2000
). Other methods more specifically address the problem of residue contact prediction using the information derived from the occurrence of correlated mutations in similar proteins (Göbel et al., 1994
; Shidyalov et al., 1994; Taylor and Hatrick, 1994
; Thomas et al., 1996
). Moreover, algorithms have been developed which combine correlated mutations with other properties, such as sequence conservation (as computed from multiple sequence alignment), sequence separation along the chain, alignment stability and residue-specific contact occupancy (as evaluated from the 3D protein structure) (Olmea and Valencia, 1997
). Finally, neural network-based methods have been applied to predict whether distances between couples of residues are above or below a given variable threshold (Lund et al., 1997
; Gorodkin et al., 1999
) and also contact maps of proteins (Fariselli and Casadio, 1998
, 1999
).
It has been shown that predictions of contact maps with neural networks score higher than statistical approaches (Fariselli and Casadio, 1999). Even though deviation from a random predictor was equal to six; the efficiency was not sufficient for protein prediction and the development of new predictors was foreseen. Therefore, we thought to take advantage of the high degree of flexibility of the neural network system and in this paper we combine the information derived from different sources, including sequence conservation, correlated mutations and predicted secondary structures, to address the problem of predicting contact maps of proteins.
The question still to be answered is then to what extent a neural network system is capable of learning the correlation between the residue covalent structure of a protein and its contact map, as it is computed from its known 3D structure. Specifically, we investigate how all the possible information which can be derived from the sequence alignment of the protein, as previously described (Olmea and Valencia, 1997), affects the predictive performance. The neural networks described in this paper are trained to learn the correlation between protein sequences and the contact maps of a set of non-homologous proteins of known structure, including 173 chains of different length. Proteins are selected with the specific requirement that they have at least 15 sequences in the multiple sequence alignment. The test is performed using a cross-validation procedure. Our results indicate that the neural network-based predictor can reach an accuracy as high as 0.21, with a deviation from a random predictor of a factor >6. Moreover, the inclusion of a filtering procedure previously described and based on residue occupancies (Olmea and Valencia, 1997
) improves the average accuracy up to 0.25; in this case the improvement over a random predictor reaches a factor of 8. These scores are higher than that previously obtained by the other methods developed to predict residue contacts (Thomas et al., 1996
; Olmea and Valencia, 1997
; Fariselli and Casadio, 1999
) and indicate that the performance of the prediction of contact maps with neural networks can be improved by increasing the amount of information given as input.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We use a large set of non-homologous proteins of known 3D structure. In Table I we list the proteins sorted by length [proteins are labelled by their Protein Data Bank (PDB) code], since our results require this classification. The list includes all proteins in the PDB select-list of non-sequence-redundant protein structures with percentage of sequence identity lower than 25% (Hobohm and Sander, 1994
) and whose chain was not interrupted (822 proteins). Furthermore, the chains are selected provided that alignments with more than 15 sequences were obtained: using this approach, in total, our set includes 173 proteins (Table I
).
|
The native structure of a protein is approximated by the set of the coordinates listed in its PDB file. If a protein contains N atoms, the corresponding representation requires 3N coordinates. An alternative view of the protein makes use of a distance matrix, a symmetric square NxN matrix whose elements are the distances among the atoms in the protein. This representation is obviously redundant: it requires N(N 1)/2 degrees of freedom instead of 3N. However, it is still very important, since it has been demonstrated that the redundancy can help in the reconstruction of the 3D structure of the protein only when some elements of the distance matrix are available (Kuntz et al., 1989). This is often the case, especially in NMR spectroscopy. Based on this notion, a correct prediction of a relevant part of a distance matrix could be used to compute ab initio the protein 3D structure. Unfortunately, so far no method is available to compute the distance matrix starting from the protein sequence. However, we can try an approximate solution of this problem by predicting the contact map of a protein. A contact map is a binary version of the distance matrix and it is defined as a symmetric matrix consisting of N x N elements (for a protein of N residues) whose values are set to 1 or 0 according to whether there is or there is no contact between the two residues at issue.
A contact is said to exist between each pair of residues whenever the mutual distance is below a given arbitrary threshold. The distance involved in the different definitions of a contact can be that between the CC
atoms (Mirny and Domany, 1996
), between the CßCß (Thomas et al., 1996
; Lund et al., 1997
; Olmea and Valencia, 1997
) and the minimal distance between atoms belonging to the side chain or to the backbone of the two residues (Fariselli and Casadio, 1999
).
In this paper we adopt the definition of contact used also by Olmea and Valencia (Olmea and Valencia, 1997), i.e. we calculate contacts only using Cß atoms and adopting a threshold value of 8 Å. When it is not stated explicitly, the minimal sequence separation is |i j|
7 residues. This choice is done in order to overcome the effect of learning local contacts and particularly to polarize predictions on the intra turn and intra helix contacts (i, i +4 ; i, i + 5; i, i + 6) and in this, the present approach is different from that previously described (Fariselli and Casadio, 1999
).
Computing sequence conservation and correlated mutations
Sequence variability was taken from the HSSP database (Sander and Schneider, 1993). In the HSSP definition (Sander and Schneider, 1991
), variability is 0 when positions in the multiple sequence alignments are completely conserved and it increases proportionally to the number of amino acid changes occurring at that position.
Correlated mutations are calculated as previously described (Göbel et al., 1994). Briefly, a distance array is used to codify each position in the alignment. This position-specific array contains all the residueresidue distances between all the possible pairs of sequences at that position. The correlation value between each pair of positions in the alignment is computed as the correlation of the two arrays for each possible residue pair. Corresponding elements in the arrays contain the distance between the same two sequences in the two positions under comparison. The scoring matrix of McLachlan (McLachlan, 1971
) defines the distances between residues. Positions with a percentage of gaps >10% are set at a correlation value of 1 and completely conserved positions are set at a correlation value of 0. The similarity value of gaps is set to a dummy value of 0.
Neural network architecture and input codings
Neural networks have been proved to be one of the most successful methods for prediction of contact maps of proteins (Fariselli and Casadio, 1999). In this work we implement a neural network architecture which is similar to that described before. This topology, which was found to be the best performing one with the problem at hand, is depicted in Figure 1
. A single output neuron codes for contact (output value close to 1) and non-contact (output value close to 0). The hidden layer consists of eight hidden neurons. A new type of input coding was previously introduced (Fariselli and Casadio, 1999
) and it is also used here. Each residue pair in the protein sequence is coded as an input vector containing 210 elements (20x(20 + 1)/2), representing all the possible ordered couples of residues (considering that each residue couple and its symmetric are coded in the same way). This is done in order to reduce the number of weight junctions. When a single sequence is used, the input neuron coding for the ordered couple of amino acidic residues at positions i and j is set to 1, while the remaining 209 are set to 0. In order to take into account the sequence neighbours we use a 3-residue-long input window, considering both parallel and anti-parallel pairing of the two segments centred at positions i and j, respectively. This leads to the coding of the couples formed by the residues in positions {i 1, j 1}, {i, j}, {i + 1, j + 1} (parallel pairing) and {i 1, j+1}, {i, j}, {i + 1, j 1} (anti-parallel pairing) ending up with five possible combinations ({i 1, j 1}, {i, j}, {i + 1, j + 1}, {i 1, j + 1}, {i + 1, j 1}) of the ordered couples. This procedure requires 1050 (210x5) input neurons. When multiple sequence information is used (as is always the case in this paper), this binary input code is changed to a frequency-based one. This is done by considering the alignment from the corresponding HSSP files (Sander and Schneider, 1991
) and taking all the possible couples generated by residues in positions i and j of the different aligned sequences. After normalization to the number of sequences, the frequencies of occurrence in the alignment of each couple are used in the corresponding position of the 210 element input vector representing all the possible ordered couples. Using this approach, the 210 element vector may have more than one component activated (Fariselli and Casadio, 1999
).
|
To solve our specific problem five neural networks of different complexity are implemented (Figure 1):
|
In order to compare all the implemented networks together and with the previously described methods based on correlated mutations, the number of predicted contacts is set equal to half of the protein length Lp/2 (following the procedure described in Olmea and Valencia, 1997). A sorting procedure based on the network output values is adopted. Contacts are defined as the highest Lp/2 prediction values for a protein of length equal to Lp and are routinely characterized by output activation values >0.750.80.
The filtering procedure
To avoid contact overprediction, the predicted pairs are filtered taking into account the amount of contacts that each residue type can make (Olmea and Valencia, 1997). The filtering procedure is based on the occupancy data (or residue-coordination numbers) of each residue. This value is statistically derived from the set of protein structures of the database and takes into account the secondary structure type and the solvent exposition of each residue. Using this, the number of predicted contacts of a residue becomes a function of its structural environment. Therefore, the occupancy can be considered an estimate of the maximal number of contacts that each residue can make and is used to limit the number of contacts predicted for each residue.
Evaluation of the efficiency
To score the efficiency of the predictors we use three statistical indices. The first and the most used index to evaluate the predictor accuracy is defined as:
![]() | (1) |
The improvement over a random predictor is evaluated by computing the ratio between A (Equation 1) and the accuracy of a random predictor (Nc / Np):
![]() | (2) |
A third and useful index measures the difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997). This index is defined as:
![]() | (3) |
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Irrespective of the contact definition, it has been shown that the number of contacts (Nc) in the proteins increases almost linearly with the protein sequence length, while the number of `non-contacts' (Nnc) increases with the square of the sequence length (Vendruscolo et al., 1997). On average Nc/Nnc is approximately 1/60 (Fariselli and Casadio, 1999
). In order to compensate for this disproportion a balancing procedure is used during the training phase of the networks and a sorting procedure based on the network output values is adopted (see Materials and methods, Neural network architecture and input codings).
However, a crucial point to take into consideration is the distribution of protein contacts with respect to the length of the sequence separation between pairs. The sequence separation between two residues in position i and j in the protein sequence is computed as |j i|. It is obvious that a well performing predictor of contacts should not be polarized by contact predictions among residues that are nearest neighbours in the sequence and covalently bounded. It can be evaluated that the frequency distribution of the contacts decreases at increasing length of the sequence separation between residue pairs (Thomas et al., 1996; Fariselli and Casadio, 1999
).
The theoretical distribution of contacts in a protein can be computed by assuming that:
From these assumptions it follows that the a priori probability of finding a contact (c) in a map i is:
![]() | (4) |
![]() | (5) |
![]() | (6) |
The theoretical frequency distribution for a a priori random assignment of contacts in a protein (Equation 6) is computed using the N(Li) value from our database. In Figure 3
the theoretical frequency distribution of contacts is compared with the real one evaluated using our database. It is evident that the pattern of the a priori distribution is characterized by an exponential-like behaviour, indicating that most of the contacts (68%) in proteins are between residues with a sequence separation ranging from 7 to 100. The real distribution deviates from that of a a priori random assignment of contacts. It is remarkable that the real frequency distribution of contacts contains in the same region of sequence separation 81% of the population. This indicates that contacts in proteins are not randomly distributed and occur predominantly between residues with a sequence separation spanning from 7 to 100 residues.
|
The implemented networks are tested on the selected database using a cross-validation procedure. The results are compared with those obtained using a previous method based only on correlated mutations (Olmea and Valencia, 1997) (Table II
). In order to highlight the dependency of the results on the protein dimension, the accuracy values are listed after grouping the proteins also by sequence length. It is evident that on average for all the proteins neural networks (NET) give a very good performance with respect to the method based only on correlated mutation (Corr). For the sake of comparison with a neural network-based predictor previously described (Fariselli and Casadio, 1999
) we notice that the average accuracy is the same as that published before (0.16). However, since the sequence separation value is set now to 7 (instead of 4 residues, as before), deviation from random is 1.16 points lower than before. This simply reflects the fact that with the current procedure we are narrowing the range of contact predictions accepted as correct.
|
When information on sequence conservation and correlated mutations is extended to a 3-residue-long window (NETCW), the accuracy value increases of one percentage point; however, deviation from random (R) and Xd values are not modified with respect to NETC, indicating that the efficiency of the prediction is rather similar to that obtained with NETC. If the prediction efficiency of the subsets of proteins of different size is considered, it can be concluded that this result is mainly due to the decrease in prediction accuracy of proteins of large size (>300 residues).
If sequence separation of the contact residue pairs is taken into account (NETCsep), an increase of the accuracy by two percentage points with respect to NETC is noticed. In this case all the other indices also increase. NETCsep reaches an average deviation from random (R) higher than 6 and an Xd value of 9.8. Noticeably, this predictor performs well also on proteins of large size (>300 residues).
The predictor which also implements the protein-predicted secondary structures (NETCSS) gives the highest score (Table II). As compared to NETC, which only considers correlated mutations, accuracy is increased 2-fold. With the increase of protein length the performance of NETCSS decreases (Table II
). For large proteins the accuracy is lower than that obtained with NETsep. Nevertheless the overall performance of NETCSS is still the highest for proteins of small and medium size (<300 residues), indicating that this method is the best predictor of contact maps for proteins of average dimensions.
In Figure 4, the accuracy of the contact prediction is plotted as a function of the protein length using the different methods (Corr, FNET and NETCSS). It is evident that the accuracy of the prediction is dependent on the length of the protein. It is indeed easier to predict the contacts of short sequences, since the contact density is higher in small than in large proteins (Vendruscolo et al., 1997
). Neural network-based methods (the NET predictors) are out-performing that based only on correlated mutations (Corr); however, the most significant differences in the efficiency of the prediction between the two approaches are to be found in the contact prediction of proteins of small and medium size.
|
When the filtering procedure is adopted, even though the number of predicted contacts is reduced, the accuracy level of the rest is of higher quality (Table III). All the networks increase their efficiency as measured by the different indices while keeping the same order of performance listed in Table II
. After filtering, it also becomes evident that there is no improvement when the 3-residue-long window coding for conservation and correlated mutations is used; indeed NETC and NETCW show the same accuracy values. The former performs better on proteins of large size, while the latter performs better on small ones.
|
Network performance in different regions of sequence separation
All the contact predictions described above were performed considering a sequence separation of at least 7 residues. In this section we explore the effect of increasing this minimal sequence separation length on the prediction. Increasing the minimal sequence separations causes a severe reduction of the number of real contacts taken into consideration (Figure 3). Since the greatest amount of contacts in a protein is found in the region of small sequence separation (see above), increasing this value is equivalent to reducing the expectation of randomly locating contacts during the prediction. In Table IV
we analyse the results obtained using our best performing method (NETCSS), with and without the filtering procedure, tested on the protein sets using four different minimal sequence separations (7, 20, 40 and 80 residues). As expected, the accuracy index A diminishes as the minimal sequence separation increases. The Xd value decrease less than the accuracy value, indicating that the predicted contacts are anyway quite close in space (Olmea and Valencia, 1997
). This is due to the fact that in the regions of large sequence separations contacts are not only less abundant but also more spot-like and less grouped together. This explanation is also confirmed by the behaviour of the index evaluating the improvement over random prediction (R).
|
CASP3 common targets
Both groups (Olmea and Valencia, 1997; Fariselli and Casadio, 1999
) sent predictions to the Critical Assessment of techniques for protein Structure Prediction (CASP3) competition (Orengo et al., 1999
). At that time there was not a common criterion for the prediction of contact maps, in terms of number of predicted contacts, minimal sequence separation and contact definition. Then, in order to compare the accuracy we have re-calculated the predictions in a consistent way. This is done by scoring the old submissions using a number of predicted contacts equal to half of the length of the protein, a minimal sequence separation of 7 residues and a contact definition based only on Cß with a threshold value of 8 Å. Among the submitted targets only three of them were common to the predictions of both groups, namely T0067, T0077 and T0079 (Table V
). The new method (NETCSS with filter) for two of the three targets predicts with an accuracy level much higher than the two single methods separately (Table V
). Interestingly, for the target T0079 the method based on correlated mutations (and filtered) ranks higher than neural networks. However, the predictor described in this paper scores on the targets with a very high efficiency, which is even greater than the average values obtained on our database. The predictions obtained with NETCSS on the three targets are highlighted in Figures 57
, where all the predicted contacts are shown using a `sticks' representation (the correct contacts are depicted in black).
|
|
|
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
From the CASP3 results of contact map predictions (Orengo et al., 1999), it turned out that the method based on correlated mutations (Olmea and Valencia, 1997
) was the best performing of the predicted contacts with an accuracy equal to 0.15. However, the number of contacts predicted was less than that obtained with other methods. On the other hand, the method based on neural networks (Fariselli and Casadio, 1999
) performed best when the statistical significance of the predictions was evaluated with a self-threading method (Orengo et al., 1999
) aimed at recognizing the correct threading of the protein sequence onto its 3D structure from all the alternative threadings.
In this paper we combine both methods and show that neural networks can predict protein contacts with an accuracy of 0.21 starting from the residue sequence and including residue conservation, correlated mutations and predicted secondary structures. With this approach, prediction is improved as compared to that of a random predictor by a factor >6, previously obtained with neural networks using as input only evolutionary information (Fariselli and Casadio, 1999). Our results are even more important if we consider that the score is independent of the information on sequence separation, making the predictor of a more general use than before.
At the CASP3 competition it has been shown that the information on sequence separation also strongly affected the prediction of distances between residues made by other neural network-based predictors (Lund et al., 1997).
A substantial improvement of the prediction efficiency is obtained by filtering the network outputs using the filtering procedure previously introduced by Olmea and Valencia (Olmea and Valencia, 1997) and based on the residue-coordination numbers. In this case, the neural network-based predictor increases its accuracy up to 0.25 for all proteins, with a deviation from a random predictor of a factor of 8. Even though this level of accuracy is still not sufficient for folding a protein, we believe that this is a tangible step forward in predicting contact maps of proteins.
|
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bohr,H., Bohr,J. Brunak,S. Cotterill,R. Fredholm,H. Lautrup,B. and Petersen,S.B. (1990) FEBS Lett., 261, 4346.[ISI]
Bohr,J., Bohr,H., Brunak,S., Cotterill,R.M.J., Fredholm,H., Lautrup,B. and Petersen,S.B. (1993) J. Mol. Biol., 231, 861869.[ISI][Medline]
Compiani,M., Fariselli,P., Martelli,P. and Casadio,R. (1998) Proc. Natl Acad. Sci. USA, 95, 92909294.
Eisenhaber,F., Persson,B. and Argos,P. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 194.[Abstract]
Fariselli,P. and Casadio,R. (1998) In Proceedings of World Multiconference on Systemics, Cybernetics and Informatics (SCI'98, Orlando USA), Vol. 1, pp. 527533.
Fariselli,P. and Casadio,R. (1999) Protein Eng., 12, 1521.
Fariselli P, Compiani M. and Casadio R. (1993) Eur. Biophys. J., 22, 4151.[ISI][Medline]
Kuntz,I.D., Thomason,J.F. and Oshiro,C.M. (1989) Methods Enzymol., 177, 159205.[ISI][Medline]
Govindarajan,S. and Goldstein,R. (1998) Proc. Natl Acad. Sci. USA, 95, 55455549.
Göbel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309317.[ISI][Medline]
Godzik,A., Skolnick, J., and Kolinski,A. (1992) J. Mol. Biol., 227, 227238.[ISI][Medline]
Gorodkin,J., Lund,O. Andersen,C.A. and Brunak,S. (1999) Proc. Int. Conf. Intell. Sys., 6, 95105.
Grossman,T., Farber,R. and Lapedes,A. (1995) Mol. Biol., 3, 154161.
Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522524.
Huang,E.S., Subbiah,S. and Levitt,M. (1995) J. Mol. Biol., 252, 709720.[ISI][Medline]
Jacoboni,I., Martelli,P.L., Fariselli,P., Compiani,M. and Casadio,R. (2000) Proteins, 41,535544
Jones,D.T. (1999) J. Mol. Biol., 292, 195202.[ISI][Medline]
Lund,O., Frimand,K., Gorodkin,J., Bohr,H., Bohr,J., Hansen,J. and Brunak,S. (1997) Protein Eng., 10, 12411248.[Abstract]
Maiorov,V.N. and Crippen,G.M. (1992) J. Mol. Biol., 227, 876888.[ISI][Medline]
McLachlan,A. (1971) J. Mol. Biol., 61, 409424.[ISI][Medline]
Miyazawa,S. and Jernigan,R.L. (1999) Proteins, 36, 357369.[ISI][Medline]
Mirny,L. and Domany,E. (1996) Proteins, 26, 391410.[ISI][Medline]
Mirny,L.A. and Shaknovich,E.I. (1996) J. Mol. Biol., 264, 11641179.[ISI][Medline]
Orengo,C.A., Bray,J.E., Hubbard,T., LoConte,L. and Sillitoe,I. (1999) Proteins, 37, 149170.[Medline]
Olmea,O. and Valencia,A. (1997) Fold. Des., 2, S25S32.[ISI][Medline]
Pazos,F., Helmer Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 271, 511523.[ISI][Medline]
Rost,B. and Sander,C. (1995) Proteins, 3, 295300.
Rost,B., Fariselli,P. and Casadio,R. (1996) Protein Sci., 5, 17041718.
Sander,C. and Schneider,R. (1991) Proteins, 9, 5668.[ISI][Medline]
Sander,C. and Schneider,R. (1993) Nucleic Acids Res., 21, 31053109.[ISI][Medline]
Selbig,J. and Argos,P. (1998) Proteins, 31, 172185.[Medline]
Shindyalov,I.,N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 49358.
Sippl,M.J. (1990) J. Mol. Biol., 213, 859883.[ISI][Medline]
Taylor,W.R. and Hatrick,K. (1994) Protein Eng., 7, 341348.[Abstract]
Thomas,D.J., Casari,G. & Sander,C. (1996) Protein Eng., 9, 941948.[Abstract]
Vendruscolo,M., Kussell,E. and Domany,E. (1997) Fold Des., 2, 295306.[ISI][Medline]
Zhang,C. and Sung-Hou,K. (2000) Proc. Natl Acad. Sci. USA, 97, 25502555.
Received November 30, 2000; revised June 19, 2001; accepted July 10, 2001.