Prediction of contact maps with neural networks and correlated mutations

Piero Fariselli1, Osvaldo Olmea2, Alfonso Valencia2 and Rita Casadio1,3

1 CIRB and Department of Biology, University of Bologna, via Irnerio 42, Bologna, Italy and 2 Protein Design Group, CNB-CSIC, Cantoblanco, Madrid 28049, Spain


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Conclusions
 References
 
Contact maps of proteins are predicted with neural network-based methods, using as input codings of increasing complexity including evolutionary information, sequence conservation, correlated mutations and predicted secondary structures. Neural networks are trained on a data set comprising the contact maps of 173 non-homologous proteins as computed from their well resolved three-dimensional structures. Proteins are selected from the Protein Data Bank database provided that they align with at least 15 similar sequences in the corresponding families. The predictors are trained to learn the association rules between the covalent structure of each protein and its contact map with a standard back propagation algorithm and tested on the same protein set with a cross-validation procedure. Our results indicate that the method can assign protein contacts with an average accuracy of 0.21 and with an improvement over a random predictor of a factor >6, which is higher than that previously obtained with methods only based either on neural networks or on correlated mutations. Furthermore, filtering the network outputs with a procedure based on the residue coordination numbers, the accuracy of predictions increases up to 0.25 for all the proteins, with an 8-fold deviation from a random predictor. These scores are the highest reported so far for predicting protein contact maps.

Keywords: contact maps/correlated mutations/neural networks/protein structure predictions/residue contacts


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Conclusions
 References
 
The three-dimensional (3D) architecture of a protein represents the ultimate molecular information and from that springs a wealth of significant scientific results comprising the understanding of the protein-folding mechanism. Anfinsen's thermodynamic hypothesis (Anfinsen, 1973Go), recently revisited by using lattice models (Govindarajan and Goldstein, 1998Go), has placed the bases of the today commonly accepted idea that under physiological conditions the covalent structure of a protein is sufficient to funnel the folding of the molecule into its native structure. In order to exploit the content of this `molecular biology Holy Grail' (Eisenhaber et al., 1995Go), different methods focusing on distances or contacts among residues in the protein have been proposed for predicting the structure starting from the sequence, when a significantly similar homologue with known structure is lacking. In particular, contacts among residues constrain protein folding and characterize different protein structures. Therefore, the prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein-folding recognition and de novo design. Provided that residue contacts are known for a protein sequence, the major features of its 3D structure could be deduced by applying different reconstruction methods (Bohr et al., 1993Go; Vendruscolo et al., 1997Go).

A contact map also constitutes a structural `fingerprint' of a given protein. The secondary structure, fold topology and side-chain packing patterns (for side-chain contact maps) can be highlighted easily from a contact map. Indeed, each protein can be identified based on its contact map. Furthermore, structural similarity between a pair of proteins can be detected by inspecting their contact maps, without searching for all their possible orientations (Godzik et al., 1992Go). Recently residue contact classification has been correlated to structural patterns (Selbig and Argos, 1998Go). It follows that a good predictor of protein contacts would play a fundamental role in structural theoretical approaches.

The prediction of contacts between residues requires the study of the protein inter-residue distances related to the specific type of residue pair. The distribution of distances between residue pairs in proteins of known 3D structure has been first computed to derive mean force potentials with the aim of tackling the folding and the inverse folding problem (Sippl, 1990Go; Maiorov and Crippen, 1992Go; Huang et al., 1995Go; Mirny and Shaknovich, 1996Go; Miyazawa and Jernigam, 1999; Zhang and Sung-Hou, 2000Go). Other methods more specifically address the problem of residue contact prediction using the information derived from the occurrence of correlated mutations in similar proteins (Göbel et al., 1994Go; Shidyalov et al., 1994; Taylor and Hatrick, 1994Go; Thomas et al., 1996Go). Moreover, algorithms have been developed which combine correlated mutations with other properties, such as sequence conservation (as computed from multiple sequence alignment), sequence separation along the chain, alignment stability and residue-specific contact occupancy (as evaluated from the 3D protein structure) (Olmea and Valencia, 1997Go). Finally, neural network-based methods have been applied to predict whether distances between couples of residues are above or below a given variable threshold (Lund et al., 1997Go; Gorodkin et al., 1999Go) and also contact maps of proteins (Fariselli and Casadio, 1998Go, 1999Go).

It has been shown that predictions of contact maps with neural networks score higher than statistical approaches (Fariselli and Casadio, 1999Go). Even though deviation from a random predictor was equal to six; the efficiency was not sufficient for protein prediction and the development of new predictors was foreseen. Therefore, we thought to take advantage of the high degree of flexibility of the neural network system and in this paper we combine the information derived from different sources, including sequence conservation, correlated mutations and predicted secondary structures, to address the problem of predicting contact maps of proteins.

The question still to be answered is then to what extent a neural network system is capable of learning the correlation between the residue covalent structure of a protein and its contact map, as it is computed from its known 3D structure. Specifically, we investigate how all the possible information which can be derived from the sequence alignment of the protein, as previously described (Olmea and Valencia, 1997Go), affects the predictive performance. The neural networks described in this paper are trained to learn the correlation between protein sequences and the contact maps of a set of non-homologous proteins of known structure, including 173 chains of different length. Proteins are selected with the specific requirement that they have at least 15 sequences in the multiple sequence alignment. The test is performed using a cross-validation procedure. Our results indicate that the neural network-based predictor can reach an accuracy as high as 0.21, with a deviation from a random predictor of a factor >6. Moreover, the inclusion of a filtering procedure previously described and based on residue occupancies (Olmea and Valencia, 1997Go) improves the average accuracy up to 0.25; in this case the improvement over a random predictor reaches a factor of 8. These scores are higher than that previously obtained by the other methods developed to predict residue contacts (Thomas et al., 1996Go; Olmea and Valencia, 1997Go; Fariselli and Casadio, 1999Go) and indicate that the performance of the prediction of contact maps with neural networks can be improved by increasing the amount of information given as input.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Conclusions
 References
 
The database

We use a large set of non-homologous proteins of known 3D structure. In Table IGo we list the proteins sorted by length [proteins are labelled by their Protein Data Bank (PDB) code], since our results require this classification. The list includes all proteins in the PDB select-list of non-sequence-redundant protein structures with percentage of sequence identity lower than 25% (Hobohm and Sander, 1994Go) and whose chain was not interrupted (822 proteins). Furthermore, the chains are selected provided that alignments with more than 15 sequences were obtained: using this approach, in total, our set includes 173 proteins (Table IGo).


View this table:
[in this window]
[in a new window]
 
Table I. The database of proteins used to train and test the predictors
 
Contact map definition

The native structure of a protein is approximated by the set of the coordinates listed in its PDB file. If a protein contains N atoms, the corresponding representation requires 3N coordinates. An alternative view of the protein makes use of a distance matrix, a symmetric square NxN matrix whose elements are the distances among the atoms in the protein. This representation is obviously redundant: it requires N(N – 1)/2 degrees of freedom instead of 3N. However, it is still very important, since it has been demonstrated that the redundancy can help in the reconstruction of the 3D structure of the protein only when some elements of the distance matrix are available (Kuntz et al., 1989Go). This is often the case, especially in NMR spectroscopy. Based on this notion, a correct prediction of a relevant part of a distance matrix could be used to compute ab initio the protein 3D structure. Unfortunately, so far no method is available to compute the distance matrix starting from the protein sequence. However, we can try an approximate solution of this problem by predicting the contact map of a protein. A contact map is a binary version of the distance matrix and it is defined as a symmetric matrix consisting of N x N elements (for a protein of N residues) whose values are set to 1 or 0 according to whether there is or there is no contact between the two residues at issue.

A contact is said to exist between each pair of residues whenever the mutual distance is below a given arbitrary threshold. The distance involved in the different definitions of a contact can be that between the C{alpha}–C{alpha} atoms (Mirny and Domany, 1996Go), between the Cß–Cß (Thomas et al., 1996Go; Lund et al., 1997Go; Olmea and Valencia, 1997Go) and the minimal distance between atoms belonging to the side chain or to the backbone of the two residues (Fariselli and Casadio, 1999Go).

In this paper we adopt the definition of contact used also by Olmea and Valencia (Olmea and Valencia, 1997Go), i.e. we calculate contacts only using Cß atoms and adopting a threshold value of 8 Å. When it is not stated explicitly, the minimal sequence separation is |ij| <= 7 residues. This choice is done in order to overcome the effect of learning local contacts and particularly to polarize predictions on the intra turn and intra helix contacts (i, i +4 ; i, i + 5; i, i + 6) and in this, the present approach is different from that previously described (Fariselli and Casadio, 1999Go).

Computing sequence conservation and correlated mutations

Sequence variability was taken from the HSSP database (Sander and Schneider, 1993Go). In the HSSP definition (Sander and Schneider, 1991Go), variability is 0 when positions in the multiple sequence alignments are completely conserved and it increases proportionally to the number of amino acid changes occurring at that position.

Correlated mutations are calculated as previously described (Göbel et al., 1994Go). Briefly, a distance array is used to codify each position in the alignment. This position-specific array contains all the residue–residue distances between all the possible pairs of sequences at that position. The correlation value between each pair of positions in the alignment is computed as the correlation of the two arrays for each possible residue pair. Corresponding elements in the arrays contain the distance between the same two sequences in the two positions under comparison. The scoring matrix of McLachlan (McLachlan, 1971Go) defines the distances between residues. Positions with a percentage of gaps >10% are set at a correlation value of –1 and completely conserved positions are set at a correlation value of 0. The similarity value of gaps is set to a dummy value of 0.

Neural network architecture and input codings

Neural networks have been proved to be one of the most successful methods for prediction of contact maps of proteins (Fariselli and Casadio, 1999Go). In this work we implement a neural network architecture which is similar to that described before. This topology, which was found to be the best performing one with the problem at hand, is depicted in Figure 1Go. A single output neuron codes for contact (output value close to 1) and non-contact (output value close to 0). The hidden layer consists of eight hidden neurons. A new type of input coding was previously introduced (Fariselli and Casadio, 1999Go) and it is also used here. Each residue pair in the protein sequence is coded as an input vector containing 210 elements (20x(20 + 1)/2), representing all the possible ordered couples of residues (considering that each residue couple and its symmetric are coded in the same way). This is done in order to reduce the number of weight junctions. When a single sequence is used, the input neuron coding for the ordered couple of amino acidic residues at positions i and j is set to 1, while the remaining 209 are set to 0. In order to take into account the sequence neighbours we use a 3-residue-long input window, considering both parallel and anti-parallel pairing of the two segments centred at positions i and j, respectively. This leads to the coding of the couples formed by the residues in positions {i – 1, j – 1}, {i, j}, {i + 1, j + 1} (parallel pairing) and {i – 1, j+1}, {i, j}, {i + 1, j – 1} (anti-parallel pairing) ending up with five possible combinations ({i – 1, j – 1}, {i, j}, {i + 1, j + 1}, {i –1, j + 1}, {i + 1, j – 1}) of the ordered couples. This procedure requires 1050 (210x5) input neurons. When multiple sequence information is used (as is always the case in this paper), this binary input code is changed to a frequency-based one. This is done by considering the alignment from the corresponding HSSP files (Sander and Schneider, 1991Go) and taking all the possible couples generated by residues in positions i and j of the different aligned sequences. After normalization to the number of sequences, the frequencies of occurrence in the alignment of each couple are used in the corresponding position of the 210 element input vector representing all the possible ordered couples. Using this approach, the 210 element vector may have more than one component activated (Fariselli and Casadio, 1999Go).



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 1. Neural network architectures implemented in this paper. The different input codings are depicted. The simplest coding includes 210 residue couples (AA, AC, . . . ) times five possible pairings (out of the 3-residue-long windows centred in the i and j paired residues, respectively). Each array (a rectangular box), coding for one of the possible pairing (5), contains the frequency of the couple (c) in the alignment positions (p) [indicated as f(c, p)]. Inputs of increasing complexity includes sequence conservation [c(i), c(j)], correlated mutations [cm (i, j)] and predicted secondary structures in a 3-residue-long window (W) centred on each residue of the i, j couple [ss (Wi), ss (Wj)] (see text for further details).

 
A major characteristic of the networks described in this paper is that they use different input codings, characterized by increasing complexity. We implement neural networks taking into account each position in the sequence alignment information derived from sequence conservation and correlated mutations as previously computed (Olmea and Valencia, 1997Go). Moreover, the predicted secondary structures are also added as input to the networks.

To solve our specific problem five neural networks of different complexity are implemented (Figure 1Go):

  1. NET is the neural network with the simplest input, consisting only of the pairing couples taken from the multiple sequence alignments with a 3-residue-long window centred at the residue pair to be predicted. This network, which was also used in a previous paper (Fariselli and Casadio, 1999Go), is implemented here for the sake of comparison with our previously published results (see below).
  2. NETC is a neural network which contains three input nodes more than NET. Two nodes code for the conservation of each residue in position i and j, respectively, and a third one codes for the correlated mutations of the i and j positions.
  3. NETCW is similar to NETC, with six input neurons coding for the sequence conservation (instead of the two of NETC). This is done in order to also take into consideration the sequence conservation value of the nearest neighbouring residues. The number of input neurons coding for the correlated mutations is increased from one in NETC (referring to the i, j couple) to nine (each combination in a window of three residues, Figure 2Go).
  4. NETCSEP is NETC with one more input neuron that codes for the length of the sequence separation between residues to be predicted. The length of the protein under inspection is used as a normalization factor.
  5. NETCSS adds to NETC 18 input neurons in order to include also the predicted secondary structure. For each residue couple (i, j) a 3-residue-long input window, that contains a node for each discriminated secondary structure type ({alpha}, ß, coil), is considered. Predictions are obtained with a neural network-based predictor, implemented `in house' and trained using a cross-validation procedure on 822 non-homologous proteins (25% sequence identity). Its overall three-state accuracy reaches 0.74 (Jacoboni et al., 2000Go).



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 2. A graphical view of the windows presented to the network (NETCW) including sequence conservation and correlated mutations.

 
All the five networks are trained using a cross-validation procedure. This is performed by splitting the set comprising 173 proteins into three subsets including an approximate equal number of proteins. The networks are trained using the back-propagation algorithm and a balancing procedure (Fariselli et al., 1993Go) in order to avoid deterioration of the performance due to the large imbalance between contacts (less abundant) and non-contacts (more abundant). The weight junctions are randomly initialized in the range [–0.01, 0.01]; the learning rate and the momentum term are set to 0.1 and 0.9, respectively.

In order to compare all the implemented networks together and with the previously described methods based on correlated mutations, the number of predicted contacts is set equal to half of the protein length Lp/2 (following the procedure described in Olmea and Valencia, 1997Go). A sorting procedure based on the network output values is adopted. Contacts are defined as the highest Lp/2 prediction values for a protein of length equal to Lp and are routinely characterized by output activation values >0.75–0.80.

The filtering procedure

To avoid contact overprediction, the predicted pairs are filtered taking into account the amount of contacts that each residue type can make (Olmea and Valencia, 1997Go). The filtering procedure is based on the occupancy data (or residue-coordination numbers) of each residue. This value is statistically derived from the set of protein structures of the database and takes into account the secondary structure type and the solvent exposition of each residue. Using this, the number of predicted contacts of a residue becomes a function of its structural environment. Therefore, the occupancy can be considered an estimate of the maximal number of contacts that each residue can make and is used to limit the number of contacts predicted for each residue.

Evaluation of the efficiency

To score the efficiency of the predictors we use three statistical indices. The first and the most used index to evaluate the predictor accuracy is defined as:

(1)
where Ncp* and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. Routinely the accuracy is evaluated for each protein and then averaged over the protein set under consideration (<A>).

The improvement over a random predictor is evaluated by computing the ratio between A (Equation 1Go) and the accuracy of a random predictor (Nc / Np):

(2)
where Nc is the number of real contacts in the protein of length Lp, and Np are all the possible contacts. In this paper in order to limit the prediction of local contacts (clustered along the main diagonal of the contact map) we set to 7 the minimum length of the sequence separation between residues in contact. Since the contact map S is symmetric and residues whose sequence gap is <7 are not included, Np is computed to be equal to (Lp – 7)(Lp – 6). In a previous paper in which we used neural networks having as input only evolution information, the minimum sequence gap between residues in contact was set to 4. Therefore, the R values computed in this paper cannot be directly compared to those previously published since the accuracy of the random predictor is differently evaluated (Fariselli and Casadio, 1999Go). Also for the R value an averaging procedure is adopted (<R>).

A third and useful index measures the difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997Go). This index is defined as:

(3)
where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to 60 Å cluster all the possible distances of residue pairs observed in the protein structure); di is the upper limit (normalized to 60 Å) for each bin, e.g. 8 Å for the 4–8 Å bin; Pic and Pia are the percentage of predicted contact pairs (with distance between di and di 1) and that of all possible pairs, respectively. By definition, values of Xd = 0 indicate no separation between the two distance populations, meaning that the predicted contacts are randomly distributed; values of Xd > 0 indicate positive cases, when the population of the distances between predicted contact pairs is shifted to smaller values with respect to the population of the distances of all residue pairs in the protein. Since contact distances have an upper limit of 8 Å, the larger and positive Xd, the more efficient prediction of contacts is. Similarly to the other two indexes, Xd is also averaged on the protein sets (<Xd>).


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Conclusions
 References
 
Contact distribution

Irrespective of the contact definition, it has been shown that the number of contacts (Nc) in the proteins increases almost linearly with the protein sequence length, while the number of `non-contacts' (Nnc) increases with the square of the sequence length (Vendruscolo et al., 1997Go). On average Nc/Nnc is approximately 1/60 (Fariselli and Casadio, 1999Go). In order to compensate for this disproportion a balancing procedure is used during the training phase of the networks and a sorting procedure based on the network output values is adopted (see Materials and methods, Neural network architecture and input codings).

However, a crucial point to take into consideration is the distribution of protein contacts with respect to the length of the sequence separation between pairs. The sequence separation between two residues in position i and j in the protein sequence is computed as |ji|. It is obvious that a well performing predictor of contacts should not be polarized by contact predictions among residues that are nearest neighbours in the sequence and covalently bounded. It can be evaluated that the frequency distribution of the contacts decreases at increasing length of the sequence separation between residue pairs (Thomas et al., 1996Go; Fariselli and Casadio, 1999Go).

The theoretical distribution of contacts in a protein can be computed by assuming that:

  1. (i)For the protein i, the number of contacts Nc(i) increases linearly with the protein length Li. The phenomenological estimate Nc(i) = axLi is consistent with previous evaluations (Vendruscolo et al., 1997Go; Fariselli and Casadio, 1999Go).
  2. In a contact map, contacts are randomly distributed (a priori).

From these assumptions it follows that the a priori probability of finding a contact (c) in a map i is:

(4)
The a priori number of contacts at a given length of sequence separation (s) for a protein of length Li is:

(5)
From this it follows that the probability P(c, s) decreases linearly at increasing length of sequence separation. However, it must also be considered that the distribution is computed on proteins of different size, and that not all of the proteins can equally contribute to all sequence separations since the constraint s < Li must hold. Taking this into account, the probability of contacts for a given length of sequence separation (s) becomes:

(6)
where the sum is carried out over all protein length, N(Li) is the number of proteins of length Li and Ntot is the sum over all the sequence separations.

The theoretical frequency distribution for a a priori random assignment of contacts in a protein (Equation 6Go) is computed using the N(Li) value from our database. In Figure 3Go the theoretical frequency distribution of contacts is compared with the real one evaluated using our database. It is evident that the pattern of the a priori distribution is characterized by an exponential-like behaviour, indicating that most of the contacts (68%) in proteins are between residues with a sequence separation ranging from 7 to 100. The real distribution deviates from that of a a priori random assignment of contacts. It is remarkable that the real frequency distribution of contacts contains in the same region of sequence separation 81% of the population. This indicates that contacts in proteins are not randomly distributed and occur predominantly between residues with a sequence separation spanning from 7 to 100 residues.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 3. Frequency distributions of the real and hypothetical contacts as a function of sequence separations. Hypothetical contacts are computed according to Equation 6Go. The minimum value of the length of sequence separation is 7 (residues).

 
The efficiency of the predictors

The implemented networks are tested on the selected database using a cross-validation procedure. The results are compared with those obtained using a previous method based only on correlated mutations (Olmea and Valencia, 1997Go) (Table IIGo). In order to highlight the dependency of the results on the protein dimension, the accuracy values are listed after grouping the proteins also by sequence length. It is evident that on average for all the proteins neural networks (NET) give a very good performance with respect to the method based only on correlated mutation (Corr). For the sake of comparison with a neural network-based predictor previously described (Fariselli and Casadio, 1999Go) we notice that the average accuracy is the same as that published before (0.16). However, since the sequence separation value is set now to 7 (instead of 4 residues, as before), deviation from random is 1.16 points lower than before. This simply reflects the fact that with the current procedure we are narrowing the range of contact predictions accepted as correct.


View this table:
[in this window]
[in a new window]
 
Table II. Efficiency of the different methods used to predict contact maps
 
When the information relative to correlated mutations is introduced as input to the network (NETC), the accuracy increases by 1% and the Xd value becomes 8.5 instead of 7.83 (Table IIGo). This indicates that although the neural network captures most information, the pre-processing computation of the sequence conservation and correlated mutations helps the system in the classification task. This is not the case when the method based only on correlated mutations is adopted (Corr).

When information on sequence conservation and correlated mutations is extended to a 3-residue-long window (NETCW), the accuracy value increases of one percentage point; however, deviation from random (R) and Xd values are not modified with respect to NETC, indicating that the efficiency of the prediction is rather similar to that obtained with NETC. If the prediction efficiency of the subsets of proteins of different size is considered, it can be concluded that this result is mainly due to the decrease in prediction accuracy of proteins of large size (>300 residues).

If sequence separation of the contact residue pairs is taken into account (NETCsep), an increase of the accuracy by two percentage points with respect to NETC is noticed. In this case all the other indices also increase. NETCsep reaches an average deviation from random (R) higher than 6 and an Xd value of 9.8. Noticeably, this predictor performs well also on proteins of large size (>300 residues).

The predictor which also implements the protein-predicted secondary structures (NETCSS) gives the highest score (Table IIGo). As compared to NETC, which only considers correlated mutations, accuracy is increased 2-fold. With the increase of protein length the performance of NETCSS decreases (Table IIGo). For large proteins the accuracy is lower than that obtained with NETsep. Nevertheless the overall performance of NETCSS is still the highest for proteins of small and medium size (<300 residues), indicating that this method is the best predictor of contact maps for proteins of average dimensions.

In Figure 4Go, the accuracy of the contact prediction is plotted as a function of the protein length using the different methods (Corr, FNET and NETCSS). It is evident that the accuracy of the prediction is dependent on the length of the protein. It is indeed easier to predict the contacts of short sequences, since the contact density is higher in small than in large proteins (Vendruscolo et al., 1997Go). Neural network-based methods (the NET predictors) are out-performing that based only on correlated mutations (Corr); however, the most significant differences in the efficiency of the prediction between the two approaches are to be found in the contact prediction of proteins of small and medium size.



View larger version (35K):
[in this window]
[in a new window]
 
Fig. 4. Bar graph showing a comparison of the accuracy (A) of the contact predictions obtained with the different methods used in this paper. The effect of filtering (F) is also included. Accuracy of each method is averaged over the set of all proteins and also on the four different subsets including proteins of small, medium and large size, as indicated. For each protein set, six different sets of predictions are considered. Contact predictions are obtained with: Corr, correlated mutations; NET, the network trained with multiple sequence alignment; and NETCSS, the best performing network; the results of the predictors after filtering (FCorr, FNET, FNETCSS) are also shown. The height of the bars corresponds to the average accuracy of each method and the lines to the standard deviation.

 
Accuracy after the filtering procedure

When the filtering procedure is adopted, even though the number of predicted contacts is reduced, the accuracy level of the rest is of higher quality (Table IIIGo). All the networks increase their efficiency as measured by the different indices while keeping the same order of performance listed in Table IIGo. After filtering, it also becomes evident that there is no improvement when the 3-residue-long window coding for conservation and correlated mutations is used; indeed NETC and NETCW show the same accuracy values. The former performs better on proteins of large size, while the latter performs better on small ones.


View this table:
[in this window]
[in a new window]
 
Table III. Comparison of the performances of the different predictors after the filtering procedure
 
The most remarkable result shown in Table IIIGo is the accuracy reached by FNETCSS. The efficiency of FNETCSS on the whole set of proteins is demonstrated by the accuracy value equal to 0.25. Furthermore, the improvement over a random predictor is increasing up to 8 and the Xd value is close to 12. The other relevant observation is that this predictor still maintains its accuracy on proteins of medium and large size. This indicates that after filtering, contacts are predicted with a high efficiency, as concluded by considering the improvement over random is 9.47 and 12.71 times, respectively, on medium and large size proteins.

Network performance in different regions of sequence separation

All the contact predictions described above were performed considering a sequence separation of at least 7 residues. In this section we explore the effect of increasing this minimal sequence separation length on the prediction. Increasing the minimal sequence separations causes a severe reduction of the number of real contacts taken into consideration (Figure 3Go). Since the greatest amount of contacts in a protein is found in the region of small sequence separation (see above), increasing this value is equivalent to reducing the expectation of randomly locating contacts during the prediction. In Table IVGo we analyse the results obtained using our best performing method (NETCSS), with and without the filtering procedure, tested on the protein sets using four different minimal sequence separations (7, 20, 40 and 80 residues). As expected, the accuracy index A diminishes as the minimal sequence separation increases. The Xd value decrease less than the accuracy value, indicating that the predicted contacts are anyway quite close in space (Olmea and Valencia, 1997Go). This is due to the fact that in the regions of large sequence separations contacts are not only less abundant but also more spot-like and less grouped together. This explanation is also confirmed by the behaviour of the index evaluating the improvement over random prediction (R).


View this table:
[in this window]
[in a new window]
 
Table IV. The efficiency of the best performing method (NETCSS) with and without filtering as a function of sequence separation
 
In the set of experiments shown in Table IVGo, the values of R are nearly constant, even when the accuracy (A) is decreased. This indicates that when the minimal sequence separation increases, the improvement over a random predictor is almost constant due to the fact that the probability of randomly getting correct answers is also decreasing. Therefore, it can be concluded that quality of contact predictions is independent of the sequence separation above a given threshold, whose extent depends on the protein length (using a sequence separation >80 residues, only those proteins whose length is >170 residues give significant results).

CASP3 common targets

Both groups (Olmea and Valencia, 1997Go; Fariselli and Casadio, 1999Go) sent predictions to the Critical Assessment of techniques for protein Structure Prediction (CASP3) competition (Orengo et al., 1999Go). At that time there was not a common criterion for the prediction of contact maps, in terms of number of predicted contacts, minimal sequence separation and contact definition. Then, in order to compare the accuracy we have re-calculated the predictions in a consistent way. This is done by scoring the old submissions using a number of predicted contacts equal to half of the length of the protein, a minimal sequence separation of 7 residues and a contact definition based only on Cß with a threshold value of 8 Å. Among the submitted targets only three of them were common to the predictions of both groups, namely T0067, T0077 and T0079 (Table VGo). The new method (NETCSS with filter) for two of the three targets predicts with an accuracy level much higher than the two single methods separately (Table VGo). Interestingly, for the target T0079 the method based on correlated mutations (and filtered) ranks higher than neural networks. However, the predictor described in this paper scores on the targets with a very high efficiency, which is even greater than the average values obtained on our database. The predictions obtained with NETCSS on the three targets are highlighted in Figures 5–7GoGo, where all the predicted contacts are shown using a `sticks' representation (the correct contacts are depicted in black).


View this table:
[in this window]
[in a new window]
 
Table V. Comparison of FNETCSS with previously described methods on the same CASP3 targets
 


View larger version (47K):
[in this window]
[in a new window]
 
Fig. 5. Localization of the predicted contacts of the CASP3 target T0067 (phosphatidylethanolamine-binding protein, 1BD9) with FNETCSS. The prediction accuracy for this protein is A = 0.38, R = 12.7, Xd = 17.5. Side chains of correctly and wrongly predicted contact residues are highlighted in black and grey, respectively.

 


View larger version (40K):
[in this window]
[in a new window]
 
Fig. 7. Localization of the predicted contacts of CASP3 target T0079 (MarA protein, 1BL0) with FNETCSS. The prediction accuracy for this protein is A = 0.12, R = 5.7, Xd = 7.6. Side chain details are as in Figure 5Go

 

    Conclusions
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Conclusions
 References
 
It has been previously shown that neural networks are efficient tools in solving several kinds of problems, including predictions of secondary structures (Rost and Sander, 1995Go; Jones, 1999Go), transmembrane helices (Rost et al., 1996Go) and folding initiation sites (Compiani et al., 1998Go). Neural networks have been used in pioneer work by Bohr et al. (Bohr et al., 1990Go) to predict distances between amino acids of homologous protein sequences and also to represent knowledge based potentials (Grossman et al., 1995Go). In all this work it is clearly demonstrated that neural network-based predictors perform with a better efficiency when more information is given in input to the system. Therefore, we took advantage of the adaptability of neural networks to re-address the problem of predicting protein contact maps. The purpose of the present work is to use as input to the networks all the `orthogonal' information left behind in our first approach in which neural networks were trained only with evolutionary information (Fariselli and Casadio, 1999Go).

From the CASP3 results of contact map predictions (Orengo et al., 1999Go), it turned out that the method based on correlated mutations (Olmea and Valencia, 1997Go) was the best performing of the predicted contacts with an accuracy equal to 0.15. However, the number of contacts predicted was less than that obtained with other methods. On the other hand, the method based on neural networks (Fariselli and Casadio, 1999Go) performed best when the statistical significance of the predictions was evaluated with a self-threading method (Orengo et al., 1999Go) aimed at recognizing the correct threading of the protein sequence onto its 3D structure from all the alternative threadings.

In this paper we combine both methods and show that neural networks can predict protein contacts with an accuracy of 0.21 starting from the residue sequence and including residue conservation, correlated mutations and predicted secondary structures. With this approach, prediction is improved as compared to that of a random predictor by a factor >6, previously obtained with neural networks using as input only evolutionary information (Fariselli and Casadio, 1999Go). Our results are even more important if we consider that the score is independent of the information on sequence separation, making the predictor of a more general use than before.

At the CASP3 competition it has been shown that the information on sequence separation also strongly affected the prediction of distances between residues made by other neural network-based predictors (Lund et al., 1997Go).

A substantial improvement of the prediction efficiency is obtained by filtering the network outputs using the filtering procedure previously introduced by Olmea and Valencia (Olmea and Valencia, 1997Go) and based on the residue-coordination numbers. In this case, the neural network-based predictor increases its accuracy up to 0.25 for all proteins, with a deviation from a random predictor of a factor of 8. Even though this level of accuracy is still not sufficient for folding a protein, we believe that this is a tangible step forward in predicting contact maps of proteins.



View larger version (74K):
[in this window]
[in a new window]
 
Fig. 6. Localization of the predicted contacts of CASP3 target T0077 (ribosomal protein L30) with FNETCSS. The prediction accuracy for this protein is A = 0.43, R = 8.8, Xd = 15.7. Side chain details are as in Figure 5Go

 

    Notes
 
3 To whom correspondence should be addressed. E-mail: casadio{at}alma.unibo.it Back


    Acknowledgments
 
Financial support to this work was provided to R.C. by a grant from the Ministero della Università e della Ricerca Scientifica e Tecnologica (MURST) to the project `Structural, Functional and Applicative Prospects of Proteins from Thermophiles' and by a grant for a target project in biotechnology from the Italian Centro Nazionale delle Ricerche (CNR). We thank the Italian Ministero della Università e della Ricerca Scientifica e Tecnologica and the Spanish Minister of Research for supporting the joint collaboration between Italy and Spain.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Conclusions
 References
 
Anfinsen,C.B. (1973) Science, 181, 223–230.[ISI][Medline]

Bohr,H., Bohr,J. Brunak,S. Cotterill,R. Fredholm,H. Lautrup,B. and Petersen,S.B. (1990) FEBS Lett., 261, 43–46.[ISI]

Bohr,J., Bohr,H., Brunak,S., Cotterill,R.M.J., Fredholm,H., Lautrup,B. and Petersen,S.B. (1993) J. Mol. Biol., 231, 861–869.[ISI][Medline]

Compiani,M., Fariselli,P., Martelli,P. and Casadio,R. (1998) Proc. Natl Acad. Sci. USA, 95, 9290–9294.[Abstract/Free Full Text]

Eisenhaber,F., Persson,B. and Argos,P. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 1–94.[Abstract]

Fariselli,P. and Casadio,R. (1998) In Proceedings of World Multiconference on Systemics, Cybernetics and Informatics (SCI'98, Orlando USA), Vol. 1, pp. 527–533.

Fariselli,P. and Casadio,R. (1999) Protein Eng., 12, 15–21.[Abstract/Free Full Text]

Fariselli P, Compiani M. and Casadio R. (1993) Eur. Biophys. J., 22, 41–51.[ISI][Medline]

Kuntz,I.D., Thomason,J.F. and Oshiro,C.M. (1989) Methods Enzymol., 177, 159–205.[ISI][Medline]

Govindarajan,S. and Goldstein,R. (1998) Proc. Natl Acad. Sci. USA, 95, 5545–5549.[Abstract/Free Full Text]

Göbel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309–317.[ISI][Medline]

Godzik,A., Skolnick, J., and Kolinski,A. (1992) J. Mol. Biol., 227, 227–238.[ISI][Medline]

Gorodkin,J., Lund,O. Andersen,C.A. and Brunak,S. (1999) Proc. Int. Conf. Intell. Sys., 6, 95–105.

Grossman,T., Farber,R. and Lapedes,A. (1995) Mol. Biol., 3, 154–161.

Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522–524.[Abstract/Free Full Text]

Huang,E.S., Subbiah,S. and Levitt,M. (1995) J. Mol. Biol., 252, 709–720.[ISI][Medline]

Jacoboni,I., Martelli,P.L., Fariselli,P., Compiani,M. and Casadio,R. (2000) Proteins, 41,535–544

Jones,D.T. (1999) J. Mol. Biol., 292, 195–202.[ISI][Medline]

Lund,O., Frimand,K., Gorodkin,J., Bohr,H., Bohr,J., Hansen,J. and Brunak,S. (1997) Protein Eng., 10, 1241–1248.[Abstract]

Maiorov,V.N. and Crippen,G.M. (1992) J. Mol. Biol., 227, 876–888.[ISI][Medline]

McLachlan,A. (1971) J. Mol. Biol., 61, 409–424.[ISI][Medline]

Miyazawa,S. and Jernigan,R.L. (1999) Proteins, 36, 357–369.[ISI][Medline]

Mirny,L. and Domany,E. (1996) Proteins, 26, 391–410.[ISI][Medline]

Mirny,L.A. and Shaknovich,E.I. (1996) J. Mol. Biol., 264, 1164–1179.[ISI][Medline]

Orengo,C.A., Bray,J.E., Hubbard,T., LoConte,L. and Sillitoe,I. (1999) Proteins, 37, 149–170.[Medline]

Olmea,O. and Valencia,A. (1997) Fold. Des., 2, S25–S32.[ISI][Medline]

Pazos,F., Helmer Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 271, 511–523.[ISI][Medline]

Rost,B. and Sander,C. (1995) Proteins, 3, 295–300.

Rost,B., Fariselli,P. and Casadio,R. (1996) Protein Sci., 5, 1704–1718.[Abstract/Free Full Text]

Sander,C. and Schneider,R. (1991) Proteins, 9, 56–68.[ISI][Medline]

Sander,C. and Schneider,R. (1993) Nucleic Acids Res., 21, 3105–3109.[ISI][Medline]

Selbig,J. and Argos,P. (1998) Proteins, 31, 172–185.[Medline]

Shindyalov,I.,N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 49–358.

Sippl,M.J. (1990) J. Mol. Biol., 213, 859–883.[ISI][Medline]

Taylor,W.R. and Hatrick,K. (1994) Protein Eng., 7, 341–348.[Abstract]

Thomas,D.J., Casari,G. & Sander,C. (1996) Protein Eng., 9, 941–948.[Abstract]

Vendruscolo,M., Kussell,E. and Domany,E. (1997) Fold Des., 2, 295–306.[ISI][Medline]

Zhang,C. and Sung-Hou,K. (2000) Proc. Natl Acad. Sci. USA, 97, 2550–2555.[Abstract/Free Full Text]

Received November 30, 2000; revised June 19, 2001; accepted July 10, 2001.