Istituto Superiore Sanità, TCE Laboratory, Viale Regina Elena 299, 00161, Rome and 1 Department of Biochemical Sciences, University of Rome `La Sapienza', P.le A. Moro 5, 00185 Rome, Italy
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: chimeric sequences/pattern recognition/primary structures/principal component analysis/recurrence plots
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Considering two foldable sequences of almost identical size located far apart in the sequence space, however, qualifies the problem of sequence/folding prediction as intrinsically non-linear (as well as amenable to generalization), given the huge difference between the SH3 and PG folds, and permits the use of analytical tools for comparing sequences sensitive to even fine and non-local differences in primary structures (Zbilut et al., 1998, 2000
; Giuliani et al., 2000
). We decided to investigate the sequence space generated by Blanco et al. (1999) by means of a relatively new technique, recurrence quantification analysis (RQA), as applied to the recurrence plots of hydrophobicity-coded primary structures. Recurrence plots visualize the distance matrix between epochs (rows) of the embedding matrix constructed over sequences and, consequently, the autocorrelation structure present in them at any possible scale. Since, in fact, distances are computed for all the possible pairs of epochs, the elements close to the main diagonal of the plot correspond to short-range correlations (the diagonal marks the identity in time) while long-range correlations correspond to points distant from the main diagonal (Webber and Zbilut, 1994
). Our goal was to test the ability of RQA in predicting the folding properties of polypeptide chains and in acting as a useful supplement to traditional alignment strategies in sequencefunction predictions. For this purpose, the data set used by Blanco et al. (1999) is almost ideal since, starting from one folding module only and moving away from the location of the corresponding sequence in the sequence space by successive changes would have made the sequencestructure relation an ill-posed problem, owing to the impossibility of setting a `direction' for those changes. In the present case, however, by moving in the sequence space from one fold to another, an estimate of the relative size of the folding basins becomes feasible and a direction for the imposed changes may be defined.
As a result, by tracing the exploration of the sequence space along a linear (in the sequence space) path connecting the two leading proteins via their hybrids, the intrinsically non-linear character of RQA, in terms of departure from the fractional similarity between hybrid sequences and the SH3 and PG poles, could be confirmed.
Another, somewhat unexpected, although by no means secondary, result of the present work is the discovery that the fine structure of recurrence plots even by visual inspection produces reliable clustering of different folding architectures.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
RQA was introduced by Eckmann et al. (1987) as a purely graphical technique based on the computation of a distance matrix (DM) between any possible pair-combination of rows (epochs) of an embedding matrix (Broomhead and King, 1986; Zbilut and Webber, 1992
) of a one-dimensional signal. The distance matrix is then colored, darkening the pixels located at specific (i, j) coordinates corresponding to distance values between ith and jth rows (epochs) lower than a predefined radius. The features of the distance function make the plot symmetric (DMi,j = DMj,i) and with a darkened main diagonal marking the identity line (DMi,i = 0) (see Figure 1
). The darkened (recurrent) points single out recurrences and the plot can be considered as a global picture of the autocorrelation structure of the system at hand (for details, see Webber and Zbilut, 1994; Giuliani and Manetti, 1996; Zbilut et al. 1998). Webber and Zbilut (1994) enhanced the technique by defining five non-linear numerical descriptors of recurrence plots, namely REC (% recurrence), DET (% determinism), TREND (see below), MAXL (maximum line) and ENT (informational entropy), whose calculation and meaning are thoroughly explained in their paper. Such descriptors were used in fields ranging from molecular dynamics to physiology (Giuliani and Manetti, 1996
; Faure and Korn, 1997
; Manetti et al., 1999
) and in the analysis of protein sequences (Zbilut et al., 1998
, 2000
; Giuliani et al., 2000
). Here it is sufficient to stress that recurrence plot descriptors are sensitive to the number of recurrent points in the plot (REC) and also to (i) their distribution along consecutive patches on the sequence (DET, ENT, MAXL) and (ii) how stationary their relative frequency is along the sequence (TREND).
|
Both hybrid and native sequences were coded in terms of the relative hydrophobic character (logarithm of the octanolwater partition coefficient) of amino acid residues. The ensuing hydrophobicity series was submitted to RQA after a four-dimensional embedding by the method of delays. The embedding procedure consists in building an n-column matrix (in our case n = 4) out of the original linear array, by shifting the series of a fixed lag. The discrete character of protein sequences makes compulsory, in the present case, the choice of lag = 1. Thus, the rows of the embedding matrix correspond to subsequent windows of length 4 along the sequence. The choice of the embedding dimension was dictated by a balance between the need to have a window large enough to keep track of between-residue interactions and on relying, at the same time, on a sufficiently long series (Broomhead and King, 1986; Zbilut and Webber, 1992
). Note that the last four values are eliminated from the analysis as an obvious consequence of shifting the series in the embedding procedure. In any case, different choices of the embedding dimension (from three to six) gave consistent results (Giuliani et al., 2000
). Moreover, the four-residue window was demonstrated (Strait and Dewey, 1996
), through the application of a formalism derived from information theory, to constitute an upper limit for the information content of protein sequences.
Data analysis strategy
Protein sequences have been considered as statistical units defined in different multivariate spaces, namely (i) the space of hydrophobicity patterns along the chain as described by RQA (RQ-space), (ii) the space of the folding information (FOLD-space) and (iii) the sequence-alignment space (DIST-space), expressed as relative superposition with the leading sequences, according to the criterion suggested by Blanco et al. (1999). The results of the present work stem from the correlations between the above spaces.
These spaces allow for the representation of the same statistical units (PG, SH3 proteins plus chimeric sequences) as points in different multivariate frames, each corresponding to a specific portion of the available information (see Table I). More specifically: RQ-space includes as axes the variables corresponding to columns 29 of Table I
, DIST-space columns 1013 and FOLD-space columns 1418.
|
The folding behavior was quantitated by the following qualitative and semi-quantitative variables in the FOLD-space:
The RQ-space was based on the RQA quantitative descriptors (see the previous paragraph). Another derived space (RP-space) was computed to assess the congruence between the quantitative description given by RQA and the holistic graphic appearance of the plots. The RP-space was generated from a consensus score computed from the general resemblance between plots as judged by a panel of unbiased observers (see Appendix for details).
To maximize the information content in each space and to overcome the distortion induced by within-set correlations, each data set was previously filtered by a principal component analysis (PCA). Principal components (Lebart et al., 1984) correspond to the optimal (in a least-squares sense) parameterization of a given multivariate set. The components are mutually orthogonal, each pointing to an `autonomous' aspect of the data set, and are reckoned by the algorithm in decreasing order of explained variance (%EXPLVAR). Thus, between-sets correlations were computed as correlations between the corresponding principal component spaces by means of both linear (Pearson r, multiple linear regression) and non-linear (k-nearest neighbors) methods. The k-nearest-neighbors discriminant analysis is a non-linear pattern recognition procedure aiming at classifying a given statistical unit on the basis of the class assigned to its k nearest units in a given metric space (Caver and Hart, 1967
). In our case k was set to 3 and the problem at hand was the assignment of sequences to the correct folding NMR class (NODISP/BROAD/DISP) on the basis of their position in the space spanned by the first two RQ components (PCRQ1, PCRQ2). The nearest-neighbors algorithm assigned each unit to the class corresponding to the `majority vote' (weighted for the actual Euclidean distance in the PCRQ1, PCRQ2 plane) on the classes of the three nearest units.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Table I contains the overall set of data from which the subsets corresponding to the RQA-, DIST- and FOLD-spaces have been extracted. Each subset was independently analyzed and summarized by PCA.
Table II refers to the RQ-space in terms of component loadings, i.e. of correlation coefficients between the original variables and the components derived from them. A five-component solution saves almost completely the original RQ information (explained variance = 97%). PCRQ1 is by far the most important component (explained variance = 70%) and is an inverse estimate of the hydrophobic profile variability, as can be inferred by the inspection of the component loadings. PCRQ2 is driven by the TREND variable and accounts for 18% of total variance: it can be considered a measure of the `ruggedness' of the hydrophobic plots. The third component (PCRQ3) is mainly linked to the length of patches of homogeneous hydrophobic behavior in the plots and accounts for 6% of total variance, while the minor components are difficult to explain, being linked to subtleties of the RQ patterns.
|
|
|
|
Modeling the DIST-space through a multiple linear regression model having PCDIST1 as the dependent variable and PCRQ1-5 as regressors reveals a statistically significant correlation between the two spaces (R2 = 0.714; r = 0.845, p < 0.001). This points to the general consistency between sequence alignment and RQA descriptions. Although the correlation between PCDIST2 and the RQ-space was only marginally significant (r = 0.606, p < 0.05), it is worth stressing that RQA is able to recognize a relatively minor (in terms of explained variance) feature of the global sequence, as the presence of the SH3 hydrophobic core (length = eight amino acids).
Figure 3 reports the combined PCDIST1-PCRQ1 space (r = 0.629, p < 0.005) and allows us to visualize the non-linearity introduced into the sequence space by RQA: SH3 and PG lie (as a natural consequence of the Blanco et al. strategy) at the extremes of the sequence alignment axis (PCDIST1), but they are not the poles of the RQ-space. The RQ-space, in fact, only partially follows the alignment `order parameter' and very large deviations from the SH3
PG line are evident (see, for example, the m5m8m9m15 cluster). These deviations indicate some new hydrophobic patterns stemming from the simple addition of ready-made elements and point to the possibility of completely new structures emerging from recombination processes during evolution.
|
As stated above, the presence of two opposite poles (SH3 and PG) prevented any direct linear correlation approach as for folding prediction. Thus, to correlate the folding class probability estimated by the non-linear model on the recurrence space with the components of the folding space, we adopted a non-linear strategy (k-nearest-neighbors) followed by a subsequent linear validation step.
Figure 4 reports the space of the first two RQ components together with the indication of the NMR pattern of the statistical units. One can see a clear tendency for aggregation of sequences characterized by the same NMR pattern. To check whether the NMR pattern location (and thus the folding behavior) was derivable by the relative location in the recurrence space with reliability higher than expected by chance, a k-nearest-neighbors discriminant analysis was carried out. In other words, we looked for the existence of a significant correlation between folding behavior (as expressed by NMR pattern) and recurrence quantification of hydrophobic profiles. The sequence classification matrix reported in Table V
shows that this was actually the case: the global accuracy was around 75% (six misclassifications, as marked by asterisks in Table V
, out of 26 NMR-characterized structures), corresponding to a Fisher exact test probability equal to 0.0024 for the classification matrix being due to chance alone. The following features of the classification matrix are particularly worth noting:
|
|
It should be remembered that our prediction exercise has only been expressed in statistical terms and its main goal is to test the linkage (if any) between the non-linear description of sequences provided by RQA and the folding behavior. We cannot directly verify by appropriate experiments our predictions, but leaving, however, as a commitment, such a proof to independent and objective external validation.
Unique features of RQA descriptors
From our viewpoint it is important to check the independent character of the folding relevant information provided by RQA with respect to that obtained from standard sequence analysis. In other words, we want to check whether the RQA approach goes beyond the simplistic assumption that a sequence more similar to a real protein has a higher probability of folding than a less similar one. In order to prove the point, we checked the stability of the correlation between the probability of `not being a protein' as estimated by RQA (NODISP variable in Table V) and the global amount of 3D structuring as measured by the first component of the FOLD-space (PCFOLD1), when the possible confounding effect of sequence similarities (as expressed by PCDIST1) is partialled out. This procedure is equivalent to checking the invariance of the correlation coefficient (Pearson r) between NODISP and PCFOLD1, after elimination of the linear influence of PCDIST1. This was actually the case, with the r coefficient between NODISP and PCFOLD1 scoring values of 0.608 (p < 0.001) and 0.620 (p < 0.001) before and after such elimination, respectively. We prefer to use the probability of `not being a protein' (NODISP) instead of the direct index (DISP), in order to rule out possible ambiguities linked to the intermediate class (BROAD).
Visual screening of recurrence plots
In brief, a metric was built between pairs of RPs in terms of the number of observers assigning the two elements of the pair to different classes (see the Appendix for details). The ensuing distance matrix between plots was submitted to a principal component analysis in order to derive an explicit quantitative space (RP-space), where visual similarities could be expressed. The RP-space was exhaustively described by a six-component solution (96% of explained variance); in the subsequent step of `between-sets' correlations, however, we limited ourselves to a two-component solution (PCRP1, PCRP2) globally explaining 56% of the total variability. This choice was made in order to retain only the most `robust' (consensual) portion of the visual judgment information. Table VI contains the RP-space component loadings: it is important to note how PCRP1 is a measure of the SH3 hydrophobicity pattern, as evident from the extremely high loadings of SH3 and nearby (in terms of superposition to SH3, namely SH3P, see Table I
) chimeric sequences m2, m14, etc., on the component. PCRP2 points to a new pattern (marked by m27, m29, m30, m31 hybrids; see the Appendix, Figure 5
) only partially overlapping the PG pattern (PG loading = 0.541) they were supposed to mimic in terms of sequence overlapping (see Table I
). PCRP3 points to another new pattern (m5, m8, m9, m11, m15; see also Figure 5
) while the PG pattern is only represented by PCRP4. The relative isolation of PG, despite the high sequence overlapping with a number of structures (from m27 to m31, see Table I
) is another important point worth noting.
|
|
An immediate perception of qualitatively different patterns of hydrophobicity distribution emerges from simple row-wise consideration of the ordering of recurrence plots provided by the first component of the RP-space, as displayed in Figure 5.
Conclusion
RQA of primary structures showed itself to be useful in several cases: in a site-directed mutagenesis study, to discriminate between function-retaining and non-functional mutants of ß-lactamase (Zbilut et al. 1998); in predicting the thermophilic/mesophilic character of rubredoxins (Giuliani et al., 2000
) and, more recently, in revealing some peculiar hydrophobicity patterns in prions (Zbilut et al., 2000
). In the present work, the significant correlation stemming between the RQA-induced non-linearity over the sequence space of chimeric structures and their folding properties shows the potential use of RQA in supplementing alignment methods to explore sequence databases.
As in the above-mentioned cases of rubredoxins and prions, an RQA-based quantification scheme of primary structures may reveal crucial information in solving specific structural problems and, by a widely used psychometric approach (Nosofsky, 1986; Shepard, 1987
), we showed that such information can be reliably appreciated even by visual inspection.
Our proposal is to use RQA and recurrence plots to optimize the process of engineering new proteins, given their relevance for protein folding/activity prediction, in addition to their relative independence from sequence similarity methods.
Recurrence plots can introduce in protein engineering a working style analogous to that adopted by medicinal chemists in `rational drug design' (Martin, 1981): before initiating a synthesis campaign to find the most potent (or less toxic) analogues of a lead drug, the proposed structures are designed in silico and labeled by a wide array of chemico-physical and/or topological descriptors. This procedure generates a multi-dimensional space in which the set of proposed structures is defined as a data cloud. Candidates for real synthesis and pharmacological testing are selected from this cloud by purely heuristic strategies, such as maximizing the diversity (and thus the probability of finding a potent analogue) within a selected subset of structures or, in a more advanced step, maximizing the similarity to an already found potent analogue (Martin, 1981
; Willett, 1997
). This rationale of drug design was demonstrated to outperform by orders of magnitude random searching huge libraries of organic molecules in order to find potential new drug candidates (Willett, 1997
).
As for the original problem set by Blanco et al. (1999), our analysis suggests the possibility that new folds can be generated from already existing ones not through a continuous path but through a number of sequential and discrete `successful' events giving rise to new folds from relatively minor sequence modification and/or gene duplication processes.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Nineteen independent observers, not aware of the nature of the recurrence plots to be analyzed, were asked to classify them on the basis of the relative similarity in their visual appearance without any constraint on the number and type of classes. Defining the relative similarity of each possible pair of plots could generate a straightforward metrics by the number of judgements assigning them to the same class minus the number of judgements assigning them to different classes (Nosofsky, 1986; Shepard, 1987
). The consistency of this space was assessed by the recognition of a very high between-judgement concordance (Pearson r correlation coefficients ranging around 0.90) with the only noticeable exception of an `aberrant' judgement (r
0.50 with the others). In order to help any reader in deriving his/her own visual impression, the recurrence plots relative to all the examined structures are reported in Figure 5
, in which RPs are ordered along the first component (PCRP1), so that the meaning of the whole procedure can be easily verified.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Broomhead,D.S. and King,G.P. (1986) Physica D, 20, 217236.[ISI]
Caver,T.M. and Hart,P.E. (1967) IEEE Trans. Inf. Theory, IT-13, 2127.
Eckmann,J.P., Kamphorst,S.O. and Ruelle,D. (1987) Europhys. Lett., 4, 324327.
Faure,P. and Korn,H. (1997) Proc. Natl Acad. Sci. USA, 94, 65066511.
Giuliani,A. and Manetti,C. (1996) Phys. Rev. E, 53, 63366340.[ISI]
Giuliani,A., Benigni,R., Sirabella P., Zbilut,J.P. and Colosimo,A. (2000) Biophys. J., 78, 136148.
Lebart,L., Morineau,A. and Warwick,K.M. (1984) Multivariate Descriptive Statistical Analysis. Wiley, New York.
Manetti,C., Ceruso,M.A., Giuliani,A., Webber,C.L.,Jr and Zbilut,J.P. (1999) Phys. Rev. E, 59, 992998.[ISI]
Martin,Y. (1981) J. Med. Chem., 24, 229237.[ISI][Medline]
Nosofsky,R.M. (1986) J. Exp. Psychol. Gen. 115, 3961.[ISI][Medline]
Shepard,R.N. (1987) Science, 237, 13171323.[ISI][Medline]
Strait,B.J. and Dewey,T.G. (1996) Biophys. J., 71, 148155.[Abstract]
Webber,C.L. and Zbilut,J.P. (1994) J. Appl. Physiol., 76, 965973.
Willett,P. (1997) Computational Methods for the Analysis of Molecular Diversity: Perspectives in Drug Discovery and Design, vol. 7/8. Elsevier, New York.
Zbilut,J.P. and Webber,C.L.,Jr (1992) Phys. Lett. A, 171, 199203.[ISI]
Zbilut,J.P., Giuliani,A., Webber,C.L.,Jr and Colosimo,A. (1998) Protein Eng., 11, 8793.[Abstract]
Zbilut,J.P., Webber,C.L.,Jr, Colosimo,A. and Giuliani,A. (2000) Protein Eng., 13, 99104.
Received June 20, 2000; revised August 2, 2000; accepted August 28, 2000.