Mapping protein sequence spaces by recurrence quantification analysis: a case study on chimeric structures

Alessandro Giuliani, Paolo Sirabella1, Romualdo Benigni and Alfredo Colosimo1,2

Istituto Superiore Sanità, TCE Laboratory, Viale Regina Elena 299, 00161, Rome and 1 Department of Biochemical Sciences, University of Rome `La Sapienza', P.le A. Moro 5, 00185 Rome, Italy


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Appendix
 References
 
Recurrence quantification analysis (RQA) was used to characterize the folding properties of 22 chimeric sequences derived from two parent proteins of similar length but different three-dimensional arrangement. A non-linear relation between sequence data and their RQA representation was revealed, which points to new information carried by this method as compared with classical best-alignment methods. This new information is significantly correlated with the folding properties of the hybrid polypeptide chains, as substantiated by careful statistical analysis of the recurrence plots' numerical descriptors, thus encouraging their systematic use to complement sequence data in both proteomics and protein engineering tasks. Even the direct visual screening of the qualitative graphical features of recurrence plots is shown to provide useful hints to discriminate between different recurrence structures of protein sequences.

Keywords: chimeric sequences/pattern recognition/primary structures/principal component analysis/recurrence plots


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Appendix
 References
 
In a recent paper Blanco et al. (1999) addressed the point of the existence and relative extension of `basins of attraction' made of efficiently foldable sequences surrounding proteins in the sequence space. They investigated the possibility of going from a given `folding-optimized' sequence to a completely new fold through a route of folded intermediate sequences, which is a crucial point in order to make hypotheses on the basic mechanisms of biological evolution. The problem was approached starting from two very different folds, the {alpha}-spectrin SH3 domain (PDB code = 1shg) and the protein G B1 domain (PDB code = 1pgb), and then constructing a number of chimeric sequences made of different combinations of the two leading proteins. Having examined the foldability of the hybrid sequences by different experimental methods (circular dichroism, temperature dependence, NMR), the authors concluded that a continuous route of foldable structures between the two poles does not exist and cast doubt on (i) the existence of a continuous, purely additive (point mutation-by-point mutation) pathway changing a given fold into another and (ii) the plausibility of generating artificial proteins from the rearrangement of patches of genes corresponding to natural proteins.

Considering two foldable sequences of almost identical size located far apart in the sequence space, however, qualifies the problem of sequence/folding prediction as intrinsically non-linear (as well as amenable to generalization), given the huge difference between the SH3 and PG folds, and permits the use of analytical tools for comparing sequences sensitive to even fine and non-local differences in primary structures (Zbilut et al., 1998Go, 2000Go; Giuliani et al., 2000Go). We decided to investigate the sequence space generated by Blanco et al. (1999) by means of a relatively new technique, recurrence quantification analysis (RQA), as applied to the recurrence plots of hydrophobicity-coded primary structures. Recurrence plots visualize the distance matrix between epochs (rows) of the embedding matrix constructed over sequences and, consequently, the autocorrelation structure present in them at any possible scale. Since, in fact, distances are computed for all the possible pairs of epochs, the elements close to the main diagonal of the plot correspond to short-range correlations (the diagonal marks the identity in time) while long-range correlations correspond to points distant from the main diagonal (Webber and Zbilut, 1994Go). Our goal was to test the ability of RQA in predicting the folding properties of polypeptide chains and in acting as a useful supplement to traditional alignment strategies in sequence–function predictions. For this purpose, the data set used by Blanco et al. (1999) is almost ideal since, starting from one folding module only and moving away from the location of the corresponding sequence in the sequence space by successive changes would have made the sequence–structure relation an ill-posed problem, owing to the impossibility of setting a `direction' for those changes. In the present case, however, by moving in the sequence space from one fold to another, an estimate of the relative size of the folding basins becomes feasible and a direction for the imposed changes may be defined.

As a result, by tracing the exploration of the sequence space along a linear (in the sequence space) path connecting the two leading proteins via their hybrids, the intrinsically non-linear character of RQA, in terms of departure from the fractional similarity between hybrid sequences and the SH3 and PG poles, could be confirmed.

Another, somewhat unexpected, although by no means secondary, result of the present work is the discovery that the fine structure of recurrence plots even by visual inspection produces reliable clustering of different folding architectures.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Appendix
 References
 
Recurrence quantification analysis (RQA)

RQA was introduced by Eckmann et al. (1987) as a purely graphical technique based on the computation of a distance matrix (DM) between any possible pair-combination of rows (epochs) of an embedding matrix (Broomhead and King, 1986Go; Zbilut and Webber, 1992Go) of a one-dimensional signal. The distance matrix is then colored, darkening the pixels located at specific (i, j) coordinates corresponding to distance values between ith and jth rows (epochs) lower than a predefined radius. The features of the distance function make the plot symmetric (DMi,j = DMj,i) and with a darkened main diagonal marking the identity line (DMi,i = 0) (see Figure 1Go). The darkened (recurrent) points single out recurrences and the plot can be considered as a global picture of the autocorrelation structure of the system at hand (for details, see Webber and Zbilut, 1994; Giuliani and Manetti, 1996; Zbilut et al. 1998). Webber and Zbilut (1994) enhanced the technique by defining five non-linear numerical descriptors of recurrence plots, namely REC (% recurrence), DET (% determinism), TREND (see below), MAXL (maximum line) and ENT (informational entropy), whose calculation and meaning are thoroughly explained in their paper. Such descriptors were used in fields ranging from molecular dynamics to physiology (Giuliani and Manetti, 1996Go; Faure and Korn, 1997Go; Manetti et al., 1999Go) and in the analysis of protein sequences (Zbilut et al., 1998Go, 2000Go; Giuliani et al., 2000Go). Here it is sufficient to stress that recurrence plot descriptors are sensitive to the number of recurrent points in the plot (REC) and also to (i) their distribution along consecutive patches on the sequence (DET, ENT, MAXL) and (ii) how stationary their relative frequency is along the sequence (TREND).



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 1. Recurrence plots of (A) SH3 and (B) PG sequences. In each plot, both axes give the residue number, the dots mark the recurrences (similarity in hydrophobicity patterns) within four-residue windows (starting from the marked position) and the main diagonal corresponds to the amino acid identity.

 
RQA on the hydrophobicity profile of protein sequences

Both hybrid and native sequences were coded in terms of the relative hydrophobic character (logarithm of the octanol–water partition coefficient) of amino acid residues. The ensuing hydrophobicity series was submitted to RQA after a four-dimensional embedding by the method of delays. The embedding procedure consists in building an n-column matrix (in our case n = 4) out of the original linear array, by shifting the series of a fixed lag. The discrete character of protein sequences makes compulsory, in the present case, the choice of lag = 1. Thus, the rows of the embedding matrix correspond to subsequent windows of length 4 along the sequence. The choice of the embedding dimension was dictated by a balance between the need to have a window large enough to keep track of between-residue interactions and on relying, at the same time, on a sufficiently long series (Broomhead and King, 1986Go; Zbilut and Webber, 1992Go). Note that the last four values are eliminated from the analysis as an obvious consequence of shifting the series in the embedding procedure. In any case, different choices of the embedding dimension (from three to six) gave consistent results (Giuliani et al., 2000Go). Moreover, the four-residue window was demonstrated (Strait and Dewey, 1996Go), through the application of a formalism derived from information theory, to constitute an upper limit for the information content of protein sequences.

Data analysis strategy

Protein sequences have been considered as statistical units defined in different multivariate spaces, namely (i) the space of hydrophobicity patterns along the chain as described by RQA (RQ-space), (ii) the space of the folding information (FOLD-space) and (iii) the sequence-alignment space (DIST-space), expressed as relative superposition with the leading sequences, according to the criterion suggested by Blanco et al. (1999). The results of the present work stem from the correlations between the above spaces.

These spaces allow for the representation of the same statistical units (PG, SH3 proteins plus chimeric sequences) as points in different multivariate frames, each corresponding to a specific portion of the available information (see Table IGo). More specifically: RQ-space includes as axes the variables corresponding to columns 2–9 of Table IGo, DIST-space columns 10–13 and FOLD-space columns 14–18.


View this table:
[in this window]
[in a new window]
 
Table I. Raw data set used in this work
 
The sequence-alignment space was defined by means of the same variables as used by Blanco et al. (1999): SH3P = percentage of superposition with SH3; PGP = percentage of superposition with PG; aacSH3 = number of residues of SH3 core; aacPG = number of residues of PG core.

The folding behavior was quantitated by the following qualitative and semi-quantitative variables in the FOLD-space:

The RQ-space was based on the RQA quantitative descriptors (see the previous paragraph). Another derived space (RP-space) was computed to assess the congruence between the quantitative description given by RQA and the holistic graphic appearance of the plots. The RP-space was generated from a consensus score computed from the general resemblance between plots as judged by a panel of unbiased observers (see Appendix for details).

To maximize the information content in each space and to overcome the distortion induced by within-set correlations, each data set was previously filtered by a principal component analysis (PCA). Principal components (Lebart et al., 1984Go) correspond to the optimal (in a least-squares sense) parameterization of a given multivariate set. The components are mutually orthogonal, each pointing to an `autonomous' aspect of the data set, and are reckoned by the algorithm in decreasing order of explained variance (%EXPLVAR). Thus, between-sets correlations were computed as correlations between the corresponding principal component spaces by means of both linear (Pearson r, multiple linear regression) and non-linear (k-nearest neighbors) methods. The k-nearest-neighbors discriminant analysis is a non-linear pattern recognition procedure aiming at classifying a given statistical unit on the basis of the class assigned to its k nearest units in a given metric space (Caver and Hart, 1967Go). In our case k was set to 3 and the problem at hand was the assignment of sequences to the correct folding NMR class (NODISP/BROAD/DISP) on the basis of their position in the space spanned by the first two RQ components (PCRQ1, PCRQ2). The nearest-neighbors algorithm assigned each unit to the class corresponding to the `majority vote' (weighted for the actual Euclidean distance in the PCRQ1, PCRQ2 plane) on the classes of the three nearest units.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Appendix
 References
 
PCA description of the single data sets

Table IGo contains the overall set of data from which the subsets corresponding to the RQA-, DIST- and FOLD-spaces have been extracted. Each subset was independently analyzed and summarized by PCA.

Table IIGo refers to the RQ-space in terms of component loadings, i.e. of correlation coefficients between the original variables and the components derived from them. A five-component solution saves almost completely the original RQ information (explained variance = 97%). PCRQ1 is by far the most important component (explained variance = 70%) and is an inverse estimate of the hydrophobic profile variability, as can be inferred by the inspection of the component loadings. PCRQ2 is driven by the TREND variable and accounts for 18% of total variance: it can be considered a measure of the `ruggedness' of the hydrophobic plots. The third component (PCRQ3) is mainly linked to the length of patches of homogeneous hydrophobic behavior in the plots and accounts for 6% of total variance, while the minor components are difficult to explain, being linked to subtleties of the RQ patterns.


View this table:
[in this window]
[in a new window]
 
Table II. Factor loadings matrix for the RQ-space
 
The principal components generated from the DIST-space are reported in Table IIIGo. A two-component solution explains 98% of the total variance. PCDIST1 corresponds to the line connecting SH3 and PG poles in the sequence space (Pearson r = –0.99 and 0.99 between PCDIST1 and the sequence superposition with SH3 and PG, respectively). This structure mirrors the strategy followed by Blanco et al. (1999): it is interesting that whereas aacPG scales with the general superposition of the hybrid sequences with PG, the number of SH3 core residues (aacSH3) is partially independent of the general superposition with SH3, giving rise to an independent component (PCRQ2) which explains 18% of total variance. This is due to the fact that the SH3 hydrophobic core is maintained even in sequences distant from SH3 (Table IGo). From the graphical representation of the first two components of the sequence space (Figure 2Go), the general sense of the Blanco et al. (1999) strategy can be appreciated: the PG `isolation' derives from the absence of the SH3 core (low values of PCDIST2) present in the other sequences. Moreover, the presence of SH3 core limits the possible extension on PCDIST1 and, consequently, the linear path starting from SH3 turns down (Figure 2Go).


View this table:
[in this window]
[in a new window]
 
Table III. Factor loadings matrix for the DIST-space
 


View larger version (8K):
[in this window]
[in a new window]
 
Fig. 2. The DIST-space spanned by the first two principal components. The relative position of hybrid and parent sequences are shown in the PCDIST1, PCDIST2 plane (see the text for further explanations).

 
The FOLD component space is reported in Table IVGo. A four-component solution retains 98% of the total information. PCFOLD1 represents a general score of the `amount of three-dimensional structuring' of the analyzed sequences, being positively correlated with all the `structure indicators' and negatively linked to the amount of random coil (RC). The appearance of such a clear component is an important proof of both reliability and consistency of Blanco et al.'s (1999) experimental results and of the quantification scheme adopted by us. Further, our results indicate a self-consistent structural description of proteins, with the only relevant exception of the `Helix' descriptor which appears largely independent of the others, since it heavily loads on PCFOLD2 (r = 0.95). Thus, PCFOLD2 appears as a `global helicity score' independent of the general 3D structuring of proteins. PCFOLD3 and PCFOLD4 account for minor subtleties of the data set, corresponding to the peculiarities of different analytical methods, and can be interpreted as `coil specific' and `NMR specific', respectively.


View this table:
[in this window]
[in a new window]
 
Table IV. Factor loadings matrix for the FOLD-space
 
Between-set correlations

Modeling the DIST-space through a multiple linear regression model having PCDIST1 as the dependent variable and PCRQ1-5 as regressors reveals a statistically significant correlation between the two spaces (R2 = 0.714; r = 0.845, p < 0.001). This points to the general consistency between sequence alignment and RQA descriptions. Although the correlation between PCDIST2 and the RQ-space was only marginally significant (r = 0.606, p < 0.05), it is worth stressing that RQA is able to recognize a relatively minor (in terms of explained variance) feature of the global sequence, as the presence of the SH3 hydrophobic core (length = eight amino acids).

Figure 3Go reports the combined PCDIST1-PCRQ1 space (r = 0.629, p < 0.005) and allows us to visualize the non-linearity introduced into the sequence space by RQA: SH3 and PG lie (as a natural consequence of the Blanco et al. strategy) at the extremes of the sequence alignment axis (PCDIST1), but they are not the poles of the RQ-space. The RQ-space, in fact, only partially follows the alignment `order parameter' and very large deviations from the SH3{leftrightarrow}PG line are evident (see, for example, the m5–m8–m9–m15 cluster). These deviations indicate some new hydrophobic patterns stemming from the simple addition of ready-made elements and point to the possibility of completely new structures emerging from recombination processes during evolution.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 3. Relation between DIST- and RQ-spaces. The non-linear relation between sequence and RQA parameter information is visualized in the PCRQ1, PCDIST1 plane. Only the parent proteins' location is labeled (see the text for further explanation).

 
Fold prediction by means of RQA information on sequences

As stated above, the presence of two opposite poles (SH3 and PG) prevented any direct linear correlation approach as for folding prediction. Thus, to correlate the folding class probability estimated by the non-linear model on the recurrence space with the components of the folding space, we adopted a non-linear strategy (k-nearest-neighbors) followed by a subsequent linear validation step.

Figure 4Go reports the space of the first two RQ components together with the indication of the NMR pattern of the statistical units. One can see a clear tendency for aggregation of sequences characterized by the same NMR pattern. To check whether the NMR pattern location (and thus the folding behavior) was derivable by the relative location in the recurrence space with reliability higher than expected by chance, a k-nearest-neighbors discriminant analysis was carried out. In other words, we looked for the existence of a significant correlation between folding behavior (as expressed by NMR pattern) and recurrence quantification of hydrophobic profiles. The sequence classification matrix reported in Table VGo shows that this was actually the case: the global accuracy was around 75% (six misclassifications, as marked by asterisks in Table VGo, out of 26 NMR-characterized structures), corresponding to a Fisher exact test probability equal to 0.0024 for the classification matrix being due to chance alone. The following features of the classification matrix are particularly worth noting:



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 4. The RQ-space spanned by the first two principal components. The parent sequences are labeled. For hybrid sequences different symbols indicate their assignment to the corresponding NMR class (see Table VGo).

 

View this table:
[in this window]
[in a new window]
 
Table V. Classification matrix from the k-nearest-neighbors discriminant analysis
 
i) The existence of limitations, due to the sensitivity of proteins to single amino acid changes: in spite of the fact that m10 differs from SH3 by two amino acids only, it does not fold. This kind of behavior is virtually unpredictable by statistical methods.
ii) The greater accuracy in predicting folded (NMR = DISP) than unfolded structures. This can have relevant implications in protein engineering since, by means of a recurrence study of theoretical sequences, the structures predicted as foldable may be selected with higher probability of finding the functionally interesting ones.
iii) The prediction concerning sequences m6, m12 and m15, for which experimental data do not exist as yet, allows for a subsequent validation test of the proposed approach.

It should be remembered that our prediction exercise has only been expressed in statistical terms and its main goal is to test the linkage (if any) between the non-linear description of sequences provided by RQA and the folding behavior. We cannot directly verify by appropriate experiments our predictions, but leaving, however, as a commitment, such a proof to independent and objective external validation.

Unique features of RQA descriptors

From our viewpoint it is important to check the independent character of the folding relevant information provided by RQA with respect to that obtained from standard sequence analysis. In other words, we want to check whether the RQA approach goes beyond the simplistic assumption that a sequence more similar to a real protein has a higher probability of folding than a less similar one. In order to prove the point, we checked the stability of the correlation between the probability of `not being a protein' as estimated by RQA (NODISP variable in Table VGo) and the global amount of 3D structuring as measured by the first component of the FOLD-space (PCFOLD1), when the possible confounding effect of sequence similarities (as expressed by PCDIST1) is partialled out. This procedure is equivalent to checking the invariance of the correlation coefficient (Pearson r) between NODISP and PCFOLD1, after elimination of the linear influence of PCDIST1. This was actually the case, with the r coefficient between NODISP and PCFOLD1 scoring values of –0.608 (p < 0.001) and –0.620 (p < 0.001) before and after such elimination, respectively. We prefer to use the probability of `not being a protein' (NODISP) instead of the direct index (DISP), in order to rule out possible ambiguities linked to the intermediate class (BROAD).

Visual screening of recurrence plots

In brief, a metric was built between pairs of RPs in terms of the number of observers assigning the two elements of the pair to different classes (see the Appendix for details). The ensuing distance matrix between plots was submitted to a principal component analysis in order to derive an explicit quantitative space (RP-space), where visual similarities could be expressed. The RP-space was exhaustively described by a six-component solution (96% of explained variance); in the subsequent step of `between-sets' correlations, however, we limited ourselves to a two-component solution (PCRP1, PCRP2) globally explaining 56% of the total variability. This choice was made in order to retain only the most `robust' (consensual) portion of the visual judgment information. Table VIGo contains the RP-space component loadings: it is important to note how PCRP1 is a measure of the SH3 hydrophobicity pattern, as evident from the extremely high loadings of SH3 and nearby (in terms of superposition to SH3, namely SH3P, see Table IGo) chimeric sequences m2, m14, etc., on the component. PCRP2 points to a new pattern (marked by m27, m29, m30, m31 hybrids; see the Appendix, Figure 5Go) only partially overlapping the PG pattern (PG loading = 0.541) they were supposed to mimic in terms of sequence overlapping (see Table IGo). PCRP3 points to another new pattern (m5, m8, m9, m11, m15; see also Figure 5Go) while the PG pattern is only represented by PCRP4. The relative isolation of PG, despite the high sequence overlapping with a number of structures (from m27 to m31, see Table IGo) is another important point worth noting.


View this table:
[in this window]
[in a new window]
 
Table VI. Factor loadings matrix for the RP-space
 



View larger version (64K):
[in this window]
[in a new window]
 
Fig. 5. The whole set of recurrence plots of hybrid and parent sequences. The plots are ordered along the main consensus line (PCRP1) in order to appreciate the classification criteria.

 
A k-nearest-neighbors procedure applied to the PCRP1–PCRP2 plane in order to predict folding behavior gives results (not shown) fully compatible with the corresponding ones obtained in the RQ-space. Finally, the significant correlation between principal components of the RP- and RQ-spaces (PCRP1–PCRQ1, r = –0.75; PCRP2–PCRQ2, r = –0.77) indicates a good match between conclusions drawn by visual inspection and quantitative descriptors of recurrence plots.

An immediate perception of qualitatively different patterns of hydrophobicity distribution emerges from simple row-wise consideration of the ordering of recurrence plots provided by the first component of the RP-space, as displayed in Figure 5Go.

Conclusion

RQA of primary structures showed itself to be useful in several cases: in a site-directed mutagenesis study, to discriminate between function-retaining and non-functional mutants of ß-lactamase (Zbilut et al. 1998Go); in predicting the thermophilic/mesophilic character of rubredoxins (Giuliani et al., 2000Go) and, more recently, in revealing some peculiar hydrophobicity patterns in prions (Zbilut et al., 2000Go). In the present work, the significant correlation stemming between the RQA-induced non-linearity over the sequence space of chimeric structures and their folding properties shows the potential use of RQA in supplementing alignment methods to explore sequence databases.

As in the above-mentioned cases of rubredoxins and prions, an RQA-based quantification scheme of primary structures may reveal crucial information in solving specific structural problems and, by a widely used psychometric approach (Nosofsky, 1986Go; Shepard, 1987Go), we showed that such information can be reliably appreciated even by visual inspection.

Our proposal is to use RQA and recurrence plots to optimize the process of engineering new proteins, given their relevance for protein folding/activity prediction, in addition to their relative independence from sequence similarity methods.

Recurrence plots can introduce in protein engineering a working style analogous to that adopted by medicinal chemists in `rational drug design' (Martin, 1981Go): before initiating a synthesis campaign to find the most potent (or less toxic) analogues of a lead drug, the proposed structures are designed in silico and labeled by a wide array of chemico-physical and/or topological descriptors. This procedure generates a multi-dimensional space in which the set of proposed structures is defined as a data cloud. Candidates for real synthesis and pharmacological testing are selected from this cloud by purely heuristic strategies, such as maximizing the diversity (and thus the probability of finding a potent analogue) within a selected subset of structures or, in a more advanced step, maximizing the similarity to an already found potent analogue (Martin, 1981Go; Willett, 1997Go). This rationale of drug design was demonstrated to outperform by orders of magnitude random searching huge libraries of organic molecules in order to find potential new drug candidates (Willett, 1997Go).

As for the original problem set by Blanco et al. (1999), our analysis suggests the possibility that new folds can be generated from already existing ones not through a continuous path but through a number of sequential and discrete `successful' events giving rise to new folds from relatively minor sequence modification and/or gene duplication processes.


    Appendix
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Appendix
 References
 
Quantifying the visual information in recurrence plots

Nineteen independent observers, not aware of the nature of the recurrence plots to be analyzed, were asked to classify them on the basis of the relative similarity in their visual appearance without any constraint on the number and type of classes. Defining the relative similarity of each possible pair of plots could generate a straightforward metrics by the number of judgements assigning them to the same class minus the number of judgements assigning them to different classes (Nosofsky, 1986Go; Shepard, 1987Go). The consistency of this space was assessed by the recognition of a very high between-judgement concordance (Pearson r correlation coefficients ranging around 0.90) with the only noticeable exception of an `aberrant' judgement (r {approx} –0.50 with the others). In order to help any reader in deriving his/her own visual impression, the recurrence plots relative to all the examined structures are reported in Figure 5Go, in which RPs are ordered along the first component (PCRP1), so that the meaning of the whole procedure can be easily verified.


    Notes
 
2 To whom correspondence should be addressed. E-mail: colosimo{at}caspur.it Back


    Acknowledgments
 
This work was partly supported by grants from the Italian MURST (40%, 60%) to A.Colosimo.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Appendix
 References
 
Blanco,F.J., Angrand,I. and Serrano,L. (1999) J. Mol. Biol., 285, 741–753.[ISI][Medline]

Broomhead,D.S. and King,G.P. (1986) Physica D, 20, 217–236.[ISI]

Caver,T.M. and Hart,P.E. (1967) IEEE Trans. Inf. Theory, IT-13, 21–27.

Eckmann,J.P., Kamphorst,S.O. and Ruelle,D. (1987) Europhys. Lett., 4, 324–327.

Faure,P. and Korn,H. (1997) Proc. Natl Acad. Sci. USA, 94, 6506–6511.[Abstract/Free Full Text]

Giuliani,A. and Manetti,C. (1996) Phys. Rev. E, 53, 6336–6340.[ISI]

Giuliani,A., Benigni,R., Sirabella P., Zbilut,J.P. and Colosimo,A. (2000) Biophys. J., 78, 136–148.[Abstract/Free Full Text]

Lebart,L., Morineau,A. and Warwick,K.M. (1984) Multivariate Descriptive Statistical Analysis. Wiley, New York.

Manetti,C., Ceruso,M.A., Giuliani,A., Webber,C.L.,Jr and Zbilut,J.P. (1999) Phys. Rev. E, 59, 992–998.[ISI]

Martin,Y. (1981) J. Med. Chem., 24, 229–237.[ISI][Medline]

Nosofsky,R.M. (1986) J. Exp. Psychol. Gen. 115, 39–61.[ISI][Medline]

Shepard,R.N. (1987) Science, 237, 1317–1323.[ISI][Medline]

Strait,B.J. and Dewey,T.G. (1996) Biophys. J., 71, 148–155.[Abstract]

Webber,C.L. and Zbilut,J.P. (1994) J. Appl. Physiol., 76, 965–973.[Abstract/Free Full Text]

Willett,P. (1997) Computational Methods for the Analysis of Molecular Diversity: Perspectives in Drug Discovery and Design, vol. 7/8. Elsevier, New York.

Zbilut,J.P. and Webber,C.L.,Jr (1992) Phys. Lett. A, 171, 199–203.[ISI]

Zbilut,J.P., Giuliani,A., Webber,C.L.,Jr and Colosimo,A. (1998) Protein Eng., 11, 87–93.[Abstract]

Zbilut,J.P., Webber,C.L.,Jr, Colosimo,A. and Giuliani,A. (2000) Protein Eng., 13, 99–104.[Abstract/Free Full Text]

Received June 20, 2000; revised August 2, 2000; accepted August 28, 2000.