BioElectronics Group, Department of Electrical and Computer Systems Engineering, PO Box 35, Monash University, VIC 3800, Australia
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: heme proteins/resonance recognition model/wavelet transform
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Often within a protein class, only a few amino acid residues could be designated as `invariable'. A substitution of such amino acids would destroy both biological activity and function. A consequence of the different amino acid sequences of functionally similar proteins is their immunological diversity, their species specificity. The similarities can help to identify individual amino acids crucial for the biological function, proteintarget interaction and structure maintenance. The structurally essential similarities can be most effectively deduced from amino acid exchange frequencies in proteins of different species. It is not only the local similarity but also the global similarity between sequences needs to be found (Bishop and Rawlings, 1996). The similarities can be expressed as a template or a motif, the determinant of a specific structure and function.
Derivation of the three-dimensional structure from the amino acid sequence would be worthwhile, since it cannot be expected that all proteins would produce suitable crystals for X-ray analysis (Goffin et al., 1996). Thus, if a protein sequence of unknown function and unknown structure is compared with other known sequences, its functional and structural information may be revealed by their similarity pattern.
Previous approaches such as FASTA, BLAST and PROSRCH (Pearson and Lipman, 1988; Bishop and Rawlings, 1996
) are mainly based on sequence comparison and alignment. The concept of the similarity (a sequence similarity) for those approaches only means how many identical pairs of amino acids exist for the query sequence and the subject sequence.
However, two protein sequences with low sequential identity may show similarities in their physicochemical properties, tertiary structure, resonance recognition model (RRM) spectra and biological functions (Lesk, 1988; Cosic, 1994
, 1997
). This similarity concept can be enriched by incorporating the notion of similarity in other contexts.
The RRM multiple-cross spectral function can be regarded as a measurement of the similarity among different protein sequences in the frequency domain when each protein sequence is treated as a numerical series (Cosic, 1994, 1997
). The most prominent peak frequencies show the spectral similarity of the protein sequences. Furthermore, the similarity can be either a local similarity or long-range similarity, the overall sequence similarity. For those traditional sequence comparison approaches, to find the local similarity is relatively easy but to find the global similarity is a difficult task (Bishop and Rawlings, 1987
). The significance of the similarity is also hard to assess by those approaches. The spectrum similarity determined by the RRM is a global similarity because the spectrum is a contribution of all individual amino acids in the sequence.
Another analytical approach is the wavelet transform (WT) representation. It is a signal processing method efficient for multi-resolution analysis and local feature extraction (Daubechies, 1988, 1992
). If the WT is introduced to a protein sequence, the similarity can be measured at different resolution scales based on a space-scale analysis. This sequencescale similarity may reveal more information than other conventional methods.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The RRM model
The RRM (Cosic and Nesic, 1988; Cosic et al., 1989
, 1991
; Cosic, 1990
, 1994
, 1995
, 1996
, 1997
; Cosic and Hearn, 1991
) is a physical and mathematical model which interprets protein sequence linear information using signal analysis methods. It comprises two stages. The first involves the transformation of the amino acid sequence into a numerical sequence. Each amino acid is represented by the value of the electronion interaction potential (EIIP) (Veljkovic and Slavic, 1972
; Pirogova and Cosic, 1999
), which describes the average energy states of all valence electrons in particular amino acids (Table I
). The EIIP values for each amino acid were calculated using the following general model pseudopotential (Veljkovic and Slavic, 1972
; Pirogova and Cosic, 1999
):
![]() | (1) |
![]() | (2) |
|
![]() | (3) |
|
![]() | (4) |
Peak frequencies in such a multiple cross-spectral function denote common frequency components for all sequences analysed. The signal-to-noise ratio (S/N) for each peak is defined as a measure of similarity between sequences analysed. S/N is calculated as the ratio between signal intensity at the particular peak frequency and the mean value over the whole spectrum. The extensive experience gained from previous research (Cosic, 1994, 1995
, 1996
, 1997
) suggests that an S/N of at least 20 can be considered significant. The multiple cross-spectral function for a large group of sequences with the same biological function has been named `consensus spectrum'. The presence of a peak frequency with significant S/N in a consensus spectrum implies that all of the analysed sequences within the group have one frequency component in common. This frequency is related to the biological function provided that the following criteria are met:
In our previous studies (Table II), the above criteria were tested with over 1000 proteins from 28 functional groups (Cosic, 1994
, 1995
, 1996
, 1997
; Trad et al., 2000
). Multiple cross-spectral functions of four different functional groups of proteins are represented in Figure 2
. The following fundamental conclusion was drawn from our studies: each specific biological function of protein or regulatory DNA sequence(s) is characterized by a single frequency. Once the RRM characteristic frequency for a particular biological function has been determined, it is possible to identify the individual amino acid so-called `hot spots' [using Fourier transformation (FT)] or domains [using the continuous wavelet transform (CWT) (Fang and Cosic, 1998
, 1999
; Trad et al., 2000
, 2001)] that contribute mostly to the characteristic frequency and thus also to the protein's biological function.
|
|
The correlation between the amplitude spectrum of numerical representation of genetic sequences and the corresponding biological function presented previously can lead to a completely new approach to protein dynamics. Each frequency in the RRM characterizes one biological function (Figure 2). Each biological process involves a number of interactions between proteins and their targets (other protein, DNA regulatory segment or small molecule). Each of these processes involves energy transfer between interacting molecules. These interactions are highly selective and this selectivity is defined within the protein structure. The selectivity of these interactions is proposed to be the resonant energy transfer between interacting molecules (Cosic, 1994
). Consequently, the characteristic resonant frequencies for a number of different interactions, i.e. biological functions, were theoretically calculated (Table II
). These calculations were based on the following key finding: proteins with the same biological functions have common periodicities in the distribution of energies of delocalized electrons along the protein. With this in mind and taking into account the conductive properties of the protein backbone, the theoretical model of biologically relevant protein resonances was established (Cosic, 1990
, 1994
, 1997
).
The discrete wavelet transform
The wavelet transform (WT) is a relatively new signal processing tool efficient for multi-resolution analysis and local feature extraction of non-stationary signals (Daubechies, 1988, 1992
). The wavelet transform can be viewed as an inner product operation that measures the similarity or cross-correlation between the signal and the wavelets.
The sequencescale similarity measurement introduced here is based on the discrete wavelet transform (DWT) and a cross-correlation analysis. The comparing sequences are initially `converted' into numerical series using the RRM (Cosic et al., 1989; Cosic, 1997
). These numerical series are normalized to zero mean and unit standard deviation and zero-padded to have an identical sequence length. Then they are decomposed to M levels with details from level 1 to level M and an approximation at level M by the DWT. Because a correlation function quantifies the degree of interdependence of one process upon another or establishes the similarity between one set of data and another (Oppenheim and Schafer, 1997
), the cross-correlation coefficients are calculated at each level to establish and quantify the similarity between the two compared protein sequences. There are a total of M + 1 correlation coefficients. The value of a correlation coefficient lies between 1 and +1; +1 means 100% correlation in the same sense and 1 means 100% correlation in the opposing sense (Oppenheim and Schafer, 1997
). The cross-correlation coefficient is defined as
| (5) |
![]() | (6) |
The maximum absolute value of the correlation coefficient at each decomposition level is regarded as the similarity score for these two proteins at that level. Therefore, a total of M + 1 maximum values are taken out to form a sequencescale similarity vector. The sequencescale similarity vector depicts the similarity of two protein sequences at different scales or different frequency bands. More specifically, this vector describes the correlation with a multiresolution point of view.
The underlying property of wavelets is that they are localized in both time and frequency (Strang and Nguyen, 1996). The product of the uncertainties of both time and frequency is bound by the Heisenberg's uncertainty principle; no filter can have a width product smaller than 1/
. The Gaussian filters attain this theoretical limit.
In this work we used the Bior3.3 biorthogonal wavelets (Cohen et al., 1992) for the protein signal decomposition for all cases. Biorthogonal discrete wavelet transform uses two wavelets, one for decomposition and the other for reconstruction. Hence the analysis and synthesis tasks can be separated (Cohen et al., 1992
). Biorthogonal wavelets are symmetrical wavelets and have linear phase.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
The similarity of closely related sequences is obvious. However, it is difficult to find the sequence similarity for proteins which are distantly related but have similar biological function or tertiary structure. For example, sperm whale myoglobin and lupine leghemoglobin have only 15% identical residues, which is far below the twilight zone of sequence identity, although they both contain a heme group, have similar secondary and tertiary structures and bind oxygen (Doolittle, 1981). The cross-correlation analysis of lupine leghemoglobin and sperm whale myoglobin revealed the sequence-similarity vector (0.40, 0.53, 0.44, 0.36, 0.25), showing a weak correlation in D4 (= 0.53). It is reasonable to deduce that this correlation is related to their sharing biological function, the oxygen binding capability.
Another example is chymotrypsin and subtilisin. These two proteins have a very low sequence identity, only 12% even using an optimal alignment method. However, they share a common proteolytic function and a common catalytic mechanism as an example of convergent evolution (Lesk, 1988). Because of the low sequence identity of these two pairs of proteins, it is unlikely that they can be linked together using the sequence alignment methods. However, using the sequencescale similarity as defined above, we still can probe their distant connections. The sequencescale similarity analysis of chymotrypsin and subtilisin revealed the sequence-similarity vector (0.35, 0.60, 0.42, 0.25, 0.18). At D4 (= 0.60), there is also a weak correlation for these two distantly related proteins.
Myoglobin is an oxygen-carrying globular heme protein like hemoglobin involved in oxygen storage and transport in vertebrate muscle. The myoglobin molecule is built up of eight helices, which compose a box-like structure with a hydrophobic pocket. The heme group responsible for oxygen binding (Fe2+-porphyrin) is fixed in this pocket only by weak bonding. Myoglobin and hemoglobin are composed of an association of smaller subunits (- and ß-chains) and are thought to be evolutionarily related (Lehninger et al., 1993
). The sequence similarity of myoglobin and hemoglobin is very poor. However, Figure 7
indicates that hemoglobin and myoglobin are not dissimilar in the sense of the sequencescale similarity. There are two weakly correlated frequency bands A4 and D4 which have correlation coefficients 0.63 and 0.60, respectively. Moreover, hemoglobin
-chain and ß-chain are also correlated in these two frequency bands (see Figure 6
). These two proteins have a strong correlation (correlation coefficient 0.97) and a weak correlation (correlation coefficient 0.62). This gives more evidence that frequency bands A4 and/or D4 contain the information related to the oxygen-carrying function of those proteins (hemoglobin, sigmoid oxygen saturation curve; myoglobin, hyperbolic saturation curve).
|
|
The sequencescale similarity vector shows a strong cross-correlation between two closely related proteins and a certain correlation for two functionally related proteins. One requirement for choosing an appropriate analysis tool for protein sequence is to have a direct relationship with the underlying processing. This requires that a self-contained similarity measurement scheme shall give no-correlation results for functionally and/or structurally unrelated proteins.
Lysozyme is a widespread enzyme found especially in animal secretions, in egg white and in some microorganisms. It splits the glycosidic bond between certain residues in mucopolysaccharides and mucopeptides of bacterial cell walls. Lysozyme and hemoglobin do not share any biological function. This is also shown (Figure 9) by the cross-correlation study of their DWTs. In Figure 9
, there is no peak that exceeds the weak correlation boundary 0.5.
|
|
![]() |
Discussion and conclusion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This finding indicates that the functional or structural similarity of two protein sequences could be revealed by the sequencescale study. One important judgement to compare different computational approaches is how well they perform in finding low degrees of similarity (Bishop and Rawlings, 1996). Hence the sequencescale similarity can be a very promising tool for sequence comparison with the important advantage of not requiring indels.
These comparative studies have provided new insights into the structurefunction relationships of certain groups of proteins. The results in Table III generally match the biological relationships of each protein pair. Using BLAST for the protein pair hemoglobin
-chain (hahu: 142 amino acids) and itself revealed the following results: score = 286 bits; identities = 142/142 (100%); positives = 142/142 (100%). The sequencescale similarity vector shows complete correlation in all five scales (5S). Using BLAST for the protein pair hemoglobin
-chain (hahu: 142 amino acids) and sperm whale myoglobin (mwhp: 153 amino acids) revealed the following results: score = 46.2 bits; identities = 37/147 (25%); positives = 59/147 (39%); Gaps = 6/147 (4%). The sequencescale similarity vector (0.63 0.60 0.48 0.31 0.30) shows weak correlations at A4 and D4 expressed as 2W3N. It is reasonable to deduce that this correlation is related to their sharing biological function, the oxygen binding capability. Only the fgfbh (basic human growth factor) and legh (lupine leghemoglobin) have a clear correlationalthough no reported common biological properties of them have been found. The reason that causes this exception is still not clear.
Thus a fundamental and empirical conclusion for sequencescale similarity measurement is reached:
There are two additional advantages of the sequencescale similarity measurement. First, the significance of the similarity is given directly by the correlation value rather than an alignment score as shown in the discussion above. The results derived from a sequence comparison scheme measure the quality of the alignment. Thus with the sequencescale similarity vectors, the similarity significance can be compared, assessed and interpreted easily. For the conventional comparison methods, the comparison score needs to be processed using various empirical and statistical methods before it can be evaluated (Bishop and Rawlings, 1987; Lesk, 1988
). Second, with the introduction of a cross-correlation function, the deletion and insertion which are often used in other conventional sequence comparison and alignment schemes are no longer needed. All the drawbacks derived from the gap insertion and deletion are not inherent to this method at all. Therefore, proteins with different sequence lengths can be compared easily.
Having in mind that the majority of theoretically predicted biological properties of proteins in this paper are functionally important, we can conclude that this study confirms our earlier hypothesis that the WT method could be established as a novel approach to examine protein sequences at different spatial resolutions.
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bishop,M. and Rawlings,C. (1996) DNA and Protein Sequence Analysis A Practical Approach. IRL Press, Oxford.
Cohen,A., Daubechies,I. and Feauveau,J.C. (1992) Commun. Pure Appl. Math., 45, 485560.[ISI]
Cosic,I. (1990) In Wise,D. (ed.), Bioinstrumentation and Biosensors. Marcel Dekker, New York, pp. 475510.
Cosic,I. (1994) IEEE Trans. Biomed. Eng., 41, 11011114.[CrossRef][ISI][Medline]
Cosic,I. (1995) Bio/Technology, 13, 236238.[ISI][Medline]
Cosic,I. (1996) Med. Biol. Eng. Comput., 34, 139140.
Cosic,I. (1997) The Resonant Recognition Model of Macromolecular Activity. Birkhauser, Basel.
Cosic,I. and Hearn,M.T.W. (1991) J. Mol. Recognit., 4, 5762.[Medline]
Cosic,I. and Nesic,D. (1988) Eur. J. Biochem., 170, 247252.[Abstract]
Cosic,I., Pavlovic V. and Vojisavljevic,V. (1989) Biochimie, 71, 333342.[CrossRef][ISI][Medline]
Cosic,I., Hodder,A., Aguilar,M. and Hearn,M.T.W. (1991) Eur. J. Biochem., 198, 113119.[Abstract]
Daubechies,I. (1988) Commun. Pure Appl. Math., 41, 909996.[ISI]
Daubechies,I. (1992) Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia.
Doolittle,R.F. (1981) Science, 214, 149159.[ISI][Medline]
Fang,Q. and Cosic,I. (1998) Aus. Phy. Eng. Sci. Med., 21, 179185.
Fang,Q. and Cosic,I. (1999) In Proceedings of the Inaugural Conference of the Victorian Chapter of the IEEE EMBS. pp. 211214.
Goffin,V., Martial,J.A. and Summers,N.L. (1996) Protein Eng., 8, 12151231.[Abstract]
Lehninger,A.L., Nelson,D.L. and Cox,M.M. (1993) In Principles of Biochemistry. Worth, New York.
Lesk,A.M. (1988) In Computational Molecular Biology. Oxford University Press, Oxford.
Oppenheim,A.V. and Schafer,R.W. (1997) In Discrete-time Signal Processing. Prentice-Hall, Englewood Cliffs, NJ.
Oyster,C.K., Hanten,W.O. and Liorence,L.A. (1987) In Introduction to Research: a Guide for the Health Science Professional. Lippincott, Oxford.
Pearson,W.R. and Lipman,D. J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444.[Abstract]
Pirogova,E. and Cosic,I. (1999) In Proceedings of the Inaugural Conference of the Victorian Chapter of the IEEE EMBS. pp. 203206.
Strang,G. and Nguyen,T. (1996) In Wavelets and Filter Banks. Wellesley-Cambridge Press, Wellesley.
Trad,C.H., Fang,Q. and Cosic,I. (2000) Biophys. Chem., 84, 149157.[CrossRef][ISI][Medline]
Trad,C.H., Fang,Q. and Cosic, I (2001) In Proceedings of the 2nd Conference of the Victorian Chapter of the IEEE EMBS. pp. 115119.
Veljkovic,V. and Slavic,I. (1972) Phys. Rev. Lett., 29, 105108.[CrossRef][ISI]
Received October 24, 2001; revised December 18, 2001; accepted January 4, 2002.