Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK and 2 Mathematics Department, Stanford University, Palo Alto, CA, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: protein structure repeats/Fourier transform symmetry
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The equivalent problem in the one-dimensional-(1D) world of the sequence has been approached both by direct internal comparisons (Heringa and Argos, 1993) and by Fourier transform (McLachlan and Stewart, 1977
), which is particularly suitable for the sequences of long fibrous proteins (McLachlan, 1977
, 1983
). The former approach can be directly transposed to 3D structures simply by repeated superposition of fragments (McLachlan, 1979
; Matthews and Rossmann, 1985
). However, while the Fourier transform is a natural tool to adopt for the analysis of repetition in the 1D sequence, its extension to structural data is not straightforward. In Cartesian coordinates (with each atomic point modelled by a Gaussian density) the 3D Fourier transform cannot be interpreted easily. Attempts have also been made using spherical coordinates (harmonics) (Duncan and Olson, 1993
) but addressing protein topography and not topology.
In this work we describe a Fourier analysis at an intermediate, 2D level based on the structural similarity matrices used in protein structure comparison. In this representation, symmetric1 structures appear as off-diagonal ridges of high score (as in the sequence dot-plot) and the periodicity of these ridges can be analysed by the Fourier transform. To allow for the quantitative interpretation of the resulting spectra, we also describe a normalization method based on random structures.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The SAP program (Taylor, 1999) was used to calculate the similarity between two proteins. This method uses the double dynamic programming (DDP) algorithm (Taylor and Orengo, 1989
). In the SAP program, a pair of positions (one residue from each structure) is compared by matching their interatomic vectors to all other residues in their respective struc tures. This is achieved by constructing a matrix, defined by the sequence of one structure against the other, which is composed of a measure based on relative geometry of the pair of pairs (Figure 1
) (Taylor, 1999
). In structure comparison, this matrix (referred to as the low-level matrix) provides the base from which an alignment is extracted (using a standard dynamic programming algorithm). The alignments over all pairs of residues (between the two structures) are summed, forming another matrix (high-level matrix) from which the final alignment is extracted.
|
Smoothing the score matrix
In a protein structure that contains repeated sub-structures, these will form diagonal ridges across the matrix parallel to the main diagonal (i = j) and it is the strength and periodicity of these undulations that can be analysed using a Fourier analysis. However, it was found that often these ridges are sharply defined and the resulting spikes or peaks generated a high background level in the Fourier transform spectrum. (It should be remembered that the transform of a spike gives rise to an equal score across the full frequency range.)
To make the frequency spectrum easier to interpret, the high-level matrix was smoothed. Simple averaging was avoided as this tended to eliminate locally strong similarities. (For example, the peak from two corresponding secondary structures would be averaged with the weak signals from their flanking coil regions.) Instead the averaging was made by taking the maximum over adjacent valuesequivalent to a dynamic programming algorithm, as follows:
| (1) |
The factor of a half applied to off-diagonal contributions is equivalent to a gap-penalty and dampens smoothing in the anti-diagonal direction while still allowing the propagation of some signal across small insertions.
This operation was applied to each cell in the matrix A (with increasing i and j) creating the matrix B. As B will be asymmetric, the smoothing operation was then applied to B with decreasing i and j, recreating a new matrix A. This flip-flop smoothing was repeated 20 times resulting in a matrix that still contained substantial detail. The effect of this repetition can be pictured by considering a single diagonal line (i = j + n) in the matrix, which for half its length has a value 1 and is otherwise 0. After 20 iterations, the sharp edge will be smoothed to a sigmoid curve which makes most of the transition from 1 to 0 over a range equal to the number of smoothing iterations. More generally, a single point will be smoothed into a Gaussian (bell-shaped) curve of the form exp(x2/n), where n is the number of iterations. By anology with optics, this means that two spikes closer than 6.5 residues will not be resolved. In terms of protein structure, short-range features will be smoothed away, but typical secondary structures will remain distinct. It should be remembered, however, that this smoothing takes place after comparison by the SAP algorithm, so all features, whatever their size, are fully considered in establishing internal similarities.
Fourier transform
The remaining problem with the data generated as decribed above, is the direction in which to calculate the Fourier transform. As the matrix is symmetric, this leaves a choice between the direction parallel to an axis or parallel to the minor diagonal (i + j = N, where N is the length of the protein, referred to below as the antidiagonal). The former has the problem that the diagonal itself constitutes an anomaly in the scores since the similarity of a residue to itself (plus environment) is very high. In the later direction, however, there is only one line in this direction that covers the full length of the protein (the antidiagonal itself).
Fourier calculation.As proteins are short and computers are fast, the fast-Fourier transform (FFT) algorithm was not used as this requires special treatment of the signal to render it in lengths of powers of two (Press et al., 1986). Instead, the direct approach was used of multiplying sine and cosine waves over a range of frequencies and plotting the power of each component as a spectrum. Each term in the spectrum was normalized for the length of the protein (N) as: 100(s2 + c2)1/2/N, where s and c are sine and cosine (real and imaginary) components of the transform. (See Figure 2
for examples and a comparison to the FFT method.)
|
|
For the calculation of the transform in the antidiagonal direction, the use of the antidiagonal alone may be unrepresentative and a band of 20 antidiagonals was taken. Such a construction, however, creates outer edges that are shorter than the antidiagonal and to correct this, the missing corners were padded by repeated reflection of the diagonal at each level.
For the calculation in the direction of the rows, two variants were tried: firstly, with the untouched score matrix and secondly, with the row shifted by one in each column and wrapped so as to shift the diagonal to the edge.
Variance calculation. Each row in the matrix (or each antidiagonal) was individually transformed and the spectra averaged (rather than transforming an average signal). This approach has the advantage that the variance of each component can then be calculated as a guide to significance of the peak and potentially used as a weighting factor.
![]() |
Random models |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Random fake proteins. Random proteins have been generated many times in the past for testing purposes and a simple algorithm to do this is a self-avoiding random walk from one -carbon to the next constrained to lie inside a sphere or ellipsoid (Cohen and Sternberg, 1980
; Thornton and Sibanda, 1983
). The algorithm employed below is similar but incorporates a local, rather than a global, constraint for the chain to be confined. This was implemented by selecting the next position in a growing chain to be preferentially in contact with its predecessors using the following algorithm (in C code):
The density of the fake structures are dependent on the number of target neighbours (controlled by the variable n in the above code). If this is set to find too few contacts then the resulting structures are not sufficiently compact while if set to find many neighbours, then the time taken increases. Trials indicated that aiming for four or more neighbours produced compact structures while aiming for only two neighbours was not sufficient (n = 1). However, as the program need be run only once to produce a databank of fake structures, the denser models were chosen (n = 3).
The resulting structures are remarkably life-like (Figure 3), even incorporating elements reminiscent of secondary structure and, if allowed to grow large enough, breaking-up into domain-like regions (Figure 3c
).
|
![]() |
Visualizing repeats |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It might be thought that this information could be extracted using wavelet analysis, however, this technique can only be used where the feature is much smaller than the length of the signal (such as a short pattern of residues in a long protein sequence). In the current application the features constitute a large fraction of the length of the protein, and indeed, at this scale, wavelet and Fourier transform methods converge.
Biasing SAP to overlap repeats. In the SAP score matrix, each off-diagonal ridge represents a region in which repeats can be aligned. The SAP comparison was focused on each solution in turn by convoluting a Gaussian (bell-shaped) function over the off-diagonal ridge in the score matrix. The width of the bell-curve was set as a function of the period of the repeat such that all neighbouring ridges were essentially eliminated, as follows:
![]() | (2) |
In a protein containing five repeated sub-structures, A . . . E, the solution obtained with p = 5 and r = 3 is the alignment of sub-structures, A, B, C with C, D, E.
Delineating repeat boundaries. The same masking approach can be used to identify repeat boundaries by focusing on the main diagonal (r = 0) and extracting the alignment path, then cyclicly shifting the original score matrix rows by one period (N/p) and repeating the alignment. After a full set of shifts, the points at which each of the p alignment paths cross the edge of the matrix gives the repeat boundaries.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The highly repetitive protein structure of the ribonuclease inhibitor H1 (2bnh) was used to assess the two transform directions described above. This structure consists of 16 (leucine-rich) ß repeats arranged in a regular arc (Figure 4
).
|
|
|
With this treatment of the diagonal, the row-based approach appeared preferable to the antidiagonal approach as it permits the full data matrix to be used, allowing statistics to be calculated over a bigger sample than the limited band along the antidiagonal. The transform in the (unshifted) row direction was used to generate all further results presented below.
Analysis of total power
Model repeat proteins. The ribonuclease inhibitor structure 2bnh (Figure 4) was used to generate a double series of model structures. For one, the last strand was deleted, leaving exactly 16 ß
-units. These were then successively deleted in turn from the N-terminus (being the least regular) giving 16 coordinate sets each with a different number of repeats. For the second series, this was repeated but with the last strand deleted giving a series of 16
ß-units.
The total power of the spectrum obtained when transforming each construct in these series of structures is plotted against the number of residues in the structure in Figure 8. The power rises quickly followed by a slower, linear increase. The combined trend can be reasonably modelled with the combination of a linear component and an inverted Gaussian function:
|
![]() | (3) |
Ideal ß proteins. Using the same ß
-unit lengths as in 2bnh, model protein structures were generated with 4, 6, 8, 10 and 12 ß
-units (plus a terminal ß-strand) and a sample of 15 different folds were taken from each. These are plotted on Figure 8a
where it can be seen that these structures have a very high power, with only the weaker members being comparable to the 2bnh series.
Random proteins. A series of random-walk (fake) proteins from 50 to 450 residues was generated in steps of 50 residues. A sample of 20 from each length was transformed and the resulting power of the spectrum plotted in Figure 8a.
This series shows a slight dependence on length, with the power level rising from around 1200 (±200) for structures with 100 residues. The upper limits of this distribution can again be modelled by the same function as described above (Equation 3) with values for the coefficients a = 2, b = 1400 and c = 4. This curve provides a good base-level against which the power of real proteins can be assessed.
Real proteins. The program was run on a selection of proteins from the PDB (as used previously by Jonassen et al., 2000) and the power of their spectra plotted against length (Figure 8b
). The majority of proteins lie below the upper limits attained by the fake proteins but a considerable number (17%) still lie above this line and even above the values attained by the series of 2bnh repeats (1%), with a few approaching the highest values seen for the ideal model proteins.
Inspection of the highest scoring structures revealed a collection of symmetric folds, dominated by the very regular ß-propellor structures (Murzin, 1992) but also containing many regular ß
-folds, including TIM barrels (Table I
). These results can be summarized by plotting the total power for each protein along with its secondary structure composition, while colouring the protein as a function of the most dominant period in its spectrum (Figure 9
).
|
|
Removing expected symmetries
Some of the proteins included in Table Ia contain internally duplicated domains that are apparent from sequence comparison. To access the more unexpected symmetries, these were filtered using the program REPRO (Heringa and Argos, 1993
). This adjustment was made by taking the normalized power (s) (column score in Table Ia
) divided by the REPRO score (r) as: t = s exp(r2/105). The new score t is typically s/1000 for the strongest sequence repeats (r = 831), falling to s/10 for clear repeats (r = 480) and 3s/4 for weak repeats (r = 170). These modified results are shown in Table Ib
.
The greatest change between the rankings is the disappearance of the highly repetitive folds seen at the top of Table Ia (ß-propellors, ß-prisms and ß
-arc proteins). The single example of this type that holds its place in the top 50 is the 4-fold ß-propellor 1gen and although its internal repeat was recognized by REPRO, the sequence identity over the repeats is <20%. Those remaining in the filtered list (Table Ib
) were overwhelmingly of the globular ß
fold class and are dominated by the Rossmann-like folds (which contain a pseudo-2-fold) and the ß
-barrel (TIM-like) fold which have 8-fold cyclic symmetry (but also incorporate many deviations). The only globular ß
class protein to drop markedly in the rankings was the von Willebrand factor protein 1atzA which, as well as a very symmetric fold (classic Rossmann fold2)2, has sufficient sequence similarity in the two halves for REPRO to pick-up a repeat.
A fold that makes a stronger appearance in the filtered list is the ßß
layer protein (ABBA in Table I
) 1ryp1 from the proteasome which contains an internal structural duplication. This symmetry runs through three of the four layers of secondary structure and, although it is not clear to the eye, it was identified as a 2-fold repeat in the Fourier spectrum (Table Ib
). The sequence identity over the repeats is <10% which would not be seen by any sequence-based method. The 14 related chains from the proteasome, labelled 1, 2 and AK and the related heat-shock protein structure 1dooA, all score above random and by the scoring used in Table 1b
, lie in ranked positions: 30, 86 (chains 1, 2); 122, 189, 265, 123, 148, 139, 71, 58, 77, 264, 127 (chains AK) and 63 (1dooA).
The more obvious structural repeat seen in the double domain -crystalins (1bd7A = 1.4 Å/84 res. 36% sequence ID; 1gcs = 1.6 Å/84 res. 36% sequence ID) was easily eliminated by REPRO but the more distant internal (Greek-key) internal repeat within each individual domain did not score enough to be down-graded and the structure 1dsl holds its position after filtering.
As would be expected, there are no sequence repeats detected by REPRO that do not have a corresponding structural repeat. Although this simply reflects the principal that structure is better conserved than sequence, it is interesting to examine the proteins that approach the violation of this principle most closely. These are both artificial constructs, being linked dimers of a globin (1abwA) and the HIV protease (1hvp). Their exact internal sequence repeat gives a very high REPRO score while the single structural duplication, although strong, constitutes only one off-diagonal ridge in the score matrix. The closest native proteins to these are the annexins (1aei and 1axn).
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Fourier calculation. The SAP score matrix was found to be a useful construct in which to search for repeated structures using a Fourier analysis. For the Fourier calculation, trials indicated that it was best to avoid artificial constructs, such as the padding required for the FFT and the calculation in the direction of the antidiagonal. When tested with model data, the simple approach used above provided a clear signal of the repeat periodicity that could be used to extract the repeating units with the SAP program.
Model structures. The strength of the Fourier signal would have been difficult to interpret without the use of good model proteins for calibration. The most useful of these was the novel random-walk approach employed to generate realistic fake structures. On the basis of these it was possible to conclude that only a small proportion of protein structures have an unexpected degree of symmetry.
The artificially generated ideal folds provided an upper limit on the strength of signal that could be expected from a perfectly repeating compact structure. Unexpectedly, some of the very regular ß-propellor folds attained the same strength of signal but the majority of symmetric structures lay below this region.
Proteins ranked by symmetry. When native proteins were ranked by the power of their spectrum (normalized for length) the ß-propellor folds occupied the top positions but, otherwise, a wide variety of fold types were seen to score highly. In the ß class, these included the globular ß
proteins (TIM barrels, Rossmann-like folds) and the more repetitive leucine-rich ß
folds. In the all-ß class; ß-propellors, ß-prisms and ß-helices were found as well as the more globular
-crystalin domains, with both single- and double-domain structures. The less-regular all-
class was represented by the double
-barrel structure 1dceB.
When this ranked list was corrected to down-weight proteins that contained detectable internal sequence similarity (using the program REPRO), the list became almost exclusively composed of just globular ß class proteins. In the top 50 re-ranked proteins, only a 4-fold propellor structure remained but lying well down on its former ranking.
Assessment of the Fourier approach
Advantages and disadvantages. The Fourier transform is a simple way of extracting the periodicities in a signal and in its application above is able to utilize all the information in the 2D score matrix, also allowing statistics to be gathered on the significance of the peaks in the spectrum. Even without using the FFT algorithm, the calculation takes very little time relative to the calculation of the structure comparison.
As mentioned in the Introduction, a disadvantage of the approach is that it is not possible to tell the number or (within limits) the size of the sub-structures that give rise to any periodicity without returning to examine the original score matrix. Given that some information must be lost in the reduction of a complex structure to a few numbers, this is perhaps not unexpected and (as was outlined in Methods), the transformed signal can still be used to help extract the repeating substructures.
A further ramification of this loss of information is when the sub-structures occur in a range of sizesas is often seen with the ß-units in the 8-fold ß
-barrel folds. In this situation, rather than the expected sharp peak at frequency N/8, the peak becomes spread (or sometimes split) over adjacent frequencies. Similarly, if the symmetric domain has a large insert or terminal addition, then the peak will again be displaced from the expected frequency. An example of this can be seen at the top of Table Ia
where the number of repeats for the 7-fold ß-propellor structure 1gotB is recorded as eight. Figure 10
shows the obvious explanation for this is the long N-terminal
-helix and loop.
|
Repeats in the sequence were identified using the program REPRO but they can also be identified using the same Fourier transform approach as applied here to structures. This was tested by presenting the initial high-level SAP matrix directly to the Fourier transform, without evaluation of any pairs at the low-level (Figure 1). Strong signals were obtained but these will need further analysis to find if it is best to use just sequence identity, a Dayhoff-like substitution model or a combination of this and local structural properties as used to initialize the SAP calculation.
An approach to avoid the problems of variable length in the repeating units might be to count the number of peaks in the raw signal (rather than convolute them with periodic function as in the Fourier transform). This is often a difficult process to automate but some recent applications to other data might provide a way forward (May, 2001;Taylor, 2001
).
![]() |
Origin of structural symmetry |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This result was unexpected as there is no obvious structural reason why more globular ß proteins should not be found with a clear internal sequence duplication or why more symmetric all-ß or all-
protein should not exist without an accompanying signal in the sequence. One simple explanation might be that the ß
class is so numerous that examples in the other classes have just not yet been observed. Alternatively, it might be argued that the more obviously repeating structures are more recently evolved and so retain their sequence signal while those in the ß
class tend to be ancient metabolic enzymes.
A more structural explanation of the dominance of the globular ß proteins might be based on the relative sizes and degrees of structural freedom that are available to the different super-secondary structure types. Those composed of all-ß structure have a geometric regularity imposed by the plane of the ß-sheet but are otherwise relatively topologically unconstrained, so giving rise to few symmetries by chance. The all-
structures lack the spatial register imposed by a hydrogen-bonded sheet and so will naturally be less symmetric in their packing. However, as the
-helix is a relatively large structure, smaller proteins (with less than six helices) will stand a good chance of acquiring a symmetric arrangement). The ß
unit combines both symmetry-inducing attributes of the previous types, having the spatial register of the ß-sheet while being relatively large, so that there will not be too many unsymmetric arrangements in a typical sized protein.
The unexpectedly high values for the smaller ß model protein (Figures 3e and 8a
) lends some support to this explanation. However, a more extensive analysis over a wider range of model structures will be necessary before any firm theoretical conclusions can be drawn.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
As increasing numbers of protein structures are being determined in a high-throughput manner, the automatic analysis of symmetry will provide an additional tool to help annotate the emerging structures and aid in their comparison and classification.
|
![]() |
Notes |
---|
4 Present address: Accelrys Ltd, 230/250 The Quorum, Barnwell Road, Cambridge CB5 8RE, UK
1 To whom correspondence should be addressed. E-mail: wtaylor{at}nimr.mrc.ac.uk
1 In the following sections, the term repeat will be largely used to refer to sequence repeats, while the term symmetry will be used mainly for structural repeats. This distinction is made to emphasize that the measure of structural similarity used here is based on the comparison of 3D structures and captures more of the structural context than just the linear arrangement of secondary structures. For example, if the comparison matches two ßßß sub-structures, then these will have the same internal structure and will also hold a similar 3D relationship to the rest of the structure. This implies symmetry but does not specify which sort.
2 The original Rossmann fold was half a dinucleotide-binding domain (ß)3. It is used here to refer to the double fold: 2 X (ß
)3 , which constitutes an intact domain.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Blundell,T.L. and Srinivasan,N. (1996) Proc. Natl Acad. Sci. USA, 93, 1424314248.
Cohen,F.E. and Sternberg,M.J.E. (1980) J. Mol. Biol., 138, 321333.[ISI][Medline]
Cohen,F.E., Sternberg,M.J.E. and Taylor,W.R. (1982) J. Mol. Biol., 156, 821862.[ISI][Medline]
Duncan,B.S. and Olson,A.J. (1993) Biopolymers, 33, 123456.
Flores,T.P., Orengo,C.A., Moss,D.S. and Thornton,J.M. (1993) Protein Sci., 2, 18111826.
Heringa,J. and Argos,P. (1993) Protein Struct. Funct. Genet., 17, 391411.[ISI]
Heringa,J. and Taylor,W.R. (1997) Curr. Opin. Struct. Biol., 7, 416421.[CrossRef][ISI][Medline]
Jonassen,I., Eidhammer,I., Grindhaug,S.H. and Taylor,W.R. (2000) J. Mol. Biol., 304, 599619.[CrossRef][ISI][Medline]
Matthews,B.W. and Rossmann,M.G. (1985) Methods. Enzymol., 115, 397420.[ISI][Medline]
May,A.C.W. (2001) Protein Eng., 14, 209217.
McLachlan,A.D. (1977) Biopolymers, 16, 12711297.[ISI][Medline]
McLachlan,A.D. (1979) J. Mol. Biol., 128, 4979.[ISI][Medline]
McLachlan,A.D. (1983) J. Mol. Biol., 169, 1530.[ISI][Medline]
McLachlan,A.D. and Stewart,M. (1977) J. Mol. Biol., 103, 271298.[ISI]
Murzin,A.G. (1992) Protein Struct. Funct. Genet., 14, 191201.[ISI]
Press,W.H., Flannery,B.P., Teukolsky,S.A. and Vetterling,W.T. (1986) Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, Cambridge, UK.
Rao,S.T. and Rossmann,M.G. (1973) J. Mol. Biol., 76, 241256.[ISI][Medline]
Salem,G.M., Hutchinson,E.G. and Orengo,C.A. (1999) J. Mol. Biol., 287, 969981.[CrossRef][ISI][Medline]
Taylor,W.R. (1991) Protein Eng., 4, 853870.[Abstract]
Taylor,W.R. (1999) Protein Sci., 8, 654665.[Abstract]
Taylor,W.R. (2001) J. Mol. Biol., 310, 11351150.[CrossRef][ISI][Medline]
Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 122.[ISI][Medline]
Thornton,J. and Sibanda,B. (1983) J. Mol. Biol., 167, 443460.[ISI][Medline]
Received August 28, 2001; revised November 2, 2001; accepted November 21, 2001.